Database schema for an online Rss Reader - rdbms

I need to create an online Rss Reader(just like google reader for example) as part of a more big project. I've already done test version using MS SQL. However the problem is that i don't know how to efficiency store feed items in database: each feed item has the id (guid or just permanent link) and while i store them all in one table performance is incredibly bad after just 300 000 - 500 000 items.
So i have questions:
1) What is the best DB engine for my problem(i accept not only RDMS, may be BerkeleyDB or something else(please write WHY i should use certain engine)?
2) What is the best way to organize data (that is schema) in DB?
3) What is the best language/framework for this problem?
And i'll be pleased if you'll give me general performance related advices.
UPDATE:
My idea is to split feeds space into 255 subspaces using CRC8 hash of Feed url. Once calculate this CRC8 used as the name of the table where items of this feed will be stored.
#FractalizeR: the main question is
with given string find feeditem which already stored in database with this id (SELECT * FROM FeedItems.pid = pid)
the main problem here is that pid is arbitrary long text.

The first question you need to ask yourself before designing a database is: "What questions to the database are most likely to be asked". If you will provide us with the answer to this question, we can proceed to planning database.
Database can be slow on one questions and extremly fast on others.

Related

PostgreSQL - store real-time data of multiple subjects

Scenario:
I'm trying to build a real-time monitoring webpage for ship operations
I have 1,000 - 10,000 ships operating
All ships are sending real-time data to DB, 24 hours - for 30 days
Each new data inserted has a dimension of 1 row X 100 col
When loading the webpage, all historic data of a chosen ship will be fetched and visualized
Last row of the ship's real-time data table will be queried, and fetched on the webpage to update real-time screen
Each ship has its own non-real-time data, such as ship dimensions, cargo, attendants, etc...
So far I've been thinking about creating a new schema for each ship. Something like this:
public_schema
ship1_schema
ship2_schema
ship3_schema
|--- realtime_table
|--- cargo_table
|--- dimensions_table
|--- attendants_table
ship4_schema
ship5_schme
Is this a good way to store individual ship's real-time data, and fetch them on a webserver? What other ways would you recommend?
For time-series wise, I'm already using a PostgreSQL extension called Timescale DB. My question rather about storing time-series data, in case I have many ships. Is it a good idea to differentiate each ship's RT data my constructing a new schema?
++ I'm pretty new to PostgreSQL, and some of the advice I got from other people was too advanced for me... I would greatly appreciated if you suggest some method, briefly explain what it is
This seems personally like the wrong way to work.
In this case i would have all the ship data in one table and from there on i would include a shipid to
realtime_table
cargo_table
dimensions_table
attendants_table
From there on if you believe that your data will reach a lot of volume you have the following choices.
Create indexes on the fields that are important to query, Postgres query planner is very useful for that.
Latest Postgres has implemented table partitioning based on criteria you provide without having to use table inheritance.**
Since you will be needing live data on the web page you can use
Listen command for Postgres
for when data are received from the ship (Unless you have another way of sending this data to the web server like web sockets)
Adding a bit of color here - if you are already using the TimescaleDB extension, you won't need to use table partitioning, since TimescaleDB will handle that for you automatically.
The approach of storing all ship data in a single table with a metadata table outside of the time series table is a common practice. As long as you build the correct indexes, as others have suggested, you should be fine. An important thing to note is that if you (for example) build an index on time, you want to make sure to include time in your queries to benefit from constraint exclusion.

How can I handle a very large database and do not miss the performance?

if i want to develop an application, I'm worried about its performance after the number of users and stored data increases.
actually I don't know what is the best way to implement a program that it works with a really large data and do some things like search in it, find and receive user information, search text and so on in real time without any delay !
Let's me explain the problem more
for example i have chosen 'Mongodb' as a database and suppose we have at least five million users and a user want to log in into the system, the user has sent the username and password
The first thing that we should do is to find the user with that username and then check the password, in mongodb we should use something like 'find' method to get the user's information, something like below:
Users.find({ username: entered_username })
then get the user information and we check the password
but the 'find' method should search the username between million users and it's a large number and if any person request for authentication, this method should be run for each of them and it cause a heavy processing on the system
but unfortunately this problem is only for something like finding a user, if we decide to search a text when we have a lot of texts and posts on the database the problem is more bigger
i don't know how big companies like facebook and linkedin search through millions of data in such a short span of time. actually i don't want to create something like facebook or more but i have a large amount of data and i'm looking for a good way to handle it
is there any framework or something else that help me to handle large data on the databases or is there exist a method to implement data on database so that we search and find data fast and quickly? should i use a particular data structure?
i founded an opensource project elasticsearch that it help us to search faster but i don't know if i found something with elastic how can i find it on mongodb too for doing something like updating data and if i use elastic search i should use mongodb too or not!? can i use elastic as a database and as a search engine simultaneous !?
if i use elasticsearch and mongodb together then i should have two copies of my data, one in mongodb and one in elasticsearch!? and this two copies of the data that are separated :( i wish elasticsearch search in the mongodb that does not have to create two copies of the data
thank you if you help me to find out a good way and understand what should i do.
When you talk about performance, it usually boils down to three things:
Your design
Your definition of "quick", and
How much you're willing to pay
Your design
MongoDB is great if you want to iterate on your data model, can scale horizontally, and very quick if used properly. Elasticsearch on the other hand, is not a database. However, it is very quick for searching. A traditional relational database will be useful if you know exactly how your data looks like, and don't expect it to change much, or is relational by nature.
You can, for example, use a relational database for user login, use MongoDB for everything else, and use Elastic for textual, searchable data. There is no rule that tells you to keep everything within a single database.
Make sure you understand indexing, and know how to utilize it to its fullest potential. The fastest hardware will not help you if you don't design your database properly.
Conclusion: use any tool you need, combine if necessary, but understand their strengths and weaknesses.
Your definition of "quick"
How "quick" is quick enough for your application? Is 100ms quick enough? Is 10ms quick enough? Remember that more performance you ask of the machine, more expensive it will be. You can get more performance with a better design, but design can only go so far.
Usually this boils down to what is acceptable for you and your client. Not every application needs a sub-10ms response time. There's plenty of applications that can tolerate queries that return in seconds.
Conclusion: determine what is acceptable, and design accordingly.
How much you're willing to pay
Of course, it all depends on how much you're willing to pay for all the hardware that need to host all that stuff. MongoDB might be open source, but you need some place to host it. Also, you cannot expect magic. You can't throw thousands of queries and updates per second, and expect it to be blazing fast when you only give it 1 GB of RAM.
Conclusion: never under-provision to save money if you want your application to be successful.

Database design: Postgres or EAV to hold semi-structured data

I was given the task to decide whether our stack of technologies is adequate to complete the project we have at hand or should we change it (and to which technologies exactly).
The problem is that I'm just a SQL Server DBA and I have a few days to come up with a solution...
This is what our client wants:
They want a web application to centralize pharmaceutical researches separated into topics, or projects, in their jargon. These researches are sent as csv files and they are somewhat structured as follows:
Project (just a name for the project)
Segment (could be behavioral, toxicology, etc. There is a finite set of about 10 segments. Each csv file holds a segment)
Mandatory fixed fields (a small set of fields that are always present, like Date, subjects IDs, etc. These will be the PKs).
Dynamic fields (could be anything here, but always as a key/pair value and shouldn't be more than 200 fields)
Whatever files (images, PDFs, etc.) that are associated with the project.
At the moment, they just want to store these files and retrieve them through a simple search mechanism.
They don't want to crunch the numbers at this point.
98% of the files have a couple of thousand lines, but there's a 2% with a couple of million rows (and around 200 fields).
This is what we are developing so far:
The back-end is SQL 2008R2. I've designed EAVs for each segment (before anything please keep in mind that this is not our first EAV design. It worked well before with less data.) and the mid-tier/front-end is PHP 5.3 and Laravel 4 framework with Bootstrap.
The issue we are experiencing is that PHP chokes up with the big files. It can't insert into SQL in a timely fashion when there's more than 100k rows and that's because there's a lot of pivoting involved and, on top of that, PHP needs to get back all the fields IDs first to start inserting. I'll explain: this is necessary because the client wants some sort of control on the fields names. We created a repository for all the possible fields to try and minimize ambiguity problems; fields, for instance, named as "Blood Pressure", "BP", "BloodPressure" or "Blood-Pressure" should all be stored under the same name in the database. So, to minimize the issue, the user has to actually insert his csv fields into another table first, we called it properties table. This action won't completely solve the problem, but as he's inserting the fields, he's seeing possible matches already inserted. When the user types in blood, there's a panel showing all the fields already used with the word blood. If the user thinks it's the same thing, he has to change the csv header to the field. Anyway, all this is to explain that's not a simple EAV structure and there's a lot of back and forth of IDs.
This issue is giving us second thoughts about our technologies stack choice, but we have limitations on our possible choices: I only have worked with relational DBs so far, only SQL Server actually and the other guys know only PHP. I guess a MS full stack is out of the question.
It seems to me that a non-SQL approach would be the best. I read a lot about MongoDB but honestly, I think it would be a super steep learning curve for us and if they want to start crunching the numbers or even to have some reporting capabilities,
I guess Mongo wouldn't be up to that. I'm reading about PostgreSQL which is relational and it's famous HStore type. So here is where my questions start:
Would you guys think that Postgres would be a better fit than SQL Server for this project?
Would we be able to convert the csv files into JSON objects or whatever to be stored into HStore fields and be somewhat queryable?
Is there any issues with Postgres sitting in a windows box? I don't think our client has Linux admins. Nor have we for that matter...
Is it's licensing free for commercial applications?
Or should we stick with what we have and try to sort the problem out with staging tables or bulk-insert or other technique that relies on the back-end to do the heavy lifting?
Sorry for the long post and thanks for your input guys, I appreciate all answers as I'm pulling my hair out here :)

Query to Database MySQL iOS application

I have a database mysql and it contains 1000 records of people (Name, Age, Adress, Country). I want to make an application that :
Randomly displays ONE of the records in the database.
Randomly displays ONE of the records according to a specific 'Country'.
My usual approach to this is : parse ALL data to the iphone using JSON, then manipulate the data. Because of the volume of data that the user needs to download (1000 records) and because i only need just ONE of the records to be displayed, it seems wasteful to download all 1000 records and then sort the data. What other approach can i use? (not sure but is it possible to query the database directly?)
It would be very unsafe to query the DB directly and the usual approach you mentioned would be a very bad idea. So, the general approach, safe and optimized, for this scenario is to write some web-service, can be in PHP or ASP.net or etc., end-points which can optionally take some parameters, like the specific country, as argument and return you data in JSON format and then you can display it. This is how I'd approach to such problem. Here's a tutorial on writing some basic web-services with PHP.
http://davidwalsh.name/web-service-php-mysql-xml-json
Hope that'd help.

PostgreSql and retrieving real time business statistics leads to too long queries : a solution?

We have a national application & the users would like to have accurate business statistics regarding some tables.
We are using tomcat, Spring Ws & hibernate on top of that.
We have thought of many solutions :
plain old query for each user request. The problem is those tables contains millions of records. Every query will take many seconds at least. Solution never used.
the actual solution used: create trigger. But it is painful to create & difficult to maintain (no OO, no cool EDI, no real debug). The only helping part is the possibility to create Junit Test on a higher level to verify the expected result. And for each different statistic on a table we have to create an other trigger for this table.
Using the quartz framework to consolidate data after X minutes.
I have learned that databases are not designedfor these heavy and complicated queries.
A separate data warehouse optimize for reading only queries will be better. (OLAP??)
But I don't have any clue where to start with postGresql. (pentaho is the solution or just a part?)
How could we extract data from the production database ? Using some extractor ?
And when ?Every night ?
If it is periodically - How will we manage to maintain near real time statistics if the data are just dumped on our datawarehouse one time per day ?
"I have learn that databases are NOT DESIGNED for these heavy and complicated queries."
Well you need to unlearn that. A database was designed for just these type of queries. I would blame bad design of the software you are using before I would blame the core technology.
I seems i have been misunderstood.
For those who think that a classic database is design for even processing real-time statistic with queries on billions datas, they might need to read articles on the origin of OLAP & why some people bother to design products around if the answer for performance was just a design question.
"I would blame bad design of the software you are using before I would blame the core technology."
By the way, im not using any software (or pgadmin counts ?). I have two basic tables, you cant make it more simple,and the problem comes when you have billions datas to retreve for statistics.
For those who think it is just a design problm, im glad to hear their clever answer (no trigger i know this one) to a simple problem :
Imagine you have 2 tables: employees & phones. An employee may have 0 to N phones.
Now let say that you have 10 000 000 employees & 30 000 000 phones.
You final users want to know in real time :
1- the average number of phones per user
2-the avegarde age of user who have more than 3 phones
3-the averagae numbers of phones for employees who are in the company for more than 10 years
You have potentially 100 users that want those real time statistics at anytime.
Of course, any queries dont have to take more than 1/4 sec.
Incrementally summarize the data..?
The frequency depends on your requirements, and in extreme cases you may need more hardware, but this is very unlikely.
Bulk load new data
Calculate new status [delta] using new data and existing status
Merge/update status
Insert new data into permanent table (if necessary)
NOTIFY wegotsnewdata
Commit
StarShip3000 is correct, btw.