PostgreSql and retrieving real time business statistics leads to too long queries : a solution? - postgresql

We have a national application & the users would like to have accurate business statistics regarding some tables.
We are using tomcat, Spring Ws & hibernate on top of that.
We have thought of many solutions :
plain old query for each user request. The problem is those tables contains millions of records. Every query will take many seconds at least. Solution never used.
the actual solution used: create trigger. But it is painful to create & difficult to maintain (no OO, no cool EDI, no real debug). The only helping part is the possibility to create Junit Test on a higher level to verify the expected result. And for each different statistic on a table we have to create an other trigger for this table.
Using the quartz framework to consolidate data after X minutes.
I have learned that databases are not designedfor these heavy and complicated queries.
A separate data warehouse optimize for reading only queries will be better. (OLAP??)
But I don't have any clue where to start with postGresql. (pentaho is the solution or just a part?)
How could we extract data from the production database ? Using some extractor ?
And when ?Every night ?
If it is periodically - How will we manage to maintain near real time statistics if the data are just dumped on our datawarehouse one time per day ?

"I have learn that databases are NOT DESIGNED for these heavy and complicated queries."
Well you need to unlearn that. A database was designed for just these type of queries. I would blame bad design of the software you are using before I would blame the core technology.

I seems i have been misunderstood.
For those who think that a classic database is design for even processing real-time statistic with queries on billions datas, they might need to read articles on the origin of OLAP & why some people bother to design products around if the answer for performance was just a design question.
"I would blame bad design of the software you are using before I would blame the core technology."
By the way, im not using any software (or pgadmin counts ?). I have two basic tables, you cant make it more simple,and the problem comes when you have billions datas to retreve for statistics.
For those who think it is just a design problm, im glad to hear their clever answer (no trigger i know this one) to a simple problem :
Imagine you have 2 tables: employees & phones. An employee may have 0 to N phones.
Now let say that you have 10 000 000 employees & 30 000 000 phones.
You final users want to know in real time :
1- the average number of phones per user
2-the avegarde age of user who have more than 3 phones
3-the averagae numbers of phones for employees who are in the company for more than 10 years
You have potentially 100 users that want those real time statistics at anytime.
Of course, any queries dont have to take more than 1/4 sec.

Incrementally summarize the data..?
The frequency depends on your requirements, and in extreme cases you may need more hardware, but this is very unlikely.
Bulk load new data
Calculate new status [delta] using new data and existing status
Merge/update status
Insert new data into permanent table (if necessary)
NOTIFY wegotsnewdata
Commit
StarShip3000 is correct, btw.

Related

How to design the Timescaledb schema for engineering data

for a project in the engineering firm I work in, I want to leverage Timescaledb to store experiment results.
I'm trying to figure out the db schema that will deliver the best read performances.
In my use case, I will be collecting several sensor logs for several experiments. For each experiment the time will start from 0.
From newbie thinking I think there are couple of options:
1- each experiment has its own table. The table will have as many columns as many sensors will be used. I bet, this is a terrible solution.
2- each sensor has its own table. I'm thinking to have 2 columns, value and an ID that identifies the experiment and the table name will be the sensor's name. In this table there will be many time duplicates since each experiment time starts from 0.
I'm not sure either of these solutions are actually going to be good.
What kind of schema should I use?
Thank you in advance,
Guido
Update1:
This is the schema I'm testing out right now:
After loading data from about 10 tests (a few hundreds of sensors logged at 100Hz) the Ch_Values table has 180M rows already and queries are terribly slow. I added indexes on ch_index, run_index, and time and now it's exponentially better. This is only ~10 tests tho. In reality, the db will contain hundreds of tests.
Any suggestion on how to efficiently store these data?

Is it bad practice to keep everything in one table?

Looking for some feedback - I am building a social networking type software- one of the features allows users to post news stories and have friends comment. I have in the past kept different tables for things like news, comments, calendar events, etc. However a friend has turned me to the wordpress-type database structure of "POSTS" and "post_types" where everything is in one table and has a "post_type".
This would mean that news stories, comments, events, etc are all in the same table. I love the efficiency of creating functions that are updating one table. HOWEVER, a single table in my old software was 1.5MILLION rows, I'd expect this new table to grow to about 10Million in the first year.
Does mysql handle this size of data okay as long as indexes are properly set, or is it smarter to break everything into seperate tables for this reason?
There is no general answer. It depends.
MySQL has no problem dealing with large tables. However, it will not do miracles for you. In the end, it's all about efficiency. It means you need to optimize your design for multiple, mutually exclusive goals. What you want to find is a sweet spot between complexity, performance, extensibility and maintenance costs. This is different for every project and is kind of an art.
Generally don't want to mix things that are too different. This is why they teach about data normalization in just about every database book or CS course. If your data is small, this does not really matter. But if you have a lot of data and a lot of requests, you will almost certainly want to squeeze every last drop of performance from your database. So not only will you be separating tables, scrutinizing indexes, inspecting execution plans, updating statistics, defragmenting pages and measuring performance, but you will also be using partitioning, clustering, materialized views, read-only replicas, I/O and CPU parallelism, SSDs, Memcached and a variety of other tools. This will all be much more challenging if you have started with a bad data model. In my personal experience, locking is something that really bites you in the ass with large tables, unless you can somehow live without transactions.
To make any kinds of estimations, you need to have some performance baseline. Just knowing number of records is not enough. How many requests will there be? What will the queries be doing? Where do you expect the heaviest load? Can you prepare the most common queries that the system will be running most of the time? What about peak hours? What hardware will be available to run this load? What is the ratio of reads to writes? Etc.
To make optimizations, you need some kind of goal. As always, you will find out that in order to get there, you have to sacrifice something. Because you probably don't have all those answers yet, try following the principle of minimalism - start small, measure, analyze, improve, repeat.

Real-time statistics (example). NoSQL

Task
Hi I have 2-3 thousands of users online. I also have groups, teams and other(2-3) entities which have users. So for about every 10 seconds I
want to show online statistics (query various params of users and other entities). And every, I believe, 5 - 30 seconds user can change his status. Every 1 hour move to another group or team or whatever. What no-sql database should I use ? I dont have experience, just know no-sql is quite fast and just read a little about Redis, MongoDB, Cassandra.
Of course, I store this data model in RDBMS (except online status and statistics).
I think about next solution:
Store all data in json. use Redis. prepend id prefix (EX 'user_'+userId)
user_id:{"status":"123", "group":"group_id", "team":"team_id", "firstname":"firstname", "lastname":"lastname", ... other attributes]}
group_id:{users:[user_id,user_id,...], ... other group attributes}
team_id:{users:[user_id,user_id,...], ... other team attributes}
...
What would you recommend or propose? Will it be convenient to query such data?
Maybe I can use some popular standard algotithms to query statistics (ex monte-carlo algotithm for percentage statistics, I dunno). Thanks
You could use Redis Hyperloglog, a feature added in Redis 2.8.9.
This blog post describes how to calculate very efficiently some statistics that look quite similar to the ones you need.

NoSQL & AdHoc Queries - Millions of Rows

I currently run a MySQL-powered website where users promote advertisements and gain revenue every time someone completes one. We log every time someone views an ad ("impression"), every time a user clicks an add ("click"), and every time someone completes an ad ("lead").
Since we get so much traffic, we have millions of records in each of these respective tables. We then have to query these tables to let users see how much they have earned, so we end up performing multiple queries on tables with millions and millions of rows multiple times in one request, hundreds of times concurrently.
We're looking to move away from MySQL and to a key-value store or something along those lines. We need something that will let us store all these millions of rows, query them in milliseconds, and MOST IMPORTANTLY, use adhoc queries where we can query any single column, so we could do things like:
FROM leads WHERE country = 'US' AND user_id = 501 (the NoSQL equivalent, obviously)
FROM clicks WHERE ad_id = 1952 AND user_id = 200 AND country = 'GB'
etc.
Does anyone have any good suggestions? I was considering MongoDB or CouchDB but I'm not sure if they can handle querying millions of records multiple times a second and the type of adhoc queries we need.
Thanks!
With those requirements, you are probably better off sticking with SQL and setting up replication/clustering if you are running into load issues. You can set up indexing on a document database so that those queries are possible, but you don't really gain anything over your current system.
NoSQL systems generally improve performance by leaving out some of the more complex features of relational systems. This means that they will only help if your scenario doesn't require those features. Running ad hoc queries on tabular data is exactly what SQL was designed for.
CouchDB's map/reduce is incremental which means it only processes a document once and stores the results.
Let's assume, for a moment, that CouchDB is the slowest database in the world. Your first query with millions of rows takes, maybe, 20 hours. That sounds terrible. However, your second query, your third query, your fourth query, and your hundredth query will take 50 milliseconds, perhaps 100 including HTTP and network latency.
You could say CouchDB fails the benchmarks but gets honors in the school of hard knocks.
I would not worry about performance, but rather if CouchDB can satisfy your ad-hoc query requirements. CouchDB wants to know what queries will occur, so it can do the hard work up-front before the query arrives. When the query does arrive, the answer is already prepared and out it goes!
All of your examples are possible with CouchDB. A so-called merge-join (lots of equality conditions) is no problem. However CouchDB cannot support multiple inequality queries simultaneously. You cannot ask CouchDB, in a single query, for users between age 18-40 who also clicked fewer than 10 times.
The nice thing about CouchDB's HTTP and Javascript interface is, it's easy to do a quick feasibility study. I suggest you try it out!
Most people would probably recommend MongoDB for a tracking/analytic system like this, for good reasons. You should read the „MongoDB for Real-Time Analytics” chapter from the „MongoDB Definitive Guide” book. Depending on the size of your data and scaling needs, you could get all the performance, schema-free storage and ad-hoc querying features. You will need to decide for yourself if issues with durability and unpredictability of the system are risky for you or not.
For a simpler tracking system, Redis would be a very good choice, offering rich functionality, blazing speed and real durability. To get a feel how such a system would be implemented in Redis, see this gist. The downside is, that you'd need to define all the „indices” by yourself, not gain them for „free”, as is the case with MongoDB. Nevertheless, there's no free lunch, and MongoDB indices are definitely not a free lunch.
I think you should have a look into how ElasticSearch would enable you:
Blazing speed
Schema-free storage
Sharding and distributed architecture
Powerful analytic primitives in the form of facets
Easy implementation of „sliding window”-type of data storage with index aliases
It is in heart a „fulltext search engine”, but don't get yourself confused by that. Read the „Data Visualization with ElasticSearch and Protovis“ article for real world use case of ElasticSearch as a data mining engine.
Have a look on these slides for real world use case for „sliding window” scenario.
There are many client libraries for ElasticSearch available, such as Tire for Ruby, so it's easy to get off the ground with a prototype quickly.
For the record (with all due respect to #jhs :), based on my experience, I cannot imagine an implementation where Couchdb is a feasible and useful option. It would be an awesome backup storage for your data, though.
If your working set can fit in the memory, and you index the right fields in the document, you'd be all set. Your ask is not something very typical and I am sure with proper hardware, right collection design (denormalize!) and indexing you should be good to go. Read up on Mongo querying, and use explain() to test the queries. Stay away from IN and NOT IN clauses that'd be my suggestion.
It really depends on your data sets. The number one rule to NoSQL design is to define your query scenarios first. Once you really understand how you want to query the data then you can look into the various NoSQL solutions out there. The default unit of distribution is key. Therefore you need to remember that you need to be able to split your data between your node machines effectively otherwise you will end up with a horizontally scalable system with all the work still being done on one node (albeit better queries depending on the case).
You also need to think back to CAP theorem, most NoSQL databases are eventually consistent (CP or AP) while traditional Relational DBMS are CA. This will impact the way you handle data and creation of certain things, for example key generation can be come trickery.
Also remember than in some systems such as HBase there is no indexing concept. All your indexes will need to be built by your application logic and any updates and deletes will need to be managed as such. With Mongo you can actually create indexes on fields and query them relatively quickly, there is also the possibility to integrate Solr with Mongo. You don’t just need to query by ID in Mongo like you do in HBase which is a column family (aka Google BigTable style database) where you essentially have nested key-value pairs.
So once again it comes to your data, what you want to store, how you plan to store it, and most importantly how you want to access it. The Lily project looks very promising. The work I am involved with we take a large amount of data from the web and we store it, analyse it, strip it down, parse it, analyse it, stream it, update it etc etc. We dont just use one system but many which are best suited to the job at hand. For this process we use different systems at different stages as it gives us fast access where we need it, provides the ability to stream and analyse data in real-time and importantly, keep track of everything as we go (as data loss in a prod system is a big deal) . I am using Hadoop, HBase, Hive, MongoDB, Solr, MySQL and even good old text files. Remember that to productionize a system using these technogies is a bit harder than installing MySQL on a server, some releases are not as stable and you really need to do your testing first. At the end of the day it really depends on the level of business resistance and the mission-critical nature of your system.
Another path that no one thus far has mentioned is NewSQL - i.e. Horizontally scalable RDBMSs... There are a few out there like MySQL cluster (i think) and VoltDB which may suit your cause.
Again it comes to understanding your data and the access patterns, NoSQL systems are also Non-Rel i.e. non-relational and are there for better suit to non-relational data sets. If your data is inherently relational and you need some SQL query features that really need to do things like Cartesian products (aka joins) then you may well be better of sticking with Oracle and investing some time in indexing, sharding and performance tuning.
My advice would be to actually play around with a few different systems. However for your use case I think a Column Family database may be the best solution, I think there are a few places which have implemented similar solutions to very similar problems (I think the NYTimes is using HBase to monitor user page clicks). Another great example is Facebook and like, they are using HBase for this. There is a really good article here which may help you along your way and further explain some points above. http://highscalability.com/blog/2011/3/22/facebooks-new-realtime-analytics-system-hbase-to-process-20.html
Final point would be that NoSQL systems are not the be all and end all. Putting your data into a NoSQL database does not mean its going to perform any better than MySQL, Oracle or even text files... For example see this blog post: http://mysqldba.blogspot.com/2010/03/cassandra-is-my-nosql-solution-but.html
I'd have a look at;
MongoDB - Document - CP
CouchDB - Document - AP
Redis - In memory key-value (not column family) - CP
Cassandra - Column Family - Available & Partition Tolerant (AP)
HBase - Column Family - Consistent & Partition Tolerant (CP)
Hadoop/Hive - Also have a look at Hadoop streaming...
Hypertable - Another CF CP DB.
VoltDB - A really good looking product, a relation database that is distributed and might work for your case (may be an easier move). They also seem to provide enterprise support which may be more suited for a prod env (i.e. give business users a sense of security).
Any way thats my 2c. Playing around with the systems is really the only way your going to find out what really works for your case.

Are document databases good for storing large amounts of Stock Tick data? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
I was thinking of using a database like mongodb or ravendb to store a lot of stock tick data and wanted to know if this would be viable compared to a standard relational such as Sql Server.
The data would not really be relational and would be a couple of huge tables. I was also thinking that I could sum/min/max rows of data by minute/hour/day/week/month etc for even faster calculations.
Example data:
500 symbols * 60 min * 60sec * 300 days... (per record we store: date, open, high,low,close, volume, openint - all decimal/float)
So what do you guys think?
Since when this question was asked in 2010, several database engines were released or have developed features that specifically handle time series such as stock tick data:
InfluxDB - see my other answer
Cassandra
With MongoDB or other document-oriented databases, if you target performance, the advices is to contort your schema to organize ticks in an object keyed by seconds (or an object of minutes, each minute being another object with 60 seconds). With a specialized time series database, you can query data simply with
SELECT open, close FROM market_data
WHERE symbol = 'AAPL' AND time > '2016-09-14' AND time < '2016-09-21'
I was also thinking that I could sum/min/max rows of data by minute/hour/day/week/month etc for even faster calculations.
With InfluxDB, this is very straightforward. Here's how to get the daily minimums and maximums:
SELECT MIN("close"), MAX("close") FROM "market_data" WHERE WHERE symbol = 'AAPL'
GROUP BY time(1d)
You can group by time intervals which can be in microseconds (u), seconds (s), minutes (m), hours (h), days (d) or weeks (w).
TL;DR
Time-series databases are better choices than document-oriented databases for storing and querying large amounts of stock tick data.
The answer here will depend on scope.
MongoDB is great way to get the data "in" and it's really fast at querying individual pieces. It's also nice as it is built to scale horizontally.
However, what you'll have to remember is that all of your significant "queries" are actually going to result from "batch job output".
As an example, Gilt Groupe has created a system called Hummingbird that they use for real-time analytics on their web site. Presentation here. They're basically dynamically rendering pages based on collected performance data in tight intervals (15 minutes).
In their case, they have a simple cycle: post data to mongo -> run map-reduce -> push data to webs for real-time optimization -> rinse / repeat.
This is honestly pretty close to what you probably want to do. However, there are some limitations here:
Map-reduce is new to many people. If you're familiar with SQL, you'll have to accept the learning curve of Map-reduce.
If you're pumping in lots of data, your map-reduces are going to be slower on those boxes. You'll probably want to look at slaving / replica pairs if response times are a big deal.
On the other hand, you'll run into different variants of these problems with SQL.
Of course there are some benefits here:
Horizontal scalability. If you have lots of boxes then you can shard them and get somewhat linear performance increases on Map/Reduce jobs (that's how they work). Building such a "cluster" with SQL databases is lot more costly and expensive.
Really fast speed and as with point #1, you get the ability to add RAM horizontally to keep up the speed.
As mentioned by others though, you're going to lose access to ETL and other common analysis tools. You'll definitely be on the hook to write a lot of your own analysis tools.
Here's my reservation with the idea - and I'm going to openly acknowledge that my working knowledge of document databases is weak. I’m assuming you want all of this data stored so that you can perform some aggregation or trend-based analysis on it.
If you use a document based db to act as your source, the loading and manipulation of each row of data (CRUD operations) is very simple. Very efficient, very straight forward, basically lovely.
What sucks is that there are very few, if any, options to extract this data and cram it into a structure more suitable for statistical analysis e.g. columnar database or cube. If you load it into a basic relational database, there are a host of tools, both commercial and open source such as pentaho that will accommodate the ETL and analysis very nicely.
Ultimately though, what you want to keep in mind is that every financial firm in the world has a stock analysis/ auto-trader application; they just caused a major U.S. stock market tumble and they are not toys. :)
A simple datastore such as a key-value or document database is also beneficial in cases where performing analytics reasonably exceeds a single system's capacity. (Or it will require an exceptionally large machine to handle the load.) In these cases, it makes sense to use a simple store since the analytics require batch processing anyway. I would personally look at finding a horizontally scaling processing method to coming up with the unit/time analytics required.
I would investigate using something built on Hadoop for parallel processing. Either use the framework natively in Java/C++ or some higher level abstraction: Pig, Wukong, binary executables through the streaming interface, etc. Amazon offers reasonably cheap processing time and storage if that route is of interest. (I have no personal experience but many do and depend on it for their businesses.)