Cron job + new table vs materialized views for webapp analytics? - postgresql

Hi
I am in the process of adding analytics to my SaaS app, and I'd love to hear other people's experiences doing this.
Current I see two different approches:
Do most of the data handling at the DB level, building and aggregating data into materialized views for performance boost. This way the data will stay normalized.
Have different cronjobs/processes that will run at different intervals (10 min, 1 hour etc.) that will query the database and insert aggregate results into a new table. In this case, the metrics/analytics are denormalized.
Which approach makes the most sense, maybe something completely different?

On really big data, the cronjob or ETL is the only option. You read the data once, aggregate it and never go back. Querying aggregated data is then relatively cheap.
Views will go through tables. If you use "explain" for a view-based query, you might see the data is still being read from tables, possibly using indexes (if corresponding indexes exist). Querying terabytes of data this way is not viable.
The only problem with the cronjob/ETL approach is that it's PITA to maintain. If you find a bug on production environment - you are screwed. You might spend days and weeks fixing and recalculating aggregations. Simply said: you have to get it right the first time :)

Related

How to design the Timescaledb schema for engineering data

for a project in the engineering firm I work in, I want to leverage Timescaledb to store experiment results.
I'm trying to figure out the db schema that will deliver the best read performances.
In my use case, I will be collecting several sensor logs for several experiments. For each experiment the time will start from 0.
From newbie thinking I think there are couple of options:
1- each experiment has its own table. The table will have as many columns as many sensors will be used. I bet, this is a terrible solution.
2- each sensor has its own table. I'm thinking to have 2 columns, value and an ID that identifies the experiment and the table name will be the sensor's name. In this table there will be many time duplicates since each experiment time starts from 0.
I'm not sure either of these solutions are actually going to be good.
What kind of schema should I use?
Thank you in advance,
Guido
Update1:
This is the schema I'm testing out right now:
After loading data from about 10 tests (a few hundreds of sensors logged at 100Hz) the Ch_Values table has 180M rows already and queries are terribly slow. I added indexes on ch_index, run_index, and time and now it's exponentially better. This is only ~10 tests tho. In reality, the db will contain hundreds of tests.
Any suggestion on how to efficiently store these data?

Is it bad practice to keep everything in one table?

Looking for some feedback - I am building a social networking type software- one of the features allows users to post news stories and have friends comment. I have in the past kept different tables for things like news, comments, calendar events, etc. However a friend has turned me to the wordpress-type database structure of "POSTS" and "post_types" where everything is in one table and has a "post_type".
This would mean that news stories, comments, events, etc are all in the same table. I love the efficiency of creating functions that are updating one table. HOWEVER, a single table in my old software was 1.5MILLION rows, I'd expect this new table to grow to about 10Million in the first year.
Does mysql handle this size of data okay as long as indexes are properly set, or is it smarter to break everything into seperate tables for this reason?
There is no general answer. It depends.
MySQL has no problem dealing with large tables. However, it will not do miracles for you. In the end, it's all about efficiency. It means you need to optimize your design for multiple, mutually exclusive goals. What you want to find is a sweet spot between complexity, performance, extensibility and maintenance costs. This is different for every project and is kind of an art.
Generally don't want to mix things that are too different. This is why they teach about data normalization in just about every database book or CS course. If your data is small, this does not really matter. But if you have a lot of data and a lot of requests, you will almost certainly want to squeeze every last drop of performance from your database. So not only will you be separating tables, scrutinizing indexes, inspecting execution plans, updating statistics, defragmenting pages and measuring performance, but you will also be using partitioning, clustering, materialized views, read-only replicas, I/O and CPU parallelism, SSDs, Memcached and a variety of other tools. This will all be much more challenging if you have started with a bad data model. In my personal experience, locking is something that really bites you in the ass with large tables, unless you can somehow live without transactions.
To make any kinds of estimations, you need to have some performance baseline. Just knowing number of records is not enough. How many requests will there be? What will the queries be doing? Where do you expect the heaviest load? Can you prepare the most common queries that the system will be running most of the time? What about peak hours? What hardware will be available to run this load? What is the ratio of reads to writes? Etc.
To make optimizations, you need some kind of goal. As always, you will find out that in order to get there, you have to sacrifice something. Because you probably don't have all those answers yet, try following the principle of minimalism - start small, measure, analyze, improve, repeat.

Getting Over to Someone Why Views Can't Be Used as a Tables

For some reason I'm having a hard time getting over to some people that using a view in Postgres as you would use a table, is a bad idea.
As some background, there are a number of tables containing completely static data that is updated every few months via a batch import into different tables by date - table_201603 or table_201607. A view has then been created called 'table' which clients then use which is just a 'SELECT * FROM' of the table. When an updated batch of data is put into a new table the view is then updated to point at the new table. This means an in-place rename of the table does not need to take place that might mean downtime. This is in a version of Postgres before 9.3 where materialized views came in, just to clarify. These tables generally have about 100 million rows in them.
This is understandably leading to some confusing results when people are querying these views with very inconsistent query times. Sometimes queries are taking seconds, other times 20 or 30 milliseconds.
Additional: This is geospatial data, so they're doing geospatial queries on a view.
I know what many of the pitfalls here are - views are created on-the-fly like a sub-query, you're very much at the whim of the query planner as to what predicates get brought down and how long results are cached as results aren't physically stored as tables - but can anyone see anything else and suggest a better way of doing this? I can imagine this would be a reasonably common scenario so it might help others.
Thanks,
In general, this reminds me a use case for synonym. However, there are no synonyms in Postgres and they recommend using Views and or separation by schema
https://www.postgresql.org/message-id/kon2r2$mo6$1#ger.gmane.org

Implement interval analysis on top of PostgreSQL

I have a couple of millions entries in a table which start and end timestamps. I want to implement an analysis tool which determines unique entries for a specific interval. Let's say between yesterday and 2 month before yesterday.
Depending on the interval the queries take between a couple of seconds and 30 minutes. How would I implement an analysis tool for a web front-end which would allow to quite quickly query this data, similar to Google Analytics.
I was thinking of moving the data into Redis and do something clever with interval and sorted sets etc. but I was wondering if there's something in PostgreSQL which would allow to execute aggregated queries, re-use old queries, so that for instance, after querying the first couple of days it does not start from scratch again when looking at different interval.
If not, what should I do? Export the data to something like Apache Spark or Dynamo DB and analysis in there to fill Redis for retrieving it quicker?
Either will do.
Aggregation is a basic task they all can do, and your data is smll enough to fit into main memory. So you don't even need a database (but the aggregation functions of a database may still be better implemented than if you rewrite them; and SQL is quite convenient to use.
Jusr do it. Give it a try.
P.S. make sure to enable data indexing, and choose the right data types. Maybe check query plans, too.

Performance improvement for fetching records from a Table of 10 million records in Postgres DB

I have a analytic table that contains 10 million records and for producing charts i have to fetch records from analytic table. several other tables are also joined to this table and data is fetched currently But it takes around 10 minutes even though i have indexed the joined column and i have used Materialized views in Postgres.But still performance is very low it takes 5 mins for executing the select query from Materialized view.
Please suggest me some technique to get the result within 5sec. I dont want to change the DB storage structure as so much of code changes has to be done to support it. I would like to know if there is some in built methods for query speed improvement.
Thanks in Advance
In general you can take care of this issue by creating a better data structure(Most engines do this to an extent for you with keys).
But if you were to create a sorting column of sorts. and create a tree like structure then you'd be left to a search rate of (N(log[N]) rather then what you may be facing right now. This will ensure you always have a huge speed up in your searches.
This is in regards to binary tree's, Red-Black trees and so on.
Another implementation for a speedup may be to make use of something allong the lines of REDIS, ie - a nice database caching layer.
For analytical reasons in the past I have also chosen to make use of technologies related to hadoop. Though this may be a larger migration in your case at this point.