Aggregate as part of ETL or within the database?

Aggregate as part of ETL or within the database? - postgresql

Is there a general preference or best practice when it comes whether the data should be aggregated in memory on an ETL worker (with pandas groupby or pd.pivot_table, for example), versus doing a groupby query at the database level?
At the visualization layer, I connect to the last 30 days of detailed interaction-level data, and then the last few years of aggregated data (daily level).
I suppose that if I plan on materializing the aggregated table, it would be best to just do it during the ETL phase since that can be done remotely and not waste the resources of the database server. Is that correct?

If your concern is to put as little load on the source database server as possible, it is best to pull the tables from the source database to a staging area and do joins and aggregations there. But take care that the ETL tool does not perform a nested loop join on the source database tables, that is to pull in one of the tables and then run thousands of queries against the other table to find matching rows.
If your target is to perform joins and aggregations as fast and efficient as possible, by all means push them down to the source database. This may put more load on the source database though. I say “may” because if all you need is an aggregation on a single table, it can be cheaper to perform this in the source database than to pull the whole table.

If you aggregate by day, what if you boss wants it aggregated by hour or week?
The general rule is: Your fact table granularity should be as granular as possible. Then you can drill-down.
You can create pre-aggregated tables too, for example by hour, day, week, month, etc. Space is cheap these days.
Tools like Pentaho Aggregation Designer can automate this for you.

Related

Creating denormalized tables with triggers too slow

Assume I'm doing everything in one postgresql database. I have 10 source tables I'm using to create one huge denormalized table. These source tables change frequently and have triggers firing after insert/update/delete to modify denormalized table in near-real-time. The problem is, some of these source tables I'm joining are huge (one table has 120M and other 25M rows) and statements for inserting new rows into denormalized table execute for a long time (20+ minutes for 50-100k rows).
So, I was thinking on what would be the best solution for updating(IUD)changes on this denormalized table, based on changes coming to source tables? Should I run these operations on a schedule, should I dedicate a specific database replica just for this, or should I continue trying to use triggers?
I'm open to using a totally different approach, as long as it's doable on the same database.

That sounds like there is no good and simple solution.
Perhaps you don't need that one huge denormalized table, and denormalizing a few attributes would be good enough for your query speed.
If not, you will probably need a kind of data warehouse for the denormalized data, and refresh that daily with increments. Ideally, tables there are already pre-aggregated.

Are schemas in PostgreSQL physical objects?

I use schemas in PostgreSQL for organizing my huge accounting database. At the end of every year I make a reconcile process by creating a new schema for the next year.
Are the files of the new schema physically separated from the old schema? Or all schemas stored on the hard disk together?
This is a vital thing for me because at the end of every year I've huge tables with millions of records which means I'll call heavy queries soon (I didn't plan for it when I decided to choose PostgreSQL).

Schemas are namespaces so they are a "logical" thing, not a physical thing.
As documented in the manual each table is represented as one (or more files) inside the directory corresponding to the database the table is created in. The namespaces (schemas) are not reflected in the physical database layout.
In general you shouldn't care about the storage of the database to begin with and your SQL queries will not know where the actual data is stored.
"millions" of rows is not considered "huge" these days. If you do run in performance problems, you will tune your query using e.g. indexes or by rewriting it to a more efficient solution. In rare cases partitioning a table can help with really huge tables - but we are talking hundreds of millions or even billions of rows. With medium to small sized tables, partitioning usually doesn't help with performance.

Should I migrate to Redshift?

I'm currently struggling querying be chunk of data that is stored in partitioned table (partition per date)
the data looks like that:
date, product_id, orders
2019-11-01, 1, 100
2019-11-01, 2, 200
2019-11-02, 1, 300
I have hundreds of date-partitions and millions of rows per date.
Now, if I want to query, for instance, total orders for product id 1 and 2 for period of 2 weeks, and group by date (to show in a graph per date), the db has to go to 2 weeks of partitions and fetch the data for them.
That process might be taking a long time when the number of products is big or the time frame required is long.
I have read that AWS Redshift is suitable for this kind of tasks. I'm considering shifting my partitioned tables (aggregated analytics per date) to that technology but I wonder if that's really what I should do to make those queries to run much faster.
Thanks!

As per your use case Redshift is really a good choice for you.
To gain the best performance out of Redshift, it is very important to set proper distribution and sort key. In your case "date" column should be distribution key and "productid" should be sort key. Another important note, Do not encode "date" and "productid" column.
You should get better performance.

If you are struggling with traditional SQL databases, then Amazon Redshift is certainly an option. It can handle tables with billions of rows.
This would involve loading the data from Amazon S3 into Redshift. This will allow Redshift to optimize the way that the data is stored, making it much faster for querying.
Alternatively, you could consider using Amazon Athena, which can query the data directly from Amazon S3. It understands data that is partitioned into separate directories (eg based on date).

Which version of PostgreSQL are you using?
Are you using native partioning or inheritance partitioning trigger-based?
Latest version of postgresql improved partitioning management.

Considering your case Amazon Redshift can be a good choice, so does Amazon Athena. But it is also important to consider your application framework. Are you opt moving to Amazon only for Database or you have other Amazon services in the list too?
Also before making the decision please check the cost of Redshift.

Row based database or Column based database

We are working on a audit system where auditor are given access to transaction processed in last quarter. Auditor performs various analysis on the data to find out invalid/erroneous transactions that have some exceptions.
Generally, these analysis requires data to be present on some charts to view the out-layers or sometime duplication detection are done based on multiple columns.
Sometime exception detection algorithm are pretty involved that require multiple processing steps using stored procedure.
Please note that analysis rarely involves aggregation on huge rows.
Occasionally , they can change some data if they find it missing or incorrect.
We are evaluating row based (sql & nosql databases) and column store (like data warehouse systems).
Is this a use case for datawarehouse or row based store, like nosql or some RDBMS?
In short, requirements are:
- Occasional update
- Mostly read queries over last 3/months of data
- Reading data my require several messaging steps, like creating temp table in step 1, forming join with another table in step rule, delete some rows ect.
Thanks

For your task, it does not really matter how the data is stored. You need to think instead how to create a solid dimensional model, populate it with data properly, and what reporting tools to use.
To give you an example, here are a couple of common setups I've used in my projects:
Microsoft stack setup:
SQL Server for data storage
SSIS for data ETL (or write your own stored procedures if you know what you are doing)
Publish dimensional model on the same SQL Server. If your data set is large (over billion records), use SSAS Tabular instead
Power Pivot or Power BI for interactive reporting, or SSRS for paginated reports.
Open-source setup:
PostgreSQL for data storage
Use stored procedures and/or Python to process data
Publish dimensional model to another PostgreSQL database. If your data is large, publish the dimensional model to Redshift or
other columnar database
Use Tableau or Power BI for interactive reporting, or build your own reporting interface.
I think NoSQL database is a wrong choice here because audit will require highly structured data.

Tableau - How to query large data sources efficiently?

I am new to Tableau, and having performance issues and need some help. I have a query that joins several large tables. I am using a live data connection to a MySQL db.
The issue I am having is that it is not applying the filter criteria before asking MySQL for the data. So it is essentially doing a SELECT * from my query and not applying the filter criteria to the where clause. It pulls all the data from MySQL db back to Tableau, then throws away the un-needed data based on my filter criteria. My two main filter criteria are on account_id and a date range.
I can cleanly get a list of the accounts from just doing a select from my account table to populate the filter list, then need to know how to apply that selection when it goes to pull the data from the main data query from MySQL.

To apply a filter at the data source first, try using context filters.
Performance can also be improved by using extracts.

I would personally use an extract, go into your MySQL DB Back-end, run the query, and a CREATE TABLE extract1 AS statement, or whatever you want to call your data table.
When you import this table into Tableau it will already have a SELECT * of your aggregate data in the workbook. From here your query efficiency will be increased ten fold.
Unfortunately, it's going to take awhile for Tableau processing time + mySQL backend DB query time = Ntime to process your data.
Try the extracts...

I've been struggling with the very same thing. I have found that the tableau extracts aren't any faster than pulling directly from a SQL table. What I have done is within SQL created tables that already have the filtered data in them, so the Select * will have only the needed data. The downside to this is it takes up more space on the server, but this isn't a problem on my side.

For the Large Data sets Tableau recommend using an Extract.
An extract will create a snapshot of the data that you are connected with and processing on this data will be faster than a live connection.
All the charts and visualization will load faster and saves your time, each time when you go to the Dashboard.
For the filters that you are using to filter the data-set will work faster in an extract connection. But to get the latest data you have to refresh the extract or schedule a refresh in the server ( if you are uploading the report to server).
There are multiple type of filters available in Tableau, the use of which depends on your application, context filters and global filters can be use to filter the whole set of data.