Creating denormalized tables with triggers too slow - postgresql

Assume I'm doing everything in one postgresql database. I have 10 source tables I'm using to create one huge denormalized table. These source tables change frequently and have triggers firing after insert/update/delete to modify denormalized table in near-real-time. The problem is, some of these source tables I'm joining are huge (one table has 120M and other 25M rows) and statements for inserting new rows into denormalized table execute for a long time (20+ minutes for 50-100k rows).
So, I was thinking on what would be the best solution for updating(IUD)changes on this denormalized table, based on changes coming to source tables? Should I run these operations on a schedule, should I dedicate a specific database replica just for this, or should I continue trying to use triggers?
I'm open to using a totally different approach, as long as it's doable on the same database.

That sounds like there is no good and simple solution.
Perhaps you don't need that one huge denormalized table, and denormalizing a few attributes would be good enough for your query speed.
If not, you will probably need a kind of data warehouse for the denormalized data, and refresh that daily with increments. Ideally, tables there are already pre-aggregated.

Related

Are schemas in PostgreSQL physical objects?

I use schemas in PostgreSQL for organizing my huge accounting database. At the end of every year I make a reconcile process by creating a new schema for the next year.
Are the files of the new schema physically separated from the old schema? Or all schemas stored on the hard disk together?
This is a vital thing for me because at the end of every year I've huge tables with millions of records which means I'll call heavy queries soon (I didn't plan for it when I decided to choose PostgreSQL).
Schemas are namespaces so they are a "logical" thing, not a physical thing.
As documented in the manual each table is represented as one (or more files) inside the directory corresponding to the database the table is created in. The namespaces (schemas) are not reflected in the physical database layout.
In general you shouldn't care about the storage of the database to begin with and your SQL queries will not know where the actual data is stored.
"millions" of rows is not considered "huge" these days. If you do run in performance problems, you will tune your query using e.g. indexes or by rewriting it to a more efficient solution. In rare cases partitioning a table can help with really huge tables - but we are talking hundreds of millions or even billions of rows. With medium to small sized tables, partitioning usually doesn't help with performance.

Aggregate as part of ETL or within the database?

Is there a general preference or best practice when it comes whether the data should be aggregated in memory on an ETL worker (with pandas groupby or pd.pivot_table, for example), versus doing a groupby query at the database level?
At the visualization layer, I connect to the last 30 days of detailed interaction-level data, and then the last few years of aggregated data (daily level).
I suppose that if I plan on materializing the aggregated table, it would be best to just do it during the ETL phase since that can be done remotely and not waste the resources of the database server. Is that correct?
If your concern is to put as little load on the source database server as possible, it is best to pull the tables from the source database to a staging area and do joins and aggregations there. But take care that the ETL tool does not perform a nested loop join on the source database tables, that is to pull in one of the tables and then run thousands of queries against the other table to find matching rows.
If your target is to perform joins and aggregations as fast and efficient as possible, by all means push them down to the source database. This may put more load on the source database though. I say “may” because if all you need is an aggregation on a single table, it can be cheaper to perform this in the source database than to pull the whole table.
If you aggregate by day, what if you boss wants it aggregated by hour or week?
The general rule is: Your fact table granularity should be as granular as possible. Then you can drill-down.
You can create pre-aggregated tables too, for example by hour, day, week, month, etc. Space is cheap these days.
Tools like Pentaho Aggregation Designer can automate this for you.

Postgresql archiving old data

I need some expert advice on Postgres
I have few tables in my database that can grow huge, may be a hundred million records and have to implement some sort of data archiving in place. Say I have a subscriber table and subscriber_logs table. The subscriber_logs table will grow huge with time, affecting performance. I wanted to create a separate table called archive_subscriber_logs and create a scheduled task which will read from subscriber_logs and insert the data into archive_subscriber_logs, then delete the dumped data from subscriber_logs.
But my concern is, should I create the archive_subscriber_logs in the same database or in a different database. The problem with storing in a different db is the foreign key constraints that already exists on the main tables.
Anyone can suggest whether same db or different db is preferable? Or any other solutions?
Consider table partitioning, which is implemented in Postgres using table inheritance. This will improve performance on very large tables. Of course you would do measurements first to make sure it is worth implementing. The details are in the excellent Postgres documentation.
Using separate databases is not recommended because you won't be able to have foreign key constraints easily.

No merge tables in MariaDB - options for very large tables?

I've been happily using MySQl for years, and have followed the MariahDB fork with interest.
The server for one of my projects is reaching end of life and needs to be rehosted - likely to CentOS 7, which includes MariahDB
One of my concerns is the lack of the merge table feature, which I use extensively. We have a very large (at least by my standards) data set with on the order of 100M records/20 GB (with most data compressed) and growing. I've split this into read only compressed myisam "archive" tables organized by data epoch, and a regular myisam table for current data and inserts. I then span all of these with a merge table.
The software working against this database is then written such that it figures out which table to retrieve data from for the timespan in question, and if the timespan spans multiple tables, it queries the overlying merge table.
This does a few things for me:
Queries are much faster against the smaller tables - unfortunately, the index needed for the most typical query, and preventing duplicate records is relatively complicated
Frees the user from having to query multiple tables and assemble the results when a query spans multiple tables
Allowing > 90% of the data to reside in the compressed tables saves alot of disk space
I can back up the archive tables once - this saves tremendous time, bandwidth and storage on the nightly backups
An suggestions for how to handle this without merge tables? Does any other table type offer the compressed, read-only option that myisam does?
I'm thinking we may have to go with separate tables, and live with the additional complication and changing all the code in the multiple projects which use this database.
MariaDB 10 introduced the CONNECT storage engine that does a lot of different things. One of the table types it provides is TBL, which is basically an expansion of the MERGE table type. The TBL CONNECT type is currently read only, but you should be able to just insert into the base tables as needed. This is probably your best option but I'm not very familiar with the CONNECT engine in general and you will need to do a bit of experimentation to decide if it will work.

Most efficient way of bulk loading unnormalized dataset into PostgreSQL?

I have loaded a huge CSV dataset -- Eclipse's Filtered Usage Data using PostgreSQL's COPY, and it's taking a huge amount of space because it's not normalized: three of the TEXT columns is much more efficiently refactored into separate tables, to be referenced from the main table with foreign key columns.
My question is: is it faster to refactor the database after loading all the data, or to create the intended tables with all the constraints, and then load the data? The former involves repeatedly scanning a huge table (close to 10^9 rows), while the latter would involve doing multiple queries per CSV row (e.g. has this action type been seen before? If not, add it to the actions table, get its ID, create a row in the main table with the correct action ID, etc.).
Right now each refactoring step is taking roughly a day or so, and the initial loading also takes about the same time.
From my experience you want to get all the data you care about into a staging table in the database and go from there, after that do as much set based logic as you can most likely via stored procedures. When you load into the staging table don't have any indexes on the table. Create the indexes after the data is loaded into the table.
Check this link out for some tips http://www.postgresql.org/docs/9.0/interactive/populate.html