PostgreSQL: Automating deletion of old partitions - postgresql

I am using Imperial Wicket's fine automated partitioning function for PostgreSQL http://imperialwicket.com/postgresql-automating-monthly-table-partitions/) to create weekly partitions to store a large amount of data. The database grows by ~100GB per week, so some frequent attention is needed to ensure the server does not run out of disk space.
I have searched SO but cannot find any solutions to automatically drop tables based on disk space thresholds or time periods. Are there any native PostgreSQL solutions or developed functions to automate dropping old partitions, or is this a roll-my-own solution kind of problem?

This seems to be a roll my own solution kind of problem. Since I am using Imperial Wicket's automated table partitioning function, I used it as the base to develop a new function that can drop partioning tables older than the specified date.
The code is available at https://github.com/stevbev/postgresql-drop-time-series-table-partitions
Usage is similar to Imperial Wicket's syntax and supports day/week/month/year partitioning schemes. For example, to delete table partitions with name 'my_table_name' created with weekly partitioning that are older than the week 180 days ago, execute the SQL statement:
SELECT public.drop_partitions(current_date-180, 'public', 'my_table_name', 5, 'week');

Related

Sync Elasticsearch Postgresql on a Springboot application

I have Postgresql as my primary database and I would like to take advantage of the Elasticsearch as a search engine for my SpringBoot application.
Problem: The queries are quite complex and with millions of rows in each table, most of the search queries are timing out.
Partial solution: I utilized the materialized views concept in the Postgresql and have a job running that refreshes them every X minutes. But on systems with huge amounts of data and with other database transactions (especially writes) in progress, the views tend to take long times to refresh (about 10 minutes to refresh 5 views). I realized that the current views are at it's capacity and I cannot add more.
That's when I started exploring other options just for the search and landed on Elasticsearch and it works great with the amount of data I have. As a POC, I used the Logstash's Jdbc input plugin but then it doesn't support the DELETE operation (bummer).
From here the soft delete is the option which I cannot take because:
A) Almost all the tables in the postgresql DB are updated every few minutes and some of them have constraints on the "name" key which in this case will stay until a clean-up job runs.
B) Many tables in my Postgresql Db are referenced with CASCADE DELETE and it's not possible for me to update 220 table's Schema and JPA queries to check for the soft delete boolean.
The same question mentioned in the link above also provides PgSync that syncs the postgresql with elasticsearch periodically. However, I cannot go with that either since it has LGPL license which is forbidden in our organization.
I'm starting to wonder if anyone else encountered this strange limitation of elasticsearch and RDMS.
I'm open to other options rather than elasticsearch to solve my need. I just don't know what's the right stack to use. Any help here is much appreciated!

Should I migrate to Redshift?

I'm currently struggling querying be chunk of data that is stored in partitioned table (partition per date)
the data looks like that:
date, product_id, orders
2019-11-01, 1, 100
2019-11-01, 2, 200
2019-11-02, 1, 300
I have hundreds of date-partitions and millions of rows per date.
Now, if I want to query, for instance, total orders for product id 1 and 2 for period of 2 weeks, and group by date (to show in a graph per date), the db has to go to 2 weeks of partitions and fetch the data for them.
That process might be taking a long time when the number of products is big or the time frame required is long.
I have read that AWS Redshift is suitable for this kind of tasks. I'm considering shifting my partitioned tables (aggregated analytics per date) to that technology but I wonder if that's really what I should do to make those queries to run much faster.
Thanks!
As per your use case Redshift is really a good choice for you.
To gain the best performance out of Redshift, it is very important to set proper distribution and sort key. In your case "date" column should be distribution key and "productid" should be sort key. Another important note, Do not encode "date" and "productid" column.
You should get better performance.
If you are struggling with traditional SQL databases, then Amazon Redshift is certainly an option. It can handle tables with billions of rows.
This would involve loading the data from Amazon S3 into Redshift. This will allow Redshift to optimize the way that the data is stored, making it much faster for querying.
Alternatively, you could consider using Amazon Athena, which can query the data directly from Amazon S3. It understands data that is partitioned into separate directories (eg based on date).
Which version of PostgreSQL are you using?
Are you using native partioning or inheritance partitioning trigger-based?
Latest version of postgresql improved partitioning management.
Considering your case Amazon Redshift can be a good choice, so does Amazon Athena. But it is also important to consider your application framework. Are you opt moving to Amazon only for Database or you have other Amazon services in the list too?
Also before making the decision please check the cost of Redshift.

Postgres table partitioning based on table name

I have a table that stores information about weather for specific events and for specific timestamps. I do insert, update and select (more often than delete) on this table. All of my queries query on timestamp and event_id. Since this table is blowing up, I was considering doing table partitioning in postgres.
I could also think of having multiple tables and naming them "table_< event_id >_< timestamp >" to store specific timestamp information, instead of using postgres declarative/inheritance partitioning. But, I noticed that no one on the internet has done or written about any approach like this. Is there something I am missing?
I see that in postgres partitioning, the data is both kept in master as well as child tables. Why keep in both places? It seems less efficient to do inserts and updates to me.
Is there a generic limit on the number of tables when postgres will start to choke?
Thank you!
re 1) Don't do it. Why re-invent the wheel if the Postgres devs have already done it for you by providing declarative partitioning
re 2) You are mistaken. The data is only kept in the partition to which it belongs to. It just looks as if it is stored in the "master".
re 3) there is no built-in limit, but anything beyond a "few thousand" partitions is probably too much. It will still work, but especially query planning will be slower. And sometime the query execution might also suffer because runtime partition pruning is not as efficient any more.
Given your description you probably want to do hash partitioning on the event ID and then create range sub-partitions on the timestamp value (so each partition for an event is again partitioned on the range of the timestamps)

How to see changes in a postgresql database

My postresql database is updated each night.
At the end of each nightly update, I need to know what data changed.
The update process is complex, taking a couple of hours and requires dozens of scripts, so I don't know if that influences how I could see what data has changed.
The database is around 1 TB in size, so any method that requires starting a temporary database may be very slow.
The database is an AWS instance (RDS). I have automated backups enabled (these are different to RDS snapshots which are user initiated). Is it possible to see the difference between two RDS automated backups?
I do not know if it is possible to see difference between RDS snapshots. But in the past we tested several solutions for similar problem. Maybe you can take some inspiration from it.
Obvious solution is of course auditing system. This way you can see in relatively simply way what was changed. Depending on granularity of your auditing system down to column values. Of course there is impact on your application due auditing triggers and queries into audit tables.
Another possibility is - for tables with primary keys you can store values of primary key and 'xmin' and 'ctid' hidden system columns (https://www.postgresql.org/docs/current/static/ddl-system-columns.html) for each row before updated and compare them with values after update. But this way you can identify only changed / inserted / deleted rows but not changes in different columns.
You can make streaming replica and set replication slots (and to be on the safe side also WAL log archiving ). Then stop replication on replica before updates and compare data after updates using dblink selects. But these queries can be very heavy.

No merge tables in MariaDB - options for very large tables?

I've been happily using MySQl for years, and have followed the MariahDB fork with interest.
The server for one of my projects is reaching end of life and needs to be rehosted - likely to CentOS 7, which includes MariahDB
One of my concerns is the lack of the merge table feature, which I use extensively. We have a very large (at least by my standards) data set with on the order of 100M records/20 GB (with most data compressed) and growing. I've split this into read only compressed myisam "archive" tables organized by data epoch, and a regular myisam table for current data and inserts. I then span all of these with a merge table.
The software working against this database is then written such that it figures out which table to retrieve data from for the timespan in question, and if the timespan spans multiple tables, it queries the overlying merge table.
This does a few things for me:
Queries are much faster against the smaller tables - unfortunately, the index needed for the most typical query, and preventing duplicate records is relatively complicated
Frees the user from having to query multiple tables and assemble the results when a query spans multiple tables
Allowing > 90% of the data to reside in the compressed tables saves alot of disk space
I can back up the archive tables once - this saves tremendous time, bandwidth and storage on the nightly backups
An suggestions for how to handle this without merge tables? Does any other table type offer the compressed, read-only option that myisam does?
I'm thinking we may have to go with separate tables, and live with the additional complication and changing all the code in the multiple projects which use this database.
MariaDB 10 introduced the CONNECT storage engine that does a lot of different things. One of the table types it provides is TBL, which is basically an expansion of the MERGE table type. The TBL CONNECT type is currently read only, but you should be able to just insert into the base tables as needed. This is probably your best option but I'm not very familiar with the CONNECT engine in general and you will need to do a bit of experimentation to decide if it will work.