Best way to automate addition of future partitions to a postgres table - postgresql

I am unaware of the best practices involved in the automation of the addition of future partitions to a table. So the situation is like this: In the month of say, Dec 2021, we want to create partitions for the next year (2022) for some tables in Postgres. We can obviously do it manually, but we want to automate this. So far, I could think of (and found by researching, talking to some people, etc.) the following ways: -
Using PL/pgSQL (In my opinion, there could be issues related to version control and deployment here)
Writing a script (in Python, say) and executing it as a cron job annually
Adding the partitioning logic to the code (that inserts records in the database) i.e., whenever a record is inserted in a table, you check whether the partition corresponding to the record exists, if not, you create the partition (fetching the metadata of the table with each incoming record can be expensive, but we can try and optimize it in my opinion)
Is there any other way that I am missing? If not, what would be the best way of the above to automate the addition of future partitions to a table in Postgres.
Also, please point out if this is not the right platform for such questions (it would be great if you could direct me to the right one).
Thank you for reading this.

Option 1 is not sufficient, because you need a way to run the code automatically (that's the hard part). It doesn't matter much if you use PL/pgSQL or a client side language for the procedural parts of the operation.
Option 3 is not easy to achieve, and certainly not in an efficient fashion.
I would say that the best way is to schedule a job for partition creation, either with the operating system scheduler (cron) or with a PostgreSQL extension like pg_timetable or pg_cron.

Related

Postgres: monitor what part of the execution plan is running for long queries

The information in pg_stat_activity is sort of scarce, and does not give progress information for long queries.
This information is sort of available in v$session_longops in Oracle, which gives which object is being processed (target), the number of items it needs to go through (totalwork), and the number of item done so far (sofar). One can then use that to infer what part of the execution plan the engine is in. This information is available in Spark and Flink as well.
I was wondering if there was a way to have access to that in Postgres, either in system tables, or by observing the processes, or where one might look in the internals if he wants to implement a patch.
Cheers!
AFAIK there is no existing feature to monitor long running queries in detail (there is such a feature only for 3 DDL statements).
A patch has been proposed 3 years ago. It looks like it was not integrated.
See discussion in hackers mailing list:
https://www.postgresql.org/message-id/CADdR5nxQUSh5kCm9MKmNga8+c1JLxLHDzLhAyXpfo9-Wmc6s5g#mail.gmail.com

Table partitioning in PostgreSQL 11 with automatic partition creation?

I need to maintain audit table and since the number of changes are going to be huge, I need an efficient way of dealing with the problem. The solution which I have thought is to record only the changed column in the audit table and partition it on the createdon column quarterly or half-yearly.
I wanted to know if there is anything like 'interval partition' of oracle? If not then how can I achieve it?
I want that every 6 months a new partition is created automatically as the row is inserted.
I am using postgres 11 as my db.
I do not think there is any magic configuration that make your life easier on this point :
https://www.postgresql.org/docs/11/ddl-partitioning.html
If you want the table auto-created, I think you have two major possibilities :
Verify each data at the in of the 'mother' table to see if it fits in an already present partition (trigger, if huge amount of inserts it could be a problem)
Check once in a while that you already have the partitions that are going to be needed in the future. For this one pg_partman is going to be your best ally.
As an example, few years ago, I had done a partition mechanism when there was only the declarative one and not any possibility to add pg_partman. With the trigger mechanism for 15 million rows per month it still works like a charm.
If you do not want to harm your performances EVER (and especially if you do not know how large your system is going to grow) I recommand to you the same response than in a_horse_with_no_name comment : use pg_partman.
If you cannot use it, like it was the case for me, adopt one of the two philosophies (trigger or advance table creation by crontask (for example)).

Is belated partitioning possible?

I've got a PostgreSQL 10+ (probably 11) database to work with. As the tables are growing fast, but implementation time is scarce, my approach is to first setup the whole scheme, so that the app can run on it. After that, I'd like to introduce partitioning (by date) on some tables.
Is belated partitioning possible?
Practically I mean that I create partitions by date and that already-inserted data is then automatically re-assigned to those partitions (if applicable). Also, if it's possible, is this a good approach or are there better alternatives?
I guess that reorganizing a big table needs its time, but that's ok.

wait/notify mechanism for multiple readers in Oracle sql?

We have multiple processes which read one database table, get available record and work with it. It works fine.
When there is no record in this table each process waits 5 seconds and reads it again.
So, record could idle in the table for 5 seconds which is not good.
What would be recommended solution to eliminate such waiting and proceed immediately when record is created? One solution could be trigger which does something when record created. But this solution requires knowledge of working processes to deliver record to the one of idle processes.
It looks that ideal solution would be when each process starts to read via SQL from something and when record is created one of waiting processes will have it record and other will continue to wait.
Does Oracle 10 provide such or similar mechanism?
Look at Database Change Notification in 10g, which has since been renamed Continuous Query Notification.
I normally like to include an example but it's hard to find a 10g instance these days, and even a short example requires a lot of code. The process looks complicated, it might be better off to use triggers as you suggested, and deal with the tight coupling.

PostgreSQL to Data-Warehouse: Best approach for near-real-time ETL / extraction of data

Background:
I have a PostgreSQL (v8.3) database that is heavily optimized for OLTP.
I need to extract data from it on a semi real-time basis (some-one is bound to ask what semi real-time means and the answer is as frequently as I reasonably can but I will be pragmatic, as a benchmark lets say we are hoping for every 15min) and feed it into a data-warehouse.
How much data? At peak times we are talking approx 80-100k rows per min hitting the OLTP side, off-peak this will drop significantly to 15-20k. The most frequently updated rows are ~64 bytes each but there are various tables etc so the data is quite diverse and can range up to 4000 bytes per row. The OLTP is active 24x5.5.
Best Solution?
From what I can piece together the most practical solution is as follows:
Create a TRIGGER to write all DML activity to a rotating CSV log file
Perform whatever transformations are required
Use the native DW data pump tool to efficiently pump the transformed CSV into the DW
Why this approach?
TRIGGERS allow selective tables to be targeted rather than being system wide + output is configurable (i.e. into a CSV) and are relatively easy to write and deploy. SLONY uses similar approach and overhead is acceptable
CSV easy and fast to transform
Easy to pump CSV into the DW
Alternatives considered ....
Using native logging (http://www.postgresql.org/docs/8.3/static/runtime-config-logging.html). Problem with this is it looked very verbose relative to what I needed and was a little trickier to parse and transform. However it could be faster as I presume there is less overhead compared to a TRIGGER. Certainly it would make the admin easier as it is system wide but again, I don't need some of the tables (some are used for persistent storage of JMS messages which I do not want to log)
Querying the data directly via an ETL tool such as Talend and pumping it into the DW ... problem is the OLTP schema would need tweaked to support this and that has many negative side-effects
Using a tweaked/hacked SLONY - SLONY does a good job of logging and migrating changes to a slave so the conceptual framework is there but the proposed solution just seems easier and cleaner
Using the WAL
Has anyone done this before? Want to share your thoughts?
Assuming that your tables of interest have (or can be augmented with) a unique, indexed, sequential key, then you will get much much better value out of simply issuing SELECT ... FROM table ... WHERE key > :last_max_key with output to a file, where last_max_key is the last key value from the last extraction (0 if first extraction.) This incremental, decoupled approach avoids introducing trigger latency in the insertion datapath (be it custom triggers or modified Slony), and depending on your setup could scale better with number of CPUs etc. (However, if you also have to track UPDATEs, and the sequential key was added by you, then your UPDATE statements should SET the key column to NULL so it gets a new value and gets picked by the next extraction. You would not be able to track DELETEs without a trigger.) Is this what you had in mind when you mentioned Talend?
I would not use the logging facility unless you cannot implement the solution above; logging most likely involves locking overhead to ensure log lines are written sequentially and do not overlap/overwrite each other when multiple backends write to the log (check the Postgres source.) The locking overhead may not be catastrophic, but you can do without it if you can use the incremental SELECT alternative. Moreover, statement logging would drown out any useful WARNING or ERROR messages, and the parsing itself will not be instantaneous.
Unless you are willing to parse WALs (including transaction state tracking, and being ready to rewrite the code everytime you upgrade Postgres) I would not necessarily use the WALs either -- that is, unless you have the extra hardware available, in which case you could ship WALs to another machine for extraction (on the second machine you can use triggers shamelessly -- or even statement logging -- since whatever happens there does not affect INSERT/UPDATE/DELETE performance on the primary machine.) Note that performance-wise (on the primary machine), unless you can write the logs to a SAN, you'd get a comparable performance hit (in terms of thrashing filesystem cache, mostly) from shipping WALs to a different machine as from running the incremental SELECT.
if you can think of a 'checksum table' that contains only the id's and the 'checksum' you can not only do a quick select of the new records but also the changed and deleted records.
the checksum could be a crc32 checksum function you like.
The new ON CONFLICT clause in PostgreSQL has changed the way I do many updates. I pull the new data (based on a row_update_timestamp) into a temp table then in one SQL statement INSERT into the target table with ON CONFLICT UPDATE. If your target table is partitioned then you need to jump through a couple of hoops (i.e. hit the partition table directly). The ETL can happen as you load the the Temp table (most likely) or in the ON CONFLICT SQL (if trivial). Compared to to other "UPSERT" systems (Update, insert if zero rows etc.) this shows a huge speed improvement. In our particular DW environment we don't need/want to accommodate DELETEs. Check out the ON CONFLICT docs - it gives Oracle's MERGE a run for it's money!