How can I automatically maintain a dump of modified rows in PostGreSql - postgresql

So, I have a PostGreSQL DB. For some chosen tables in that DB I want to maintain a plain dump of the rows when modified. Note this dump is not a recovery or backup dump. It is just a file which will have the incremental rows. That is, whenever a row is inserted or updated, I want that appended to this file or to a file in a folder. Idea is to load that folder into say something like hive periodically so that I can run queries to check previous states of certain rows, columns. Now, these are very high transactional tables and the dump does not need to be real time. It can be in batches, every hour. I want to avoid a trigger firing hundreds of times every minute. I am looking for something which is off the shelf - already available in PostGreSQL. I did some research but everything is related to PostGreSQL backup - which is not the exact use case.
I have read some links like https://clarkdave.net/2015/02/historical-records-with-postgresql-and-temporal-tables-and-sql-2011/ Implementing history of PostgreSQL table etc - but these are based on insert update trigger and create the history table on PostGreSQL itself. I want to avoid both. I cannot have the history on PostGreSQL as it will be huge soon. And I do not want to keep writing to files through a trigger firing constantly.

Related

PostgreSQL: even read access changes data files disk leading to large incremental backups using pgbackrest

We are using pgbackrest to backup our database to Amazon S3. We do full backups once a week and an incremental backup every other day.
Size of our database is around 1TB, a full backup is around 600GB and an incremental backup is also around 400GB!
We found out that even read access (pure select statements) on the database has the effect that the underlying data files (in /usr/local/pgsql/data/base/xxxxxx) change. This results in large incremental backups and also in very large storage (costs) on Amazon S3.
Usually the files with low index names (e.g. 391089.1) change on read access.
On an update, we see changes in one or more files - the index could correlate to the age of the row in the table.
Some more facts:
Postgres version 13.1
Database is running in docker container (docker version 20.10.0)
OS is CentOS 7
We see the phenomenon on multiple servers.
Can someone explain, why postgresql changes data files on pure read access?
We tested on a pure database without any other resources accessing the database.
This is normal. Some cases I can think of right away are:
a SELECT or other SQL statement setting a hint bit
This is a shortcut for subsequent statements that access the data, so they don't have t consult the commit log any more.
a SELECT ... FOR UPDATE writing a row lock
autovacuum removing dead row versions
These are leftovers from DELETE or UPDATE.
autovacuum freezing old visible row versions
This is necessary to prevent data corruption if the transaction ID counter wraps around.
The only way to fairly reliably prevent PostgreSQL from modifying a table in the future is:
never perform an INSERT, UPDATE or DELETE on it
run VACUUM (FREEZE) on the table and make sure that there are no concurrent transactions

Slow insert and update commands during mysql to redshift replication

I am trying to make a replication server from MySQL to redshift, for this, I am parsing the MySQL binlog. For initial replication, I am taking the dump of the mysql table, converting it into a CSV file and uploading the same to S3 and then I use the redshift copy command. For this the performance is efficient.
After the initial replication, for the continuous sync when I am reading the binlog the inserts and updates have to be run sequentially which are very slow.
Is there anything that can be done for increasing the performance?
One possible solution that I can think of is to wrap the statements in a transaction and then send the transaction at once, to avoid multiple network calls. But that would not address the problem that single update and insert statements in redshift run very slow. A single update statement is taking 6s. Knowing the limitations of redshift (That it is a columnar database and single row insertion will be slow) what can be done to work around those limitations?
Edit 1:
Regarding DMS: I want to use redshift as a warehousing solution which just replicates our MYSQL continuously, I don't want to denormalise the data since I have 170+ tables in mysql. During ongoing replication, DMS shows many errors multiple times in a day and fails completely after a day or two and it's very hard to decipher DMS error logs. Also, When I drop and reload tables, it deletes the existing tables on redshift and creates and new table and then starts inserting data which causes downtime in my case. What I wanted was to create a new table and then switch the old one with new one and delete old table
Here is what you need to do to get DMS to work
1) create and run a dms task with "migrate and ongoing replication" and "Drop tables on target"
2) this will probably fail, do not worry. "stop" the dms task.
3) on redshift make the following changes to the table
Change all dates and timestamps to varchar (because the options used
by dms for redshift copy cannot cope with '00:00:00 00:00' dates that
you get in mysql)
change all bool to be varchar - due to a bug in dms.
4) on dms - modify the task to "Truncate" in "Target table preparation mode"
5) restart the dms task - full reload
now - the initial copy and ongoing binlog replication should work.
Make sure you are on latest replication instance software version
Make sure you have followed the instructions here exactly
http://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.MySQL.html
If your source is aurora, also make sure you have set binlog_checksum to "none" (bad documentation)

What is the safe limit for the number of rows in postgresql?

I have a cron job that fires a query every three minutes. With each firing, data is entered into my db. So my database keeps on growing after each interval. However I am only interested in the latest row. Is it safe? Will postgresql truncate oldest entries automatically?
If you post the gist of your cron job you could get a better answer.
If its a direct insert, execute a truncate before hand to delete the old unwanted data. Delete is also possible, but you will end up with a lot of dead tuples and you will need to vacuum the table on a regular basis.
update is a good option but it depends on how much of the data is static and how much is not. eg. if you repeat values in any columns then go for update. This will also be subject to dead tuples and vacuuming.
if you are loading from an external source, eg csv, json, xml there are methods to overwrite existing data automatically. pg_loader may be an option here.

Best way to backup and restore data in PostgreSQL for testing

I'm trying to migrate our database engine from MsSql to PostgreSQL. In our automated test, we restore the database back to "clean" state at the start of every test. We do this by comparing the "diff" between the working copy of the database with the clean copy (table by table). Then copying over any records that have changed. Or deleting any records that have been added. So far this strategy seems to be the best way to go about for us because per test, not a lot of data is changed, and the size of the database is not very big.
Now I'm looking for a way to essentially do the same thing but with PostgreSQL. I'm considering doing the exact same thing with PostgreSQL. But before doing so, I was wondering if anyone else has done something similar and what method you used to restore data in your automated tests.
On a side note - I considered using MsSql's snapshot or backup/restore strategy. The main problem with these methods is that I have to re-establish the db connection from the app after every test, which is not possible at the moment.
If you're okay with some extra storage, and if you (like me) are particularly not interested in re-inventing the wheel in terms of checking for diffs via your own code, you should try creating a new DB (per run) via templates feature of createdb command (or CREATE DATABASE statement) in PostgreSQL.
So for e.g.
(from bash) createdb todayDB -T snapshotDB
or
(from psql) CREATE DATABASE todayDB TEMPLATE snaptshotDB;
Pros:
In theory, always exact same DB by design (no custom logic)
Replication is a file-transfer (not DB restore). So far less time taken (i.e. doesn't run SQL again, doesn't recreate indexes / restore tables etc.)
Cons:
Takes 2x the disk space (although template could be on a low performance NFS etc)
For my specific situation. I decided to go back to the original solution. Which is to compare the "working" copy of the database with "clean" copy of the database.
There are 3 types of changes.
For INSERT records - find max(id) from clean table and delete any record on working table that has higher ID
For UPDATE or DELETE records - find all records in clean table EXCEPT records found in working table. Then UPSERT those records into working table.

How do I ALTER a set of partitioned tables in Postgres?

I created a set of partitioned tables in Postgres, and started inserting a lot of rows via the master table. When the load process blew up on me, I realized I should have declared the id row BIGSERIAL (BIGINT with a sequence, behind the scenes), but inadvertently set it as SERIAL (INTEGER). Now that I have a couple of billion rows loaded, I am trying to ALTER the column to BIGINT. The process seems to be working, but is taking a long time. So, in reality, I don't really know if it is working or it is hung. I'd rather not restart the entire load process again.
Any suggestions?
When you update a row to alter it in PostgreSQL, that writes out a new copy of the row and then does some cleanup later to remove the original. This means that trying to fix the problem by doing updates can take longer than just loading all the data in from scratch again--it's more disk I/O than loading a new copy, and some extra processing time too. The only situation where you'd want to do an update instead of a reload is when the original load was very inefficient, for example if a slow client programs is inserting the data and it's the bottleneck on the process.
To figure out if the process is still working, see if it's using CPU when you run top (UNIX-ish systems) or the Task Manager (Windows). On Linux, "top -c" will even show you what the PostgreSQL client processes are doing. You probably just expected it to take less time than the original load, which it won't, and it's still running rather than hung up.
Restart it (clarifying edit: restart the entire load process again).
Altering a column value requires a new row version, and all indexes pointing to the old version to be updated to point to the new version.
Additionally, see how much of the advise on populating databases you can follow.
Correction from #archnid:
altering the type of the column will trigger a table rewrite, so the row versioning isn't a big problem, but it will still take lots of disk space temporarily. you can usually monitor progress by looking at which files in the database directory are being appended to...