Best way to backup and restore data in PostgreSQL for testing

Best way to backup and restore data in PostgreSQL for testing - postgresql

I'm trying to migrate our database engine from MsSql to PostgreSQL. In our automated test, we restore the database back to "clean" state at the start of every test. We do this by comparing the "diff" between the working copy of the database with the clean copy (table by table). Then copying over any records that have changed. Or deleting any records that have been added. So far this strategy seems to be the best way to go about for us because per test, not a lot of data is changed, and the size of the database is not very big.
Now I'm looking for a way to essentially do the same thing but with PostgreSQL. I'm considering doing the exact same thing with PostgreSQL. But before doing so, I was wondering if anyone else has done something similar and what method you used to restore data in your automated tests.
On a side note - I considered using MsSql's snapshot or backup/restore strategy. The main problem with these methods is that I have to re-establish the db connection from the app after every test, which is not possible at the moment.

If you're okay with some extra storage, and if you (like me) are particularly not interested in re-inventing the wheel in terms of checking for diffs via your own code, you should try creating a new DB (per run) via templates feature of createdb command (or CREATE DATABASE statement) in PostgreSQL.
So for e.g.
(from bash) createdb todayDB -T snapshotDB
or
(from psql) CREATE DATABASE todayDB TEMPLATE snaptshotDB;
Pros:
In theory, always exact same DB by design (no custom logic)
Replication is a file-transfer (not DB restore). So far less time taken (i.e. doesn't run SQL again, doesn't recreate indexes / restore tables etc.)
Cons:
Takes 2x the disk space (although template could be on a low performance NFS etc)

For my specific situation. I decided to go back to the original solution. Which is to compare the "working" copy of the database with "clean" copy of the database.
There are 3 types of changes.
For INSERT records - find max(id) from clean table and delete any record on working table that has higher ID
For UPDATE or DELETE records - find all records in clean table EXCEPT records found in working table. Then UPSERT those records into working table.

Related

DB2 append backups into one unique db

I am using the restore db to import backup into a test environment database that works fine, but I need to extend this import process to several backups from several dates into a unique test environment db... What is the command to append backups into a unique database...
thanks
Phil

You write "I get only full offline database backup files from several time date (2 weeks of production) that I need to restore in a test database for analysis... example 4 backups files of 2 weeks of data = 2 months of data ...."
and you also write "What is the command to append backups into a unique database..."
While there is no explicit method for full-offline-backup images to be combined for Db2-LUW, there's always another way to get what you need...given the right skills and tools.
IF you have a FULL backup image, it can either be restored to a new database, or it can fully overwrite an existing database. If you have 4 FULL backup images, each can be restored either into a (uniquely named) database (or overwrite 4 existing databases).
You can also restore specific tablespaces from a backup image, if properly configured. Some sites have designed discrete tablespaces for specific time periods (one per day/week/month) to help with such activities. Some sites have designed their tables to be range partitioned (with each time period having its own partition (and sometimes dedicated tablespaces also), and this makes subsequent merging of content more easy with the right skills.
If you are competent with scripting, you can restore the first (earliest) image, export the relevant table contents to flat-files, restore the next backup image and export the relevant tables to new flat-files (repeat as needed), then load these flat-files into a table for analysis. If your database size is small then this can be considered a keep-it-simple approach.
You can also do clever things with federation if you restore to discrete databases.
Separately purchasable tools exist to let you extract selected content from a backup image (which can then be loaded into a Db2 database), without needing to do a restore action. These are not included with the Db2 product. So you could extract specific table contents from a backup image if you pay for the right tools and learn how to use them. Speak with your IBM Salesperson. Such tools may require currently supported versions of Db2 however.

PostgreSQL: even read access changes data files disk leading to large incremental backups using pgbackrest

We are using pgbackrest to backup our database to Amazon S3. We do full backups once a week and an incremental backup every other day.
Size of our database is around 1TB, a full backup is around 600GB and an incremental backup is also around 400GB!
We found out that even read access (pure select statements) on the database has the effect that the underlying data files (in /usr/local/pgsql/data/base/xxxxxx) change. This results in large incremental backups and also in very large storage (costs) on Amazon S3.
Usually the files with low index names (e.g. 391089.1) change on read access.
On an update, we see changes in one or more files - the index could correlate to the age of the row in the table.
Some more facts:
Postgres version 13.1
Database is running in docker container (docker version 20.10.0)
OS is CentOS 7
We see the phenomenon on multiple servers.
Can someone explain, why postgresql changes data files on pure read access?
We tested on a pure database without any other resources accessing the database.

This is normal. Some cases I can think of right away are:
a SELECT or other SQL statement setting a hint bit
This is a shortcut for subsequent statements that access the data, so they don't have t consult the commit log any more.
a SELECT ... FOR UPDATE writing a row lock
autovacuum removing dead row versions
These are leftovers from DELETE or UPDATE.
autovacuum freezing old visible row versions
This is necessary to prevent data corruption if the transaction ID counter wraps around.
The only way to fairly reliably prevent PostgreSQL from modifying a table in the future is:
never perform an INSERT, UPDATE or DELETE on it
run VACUUM (FREEZE) on the table and make sure that there are no concurrent transactions

Is there a way to show everything that was changed in a PostgreSQL database during a transaction?

I often have to execute complex sql scripts in a single transaction on a large PostgreSQL database and I would like to verify everything that was changed during the transaction.
Verifying each single entry on each table "by hand" would take ages.
Dumping the database before and after the script to plain sql and using diff on the dumps isn't really an option since each dump would be about 50G of data.
Is there a way to show all the data that was added, deleted or modified during a single transaction?

Dude, What are you looking for is the most searchable thing on the internet when it comes to capturing Database changes. It is a kind of version control we can say.
But as long as I know, sadly there are no in-built approaches are available in PostgreSQL or MySql. But you can overcome it by setting/adding some triggers for your most usable operations.
You can create some backup schemas, and tables to capture your changes that are changed(updated), created, or deleted.
In this way you can achieve what you want. I know this process is fully manual, But really effective.

If you need to analyze the script's behaviour only sporadically, then the easiest approach would be to change server configuration parameter log_min_duration_statement to 0 and then back to any value it had before the analysis. Then all of the script activity will be written to the instance log.
This approach is not suitable if your storage is not prepared to accommodate this amount of data, or for systems in which you don't want sensitive client data to be written to a plain-text log file.

How to see changes in a postgresql database

My postresql database is updated each night.
At the end of each nightly update, I need to know what data changed.
The update process is complex, taking a couple of hours and requires dozens of scripts, so I don't know if that influences how I could see what data has changed.
The database is around 1 TB in size, so any method that requires starting a temporary database may be very slow.
The database is an AWS instance (RDS). I have automated backups enabled (these are different to RDS snapshots which are user initiated). Is it possible to see the difference between two RDS automated backups?

I do not know if it is possible to see difference between RDS snapshots. But in the past we tested several solutions for similar problem. Maybe you can take some inspiration from it.
Obvious solution is of course auditing system. This way you can see in relatively simply way what was changed. Depending on granularity of your auditing system down to column values. Of course there is impact on your application due auditing triggers and queries into audit tables.
Another possibility is - for tables with primary keys you can store values of primary key and 'xmin' and 'ctid' hidden system columns (https://www.postgresql.org/docs/current/static/ddl-system-columns.html) for each row before updated and compare them with values after update. But this way you can identify only changed / inserted / deleted rows but not changes in different columns.
You can make streaming replica and set replication slots (and to be on the safe side also WAL log archiving ). Then stop replication on replica before updates and compare data after updates using dblink selects. But these queries can be very heavy.

How can I make and query read only snapshots in Postgres (or MySql)?

I'd like to create a read-only snapshot of a database at the end of each day, and keep them around for a couple of months.
I'd then like to be able to run queries against a specific (named) snapshot.
Is this possible to achieve elegantly and with minimal resource usage (the database only changes very slowly, but has a few GBs of data - so almost all data is common to all snapshots).

The usual way to create a snapshot in PostgreSQL is to use pg_dump/pg_restore.
A much quicker method is to simply use CREATE DATABASE to clone your database.
CREATE DATABASE my_copy_db TEMPLATE my_production_db;
which will be much faster than a dump/restore. The only drawback to this solution is that the source database must not have any open connections.
The copy will not be read-only by default, but you could simply revoke the respective privileges from the users to ensure that

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse