Amazon RDS Postgresql snapshot preserves schema but loses all data - postgresql

Using AWS RDS console I created a snapshot backup of a Postgresql v11 database containing multiple schemas. I then created a new instance from the backup. The process seemed to work fine without error. However, upon inspection of the data in the new instance, I noticed that in only one of my schemas the data was not preserved. The schema structure, tables, indexes, constraints, etc looked fine, but every table was empty (select count(*) from schema.table was 0 for every table in the schema). All other schemas looked fine and contained the expected data. I looked everywhere (could not find help for this online) and tried many tests myself (changing roles, rebuilding the schema, privileges, much more) while attempting to solve this issue. What would cause my snapshots to preserve the entire schema structure, but lose all of the data itself?

I finally realized that the only difference between the problem schema and the other was that all tables in the problem schema had been created with the 'UNLOGGED' keyword. This was done to increase write speed for millions of rows inserted when the schema was first built. However, when a snapshot is created/restored as described above, the process depends on the WAL files that are written with normal (logged) tables to restore the data. To fix my problem I simply altered all of the tables and set them to be logged (alter table schema.table set logged). After this, snapshots worked fine. For anyone else in the future that is doing something similar, should unlogged tables be needed for initial mass population of data to get better write speed, it would be a good to changed them to be logged after initial data population (if you plan on using snapshots or replications or similar). Side note, pg_dump/pg_restore does still work for unlogged tables.

Related

How to replicate a Postgres DB with only a sample of the data

I'm attempting to mock a database for testing purposes. What I'd like to do is given a connection to an existing Postgres DB, retrieve the schema, limit the data pulled to 1000 rows from each table, and persist both of these components as a file which can later be imported into a local database.
pg_dump doesn't seem to fullfill my requirements as theres no way to tell it to only retrieve a limited amount of rows from tables, its all or nothing.
COPY/\copy commands can help fill this gap, however, it doesn't seem like theres a way to copy data from multiple tables into a single file. I'd rather avoid having to create a single file per table, is there a way to work around this?

Source of data in Redshift tables

I am looking to find the data source of couple of Tables in Redshift. I have gone through all the stored procedures in Redshift instance. I couldn't find any stored procedure which populates these tables in Redshift. I have also checked the Data Migration Service and didn't see these tables are being migrated from RDS instance. However, the tables are updated regularly each day.
What would be the way to find how data is populated in those 2 tables? Is there any logs or system tables I can look in to?
One place I'd look is svl_statementtext. That will pull any queries and utility queries that may be inserting or running copy jobs against that table. Just use a WHERE text LIKE %yourtablenamehere% and see what comes back.
https://docs.aws.amazon.com/redshift/latest/dg/r_SVL_STATEMENTTEXT.html
Also check scheduled queries in the Redshift UI console.

How does AWS postgres RDS read replication handle schema switching?

I am wanting to know how an AWS postgres RDS does replication where I rename schemas to "swap" them within the read/write instance of the database.
Does it replicate this action to the read-replicas by sending on the "alter schema" rename commands I gave to my read/write instance? Or after my renames, does it see wholly different sets of data in the schemas and do a whole new copy of each out to the read-replicas?
For example...
In my RDS instance I have a read/write instance of "my_mega_database" which I want to create read-replicas of for my applications to connect to.
Typically, in "my_mega_database" there are two schemas "my_data" and "my_data_old", whereby "my_data" contains data that was delivered last night, and "my_data_old" contains data from the previous night. Each contains many tables and huge amounts of data.
If I were to do the following...
ALTER SCHEMA my_data_old RENAME TO my_data_tmp;
ALTER SCHEMA my_data RENAME TO my_data_old;
ALTER SCHEMA my_data_tmp RENAME TO my_data;
... I have affectively swapped these around.
My expectation is that these actions are replicated via the postgres WAL (ie: it sends the rename commands out to the replicas) and AWS RDS replication won't try and waste time copying huge amounts of data all over the place.
Is this correct?
(Speaking about PostgreSQL here, but RDS is probably similar.)
Renaming a schema (or any other object) is a small update in a catalog table, and no data are moved. Internally PostgreSQL uses only the numeric object ID, which stays the same.
You might wrap the three statements in a transaction to make the whole magic atomic.
The same is true on the standby, it is a trivial (meta)data modification.
The only thing that might be a problem are concurrent sessions holding locks.

How to see changes in a postgresql database

My postresql database is updated each night.
At the end of each nightly update, I need to know what data changed.
The update process is complex, taking a couple of hours and requires dozens of scripts, so I don't know if that influences how I could see what data has changed.
The database is around 1 TB in size, so any method that requires starting a temporary database may be very slow.
The database is an AWS instance (RDS). I have automated backups enabled (these are different to RDS snapshots which are user initiated). Is it possible to see the difference between two RDS automated backups?
I do not know if it is possible to see difference between RDS snapshots. But in the past we tested several solutions for similar problem. Maybe you can take some inspiration from it.
Obvious solution is of course auditing system. This way you can see in relatively simply way what was changed. Depending on granularity of your auditing system down to column values. Of course there is impact on your application due auditing triggers and queries into audit tables.
Another possibility is - for tables with primary keys you can store values of primary key and 'xmin' and 'ctid' hidden system columns (https://www.postgresql.org/docs/current/static/ddl-system-columns.html) for each row before updated and compare them with values after update. But this way you can identify only changed / inserted / deleted rows but not changes in different columns.
You can make streaming replica and set replication slots (and to be on the safe side also WAL log archiving ). Then stop replication on replica before updates and compare data after updates using dblink selects. But these queries can be very heavy.

Best way to backup and restore data in PostgreSQL for testing

I'm trying to migrate our database engine from MsSql to PostgreSQL. In our automated test, we restore the database back to "clean" state at the start of every test. We do this by comparing the "diff" between the working copy of the database with the clean copy (table by table). Then copying over any records that have changed. Or deleting any records that have been added. So far this strategy seems to be the best way to go about for us because per test, not a lot of data is changed, and the size of the database is not very big.
Now I'm looking for a way to essentially do the same thing but with PostgreSQL. I'm considering doing the exact same thing with PostgreSQL. But before doing so, I was wondering if anyone else has done something similar and what method you used to restore data in your automated tests.
On a side note - I considered using MsSql's snapshot or backup/restore strategy. The main problem with these methods is that I have to re-establish the db connection from the app after every test, which is not possible at the moment.
If you're okay with some extra storage, and if you (like me) are particularly not interested in re-inventing the wheel in terms of checking for diffs via your own code, you should try creating a new DB (per run) via templates feature of createdb command (or CREATE DATABASE statement) in PostgreSQL.
So for e.g.
(from bash) createdb todayDB -T snapshotDB
or
(from psql) CREATE DATABASE todayDB TEMPLATE snaptshotDB;
Pros:
In theory, always exact same DB by design (no custom logic)
Replication is a file-transfer (not DB restore). So far less time taken (i.e. doesn't run SQL again, doesn't recreate indexes / restore tables etc.)
Cons:
Takes 2x the disk space (although template could be on a low performance NFS etc)
For my specific situation. I decided to go back to the original solution. Which is to compare the "working" copy of the database with "clean" copy of the database.
There are 3 types of changes.
For INSERT records - find max(id) from clean table and delete any record on working table that has higher ID
For UPDATE or DELETE records - find all records in clean table EXCEPT records found in working table. Then UPSERT those records into working table.