How to see changes in a postgresql database - postgresql

My postresql database is updated each night.
At the end of each nightly update, I need to know what data changed.
The update process is complex, taking a couple of hours and requires dozens of scripts, so I don't know if that influences how I could see what data has changed.
The database is around 1 TB in size, so any method that requires starting a temporary database may be very slow.
The database is an AWS instance (RDS). I have automated backups enabled (these are different to RDS snapshots which are user initiated). Is it possible to see the difference between two RDS automated backups?

I do not know if it is possible to see difference between RDS snapshots. But in the past we tested several solutions for similar problem. Maybe you can take some inspiration from it.
Obvious solution is of course auditing system. This way you can see in relatively simply way what was changed. Depending on granularity of your auditing system down to column values. Of course there is impact on your application due auditing triggers and queries into audit tables.
Another possibility is - for tables with primary keys you can store values of primary key and 'xmin' and 'ctid' hidden system columns (https://www.postgresql.org/docs/current/static/ddl-system-columns.html) for each row before updated and compare them with values after update. But this way you can identify only changed / inserted / deleted rows but not changes in different columns.
You can make streaming replica and set replication slots (and to be on the safe side also WAL log archiving ). Then stop replication on replica before updates and compare data after updates using dblink selects. But these queries can be very heavy.

Related

Insert data into remote DB tables from multiple databases through trigger or replication or foreign data wrapper

I need some advice about the following scenario.
I have multiple embedded systems supporting PostgreSQL database running at different places and we have a server running on CentOS at our premises.
Each system is running at remote location and has multiple tables inside its database. These tables have the same names as the server's table names, but each system has different table name than the other systems, e.g.:
system 1 has tables:
sys1_table1
sys1_table2
system 2 has tables
sys2_table1
sys2_table2
I want to update the tables sys1_table1, sys1_table2, sys2_table1 and sys2_table2 on the server on every insert done on system 1 and system 2.
One solution is to write a trigger on each table, which will run on every insert of both systems' tables and insert the same data on the server's tables. This trigger will also delete the records in the systems after inserting the data into server. The problem with this solution is that if the connection with the server is not established due to network issue than that trigger will not execute or the insert will be wasted. I have checked the following solution for this
Trigger to insert rows in remote database after deletion
The second solution is to replicate tables from system 1 and system 2 to the server's tables. The problem with replication will be that if we delete data from the systems, it'll also delete the records on the server. I could add the alternative trigger on the server's tables which will update on the duplicate table, hence the replicated table can get empty and it'll not effect the data, but it'll make a long tables list if we have more than 200 systems.
The third solution is to write a foreign table using postgres_fdw or dblink and update the data inside the server's tables, but will this effect the data inside the server when we delete the data inside the system's table, right? And what will happen if there is no connectivity with the server?
The forth solution is to write an application in python inside each system which will make a connection to server's database and write the data in real time and if there is no connectivity to the server than it will store the data inside the sys1.table1 or sys2.table2 or whatever the table the data belongs and after the re-connect, the code will send the tables data into server's tables.
Which option will be best according to this scenario? I like the trigger solution best, but is there any way to avoid the data loss in case of dis-connectivity from the server?
I'd go with the fourth solution, or perhaps with the third, as long as it is triggered from outside the database. That way you can easily survive connection loss.
The first solution with triggers has the problems you already detected. It is also a bad idea to start potentially long operations, like data replication across a network of uncertain quality, inside a database transaction. Long transactions mean long locks and inefficient autovacuum.
The second solution may actually also be an option if you you have a recent PostgreSQL versions that supports logical replication. You can use a publication WITH (publish = 'insert,update'), so that DELETE and TRUNCATE are not replicated. Replication can deal well with lost connectivity (for a while), but it is not an option if you want the data at the source to be deleted after they have been replicated.

Best way to backup and restore data in PostgreSQL for testing

I'm trying to migrate our database engine from MsSql to PostgreSQL. In our automated test, we restore the database back to "clean" state at the start of every test. We do this by comparing the "diff" between the working copy of the database with the clean copy (table by table). Then copying over any records that have changed. Or deleting any records that have been added. So far this strategy seems to be the best way to go about for us because per test, not a lot of data is changed, and the size of the database is not very big.
Now I'm looking for a way to essentially do the same thing but with PostgreSQL. I'm considering doing the exact same thing with PostgreSQL. But before doing so, I was wondering if anyone else has done something similar and what method you used to restore data in your automated tests.
On a side note - I considered using MsSql's snapshot or backup/restore strategy. The main problem with these methods is that I have to re-establish the db connection from the app after every test, which is not possible at the moment.
If you're okay with some extra storage, and if you (like me) are particularly not interested in re-inventing the wheel in terms of checking for diffs via your own code, you should try creating a new DB (per run) via templates feature of createdb command (or CREATE DATABASE statement) in PostgreSQL.
So for e.g.
(from bash) createdb todayDB -T snapshotDB
or
(from psql) CREATE DATABASE todayDB TEMPLATE snaptshotDB;
Pros:
In theory, always exact same DB by design (no custom logic)
Replication is a file-transfer (not DB restore). So far less time taken (i.e. doesn't run SQL again, doesn't recreate indexes / restore tables etc.)
Cons:
Takes 2x the disk space (although template could be on a low performance NFS etc)
For my specific situation. I decided to go back to the original solution. Which is to compare the "working" copy of the database with "clean" copy of the database.
There are 3 types of changes.
For INSERT records - find max(id) from clean table and delete any record on working table that has higher ID
For UPDATE or DELETE records - find all records in clean table EXCEPT records found in working table. Then UPSERT those records into working table.

Postgres Multi-tenant administration/maintenance

We have a SaaS application where each tenant has its own database in Postgres. How would I apply a patch to all the databses? For example if I want to add a table or add a column to a table, I have to either write a program that loops through all databases and execute a SQL against them or using pgadmin, go through them one by one.
Is there smarter and/or faster way?
Any help is greatly appreciated.
Yes, there's a smarter way.
Don't create a new database for each tenant. If everything is in one database then you only need to alter one database.
Pick one database, alter each table to have the column TENANT and add this to the primary key. Then insert into this database every record for all tenants and drop the other databases (obviously considerably more work than this as your application will need to be changed).
The differences with your approach are extensively discussed elsewhere:
What problems will I get creating a database per customer?
What are the advantages of using a single database for EACH client?
Multiple schemas versus enormous tables
Practicality of multiple databases per client vs one database
Multi-tenancy - single database vs multiple database
If you don't put everything in one database then I'm afraid you have to alter them all individually, and doing it programatically would be simplest.
At a higher level, all multi-tenant applications follow one of three approaches:
One tenant's data lives in one database,
One tenant's data lives in one schema, or
Add a tenant_id / account_id column to your tables (shared schema).
I usually find that developers use the following criteria when they evaluate these different approaches.
Isolation: Since you can put each tenant into its own database in one hand, and have tenants share the same table on the other, this becomes the most apparent dimension. If you provide your users raw SQL access or you're in a regulated industry such as healthcare, you may need strict guarantees from your database. That said, PostgreSQL 9.5 comes with row level security policies that makes this less of a concern for most applications.
Extensibility: If your tenants are sharing the same schema (approach #3), and your tenants have fields that varies between them, then you need to think about how to merge these fields.
This article on multi-tenant databases has a great summary of different approaches. For example, you can add a dozen columns, call them C1, C2, and so forth, and have your application infer the actual data in this column based on the tenant_id. PostgresQL 9.4 comes with JSONB support and natively allows you to use semi-structured fields to express variations between different tenants' data.
Scaling: Another criteria is how easily your database would scale-out. If you create a tenant per database or schema (#1 or #2 above), your application can make use of existing Ruby Gems or [Django packages][1] to simplify app integration. That said, you'll need to manually manage your tenants' data and the machines they live on. Similarly, you'll need to build your own sharding logic to propagate foreign key constraints and ALTER TABLE commands.
With approach #3, you can use existing open source scaling solutions, such as Citus. For example, this blog post describes how to easily shard a multi-tenant app with Postgres.
it's time for me to give back to the community :) So after 4 years, our multi-tenant platform is in production and I would like to share the following observations/experiences with all of you.
We used a database per each tenant. This has given us extreme flexibility as the size of the databases in the backups are not huge and hence we can easily import them into our staging environment for customers issues.
We use Liquibase for database development and upgrades. This has been a tremendous help to us, allowing us to package the entire build into a simple war file. All changes are easily versioned and managed very efficiently. There is a bit of learning curve here an there but nothing substantial. 2-5 days can significantly save you time.
Given that we use Spring/JPA/Hibernate, we use a technique called Dynamic Data Source Routing. So when a user logs-in, we find the related datasource with a lookup and connect them to the session to the right database. That's also when the Liquibase scripts get applied for updates.
This is, for now, I will come back with more later on.
Well, there are problems with one database for all tenants in our case for sure.
The backup file gets huge and becomes almost not practical hard to manage
For troubleshooting, we need to restore customer's data in our dev env, we just use that customer's backup file and usually the file is not as big as if we were to use one database for all customers.
Again, Liquibase has been key in allowing to manage updates across all the tenants seamlessly and without any issues. Without Liquibase, I can see lots of complications with this approach. So Liquibase, Liquibase and more Liquibase.
I also suspect that we would need a more powerful hardware to manage a huge database with large joins across millions of records vs much lighter database with much smaller queries.
In case of problems, the service doesn't go down for everyone and there will be limited to one or few tenants.
In general, for our purposes, this has been a great architectural decision and we are benefiting from it every day. One time we had one customer that didn't have their archiving active and their database size grew to over 3 GB. With offshore teams and slower internet as well as storage/bandwidth prices, one can see how things may become complicated very quickly.
Hope this helps someone.
--Rex

Database Content Versioning

I am interested in keeping a running history of every change which has happened on some tables in my database, thus being able to reconstruct historical states of the database for analysis purposes.
I am using Postgres, and this MVCC thing just seems like I should be able to exploit it for this purpose but I cannot find any documentation to support this. Can I do it? Is there a better way?
Any input is appreciated!
UPD
I have marked Denis' response as the answer, because he did in fact answer whether MVCC is what I want which was the question. However, the strategy I have settled on is detailed below in case anyone finds it useful:
The Postgres feature that does what I want: online backup/point in time recovery.
http://www.postgresql.org/docs/8.1/static/backup-online.html explains how to use this feature but essentially you can set this "write ahead log" to archive mode, take a snapshot of the database (say, before it goes live), then continually archive the WAL. You can then use log replay to recall the state of the database at any time, with the side benefit of having a warm standby if you choose (by continually replaying the new WALs on your standby server).
Perhaps this method is not as elegant as other ways of keeping a history, since you need to actually build the database for every point in time you wish to query, however it looks extremely easy to set up and loses zero information. That means when I have the time to improve my handling of historical data, I'll have everything and will therefore be able to transform my clunky system to a more elegant system.
One key fact that makes this so perfect is that my "valid time" is the same as my "transaction time" for the specific application- if this were not the case I would only be capturing "transaction time".
Before I found out about the WAL, I was considering just taking daily snapshots or something but the large size requirement and data loss involved did not sit well with me.
For a quick way to get up and running without compromising my data retention from the outset, this seems like the perfect solution.
Time Travel
PostgreSQL used to have just this feature, and called it "Time Travel". See the old documentation.
There's somewhat similar functionality in the spi contrib module that you might want to check out.
Composite type audit trigger
What I usually do instead is to use triggers to log changes along with timestamps to archival tables, and query against those. If the table structure isn't going to change you can use something like:
CREATE TABLE sometable_history(
command_tag text not null check (command_tag IN ('INSERT','DELETE','UPDATE','TRUNCATE')),
new_content sometable,
change_time timestamp with time zone
);
and your versioning trigger can just insert into sometable_history(TG_OP,NEW,current_timestamp) (with a different CASE for DELETE, where NEW is not defined).
hstore audit trigger
That gets painful if the schema changes to add new NOT NULL columns though. If you expect to do anything like that consider using a hstore to archive the columns, instead of a composite type. I've already added an implementation of that on the PostgreSQL wiki already.
PITR
If you want to avoid impact on your master database (growing tables, etc), you can alternately use continuous archiving and point-in-time recovery to log WAL files that can, using a recovery.conf, be replayed to any moment in time. Note that WAL files are big and they include not only the tuples you changed, but VACUUM activity and other details. You'll want to run them through clearxlogtail since they can have garbage data on the end if they're partial segments from an archive timeout, then you'll want to compress them heavily for long term storage.
I am using Postgres, and this MVCC thing just seems like I should be able to exploit it for this purpose but I cannot find any documentation to support this. Can I do it?
Not really. There are tools to see dead rows, because auto-vacuuming is so that will eventually be reclaimed.
Is there a better way?
If I get your question right, you're looking into logging slowly changing dimensions.
You might find this recent related thread interesting:
Temporal database design, with a twist (live vs draft rows)
I'm not aware of any tools/products that are built for that purpose.
While this may not be exactly what you're asking for, you can configure Postgresql to log ddl changes. Setting the log_line_prefix parameter (try including %d, %m, and %u) and setting the log_statement parameter to ddl should give you a reasonable history of who made what ddl changes and when.
Having said that, I don't believe logging ddl to be foolproof. For example, consider a situation where:
Multiple schemas have a table with the same name,
one of the tables is altered, and
the ddl doesn't fully qualify the table name (relying on the search path to get it right),
then it may not be possible to know from the log which table was actually altered.
Another option might be to log ddl as above but then have a watcher program perform a pg_dump of the database schema whenever a ddl entry get's logged. You could even compare the new dump with the previous dump and extract just the objects that were changed.

How to prevent Write Ahead Logging on just one table in PostgreSQL?

I am considering log-shipping of Write Ahead Logs (WAL) in PostgreSQL to create a warm-standby database. However I have one table in the database that receives a huge amount of INSERT/DELETEs each day, but which I don't care about protecting the data in it. To reduce the amount of WALs produced I was wondering, is there a way to prevent any activity on one table from being recorded in the WALs?
Ran across this old question, which now has a better answer. Postgres 9.1 introduced "Unlogged Tables", which are tables that don't log their DML changes to WAL. See the docs for more info, but at least now there is a solution for this problem.
See Waiting for 9.1 - UNLOGGED tables by depesz, and the 9.1 docs.
Unfortunately, I don't believe there is. The WAL logging operates on the page level, which is much lower than the table level and doesn't even know which page holds data from which table. In fact, the WAL files don't even know which pages belong to which database.
You might consider moving your high activity table to a completely different instance of PostgreSQL. This seems drastic, but I can't think of another way off the top of my head to avoid having that activity show up in your WAL files.
To offer one option to my own question. There are temp tables - "temporary tables are automatically dropped at the end of a session, or optionally at the end of the current transaction (see ON COMMIT below)" - which I think don't generate WALs. Even so, this might not be ideal as the table creation & design will be have to be in the code.
I'd consider memcached for use-cases like this. You can even spread the load over a bunch of cheap machines too.