I have inherited an existing Postgres database full of data. Most of the data has a 'created_date' column value. Some of the earlier data was inserted before this was being tracked.
Is there a Postgres metadata table hidden away somewhere that tracks when INSERT queries were done?
Postgres 9.5 or later
You can enable track_commit_timestamp in postgresql.conf (and restart) to start tracking commit timestamps. Then you can get a timestamp for your xmin. Related answer:
Atomically set SERIAL value when committing transaction
Postgres 9.4 or older
There is no such metadata in PostgreSQL unless you record it yourself.
You may be able to deduce some information from the row headers (HeapTupleHeaderData), in particular from the insert transaction id xmin. It holds the ID of the transaction in which the row was inserted (needed to decide visibility in PostgreSQL's MVCC model). Try (for any table):
SELECT xmin, * FROM tbl LIMIT 10;
Some limitations apply:
If the database was dumped and restored then, obviously, the information is gone - all rows are inserted in the same transaction.
If the database is huge / very old / very heavily written, then it may have gone through transaction ID wraparound, and the order of numbers in xmin is ambiguous.
But for most databases you should be able to derive:
the chronological order of INSERTs
which rows were inserted together
when there (probably) was a long period of time between inserts
No timestamp, though.
Building on Erwin Brandstetter's answer, if you have PostgreSQL 9.5 or later, the timestamps of commits are being recorded in the write-ahead log all the time, even if track_commit_timestamp is off. They are recorded there to support point-in-time recovery, where you can roll the database to an exact past state that you can specify as a date and time.
What you get by turning track_commit_timestamp on is an easier way to retrieve that information, where you can simply query with
SELECT pg_xact_commit_timestamp(xid);
where xid is the xmin from the row you care about, and it gives you the timestamp.
That's convenient, but it only works if:
track_commit_timestamp is on
it was on when the transaction committed
the transaction ID is not far enough in the past to be 'frozen'.
(PostgreSQL controls the overhead of remembering transaction IDs forever, by eventually 'freezing' old ones. That also controls how far the track_commit_timestamp-dependent functions can look back. There is another setting, vacuum_freeze_max_age, for adjusting that.)
So what do you do if you need the timestamp for a transaction that happened before you turned on track_commit_timestamp?
As long as it happened in PG 9.5 or later, the timestamp is in the write-ahead log. If you have been keeping backups sufficient for point-in-time recovery, that gives you a crude way to find the answer: you can restore a base backup from before you think it happened, set a recovery 'pause' target timestamp near where you guess it happened, connect when it pauses and query to see if it happened yet. If not, set a slightly later target, let the recovery continue, and check again. This can all be done using the backups in another PostgreSQL instance, to avoid interfering with one running production.
That is a clumsy-enough procedure you might wish you could just go back in time and tell your former self to turn track_commit_timestamp on, so it would have been on when the transaction happened that you are interested in. You can turn on track_commit_timestamp before starting the server to recover from a backup, but that doesn't quite do the trick: if it was turned off at the time of the backup, it will only begin saving timestamps for new transactions, after the ones it recovers.
It turns out it is possible to fool PostgreSQL into thinking track_commit_timestamp was on, and then start the server in recovery, and that has much the desired effect: as it replays transactions from the write-ahead log, it does remember their timestamps, and you can then use pg_xact_commit_timestamp() to query them. It will not have timestamps for anything that was in the base backup, but only for the transactions that followed the base backup and were replayed from the WAL. Still, by choosing a base backup known to be earlier than the wanted transaction, this allows the timestamp to be recovered.
There is no official tool/option to 'retroactively' set track_commit_timestamp in this way, but the (fiddly and unsupported) proof-of-concept has been discussed on pgsql-hackers.
track_commit_timestamp (boolean)
Mostly used at time of replication server setup.
Record commit time of transactions. This parameter can only be set in postgresql.conf file or on the server command line. The default value is off.
Short answer: no.
If there was, everyone would complain it was a waste of space on all the tables they didn't want to track.
Related
My postresql database is updated each night.
At the end of each nightly update, I need to know what data changed.
The update process is complex, taking a couple of hours and requires dozens of scripts, so I don't know if that influences how I could see what data has changed.
The database is around 1 TB in size, so any method that requires starting a temporary database may be very slow.
The database is an AWS instance (RDS). I have automated backups enabled (these are different to RDS snapshots which are user initiated). Is it possible to see the difference between two RDS automated backups?
I do not know if it is possible to see difference between RDS snapshots. But in the past we tested several solutions for similar problem. Maybe you can take some inspiration from it.
Obvious solution is of course auditing system. This way you can see in relatively simply way what was changed. Depending on granularity of your auditing system down to column values. Of course there is impact on your application due auditing triggers and queries into audit tables.
Another possibility is - for tables with primary keys you can store values of primary key and 'xmin' and 'ctid' hidden system columns (https://www.postgresql.org/docs/current/static/ddl-system-columns.html) for each row before updated and compare them with values after update. But this way you can identify only changed / inserted / deleted rows but not changes in different columns.
You can make streaming replica and set replication slots (and to be on the safe side also WAL log archiving ). Then stop replication on replica before updates and compare data after updates using dblink selects. But these queries can be very heavy.
Let's say I have a .csv-File with 100 million rows. I import that csv-file into pentaho Kettle and want to write all rows into a PostgreSQL database. What is the fastest insert-transformation? I have tried the normal table output transformation and the PostgreSQL Bulk Loader (which is way faster than the table output). But still, it is too slow. Is there a faster way than using the PostgreSQL Bulk Loader?
Considering the fact that PostgreSQL Bulk Loader runs COPY table_name FROM STDIN - there's nothing faster from data load in postgres. Multi-value insert will be slower, just multiple insert will be slowest. So you can't make it faster.
To speed up COPY you can:
set commit_delay to 100000;
set synchronous_commit to off;
and other server side tricks (like dropping indexes before loading).
NB:
very old but still relevant depesz post
most probably won't work with pentaho Kettle,but worth of checking pgloader
update
https://www.postgresql.org/docs/current/static/runtime-config-wal.html
synchronous_commit (enum)
Specifies whether transaction commit will wait for WAL records to be
written to disk before the command returns a “success” indication to
the client. Valid values are on, remote_apply, remote_write, local,
and off. The default, and safe, setting is on. When off, there can be
a delay between when success is reported to the client and when the
transaction is really guaranteed to be safe against a server crash.
(The maximum delay is three times wal_writer_delay.) Unlike fsync,
setting this parameter to off does not create any risk of database
inconsistency: an operating system or database crash might result in
some recent allegedly-committed transactions being lost, but the
database state will be just the same as if those transactions had been
aborted cleanly. So, turning synchronous_commit off can be a useful
alternative when performance is more important than exact certainty
about the durability of a transaction.
(emphasis mine)
Also notice I recommend using SETfor the session level, so if the GeoKettle does not allow to set config before running commands on postgres, you can use pgbouncer connect_query for the specific user/database pair, or think some other trick. And if you can't do anything to set synchronous_commit per session and you decide to change it per database or user (so it would be applied to GeoKettle connection, don't forget to set it back to on after load is over.
I have following Quires:
How Do I check redo / un-committed data size in PostgreSQL ?
Looks like if I do multiple update in sequence, it slows down.
Like Update 1, update 2, .... update n; ...seem update n is slower than update 1. Does uncommitted data volume affects it ? How redo management works in PostgreSQL ?
How do I monitor current running SQL in stored function? pg_stat_activity just shows function call; at session level. How do I get current SQL under that function which is running now ?
~ Santosh
You're clearly coming from an Oracle background.
PostgreSQL does not have undo and redo logs, as such.
Uncommitted (in-progress or rolled back), live committed data and comimtted-then-deleted data are mixed together in the heap, i.e. the main table contents. The fraction used by rolled back transactions, old versions of updated rows and deleted rows is referred to as table bloat. See the wiki.
The closest thing to do the redo log is the write-ahead logs in pg_xlog. There's no SQL-level interface to getting the current xlog size.
The documentation discusses this in some more detail, but it's an area of PostgreSQL management that could really use more attention from interested contributors. Both better built-in monitoring tools and better documentation would be good. Patches are welcome.
As for your second question... you don't. There isn't currently a way to get a function call stack. One is being discussed, but hasn't been implemented as of 9.5.
While editing some records in my PostgreSQL database using sql in the terminal (in ubuntu lucid), I made a wrong update.
Instead of -
update mytable set start_time='13:06:00' where id=123;
I typed -
update mytable set start_time='13:06:00';
So, all records are now having the same start_time value.
Is there a way to undo this change? There are some 500+ records in the table, and I do not know what the start_time value for each record was
Is it lost forever?
I'm assuming it was a transaction that's already committed? If so, that's what "commit" means, you can't go back.
Some data may be recoverable if you're lucky. Stop the database NOW.
Here's an answer I wrote on the same topic earlier. I hope it's helpful.
This might be too: Recoved deleted rows in postgresql .
Unless the data is absolutely critical, just restore from backups, it'll be lots easier and less painful. If you didn't have backups, consider yourself soundly thwacked.
If you catch the mistake and immediately bring down any applications using the database and take it offline, you can potentially use Point-in-Time Recovery (PITR) to replay your Write Ahead Log (WAL) files up to, but not including, the moment when the errant transaction was made. This would return the database to the state it was in prior, thus effectively 'undoing' that transaction.
As an approach for a production application database it has a number of obvious limitations, but there are circumstances in which PITR may be the best option available, especially when critical data loss has occurred. However, it is of no value if archiving was not already configured before the corruption event.
https://www.postgresql.org/docs/current/static/continuous-archiving.html
Similar capabilities exist with other relational database engines.
I am interested in keeping a running history of every change which has happened on some tables in my database, thus being able to reconstruct historical states of the database for analysis purposes.
I am using Postgres, and this MVCC thing just seems like I should be able to exploit it for this purpose but I cannot find any documentation to support this. Can I do it? Is there a better way?
Any input is appreciated!
UPD
I have marked Denis' response as the answer, because he did in fact answer whether MVCC is what I want which was the question. However, the strategy I have settled on is detailed below in case anyone finds it useful:
The Postgres feature that does what I want: online backup/point in time recovery.
http://www.postgresql.org/docs/8.1/static/backup-online.html explains how to use this feature but essentially you can set this "write ahead log" to archive mode, take a snapshot of the database (say, before it goes live), then continually archive the WAL. You can then use log replay to recall the state of the database at any time, with the side benefit of having a warm standby if you choose (by continually replaying the new WALs on your standby server).
Perhaps this method is not as elegant as other ways of keeping a history, since you need to actually build the database for every point in time you wish to query, however it looks extremely easy to set up and loses zero information. That means when I have the time to improve my handling of historical data, I'll have everything and will therefore be able to transform my clunky system to a more elegant system.
One key fact that makes this so perfect is that my "valid time" is the same as my "transaction time" for the specific application- if this were not the case I would only be capturing "transaction time".
Before I found out about the WAL, I was considering just taking daily snapshots or something but the large size requirement and data loss involved did not sit well with me.
For a quick way to get up and running without compromising my data retention from the outset, this seems like the perfect solution.
Time Travel
PostgreSQL used to have just this feature, and called it "Time Travel". See the old documentation.
There's somewhat similar functionality in the spi contrib module that you might want to check out.
Composite type audit trigger
What I usually do instead is to use triggers to log changes along with timestamps to archival tables, and query against those. If the table structure isn't going to change you can use something like:
CREATE TABLE sometable_history(
command_tag text not null check (command_tag IN ('INSERT','DELETE','UPDATE','TRUNCATE')),
new_content sometable,
change_time timestamp with time zone
);
and your versioning trigger can just insert into sometable_history(TG_OP,NEW,current_timestamp) (with a different CASE for DELETE, where NEW is not defined).
hstore audit trigger
That gets painful if the schema changes to add new NOT NULL columns though. If you expect to do anything like that consider using a hstore to archive the columns, instead of a composite type. I've already added an implementation of that on the PostgreSQL wiki already.
PITR
If you want to avoid impact on your master database (growing tables, etc), you can alternately use continuous archiving and point-in-time recovery to log WAL files that can, using a recovery.conf, be replayed to any moment in time. Note that WAL files are big and they include not only the tuples you changed, but VACUUM activity and other details. You'll want to run them through clearxlogtail since they can have garbage data on the end if they're partial segments from an archive timeout, then you'll want to compress them heavily for long term storage.
I am using Postgres, and this MVCC thing just seems like I should be able to exploit it for this purpose but I cannot find any documentation to support this. Can I do it?
Not really. There are tools to see dead rows, because auto-vacuuming is so that will eventually be reclaimed.
Is there a better way?
If I get your question right, you're looking into logging slowly changing dimensions.
You might find this recent related thread interesting:
Temporal database design, with a twist (live vs draft rows)
I'm not aware of any tools/products that are built for that purpose.
While this may not be exactly what you're asking for, you can configure Postgresql to log ddl changes. Setting the log_line_prefix parameter (try including %d, %m, and %u) and setting the log_statement parameter to ddl should give you a reasonable history of who made what ddl changes and when.
Having said that, I don't believe logging ddl to be foolproof. For example, consider a situation where:
Multiple schemas have a table with the same name,
one of the tables is altered, and
the ddl doesn't fully qualify the table name (relying on the search path to get it right),
then it may not be possible to know from the log which table was actually altered.
Another option might be to log ddl as above but then have a watcher program perform a pg_dump of the database schema whenever a ddl entry get's logged. You could even compare the new dump with the previous dump and extract just the objects that were changed.