Script to track Database change - postgresql

I need to track any changes of data in postgresql database. Is there any option in database or any script to view those data and DML as well.

Sorry - I have no clue. But I do have some different suggestions:
Log /all/ queries and grep for those involving update, delete, insert, alter table etc. Caveats: may cause performance problems if there are lots of queries and the log is on the same RAID as data and/or WAL. Not sure if it's easy to make some regexp that is 100% certain to catch all modifying statements. May be difficult to catch rollbacks etc. To log everything, add this to the configuration file: log_min_duration_statement = 0. Have a look that the other log_* configuration variables are sane as well.
The rules/trigger approach (as hinted by other user) - I believe it involves writing up rules for each and every table - but it's of course doable (and should be possible to create the rules through some external script if you have a lot of tables). You may also look a bit into how slony works - slony is a trigger-based replication system, should be possible to use it to catch all the changes in the DB.
All changes to the database ends up in the WAL-file, maybe it's theoretically possible to extract something out from the WAL, but I suspect that's not practical unless you're already a skilled postgres hacker ... and if you're a skilled postgres hacker, you probably wouldn't ask this question in the first place ;-) (eventually, the WALs may be used to see the rate of changes in the data and spot times of the day when there are more updates than otherwise etc. They may also be used for replication and roll-forward from a binary backup)

Between setting log_statement='all' in the postgresql.conf, you can also use tablelog to capture old data.

Related

Disabling MVCC in Postgres

I've decades of experience with MSSQL but none with Postgres and its MVCC style of concurrency control.
In MSSQL if I had a very large dataset which was read-only, I would set the database to read-only (for safety) and use transaction isolation level read uncommitted, and that should avoid lock contention, which the dataset didn't need anyway.
In Postgres, is there some equivalent? Some way of setting a database to read-only and reassuring PG that is completely safe not to use MVCC, just read without making row copies? Because it seems that MVCC has some considerable overhead which for multiple readers of very large passive data sets seems potentially expensive.
Edit: comments say I misunderstand that copies are only made when writing occurs, not reading as I assumed.
"MVCC" stands for "Multiversion Concurrency Control". Multiple versions of the same table row are only spawned by write activity (mostly UPDATE).
If your database is read-only - enforced or voluntarily, all the same for the purpose of this question - then there cannot be multiple versions of a row, ever. And the question is moot.
No, there is no way to do that, and there is no reason for it either.
Since PostgreSQL, writers will never block readers and vice versa, precisely because of its MVCC implementation that you want to disable. So there is no need for the unsavory crutch of reading uncommitted data.

CLUSTER USING for Postgres in Django (table defragmenting / packing)

Let's say that I'm building a stack exchange clone, and every time I examine a question, I also load each and every answer. The table might look like:
id integer
question_id FOREIGN KEY
answer bool
date timestamp
How can I tell django to tell postgres to keep all the answers together for fast access? Postgres has the underlying feature CLUSTER USING.
(CLUSTER USING is 'defragmenting' feature for tables. This works especially well for small records, since they may all end up in the same disk block and greatly reduce load time. The defragmenting is typically done as a batch job at times of low load).
As far as I can tell, you can't. But you can treat this as a database administration task, and do it from the psql command line:
# CLUSTER table USING index_name;
# ANALYZE VERBOSE table;
# CLUSTER VERBOSE;
This will be remembered. Each time you run CLUSTER VERBOSE it will lock all the tables and sort the data. All your answers (in the example above) will be gathered together on disk. This makes sense even for solid state storage, since the eventual database read will cover fewer sectors, meaning fewer I/O operations to retrieve the group.
Obviously you must pick your index well: the wrong choice can scatter the data you actually access. The performance benefit is the best for sparse datasets, and becomes less relevant if most everything is frequently accessed.
A better name for the CLUSTER feature might be "DEFRAG", as this is an operation analogous defragmenting a filesystem.

PostgreSQL temporary table cache in memory?

Context:
I want to store some temporary results in some temporary tables. These tables may be reused in several queries that may occur close in time, but at some point the evolutionary algorithm I'm using may not need some old tables any more and keep generating new tables. There will be several queries, possibly concurrently, using those tables. Only one user doing all those queries. I don't know if that clarifies everything about sessions and so on, I'm still uncertain about how that works.
Objective:
What I would like to do is to create temporary tables (if they don't exist already), store them on memory as far as that is possible and if at some point there is not enough memory, delete those that would be committed to the HDD (I guess those will be the least recently used).
Examples:
The client will be doing queries for EMAs with different parameters and an aggregation of them with different coefficients, each individual may vary in terms of the coefficients used and so the parameters for the EMAs may repeat as they are still in the gene pool, and may not be needed after a while. There will be similar queries with more parameters and the genetic algorithm will find the right values for the parameters.
Questions:
Is that what "on commit drop" means? I've seen descriptions about
sessions and transactions but I don't really understand those
concepts. Sorry if the question is stupid.
If it is not, do you know about any simple way to get Postgres to do
this?
Workaround:
In the worst case I should be able to make a guesstimation about how many tables I can keep on memory and try to implement the LRU by myself, but it's never going to be as good as what Postgres could do.
Thank you very much.
This is a complicated topic and probably one to discuss in some depth. I think it is worth both explaining why PostgreSQL doesn't support this and also what you can do instead with recent versions to approach what you are trying to do.
PostgreSQL has a pretty good approach to caching diverse data sets across multiple users. In general you don't want to allow a programmer to specify that a temporary table must be kept in memory if it becomes very large. Temporary tables however are managed quite differently from normal tables in that they are:
Buffered by the individual back-end, not the shared buffers
Locally visible only, and
Unlogged.
What this means is that typically you aren't generating a lot of disk I/O for temporary tables. The tables do not normally flush WAL segments, and they are managed by the local back-end so they don't affect shared buffer usage. This means that only occasionally is data going to be written to disk and only when necessary to free memory for other (usually more frequent) tasks. You certainly aren't forcing disk writes and only need disk reads when something else has used up memory.
The end result is that you don't really need to worry about this. PostgreSQL already tries, to a certain extent, to do what you are asking it to do, and temporary tables have much lower disk I/O requirements than standard tables do. It does not force the tables to stay in memory though and if they become large enough, the pages may expire into the OS disk cache, and eventually on to disk. This is an important feature because it ensures that performance gracefully degrades when many people create many large temporary tables.

Does PostgreSQL cache Prepared Statements like Oracle

I have just moved to PostgreSQL after having worked with Oracle for a few years.
I have been looking into some performance issues with prepared statements in the application (Java, JDBC) with the PostgreSQL database.
Oracle caches prepared statements in its SGA - the pool of prepared statements is shared across database connections.
PostgreSQL documentation does not seem to indicate this. Here's the snippet from the documentation (https://www.postgresql.org/docs/current/static/sql-prepare.html) -
Prepared statements only last for the duration of the current database
session. When the session ends, the prepared statement is forgotten,
so it must be recreated before being used again. This also means that
a single prepared statement cannot be used by multiple simultaneous
database clients; however, each client can create their own prepared
statement to use.
I just want to make sure that I am understanding this right, because it seems so basic for a database to implement some sort of common pool of commonly executed prepared statements.
If PostgreSQL does not cache these that would mean every application that expects a lot of database transactions needs to develop some sort of prepared statement pool that can be re-used across connections.
If you have worked with PostgreSQL before, I would appreciate any insight into this.
Yes, your understanding is correct. Typically if you had a set of prepared queries that are that critical then you'd have the application call a custom function to set them up on connection.
There are three key reasons for this afaik:
There's a long todo list and they get done when a developer is interested/paid to tackle them. Presumably no-one has thought it worth funding yet or come up with an efficient way of doing it.
PostgreSQL runs in a much wider range of environments than Oracle. I would guess that 99% of installed systems wouldn't see much benefit from this. There are an awful lot of setups without high-transaction performance requirement, or for that matter a DBA to notice whether it's needed or not.
Planned queries don't always provide a win. There's been considerable work done on delaying planning/invalidating caches to provide as good a fit as possible to the actual data and query parameters.
I'd suspect the best place to add something like this would be in one of the connection pools (pgbouncer/pgpool) but last time I checked such a feature wasn't there.
HTH

PostgreSQL to Data-Warehouse: Best approach for near-real-time ETL / extraction of data

Background:
I have a PostgreSQL (v8.3) database that is heavily optimized for OLTP.
I need to extract data from it on a semi real-time basis (some-one is bound to ask what semi real-time means and the answer is as frequently as I reasonably can but I will be pragmatic, as a benchmark lets say we are hoping for every 15min) and feed it into a data-warehouse.
How much data? At peak times we are talking approx 80-100k rows per min hitting the OLTP side, off-peak this will drop significantly to 15-20k. The most frequently updated rows are ~64 bytes each but there are various tables etc so the data is quite diverse and can range up to 4000 bytes per row. The OLTP is active 24x5.5.
Best Solution?
From what I can piece together the most practical solution is as follows:
Create a TRIGGER to write all DML activity to a rotating CSV log file
Perform whatever transformations are required
Use the native DW data pump tool to efficiently pump the transformed CSV into the DW
Why this approach?
TRIGGERS allow selective tables to be targeted rather than being system wide + output is configurable (i.e. into a CSV) and are relatively easy to write and deploy. SLONY uses similar approach and overhead is acceptable
CSV easy and fast to transform
Easy to pump CSV into the DW
Alternatives considered ....
Using native logging (http://www.postgresql.org/docs/8.3/static/runtime-config-logging.html). Problem with this is it looked very verbose relative to what I needed and was a little trickier to parse and transform. However it could be faster as I presume there is less overhead compared to a TRIGGER. Certainly it would make the admin easier as it is system wide but again, I don't need some of the tables (some are used for persistent storage of JMS messages which I do not want to log)
Querying the data directly via an ETL tool such as Talend and pumping it into the DW ... problem is the OLTP schema would need tweaked to support this and that has many negative side-effects
Using a tweaked/hacked SLONY - SLONY does a good job of logging and migrating changes to a slave so the conceptual framework is there but the proposed solution just seems easier and cleaner
Using the WAL
Has anyone done this before? Want to share your thoughts?
Assuming that your tables of interest have (or can be augmented with) a unique, indexed, sequential key, then you will get much much better value out of simply issuing SELECT ... FROM table ... WHERE key > :last_max_key with output to a file, where last_max_key is the last key value from the last extraction (0 if first extraction.) This incremental, decoupled approach avoids introducing trigger latency in the insertion datapath (be it custom triggers or modified Slony), and depending on your setup could scale better with number of CPUs etc. (However, if you also have to track UPDATEs, and the sequential key was added by you, then your UPDATE statements should SET the key column to NULL so it gets a new value and gets picked by the next extraction. You would not be able to track DELETEs without a trigger.) Is this what you had in mind when you mentioned Talend?
I would not use the logging facility unless you cannot implement the solution above; logging most likely involves locking overhead to ensure log lines are written sequentially and do not overlap/overwrite each other when multiple backends write to the log (check the Postgres source.) The locking overhead may not be catastrophic, but you can do without it if you can use the incremental SELECT alternative. Moreover, statement logging would drown out any useful WARNING or ERROR messages, and the parsing itself will not be instantaneous.
Unless you are willing to parse WALs (including transaction state tracking, and being ready to rewrite the code everytime you upgrade Postgres) I would not necessarily use the WALs either -- that is, unless you have the extra hardware available, in which case you could ship WALs to another machine for extraction (on the second machine you can use triggers shamelessly -- or even statement logging -- since whatever happens there does not affect INSERT/UPDATE/DELETE performance on the primary machine.) Note that performance-wise (on the primary machine), unless you can write the logs to a SAN, you'd get a comparable performance hit (in terms of thrashing filesystem cache, mostly) from shipping WALs to a different machine as from running the incremental SELECT.
if you can think of a 'checksum table' that contains only the id's and the 'checksum' you can not only do a quick select of the new records but also the changed and deleted records.
the checksum could be a crc32 checksum function you like.
The new ON CONFLICT clause in PostgreSQL has changed the way I do many updates. I pull the new data (based on a row_update_timestamp) into a temp table then in one SQL statement INSERT into the target table with ON CONFLICT UPDATE. If your target table is partitioned then you need to jump through a couple of hoops (i.e. hit the partition table directly). The ETL can happen as you load the the Temp table (most likely) or in the ON CONFLICT SQL (if trivial). Compared to to other "UPSERT" systems (Update, insert if zero rows etc.) this shows a huge speed improvement. In our particular DW environment we don't need/want to accommodate DELETEs. Check out the ON CONFLICT docs - it gives Oracle's MERGE a run for it's money!