List transaction timestamps (transaction_timestamp()) of all active (i.e. not yet committed) transactions - postgresql

I would like to query for the lowest transaction_timestamp() value among the currently active (not yet committed nor rolled back) transactions from all connections to my PostgreSQL database (I am not interested in just the transaction_timestamp() of my own session).
I can get the transaction xid-s for such by txid_current_snapshot() and related, but I would like to find out what the transaction_timestamp for them are (or at least the lowest of the timestamps).

You could query the pg_stat_activity view for that:
SELECT MIN(xact_start) FROM pg_stat_activity
WHERE state IN ('active', 'idle in transaction')
You should have a look at the possible states you're interested in.

Related

reason for transaction holding snapshot (postgres, dbeaver)

As per my understanding, I can see that a transaction is holding a snapshot by either of the columns backend_xid or backend_xmin not being NULL in pg_stat_activity.
I am currently investigating cases where backend_xid is not null for sessions from dbeaver and I don't understand why the transaction is requiring a snapshot. This is of interest as long running transaction that are holding a snapshot can cause problems, for autovacuum for instance.
My question is: Can I (serverside) find the reason why a transaction is holding a snapshot? Is there a table where I can see why the transaction is holding a snapshot?
backend_xid is the transaction ID of the session and does not mean that the session has an active snapshot. The documentation says:
Top-level transaction identifier of this backend, if any.
backend_xmin is described as
The current backend's xmin horizon.
“xmin horizon” is PostgreSQL jargon and refers to the lowest transaction ID that was active when the snapshot was taken. It is an upper limit of what VACUUM is allowed to remove.

How do transactions work in the context of reads to the database?

I am using transactions to make changes to a SQL database. As I understand it, this means that changes to the database will happen in an all-or-nothing fashion. What I want to know is, does this have any guarantees for reads? For example, suppose I have some (pseudo)-code like this:
1) start TRANSACTION
2) INSERT INTO users ... // insert some data
3) count = SELECT COUNT(*) FROM ... // count something in the database
4) if count > 10: // do something based on the read
5) INSERT INTO other_table ... // write based on the read
6) COMMMIT TRANSACTION
In this code, I'm doing an INSERT, followed by a SELECT, and then conditionally doing another INSERT based on the outcome of the SELECT.
So my question is, if another process modifies the database between steps (3) and (5), what happens to the count variable, and to my transaction?
If it makes a difference, I am using PostgreSQL.
As Xin pointed out, it depends on the isolation level.
At the default READ COMMITTED level, records from other sessions will become visible as they are committed; you would see the same records if you didn't start a transaction at all (though of course, other processes would see your inserts appear at different times).
With REPEATABLE READ, your queries will not see any records committed by other sessions after your transaction starts. But while you don't have to worry about the result of SELECT COUNT(*) changing during your transaction, you can't assume that this result will still be accurate by the time you commit.
Using SERIALIZABLE provides the strongest guarantee: if your script does the right thing when given exclusive access to the database, then it will do the right thing in the presence of other serialisable transactions (or it will fail outright). However, this means that all transactions which might interfere with yours must be using the same isolation level (which comes at a cost), and all must be prepared to retry their transaction in the event of a serialisation failure.
When serialisable transactions are not an option, you generally guard against race conditions by explicitly locking things against concurrent writes. It's often enough to lock a selection of records, but you can't exactly lock the result of a COUNT(*); in your case, you'd probably need to lock the whole table.
I am not working on postgreSQL, but I think I can answer your question. Think of every query is parallel. I am saying so, because there are 2 transactions: when you insert into a; others can insert into b; then when you check b; whether you can see the new data depends on your isolation setting (read committed or just dirty read).
Also please note that, in database, there is a technology called lock: you can lock a table so that prevent altering it from others before committing your transaction.
Hope

Lock and transaction in postgres that should block a query

Let's assume in SQL window 1 I do:
-- query 1
BEGIN TRANSACTION;
UPDATE post SET title = 'edited' WHERE id = 1;
-- note that there is no explicit commit
Then from another window (window 2) I do:
-- query 2
SELECT * FROM post WHERE id = 1;
I get:
1 | original title
Which is fine as the default isolation level is READ COMMITTED and because query 1 is never committed, the change it performs is not readable until I explicitly commit from window 1.
In fact if I, in window 1, do:
COMMIT TRANSACTION;
I can then see the change if I re-run query 2.
1 | edited
My question is:
Why is query 2 returning fine the first time I run it? I was expecting it to block as the transaction in window 1 was not committed yet and the lock placed on row with id = 1 was (should be) an unreleased exclusive one that should block a read like the one performed in window 2. All the rest makes sense to me but I was expecting the SELECT to get stuck until an explicit commit in window 1 was executed.
The behaviour you describe is normal and expected in any transactional relational database.
If PostgreSQL showed you the value edited for the first SELECT it'd be wrong to do so - that's called a "dirty read", and is bad news in databases.
PostgreSQL would be allowed to wait at the SELECT until you committed or rolled back, but it isn't required to by the SQL standard, you haven't told it you want to wait, and it doesn't have to wait for any technical reason, so it returns the data you asked for immediately. After all, until it's committed, that update only kind-of exists - it still might or might not happen.
If PostgreSQL always waited here, then you'd quickly land up with a situation where only one connection could be doing anything with the database at a time. Not pretty for performance, and totally unnecessary the vast majority of the time.
If you want to wait for a concurrent UPDATE (or DELETE), you'd use SELECT ... FOR SHARE. (But be aware that this won't work for INSERT).
Details:
SELECT without a FOR UPDATE or FOR SHARE clause does not take any row level locks. So it sees whatever is the current committed row, and is not affected by any in-flight transactions that might be modifying that row. The concepts are explained in the MVCC section of the docs. The general idea is that PostgreSQL is copy-on-write, with versioning that allows it to return the correct copy based on what the transaction or statement could "see" at the time it started - what PostgreSQL calls a "snapshot".
In the default READ COMMITTED isolation snapshots are taken at the statement level, so if you SELECT a row, COMMIT a change to it from another transaction, and SELECT it again you'll see different values even within one transation. You can use SNAPSHOT isolation if you don't want to see changes committed after the transaction begins, or SERIALIZABLE isolation to add further protection against certain kinds of transaction inter-dependencies.
See the transaction isolation chapter in the documentation.
If you want a SELECT to wait for in-progress transactions to commit or rollback changes to rows being selected, you must use SELECT ... FOR SHARE. This will block on the lock taken by an UPDATE or DELETE until the transaction that took the lock rolls back or commits.
INSERT is different, though - the tuples just don't exist to other transactions until commit. The only way to wait for concurrent INSERTs is to take an EXCLUSIVE table-level lock, so you know nobody else is changing the table while you read it. Usually the need to do that means you have a design problem in the application though - your app should not care if there are uncommitted inserts still in flight.
See the explicit locking chapter of the documentation.
In PostgreSQL's MVCC implementation, the principle is reading does not block writing and vice-versa. The manual:
The main advantage of using the MVCC model of concurrency control
rather than locking is that in MVCC locks acquired for querying
(reading) data do not conflict with locks acquired for writing data,
and so reading never blocks writing and writing never blocks reading.
PostgreSQL maintains this guarantee even when providing the strictest
level of transaction isolation through the use of an innovative
Serializable Snapshot Isolation (SSI) level.
Each transaction only sees (mostly) what has been committed before the transaction began.
That does not mean there'd be no locking. Not at all. For many operations various kinds of locks are acquired. And various strategies are applied to resolve possible conflicts.

All Queries in a given transaction

I am currently trying to debug "idle in transaction" scenarios in my application.I can find out the process id and transaction start time for a query with state 'idle in transaction' by looking at pg_state_activity.
select pid,query from pg_stat_activity where state='idle in transaction' OR state='idle'
Is there any way to identify list of all queries executed within a transaction corresponding to the query with 'idle in transaction'
Are you attempting to get a list of all previous statements run by the transaction that is now showing up as idle in transaction?
If so, there is no easy and safe way to do so at the SQL level. You should use CSV logging mode to analyze your query history and group queries up into transactions. Handily, you can do this with SQL, by COPYing the CSV into a PostgreSQL table for easier analysis.
Alternately, use ordinary text logs, and set a log_line_prefix that includes the transaction ID and process ID.
(I could've sworn I saw an extension for debugging that collected a query trace, but cannot find it now, and I'm not sure it's that useful as you must run a command on the problem connection to extract the data it's collected).

How to wait during SELECT that pending INSERT commit?

I'm using PostgreSQL 9.2 in a Windows environment.
I'm in a 2PC (2 phase commit) environment using MSDTC.
I have a client application, that starts a transaction at the SERIALIZABLE isolation level, inserts a new row of data in a table for a specific foreign key value (there is an index on the column), and vote for completion of the transaction (The transaction is PREPARED). The transaction will be COMMITED by the Transaction Coordinator.
Immediatly after that, outside of a transaction, the same client requests all the rows for this same specific foreign key value.
Because there may be a delay before the previous transaction is really commited, the SELECT clause may return a previous snapshot of the data. In fact, it does happen sometimes, and this is problematic. Of course the application may be redesigned but until then, I'm looking for a lock solution. Advisory Lock ?
I already solved the problem while performing UPDATE on specific rows, then using SELECT...FOR SHARE, and it works well. The SELECT waits until the transaction commits and return old and new rows.
Now I'm trying to solve it for INSERT.
SELECT...FOR SHARE does not block and return immediatley.
There is no concurrency issue here as only one client deals with a specific set of rows. I already know about MVCC.
Any help appreciated.
To wait for a not-yet-committed INSERT you'd need to take a predicate lock. There's limited predicate locking in PostgreSQL for the serializable support, but it's not exposed directly to the user.
Simple SERIALIZABLE isolation won't help you here, because SERIALIZABLE only requires that there be an order in which the transactions could've occurred to produce a consistent result. In your case this ordering is SELECT followed by INSERT.
The only option I can think of is to take an ACCESS EXCLUSIVE lock on the table before INSERTing. This will only get released at COMMIT PREPARED or ROLLBACK PREPARED time, and in the mean time any other queries will wait for the lock. You can enforce this via a BEFORE trigger to avoid the need to change the app. You'll probably get the odd deadlock and rollback if you do it that way, though, because INSERT will take a lower lock then you'll attempt lock promotion in the trigger. If possible it's better to run the LOCK TABLE ... IN ACCESS EXCLUSIVE MODE command before the INSERT.
As you've alluded to, this is mostly an application mis-design problem. Expecting to see not-yet-committed rows doesn't really make any sense.