Is it safe to rely on Postgres' deadlock detection for concurrency control? - postgresql

I've been running into occasional deadlocks in my application because two transactions which need to update the same rows but in different orders (ex, transaction A updates rows X then Y, while transaction B updates rows Y then X).
For various Reasons, the traditional approaches to resolving avoiding this kind of deadlock – locking, or updating rows in a consistent order – are less than ideal.
Since the updates I'm trying to perform are otherwise idempotent and order-independent, is it safe and reasonable to simply catch these occasional deadlocks at the application level and retry the transaction?
For example:
def process_update(update):
attempt = 0
while attempt < 10:
try:
execute("SAVEPOINT foo")
for row in update:
execute("UPDATE mytable SET … WHERE …", row)
execute("RELEASE SAVEPOINT foo")
break
except Deadlock:
execute("ROLLBACK TO SAVEPOINT foo")
attempt += 1
raise Exception("Too many retries")
Is this a reasonable idea? Or are there costs associated with Postgres' deadlock detection that might make it dangerous?

I did a lot of research and experimentation into this for a system that's running 50 to 100 concurrent processes on the same tables. There are a number of transaction failures that can happen besides basic deadlocks. My case includes both read committed and serializable transactions. There's no situation where handling this at the application level caused any issues. Fortunately Postgres will fail immediately, so the only performance hit is to the application, nothing significant to the database.
The key components are catching every type of error, knowing which cases require a rollback, and having an exponential backoff for retries. I found that immediate retries or static sleep times cause processes to simply deadlock each other repeatedly and cause a bit of a domino effect, which makes sense.
This is the complete logic my system requires to handle every concurrency issue (pseudocode):
begin transaction (either read committed or serializable)
while not successful and count < 5
try
execute sql
commit
except
if error code is '40P01' or '55P03'
# Deadlock or lock not available
sleep a random time (200 ms to 1 sec) * number of retries
else if error code is '40001' or '25P02'
# "In failed sql transaction" or serialized transaction failure
rollback
sleep a random time (200 ms to 1 sec) * number of retries
begin transaction
else if error message is 'There is no active transaction'
sleep a random time (200 ms to 1 sec) * number of retries
begin transaction
increment count

Related

Postgres ISOLATION LEVEL

I want to ask you for help with my problem. I have a program that triggers asynchronous computations in parallel and waits in the loop until they are finished.
I am using Postgres as a database where I have created the table computation_status that contains the following data when computations is triggered:
computation
finished
COMPUTATION_A
null
COMPUTATION_B
null
COMPUTATION_C
null
Then I am waiting in the loop until all computations are finished. This loop acceps notifications for each computation that is finished and triggers SQL transactions to update its status and check if there is any other computation running. For example:
T1:
BEGIN_TRANSACTION
update computation_status set finished = NOW() where and computation = 'COMPUTATION_A'
select exists (select 1 from computation_status where finished is null)
COMMIT
T2:
BEGIN_TRANSACTION
update computation_status set finished = NOW() where and computation = 'COMPUTATION_B'
select exists (select 1 from computation_status where finished is null)
COMMIT
T3:
BEGIN_TRANSACTION
update computation_status set finished = NOW() where and computation = 'COMPUTATION_C'
select exists (select 1 from computation_status where finished is null)
COMMIT
And when the last computation is finished the program exits the waiting loop.
What level of isolation should I use to avoid these problems? I know I should at least use the READ_COMMITED isolation level to prevent non-repeatable reads, but is that enough? Or is it also possible that phantom reads will occur and I should use REPETABLE_READ? (I'm not sure if an UPDATE is counted as a READ too).
I want to avoid the problem that for example computations A and B will be finished at the same time as the last ones. Then T1 will set A=finished and read that B is not finished and T2 will set B=finished and read that A is not fished and this will cause a problem in my application because it will end up in an infinite loop.
To avoid race conditions here, you have to effectively serialize the transactions.
The only isolation level where that would work reliably would be SERIALIZABLE. However, that incurs a performance penalty, and you have to be ready to repeat transactions in case a serialization error is thrown. If more than one of these transactions are running concurrently, a serialization error will be thrown.
The alternative would be to use locks, but that is not very appealing: using row locks would lead to deadlocks, and using table locks would block autovacuum, which would eventually bring your system down.

How to avoid long delay before finally getting "40001 could not serialize access due to concurrent update"

We have a Postgres 12 system running one master master and two async hot-standby replica servers and we use SERIALIZABLE transactions. All the database servers have very fast SSD storage for Postgres and 64 GB of RAM. Clients connect directly to master server if they cannot accept delayed data for a transaction. Read-only clients that accept data up to 5 seconds old use the replica servers for querying data. Read-only clients use REPEATABLE READ transactions.
I'm aware that because we use SERIALIZABLE transactions Postgres might give us false positive matches and force us to repeat transactions. This is fine and expected.
However, the problem I'm seeing is that randomly a single line INSERT or UPDATE query stalls for a very long time. As an example, one error case was as follows (speaking directly to master to allow modifying table data):
A simple single row insert
insert into restservices (id, parent_id, ...) values ('...', '...', ...);
stalled for 74.62 seconds before finally emitting error
ERROR 40001 could not serialize access due to concurrent update
with error context
SQL statement "SELECT 1 FROM ONLY "public"."restservices" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
We log all queries exceeding 40 ms so I know this kind of stall is rare. Like maybe a couple of queries a day. We average around 200-400 transactions per second during normal load with 5-40 queries per transaction.
After finally getting the above error, the client code automatically released two savepoints, rolled back the transaction and disconnected from database (this cleanup took 2 ms total). It then reconnected to database 2 ms later and replayed the whole transaction from the start and finished in 66 ms including the time to connect to the database. So I think this is not about performance of the client or the master server as a whole. The expected transaction time is between 5-90 ms depending on transaction.
Is there some PostgreSQL connection or master configuration setting that I can use to make PostgreSQL to return the error 40001 faster even if it caused more transactions to be rolled back? Does anybody know if setting
set local statement_timeout='250'
within the transaction has dangerous side-effects? According to the documentation https://www.postgresql.org/docs/12/runtime-config-client.html "Setting statement_timeout in postgresql.conf is not recommended because it would affect all sessions" but I could set the timeout only for transactions by this client that's able to automatically retry the transaction very fast.
Is there anything else to try?
It looks like someone had the parent row to the one you were trying to insert locked. PostgreSQL doesn't know what to do about that until the lock is released, so it blocks. If you failed rather than blocking, and upon failure retried the exact same thing, the same parent row would (most likely) still be locked and so would just fail again, and you would busy-wait. Busy-waiting is not good, so blocking rather than failing is generally a good thing here. It blocks and then unblocks only to fail, but once it does fail a retry should succeed.
An obvious exception to blocking-better-than-failing being if when you retry, you can pick a different parent row to retry with, if that make sense in your context. In this case, maybe the best thing to do is explicitly lock the parent row with NOWAIT before attempting the insert. That way you can perhaps deal with failures in a more nuanced way.
If you must retry with the same parent_id, then I think the only real solution is to figure out who is holding the parent row lock for so long, and fix that. I don't think that setting statement_timeout would be hazardous, but it also wouldn't solve your problem, as you would probably just keep retrying until the lock on the offending row is released. (Setting it on the other session, the one holding the lock, might be helpful, depending on what that session is doing while the lock is held.)

Is it possible that both transactions rollback during a deadlock or serialization error?

In PostgreSQL (and other MVCC databases), transactions can rollback due to a deadlock or serialization error. Assume two transactions are currently running, is it ever possible that both, instead of just one, transaction will fail due to this kind of errors?
The reason why I am asking is that I am writing a retry implementation. If both transactions can fail, we might end up in a never-ending loop of retries if both retry immediately. If only one transaction can fail, I don't see any harm in retrying as soon as possible.
Yes. A deadlock can involve more than two transactions. In this case more than one may be terminated. But this is an extremely rare condition. Normally.
If just two transactions deadlock, one survives. The manual:
PostgreSQL automatically detects deadlock situations and resolves them by aborting one of the transactions involved, allowing the other(s) to complete.
Serialization failures only happen in REPEATABLE READ or SERIALIZABLE transaction isolation. I wouldn't know of any particular limit to how many serialization failures can happen concurrently. But I also never heard of any necessity to delay retrying.
I would retry as soon as possible either way.

How to debug ShareLock in Postgres

I am seeing quite a few occurrences of the following in my Postgres server log:
LOG: process x still waiting for ShareLock on transaction y after 1000.109 ms
DETAIL: Process holding the lock: z. Wait queue: x.
CONTEXT: while inserting index tuple (a,b) in relation "my_test_table"
SQL function "my_test_function" statement 1
...
LOG: process x acquired ShareLock on transaction y after 1013.664 ms
CONTEXT: while inserting index tuple (a,b) in relation "my_test_table"
I am running Postgres 9.5.3. In addition I am running on Heroku so I don't have access to the fine grained superuser-only debugging tools.
I am wondering how best to debug such an issue given these constraints and the fact each individual lock is relatively transient (generally 1000-2000ms).
Things I have tried:
Monitoring pg_locks (and joining to pg_class for context).
Investigating pageinspect.
Replicating locally both by hand and with pgbench where I do have superuser perms. I have so far been unable to replicate the issue locally (I suspect due to having a much smaller data set but I can't be sure).
It is worth noting that CPU utilisation appears high (load average of >1) when I see these issues so it's possible there is nothing wrong with the above per se and that I'm seeing it as a consequence of insufficient system resources being available. I would still like to understand how best to debug it though so I can understand what exactly is happening.
The key thing is that it's a ShareLock on the transaction.
This means that one transaction is waiting for another to commit/rollback before it can proceed. It's only loosely a "lock". What's happening here is that a PostgreSQL transaction takes an ExclusiveLock on its own transaction ID when it starts. Other transactions that want to wait for it to finish can try to acquire a ShareLock on the transaction, which will block until the ExclusiveLock is released on commit/abort. It's basically using the locking mechanism as a convenience to implement inter-transaction completion signalling.
This usually happens when the waiting transaction(s) are trying to INSERT a UNIQUE or PRIMARY KEY value for a row that's recently inserted/modified by the waited-on transaction. The waiting transactions cannot proceed until they know the outcome of the waited-on transaction - whether it committed or rolled back, and if it committed, whether the target row got deleted/inserted/whatever.
That's consistent with what's in your error message. proc "x" is trying to insert into "my_test_table" and has to wait until proc "y" commits xact "z" to find out whether to raise a unique violation or whether it can proceed.
Most likely you have contention in some kind of upsert or queue processing system. This can also happen if you have some function/transaction pattern that tries to insert into a heavily contended table, then does a lot of other time consuming work before it commits.

handle locks in redshift

I have a python script that executes multiple sql scripts (one after another) in Redshift. Some of the tables in these sql scripts can be queried multiple times. For ex. Table t1 can be SELECTed in one script and can be dropped/recreated in another script. This whole process is running in one transaction. Now, sometimes, I am getting deadlock detected error and the whole transaction is rolled back. If there is a deadlock on a table, I would like to wait for the table to be released and then retry the sql execution. For other types of errors, I would like to rollback the transaction. From the documentation, it looks like the table lock isn't released until end of transaction. I would like to achieve all or no data changes (which is accomplished by using transaction) but also would like to handle deadlocks. Any suggestion on how this can be accomplished?
I would execute all of the SQL you are referring to in one transaction with a retry loop. Below is the logic I use to handle concurrency issues and retry (pseudocode for brevity). I do not have the system wait indefinitely for the lock to be released. Instead I handle it in the application by retrying over time.
begin transaction
while not successful and count < 5
try
execute sql
commit
except
if error code is '40P01' or '55P03'
# Deadlock or lock not available
sleep a random time (200 ms to 1 sec) * number of retries
else if error code is '40001' or '25P02'
# "In failed sql transaction" or serialized transaction failure
rollback
sleep a random time (200 ms to 1 sec) * number of retries
begin transaction
else if error message is 'There is no active transaction'
sleep a random time (200 ms to 1 sec) * number of retries
begin transaction
increment count
The key components are catching every type of error, knowing which cases require a rollback, and having an exponential backoff for retries.