Is it possible that both transactions rollback during a deadlock or serialization error? - postgresql

In PostgreSQL (and other MVCC databases), transactions can rollback due to a deadlock or serialization error. Assume two transactions are currently running, is it ever possible that both, instead of just one, transaction will fail due to this kind of errors?
The reason why I am asking is that I am writing a retry implementation. If both transactions can fail, we might end up in a never-ending loop of retries if both retry immediately. If only one transaction can fail, I don't see any harm in retrying as soon as possible.

Yes. A deadlock can involve more than two transactions. In this case more than one may be terminated. But this is an extremely rare condition. Normally.
If just two transactions deadlock, one survives. The manual:
PostgreSQL automatically detects deadlock situations and resolves them by aborting one of the transactions involved, allowing the other(s) to complete.
Serialization failures only happen in REPEATABLE READ or SERIALIZABLE transaction isolation. I wouldn't know of any particular limit to how many serialization failures can happen concurrently. But I also never heard of any necessity to delay retrying.
I would retry as soon as possible either way.

Related

How to avoid long delay before finally getting "40001 could not serialize access due to concurrent update"

We have a Postgres 12 system running one master master and two async hot-standby replica servers and we use SERIALIZABLE transactions. All the database servers have very fast SSD storage for Postgres and 64 GB of RAM. Clients connect directly to master server if they cannot accept delayed data for a transaction. Read-only clients that accept data up to 5 seconds old use the replica servers for querying data. Read-only clients use REPEATABLE READ transactions.
I'm aware that because we use SERIALIZABLE transactions Postgres might give us false positive matches and force us to repeat transactions. This is fine and expected.
However, the problem I'm seeing is that randomly a single line INSERT or UPDATE query stalls for a very long time. As an example, one error case was as follows (speaking directly to master to allow modifying table data):
A simple single row insert
insert into restservices (id, parent_id, ...) values ('...', '...', ...);
stalled for 74.62 seconds before finally emitting error
ERROR 40001 could not serialize access due to concurrent update
with error context
SQL statement "SELECT 1 FROM ONLY "public"."restservices" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
We log all queries exceeding 40 ms so I know this kind of stall is rare. Like maybe a couple of queries a day. We average around 200-400 transactions per second during normal load with 5-40 queries per transaction.
After finally getting the above error, the client code automatically released two savepoints, rolled back the transaction and disconnected from database (this cleanup took 2 ms total). It then reconnected to database 2 ms later and replayed the whole transaction from the start and finished in 66 ms including the time to connect to the database. So I think this is not about performance of the client or the master server as a whole. The expected transaction time is between 5-90 ms depending on transaction.
Is there some PostgreSQL connection or master configuration setting that I can use to make PostgreSQL to return the error 40001 faster even if it caused more transactions to be rolled back? Does anybody know if setting
set local statement_timeout='250'
within the transaction has dangerous side-effects? According to the documentation https://www.postgresql.org/docs/12/runtime-config-client.html "Setting statement_timeout in postgresql.conf is not recommended because it would affect all sessions" but I could set the timeout only for transactions by this client that's able to automatically retry the transaction very fast.
Is there anything else to try?
It looks like someone had the parent row to the one you were trying to insert locked. PostgreSQL doesn't know what to do about that until the lock is released, so it blocks. If you failed rather than blocking, and upon failure retried the exact same thing, the same parent row would (most likely) still be locked and so would just fail again, and you would busy-wait. Busy-waiting is not good, so blocking rather than failing is generally a good thing here. It blocks and then unblocks only to fail, but once it does fail a retry should succeed.
An obvious exception to blocking-better-than-failing being if when you retry, you can pick a different parent row to retry with, if that make sense in your context. In this case, maybe the best thing to do is explicitly lock the parent row with NOWAIT before attempting the insert. That way you can perhaps deal with failures in a more nuanced way.
If you must retry with the same parent_id, then I think the only real solution is to figure out who is holding the parent row lock for so long, and fix that. I don't think that setting statement_timeout would be hazardous, but it also wouldn't solve your problem, as you would probably just keep retrying until the lock on the offending row is released. (Setting it on the other session, the one holding the lock, might be helpful, depending on what that session is doing while the lock is held.)

maxTransactionLockRequestTimeoutMillis with concurrent transactions

I'm trying to get a better understanding of the lock acquisition behavior on MongoDB transactions. I have a scenario where two concurrent transactions try to modify the same document. Since one transaction will get the write lock on the document first, the second transaction will run into a write conflict and fail.
I stumbled upon the maxTransactionLockRequestTimeoutMillis setting as documented here: https://docs.mongodb.com/manual/reference/parameters/#param.maxTransactionLockRequestTimeoutMillis and it states:
The maximum amount of time in milliseconds that multi-document transactions should wait to acquire locks required by the operations in the transaction.
However, changing this value does not seem to have an impact on the observed behavior with a write conflict. Transaction 2 does not seem to wait for the lock to be released again but immediately runs into a write conflict when another transaction holds the lock (other than concurrent writes outside a transaction which will block and wait for the lock).
Do I understand correctly that the configured time in maxTransactionLockRequestTimeoutMillis does not include the act of actually receiving the write lock on the document or is there something wrong with my tests?

Condition for deadlock to happen

Deadlock-
A deadlock is a situation in which two or more competing actions are each waiting for the other to finish, and thus neither ever does.
For deadlock to happen all these four conditions must hold simultaneously
Mutual exclusion
Hold and wait
No preemption
Circular wait
we apply deadlock detection algorithm to check whether the system is in deadlock or not. But if any of the above criterion fails(For example No preemption fails, so some resource is being released) which incur the system to be deadlock free. So what I think, if the deadlock detection algorithm finds the state to be unsafe and all the above four criterion holds true simultaneously then we can say the system is in deadlock.
Unsafe state may or may not lead to deadlock.
But unsafe state with all these 4 conditions holding simultaneously must incur deadlock.
Am I thinking right?
I have another question in my mind. How can we say deadlock has occurred definitely because the next moment some process may release their resources to get rid of deadlock.
Am I thinking right?
Yes, you are correct.
See this link to see why unsafe may not lead to deadlock.
I have another question in my mind. How can we say deadlock has occurred definitely because the next moment some process may release their resources to get rid of deadlock.
Say deadlock has occurred. All processes causing the deadlock are waiting for some resource to be acquired. And because of "No preemption" no such process will get preempted and therefore release resources. Also because of "Hold and wait" property, process needs some more resources to continue but is not going to give up or release whatever it is holding now and will wait till its required resources are met. Once there is deadlock, nothing can happen (there cannot be any progress) until you break one of the above condition. Breaking a condition will make some other process to meet its requirements and ensure progress and completion.

Can committing an transaction in PostgreSQL fail?

If I execute some SQL inside a transaction successfully, can it happens that the commit will fail? And what are possible causes? Can it fail related to the executed queries, or just due to some DB side issues?
The question comes up because I need to judge if it makes sense to commit transactions inside tests or if it is "safe enough" to just rollback after each test case.
If I execute some SQL inside a transaction successfully, can it happens that the commit will fail?
Yes.
And what are possible causes?
DEFERRABLE constraints with SET CONSTRAINTS DEFERRED or in a one-statement autocommit transaction. (Can't happen unless you use DEFERRABLE constraints)
SERIALIZABLE transaction with serialization failure detected at commit time. (Can't happen unless you use SERIALIZABLE transactions)
Asynchronous commit where the DB crashes or is shut down. (Can't happen if synchronous_commit = on, the default)
Disk I/O error, filesystem error, etc
Out-of-memory error
Network error leading to session disconnect after you send the commit but before you get confirmation of success. In this case you don't know for sure if it committed or not.
... probably more
Can it fail related to the executed queries, or just due to some DB side issues?
Either. A serialization failure, for example, is definitely related to the queries run.
If you're using READ COMMITTED isolation with no deferred constraints then commits are only likely to fail due to underlying system errors.
The question comes up because I need to judge if it makes sense to commit transactions inside tests or if it is "safe enough" to just rollback after each test case.
Any sensible test suite has to cover multiple concurrent transactions interacting, committing in different orders, etc.
If all you test is single standalone transactions you're not testing the real system.
So the question is IMO moot, because a decent suite of tests has to commit anyway.

Does MongoDB fail silently if I don't check error codes?

I'm wondering if any persistence failure will go undetected if I don't check error codes? If so, what's the right way to write fast (asynchronously) while still detecting errors?
If you don't check for errors, your update is only fireAndForget. You'll indeed miss all errors which could arise. Please see MongoDB WriteConcerns for the available write modes in MongoDB (sorry I always fail to find the official, non driver related documentation, I really should bookmark it).
So with NORMAL you'll get at least connectivity errors, with NONE no exceptions at all. If you want to be informed of exceptions you have to use one of the other modes, which differ only in the persistence guarantee they give you.
You can't detect errors when running asynchronous, as this is against the intention. Your connection which sent the write operation, may be already closed or reused, so you can't sent it through that connection. Further more only your actual code knows what to do if it fails. As mongoDB doesn't offer some remote procedure call to asynchronous inform you of updates you'll have to wait until the write finished to a given stage.
So the fastest, but most unrelieable is SAFE, where the write only happened to memory. JOURNAL gives you the security that it was written at least to disk. With FSYNC you'll have those changes persisted on your db on disk. REPLICA that a least two replicas have written it, and MAJORITY that more than half of your replicas have written it(by three replicas which should be the default this doesn't differ).
The only chance I see to have something like asynchronous, is to have a separate Thread who is performing all write operations synchronous. This thread you could handle the actual update as well as a class which is called in case of a failure to perform the needed operations to handle this failure. But I don't think that this is good application design.
Yes, depending on the error, it can fail silently if you don't check the returned error code. It's necessary to wait for error checking. Your only other option would be for your app to occasionally tell the user "oops, remember when I acted like I saved your data a moment ago? Well, not really."