Can multiple threads cause duplicate updates on constrained set? - postgresql

In postgres if I run the following statement
update table set col = 1 where col = 2
In the default READ COMMITTED isolation level, from multiple concurrent sessions, am I guaranteed that:
In a case of a single match only 1 thread will get a ROWCOUNT of 1 (meaning only one thread writes)
In a case of a multi match that only 1 thread will get a ROWCOUNT > 0 (meaning only one thread writes the batch)

Your stated guarantees apply in this simple case, but not necessarily in slightly more complex queries. See the end of the answer for examples.
The simple case
Assuming that col1 is unique, has exactly one value "2", or has stable ordering so every UPDATE matches the same rows in the same order:
What'll happen for this query is that the threads will find the row with col=2 and all try to grab a write lock on that tuple. Exactly one of them will succeed. The others will block waiting for the first thread's transaction to commit.
That first tx will write, commit, and return a rowcount of 1. The commit will release the lock.
The other tx's will again try to grab the lock. One by one they'll succeed. Each transaction will in turn go through the following process:
Obtain the write lock on the contested tuple.
Re-check the WHERE col=2 condition after getting the lock.
The re-check will show that the condition no longer matches so the UPDATE will skip that row.
The UPDATE has no other rows so it will report zero rows updated.
Commit, releasing the lock for the next tx trying to get hold of it.
In this simple case the row-level locking and the condition re-check effectively serializes the updates. In more complex cases, not so much.
You can easily demonstrate this. Open say four psql sessions. In the first, lock the table with BEGIN; LOCK TABLE test;*. In the rest of the sessions run identical UPDATEs - they'll block on the table level lock. Now release the lock by COMMITting your first session. Watch them race. Only one will report a row count of 1, the others will report 0. This is easily automated and scripted for repetition and scaling up to more connections/threads.
To learn more, read rules for concurrent writing, page 11 of PostgreSQL concurrency issues - and then read the rest of that presentation.
And if col1 is non-unique?
As Kevin noted in the comments, if col isn't unique so you might match multiple rows, then different executions of the UPDATE could get different orderings. This can happen if they choose different plans (say one is a via a PREPARE and EXECUTE and another is direct, or you're messing with the enable_ GUCs) or if the plan they all use uses an unstable sort of equal values. If they get the rows in a different order then tx1 will lock one tuple, tx2 will lock another, then they'll each try to get locks on each others' already-locked tuples. PostgreSQL will abort one of them with a deadlock exception. This is yet another good reason why all your database code should always be prepared to retry transactions.
If you're careful to make sure concurrent UPDATEs always get the same rows in the same order you can still rely on the behaviour described in the first part of the answer.
Frustratingly, PostgreSQL doesn't offer UPDATE ... ORDER BY so ensuring that your updates always select the same rows in the same order isn't as simple as you might wish. A SELECT ... FOR UPDATE ... ORDER BY followed by a separate UPDATE is often safest.
More complex queries, queuing systems
If you're doing queries with multiple phases, involving multiple tuples, or conditions other than equality you can get surprising results that differ from the results of a serial execution. In particular, concurrent runs of anything like:
UPDATE test SET col = 1 WHERE col = (SELECT t.col FROM test t ORDER BY t.col LIMIT 1);
or other efforts to build a simple "queue" system will *fail* to work how you expect. See the PostgreSQL docs on concurrency and this presentation for more info.
If you want a work queue backed by a database there are well-tested solutions that handle all the surprisingly complicated corner cases. One of the most popular is PgQ. There's a useful PgCon paper on the topic, and a Google search for 'postgresql queue' is full of useful results.
* BTW, instead of a LOCK TABLE you can use SELECT 1 FROM test WHERE col = 2 FOR UPDATE; to obtain a write lock on just that on tuple. That'll block updates against it but not block writes to other tuples or block any reads. That allows you to simulate different kinds of concurrency issues.

Related

Is it safe to join "pg_locks" table in production code?

I have a queue-like table t1 holding timestamp-ordered data. One of its columns is foreign key ext_id. I have a number of workers that process rows from t1 and remove them after their job is done. Outcome of the process is upserting rows to another table t2. t2 also references ext_id, but in this case relation is unique: if row pointing to the particular ext_id already exist I want to update it instead of inserting.
As long as single worker is processing the data task is fairly simple. When multiple workers are brought into play SKIP LOCKED clause comes to the rescue: each thread locks the row that it is processing and makes it invisible to other threads. SKIP LOCKED clause guarantees that threads are not interfering with each other in terms of source table t1. The problem is, that they can still try simultaneously insert rows into table t2. Since there is uniqueness constraint on t2 this can yield error if multiple workers select t1 rows sharing ext_id. Since constraint raises error I can simply retry processing of a particular row, but then I lose the guarantee of processing order (not to mention that exception-based flow control feels like serious anti-pattern).
I considered adding auxiliary "synchronization" table (lets call it sync), that would hold entry for each ext_id being currently processed. The processing becomes more complicated then: I need to commit sync row insertion before actually start processing, so that other threads can use this information to select t1 rows that are safe to process. t1 row selection can join the auxiliary table and match first row that has ext_id not present in sync table. It is still possible that concurrent threads will select consecutive rows and try inserting synchronization rows pointing to the same ext_id. If it happens I need to retry t1 row selection.
Second approach solves concurrent processing of conflicting t1 rows and guarantees that the row order is maintained (within partitions defined by ext_id values). What it fails to solve is dirty flow control structure based on failed insertions.
PostgreSQL provides advisory locks mechanism, which allows building of application-specific custom synchronization logic.
Quote from explicit locking documentation:
For example, a common use of advisory locks is to emulate pessimistic locking strategies typical of so-called “flat file” data management systems. While a flag stored in a table could be used for the same purpose, advisory locks are faster, avoid table bloat, and are automatically cleaned up by the server at the end of the session.
Usage of advisory locks solves exactly this problem and according to the documentation should yield better performance. When a process selects a row, it also obtains advisory lock parametrized with ext_id. If another process tries to select conflicting row, it will have to wait until the other lock is released.
This is much better, but in some cases it will prohibit subset of workers from performing their tasks simultaneously and make them wait to perform tasks in a sequence. What those workers could do instead of waiting is to try fetching another row: something that sync-based solution solved by excluding t1 rows using outer join.
After this lengthy introduction, finally, the question:
Existing advisory locks can be inspected by querying pg_locks view. This view can be joined as a regular relation within queries. It is tempting to join it while fetching next t1 row, to exclude rows that are currently unprocessable due to existing lock. Since pg_locks is not regular table I have some doubts if this approach is safe.
Is it?

How to avoid being blocked by deadlocks?

Can I write an UPDATE statement that will simply not bother executing if there's a deadlock?
I have a small, but frequently updated table.
This statement is run quite frequently on it....
UPDATE table_a SET lastChangedTime = 'blah' WHERE pk = 1234;
Where pk is the primary key.
Every now and again this statement gets blocked. That's not in itself a big deal; the issue is that each time there's a lock it seems to take a minute or two for Postgres to sort itself out, and I can lose a lot a data.
table_a is very volatile, and lastChangedTime gets altered all the time, so rather than occasionally having to wait two minutes for the UPDATE to get executed, I'd rather it just didn't bother. Ok, my data might not be as up-to-date as I'd like for this one record, but at least I wouldn't have locked the whole table for 2 minutes.
Update following comments:
the application interacts very simply with the database, it only issues simple, one line UPDATE and INSERT statements, and commits each one immediately. One of the issues causing me a lot of head scratching is how can something work a million times without problem, and then just fail on another record that appears to be identical to all the others.
Final suggestion/question.....
The UPDATE statement is being invoked from a C# application. If I change the 'command timeout' to a very short value - say 1 millisecond would that have the desired effect? or might it end up clogging up the database with lots of broken transactions?
To avoid waiting for locks, first run
SELECT 1 FROM table_a WHERE pk = 1234 FOR UPDATE NOWAIT;
If there is a lock on the row, the statement will fail immediately, and you can go o working on something different.
Mind that the SELECT ... FOR UPDATE statement has to be in the same database transaction as your UPDATE statement.
As a general advice, you should use shorter transactions, which will reduce the length of lock waits and the risk of deadlock.

PostgreSQL Serialized Inserts Interleaving Sequence Numbers

I have multiple processes inserting into a Postgres (10.3) table using the SERIALIZED isolation level.
Another part of our system needs to read these records and be guaranteed that it receives all of them in sequence. For example, in the picture below, the consumer would need to
select * from table where sequanceNum > 2309 limit 5
and then receive sequence numbers 2310, 2311, 2312, 2313 and 2314.
The reading query is using READCOMMITTED isolation level.
What I'm seeing though is that the reading query is only receiving the rows I've highlighted in yellow. Looking at the xmin, I'm guessing that transaction 334250 had begun but not finished, then transactions 334251, 334252 et al started and finished prior to my reading query starting.
My question is, how did they get sequence numbers interleaved in those of 334250? Why weren't those transactions blocked by merrit of all of the writing transactions being serialized?
Any suggestions on how to achieve what I'm after? Which is, a guarantee that different transactions don't generate interleaving sequence numbers? (It's ok if there are gaps.... but they can't interleave).
Thanks very much for your help. I'm losing hair over this one!
PS - I just noticed that 334250 has a non zero xmax. Is that a clue that I'm missing perhaps?
The SQL standard in its usual brevity defines SERIALIZABLE as:
The execution of concurrent SQL-transactions at isolation level SERIALIZABLE is guaranteed to be serializable.
A serializable execution is defined to be an execution of the operations of concurrently executing SQL-transactions
that produces the same effect as some serial execution of those same SQL-transactions. A serial execution
is one in which each SQL-transaction executes to completion before the next SQL-transaction begins.
In the light of this definition, I understand that your wish is that the sequence numbers be in the same order as the “serial execution” that “produces the same effect”.
Unfortunately the equivalent serial ordering is not clear at the time the transactions begin, because statements later in the transaction can determine the “logical” order of the transactions.
Sequence numbers on the other hand are ordered according to the wall time when the number was requested.
In a way, you would need sequence numbers that are determined by something that is not certain until the transactions commit, and that is a contradiction in terms.
So I think that it is not possible to get what you want, unless you actually serialize the execution, e.g. by locking the table in SHARE ROW EXCLUSIVE mode before you insert the data.
My question is why you have that unusual demand. I cannot think of a good reason.

Why are lock hints needed on an atomic statement?

Question
What is the benefit of applying locks to the below statement?
Similarly, what issue would we see if we didn't include these hints? i.e. Do they prevent a race condition, improve performance, or maybe something else? Asking as perhaps they're included to prevent some issue I've not considered rather than the race condition I'd assumed.
NB: This is an overflow from a question asked here: SQL Threadsafe UPDATE TOP 1 for FIFO Queue
The Statement In Question
WITH nextRecordToProcess AS
(
SELECT TOP(1) Id, StatusId
FROM DemoQueue
WHERE StatusId = 1 --Ready for processing
ORDER BY DateSubmitted, Id
)
UPDATE nextRecordToProcess
SET StatusId = 2 --Processing
OUTPUT Inserted.Id
Requirement
The SQL is used to retrieve an unprocessed record from a queue.
The record to be obtained should be the first record in the queue with status Ready (StatusId = 1).
There may be multiple workers/sessions processing messages from this queue.
We want to ensure that each record in the queue is only picked up once (i.e. by a single worker), and that each worker processes messages in the order in which they appear in the queue.
It's OK for one worker to work faster than another (i.e. if Worker A picks up record 1 then Worker B picks up record 2 it's OK if worker B completes the processing of record 2 before Worker A has finished processing record 1). We're only concerned within the context of picking up the record.
There's no ongoing transaction; i.e. we just want to pick up the record from the queue; we don't need to keep it locked until we come back to progress the status from Processing to Processed.
Additional SQL for Context:
CREATE TABLE Statuses
(
Id SMALLINT NOT NULL PRIMARY KEY CLUSTERED
, Name NVARCHAR(32) NOT NULL UNIQUE
)
GO
INSERT Statuses (Id, Name)
VALUES (0,'Draft')
, (1,'Ready')
, (2,'Processing')
, (3,'Processed')
, (4,'Error')
GO
CREATE TABLE DemoQueue
(
Id BIGINT NOT NULL IDENTITY(1,1) PRIMARY KEY CLUSTERED
, StatusId SMALLINT NOT NULL FOREIGN KEY REFERENCES Statuses(Id)
, DateSubmitted DATETIME --will be null for all records with status 'Draft'
)
GO
Suggested Statement
In the various blogs discussing queues, and in the question which caused this discussion, it's suggested that the above statement be changed to include lock hints as below:
WITH nextRecordToProcess AS
(
SELECT TOP(1) Id, StatusId
FROM DemoQueue WITH (UPDLOCK, ROWLOCK, READPAST)
WHERE StatusId = 1 --Ready for processing
ORDER BY DateSubmitted, Id
)
UPDATE nextRecordToProcess
SET StatusId = 2 --Processing
OUTPUT Inserted.Id
My Understanding
I understand that were locking required the benefits of these hints would be:
UPDLOCK: Because we're selecting the record to update it's status we need to ensure that any other sessions reading this record after we've read it but before we've updated it won't be able to read the record with the intent to update it (or rather, such a statement would have to wait until we've performed our update and released the lock before the other session could see our record with its new value).
ROWLOCK: Whilst we're locking the record, we want to ensure that our lock only impacts the row we're locking; i.e. as we don't need to lock many resources / we don't want to impact other processes / we want other sessions to be able to read the next available item in the queue even if that item's in the same page as our locked record.
READPAST: If another session is already reading an item from the queue, rather than waiting for that session to release it's lock, our session should pick the next available (not locked) record in the queue.
i.e. Were we running the below code I think this would make sense:
DECLARE #nextRecordToProcess BIGINT
BEGIN TRANSACTION
SELECT TOP (1) #nextRecordToProcess = Id
FROM DemoQueue WITH (UPDLOCK, ROWLOCK, READPAST)
WHERE StatusId = 1 --Ready for processing
ORDER BY DateSubmitted, Id
--and then in a separate statement
UPDATE DemoQueue
SET StatusId = 2 --Processing
WHERE Id = #nextRecordToProcess
COMMIT TRANSACTION
--#nextRecordToProcess is then returned either as an out parameter or by including a `select #nextRecordToProcess Id`
However when the select and update occur in the same statement I'd have assumed that no other session could read the same record between our session's read & update; so there'd be no need for explicit lock hints.
Have I misunderstood something fundamentally with how locks work; or is the suggestion for these hints related to some other similar but different use case?
John is right in as these are optimizations, but in SQL world these optimizations can mean the difference between 'fast' vs. 'unbearable size-of-data slow' and/or the difference between 'works' vs. 'unusable deadlock mess'.
The readpast hint is clear. For the other two, I feel I need to add a bit more context:
ROWLOCK hint is to prevent page lock granularity scans. The lock granularity (row vs. page) is decided upfront when the query starts and is based on an estimate of the number pages that the query will scan (the third granularity, table, will only be used in special cases and does not apply here). Normally dequeue operations should never have to scan so many pages so that page granularity is considered by the engine. But I've seen 'in the wild' cases when the engine decided to use page lock granularity, and this leads to blocking and deadlocks in dequeue
UPDLOCK is needed to prevent the upgrade lock deadlock scenario. The UPDATE statement is logically split into a search for the rows that need to be updated and then update the rows. The search needs to lock the rows it evaluates. If the row qualifies (meets the WHERE condition) then the row is updated, and update is always an exclusive lock. So the question is how do you lock the rows during the search? If you use a shared lock then two UPDATE will look at the same row (they can, since the shared lock allows them), both decide the row qualifies and both try to upgrade the lock to exclusive -> deadlock. If you use exclusive locks during the search the deadlock cannot happen, but then UPDATE will conflict on all rows evaluated with any other read, even if the row does not qualifies (not to mention that Exclusive locks cannot be released early w/o breaking two-phase-locking). This is why there is an U mode lock, one that is compatible with Shared (so that UPDATE evaluation of candidate rows does not block reads) but is incompatible with another U (so that two UPDATEs do not deadlock). There are two reasons why the typical CTE based dequeue needs this hint:
because is a CTE the query processing does not understand always that the SELECT inside the CTE is the target of an UPDATE and should use U mode locks and
the dequeue operation will always go after the same rows to update (the rows being 'dequeued') so deadlocks are frequent.
tl;dr
They're for performance optimisation in a high concurrency dedicated queue table scenario.
Verbose
I think I've found the answer by finding a related SO answer by this quoted blog's author.
It seems that this advice is for a very specific scenario; where the table being used as the queue is dedicated as a queue; i.e. the table is not used for any other purpose. In such a scenario the lock hints make sense. They have nothing to do with preventing a race condition; they're to improve performance in high concurrency scenarios by avoiding (very short term) blocking.
The ReadPast lock improves the performance in high concurrency scenarios; there's no waiting for the currently read record to be released; the only thing locking it will be another "Queue Worker" process, so we can safely skip knowing that that worker's dealing with this record.
The RowLock ensures that we don't lock more than one row at a time, so the next worker to request a message will get the next record rather than skipping several records because they're in a locked record's page.
The UpdLock is used to get a lock; i.e. RowLock says what to lock but doesn't say that there must be a lock, and ReadPast determines the behaviour when encountering other locked records, so again doesn't cause a lock on the current record. I suspect this is not explicitly needed as SQL would acquire it in the background anyway (in fact, in the linked SO answer only ReadPast is specified); but was included in the block post for completeness / to explicitly show the lock which SQL would be implicitly causing in the background anyway.
However that post is written for a dedicated queue table. Where the table is used for other things (e.g. in the original question it was a table holding invoice data, which happened to have a column used to track what had been printed), that advice may not be desirable. i.e. By using a ReadPast lock you're jumping over all locked records; and there's no guarantee that those records are locked by another worker processing your queue; they may be locked for some completely unrelated purpose. That will then break the FIFO requirement.
Given this, I think my answer on the linked question stands. i.e. Either create a dedicated table to handle the queue scenario, or consider the other options and their pros and cons in the context or your scenario.

Lock for SELECT so another process doesn't get old data

I have a table that could have two threads reading data from it. If the data is in a certain state (let's say state 1) then the process will do something (not relevant to this question) and then update the state to 2.
It seems to me that there could be a case where thread 1 and thread 2 both perform a select within microseconds of one another and both see that the row is in state 1, and then both do the same thing and 2 updates occur after locks have been released.
Question is: Is there a way to prevent the second thread from being able to modify this data in Postgres - AKA it is forced to do another SELECT after the lock from the first one is released for its update so it knows to bail in order to prevent dupes?
I looked into row locking, but it says you cannot prevent select statements which sounds like it won't work for my condition here. Is my only option to use advisory locks?
Your question, referencing an unknown source:
I looked into row locking, but it says you cannot prevent select
statements which sounds like it won't work for my condition here. Is
my only option to use advisory locks?
The official documentation on the matter:
Row-level locks do not affect data querying; they block only writers
and lockers to the same row.
Concurrent attempts will not just select but try to take out the same row-level lock with SELECT ... FOR UPDATE - which causes them to wait for any previous transaction holding a lock on the same row to either commit or roll back. Just what you wanted.
However, many use cases are better solved with advisory locks - in versions before 9.5. You can still lock rows being processed with FOR UPDATE additionally to be safe. But if the next transaction just wants to process "the next free row" it's often much more efficient not to wait for the same row, which is almost certainly unavailable after the lock is released, but skip to the "next free" immediately.
In Postgres 9.5+ consider FOR UPDATE SKIP LOCKED for this. Like #Craig commented, this can largely replace advisory locks.
Related question stumbling over the same performance hog:
Function taking forever to run for large number of records
Explanation and code example for advisory locks or FOR UPDATE SKIP LOCKED in Postgres 9.5+:
Postgres UPDATE ... LIMIT 1
To lock many rows at once:
How to mark certain nr of rows in table on concurrent access
What you want is the fairly-common SQL SELECT ... FOR UPDATE. The Postgres-specific docs are here.
Using SELECT FOR UPDATE will lock the selected records for the span of the transaction, allowing you time to update them before another thread can select.