Concurrent update in PostgreSQL 11 - postgresql

I have about 10 queries that concurrently update a row, so I want to know what is the difference between
UPDATE account SET balance = balance + 1000
WHERE id = (SELECT id FROM account
where id = 1 FOR UPDATE);
and
BEGIN;
SELECT balance FROM account WHERE id = 1 FOR UPDATE;
-- compute $newval = $balance + 1000
UPDATE account SET balance = $newval WHERE id = 1;
COMMIT;
I am using PosgreSQL 11, so what is the right solution and what will happen with multi transactions in these two solutions?

Both versions will have exactly the same effect, and both prevent anomalies in the face of concurrency, because the row is locked before it is modified.
The first method is preferable, because there is only one client-server round trip, so the transaction is shorter and the lock is held for a shorter time, which improves concurrency.
The best way to do this and be safe from concurrent data modifications is:
UPDATE account
SET balance = balance + 1000
WHERE id = 1;
This does the same, because an UPDATE automatically puts an exclusive lock on the affected row, and a blocked query will see the updated version of the row when the lock is gone.

Related

PostgreSQL - Does row locking depends on update syntax in transaction?

I have a table called user_table with 2 columns: id (integer) and data_value (integer).
Here are two transactions that end up with the same results:
-- TRANSACTION 1
BEGIN;
UPDATE user_table
SET data_value = 100
WHERE id = 0;
UPDATE user_table
SET data_value = 200
WHERE id = 1;
COMMIT;
-- TRANSACTION 2
BEGIN;
UPDATE user_table AS user_with_old_value SET
data_value = user_with_new_value.data_value
FROM (VALUES
(0, 100),
(1, 200)
) AS user_with_new_value(id, data_value)
WHERE user_with_new_value.id = user_with_old_value.id;
COMMIT;
I would like to know if there is a difference on the lock applied on the rows.
If I understand it correctly, transaction 1 will first lock user 0, then lock user 1, then free both locks.
But what does transaction 2 do ?
Does it do the same thing or does it do: lock user 0 and user 1, then free both locks ?
There is a difference because if i have two concurent transactions, if i write my queries as the first transaction, i might encounter deadlocks issues. But if I write my transactions like the second one, can I run into deadlocks issues ?
If it does the same thing, is there a way to write this transaction so that at the beginning of a transaction, before doing anything, the transaction checks for each rows it needs to update, waits until all this rows are not locked, then lock all rows at the same time ?
links:
the syntax of the second transaction comes from: Update multiple rows in same query using PostgreSQL
Both transactions lock the same two rows, and both can run into a deadlock. The difference is that the first transaction always locks the two rows in a certain order.
If your objective is to avoid deadlocks the first transaction is better: if you make a rule that any transaction must update the user with the lower id first, then no such two transactions can deadlock with each other (but they can still deadlock with other transactions that do not obey that rule).

How to lock row for multi worker rate limit?

I have multiple workers distributed across multiple nodes that scrape HTML. I need to specify a rate limit in the system so no domain gets more than 1 request every 5 seconds.
Each worker has access to a shared database (PostgreSQL) so I created a table with 2 columns:
domain key, last scan date
In the worker code I want to check the last scan date before making a request. The problem is thousands of workers could get the same domain at almost the same instant if tasks are distributed round robin so if they all read at once they will see no recent scan and all fire off requests. So I need a way to lock the field so the first worker to check engages a lock, makes the scan and then updates the scan date and releases the lock. Then all the other workers can check to see if a lock exists on the row and reject the task so it is re-scheduled.
I read the manual page of locks and found it very confusing. It said all locks are table lock and I didn't really understand what it means about conflicts. I am going to need multiple workers to be able to lock/unlock different rows at the same time and also check if a lock exists before placing lock so that the worker doesn't hang waiting for the lock to release and can move onto next task.
What type of lock do I need? Are there any good examples showing this type of lock?
If I just wrap each process in a transaction will that work?
Your core code would be the block:
begin;
set transaction isolation level read committed; -- should already be the default
select domain_key
from your_table
where last_scan_date < now() - interval '5 seconds'
for update skip locked
limit 1;
-- Do your stuff here, and issue a rollback if it fails
update your_table
set last_scan_date = <value goes here>
where domain_key = <value goes here>;
commit;
I expect this will be used in a host language. The following example snippet of a worker is in python:
conn = psycopg2.connect('<db connect parameters>')
conn.autocommit = false
c = conn.cursor()
c.execute("set transaction isolation level read committed;")
c.execute("""
select domain_key
from your_table
where last_scan_date < now() - interval '5 seconds'
order by last_scan_date
for update skip locked
limit 1
""")
domain_key = c.fetchone()[0]
if domain_key:
result = process_url(domain_key) # <-- This is your scraping routine
if result == 'Ok':
c.execute("""
update your_table
set last_scan_date = now()
where domain_key = %s
""", (domain_key,))
conn.commit()
else:
conn.rollback()

job queue with multiple consumers did the same job twice

Actually a lot of things might be covered here: Job queue as SQL table with multiple consumers (PostgreSQL)
However I just wanted to ask for my specific query.
Currently I have a job queue that actually should emit a new job for every consumer, however we found out that we sometimes gotten the same job twice on different consumer (probably a race condition.
This was our query (run inside a transaction):
UPDATE invoice_job SET status = 'working', date_time_start = now(),
node = $ip
WHERE id = (SELECT id FROM invoice_job WHERE status = 'created' ORDER BY id LIMIT 1)
RETURNING *
Currently the Table is pretty simple and has a status (can be "created", "working", "done", date_time_start field, created field (not used for query), id field, node (where the job was run).
However this emitted the same job twice at one point.
Currently I changed the query now to:
UPDATE invoice_job SET status = 'working', date_time_start = now(),
node = $ip
WHERE id = (SELECT id FROM invoice_job WHERE status = 'created' ORDER BY id LIMIT 1 FOR UPDATE SKIP LOCKED)
RETURNING *
would that actually help and only emit the same job once?
Your solution with FOR UPDATE SKIP LOCKED is fine. It'll ensure a row is locked by exactly one session before being updated for processing. No transaction can choose a row already locked by another transaction, and when the lock is released on commit, subsequent SELECT clauses will no longer match the row.
The original failed because the subquery's SELECT can choose the same row concurrently in multiple sessions, each of which then tries to UPDATE the row. There's no WHERE clause in the UPDATE that'd make that fail; it's perectly fine for two concurrent sessions to UPDATE invoice_job SET status = 'working' WHERE node = 42 or whatever. The second update will happily run and commit once the first update succeeds.
You could also make it safe by repeating the WHERE clause in the UPDATE
UPDATE invoice_job SET status = 'working', date_time_start = now(),
node = $ip
WHERE id = (SELECT id FROM invoice_job WHERE status = 'created' ORDER BY id LIMIT 1)
AND status = 'created'
RETURNING *
... but this will often return zero rows under high concurrency.
In fact it will return zero rows for all but one of a set of concurrent executions, so it's no better than a serial queue worker. This is true of most of the other "clever" tricks people use to try to do concurrent queues, and one of the main reasons SKIP LOCKED was introduced.
The fact that you only noticed this problem now tells me that you would actually be fine with a simple, serial queue dispatch where you LOCK TABLE before picking the first row. But SKIP LOCKED will scale better if your workload grows.

Why does PostgreSQL serializable transaction think this as conflict?

In my understanding PostgreSQL use some kind of monitors to guess if there's a conflict in serializable isolation level. Many examples are about modifying same resource in concurrent transaction, and serializable transaction works great. But I want to test concurrent issue in another way.
I decide to test 2 users modifying their own account balance, and wish PostgreSQL is smart enough to not detect it as conflict, but the result is not what I want.
Below is my table, there're 4 accounts which belongs to 2 users, each user has a checking account and a saving account.
create table accounts (
id serial primary key,
user_id int,
type varchar,
balance numeric
);
insert into accounts (user_id, type, balance) values
(1, 'checking', 1000),
(1, 'saving', 1000),
(2, 'checking', 1000),
(2, 'saving', 1000);
The table data is like this:
id | user_id | type | balance
----+---------+----------+---------
1 | 1 | checking | 1000
2 | 1 | saving | 1000
3 | 2 | checking | 1000
4 | 2 | saving | 1000
Now I run 2 concurrent transaction for 2 users. In each transaction, I reduce the checking account with some money, and check that user's total balance. If it's greater than 1000, then commit, otherwise rollback.
The user 1's example:
begin;
-- Reduce checking account for user 1
update accounts set balance = balance - 200 where user_id = 1 and type = 'checking';
-- Make sure user 1's total balance > 1000, then commit
select sum(balance) from accounts where user_id = 1;
commit;
The user 2 is the same, except the user_id = 2 in where:
begin;
update accounts set balance = balance - 200 where user_id = 2 and type = 'checking';
select sum(balance) from accounts where user_id = 2;
commit;
I first commit user 1's transaction, it success with no doubt. When I commit user 2's transaction, it fails.
ERROR: could not serialize access due to read/write dependencies among transactions
DETAIL: Reason code: Canceled on identification as a pivot, during commit attempt.
HINT: The transaction might succeed if retried.
My questions are:
Why PostgreSQL thinks this 2 transactions are conflict? I added user_id condition for all SQL, and doesn't modify user_id, but all these have no effect.
Does that mean serializable transaction doesn't allow concurrent transactions happened on the same table, even if their read/write have no conflict?
Do something per user is very common, Should I avoid use serializable transaction for operations which happen very frequently?
You can fix this problem with the following index:
CREATE INDEX accounts_user_idx ON accounts(user_id);
Since there are so few data in your example table, you will have to tell PostgreSQL to use an index scan:
SET enable_seqscan=off;
Now your example will work!
If that seems like black magic, take a look at the query execution plans of your SELECT and UPDATE statements.
Without the index both will use a sequential scan on the table, thereby reading all rows in the table. So both transactions will end up with a SIReadLock on the whole table.
This triggers the serialization failure.
To my knowledge serializable has the highest level of isolation, therefore lowest level of concurrency. The transactions occur one after the other with zero concurrency.

lock the rows until next select postgres

Is there a way in postgres to lock the rows until the next select query execution from the same system.And one more thing is there will be no update process on locked rows.
scenario is something like this
If the table1 contains data like
id | txt
-------------------
1 | World
2 | Text
3 | Crawler
4 | Solution
5 | Nation
6 | Under
7 | Padding
8 | Settle
9 | Begin
10 | Large
11 | Someone
12 | Dance
If sys1 executes
select * from table1 order by id limit 5;
then it should lock row from id 1 to 5 for other system which are executing select statement concurrently.
Later if sys1 again execute another select query like
select * from table1 where id>10 order by id limit 5;
then pereviously locked rows should be released.
I don't think this is possible. You cannot block a read only access to a table (unless that select is done FOR UPDATE)
As far as I can tell, the only chance you have is to use the pg_advisory_lock() function.
http://www.postgresql.org/docs/current/static/functions-admin.html#FUNCTIONS-ADVISORY-LOCKS
But this requires a "manual" release of the locks obtained through it. You won't get an automatic unlocking with that.
To lock the rows you would need something like this:
select pg_advisory_lock(id), *
from
(
select * table1 order by id limit 5
) t
(Note the use of the derived table for the LIMIT part. See the manual link I posted for an explanation)
Then you need to store the retrieved IDs and later call pg_advisory_unlock() for each ID.
If each process is always releasing all IDs at once, you could simply use pg_advisory_unlock_all() instead. Then you will not need to store the retrieved IDs.
Note that this will not prevent others from reading the rows using "normal" selects. It will only work if every process that accesses that table uses the same pattern of obtaining the locks.
It looks like you really have a transaction which transcends the borders of your database, and all the change happens in an another system.
My idea is select ... for update no wait to lock the relevant rows, then offload the data into another system, then rollback to unlock the rows. No two select ... for update queries will select the same row, and the second select will fail immediately rather than wait and proceed.
But you don't seem to mark offloaded records in any way; I don't see why two non-consecutive selects won't happily select overlapping range. So I'd still update the records with a flag and/or a target user name and would only select records with the flag unset.
I tried both select...for update and pg_try_advisory_lock and managed to get near my requirement.
/*rows are locking but limit is the problem*/
select * from table1 where pg_try_advisory_lock( id) limit 5;
.
.
$_SESSION['rows'] = $rowcount; // no of row to process
.
.
/*afer each process of word*/
$_SESSION['rows'] -=1;
.
.
/*and finally unlock locked rows*/
if ($_SESSION['rows']===0)
select pg_advisory_unlock_all() from table1
But there are two problem in this
1. As Limit will apply before lock, every time the same rows are trying to lock in different instance.
2. Not sure whether pg_advisory_unlock_all will unlock the rows locked by current instance or all the instance.