What concurrency issues can this PostgreSQL code create?

What concurrency issues can this PostgreSQL code create? - postgresql

Ok so here's the schema, which is pretty self-explanatory:
STORE(storeID, name, city)
PRODUCT(productID, name, brand)
PRODUCT_FOR_SALE(productID, storeID, price)
I have 2 transactions: T1 and T2.
T1 raises by 5% the price for any product sold in any store in 'London'.
T2 lowers by 10% the price for any product whose cost is >= $1050
What I am asked is to tell what kind of concurrency anomaly they may result in, and what isolation level I should apply to which transaction to make it safe.
The code for the transactions is not given, but I suppose it would be something on the lines of:
# T1:
BEGIN;
UPDATE product_for_sale
SET price = price + ((price/100) *5)
WHERE storeID IN (SELECT storeID FROM store WHERE city='London')
COMMIT;
# T2:
BEGIN;
UPDATE product_for_sale
SET price = price - (price/10)
WHERE price >= 1050
COMMIT;
My "guess" to what might happen with READ COMMITTED (default) is:
Considering a product P, sold in 'London' for $1049
both transactions begin
they both consider their row sets: T1 will consider all products sold in London (which includes P), T2 will consider products whose price is $1050 or more (which excludes P)
T1 commits and sets the price of P to $1101 but, since P wasn't in T2's row set to begin with, the change goes unnoticed, and T2 commits without considering it
Which, if I'm not messing up definitions, should be a case of phantom read, which would be fixed if I set T2 to ISOLATION LEVEL REPEATABLE READ

First, it is not quite clear what you mean with a concurrency issue. It could be:
something that could conceivably be a problem, but is handled by PostgreSQL automatically so that no problems arise.
something that can generate an unexpected error or an undesirable result.
For 1., this is handled by locks that serialize transactions that are trying to modify the same rows concurrently.
I assume you are more interested in 2.
What you describe can happen, but it is not a concurrency problem. It just means that T1 logically takes place before T2. This will work just fine on all isolation levels.
I might be missing something, but the only potential problem I see here is a deadlock between the two statements:
They both can update several rows, so it could happen that one of them updates row X first and then tries to update row Y, which has already been updated by the other statement. The first statement then is blocked. Now the second statement wants to update row Y and is blocked too.
Such a deadlock is broken by the deadlock resolver after one second by killing off one of the statements with an error.
Note that deadlocks are not real problems either, all your code has to do is to repeat the failed transaction.

Related

Try to understand MVCC

I'm trying to understand MVCC and can't get it. For example.
Transaction1 (T1) try to read some data row. In that same time T2 update the same row.
The flow of transaction is
T1 begin -> T2 begin -> T2 commit -> T1 commit
So the first transaction get it snapshot of database and returns to user result on which he is gonna build other calculation. But as I understand, customer get the old data value? As I understand MVCC, after T1 transaction begins, that transaction doesn't know, that some other transaction change data. So if now user doing some calculation after that (without DB involved), he is doing it on wrong data? Or I'm not right and first transaction have some mechanisms to know that row was changed?
Let's now change the transaction flow.
T2beg -> T1beg -> T2com -> T1com
In 2PL user get the newer version of data because of locks (T1 must wait before exclusive lock released). But in case of MVCC it still will be the old data, as I understand the postgresql MVCC model. So I can get stale data in exchange for speed. Am I right? Or I miss something?
Thank you

Yes, it can happen that you read some old data from the database (that a concurrent transaction has modified), perform a calculation based on this and store “outdated” data in the database.
This is not a problem, because rather than the actual order in which transactions happen, the logical order is more relevant:
If T1 reads some data, then T2 modifies the data, then T1 modifies the database based on what it read, you can say that T1 logically took place before T2, and there is no inconsistency.
You only get an anomaly if T1 and T2 modify the same data: T1 reads data, T2 modifies the data, then T1 modifies the same data based on what it read. This anomaly is called a “lost update”.
Lost updates can only occur with the weakest (which is the default) isolation level, READ COMMITTED.
If you want better isolation from concurrent activity, you need to use at least REPEATABLE READ isolation. Then T1 would receive a serialization error when it tries to update the data, and you would have to repeat the transaction. On the second attempt, it would read the new data, and everything will be consistent.

The flow of transaction is next T1 begin -> T2 begin -> T2 commit -> T1 commit. So the first transaction get it snapshot of database and returns to user result on which he is gonna build other calculation. But as I understand, customer get the old data value?
Unless T1 and T2 try to update the same row, this could usually be reordered to be the same as: T1 begin; T1 commit; T2 begin; T2 commit;
In other words, any undesirable business outcome that could be achieved by T2 changing the data while T1 was making a decision, could have also occurred by T2 changing the data immediately after T1 made the same decision.

Each transaction can only see data that is younger or equal to that transaction's ID.
When transaction 1 reads data, it marks the read time stamp of that data to transaction 1.
If transaction 2 tries to read the same data, it checks the read timestamp of the data, if the read timestamp for the data is less than transaction 2, then transaction 2 is aborted because 1 < 2 -- 1 got there before us and they must finish before us.
At commit time, we also check if the read timestamp of the data is less than the committing transaction. If it is, we abort the transaction and restart with a new transaction ID.
We also check if write timestamps are less than our transaction. Younger transaction wins.
There is an edge case where a younger transaction can abort if an older transaction gets ahead of the younger transaction.
I have actually implemented MVCC in Java. (see transaction, runner and mvcc code)

INSERT INTO .. SELECT causing possible race condition?

INSERT INTO A
SELECT * FROM B WHERE timestamp > (SELECT max(timestamp) FROM A);
or, written differently:
WITH selection AS
(SELECT * FROM B WHERE timestamp > (SELECT max(timestamp) FROM A))
INSERT INTO A SELECT * FROM selection;
If these queries run multiple times simultaneously, is it possible that I will end up with duplicated rows in A?
How does Postgres process these queries? Is it one or multiple?
If it is multiple queries (find max(timestamp)[1], select[2] then insert[3]) I can imagine this will cause duplicated rows.
If that is correct, would wrapping it in BEGIN/END (a transaction) help?

Yes, that might result in duplicate values.
A single statement sees a consistent view of the data in all tables as of the point in time when the statement started.
Wrapping that single statement into a transaction won't change that (a single statement is always executed as an atomic statement regardless of the number of sub-query involved).
The statement will never see uncommitted data from other transactions (which is the root cause why you can wind up with duplicate values).
The only safe way to avoid duplicate values, is to create a unique constraint (or index) on that column. In that case the INSERT would result in an error if such a value already exists.
If you want to avoid the error, use insert ... on conflict

This depends on the isolation level set in your database.
This is from the postgres documentation
By default, this is set to Repeatable read, which means that each query will get the output based on when the transaction first attempted to read the data. If 2 queries read before any one writes, then you will get duplicate data in these tables.
If you want to avoid having duplicate entries, you have a few options.
Try using the isolation level Serializable
Apply a unique index on a field of A in table B. Timestamp is not a great contender as you might legitimately have 2 rows with the same timestamp. Probably id of the table A is a good option.
Take a lock at the application level before performing such a query.

PostgreSQL performance tuning with table partitions

I am solving an performance issue on PostgreSQL 9.6 dbo based system. Intro:
12yo system, similar to banking system, with most queried primary table called transactions.
CREATE TABLE jrn.transactions (
ID BIGSERIAL,
type_id VARCHAR(200),
account_id INT NOT NULL,
date_issued DATE,
date_accounted DATE,
amount NUMERIC,
..
)
In the table transactions we store all transactions within a bank account. Field type_id determines the type of a transaction. Servers also as C# EntityFramework Discriminator column. Values are like:
card_payment, cash_withdrawl, cash_in, ...
14 types of transaction are known.
In generally, there are 4 types of queries (no. 3 and .4 are by far most frequent):
select single transaction like: SELECT * FROM jrn.transactions WHERE id = 3748734
select single transaction with JOIN to other transaction like: SELECT * FROM jrn.transactions AS m INNER JOIN jrn.transactions AS r ON m.refund_id = r.id WHERE m.id = 3748734
select 0-100, 100-200, .. transactions of given type like: SELECT * FROM jrn.transactions WHERE account_id = 43784 AND type_id = 'card_payment' LIMIT 100
several aggregate queries, like: SELECT SUM(amount), MIN(date_issued), MAX(date_issued) FROM jrn.transactions WHERE account_id = 3748734 AND date_issued >= '2017-01-01'
In last few month we had unexpected row count growth, now 120M.
We are thinking of table partitioning, following to PostgreSQL doc: https://www.postgresql.org/docs/10/static/ddl-partitioning.html
Options:
partition table by type_id into 14 partitions
add column year and partition table by year (or year_month) into 12 (or 144) partitions.
I am now restoring data into out test environment, I am going to test both options.
What do you consider the most appropriate partitioning rule for such situation? Any other options?
Thanks for any feedback / advice etc.

Partitioning won't be very helpful with these queries, since they won't perform a sequential scan, unless you forgot an index.
The only good reason I see for partitioning would be if you want to delete old rows efficiently; then partitioning by date would be best.
Based on your queries, you should have these indexes (apart from the primary key index):
CREATE INDEX ON jrn.transactions (account_id, date_issued);
CREATE INDEX ON jrn.transactions (refund_id);
The following index might be a good idea if you can sacrifice some insert performance to make the third query as fast as possible (you might want to test):
CREATE INDEX ON jrn.transactions (account_id, type_id);

What you have here is almost a perfect case for column-based storage as you may get it using a SAP HANA Database. However, as you explicitly have asked for a Postgres answer and I doubt that a HANA database will be within the budget limit, we will have to stick with Postgres.
Your two queries no. 3 and 4 go quite into different directions, so there won't be "the single answer" to your problem - you will always have to balance somehow between these two use cases. Yet, I would try to use two different techniques to approach each of them individually.
From my perspective, the biggest problem is the query no. 4, which creates quite a high load on your postgres server just because it is summing up values. Moreover, you are just summing up values over and over again, which most likely won't change often (or even at all), as you have said that UPDATEs nearly do not happen at all. I furthermore assume two more things:
transactions is INSERT-only, i.e. DELETE statements almost never happen (besides perhaps in cases of some exceptional administrative intervention).
The values of column date_issued when INSERTing typically are somewhere "close to today" - so you usually won't INSERT stuff way in the past.
Out of this, to prevent aggregating values over and over again unnecessarily, I would introduce yet another table: let's call it transactions_aggr, which is built up like this:
create table transactions_aggr (
account_id INT NOT NULL,
date_issued DATE,
sumamount NUMERIC,
primary key (account_id, date_issued)
)
which will give you a table of per-day preaggregated values.
To determine which values are already preaggregated, I would add another boolean-typed column to transactions, which indicates to me, which of the rows are contained in transactions_aggr and which are not (yet). The query no. 4 then would have to be changed in such a way that it reads only non-preaggregated rows from transactions, whilst the rest could come from transactions_aggr. To facilitate that you could define a view like this:
select account_id, date_issued, sum(amount) as sumamount from
(
select account_id, date_issued, sumamount as amount from transactions_aggr as aggr
union all
select account_id, date_issued, sum(amount) as amount from transactions as t where t.aggregated = false
)
group by account_id, date_issued
Needless to say that putting an index on transactions.aggregated (perhaps in conjunction with the account_id) could greatly help to improve the performance here.
Updating transactions_aggr can be done using multiple approaches:
You could use this as a one-time activity and only pre-aggregate the current set of ~120m rows once. This would at least reduce the load on your machine doing aggregations significantly. However, over time you will run into the same problem again. Then you may just re-execute the entire procedure, simply dropping transactions_aggr as a whole and re-create it from scratch (all the original data still is there in transactions).
You have a nice period somewhere during the week/month/in the night, where you have little or no queries are coming in. Then you can open a transaction, read all transactions WHERE aggregated = false and add them with UPDATEs to transactions_aggr. Keep in mind to then toggle aggregated to true (should be done in the same transaction). The tricky part of this, however, is that you must pay attention to what reading queries will "see" of this transaction: Depending on your requirements of accuracy during that timeframe of this "update job", you may have to consider switching the transaction isolation level to "READ_COMMITED" to prevent ghost reads.
On the matter of your query no. 3 you then could try to really go for the approach of partitioning based on type_id. However, I perceive your query as a little strange, as you are performing a LIMIT/OFFSET without ordering (e.g. there is no ORDER BY statement in place) having specified (NB: You are not saying that you would be using database cursors). This may lead to the effect that the implicit order, which is currently used, is changed, if you enable partitioning on the table. So be careful on side-effects which this may cause on your program.
And one more thing: Before really doing the partition split, I would first check on the data distribution concerning type_id by issuing
select type_id, count(*) from transactions group by type_id
Not that it turns out that, for example, 90% of your data is with card_payment - so that you will have a heavily uneven distribution amongst your partitions and the biggest performance hogging queries are those which would still go into this single "large partition".
Hope this helps a little - and good luck!

In-order sequence generation

Is there a way to generate some kind of in-order identifier for a table records?
Suppose that we have two threads doing queries:
Thread 1:
begin;
insert into table1(id, value) values (nextval('table1_seq'), 'hello');
commit;
Thread 2:
begin;
insert into table1(id, value) values (nextval('table1_seq'), 'world');
commit;
It's entirely possible (depending on timing) that an external observer would see the (2, 'world') record appear before the (1, 'hello').
That's fine, but I want a way to get all the records in the 'table1' that appeared since the last time the external observer checked it.
So, is there any way to get the records in the order they were inserted? Maybe OIDs can help?

No. Since there is no natural order of rows in a database table, all you have to work with is the values in your table.
Well, there are the Postgres specific system columns cmin and ctid you could abuse to some degree.
The tuple ID (ctid) contains the file block number and position in the block for the row. So this represents the current physical ordering on disk. Later additions will have a bigger ctid, normally. Your SELECT statement could look like this
SELECT *, ctid -- save ctid from last row in last_ctid
FROM tbl
WHERE ctid > last_ctid
ORDER BY ctid
ctid has the data type tid. Example: '(0,9)'::tid
However it is not stable as long-term identifier, since VACUUM or any concurrent UPDATE or some other operations can change the physical location of a tuple at any time. For the duration of a transaction it is stable, though. And if you are just inserting and nothing else, it should work locally for your purpose.
I would add a timestamp column with default now() in addition to the serial column ...
I would also let a column default populate your id column (a serial or IDENTITY column). That retrieves the number from the sequence at a later stage than explicitly fetching and then inserting it, thereby minimizing (but not eliminating) the window for a race condition - the chance that a lower id would be inserted at a later time. Detailed instructions:
Auto increment table column

What you want is to force transactions to commit (making their inserts visible) in the same order that they did the inserts. As far as other clients are concerned the inserts haven't happened until they're committed, since they might roll back and vanish.
This is true even if you don't wrap the inserts in an explicit begin / commit. Transaction commit, even if done implicitly, still doesn't necessarily run in the same order that the row its self was inserted. It's subject to operating system CPU scheduler ordering decisions, etc.
Even if PostgreSQL supported dirty reads this would still be true. Just because you start three inserts in a given order doesn't mean they'll finish in that order.
There is no easy or reliable way to do what you seem to want that will preserve concurrency. You'll need to do your inserts in order on a single worker - or use table locking as Tometzky suggests, which has basically the same effect since only one of your insert threads can be doing anything at any given time.
You can use advisory locking, but the effect is the same.
Using a timestamp won't help, since you don't know if for any two timestamps there's a row with a timestamp between the two that hasn't yet been committed.
You can't rely on an identity column where you read rows only up to the first "gap" because gaps are normal in system-generated columns due to rollbacks.
I think you should step back and look at why you have this requirement and, given this requirement, why you're using individual concurrent inserts.
Maybe you'll be better off doing small-block batched inserts from a single session?

If you mean that every query if it sees world row it has to also see hello row then you'd need to do:
begin;
lock table table1 in share update exclusive mode;
insert into table1(id, value) values (nextval('table1_seq'), 'hello');
commit;
This share update exclusive mode is the weakest lock mode which is self-exclusive — only one session can hold it at a time.
Be aware that this will not make this sequence gap-less — this is a different issue.

We found another solution with recent PostgreSQL servers, similar to #erwin's answer but with txid.
When inserting rows, instead of using a sequence, insert txid_current() as row id. This ID is monotonically increasing on each new transaction.
Then, when selecting rows from the table, add to the WHERE clause id < txid_snapshot_xmin(txid_current_snapshot()).
txid_snapshot_xmin(txid_current_snapshot()) corresponds to the transaction index of the oldest still-open transaction. Thus, if row 20 is committed before row 19, it will be filtered out because transaction 19 will still be open. When the transaction 19 is committed, both rows 19 and 20 will become visible.
When no transaction is opened, the snapshot xmin will be the transaction id of the currently running SELECT statement.
The returned transaction IDs are 64-bits, the higher 32 bits are an epoch and the lower 32 bits are the actual ID.
Here is the documentation of these functions: https://www.postgresql.org/docs/9.6/static/functions-info.html#FUNCTIONS-TXID-SNAPSHOT
Credits to tux3 for the idea.

Postgres deadlock with read_commited isolation

We have noticed a rare occurrence of a deadlock on a Postgresql 9.2 server on the following situation:
T1 starts the batch operation:
UPDATE BB bb SET status = 'PROCESSING', chunk_id = 0 WHERE bb.status ='PENDING'
AND bb.bulk_id = 1 AND bb.user_id IN (SELECT user_id FROM BB WHERE bulk_id = 1
AND chunk_id IS NULL AND status ='PENDING' LIMIT 2000)
When T1 commits after a few hundred milliseconds or so (BB has many millions of rows), multiple threads begin new Transactions (one transaction per thread) that read items from BB, do some processing and update them in batches of 50 or so with the queries:
For select:
SELECT *, RANK() as rno OVER(ORDER BY user_id) FROM BB WHERE status = 'PROCESSING' AND bulk_id = 1 and rno = $1
And Update:
UPDATE BB set datetime=$1, status='DONE', message_id=$2 WHERE bulk_id=1 AND user_id=$3
(user_id, bulk_id have a UNIQUE constraint).
Due to an external to the situation problem, another transaction T2 executes the same query with T1 almost immediately after T1 has committed (the initial batch operation where items are marked as 'PROCESSING').
UPDATE BB bb SET status = 'PROCESSING', chunk_id = 0 WHERE bb.status ='PENDING'
AND bb.bulk_id = 1 AND bb.user_id IN (SELECT user_id FROM BB WHERE bulk_id = 1
AND chunk_id IS NULL AND status ='PENDING' LIMIT 2000)
However although these items are marked as 'PROCESSING' this query deadlocks with some of the updates (which are done in batches as i said) off the worker threads. To my understanding this should not happen with READ_COMMITTED isolation level (default) that we use. I am sure that T1 has committed because the worker threads execute after it has done so.
edit: One thing i should clear up is that T2 starts after T1 but before it commits. However due to a write_exclusive tuple lock that we acquire with a SELECT for UPDATE on the same row (that is not affected by any of the above queries), it waits for T1 to commit before it runs the batch update query.

When T1 commits after a few hundred milliseconds or so (BB has many millions of rows), multiple threads begin new Transactions (one transaction per thread) that read items from BB, do some processing and update them in batches of 50 or so with the queries:
This strikes me as a concurrency problem. I think you are far better off to have one transaction read the rows and hand them off to worker processes, and then update them in batches, when they come back. Your fundamental problem is going to be that these rows are effectively working on uncertain state, holding the rows during transactions, and the like. You have to handle rollbacks and so forth separately, and consequently the locking is a real problem.
Now, if that solution is not possible, I would have a separate locking table. In this case, each thread spins up separately, locks the locking table, claims a bunch of rows, inserts records into the locking table, and commits. In this way each one thread has claimed records. Then they can work on their record sets, update them, etc. You may want to have a process which periodically clears out stale locks.
In essence your problem is that rows go from state A -> processing -> state B and may be rolled back. Since the other threads have no way of knowing what rows are processing and by which threads, you can't safely allocate records. One option is to change the model to:
state A -> claimed state -> processing -> state B. However you have to have some way of ensuring that rows are effectively allocated and that the threads know which rows have been allocated to eachother.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse