I have to write SQL transactions for very high traffic web application which is using Postgres for database.
My question is how to control concurrency for READ THEN UPDATE THEN WRITE transaction, if two users are concurrently doing that transaction?
What is the best practice to do that for very high traffic web application. Any help/suggestion will be really appreciated.
Thanks in advance.
Explanatory note: I'm assuming you mean a read-modify-write workload, and that the capital letters for "READ THEN UPDATE THEN WRITE" are not intended to signify some special transaction option SQL syntax from a product I'm unfamiliar with.
If your webapp is doing read-modify-write cycles with high concurrency and traffic, then you can't use traditional row locking:
BEGIN
SELECT primarykey, col1 FROM thetable WHERE ... FOR UPDATE
process in the application
UPDATE blah SET col1 ... WHERE primarykey ...
COMMIT
because user "think time" and network latency is potentially unbounded. Most of your connections will be stuck for an indefinite amount of time in the "process in the application" phase. Each waiting session means an open, idle transaction, which means finite database resources such as connection limits and memory consumed.
The conventional, well-established solution to this is to use optimistic concurrency control, sometimes misleadingly referred to as optimistic locking. Some ORMs support this natively. It's easy enough to implement if you're working with SQL directly or via a framework that doesn't though. The principle is that your logic flow looks more like this:
BEGIN READ ONLY TRANSACTION
SELECT primarykey, col1, row_version FROM thetable WHERE ...
COMMIT
process in the application and wait for user response
BEGIN
UPDATE blah SET col1 ..., row_version = row_version + 1 WHERE primarykey ... AND row_version = 'prev_row_version'
Check to see if the UPDATE affected any rows using the affected-row-count returned by the database in the UPDATE response
If it affected zero rows, the WHERE clause didn't match, suggesting that someone else updated the row since we SELECTed it. Go back to the beginning and start again.
If it affected one row, we know nobody else beat us to updating this row, so COMMIT and tell the user everything's OK.
Frameworks like Hibernate support this automatically by annotating a column as a row version.
Optimistic concurrency control can inter-operate with traditional locking with appropriate database triggers. See, e.g. the sample trigger I wrote for Hibernate inter-operation.
Related
I'm looking for a way to manage optimistic concurrency control across more than one table in Postgres. I'm also trying to keep business logic out of the database. I have a table setup something like this:
CREATE TABLE master
(
id SERIAL PRIMARY KEY NOT NULL,
status VARCHAR NOT NULL,
some_value INT NOT NULL,
row_version INT NOT NULL DEFAULT(1)
)
CREATE TABLE detail
(
id SERIAL PRIMARY KEY NOT NULL,
master_id INT NOT NULL REFERENCES master ON DELETE CASCADE ON UPDATE CASCADE,
some_data VARCHAR NOT NULL
)
master.row_version is automatically incremented by a trigger whenever the row is updated.
The client application does the following:
Reads a record from the master table.
Calculates some business logic based on the values of the record, this may include a delay of several minutes involving user interaction.
Inserts a record into the detail table based on logic in step 2.
I want step 3 to be rejected if the value of master.row_version has changed since the record was read at step 1. Optimistic concurrency control seems to me like the right answer (the only answer?), but I'm not sure how to manage it across two tables like this.
I'm thinking a function in Postgres with a row-level lock on the relevant record in the master table is probably the way to go. But I'm not sure if this is my best/only option, or what that would look like (I'm a bit green on Postgres syntax).
I'm using Npgsql, given that the client application is written in C#. I don't know if there's anything in it which can help me? I'd like to avoid a function if possible, but I'm struggling to find a way to do this with straight-up SQL, and anonymous code blocks (at least in Npgsql) don't support the I/O operations I'd need.
Locking is out if you want to use optimistic concurrency control, see the Wikipedia article on the topic:
OCC assumes that multiple transactions can frequently complete without
interfering with each other. While running, transactions use data
resources without acquiring locks on those resources.
You could use a more complicated INSERT statement.
If $1 is the original row_version and $2 and $3 are master_id and some_data to be inserted in detail, run
WITH m(id) AS
(SELECT CASE WHEN master.row_version = $1
THEN $2
ELSE NULL
END
FROM master
WHERE master.id = $2)
INSERT INTO detail (master_id, some_data)
SELECT m.id, $3 FROM m
If row_version has changed, this will try to insert NULL as detail.id, which will cause an
ERROR: null value in column "id" violates not-null constraint
that you can translate into a more meaningful error message.
I've since come to the conclusion that a row lock can be employed in a "typical" pessimistic concurrency control approach, but when combined with a row version can produce a "hybrid" approach with some meaningful benefits.
Unsurprisingly, the choice of pessimistic, optimistic or "hybrid" concurrency control depends on the needs of the application.
Pessimistic Concurrency Control
A typical pessimistic concurrency control approach might look like this.
Begin database transaction.
Read (and lock) record from master table.
Perform business logic.
Insert a record into detail table.
Commit database transaction.
If the business logic at step 3 is long-running, this approach may be undesirable as it leads to a long-running transaction (generally unfavourable), and a long-running lock on the record in master which may be otherwise problematic for concurrency.
Optimistic Concurrency Control
An approach using only optimistic concurrency control might look more like this.
Read record (including row version) from master table.
Perform business logic.
Begin database transaction.
Increment row version on record in master table (an optimistic concurrency control check).
Insert a record into detail table.
Commit database transaction.
In this scenario, the database transaction is held for a shorter period of time, as are any (implicit) row locks. But, the increment of row version on the record in the master table may be a bit misleading to concurrent operations. Imagine several concurrent operations of this scenario, they'll start failing on the optimistic concurrency check because the row version has been incremented, even though the meaningful properties on the record haven't been changed.
Hybrid Concurrency Control
A "hybrid" approach uses both pessimistic locking and (sort of) optimistic locking, like this.
Read record (including row version) from master table.
Perform business logic.
Begin database transaction.
Re-read record from master table based on it's ID and row version (an optimistic concurrency control check of sorts) AND lock the row.
Insert a record into detail table.
Commit database transaction.
If step 4 fails to obtain a record, this should be considered an optimistic concurrency control check failure. The record has been changed since step 1 so the business logic is no longer valid.
Like the typical pessimistic concurrency control scenario, this involves a transaction and an explicit row lock, but the duration of the transaction+lock no longer includes the time necessary to perform the business logic.
Like the optimistic concurrency control scenario, the record requires a version. But where it differs is that the version is not updated, which means other operations depending on that row version won't be impacted.
Example of Hybrid Approach
An example of where the hybrid approach might be favourable:
A blog has a post table and comment table. Comments can be added to a post only if the post.comments_locked flag is false. The process for adding comments could use the hybrid approach, ensuring users can concurrently add comments without any concurrency exceptions.
The owner of the blog may edit their post, in which case the conventional optimistic concurrency control approach could be employed. The owner of the blog can have a long-running edit process which won't be affected by users adding comments. When the post is updated to the database, the version will be incremented, which means any in-progress comment-adding operations will fail, but they could be easily retried with a database-wins approach of re-fetching the post record from the database and retrying the process.
I am using transactions to make changes to a SQL database. As I understand it, this means that changes to the database will happen in an all-or-nothing fashion. What I want to know is, does this have any guarantees for reads? For example, suppose I have some (pseudo)-code like this:
1) start TRANSACTION
2) INSERT INTO users ... // insert some data
3) count = SELECT COUNT(*) FROM ... // count something in the database
4) if count > 10: // do something based on the read
5) INSERT INTO other_table ... // write based on the read
6) COMMMIT TRANSACTION
In this code, I'm doing an INSERT, followed by a SELECT, and then conditionally doing another INSERT based on the outcome of the SELECT.
So my question is, if another process modifies the database between steps (3) and (5), what happens to the count variable, and to my transaction?
If it makes a difference, I am using PostgreSQL.
As Xin pointed out, it depends on the isolation level.
At the default READ COMMITTED level, records from other sessions will become visible as they are committed; you would see the same records if you didn't start a transaction at all (though of course, other processes would see your inserts appear at different times).
With REPEATABLE READ, your queries will not see any records committed by other sessions after your transaction starts. But while you don't have to worry about the result of SELECT COUNT(*) changing during your transaction, you can't assume that this result will still be accurate by the time you commit.
Using SERIALIZABLE provides the strongest guarantee: if your script does the right thing when given exclusive access to the database, then it will do the right thing in the presence of other serialisable transactions (or it will fail outright). However, this means that all transactions which might interfere with yours must be using the same isolation level (which comes at a cost), and all must be prepared to retry their transaction in the event of a serialisation failure.
When serialisable transactions are not an option, you generally guard against race conditions by explicitly locking things against concurrent writes. It's often enough to lock a selection of records, but you can't exactly lock the result of a COUNT(*); in your case, you'd probably need to lock the whole table.
I am not working on postgreSQL, but I think I can answer your question. Think of every query is parallel. I am saying so, because there are 2 transactions: when you insert into a; others can insert into b; then when you check b; whether you can see the new data depends on your isolation setting (read committed or just dirty read).
Also please note that, in database, there is a technology called lock: you can lock a table so that prevent altering it from others before committing your transaction.
Hope
Let's assume in SQL window 1 I do:
-- query 1
BEGIN TRANSACTION;
UPDATE post SET title = 'edited' WHERE id = 1;
-- note that there is no explicit commit
Then from another window (window 2) I do:
-- query 2
SELECT * FROM post WHERE id = 1;
I get:
1 | original title
Which is fine as the default isolation level is READ COMMITTED and because query 1 is never committed, the change it performs is not readable until I explicitly commit from window 1.
In fact if I, in window 1, do:
COMMIT TRANSACTION;
I can then see the change if I re-run query 2.
1 | edited
My question is:
Why is query 2 returning fine the first time I run it? I was expecting it to block as the transaction in window 1 was not committed yet and the lock placed on row with id = 1 was (should be) an unreleased exclusive one that should block a read like the one performed in window 2. All the rest makes sense to me but I was expecting the SELECT to get stuck until an explicit commit in window 1 was executed.
The behaviour you describe is normal and expected in any transactional relational database.
If PostgreSQL showed you the value edited for the first SELECT it'd be wrong to do so - that's called a "dirty read", and is bad news in databases.
PostgreSQL would be allowed to wait at the SELECT until you committed or rolled back, but it isn't required to by the SQL standard, you haven't told it you want to wait, and it doesn't have to wait for any technical reason, so it returns the data you asked for immediately. After all, until it's committed, that update only kind-of exists - it still might or might not happen.
If PostgreSQL always waited here, then you'd quickly land up with a situation where only one connection could be doing anything with the database at a time. Not pretty for performance, and totally unnecessary the vast majority of the time.
If you want to wait for a concurrent UPDATE (or DELETE), you'd use SELECT ... FOR SHARE. (But be aware that this won't work for INSERT).
Details:
SELECT without a FOR UPDATE or FOR SHARE clause does not take any row level locks. So it sees whatever is the current committed row, and is not affected by any in-flight transactions that might be modifying that row. The concepts are explained in the MVCC section of the docs. The general idea is that PostgreSQL is copy-on-write, with versioning that allows it to return the correct copy based on what the transaction or statement could "see" at the time it started - what PostgreSQL calls a "snapshot".
In the default READ COMMITTED isolation snapshots are taken at the statement level, so if you SELECT a row, COMMIT a change to it from another transaction, and SELECT it again you'll see different values even within one transation. You can use SNAPSHOT isolation if you don't want to see changes committed after the transaction begins, or SERIALIZABLE isolation to add further protection against certain kinds of transaction inter-dependencies.
See the transaction isolation chapter in the documentation.
If you want a SELECT to wait for in-progress transactions to commit or rollback changes to rows being selected, you must use SELECT ... FOR SHARE. This will block on the lock taken by an UPDATE or DELETE until the transaction that took the lock rolls back or commits.
INSERT is different, though - the tuples just don't exist to other transactions until commit. The only way to wait for concurrent INSERTs is to take an EXCLUSIVE table-level lock, so you know nobody else is changing the table while you read it. Usually the need to do that means you have a design problem in the application though - your app should not care if there are uncommitted inserts still in flight.
See the explicit locking chapter of the documentation.
In PostgreSQL's MVCC implementation, the principle is reading does not block writing and vice-versa. The manual:
The main advantage of using the MVCC model of concurrency control
rather than locking is that in MVCC locks acquired for querying
(reading) data do not conflict with locks acquired for writing data,
and so reading never blocks writing and writing never blocks reading.
PostgreSQL maintains this guarantee even when providing the strictest
level of transaction isolation through the use of an innovative
Serializable Snapshot Isolation (SSI) level.
Each transaction only sees (mostly) what has been committed before the transaction began.
That does not mean there'd be no locking. Not at all. For many operations various kinds of locks are acquired. And various strategies are applied to resolve possible conflicts.
I'm using PostgreSQL 9.2 in a Windows environment.
I'm in a 2PC (2 phase commit) environment using MSDTC.
I have a client application, that starts a transaction at the SERIALIZABLE isolation level, inserts a new row of data in a table for a specific foreign key value (there is an index on the column), and vote for completion of the transaction (The transaction is PREPARED). The transaction will be COMMITED by the Transaction Coordinator.
Immediatly after that, outside of a transaction, the same client requests all the rows for this same specific foreign key value.
Because there may be a delay before the previous transaction is really commited, the SELECT clause may return a previous snapshot of the data. In fact, it does happen sometimes, and this is problematic. Of course the application may be redesigned but until then, I'm looking for a lock solution. Advisory Lock ?
I already solved the problem while performing UPDATE on specific rows, then using SELECT...FOR SHARE, and it works well. The SELECT waits until the transaction commits and return old and new rows.
Now I'm trying to solve it for INSERT.
SELECT...FOR SHARE does not block and return immediatley.
There is no concurrency issue here as only one client deals with a specific set of rows. I already know about MVCC.
Any help appreciated.
To wait for a not-yet-committed INSERT you'd need to take a predicate lock. There's limited predicate locking in PostgreSQL for the serializable support, but it's not exposed directly to the user.
Simple SERIALIZABLE isolation won't help you here, because SERIALIZABLE only requires that there be an order in which the transactions could've occurred to produce a consistent result. In your case this ordering is SELECT followed by INSERT.
The only option I can think of is to take an ACCESS EXCLUSIVE lock on the table before INSERTing. This will only get released at COMMIT PREPARED or ROLLBACK PREPARED time, and in the mean time any other queries will wait for the lock. You can enforce this via a BEFORE trigger to avoid the need to change the app. You'll probably get the odd deadlock and rollback if you do it that way, though, because INSERT will take a lower lock then you'll attempt lock promotion in the trigger. If possible it's better to run the LOCK TABLE ... IN ACCESS EXCLUSIVE MODE command before the INSERT.
As you've alluded to, this is mostly an application mis-design problem. Expecting to see not-yet-committed rows doesn't really make any sense.
I have a PostgreSQL 9.2.2 database that serves orders to my ERP system. The database tables contain boolean columns indicating if a customer is added or not among other records. The code I use extracts the rows from the database and sends them to our ERP system one at a time (single threaded). My code works perfectly in this regard; however over the past year our volume has grown enough to require a multi-threaded solution.
I don't think the MVCC modes will work for me because the added_customer column is only updated once a customer has been successfully added. The default MVCC modes could cause the same row to be worked on at the same time resulting in duplicate web service calls. What I want to avoid is duplicate web service calls to our ERP system as they can be rather heavy, although admittedly I am not an expert on MVCC nor the other modes that PostgreSQL provides.
My question is: How can I be sure that a row, or series of rows returned in one select statement are excluded from other queries to the database in separate threads?
You will need to record the fact that the rows are being processed somehow. You will also need to deal with concurrent attempts to mark them as being processed and handle failures with sending them to your ERP system.
You may find SELECT ... FOR UPDATE useful to get a set of rows and simultaneously lock them against updates. One approach might be for each thread to select a target row, try to add it's ID to a "processing" table, then remove it in the same transaction you update added_customer.
If a thread fetches no candidate rows, or fails to insert then it just needs to sleep briefly and try again. If anything goes badly wrong then you should have rows left in the "processing" table that you can inspect/correct.
Of course the other option is to just grab a set of candidate rows and spawn a separate process/thread for each that communicates with the ERP. That keeps the database fetching single-threaded while allowing multiple channels to the ERP.
You can add a column user_is_proccesed to the table. It can hold the process id for the back end, that updates the record.
Then use a small serializable transaction to set the user_is_proccesed to "lock row for proccesing".
Something like:
START TRANSACTION ISOLATION LEVEL SERIALIZABLE;
UPDATE user_table
SET user_is_proccesed = pg_backend_pid()
WHERE <some condition>
AND user_is_proccesed IS NULL; -- no one is proccesing it now
COMMIT;
The key thing here - with SERIALIZABLE only one transaction can successfully update the record (all other concurrent SERIALIZABLE updates will fail with ERROR: could not serialize access due to concurrent update).