Are two identical sequential reads in a Postgres transaction guaranteed return the same records? - postgresql

I was wondering if transactions in Postgres freeze the state of a table, similar to the way a point-in-time does in Elasticsearch.
If have a query with a where clause, and I inside a transaction I run it, first as a count(*), then with a select, and finally, with an update, do I need to be worried about a different process inserting a record into the db and throwing off the results?

In the default transaction isolation level READ COMMITTED, each statement in a transaction sees a different state (snapshot) of the database.
If you want all statements in a transaction to see the same snapshot, you will have to use the REPEATABLE READ isolation level. However, there is the possibility that concurrent data manipulations cause the UPDATE to fail with a serialization error, forcing you to repeat the transaction.

Related

does sql statement ensure atomicity in postgres

I have a simple bug in my program that uses multi user support. I'm using knex to build sql queries, and I have a pseudocode that depicts the scenerio:
const value = queryBuilder().readDataFromTheDatabase();//executes this
//do some other work and get value
queryBuilder.writeValueToTheDatabase(updateValue(value));
This piece of code is being use in sort of a middleware function. And as you can see, this is a possible race condition i.e. when multiple users access the thing, one of them gets a stale value when they try to execute this at roughly the same amount of time.
My solution
So, I was think a possible solution would be create a single queryBuilder statement:
queryBuilder().readAndUpdateValueInTheDatabase();
So, I'll probably have to use a little bit of plpgsql. I was wondering if this solution will be sufficient. Will the statement be executed atomically? i.e. When one request reads and doesn't finish his write, does another request wait around to both read and write or just waits to write but, reads the stale value?
I think what you are looking for here is isolation, not atomicity. You could set all transactions to the highest isolation level, serializable (which is higher than the usual default level). With that level, if data that a transaction read (and presumably relied upon) is changed, then when it tries to commit it might get a serialization failure error. I say "might", because the system could conclude the situation would be consistent with the data change having happened after the commit, in which case the commit is allowed to stand.
To avoid a race condition with such a setup, you must run both the read and the write in the same database transaction.
There are two ways to do that:
Use the default READ COMMITTED isolation level and lock the rows when you read them:
SELECT ... FROM ... FOR NO KEY UPDATE;
That locks the rows against concurrent modifications, and the lock is held until the end of the transaction.
Use the REPEATABLE READ isolation level and don't lock anything. Then your UPDATE will receive a serialization error (SQLSTATE 40001) if somebody modified the row concurrently. In that case, you roll the transaction back and try again in a new REPEATABLE READ transaction.
The first solution is usually better if you expect conflicts frequently, while the second is better if conflicts are rare.
Note that you should keep the database transaction as short as possible in both cases to keep the risk of conflicts low.
Transaction in PostgreSQL use an optimistic locking model when accessing to tables, while some other DBMS do pessimistic locking (IBM Db2) or the two locking model (MS SQL Server).
Optimistic locking snapshot the data on which you are working, and the modifications are done on the snapshot until the transaction ended. When the transaction finishes, the snapshot modifications are postponed on the real database (table rows), but if some other user had made a change between the moment of the snapshot capture and the commit, then the commit cannot apply and the COMMIT is rejected as a ROLLBACK.
You can try to raise the ISOLATION LEVEL (REPEATABLE READ or SERIALIZABLE) to avoid the trouble.

What's the benefit of wrapping an insert into a transacation in Postgres or other relational database servers?

I'm trying to increase the insert into number per second on Postgres. Now I can insert into around 200-300 rows into a table per second.
I found the ORM I'm using on node.js, TypeORM. Default to wrap every insert with transaction as its comments on source code
https://github.com/typeorm/typeorm/blob/master/src/repository/SaveOptions.ts#L18-L22
/**
* By default transactions are enabled and all queries in persistence operation are wrapped into the transaction.
* You can disable this behaviour by setting { transaction: false } in the persistence options.
*/
transaction?: boolean;
if i set it to false without transaction which increased around 100 per second. I know transaction can make multiple sql commands to have atom ability, if one of the command failed, postgres will recover it. But why Typeorm default to wrap .save api to enable transaction? since one single .save only send a insert into once without other SQL statement.
What's the benefit of wrapping an insert into in a transaction in Postgres or other RDBMS?
I don't understand most of what you wrote, since I don't know typeorm. But I can answer the question what the benefit of wrapping INSERT statements into a transaction is.
All relational database that provide ACID guarantees will write a transaction log that is used to replay the transaction in the case of a crash or to recover from a backup. This transaction log has to be persisted to disk at the end of every transaction.
Now if you run 1000 INSERT statements in a single transaction, the transaction log has to be persisted only once. But if each INSERT runs in its own transaction (the default in PostgreSQL), you'll end up with 1000 I/O requests to persist the transaction log. That will clearly affect performance.

How do transactions work in the context of reads to the database?

I am using transactions to make changes to a SQL database. As I understand it, this means that changes to the database will happen in an all-or-nothing fashion. What I want to know is, does this have any guarantees for reads? For example, suppose I have some (pseudo)-code like this:
1) start TRANSACTION
2) INSERT INTO users ... // insert some data
3) count = SELECT COUNT(*) FROM ... // count something in the database
4) if count > 10: // do something based on the read
5) INSERT INTO other_table ... // write based on the read
6) COMMMIT TRANSACTION
In this code, I'm doing an INSERT, followed by a SELECT, and then conditionally doing another INSERT based on the outcome of the SELECT.
So my question is, if another process modifies the database between steps (3) and (5), what happens to the count variable, and to my transaction?
If it makes a difference, I am using PostgreSQL.
As Xin pointed out, it depends on the isolation level.
At the default READ COMMITTED level, records from other sessions will become visible as they are committed; you would see the same records if you didn't start a transaction at all (though of course, other processes would see your inserts appear at different times).
With REPEATABLE READ, your queries will not see any records committed by other sessions after your transaction starts. But while you don't have to worry about the result of SELECT COUNT(*) changing during your transaction, you can't assume that this result will still be accurate by the time you commit.
Using SERIALIZABLE provides the strongest guarantee: if your script does the right thing when given exclusive access to the database, then it will do the right thing in the presence of other serialisable transactions (or it will fail outright). However, this means that all transactions which might interfere with yours must be using the same isolation level (which comes at a cost), and all must be prepared to retry their transaction in the event of a serialisation failure.
When serialisable transactions are not an option, you generally guard against race conditions by explicitly locking things against concurrent writes. It's often enough to lock a selection of records, but you can't exactly lock the result of a COUNT(*); in your case, you'd probably need to lock the whole table.
I am not working on postgreSQL, but I think I can answer your question. Think of every query is parallel. I am saying so, because there are 2 transactions: when you insert into a; others can insert into b; then when you check b; whether you can see the new data depends on your isolation setting (read committed or just dirty read).
Also please note that, in database, there is a technology called lock: you can lock a table so that prevent altering it from others before committing your transaction.
Hope

How to wait during SELECT that pending INSERT commit?

I'm using PostgreSQL 9.2 in a Windows environment.
I'm in a 2PC (2 phase commit) environment using MSDTC.
I have a client application, that starts a transaction at the SERIALIZABLE isolation level, inserts a new row of data in a table for a specific foreign key value (there is an index on the column), and vote for completion of the transaction (The transaction is PREPARED). The transaction will be COMMITED by the Transaction Coordinator.
Immediatly after that, outside of a transaction, the same client requests all the rows for this same specific foreign key value.
Because there may be a delay before the previous transaction is really commited, the SELECT clause may return a previous snapshot of the data. In fact, it does happen sometimes, and this is problematic. Of course the application may be redesigned but until then, I'm looking for a lock solution. Advisory Lock ?
I already solved the problem while performing UPDATE on specific rows, then using SELECT...FOR SHARE, and it works well. The SELECT waits until the transaction commits and return old and new rows.
Now I'm trying to solve it for INSERT.
SELECT...FOR SHARE does not block and return immediatley.
There is no concurrency issue here as only one client deals with a specific set of rows. I already know about MVCC.
Any help appreciated.
To wait for a not-yet-committed INSERT you'd need to take a predicate lock. There's limited predicate locking in PostgreSQL for the serializable support, but it's not exposed directly to the user.
Simple SERIALIZABLE isolation won't help you here, because SERIALIZABLE only requires that there be an order in which the transactions could've occurred to produce a consistent result. In your case this ordering is SELECT followed by INSERT.
The only option I can think of is to take an ACCESS EXCLUSIVE lock on the table before INSERTing. This will only get released at COMMIT PREPARED or ROLLBACK PREPARED time, and in the mean time any other queries will wait for the lock. You can enforce this via a BEFORE trigger to avoid the need to change the app. You'll probably get the odd deadlock and rollback if you do it that way, though, because INSERT will take a lower lock then you'll attempt lock promotion in the trigger. If possible it's better to run the LOCK TABLE ... IN ACCESS EXCLUSIVE MODE command before the INSERT.
As you've alluded to, this is mostly an application mis-design problem. Expecting to see not-yet-committed rows doesn't really make any sense.

Do writes in SNAPSHOT isolation level block writes in another SNAPSHOT transaction in SQL Server 2008 R2

For SNAPSHOt isolation level in SQL Server 2008 R2, the following is mentioned in MSDN ADO.Net documentation:
Transactions that modify data do not block transactions that read data, and transactions that read data do not block transactions that write data, as they normally would under the default READ COMMITTED isolation level in SQL Server.
There is no mention of whether writes will block writes, when both transactions are in SNAPSHOT isolation mode. So my question is as follows:
Will writes in a SNAPSHOT transaction1 block writes to same tables in another SNAPHOT transaction2?
LATEST UPDATE
After doing a lot of thinking on my question, I am coming to a conclusion as mentioned in paragraph below. Hope others can throw more light on this.
There is no relational database in which writes do NOT block writes. In other words, writes will always block writes. Writes would include statements like INSERT or UPDATE or DELETE. This is true no matter which isolation level you use, since all relational databases need to implement data consistency, when multiple writes are happening in database. Of course, the simultaneous writes need to be conflicting ( as in inserting into the same table or updating the same row/s) for this blocking to occur.
Ligos is actually incorrect - if two separate transactions are trying to update the same record with Snapshot on, transaction 2 WILL be blocked until transaction 1 releases the lock. Then, and ONLY then, will you get error 3960. I realize this thread is over 2 years old, but I wanted to avoid miss-information being out there.
Even the link Ligos references says the exact same thing I am mentioning above (check out the last non-red paragraph)
Write vs. Write will only not be blocked if the two records (ie. rows) trying to be updated are different
No. They will not block. Instead, the UPDATE command in trans2 will fail with error number 3960.
Because of how SNAPSHOT isolation level works, any UPDATE command may fail. The only way you can tell is to catch and handle error 3960 (it is called optimistic concurrency because you don't expect this situation to happen very often).
I ended up testing this empirically, because it's not entirely obvious from the documentation. This blog post illustrates it nicely though.
Assumption: both trans1 and trans2 are UPDATEing the same row in the same table. Updating two different rows should work just fine.