In PostgreSQL, the MVCC concurrency control mechanism says that:
MVCC locks acquired for querying (reading) data do not conflict with
locks acquired for writing data, and so reading never blocks writing
and writing never blocks reading
So, even for READ_COMMITTED, an UPDATE statement will lock the currently affected rows so no other transaction can modify them, until the current transaction commits or rolls back.
If a concurrent transaction issues an UPDATE on the locked rows, the second transaction will block until the first one releases it's locks.
Is this behavior trying to prevent the write-write conflicts?
Lost updates can still happen in READ_COMMITTED, as after the first transaction commits, the second one will overwrite the row (even if the database has changed between the UPDATE query start and the query end). So if lost updates are still possible, why does the second transaction have to wait? Couldn't the row-level snapshots be used to store the uncommitted transaction changes to avoid transactions having to wait for write-locks to be released?
The answer to the first question is Yes. No DBMS can support dirty writes; if two transactions T1 and T2 are concurrently executing and T2 overwrites an update from T1, then the system cannot handle the case where T1 subsequently issues a ROLLBACK since T2's update has already occurred.
To avoid dirty writes, the original definition for snapshot isolation was "first committer wins" - that is, conflicting writes would be allowed to happen, but only the first transaction to issue a COMMIT would be able to - all other conflicting transactions would have to ROLLBACK. But this programming model is somewhat problematic, if not wasteful, since a transaction might update a significant proportion of the database only to be refused the ability to COMMIT at the end. So, instead of "first committer wins" most DBMS systems that support MVCC implement "first updater wins" using fairly traditional two-phase locking.
Related
I have a script that performs a bunch of updates on a moderately large (approximately 6 million rows) table, based on data read from a file.
It currently begins and then commits a transaction for each row it updates and I wanted to improve its performance somehow. I wonder if starting a single transaction at the beginning of the script's run and then rollbacking to individual savepoints in case any validation error occurs would actually result in a performance increase.
I looked online but haven't had much luck finding any documentation or benchmarks.
COMMIT is mostly an I/O problem, because the transaction log (WAL) has to be synchronized to disk.
So using subtransactions (savepoints) will verylikely boost performance. But beware that using more than 64 subtransactions per transaction will again hurt performance if you have concurrent transactions.
If you can live with losing some committed transactions in the event of a database server crash (which is rare), you could simply set synchronous_commit to off and stick with many small transactions.
Another, more complicated method is to process the rows in batches without using subtransactions and repeating the whole batch in case of a problem.
Having a single transaction with only 1 COMMIT should be faster than having multiple single row update transactions because each COMMIT must synchronize WAL writing to disk. But how really faster it is in a given environment depends a lot of the environment (number of transactions, table structure, index structure, UPDATE statement, PostgreSQL configuration, system configuration etc.): only you can benchmark in your environment.
My case I have connected to another GP DB to import data into my PostgreSQL tables and written Java schedulers to refresh it Daily. But when I'm trying to fetch the records everyday by using SQL functions, it's giving me an error Greenplum Database does not support REPEATABLE READ transactions. So, Can anyone suggest me how can I load the data in frequent from GP to postgres without isolation hassle.
I knew to execute to refresh the tables by
START TRANSACTION ISOLATION LEVEL SERIALIZABLE;
But, I'm not able to use the same in the functions due to transactions blocks.
Unlike Oracle database, which uses locks and latches for concurrency control, Greenplum Database (as does PostgreSQL) maintains data consistency by using a multiversion model (Multiversion Concurrency Control, MVCC). This means that while querying a database, each transaction sees a snapshot of data which protects the transaction from viewing inconsistent data that could be caused by (other) concurrent updates on the same data rows. This provides transaction isolation for each database session. In a nutshell, readers don’t block writers and writers don’t block readers. Each transaction sees a snapshot of the database rather than locking tables.
Transaction Isolation Levels
The SQL standard defines four transaction isolation levels. In Greenplum Database, you can request any of the four standard transaction isolation levels. But internally, there are only two distinct isolation levels — read committed and serializable:
read committed — When a transaction runs on this isolation level, a SELECT query sees only data committed before the query began. It never sees either uncommitted data or changes committed during query execution by concurrent transactions. However, the SELECT does see the effects of previous updates executed within its own transaction, even though they are not yet committed. In effect, a SELECT query sees a snapshot of the database as of the instant that query begins to run. Notice that two successive SELECT commands can see different data, even though they are within a single transaction, if other transactions commit changes during execution of the first SELECT. UPDATE and DELETE commands behave the same as SELECT in terms of searching for target rows. They will only find target rows that were committed as of the command start time. However, such a target row may have already been updated (or deleted or locked) by another concurrent transaction by the time it is found. The partial transaction isolation provided by read committed mode is adequate for many applications, and this mode is fast and simple to use. However, for applications that do complex queries and updates, it may be necessary to guarantee a more rigorously consistent view of the database than the read committed mode provides.
serializable — This is the strictest transaction isolation. This level emulates serial transaction execution, as if transactions had been executed one after another, serially, rather than concurrently. Applications using this level must be prepared to retry transactions due to serialization failures. When a transaction is on the serializable level, a SELECT query sees only data committed before the transaction began. It never sees either uncommitted data or changes committed during transaction execution by concurrent transactions. However, the SELECT does see the effects of previous updates executed within its own transaction, even though they are not yet committed. Successive SELECT commands within a single transaction always see the same data. UPDATE and DELETE commands behave the same as SELECT in terms of searching for target rows. They will only find target rows that were committed as of the transaction start time. However, such a target row may have already been updated (or deleted or locked) by another concurrent transaction by the time it is found. In this case, the serializable transaction will wait for the first updating transaction to commit or roll back (if it is still in progress). If the first updater rolls back, then its effects are negated and the serializable transaction can proceed with updating the originally found row. But if the first updater commits (and actually updated or deleted the row, not just locked it) then the serializable transaction will be rolled back.
read uncommitted — Treated the same as read committed in Greenplum Database.
repeatable read — Treated the same as serializable in Greenplum Database.
The default transaction isolation level in Greenplum Database is read committed. To change the isolation level for a transaction, you can declare the isolation level when you BEGIN the transaction, or else use the SET TRANSACTION command after the transaction is started.
enter link description here
I have found a bug in my application code where I have started a transaction, but never commit or do a rollback. The connection is used periodically, just reading some data every 10s or so. In the pg_stat_activity table, its state is reported as "idle in transaction", and its backend_start time is over a week ago.
What is the impact on the database of this? Does it cause additional CPU and RAM usage? Will it impact other connections? How long can it persist in this state?
I'm using postgresql 9.1 and 9.4.
Since you only SELECT, the impact is limited. It is more severe for any write operations, where the changes are not visible to any other transaction until committed - and lost if never committed.
It does cost some RAM and permanently occupies one of your allowed connections (which may or may not matter).
One of the more severe consequences of very long running transactions: It blocks VACUUM from doing it's job, since there is still an old transaction that can see old rows. The system will start bloating.
In particular, SELECT acquires an ACCESS SHARE lock (the least blocking of all) on all referenced tables. This does not interfere with other DML commands like INSERT, UPDATE or DELETE, but it will block DDL commands as well as TRUNCATE or VACUUM (including autovacuum jobs). See "Table-level Locks" in the manual.
It can also interfere with various replication solutions and lead to transaction ID wraparound in the long run if it stays open long enough / you burn enough XIDs fast enough. More about that in the manual on "Routine Vacuuming".
Blocking effects can mushroom if other transactions are blocked from committing and those have acquired locks of their own. Etc.
You can keep transactions open (almost) indefinitely - until the connection is closed (which also happens when the server is restarted, obviously.)
But never leave transactions open longer than needed.
There are two major impacts to the system.
The tables that have been used in those transactions:
are not vacuumed which means they are not "cleaned up" and their statistics aren't updated which might lead to bad (=slow) execution plans
cannot be changed using ALTER TABLE
The PostgreSQL wiki advises an approach to implementing UPSERT that uses a retry-loop. Implicit in this solution is the use of "subtransaction IDs". On the wiki article there is the following warning:
The correct solution is slow and clumsy to use, and is unsuitable for significant amounts of data. It also potentially burns through a lot of subtransaction IDs - avoiding burning XIDs is an explicit goal of the current "native UPSERT in PostgreSQL" effort.
What is the consequence of using "a lot of subtransaction IDs"? I don't really know what a subtransaction ID is - is this just a way of numbering nested transactions, and is the implication that these numbers might run out?
The resource is the 32 bits XID transaction counter itself, which is used by the engine to know if the version of a row in a table is associated to an "old" transaction (committed or rolled back) or a not-yet-committed transaction, and if it's visible or not from any given transaction.
Increasing XIDs at a super-high rate creates or increases the risk of getting a transaction ID wraparound issue. The worst case being that this issue escalates into a database self-shutdown to avoid data inconsistencies.
What avoids the transaction ID wraparound is routine vacuuming. This is detailed in the doc under Preventing Transaction ID Wraparound Failures.
But autovacuum is a background task which is meant to not get in the way of the foreground activity. Among other things, it cancels itself instead of locking out other queries. At times, it can lag a lot behind.
We can imagine a worst case where the foreground database activity increases XID values so fast that autovacuum just doesn't have the time to freeze the rows with the "old XIDs" before these XIDs values are claimed by a new transaction or subtransaction, a situation which PostgreSQL couldn't deal with.
It might also be that those foreground transactions stay uncommitted when this is going on, so even an aggressive vaccum couldn't do anything about it.
That's why programmers should be cautious about techniques that make this event more likely, like opening/closing subtransactions in huge loops.
The range is about 2 billion transactions, but this is the kind of limit that was unreachable when the system was designed, but which will become problematic as our hardware capabilities and what we're asking from our databases are ever-increasing.
In a COBOL batch program what is better in performance terms?
With commit:
IF SW-NEW-TRANSACT
EXEC SQL
COMMIT
END-EXEC
END-IF.
PERFORM SOMETHING
THRU SOMETHING-EXIT.
IF SW-ERROR
EXEC SQL
ROLLBACK
END-EXEC
END-IF.
With syncpoints:
IF SW-NEW-TRANSACT
EXEC SQL
SAVEPOINT NAMEPOINT ON ROLLBACK RETAIN CURSORS
END-EXEC
END-IF.
PERFORM SOMETHING
THRU SOMETHING-EXIT.
IF SW-ERROR
EXEC SQL
ROLLBACK TO SAVEPOINT NAMEPOINT
END-EXEC
END-IF.
SAVEPOINTs and COMMITs are not interchangeable.
A process always has to COMMIT or ROLLBACK database work at some point. A COMMIT is
taken between transactions (between complete units of work). COMMIT may be taken after each transaction or, as is common in batch processing,
after some multiple number of transactions. A COMMIT should
never be taken mid-transaction (this defeats the UNIT OF WORK concept).
A SAVEPOINT is generally taken and possibly released within a single unit of work. SAVEPOINTs should always be released upon completion of
a unit of work. They are always released upon a COMMIT.
The purpose
of a SAVEPOINT is to allow partial backout from within a unit of work. This is useful when a process begins with a sequence of common
database inserts/updates followed by a process branch where some updates may be performed before it can be determined that the alternate process
branch should have been executed. The SAVEPOINT allows backing out of the "blind alley" branch and then continuing on with the alternate
branch while preserving the common "up front" work. Without a SAVEPOINT, backing out of a "blind alley" might have required extensive data
buffering within the transaction (complex processing) or a ROLLBACK and re-do from the start of the transaction with some sort of flag
indicating that the alternative process branch needs to be followed. All this leads to complex application logic.
ROLLBACK TO SAVEPOINT has several advantages.
It can preserve "up front" work, saving the cost of doing it over. It saves rolling back the entire transaction. Rollbacks can be more
"expensive" than the original inserts/updates were and may span multiple transactions (depending on the commit frequency).
Finally, process complexity is generally reduced when database work can be
selectively "undone" through a ROLLBACK TO SAVEPOINT.
How might SAVEPOINT be used to improve the efficiency of your batch program? If your transactions employ self induced rollbacks to recover
from "blind alley" processing, then SAVEPOINT can be a huge benefit. Similarly, if the internal processing logic is complicated by the
need to avoid performing database updates for similar "backout" requirements, then SAVEPOINT can be used to refactor the process into
something that is quite a bit simpler and probably more efficient. Outside of these factors, SAVEPOINT is not going to affect performance
in a positive manner.
Some claim that having a high COMMIT frequency in a batch program reduces performance. Consequently, the lower the commit frequency
the better the performance. Tuning COMMIT frequency is not trivial. The lower the commit frequency, the longer database resources are
held and consequently, the greater the probability of inducing database timeouts. Suffering a database timeout generally causes
a process to rollback. The rollback is a very expensive operation. ROLLBACKs are a big hit to the DBMS itself and
your transaction needs to re-apply all of the updates a second time once it is restarted. Lowering commit frequency can end up
costing you a lot more than it gains. BEWARE!
EDIT
Rule of thumb: Commits have a cost. Rollbacks have a higher cost.
Discounting rollbacks
due to bad data, device failure and program abends (all of which should be rare), most rollbacks are caused by
timeout due to resource contention among processes. Doing fewer commits increases db contention. Doing
fewer commits may improve performance. The trick
is to find where performance gained in not committing out weights the cost of rollbacks due to contention. There
are a large number of factors that influence this - may of them dynamic. My overall advice is to look elsewhere
to improve performance - tuning commit frequency (where timeouts are not the issue) is generally a low return
investment.
Other more fruitful ways to improve batch preformance often involve:
improving paralleslism by load splitting and running multiple images of the same job
analyzing db/2 bind plans and optomizing access paths
profiling the behaviour of the batch programs and refactoring those parts consuming the most resources
This isn't a performance issue at all.
You COMMIT when you finish a unit of work, whatever a unit of work means to your application. Usually, it means that you've processed a complete transaction. In the batch world, you'd take a commit after 1,000 to 2,000 transactions, so you don't spend all your time COMMITing. The number depends on how many transactions you can rerun in the event of a ROLLBACK.
You ROLLBACK when you've encountered an error of some sort, either a database error or an application error.
You SAVEPOINT when you are processing a complex unit of work, and you want to save what you've done without taking a full COMMIT. In other words, you would take one or more SAVEPOINTs and then finally take a COMMIT.