I have a script that performs a bunch of updates on a moderately large (approximately 6 million rows) table, based on data read from a file.
It currently begins and then commits a transaction for each row it updates and I wanted to improve its performance somehow. I wonder if starting a single transaction at the beginning of the script's run and then rollbacking to individual savepoints in case any validation error occurs would actually result in a performance increase.
I looked online but haven't had much luck finding any documentation or benchmarks.
COMMIT is mostly an I/O problem, because the transaction log (WAL) has to be synchronized to disk.
So using subtransactions (savepoints) will verylikely boost performance. But beware that using more than 64 subtransactions per transaction will again hurt performance if you have concurrent transactions.
If you can live with losing some committed transactions in the event of a database server crash (which is rare), you could simply set synchronous_commit to off and stick with many small transactions.
Another, more complicated method is to process the rows in batches without using subtransactions and repeating the whole batch in case of a problem.
Having a single transaction with only 1 COMMIT should be faster than having multiple single row update transactions because each COMMIT must synchronize WAL writing to disk. But how really faster it is in a given environment depends a lot of the environment (number of transactions, table structure, index structure, UPDATE statement, PostgreSQL configuration, system configuration etc.): only you can benchmark in your environment.
Related
I've been reading through the WAL chapter of the Postgres manual and was confused by a portion of the chapter:
Using WAL results in a significantly reduced number of disk writes, because only the log file needs to be flushed to disk to guarantee that a transaction is committed, rather than every data file changed by the transaction.
How is it that continuous writing to WAL more performant than simply writing to the table/index data itself?
As I see it (forgetting for now the resiliency benefits of WAL) postgres need to complete two disk operations; first pg needs to commit to WAL on disk and then you'll still need to change the table data to be consistent with WAL. I'm sure there's a fundamental aspect of this I've misunderstood but it seems like adding an additional step between a client transaction and the and the final state of the table data couldn't actually increase overall performance. Thanks in advance!
You are fundamentally right: the extra writes to the transaction log will per se not reduce the I/O load.
But a transaction will normally touch several files (tables, indexes etc.). If you force all these files out to storage (“sync”), you will incur more I/O load than if you sync just a single file.
Of course all these files will have to be written and sync'ed eventually (during a checkpoint), but often the same data are modified several times between two checkpoints, and then the corresponding files will have to be sync'ed only once.
With slow query logging turned on, we see a lot of COMMITs taking upwards of multiple seconds to complete on our production database. On investigation, these are generally simple transactions: fetch a row, UPDATE the row, COMMIT. The SELECTs and UPDATEs in these particular transactions aren't being logged as slow. Is there anything we can do, or tools that we can use, to figure out the reason for these slow commits? We're running on an SSD, and are streaming to a slave, if that makes a difference.
Postgres commits are synchronous. This means they will wait for the WAL writes to complete before moving to the next one. You can adjust the WAL settings in the config file to adjust for this.
You can set the commit level to asynchronous at a session/user level or database wide with the synchronous_commit in the config file.
On the database side.
Vacuum your tables an update the statistics. This will get rid of dead tuples since your performing updates, there will be many.
VACUUM ANALYZE
I have found a bug in my application code where I have started a transaction, but never commit or do a rollback. The connection is used periodically, just reading some data every 10s or so. In the pg_stat_activity table, its state is reported as "idle in transaction", and its backend_start time is over a week ago.
What is the impact on the database of this? Does it cause additional CPU and RAM usage? Will it impact other connections? How long can it persist in this state?
I'm using postgresql 9.1 and 9.4.
Since you only SELECT, the impact is limited. It is more severe for any write operations, where the changes are not visible to any other transaction until committed - and lost if never committed.
It does cost some RAM and permanently occupies one of your allowed connections (which may or may not matter).
One of the more severe consequences of very long running transactions: It blocks VACUUM from doing it's job, since there is still an old transaction that can see old rows. The system will start bloating.
In particular, SELECT acquires an ACCESS SHARE lock (the least blocking of all) on all referenced tables. This does not interfere with other DML commands like INSERT, UPDATE or DELETE, but it will block DDL commands as well as TRUNCATE or VACUUM (including autovacuum jobs). See "Table-level Locks" in the manual.
It can also interfere with various replication solutions and lead to transaction ID wraparound in the long run if it stays open long enough / you burn enough XIDs fast enough. More about that in the manual on "Routine Vacuuming".
Blocking effects can mushroom if other transactions are blocked from committing and those have acquired locks of their own. Etc.
You can keep transactions open (almost) indefinitely - until the connection is closed (which also happens when the server is restarted, obviously.)
But never leave transactions open longer than needed.
There are two major impacts to the system.
The tables that have been used in those transactions:
are not vacuumed which means they are not "cleaned up" and their statistics aren't updated which might lead to bad (=slow) execution plans
cannot be changed using ALTER TABLE
We're using MongoDB 2.2.0 at work. The DB contains about 51GB of data (at the moment) and I'd like to do some analytics on the user data that we've collected so far. Problem is, it's the live machine and we can't afford another slave at the moment. I know MongoDB has a read lock which may affect any writes that happen especially with complex queries. Is there a way to tell MongoDB to treat my (particular) query with the lowest priority?
In MongoDB reads and writes do affect each other. Read locks are shared, but read locks block write locks from being acquired and of course no other reads or writes are happening while a write lock is held. MongoDB operations yield periodically to keep other threads waiting for locks from starving. You can read more about the details of that here.
What does that mean for your use case? Because there is no way to tell MongoDB to access the data without a read lock, nor is there a way to prioritize the requests (at least not yet) whether the reads significantly affect the performance of your writes depends on how much "headroom" you have available while write activity is going on.
One suggestion I can make is when figuring out how to run analytics, rather than scanning the entire data set (i.e. doing an aggregation query over all historical data) try running smaller aggregation queries on short time slices. This will accomplish two things:
reads jobs will be shorter lived and therefore will finish quicker, this will give you a chance to assess what impact the queries have on your "live" performance.
you won't be pulling all old data into RAM at once - by spacing out these analytical queries over time you will minimize the impact it will have on current write performance.
Depending on what it is you can't afford about getting another server - you might consider getting a short lived AWS instance which may be not very powerful but would be available to run a long analytical query against a copy of your data set. Just be careful when making it a copy of your data - doing a full sync off of the production system will place a heavy load on it (more effective way would be to use a recent backup/file snapshot to resume from).
Such operations are best left for slaves of a replica set. For one thing, read locks can be shared to allow many reads at once, but write locks will block reads. And, while you can't prioritize queries, mongodb yields long running read/write queries. Their concurrency docs should help
If you can't afford another server, you can setup a slave on the same machine, provided you have some spare RAM/Disk headroom, and you use the slave lightly/occasionally. You must be careful though, your disk I/O will increase significantly.
In a COBOL batch program what is better in performance terms?
With commit:
IF SW-NEW-TRANSACT
EXEC SQL
COMMIT
END-EXEC
END-IF.
PERFORM SOMETHING
THRU SOMETHING-EXIT.
IF SW-ERROR
EXEC SQL
ROLLBACK
END-EXEC
END-IF.
With syncpoints:
IF SW-NEW-TRANSACT
EXEC SQL
SAVEPOINT NAMEPOINT ON ROLLBACK RETAIN CURSORS
END-EXEC
END-IF.
PERFORM SOMETHING
THRU SOMETHING-EXIT.
IF SW-ERROR
EXEC SQL
ROLLBACK TO SAVEPOINT NAMEPOINT
END-EXEC
END-IF.
SAVEPOINTs and COMMITs are not interchangeable.
A process always has to COMMIT or ROLLBACK database work at some point. A COMMIT is
taken between transactions (between complete units of work). COMMIT may be taken after each transaction or, as is common in batch processing,
after some multiple number of transactions. A COMMIT should
never be taken mid-transaction (this defeats the UNIT OF WORK concept).
A SAVEPOINT is generally taken and possibly released within a single unit of work. SAVEPOINTs should always be released upon completion of
a unit of work. They are always released upon a COMMIT.
The purpose
of a SAVEPOINT is to allow partial backout from within a unit of work. This is useful when a process begins with a sequence of common
database inserts/updates followed by a process branch where some updates may be performed before it can be determined that the alternate process
branch should have been executed. The SAVEPOINT allows backing out of the "blind alley" branch and then continuing on with the alternate
branch while preserving the common "up front" work. Without a SAVEPOINT, backing out of a "blind alley" might have required extensive data
buffering within the transaction (complex processing) or a ROLLBACK and re-do from the start of the transaction with some sort of flag
indicating that the alternative process branch needs to be followed. All this leads to complex application logic.
ROLLBACK TO SAVEPOINT has several advantages.
It can preserve "up front" work, saving the cost of doing it over. It saves rolling back the entire transaction. Rollbacks can be more
"expensive" than the original inserts/updates were and may span multiple transactions (depending on the commit frequency).
Finally, process complexity is generally reduced when database work can be
selectively "undone" through a ROLLBACK TO SAVEPOINT.
How might SAVEPOINT be used to improve the efficiency of your batch program? If your transactions employ self induced rollbacks to recover
from "blind alley" processing, then SAVEPOINT can be a huge benefit. Similarly, if the internal processing logic is complicated by the
need to avoid performing database updates for similar "backout" requirements, then SAVEPOINT can be used to refactor the process into
something that is quite a bit simpler and probably more efficient. Outside of these factors, SAVEPOINT is not going to affect performance
in a positive manner.
Some claim that having a high COMMIT frequency in a batch program reduces performance. Consequently, the lower the commit frequency
the better the performance. Tuning COMMIT frequency is not trivial. The lower the commit frequency, the longer database resources are
held and consequently, the greater the probability of inducing database timeouts. Suffering a database timeout generally causes
a process to rollback. The rollback is a very expensive operation. ROLLBACKs are a big hit to the DBMS itself and
your transaction needs to re-apply all of the updates a second time once it is restarted. Lowering commit frequency can end up
costing you a lot more than it gains. BEWARE!
EDIT
Rule of thumb: Commits have a cost. Rollbacks have a higher cost.
Discounting rollbacks
due to bad data, device failure and program abends (all of which should be rare), most rollbacks are caused by
timeout due to resource contention among processes. Doing fewer commits increases db contention. Doing
fewer commits may improve performance. The trick
is to find where performance gained in not committing out weights the cost of rollbacks due to contention. There
are a large number of factors that influence this - may of them dynamic. My overall advice is to look elsewhere
to improve performance - tuning commit frequency (where timeouts are not the issue) is generally a low return
investment.
Other more fruitful ways to improve batch preformance often involve:
improving paralleslism by load splitting and running multiple images of the same job
analyzing db/2 bind plans and optomizing access paths
profiling the behaviour of the batch programs and refactoring those parts consuming the most resources
This isn't a performance issue at all.
You COMMIT when you finish a unit of work, whatever a unit of work means to your application. Usually, it means that you've processed a complete transaction. In the batch world, you'd take a commit after 1,000 to 2,000 transactions, so you don't spend all your time COMMITing. The number depends on how many transactions you can rerun in the event of a ROLLBACK.
You ROLLBACK when you've encountered an error of some sort, either a database error or an application error.
You SAVEPOINT when you are processing a complex unit of work, and you want to save what you've done without taking a full COMMIT. In other words, you would take one or more SAVEPOINTs and then finally take a COMMIT.