What's the equivalent of INSERT ... ON CONFLICT that first tries update? - postgresql

I've been using INSERT ... ON CONFLICT DO UPDATE to insert/update data. The question I have is that I know that most of the time, I will want to do an update: for every day, update a counter. If there's no data for that date, then create the row. That creation will only happen once (obviously), but the update could happen millions of times per day. Is using INSERT ... ON CONFLICT DO UPDATE still the right approach? is there an equivalent of "try to update first, if that fails then insert the row"? (like an actual "UPSERT").

There is no variant of UPDATE that has the same behavior, for the simple reason that it would do exactly the same thing as INSERT ... ON CONFLICT. Don't worry about the name.
If you have millions of updates for each row per day, you should worry much more about VACUUM.
If you can, do not index the attributes that will be updated frequently and create the table with a fillfactor lower than 100. The you can get the much more efficient &ldauo;HOT updates”, which will significantly reduce the amount of disk writes and VACUUM required.
Make sure to tune autovacuum to be more aggressive by lowering autovacuum_vacuum_cost_delay.

Related

How to avoid being blocked by deadlocks?

Can I write an UPDATE statement that will simply not bother executing if there's a deadlock?
I have a small, but frequently updated table.
This statement is run quite frequently on it....
UPDATE table_a SET lastChangedTime = 'blah' WHERE pk = 1234;
Where pk is the primary key.
Every now and again this statement gets blocked. That's not in itself a big deal; the issue is that each time there's a lock it seems to take a minute or two for Postgres to sort itself out, and I can lose a lot a data.
table_a is very volatile, and lastChangedTime gets altered all the time, so rather than occasionally having to wait two minutes for the UPDATE to get executed, I'd rather it just didn't bother. Ok, my data might not be as up-to-date as I'd like for this one record, but at least I wouldn't have locked the whole table for 2 minutes.
Update following comments:
the application interacts very simply with the database, it only issues simple, one line UPDATE and INSERT statements, and commits each one immediately. One of the issues causing me a lot of head scratching is how can something work a million times without problem, and then just fail on another record that appears to be identical to all the others.
Final suggestion/question.....
The UPDATE statement is being invoked from a C# application. If I change the 'command timeout' to a very short value - say 1 millisecond would that have the desired effect? or might it end up clogging up the database with lots of broken transactions?
To avoid waiting for locks, first run
SELECT 1 FROM table_a WHERE pk = 1234 FOR UPDATE NOWAIT;
If there is a lock on the row, the statement will fail immediately, and you can go o working on something different.
Mind that the SELECT ... FOR UPDATE statement has to be in the same database transaction as your UPDATE statement.
As a general advice, you should use shorter transactions, which will reduce the length of lock waits and the risk of deadlock.

Does trigger(after insert) on table slow down inserting into this table

I have a big table(bo_sip_cti_event) which is too largest to even run queries on this so I made the same table (bo_sip_cti_event_day), added trigger after insert on bo_sip_cti_event to add all the same values to bo_sip_cti_event_day and now I am thinking if I significantly slowed down inserts into bo_sip_cti_event.
So generally, does trigger after insert slow down operations on this table?
Yes, the trigger must slow down inserts.
The reason is that relational databases are ACID compliant: All actions, including side-effects like triggers, must be completed before the update transaction completes. So triggers must be executed synchronously, and that consumes CPU, and in your case I/O too, which ultimately takes more time. There's no getting around it.
The answer is yes: it is additional overhead, so obviously it takes time to finish the transaction with the additional trigger execution.
Your design makes me wonder if:
You explored all options to speed up your large table. Even billions of rows can be handled quite fine, if you have proper index ect. But it all depends on the table, the design, the data and the queries.
What exactly your trigger is doing. The table name "_day" raises questions when and where and how exactly this table is cleaned out at midnight. Hopefully not inside the trigger function, and hopefully not with a "DELETE FROM".

Postgres: is it possible to lock some rows for changing?

I have pretty old tables, which hold records for clients' payments and commissions for several years. In regular business cycle sometimes it's needed to recalc commissions and update table. But usually period of recalulation 1 or 2 months backwards, not more.
Recently, in result of bug in php script, our developer recalculated commissions since the very beggining 0_0. And the process of recalculation is really complicated so it cant be restored just grabbing yeasterday's backup - data changes in noumerous databases, so restoring data is really complicated and awfully expensive procedure. And complains from clients and change in accounting...you know..Horor.
We can't split tables by periods. (Well we can, but it will take year to remake all data selects).
What I'm trying to think about is to set up some update trigger that would check date of the changing record and allowed date that should be less then the updating date. So in case of mistake or bug, when someone would try to update such 'restricted' row it would get an exception and keep the data unchaged.
Is that a good approach? How's that can be done - I mean trigger?
For postgres you can use a check constraint to ensure the allowed_date is always less than the update_date:
ALTER TABLE mytable ADD CONSTRAINT datecheck CHECK (allowed_date < update_date);

Postgresql Truncation speed

We're using Postgresql 9.1.4 as our db server. I've been trying to speed up my test suite so I've stared profiling the db a bit to see exactly what's going on. We are using database_cleaner to truncate tables at the end of tests. YES I know transactions are faster, I can't use them in certain circumstances so I'm not concerned with that.
What I AM concerned with, is why TRUNCATION takes so long (longer than using DELETE) and why it takes EVEN LONGER on my CI server.
Right now, locally (on a Macbook Air) a full test suite takes 28 minutes. Tailing the logs, each time we truncate tables... ie:
TRUNCATE TABLE table1, table2 -- ... etc
it takes over 1 second to perform the truncation. Tailing the logs on our CI server (Ubuntu 10.04 LTS), take takes a full 8 seconds to truncate the tables and a build takes 84 minutes.
When I switched over to the :deletion strategy, my local build took 20 minutes and the CI server went down to 44 minutes. This is a significant difference and I'm really blown away as to why this might be. I've tuned the DB on the CI server, it has 16gb system ram, 4gb shared_buffers... and an SSD. All the good stuff. How is it possible:
a. that it's SO much slower than my Macbook Air with 2gb of ram
b. that TRUNCATION is so much slower than DELETE when the postgresql docs state explicitly that it should be much faster.
Any thoughts?
This has come up a few times recently, both on SO and on the PostgreSQL mailing lists.
The TL;DR for your last two points:
(a) The bigger shared_buffers may be why TRUNCATE is slower on the CI server. Different fsync configuration or the use of rotational media instead of SSDs could also be at fault.
(b) TRUNCATE has a fixed cost, but not necessarily slower than DELETE, plus it does more work. See the detailed explanation that follows.
UPDATE: A significant discussion on pgsql-performance arose from this post. See this thread.
UPDATE 2: Improvements have been added to 9.2beta3 that should help with this, see this post.
Detailed explanation of TRUNCATE vs DELETE FROM:
While not an expert on the topic, my understanding is that TRUNCATE has a nearly fixed cost per table, while DELETE is at least O(n) for n rows; worse if there are any foreign keys referencing the table being deleted.
I always assumed that the fixed cost of a TRUNCATE was lower than the cost of a DELETE on a near-empty table, but this isn't true at all.
TRUNCATE table; does more than DELETE FROM table;
The state of the database after a TRUNCATE table is much the same as if you'd instead run:
DELETE FROM table;
VACCUUM (FULL, ANALYZE) table; (9.0+ only, see footnote)
... though of course TRUNCATE doesn't actually achieve its effects with a DELETE and a VACUUM.
The point is that DELETE and TRUNCATE do different things, so you're not just comparing two commands with identical outcomes.
A DELETE FROM table; allows dead rows and bloat to remain, allows the indexes to carry dead entries, doesn't update the table statistics used by the query planner, etc.
A TRUNCATE gives you a completely new table and indexes as if they were just CREATEed. It's like you deleted all the records, reindexed the table and did a VACUUM FULL.
If you don't care if there's crud left in the table because you're about to go and fill it up again, you may be better off using DELETE FROM table;.
Because you aren't running VACUUM you will find that dead rows and index entries accumulate as bloat that must be scanned then ignored; this slows all your queries down. If your tests don't actually create and delete all that much data you may not notice or care, and you can always do a VACUUM or two part-way through your test run if you do. Better, let aggressive autovacuum settings ensure that autovacuum does it for you in the background.
You can still TRUNCATE all your tables after the whole test suite runs to make sure no effects build up across many runs. On 9.0 and newer, VACUUM (FULL, ANALYZE); globally on the table is at least as good if not better, and it's a whole lot easier.
IIRC Pg has a few optimisations that mean it might notice when your transaction is the only one that can see the table and immediately mark the blocks as free anyway. In testing, when I've wanted to create bloat I've had to have more than one concurrent connection to do it. I wouldn't rely on this, though.
DELETE FROM table; is very cheap for small tables with no f/k refs
To DELETE all records from a table with no foreign key references to it, all Pg has to do a sequential table scan and set the xmax of the tuples encountered. This is a very cheap operation - basically a linear read and a semi-linear write. AFAIK it doesn't have to touch the indexes; they continue to point to the dead tuples until they're cleaned up by a later VACUUM that also marks blocks in the table containing only dead tuples as free.
DELETE only gets expensive if there are lots of records, if there are lots of foreign key references that must be checked, or if you count the subsequent VACUUM (FULL, ANALYZE) table; needed to match TRUNCATE's effects within the cost of your DELETE .
In my tests here, a DELETE FROM table; was typically 4x faster than TRUNCATE at 0.5ms vs 2ms. That's a test DB on an SSD, running with fsync=off because I don't care if I lose all this data. Of course, DELETE FROM table; isn't doing all the same work, and if I follow up with a VACUUM (FULL, ANALYZE) table; it's a much more expensive 21ms, so the DELETE is only a win if I don't actually need the table pristine.
TRUNCATE table; does a lot more fixed-cost work and housekeeping than DELETE
By contrast, a TRUNCATE has to do a lot of work. It must allocate new files for the table, its TOAST table if any, and every index the table has. Headers must be written into those files and the system catalogs may need updating too (not sure on that point, haven't checked). It then has to replace the old files with the new ones or remove the old ones, and has to ensure the file system has caught up with the changes with a synchronization operation - fsync() or similar - that usually flushes all buffers to the disk. I'm not sure whether the the sync is skipped if you're running with the (data-eating) option fsync=off .
I learned recently that TRUNCATE must also flush all PostgreSQL's buffers related to the old table. This can take a non-trivial amount of time with huge shared_buffers. I suspect this is why it's slower on your CI server.
The balance
Anyway, you can see that a TRUNCATE of a table that has an associated TOAST table (most do) and several indexes could take a few moments. Not long, but longer than a DELETE from a near-empty table.
Consequently, you might be better off doing a DELETE FROM table;.
--
Note: on DBs before 9.0, CLUSTER table_id_seq ON table; ANALYZE table; or VACUUM FULL ANALYZE table; REINDEX table; would be a closer equivalent to TRUNCATE. The VACUUM FULL impl changed to a much better one in 9.0.
Brad, just to let you know. I've looked fairly deeply into a very similar question.
Related question: 30 tables with few rows - TRUNCATE the fastest way to empty them and reset attached sequences?
Please also look at this issue and this pull request:
https://github.com/bmabey/database_cleaner/issues/126
https://github.com/bmabey/database_cleaner/pull/127
Also this thread: http://archives.postgresql.org/pgsql-performance/2012-07/msg00047.php
I am sorry for writing this as an answer, but I didn't find any comment links, maybe because there are too much comments already there.
I've encountered similar issue lately, i.e.:
The time to run test suite which used DatabaseCleaner varied widely between different systems with comparable hardware,
Changing DatabaseCleaner strategy to :deletion provided ~10x improvement.
The root cause of the slowness was a filesystem with journaling (ext4) used for database storage. During TRUNCATE operation the journaling daemon (jbd2) was using ~90% of disk IO capacity. I am not sure if this is a bug, an edge case or actually normal behaviour in these circumstances. This explains however why TRUNCATE was a lot slower than DELETE - it generated a lot more disk writes. As I did not want to actually use DELETE I resorted to setting fsync=off and it was enough to mitigate this issue (data safety was not important in this case).
A couple of alternate approaches to consider:
Create a empty database with static "fixture" data in it, and run the tests in that. When you are done, just just drop the database, which should be fast.
Create a new table called "test_ids_to_delete" that contains columns for table names and primary key ids. Update your deletion logic to insert the ids/table names in this table instead, which will be much faster than running deletes. Then, write a script to run "offline" to actually delete the data, either after a entire test run has finished, or overnight.
The former is a "clean room" approach, while latter means there will be some test data will persist in database for longer. The "dirty" approach with offline deletes is what I'm using for a test suite with about 20,000 tests. Yes, there are sometimes problems due to having "extra" test data in the dev database but at times. But sometimes this "dirtiness" has helped us find and fixed bug because the "messiness" better simulated a real-world situation, in a way that clean-room approach never will.

PostgreSQL - Why are some queries on large datasets so incredibly slow

I have two types of queries I run often on two large datasets. They run much slower than I would expect them to.
The first type is a sequential scan updating all records:
Update rcra_sites Set street = regexp_replace(street,'/','','i')
rcra_sites has 700,000 records. It takes 22 minutes from pgAdmin! I wrote a vb.net function that loops through each record and sends an update query for each record (yes, 700,000 update queries!) and it runs in less than half the time. Hmmm....
The second type is a simple update with a relation and then a sequential scan:
Update rcra_sites as sites
Set violations='No'
From narcra_monitoring as v
Where sites.agencyid=v.agencyid and v.found_violation_flag='N'
narcra_monitoring has 1,700,000 records. This takes 8 minutes. The query planner refuses to use my indexes. The query runs much faster if I start with a set enable_seqscan = false;. I would prefer if the query planner would do its job.
I have appropriate indexes, I have vacuumed and analyzed. I optimized my shared_buffers and effective_cache_size best I know to use more memory since I have 4GB. My hardware is pretty darn good. I am running v8.4 on Windows 7.
Is PostgreSQL just this slow? Or am I still missing something?
Possibly try reducing your random_page_cost (default: 4) compared to seq_page_cost: this will reduce the planner's preference for seq scans by making random-accesses driven by indices more attractive.
Another thing to bear in mind is that MVCC means that updating a row is fairly expensive. In particular, updating every row in a table requires doubling the amount of storage for the table, until it can be vacuumed. So in your first query, you may want to qualify your update:
UPDATE rcra_sites Set street = regexp_replace(street,'/','','i')
where street ~ '/'
(afaik postgresql doesn't automatically suppress the update if it looks like you're not actually updating anything. Istr there was a standard trigger function added in 8.4 (?) to allow you to do that, but it's perhaps better to address it in the client side)
When a row is updated, a new row version is written.
If the new row does not fit in the same disk block, then every index entry pointing to the old row needs to be updated to point to the new row.
It is not just indexes on the updated data that need updating.
If you have a lot of indexes on rcra_sites, and only one or two frequently updated fields, then you might gain by separating the frequently updated fields into a table of their own.
You can also reduce the fillfactor percentage below its default of 100, so that some of the updates can result in new rows being written to the same block, resulting in the indexes pointing to that block not needing to be updated.