What is the best approach for upserting large number of rows into a single table? - postgresql

Im working on a product that involves large number of upsert operations into a single table.
We are dealing with a time-based data and using timescaledb hypertables with 7 days chunk interval size. we have concurrent tasks that upserts data into a single table, and in extreme cases its possible that we will have 40 concurrent tasks, each one upserting around 250k rows, all to the same table.
Initially we decided to go with the approach of deleting all the old rows and then inserting the updated ones with a COPY FROM statement, but when we got to test the system on large scale these COPYs took long time to finish, eventually resulting in the db's CPU usage to reach 100%, and become unresponsive.
We also noticed that the index size of the table increased radically and filled up the disk usage to 100%, and SELECT statements took extremely long time to execute (over 10 minutes). We concluded that the reason for that was large amount of delete statements that caused index fragmentation, and decided to go with another approach.
following the answers on this post, we decided to copy all the data to a temporary table, and then upsert all the data to the actual table using an "extended insert" statement -
INSERT INTO table SELECT * FROM temp_table ON CONFLICT DO UPDATE...;
our tests show that it helped with the index fragmentation issues, but still large upsert operations of ~250K take over 4 minutes to execute, and during this upsert process SELECT statements take too long to finish which is unacceptable for us.
I'm wondering whats the best approach to create this upsert operation with as low impact to the performance of SELECTs as possible. The only thing that comes in mind right now is to split the insert into smaller chunks -
INSERT INTO table SELECT * FROM temp_table LIMIT 50000 OFFSET 0 ON CONFLICT DO UPDATE ...;
INSERT INTO table SELECT * FROM temp_table LIMIT 50000 OFFSET 50000 ON CONFLICT DO UPDATE ...;
INSERT INTO table SELECT * FROM temp_table LIMIT 50000 OFFSET 100000 ON CONFLICT DO UPDATE ...;
...
but if we batch the inserts, is there any advantage of first copying all the data into a temporary table? will it perform better then a simple multi-row insert statement?
and how do i decide whats the best chunk size to use when splitting up the upsert? is
using a temporary table and upserting the rows directly from it allows for a bigger chunk sizes?
Is there any better approach to achieve this? any suggestion would be appreciated

There are a handful of things that you can do to speed up data loading:
Have no index or foreign key on the table while you load data (check constraints are fine). I am not suggesting that you drop all your constraints, but you could for example use partitioning, load the data into a new table, then create indexes and constraints and attach the table as a new partition.
Load the data with COPY. If you cannot use COPY, use a prepared statement for the INSERT to save on parsing time.
Load many rows in a single transaction.
Set max_wal_size high so that you get no more checkpoints than necessary.
Get fast local disks.

Related

Postgres index creation time on temp table

I have a statement trigger that executes a function after some rows are updated.
With the transition table of the trigger (joined with some other tables) I need to create a temporary table, which I will use to query later.
The thing is, the query on this temp table is pretty fast but when the number of rows increase it gets slower. So I was trying to add an index.
My question is how can I measure the impact of the index creation? If the number of rows is not high, the index is not being used, so I would need to know if the index creation is expensive.
The cost of creating a b-tree index is dominated by the cost of sorting for bigger tables: O(n * log(n))
For small tables, the cost is low. So I wouldn't worry about creating the index even for small tables where it is not used.
You shouldn't forget to ANALYZE the table, however, because autovacuum doesn't do that for temporary tables.

Temp table updates is slower then normal table in postgresql

I have a situation where updates on my temp table is slow. Below is the scenario
Created temp table in session for every session,first time temp table created and then going forward doing insert,update and delete operations this operations until session ends only.
First i'm inserting the rows and based on rows i'm updateing other columns. but this updates is slow compared to norma table. i checked the performance by replacing temp table whereas normal table taking around 50 to 60s but temp table is taking nearly 5 mins.
I tried analyze on temp table, then i got the improved performance. when im using analyze the updates are completed in with 50 seconds.
I tried Types also, but no luck.
Record count in temp table is 480
Can anyone help to imprrove the performance on temp table with out analyze OR any alternative for bulk collect and bulk insert in user defined types
All the above ooperations i'm doing in postgresql.
The lack of information in your question forces me to guess, but if all other things are equal, the difference is probably that you don't have accurate statistics on the temporary table. For normal tables (which are visible to the public), autovacuum takes care of that automatically, but for temporary tables, you have to call ANALYZE explicitly to gather table statistics.

Should I use database trigger in insert heavy operation table?

My use-case is that I need to copy a few columns from TABLE A to another TABLE B and also derive values of a few other columns of TABLE B by some calculation.
As per current estimation around 50,000 rows will be inserted on daily basis in TABLE A.
TABLE B should be updated with all data before End of day.
Hence, either I can use trigger which will be invoked on INSERT operation on TABLE A or schedule some Job at EOD which read all data in bulk from TABLE A , do some calculation and insert in TABLE B.
As I am new to trigger, I am not sure which option should i pick for this use-case. Any suggestion which would be a better approach ?
So far what I have read about triggers, they can slow down DBs performance if they are invoked frequently.
As around 50,000 insert operation will happen daily , so can I assume that 50,000 falls under heavy operations where triggers would not be beneficial ?
EDIT 1 : 50,000 insert operation will reach 100,000 insert operations daily
Postgres DB is used.
If you are doing bulk COPY into an unindexed table, adding a simple trigger will slow you down by a lot (like 5 fold). But if you are using single-row INSERTs or the table is indexed, the marginal slow down of adding a simple trigger will be pretty small.
50,000 inserts per day is not very many. You should have no trouble using a trigger on that, unless the trigger has to scan a large table on each call or something.

Bulk update Postgres table

I have a table with around 200 million records and I have added 2 new columns to it. Now the 2 columns need values from a different table. Nearly 80% of the rows will be updated.
I tried update but it takes more than 2 hours to complete.
The main table has a composite primary key of 4 columns. I have dropped it and dropped an index that is present on a column before updating. Now the update takes little over than 1 hour.
Is there any other way to speed up this update process (like batch processing).
Edit: I used the other table(from where values will be matched for update) in from clause of the update statement.
Not really. Make sure that max_wal_size is high enough that you don't get too many checkpoints.
After the update, the table will be bloated to about twice its original size.
That bloat can be avoided if you update in batches and VACUUM in between, but that will not make processing faster.
Do you need whole update in single transaction? I had quite similar problem, with table that was under heavy load, and column required not null constraint. Do deal with it - I did some steps:
Add columns without constraints like not null, but with defaults. That way it went really fast.
Update columns in steps like 1000 entries per transaction. In my case load of the DB rise, so I had to put small delay.
Update columns to have not null constraints.
That way you don't block table for long time, but that is not an answer to your question.
First to validate where you are - I would check iostats to see if that is not the limit... To speed up, I would consider:
higher free space map - to be sure DB is aware of entries that can be removed, but note that if pages are packed to the limit it would not bring much...
maybe foreign keys referring to the table can be also removed? To stop locking the table,
removing all indices since they are slowing down, and create them afterwords - that looks like slicing problem but other way, but is an option, so counts...
There is a 2 type of solution to your problem.
1) This approach work if your main table doesn't update or inserted during this process
First create the same table schema without composite primary key and index with a different name.
Then insert the data in the new table with join table data.
Apply all constraints and indexes on the new table after insert.
Drop the old table and rename the new table with the old table name.
2) Or you can use a trigger to update that two-column on insert or update event. (This will make insert update operation slightly slow)

Deleting rows in Postgres table using ctid

We have a table with nearly 2 billion events recorded. As per our data model, each event is uniquely identified with 4 columns combined primary key. Excluding the primary key, there are 5 B-tree indexes each on single different columns. So totally 6 B-tree indexes.
The events recorded span for years and now we need to remove the data older than 1 year.
We have a time column with long values recorded for each event. And we use the following query,
delete from events where ctid = any ( array (select ctid from events where time < 1517423400000 limit 10000) )
Does the indices gets updated?
During testing, it didn't.
After insertion,
total_table_size - 27893760
table_size - 7659520
index_size - 20209664
After deletion,
total_table_size - 20226048
table_size - 0
index_size - 20209664
Reindex can be done
Command: REINDEX
Description: rebuild indexes
Syntax:
REINDEX { INDEX | TABLE | DATABASE | SYSTEM } name [ FORCE ]
Considering #a_horse_with_no_name method is the good solution.
What we had:
Postgres version 9.4.
1 table with 2 billion rows with 21 columns (all bigint) and 5 columns combined primary key and 5 individual column indices with date spanning 2 years.
It looks similar to time-series data with a time column containing UNIX timestamp except that its analytics project, so time is not at an ordered increase. The table was insert and select only (most select queries use aggregate functions).
What we need: Our data span is 6 months and need to remove the old data.
What we did (with less knowledge on Postgres internals):
Delete rows at 10000 batch rate.
At inital, the delete was so fast taking ms, as the bloat increased each batch delete increased to nearly 10s. Then autovacuum got triggered and it ran for almost 3 months. The insert rate was high and each batch delete has increased the WAL size too. Poor stats in the table made the current queries so slow that they ran for minutes and hours.
So we decided to go for Partitioning. Using Table Inheritance in 9.4, we implemented it.
Note: Postgres has Declarative Partitioning from version 10, which handles most manual work needed in partitioning using Table Inheritance.
Please go through the official docs as they have clear explanation.
Simplified and how we implemented it:
Create parent table
Create child table inheriting it with check constraints. (We had monthly partitions and created using schedular)
Indexes are need to be created separately for each child table
To drop old data, just drop the table, so vacuum is not needed and will be instant.
Make sure to have the postgres property constraint_exclusion to partition.
VACUUM ANALYZE the old partition after started inserting in the new partition. (In our case, it helped the query planner to use Index-Only scan instead of Seq. scan)
Using Triggers as mentioned in the docs may make the inserts slower, so we deviated from it, as we partitioned based on time column, we calculated the table name at application level based on time value before every insert and it didn't affect the insert rate for us.
Also read other caveats mentioned there.