Bulk update Postgres table

Bulk update Postgres table - postgresql

I have a table with around 200 million records and I have added 2 new columns to it. Now the 2 columns need values from a different table. Nearly 80% of the rows will be updated.
I tried update but it takes more than 2 hours to complete.
The main table has a composite primary key of 4 columns. I have dropped it and dropped an index that is present on a column before updating. Now the update takes little over than 1 hour.
Is there any other way to speed up this update process (like batch processing).
Edit: I used the other table(from where values will be matched for update) in from clause of the update statement.

Not really. Make sure that max_wal_size is high enough that you don't get too many checkpoints.
After the update, the table will be bloated to about twice its original size.
That bloat can be avoided if you update in batches and VACUUM in between, but that will not make processing faster.

Do you need whole update in single transaction? I had quite similar problem, with table that was under heavy load, and column required not null constraint. Do deal with it - I did some steps:
Add columns without constraints like not null, but with defaults. That way it went really fast.
Update columns in steps like 1000 entries per transaction. In my case load of the DB rise, so I had to put small delay.
Update columns to have not null constraints.
That way you don't block table for long time, but that is not an answer to your question.
First to validate where you are - I would check iostats to see if that is not the limit... To speed up, I would consider:
higher free space map - to be sure DB is aware of entries that can be removed, but note that if pages are packed to the limit it would not bring much...
maybe foreign keys referring to the table can be also removed? To stop locking the table,
removing all indices since they are slowing down, and create them afterwords - that looks like slicing problem but other way, but is an option, so counts...

There is a 2 type of solution to your problem.
1) This approach work if your main table doesn't update or inserted during this process
First create the same table schema without composite primary key and index with a different name.
Then insert the data in the new table with join table data.
Apply all constraints and indexes on the new table after insert.
Drop the old table and rename the new table with the old table name.
2) Or you can use a trigger to update that two-column on insert or update event. (This will make insert update operation slightly slow)

Related

Huge delete on PostgreSQL table : Deleting 99,9% of the rows of the table

I have a table in my PostgreSQL database that became huge, filled with a lot of useless rows.
As these useless rows represent 99.9% of my table data (about 3.3M rows), I was wondering if deleting them could have a bad impact on my DB :
I know that this operation could take some time and I will be able to block writes on the table during the maintenance operation
But I was wondering if this huge change in the data could also impact performance after the opertation itself.
I found solutions like creating a new table / using TRUNCATE to drop all lines but as this operation will be specific and one shot, I would like to be able to choose the most adapted solution.
I know that Postgre SQL has a VACUUM mechanism but I'm not a DBA expert : Could anyone please confirm that this delete will not impact my table integrity / data structure and that freed space will be reclaimed if needed for new data ?
PostgreSQL 11.12, with default settings on AWS RDS. I don't have any index on my table and the criteria for rows deletion will not be based on the PK

Deleting rows typically does not shrink a PostgreSQL table, sou you would then have to run VACUUM (FULL) to compact it, during which the table is inaccessible.
If you are deleting many rows, both the DELETE and the VACUUM (FULL) will take a long time, and you would be much better off like this:
create a new table that is defined like the old one
INSERT INTO new_tab SELECT * FROM old_tab WHERE ... to copy over the rows you want to keep
drop foreign key constraints that point to the old table
create all indexes and constraints on the new table
drop the old table and rename the new one
By planning that carefully, you can get away with a short down time.

Performance of truncate and insert vs update

I have a table with more than 1 million records and table is growing everyday.I need to update two columns of that table everyday. What is the best way either to truncate the table and insert or update row wise?
example :-
today
userid activitycount
1 18
tomorrow
userid activitycount
1 19

Make sure that the fillfactor of the table is less than 50 and that the updated columns are not indexed.
Then the updates will become HOT updates that don't need to modify any index, and autovacuum will make sure that tomorrow's update will find enough free space.
The disadvantage is the bloat you have with this method, but you don't need to create new tables and rename them, which may be problematic with concurrent transactions.

Is faster to truncate table and copy it again. On Postgres docs you can learn how to do to populate tables with big datasets:
This section contains some suggestions on how to make this process as efficient as possible.
Use Copy: Use COPY to load all the rows in one command, instead of using a series of INSERT commands.
Remove Indexes: if you need indexes, just create indexes when data is already inserted.
Remove Foreign Key Constraints: Create constraints when data is already inserted.
Tuning Postgres installation: maintenance_work_mem, max_wal_size, Disable WAL Archival and Streaming Replication, ...

Postgres parallel/efficient load huge amount of data psycopg

I want to load many rows from a CSV file.
The files contain data like these "article_name,article_time,start_time,end_time"
There is a contraint on the table: for the same article name, i don't insert a new row if the new article_time falls in an existing range [start_time,end_time] for the same article.
ie: don't insert row y if exists [start_time_x,end_time_x] for which time_article_y inside range [start_time_x,end_time_x] , with article_name_y = article_name_x
I tried with psycopg by selecting the existing article names ad checking manually if there is an overlap --> too long
I tried again with psycopg, this time by setting a condition 'exclude using...' and tryig to insert with specifying "on conflict do nothing" (so that it does not fail) but still too long
I tried the same thing but this time trying to insert many values at each call of execute (psycopg): it got a little better (1M rows processed in almost 10minutes), but still not as fast as it needs to be for the amount of data I have (500M+)
I tried to parallelize by calling the same script many time, on different files but the timing didn't get any better, I guess because of the locks on the table each time we want to write something
Is there any way to create a lock only on rows containing the same article_name? (and not a lock on the whole table?)
Could you please help with any idea to make this parallellizable and/or more time efficient?
Lots of thanks folks

Your idea with the exclusion constraint and INSERT ... ON CONFLICT is good.
You could improve the speed as follows:
Do it all in a single transaction.
Like Vao Tsun suggested, maybe COPY the data into a staging table first and do it all with a single SQL statement.
Remove all indexes except the exclusion constraint from the table where you modify data and re-create them when you are done.
Speed up insertion by disabling autovacuum and raising max_wal_size (or checkpoint_segments on older PostgreSQL versions) while you load the data.

Implications of using ADD COLUMN on large dataset

Docs for Redshift say:
ALTER TABLE locks the table for reads and writes until the operation completes.
My question is:
Say I have a table with 500 million rows and I want to add a column. This sounds like a heavy operation that could lock the table for a long time - yes? Or is it actually a quick operation since Redshift is a columnar db? Or it depends if column is nullable / has default value?

I find that adding (and dropping) columns is a very fast operation even on tables with many billions of rows, regardless of whether there is a default value or it's just NULL.
As you suggest, I believe this is a feature of the it being a columnar database so the rest of the table is undisturbed. It simply creates empty (or nearly empty) column blocks for the new column on each node.

I added an integer column with a default to a table of around 65M rows in Redshift recently and it took about a second to process. This was on a dw2.large (SSD type) single node cluster.
Just remember you can only add a column to the end (right) of the table, you have to use temporary tables etc if you want to insert a column somewhere in the middle.

Personally I have seen rebuilding the table works best.
I do it in following ways
Create a new table N_OLD_TABLE table
Define the datatype/compression encoding in the new table
Insert data into N_OLD(old_columns) select(old_columns) from old_table Rename OLD_Table to OLD_TABLE_BKP
Rename N_OLD_TABLE to OLD_TABLE
This is a much faster process. Doesn't block any table and you always have a backup of old table incase anything goes wrong

efficiently trimming postgresql tables

I have about 10 tables with over 2 million records and one with 30 million. I would like to efficiently remove older data from each of these tables.
My general algorithm is:
create a temp table for each large table and populate it with newer data
truncate the original tables
copy tmp data back to original tables using: "insert into originaltable (select * from tmp_table)"
However, the last step of copying the data back is taking longer than I'd like. I thought about deleting the original tables and making the temp tables "permanent", but I lose constraint/foreign key info.
If I delete from the tables directly, it takes much longer. Given that I need to preserve all foreign keys and constraints, are there any faster ways of removing the older data?
Thanks.

The fastest process is likely to be exactly as you've outlined:
Copy new data into a temporary table
Drop indexes and foreign keys
Drop the old table
Copy the temporary table back to the old table name
Rebuild indexes and foreign keys.
The Postgres manual has some suggestions on perfomance, too, that may or may not apply. Frankly, however, it is significantly quicker to drop a table than to drop millions of rows (since each delete is performed tuple by tuple) and it is significantly quicker to insert millions of rows into a table with no constraints or indexes (as each constraint must be checked and each index must be updated for each record insert; by removing all constraints, you limit this to a single build of the index and a single verification for the constraint).

The "standard" solution for these problems typically involves partitioning your tables on the appropriate key, such that when you need to delete old data, you can simply drop a whole partition -- certainly the fastest deletion that you will ever get.
However, partitioning in PostgreSQL isn't as easy as some other databases -- you need to relocate data manually using triggers, and there are caveats (e.g. no global primary keys)
See the PostgreSQL manual on Partitioning

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Bulk update Postgres table - postgresql

Not really. Make sure that max_wal_size is high enough that you don't get too many checkpoints. After the update, the table will be bloated to about twice its original size. That bloat can be avoided if you update in batches and VACUUM in between, but that will not make processing faster.

Related

Huge delete on PostgreSQL table : Deleting 99,9% of the rows of the table

Performance of truncate and insert vs update

Postgres parallel/efficient load huge amount of data psycopg

Implications of using ADD COLUMN on large dataset

efficiently trimming postgresql tables

Categories

Resources