Performance of truncate and insert vs update - postgresql

I have a table with more than 1 million records and table is growing everyday.I need to update two columns of that table everyday. What is the best way either to truncate the table and insert or update row wise?
example :-
today
userid activitycount
1 18
tomorrow
userid activitycount
1 19

Make sure that the fillfactor of the table is less than 50 and that the updated columns are not indexed.
Then the updates will become HOT updates that don't need to modify any index, and autovacuum will make sure that tomorrow's update will find enough free space.
The disadvantage is the bloat you have with this method, but you don't need to create new tables and rename them, which may be problematic with concurrent transactions.

Is faster to truncate table and copy it again. On Postgres docs you can learn how to do to populate tables with big datasets:
This section contains some suggestions on how to make this process as efficient as possible.
Use Copy: Use COPY to load all the rows in one command, instead of using a series of INSERT commands.
Remove Indexes: if you need indexes, just create indexes when data is already inserted.
Remove Foreign Key Constraints: Create constraints when data is already inserted.
Tuning Postgres installation: maintenance_work_mem, max_wal_size, Disable WAL Archival and Streaming Replication, ...

Related

Huge delete on PostgreSQL table : Deleting 99,9% of the rows of the table

I have a table in my PostgreSQL database that became huge, filled with a lot of useless rows.
As these useless rows represent 99.9% of my table data (about 3.3M rows), I was wondering if deleting them could have a bad impact on my DB :
I know that this operation could take some time and I will be able to block writes on the table during the maintenance operation
But I was wondering if this huge change in the data could also impact performance after the opertation itself.
I found solutions like creating a new table / using TRUNCATE to drop all lines but as this operation will be specific and one shot, I would like to be able to choose the most adapted solution.
I know that Postgre SQL has a VACUUM mechanism but I'm not a DBA expert : Could anyone please confirm that this delete will not impact my table integrity / data structure and that freed space will be reclaimed if needed for new data ?
PostgreSQL 11.12, with default settings on AWS RDS. I don't have any index on my table and the criteria for rows deletion will not be based on the PK
Deleting rows typically does not shrink a PostgreSQL table, sou you would then have to run VACUUM (FULL) to compact it, during which the table is inaccessible.
If you are deleting many rows, both the DELETE and the VACUUM (FULL) will take a long time, and you would be much better off like this:
create a new table that is defined like the old one
INSERT INTO new_tab SELECT * FROM old_tab WHERE ... to copy over the rows you want to keep
drop foreign key constraints that point to the old table
create all indexes and constraints on the new table
drop the old table and rename the new one
By planning that carefully, you can get away with a short down time.

Cloning a Postgres table, including indexes and data

I am trying to create a clone of a Postgres table using plpgsql.
To date I have been simply truncating table 2 and re-inserting data from table 1.
TRUNCATE TABLE "dbPlan"."tb_plan_next";
INSERT INTO "dbPlan"."tb_plan_next" SELECT * FROM "dbPlan"."tb_plan";
As code this works as expected, however "dbPlan"."tb_plan" contains around 3 million records and therefore completes in around 20 minutes. This is too long and has a knock on effects on other processes.
It's important that all constraints, indexes and data are copied exactly to table 2.
I had tried dropping the table and re-creating it, however this did not improve speed.
DROP TABLE IF EXISTS "dbPlan"."tb_plan_next";
CREATE TABLE "dbPlan"."tb_plan_next" (LIKE "dbPlan"."tb_plan" INCLUDING ALL);
INSERT INTO "dbPlan"."tb_plan_next" SELECT * FROM "dbPlan"."tb_plan";
Is there a better method for achieving this?
I am considering creating the table and then creating indexes as a second step.
PostgreSQL doesn't provide a very elegant way of doing this. You could use pg_dump with -t and --section= to dump the pre-data and post-data for the table. Then you would replay the pre-data to create the table structure and the check constraints, then load the data from whereever you get it from, then replay the post-data to add the indexes and FK constraints.

Bulk update Postgres table

I have a table with around 200 million records and I have added 2 new columns to it. Now the 2 columns need values from a different table. Nearly 80% of the rows will be updated.
I tried update but it takes more than 2 hours to complete.
The main table has a composite primary key of 4 columns. I have dropped it and dropped an index that is present on a column before updating. Now the update takes little over than 1 hour.
Is there any other way to speed up this update process (like batch processing).
Edit: I used the other table(from where values will be matched for update) in from clause of the update statement.
Not really. Make sure that max_wal_size is high enough that you don't get too many checkpoints.
After the update, the table will be bloated to about twice its original size.
That bloat can be avoided if you update in batches and VACUUM in between, but that will not make processing faster.
Do you need whole update in single transaction? I had quite similar problem, with table that was under heavy load, and column required not null constraint. Do deal with it - I did some steps:
Add columns without constraints like not null, but with defaults. That way it went really fast.
Update columns in steps like 1000 entries per transaction. In my case load of the DB rise, so I had to put small delay.
Update columns to have not null constraints.
That way you don't block table for long time, but that is not an answer to your question.
First to validate where you are - I would check iostats to see if that is not the limit... To speed up, I would consider:
higher free space map - to be sure DB is aware of entries that can be removed, but note that if pages are packed to the limit it would not bring much...
maybe foreign keys referring to the table can be also removed? To stop locking the table,
removing all indices since they are slowing down, and create them afterwords - that looks like slicing problem but other way, but is an option, so counts...
There is a 2 type of solution to your problem.
1) This approach work if your main table doesn't update or inserted during this process
First create the same table schema without composite primary key and index with a different name.
Then insert the data in the new table with join table data.
Apply all constraints and indexes on the new table after insert.
Drop the old table and rename the new table with the old table name.
2) Or you can use a trigger to update that two-column on insert or update event. (This will make insert update operation slightly slow)

Is the speed of a PostgreSQL SELECT adversely affected by too many indexes on the table?

I have read that when having a lot of indexes on a database It can seriously hurt the performance but in the PostgreSQL doc I can't find anything about it.
I have a very big table with something like 100 columns and a billion rows and often I have to do a lot of searches in a lot of different fields.
Does the performance of the PostgreSQL table will drop if I add a lot of indexes (maybe 10 unique column indexes and 5 or 7 3 column indexes)?
EDIT: With performance drop I mean the performance in fetching rows (select), the database will be updated once a month so the update and insert time are not an issue.
The indexes are maintained when the content of the table has been modified (i.e. INSERT, UPDATE, DELETE)
The query planner of PostgreSQL can decide when to use an index and when it's not needed and a sequential scan is more optimal.
So having too many indexes will hurt the modifying performance, not the fetching.
The indexes will have to be updated at each insert and update involving those columns.
I have some charts about that on my site: http://use-the-index-luke.com/sql/dml
An index is pure redundancy. It contains only data that is also stored
in the table. During write operations, the database must keep those
redundancies consistent. Specifically, it means that insert, delete
and update not only affect the table but also the indexes that hold a
copy of the affected data.
The chapter titles suggest the impact that indexes can have:
Insert — cannot take direct benefit from indexes
Delete — uses indexes for the where clause
Update — does not affect all indexes of the table

efficiently trimming postgresql tables

I have about 10 tables with over 2 million records and one with 30 million. I would like to efficiently remove older data from each of these tables.
My general algorithm is:
create a temp table for each large table and populate it with newer data
truncate the original tables
copy tmp data back to original tables using: "insert into originaltable (select * from tmp_table)"
However, the last step of copying the data back is taking longer than I'd like. I thought about deleting the original tables and making the temp tables "permanent", but I lose constraint/foreign key info.
If I delete from the tables directly, it takes much longer. Given that I need to preserve all foreign keys and constraints, are there any faster ways of removing the older data?
Thanks.
The fastest process is likely to be exactly as you've outlined:
Copy new data into a temporary table
Drop indexes and foreign keys
Drop the old table
Copy the temporary table back to the old table name
Rebuild indexes and foreign keys.
The Postgres manual has some suggestions on perfomance, too, that may or may not apply. Frankly, however, it is significantly quicker to drop a table than to drop millions of rows (since each delete is performed tuple by tuple) and it is significantly quicker to insert millions of rows into a table with no constraints or indexes (as each constraint must be checked and each index must be updated for each record insert; by removing all constraints, you limit this to a single build of the index and a single verification for the constraint).
The "standard" solution for these problems typically involves partitioning your tables on the appropriate key, such that when you need to delete old data, you can simply drop a whole partition -- certainly the fastest deletion that you will ever get.
However, partitioning in PostgreSQL isn't as easy as some other databases -- you need to relocate data manually using triggers, and there are caveats (e.g. no global primary keys)
See the PostgreSQL manual on Partitioning