Cloning a Postgres table, including indexes and data - postgresql

I am trying to create a clone of a Postgres table using plpgsql.
To date I have been simply truncating table 2 and re-inserting data from table 1.
TRUNCATE TABLE "dbPlan"."tb_plan_next";
INSERT INTO "dbPlan"."tb_plan_next" SELECT * FROM "dbPlan"."tb_plan";
As code this works as expected, however "dbPlan"."tb_plan" contains around 3 million records and therefore completes in around 20 minutes. This is too long and has a knock on effects on other processes.
It's important that all constraints, indexes and data are copied exactly to table 2.
I had tried dropping the table and re-creating it, however this did not improve speed.
DROP TABLE IF EXISTS "dbPlan"."tb_plan_next";
CREATE TABLE "dbPlan"."tb_plan_next" (LIKE "dbPlan"."tb_plan" INCLUDING ALL);
INSERT INTO "dbPlan"."tb_plan_next" SELECT * FROM "dbPlan"."tb_plan";
Is there a better method for achieving this?
I am considering creating the table and then creating indexes as a second step.

PostgreSQL doesn't provide a very elegant way of doing this. You could use pg_dump with -t and --section= to dump the pre-data and post-data for the table. Then you would replay the pre-data to create the table structure and the check constraints, then load the data from whereever you get it from, then replay the post-data to add the indexes and FK constraints.

Related

Huge delete on PostgreSQL table : Deleting 99,9% of the rows of the table

I have a table in my PostgreSQL database that became huge, filled with a lot of useless rows.
As these useless rows represent 99.9% of my table data (about 3.3M rows), I was wondering if deleting them could have a bad impact on my DB :
I know that this operation could take some time and I will be able to block writes on the table during the maintenance operation
But I was wondering if this huge change in the data could also impact performance after the opertation itself.
I found solutions like creating a new table / using TRUNCATE to drop all lines but as this operation will be specific and one shot, I would like to be able to choose the most adapted solution.
I know that Postgre SQL has a VACUUM mechanism but I'm not a DBA expert : Could anyone please confirm that this delete will not impact my table integrity / data structure and that freed space will be reclaimed if needed for new data ?
PostgreSQL 11.12, with default settings on AWS RDS. I don't have any index on my table and the criteria for rows deletion will not be based on the PK
Deleting rows typically does not shrink a PostgreSQL table, sou you would then have to run VACUUM (FULL) to compact it, during which the table is inaccessible.
If you are deleting many rows, both the DELETE and the VACUUM (FULL) will take a long time, and you would be much better off like this:
create a new table that is defined like the old one
INSERT INTO new_tab SELECT * FROM old_tab WHERE ... to copy over the rows you want to keep
drop foreign key constraints that point to the old table
create all indexes and constraints on the new table
drop the old table and rename the new one
By planning that carefully, you can get away with a short down time.

Performance of truncate and insert vs update

I have a table with more than 1 million records and table is growing everyday.I need to update two columns of that table everyday. What is the best way either to truncate the table and insert or update row wise?
example :-
today
userid activitycount
1 18
tomorrow
userid activitycount
1 19
Make sure that the fillfactor of the table is less than 50 and that the updated columns are not indexed.
Then the updates will become HOT updates that don't need to modify any index, and autovacuum will make sure that tomorrow's update will find enough free space.
The disadvantage is the bloat you have with this method, but you don't need to create new tables and rename them, which may be problematic with concurrent transactions.
Is faster to truncate table and copy it again. On Postgres docs you can learn how to do to populate tables with big datasets:
This section contains some suggestions on how to make this process as efficient as possible.
Use Copy: Use COPY to load all the rows in one command, instead of using a series of INSERT commands.
Remove Indexes: if you need indexes, just create indexes when data is already inserted.
Remove Foreign Key Constraints: Create constraints when data is already inserted.
Tuning Postgres installation: maintenance_work_mem, max_wal_size, Disable WAL Archival and Streaming Replication, ...

Best practices for performing a table swap in Redshift

We're in the process of running a handful of hourly scripts on our Redshift cluster which build summary tables for data consumers. After assembling a staging table, the script then runs a transaction which deletes the existing table and replaces it with the staging table, as such:
BEGIN;
DROP TABLE IF EXISTS public.data_facts;
ALTER TABLE public.data_facts_stage RENAME TO data_facts;
COMMIT;
The problem with this operation is that long-running analysis queries will place an AccessShareLock on public.data_facts, preventing it from being dropped and thrashing our ETL cycle. I'm thinking a better solution would be one which renames the existing table, as such:
ALTER TABLE public.data_facts RENAME TO data_facts_old;
ALTER TABLE public.data_facts_stage RENAME TO data_facts;
DROP TABLE public.data_facts_old;
However, this approach presupposes that 1) public.data_facts exists, and 2) public.data_facts_old does not exist.
Do you know if there's a way to conduct this operation safely in SQL, without relying on application logic? (eg. something like ALTER TABLE IF EXISTS).
I haven't tried it but looking at the documentation of CREATE VIEW it seems that this can be done with late-binding views.
The main idea would be a view public.data_facts that users interact with. Behind the scenes, you can load new data and then swap the view to “point” to the new table.
Bootstrap
-- load data into public.data_facts_v0
CREATE VIEW public.data_facts AS
SELECT * from public.data_facts_v0 WITH NO SCHEMA BINDING;
Update
-- load data into public.data_facts_v1
CREATE OR REPLACE VIEW public.data_facts AS
SELECT * from public.data_facts_v1 WITH NO SCHEMA BINDING;
DROP TABLE public.data_facts_v0;
The WITH NO SCHEMA BINDING means the view will be late-binding. “A late-binding view doesn't check the underlying database objects, such as tables and other views, until the view is queried.” This means the update can even introduce a table with renamed columns or a completely new structure.
Notes:
It might be a good idea to wrap the swap operations into a transaction to make sure we don't drop the previous table if the VIEW swap failed.
You can add a new load time timestamp encode runlength default getdate() column to your target table, and make your ETL do this:
INSERT INTO public.data_facts
SELECT * FROM public.data_facts_staging;
DELETE FROM public.data_facts
WHERE load_time<(select max(load_time) from public.data_facts);
DROP TABLE public.data_facts_staging;
note: public.data_facts_staging should have exactly the same structure as public.data_facts except that the last column of public.data_facts is load_time, so that on insert it will be populated with the current timestamp.
The only implication is that it would require extra disk space for a moment between you insert new rows and delete the old rows, and load_time has to be always the last column. Also you have to vaccum table every time you do this.
Another good thing about this is that if your ETL fails and staging table is empty or there is no staging table you won't lose your data. In the pure SQL scenario of swapping tables with DDL you're not protected from dropping the target table when staging table is missing. In the suggested scenario if no new rows are inserted the delete statement deletes nothing (there are no rows less than max load time), so worst case is just having the old version of data.
p.s. there is a command that instead of insert ... select ... just changes the pointer from staging to target table (alter table ... append from ...) but it requires the same type of lock as alter table I guess, so I don't suggest this

Postgres backup and overwrite one table

I have a postgres database, I am trying to backup a table with :
pg_dump --data-only --table=<table> <db> > dump.sql
Then days later I am trying to overwrite it (basically want to erase all data and add the data from my dump) by:
psql -d <db> -c --table=<table> < dump.sql
But It doesn't overwrite, it adds on it without deleting the existing data.
Any advice would be awesome, thanks!
You have basically two options, depending on your data and fkey constraints.
If there are no fkeys to the table, then the best thing to do is to truncate the table before loading it. Note that truncate behaves a little odd in transactions so the best thing to do is (in a transaction block):
Lock the table
Truncate
Load
This will avoid other transactions seeing an empty table.
If you have fkeys then you may want to load into a temporary table and then do an upsert. In this case you may still want to lock the table to avoid a race condition if it is possible other transactions may want to write to the table (also in a transaction block):
Load data into a temporary table
Lock the destination table (optional, see above)
use a writeable cte to "upsert" in the table.
Use a separate delete statement to delete data from the table.
Stage 3 is a little tricky. You might need to ask a separate question about it, but basically you will have two stages (and write this in consultation with the docs):
Update existing records
Insert non-existing records
Hope this helps.

efficiently trimming postgresql tables

I have about 10 tables with over 2 million records and one with 30 million. I would like to efficiently remove older data from each of these tables.
My general algorithm is:
create a temp table for each large table and populate it with newer data
truncate the original tables
copy tmp data back to original tables using: "insert into originaltable (select * from tmp_table)"
However, the last step of copying the data back is taking longer than I'd like. I thought about deleting the original tables and making the temp tables "permanent", but I lose constraint/foreign key info.
If I delete from the tables directly, it takes much longer. Given that I need to preserve all foreign keys and constraints, are there any faster ways of removing the older data?
Thanks.
The fastest process is likely to be exactly as you've outlined:
Copy new data into a temporary table
Drop indexes and foreign keys
Drop the old table
Copy the temporary table back to the old table name
Rebuild indexes and foreign keys.
The Postgres manual has some suggestions on perfomance, too, that may or may not apply. Frankly, however, it is significantly quicker to drop a table than to drop millions of rows (since each delete is performed tuple by tuple) and it is significantly quicker to insert millions of rows into a table with no constraints or indexes (as each constraint must be checked and each index must be updated for each record insert; by removing all constraints, you limit this to a single build of the index and a single verification for the constraint).
The "standard" solution for these problems typically involves partitioning your tables on the appropriate key, such that when you need to delete old data, you can simply drop a whole partition -- certainly the fastest deletion that you will ever get.
However, partitioning in PostgreSQL isn't as easy as some other databases -- you need to relocate data manually using triggers, and there are caveats (e.g. no global primary keys)
See the PostgreSQL manual on Partitioning