Postgres backup and overwrite one table - postgresql

I have a postgres database, I am trying to backup a table with :
pg_dump --data-only --table=<table> <db> > dump.sql
Then days later I am trying to overwrite it (basically want to erase all data and add the data from my dump) by:
psql -d <db> -c --table=<table> < dump.sql
But It doesn't overwrite, it adds on it without deleting the existing data.
Any advice would be awesome, thanks!

You have basically two options, depending on your data and fkey constraints.
If there are no fkeys to the table, then the best thing to do is to truncate the table before loading it. Note that truncate behaves a little odd in transactions so the best thing to do is (in a transaction block):
Lock the table
Truncate
Load
This will avoid other transactions seeing an empty table.
If you have fkeys then you may want to load into a temporary table and then do an upsert. In this case you may still want to lock the table to avoid a race condition if it is possible other transactions may want to write to the table (also in a transaction block):
Load data into a temporary table
Lock the destination table (optional, see above)
use a writeable cte to "upsert" in the table.
Use a separate delete statement to delete data from the table.
Stage 3 is a little tricky. You might need to ask a separate question about it, but basically you will have two stages (and write this in consultation with the docs):
Update existing records
Insert non-existing records
Hope this helps.

Related

What happens to existing data with psql dbname < pg_dump_file [duplicate]

This question already has an answer here:
will pg_restore overwrite the existing tables?
(1 answer)
Closed 9 months ago.
I have an database on aws' rds and I use a pg_dump from a local version of the database, then psql dbname > pg_dump_file with proper arguments for remote upload to populate the database.
I'd like to know what is expected to happen if that rds db already contains data. More specifically:
Data present in the local dump, but absent in rds
Data present on rds, but absent in the local data
Data present in both but that have been modified
My current understanding:
New data will be added and be present in both after upload
Data in rds should be unaffected?
The data from the pg_dump will be present in both (assuming the same pk, but different fields otherwise)
Is that about correct? I've been reading this, but it's a little thin on how the restore is actually performed, so I'm having a harder time figuring that out. Thanks.
EDIT: following #wildplasser comment, by looking at the pg_dump file it appears that the following happens:
CREATE TABLE [....]
ALTER TABLE [setting table owner]
ALTER SEQUENCE [....]
For each table in the db. Then, again one table at a time:
COPY [tablename] (list of cols) FROM stdin;
[data to be copied]
Finally, more ALTER statements to set contraints, foreign keys etc.
So I guess the ultimate answer is "it depends". One could I suppose remove the CREATE TABLE [...], ALTER TABLE, ALTER SEQUENCE statements if those are already created as they should. I am not positive yet what happens if one tries CREATE TABLE with an existing table (error thrown perhaps?).
Then I guess the COPY statements would overwrite whatever already exists. Or perhaps throw an error. I'll have to test that. I'll write up an answer once I figure it out.
So the answer is a bit dull. Turns out that even if one removes the initial statements before the copy, if the table as an primary key (thus uniqueness constrains) then it won't work:
ERROR: duplicate key value violates unique constraint
So one gets shutdown pretty quickly there. One would have I guess to rewrite the dump as a list of UPDATE statements instead, but then I guess might as well write a script to do so. Unsure if pg_dump is all that useful in that case.

Cloning a Postgres table, including indexes and data

I am trying to create a clone of a Postgres table using plpgsql.
To date I have been simply truncating table 2 and re-inserting data from table 1.
TRUNCATE TABLE "dbPlan"."tb_plan_next";
INSERT INTO "dbPlan"."tb_plan_next" SELECT * FROM "dbPlan"."tb_plan";
As code this works as expected, however "dbPlan"."tb_plan" contains around 3 million records and therefore completes in around 20 minutes. This is too long and has a knock on effects on other processes.
It's important that all constraints, indexes and data are copied exactly to table 2.
I had tried dropping the table and re-creating it, however this did not improve speed.
DROP TABLE IF EXISTS "dbPlan"."tb_plan_next";
CREATE TABLE "dbPlan"."tb_plan_next" (LIKE "dbPlan"."tb_plan" INCLUDING ALL);
INSERT INTO "dbPlan"."tb_plan_next" SELECT * FROM "dbPlan"."tb_plan";
Is there a better method for achieving this?
I am considering creating the table and then creating indexes as a second step.
PostgreSQL doesn't provide a very elegant way of doing this. You could use pg_dump with -t and --section= to dump the pre-data and post-data for the table. Then you would replay the pre-data to create the table structure and the check constraints, then load the data from whereever you get it from, then replay the post-data to add the indexes and FK constraints.

pg_dump with --exclude-table still includes those tables in the background COPY commands it runs?

I am trying to take a backup of a TimescaleDB database, excluding two very big hypertables.
That means that while the backup is running, I would not expect to see any COPY command of the underlying chunks, but I actually do!
Let's say TestDB is my database and it has two big hypertables on schema mySchema called hyper1 and hyper2, as well as other normal tables.
I run the following command:
pg_dump -U user -F t TestDB --exclude-table "mySchema.hyper1" --exclude-table "mySchema.hyper2" > TestDB_Backup.tar
Then I check the running queries (esp. because I did not expect it to take this long) and I find out that several COPY commands are running, for each chunk of the tables I actually excluded.
This is TimescaleDB version 1.7.4.
Did this ever happen to any of you and what is actually going on here?
ps. I am sorry I cannot really provide a repro for this and that this is more of a discussion than an actual programmatic problem, but I still hope someone has seen this before and can show me what am I missing :)
pg_dump dumps each child table separately and independently from their parents, thus when you exclude a hypertable, its chunk tables will be still dumped. Thus you observe all chunk tables are still dumped.
Note that excluding hypertables and chunks will not work to restore the dump correctly into a TimescaleDB instance, since TimescaleDB metadata will not match actual state of the database. TimescaleDB maintains catalog tables with information about hypertables and chunks and they are just another user tables for pg_dump, so it will dump them (which is important), but when they are restored they will contain all hypertables and chunks, which was in the database before the dump.
So you need to exclude data from the tables you want to exclude (not hypertables or chunks themselves), which will reduce dump and restore time. Then it will be necessary to drop the excluded hypertables after the restore. You exclude table data with pg_dump parameter --exclude-table-data. There is an issue in TimescaleDB GitHub repo, which discusses how to exclude hypertable data from a dump. The issue suggests how to generate the exclude string:
SELECT string_agg(format($$--exclude-table-data='%s.%s'$$,coalesce(cc.schema_name,c.schema_name), coalesce(cc.table_name, c.table_name)), ' ')
FROM _timescaledb_catalog.hypertable h
INNER JOIN _timescaledb_catalog.chunk c on c.hypertable_id = h.id
LEFT JOIN _timescaledb_catalog.chunk cc on c.compressed_chunk_id = cc.id
WHERE h.schema_name = <foo> AND h.table_name = <bar> ;
Alternatively, you can find hypertable_id and exclude data from all chunk tables prefixed with the hypertable id. Find hypertable_id from catalog table _timescaledb_catalog.hypertable:
SELECT id
FROM _timescaledb_catalog.hypertable
WHERE schema_name = 'mySchema' AND table_name = 'hyper1';
Let's say that the id is 2. Then dump the database according the instructions:
pg_dump -U user -Fc -f TestDB_Backup.bak \
--exclude-table-data='_timescaledb_internal._hyper_2*' TestDB

Is it possible to dump from Timescale without hypertable insertions?

I followed the manual on: https://docs.timescale.com/v1.0/using-timescaledb/backup
When I dump it into a binary file everything work out as expected (can restore it easily).
However, when I dump it into plain text SQL, insertions to hyper tables will be created. Is that possible to create INSERTION to the table itself?
Say I have an 'Auto' table with columns of id,brand,speed
and with only one row: 1,Opel,170
dumping into SQL will result like this:
INSERT INTO _timescaledb_catalog.hypertable VALUES ...
INSERT INTO _timescaledb_internal._hyper_382_8930_chunk VALUES (1, 'Opel',170);
What I need is this (and let TS do the work in the background):
INSERT INTO Auto VALUES (1,'Opel',170);
Is that possible somehow? (I know I can exclude tables from pg_dump but that wouldn't create the needed insertion)
Beatrice. Unfortunately, pg_dump will dump commands that mirror the underlying implementation of Timescale. For example, _hyper_382_8930_chunk is a chunk underlying the auto hypertable that you have.
Might I ask why you don't want pg_dump to behave this way? The SQL file that Postgres creates on a dump is intended to be used by pg_restore. So as long as you dump and restore and see correct state, there is no problem with dump/restore.
Perhaps you are asking a different question?

PostgreSQL reset tables to original state

We have a large PostgreSQL dump with hundreds of tables that I can successfully import with pg_restore. We are developing a software that inserts into a lot of these tables (~100) and for every run we need to return these tables to their original state (that means to the content that was in the dump). Restoring the original dump again takes a lot of time and we just can't wait for half an hour before every debugging session. So I need a relatively fast way to revert these tables to the state they are in after restoring from the dump.
I've tried using pg_restore with -L switch and selecting these tables but I get either a duplicate key error when using both --data-only and --clean or a "cannot drop table X because other objects depend on it" error when using only --clean. Issuing a SET CONSTRAINTS ALL DEFERRED command before pg_restore did not work either. Maybe I have the rows in the table list all wrong, right now it's
491; 1259 39623998 TABLE public some_table some_user
8021; 0 0 COMMENT public TABLE some_table some_user
8022; 0 0 ACL public some_table some_user
for every table and then
6700; 0 39624062 TABLE DATA public some_table postgres
8419; 0 0 SEQUENCE SET public some_table_pk_id_seq some_user
for every table.
We only insert data and don't update existing rows so deleting all rows above an index and resetting the sequences might work, but I really don't want to have to manually create these commands for all the hundred tables and I'm not even sure it would work even if I set cascade to delete other objects depending on the given row.
Does anyone have any better idea how to handle this?
So you are looking for something like a snapshot in order to be able to revert quickly to a certain state.
I am not aware of a possiblity in PostgreSql to rollback to a certain timestamp.
While searching for a solution, I've found two ideas here
Use create database with the template option
Virtualize your PostgreSql installation using VMWare or VirtualBox, and use the snapshot feature of the virtual machines.
Again, both ideas are copied from the above source (I have search for "postgresql db snapshots").
You can use PITR to create a snapshot before loads and use the PITR snapshot to take you back to any point that you have the logs for.