Loading a big table, referenced by others, efficiently - postgresql

My use case is the following:
I have big table users (~200 millions rows) of users with user_id as the primary key. users is referenced by several other tables using foreign key with ON DELETE CASCADE.
Every day I have to replace the whole content of users using a lot of csv files. (Please don't ask why I have to do that, I just have to...)
My idea was to set the primary key and all foreign keys as DEFERRED, then, in the same transaction, DELETE the whole table and copying all the csvs using the COPY command. The expected result was that all check and index calculation would happen at the end of the transaction.
But actually the insert process is super slow (4hours, against 10min if I insert and the put the primary key) AND no foreign key can refer to a deferrable primary.
I can't remove the primary key during the insertion because of the foreign keys. I don't want get rid of the foreign key either because I would have to simulate the behavior of ON DELETE CASCADE manually.
So basically I am looking for a way to tell postgres to not care about primary key index or foreign key check until the very end of the transaction.
PS1: I made up the users table, I am actually working with very different kind of data but it's not really relevant to the problem.
PS2: As a rough estimation I would say that every day, on my 200+ millions records, I have 10 records removed, 1million updated and 1million added.

A full delete + a full insert will cause a flood of cascading FK,
which will have to be postponed by DEFERRED,
which will cause an avalanche of aftermath for the DBMS at commit time.
Instead, dont {delete+create} keys, but keep them right where they are.
Also, dont touch records that dont need to be touched.
-- staging table
CREATE TABLE tmp_users AS SELECT * FROM big_users WHERE 1=0;
COPY TABLE tmp_users (...) FROM '...' WITH CSV;
-- ... and more copying ...
-- ... from more files ...
-- If this fails, you have a problem!
ALTER TABLE tmp_users
ADD PRIMARY KEY (id);
-- [EDIT]
-- I added this later, because the user_comments table
-- was not present in the original question.
DELETE FROM user_comments c
WHERE NOT EXISTS (
SELECT * FROM tmp_users u WHERE u.id = c.user_id
);
-- These deletes are allowed to cascade
-- [we assume that the mport of the CSV files was complete, here ...]
DELETE FROM big_users b
WHERE NOT EXISTS (
SELECT *
FROM tmp_users t
WHERE t.id = b.id
);
-- Only update the records that actually **change**
-- [ updates are expensive in terms of I/O, because they create row-versions
-- , and the need to delete the old row-versions, afterwards ]
-- Note that the key (id) does not change, so there will be no cascading.
-- ------------------------------------------------------------
UPDATE big_users b
SET name_1 = t.name_1
, name_2 = t.name_2
, address = t.address
-- , ... ALL THE COLUMNS here, except the key(s)
FROM tmp_users t
WHERE t.id = b.id
AND (t.name_1, t.name_2, t.address, ...) -- ALL THE COLUMNS, except the key(s)
IS DISTINCT FROM
(b.name_1, b.name_2, b.address, ...)
;
-- Maybe there were some new records in the CSV files. Add them.
INSERT INTO big_users (id,name_1,name_2,address, ...)
SELECT id,name_1,name_2,address, ...
FROM tmp_users t
WHERE NOT EXISTS (
SELECT *
FROM big_users x
WHERE x.id = t.id
);

I found a hacky solution :
update pg_index set indisvalid = false, indisready=false where indexrelid= 'users_pkey'::regclass;
DELETE FROM users;
COPY TABLE users FROM 'file.csv';
REINDEX INDEX users_pkey;
DELETE FROM user_comments c WHERE NOT EXISTS (SELECT * FROM users u WHERE u.id = c.user_id )
commit;
The magic dirty hack is to disable the primary key index in the postgres catalog and at the end to force the reindexing (which will override what we changed). I can't use foreign key with ON DELETE CASCADE because for some reason it makes the constraint being executed immediatly... So instead my foreign keys are ON DELETE NO ACTION DEFERRABLE INITIALLY DEFERRED and I have to do the delete myself.
This works well in my case because only a few users are being referred in other tables.
I wish there was a cleaner solution though...

Related

How to use ON CONFLICT with a primary key on a foreign table

I am basically trying to replicate data from a table on one server to another.
I have two identical databases on the servers. I created a foreign table called opentickets_aux1 to represent the opentickets table on the secondary server on the primary server. Both have a primary key of incidentnumber. I can access the data in the foreign table just fine but when I try the following SQL,I get "ERROR: there is no unique or exclusion constraint matching the ON CONFLICT specification."
INSERT INTO opentickets_aux1 (SELECT * FROM opentickets)
ON CONFLICT (incidentnumber)
DO
UPDATE SET
status = EXCLUDED.status,
lastmodifieddate = EXCLUDED.lastmodifieddate
I want to update a few columns if the primary key exist. I use this statement for other queries and they work when its a local table. Any ideas?
A foreign table cannot have a primary key constraint, because PostgreSQL wouldn't be able to enforce its integrity. Therefore, you cannot use INSERT ... ON CONFLICT with foreign tables.
Your idea also does not handle rows that are deleted on the foreign server, but maybe that's intentional.
If you want a local copy of a foreign table, the easiest way would be to create a materialized view on the foreign table.
If that is not your desire (perhaps because you don't want to copy deletions), you'd have to use statements like
INSERT INTO localtable
SELECT * FROM foreigntable f
WHERE NOT EXISTS
(SELECT 1 FROM localtable l
WHERE f.id = l.id);
UPDATE localtable l
SET /* all columns from f */
FROM foreigntable f
WHERE f.id = l.id
AND (f.*) <> (l.*);

Avoid exclusive access locks on referenced tables when DROPping in PostgreSQL

Why does dropping a table in PostgreSQL require ACCESS EXCLUSIVE locks on any referenced tables? How can I reduce this to an ACCESS SHARED lock or no lock at all? i.e. is there a way to drop a relation without locking the referenced table?
I can't find any mention of which locks are required in the documentation, but unless I explicitly get locks in the correct order when dropping multiple tables during concurrent operations, I can see deadlocks waiting on an AccessExclusiveLock in the logs, and acquiring this restrictive lock on commonly-referenced tables is causing momentary delays to other processes when tables are deleted.
To clarify,
CREATE TABLE base (
id SERIAL,
PRIMARY KEY (id)
);
CREATE TABLE main (
id SERIAL,
base_id INT,
PRIMARY KEY (id),
CONSTRAINT fk_main_base (base_id)
REFERENCES base (id)
ON DELETE CASCADE ON UPDATE CASCADE
);
DROP TABLE main; -- why does this need to lock base?
For anyone googling and trying to understand why their drop table (or drop foreign key or add foreign key) got stuck for a long time:
PostgreSQL (I looked at versions 9.4 to 13) foreign key constraints are actually implemented using triggers on both ends of the foreign key.
If you have a company table (id as primary key) and a bank_account table (id as primary key, company_id as foreign key pointing to company.id), then there are actually 2 triggers on the bank_account table and also 2 triggers on the company table.
table_name
timing
trigger_name
function_name
bank_account
AFTER UPDATE
RI_ConstraintTrigger_c_1515961
RI_FKey_check_upd
bank_account
AFTER INSERT
RI_ConstraintTrigger_c_1515960
RI_FKey_check_ins
company
AFTER UPDATE
RI_ConstraintTrigger_a_1515959
RI_FKey_noaction_upd
company
AFTER DELETE
RI_ConstraintTrigger_a_1515958
RI_FKey_noaction_del
Initial creation of those triggers (when creating the foreing key) requires SHARE ROW EXCLUSIVE lock on those tables (it used to be ACCESS EXCLUSIVE lock in version 9.4 and earlier). This lock does not conflict with "data reading locks", but will conflict with all other locks, for example a simple INSERT/UPDATE/DELETE into company table.
Deletion of those triggers (when droping the foreign key, or the whole table) requires ACCESS EXCLUSIVE lock on those tables. This lock conflicts with every other lock!
So imagine a scenario, where you have a transaction A running that first did a simple SELECT from company table (causing it to hold an ACCESS SHARE lock for company table until the transaction is commited or rolled back) and is now doing some other work for 3 minutes. You try to drop the bank_account table in transaction B. This requires ACCESS EXCLUSIVE lock, which will need to wait until the ACCESS SHARE lock is released first.
In addition of that all other transactions, which want to access the company table (just SELECT, or maybe INSERT/UPDATE/DELETE), will be queued to wait on the ACCESS EXCLUSIVE lock, which is waiting on the ACCESS SHARE lock.
Long running transactions and DDL changes require delicate handling.
-- SESSION#1
DROP SCHEMA tmp CASCADE;
CREATE SCHEMA tmp ;
SET search_path=tmp;
BEGIN;
CREATE TABLE base (
id SERIAL
, dummy INTEGER
, PRIMARY KEY (id)
);
CREATE TABLE main (
id SERIAL
, base_id INTEGER
, PRIMARY KEY (id)
, CONSTRAINT fk_main_base FOREIGN KEY (base_id) REFERENCES base (id)
-- comment the next line out ( plus maybe tghe previous one)
ON DELETE CASCADE ON UPDATE CASCADE
);
-- make some data ...
INSERT INTO base (dummy)
SELECT generate_series(1,10)
;
-- make some FK references
INSERT INTO main(base_id)
SELECT id FROM base
WHERE random() < 0.5
;
COMMIT;
BEGIN;
DROP TABLE main; -- why does this need to lock base?
SELECT pg_backend_pid();
-- allow other session to check the locks
-- and attempt an update to "base"
SELECT pg_sleep(20);
-- On rollback the other session will fail.
-- On commit the other session will succeed.
-- In both cases the other session must wait for us to complete.
-- ROLLBACK;
COMMIT;
-- SESSION#2
-- (Start this after session#1 from a different terminal)
SET search_path = tmp, pg_catalog;
PREPARE peeklock(text) AS
SELECT dat.datname
, rel.relname as relrelname
, cat.relname as catrelname
, lck.locktype
-- , lck.database, lck.relation
, lck.page, lck.tuple
-- , lck.virtualxid, lck.transactionid
-- , lck.classid
, lck.objid, lck.objsubid
-- , lck.virtualtransaction
, lck.pid, lck.mode, lck.granted, lck.fastpath
FROM pg_locks lck
LEFT JOIN pg_database dat ON dat.oid = lck.database
LEFT JOIN pg_class rel ON rel.oid = lck.relation
LEFT JOIN pg_class cat ON cat.oid = lck.classid
WHERE EXISTS(
SELECT * FROM pg_locks l
JOIN pg_class c ON c.oid = l.relation AND c.relname = $1
WHERE l.pid =lck.pid
)
;
EXECUTE peeklock( 'base' );
BEGIN;
-- attempt to perfom some DDL
ALTER TABLE base ALTER COLUMN id TYPE BIGINT;
-- attempt to perfom some DML
UPDATE base SET id = id+100;
COMMIT;
EXECUTE peeklock( 'base' );
\d base
SELECT * FROM base;
I suppose DDL locks everything it touches exclusively for the sake of simplicity — you're not supposed to run DDL involving not-temporary tables during normal operation anyway.
To avoid deadlock you may use advisory lock:
start transaction;
select pg_advisory_xact_lock(0);
drop table main;
commit;
This would ensure that only one client is concurrently running DDL involving referenced tables so it wouldn't matter in which order would other locks be acquired.
You can avoid locking table for long time by dropping foreign key first:
start transaction;
select pg_advisory_xact_lock(0);
alter table main drop constraint fk_main_base;
commit;
start transaction;
drop table main;
commit;
This would still need to lock base exclusively, but for much shorter time.

PostgreSQL multi-column unique constraint causes errors

I am new to postgresql and have a question about multiple column unique constraint.
I got this error when tried to add rows to the table:
ERROR: duplicate key value violates unique constraint "i_rb_on"
DETAIL: Key (a_fk, b_fk)=(296, 16) already exists.
I used this code (short version):
INSERT INTO rb_on (a_fk, b_fk) SELECT a.pk, b.pk FROM A, B WHERE NOT EXISTS (SELECT * FROM rb_on WHERE a_fk=a.pk AND b_fk=b.pk);
i_rb_on is unique constraint / columns (a_fk, b_fk).
It seems that my WHERE NOT EXISTS doesn't provide a protection against the duplicate key error for this kind of unique key.
UPDATE:
INSERT INTO tabA (mark_done, log_time, tabB_fk, tabC_fk)
SELECT FALSE, '2003-09-02 04:05:06', tabB.pk, tabC.pk FROM tabB, tabC, tabD, tabE, tabF
WHERE (tabC.sf_id='SUMMER' AND tabC.sf_status IN(0,1)
AND tabE.inventory_status=0)
AND tabF.tabD_fk=tabD.pk
AND tabD.tabE_fk=tabE.pk
AND tabE.tabB_fk=tabB.pk
AND tabF.tabC_fk=tabC.pk
AND NOT EXISTS (SELECT *
FROM tabA
WHERE tabB_fk=tabB.pk AND tabC_fk=tabC.pk);
In tabA unique index:
CREATE UNIQUE INDEX i_tabA
ON tabA
USING btree
(tabB_fk , tabC_fk );
Only one row (of many) must be inserted into the tabA.
Your WHERE NOT EXISTS never provides proper protection against a unique violation. It only seems to most of the time. The WHERE NOT EXISTS can run concurrently with another insert, so the row is still inserted multiple times and all but one of the inserts causes a unique violation.
For that reason it's often better to just run the insert and let the violation happen if the row already exists.
I can't help you with the exact problem described unless you show the data (as SQL CREATE TABLE and INSERTs) and the real query.
BTW, please don't use old style A, B joins. Use A INNER JOIN B ON (...). It makes it easier to tell which join conditions are supposed to apply to which parts of the query, and harder to forget a join condition. You seem to have done that; you're attempting to insert a cartesian product. I suspect it's just an editing mistake in the query.
I added LIMIT 1 to the end: ...WHERE tabB_fk=tabB.pk AND tabC_fk=tabC.pk) LIMIT1 ;
and it did the trick.
I created a function with LIMIT 1 and ...EXCEPTION WHEN unique_violation THEN ... and it also worked.
But when LIMIT 1 and "NOT EXISTS" are used, I think, it is not necessary to use unique_violation error handling.

Restoring only some key values using COPY STDIN in Postgres?

I accidentally ran a query on live data that deleted 5000 odd rows. I made a backup before I did this, and the backup is in this format:
COPY table (id, "position", event) FROM stdin;
529 1 5283
648 1 6473
687 1 6853
\.
Problem is, if I run it, i get:
ERROR: duplicate key value violates unique constraint "table_pkey"
is there a way to alter this query to only insert the rows I deleted? Something like an "if exists, ignore" kind of thing? Normally I know this affects many things, but because it's literally just those entries that need to be replaced, I think something like this could work, but I don't know if it exists?
Easiest way may be to create a copy of the original table and restore to that.
Then insert to original table from copy where no entry exists in original.
e.g.
create table copy_table as select * from table where 1=2;
-- change the copy statement
COPY copy_table from stdin;
...
-- Insert to original
INSERT INTO table t1
SELECT ct.*
FROM copy_table ct
LEFT JOIN table t2 ON t2.id = ct.id -- assuming id is primary key
WHERE t2.id IS NULL;
No, this is unfortunately not possible using the COPY command.
You need to insert all rows into a staging table, then use insert into .. select ... where not exits (...) to copy the misssing rows from the staging table into the real table.

PostgreSQL delete all content

Hello I want to delete all data in my postgresql tables, but not the table itself.
How could I do this?
Use the TRUNCATE TABLE command.
The content of the table/tables in PostgreSQL database can be deleted in several ways.
Deleting table content using sql:
Deleting content of one table:
TRUNCATE table_name;
DELETE FROM table_name;
Deleting content of all named tables:
TRUNCATE table_a, table_b, …, table_z;
Deleting content of named tables and tables that reference to them (I will explain it in more details later in this answer):
TRUNCATE table_a, table_b CASCADE;
Deleting table content using pgAdmin:
Deleting content of one table:
Right click on the table -> Truncate
Deleting content of table and tables that reference to it:
Right click on the table -> Truncate Cascaded
Difference between delete and truncate:
From the documentation:
DELETE deletes rows that satisfy the WHERE clause from the specified
table. If the WHERE clause is absent, the effect is to delete all rows
in the table.
http://www.postgresql.org/docs/9.3/static/sql-delete.html
TRUNCATE is a PostgreSQL extension that provides a faster mechanism to
remove all rows from a table. TRUNCATE quickly removes all rows from a
set of tables. It has the same effect as an unqualified DELETE on each
table, but since it does not actually scan the tables it is faster.
Furthermore, it reclaims disk space immediately, rather than requiring
a subsequent VACUUM operation. This is most useful on large tables.
http://www.postgresql.org/docs/9.1/static/sql-truncate.html
Working with table that is referenced from other table:
When you have database that has more than one table the tables have probably relationship.
As an example there are three tables:
create table customers (
customer_id int not null,
name varchar(20),
surname varchar(30),
constraint pk_customer primary key (customer_id)
);
create table orders (
order_id int not null,
number int not null,
customer_id int not null,
constraint pk_order primary key (order_id),
constraint fk_customer foreign key (customer_id) references customers(customer_id)
);
create table loyalty_cards (
card_id int not null,
card_number varchar(10) not null,
customer_id int not null,
constraint pk_card primary key (card_id),
constraint fk_customer foreign key (customer_id) references customers(customer_id)
);
And some prepared data for these tables:
insert into customers values (1, 'John', 'Smith');
insert into orders values
(10, 1000, 1),
(11, 1009, 1),
(12, 1010, 1);
insert into loyalty_cards values (100, 'A123456789', 1);
Table orders references table customers and table loyalty_cards references table customers. When you try to TRUNCATE / DELETE FROM the table that is referenced by other table/s (the other table/s has foreign key constraint to the named table) you get an error. To delete content from all three tables you have to name all these tables (the order is not important)
TRUNCATE customers, loyalty_cards, orders;
or just the table that is referenced with CASCADE key word (you can name more tables than just one)
TRUNCATE customers CASCADE;
The same applies for pgAdmin. Right click on customers table and choose Truncate Cascaded.
For small tables DELETE is often faster and needs less aggressive locking (important for concurrent load):
DELETE FROM tbl;
With no WHERE condition.
For medium or bigger tables, go with TRUNCATE, like #Greg posted:
TRUNCATE tbl;
Hard to pin down the line between "small" and "big", as that depends on many variables. You'll have to test in your installation.
I found a very easy and fast way for everyone who might use a tool like DBeaver:
You just need to select all the tables that you want to truncate (SHIFT + click or CTRL + click) then right click
And if you have foreign keys, select also CASCADE option on Settings panel. Start and that's all it takes!