PostgreSQL: proper way to delete 'orphaned' records - postgresql

In my current project, a have a few cases where, within data pump operations, I have to execute queries like this (it's not a real example, but it should give you some idea):
DELETE FROM notification
WHERE user_id NOT IN (SELECT id FROM user)
On big tables, such construction performs poorly, I believe it's because of NOT IN construction, which makes it impossible to use indexes.
Such approach should perform better:
DELETE FROM notification
USING (
SELECT n.user_id, u.id FROM notification
LEFT OUTER JOIN user u ON n.user_id = u.id
) i
WHERE
notification.user_id = i.user_id
AND i.id IS NULL
... but it looks a bit overcomplicated.
Is there a better way / best practice for such operations?

We can use EXISTS instead:
DELETE FROM notification
WHERE NOT EXISTS
(SELECT id FROM user WHERE id = notification.user_id)

I'd use NOT EXISTS.
DELETE FROM notification n
WHERE NOT EXISTS (SELECT *
FROM "user" u
WHERE u.id = n.user_id);
But what you really should do is adding a foreign key constraint so that such rows cannot exist in the first place. Assuming that "user".id already is a (primary) key:
ALTER TABLE notification
ADD FOREIGN KEY (user_id)
REFERENCES "user"
(id);
If "user".id isn't a (primary) key, you first need to change that and make it a (primary) key.

Related

Delete all records that violate new unqiue constraint

I have a table that has the following fields
----------------------------------
| id | user_id | doc_id |
----------------------------------
I want to create a new unique constraint to make sure that there are no repeat user_id and doc_id records. Aka a user can only be linked to a doc one time. That is simple enough.
ALTER TABLE mytable
ADD CONSTRAINT uniquectm_const UNIQUE (user_id, doc_id);
The issue is I have records that currently violate that constraint. I was wondering if there is an easy way to query for those records or to tell postgres just delete anything that violates the constraint.
Identifying records that violate your new key:
SELECT *
FROM
(
SELECT id, user_id, doc_id
, COUNT(*) OVER (PARTITION BY user_id, doc_id) as unique_check
FROM mytable
)
WHERE unique_check > 1;
Then you can figure out from those duplicates, which should be deleted and perform the delete.
To my knowledge there is no other way to perform this since any automated "Delete any duplicates" command would leave the database engine to decide which of the two-or-more duplicate records to get rid of.
If the entire record is a duplicate (all columns match) then you could just create a new table with your new unique constraint and do a INSERT INTO newtable SELECT DISTINCT * FROM oldtable but I'm betting that isn't the case.

How to use ON CONFLICT with a primary key on a foreign table

I am basically trying to replicate data from a table on one server to another.
I have two identical databases on the servers. I created a foreign table called opentickets_aux1 to represent the opentickets table on the secondary server on the primary server. Both have a primary key of incidentnumber. I can access the data in the foreign table just fine but when I try the following SQL,I get "ERROR: there is no unique or exclusion constraint matching the ON CONFLICT specification."
INSERT INTO opentickets_aux1 (SELECT * FROM opentickets)
ON CONFLICT (incidentnumber)
DO
UPDATE SET
status = EXCLUDED.status,
lastmodifieddate = EXCLUDED.lastmodifieddate
I want to update a few columns if the primary key exist. I use this statement for other queries and they work when its a local table. Any ideas?
A foreign table cannot have a primary key constraint, because PostgreSQL wouldn't be able to enforce its integrity. Therefore, you cannot use INSERT ... ON CONFLICT with foreign tables.
Your idea also does not handle rows that are deleted on the foreign server, but maybe that's intentional.
If you want a local copy of a foreign table, the easiest way would be to create a materialized view on the foreign table.
If that is not your desire (perhaps because you don't want to copy deletions), you'd have to use statements like
INSERT INTO localtable
SELECT * FROM foreigntable f
WHERE NOT EXISTS
(SELECT 1 FROM localtable l
WHERE f.id = l.id);
UPDATE localtable l
SET /* all columns from f */
FROM foreigntable f
WHERE f.id = l.id
AND (f.*) <> (l.*);

Loading a big table, referenced by others, efficiently

My use case is the following:
I have big table users (~200 millions rows) of users with user_id as the primary key. users is referenced by several other tables using foreign key with ON DELETE CASCADE.
Every day I have to replace the whole content of users using a lot of csv files. (Please don't ask why I have to do that, I just have to...)
My idea was to set the primary key and all foreign keys as DEFERRED, then, in the same transaction, DELETE the whole table and copying all the csvs using the COPY command. The expected result was that all check and index calculation would happen at the end of the transaction.
But actually the insert process is super slow (4hours, against 10min if I insert and the put the primary key) AND no foreign key can refer to a deferrable primary.
I can't remove the primary key during the insertion because of the foreign keys. I don't want get rid of the foreign key either because I would have to simulate the behavior of ON DELETE CASCADE manually.
So basically I am looking for a way to tell postgres to not care about primary key index or foreign key check until the very end of the transaction.
PS1: I made up the users table, I am actually working with very different kind of data but it's not really relevant to the problem.
PS2: As a rough estimation I would say that every day, on my 200+ millions records, I have 10 records removed, 1million updated and 1million added.
A full delete + a full insert will cause a flood of cascading FK,
which will have to be postponed by DEFERRED,
which will cause an avalanche of aftermath for the DBMS at commit time.
Instead, dont {delete+create} keys, but keep them right where they are.
Also, dont touch records that dont need to be touched.
-- staging table
CREATE TABLE tmp_users AS SELECT * FROM big_users WHERE 1=0;
COPY TABLE tmp_users (...) FROM '...' WITH CSV;
-- ... and more copying ...
-- ... from more files ...
-- If this fails, you have a problem!
ALTER TABLE tmp_users
ADD PRIMARY KEY (id);
-- [EDIT]
-- I added this later, because the user_comments table
-- was not present in the original question.
DELETE FROM user_comments c
WHERE NOT EXISTS (
SELECT * FROM tmp_users u WHERE u.id = c.user_id
);
-- These deletes are allowed to cascade
-- [we assume that the mport of the CSV files was complete, here ...]
DELETE FROM big_users b
WHERE NOT EXISTS (
SELECT *
FROM tmp_users t
WHERE t.id = b.id
);
-- Only update the records that actually **change**
-- [ updates are expensive in terms of I/O, because they create row-versions
-- , and the need to delete the old row-versions, afterwards ]
-- Note that the key (id) does not change, so there will be no cascading.
-- ------------------------------------------------------------
UPDATE big_users b
SET name_1 = t.name_1
, name_2 = t.name_2
, address = t.address
-- , ... ALL THE COLUMNS here, except the key(s)
FROM tmp_users t
WHERE t.id = b.id
AND (t.name_1, t.name_2, t.address, ...) -- ALL THE COLUMNS, except the key(s)
IS DISTINCT FROM
(b.name_1, b.name_2, b.address, ...)
;
-- Maybe there were some new records in the CSV files. Add them.
INSERT INTO big_users (id,name_1,name_2,address, ...)
SELECT id,name_1,name_2,address, ...
FROM tmp_users t
WHERE NOT EXISTS (
SELECT *
FROM big_users x
WHERE x.id = t.id
);
I found a hacky solution :
update pg_index set indisvalid = false, indisready=false where indexrelid= 'users_pkey'::regclass;
DELETE FROM users;
COPY TABLE users FROM 'file.csv';
REINDEX INDEX users_pkey;
DELETE FROM user_comments c WHERE NOT EXISTS (SELECT * FROM users u WHERE u.id = c.user_id )
commit;
The magic dirty hack is to disable the primary key index in the postgres catalog and at the end to force the reindexing (which will override what we changed). I can't use foreign key with ON DELETE CASCADE because for some reason it makes the constraint being executed immediatly... So instead my foreign keys are ON DELETE NO ACTION DEFERRABLE INITIALLY DEFERRED and I have to do the delete myself.
This works well in my case because only a few users are being referred in other tables.
I wish there was a cleaner solution though...

removing data which are not refenceing to other table POSTGRES

I have one table USER and some other tables like USER_DETAILS ,USER_QUALIFICATION etc USER_ID references to all such table i want to remove those USER_ID which are not present in any other tables.
Deleting all of the users that are not present in a connected table:
DELETE FROM table WHERE user_id NOT IN (SELECT user_id FROM other_table)
If you want to delete only users that are not found in any table than you can add
AND NOT IN (SELECT user_id FROM another_table)
Alternatively you can create a tmp table and merge in all the user_ids that you want to keep and use that table in the sub-select for the NOT IN.
Use a DELETE with a not exists condition for all related tables:
delete from "USER" u
where not exists (select *
from user_details ud
where ud.user_id = u.user_id)
and not exists (select *
from user_qualification uq
where uq.user_id = u.user_id);
Note that user is a reserved word, and thus needs to be quoted to be usable as a table name. But quoting makes it case-sensitive. So "USER" and "user" are two different table names. As you have not included the DDL for your tables I cannot tell if your table is named "USER" or "user".
In general I would strongly recommend to avoid using double quotes for identifies completely.

Is it possible to insert values in column using just a primary key reference?

I have a table 'users' with the columns:
user_id(PK), user_firstname, user_lastname
and another table 'room' with the columns:
event_id(PK), user_id(FK), user_firstname, user_lastname....(and more columns).
I want to know if it is possible to fill the user_firstname and user_lastname automatically just knowing the user_id column.
Like the default value of user_firstname would be like: "select users.user_firstname where users.user_id = user_id"
I don't know if was clear enough...As you can see my knowledge in database is very narrow.
What you want to achieve can be done with JOINs. They will avoid those redundant user_firstname and user_lastname columns. So you'd just fetch from both tables when querying the room table and you get the extra columns of users into the result set:
SELECT * FROM room AS r INNER JOIN users AS u ON r.user_id = u.user_id;
The thing we did here is called normalization. Another important thing to take care of are foreign key constraints and their cascades, in your case room.user_id references user.user_id. A delete on user should most probably cascade to room, if you want to delete users, instead of flagging them deleted.
The columns user_firstname and user_lastname do not belong in your room table. The user_id column references the users table, that is all you need.
To select the data, you can use a JOIN statement, something like
SELECT R.event_id, R.user_id, U.user_firstname, U.user_lastname
FROM room AS R
JOIN users AS U ON R.user_uid = U.user_id
The answer here is sideways to the question. You do not want a user_firstname and user_lastname column in the Event table. The user_id is a proxy for that row of the entire User table. When you need to access user_firstname, you do a JOIN of the two tables on the common column.