I accidentally ran a query on live data that deleted 5000 odd rows. I made a backup before I did this, and the backup is in this format:
COPY table (id, "position", event) FROM stdin;
529 1 5283
648 1 6473
687 1 6853
\.
Problem is, if I run it, i get:
ERROR: duplicate key value violates unique constraint "table_pkey"
is there a way to alter this query to only insert the rows I deleted? Something like an "if exists, ignore" kind of thing? Normally I know this affects many things, but because it's literally just those entries that need to be replaced, I think something like this could work, but I don't know if it exists?
Easiest way may be to create a copy of the original table and restore to that.
Then insert to original table from copy where no entry exists in original.
e.g.
create table copy_table as select * from table where 1=2;
-- change the copy statement
COPY copy_table from stdin;
...
-- Insert to original
INSERT INTO table t1
SELECT ct.*
FROM copy_table ct
LEFT JOIN table t2 ON t2.id = ct.id -- assuming id is primary key
WHERE t2.id IS NULL;
No, this is unfortunately not possible using the COPY command.
You need to insert all rows into a staging table, then use insert into .. select ... where not exits (...) to copy the misssing rows from the staging table into the real table.
Related
I have an unusual problem: I need to delete duplicate records from a table in Postgresql. As i have duplicate records so i dont have primary key and unique index in this table. The table conatins like 20million records and it has duplicate records in it. While i am trying the below query it is taking too long time.
'DELETE FROM temp a using temp b where a.recordid=b.recordid and a.ctid < b.ctid;'
So what should be a better approach to handle such huge table with no index in it?
Appreciate for help.
if you have enough empty space, your can copy table without duplicates, then remove old table and rename new table
like this
INSERT INTO new_table
VALUES
SELECT
DISTINCT ON (column)
*
FROM old_table
ORDER BY column ASC
Use COPY TO to dump the table.
Then Unix sort -u to de-duplicate it.
Drop or truncate the table in Postgres, use COPY FROM to read it back in.
Add a primary key column.
My use case is the following:
I have big table users (~200 millions rows) of users with user_id as the primary key. users is referenced by several other tables using foreign key with ON DELETE CASCADE.
Every day I have to replace the whole content of users using a lot of csv files. (Please don't ask why I have to do that, I just have to...)
My idea was to set the primary key and all foreign keys as DEFERRED, then, in the same transaction, DELETE the whole table and copying all the csvs using the COPY command. The expected result was that all check and index calculation would happen at the end of the transaction.
But actually the insert process is super slow (4hours, against 10min if I insert and the put the primary key) AND no foreign key can refer to a deferrable primary.
I can't remove the primary key during the insertion because of the foreign keys. I don't want get rid of the foreign key either because I would have to simulate the behavior of ON DELETE CASCADE manually.
So basically I am looking for a way to tell postgres to not care about primary key index or foreign key check until the very end of the transaction.
PS1: I made up the users table, I am actually working with very different kind of data but it's not really relevant to the problem.
PS2: As a rough estimation I would say that every day, on my 200+ millions records, I have 10 records removed, 1million updated and 1million added.
A full delete + a full insert will cause a flood of cascading FK,
which will have to be postponed by DEFERRED,
which will cause an avalanche of aftermath for the DBMS at commit time.
Instead, dont {delete+create} keys, but keep them right where they are.
Also, dont touch records that dont need to be touched.
-- staging table
CREATE TABLE tmp_users AS SELECT * FROM big_users WHERE 1=0;
COPY TABLE tmp_users (...) FROM '...' WITH CSV;
-- ... and more copying ...
-- ... from more files ...
-- If this fails, you have a problem!
ALTER TABLE tmp_users
ADD PRIMARY KEY (id);
-- [EDIT]
-- I added this later, because the user_comments table
-- was not present in the original question.
DELETE FROM user_comments c
WHERE NOT EXISTS (
SELECT * FROM tmp_users u WHERE u.id = c.user_id
);
-- These deletes are allowed to cascade
-- [we assume that the mport of the CSV files was complete, here ...]
DELETE FROM big_users b
WHERE NOT EXISTS (
SELECT *
FROM tmp_users t
WHERE t.id = b.id
);
-- Only update the records that actually **change**
-- [ updates are expensive in terms of I/O, because they create row-versions
-- , and the need to delete the old row-versions, afterwards ]
-- Note that the key (id) does not change, so there will be no cascading.
-- ------------------------------------------------------------
UPDATE big_users b
SET name_1 = t.name_1
, name_2 = t.name_2
, address = t.address
-- , ... ALL THE COLUMNS here, except the key(s)
FROM tmp_users t
WHERE t.id = b.id
AND (t.name_1, t.name_2, t.address, ...) -- ALL THE COLUMNS, except the key(s)
IS DISTINCT FROM
(b.name_1, b.name_2, b.address, ...)
;
-- Maybe there were some new records in the CSV files. Add them.
INSERT INTO big_users (id,name_1,name_2,address, ...)
SELECT id,name_1,name_2,address, ...
FROM tmp_users t
WHERE NOT EXISTS (
SELECT *
FROM big_users x
WHERE x.id = t.id
);
I found a hacky solution :
update pg_index set indisvalid = false, indisready=false where indexrelid= 'users_pkey'::regclass;
DELETE FROM users;
COPY TABLE users FROM 'file.csv';
REINDEX INDEX users_pkey;
DELETE FROM user_comments c WHERE NOT EXISTS (SELECT * FROM users u WHERE u.id = c.user_id )
commit;
The magic dirty hack is to disable the primary key index in the postgres catalog and at the end to force the reindexing (which will override what we changed). I can't use foreign key with ON DELETE CASCADE because for some reason it makes the constraint being executed immediatly... So instead my foreign keys are ON DELETE NO ACTION DEFERRABLE INITIALLY DEFERRED and I have to do the delete myself.
This works well in my case because only a few users are being referred in other tables.
I wish there was a cleaner solution though...
This question already has an answer here:
How to bulk insert only new rows in PostreSQL
(1 answer)
Closed 8 years ago.
I'm trying to achieve database abstraction in my project, but now I got stuck with doing a bulk INSERT in PostgreSQL. My project is in C# and I'm using PostgreSQL 9.3 with npgsql.dll 2.0.14.
For Microsoft SQL Server I'm doing the bulk INSERT simply by concatenating all statements and then performing an ExecuteNonQuery:
IF NOT EXISTS (SELECT id FROM table WHERE id = 1) INSERT INTO table (id) VALUES (1);
IF NOT EXISTS (SELECT id FROM table WHERE id = 2) INSERT INTO table (id) VALUES (2);
IF NOT EXISTS (SELECT id FROM table WHERE id = 3) INSERT INTO table (id) VALUES (3);
Though the IF-NOT-EXISTS clause can be substituted in PostgreSQL by a SELECT-WHERE, this approach unfortunately still doesn't work - because every single statement in PostgreSQL is committed separately.
So I googled for another solution and found the approach of using the COPY command along with NpgsqlCopySerializer/NpgsqlCopyIn to performantly "stream" the bulk data. But now I'm getting primary key violation errors all the time - 'cause the EXISTS/WHERE clause can seemingly not be used together with the COPY statement.
I would really like to avoid to do the INSERTs all one-by-one, as this will slow down my application extremely, so I hope that anyone solved this issue already!
Generally for this type of situation I'd have a separate staging table that does not have the PK constraint, which I'd populate using COPY (assuming the data were in a format for which it makes sense to do a COPY). Then I'd do something like:
insert into table
select a.*
from staging a
where not exists (select 1
from table
where a.id = b.id)
That approach isn't too far off from your original design.
I don't totally understand this part of your question, however, which doesn't even seem totally relevant to your question:
this approach unfortunately still doesn't work - because every single
statement in postgreSQL is committed separately.
That's not true at all, not for any RDBMS. Sure, auto-commit might be enabled on your client, but that doesn't mean that postgres commits every statement separately and that you can't disable the auto-commit. This approach would work:
begin;
insert into table (id) select 1 where not exists (select 1 from table where id = 1);
insert into table (id) select 2 where not exists (select 1 from table where id = 2);
insert into table (id) select 3 where not exists (select 1 from table where id = 3);
commit;
As you pointed out, however, if you've got more than a handful of such statements you'll quickly be hitting some performance concerns.
I've never had to post a question on StackOverflow before because I can always find an answer here by just searching. Only this time, I think I've got a real stumper....
I'm writing code that automates the process of moving data from one SQL Server database to another. I have some pretty standard SQL Server Databases with foreign key relationships between some of their tables. Straight forward stuff. One of my requirements is that the entire table needs to be copied in one fell swoop, without looping through rows or using a cursor. Another requirement is I have to do this in SQL, no SSIS or other external helpers.
For example:
INSERT INTO TargetDatabase.dbo.MasterTable
SELECT * FROM SourceDatabase.dbo.MasterTable
That's easy enough. Then, once the data from the MasterTable has been moved, I move the data of the child table.
INSERT INTO TargetDatabase.dbo.ChildTable
SELECT * FROM SourceDatabase.dbo.ChildTable
Of course, in reality I use more explicit SQL... like I specifically name all the fields and things like that, but this is just a simplified version. Anyway, so far everything's going alright, except ...
The problem is that the primary key of the master table is defined as an identity field. So, when I insert into the MasterTable, the primary key for the new table gets calculated by the database. So to deal with that, I tried using the OUTPUT INTO statement to get the updated values into a Temp table:
INSERT INTO TargetDatabase.dbo.MasterTable
OUPUT INSERTED.* INTO #MyTempTable
SELECT * FROM SourceDatabase.dbo.MasterTable
So here's where it all falls apart. Since the database changed the primary key, how on earth do I figure out which record in the temp table matches up with the original record in the source table?
Do you see the problem? I know what the new ID is, I just don't know how to match it with the original record reliably. The SQL server lets me output the INSERTED values, but doesn't let me output the FROM TABLE values along side the INSERTED values. I've tried it with triggers, I've tried it with an SP, always I have the same problem.
If I were just updating one record at a time, I could easily match up my INSERTED values with the original record I was trying to insert to see the old and new primary key values, but I have this requirement to do it in a batch.
Any Ideas?
PS: I'm not allowed to change the table structure of the target or source table.
You can use MERGE.
declare #Source table (SourceID int identity(1,2), SourceName varchar(50))
declare #Target table (TargetID int identity(2,2), TargetName varchar(50))
insert into #Source values ('Row 1'), ('Row 2')
merge #Target as T
using #Source as S
on 0=1
when not matched then
insert (TargetName) values (SourceName)
output inserted.TargetID, S.SourceID;
Result:
TargetID SourceID
----------- -----------
2 1
4 3
Covered in this blog post by Adam Machanic: Dr. OUTPUT or: How I Learned to Stop Worrying and Love the MERGE
To illustrate what I mentioned in the comment:
SET IDENTITY_INSERT TargetDatabase.dbo.MasterTable ON
INSERT INTO TargetDatabase.dbo.MasterTable (IdentityColumn, OtherColumn1, OtherColumn2, ...)
SELECT IdentityColumn, OtherColumn1, OtherColumn2, ...
FROM SourceDatabase.dbo.MasterTable
SET IDENTITY_INSERT TargetDatabase.dbo.MasterTable OFF
Okay, since that didn't work for you (pre-existing values in target tables), how about adding a fixed increment (offset) to the id values in both tables (use the current max id value). Assuming the identity column is "id" in both tables:
DECLARE #incr int
BEGIN TRAN
SELECT #incr = max(id)
FROM TargetDatabase.dbo.MasterTable AS m WITH (TABLOCKX, HOLDLOCK)
SET IDENTITY_INSERT TargetDatabase.dbo.MasterTable ON
INSERT INTO TargetDatabase.dbo.MasterTable (id{, othercolumns...})
SELECT id+#incr{, othercolumns...}
FROM SourceDatabase.dbo.MasterTable
SET IDENTITY_INSERT TargetDatabase.dbo.MasterTable OFF
INSERT INTO TargetDatabase.dbo.ChildTable (id{, othercolumns...})
SELECT id+#incr{, othercolumns...}
FROM SourceDatabase.dbo.ChildTable
COMMIT TRAN
I need to duplicate selected rows with all the fields exactly same except ID ident int which is added automatically by SQL.
What is the best way to duplicate/clone record or records (up to 50)?
Is there any T-SQL functionality in MS SQL 2008 or do I need to select insert in stored procedures ?
The only way to accomplish what you want is by using Insert statements which enumerate every column except the identity column.
You can of course select multiple rows to be duplicated by using a Select statement in your Insert statements. However, I would assume that this will violate your business key (your other unique constraint on the table other than the surrogate key which you have right?) and require some other column to be altered as well.
Insert MyTable( ...
Select ...
From MyTable
Where ....
If it is a pure copy (minus the ID field) then the following will work (replace 'NameOfExistingTable' with the table you want to duplicate the rows from and optionally use the Where clause to limit the data that you wish to duplicate):
SELECT *
INTO #TempImportRowsTable
FROM (
SELECT *
FROM [NameOfExistingTable]
-- WHERE ID = 1
) AS createTable
-- If needed make other alterations to the temp table here
ALTER TABLE #TempImportRowsTable DROP COLUMN Id
INSERT INTO [NameOfExistingTable]
SELECT * FROM #TempImportRowsTable
DROP TABLE #TempImportRowsTable
If you're able to check the duplication condition as rows are inserted, you could put an INSERT trigger on the table. This would allow you to check the columns as they are inserted instead of having to select over the entire table.