How to do PostgreSQL Bulk INSERT without Primary Key Violation [duplicate] - postgresql

This question already has an answer here:
How to bulk insert only new rows in PostreSQL
(1 answer)
Closed 8 years ago.
I'm trying to achieve database abstraction in my project, but now I got stuck with doing a bulk INSERT in PostgreSQL. My project is in C# and I'm using PostgreSQL 9.3 with npgsql.dll 2.0.14.
For Microsoft SQL Server I'm doing the bulk INSERT simply by concatenating all statements and then performing an ExecuteNonQuery:
IF NOT EXISTS (SELECT id FROM table WHERE id = 1) INSERT INTO table (id) VALUES (1);
IF NOT EXISTS (SELECT id FROM table WHERE id = 2) INSERT INTO table (id) VALUES (2);
IF NOT EXISTS (SELECT id FROM table WHERE id = 3) INSERT INTO table (id) VALUES (3);
Though the IF-NOT-EXISTS clause can be substituted in PostgreSQL by a SELECT-WHERE, this approach unfortunately still doesn't work - because every single statement in PostgreSQL is committed separately.
So I googled for another solution and found the approach of using the COPY command along with NpgsqlCopySerializer/NpgsqlCopyIn to performantly "stream" the bulk data. But now I'm getting primary key violation errors all the time - 'cause the EXISTS/WHERE clause can seemingly not be used together with the COPY statement.
I would really like to avoid to do the INSERTs all one-by-one, as this will slow down my application extremely, so I hope that anyone solved this issue already!

Generally for this type of situation I'd have a separate staging table that does not have the PK constraint, which I'd populate using COPY (assuming the data were in a format for which it makes sense to do a COPY). Then I'd do something like:
insert into table
select a.*
from staging a
where not exists (select 1
from table
where a.id = b.id)
That approach isn't too far off from your original design.
I don't totally understand this part of your question, however, which doesn't even seem totally relevant to your question:
this approach unfortunately still doesn't work - because every single
statement in postgreSQL is committed separately.
That's not true at all, not for any RDBMS. Sure, auto-commit might be enabled on your client, but that doesn't mean that postgres commits every statement separately and that you can't disable the auto-commit. This approach would work:
begin;
insert into table (id) select 1 where not exists (select 1 from table where id = 1);
insert into table (id) select 2 where not exists (select 1 from table where id = 2);
insert into table (id) select 3 where not exists (select 1 from table where id = 3);
commit;
As you pointed out, however, if you've got more than a handful of such statements you'll quickly be hitting some performance concerns.

Related

Redshift Temp Table Identity column

My stored procedure includes the following code:
CREATE TEMPORARY TABLE #lala
(
idx int IDENTITY(1,1),
tablename nvarchar(128)
);
INSERT INTO #lala(tablename)
SELECT LEFT(tablename, LEN(tablename) - 3)
FROM SVV_EXTERNAL_TABLES
WHERE schemaname = 'spectrum'
AND tablename LIKE '%_v2';
I'm then calling it like this:
BEGIN;
CALL myschema.make_union_views('spectrum_views','spectrum','mycursor');
FETCH ALL FROM mycursor;
COMMIT;
At first it was running succesfully.
Then it began falling over, and debugging, I listed the contents of '#lala
I am confused as to how this has come about - that the [idx] column is not sequential ?
Hope that someone can shed some light on what might be happening?
This is by design. Redshift is a cluster and as such communications between parts of the cluster are expensive. Redshift ensures the uniqueness of identity columns but NOT sequentiality. Per the CREATE TABLE documentation (https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_NEW.html):
When you load the table using an INSERT INTO [tablename] SELECT * FROM
or COPY statement, the data is loaded in parallel and distributed to
the node slices. To be sure that the identity values are unique,
Amazon Redshift skips a number of values when creating the identity
values.

Restoring only some key values using COPY STDIN in Postgres?

I accidentally ran a query on live data that deleted 5000 odd rows. I made a backup before I did this, and the backup is in this format:
COPY table (id, "position", event) FROM stdin;
529 1 5283
648 1 6473
687 1 6853
\.
Problem is, if I run it, i get:
ERROR: duplicate key value violates unique constraint "table_pkey"
is there a way to alter this query to only insert the rows I deleted? Something like an "if exists, ignore" kind of thing? Normally I know this affects many things, but because it's literally just those entries that need to be replaced, I think something like this could work, but I don't know if it exists?
Easiest way may be to create a copy of the original table and restore to that.
Then insert to original table from copy where no entry exists in original.
e.g.
create table copy_table as select * from table where 1=2;
-- change the copy statement
COPY copy_table from stdin;
...
-- Insert to original
INSERT INTO table t1
SELECT ct.*
FROM copy_table ct
LEFT JOIN table t2 ON t2.id = ct.id -- assuming id is primary key
WHERE t2.id IS NULL;
No, this is unfortunately not possible using the COPY command.
You need to insert all rows into a staging table, then use insert into .. select ... where not exits (...) to copy the misssing rows from the staging table into the real table.

SQL Server - How to find a records in INSERTED when the database generates a primary key

I've never had to post a question on StackOverflow before because I can always find an answer here by just searching. Only this time, I think I've got a real stumper....
I'm writing code that automates the process of moving data from one SQL Server database to another. I have some pretty standard SQL Server Databases with foreign key relationships between some of their tables. Straight forward stuff. One of my requirements is that the entire table needs to be copied in one fell swoop, without looping through rows or using a cursor. Another requirement is I have to do this in SQL, no SSIS or other external helpers.
For example:
INSERT INTO TargetDatabase.dbo.MasterTable
SELECT * FROM SourceDatabase.dbo.MasterTable
That's easy enough. Then, once the data from the MasterTable has been moved, I move the data of the child table.
INSERT INTO TargetDatabase.dbo.ChildTable
SELECT * FROM SourceDatabase.dbo.ChildTable
Of course, in reality I use more explicit SQL... like I specifically name all the fields and things like that, but this is just a simplified version. Anyway, so far everything's going alright, except ...
The problem is that the primary key of the master table is defined as an identity field. So, when I insert into the MasterTable, the primary key for the new table gets calculated by the database. So to deal with that, I tried using the OUTPUT INTO statement to get the updated values into a Temp table:
INSERT INTO TargetDatabase.dbo.MasterTable
OUPUT INSERTED.* INTO #MyTempTable
SELECT * FROM SourceDatabase.dbo.MasterTable
So here's where it all falls apart. Since the database changed the primary key, how on earth do I figure out which record in the temp table matches up with the original record in the source table?
Do you see the problem? I know what the new ID is, I just don't know how to match it with the original record reliably. The SQL server lets me output the INSERTED values, but doesn't let me output the FROM TABLE values along side the INSERTED values. I've tried it with triggers, I've tried it with an SP, always I have the same problem.
If I were just updating one record at a time, I could easily match up my INSERTED values with the original record I was trying to insert to see the old and new primary key values, but I have this requirement to do it in a batch.
Any Ideas?
PS: I'm not allowed to change the table structure of the target or source table.
You can use MERGE.
declare #Source table (SourceID int identity(1,2), SourceName varchar(50))
declare #Target table (TargetID int identity(2,2), TargetName varchar(50))
insert into #Source values ('Row 1'), ('Row 2')
merge #Target as T
using #Source as S
on 0=1
when not matched then
insert (TargetName) values (SourceName)
output inserted.TargetID, S.SourceID;
Result:
TargetID SourceID
----------- -----------
2 1
4 3
Covered in this blog post by Adam Machanic: Dr. OUTPUT or: How I Learned to Stop Worrying and Love the MERGE
To illustrate what I mentioned in the comment:
SET IDENTITY_INSERT TargetDatabase.dbo.MasterTable ON
INSERT INTO TargetDatabase.dbo.MasterTable (IdentityColumn, OtherColumn1, OtherColumn2, ...)
SELECT IdentityColumn, OtherColumn1, OtherColumn2, ...
FROM SourceDatabase.dbo.MasterTable
SET IDENTITY_INSERT TargetDatabase.dbo.MasterTable OFF
Okay, since that didn't work for you (pre-existing values in target tables), how about adding a fixed increment (offset) to the id values in both tables (use the current max id value). Assuming the identity column is "id" in both tables:
DECLARE #incr int
BEGIN TRAN
SELECT #incr = max(id)
FROM TargetDatabase.dbo.MasterTable AS m WITH (TABLOCKX, HOLDLOCK)
SET IDENTITY_INSERT TargetDatabase.dbo.MasterTable ON
INSERT INTO TargetDatabase.dbo.MasterTable (id{, othercolumns...})
SELECT id+#incr{, othercolumns...}
FROM SourceDatabase.dbo.MasterTable
SET IDENTITY_INSERT TargetDatabase.dbo.MasterTable OFF
INSERT INTO TargetDatabase.dbo.ChildTable (id{, othercolumns...})
SELECT id+#incr{, othercolumns...}
FROM SourceDatabase.dbo.ChildTable
COMMIT TRAN

Cascade new IDs to a second, related table

I have two tables, Contacts and Contacts_Detail. I am importing records into the Contacts table and need to run a SP to create a record in the Contacts_Detail table for each new record in the Contacts. There is an ID in the Contacts table and a matching ID_D in the Contacts_Detail table.
I'm using this to insert the record into Contacts_Detail but get the 'Subquery returned more than 1 value.' error and I can't figure out why. There are multiple records in Contacts that need have matching records in Contacts_Detail.
Insert into Contacts_Detail (ID_D)
select id from Contacts c
left join Contacts_Detail cd
on c.id = cd.id_d
where id_d is null
I'm open to a better way...
thanks.
It sounds like you're inserting blank child-records into your Contacts_Detail table -- so the first question I'd ask is: Why?
As for why your specific SQL isn't working...
A few things you can check:
Contacts table -- do you have any records there WHERE id is null?
(delete them -- then make the id field a primary key)
Contacts_Detail
table -- do you have any records there WHEERE id_d is null?
(delete them -- then go into your designer and create a relationship
/ enforce referential integrity.)
Verify that c.id is the primary
key, and cd.id_d is the correct foreign key to relate the tables.
Hope that helps
Why not just have a trigger? This seems a little simpler than having to determine for all time which rows are missing - that seems more like something you would do periodically to correct for some anomalies, not something you should have to do after every insert. Something like this should work:
CREATE TRIGGER dbo.NewContacts
ON dbo.Contacts
FOR INSERT
AS
BEGIN
INSERT dbo.Contacts_Detail(ID_D) SELECT ID FROM inserted;
END
GO
But I suspect you have a trigger on the Contacts_Detail table that is not written to correctly handle multi-row inserts, and that's where your subquery error is coming from. Can you show the trigger on Contacts_Detail?

Is there a way to quickly duplicate record in T-SQL?

I need to duplicate selected rows with all the fields exactly same except ID ident int which is added automatically by SQL.
What is the best way to duplicate/clone record or records (up to 50)?
Is there any T-SQL functionality in MS SQL 2008 or do I need to select insert in stored procedures ?
The only way to accomplish what you want is by using Insert statements which enumerate every column except the identity column.
You can of course select multiple rows to be duplicated by using a Select statement in your Insert statements. However, I would assume that this will violate your business key (your other unique constraint on the table other than the surrogate key which you have right?) and require some other column to be altered as well.
Insert MyTable( ...
Select ...
From MyTable
Where ....
If it is a pure copy (minus the ID field) then the following will work (replace 'NameOfExistingTable' with the table you want to duplicate the rows from and optionally use the Where clause to limit the data that you wish to duplicate):
SELECT *
INTO #TempImportRowsTable
FROM (
SELECT *
FROM [NameOfExistingTable]
-- WHERE ID = 1
) AS createTable
-- If needed make other alterations to the temp table here
ALTER TABLE #TempImportRowsTable DROP COLUMN Id
INSERT INTO [NameOfExistingTable]
SELECT * FROM #TempImportRowsTable
DROP TABLE #TempImportRowsTable
If you're able to check the duplication condition as rows are inserted, you could put an INSERT trigger on the table. This would allow you to check the columns as they are inserted instead of having to select over the entire table.