Merging two tables in PostgreSQL using `INSERT ... SELECT ... WHERE NOT IN(..)` - postgresql

All,
I am trying to bulk insert some data in a table using the COPY TO command and I can't seem to get around the unique key error. Here's my workflow.
Create a dump of the data I want to move to another server
COPY (
SELECT *
FROM mytable
WHERE created_at >= '2012-10-01')
TO 'D:\tmp\file.txt'
Create a new "temp" table in the target DB then COPY the data like so.
COPY temp FROM 'D:\tmp\file.txt'
I now want to move the data from the "temp" table in to the master table in the target DBlike so.
INSERT INTO master SELECT * FROM temp
WHERE id NOT IN (SELECT id FROM master)
This runs fine but nothing gets inserted and no fields are updated. Does anyone have a clue what might be going on here? The schemas for temp and master are identical. Any help on this matter would be great! I am using Postgresql 9.2
Adam

This can happen if there's a null value in the IN list.
In SQL, the presence of a null when making comparisons is always false (you need the special IN NULL test to get a match). This has the unfortunate consequence of making the entire list not match if there's any null values returned from SELECT id FROM master.
See if there are any rows returned from this query:
SELECT id
FROM master
WHERE id is null;
If not, then this isn't your problem.
If there are values, then the fix is to exclude null ids from the list:
INSERT INTO master
SELECT *
FROM temp
WHERE id NOT IN (SELECT id FROM master where id is not null)
The other thing to consider is that there are simply no values not already inserted!

Related

Insert into table on conflict do update from csv

I have a table with two columns, id that is varchar and data that is jsonb. I also have a csv-file with new IDs that I would like to insert into the table, the data I would like to assign to these IDs are identical, and if an ID already exists I would like to update the current data value with the new data. This is what I have done so far:
INSERT INTO "table" ("id", "data")
VALUES ('[IDs from CSV file]', ' {dataObject}')
ON CONFLICT (id) do UPDATE set data='{dataObject}';
I have got it working with a single ID, but I would now like to run this for every ID in my csv-file, hence the array in the example to illustrate this. Is there a way to do this using a query? I was thinking I could create a temporary table and import the IDs there, but I am still not sure how I would utilize that table with my query.
Yes, use a staging table to upload your csv into, make sure to truncate it before uploading. After uploading:
insert into prod_table
select * from csv_upload
on conflict (id) do update
set data = excluded.data;
Don't complicate the process unnecessarily.
Import csv to a temporary table T2
Update T1 where rows match in T2
Insert into T1 from T2 where rows do not match

Is this the correct way to bulk INSERT ON CONFLICT in Postgres?

I will provide a simplified example of my problem.
I have two tables: reviews and users.
reviews is updated with a bunch of reviews that users post. The process that fetches the reviews also returns information for the user that submitted it (and certain user data changes frequently).
I want to update users whenever I update reviews, in bulk, using COPY. The issue arises for users when the fetched data contains two or more reviews from the same user. If I do a simple INSERT ON CONFLICT, I might end up with errors since and INSERT statement cannot update the same row twice.
A SELECT DISTINCT would solve that problem, but I also want to guarantee that I insert the latest data into the users table. This is how I am doing it. Keep in mind I am doing this in bulk:
1. Create a temporary table so that we can COPY to/from it.
CREATE TEMPORARY TABLE users_temp (
id uuid,
stat_1 integer,
stat_2 integer,
account_age_in_mins integer);
2. COPY data into temporary table
COPY users_temp (
id,
stat_1,
stat_2,
account_age_in_mins) FROM STDIN CSV ENCODING 'utf-8';
3. Lock users table and perform INSERT ON CONFLICT
LOCK TABLE users in EXCLUSIVE MODE;
INSERT INTO users SELECT DISTINCT ON (1)
users_temp.id,
users_temp.stat_1,
users_temp.stat_2,
users_temp.account_age_in_mins
FROM users_temp
ORDER BY 1, 4 DESC, 2, 3
ON CONFLICT (id) DO UPDATE
SET
stat_1 = EXCLUDED.stat_1,
stat_2 = EXCLUDED.stat_2,
account_age_in_mins = EXCLUDED.account_age_in_mins';
The reason I am doing a SELECT DISTINCT and an ORDER BY in step 3) is because I:
Only want to return one instance of the duplicated rows.
From those
duplicates make sure that I get the most up to date record by
sorting on the account_age_in_mins.
Is this the correct method to achieve my goal?
This is a very good approach.
Maybe you can avoid the table-lock, when you lock only tuples you have in your temporary table.
https://dba.stackexchange.com/questions/106121/locking-in-postgres-for-update-insert-combination

Merging postgres data

I have data in two postgresql databases that needs to be merged into 1. Just to be clear, both databases have "good" data in them from a certain date that needs to be combined. This isn't merely appending the data from one into another. In other words, let's say that table foo has an serial id field. Both databases have a foo with ID=5555 and both values are valid (but different). So, the target database's foo keeps 5555 and the new record should get added with a new ID of nextval(foo_id_seq).
So, it's a big mess.
My thoughts are to create a tmp schema in the target db and to copy the needed data from source db. Then I need to essentially "upsert" the data. New records get inserted with new ideas (and foreign keys updated) and records that exist in both dbs get updated.
I don't believe there is a tool that will help me with this.
My questions.
How best to handle generating the new id? I know I could do it via selects and just leaving out the id column, but that's a lot of typing and would be slow. My thinking is to create a temporary trigger for these tables that will override the id supplied when doing an insert.
Finally notes:
Both databases are offline. And I'm the only one that can get to them.
Both database have the exact same schema
Target database is 9.2
try using something like:
INSERT INTO A(id, f1, f2)
SELECT nextval('A_seq'), tmp_A.f1, tmp_A.f2
FROM tmp_A
WHERE tmp_A.id IN (select A.id FROM A);
INSERT INTO A(id, f1, f2)
SELECT tmp_A.id, tmp_A.f1, tmp_A.f2
FROM tmp_A
WHERE tmp_A.id NOT IN (select A.id FROM A);
The idea - use one INSERT .. SELECT .. to insert the data with conflicts in id fields and other INSERT .. SELECT .. to insert the data without the conflict.
Or simply generate new id for every inserted record:
INSERT INTO A(id, f1, f2)
SELECT nextval('A_seq'), tmp_A.f1, tmp_A.f2
FROM tmp_A;

Restoring only some key values using COPY STDIN in Postgres?

I accidentally ran a query on live data that deleted 5000 odd rows. I made a backup before I did this, and the backup is in this format:
COPY table (id, "position", event) FROM stdin;
529 1 5283
648 1 6473
687 1 6853
\.
Problem is, if I run it, i get:
ERROR: duplicate key value violates unique constraint "table_pkey"
is there a way to alter this query to only insert the rows I deleted? Something like an "if exists, ignore" kind of thing? Normally I know this affects many things, but because it's literally just those entries that need to be replaced, I think something like this could work, but I don't know if it exists?
Easiest way may be to create a copy of the original table and restore to that.
Then insert to original table from copy where no entry exists in original.
e.g.
create table copy_table as select * from table where 1=2;
-- change the copy statement
COPY copy_table from stdin;
...
-- Insert to original
INSERT INTO table t1
SELECT ct.*
FROM copy_table ct
LEFT JOIN table t2 ON t2.id = ct.id -- assuming id is primary key
WHERE t2.id IS NULL;
No, this is unfortunately not possible using the COPY command.
You need to insert all rows into a staging table, then use insert into .. select ... where not exits (...) to copy the misssing rows from the staging table into the real table.

SQL Server - How to find a records in INSERTED when the database generates a primary key

I've never had to post a question on StackOverflow before because I can always find an answer here by just searching. Only this time, I think I've got a real stumper....
I'm writing code that automates the process of moving data from one SQL Server database to another. I have some pretty standard SQL Server Databases with foreign key relationships between some of their tables. Straight forward stuff. One of my requirements is that the entire table needs to be copied in one fell swoop, without looping through rows or using a cursor. Another requirement is I have to do this in SQL, no SSIS or other external helpers.
For example:
INSERT INTO TargetDatabase.dbo.MasterTable
SELECT * FROM SourceDatabase.dbo.MasterTable
That's easy enough. Then, once the data from the MasterTable has been moved, I move the data of the child table.
INSERT INTO TargetDatabase.dbo.ChildTable
SELECT * FROM SourceDatabase.dbo.ChildTable
Of course, in reality I use more explicit SQL... like I specifically name all the fields and things like that, but this is just a simplified version. Anyway, so far everything's going alright, except ...
The problem is that the primary key of the master table is defined as an identity field. So, when I insert into the MasterTable, the primary key for the new table gets calculated by the database. So to deal with that, I tried using the OUTPUT INTO statement to get the updated values into a Temp table:
INSERT INTO TargetDatabase.dbo.MasterTable
OUPUT INSERTED.* INTO #MyTempTable
SELECT * FROM SourceDatabase.dbo.MasterTable
So here's where it all falls apart. Since the database changed the primary key, how on earth do I figure out which record in the temp table matches up with the original record in the source table?
Do you see the problem? I know what the new ID is, I just don't know how to match it with the original record reliably. The SQL server lets me output the INSERTED values, but doesn't let me output the FROM TABLE values along side the INSERTED values. I've tried it with triggers, I've tried it with an SP, always I have the same problem.
If I were just updating one record at a time, I could easily match up my INSERTED values with the original record I was trying to insert to see the old and new primary key values, but I have this requirement to do it in a batch.
Any Ideas?
PS: I'm not allowed to change the table structure of the target or source table.
You can use MERGE.
declare #Source table (SourceID int identity(1,2), SourceName varchar(50))
declare #Target table (TargetID int identity(2,2), TargetName varchar(50))
insert into #Source values ('Row 1'), ('Row 2')
merge #Target as T
using #Source as S
on 0=1
when not matched then
insert (TargetName) values (SourceName)
output inserted.TargetID, S.SourceID;
Result:
TargetID SourceID
----------- -----------
2 1
4 3
Covered in this blog post by Adam Machanic: Dr. OUTPUT or: How I Learned to Stop Worrying and Love the MERGE
To illustrate what I mentioned in the comment:
SET IDENTITY_INSERT TargetDatabase.dbo.MasterTable ON
INSERT INTO TargetDatabase.dbo.MasterTable (IdentityColumn, OtherColumn1, OtherColumn2, ...)
SELECT IdentityColumn, OtherColumn1, OtherColumn2, ...
FROM SourceDatabase.dbo.MasterTable
SET IDENTITY_INSERT TargetDatabase.dbo.MasterTable OFF
Okay, since that didn't work for you (pre-existing values in target tables), how about adding a fixed increment (offset) to the id values in both tables (use the current max id value). Assuming the identity column is "id" in both tables:
DECLARE #incr int
BEGIN TRAN
SELECT #incr = max(id)
FROM TargetDatabase.dbo.MasterTable AS m WITH (TABLOCKX, HOLDLOCK)
SET IDENTITY_INSERT TargetDatabase.dbo.MasterTable ON
INSERT INTO TargetDatabase.dbo.MasterTable (id{, othercolumns...})
SELECT id+#incr{, othercolumns...}
FROM SourceDatabase.dbo.MasterTable
SET IDENTITY_INSERT TargetDatabase.dbo.MasterTable OFF
INSERT INTO TargetDatabase.dbo.ChildTable (id{, othercolumns...})
SELECT id+#incr{, othercolumns...}
FROM SourceDatabase.dbo.ChildTable
COMMIT TRAN