Is this the correct way to bulk INSERT ON CONFLICT in Postgres? - postgresql

I will provide a simplified example of my problem.
I have two tables: reviews and users.
reviews is updated with a bunch of reviews that users post. The process that fetches the reviews also returns information for the user that submitted it (and certain user data changes frequently).
I want to update users whenever I update reviews, in bulk, using COPY. The issue arises for users when the fetched data contains two or more reviews from the same user. If I do a simple INSERT ON CONFLICT, I might end up with errors since and INSERT statement cannot update the same row twice.
A SELECT DISTINCT would solve that problem, but I also want to guarantee that I insert the latest data into the users table. This is how I am doing it. Keep in mind I am doing this in bulk:
1. Create a temporary table so that we can COPY to/from it.
CREATE TEMPORARY TABLE users_temp (
id uuid,
stat_1 integer,
stat_2 integer,
account_age_in_mins integer);
2. COPY data into temporary table
COPY users_temp (
id,
stat_1,
stat_2,
account_age_in_mins) FROM STDIN CSV ENCODING 'utf-8';
3. Lock users table and perform INSERT ON CONFLICT
LOCK TABLE users in EXCLUSIVE MODE;
INSERT INTO users SELECT DISTINCT ON (1)
users_temp.id,
users_temp.stat_1,
users_temp.stat_2,
users_temp.account_age_in_mins
FROM users_temp
ORDER BY 1, 4 DESC, 2, 3
ON CONFLICT (id) DO UPDATE
SET
stat_1 = EXCLUDED.stat_1,
stat_2 = EXCLUDED.stat_2,
account_age_in_mins = EXCLUDED.account_age_in_mins';
The reason I am doing a SELECT DISTINCT and an ORDER BY in step 3) is because I:
Only want to return one instance of the duplicated rows.
From those
duplicates make sure that I get the most up to date record by
sorting on the account_age_in_mins.
Is this the correct method to achieve my goal?

This is a very good approach.
Maybe you can avoid the table-lock, when you lock only tuples you have in your temporary table.
https://dba.stackexchange.com/questions/106121/locking-in-postgres-for-update-insert-combination

Related

PostgreSQL: Return auto-generated ids from COPY FROM insertion

I have a non-empty PostgreSQL table with a GENERATED ALWAYS AS IDENTITY column id. I do a bulk insert with the C++ binding pqxx::stream_to, which I'm assuming uses COPY FROM. My problem is that I want to know the ids of the newly created rows, but COPY FROM has no RETURNING clause. I see several possible solutions, but I'm not sure if any of them is good, or which one is the least bad:
Provide the ids manually through COPY FROM, taking care to give the values which the identity sequence would have provided, then afterwards synchronize the sequence with setval(...).
First stream the data to a temp-table with a custom index column for ordering. Then do something likeINSERT INTO foo (col1, col2)
SELECT ttFoo.col1, ttFoo.col2 FROM ttFoo
ORDER BY ttFoo.idx RETURNING foo.id
and depend on the fact that the identity sequence produces ascending numbers to correlate them with ttFoo.idx (I cannot do RETURNING ttFoo.idx too because only the inserted row is available for that which doesn't contain idx)
Query the current value of the identity sequence prior to insertion, then check afterwards which rows are new.
I would assume that this is a common situation, yet I don't see an obviously correct solution. What do you recommend?
You can find out which rows have been affected by your current transaction using the system columns. The xmin column contains the ID of the inserting transaction, so to return the id values you just copied, you could:
BEGIN;
COPY foo(col1,col2) FROM STDIN;
SELECT id FROM foo
WHERE xmin::text = (txid_current() % (2^32)::bigint)::text
ORDER BY id;
COMMIT;
The WHERE clause comes from this answer, which explains the reasoning behind it.
I don't think there's any way to optimise this with an index, so it might be too slow on a large table. If so, I think your second option would be the way to go, i.e. stream into a temp table and INSERT ... RETURNING.
I think you can create id with type is uuid.
The first step, you should random your ids after that bulk insert them, by this way your will not need to return ids from database.

PostgreSQL Latest Record w/o id nor date

I have a foreign table without id nor date.
If for example other users insert a number of records, is it possible in PostgreSQL to select the last record inserted?
*Note: My only access to that table is select only
SQL tables represent unordered sets and the result sets too. You cannot guarantee your data without specify ORDER BY.
And :
I have a foreign table without id nor date
There is no other way to workaround without this to specify what you need.
My only access to that table is select only
If you only get just Select privilege you should tell your DBA you cannot give the data with 100% guarantee if that is the last data inserted from that user.
Based on my knowledge PostgreSQL does not guarantee to preserve insertion order. Without a timestamp field or sequential primary key I do not think guaranteed fetching of the last row is possible.
You can try this
SELECT * FROM YOUR_TABLE WHERE CTID = (SELECT MAX(CTID) FROM YOUR_TABLE)
provided that the target table does not do update operations.

"ON UPDATE" equivalent for Amazon Redshift

I want a create a table that has a column updated_date that is updated to SYSDATE every time any field in that row is updated. How should I do this in Redshift?
You should be creating table definition like below, that will make sure whenever you insert the record, it populates sysdate.
create table test(
id integer not null,
update_at timestamp DEFAULT SYSDATE);
Every time field update?
Remember, Redshift is DW solution, not a simple database, hence updates should be avoided or minimized.
UPDATE= DELETE + INSERT
Ideally instead of updating any record, you should be deleting and inserting it, so takes care of update_at population while updating which is eventually, DELETE+INSERT.
Also, most of use ETLs, you may using stg_sales table for populating you date, then also, above solution works, where you could do something like below.
DELETE from SALES where id in (select Id from stg_sales);
INSERT INTO SALES select id from stg_sales;
Hope this answers your question.
Redshift doesn't support UPSERTs, so you should load your data to a temporary/staging table first and check for IDs in the main tables, which also exist in the staging table (i.e. which need to be updated).
Delete those records, and INSERT the data from the staging table, which will have the new updated_date.
Also, don't forget to run VACUUM on your tables every once in a while, because your use case involves a lot of DELETEs and UPDATEs.
Refer this for additional info.

Merging postgres data

I have data in two postgresql databases that needs to be merged into 1. Just to be clear, both databases have "good" data in them from a certain date that needs to be combined. This isn't merely appending the data from one into another. In other words, let's say that table foo has an serial id field. Both databases have a foo with ID=5555 and both values are valid (but different). So, the target database's foo keeps 5555 and the new record should get added with a new ID of nextval(foo_id_seq).
So, it's a big mess.
My thoughts are to create a tmp schema in the target db and to copy the needed data from source db. Then I need to essentially "upsert" the data. New records get inserted with new ideas (and foreign keys updated) and records that exist in both dbs get updated.
I don't believe there is a tool that will help me with this.
My questions.
How best to handle generating the new id? I know I could do it via selects and just leaving out the id column, but that's a lot of typing and would be slow. My thinking is to create a temporary trigger for these tables that will override the id supplied when doing an insert.
Finally notes:
Both databases are offline. And I'm the only one that can get to them.
Both database have the exact same schema
Target database is 9.2
try using something like:
INSERT INTO A(id, f1, f2)
SELECT nextval('A_seq'), tmp_A.f1, tmp_A.f2
FROM tmp_A
WHERE tmp_A.id IN (select A.id FROM A);
INSERT INTO A(id, f1, f2)
SELECT tmp_A.id, tmp_A.f1, tmp_A.f2
FROM tmp_A
WHERE tmp_A.id NOT IN (select A.id FROM A);
The idea - use one INSERT .. SELECT .. to insert the data with conflicts in id fields and other INSERT .. SELECT .. to insert the data without the conflict.
Or simply generate new id for every inserted record:
INSERT INTO A(id, f1, f2)
SELECT nextval('A_seq'), tmp_A.f1, tmp_A.f2
FROM tmp_A;

Cascade new IDs to a second, related table

I have two tables, Contacts and Contacts_Detail. I am importing records into the Contacts table and need to run a SP to create a record in the Contacts_Detail table for each new record in the Contacts. There is an ID in the Contacts table and a matching ID_D in the Contacts_Detail table.
I'm using this to insert the record into Contacts_Detail but get the 'Subquery returned more than 1 value.' error and I can't figure out why. There are multiple records in Contacts that need have matching records in Contacts_Detail.
Insert into Contacts_Detail (ID_D)
select id from Contacts c
left join Contacts_Detail cd
on c.id = cd.id_d
where id_d is null
I'm open to a better way...
thanks.
It sounds like you're inserting blank child-records into your Contacts_Detail table -- so the first question I'd ask is: Why?
As for why your specific SQL isn't working...
A few things you can check:
Contacts table -- do you have any records there WHERE id is null?
(delete them -- then make the id field a primary key)
Contacts_Detail
table -- do you have any records there WHEERE id_d is null?
(delete them -- then go into your designer and create a relationship
/ enforce referential integrity.)
Verify that c.id is the primary
key, and cd.id_d is the correct foreign key to relate the tables.
Hope that helps
Why not just have a trigger? This seems a little simpler than having to determine for all time which rows are missing - that seems more like something you would do periodically to correct for some anomalies, not something you should have to do after every insert. Something like this should work:
CREATE TRIGGER dbo.NewContacts
ON dbo.Contacts
FOR INSERT
AS
BEGIN
INSERT dbo.Contacts_Detail(ID_D) SELECT ID FROM inserted;
END
GO
But I suspect you have a trigger on the Contacts_Detail table that is not written to correctly handle multi-row inserts, and that's where your subquery error is coming from. Can you show the trigger on Contacts_Detail?