PostgreSQL multi-value upserts - postgresql

is it possible to perform a multi-value upsert in PostgreSQL? I know multi-value inserts exist, as do the "ON CONFLICT" key words to perform an update if the key is violated... but is it possible to bring the two together? Something like so...
INSERT INTO table1(col1, col2) VALUES (1, 'foo'), (2,'bar'), (3,'baz')
ON CONFLICT ON CONSTRAINT theConstraint DO
UPDATE SET (col2) = ('foo'), ('bar'), ('baz')
I googled the crud out of this and couldn't find anything on regarding it.
I have an app that is utilizing pg-promise and I'm doing batch processing. It works but its horrendously slow (like 50-ish rows every 5 seconds or so...). I figured if I could do away with the batch processing and instead correctly build this multi-valued upsert query, it could improve performance.
Edit:
Well... I just tried it myself and no, it doesn't work. Unless I'm doing it incorrectly. So now I guess my question has changed to, what's a good way to implement something like this?

Multi-valued upsert is definitely possible, and a significant part of why insert ... on conflict ... was implemented.
CREATE TABLE table1(col1 int, col2 text, constraint theconstraint unique(col1));
INSERT INTO table1 VALUES (1, 'parrot'), (4, 'turkey');
INSERT INTO table1 VALUES (1, 'foo'), (2,'bar'), (3,'baz')
ON CONFLICT ON CONSTRAINT theconstraint
DO UPDATE SET col2 = EXCLUDED.col2;
results in
regress=> SELECT * FROM table1 ORDER BY col1;
col1 | col2
------+------
1 | foo
2 | bar
3 | baz
4 | turkey
(4 rows)
If the docs were unclear, please submit appropriate feedback to the pgsql-general mailing list. Or even better, propose a patch to the docs.

1.Before the insert
2.Command
3.After the insert

Related

Poor Performance while data upsert to Postgres

I am submitting 3 million records to postgres table1 from a staging table table2,I have my update and insert queries as below
UPDATE table1 t set
col1 = stage.col1,
col2 = stage.col2 ,
col3 = stage.stage.col3
from table2 stage
where t.id::uuid = stage.id::uuid
and coalesce(t.name,'name') = coalesce(stage.name,'name')
and coalesce(t.level,'level') = coalesce(stage.level,'level');
INSERT INTO table1
(col1,
col2,
col3,
col4,
id,
name,
level)
select
stage.col1,
stage.col2,
stage.col3,
stage.col4,
stage.id,
stage.name,
stage.level
from table2 stage
where NOT EXISTS (select
from table1 t where
t.id::uuid = stage.id::uuid
and coalesce(t.name,'name') = coalesce(stage.name,'name')
and coalesce(t.level,'level') = coalesce(stage.level,'level'));
I am facing performance issues (takes long 1.5 hours) even using the exactly same indexed keys (btree) as defined on the table, In order to test the cause ,I created a replica of the table1 without indexes and I was able to submit entire data in 15 ~ 17 mins approx., So I am inclined to think that indexes are killing the performance on the table as there are so many of them (some unused indexes which I cannot drop due to permission issues).I am looking for suggestions to improve/optimize my query or may be use some other strategy to upsert the data to reduce data load time. Any suggestion is appreciated.
Running an explain analyze on the query helped me to realize the query was never using the defined indexes on target table and was doing a sequential scan on a large number of rows ,the cause was one of the keys used in update/insert was defined without a coalesce in the defined indexes , though it means I have to handle null well before feeding in to my code , but it improved the performance significantly. I am open to further improvements.

Cap on number of table rows matching a certain condition using Postgres exclusion constraint?

If I have a Postgresql db schema for a tires table like this (a user has many tires):
user_id integer
description text
size integer
color text
created_at timestamp
and I want to enforce a constraint that says "a user can only have 4 tires".
A naive way to implement this would be to do:
SELECT COUNT(*) FROM tires WHERE user_id='123'
, compare the result to 4 and insert if it's lower than 4. It's suspectible to race conditions and so is a naive approach.
I don't want to add a count column. How can I do this (or can I do this) using an exclusion constraint? If it's not possible with exclusion constraints, what is the canonical way?
The "right" way is using locks here.
Firstly, you need to lock related rows. Secondly, insert new record if a user has less than 4 tires. Here is the SQL:
begin;
-- lock rows
select count(*)
from tires
where user_id = 123
for update;
-- throw an error of there are more than 3
-- insert new row
insert into tires (user_id, description)
values (123, 'tire123');
commit;
You can read more here How to perform conditional insert based on row count?

Update a very large table in PostgreSQL without locking

I have a very large table with 100M rows in which I want to update a column with a value on the basis of another column. The example query to show what I want to do is given below:
UPDATE mytable SET col2 = 'ABCD'
WHERE col1 is not null
This is a master DB in a live environment with multiple slaves and I want to update it without locking the table or effecting the performance of the live environment. What will be the most effective way to do it? I'm thinking of making a procedure that update rows in batches of 1000 or 10000 rows using something like limit but not quite sure how to do it as I'm not that familiar with Postgres and its pitfalls. Oh and both columns don't have any indexes but table has other columns that has.
I would appreciate a sample procedure code.
Thanks.
There is no update without locking, but you can strive to keep the row locks few and short.
You could simply run batches of this:
UPDATE mytable
SET col2 = 'ABCD'
FROM (SELECT id
FROM mytable
WHERE col1 IS NOT NULL
AND col2 IS DISTINCT FROM 'ABCD'
LIMIT 10000) AS part
WHERE mytable.id = part.id;
Just keep repeating that statement until it modifies less than 10000 rows, then you are done.
Note that mass updates don't lock the table, but of course they lock the updated rows, and the more of them you update, the longer the transaction, and the greater the risk of a deadlock.
To make that performant, an index like this would help:
CREATE INDEX ON mytable (col2) WHERE col1 IS NOT NULL;
Just an off-the-wall, out-of-the-box idea. Both col1 and col2 must be null to qualify precludes using an index, perhaps building a psudo index might be an option. This index would of course be a regular table but would only exist for a short period. Additionally, this relieves the lock time worry.
create table indexer (mytable_id integer primary key);
insert into indexer(mytable_id)
select mytable_id
from mytable
where col1 is null
and col2 is null;
The above creates our 'index' that contains only the qualifying rows. Now wrap an update/delete statement into an SQL function. This function updates the main table and deleted the updated rows from the 'index' and returns the number of rows remaining.
create or replace function set_mytable_col2(rows_to_process_in integer)
returns bigint
language sql
as $$
with idx as
( update mytable
set col2 = 'ABCD'
where col2 is null
and mytable_id in (select mytable_if
from indexer
limit rows_to_process_in
)
returning mytable_id
)
delete from indexer
where mytable_id in (select mytable_id from idx);
select count(*) from indexer;
$$;
When the functions returns 0 all rows initially selected have been processed. At this point repeat the entire process to pickup any rows added or updated which the initial selection didn't identify. Should be small number, and process is still available needed later.
Like I said just an off-the-wall idea.
Edited
Must have read into it something that wasn't there concerning col1. However the idea remains the same, just change the INSERT statement for 'indexer' to meet your requirements. As far as setting it in the 'index' no the 'index' contains a single column - the primary key of the big table (and of itself).
Yes you would need to run multiple times unless you give it the total number rows to process as the parameter. The below is a DO block that would satisfy your condition. It processes 200,000 on each pass. Change that to fit your need.
Do $$
declare
rows_remaining bigint;
begin
loop
rows_remaining = set_mytable_col2(200000);
commit;
exit when rows_remaining = 0;
end loop;
end; $$;

T-SQL Indices when using OR statement

I've looked through several sources to get information on indices regarding AND statements, table joins, etc., but I've yet to find much useful information when there are OR statements present. That being said how would someone ideally handle creating indexes for a situation like the one here?
Updated SQL Statement to not use like.
SELECT *
FROM table_name
WHERE table_name.field1 = 'criteria'
and (table_name.field2 = 1 or
table_name.field3 = 0)
Obviously, I would want to create an index for field1. But whether I include fields 2 and 3 or handle this in another way, I'm not sure. If this were a simple three part AND statement I would probably use CREATE INDEX IND_index_name on table_name (field1, field2, field3) but I have reason to believe this logic doesn't work the same way for OR statements. Based on the statement given, I can assume that field1 always needs to be evaluated first, but then I have multiple ways I could handle this. The potential solutions I am evaluating are listed below, but I'm not sure if there may yet be another better solution. Any help would be greatly appreciated!
CREATE INDEX IND_index_name on table_name (field1) INCLUDING (field2, field3)
CREATE INDEX IND_index_name on table_name (field1, field2, field3)
CREATE INDEX IND_index_name1 on table_name (field1, field2)
CREATE INDEX IND_index_name2 on table_name (field1, field3)
CREATE INDEX IND_index_name1 on table_name (field1)
CREATE INDEX IND_index_name2 on table_name (field2)
CREATE INDEX IND_index_name3 on table_name (field3)
As additional info, I do not have access to SQL Server Management Studio tools because I am using DBeaver. For the sake of this example let's assume it takes almost a full workday to run without indexes. The answers submitted here will be used to tackle a much larger more complex query where data from table_name would used in several subsequent queries that run after the query shown above.
There is not one way to index OR
I would try 3 separate indexes

tsql - Batch insert performance

I would like to improve the speed of some inserts on an application I am working on.
I had originally created a batch like this:
insert tableA (field1, field2) values (1,'test1'), (2, 'test2')
This works great on SQL Server 2008 and above, but I need my inserts to work on SQL Server 2005. My question is would I get any performance benefit using a batch like this:
insert tableA (field1, field2)
select 1, 'test1'
union all
select 2, 'test2'
Over this batch:
insert tableA (field1, field2) values (1, 'test1')
insert tableA (field1, field2) values (2, 'test2')
Altough it may seem that less code to process should give you a performance gain, using first option (union all) is a bad idea.
All select statements have to be parsed and executed before the data is inserted into a table. It'll consume a lot of memory and may take forever to finish even for fairly small amount of data. On the other hand separate insert statements are executed "on the fly".
Run a simple test in SSMS that will prove this: create a simple table with 4 fields and try inserting 20k rows. Second solution will execute in seconds, while first... you'll see :).
Another problem is that when the data is not correct in som row (type mismatch for example), you'll receive an error like Conversion failed when converting the varchar value 'x' to data type int., but you'll have no indication which row failed - you'd have to find it yourself. But with separate inserts, you'll know exactly which insert failed.