Most efficient way to do a bulk UPDATE with pairs of input - postgresql

Suppose I want to do a bulk update, setting a=b for a collection of a values. This can easily be done with a sequence of UPDATE queries:
UPDATE foo SET value='foo' WHERE id=1
UPDATE foo SET value='bar' WHERE id=2
UPDATE foo SET value='baz' WHERE id=3
But now I suppose I want to do this in bulk. I have a two dimensional array containing the ids and new values:
[ [ 1, 'foo' ]
[ 2, 'bar' ]
[ 3, 'baz' ] ]
Is there an efficient way to do these three UPDATEs in a single SQL query?
Some solutions I have considered:
A temporary table
CREATE TABLE temp ...;
INSERT INTO temp (id,value) VALUES (....);
UPDATE foo USING temp ...
But this really just moves the problem. Although it may be easier (or at least less ugly) to do a bulk INSERT, there are still a minimum of three queries.
Denormalize the input by passing the data pairs as SQL arrays. This makes the query incredibly ugly, though
UPDATE foo
USING (
SELECT
split_part(x,',',1)::INT AS id,
split_part(x,',',2)::VARCHAR AS value
FROM (
SELECT UNNEST(ARRAY['1,foo','2,bar','3,baz']) AS x
) AS x;
)
SET value=x.value WHERE id=x.id
This makes it possible to use a single query, but makes that query ugly, and inefficient (especially for mixed and/or complex data types).
Is there a better solution? Or should I resort to multiple UPDATE queries?

Normally you want to batch-update from a table with sufficient index to make the merge easy:
CREATE TEMP TABLE updates_table
( id integer not null primary key
, val varchar
);
INSERT into updates_table(id, val) VALUES
( 1, 'foo' ) ,( 2, 'bar' ) ,( 3, 'baz' )
;
UPDATE target_table t
SET value = u.val
FROM updates_table u
WHERE t.id = u.id
;
So you should probably populate your update_table by something like:
INSERT into updates_table(id, val)
SELECT
split_part(x,',',1)::INT AS id,
split_part(x,',',2)::VARCHAR AS value
FROM (
SELECT UNNEST(ARRAY['1,foo','2,bar','3,baz'])
) AS x
;
Remember: an index (or the primary key) on the id field in the updates_table is important. (but for small sets like this one, a hashjoin will probably by chosen by the optimiser)
In addition: for updates, it is important to avoid updates with the same value, these cause extra rowversions to be created + plus the resulting VACUUM activity after the update was committed:
UPDATE target_table t
SET value = u.val
FROM updates_table u
WHERE t.id = u.id
AND (t.value IS NULL OR t.value <> u.value)
;

You can use CASE conditional expression:
UPDATE foo
SET "value" = CASE id
WHEN 1 THEN 'foo'
WHEN 2 THEN 'bar'
WHEN 3 THEN 'baz'
END

Related

How to assert the expect number of result rows of a sub-query in PostgreSQL?

Often times I find myself writing code such as:
const fooId = await pool.oneFirst(sql`
SELECT id
FROM foo
WHERE nid = 'BAR'
`);
await pool.query(sql`
INSERT INTO bar (foo_id)
VALUES (${fooId})
`);
oneFirst is a Slonik query method that ensures that the query returns exactly 1 result. It is needed there, because foo_id also accepts NULL as a valid value, i.e. if SELECT id FROM foo WHERE status = 'BAR' returned no results, this part of the program would fail silently.
The problem with this approach is that it causes two database roundtrips for what should be a single operation.
In a perfect world, postgresql supported assertions natively, e.g.
INSERT INTO bar (foo_id)
VALUES
(
(
SELECT id
FROM foo
WHERE nid = 'BAR'
EXPECT 1 RESULT
)
)
EXPECT 1 RESULT is a made up DSL.
The expectation is that EXPECT 1 RESULT would cause PostgreSQL to throw an error if that query returns anything other than 1 result.
Since PostgreSQL does not support this natively, what are the client-side solutions?
You can use
const fooId = await pool.oneFirst(sql`
INSERT INTO bar (foo_id)
SELECT id
FROM foo
WHERE nid = 'BAR'
RETURNING foo_id;
`);
This will insert all rows matched by the condition in foo into bar, and Slonik will throw if that was not exactly one row.
Alternatively, if you insist on using VALUES with a subquery, you can do
INSERT INTO bar (foo_id)
SELECT tmp.id
FROM (VALUES (
SELECT id
FROM foo
WHERE nid = 'BAR'
)) AS tmp
WHERE tmp.id IS NOT NULL
Nothing would be inserted if no row did match the condition and your application could check that. Postgres would throw an exception if multiple rows were matched, since a subquery in an expression must return at most one row.
That's a cool idea.
You can cause an error on extra results by putting the select in a context where only one result is expected. For example, just writing
SELECT (SELECT id from foo WHERE nid = 'bar')
will get you a decent error message on multiple results: error: more than one row returned by a subquery used as an expression.
To handle the case where nothing is returned, you could use COALESCE.

PostgreSQL array of data composite update element using where condition

I have a composite type:
CREATE TYPE mydata_t AS
(
user_id integer,
value character(4)
);
Also, I have a table, uses this composite type as an array of mydata_t.
CREATE TABLE tbl
(
id serial NOT NULL,
data_list mydata_t[],
PRIMARY KEY (id)
);
Here I want to update the mydata_t in data_list, where mydata_t.user_id is 100000
But I don't know which array element's user_id is equal to 100000
So I have to make a search first to find the element where its user_id is equal to 100000 ... that's my problem ... I don't know how to make the query .... in fact, I want to update the value of the array element, where it's user_id is equal to 100000 (Also where the id of tbl is for example 1) ... What will be my query?
Something like this (I know it's wrong !!!)
UPDATE "tbl" SET "data_list"[i]."value"='YYYY'
WHERE "id"=1 AND EXISTS (SELECT ROW_NUMBER() OVER() AS i
FROM unnest("data_list") "d" WHERE "d"."user_id"=10000 LIMIT 1)
For example, this is my tbl data:
Row1 => id = 1, data = ARRAY[ROW(5,'YYYY'),ROW(6,'YYYY')]
Row2 => id = 2, data = ARRAY[ROW(10,'YYYY'),ROW(11,'YYYY')]
Now i want to update tbl where id is 2 and set the value of one of the tbl.data elements to 'XXXX' where the user_id of element is equal to 11
In fact, the final result of Row2 will be this:
Row2 => id = 2, data = ARRAY[ROW(10,'YYYY'),ROW(11,'XXXX')]
If you know the value value, you can use the array_replace() function to make the change:
UPDATE tbl
SET data_list = array_replace(data_list, (11, 'YYYY')::mydata_t, (11, 'XXXX')::mydata_t)
WHERE id = 2
If you do not know the value value then the situation becomes more complex:
UPDATE tbl SET data_list = data_arr
FROM (
-- UPDATE doesn't allow aggregate functions so aggregate here
SELECT array_agg(new_data) AS data_arr
FROM (
-- For the id value, get the data_list values that are NOT modified
SELECT (user_id, value)::mydata_t AS new_data
FROM tbl, unnest(data_list)
WHERE id = 2 AND user_id != 11
UNION
-- Add the values to update
VALUES ((11, 'XXXX')::mydata_t)
) x
) y
WHERE id = 2
You should keep in mind, though, that there is an awful lot of work going on in the background that cannot be optimised. The array of mydata_t values has to be examined from start to finish and you cannot use an index on this. Furthermore, updates actually insert a new row in the underlying file on disk and if your array has more than a few entries this will involve substantial work. This gets even more problematic when your arrays are larger than the pagesize of your PostgreSQL server, typically 8kB. All behind the scene so it will work, but at a performance penalty. Even though array_replace sounds like changes are made in-place (and they indeed are in memory), the UPDATE command will write a completely new tuple to disk. So if you have 4,000 array elements that means that at least 40kB of data will have to be read (8 bytes for the mydata_t type on a typical system x 4,000 = 32kB in a TOAST file, plus the main page of the table, 8kB) and then written to disk after the update. A real performance killer.
As #klin pointed out, this design may be more trouble than it is worth. Should you make data_list as table (as I would do), the update query becomes:
UPDATE data_list SET value = 'XXXX'
WHERE id = 2 AND user_id = 11
This will have MUCH better performance, especially if you add the appropriate indexes. You could then still create a view to publish the data in an aggregated form with a custom type if your business logic so requires.

Can postgreSQL OnConflict combine with JSON obejcts?

I wanted to perform a conditional insert in PostgreSQL. Something like:
INSERT INTO {TABLE_NAME} (user_id, data) values ('{user_id}', '{data}')
WHERE not exists(select 1 from files where user_id='{user_id}' and data->'userType'='Type1')
Unfortunately, insert and where does not cooperate in PostGreSQL. What could be a suitable syntax for my query? I was considering ON CONFLICT, but couldn't find the syntax for using it with JSON object. (Data in the example)
Is it possible?
Rewrite the VALUES part to a SELECT, then you can use a WHERE condition:
INSERT INTO { TABLE_NAME } ( user_id, data )
SELECT
user_id,
data
FROM
( VALUES ( '{user_id}', '{data}' ) ) sub ( user_id, data )
WHERE
NOT EXISTS (
SELECT 1
FROM files
WHERE user_id = '{user_id}'
AND data -> 'userType' = 'Type1'
);
But, there is NO guarantee that the WHERE condition works! Another transaction that has not been committed yet, is invisible to this query. This could lead to data quality issues.
You can use INSERT ... SELECT ... WHERE ....
INSERT INTO elbat
(user_id,
data)
SELECT 'abc',
'xyz'
WHERE NOT EXISTS (SELECT *
FROM files
WHERE user_id = 'abc'
AND data->>'userType' = 'Type1')
And it looks like you're creating the query in a host language. Don't use string concatenation or interpolation for getting the values in it. That's error prone and makes your application vulnerable to SQL injection attacks. Look up how to use parameterized queries in your host language. Very likely for the table name parameters cannot be used. You need some other method of either whitelisting the names or properly quoting them.

SQL NOT LIKE comparison against dynamic list

Working on a new TSQL Stored Procedure, I am wanting to get all rows where values in a specific column don't start with any of a specific set of 2 character substrings.
The general idea is:
SELECT * FROM table WHERE value NOT LIKE 's1%' AND value NOT LIKE 's2%' AND value NOT LIKE 's3%'.
The catch is that I am trying to make it dynamic so that the specific substrings can be pulled from another table in the database, which can have more values added to it.
While I have never used the IN operator before, I think something along these lines should do what I am looking for, however, I don't think it is possible to use wildcards with IN, so I might not be able to compare just the substrings.
SELECT * FROM table WHERE value NOT IN (SELECT substrings FROM subTable)
To get around that limitation, I am trying to do something like this:
SELECT * FROM table WHERE SUBSTRING(value, 1, 2) NOT IN (SELECT Prefix FROM subTable WHERE Prefix IS NOT NULL)
but I'm not sure this is right, or if it is the most efficient way to do this. My preference is to do this in a Stored Procedure, but if that isn't feasible or efficient I'm also open to building the query dynamically in C#.
Here's an option. Load values you want to filter to a table, left outer join and use PATINDEX().
DECLARE #FilterValues TABLE
(
[FilterValue] NVARCHAR(10)
);
--Table with values we want filter on.
INSERT INTO #FilterValues (
[FilterValue]
)
VALUES ( N's1' )
, ( N's2' )
, ( N's3' );
DECLARE #TestData TABLE
(
[TestValues] NVARCHAR(100)
);
--Load some test data
INSERT INTO #TestData (
[TestValues]
)
VALUES ( N's1 Test Data' )
, ( N's2 Test Data' )
, ( N's3 Test Data' )
, ( N'test data not filtered out' )
, ( N'test data not filtered out 1' );
SELECT a.*
FROM #TestData [a]
LEFT OUTER JOIN #FilterValues [b]
ON PATINDEX([b].[FilterValue] + '%', [a].[TestValues]) > 0
WHERE [b].[FilterValue] IS NULL;

use two .nextval in an insert statement

I'm using oracle database and facing a problem where two id_poduct.nextval is creating as error: ORA-00001: unique constraint (SYSTEM.SYS_C004166) violated
It is a primary key. To use all is a requirement. Can I use 2 .nextval in a statement?
insert all
into sale_product values (id_product.nextval, id.currval, 'hello', 123, 1)
into sale_product values (id_product.nextval, id.currval, 'hi', 123, 1)
select * from dual;
insert into sale_product
select id_product.nextval, id.currval, a, b, c
from
(
select 'hello' a, 123 b, 1 c from dual union all
select 'hi' a, 123 b, 1 c from dual
);
This doesn't use the insert all syntax, but it works the same way if you are only inserting into the same table.
The value of id_product.NEXTVAL in the first INSERT is the same as the second INSERT, hence you'll get the unique constraint violation. if you remove the constraint and perform the insert, you'll notice the duplicate values!
The only way is to perform two bulk INSERTS in sequence or to have two seperate sequences with a different range, the latter would require an awful lot of coding and checking.
create table temp(id number ,id2 number);
insert all
into temp values (supplier_seq.nextval, supplier_seq.currval)
into temp values (supplier_seq.nextval, supplier_seq.currval)
select * from dual;
ID ID2
---------- ----------
2 2
2 2
Refrence
The subquery of the multitable insert statement cannot use a sequence
http://docs.oracle.com/cd/B19306_01/server.102/b14200/statements_9014.htm#i2080134