Is it possible to bulk update specific values in postgresql efficiently? - postgresql

I have created a pipeline which is required to update a high number of rows in postgres where each row should be updated differently.
After looking up I found that this could be done using postgres UPDATE.. FROM.. syntax (https://www.postgresql.org/docs/current/sql-update.html) and I came up with the following query that works perfectly fine:
update grades
set course_id = data_table.course_id,
student_id = data_table.student_id,
grade = data_table.grade
from
(select unnest(array[1,2]) as id, unnest(array['Math', 'Math']) as course_id, unnest(array[1000, 1001]) as student_id, unnest(array[95, 100]) as grade) as data_table
where grades.id = data_table.id;
There's also another way to do it with WITH syntax like this:
update grades
set course_id = data_table.course_id,
student_id = data_table.student_id,
grade = data_table.grade
from
(WITH vals (id, course_id, student_id, grade) as (VALUES (1, 'Math', 1000, 95), (2, 'Math', 1001, 100)) SELECT * from vals) as data_table
where grades.id = data_table.id;
My problem is that sometimes I want in some raws to update a field and sometime not. When I don't want to update I just want to keep the value that is currently in the table. In this case, I would want to potentially do something like:
update grades g
set course_id = data_table.course_id,
student_id = data_table.student_id,
grade = data_table.grade
from
(select unnest(array[1,2]) as id, unnest(array[g.course_id, 'Math2']) as course_id, unnest(array[1000, 1001]) as student_id, unnest(array[95, g.grade]) as grade) as data_table
where grades.id = data_table.id;
However this is not possible and I get back the error HINT: There is an entry for table "g", but it cannot be referenced from this part of the query.
Also postgresql documentation specifies about it in the From description:
Note that the target table must not appear in the from_list,
unless you intend a self-join (in which case it must appear with an alias in the from_list).
Does anyone know if there's a way to perform such bulk update ?
I've tried to use JOINs in inner query but with no luck..

Chose a value that cannot be a valid value, eg '-1' for course name and -1 for a grade, and use that for your generated values, then use a case in the insert to direct whether to use the current value or not:
update grades g
set course_id = case when data_table.course_id = '-1' then course_id else data_table.course_id end,
student_id = data_table.student_id,
grade = case when data_table.grade = -1 then g.grade else data_table.grade end
from (
select
unnest(array[1,2]) as id,
unnest(array['-1', 'Math2']) as course_id, -- use '-1' instead of g.course_id
unnest(array[1000, 1001]) as student_id,
unnest(array[95, -1]) as grade -- use -1 instead of g.grade
) as data_table
where grades.id = data_table.id
Pick whatever values you like for the impossible value.
If nulls were not allowed it would have been more straightforward and less code - use null for the impossible value and coalesce() in for the update value.

Related

Fast new row insertion if a value of a column depends on previous value in existing row

I have a table cusers with a primary key:
primary key(uid, lid, cnt)
And I try to insert some values into the table:
insert into cusers (uid, lid, cnt, dyn, ts)
values
(A, B, C, (
select C - cnt
from cusers
where uid = A and lid = B
order by ts desc
limit 1
), now())
on conflict do nothing
Quite often (with the possibility of 98%) a row cannot be inserted to cusers because it violates the primary key constraint, so hard select queries do not need to be executed at all. But as I can see PostgreSQL first counts the select query as a result of dyn column and only then rejects row because of uid, lid, cnt violation.
What is the best way to insert rows quickly in such situation?
Another explanation
I have a system where one row depends on another. Here is an example:
(x, x, 2, 2, <timestamp>)
(x, x, 5, 3, <timestamp>)
Two columns contain an absolute value (2 and 5) and relative value (2, 5 - 2). Each time I insert new row it should:
avoid same rows (see primary key constraint)
if new row differs, it should count a difference and put it into the dyn column (so I take the last inserted row for the user according to the timestamp and subtract values).
Another solution I've found is to use returning uid, lid, ts for inserts and get user ids which were really inserted - this is how I know they have differences from existing rows. Then I update inserted values:
update cusers
set dyn = (
select max(cnt) - min(cnt)
from (
select cnt
from cusers
where uid = A and lid = B
order by ts desc
limit 2) Table
)
where uid = A and lid = B and ts = TS
But it is not a fast approach either, as it seeks all over the ts column to find the two last inserted rows for each user. I need a fast insert query as I insert millions of rows at a time (but I do not write duplicates).
What the solution can be? May be I need a new index for this? Thanks in advance.

Using ANY with raw data work but not subquery

I just can't figure it out why this query work
SELECT id, name, organization_id
FROM facilities
WHERE organization_id = ANY(
'{abc-xyz-123,678-ght-nmp}'
)
But this query wont work with error operator does not exist: uuid = uuid[]
SELECT id, name, organization_id
FROM facilities
WHERE organization_id = ANY(
SELECT organization_ids
FROM admins
WHERE id = 'jkl-iop-345'
)
When the subquery
SELECT organization_ids
FROM admins
WHERE id = 'jkl-iop-345'
give the exact result of {abc-xyz-123,678-ght-nmp}.
I'm using postgres (PostgreSQL) 13.3
The subquery produces one row that contains an array.
If you use = ANY (SELECT ...), the result set is converted to an array, so you end up with
{{abc-xyz-123,678-ght-nmp}}
which is an array of arrays.
You probably want
SELECT id, name, organization_id
FROM facilities
WHERE EXISTS (SELECT 1 FROM admins
WHERE admins.id = 'jkl-iop-345'
AND facilities.organization_id = ANY (admins.organization_ids)
);
Let me remark that storing references to other tables in an array, JSON or other composite data type is an exceptionally bad idea. A normalized schema with a junction table would serve you better.

In DB2, perform an update based on insert for large number of rows

In DB2, I need to do an insert, then, using results/data from that insert, update a related table. I need to do it on a million plus records and would prefer not to lock the entire database. So, 1) how do I 'couple' the insert and update statements? 2) how can I ensure the integrity of the transaction (without locking the whole she-bang)?
some pseudo-code should help clarify
STEP 1
insert into table1 (neededId, id) select DYNAMICVALUE, id from tableX where needed value is null
STEP 2
update table2 set neededId = (GET THE DYNAMIC VALUE JUST INSERTED) where id = (THE ID JUST INSERTED)
note: in table1, the ID col is not unique, so i can't just filter on that to find the new DYNAMICVALUE
This should be more clear (FTR, this works, but I don't like it, because I'd have to lock the tables to maintain integrity. Would be great it I could run these statements together, and allow the update to refer to the newAddressNumber value.)
/****RUNNING TOP INSERT FIRST****/*
--insert a new address for each order that does not have a address id
insert into addresses
(customerId, addressNumber, address)
select
cust.Id,
--get next available addressNumber
ifNull((select max(addy2.addressNumber) from addresses addy2 where addy2.customerId = cust.id),0) + 1 as newAddressNumber,
cust.address
from customers cust
where exists (
--find all customers with at least 1 order where addressNumber is null
select 1 from orders ord
where 1=1
and ord.customerId = cust.id
and ord.addressNumber is null
)
/*****RUNNING THIS UPDATE SECOND*****/
update orders ord1
set addressNumber = (
select max(addressNumber) from addresses addy3
where addy3.customerId = ord1.customerId
)
where 1=1
and ord1.addressNumber is null
The IDENTITY_VAL_LOCAL function is a non-deterministic function that returns the most recently assigned value for an identity column, where the assignment occurred as a result of a single INSERT statement using a VALUES clause

Why does usage of lower() changes the order of resultset?

I have a table where I store information about users. The table has the following structure:
CREATE TABLE PERSONS
(
ID NUMBER(20, 0) NOT NULL,
FIRSTNAME VARCHAR2(40),
LASTNAME VARCHAR2(40),
BIRTHDAY DATE,
CONSTRAINT PERSONEN_PK PRIMARY KEY
(ID)
ENABLE
);
After inserting some test data:
SET DEFINE OFF;
Insert into PERSONS (ID,FIRSTNAME,LASTNAME,BIRTHDAY) values ('1','Max','Mustermann',to_date('31.10.89','DD.MM.RR'));
Insert into PERSONS (ID,FIRSTNAME,LASTNAME,BIRTHDAY) values ('2','Max','Mustermann',to_date('31.10.89','DD.MM.RR'));
Insert into PERSONS (ID,FIRSTNAME,LASTNAME,BIRTHDAY) values ('3','Carl','Carlchen',to_date('01.01.12','DD.MM.RR'));
Insert into PERSONS (ID,FIRSTNAME,LASTNAME,BIRTHDAY) values ('4','Max','Mustermann',to_date('31.10.89','DD.MM.RR'));
Insert into PERSONS (ID,FIRSTNAME,LASTNAME,BIRTHDAY) values ('5','Max','Mustermann',to_date('31.10.89','DD.MM.RR'));
Insert into PERSONS (ID,FIRSTNAME,LASTNAME,BIRTHDAY) values ('6','Carl','Carlchen',to_date('01.01.12','DD.MM.RR'));
I want to select all duplicates of a given user. Let's use "Max Mustermann" for example:
SELECT p.id,p.firstname,p.lastname,p.birthday
FROM persons p
WHERE p.firstname = 'Max'
AND p.lastname = 'Mustermann'
AND p.birthday = to_date('31.10.1989','dd.mm.yyyy')
ORDER BY p.firstname,p.lastname;
This gives me a result like this:
id first last birthday
=================================
1 Max Mustermann 31.10.89
2 Max Mustermann 31.10.89
4 Max Mustermann 31.10.89
5 Max Mustermann 31.10.89
I want to do a case insensitive compare, so I change the query using lower (and trim) like this:
SELECT p.id,p.firstname,p.lastname,p.birthday
FROM persons p
WHERE lower(trim(p.firstname)) = lower(trim('mAx '))
AND lower(trim(p.lastname)) = lower(trim(' musteRmann '))
AND p.birthday = to_date('31.10.1989','dd.mm.yyyy')
ORDER BY p.lastname,p.firstname;
Now surprise the order has changed!
id first last birthday
=================================
1 Max Mustermann 31.10.89
5 Max Mustermann 31.10.89
4 Max Mustermann 31.10.89
2 Max Mustermann 31.10.89
Why does the order change, just by using lower() (same result when using without trim())!? I can get a stable ordering by adding the id column to the ORDER BY. But shouldn't the lower() have no affect to the ordering?
Workaround by also using id column for ORDER BY:
SELECT p.id,p.firstname,p.lastname,p.birthday
FROM persons p
WHERE p.firstname = 'Max'
AND p.lastname = 'Mustermann'
AND p.birthday = to_date('31.10.1989','dd.mm.yyyy')
ORDER BY p.firstname,p.lastname,p.id;
SELECT p.id,p.firstname,p.lastname,p.birthday
FROM persons p
WHERE lower(trim(p.firstname)) = lower(trim('mAx '))
AND lower(trim(p.lastname)) = lower(trim(' musteRmann '))
AND p.birthday = to_date('31.10.1989','dd.mm.yyyy')
ORDER BY p.lastname,p.firstname,p.id;
If the values to be ordered by are identical, then the DBMS is free to choose any order it feels correct (the same way it is free to choose any order if no order by is specified alltogether).
Because all values of the columns in the order by are identical the resulting order is not stable. The only way to get a stable order is to include a unique column as an additional order criteria for ties - exactly what you did when you added the id column.
Why does the order change, just by using lower()
From a technical point, I'd guess that applying the lower() changed the execution plan and therefor the access path to the data.
But again (just to make sure): ordering on identical values never guarantees a stable order!
There is no ordering without an order by clause. Sometimes it looks like there might be (group by fooled a lot of people in older releases`, but it's only coincidental, and must not be relied upon. In your case you're ordering by some columns, but you expect duplicates within that ordering to be further ordered implicitly, which won't happen - or at least cannot be relied on.
In this case Oracle probably happens to be retrieving the rows for your first query in the order you inserted them purely as a side effect of how it's reading data from the blocks, and the order by sorts them within that set without actually changing them (or quite likely it's skipping the order by step internally if it realises it's pointless; the explain plan would tell you that).
If you change the order the order the records are created:
...
Insert into PERSONS (ID,FIRSTNAME,LASTNAME,BIRTHDAY) values
('5','Max','Mustermann',to_date('31.10.89','DD.MM.RR'));
Insert into PERSONS (ID,FIRSTNAME,LASTNAME,BIRTHDAY) values
('4','Max','Mustermann',to_date('31.10.89','DD.MM.RR'));
...
then the result 'order' changes too:
SELECT p.id,p.firstname,p.lastname,p.birthday
FROM persons p
WHERE p.firstname = 'Max'
AND p.lastname = 'Mustermann'
AND p.birthday = to_date('31.10.1989','dd.mm.yyyy')
ORDER BY p.firstname,p.lastname;
ID FIRSTNAME LASTNAME BIRTHDAY
---------- -------------------- -------------------- ---------
1 Max Mustermann 31-OCT-89
2 Max Mustermann 31-OCT-89
5 Max Mustermann 31-OCT-89
4 Max Mustermann 31-OCT-89
Once you have the function things are changing enough for that happy accident to go out of the window, even if the records are inserted in id order (which has no relevance to the DB internally). lower() isn't changing the ordering, you just aren't getting lucky any more.
You cannot expect or rely on an order unless you fully specify it in the order by clause.

PostgreSQL and pl/pgsql SYNTAX to update fields based on SELECT and FUNCTION (while loop, DISTINCT COUNT)

I have a large database, that I want to do some logic to update new fields.
The primary key is id for the table harvard_assignees
The LOGIC GOES LIKE THIS
Select all of the records based on id
For each record (WHILE), if (state is NOT NULL && country is NULL), update country_out = "US" ELSE update country_out=country
I see step 1 as a PostgreSQL query and step 2 as a function. Just trying to figure out the easiest way to implement natively with the exact syntax.
====
The second function is a little more interesting, requiring (I believe) DISTINCT:
Find all DISTINCT foreign_keys (a bivariate key of pat_type,patent)
Count Records that contain that value (e.g., n=3 records have fkey "D","388585")
Update those 3 records to identify percent as 1/n (e.g., UPDATE 3 records, set percent = 1/3)
For the first one:
UPDATE
harvard_assignees
SET
country_out = (CASE
WHEN (state is NOT NULL AND country is NULL) THEN 'US'
ELSE country
END);
At first it had condition "id = ..." but I removed that because I believe you actually want to update all records.
And for the second one:
UPDATE
example_table
SET
percent = (SELECT 1/cnt FROM (SELECT count(*) AS cnt FROM example_table AS x WHERE x.fn_key_1 = example_table.fn_key_1 AND x.fn_key_2 = example_table.fn_key_2) AS tmp WHERE cnt > 0)
That one will be kind of slow though.
I'm thinking on a solution based on window functions, you may want to explore those too.