PostgreSQL and pl/pgsql SYNTAX to update fields based on SELECT and FUNCTION (while loop, DISTINCT COUNT) - postgresql

I have a large database, that I want to do some logic to update new fields.
The primary key is id for the table harvard_assignees
The LOGIC GOES LIKE THIS
Select all of the records based on id
For each record (WHILE), if (state is NOT NULL && country is NULL), update country_out = "US" ELSE update country_out=country
I see step 1 as a PostgreSQL query and step 2 as a function. Just trying to figure out the easiest way to implement natively with the exact syntax.
====
The second function is a little more interesting, requiring (I believe) DISTINCT:
Find all DISTINCT foreign_keys (a bivariate key of pat_type,patent)
Count Records that contain that value (e.g., n=3 records have fkey "D","388585")
Update those 3 records to identify percent as 1/n (e.g., UPDATE 3 records, set percent = 1/3)

For the first one:
UPDATE
harvard_assignees
SET
country_out = (CASE
WHEN (state is NOT NULL AND country is NULL) THEN 'US'
ELSE country
END);
At first it had condition "id = ..." but I removed that because I believe you actually want to update all records.
And for the second one:
UPDATE
example_table
SET
percent = (SELECT 1/cnt FROM (SELECT count(*) AS cnt FROM example_table AS x WHERE x.fn_key_1 = example_table.fn_key_1 AND x.fn_key_2 = example_table.fn_key_2) AS tmp WHERE cnt > 0)
That one will be kind of slow though.
I'm thinking on a solution based on window functions, you may want to explore those too.

Related

PostgreSQL array of data composite update element using where condition

I have a composite type:
CREATE TYPE mydata_t AS
(
user_id integer,
value character(4)
);
Also, I have a table, uses this composite type as an array of mydata_t.
CREATE TABLE tbl
(
id serial NOT NULL,
data_list mydata_t[],
PRIMARY KEY (id)
);
Here I want to update the mydata_t in data_list, where mydata_t.user_id is 100000
But I don't know which array element's user_id is equal to 100000
So I have to make a search first to find the element where its user_id is equal to 100000 ... that's my problem ... I don't know how to make the query .... in fact, I want to update the value of the array element, where it's user_id is equal to 100000 (Also where the id of tbl is for example 1) ... What will be my query?
Something like this (I know it's wrong !!!)
UPDATE "tbl" SET "data_list"[i]."value"='YYYY'
WHERE "id"=1 AND EXISTS (SELECT ROW_NUMBER() OVER() AS i
FROM unnest("data_list") "d" WHERE "d"."user_id"=10000 LIMIT 1)
For example, this is my tbl data:
Row1 => id = 1, data = ARRAY[ROW(5,'YYYY'),ROW(6,'YYYY')]
Row2 => id = 2, data = ARRAY[ROW(10,'YYYY'),ROW(11,'YYYY')]
Now i want to update tbl where id is 2 and set the value of one of the tbl.data elements to 'XXXX' where the user_id of element is equal to 11
In fact, the final result of Row2 will be this:
Row2 => id = 2, data = ARRAY[ROW(10,'YYYY'),ROW(11,'XXXX')]
If you know the value value, you can use the array_replace() function to make the change:
UPDATE tbl
SET data_list = array_replace(data_list, (11, 'YYYY')::mydata_t, (11, 'XXXX')::mydata_t)
WHERE id = 2
If you do not know the value value then the situation becomes more complex:
UPDATE tbl SET data_list = data_arr
FROM (
-- UPDATE doesn't allow aggregate functions so aggregate here
SELECT array_agg(new_data) AS data_arr
FROM (
-- For the id value, get the data_list values that are NOT modified
SELECT (user_id, value)::mydata_t AS new_data
FROM tbl, unnest(data_list)
WHERE id = 2 AND user_id != 11
UNION
-- Add the values to update
VALUES ((11, 'XXXX')::mydata_t)
) x
) y
WHERE id = 2
You should keep in mind, though, that there is an awful lot of work going on in the background that cannot be optimised. The array of mydata_t values has to be examined from start to finish and you cannot use an index on this. Furthermore, updates actually insert a new row in the underlying file on disk and if your array has more than a few entries this will involve substantial work. This gets even more problematic when your arrays are larger than the pagesize of your PostgreSQL server, typically 8kB. All behind the scene so it will work, but at a performance penalty. Even though array_replace sounds like changes are made in-place (and they indeed are in memory), the UPDATE command will write a completely new tuple to disk. So if you have 4,000 array elements that means that at least 40kB of data will have to be read (8 bytes for the mydata_t type on a typical system x 4,000 = 32kB in a TOAST file, plus the main page of the table, 8kB) and then written to disk after the update. A real performance killer.
As #klin pointed out, this design may be more trouble than it is worth. Should you make data_list as table (as I would do), the update query becomes:
UPDATE data_list SET value = 'XXXX'
WHERE id = 2 AND user_id = 11
This will have MUCH better performance, especially if you add the appropriate indexes. You could then still create a view to publish the data in an aggregated form with a custom type if your business logic so requires.

Fast new row insertion if a value of a column depends on previous value in existing row

I have a table cusers with a primary key:
primary key(uid, lid, cnt)
And I try to insert some values into the table:
insert into cusers (uid, lid, cnt, dyn, ts)
values
(A, B, C, (
select C - cnt
from cusers
where uid = A and lid = B
order by ts desc
limit 1
), now())
on conflict do nothing
Quite often (with the possibility of 98%) a row cannot be inserted to cusers because it violates the primary key constraint, so hard select queries do not need to be executed at all. But as I can see PostgreSQL first counts the select query as a result of dyn column and only then rejects row because of uid, lid, cnt violation.
What is the best way to insert rows quickly in such situation?
Another explanation
I have a system where one row depends on another. Here is an example:
(x, x, 2, 2, <timestamp>)
(x, x, 5, 3, <timestamp>)
Two columns contain an absolute value (2 and 5) and relative value (2, 5 - 2). Each time I insert new row it should:
avoid same rows (see primary key constraint)
if new row differs, it should count a difference and put it into the dyn column (so I take the last inserted row for the user according to the timestamp and subtract values).
Another solution I've found is to use returning uid, lid, ts for inserts and get user ids which were really inserted - this is how I know they have differences from existing rows. Then I update inserted values:
update cusers
set dyn = (
select max(cnt) - min(cnt)
from (
select cnt
from cusers
where uid = A and lid = B
order by ts desc
limit 2) Table
)
where uid = A and lid = B and ts = TS
But it is not a fast approach either, as it seeks all over the ts column to find the two last inserted rows for each user. I need a fast insert query as I insert millions of rows at a time (but I do not write duplicates).
What the solution can be? May be I need a new index for this? Thanks in advance.

How to use dynamic regex to match value in Postgres

SUMMARY: I've two tables I want to derive info out of: family_values (family_name, item_regex) and product_ids (product_id) to be able to update the property family_name in a third.
Here the plan is to grab a json array from the small family_values table and use the column value item_regex to do a test match against the product_id for every row in product_ids.
MORE DETAILS: Importing static data from CSV to table of orders. But, in evaluating cost of goods and market value I'm needing to continuously determine family from a prefix regex (item_regex from family_values) match on the product_id.
On the client this looks like this:
const families = {
FOOBAR: 'Big Ogre',
FOOBA: 'Wood Elf',
FOO: 'Valkyrie'
};
// And to find family, and subsequently COGs and Market Value:
const findFamily = product_id => Object.keys(families).find(f => new RegExp('^' + f).test(product_id));
This is a huge hit for the client so I made a family_values table in PG to include a representative: family_name, item_regex, cogs, market_value.
Then, the product_ids has a list of only the products the app cares about (out of millions). This is actually used with an insert trigger 'on before' to ignore any CSV entries that aren't in the product_ids view. So, I guess after that the product_ids view could be taken out of the equation because the orders, after inserting readonly data, has its own matching product_id. It does NOT have family_name, so I still have the issue of determining that client-side.
PSUEDO CODE: update family column of orders with family_name from family_values regex match against orders.product_id
OR update the product_ids table with a new family column and use that with the existing on insert trigger (used to left pad zeros and normalize data right now). Now I'm thinking this may be just an update as suggested, but not real good with regex in PG. I'm a PG novice.
PROBLEM: But, I'm having a hangup in doing what I thought would be like a JS Array Find operation. The family_values have been sorted on the item_regex so that the most strict match would be on top, and therefor found first.
For example, with sorting we have:
family_values_array = [
{"family_name": "Big Ogre", "item_regex": "FOOBAR"},
{"family_name": "Wood Elf", "item_regex": "FOOBA"},
{"family_name": "Valkyrie", "item_regex": "FOO"}]
So, that the comparison of product_id of ^FOOBA would yield family "Wood Elf".
SOLUTION:
The solution I finally came about using was simply using concat to write out the front-anchored regex. It was so simple in the end. The key line I was missing is:
select * into family_value_row from iol.family_values
where lvl3_id = product_row.lvl3_id and product_row.product_id
like concat(item_regex, '%') limit 1;
Whole function:
create or replace function iol.populate_families () returns void as $$
declare
product_row record;
family_value_row record;
begin
for product_row in
select product_id, lvl3_id from iol.products
loop
-- family_name is what we want after finding the BEST match fr a product_id against item_regex
select * into family_value_row from iol.family_values
where lvl3_id = product_row.lvl3_id and product_row.product_id like concat(item_regex, '%') limit 1;
-- update family_name and value columns
update iol.products set
family_name = family_value_row.family_name,
cog_cents = family_value_row.cog_cents,
market_value_cents = family_value_row.market_value_cents
where product_id = product_row.product_id;
end loop;
end;
$$
LANGUAGE plpgsql;
Use concat as updated above:
select * into family_value_row from iol.family_values
where lvl3_id = product_row.lvl3_id and product_row.product_id
like concat(item_regex, '%') limit 1;

In DB2, perform an update based on insert for large number of rows

In DB2, I need to do an insert, then, using results/data from that insert, update a related table. I need to do it on a million plus records and would prefer not to lock the entire database. So, 1) how do I 'couple' the insert and update statements? 2) how can I ensure the integrity of the transaction (without locking the whole she-bang)?
some pseudo-code should help clarify
STEP 1
insert into table1 (neededId, id) select DYNAMICVALUE, id from tableX where needed value is null
STEP 2
update table2 set neededId = (GET THE DYNAMIC VALUE JUST INSERTED) where id = (THE ID JUST INSERTED)
note: in table1, the ID col is not unique, so i can't just filter on that to find the new DYNAMICVALUE
This should be more clear (FTR, this works, but I don't like it, because I'd have to lock the tables to maintain integrity. Would be great it I could run these statements together, and allow the update to refer to the newAddressNumber value.)
/****RUNNING TOP INSERT FIRST****/*
--insert a new address for each order that does not have a address id
insert into addresses
(customerId, addressNumber, address)
select
cust.Id,
--get next available addressNumber
ifNull((select max(addy2.addressNumber) from addresses addy2 where addy2.customerId = cust.id),0) + 1 as newAddressNumber,
cust.address
from customers cust
where exists (
--find all customers with at least 1 order where addressNumber is null
select 1 from orders ord
where 1=1
and ord.customerId = cust.id
and ord.addressNumber is null
)
/*****RUNNING THIS UPDATE SECOND*****/
update orders ord1
set addressNumber = (
select max(addressNumber) from addresses addy3
where addy3.customerId = ord1.customerId
)
where 1=1
and ord1.addressNumber is null
The IDENTITY_VAL_LOCAL function is a non-deterministic function that returns the most recently assigned value for an identity column, where the assignment occurred as a result of a single INSERT statement using a VALUES clause

excluding rows from resultset in postgres

This is my result set
i am returning this result set on the base of refid using WHERE refid IN.
Over here, i need to apply a logic without any kind of programming (means SQL query only).
if in result set, i am getting period for particular refid then other rows with the same refid must not returned.
for example, 2667105 having period then myid = 612084598 must not get returned in result set.
according to me this can be achieved using CASE but i have no idea how to use it, i mean that i don't know should i use the CASE statement in SELECT statement or WHERE clause...
EDIT:
This is how it suppose to work,
myid = 612084598 is the default row for refid = 2667105 but if specifically wants the refid for period = 6 then it must return all rows except myid = 612084598
but if i am looking for period = 12, for this period no specific refid present in database.. so for this it must return all rows except first one.. means all rows with the refid which is default one..
Not very clear definition of the problem, but try this:
with cte as (
select
*,
first_value(period) over(partition by refid order by myid) as fv
from test
)
select
myid, refid, period
from cte
where period is not null or fv is null
sql fiddle demo