How to use dynamic regex to match value in Postgres - postgresql

SUMMARY: I've two tables I want to derive info out of: family_values (family_name, item_regex) and product_ids (product_id) to be able to update the property family_name in a third.
Here the plan is to grab a json array from the small family_values table and use the column value item_regex to do a test match against the product_id for every row in product_ids.
MORE DETAILS: Importing static data from CSV to table of orders. But, in evaluating cost of goods and market value I'm needing to continuously determine family from a prefix regex (item_regex from family_values) match on the product_id.
On the client this looks like this:
const families = {
FOOBAR: 'Big Ogre',
FOOBA: 'Wood Elf',
FOO: 'Valkyrie'
};
// And to find family, and subsequently COGs and Market Value:
const findFamily = product_id => Object.keys(families).find(f => new RegExp('^' + f).test(product_id));
This is a huge hit for the client so I made a family_values table in PG to include a representative: family_name, item_regex, cogs, market_value.
Then, the product_ids has a list of only the products the app cares about (out of millions). This is actually used with an insert trigger 'on before' to ignore any CSV entries that aren't in the product_ids view. So, I guess after that the product_ids view could be taken out of the equation because the orders, after inserting readonly data, has its own matching product_id. It does NOT have family_name, so I still have the issue of determining that client-side.
PSUEDO CODE: update family column of orders with family_name from family_values regex match against orders.product_id
OR update the product_ids table with a new family column and use that with the existing on insert trigger (used to left pad zeros and normalize data right now). Now I'm thinking this may be just an update as suggested, but not real good with regex in PG. I'm a PG novice.
PROBLEM: But, I'm having a hangup in doing what I thought would be like a JS Array Find operation. The family_values have been sorted on the item_regex so that the most strict match would be on top, and therefor found first.
For example, with sorting we have:
family_values_array = [
{"family_name": "Big Ogre", "item_regex": "FOOBAR"},
{"family_name": "Wood Elf", "item_regex": "FOOBA"},
{"family_name": "Valkyrie", "item_regex": "FOO"}]
So, that the comparison of product_id of ^FOOBA would yield family "Wood Elf".
SOLUTION:
The solution I finally came about using was simply using concat to write out the front-anchored regex. It was so simple in the end. The key line I was missing is:
select * into family_value_row from iol.family_values
where lvl3_id = product_row.lvl3_id and product_row.product_id
like concat(item_regex, '%') limit 1;
Whole function:
create or replace function iol.populate_families () returns void as $$
declare
product_row record;
family_value_row record;
begin
for product_row in
select product_id, lvl3_id from iol.products
loop
-- family_name is what we want after finding the BEST match fr a product_id against item_regex
select * into family_value_row from iol.family_values
where lvl3_id = product_row.lvl3_id and product_row.product_id like concat(item_regex, '%') limit 1;
-- update family_name and value columns
update iol.products set
family_name = family_value_row.family_name,
cog_cents = family_value_row.cog_cents,
market_value_cents = family_value_row.market_value_cents
where product_id = product_row.product_id;
end loop;
end;
$$
LANGUAGE plpgsql;

Use concat as updated above:
select * into family_value_row from iol.family_values
where lvl3_id = product_row.lvl3_id and product_row.product_id
like concat(item_regex, '%') limit 1;

Related

How to convert an jsonb array and use stats moment

how are you?
I needed to store an array of numbers as JSONB in PostgreSQL.
Now I'm trying to calculate stats moments from this JSON, I'm facing some issues.
Sample of my data:
I already was able to convert a JSON into a float array.
I used a function to convert jsonb to float array.
CREATE OR REPLACE FUNCTION jsonb_array_castdouble(jsonb) RETURNS float[] AS $f$
SELECT array_agg(x)::float[] || ARRAY[]::float[] FROM jsonb_array_elements_text($1) t(x);
$f$ LANGUAGE sql IMMUTABLE;
Using this SQL:
with data as (
select
s.id as id,
jsonb_array_castdouble(s.snx_normalized) as serie
FROM
spectra s
)
select * from data;
I found a function that can do these calculations and I need to pass an array for that: https://github.com/ellisonch/PostgreSQL-Stats-Aggregate/
But this function requires an array in another way: unnested
I already tried to use unnest, but it will get only one value, not the entire array :(.
My goal is:
Be able to apply stats moment (kurtosis, skewness) for each row.
like:
index
skewness
1
21.2131
2
1.123
Bonus: There is a way to not use this 'with data', use the transformation in the select statement?
snx_wavelengths is JSON, right? And also you provided it as a picture and not text :( the data looks like (id, snx_wavelengths) - I believe you meant id saying index (not a good idea to use a keyword, would require identifier doublequotes):
1,[1,2,3,4]
2,[373,232,435,84]
If that is right:
select id, (stats_agg(v::float)).skewness
from myMeasures,
lateral json_array_elements_text(snx_wavelengths) v
group by id;
DBFiddle demo
BTW, you don't need "with data" in the original sample if you don't want to use and could replace with a subquery. ie:
select (stats_agg(n)).* from (select unnest(array[16,22,33,24,15])) data(n)
union all
select (stats_agg(n)).* from (select unnest(array[416,622,833,224,215])) data(n);
EDIT: And if you needed other stats too:
select id, "count","min","max","mean","variance","skewness","kurtosis"
from myMeasures,
lateral (select (stats_agg(v::float)).* from json_array_elements_text(snx_wavelengths) v) foo
group by id,"count","min","max","mean","variance","skewness","kurtosis";
DBFiddle demo

PostgreSQL array of data composite update element using where condition

I have a composite type:
CREATE TYPE mydata_t AS
(
user_id integer,
value character(4)
);
Also, I have a table, uses this composite type as an array of mydata_t.
CREATE TABLE tbl
(
id serial NOT NULL,
data_list mydata_t[],
PRIMARY KEY (id)
);
Here I want to update the mydata_t in data_list, where mydata_t.user_id is 100000
But I don't know which array element's user_id is equal to 100000
So I have to make a search first to find the element where its user_id is equal to 100000 ... that's my problem ... I don't know how to make the query .... in fact, I want to update the value of the array element, where it's user_id is equal to 100000 (Also where the id of tbl is for example 1) ... What will be my query?
Something like this (I know it's wrong !!!)
UPDATE "tbl" SET "data_list"[i]."value"='YYYY'
WHERE "id"=1 AND EXISTS (SELECT ROW_NUMBER() OVER() AS i
FROM unnest("data_list") "d" WHERE "d"."user_id"=10000 LIMIT 1)
For example, this is my tbl data:
Row1 => id = 1, data = ARRAY[ROW(5,'YYYY'),ROW(6,'YYYY')]
Row2 => id = 2, data = ARRAY[ROW(10,'YYYY'),ROW(11,'YYYY')]
Now i want to update tbl where id is 2 and set the value of one of the tbl.data elements to 'XXXX' where the user_id of element is equal to 11
In fact, the final result of Row2 will be this:
Row2 => id = 2, data = ARRAY[ROW(10,'YYYY'),ROW(11,'XXXX')]
If you know the value value, you can use the array_replace() function to make the change:
UPDATE tbl
SET data_list = array_replace(data_list, (11, 'YYYY')::mydata_t, (11, 'XXXX')::mydata_t)
WHERE id = 2
If you do not know the value value then the situation becomes more complex:
UPDATE tbl SET data_list = data_arr
FROM (
-- UPDATE doesn't allow aggregate functions so aggregate here
SELECT array_agg(new_data) AS data_arr
FROM (
-- For the id value, get the data_list values that are NOT modified
SELECT (user_id, value)::mydata_t AS new_data
FROM tbl, unnest(data_list)
WHERE id = 2 AND user_id != 11
UNION
-- Add the values to update
VALUES ((11, 'XXXX')::mydata_t)
) x
) y
WHERE id = 2
You should keep in mind, though, that there is an awful lot of work going on in the background that cannot be optimised. The array of mydata_t values has to be examined from start to finish and you cannot use an index on this. Furthermore, updates actually insert a new row in the underlying file on disk and if your array has more than a few entries this will involve substantial work. This gets even more problematic when your arrays are larger than the pagesize of your PostgreSQL server, typically 8kB. All behind the scene so it will work, but at a performance penalty. Even though array_replace sounds like changes are made in-place (and they indeed are in memory), the UPDATE command will write a completely new tuple to disk. So if you have 4,000 array elements that means that at least 40kB of data will have to be read (8 bytes for the mydata_t type on a typical system x 4,000 = 32kB in a TOAST file, plus the main page of the table, 8kB) and then written to disk after the update. A real performance killer.
As #klin pointed out, this design may be more trouble than it is worth. Should you make data_list as table (as I would do), the update query becomes:
UPDATE data_list SET value = 'XXXX'
WHERE id = 2 AND user_id = 11
This will have MUCH better performance, especially if you add the appropriate indexes. You could then still create a view to publish the data in an aggregated form with a custom type if your business logic so requires.

With PostgREST, convert a column to and from an external encoding in the API

We are using PostgREST to automatically generate a REST API for a Postgres database. Our primary keys have an external representation that's different from how we store them internally. For simplicity's sake lets pretend the ids are stored as integers but we represent them as hexadecimal strings outwardly.
It's simple enough to get PostgREST to convert to the external representation for read operations:
CREATE DOMAIN hexid AS bigint;
CREATE TABLE fruits (
fruit_id hexid PRIMARY KEY,
name text
);
CREATE OR REPLACE VIEW api_fruits AS
SELECT to_hex(fruit_id) as fruit_id, name FROM fruits;
INSERT INTO fruits(fruit_id, name) VALUES('51955', 'avocado');
PostgREST generates the expected representation when we GET api_fruits:
[
{
"fruit_id": "caf3",
"name": "avocado"
}
]
But that's about as far as we get with this solution. It's a one way transformation so we won't be able to POST/PATCH records this way. The way PostgREST works is to transform such requests into equivalent INSERT and UPDATE statements. But this view with its custom formatting is not updatable. This is what would happen if we tried:
ERROR: cannot insert into column "fruit_id" of view "api_fruits"
DETAIL: View columns that are not columns of their base relation are not updatable.
STATEMENT: WITH pgrst_source AS (WITH pgrst_payload AS (SELECT $1::json AS json_data), pgrst_body AS ( SELECT CASE WHEN json_typeof(json_data) = 'array' THEN json_data ELSE json_build_array(json_data) END AS val FROM pgrst_payload) INSERT INTO "api_x"."api_fruits"("fruit_id", "name") SELECT "fruit_id", "name" FROM json_populate_recordset (null::"api_x"."api_fruits", (SELECT val FROM pgrst_body)) _ RETURNING "api_x"."api_fruits".*) SELECT '' AS total_result_set, pg_catalog.count(_postgrest_t) AS page_total, CASE WHEN pg_catalog.count(_postgrest_t) = 1 THEN coalesce((
WITH data AS (SELECT row_to_json(_) AS row FROM pgrst_source AS _ LIMIT 1)
SELECT array_agg(json_data.key || '=eq.' || json_data.value)
FROM data CROSS JOIN json_each_text(data.row) AS json_data
WHERE json_data.key IN ('')
), array[]::text[]) ELSE array[]::text[] END AS header, '' AS body, nullif(current_setting('response.headers', true), '') AS response_headers, nullif(current_setting('response.status', true), '') AS response_status FROM (SELECT * FROM pgrst_source) _postgrest_t
We can't INSERT into "View columns that are not columns of their base relation".
The obvious workaround is to serve fruit_id as a straight column, just an integer. With some post and preprocessing at the nginx level we can hex encode it there (and hex decode incoming ids). I'm wondering if we can do better than that though. For large API operations, re-encoding the JSON will use a lot of memory and CPU time and it seems so unnecessary.
It would have been great to be able to use a custom CREATE CAST to take the incoming hexadecimal strings and turn them back into integers, something like this:
CREATE CAST (json AS hexid) WITH FUNCTION json_to_hexid AS ASSIGNMENT;
But alas custom casts are ignored on CREATE DOMAIN types. And we can't make a true custom column type because our cloud Postgres host (Google Cloud SQL) doesn't allow custom extensions.
It feels like some combination of INSTEAD OF triggers or rules could work. But when using query parameters to filter results using query parameters (e.g. select a fruit by id), I don't think there's an appropriate trigger to use. INSTEAD OF doesn't work for straight SELECT does it?
For example I've tested doing something like this to take care of INSERT and allow POST with PostgREST. It works:
CREATE OR REPLACE FUNCTION api_fruits_insert()
RETURNS trigger AS
$$
BEGIN
INSERT INTO fruits(fruit_id, name) VALUES (('x' || lpad(NEW.fruit_id, 16, '0'))::bit(64)::bigint::hexid, NEW.name);
RETURN NEW;
END
$$ LANGUAGE 'plpgsql';
CREATE TRIGGER api_fruits_insert
INSTEAD OF INSERT
ON api_fruits
FOR EACH ROW
EXECUTE PROCEDURE api_fruits_insert();
The trouble is in the WHERE clause. Let's PATCH api_fruits?fruit_id=in.(7b,caf3) with {"name": "pear"}. This works out of the box since the name column is updatable but look at the query:
WITH pgrst_source AS (WITH pgrst_payload AS (SELECT $1::json AS json_data), pgrst_body AS ( SELECT CASE WHEN json_typeof(json_data) = 'array' THEN json_data ELSE json_build_array(json_data) END AS val FROM pgrst_payload) UPDATE "api_x"."api_fruits" SET "name" = _."name" FROM (SELECT * FROM json_populate_recordset (null::"api_x"."api_fruits" , (SELECT val FROM pgrst_body) )) _ WHERE "api_x"."api_fruits"."fruit_id" = ANY ($2) RETURNING 1) SELECT '' AS total_result_set, pg_catalog.count(_postgrest_t) AS page_total, array[]::text[] AS header, '' AS body, nullif(current_setting('response.headers', true), '') AS response_headers, nullif(current_setting('response.status', true), '') AS response_status FROM (SELECT * FROM pgrst_source) _postgrest_t
DETAIL: parameters: $1 = '{
"name": "pear"
}', $2 = '{7b,caf3}'
So we have essentially UPDATE api_fruits SET name='berry' WHERE fruit_id IN ('7b', 'caf3');. Surprisingly this works but it's a full table scan so Postgres can evaluate to_hex(fruit_id) for each row looking for matches. The same happens if we try to GET a record by fruit_id. How would we rewrite the WHERE clauses?
It really feels like some combination of just the right Postgres and PostgREST features should be able to get us to a point where it's all happening in Postgres without nginx's help and without excessive complexity. Any ideas?

Prepare dynamic case statement using PostgreSQL 9.3

I have the following case statement to prepare as a dynamic as shown below:
Example:
I have the case statement:
case cola
when cola between '2001-01-01' and '2001-01-05' then 'G1'
when cola between '2001-01-10' and '2001-01-15' then 'G2'
when cola between '2001-01-20' and '2001-01-25' then 'G3'
when cola between '2001-02-01' and '2001-02-05' then 'G4'
when cola between '2001-02-10' and '2001-02-15' then 'G5'
else ''
end
Note: Now I want to create dynamic case statement because of the values dates and name passing as a parameter and it may change.
Declare
dates varchar = '2001-01-01to2001-01-05,2001-01-10to2001-01-15,
2001-01-20to2001-01-25,2001-02-01to2001-02-05,
2001-02-10to2001-02-15';
names varchar = 'G1,G2,G3,G4,G5';
The values in the variables may change as per the requirements, it will be dynamic. So the case statement should be dynamic without using loop.
You may not need any function for this, just join to a mapping data-set:
with cola_map(low, high, value) as (
values(date '2001-01-01', date '2001-01-05', 'G1'),
('2001-01-10', '2001-01-15', 'G2'),
('2001-01-20', '2001-01-25', 'G3'),
('2001-02-01', '2001-02-05', 'G4'),
('2001-02-10', '2001-02-15', 'G5')
-- you can include as many rows, as you want
)
select table_name.*,
coalesce(cola_map.value, '') -- else branch from case expression
from table_name
left join cola_map on table_name.cola between cola_map.low and cola_map.high
If your date ranges could collide, you can use DISTINCT ON or GROUP BY to avoid row duplication.
Note: you can use a simple sub-select too, I used a CTE, because it's more readable.
Edit: passing these data (as a single parameter) can be achieved by passing a multi-dimensional array (or an array of row-values, but that requires you to have a distinct, predefined composite type).
Passing arrays as parameters can depend on the actual client (& driver) you use, but in general, you can use the array's input representation:
-- sql
with cola_map(low, high, value) as (
select d[1]::date, d[2]::date, d[3]
from unnest(?::text[][]) d
)
select table_name.*,
coalesce(cola_map.value, '') -- else branch from case expression
from table_name
left join cola_map on table_name.cola between cola_map.low and cola_map.high
// client pseudo code
query = db.prepare(sql);
query.bind(1, "{{2001-01-10,2001-01-15,G2},{2001-01-20,2001-01-25,G3}}");
query.execute();
Passing each chunk of data separately is also possible with some clients (or with some abstractions), but this is highly depends on your driver/orm/etc. you use.

PostgreSQL and pl/pgsql SYNTAX to update fields based on SELECT and FUNCTION (while loop, DISTINCT COUNT)

I have a large database, that I want to do some logic to update new fields.
The primary key is id for the table harvard_assignees
The LOGIC GOES LIKE THIS
Select all of the records based on id
For each record (WHILE), if (state is NOT NULL && country is NULL), update country_out = "US" ELSE update country_out=country
I see step 1 as a PostgreSQL query and step 2 as a function. Just trying to figure out the easiest way to implement natively with the exact syntax.
====
The second function is a little more interesting, requiring (I believe) DISTINCT:
Find all DISTINCT foreign_keys (a bivariate key of pat_type,patent)
Count Records that contain that value (e.g., n=3 records have fkey "D","388585")
Update those 3 records to identify percent as 1/n (e.g., UPDATE 3 records, set percent = 1/3)
For the first one:
UPDATE
harvard_assignees
SET
country_out = (CASE
WHEN (state is NOT NULL AND country is NULL) THEN 'US'
ELSE country
END);
At first it had condition "id = ..." but I removed that because I believe you actually want to update all records.
And for the second one:
UPDATE
example_table
SET
percent = (SELECT 1/cnt FROM (SELECT count(*) AS cnt FROM example_table AS x WHERE x.fn_key_1 = example_table.fn_key_1 AND x.fn_key_2 = example_table.fn_key_2) AS tmp WHERE cnt > 0)
That one will be kind of slow though.
I'm thinking on a solution based on window functions, you may want to explore those too.