Convert jsonb column to a user-defined type - postgresql

I'm trying to convert each row in a jsonb column to a type that I've defined, and I can't quite seem to get there.
I have an app that scrapes articles from The Guardian Open Platform and dumps the responses (as jsonb) in an ingestion table, into a column called 'body'. Other columns are a sequential ID, and a timestamp extracted from the response payload that helps my app only scrape new data.
I'd like to move the response dump data into a properly-defined table, and as I know the schema of the response, I've defined a type (my_type).
I've been referring to the 9.16. JSON Functions and Operators in the Postgres docs. I can get a single record as my type:
select * from jsonb_populate_record(null::my_type, (select body from data_ingestion limit 1));
produces
id
type
sectionId
...
example_id
example_type
example_section_id
...
(abbreviated for concision)
If I remove the limit, I get an error, which makes sense: the subquery would be providing multiple rows to jsonb_populate_record which only expects one.
I can get it to do multiple rows, but the result isn't broken into columns:
select jsonb_populate_record(null::my_type, body) from reviews_ingestion limit 3;
produces:
jsonb_populate_record
(example_id_1,example_type_1,example_section_id_1,...)
(example_id_2,example_type_2,example_section_id_2,...)
(example_id_3,example_type_3,example_section_id_3,...)
This is a bit odd, I would have expected to see column names; this after all is the point of providing the type.
I'm aware I can do this by using Postgres JSON querying functionality, e.g.
select
body -> 'id' as id,
body -> 'type' as type,
body -> 'sectionId' as section_id,
...
from reviews_ingestion;
This works but it seems quite inelegant. Plus I lose datatypes.
I've also considered aggregating all rows in the body column into a JSON array, so as to be able to supply this to jsonb_populate_recordset but this seems a bit of a silly approach, and unlikely to be performant.
Is there a way to achieve what I want, using Postgres functions?

Maybe you need this - to break my_type record into columns:
select (jsonb_populate_record(null::my_type, body)).*
from reviews_ingestion
limit 3;
-- or whatever other query clauses here
i.e. select all from these my_type records. All column names and types are in place.
Here is an illustration. My custom type is delmet and CTO t remotely mimics data_ingestion.
create type delmet as (x integer, y text, z boolean);
with t(i, j, k) as
(
values
(1, '{"x":10, "y":"Nope", "z":true}'::jsonb, 'cats'),
(2, '{"x":11, "y":"Yep", "z":false}', 'dogs'),
(3, '{"x":12, "y":null, "z":true}', 'parrots')
)
select i, (jsonb_populate_record(null::delmet, j)).*, k
from t;
Result:
i
x
y
z
k
1
10
Nope
true
cats
2
11
Yep
false
dogs
3
12
true
parrots

Related

PostgreSQL array of data composite update element using where condition

I have a composite type:
CREATE TYPE mydata_t AS
(
user_id integer,
value character(4)
);
Also, I have a table, uses this composite type as an array of mydata_t.
CREATE TABLE tbl
(
id serial NOT NULL,
data_list mydata_t[],
PRIMARY KEY (id)
);
Here I want to update the mydata_t in data_list, where mydata_t.user_id is 100000
But I don't know which array element's user_id is equal to 100000
So I have to make a search first to find the element where its user_id is equal to 100000 ... that's my problem ... I don't know how to make the query .... in fact, I want to update the value of the array element, where it's user_id is equal to 100000 (Also where the id of tbl is for example 1) ... What will be my query?
Something like this (I know it's wrong !!!)
UPDATE "tbl" SET "data_list"[i]."value"='YYYY'
WHERE "id"=1 AND EXISTS (SELECT ROW_NUMBER() OVER() AS i
FROM unnest("data_list") "d" WHERE "d"."user_id"=10000 LIMIT 1)
For example, this is my tbl data:
Row1 => id = 1, data = ARRAY[ROW(5,'YYYY'),ROW(6,'YYYY')]
Row2 => id = 2, data = ARRAY[ROW(10,'YYYY'),ROW(11,'YYYY')]
Now i want to update tbl where id is 2 and set the value of one of the tbl.data elements to 'XXXX' where the user_id of element is equal to 11
In fact, the final result of Row2 will be this:
Row2 => id = 2, data = ARRAY[ROW(10,'YYYY'),ROW(11,'XXXX')]
If you know the value value, you can use the array_replace() function to make the change:
UPDATE tbl
SET data_list = array_replace(data_list, (11, 'YYYY')::mydata_t, (11, 'XXXX')::mydata_t)
WHERE id = 2
If you do not know the value value then the situation becomes more complex:
UPDATE tbl SET data_list = data_arr
FROM (
-- UPDATE doesn't allow aggregate functions so aggregate here
SELECT array_agg(new_data) AS data_arr
FROM (
-- For the id value, get the data_list values that are NOT modified
SELECT (user_id, value)::mydata_t AS new_data
FROM tbl, unnest(data_list)
WHERE id = 2 AND user_id != 11
UNION
-- Add the values to update
VALUES ((11, 'XXXX')::mydata_t)
) x
) y
WHERE id = 2
You should keep in mind, though, that there is an awful lot of work going on in the background that cannot be optimised. The array of mydata_t values has to be examined from start to finish and you cannot use an index on this. Furthermore, updates actually insert a new row in the underlying file on disk and if your array has more than a few entries this will involve substantial work. This gets even more problematic when your arrays are larger than the pagesize of your PostgreSQL server, typically 8kB. All behind the scene so it will work, but at a performance penalty. Even though array_replace sounds like changes are made in-place (and they indeed are in memory), the UPDATE command will write a completely new tuple to disk. So if you have 4,000 array elements that means that at least 40kB of data will have to be read (8 bytes for the mydata_t type on a typical system x 4,000 = 32kB in a TOAST file, plus the main page of the table, 8kB) and then written to disk after the update. A real performance killer.
As #klin pointed out, this design may be more trouble than it is worth. Should you make data_list as table (as I would do), the update query becomes:
UPDATE data_list SET value = 'XXXX'
WHERE id = 2 AND user_id = 11
This will have MUCH better performance, especially if you add the appropriate indexes. You could then still create a view to publish the data in an aggregated form with a custom type if your business logic so requires.

How to use dynamic regex to match value in Postgres

SUMMARY: I've two tables I want to derive info out of: family_values (family_name, item_regex) and product_ids (product_id) to be able to update the property family_name in a third.
Here the plan is to grab a json array from the small family_values table and use the column value item_regex to do a test match against the product_id for every row in product_ids.
MORE DETAILS: Importing static data from CSV to table of orders. But, in evaluating cost of goods and market value I'm needing to continuously determine family from a prefix regex (item_regex from family_values) match on the product_id.
On the client this looks like this:
const families = {
FOOBAR: 'Big Ogre',
FOOBA: 'Wood Elf',
FOO: 'Valkyrie'
};
// And to find family, and subsequently COGs and Market Value:
const findFamily = product_id => Object.keys(families).find(f => new RegExp('^' + f).test(product_id));
This is a huge hit for the client so I made a family_values table in PG to include a representative: family_name, item_regex, cogs, market_value.
Then, the product_ids has a list of only the products the app cares about (out of millions). This is actually used with an insert trigger 'on before' to ignore any CSV entries that aren't in the product_ids view. So, I guess after that the product_ids view could be taken out of the equation because the orders, after inserting readonly data, has its own matching product_id. It does NOT have family_name, so I still have the issue of determining that client-side.
PSUEDO CODE: update family column of orders with family_name from family_values regex match against orders.product_id
OR update the product_ids table with a new family column and use that with the existing on insert trigger (used to left pad zeros and normalize data right now). Now I'm thinking this may be just an update as suggested, but not real good with regex in PG. I'm a PG novice.
PROBLEM: But, I'm having a hangup in doing what I thought would be like a JS Array Find operation. The family_values have been sorted on the item_regex so that the most strict match would be on top, and therefor found first.
For example, with sorting we have:
family_values_array = [
{"family_name": "Big Ogre", "item_regex": "FOOBAR"},
{"family_name": "Wood Elf", "item_regex": "FOOBA"},
{"family_name": "Valkyrie", "item_regex": "FOO"}]
So, that the comparison of product_id of ^FOOBA would yield family "Wood Elf".
SOLUTION:
The solution I finally came about using was simply using concat to write out the front-anchored regex. It was so simple in the end. The key line I was missing is:
select * into family_value_row from iol.family_values
where lvl3_id = product_row.lvl3_id and product_row.product_id
like concat(item_regex, '%') limit 1;
Whole function:
create or replace function iol.populate_families () returns void as $$
declare
product_row record;
family_value_row record;
begin
for product_row in
select product_id, lvl3_id from iol.products
loop
-- family_name is what we want after finding the BEST match fr a product_id against item_regex
select * into family_value_row from iol.family_values
where lvl3_id = product_row.lvl3_id and product_row.product_id like concat(item_regex, '%') limit 1;
-- update family_name and value columns
update iol.products set
family_name = family_value_row.family_name,
cog_cents = family_value_row.cog_cents,
market_value_cents = family_value_row.market_value_cents
where product_id = product_row.product_id;
end loop;
end;
$$
LANGUAGE plpgsql;
Use concat as updated above:
select * into family_value_row from iol.family_values
where lvl3_id = product_row.lvl3_id and product_row.product_id
like concat(item_regex, '%') limit 1;

How to split array in json using json_query?

I've got a column in a table that's a json. It contains only values without keys like
Now I'm trying to split the data from the json and create new table using every index of each array as new entry like
I've already tried
SELECT JSON_QUERY(abc) as 'Type', Id as 'ValueId' from Table FOR JSON AUTO
Is there any way to handle splitting given that some arrays might be empty and look like
[]
?
A fairly simply approach would be to use outer apply with openjson.
First, create and populate sample table (Please save us this step in your future questions):
DECLARE #T AS TABLE
(
Id int,
Value nvarchar(20)
)
INSERT INTO #T VALUES
(1, '[10]'),
(2, '[20, 200]'),
(3, '[]'),
(4, '')
The query:
SELECT Id, JsonValues.Value
FROM #T As t
OUTER APPLY
OPENJSON( Value ) As JsonValues
WHERE ISJSON(t.Value) = 1
Results:
Id Value
1 10
2 20
2 200
3 NULL
Note the ISJSON condition in the where clause will prevent exceptions in case the Value column contains anything other than a valid json (an empty array [] is still considered valid for this purpose).
If you don't want to return a row where the json array is empty, use cross apply instead of outer apply.
Your own code calling for FOR JSON AUTO tries to create JSON out of tabular data. But what you really needs seems to be the opposite direction: You want to transform JSON to a result set, a derived table. This is done by OPENJSON.
Your JSON seems to be a very minimalistic array.
You can try something along this.
DECLARE #json NVARCHAR(MAX) =N'[1,2,3]';
SELECT * FROM OPENJSON(#json);
The result returns the zero-based ordinal position in key, the actual value in value and a (very limited) type-enum.
Hint: If you want to use this against a table's column you must use APPLY, something along
SELECT *
FROM YourTable t
OUTER APPLY OPENJSON(t.TheJsonColumn);

postgres `order by` argument type

What is the argument type for the order by clause in Postgresql?
I came across a very strange behaviour (using Postgresql 9.5). Namely, the query
select * from unnest(array[1,4,3,2]) as x order by 1;
produces 1,2,3,4 as expected. However the query
select * from unnest(array[1,4,3,2]) as x order by 1::int;
produces 1,4,3,2, which seems strange. Similarly, whenever I replace 1::int with whatever function (e.g. greatest(0,1)) or even case operator, the results are unordered (on the contrary to what I would expect).
So which type should an argument of order by have, and how do I get the expected behaviour?
This is expected (and documented) behaviour:
A sort_expression can also be the column label or number of an output column
So the expression:
order by 1
sorts by the first column of the result set (as defined by the SQL standard)
However the expression:
order by 1::int
sorts by the constant value 1, it's essentially the same as:
order by 'foo'
By using a constant value for the order by all rows have the same sort value and thus aren't really sorted.
To sort by an expression, just use that:
order by
case
when some_column = 'foo' then 1
when some_column = 'bar' then 2
else 3
end
The above sorts the result based on the result of the case expression.
Actually I have a function with an integer argument which indicates the column to be used in the order by clause.
In a case when all columns are of the same type, this can work: :
SELECT ....
ORDER BY
CASE function_to_get_a_column_number()
WHEN 1 THEN column1
WHEN 2 THEN column2
.....
WHEN 1235 THEN column1235
END
If columns are of different types, you can try:
SELECT ....
ORDER BY
CASE function_to_get_a_column_number()
WHEN 1 THEN column1::varchar
WHEN 2 THEN column2::varchar
.....
WHEN 1235 THEN column1235::varchar
END
But these "workarounds" are horrible. You need some other approach than the function returning a column number.
Maybe a dynamic SQL ?
I would say that dynamic SQL (thanks #kordirko and the others for the hints) is the best solution to the problem I originally had in mind:
create temp table my_data (
id serial,
val text
);
insert into my_data(id, val)
values (default, 'a'), (default, 'c'), (default, 'd'), (default, 'b');
create function fetch_my_data(col text)
returns setof my_data as
$f$
begin
return query execute $$
select * from my_data
order by $$|| quote_ident(col);
end
$f$ language plpgsql;
select * from fetch_my_data('val'); -- order by val
select * from fetch_my_data('id'); -- order by id
In the beginning I thought this could be achieved using case expression in the argument of the order by clause - the sort_expression. And here comes the tricky part which confused me: when sort_expression is a kind of identifier (name of a column or a number of a column), the corresponding column is used when ordering the results. But when sort_expression is some value, we actually order the results using that value itself (computed for each row). This is #a_horse_with_no_name's answer rephrased.
So when I queried ... order by 1::int, in a way I have assigned value 1 to each row and then tried to sort an array of ones, which clearly is useless.
There are some workarounds without dynamic queries, but they require writing more code and do not seem to have any significant advantages.

Postgresql Select all columns and column names with a specific value for a row

I have a table with many(+1000) columns and rows(~1M). The columns have either the value 1 , or are NULL.
I want to be able to select, for a specific row (user) retrieve the column names that have a value of 1.
Since there are many columns on the table, specifying the columns would yield a extremely long query.
You're doing something SQL is quite bad at - dynamic access to columns, or treating a row as a set. It'd be nice if this were easier, but it doesn't work well with SQL's typed nature and the concept of a relation. Working with your data set in its current form is going to be frustrating; consider storing an array, json, or hstore of values instead.
Actually, for this particular data model, you could probably use a bitfield. See bit(n) and bit varying(n).
It's still possible to make a working query with your current model PostgreSQL extensions though.
Given sample:
CREATE TABLE blah (id serial primary key, a integer, b integer, c integer);
INSERT INTO blah(a,b,c) VALUES (NULL, NULL, 1), (1, NULL, 1), (NULL, NULL, NULL), (1, 1, 1);
I would unpivot each row into a key/value set using hstore (or in newer PostgreSQL versions, the json functions). SQL its self provides no way to dynamically access columns, so we have to use an extension. So:
SELECT id, hs FROM blah, LATERAL hstore(blah) hs;
then extract the hstores to sets:
SELECT id, k, v FROM blah, LATERAL each(hstore(blah)) kv(k,v);
... at which point your can filter for values matching the criteria. Note that all columns have been converted to text, so you may want to cast it back:
SELECT id, k FROM blah, LATERAL each(hstore(blah)) kv(k,v) WHERE v::integer = 1;
You also need to exclude id from matching, so:
regress=> SELECT id, k FROM blah, LATERAL each(hstore(blah)) kv(k,v) WHERE v::integer = 1 AND
k <> 'id';
id | k
----+---
1 | c
2 | a
2 | c
4 | a
4 | b
4 | c
(6 rows)