Creating table views with columns from nested value of a jsonb column - postgresql

I have a jsonb column in one of the tables that has the following structure:
{
"3424": {
"status": "pending",
"remarks": "sample here"
},
"6436": {
"status": "new",
"remarks": "sample here"
},,
"9768": {
"status": "cancelled",
"remarks": null,
"by": "customer"
}
}
I am trying to create a view that will put the statuses in individual columns and the key will be their value:
pending | new | cancelled | accepted | id | transaction
3424 | 6436 | 9768 | null | 1 | testing
The problem is the key is dynamic (numeric and corresponds to some id) so I cannot pinpoint the exact key to use the functions/operations stated here: https://www.postgresql.org/docs/9.5/functions-json.html
I've read about json_path_query and was able to extract the statuses here without the need to know the key but I cannot combine it with the integer key yet.
select mt.id, mt.transaction, hstatus from mytable mt
cross join lateral jsonb_path_query(mt.hist, '$.**.status') hstatus
where mt.id = <id>
but this returns the statuses as rows for now. I'm pretty noob with (postgre)sql so I've only gotten this far.

You can indeed use a PATH query. Unfortunately it's not possible to access the "parent" inside a jsonpath in Postgres. But you can workaround that, by expanding the whole value to a list of key/values so that id value you have, can be accessed through .key
select jsonb_path_query_first(the_column, '$.keyvalue() ? (#.value.status == "pending").key') #>> '{}' as pending,
jsonb_path_query_first(the_column, '$.keyvalue() ? (#.value.status == "new").key') #>> '{}' as new,
jsonb_path_query_first(the_column, '$.keyvalue() ? (#.value.status == "cancelled").key') #>> '{}' as cancelled,
jsonb_path_query_first(the_column, '$.keyvalue() ? (#.value.status == "accepted").key') #>> '{}' as accepted,
id,
"transaction"
from the_table
The jsonpath function $.keyvalue() returns something like this:
{"id": 0, "key": "3424", "value": {"status": "pending", "remarks": "sample here"}}
This is then used to pick the an element through a condition on #.value.status and the accessor .key then returns the corresponding key value (e.g. 3424)
The #>> '{}' is a hack to convert the returned jsonb value into a proper text value (otherwise the result would be e.g. "3424" instead of just 3424.

Related

Delta Lake table update column-wise in parallel

I hope everyone is doing well. I have a long question therefore please bear with me.
Context:
So I have CDC payloads coming from the Debezium connector for Yugabyte in the following form:
r"""
{
"payload": {
"before": null,
"after": {
"id": {
"value": "MK_1",
"set": true
},
"status": {
"value": "new_status",
"set": true
},
"status_metadata": {
"value": "new_status_metadata",
"set": true
},
"creator": {
"value": "new_creator",
"set": true
},
"created": null,
"creator_type": null,
"updater": null,
"updated": null,
"updater_type": {
"value": "new_updater_type",
"set": true
}
},
"source": {
"version": "1.7.0.13-BETA",
"connector": "yugabytedb",
"name": "dbserver1",
"ts_ms": -4258763692835,
"snapshot": "false",
"db": "yugabyte",
"sequence": "[\"0:0::0:0\",\"1:338::0:0\"]",
"schema": "public",
"table": "customer",
"txId": "",
"lsn": "1:338::0:0",
"xmin": null
},
"op": "u",
"ts_ms": 1669795724779,
"transaction": null
}
}
"""
The payload consists of before and after fields. As , visible by the `op:u', this is an update operation. Therefore a row in Yugabyte table called customers with id MK_1 was updated with new values. However, the after field only shows those columns whose value has been updated. Therefore the fields in "after" which are null have not been updated e.g created is null and therefore have not been updated but status is {"value": "new_status", "set": true} which means the status column value has been updated to the new value of "new_status". Now I have PySpark Structured Streaming Pipeline which takes in these payloads, processes them, and then makes a micro-data frame of the following form:
id | set_id | status | set_status | status_metadata | set_status_metadata | creator | set_creator | created | creator_type | set_created_type | updater | set_updater | updated | set_updated | updater_type | set_updater_type
The "set_column" is either true or false depending on the payload.
Problem:
Now I have a delta table on delta lake with the following schema:
id | status | status_metadata | creator | created | created_type | updater | updated | updater_type
And I am using the following code to update the above delta table using the python delta lake API (v 2.2.0):
for column in fields_map:
delta_lake_merger.whenMatchedUpdate(
condition = f"update_table.op = 'u' AND update_table.set_{column} = 'true'"
, set={column : fields_map[column]}
).execute()
Now you might be wondering why I am doing an update column-wise rather than all the columns at once. This is exactly the problem that I am facing. If I update all of the columns at once without set_col = true condition then it will overwrite the entire state of the rows for the matching id in the delta table. This is not what I want.
What do I want?
I only want to update those columns from the payload whose values are not null in the payload. If I update all columns at once like this:
delta_lake_merger.whenMatchedUpdate(
condition = f"update_table.op = 'u'"
, set=fields_map
).execute()
Then delta lake api will also replace those columns which have not been updated with nulls in the delta table since this is the value for the non-updating columns in the cdc package.The above iterative  solution works where I do an update column-wise for all of the rows in the delta table since it just ignores the specific row in the given column whose set_column is False and therefore keeps the existing value on the delta table.
However, this is slow since it writes the data N times in a sequential manner which bottlenecks my streaming query. Since all of the column-wise updates are independent, is there any way in delta lake python API, I can update all of the columns at once but with the set_column condition as well? I know there might be a way because each of these is just a independent call to write data for each column with the given condition. I want to call the execute command at once for all columns with the set_condition rather than putting it in a loop.
PS: I was thinking of using asyncio library for python but not so sure. Thank you so much.
I have been able to find a solution if someone is stuck on a similar problem, you can use a CASE WHEN in the set field of whenMatchedUpdate:
delta_lake_merger.whenMatchedUpdate(set = "CASE WHEN update_table.set_{column}='true' THEN update_table.{column} ELSE main_table.{column} END")
This will execute the update for all of the columns at once with the set condition.

PostgreSQL update a jsonb column multiple times

Consider the following:
create table query(id integer, query_definition jsonb);
create table query_item(path text[], id integer);
insert into query (id, query_definition)
values
(100, '{"columns":[{"type":"integer","field":"id"},{"type":"str","field":"firstname"},{"type":"str","field":"lastname"}]}'::jsonb),
(101, '{"columns":[{"type":"integer","field":"id"},{"type":"str","field":"firstname"}]}'::jsonb);
insert into query_item(path, id) values
('{columns,0,type}'::text[], 100),
('{columns,1,type}'::text[], 100),
('{columns,2,type}'::text[], 100),
('{columns,0,type}'::text[], 101),
('{columns,1,type}'::text[], 101);
I have a query table which has a jsonb column named query_definition.
The jsonb value looks like the following:
{
"columns": [
{
"type": "integer",
"field": "id"
},
{
"type": "str",
"field": "firstname"
},
{
"type": "str",
"field": "lastname"
}
]
}
In order to replace all "type": "..." with "type": "string", I've built the query_item table which contains the following data:
path |id |
----------------+---+
{columns,0,type}|100|
{columns,1,type}|100|
{columns,2,type}|100|
{columns,0,type}|101|
{columns,1,type}|101|
path matches each path from the json root to the "type" entry, id is the corresponding query's id.
I made up the following sql statement to do what I want:
update query q
set query_definition = jsonb_set(q.query_definition, query_item.path, ('"string"')::jsonb, false)
from query_item
where q.id = query_item.id
But it partially works, as it takes the 1st matching id and skips the others (the 1st and 4th line of query_item table).
I know I could build a for statement, but it requires a plpgsql context and I'd rather avoid its use.
Is there a way to do it with a single update statement?
I've read in this topic it's possible to make it with strings, but I didn't find out how to adapt this mechanism with jsonb treatment.

Postgres Querying/filtering JSONB with nested arrays

Below is my sample requirement
I want customers who meet all the below conditions
In country "xyz", incorporated between 2019 to 2021.
Should be having at least one account with balance between 10000 and 13000 and branch is "abc" and transaction dates between 20200110 and 20210625. It is formatted and stored as number
Should be having at least one address in the state "state1" and pin codes between 625001 and 625015
Below is table structure
CREATE TABLE customer_search_ms.customer
(
customer_id integer,
customer_details jsonb
)
There can be millions of rows in the table.
I have created GIN index of type jsonb_ops on the customer_details column as we would also be checking for existence conditions and range comparison
Below is a sample data in the customer_data JSONB column
customer_id : 1
{
"customer_data": {
"name": "abc",
"incorporated_year": 2020,
"country":"xyz",
"account_details": [
{
"transaction_dates": [
20180125, 20190125, 20200125,20200525
],
"account_id": 1016084,
"account_balance": 2000,
"account_branch": "xyz"
},
{
"transaction_dates": [
20180125, 20190125, 20200125
],
"account_id": 1016087,
"account_balance": 12010,
"account_branch": "abc"
}
],
"address": [
{
"address_id": 24739,
"door_no": 4686467,
"street_name":"street1",
"city": "city1",
"state": "state1",
"pin_code": 625001
},
{
"address_id": 24730,
"door_no": 4686442,
"street_name":"street2",
"city": "city1",
"state": "state1",
"pin_code": 625014
}
]
}
}
Now the query i have written for above is
SELECT c.customer_id,
c.customer_details
FROM customer_search_ms.customer c
WHERE c.customer_details ## CAST('$.customer_data.country == "xyz" && $.customer_data.incorporated_year >= 2019 && $.customer_data.incorporated_year <= 2021 ' AS JSONPATH)
AND c.customer_details #? CAST('$.customer_data.account_details[*] ? (#.account_balance >= 10000) ? (#.account_balance <= 13000) ?(#.account_branch == "abc") ? (#.transaction_dates >= 20200110) ? (#.transaction_dates <= 20210625)' AS JSONPATH)
AND c.customer_details #? CAST('$.customer_data.address[*] ? (#.state == "state1") ? (#.pin_code >= 625001) ? (#.pin_code <= 625015) ' AS JSONPATH)
To handle above scenario is it the best way to write. Is it possible to combine all the 3 criteria's (customer/account/address) into one expression? The table will have millions of rows.
I am of the opinion having it as one expression and hitting the DB will give the best performance. Is it possible to combine these 3 conditions as one expression
Your query does not give me the error you report. Rather, it runs, but does give the "wrong" results compared to what you want. There are several mistakes in it which are not syntax errors, but just give wrong results.
Your first jsonpath looks fine. It is a Boolean expression, and ## checks if that expression yields true.
Your second jsonpath has two problems. It yields a list of objects which match your conditions. But objects are not booleans, so ## will be unhappy and return SQL NULL, which is treated the same as false here. Instead, you need to test if that list is empty. This is what #? does, so use that instead of ##. Also, your dates are stored as 8-digit integers, but you are comparing them to 8-character strings. In jsonpath, such cross-type comparisons yield JSON null, which is treated the same as false here. So you either need to change the storage to strings, or change the literals they are compared to into integers.
Your third jsonpath also has the ## problem. And it has the reverse of the type problem, you have the pin_code stored as strings, but you are testing them against integers. Finally you have 'pin_code' misspelled in one occurrence.

Postgresql jsonb traversal

I am very new to the PG jsonb field.
I have for example a jsonb field containing the following
{
"RootModule": {
"path": [
1
],
"tags": {
"ModuleBase1": {
"value": 40640,
"humanstring": "40640"
},
"ModuleBase2": {
"value": 40200,
"humanstring": "40200"
}
},
"children": {
"RtuInfoModule": {
"path": [
1,
0
],
"tags": {
"in0": {
"value": 11172,
"humanstring": "11172"
},
"in1": {
"value": 25913,
"humanstring": "25913"
}
etc....
Is there a way to query X levels deep and search the "tags" key for a certain key.
Say I want "ModuleBase2" and "in1" and I want to get their values?
Basically I am looking for a query that will traverse a jsonb field until it finds a key and returns the value without having to know the structure.
In Python or JS a simple loop or recursive function could easily traverse a json object (or dictionary) until it finds a key.
Is there a built in function PG has to do that?
Ultimately I want to do this in django.
Edit:
I see I can do stuff like
SELECT data.key AS key, data.value as value
FROM trending_snapshot, jsonb_each(trending_snapshot.snapshot-
>'RootModule') AS data
WHERE key = 'tags';
But I must specify the the levels.
You can use a recursive query to flatten a nested jsonb, see this answer. Modify the query to find values for specific keys (add a condition in where clause):
with recursive flat (id, path, value) as (
select id, key, value
from my_table,
jsonb_each(data)
union
select f.id, concat(f.path, '.', j.key), j.value
from flat f,
jsonb_each(f.value) j
where jsonb_typeof(f.value) = 'object'
)
select id, path, value
from flat
where path like any(array['%ModuleBase2.value', '%in1.value']);
id | path | value
----+--------------------------------------------------+-------
1 | RootModule.tags.ModuleBase2.value | 40200
1 | RootModule.children.RtuInfoModule.tags.in1.value | 25913
(2 rows)
Test it in SqlFiddle.

Using LIKE operator for array of objects inside jsonb field in PostgreSQL

Is it possible to use LIKE operator for single key/value inside array of objects for jsonb field in PostgreSQL 9.4? For example I have:
id | body
------------------------------------------------------------
1 | {"products": [{"name": "qwe", "description": "asd"}, {"name": "zxc", "description": "vbn"}]}
I know, I can get a product with something like this:
select * from table where 'body'->'products' #> '[{"name": "qwe"}]'::jsonb
The question is: can I get this product if I don't know full name of it?
Try to get the key and value by using jsonb_each() function:
WITH json_test(data) AS ( VALUES
('{"products": [{"name": "qwe", "description": "asd"}, {"name": "zxc", "description": "vbn"}]}'::JSONB)
)
SELECT doc.key,doc.value
FROM json_test jt,
jsonb_array_elements(jt.data->'products') array_elements,
jsonb_each(array_elements) doc
WHERE
doc.key = 'name'
AND
doc.value::TEXT LIKE '%w%';
Output will be the following:
key | value
------+-------
name | "qwe"
(1 row)