PostgreSQL update a jsonb column multiple times - postgresql

Consider the following:
create table query(id integer, query_definition jsonb);
create table query_item(path text[], id integer);
insert into query (id, query_definition)
values
(100, '{"columns":[{"type":"integer","field":"id"},{"type":"str","field":"firstname"},{"type":"str","field":"lastname"}]}'::jsonb),
(101, '{"columns":[{"type":"integer","field":"id"},{"type":"str","field":"firstname"}]}'::jsonb);
insert into query_item(path, id) values
('{columns,0,type}'::text[], 100),
('{columns,1,type}'::text[], 100),
('{columns,2,type}'::text[], 100),
('{columns,0,type}'::text[], 101),
('{columns,1,type}'::text[], 101);
I have a query table which has a jsonb column named query_definition.
The jsonb value looks like the following:
{
"columns": [
{
"type": "integer",
"field": "id"
},
{
"type": "str",
"field": "firstname"
},
{
"type": "str",
"field": "lastname"
}
]
}
In order to replace all "type": "..." with "type": "string", I've built the query_item table which contains the following data:
path |id |
----------------+---+
{columns,0,type}|100|
{columns,1,type}|100|
{columns,2,type}|100|
{columns,0,type}|101|
{columns,1,type}|101|
path matches each path from the json root to the "type" entry, id is the corresponding query's id.
I made up the following sql statement to do what I want:
update query q
set query_definition = jsonb_set(q.query_definition, query_item.path, ('"string"')::jsonb, false)
from query_item
where q.id = query_item.id
But it partially works, as it takes the 1st matching id and skips the others (the 1st and 4th line of query_item table).
I know I could build a for statement, but it requires a plpgsql context and I'd rather avoid its use.
Is there a way to do it with a single update statement?
I've read in this topic it's possible to make it with strings, but I didn't find out how to adapt this mechanism with jsonb treatment.

Related

Delta Lake table update column-wise in parallel

I hope everyone is doing well. I have a long question therefore please bear with me.
Context:
So I have CDC payloads coming from the Debezium connector for Yugabyte in the following form:
r"""
{
"payload": {
"before": null,
"after": {
"id": {
"value": "MK_1",
"set": true
},
"status": {
"value": "new_status",
"set": true
},
"status_metadata": {
"value": "new_status_metadata",
"set": true
},
"creator": {
"value": "new_creator",
"set": true
},
"created": null,
"creator_type": null,
"updater": null,
"updated": null,
"updater_type": {
"value": "new_updater_type",
"set": true
}
},
"source": {
"version": "1.7.0.13-BETA",
"connector": "yugabytedb",
"name": "dbserver1",
"ts_ms": -4258763692835,
"snapshot": "false",
"db": "yugabyte",
"sequence": "[\"0:0::0:0\",\"1:338::0:0\"]",
"schema": "public",
"table": "customer",
"txId": "",
"lsn": "1:338::0:0",
"xmin": null
},
"op": "u",
"ts_ms": 1669795724779,
"transaction": null
}
}
"""
The payload consists of before and after fields. As , visible by the `op:u', this is an update operation. Therefore a row in Yugabyte table called customers with id MK_1 was updated with new values. However, the after field only shows those columns whose value has been updated. Therefore the fields in "after" which are null have not been updated e.g created is null and therefore have not been updated but status is {"value": "new_status", "set": true} which means the status column value has been updated to the new value of "new_status". Now I have PySpark Structured Streaming Pipeline which takes in these payloads, processes them, and then makes a micro-data frame of the following form:
id | set_id | status | set_status | status_metadata | set_status_metadata | creator | set_creator | created | creator_type | set_created_type | updater | set_updater | updated | set_updated | updater_type | set_updater_type
The "set_column" is either true or false depending on the payload.
Problem:
Now I have a delta table on delta lake with the following schema:
id | status | status_metadata | creator | created | created_type | updater | updated | updater_type
And I am using the following code to update the above delta table using the python delta lake API (v 2.2.0):
for column in fields_map:
delta_lake_merger.whenMatchedUpdate(
condition = f"update_table.op = 'u' AND update_table.set_{column} = 'true'"
, set={column : fields_map[column]}
).execute()
Now you might be wondering why I am doing an update column-wise rather than all the columns at once. This is exactly the problem that I am facing. If I update all of the columns at once without set_col = true condition then it will overwrite the entire state of the rows for the matching id in the delta table. This is not what I want.
What do I want?
I only want to update those columns from the payload whose values are not null in the payload. If I update all columns at once like this:
delta_lake_merger.whenMatchedUpdate(
condition = f"update_table.op = 'u'"
, set=fields_map
).execute()
Then delta lake api will also replace those columns which have not been updated with nulls in the delta table since this is the value for the non-updating columns in the cdc package.The above iterative  solution works where I do an update column-wise for all of the rows in the delta table since it just ignores the specific row in the given column whose set_column is False and therefore keeps the existing value on the delta table.
However, this is slow since it writes the data N times in a sequential manner which bottlenecks my streaming query. Since all of the column-wise updates are independent, is there any way in delta lake python API, I can update all of the columns at once but with the set_column condition as well? I know there might be a way because each of these is just a independent call to write data for each column with the given condition. I want to call the execute command at once for all columns with the set_condition rather than putting it in a loop.
PS: I was thinking of using asyncio library for python but not so sure. Thank you so much.
I have been able to find a solution if someone is stuck on a similar problem, you can use a CASE WHEN in the set field of whenMatchedUpdate:
delta_lake_merger.whenMatchedUpdate(set = "CASE WHEN update_table.set_{column}='true' THEN update_table.{column} ELSE main_table.{column} END")
This will execute the update for all of the columns at once with the set condition.

Iteration for insert into values

I will try and present the current setup as an abstract view, the focus being on the logical approach to batch insert.
CREATE TABLE factv (id, ..other columns..);
CREATE TABLE facto (id, ..other columns..);
CREATE TABLE dims (id serial, dimensions jsonb);
The 2 fact tables share the same dimensions, but they've got different columns.
There's an event stream which send messages to a table and there's function which is executed for each row, the logic is similar to:
CREATE OR REPLACE FUNCTION insert_event() RETURNS TRIGGER AS $$
IF event_json ->> 'type' = 'someEvent' THEN
WITH factv_insert AS (
INSERT INTO factv VALUES (id,..other columns..)
fn_createDimId,
NEW.event_json->>..,
...
RETURNING id
)
INSERT INTO facto VALUES (id,..other columns..)
(select id from factv_insert),
NEW.event_json->>..
ELSE DoSomethingElse...
END IF;
RETURN NEW;
END;
$$ LANGUAGE PLPGSQL;
The function called in here fn_createDimId just looks up the dims table and if these dimension are not found, they're inserted. If those are already in there, just give me that id for those dimensions as the id for this fact insert.
I do have some new events coming through and I need to grab some information which breaks the rules around insert into..values with ERROR: more than one row returned by a subquery used as an expression
The event structure is similar, but not limited to
{
"type": "someEvent",
"instruction": {
"contains": {
"id": "containerid",
"map": {
"50561:null:null": {
"productid": "50561",
"quantity": 3
},
"50562:null:null": {
"productid": "50562",
"quantity": 8
},
"50559:null:null": {
"productid": "50559",
"quantity": 5
}
}
},
"target": {
"50561": "Random",
"50562": "Random",
"50559": "Mix",
}
}
}
What is causing the problems here is the information around target and the respective quantities for those ids. From the event shown above, I need to aggregate and insert into the fact table:
-------|-----
target | qty
-------|-----
Random | 11
Mixed | 5
-------------
If I was to query to grab the information, I would run the following:
WITH meta_data as (
SELECT
json_object_keys(event_json -> 'instruction' ->'target') as prodid
,event_json -> 'instruction' ->'target'->>json_object_keys(event_json -> 'instruction' ->'target') as target
,event_json -> 'instruction' ->'contains'->'map'->json_object_keys(event_json -> 'instruction' ->'contains'->'map')->>'quantity' as qty
FROM event_table )
select
target,
sum(qty::int)
from meta_data
group by target;
I am looking to find a solution which makes it possible to do the same logical operations, but overcomes the error on the multiple returned rows, ideally an iteration for each event which returns more than 1 row.

Update selected values in a jsonb column containing a array

Table faults contains column recacc (jsonb) which contains an array of json objects. Each of them contains a field action. If the value for action is abc, I want to change it to cba. Changes to be applied to all rows.
[
{
"action": "abc",
"created": 1128154425441
},
{
"action": "lmn",
"created": 1228154425441
},
{
"action": "xyz",
"created": 1328154425441
}
]
The following doesn't work, probably because of the data being in array format
update faults
set recacc = jsonb_set(recacc,'{action}', to_jsonb('cbe'::TEXT),false)
where recacc ->> 'action' = 'abc'
I'm not sure if this is the best option, but you may first get the elements of jsonb using jsonb_array_elements, replace it and then reconstruct the json using array_agg and array_to_json.
UPDATE faults SET recacc = new_recacc::jsonb
FROM
(SELECT array_to_json(array_agg(s)) as new_recacc
FROM
( SELECT
replace(c->>'action','abc','cba') , --this to change the value
c->>'created' FROM faults f
cross join lateral jsonb_array_elements(f.recacc) as c
) as s (action,created)
) m;
Demo

Querying Postgres 9.6 JSONB array of objects

I have the following table:
CREATE TABLE trip
(
id SERIAL PRIMARY KEY ,
gps_data_json jsonb NOT NULL
);
The JSON in gps_data_json contains an array of of trip objects with the following fields (sample data below):
mode
timestamp
latitude
longitude
I'm trying to get all rows that contain a certain "mode".
SELECT * FROM trip
where gps_data_json ->> 'mode' = 'WALK';
I pretty sure I'm using the ->> operator wrong, but I'm unsure who to tell the query that the JSONB field is an array of objects?
Sample data:
INSERT INTO trip (gps_data_json) VALUES
('[
{
"latitude": 47.063480377197266,
"timestamp": 1503056880725,
"mode": "TRAIN",
"longitude": 15.450349807739258
},
{
"latitude": 47.06362533569336,
"timestamp": 1503056882725,
"mode": "WALK",
"longitude": 15.450264930725098
}
]');
INSERT INTO trip (gps_data_json) VALUES
('[
{
"latitude": 47.063480377197266,
"timestamp": 1503056880725,
"mode": "BUS",
"longitude": 15.450349807739258
},
{
"latitude": 47.06362533569336,
"timestamp": 1503056882725,
"mode": "WALK",
"longitude": 15.450264930725098
}
]');
The problem arises because ->> operator cannot walk through array:
First unnest your json array using json_array_elements function;
Then use the operator for filtering.
Following query does the trick:
WITH
A AS (
SELECT
Id
,jsonb_array_elements(gps_data_json) AS point
FROM trip
)
SELECT *
FROM A
WHERE (point->>'mode') = 'WALK';
Unnesting the array works fine, if you only want the objects containing the values queried.
The following checks for containment and returns the full JSONB:
SELECT * FROM trip
WHERE gps_data_json #> '[{"mode": "WALK"}]';
See also Postgresql query array of objects in JSONB field
select * from
(select id, jsonb_array_elements(gps_data_json) point from trip where id = 16) t
where point #> '{"mode": "WALK"}';
In My Table, id = 16 is to make sure that the specific row is jsonb-array datatype ONLY. Since other rows data is just JSONB object. So you must filter out jsonb-array data FIRST. Otherwise : ERROR: cannot extract elements from an object

Postgresql jsonb traversal

I am very new to the PG jsonb field.
I have for example a jsonb field containing the following
{
"RootModule": {
"path": [
1
],
"tags": {
"ModuleBase1": {
"value": 40640,
"humanstring": "40640"
},
"ModuleBase2": {
"value": 40200,
"humanstring": "40200"
}
},
"children": {
"RtuInfoModule": {
"path": [
1,
0
],
"tags": {
"in0": {
"value": 11172,
"humanstring": "11172"
},
"in1": {
"value": 25913,
"humanstring": "25913"
}
etc....
Is there a way to query X levels deep and search the "tags" key for a certain key.
Say I want "ModuleBase2" and "in1" and I want to get their values?
Basically I am looking for a query that will traverse a jsonb field until it finds a key and returns the value without having to know the structure.
In Python or JS a simple loop or recursive function could easily traverse a json object (or dictionary) until it finds a key.
Is there a built in function PG has to do that?
Ultimately I want to do this in django.
Edit:
I see I can do stuff like
SELECT data.key AS key, data.value as value
FROM trending_snapshot, jsonb_each(trending_snapshot.snapshot-
>'RootModule') AS data
WHERE key = 'tags';
But I must specify the the levels.
You can use a recursive query to flatten a nested jsonb, see this answer. Modify the query to find values for specific keys (add a condition in where clause):
with recursive flat (id, path, value) as (
select id, key, value
from my_table,
jsonb_each(data)
union
select f.id, concat(f.path, '.', j.key), j.value
from flat f,
jsonb_each(f.value) j
where jsonb_typeof(f.value) = 'object'
)
select id, path, value
from flat
where path like any(array['%ModuleBase2.value', '%in1.value']);
id | path | value
----+--------------------------------------------------+-------
1 | RootModule.tags.ModuleBase2.value | 40200
1 | RootModule.children.RtuInfoModule.tags.in1.value | 25913
(2 rows)
Test it in SqlFiddle.