Delta Lake table update column-wise in parallel - pyspark

I hope everyone is doing well. I have a long question therefore please bear with me.
Context:
So I have CDC payloads coming from the Debezium connector for Yugabyte in the following form:
r"""
{
"payload": {
"before": null,
"after": {
"id": {
"value": "MK_1",
"set": true
},
"status": {
"value": "new_status",
"set": true
},
"status_metadata": {
"value": "new_status_metadata",
"set": true
},
"creator": {
"value": "new_creator",
"set": true
},
"created": null,
"creator_type": null,
"updater": null,
"updated": null,
"updater_type": {
"value": "new_updater_type",
"set": true
}
},
"source": {
"version": "1.7.0.13-BETA",
"connector": "yugabytedb",
"name": "dbserver1",
"ts_ms": -4258763692835,
"snapshot": "false",
"db": "yugabyte",
"sequence": "[\"0:0::0:0\",\"1:338::0:0\"]",
"schema": "public",
"table": "customer",
"txId": "",
"lsn": "1:338::0:0",
"xmin": null
},
"op": "u",
"ts_ms": 1669795724779,
"transaction": null
}
}
"""
The payload consists of before and after fields. As , visible by the `op:u', this is an update operation. Therefore a row in Yugabyte table called customers with id MK_1 was updated with new values. However, the after field only shows those columns whose value has been updated. Therefore the fields in "after" which are null have not been updated e.g created is null and therefore have not been updated but status is {"value": "new_status", "set": true} which means the status column value has been updated to the new value of "new_status". Now I have PySpark Structured Streaming Pipeline which takes in these payloads, processes them, and then makes a micro-data frame of the following form:
id | set_id | status | set_status | status_metadata | set_status_metadata | creator | set_creator | created | creator_type | set_created_type | updater | set_updater | updated | set_updated | updater_type | set_updater_type
The "set_column" is either true or false depending on the payload.
Problem:
Now I have a delta table on delta lake with the following schema:
id | status | status_metadata | creator | created | created_type | updater | updated | updater_type
And I am using the following code to update the above delta table using the python delta lake API (v 2.2.0):
for column in fields_map:
delta_lake_merger.whenMatchedUpdate(
condition = f"update_table.op = 'u' AND update_table.set_{column} = 'true'"
, set={column : fields_map[column]}
).execute()
Now you might be wondering why I am doing an update column-wise rather than all the columns at once. This is exactly the problem that I am facing. If I update all of the columns at once without set_col = true condition then it will overwrite the entire state of the rows for the matching id in the delta table. This is not what I want.
What do I want?
I only want to update those columns from the payload whose values are not null in the payload. If I update all columns at once like this:
delta_lake_merger.whenMatchedUpdate(
condition = f"update_table.op = 'u'"
, set=fields_map
).execute()
Then delta lake api will also replace those columns which have not been updated with nulls in the delta table since this is the value for the non-updating columns in the cdc package.The above iterative  solution works where I do an update column-wise for all of the rows in the delta table since it just ignores the specific row in the given column whose set_column is False and therefore keeps the existing value on the delta table.
However, this is slow since it writes the data N times in a sequential manner which bottlenecks my streaming query. Since all of the column-wise updates are independent, is there any way in delta lake python API, I can update all of the columns at once but with the set_column condition as well? I know there might be a way because each of these is just a independent call to write data for each column with the given condition. I want to call the execute command at once for all columns with the set_condition rather than putting it in a loop.
PS: I was thinking of using asyncio library for python but not so sure. Thank you so much.

I have been able to find a solution if someone is stuck on a similar problem, you can use a CASE WHEN in the set field of whenMatchedUpdate:
delta_lake_merger.whenMatchedUpdate(set = "CASE WHEN update_table.set_{column}='true' THEN update_table.{column} ELSE main_table.{column} END")
This will execute the update for all of the columns at once with the set condition.

Related

PostgreSQL update a jsonb column multiple times

Consider the following:
create table query(id integer, query_definition jsonb);
create table query_item(path text[], id integer);
insert into query (id, query_definition)
values
(100, '{"columns":[{"type":"integer","field":"id"},{"type":"str","field":"firstname"},{"type":"str","field":"lastname"}]}'::jsonb),
(101, '{"columns":[{"type":"integer","field":"id"},{"type":"str","field":"firstname"}]}'::jsonb);
insert into query_item(path, id) values
('{columns,0,type}'::text[], 100),
('{columns,1,type}'::text[], 100),
('{columns,2,type}'::text[], 100),
('{columns,0,type}'::text[], 101),
('{columns,1,type}'::text[], 101);
I have a query table which has a jsonb column named query_definition.
The jsonb value looks like the following:
{
"columns": [
{
"type": "integer",
"field": "id"
},
{
"type": "str",
"field": "firstname"
},
{
"type": "str",
"field": "lastname"
}
]
}
In order to replace all "type": "..." with "type": "string", I've built the query_item table which contains the following data:
path |id |
----------------+---+
{columns,0,type}|100|
{columns,1,type}|100|
{columns,2,type}|100|
{columns,0,type}|101|
{columns,1,type}|101|
path matches each path from the json root to the "type" entry, id is the corresponding query's id.
I made up the following sql statement to do what I want:
update query q
set query_definition = jsonb_set(q.query_definition, query_item.path, ('"string"')::jsonb, false)
from query_item
where q.id = query_item.id
But it partially works, as it takes the 1st matching id and skips the others (the 1st and 4th line of query_item table).
I know I could build a for statement, but it requires a plpgsql context and I'd rather avoid its use.
Is there a way to do it with a single update statement?
I've read in this topic it's possible to make it with strings, but I didn't find out how to adapt this mechanism with jsonb treatment.

Creating table views with columns from nested value of a jsonb column

I have a jsonb column in one of the tables that has the following structure:
{
"3424": {
"status": "pending",
"remarks": "sample here"
},
"6436": {
"status": "new",
"remarks": "sample here"
},,
"9768": {
"status": "cancelled",
"remarks": null,
"by": "customer"
}
}
I am trying to create a view that will put the statuses in individual columns and the key will be their value:
pending | new | cancelled | accepted | id | transaction
3424 | 6436 | 9768 | null | 1 | testing
The problem is the key is dynamic (numeric and corresponds to some id) so I cannot pinpoint the exact key to use the functions/operations stated here: https://www.postgresql.org/docs/9.5/functions-json.html
I've read about json_path_query and was able to extract the statuses here without the need to know the key but I cannot combine it with the integer key yet.
select mt.id, mt.transaction, hstatus from mytable mt
cross join lateral jsonb_path_query(mt.hist, '$.**.status') hstatus
where mt.id = <id>
but this returns the statuses as rows for now. I'm pretty noob with (postgre)sql so I've only gotten this far.
You can indeed use a PATH query. Unfortunately it's not possible to access the "parent" inside a jsonpath in Postgres. But you can workaround that, by expanding the whole value to a list of key/values so that id value you have, can be accessed through .key
select jsonb_path_query_first(the_column, '$.keyvalue() ? (#.value.status == "pending").key') #>> '{}' as pending,
jsonb_path_query_first(the_column, '$.keyvalue() ? (#.value.status == "new").key') #>> '{}' as new,
jsonb_path_query_first(the_column, '$.keyvalue() ? (#.value.status == "cancelled").key') #>> '{}' as cancelled,
jsonb_path_query_first(the_column, '$.keyvalue() ? (#.value.status == "accepted").key') #>> '{}' as accepted,
id,
"transaction"
from the_table
The jsonpath function $.keyvalue() returns something like this:
{"id": 0, "key": "3424", "value": {"status": "pending", "remarks": "sample here"}}
This is then used to pick the an element through a condition on #.value.status and the accessor .key then returns the corresponding key value (e.g. 3424)
The #>> '{}' is a hack to convert the returned jsonb value into a proper text value (otherwise the result would be e.g. "3424" instead of just 3424.

Update jsonb array value stored as text in Postgres

Postgres Version: 9.5.0
I have a database table where one of the columns is stored as text that represents a json value. The json value is an array of dictionaries e.x.:
[{"picture": "XXX", "image_hash": null, "name": "test", "video": null, "link": "http://www.google.com", "table_id": 356}, ..]
I am trying to update the value associated to the key table_id only for the 1st array element. Here is the query I ran:
update table1 set "json_column" = jsonb_set("json_column", "{0, table_id}", null, false) where id = 1;
I keep running into the error - ERROR: column "{0, table_id}" does not exist
Could someone please help me out understanding how this can be fixed?

How can I do less than, greater than in JSON array object Postgres fields? But performance is much better required

I want to retrieve data by specific field operation it store array of object. i want to add new object in it.
CREATE TABLE justjson ( id INTEGER, doc JSONB);
INSERT INTO justjson VALUES ( 1, '[
{
"name": "abc",
"age": "22"
},
{
"name": "def",
"age": "23"
}
]');
retrieve data where age is greater then and equal to 23 how is possible
And i have solution for such thing but it decrease query performance to much.
my solutions is
using jsonb_array_elements:
t=# with a as (select *,jsonb_array_elements(doc) k from justjson)
select k from a where (k->>'age')::int >= 23;
k
------------------------------
{"age": "23", "name": "def"}
(1 row)
I need a solution or other thing by which i do such thing with high performance.

Postgresql jsonb traversal

I am very new to the PG jsonb field.
I have for example a jsonb field containing the following
{
"RootModule": {
"path": [
1
],
"tags": {
"ModuleBase1": {
"value": 40640,
"humanstring": "40640"
},
"ModuleBase2": {
"value": 40200,
"humanstring": "40200"
}
},
"children": {
"RtuInfoModule": {
"path": [
1,
0
],
"tags": {
"in0": {
"value": 11172,
"humanstring": "11172"
},
"in1": {
"value": 25913,
"humanstring": "25913"
}
etc....
Is there a way to query X levels deep and search the "tags" key for a certain key.
Say I want "ModuleBase2" and "in1" and I want to get their values?
Basically I am looking for a query that will traverse a jsonb field until it finds a key and returns the value without having to know the structure.
In Python or JS a simple loop or recursive function could easily traverse a json object (or dictionary) until it finds a key.
Is there a built in function PG has to do that?
Ultimately I want to do this in django.
Edit:
I see I can do stuff like
SELECT data.key AS key, data.value as value
FROM trending_snapshot, jsonb_each(trending_snapshot.snapshot-
>'RootModule') AS data
WHERE key = 'tags';
But I must specify the the levels.
You can use a recursive query to flatten a nested jsonb, see this answer. Modify the query to find values for specific keys (add a condition in where clause):
with recursive flat (id, path, value) as (
select id, key, value
from my_table,
jsonb_each(data)
union
select f.id, concat(f.path, '.', j.key), j.value
from flat f,
jsonb_each(f.value) j
where jsonb_typeof(f.value) = 'object'
)
select id, path, value
from flat
where path like any(array['%ModuleBase2.value', '%in1.value']);
id | path | value
----+--------------------------------------------------+-------
1 | RootModule.tags.ModuleBase2.value | 40200
1 | RootModule.children.RtuInfoModule.tags.in1.value | 25913
(2 rows)
Test it in SqlFiddle.