How does Get Metadata order/sort its output? - azure-data-factory

I have set up a DataFactory pipeline that gets a list of files in Azure Data Lake Storage Gen2 then iterates over each files using a ForEach loop.
Im using a Get Metadata activity to produce the list of files and the argument its outputting is 'Child Items'.
I want to make sure the list (child items) is always sorted in name order. My question is what is the default sorting method for child items or can i sort this manually?
Thanks
"name": "GetMetadata",
"description": "",
"type": "GetMetadata",
"dependsOn": [
{
"activity": "Execute Previous Activity",
"dependencyConditions": [
"Succeeded"
]
}
],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"dataset": {
"referenceName": "Folder",
"type": "DatasetReference"
},
"fieldList": [
"childItems"
]
}
},

I've implemented the following solution to overcome the problem with get metadata default sorting order without a use of Azure Functions:
Get a list of items from the BLOB storage
Apply custom filtering (out-of-scope in your question context - just skip)
Apply a lookup activity, that basically receives the JSON representation of 1. Get Metadata activity, parses it using T-SQL Stored Procedure and returns batk as a sorted table representation of the input JSON (sorted in a descending manner)
For each activity start to iterate through the list from the top to down, starting with the most recented date folders and moving to the oldest ones
Below you can find a:
Configuration of the lookup activity from 3.
T-SQL Stored Procedure that transforms the output of Get Metadata activity to an input of ForEach activity.
ALTER PROCEDURE Tech.spSortBlobMetadata
#Json NVARCHAR(MAX)
, #SortOder VARCHAR(5) = 'DESC'
, #Debug INT = 0
AS
/***************************************************************************
EXEC Tech.spSortBlobMetadata
'[{"name":"dt=2020-06-17","type":"Folder"},{"name":"dt=2020-06-18"}]'
, 'DESC'
, 1
***************************************************************************/
BEGIN
DECLARE
#sqlTransform NVARCHAR(MAX) = 'SELECT *
FROM OPENJSON(#Json) WITH(name NVARCHAR(200) ''$.name'', type NVARCHAR(50) ''$.type'')
ORDER BY name ' + #SortOder
IF #Debug = 0
BEGIN
EXEC sp_executesql
#sqlTransform
, N'#Json nvarchar(max)'
, #json = #json
END
ELSE
BEGIN
SELECT #sqlTransform
END
END

You should probably try and refactor your process to take advantage of one of the real strengths of Azure Data Factory (ADF) which is the ability to process things in parallel. What if you did DELETE based on the file / date / period instead of TRUNCATE?
I did get a sequential process to work using a Lookup to a database, a query with an ORDER BY clause to sort the output, and a For Each loop running in sequential mode, but this is counter to the strengths of ADF:

Unfortunatelly, there is not way to sort the order of ChildItems. I find this requirement quite strange, what is the scenario that you need you files sorted?

Related

Delta Lake table update column-wise in parallel

I hope everyone is doing well. I have a long question therefore please bear with me.
Context:
So I have CDC payloads coming from the Debezium connector for Yugabyte in the following form:
r"""
{
"payload": {
"before": null,
"after": {
"id": {
"value": "MK_1",
"set": true
},
"status": {
"value": "new_status",
"set": true
},
"status_metadata": {
"value": "new_status_metadata",
"set": true
},
"creator": {
"value": "new_creator",
"set": true
},
"created": null,
"creator_type": null,
"updater": null,
"updated": null,
"updater_type": {
"value": "new_updater_type",
"set": true
}
},
"source": {
"version": "1.7.0.13-BETA",
"connector": "yugabytedb",
"name": "dbserver1",
"ts_ms": -4258763692835,
"snapshot": "false",
"db": "yugabyte",
"sequence": "[\"0:0::0:0\",\"1:338::0:0\"]",
"schema": "public",
"table": "customer",
"txId": "",
"lsn": "1:338::0:0",
"xmin": null
},
"op": "u",
"ts_ms": 1669795724779,
"transaction": null
}
}
"""
The payload consists of before and after fields. As , visible by the `op:u', this is an update operation. Therefore a row in Yugabyte table called customers with id MK_1 was updated with new values. However, the after field only shows those columns whose value has been updated. Therefore the fields in "after" which are null have not been updated e.g created is null and therefore have not been updated but status is {"value": "new_status", "set": true} which means the status column value has been updated to the new value of "new_status". Now I have PySpark Structured Streaming Pipeline which takes in these payloads, processes them, and then makes a micro-data frame of the following form:
id | set_id | status | set_status | status_metadata | set_status_metadata | creator | set_creator | created | creator_type | set_created_type | updater | set_updater | updated | set_updated | updater_type | set_updater_type
The "set_column" is either true or false depending on the payload.
Problem:
Now I have a delta table on delta lake with the following schema:
id | status | status_metadata | creator | created | created_type | updater | updated | updater_type
And I am using the following code to update the above delta table using the python delta lake API (v 2.2.0):
for column in fields_map:
delta_lake_merger.whenMatchedUpdate(
condition = f"update_table.op = 'u' AND update_table.set_{column} = 'true'"
, set={column : fields_map[column]}
).execute()
Now you might be wondering why I am doing an update column-wise rather than all the columns at once. This is exactly the problem that I am facing. If I update all of the columns at once without set_col = true condition then it will overwrite the entire state of the rows for the matching id in the delta table. This is not what I want.
What do I want?
I only want to update those columns from the payload whose values are not null in the payload. If I update all columns at once like this:
delta_lake_merger.whenMatchedUpdate(
condition = f"update_table.op = 'u'"
, set=fields_map
).execute()
Then delta lake api will also replace those columns which have not been updated with nulls in the delta table since this is the value for the non-updating columns in the cdc package.The above iterative  solution works where I do an update column-wise for all of the rows in the delta table since it just ignores the specific row in the given column whose set_column is False and therefore keeps the existing value on the delta table.
However, this is slow since it writes the data N times in a sequential manner which bottlenecks my streaming query. Since all of the column-wise updates are independent, is there any way in delta lake python API, I can update all of the columns at once but with the set_column condition as well? I know there might be a way because each of these is just a independent call to write data for each column with the given condition. I want to call the execute command at once for all columns with the set_condition rather than putting it in a loop.
PS: I was thinking of using asyncio library for python but not so sure. Thank you so much.
I have been able to find a solution if someone is stuck on a similar problem, you can use a CASE WHEN in the set field of whenMatchedUpdate:
delta_lake_merger.whenMatchedUpdate(set = "CASE WHEN update_table.set_{column}='true' THEN update_table.{column} ELSE main_table.{column} END")
This will execute the update for all of the columns at once with the set condition.

PostgreSQL update a jsonb column multiple times

Consider the following:
create table query(id integer, query_definition jsonb);
create table query_item(path text[], id integer);
insert into query (id, query_definition)
values
(100, '{"columns":[{"type":"integer","field":"id"},{"type":"str","field":"firstname"},{"type":"str","field":"lastname"}]}'::jsonb),
(101, '{"columns":[{"type":"integer","field":"id"},{"type":"str","field":"firstname"}]}'::jsonb);
insert into query_item(path, id) values
('{columns,0,type}'::text[], 100),
('{columns,1,type}'::text[], 100),
('{columns,2,type}'::text[], 100),
('{columns,0,type}'::text[], 101),
('{columns,1,type}'::text[], 101);
I have a query table which has a jsonb column named query_definition.
The jsonb value looks like the following:
{
"columns": [
{
"type": "integer",
"field": "id"
},
{
"type": "str",
"field": "firstname"
},
{
"type": "str",
"field": "lastname"
}
]
}
In order to replace all "type": "..." with "type": "string", I've built the query_item table which contains the following data:
path |id |
----------------+---+
{columns,0,type}|100|
{columns,1,type}|100|
{columns,2,type}|100|
{columns,0,type}|101|
{columns,1,type}|101|
path matches each path from the json root to the "type" entry, id is the corresponding query's id.
I made up the following sql statement to do what I want:
update query q
set query_definition = jsonb_set(q.query_definition, query_item.path, ('"string"')::jsonb, false)
from query_item
where q.id = query_item.id
But it partially works, as it takes the 1st matching id and skips the others (the 1st and 4th line of query_item table).
I know I could build a for statement, but it requires a plpgsql context and I'd rather avoid its use.
Is there a way to do it with a single update statement?
I've read in this topic it's possible to make it with strings, but I didn't find out how to adapt this mechanism with jsonb treatment.

Map Nested Array to Sink Columns

I'm currently working with a survey API to retrieve results and store them in our data warehouse (SQL database). The results are returned as a JSON object, which includes array ("submissions"), containing each individual's responses. An individual submission contains an array ("answers") with each answer to the questions in the survey.
I would like each submission to be one row in one table.
I will provide some very simple data examples and am just looking for a general way to approach this problem. I certainly am not looking for an entire solution.
The API returns a response like this:
{
"surveyName": "Sample Survey",
"count": 2,
"submissions": [
{
"id": 1,
"created": "2021-01-01T12:00:00.000Z",
"answers": [
{
"question_id": 1,
"answer": "Yes"
},
{
"question_id": 2,
"answer": 5
}
],
},
{
"id": 2,
"created": "2021-01-02T12:00:00.000Z",
"answers": [
{
"question_id": 1,
"answer": "No"
},
{
"question_id": 2,
"answer": 4
}
],
}
]
}
Essentially, I want to add a row into a SQL table where the columns are: id, created, answer1, answer2. Within the Sink tab of the Copy Data activity, I cannot figure out how to essentially say, "If question_id = 1, map the answer to column answer1. If question_id = 2, map the answer to column answer2."
Will I likely have to use a Data Flow to handle this sort of mapping? If so, can you think of the general steps included in that type of flow?
For those looking for a similar solution, I'll post the general idea on how I solved this problem, thanks to the suggestion from #mark-kromer-msft.
First of all, the portion of my pipeline where I obtained the JSON files is not included. For that, I had to use an Until loop to paginate through this particular endpoint in order to obtain all submission results. I used a Copy Data activity to create JSON files in blob storage for each page. After that, I created a Data Flow.
I had to first flatten the "submissions" array in order to separate each submission into a separate row. I then used Derived Column to pull out each answer to a separate column. Here's what that looks like:
Here's one example of an Expression:
find(submissions.answers, equals(#item.question_id, '1')).answer
Finally, I just had to create the mapping in the last step (Sink) in order to map my derived columns.
An alternate approach would be to use the native JSON abilites of Azure SQL DB. Use a Stored Proc task, pass the JSON in as a parameter and shred it in the database using OPENJSON:
-- Submission level
-- INSERT INTO yourTable ( ...
SELECT
s.surveyName,
s.xcount,
s.submissions
FROM OPENJSON( #json )
WITH (
surveyName VARCHAR(50) '$.surveyName',
xcount INT '$.count',
submissions NVARCHAR(MAX) AS JSON
) s
CROSS APPLY OPENJSON( s.submissions ) so;
-- Question level, additional CROSS APPLY and JSON_VALUEs required
-- INSERT INTO yourTable ( ...
SELECT
'b' s,
s.surveyName,
s.xcount,
--s.submissions,
JSON_VALUE ( so.[value], '$.id' ) AS id,
JSON_VALUE ( so.[value], '$.created' ) AS created,
JSON_VALUE ( a.[value], '$.question_id' ) AS question_id,
JSON_VALUE ( a.[value], '$.answer' ) AS answer
FROM OPENJSON( #json )
WITH (
surveyName VARCHAR(50) '$.surveyName',
xcount INT '$.count',
submissions NVARCHAR(MAX) AS JSON
) s
CROSS APPLY OPENJSON( s.submissions ) so
CROSS APPLY OPENJSON( so.[value], '$.answers' ) a;
Results at submission and question level:
Full script with sample JSON here.

Update selected values in a jsonb column containing a array

Table faults contains column recacc (jsonb) which contains an array of json objects. Each of them contains a field action. If the value for action is abc, I want to change it to cba. Changes to be applied to all rows.
[
{
"action": "abc",
"created": 1128154425441
},
{
"action": "lmn",
"created": 1228154425441
},
{
"action": "xyz",
"created": 1328154425441
}
]
The following doesn't work, probably because of the data being in array format
update faults
set recacc = jsonb_set(recacc,'{action}', to_jsonb('cbe'::TEXT),false)
where recacc ->> 'action' = 'abc'
I'm not sure if this is the best option, but you may first get the elements of jsonb using jsonb_array_elements, replace it and then reconstruct the json using array_agg and array_to_json.
UPDATE faults SET recacc = new_recacc::jsonb
FROM
(SELECT array_to_json(array_agg(s)) as new_recacc
FROM
( SELECT
replace(c->>'action','abc','cba') , --this to change the value
c->>'created' FROM faults f
cross join lateral jsonb_array_elements(f.recacc) as c
) as s (action,created)
) m;
Demo

postgres - syntax for updating a jsonb array

I'm struggling to find the right syntax for updating an array in a jsonb column in postgres 9.6.6
Given a column "comments", with this example:
[
{
"Comment": "A",
"LastModified": "1527579949"
},
{
"Comment": "B",
"LastModified": "1528579949"
},
{
"Comment": "C",
"LastModified": "1529579949"
}
]
If I wanted to append Z to each comment (giving AZ, BZ, CZ).
I know I need to use something like jsonb_set(comments, '{"Comment"}',
Any hints on finishing this off?
Thanks.
Try:
UPDATE elbat
SET comments = array_to_json(ARRAY(SELECT jsonb_set(x.original_comment,
'{Comment}',
concat('"',
x.original_comment->>'Comment',
'Z"')::jsonb)
FROM (SELECT jsonb_array_elements(elbat.comments) original_comment) x))::jsonb;
It uses jsonb_array_elements() to get the array elements as set, applies the changes on them using jsonb_set(), transforms this to an array and back to json with array_to_json().
But that's an awful lot of work. OK, maybe there is a more elegant solution, that I didn't find. But since your JSON seems to have a fixed schema anyway, I'd recommend a redesign to do it the relational way and have a simple table for the comments plus a linking table for the objects the comment is on. The change would have been very, very easy in such a model for sure.
Find a query returning the expected result:
select jsonb_agg(value || jsonb_build_object('Comment', value->>'Comment' || 'Z'))
from my_table
cross join jsonb_array_elements(comments);
jsonb_agg
-----------------------------------------------------------------------------------------------------------------------------------------------------
[{"Comment": "AZ", "LastModified": "1527579949"}, {"Comment": "BZ", "LastModified": "1528579949"}, {"Comment": "CZ", "LastModified": "1529579949"}]
(1 row)
Create a simple SQL function based of the above query:
create or replace function update_comments(jsonb)
returns jsonb language sql as $$
select jsonb_agg(value || jsonb_build_object('Comment', value->>'Comment' || 'Z'))
from jsonb_array_elements($1)
$$;
Use the function:
update my_table
set comments = update_comments(comments);
DbFiddle.