Map Nested Array to Sink Columns - azure-data-factory

I'm currently working with a survey API to retrieve results and store them in our data warehouse (SQL database). The results are returned as a JSON object, which includes array ("submissions"), containing each individual's responses. An individual submission contains an array ("answers") with each answer to the questions in the survey.
I would like each submission to be one row in one table.
I will provide some very simple data examples and am just looking for a general way to approach this problem. I certainly am not looking for an entire solution.
The API returns a response like this:
{
"surveyName": "Sample Survey",
"count": 2,
"submissions": [
{
"id": 1,
"created": "2021-01-01T12:00:00.000Z",
"answers": [
{
"question_id": 1,
"answer": "Yes"
},
{
"question_id": 2,
"answer": 5
}
],
},
{
"id": 2,
"created": "2021-01-02T12:00:00.000Z",
"answers": [
{
"question_id": 1,
"answer": "No"
},
{
"question_id": 2,
"answer": 4
}
],
}
]
}
Essentially, I want to add a row into a SQL table where the columns are: id, created, answer1, answer2. Within the Sink tab of the Copy Data activity, I cannot figure out how to essentially say, "If question_id = 1, map the answer to column answer1. If question_id = 2, map the answer to column answer2."
Will I likely have to use a Data Flow to handle this sort of mapping? If so, can you think of the general steps included in that type of flow?

For those looking for a similar solution, I'll post the general idea on how I solved this problem, thanks to the suggestion from #mark-kromer-msft.
First of all, the portion of my pipeline where I obtained the JSON files is not included. For that, I had to use an Until loop to paginate through this particular endpoint in order to obtain all submission results. I used a Copy Data activity to create JSON files in blob storage for each page. After that, I created a Data Flow.
I had to first flatten the "submissions" array in order to separate each submission into a separate row. I then used Derived Column to pull out each answer to a separate column. Here's what that looks like:
Here's one example of an Expression:
find(submissions.answers, equals(#item.question_id, '1')).answer
Finally, I just had to create the mapping in the last step (Sink) in order to map my derived columns.

An alternate approach would be to use the native JSON abilites of Azure SQL DB. Use a Stored Proc task, pass the JSON in as a parameter and shred it in the database using OPENJSON:
-- Submission level
-- INSERT INTO yourTable ( ...
SELECT
s.surveyName,
s.xcount,
s.submissions
FROM OPENJSON( #json )
WITH (
surveyName VARCHAR(50) '$.surveyName',
xcount INT '$.count',
submissions NVARCHAR(MAX) AS JSON
) s
CROSS APPLY OPENJSON( s.submissions ) so;
-- Question level, additional CROSS APPLY and JSON_VALUEs required
-- INSERT INTO yourTable ( ...
SELECT
'b' s,
s.surveyName,
s.xcount,
--s.submissions,
JSON_VALUE ( so.[value], '$.id' ) AS id,
JSON_VALUE ( so.[value], '$.created' ) AS created,
JSON_VALUE ( a.[value], '$.question_id' ) AS question_id,
JSON_VALUE ( a.[value], '$.answer' ) AS answer
FROM OPENJSON( #json )
WITH (
surveyName VARCHAR(50) '$.surveyName',
xcount INT '$.count',
submissions NVARCHAR(MAX) AS JSON
) s
CROSS APPLY OPENJSON( s.submissions ) so
CROSS APPLY OPENJSON( so.[value], '$.answers' ) a;
Results at submission and question level:
Full script with sample JSON here.

Related

How does Get Metadata order/sort its output?

I have set up a DataFactory pipeline that gets a list of files in Azure Data Lake Storage Gen2 then iterates over each files using a ForEach loop.
Im using a Get Metadata activity to produce the list of files and the argument its outputting is 'Child Items'.
I want to make sure the list (child items) is always sorted in name order. My question is what is the default sorting method for child items or can i sort this manually?
Thanks
"name": "GetMetadata",
"description": "",
"type": "GetMetadata",
"dependsOn": [
{
"activity": "Execute Previous Activity",
"dependencyConditions": [
"Succeeded"
]
}
],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"dataset": {
"referenceName": "Folder",
"type": "DatasetReference"
},
"fieldList": [
"childItems"
]
}
},
I've implemented the following solution to overcome the problem with get metadata default sorting order without a use of Azure Functions:
Get a list of items from the BLOB storage
Apply custom filtering (out-of-scope in your question context - just skip)
Apply a lookup activity, that basically receives the JSON representation of 1. Get Metadata activity, parses it using T-SQL Stored Procedure and returns batk as a sorted table representation of the input JSON (sorted in a descending manner)
For each activity start to iterate through the list from the top to down, starting with the most recented date folders and moving to the oldest ones
Below you can find a:
Configuration of the lookup activity from 3.
T-SQL Stored Procedure that transforms the output of Get Metadata activity to an input of ForEach activity.
ALTER PROCEDURE Tech.spSortBlobMetadata
#Json NVARCHAR(MAX)
, #SortOder VARCHAR(5) = 'DESC'
, #Debug INT = 0
AS
/***************************************************************************
EXEC Tech.spSortBlobMetadata
'[{"name":"dt=2020-06-17","type":"Folder"},{"name":"dt=2020-06-18"}]'
, 'DESC'
, 1
***************************************************************************/
BEGIN
DECLARE
#sqlTransform NVARCHAR(MAX) = 'SELECT *
FROM OPENJSON(#Json) WITH(name NVARCHAR(200) ''$.name'', type NVARCHAR(50) ''$.type'')
ORDER BY name ' + #SortOder
IF #Debug = 0
BEGIN
EXEC sp_executesql
#sqlTransform
, N'#Json nvarchar(max)'
, #json = #json
END
ELSE
BEGIN
SELECT #sqlTransform
END
END
You should probably try and refactor your process to take advantage of one of the real strengths of Azure Data Factory (ADF) which is the ability to process things in parallel. What if you did DELETE based on the file / date / period instead of TRUNCATE?
I did get a sequential process to work using a Lookup to a database, a query with an ORDER BY clause to sort the output, and a For Each loop running in sequential mode, but this is counter to the strengths of ADF:
Unfortunatelly, there is not way to sort the order of ChildItems. I find this requirement quite strange, what is the scenario that you need you files sorted?

Writing a rather obtuse JSON query using Slick

I am looking to translate an SQL query (Postgres) into Scala Slick code for use in my Play application.
The data looks something like this:
parent_id | json_column
----------+-----------------------------------------
| [ {"id": "abcde-12345", "data": "..."}
2 | , {"id": "67890-fghij", "data": "..."}
| , {"id": "klmno-00000", "data": "..."} ]
Here's my query in PostgreSQL:
SELECT * FROM table1
WHERE id IN (
SELECT id
FROM
table1 t1,
json_array_elements(t1.json_column) e,
json_to_record(e.value) AS r("id" text, data text)
WHERE
"id" = 'abcde-12345'
AND t1.parent_id = 2
);
This finds the results I need; any objects in t1 that include a "row" in the json_column array that has the id of "abcde-12345". The "parent_id" and "id" will be passed in to this query via query parameters (both Strings).
How would I write this query in Scala using Slick?
The easiest - maybe laziest? - way is probably to just use plain sql ..
sql"""[query]""".as[ (type1,type2..) ]
using the $var notation for the variables.
Otherwise you can use SimpleFunction to map the json calls, but I'm not quite sure how that works when they generate multiple results per row. Seems that might get complicated..

updating postgres jsonb column

I have below json string in my table column which is of type jsonb,
{
"abc": 1,
"def": 2
}
i want to remove the "abc" key from it and insert "mno" with some default value. i followed the below approcach for it.
UPDATE books SET books_desc = books_desc - 'abc';
UPDATE books SET books_desc = jsonb_set(books_desc, '{mno}', '5');
and it works.
Now i have another table with json as below,
{
"a": {
"abc": 1,
"def": 2
},
"b": {
"abc": 1,
"def": 2
}
}
Even in this json, i want to do the same thing. take out "abc" and introduce "mno" with some default value. Please help me to achieve this.
The keys "a" and "b" are dynamic and can change. But the values for "a" and "b" will always have same keys but values may change.
I need a generic logic.
Requirement 2:
abc:true should get converted to xyz:1.
abc:false should get converted to xyz:0.
demo:db<>fiddle
Because of a possible variety of your JSON keys it might be complicated to generate a common query. This is because you need to give the path within the json_set() function. But without actual values it would be hard.
A simple work-around is using the regexp_replace() function on the text representation of the JSON string to replace the relevant objects.
UPDATE my_table
SET my_data =
regexp_replace(my_data::text, '"abc"\s*:\s*\d+', '"mno":5', 'g')::jsonb
For added requirement 2:
I wrote the below query based on already given solution:
UPDATE books
SET book_info =
regexp_replace(book_info::text, '"abc"\s*:\s*true', '"xyz":1', 'g')::jsonb;
UPDATE books
SET book_info =
regexp_replace(book_info::text, '"abc"\s*:\s*false', '"xyz":0', 'g')::jsonb;

postgres - syntax for updating a jsonb array

I'm struggling to find the right syntax for updating an array in a jsonb column in postgres 9.6.6
Given a column "comments", with this example:
[
{
"Comment": "A",
"LastModified": "1527579949"
},
{
"Comment": "B",
"LastModified": "1528579949"
},
{
"Comment": "C",
"LastModified": "1529579949"
}
]
If I wanted to append Z to each comment (giving AZ, BZ, CZ).
I know I need to use something like jsonb_set(comments, '{"Comment"}',
Any hints on finishing this off?
Thanks.
Try:
UPDATE elbat
SET comments = array_to_json(ARRAY(SELECT jsonb_set(x.original_comment,
'{Comment}',
concat('"',
x.original_comment->>'Comment',
'Z"')::jsonb)
FROM (SELECT jsonb_array_elements(elbat.comments) original_comment) x))::jsonb;
It uses jsonb_array_elements() to get the array elements as set, applies the changes on them using jsonb_set(), transforms this to an array and back to json with array_to_json().
But that's an awful lot of work. OK, maybe there is a more elegant solution, that I didn't find. But since your JSON seems to have a fixed schema anyway, I'd recommend a redesign to do it the relational way and have a simple table for the comments plus a linking table for the objects the comment is on. The change would have been very, very easy in such a model for sure.
Find a query returning the expected result:
select jsonb_agg(value || jsonb_build_object('Comment', value->>'Comment' || 'Z'))
from my_table
cross join jsonb_array_elements(comments);
jsonb_agg
-----------------------------------------------------------------------------------------------------------------------------------------------------
[{"Comment": "AZ", "LastModified": "1527579949"}, {"Comment": "BZ", "LastModified": "1528579949"}, {"Comment": "CZ", "LastModified": "1529579949"}]
(1 row)
Create a simple SQL function based of the above query:
create or replace function update_comments(jsonb)
returns jsonb language sql as $$
select jsonb_agg(value || jsonb_build_object('Comment', value->>'Comment' || 'Z'))
from jsonb_array_elements($1)
$$;
Use the function:
update my_table
set comments = update_comments(comments);
DbFiddle.

How can you do projection with array_agg(order by)?

I have a table with three columns: id, name and position. I want to create a JSON array as following:
[
{"id": 443, "name": "first"},
{"id": 645, "name": "second"}
]
This should be listed by the position column.
I started with the following query:
with output as
(
select id, name
from the_table
)
select array_to_json(array_agg(output))
from output
This works, great. Now I want to add the ordering. I started with this:
with output as
(
select id, name, position
from the_table
)
select array_to_json(array_agg(output order by output.position))
from output
Now the output is as following:
[
{"id": 443, "name": "first", "position": 1},
{"id": 645, "name": "second", "position": 2}
]
But I don't want the position field in the output.
I am facing a chicken-egg problem: I need the position column to be able to order on it, but I also don't want the position column, as I don't want it in the result output.
How can I fix this?
I don't think the following query is correct, as table ordering is (theoretically) not preserved between queries:
with output as
(
select id, name
from the_table
order by position
)
select array_to_json(array_agg(output))
from output
There are two ways (at least):
Build JSON object:
with t(x,y) as (values(1,1),(2,2))
select json_agg(json_build_object('x',t.x) order by t.y) from t;
Or delete unnecessary key:
with t(x,y) as (values(1,1),(2,2))
select json_agg((to_jsonb(t)-'y')::json order by t.y) from t;
Note that in the second case you need some type casts because - operator defined only for JSONB type.
Also note that I used direct JSON aggregation json_agg() instead of pair array_to_json(array_agg())