Azure Data Factory - traverse JSON array with multiple rows - azure-data-factory

I have a REST API that outputs JSON data similar to this example:
{
"GroupIds": [
"1234",
"2345",
"3456",
"4567"
],
"Id": "w5a19-a493-bfd4-0a0c8djc05",
"Name": "Test Item",
"Description": "test item description",
"Notes": null,
"ExternalId": null,
"ExpiryDate": null,
"ActiveStatus": 0,
"TagIds": [
"784083-4c77-b8fb-0135046c",
"86de96-44c1-a497-0a308607",
"7565aa-437f-af36-8f9306c9",
"d5d841-1762-8c14-d8420da2",
"bac054-2b6e-a19b-ef5b0b0c"
],
"ResourceIds": []
}
Using ADF, I want to parse through this JSON object and insert a row for each value in the GroupIds array along with the objects Id and Name... So ultimately the above JSON should translate to a table like this:
GroupID
Id
Name
1234
w5a19-a493-bfd4-0a0c8djc05
Test Item
2345
w5a19-a493-bfd4-0a0c8djc05
Test Item
3456
w5a19-a493-bfd4-0a0c8djc05
Test Item
4567
w5a19-a493-bfd4-0a0c8djc05
Test Item
Is there some configuration I can use in the Copy Activity settings to accomplish this?

You can use Data flow activity to get desired result.
First add the REST API source then use select transformer and add required columns.
After this select Derived Column transformer and use unfold function to flatten JSON array.
Another way is to use Flatten formatter.

I tend to use a more ELT pattern for this, ie passing the JSON to a Stored Proc activity and letting the SQL database handle the JSON. This assumes you already have access to a SQL DB which is very capable with JSON.
A simplified example:
DECLARE #json NVARCHAR(MAX) = '{
"GroupIds": [
"1234",
"2345",
"3456",
"4567"
],
"Id": "w5a19-a493-bfd4-0a0c8djc05",
"Name": "Test Item",
"Description": "test item description",
"Notes": null,
"ExternalId": null,
"ExpiryDate": null,
"ActiveStatus": 0,
"TagIds": [
"784083-4c77-b8fb-0135046c",
"86de96-44c1-a497-0a308607",
"7565aa-437f-af36-8f9306c9",
"d5d841-1762-8c14-d8420da2",
"bac054-2b6e-a19b-ef5b0b0c"
],
"ResourceIds": []
}'
SELECT
g.[value] AS groupId,
m.Id,
m.[Name]
FROM OPENJSON( #json, '$' )
WITH
(
Id VARCHAR(50) '$.Id',
[Name] VARCHAR(50) '$.Name',
GroupIds NVARCHAR(MAX) AS JSON
) m
CROSS APPLY OPENJSON( #json, '$.GroupIds' ) g;
You could convert this to a stored procedure where #json is the parameter and convert the SELECT to an INSERT.
My results:
I worked through a very similar example with more screenprints here which is worth a look. It's a different pattern to using Mapping Data Flows but if you already have SQL available then it makes sense to use it rather than fire up separate compute with duplicate cost. If you are not logging to a SQL DB or have access to one, then Mapping Data Flows approach might make sense to you.

Related

How can I get the count of JSON array in ADF?

I'm using Azure data factory to retrieve data and copy into a database... the Source looks like this:
{
"GroupIds": [
"4ee1a-0856-4618-4c3c77302b",
"21259-0ce1-4a30-2a499965d9",
"b2209-4dda-4e2f-029384e4ad",
"63ac6-fcbc-8f7e-36fdc5e4f9",
"821c9-aa73-4a94-3fc0bd2338"
],
"Id": "w5a19-a493-bfd4-0a0c8djc05",
"Name": "Test Item",
"Description": "test item description",
"Notes": null,
"ExternalId": null,
"ExpiryDate": null,
"ActiveStatus": 0,
"TagIds": [
"784083-4c77-b8fb-0135046c",
"86de96-44c1-a497-0a308607",
"7565aa-437f-af36-8f9306c9",
"d5d841-1762-8c14-d8420da2",
"bac054-2b6e-a19b-ef5b0b0c"
],
"ResourceIds": []
}
In my ADF pipeline, I am trying to get the count of GroupIds and store that in a database column (along with the associated Id from the JSON above).
Is there some kind of syntax I can use to tell ADF that I just want the count of GroupIds or is this going to require some kind of recursive loop activity?
You can use the length function in Azure Data Factory (ADF) to check the length of json arrays:
length(json(variables('varSource')).GroupIds)
If you are loading the data to a SQL database then you could use OPENJSON, a simple example:
DECLARE #json NVARCHAR(MAX) = '{
"GroupIds": [
"4ee1a-0856-4618-4c3c77302b",
"21259-0ce1-4a30-2a499965d9",
"b2209-4dda-4e2f-029384e4ad",
"63ac6-fcbc-8f7e-36fdc5e4f9",
"821c9-aa73-4a94-3fc0bd2338"
],
"Id": "w5a19-a493-bfd4-0a0c8djc05",
"Name": "Test Item",
"Description": "test item description",
"Notes": null,
"ExternalId": null,
"ExpiryDate": null,
"ActiveStatus": 0,
"TagIds": [
"784083-4c77-b8fb-0135046c",
"86de96-44c1-a497-0a308607",
"7565aa-437f-af36-8f9306c9",
"d5d841-1762-8c14-d8420da2",
"bac054-2b6e-a19b-ef5b0b0c"
],
"ResourceIds": []
}'
SELECT *
FROM OPENJSON( #json, '$.GroupIds' )
SELECT COUNT(*) countOfGroupIds
FROM OPENJSON( #json, '$.GroupIds' );
My results:
If your data is stored in a table the code is similar. Make sense?
Another funky way to approach it if you really need the count in-line, is to convert the JSON to XML using the built-in functions and then run some XPath on it. It's not as complicated as it sounds and would allow you to get the result inside the pipeline.
The Data Factory XML function converts JSON to XML, but that JSON must have a single root property. We can fix up the json with concat and a single line of code. In this example I'm using a Set Variable activity, where varSource is your original JSON:
#concat('{"root":', variables('varSource'), '}')
Next, we can just apply the XPath with another simple expression:
#string(xpath(xml(json(variables('varIntermed1'))), 'count(/root/GroupIds)'))
My results:
Easy huh. It's a shame there isn't more built-in support for JPath unless I'm missing something, although you can use limited JPath in the Copy activity.
You can use Data flow activity in the Azure data factory pipeline to get the count.
Step1:
Connect the Source to JSON dataset, and in Source options under JSON settings, select single document.
In the source preview, you can see there are 5 GroupIDs per ID.
Step2:
Use flatten transformation to deformalize the values into rows for GroupIDs.
Select GroupIDs array in Unroll by and Unroll root.
Step3:
Use Aggregate transformation, to get the count of GroupIDs group by ID.
Under Group by: Select a column from the drop-down for your aggregation.
Under Aggregate: You can build the expression to get the count of the column (GroupIDs).
Aggregate Data preview:
Step4: Connect the output to Sink transformation to load the final output to database.

Postgres add additional field to json payload index within an array prior to inserting as record set into database

I'm creating a stored procedure in postgres which takes a json and inserts into a table by performing a json_to_recordset command.
Prior to inserting the record set into a SQL table with that command I want to add an additional field as a index to the json in the json array.
Is this possible?
For each index in the array I want to add "current_status": "pending"
[
{
"batch_id": "40",
"state_id": "10"
},
{
"batch_id": "40",
"state_id": "10"
}
]
after
[
{
"batch_id": "40",
"state_id": "10",
"current_status": "pending"
},
{
"batch_id": "40",
"state_id": "10"
"current_status": "pending"
}
]
Another option is updating the only NEW records in the table after the fact.
I'm new to postgres and have been reading through the documentation.
Based on your added comment, the current_status = 'pending' should be added as part of the insert into your target table instead of appending the key to the json objects.
insert into target_table (batch_id, state_id, current_status)
select batch_id, state_id, 'pending' as current_status
from json_to_recordset(<json>) as x(batch_id text, state_id text);
Adding to Mike's correct answer, in the case we want to modify existing tables with json arrays to include the status we may:
-- First create the table
CREATE TABLE myJson AS
SELECT '[
{
"batch_id": "40",
"state_id": "10"
},
{
"batch_id": "50",
"state_id": "60"
}
]'::json js;
WITH unnest_and_concat AS (
SELECT json_array_elements(js)::jsonb || json_build_object('current_status', 'pending')::jsonb jee
FROM myJson)
SELECT json_agg(jee)::json
FROM unnest_and_concat;
Of course this is only intended to work for a table with these rows (for illustration). If the objective is to update the entire table then we can do that (ideally with a LATERAL JOIN) mixed with an update statement. Looks like:
UPDATE myJson
SET old_col=new_col
FROM <insert subquery or table>
WHERE myJson.id = new_table.id;
Nevertheless, I would recommend modifying upon insert rather than updating.

Postgres jsonb nested array append

I have simple table with a jsonb column
CREATE TABLE things (
id SERIAL PRIMARY KEY,
data jsonb
);
with data that looks like:
{
"id": 1,
"title": "thing",
"things": [
{
"title": "thing 1",
"moreThings": [
{ "title": "more thing 1" }
]
}
]
}
So how do I append inside of a deeply nested array like moreThings?
For single level nested array I could do this and it works:
UPDATE posts SET data = jsonb_set(data, '{things}', data->'things' || '{ "text": "thing" }', true);
But the same doesn't work for deeply nested arrays:
UPDATE posts SET data = jsonb_set(data, '{things}', data->'things'->'moreThings' || '{ "text": "thing" }', true)
How can I append to moreThings?
It works just fine:
UPDATE things
SET data =
jsonb_set(data,
'{things,0,moreThings}',
data->'things'->0->'moreThings' || '{ "text": "thing" }',
TRUE
)
WHERE id = 1;
If you have a table that consists only of a primary key and a jsonb attribute and you regularly want to manipulate this jsonb in the database, you are certainly doing something wrong. Your life will be much easier if you normalize the data some more.

Building query in Postgres 9.4.2 for JSONB datatype using builtin function

I have a table schema as follows:
DummyTable
-------------
someData JSONB
All my values will be a JSON object. For example, when you do a select *
from DummyTable, it would look like
someData(JSONB)
------------------
{"values":["P1","P2","P3"],"key":"ProductOne"}
{"values":["P3"],"key":"ProductTwo"}
I want a query which will give me result set as follows:
[
{
"values": ["P1","P2","P3"],
"key": "ProductOne"
},
{
"values": ["P4"],
"key": "ProductTwo"
}
]
I'm using Postgres version 9.4.2. I looked at documentation page of the same, but could not find the query which would give the above result.
However, in my API, I can build the JSON by iterating over rows, but I would prefer query doing the same. I tried json_build_array, row_to_json on a result which would be given by select * from table_name, but no luck.
Any help would be appreciated.
Here is the link I looked for to write a query for JSONB
You can use json_agg or jsonb_agg:
create table dummytable(somedata jsonb not null);
insert into dummytable(somedata) values
('{"values":["P1","P2","P3"],"key":"ProductOne"}'),
('{"values":["P3"],"key":"ProductTwo"}');
select jsonb_pretty(jsonb_agg(somedata)) from dummytable;
Result:
[
{
"key": "ProductOne",
"values": [
"P1",
"P2",
"P3"
]
},
{
"key": "ProductTwo",
"values": [
"P3"
]
}
]
Although retrieving the data row by row and building on client side can be made more efficient, as the server can start to send data much sooner - after it retrieves first matching row from storage. If it needs to build the json array first, it would need to retrieve all the rows and merge them before being able to start sending data.

mongodb-php: "key" side value for nested querying of find() function doesnot work

I want to retrive record which are matching to booking's client id & want to show it to client. I am doing the following:
$mongoDb = $mongoDb->selectCollection('booking');
$bookingInfo = $mongoDb->find(array("client.id" => $_SESSION['client_id']));
My mongo database record looks like this:
"paymentDue": "",
"client": {
"contacts": [
{
"name": "loy furison",
"email": "loy#hotmail.com"
}
],
"id": "5492abba64363df013000029",
"name": "Birdfire"
},
want to fire the query with key value as client.id in find function. But this query doesnt work..whats the issue
I got a little logic that is different by key name only. If i find it with client.name then i shows me records & there i need to insert these in json object & then through foreach loop each record if i retrive & compare then it works...got it but the expected doesnt work why?????...didnt get:-!