How can I get the count of JSON array in ADF? - azure-data-factory

I'm using Azure data factory to retrieve data and copy into a database... the Source looks like this:
{
"GroupIds": [
"4ee1a-0856-4618-4c3c77302b",
"21259-0ce1-4a30-2a499965d9",
"b2209-4dda-4e2f-029384e4ad",
"63ac6-fcbc-8f7e-36fdc5e4f9",
"821c9-aa73-4a94-3fc0bd2338"
],
"Id": "w5a19-a493-bfd4-0a0c8djc05",
"Name": "Test Item",
"Description": "test item description",
"Notes": null,
"ExternalId": null,
"ExpiryDate": null,
"ActiveStatus": 0,
"TagIds": [
"784083-4c77-b8fb-0135046c",
"86de96-44c1-a497-0a308607",
"7565aa-437f-af36-8f9306c9",
"d5d841-1762-8c14-d8420da2",
"bac054-2b6e-a19b-ef5b0b0c"
],
"ResourceIds": []
}
In my ADF pipeline, I am trying to get the count of GroupIds and store that in a database column (along with the associated Id from the JSON above).
Is there some kind of syntax I can use to tell ADF that I just want the count of GroupIds or is this going to require some kind of recursive loop activity?

You can use the length function in Azure Data Factory (ADF) to check the length of json arrays:
length(json(variables('varSource')).GroupIds)
If you are loading the data to a SQL database then you could use OPENJSON, a simple example:
DECLARE #json NVARCHAR(MAX) = '{
"GroupIds": [
"4ee1a-0856-4618-4c3c77302b",
"21259-0ce1-4a30-2a499965d9",
"b2209-4dda-4e2f-029384e4ad",
"63ac6-fcbc-8f7e-36fdc5e4f9",
"821c9-aa73-4a94-3fc0bd2338"
],
"Id": "w5a19-a493-bfd4-0a0c8djc05",
"Name": "Test Item",
"Description": "test item description",
"Notes": null,
"ExternalId": null,
"ExpiryDate": null,
"ActiveStatus": 0,
"TagIds": [
"784083-4c77-b8fb-0135046c",
"86de96-44c1-a497-0a308607",
"7565aa-437f-af36-8f9306c9",
"d5d841-1762-8c14-d8420da2",
"bac054-2b6e-a19b-ef5b0b0c"
],
"ResourceIds": []
}'
SELECT *
FROM OPENJSON( #json, '$.GroupIds' )
SELECT COUNT(*) countOfGroupIds
FROM OPENJSON( #json, '$.GroupIds' );
My results:
If your data is stored in a table the code is similar. Make sense?
Another funky way to approach it if you really need the count in-line, is to convert the JSON to XML using the built-in functions and then run some XPath on it. It's not as complicated as it sounds and would allow you to get the result inside the pipeline.
The Data Factory XML function converts JSON to XML, but that JSON must have a single root property. We can fix up the json with concat and a single line of code. In this example I'm using a Set Variable activity, where varSource is your original JSON:
#concat('{"root":', variables('varSource'), '}')
Next, we can just apply the XPath with another simple expression:
#string(xpath(xml(json(variables('varIntermed1'))), 'count(/root/GroupIds)'))
My results:
Easy huh. It's a shame there isn't more built-in support for JPath unless I'm missing something, although you can use limited JPath in the Copy activity.

You can use Data flow activity in the Azure data factory pipeline to get the count.
Step1:
Connect the Source to JSON dataset, and in Source options under JSON settings, select single document.
In the source preview, you can see there are 5 GroupIDs per ID.
Step2:
Use flatten transformation to deformalize the values into rows for GroupIDs.
Select GroupIDs array in Unroll by and Unroll root.
Step3:
Use Aggregate transformation, to get the count of GroupIDs group by ID.
Under Group by: Select a column from the drop-down for your aggregation.
Under Aggregate: You can build the expression to get the count of the column (GroupIDs).
Aggregate Data preview:
Step4: Connect the output to Sink transformation to load the final output to database.

Related

Check If the Array contains value in Azure Data Factory

I need to process files in the container using Azure Datafactory and keep a track of processed files in the next execution.
so I am keeping a table in DB which stores the processed file information,
In ADF I am getting the FileNames of the processed files and I want to check whether the current file has been processed or not.
I am Using Lookup activity: Get All Files Processed
to get the processed files from DB by using below query:
select FileName from meta.Processed_Files;
Then I am traversing over the directory, and getting File Details for current File in the directory by using Get Metadata Activity: "Get Detail of Current File in Iteration"
and in the If Condition activity, I am using following Expression:
#not(contains(activity('Get All Files Processed').output.value,activity('Get Detail of current file in iteration').output.itemName))
This is always returning True even if the file has been processed
How do we compare the FileName from the returned value
Output of activity('Get All Files Processed').output.value
{
"count": 37,
"value": [
{
"FileName": "20210804074153AlteryxRunStats.xlsx"
},
{
"FileName": "20210805074129AlteryxRunStats.xlsx"
},
{
"FileName": "20210806074152AlteryxRunStats.xlsx"
},
{
"FileName": "20210809074143AlteryxRunStats.xlsx"
},
{
"FileName": "20210809074316AlteryxRunStats.xlsx"
},
{
"FileName": "20210810074135AlteryxRunStats.xlsx"
},
{
"FileName": "20210811074306AlteryxRunStats.xlsx"
},
Output of activity('Get Detail of current file in iteration').output.itemName
"20210804074153AlteryxRunStats.xlsx"
I often pass this type of thing off to SQL in Azure Data Factory (ADF) too, especially if I've got one in the architecture. However bearing in mind that any hand-offs in ADF take time, it is possible to check if an item exists in an array using contains, eg a set of files returned from a Lookup.
Background
Ordinary arrays normally look like this: [1,2,3] or ["a","b","c"], but if you think about values that get returned in ADF, eg from Lookups, they they look more like this:
{
"count": 3,
"value": [
{
"Filename": "file1.txt"
},
{
"Filename": "file2.txt"
},
{
"Filename": "file3.txt"
}
],
"effectiveIntegrationRuntime": "AutoResolveIntegrationRuntime (North Europe)",
"billingReference": {
"activityType": "PipelineActivity",
"billableDuration": [
{
...
So what you've got is a complex piece of JSON representing an object (the return value of the Lookup activity plus some additional useful info about the execution), and the array we are interested in is within the value object. However it has additional curly brackets, ie it is itself an object.
Solution
So the thing to do is to pass to contains something that will look like your object which has the single attribute Filename. Use concat to create the string and json to make it authentic:
#contains(activity('Lookup').output.value, json(concat('{"Filename":"',pipeline().parameters.pFileToCheck,'"}')))
Here I'm using a parameter which holds the filename to check but this could also be a variable or output from another Lookup activity.
Sample output from Lookup:
The Set Variable expression using contains:
The result assigned to a variable of boolean type:
I tried something like this.
from SQL table, brought all the processed files as comma-separated values using select STRING_AGG(processedfile, ',') as files in lookup activity
Assign the comma separated value to an array variable (test) using split function
#split(activity('Lookup1').output.value[0]['files'],',')
meta data activity to get current files in directory
filter activity to filter the files in current directory against the processed files
items:
#activity('Get Metadata1').output.childitems
condition:
#not(contains(variables('test'),item().name))

Azure Data Factory - traverse JSON array with multiple rows

I have a REST API that outputs JSON data similar to this example:
{
"GroupIds": [
"1234",
"2345",
"3456",
"4567"
],
"Id": "w5a19-a493-bfd4-0a0c8djc05",
"Name": "Test Item",
"Description": "test item description",
"Notes": null,
"ExternalId": null,
"ExpiryDate": null,
"ActiveStatus": 0,
"TagIds": [
"784083-4c77-b8fb-0135046c",
"86de96-44c1-a497-0a308607",
"7565aa-437f-af36-8f9306c9",
"d5d841-1762-8c14-d8420da2",
"bac054-2b6e-a19b-ef5b0b0c"
],
"ResourceIds": []
}
Using ADF, I want to parse through this JSON object and insert a row for each value in the GroupIds array along with the objects Id and Name... So ultimately the above JSON should translate to a table like this:
GroupID
Id
Name
1234
w5a19-a493-bfd4-0a0c8djc05
Test Item
2345
w5a19-a493-bfd4-0a0c8djc05
Test Item
3456
w5a19-a493-bfd4-0a0c8djc05
Test Item
4567
w5a19-a493-bfd4-0a0c8djc05
Test Item
Is there some configuration I can use in the Copy Activity settings to accomplish this?
You can use Data flow activity to get desired result.
First add the REST API source then use select transformer and add required columns.
After this select Derived Column transformer and use unfold function to flatten JSON array.
Another way is to use Flatten formatter.
I tend to use a more ELT pattern for this, ie passing the JSON to a Stored Proc activity and letting the SQL database handle the JSON. This assumes you already have access to a SQL DB which is very capable with JSON.
A simplified example:
DECLARE #json NVARCHAR(MAX) = '{
"GroupIds": [
"1234",
"2345",
"3456",
"4567"
],
"Id": "w5a19-a493-bfd4-0a0c8djc05",
"Name": "Test Item",
"Description": "test item description",
"Notes": null,
"ExternalId": null,
"ExpiryDate": null,
"ActiveStatus": 0,
"TagIds": [
"784083-4c77-b8fb-0135046c",
"86de96-44c1-a497-0a308607",
"7565aa-437f-af36-8f9306c9",
"d5d841-1762-8c14-d8420da2",
"bac054-2b6e-a19b-ef5b0b0c"
],
"ResourceIds": []
}'
SELECT
g.[value] AS groupId,
m.Id,
m.[Name]
FROM OPENJSON( #json, '$' )
WITH
(
Id VARCHAR(50) '$.Id',
[Name] VARCHAR(50) '$.Name',
GroupIds NVARCHAR(MAX) AS JSON
) m
CROSS APPLY OPENJSON( #json, '$.GroupIds' ) g;
You could convert this to a stored procedure where #json is the parameter and convert the SELECT to an INSERT.
My results:
I worked through a very similar example with more screenprints here which is worth a look. It's a different pattern to using Mapping Data Flows but if you already have SQL available then it makes sense to use it rather than fire up separate compute with duplicate cost. If you are not logging to a SQL DB or have access to one, then Mapping Data Flows approach might make sense to you.

Azure Data Factory - Copy Activity - rest api collection reference

Helo eveyone,
I am fairly new to Data Factory and I need to copy information from Dynamics Business Central's Rest API. I am struggling with the "Details" type entities such as "invoiceSalesHeader".
The api for that entity forces me to provide a header ID as a filter. In that sense, I would have to loop x times (a few thousand) and call the Rest API to retreive the lines of each sales invoice. I find that completely ridiculous and am trying to find other ways to get the information.
To avoid doing that, I am trying to get the information by calling the "salesInvoice" entity and use "$expand=salesInvoiceLines".
That gets me the information I need but inside data factory's Copy Activity, I am struggling with what I should put as a "collection reference" so that I end up with one row per salesInvoiceLine.
The data returned is an array of sales invoices with a sub array of invoice lines.
If I select "salesInvoiceLines" as the collection reference, I end up with "$['value'][0]['salesInvoiceLines']" and that only gives me the lines for the first invoice (since there is an index of zero).
What should I put in Collection Reference so that I get one row per salesInvoiceLine
It is not support to foreach nested json array in ADF.
Alternatively, we can use a Flattern activity in data flow to flatten the nested json array.
Here is my example:
This is my example json data, the structure is like yours:
[
{
"id": 1,
"Value": "January",
"orders":[{"orderid":1,"orderno":"qaz"},{"orderid":2,"orderno":"edc"}]
},
{
"id": 2,
"Value": "February",
"orders":[{"orderid":3,"orderno":"wsx"},{"orderid":4,"orderno":"rfv"}]
},
{
"id": 3,
"Value": "March",
"orders":[{"orderid":5,"orderno":"rfv"},{"orderid":6,"orderno":"tgb"}]
},
{
"id": 11,
"Value": "November",
"orders":[{"orderid":7,"orderno":"yhn"},{"orderid":8,"orderno":"ujm"}]
}
]
In the dataflow, we can select the header of the nested json array, here is orders:
Then we can see the result, we have transposed the JSON orders array with 2 objects (orderid, orderno) into 8 flatten rows:

Azure data factory get object properties from array

I have a Get Metadata activity which get all child items under a blob container. There are both files and folders but i just need files. So in a filter activity which filter for only items of type = file. This is what I got from the filter activity:
Output
{
"ItemsCount": 4,
"FilteredItemsCount": 3,
"Value": [
{
"name": "BRAND MAPPING.csv",
"type": "File"
},
{
"name": "ChinaBIHRA.csv",
"type": "File"
},
{
"name": "ChinaBIHRA.csv_20201021121500",
"type": "File"
}
]
}
So there is an array of 3 objects being returned. Each object has a name and type properties. I want just the names to be fed to a store procedure activity as a parameter. I have used this expression to try to get a comma separated list as the parameter.
#join(activity('Filter1').output.Value.name, ',')
and got this error:
The expression 'join(activity('Filter1').output.Value.name, ',')' cannot be evaluated because property 'name' cannot be selected. Array elements can only be selected using an integer index.
So how can I achieve this?
You can create For Each activity after Filter activity. Within For Each activity, append file name.
Step:
1.create two variable.
2.Setting of For Each activity
3.Setting of Append Variable activity within For Each activity
4.Setting of Set variable
Use the below codeblock instead
#concat('''',join(json(replace(replace(replace(replace(string(
activity('Filter1').output.Value)
,',"type":"File"','')
,'"name":','')
,'{','')
,'}','')),''','''),'''')
This would forego the use of multiple activities, and you can use the existing framework that you have.

Building query in Postgres 9.4.2 for JSONB datatype using builtin function

I have a table schema as follows:
DummyTable
-------------
someData JSONB
All my values will be a JSON object. For example, when you do a select *
from DummyTable, it would look like
someData(JSONB)
------------------
{"values":["P1","P2","P3"],"key":"ProductOne"}
{"values":["P3"],"key":"ProductTwo"}
I want a query which will give me result set as follows:
[
{
"values": ["P1","P2","P3"],
"key": "ProductOne"
},
{
"values": ["P4"],
"key": "ProductTwo"
}
]
I'm using Postgres version 9.4.2. I looked at documentation page of the same, but could not find the query which would give the above result.
However, in my API, I can build the JSON by iterating over rows, but I would prefer query doing the same. I tried json_build_array, row_to_json on a result which would be given by select * from table_name, but no luck.
Any help would be appreciated.
Here is the link I looked for to write a query for JSONB
You can use json_agg or jsonb_agg:
create table dummytable(somedata jsonb not null);
insert into dummytable(somedata) values
('{"values":["P1","P2","P3"],"key":"ProductOne"}'),
('{"values":["P3"],"key":"ProductTwo"}');
select jsonb_pretty(jsonb_agg(somedata)) from dummytable;
Result:
[
{
"key": "ProductOne",
"values": [
"P1",
"P2",
"P3"
]
},
{
"key": "ProductTwo",
"values": [
"P3"
]
}
]
Although retrieving the data row by row and building on client side can be made more efficient, as the server can start to send data much sooner - after it retrieves first matching row from storage. If it needs to build the json array first, it would need to retrieve all the rows and merge them before being able to start sending data.