Check If the Array contains value in Azure Data Factory - azure-data-factory

I need to process files in the container using Azure Datafactory and keep a track of processed files in the next execution.
so I am keeping a table in DB which stores the processed file information,
In ADF I am getting the FileNames of the processed files and I want to check whether the current file has been processed or not.
I am Using Lookup activity: Get All Files Processed
to get the processed files from DB by using below query:
select FileName from meta.Processed_Files;
Then I am traversing over the directory, and getting File Details for current File in the directory by using Get Metadata Activity: "Get Detail of Current File in Iteration"
and in the If Condition activity, I am using following Expression:
#not(contains(activity('Get All Files Processed').output.value,activity('Get Detail of current file in iteration').output.itemName))
This is always returning True even if the file has been processed
How do we compare the FileName from the returned value
Output of activity('Get All Files Processed').output.value
{
"count": 37,
"value": [
{
"FileName": "20210804074153AlteryxRunStats.xlsx"
},
{
"FileName": "20210805074129AlteryxRunStats.xlsx"
},
{
"FileName": "20210806074152AlteryxRunStats.xlsx"
},
{
"FileName": "20210809074143AlteryxRunStats.xlsx"
},
{
"FileName": "20210809074316AlteryxRunStats.xlsx"
},
{
"FileName": "20210810074135AlteryxRunStats.xlsx"
},
{
"FileName": "20210811074306AlteryxRunStats.xlsx"
},
Output of activity('Get Detail of current file in iteration').output.itemName
"20210804074153AlteryxRunStats.xlsx"

I often pass this type of thing off to SQL in Azure Data Factory (ADF) too, especially if I've got one in the architecture. However bearing in mind that any hand-offs in ADF take time, it is possible to check if an item exists in an array using contains, eg a set of files returned from a Lookup.
Background
Ordinary arrays normally look like this: [1,2,3] or ["a","b","c"], but if you think about values that get returned in ADF, eg from Lookups, they they look more like this:
{
"count": 3,
"value": [
{
"Filename": "file1.txt"
},
{
"Filename": "file2.txt"
},
{
"Filename": "file3.txt"
}
],
"effectiveIntegrationRuntime": "AutoResolveIntegrationRuntime (North Europe)",
"billingReference": {
"activityType": "PipelineActivity",
"billableDuration": [
{
...
So what you've got is a complex piece of JSON representing an object (the return value of the Lookup activity plus some additional useful info about the execution), and the array we are interested in is within the value object. However it has additional curly brackets, ie it is itself an object.
Solution
So the thing to do is to pass to contains something that will look like your object which has the single attribute Filename. Use concat to create the string and json to make it authentic:
#contains(activity('Lookup').output.value, json(concat('{"Filename":"',pipeline().parameters.pFileToCheck,'"}')))
Here I'm using a parameter which holds the filename to check but this could also be a variable or output from another Lookup activity.
Sample output from Lookup:
The Set Variable expression using contains:
The result assigned to a variable of boolean type:

I tried something like this.
from SQL table, brought all the processed files as comma-separated values using select STRING_AGG(processedfile, ',') as files in lookup activity
Assign the comma separated value to an array variable (test) using split function
#split(activity('Lookup1').output.value[0]['files'],',')
meta data activity to get current files in directory
filter activity to filter the files in current directory against the processed files
items:
#activity('Get Metadata1').output.childitems
condition:
#not(contains(variables('test'),item().name))

Related

Why does Data Flow Sink Cache not have all Data Preview results?

I'm seeing a significant discrepancy in Data Flow results when using a Cache Sink vs a Data Set Sink. I recreated a simple example to demonstrate.
I uploaded a simple JSON file to Azure Data Lake Storage Gen 2:
{
"data": [
{
"id": 123,
"name": "ABC"
},
{
"id": 456,
"name": "DEF"
},
{
"id": 789,
"name": "GHI"
}
]
}
I created a simple Data Flow that loads this JSON file, flattens it out, then returns it via a Sink. I'm primarily interested in using a Cache Sink because the output is small and I will ultimately need the output for the next pipeline step. (Write to activity output is checked.)
You can see that the Data Preview shows all 3 rows. (I have two sinks in this example simply because I'm illustrating that these do not match.)
Next, I create a pipeline to run the data flow:
Now, when I debug it, the Data Flow output only shows 1 record:
"output": {
"TestCacheSink": {
"value": [
{
"id": 123,
"name": "ABC"
}
],
"count": 1
}
},
However, the second Data Set Sink contains all 3 records:
{"id":123,"name":"ABC"}
{"id":456,"name":"DEF"}
{"id":789,"name":"GHI"}
I expect that the output from the Cache Sink would also have 3 records. Why is there a discrepancy?
When you choose cache as a sink, you will not be allowed to use logging. You see the below error during validation before debug.
To fix which, when you select "none" for logging, it automatically checks "first row only" property! This is causing it to write only the first row to cache sink. You just have to manually uncheck it before running debug.
Here is how it looks...

Escape charaters in JSON causing issue while retrieving attribute in ForEach activity

I have below JSON
{
"id": " https://xxx.vault.azure.net/secrets/xxx ",
"attributes": {
"enabled": true,
"nbf": 1632075242,
"created": 1632075247,
"updated": 1632075247,
"recoveryLevel": "Recoverable+Purgeable"
},
"tags": {}
}
The above JSON is the output of a web activity and I am using this output into a ForEach activity. The above output when goes to ForEach activity as input, all the values are coming with escape characters.
{
{"id":" https://xxx.vault.azure.net/secrets/xxx ","attributes":{"enabled":true,"nbf":1632075242,"created":1632075247,"updated":1632075247,"recoveryLevel":"Recoverable+Purgeable"},"tags":{}}
From this JSON, I am trying to get only xxx value from the id attribute. How can I do this in Dynamic expression.
Any help is much appreciated.
Thanks
Use the built-in functions lastIndexOf (to find the last occurence of backslash), length (to determine the length of a string), add (to add numbers), sub (to subtract numbers) and substring to do this. For example:
#substring(item().id,add(lastIndexOf(item().id,'/'),1),sub(length(item().id),add(lastIndexOf(item().id,'/'),1)))

How can I get the count of JSON array in ADF?

I'm using Azure data factory to retrieve data and copy into a database... the Source looks like this:
{
"GroupIds": [
"4ee1a-0856-4618-4c3c77302b",
"21259-0ce1-4a30-2a499965d9",
"b2209-4dda-4e2f-029384e4ad",
"63ac6-fcbc-8f7e-36fdc5e4f9",
"821c9-aa73-4a94-3fc0bd2338"
],
"Id": "w5a19-a493-bfd4-0a0c8djc05",
"Name": "Test Item",
"Description": "test item description",
"Notes": null,
"ExternalId": null,
"ExpiryDate": null,
"ActiveStatus": 0,
"TagIds": [
"784083-4c77-b8fb-0135046c",
"86de96-44c1-a497-0a308607",
"7565aa-437f-af36-8f9306c9",
"d5d841-1762-8c14-d8420da2",
"bac054-2b6e-a19b-ef5b0b0c"
],
"ResourceIds": []
}
In my ADF pipeline, I am trying to get the count of GroupIds and store that in a database column (along with the associated Id from the JSON above).
Is there some kind of syntax I can use to tell ADF that I just want the count of GroupIds or is this going to require some kind of recursive loop activity?
You can use the length function in Azure Data Factory (ADF) to check the length of json arrays:
length(json(variables('varSource')).GroupIds)
If you are loading the data to a SQL database then you could use OPENJSON, a simple example:
DECLARE #json NVARCHAR(MAX) = '{
"GroupIds": [
"4ee1a-0856-4618-4c3c77302b",
"21259-0ce1-4a30-2a499965d9",
"b2209-4dda-4e2f-029384e4ad",
"63ac6-fcbc-8f7e-36fdc5e4f9",
"821c9-aa73-4a94-3fc0bd2338"
],
"Id": "w5a19-a493-bfd4-0a0c8djc05",
"Name": "Test Item",
"Description": "test item description",
"Notes": null,
"ExternalId": null,
"ExpiryDate": null,
"ActiveStatus": 0,
"TagIds": [
"784083-4c77-b8fb-0135046c",
"86de96-44c1-a497-0a308607",
"7565aa-437f-af36-8f9306c9",
"d5d841-1762-8c14-d8420da2",
"bac054-2b6e-a19b-ef5b0b0c"
],
"ResourceIds": []
}'
SELECT *
FROM OPENJSON( #json, '$.GroupIds' )
SELECT COUNT(*) countOfGroupIds
FROM OPENJSON( #json, '$.GroupIds' );
My results:
If your data is stored in a table the code is similar. Make sense?
Another funky way to approach it if you really need the count in-line, is to convert the JSON to XML using the built-in functions and then run some XPath on it. It's not as complicated as it sounds and would allow you to get the result inside the pipeline.
The Data Factory XML function converts JSON to XML, but that JSON must have a single root property. We can fix up the json with concat and a single line of code. In this example I'm using a Set Variable activity, where varSource is your original JSON:
#concat('{"root":', variables('varSource'), '}')
Next, we can just apply the XPath with another simple expression:
#string(xpath(xml(json(variables('varIntermed1'))), 'count(/root/GroupIds)'))
My results:
Easy huh. It's a shame there isn't more built-in support for JPath unless I'm missing something, although you can use limited JPath in the Copy activity.
You can use Data flow activity in the Azure data factory pipeline to get the count.
Step1:
Connect the Source to JSON dataset, and in Source options under JSON settings, select single document.
In the source preview, you can see there are 5 GroupIDs per ID.
Step2:
Use flatten transformation to deformalize the values into rows for GroupIDs.
Select GroupIDs array in Unroll by and Unroll root.
Step3:
Use Aggregate transformation, to get the count of GroupIDs group by ID.
Under Group by: Select a column from the drop-down for your aggregation.
Under Aggregate: You can build the expression to get the count of the column (GroupIDs).
Aggregate Data preview:
Step4: Connect the output to Sink transformation to load the final output to database.

Azure data factory get object properties from array

I have a Get Metadata activity which get all child items under a blob container. There are both files and folders but i just need files. So in a filter activity which filter for only items of type = file. This is what I got from the filter activity:
Output
{
"ItemsCount": 4,
"FilteredItemsCount": 3,
"Value": [
{
"name": "BRAND MAPPING.csv",
"type": "File"
},
{
"name": "ChinaBIHRA.csv",
"type": "File"
},
{
"name": "ChinaBIHRA.csv_20201021121500",
"type": "File"
}
]
}
So there is an array of 3 objects being returned. Each object has a name and type properties. I want just the names to be fed to a store procedure activity as a parameter. I have used this expression to try to get a comma separated list as the parameter.
#join(activity('Filter1').output.Value.name, ',')
and got this error:
The expression 'join(activity('Filter1').output.Value.name, ',')' cannot be evaluated because property 'name' cannot be selected. Array elements can only be selected using an integer index.
So how can I achieve this?
You can create For Each activity after Filter activity. Within For Each activity, append file name.
Step:
1.create two variable.
2.Setting of For Each activity
3.Setting of Append Variable activity within For Each activity
4.Setting of Set variable
Use the below codeblock instead
#concat('''',join(json(replace(replace(replace(replace(string(
activity('Filter1').output.Value)
,',"type":"File"','')
,'"name":','')
,'{','')
,'}','')),''','''),'''')
This would forego the use of multiple activities, and you can use the existing framework that you have.

Database with ability to search JSON object

Please help me find a database solution.
Here is the main requirement: One of the columns (the data column) holds a JSON data object which has to be easily searched.
Database row fields:
source (string)
action (string)
description (string)
timestamp (epoch timestamp)
data (can be anything but most likely JSON object or a string)
The nature/schema of this JSON object is unknown (e.g. it can be 1 level deep or 3). The important thing is that tree of the object can be searched.
One possible object:
{
"things": [
{
"ip": "123.13.3.3.",
"sn": "ADF"
},
{
"ip": "123.13.3.3.",
"sn": "ABC"
}
]
}
Another possible object:
{
"ip": "123.13.3.3.",
"sn": "ABC"
}
Example: I want to return all the rows where the data column has a JSON object that has the key/value pair of "SN": "ABC" somewhere in its tree.
Additional Requirements:
Low overhead and/or simple to set up
This database will not be getting hundreds/thousands/millions of hits per minute (now or ever)
Python and/or PHP hooks