Why does Data Flow Sink Cache not have all Data Preview results? - azure-data-factory

I'm seeing a significant discrepancy in Data Flow results when using a Cache Sink vs a Data Set Sink. I recreated a simple example to demonstrate.
I uploaded a simple JSON file to Azure Data Lake Storage Gen 2:
{
"data": [
{
"id": 123,
"name": "ABC"
},
{
"id": 456,
"name": "DEF"
},
{
"id": 789,
"name": "GHI"
}
]
}
I created a simple Data Flow that loads this JSON file, flattens it out, then returns it via a Sink. I'm primarily interested in using a Cache Sink because the output is small and I will ultimately need the output for the next pipeline step. (Write to activity output is checked.)
You can see that the Data Preview shows all 3 rows. (I have two sinks in this example simply because I'm illustrating that these do not match.)
Next, I create a pipeline to run the data flow:
Now, when I debug it, the Data Flow output only shows 1 record:
"output": {
"TestCacheSink": {
"value": [
{
"id": 123,
"name": "ABC"
}
],
"count": 1
}
},
However, the second Data Set Sink contains all 3 records:
{"id":123,"name":"ABC"}
{"id":456,"name":"DEF"}
{"id":789,"name":"GHI"}
I expect that the output from the Cache Sink would also have 3 records. Why is there a discrepancy?

When you choose cache as a sink, you will not be allowed to use logging. You see the below error during validation before debug.
To fix which, when you select "none" for logging, it automatically checks "first row only" property! This is causing it to write only the first row to cache sink. You just have to manually uncheck it before running debug.
Here is how it looks...

Related

Kafka Connect Filter Transformer

I am using a Kafka-connect, and I want to filter out some messages.
This is how a single message in Kafka looks like.
{
"code": 2001,
"time": 11111111,
"requestId": "123456789",
"info": [
{
"name": "dan",
"value": 21
}
]
}
And this is how my transformer looks like
transforms:
transforms: FilterByCode
transforms.FilterByCode.type: io.confluent.connect.transforms.Filter$Value
transforms.FilterByCode.filter.condition: $[?(#.code == 2000)]
transforms.FilterByCode.filter.type: include
I want to filter out the messages that their code value is not 2000.
I tried to few different syntax for the condition, but could not find any that works.
All the messages are transformed, and no one is filtered.
Any ideas on how I should use the syntax for the filtering condition?
If you want to filter out all non 2000 codes, try [?(#.code != 2000)]
You may test it here - http://jsonpath.herokuapp.com/

Check If the Array contains value in Azure Data Factory

I need to process files in the container using Azure Datafactory and keep a track of processed files in the next execution.
so I am keeping a table in DB which stores the processed file information,
In ADF I am getting the FileNames of the processed files and I want to check whether the current file has been processed or not.
I am Using Lookup activity: Get All Files Processed
to get the processed files from DB by using below query:
select FileName from meta.Processed_Files;
Then I am traversing over the directory, and getting File Details for current File in the directory by using Get Metadata Activity: "Get Detail of Current File in Iteration"
and in the If Condition activity, I am using following Expression:
#not(contains(activity('Get All Files Processed').output.value,activity('Get Detail of current file in iteration').output.itemName))
This is always returning True even if the file has been processed
How do we compare the FileName from the returned value
Output of activity('Get All Files Processed').output.value
{
"count": 37,
"value": [
{
"FileName": "20210804074153AlteryxRunStats.xlsx"
},
{
"FileName": "20210805074129AlteryxRunStats.xlsx"
},
{
"FileName": "20210806074152AlteryxRunStats.xlsx"
},
{
"FileName": "20210809074143AlteryxRunStats.xlsx"
},
{
"FileName": "20210809074316AlteryxRunStats.xlsx"
},
{
"FileName": "20210810074135AlteryxRunStats.xlsx"
},
{
"FileName": "20210811074306AlteryxRunStats.xlsx"
},
Output of activity('Get Detail of current file in iteration').output.itemName
"20210804074153AlteryxRunStats.xlsx"
I often pass this type of thing off to SQL in Azure Data Factory (ADF) too, especially if I've got one in the architecture. However bearing in mind that any hand-offs in ADF take time, it is possible to check if an item exists in an array using contains, eg a set of files returned from a Lookup.
Background
Ordinary arrays normally look like this: [1,2,3] or ["a","b","c"], but if you think about values that get returned in ADF, eg from Lookups, they they look more like this:
{
"count": 3,
"value": [
{
"Filename": "file1.txt"
},
{
"Filename": "file2.txt"
},
{
"Filename": "file3.txt"
}
],
"effectiveIntegrationRuntime": "AutoResolveIntegrationRuntime (North Europe)",
"billingReference": {
"activityType": "PipelineActivity",
"billableDuration": [
{
...
So what you've got is a complex piece of JSON representing an object (the return value of the Lookup activity plus some additional useful info about the execution), and the array we are interested in is within the value object. However it has additional curly brackets, ie it is itself an object.
Solution
So the thing to do is to pass to contains something that will look like your object which has the single attribute Filename. Use concat to create the string and json to make it authentic:
#contains(activity('Lookup').output.value, json(concat('{"Filename":"',pipeline().parameters.pFileToCheck,'"}')))
Here I'm using a parameter which holds the filename to check but this could also be a variable or output from another Lookup activity.
Sample output from Lookup:
The Set Variable expression using contains:
The result assigned to a variable of boolean type:
I tried something like this.
from SQL table, brought all the processed files as comma-separated values using select STRING_AGG(processedfile, ',') as files in lookup activity
Assign the comma separated value to an array variable (test) using split function
#split(activity('Lookup1').output.value[0]['files'],',')
meta data activity to get current files in directory
filter activity to filter the files in current directory against the processed files
items:
#activity('Get Metadata1').output.childitems
condition:
#not(contains(variables('test'),item().name))

How can I get the count of JSON array in ADF?

I'm using Azure data factory to retrieve data and copy into a database... the Source looks like this:
{
"GroupIds": [
"4ee1a-0856-4618-4c3c77302b",
"21259-0ce1-4a30-2a499965d9",
"b2209-4dda-4e2f-029384e4ad",
"63ac6-fcbc-8f7e-36fdc5e4f9",
"821c9-aa73-4a94-3fc0bd2338"
],
"Id": "w5a19-a493-bfd4-0a0c8djc05",
"Name": "Test Item",
"Description": "test item description",
"Notes": null,
"ExternalId": null,
"ExpiryDate": null,
"ActiveStatus": 0,
"TagIds": [
"784083-4c77-b8fb-0135046c",
"86de96-44c1-a497-0a308607",
"7565aa-437f-af36-8f9306c9",
"d5d841-1762-8c14-d8420da2",
"bac054-2b6e-a19b-ef5b0b0c"
],
"ResourceIds": []
}
In my ADF pipeline, I am trying to get the count of GroupIds and store that in a database column (along with the associated Id from the JSON above).
Is there some kind of syntax I can use to tell ADF that I just want the count of GroupIds or is this going to require some kind of recursive loop activity?
You can use the length function in Azure Data Factory (ADF) to check the length of json arrays:
length(json(variables('varSource')).GroupIds)
If you are loading the data to a SQL database then you could use OPENJSON, a simple example:
DECLARE #json NVARCHAR(MAX) = '{
"GroupIds": [
"4ee1a-0856-4618-4c3c77302b",
"21259-0ce1-4a30-2a499965d9",
"b2209-4dda-4e2f-029384e4ad",
"63ac6-fcbc-8f7e-36fdc5e4f9",
"821c9-aa73-4a94-3fc0bd2338"
],
"Id": "w5a19-a493-bfd4-0a0c8djc05",
"Name": "Test Item",
"Description": "test item description",
"Notes": null,
"ExternalId": null,
"ExpiryDate": null,
"ActiveStatus": 0,
"TagIds": [
"784083-4c77-b8fb-0135046c",
"86de96-44c1-a497-0a308607",
"7565aa-437f-af36-8f9306c9",
"d5d841-1762-8c14-d8420da2",
"bac054-2b6e-a19b-ef5b0b0c"
],
"ResourceIds": []
}'
SELECT *
FROM OPENJSON( #json, '$.GroupIds' )
SELECT COUNT(*) countOfGroupIds
FROM OPENJSON( #json, '$.GroupIds' );
My results:
If your data is stored in a table the code is similar. Make sense?
Another funky way to approach it if you really need the count in-line, is to convert the JSON to XML using the built-in functions and then run some XPath on it. It's not as complicated as it sounds and would allow you to get the result inside the pipeline.
The Data Factory XML function converts JSON to XML, but that JSON must have a single root property. We can fix up the json with concat and a single line of code. In this example I'm using a Set Variable activity, where varSource is your original JSON:
#concat('{"root":', variables('varSource'), '}')
Next, we can just apply the XPath with another simple expression:
#string(xpath(xml(json(variables('varIntermed1'))), 'count(/root/GroupIds)'))
My results:
Easy huh. It's a shame there isn't more built-in support for JPath unless I'm missing something, although you can use limited JPath in the Copy activity.
You can use Data flow activity in the Azure data factory pipeline to get the count.
Step1:
Connect the Source to JSON dataset, and in Source options under JSON settings, select single document.
In the source preview, you can see there are 5 GroupIDs per ID.
Step2:
Use flatten transformation to deformalize the values into rows for GroupIDs.
Select GroupIDs array in Unroll by and Unroll root.
Step3:
Use Aggregate transformation, to get the count of GroupIDs group by ID.
Under Group by: Select a column from the drop-down for your aggregation.
Under Aggregate: You can build the expression to get the count of the column (GroupIDs).
Aggregate Data preview:
Step4: Connect the output to Sink transformation to load the final output to database.

Azure Data Factory - Copy Activity - rest api collection reference

Helo eveyone,
I am fairly new to Data Factory and I need to copy information from Dynamics Business Central's Rest API. I am struggling with the "Details" type entities such as "invoiceSalesHeader".
The api for that entity forces me to provide a header ID as a filter. In that sense, I would have to loop x times (a few thousand) and call the Rest API to retreive the lines of each sales invoice. I find that completely ridiculous and am trying to find other ways to get the information.
To avoid doing that, I am trying to get the information by calling the "salesInvoice" entity and use "$expand=salesInvoiceLines".
That gets me the information I need but inside data factory's Copy Activity, I am struggling with what I should put as a "collection reference" so that I end up with one row per salesInvoiceLine.
The data returned is an array of sales invoices with a sub array of invoice lines.
If I select "salesInvoiceLines" as the collection reference, I end up with "$['value'][0]['salesInvoiceLines']" and that only gives me the lines for the first invoice (since there is an index of zero).
What should I put in Collection Reference so that I get one row per salesInvoiceLine
It is not support to foreach nested json array in ADF.
Alternatively, we can use a Flattern activity in data flow to flatten the nested json array.
Here is my example:
This is my example json data, the structure is like yours:
[
{
"id": 1,
"Value": "January",
"orders":[{"orderid":1,"orderno":"qaz"},{"orderid":2,"orderno":"edc"}]
},
{
"id": 2,
"Value": "February",
"orders":[{"orderid":3,"orderno":"wsx"},{"orderid":4,"orderno":"rfv"}]
},
{
"id": 3,
"Value": "March",
"orders":[{"orderid":5,"orderno":"rfv"},{"orderid":6,"orderno":"tgb"}]
},
{
"id": 11,
"Value": "November",
"orders":[{"orderid":7,"orderno":"yhn"},{"orderid":8,"orderno":"ujm"}]
}
]
In the dataflow, we can select the header of the nested json array, here is orders:
Then we can see the result, we have transposed the JSON orders array with 2 objects (orderid, orderno) into 8 flatten rows:

How do I use ADF copy activity with multiple rows in source?

I have source which is JSON array, sink is SQL server. When I use column mapping and see the code I can see mapping is done to first element of array so each run produces single record despite the fact that source has multiple records. How do I use copy activity to import ALL the rows?
"enableStaging": false,
"translator": {
"type": "TabularTranslator",
"schemaMapping": {
"['#odata.context']": "BuyerFinancing",
"['#odata.nextLink']": "PropertyCondition",
"value[0].AssociationFee": "AssociationFee",
"value[0].AssociationFeeFrequency": "AssociationFeeFrequency",
"value[0].AssociationName": "AssociationName",
Use * as the source field to indicate all elements in json format. For example, with json:
{
"results": [
{"field1": "valuea", "field2": "valueb"},
{"field1": "valuex", "field2": "valuey"}
]
}
and a database table with a column result to store the json. The mapping with results as the collection and * and the sub element will create two records with:
{"field1": "valuea", "field2": "valueb"}
{"field1": "valuex", "field2": "valuey"}
in the result field.
Copy Data Field Mapping
ADF support cross apply for json array. Please check the example in this doc. https://learn.microsoft.com/en-us/azure/data-factory/supported-file-formats-and-compression-codecs#jsonformat-example
For schema mapping: https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-schema-and-type-mapping#schema-mapping