In Azure Data Factory pipeline, I have ForEach1 loop over Databricks Activities. Those Databricks Activities output arrays of different sizes. I would like to union and pass those arrays to another ForEach2 loop so that every element of every array would be an item in this new ForEach2 loop.
How could I collect output arrays from ForEach1 into one big array? I've tried Append Variable Activity, but got the following error:
The value of type 'Array' cannot be appended to the variable of type 'Array'.
The action type 'AppendToArrayVariable' only supports values of types 'Float, Integer, String, Boolean, Object'.
Is there a way to union/merge arrays inside ForEach1 loop? Or are there any other ways to pass arrays to ForEach2 where each element of each array would be considered as a separate item and ForEach2 would loop over each item?
There is a collection function called union() in Azure data factory which takes 2 arguments (both of type array or object). This can be used to achieve your requirement. You can follow the following example which I have tried with Get Metadata activity instead of Databricks Notebook activity.
I have a container called input and there are 2 folders a and b in it. These folders have some files. Using the similar approach, I am appending the array of child items (file names) generated by Get Metadata activity in each iteration to get a list of all file names (one big array). The following is my folder structure inside which files are present:
First I used get metadata to get names of folders inside container.
I used #activity('folder_names').output.childItems for items value in ForEach activity. Now Iniside for each, I have again used get metadata to get child items of each folder (I created a dataset and gave dynamic value for folder name in path).
You can use the procedure from below to get the requirement:
I have given the output of the 2nd get metadata activity (files in folder) to a set variable activity. I created a new variable current_file_list (array type) and gave its value as:
#union(variables('list_of_files'),activity('files in folder').output.childItems)
Note: Instead of activity('files in folder').output.childItems) in the above union, you can use the array returned by your databricks activity in Foreach1 for each iteration.
The list_of_files is another array type variable (created using another set variable activity) which is the final array that contains all elements as one big array. I am assigning this variable value as #variables('current_file_list')
This indirectly means that list_of_files value is the union of previous array elements and the current folder child item array.
Reason: If we use one variable (say list_of_files) and give its value as #union(variables('list_of_files'),activity('files in folder').output.childItems) it throws an error. We cannot use self-reference of a variable in azure data factory dynamic content so, we need to create 2 variables to overcome this.
When I debug the pipeline, it gives expected output, and we can see the output produced after each activity.
The following are the output images:
Input for list of files in current folder first iteration will be:
Input for list of files in current folder second iteration will be:
Final list_of_files array with all elements in single array:
You can follow the above approach to the combined array and use it for your ForEach2 activity.
Related
A question concerning Azure Data Factory.
I need to persist the iterator value from a lookup activity (an Id column from a sql table) to my sink together with other values.
How to do that?
I thought that I could just reference the iterator value as #{item().id} as source and a destination column name from from my sql table sink. That doesn’t seems to work. The resulting value in the destination column is NULL.
I have used 2 look up activities, one for id values and the other for remaining values. Now, to combine and insert these values to sink table, I have used the following:
The ids look up activity output is as following:
I have one more column to combine with above id values. The following is the look up output for that:
I have given the following dynamic content as the items value in for each as following:
#range(0,length(activity('ids').output.value))
Inside for each activity, I have given the following script activity query to insert data as required into sink table:
insert into t1 values(#{activity('ids').output.value[item()].id},'#{activity('remaining rows').output.value[item()].gname}')
The data would be inserted successfully and the following is the reference image of the same:
I'm trying to loop over data in an SQL table, but when I'm trying to use the value inside a foreach loop action using the #item() i get the error:
"Failed to convert the value in 'table' property to 'System.String' type. Please make sure the payload structure and value are correct."
So the row value can't be converted to a string.
Could that be my problem? and if so, what can I do about it?
Here is the pipeline:
I reproduced the above scenario with SQL table containing table names in lookup and csv files from ADLS gen2 in copy activity and got the same error.
The above error arises when we gave the lookup output array items directly into a string parameter inside ForEach.
If we look at the below lookup output,
The above value array is not a normal array, it is an array of objects. So, #item() in 1st iteration of ForEach means one object { "tablename": "sample1.csv" }. But our parameter expects a string value and that's why it is giving the above error.
To resolve this, give #item().tablename which will give our table names in every iteration inside ForEach.
My repro for your reference:
I have given same in sink also and this is my output.
Pipeline Execution
Copied data in target
I'm trying to save some values in a SQL table and then loop over the values to use each value as path in an OData source.
First I have defined an array where to save the values:
then the variable is set to #activity('Lookup1').output.value
Now the data is accessed from foreach.
Inside the foreach loop I have a copy activity where the Odata source should be se as the value.
But I don't have access to the item. Why is that?
The above approach will work for you when debug the pipeline even through it is giving warning in the dataset.
The dataset dynamic content doesn't know about ForEach #item() at first because it belongs to pipeline dynamic content. Thats why it is giving warning in the dataset.
But at debug time, it identifies the #item() value.
Please go through the below 2 scenarios to understand it better.
Here I am using ADLS as source and target with array of files for sample and passing to the ForEach.
These are my source files.
I have Created an array variable with above names like ["sample1.csv","sample2.csv"] and passed to ForEach,
Using #item() in dataset:
Source dataset and target dataset.
You can see it is giving the same warning, but it will give the correct result when you debug. But in dataset preview, it will give the error.
Copied files to target successfully.
Using #item() inside ForEach using dataset parameters:
I have created the parameters and used in datasets.
Source:
Target:
Copy activity inside ForEach:
Source parameter #item():
sink parameter #item():
Files copied to target successfully.
I want to do some activity in an ADF pipeline, but only if a field in a JSON output is present. What kind of ADF expression can I use to check that?
I set up two json files for testing, one with a firstName attribute and one without:
I then created a Lookup activity to get the contents of the JSON file and a Set Variable activity for testing the expression. I often use these to test expressions and it's a good way to test and view expression results iteratively:
I then created a Boolean variable (which is one of the datatypes supported by Azure Data Factory and Synapse pipelines) and the expression I am using to check the existence of the attribute is this:
#bool(contains(activity('Lookup1').output.firstRow, 'firstName'))
You can then use that boolean variable in an If activity, to execute subsequent activities conditionally based on the value of the variable.
How can I create an array of columns from an array of column names in dataflow?
The following creates an array of sorted columns with and exception of the last column:
sort(slice(columnNames(), 1, size(columnNames()) - 1), compare(#item1, #item2))
I want to get an array of the columns for this array of column names. I tried this:
toString(byNames(sort(slice(columnNames(), 1, size(columnNames()) - 1), compare(#item1, #item2))))
But I keep getting the error:
Column name function 'byNames' does not accept column or argument parameters
Please can anyone help me with a workaround for this?
Update--
It seems using ColumnNames() in any way (directly or assigning it to a parameter) seems to be leading to error. As at runtime on Spark it is fed to the byNames() function. Due to unavailability of a way to re-introduce as parameter or assign variable in Data Flow directly, see below which works for me.
Have empty string array type parameter in DataFLow
Use sha2 function as usual in derived column with parameter sha2(256,byNames($cols))
Create pipeline, there use getMetadata to get Structure from which you can get column names.
For each column, inside ForEach activity append to a variable.
Next, connect to DataFLow and pass the variable containing Column names.
The documentation for the byNames function states 'Computed inputs are not supported but you can use parameter substitutions'. This explains why you should use a parameter as input to create the array used in the byNames function:
Example: Where $cols parameter hold the list of columns.
sha2(256,byNames(split($cols,',')))
You can use computed columns names as input by creating the array prior to using in function. Instead of creating the expression in-line in the function call, set the column values in a parameter prior and then use it in your function directly afterwards.
For a parameter $cols of type array:
$cols = sort(slice(columnNames(), 1, size(columnNames()) - 1), compare(#item1, #item2))
toString(byNames(sort(slice($cols, compare(#item1, #item2))))
Refer: byNames