how to pass the outputs from Get metadata stage and use it for file name comparison in databricks notebook - azure-data-factory

I have 2 Get metadata stages in ADF which is fetching file names from 2 different folders, I need to use these outputs for file name comparison in databricks notebook and return true if all the files are present.
how to pass the output from Get meta data stages to databricks and perform string comparison and
return true if all files are present and return false if even 1 file is missing
How to achieve this?

Please find the below answer which I explained with 1 Get metadata stage , the same can be replicated for more than one also.
Create an ADF pipeline with below activities.
Now in the Get Metadata activity , add the childItems in the Fieldlist as argument, to pass the output of Get Metadata to Notebook as show below
In the Databricks Notebook activity , add the below parameter as Base Paramter which will capture the output of Get Metadata and pass as input paramater to Notebook. Generally this parameter will of object datatype , but I converted to string datatype to access the names of files in the notebook as show below
#string(activity('Get Metadata1').output.childItems)
Now we can able to access the Get Metadata output as string in the notebook.
import ast
required_filenames = ['File1.csv','File2.csv','File3.csv'] ##This is for comparing with the output we get from GetMetadata activity.
metadata_value = dbutils.widgets.get('metadata_output') ##Accessing the output from Get Metadata and storing into a variable using databricks widgets.
metadata_list = ast.literal_eval(metadata_value) ##Converting the above string datatype to the list datatype.
blob_output_list=[] ##Creating an empty list to add the names of files we get from GetMetadata activity.
for i in metadata_list:
blob_output_list.append(i['name']) ##This will add all the names of files from blob storage to the empty list we created above.
validateif = all(item in blob_output_list for item in required_filenames) ##This validateif variable now compare both the lists using list comprehension and provide either True or False.
I tried in the above way and can able to solve the provided requirement. Hope this helps.
Request to please upvote the answer if this helps in your requirement.

Related

What is the equivalent to Kusto's CountOf() function in Azure Data Factory?

My requirement is to extract a string from filenames using a ADF variable, I need to extract the string until the final underscore '_' and the number of underscores vary in every filename as seen in the below example.
abc_xyz_20221221.txt --> abc_xyz
abc_xyz_a1_20221221.txt --> abc_xyz_a1
abc_c_ab_a1_20221221.txt --> abc_c_ab_a1
abc_c_ab_a1_a11_20221221.txt --> abc_c_ab_a1_a11
I tried to get it done using indexof() to get the position of the final underscore but it does not accept negative values, so I got the below logic which works in KQL (Azure Data Explorer) but fails in ADF because there is no CountOf() in this tool. Is there any equivalent function in ADF or can you please suggest me how to achieve the same in ADF?
substring("abc_xyz_20221221.txt", 0,
indexof("abc_xyz_20221221.txt", "_", 0,
strlen("abc_xyz_20221221.txt"),
countof("abc_xyz_20221221.txt", '_')))
You can try like this also using split and join inside ForEach activity.
Array for ForEach activity:
["abc_xyz_20221221.txt","abc_xyz_a1_20221221.txt","abc_c_ab_a1_20221221.txt","abc_c_ab_a1_a11_20221221.txt"]
Append variable inside ForEach:
#join(take(split(item(), '_'),add(length(split(item(), '_')),-1)),'_')
Result in an array variable:
As mentioned by #Joel Cochran, use the below expression in the append variable inside ForEach with lastIndexOf().
#substring(item(),0,lastindexof(item(),'_'))
This is a just a simpler form of what #Rakesh called out above . The only difference being , his implementation is iterating . In my case the file name is stored in a variable named foo
#substring(variables('foo'),0,lastindexof(variables('foo'),'_'))
output

Azure Data Factory, If Activity expression with array element

I have an array HeaderList with a list of names. I have a look up activity to look at a CSV file header. Then, I have a IF activity to compare the first element. the expression in If activity is like this:
#equals(activity('Lookup2').output.firstRow.Prop_0,variables('HeaderList')[0])
That does not work. If I change it to this:
#equals(activity('Lookup2').output.firstRow.Prop_0,'XYZ'), then it works. How do I reference an array element in expression?
Thanks
#equals(activity('Lookup2').output.firstRow.Prop_0,variables('HeaderList')[0])
What does it mean?
I have got the same error in the if condition activity. But when the pipeline is debugged, it did not throw any error. I have repro'd the same in my ADF environment. Below are the steps.
Lookup activity is taken, and it refers to a csv file.
An array variable 'HeaderList' is taken and values for the variable is set using set variables activity.
Then If Condition activity is taken and below expression is given as a dynamic content.
#equals(activity('Lookup1').output.firstRow.prop_0,variables('HeaderList')[0])
The same error is produced.
Error: Cannot fit unknown into function parameter any.
When pipeline is debugged, it did not throw any error. It is successful.

Dataset Empty parameter value

I have an xml dataset, I want to parametrize the compression type to treat .xml and .xml.gz files with the same pipeline :
When I put 'gzip' value in compression type it reads xml.gzip file. I want to know what value I should put to read uncompressed .xml file because it does not accept empty value. It is able to read xml file just when I delete the compression_type parameter
You should pass "None" and it should work out .
I feel "None" is more of a workaround in this particular case. "None" is still a string value, not empty.
In my scenario right now, I have an Excel dataset. I want to make every parameter as generic as possible, including the file path/name, sheet name, and the range. The value of "Range" under Connection tab allows empty value. However if I specify it as #dataset().DataRange and leave my parameter DataRange empty, I cannot preview the data or submit the pipeline because it complains that the value cannot be empty.

Azure-data-Factory Copy data If a certain file exists

I have many files in a blob container. However I wanted to run a Stored procedure only IF a certain file (e.g. SRManifest.csv) exists on the blob Container. I used Get metadata and IF Condition on Data Factory. Can you please help me with the dynamic script for this. I tried this #bool(startswith(
activity('Get Metadata1').output.childitems.ItemName,
'SRManifest.csv')). It doesnt work.
Then I thought, what if i used #greaterOREquals(activity('Get Metadata1').output.LastModified,adddays(utcnow(),-2))But this checks the last modified within 2 days of the Bloob not the file exist. Thank you.
Please see below my diagram
I have understood your requirement differently I think.
I wanted to run a Stored procedure only IF a certain file (e.g. SRManifest.csv) exists on the blob Container
1 Change your metadata activity to look for existence of sentinel file (SRManifest.csv)
2 Follow with an IF activity, use this condition:
3 Put your sp in the True part of the IF activity
If you also needed the file list passed to the sp then you'll need the GetMetadata with childitems option inside the IF-True activity
Based on your diagram, since you are looping over all the blob names already, you can add a Boolean variable to the pipeline and set its default value to false:
Inside the ForEach activity, you only want to attempt to set the variable if the value is still false, and if the blob name is found, set it to true. Since Set Variable cannot be self-referential, do this inside the False branch of an If activity:
This will only attempt to process if the value is false (so the file name has not been found yet), and will do nothing if the value is true. Now set the variable based on your file name:
[NOTE: This value can be hard coded, parameterized, or based on a variable]
When you execute the pipeline, you'll see the Set Variable stops attempting once the value is set to true:
In the main pipeline, after the ForEach activity has completed, you can use the variable to set the condition of your final If activity. If the blob is never found, it will still be false, so put the Stored Procedure activity inside the True branch.

Unable to pass dynamic and unique date values in JMeter

I have a request payload(JSON format) which has an array with 1000 objects and each object has 6 key value pairs out of which 5 I’m reading from the csv file using parameterization and the 6th key has to be a unique date value of a future date for each of the object in the array.
I tried this with time-shift function which works for 1 iteration but I want to execute it for n- number of iterations.
I checked for groovy code for this but I have no knowledge of groovy and have started learning it.
How can I achieve this in JMeter.
Also, on reading time-shift function from HTTP Request Defaults-Parameters or from the Test Plan-User Defined Variables it does not read different date for each object, it duplicates same date of the first variable in each object.
{
“deviceNumber": “XX”,
“array: [
{
“keyValue1: “${value1_ReadFromCSV}”,
"keyValue2”: “${value2_ReadFromCSV}”,
"keyValue3”: “${value3_ReadFromCSV}”,
"keyValue4”: “${value4_ReadFromCSV}”,
"keyValue5”: “${value5_ReadFromCSV}”,
"keyValue6”: "2020-05-23” (Should be dynamically generated)
},
{
“keyValue7: “value7_ReadFromCSV”,
"keyValue8”: "value8_ReadFromCSV",
"keyValue9”: "value9_ReadFromCSV",
"keyValue10”: "value10_ReadFromCSV",
"keyValue11”: "value11_ReadFromCSV",
"keyValue12”: "2020-05-24” (Should be dynamically generated)
},
.
.
.
.
{
“keyValue995: “value995_ReadFromCSV”,
"keyValue996”: "value996_ReadFromCSV",
"keyValue997”: "value997_ReadFromCSV",
"keyValue998”: "value998_ReadFromCSV",
"keyValue999”: "value999_ReadFromCSV",
"keyValue1000”: "2025–12-31” (Should be dynamically generated)
}
]
}
I have got the partial solution to this, by reading the csv file line by line and storing each line into a variable using groovy. However, I don't want to store directly the line into the variable but to create a JSON object like above from each line of csv file with a unique future date for each object which is in the array.
The csv file is : (Note: I have removed column for date column in csv as I no longer need it.)
deviceNumber,keyValue1,keyValue2,keyValue3,keyValue4,keyValue5,keyValue7,keyValue8,keyValue9,keyValue10,keyValue11,keyValue12,keyValue13,keyValue15,keyValue15,keyValue16
01,somestring,somestring,somestring,somestring,somestring,somestring,somestring,somestring,somestring,somestring,somestring,somestring,somestring,somestring,somestring
02,somestring,somestring,somestring,somestring,somestring,somestring,somestring,somestring,somestring,somestring,somestring,somestring,somestring,somestring,somestring
03,somestring,somestring,somestring,somestring,somestring,somestring,somestring,somestring,somestring,somestring,somestring,somestring,somestring,somestring,somestring
.
.
.
1000,somestring,somestring,somestring,somestring,somestring,somestring,somestring,somestring,somestring,somestring,somestring,somestring,somestring,somestring,somestring
Kindly suggest any reference/example to do this.
I provide only generic instructions:
You can dynamically construct request body using JSR223 PreProcessor
You can read CSV file into memory using File.readLines() function
You can build JSON out of the values from the CSV file using JsonBuilder class
More information:
Apache Groovy - Parsing and producing JSON
Apache Groovy - Why and How You Should Use It