Azure-data-Factory Copy data If a certain file exists - azure-data-factory

I have many files in a blob container. However I wanted to run a Stored procedure only IF a certain file (e.g. SRManifest.csv) exists on the blob Container. I used Get metadata and IF Condition on Data Factory. Can you please help me with the dynamic script for this. I tried this #bool(startswith(
activity('Get Metadata1').output.childitems.ItemName,
'SRManifest.csv')). It doesnt work.
Then I thought, what if i used #greaterOREquals(activity('Get Metadata1').output.LastModified,adddays(utcnow(),-2))But this checks the last modified within 2 days of the Bloob not the file exist. Thank you.
Please see below my diagram

I have understood your requirement differently I think.
I wanted to run a Stored procedure only IF a certain file (e.g. SRManifest.csv) exists on the blob Container
1 Change your metadata activity to look for existence of sentinel file (SRManifest.csv)
2 Follow with an IF activity, use this condition:
3 Put your sp in the True part of the IF activity
If you also needed the file list passed to the sp then you'll need the GetMetadata with childitems option inside the IF-True activity

Based on your diagram, since you are looping over all the blob names already, you can add a Boolean variable to the pipeline and set its default value to false:
Inside the ForEach activity, you only want to attempt to set the variable if the value is still false, and if the blob name is found, set it to true. Since Set Variable cannot be self-referential, do this inside the False branch of an If activity:
This will only attempt to process if the value is false (so the file name has not been found yet), and will do nothing if the value is true. Now set the variable based on your file name:
[NOTE: This value can be hard coded, parameterized, or based on a variable]
When you execute the pipeline, you'll see the Set Variable stops attempting once the value is set to true:
In the main pipeline, after the ForEach activity has completed, you can use the variable to set the condition of your final If activity. If the blob is never found, it will still be false, so put the Stored Procedure activity inside the True branch.

Related

How to build OR-behavior into Azure Synapse pipelines?

I have a pipeline in Synapse that reloads data from a number of REST-endpoints. It looks as follows:
What I want to achieve is that the 'CheckAccessTokenValidity' activity checks whether my API-token is still valid. If yes, all is fine, proceed to ForEachDivision activity. If no, refresh tokens, Proceed to ForEachDivision activity.
I thought I had implemented this logic in the screenshot provided. However, this is not the case. My pipeline is now deadlocked because the ForEachDivision apparently expects all three preceding activities to be succesful before running (Which is impossible by design).
How do I implement the logic described above in Azure Synapse?
You can modify your pipeline and accommodate following changes:
Create a variable 'var1' , by default set it to 'False'
After 'CheckAccessTokenValidity' , add set variable activity 'set variable1' pointing to 'var1' and set its value as 'True'
After SetRefreshToken , add another set variable activity 'set variable2' pointing to same 'var1' and set its value as 'True' , so that either of the step 2 or 3 can turn the value of variable to 'True'.
Remove all activities starting ForEach block from this pipeline and cut paste those activities and keep in another new pipeline 'pipeline2'
After 'set variable2' activity, attach an If activity with 'success' and 'Skipped' conditional path . If step 2 gets successful, 'set variable2' would be skipped and it will call the next activity that is 'Step6' . If step 2 fails, and step 3 is called, it will call 'set variable2' and after its success, Step6 will be completed.
In this if activity, write the condition to check the value of var1 . If var1=='True' , then execute the 'pipeline2'

how to pass the outputs from Get metadata stage and use it for file name comparison in databricks notebook

I have 2 Get metadata stages in ADF which is fetching file names from 2 different folders, I need to use these outputs for file name comparison in databricks notebook and return true if all the files are present.
how to pass the output from Get meta data stages to databricks and perform string comparison and
return true if all files are present and return false if even 1 file is missing
How to achieve this?
Please find the below answer which I explained with 1 Get metadata stage , the same can be replicated for more than one also.
Create an ADF pipeline with below activities.
Now in the Get Metadata activity , add the childItems in the Fieldlist as argument, to pass the output of Get Metadata to Notebook as show below
In the Databricks Notebook activity , add the below parameter as Base Paramter which will capture the output of Get Metadata and pass as input paramater to Notebook. Generally this parameter will of object datatype , but I converted to string datatype to access the names of files in the notebook as show below
#string(activity('Get Metadata1').output.childItems)
Now we can able to access the Get Metadata output as string in the notebook.
import ast
required_filenames = ['File1.csv','File2.csv','File3.csv'] ##This is for comparing with the output we get from GetMetadata activity.
metadata_value = dbutils.widgets.get('metadata_output') ##Accessing the output from Get Metadata and storing into a variable using databricks widgets.
metadata_list = ast.literal_eval(metadata_value) ##Converting the above string datatype to the list datatype.
blob_output_list=[] ##Creating an empty list to add the names of files we get from GetMetadata activity.
for i in metadata_list:
blob_output_list.append(i['name']) ##This will add all the names of files from blob storage to the empty list we created above.
validateif = all(item in blob_output_list for item in required_filenames) ##This validateif variable now compare both the lists using list comprehension and provide either True or False.
I tried in the above way and can able to solve the provided requirement. Hope this helps.
Request to please upvote the answer if this helps in your requirement.

Dataset Empty parameter value

I have an xml dataset, I want to parametrize the compression type to treat .xml and .xml.gz files with the same pipeline :
When I put 'gzip' value in compression type it reads xml.gzip file. I want to know what value I should put to read uncompressed .xml file because it does not accept empty value. It is able to read xml file just when I delete the compression_type parameter
You should pass "None" and it should work out .
I feel "None" is more of a workaround in this particular case. "None" is still a string value, not empty.
In my scenario right now, I have an Excel dataset. I want to make every parameter as generic as possible, including the file path/name, sheet name, and the range. The value of "Range" under Connection tab allows empty value. However if I specify it as #dataset().DataRange and leave my parameter DataRange empty, I cannot preview the data or submit the pipeline because it complains that the value cannot be empty.

Debugging values into variables or user properties

How can I spy into my values when I'm on ADF debug mode ?
I want to build a simple pipeline that digs into a storage account table. For each row, enter the value of the second column, use it to create a URL and call a web service.
I saw the output of the Lookup command but how can I saw, for example, the content of each input() into the foreach activity. Can I used the user properties for debugging reason ?
When debugging, I frequently make use of the 'Set Variable' activity. Viewing the output of a 'Set Variable' activity is spying on the value.
You want to see the input to each iteration of your ForEach. Prepend the inner activity with a Set Variable activity. Dynamic content #string(item()) should be enough.

Talend How To Pass Last Modified File Into TFileInputDelimited?

I have searched all over, and read this post.
But it doesn't seem complete and doesn't work.
The situation: I need to get the last modified file from a directory on the local machine. I then need to pass that file into the fileinputdelimited component.
I currently have:
tfilelist --> iterate --> titeratetoflow --> tsamplerow
-->tflowtoiterate -> tinpufiledelimited ---> tlogrow (just to make sure its pulling the right file)
But it doesn't work. I have configured it. so that titeratetoflow has a column called
"FileName" with "((String)globalMap.get("CURRENT_FILE"))" as the value,
"FileDirectory" with ((String)globalMap.get("CURRENT_FILEDIRECTORY")) as value, and
"FileAndDirectory" with ((String)globalMap.get("CURRENT_FILEPATH")) as value.
The tsamplerow is limited to "1".
The tiflowtoiterate is set so that
"FileNameOnly" is value of "FileName"
"FileDirectoryOnly" is "FileDirectory" and
"FilePathComplete" is "FileAndDirectory"
In the File location field of the tinputfiledelimited, I have "((String)globalMap.get("FilePathComplete"))"
When it runs I get an error saying cannot find file or path. If I cut out the fileinput component and have it send straight to the tlogrow, it shows a single line of blank entry.
Any ideas?
I'm not sure if you've just slightly misconfigured the job here but it seems to work fine for me.
Here's a few screenshots showing my job design:
The only thing I can think of just by looking at your post is that you might have slightly messed up the key value pair combinations in the tFlowToIterate. I tend to find that the default settings there work fine pretty much all of the time and it makes it a little more obvious what it's doing as well.
EDIT: Actually, it looks like you might be using the wrong values in your tIterateToFlow. The tFileList will throw the values for the file paths etc in to the global map but it will preface it with the unique component name. If you hit ctrl+space in the value window it should prompt you with a list of available values (these are also specified in the "Outline" tab of the studio). It typically makes an implicit conversion to String but for this you will need to explicitly convert it so use .toString() instead of (String).
Another way to get last modified file is as below
tFileList(sorted DESC by file modified date) ------> tFixedFlowInput (schema - filename, filenumber) ----->tHashOutput
here in tFixedFlowInput
filename = file(String)globalMap.get("tFileList_1_CURRENT_FILEPATH")+"/"+(String)globalMap.get("tFileList_1_CURRENT_FILE")
filenumber = (Integer)globalMap.get("tFileList_1_NB_FILE")
What above will accomplish is get list of all files in the directory with their number/rank - where the file last modified will have file number =1 and next to that will have 2...and so on.
Now on SubJobOK of above tFileList you can have tHashInput which will read from above tHashOutput and filter only row where filenumber==1 - which means the last modified file.
tHashInput (link to tHashoutput) ---->tFilterRow(filenumber==1)------>tLogRow
One reason why you are getting null is probably you have used globalMap.get("CURRENT_FILEPATH) instead of globalMap.get("tFileList_1_CURRENT_FILEPATH")
The Simple Solution for above problem could be as below:
tFileList(sorted ASC by file modified date)--> tIterateToFlow --> tJava( just to end the subjob).
Then on
subjob ok --> tfileinput ( use (String)globalMap.get("tFileList_1_CURRENT_FILE") or (String)globalMap.get("tFileList_1_CURRENT_FILEPATH") as a file name/file path)
Explanation:
Since tFileList iterates all the files in ASC order, it will always have Latest file name stored in globalMap for the last iteration. The list is only iterated till tIterateToFlow hence after this component (String)globalMap.get("tFileList_1_CURRENT_FILE") will always give the last file name from the iterated list, which is the latest file in out case.
Main Flow :
Component View: