Azure Data Factory: Get the result of a query on the databricks notebook to create a condition - azure-data-factory

I wanted the result of a query on the databricks notebook to be the success or failure condition of the pipeline to reprocess for example the "copy data" in the azure data factory.
For example:
If x = 1, terminate the pipeline, if not, reprocess (with a limit of 3 attempts).
What's the best way to do this?

You can do this with the help of if and until activities in ADF.
Please go through the sample demonstration below:
This is the sample Notebook code from databricks.
#your code
x=1
dbutils.notebook.exit(x)
In ADF, first create an array variable which will be used in the until activity.
This array length is used for n number of times re-process.
Next give your databricks notebook.
Now use an if activity and give the below expression in that.
#equals(activity('Notebook1').output.runOutput,1)
If this is true, our pipeline has to be terminated. So, add a fail activity in the True activities of if.
Here you can give any message that you want.
Leave the Fail activities of if as it is.
Now, use an until activity and give the success of if to it.
Inside Until activities we can give any activity. if you want to reprocess another pipeline then you can give execute pipeline also. Here I have given a copy activity.
After copy activity use an append variable activity and give the array variable that we defined in the first and append with any single value that you want.
Now in the until expression give the below.
#equals(length(variables('iter')),4)
So, the activities inside until will reprocess 3 times if x!=1.
If x=1 in notebook, pipeline failed and terminated at if.
if x!=1 in Notebook, until reprocessed copy activity 3 times.

Related

Persistable key value pair storage in Synapse or ADF

I am using Synapse and have a lot of scenarios where I need to read a value at the beginning of a pipeline then save a value at the end of a pipeline as a key value pair (kvp). e.g. when the pipeline begins I read a value from a kvp store to get the max date from the last time the pipeline ran, I use that value to get all values from a table that are greater than or equal to that datetime. when the pipeline finishes doing what it has to do, I save the max modified date from this run. wash, rise, dry. I have a few ideas, like parquet file, redis (this seems a bit much). Just trying to see if anyone has come up with a more elegant/simple approach.
You can use Global Parameters which can be used in different pipelines and the values can be modified in the run time.
Go to Manage in Azure Data Factory and click on Global Parameters in the left panel options. Then click on + New.
Create a new Global Parameter.
Later you use this global parameter in any pipeline and can change its value in runtime. Refer below image for the same.

AZURE DATA FACTORY - Can I set a variable from within a CopyData task or by using the output?

I have simple pipeline that has a Copy activity to populate a table. That task is based on a query and will only ever return 1 row.
The problem I am having is that I want to reuse the value from one of the columns (batch number) to set a variable so that at the end of the pipeline I can use a Stored Procedure to log that the batch was processed. I would rather avoid running the query a second time in a lookup task so can I make use of the data already being returned?
I have tried duplicating the column in the Copy activity and then mapping that to something like #BatchNo but that fails and have even tried to add a Set Variable task but can't figure out how to take a single column #{activity('Populate Aleprstw').output} does not error but not sure what that will actually do in this case.
Thanks and sorry if its a silly question.
Cheers
Mark
I always do it like this:
Generate a batch number (usually with a proc)
Use a lookup to grab it into a variable
Use the batch number in all activities (might be multiple copes, procs etc.)
Write the batch completion
From your description it seems you have the batch embedded in the data copy from the start which is not typical.
If you must do it this way, is there really an issue with running a lookup again?
Copy activity doesn't return data like that, so you won't be able to capture the results that way. With this design, running the query again in a Lookup is the best option.
Is the query in the Source running on the same Server as the Sink? If so, you could collapse the entire operation into a Stored Procedure that returns the data point you are trying to capture.

Event based trigger for a sequential run of the same data factory pipeline

I would like to use an event based trigger to run a data factory pipeline.
The trigger will check a folder in a data lake for any new file and start a pipeline once a new CSV file is copied.
The pipeline will then copy the data to an intermediate table to check its consistency (multiple checks using different data flow activities) and if everything's correct, copies it into a stage table.
It is thus very important that the intermediate table will contain the data from only one single CSV file before it is checked.
I have read though that the event based trigger will start in parallel as many pipelines as the (simultaneously) downloaded CSV files.
Is this right? in this case how can I force each Pipeline to wait until the previous one is done?
Thank you for your help.
There is a flag on the pipeline properties (accssible in the top-right of the editor pane) called concurrency. Set this to 1 and only one copy will run and any other invocations will be queued until that one finishes.

Using Azure Data Factory output in Logic App

I have a logic app that runs on occurrence initially that runs an ADF
pipeline which outputs a folder of files.
Then, I use a List Blobs action to pull one specific file
from the newly made folder and place its path on a queue.
And once a message is placed on that queue, it triggers the run of
another ADF pipeline.
The issue is I have not seen a way to get the output of the first ADF pipeline to put on the queue. I have tried to cheat within the List Blobs action that is sequential to the 1st ADF pipeline by explicitly searching the name of the output folder because it will be the same every time.
However, even after the 1st ADF is ran and produces the folder, within the first instance of this Logic App being ran the List Blobs can't find the folder and says the file path is not found.
Only after I run the Logic App a second time the folder is finally found which is not at all optimal. How can I fix this ? I prefer to keep everything in one logic app. Are there other Azure tools that can help in addition?
I am not having the details of the implementation but i am wondering if the message is written by the first pipeline is only used as a signal the second pipeline ? if thats the case why you cannot you call the second pipeline on completion of the first one ? may be these pipelines are on different ADF's ?
I suggest you to read and see if you can use the Event triggers

Do pipeline variables persist between runs?

I'm doing a simple data flow pipeline between 2 cosmos dbs. The pipeline starts with the dataflow, which grabs the pipeline variable "LastPipelineStartTime" and passes that parameter to the dataflow for the query to use in order to get all new data where c._ts >= "LastPipelineStartTime". Then, on data flow success, updates the variable via Set Variable to the pipeline.TriggerTime(). Essentially so I'm always grabbing new data between pipeline runs.
My question is: it looks like the variable during each debug run reverts back to its Default Value of 0, and instead grabs everything each time. Am I misunderstanding or using pipeline variables wrong? Thanks!
As i know,the variable which is set in the Set Variable Activity has it's own life cycle: during current execution of pipeline.Any change of variable can't persist until next execution stage.
To implement your needs,pls refer to my workarounds as below:
1.If you execute ADF pipeline in the schedule,you could just pass the schedule time as parameter into it to make sure you grab new data.
2.If the frequency is random,persist the trigger time into other residence(e.g. simple file in the blob storage),before data flow activity,use LookUp Activity to grab that time from blob storage file.