Do pipeline variables persist between runs? - azure-data-factory

I'm doing a simple data flow pipeline between 2 cosmos dbs. The pipeline starts with the dataflow, which grabs the pipeline variable "LastPipelineStartTime" and passes that parameter to the dataflow for the query to use in order to get all new data where c._ts >= "LastPipelineStartTime". Then, on data flow success, updates the variable via Set Variable to the pipeline.TriggerTime(). Essentially so I'm always grabbing new data between pipeline runs.
My question is: it looks like the variable during each debug run reverts back to its Default Value of 0, and instead grabs everything each time. Am I misunderstanding or using pipeline variables wrong? Thanks!

As i know,the variable which is set in the Set Variable Activity has it's own life cycle: during current execution of pipeline.Any change of variable can't persist until next execution stage.
To implement your needs,pls refer to my workarounds as below:
1.If you execute ADF pipeline in the schedule,you could just pass the schedule time as parameter into it to make sure you grab new data.
2.If the frequency is random,persist the trigger time into other residence(e.g. simple file in the blob storage),before data flow activity,use LookUp Activity to grab that time from blob storage file.

Related

Trigger Date for reruns

My pipelines activities need the date of the run as a parameter. Now I get the current date in the pipeline from the utcnow() function. Ideally this would be something I could enter dynamically in the trigger so I could rerun a failed day and the parameter would be set right, now a rerun would lead to my pipeline being rerun but with the date of today not the failed run date.
I am used to airflow where such things are pretty easy to do, including scheduling reruns. Probably I think too much in terms of airflow but I can't wrap my head around a better solution.
In ADF,it is not supported directly to pass trigger date at which pipeline got failed to trigger.
You can get the trigger time using #pipeline().TriggerTime .
This system variable will give the time at which the trigger triggers the pipeline to run.
You can store this trigger value for every pipeline and use this as a parameter for the trigger which got failed and rerun the pipeline.
Reference: Microsoft document on System Variables on ADF
To resolve my problem I had to create a nested structure of pipelines, the top pipeline setting a variable for the date and then calling other pipelines passing that variable.
With this I still can't rerun the top pipeline but rerunning Execute Pipeline1/2/3 reruns them with the right variable set. It is still not perfect since the top pipeline run stays an error and it is difficult to keep track of what needs to be rerun, however it is a partial solution.

Persistable key value pair storage in Synapse or ADF

I am using Synapse and have a lot of scenarios where I need to read a value at the beginning of a pipeline then save a value at the end of a pipeline as a key value pair (kvp). e.g. when the pipeline begins I read a value from a kvp store to get the max date from the last time the pipeline ran, I use that value to get all values from a table that are greater than or equal to that datetime. when the pipeline finishes doing what it has to do, I save the max modified date from this run. wash, rise, dry. I have a few ideas, like parquet file, redis (this seems a bit much). Just trying to see if anyone has come up with a more elegant/simple approach.
You can use Global Parameters which can be used in different pipelines and the values can be modified in the run time.
Go to Manage in Azure Data Factory and click on Global Parameters in the left panel options. Then click on + New.
Create a new Global Parameter.
Later you use this global parameter in any pipeline and can change its value in runtime. Refer below image for the same.

Azure Data Factory: Get the result of a query on the databricks notebook to create a condition

I wanted the result of a query on the databricks notebook to be the success or failure condition of the pipeline to reprocess for example the "copy data" in the azure data factory.
For example:
If x = 1, terminate the pipeline, if not, reprocess (with a limit of 3 attempts).
What's the best way to do this?
You can do this with the help of if and until activities in ADF.
Please go through the sample demonstration below:
This is the sample Notebook code from databricks.
#your code
x=1
dbutils.notebook.exit(x)
In ADF, first create an array variable which will be used in the until activity.
This array length is used for n number of times re-process.
Next give your databricks notebook.
Now use an if activity and give the below expression in that.
#equals(activity('Notebook1').output.runOutput,1)
If this is true, our pipeline has to be terminated. So, add a fail activity in the True activities of if.
Here you can give any message that you want.
Leave the Fail activities of if as it is.
Now, use an until activity and give the success of if to it.
Inside Until activities we can give any activity. if you want to reprocess another pipeline then you can give execute pipeline also. Here I have given a copy activity.
After copy activity use an append variable activity and give the array variable that we defined in the first and append with any single value that you want.
Now in the until expression give the below.
#equals(length(variables('iter')),4)
So, the activities inside until will reprocess 3 times if x!=1.
If x=1 in notebook, pipeline failed and terminated at if.
if x!=1 in Notebook, until reprocessed copy activity 3 times.

ADF - replace switch statement for dataflows with a parameterized solution

I have a switch statement that looks at a variable value and based on that determines which data flow to execute. The problem with this is, I need to update the switch statement every time I add a new ID/dataflow.
Is there an alternative design to this? What if my dataflows had the same name as the variable value - would it be possible to parameterize the name of the dataflow to execute?
e.g. variable value = "1" execute data flow with name "1_dataflow", 2 execute "2_dataflow" etc. How would I accomplish this?
Yes, you can parameterize the values in any activity in Azure Data Factory and make the pipeline dynamic instead of giving hard-coded values.
You can use parameters to pass external values into pipelines,
datasets, linked services, and data flows. Once the parameter has been
passed into the resource, it cannot be changed. By parameterizing
resources, you can reuse them with different values each time.
You can also use Set Variable activity in to define and variable and set value to it which you can use in Switch activity and also can change it later.
Refer: How to use parameters, expressions and functions in Azure Data Factory, Set Variable Activity in Azure Data Factory and Azure Synapse Analytics

AZURE DATA FACTORY - Can I set a variable from within a CopyData task or by using the output?

I have simple pipeline that has a Copy activity to populate a table. That task is based on a query and will only ever return 1 row.
The problem I am having is that I want to reuse the value from one of the columns (batch number) to set a variable so that at the end of the pipeline I can use a Stored Procedure to log that the batch was processed. I would rather avoid running the query a second time in a lookup task so can I make use of the data already being returned?
I have tried duplicating the column in the Copy activity and then mapping that to something like #BatchNo but that fails and have even tried to add a Set Variable task but can't figure out how to take a single column #{activity('Populate Aleprstw').output} does not error but not sure what that will actually do in this case.
Thanks and sorry if its a silly question.
Cheers
Mark
I always do it like this:
Generate a batch number (usually with a proc)
Use a lookup to grab it into a variable
Use the batch number in all activities (might be multiple copes, procs etc.)
Write the batch completion
From your description it seems you have the batch embedded in the data copy from the start which is not typical.
If you must do it this way, is there really an issue with running a lookup again?
Copy activity doesn't return data like that, so you won't be able to capture the results that way. With this design, running the query again in a Lookup is the best option.
Is the query in the Source running on the same Server as the Sink? If so, you could collapse the entire operation into a Stored Procedure that returns the data point you are trying to capture.