I am using Synapse and have a lot of scenarios where I need to read a value at the beginning of a pipeline then save a value at the end of a pipeline as a key value pair (kvp). e.g. when the pipeline begins I read a value from a kvp store to get the max date from the last time the pipeline ran, I use that value to get all values from a table that are greater than or equal to that datetime. when the pipeline finishes doing what it has to do, I save the max modified date from this run. wash, rise, dry. I have a few ideas, like parquet file, redis (this seems a bit much). Just trying to see if anyone has come up with a more elegant/simple approach.
You can use Global Parameters which can be used in different pipelines and the values can be modified in the run time.
Go to Manage in Azure Data Factory and click on Global Parameters in the left panel options. Then click on + New.
Create a new Global Parameter.
Later you use this global parameter in any pipeline and can change its value in runtime. Refer below image for the same.
Related
I have a switch statement that looks at a variable value and based on that determines which data flow to execute. The problem with this is, I need to update the switch statement every time I add a new ID/dataflow.
Is there an alternative design to this? What if my dataflows had the same name as the variable value - would it be possible to parameterize the name of the dataflow to execute?
e.g. variable value = "1" execute data flow with name "1_dataflow", 2 execute "2_dataflow" etc. How would I accomplish this?
Yes, you can parameterize the values in any activity in Azure Data Factory and make the pipeline dynamic instead of giving hard-coded values.
You can use parameters to pass external values into pipelines,
datasets, linked services, and data flows. Once the parameter has been
passed into the resource, it cannot be changed. By parameterizing
resources, you can reuse them with different values each time.
You can also use Set Variable activity in to define and variable and set value to it which you can use in Switch activity and also can change it later.
Refer: How to use parameters, expressions and functions in Azure Data Factory, Set Variable Activity in Azure Data Factory and Azure Synapse Analytics
I am using Azure Data Factory's built-in copy task, set on a daily schedule, to copy data into a container in Azure Data Lake Storage Gen2 using the Built-in Copy tool. For my destination, I'm trying to use date variables to create a folder structure for the data. In my resultant pipeline the formula looks like this:
Dir1/Dir2/#{formatDateTime(pipeline().parameters.windowStart,'yyyy')}/#{formatDateTime(pipeline().parameters.windowStart,'MM')}/#{formatDateTime(pipeline().parameters.windowStart,'dd')}
Unfortunately this is throwing an error:
Operation on target ForEach_h33 failed: Activity failed because an inner activity failed; Inner activity name: Copy_h33, Error: The function 'formatDateTime' expects its first parameter to be of type string. The provided value is of type 'Null'.
Everything I've created was just generated by the tool, the folder path I used when following the tool was as suggested:
Dir1/Dir2/{year}/{month}/{day} (I was then able to set the format of each variable - e.g., yyyy, MM, dd, which suggests the tool understood what I was doing.
The only other thing I can think of, is that the folder structure in the container only contains Dir1/Dir2/ - I am expecting the subdirectories to be created as the copy task runs.
I'll also add, everything runs fine if I just use the directory Dir1/Dir2/ - so the issue is with my variables.
There is nothing wrong in your built-in copy task. It will give perfect result when it is running on the scheduled trigger time.
But if you want to run with manual trigger, it will give the error like above.
When you are running with manual trigger you must give the windowStart parameter value by yourself. In the above It is saying that the value is null.
Give the value as MM/DD/YYYY. The Schedule trigger automatically takes this value when it runs daily. But In manual triggering we have to specify this value.
Then click on Ok. Now, you will not get error like that, and you can create folders with the year/month/day format.
I want to perform some file level, field level validation checks on the dataset I receive.
Given below some checks which I want to perform and capture any issues into audit tables.
File Level Checks: File present, size of the file, Count of records matches to count present in control file
Field Level checks: Content in right format, Duplicate key checks, range in important fields.
I want to make this as a template so that all the project can adopt it, Is it good to perform these checks in ADF or in Databricks. If it is ADF any reference to example dataflow/pipeline would be very helpful.
Thanks,
Kumar
You can accomplish these tasks by using various Activities in Azure data factory pipeline.
To check the file existence, you can use Validation Activity.
In the validation activity, you specify several things. The dataset you want to validate the existence of, sleep how long you want to wait between retries, and timeout how long it should try before giving up and timing out. The minimum size is optional.
Be sure to set the timeout value properly. The default is 7 days, much too long for most jobs.
If the file is found, the activity reports success.
If the file is not found, or is smaller than minimum size, then it can timeout, which is treated like a failure by dependencies.
To count of matching records and assuming that you are using CSV, you could create a generic dataset (one column) and run a copy activity over whatever folders you want to count to a temp folder. Get the rowcount of the copy activity and save it.
At the end, delete everything in your temp folder.
Something like this:
Lookup Activity (Get's your list of base folders - Just for easy rerunning)
For Each (Base Folder)
Copy Recursively to temp folder
Stored procedure activity which stores the Copy Activity.output.rowsCopied
Delete temp files recursively.
To use the same pipeline repeatedly for multiple datasets, you can make your pipeline dynamic. Refer: https://sqlitybi.com/how-to-build-dynamic-azure-data-factory-pipelines/
I have simple pipeline that has a Copy activity to populate a table. That task is based on a query and will only ever return 1 row.
The problem I am having is that I want to reuse the value from one of the columns (batch number) to set a variable so that at the end of the pipeline I can use a Stored Procedure to log that the batch was processed. I would rather avoid running the query a second time in a lookup task so can I make use of the data already being returned?
I have tried duplicating the column in the Copy activity and then mapping that to something like #BatchNo but that fails and have even tried to add a Set Variable task but can't figure out how to take a single column #{activity('Populate Aleprstw').output} does not error but not sure what that will actually do in this case.
Thanks and sorry if its a silly question.
Cheers
Mark
I always do it like this:
Generate a batch number (usually with a proc)
Use a lookup to grab it into a variable
Use the batch number in all activities (might be multiple copes, procs etc.)
Write the batch completion
From your description it seems you have the batch embedded in the data copy from the start which is not typical.
If you must do it this way, is there really an issue with running a lookup again?
Copy activity doesn't return data like that, so you won't be able to capture the results that way. With this design, running the query again in a Lookup is the best option.
Is the query in the Source running on the same Server as the Sink? If so, you could collapse the entire operation into a Stored Procedure that returns the data point you are trying to capture.
I'm doing a simple data flow pipeline between 2 cosmos dbs. The pipeline starts with the dataflow, which grabs the pipeline variable "LastPipelineStartTime" and passes that parameter to the dataflow for the query to use in order to get all new data where c._ts >= "LastPipelineStartTime". Then, on data flow success, updates the variable via Set Variable to the pipeline.TriggerTime(). Essentially so I'm always grabbing new data between pipeline runs.
My question is: it looks like the variable during each debug run reverts back to its Default Value of 0, and instead grabs everything each time. Am I misunderstanding or using pipeline variables wrong? Thanks!
As i know,the variable which is set in the Set Variable Activity has it's own life cycle: during current execution of pipeline.Any change of variable can't persist until next execution stage.
To implement your needs,pls refer to my workarounds as below:
1.If you execute ADF pipeline in the schedule,you could just pass the schedule time as parameter into it to make sure you grab new data.
2.If the frequency is random,persist the trigger time into other residence(e.g. simple file in the blob storage),before data flow activity,use LookUp Activity to grab that time from blob storage file.