ADF - what's the best way to execute one from a list of Data Flow activities based on a condition - azure-data-factory

I have 20 file formats and 1 Data Flow activity that maps to each one of them. Based on the file name, I know which data flow activity to execute. Is the only way to handle this through a "Switch" activity? Is there another way? e.g. can I parameterize the data flow to execute by a variable name?:

Unfortunately , there is no option to run one out of list of dataflows based on input condition.
To perform data migration and transformation for multiple tables, you can use same dataflow and parameterize the dataflow by providing the table names either during the runtime or use a control table to hold all the tablenames and inside foreach , call the dataflow activity. In the sink settings, use merge schema option.

Related

Perform data checks in azure data factory

I have and ADF pipeline which reads data from an on-prem source and copies it to a dataset in azure.
I want to perform some datachecks:
If the data contains the features I need
If there is null in some features
If the feature is all nulls
It should fail if the conditions above dnt meet
Is there a way to do this in data factory without using a batch service and just activities in data factory or maybe a dataflow.
Many approaches to this you could do a traditional batch process running function/code in a process. You could weave together ADF activities into multiple steps combination of 'Lookup Activity' possibly followed by a 'Validation Activity' and 'Delete Activity' with your criteria and rules defined.
Azure Data Factory 'Data Flows' - https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-overview - Allows you map out data transformation as data moves through the pipeline in a codeless fashion.
A pattern with ADF Data Flows is 'Wrangling Data Flows' to work with data and prepare it for consumption. Ref Article - https://learn.microsoft.com/en-us/azure/data-factory/wrangling-overview
The Copy activity in Azure Data Factory (ADF) or Synapse Pipelines provides some basic validation checks called 'data consistency'. This can do things like: fail the activity if the number of rows read from the source is different from the number of rows in the sink, or identify the number of incompatible rows which were not copied depending on the type of copy you are doing.
This is probably not quite at the level you want so you could look at writing something custom, eg using the Stored Proc activity, or looking at Mapping Data Flows and its Assert task which could do something like this. There's a useful video in the link which shows the feature.
I tried using Assert activities but for the scope of my work this wasn't enough!
Therefore, I ended up using python code for data checks.
However, assert activity servers better if your datacheck criteria is not hard as mine.
You can try to create data flows and apply conditional split activity. This will help you to achieve your scenario.
There is no such coding for this. This is diagrammatically you can do this in ADF or Azure Synapse Data Flow.
Find my attached data flow diagram that checks a few conditions like when the year is less than the specified year or if data in a column is null, date format, etc.

Azure Data Flow generic curation framework

I wanted to create a data curation framework using Data Flow that uses generic data flow pipelines.
I have multiple data feeds (raw tables) to validate (between 10-100) and write to sink as curated tables:
For each raw data feed, need to validate the expected schema (based on a parameterized file name)
For each raw data feed, need to provide the Data Flow Script with validation logic (some columns should not be null, some columns should have specifici data types and value ranges, etc.)
Using Python SDK, create Data Factory and mapping data flows pipelines using the Data Flow Script prepared with the parameters provided (for schema validation)
Trigger the python code that creates the pipelines for each feed, does validation, write the issues into Log Analytics workspace and tear off the resources at specific schedules.
Has anyone done something like this? What is the best approach for the above please?
My overall goal is to reduce the time to validate/curate the data feeds, thus I wanted to prepare the validation logic quickly for each feed and create python classes or Powershell scripts scheduled to run them on generic data pipelines at specific times of the day.
many thanks
CK
To validate the schema, you can have a reference dataset which will be having the same schema (first row) as of your main dataset. Then you need to use “Get Metadata” activity for each dataset and get the structure of each dataset. Your Get Metadata activity will look like this:
You can then use “If Condition” activity to matches the structure of both datasets using equal Logical Function. Your equal expression will look something like this:
If both datasets’ structure matches, your next required activity(like copy the dataset to another container) will be performed.
Your complete pipeline will look like this:
The script which you want to run on your inserted dataset could be performed using “Custom” activity. You again need to create the linked service and it’s corresponding dataset for your script which you will run to validate the raw data. Please refer: https://learn.microsoft.com/en-us/azure/batch/tutorial-run-python-batch-azure-data-factory
To schedule the pipeline as per your specific pipeline will be take care by Triggers in Azure Data Factory. A schedule trigger will take care of your requirement of auto trigger your pipeline at any specific time.

Data from multiple sources and deciding destination based on the Lookup SQL data

I am trying to solve the below problem where I am getting data from different sources and trying to copy that data at single destination based on the metadata stored in SQL table. below are the steps i followed-
I have 3 REST API call and the output of those calls going as input to lookup activity.
The lookup activity is queried on SQL DB which has 3 records and pulling 2 columns only, file_name and table_name.
Then for each activity is iterating on the lookup array output and from each item, I am getting the item().file_name.
Now for each item I am trying to use Switch case to decide based on the file name what should be the destination of the data.
I am not sure how I can use the file_name coming in step 3 to use as a case in of switch activity. Can anyone please guide me on that?
You need to create a variable and save the value of file_name. Then you can use that variable in of switch activity. If you do this, please make sure your Sequential setting of For Each activity is checked.

Azure Data Flow (Pass output of one data flow to another in the pipeline)

I have a requirement where I have to pass the select transformation output from one data flow (data flow) to another directly.
Example:
I have a data flow with a SELECT transformation as Final step.
I have another data flow that needs to take the above SELECT transformation output as input.
Currently, I am storing the output of first data flow into a table and getting the data from the table in second data flow which takes long to execute. I want to avoid storing into the table.
Thanks,
Karthik
Data flows require your logic to terminate with a Sink when you execute them from a pipeline, so you must persist your output somewhere in Azure. The next pipeline activity can read from that output dataset.

How to pass the Result of my CustomDotNet Activity to my stored procedure

I've created a ADF for my project,which consist of a Custom Activity and stored procedureactivity.
The thing I'm blocked here is.
My custom activity does the obtain the most recent modified file - let's assume xx.txt is the file - from in my Azure Blob Container.
My stored procedure has the single parameter FileName. I want to pass the file name to my stored procedure which can be obtained from the above custom activity.
(We can say simply as my stored procedure activity input depends on the Custom Activity output)
How can I do this in my ADF?
You will need to do this with two separate activities (maybe in two separate pipelines, if you want greater control) and using some middle ground storage that both processes can access.
For example:
Input dataset, Blob.
Custom activity.
Output dataset, SQL DB.
Input dataset, SQLDB (same as 3).
Stored Proc activity.
Output dataset, SQLDB.
It is the staging area datasets used for point 3 and 4 that is most important here.
This is how ADF will want to handle it. Although I understand why you would want to pass the output for the custom activity straight to the stored procedure, without using an additional dataset.
Hope this helps.