Passing a dataset parameter in a data mapping acitivity in a data flow in Azure Data factory - azure-data-factory

I have a parametrized dataset, that I used for a copy data activty and it worked fine.
I am trying to replicate that using a mapping dataflow but I cant find where to input the value for the dataset parameter...

To clarify Joel's answer - you cannot assign parameter values to a Dataset from within the Data Flow settings. It is done from the Pipeline that executes the Data Flow. This means that you may get an error message if you attempt to 'Test connection' for a parameterised dataset.

Parameterized Data Sets work exactly the same way in a Data Flow activity. If a Data Set used in the Data Flow has parameters, they will have configuration points on the Settings tab:

Related

ADF - replace switch statement for dataflows with a parameterized solution

I have a switch statement that looks at a variable value and based on that determines which data flow to execute. The problem with this is, I need to update the switch statement every time I add a new ID/dataflow.
Is there an alternative design to this? What if my dataflows had the same name as the variable value - would it be possible to parameterize the name of the dataflow to execute?
e.g. variable value = "1" execute data flow with name "1_dataflow", 2 execute "2_dataflow" etc. How would I accomplish this?
Yes, you can parameterize the values in any activity in Azure Data Factory and make the pipeline dynamic instead of giving hard-coded values.
You can use parameters to pass external values into pipelines,
datasets, linked services, and data flows. Once the parameter has been
passed into the resource, it cannot be changed. By parameterizing
resources, you can reuse them with different values each time.
You can also use Set Variable activity in to define and variable and set value to it which you can use in Switch activity and also can change it later.
Refer: How to use parameters, expressions and functions in Azure Data Factory, Set Variable Activity in Azure Data Factory and Azure Synapse Analytics

Perform data checks in azure data factory

I have and ADF pipeline which reads data from an on-prem source and copies it to a dataset in azure.
I want to perform some datachecks:
If the data contains the features I need
If there is null in some features
If the feature is all nulls
It should fail if the conditions above dnt meet
Is there a way to do this in data factory without using a batch service and just activities in data factory or maybe a dataflow.
Many approaches to this you could do a traditional batch process running function/code in a process. You could weave together ADF activities into multiple steps combination of 'Lookup Activity' possibly followed by a 'Validation Activity' and 'Delete Activity' with your criteria and rules defined.
Azure Data Factory 'Data Flows' - https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-overview - Allows you map out data transformation as data moves through the pipeline in a codeless fashion.
A pattern with ADF Data Flows is 'Wrangling Data Flows' to work with data and prepare it for consumption. Ref Article - https://learn.microsoft.com/en-us/azure/data-factory/wrangling-overview
The Copy activity in Azure Data Factory (ADF) or Synapse Pipelines provides some basic validation checks called 'data consistency'. This can do things like: fail the activity if the number of rows read from the source is different from the number of rows in the sink, or identify the number of incompatible rows which were not copied depending on the type of copy you are doing.
This is probably not quite at the level you want so you could look at writing something custom, eg using the Stored Proc activity, or looking at Mapping Data Flows and its Assert task which could do something like this. There's a useful video in the link which shows the feature.
I tried using Assert activities but for the scope of my work this wasn't enough!
Therefore, I ended up using python code for data checks.
However, assert activity servers better if your datacheck criteria is not hard as mine.
You can try to create data flows and apply conditional split activity. This will help you to achieve your scenario.
There is no such coding for this. This is diagrammatically you can do this in ADF or Azure Synapse Data Flow.
Find my attached data flow diagram that checks a few conditions like when the year is less than the specified year or if data in a column is null, date format, etc.

Azure Data Flow generic curation framework

I wanted to create a data curation framework using Data Flow that uses generic data flow pipelines.
I have multiple data feeds (raw tables) to validate (between 10-100) and write to sink as curated tables:
For each raw data feed, need to validate the expected schema (based on a parameterized file name)
For each raw data feed, need to provide the Data Flow Script with validation logic (some columns should not be null, some columns should have specifici data types and value ranges, etc.)
Using Python SDK, create Data Factory and mapping data flows pipelines using the Data Flow Script prepared with the parameters provided (for schema validation)
Trigger the python code that creates the pipelines for each feed, does validation, write the issues into Log Analytics workspace and tear off the resources at specific schedules.
Has anyone done something like this? What is the best approach for the above please?
My overall goal is to reduce the time to validate/curate the data feeds, thus I wanted to prepare the validation logic quickly for each feed and create python classes or Powershell scripts scheduled to run them on generic data pipelines at specific times of the day.
many thanks
CK
To validate the schema, you can have a reference dataset which will be having the same schema (first row) as of your main dataset. Then you need to use “Get Metadata” activity for each dataset and get the structure of each dataset. Your Get Metadata activity will look like this:
You can then use “If Condition” activity to matches the structure of both datasets using equal Logical Function. Your equal expression will look something like this:
If both datasets’ structure matches, your next required activity(like copy the dataset to another container) will be performed.
Your complete pipeline will look like this:
The script which you want to run on your inserted dataset could be performed using “Custom” activity. You again need to create the linked service and it’s corresponding dataset for your script which you will run to validate the raw data. Please refer: https://learn.microsoft.com/en-us/azure/batch/tutorial-run-python-batch-azure-data-factory
To schedule the pipeline as per your specific pipeline will be take care by Triggers in Azure Data Factory. A schedule trigger will take care of your requirement of auto trigger your pipeline at any specific time.

Azure Data Flow (Pass output of one data flow to another in the pipeline)

I have a requirement where I have to pass the select transformation output from one data flow (data flow) to another directly.
Example:
I have a data flow with a SELECT transformation as Final step.
I have another data flow that needs to take the above SELECT transformation output as input.
Currently, I am storing the output of first data flow into a table and getting the data from the table in second data flow which takes long to execute. I want to avoid storing into the table.
Thanks,
Karthik
Data flows require your logic to terminate with a Sink when you execute them from a pipeline, so you must persist your output somewhere in Azure. The next pipeline activity can read from that output dataset.

Null dates being converted to 1900-01-01 in Azure Data Factory

I am bringing data from different databases to Blob storage in Azure by Data Factory, the problem is all the date values turn into 1900-01-01 when Null.
Do you have any suggestion what to do in order to keep the null value?
Thanks
You have to use Custom Activity to achieve this. In your Custom Activity you will have access to the source and destination Linked Services and Datasets. You can transform the data the way you want. You have to write your own transformation logic.
You will not have much control over data transformation if you are using Data Factory's Copy Activity.
Another Approach is to use Data Factory Management APIs. This way you can create output dataset structure as per the input data. In this approach you will have full control over the Data Movement and Transformations.