Azure Data Flow generic curation framework - azure-data-factory

I wanted to create a data curation framework using Data Flow that uses generic data flow pipelines.
I have multiple data feeds (raw tables) to validate (between 10-100) and write to sink as curated tables:
For each raw data feed, need to validate the expected schema (based on a parameterized file name)
For each raw data feed, need to provide the Data Flow Script with validation logic (some columns should not be null, some columns should have specifici data types and value ranges, etc.)
Using Python SDK, create Data Factory and mapping data flows pipelines using the Data Flow Script prepared with the parameters provided (for schema validation)
Trigger the python code that creates the pipelines for each feed, does validation, write the issues into Log Analytics workspace and tear off the resources at specific schedules.
Has anyone done something like this? What is the best approach for the above please?
My overall goal is to reduce the time to validate/curate the data feeds, thus I wanted to prepare the validation logic quickly for each feed and create python classes or Powershell scripts scheduled to run them on generic data pipelines at specific times of the day.
many thanks
CK

To validate the schema, you can have a reference dataset which will be having the same schema (first row) as of your main dataset. Then you need to use “Get Metadata” activity for each dataset and get the structure of each dataset. Your Get Metadata activity will look like this:
You can then use “If Condition” activity to matches the structure of both datasets using equal Logical Function. Your equal expression will look something like this:
If both datasets’ structure matches, your next required activity(like copy the dataset to another container) will be performed.
Your complete pipeline will look like this:
The script which you want to run on your inserted dataset could be performed using “Custom” activity. You again need to create the linked service and it’s corresponding dataset for your script which you will run to validate the raw data. Please refer: https://learn.microsoft.com/en-us/azure/batch/tutorial-run-python-batch-azure-data-factory
To schedule the pipeline as per your specific pipeline will be take care by Triggers in Azure Data Factory. A schedule trigger will take care of your requirement of auto trigger your pipeline at any specific time.

Related

ADF - what's the best way to execute one from a list of Data Flow activities based on a condition

I have 20 file formats and 1 Data Flow activity that maps to each one of them. Based on the file name, I know which data flow activity to execute. Is the only way to handle this through a "Switch" activity? Is there another way? e.g. can I parameterize the data flow to execute by a variable name?:
Unfortunately , there is no option to run one out of list of dataflows based on input condition.
To perform data migration and transformation for multiple tables, you can use same dataflow and parameterize the dataflow by providing the table names either during the runtime or use a control table to hold all the tablenames and inside foreach , call the dataflow activity. In the sink settings, use merge schema option.

Azure Data Factory - run script on parquet files and output as parquet files

In Azure Data Factory I have a pipeline, created from the built-in copy data task, that copies data from 12 entities (campaign, lead, contact etc.) from Dynamics CRM (using a linked service) and outputs the contents as parquet files in account storage. This is run every day, into a folder structure based on the date. The output structure in the container looks something like this:
Raw/CRM/2022/05/28/campaign.parquet
Raw/CRM/2022/05/28/lead.parquet
Raw/CRM/2022/05/29/campaign.parquet
Raw/CRM/2022/05/29/lead.parquet
That's just an example, but there is a folder structure for every year/month/day that the pipeline runs, and a parquet file for each of the 12 entities I'm retrieving.
This involved creating a pipeline, dataset for the source and dataset for the target. I modified the pipeline to add the pipeline's run date/time as a column in the parquet files, called RowStartDate (which I'll need in the next stage of processing)
My next step is to process the data into a staging area, which I'd like to output to a different folder in my container. My plan was to create 12 scripts (one for campaigns, one for leads, one for contact etc.) that essentially does the following:
accesses all of the correct files, using a wildcard path along the lines of: Raw/CRM/ * / * / * /campaign.parquet
selects the columns that I need
Rename column headings
in some cases, just take the most recent data (using the RowStartDate)
in some cases, create a slowly changing dimension, ensuring every row has a RowEndDate
I made some progress figuring out how to do this in SQL, by running a query using OPENROWSET with wildcards in the path as per above - but I don't think I can use my SQL script in ADF to move/process the data into a separate folder in my container.
My question is, how can I do this (preferably in ADF pipelines):
for each of my 12 entities, access each occurrence in the container with some sort of Raw/CRM///*/campaign.parquet statement
Process it as per the logic I've described above - a script of some sort
Output the contents back to a different folder in my container (each script would produce 1 output)
I've tried:
Using Azure Data Factory, but when I tell it which dataset to use, I point it to the dataset I created in my original pipeline - but this dataset has all 12 entities in the dataset and the data flow activity produces the error: "No value provided for Parameter 'cw_fileName" - but I don't see any place when configuring the data flow to specify a parameter (its not under source settings, source options, projection, optimize or inspect)
using Azure Data Factory, tried to add a script - but in trying to connect to my SQL script in Synapse - I don't know my Service Principal Key for the synapse workspace
using a notebook Databricks, I tried to mount my container but got an error along the lines that "adding secret to Databricks scope doesn't work in Standard Tier" so couldn't proceed
using Synapse, but as expected, it wants things in SQL whereas I'm trying to keep things in a container for now.
Could anybody point me in the right direction. What's the best approach that I should take? And if its one that I've described above, how do I go about getting past the issue I've described?
Pass the data flow dataset parameter values from the pipeline data flow activity settings.

Perform data checks in azure data factory

I have and ADF pipeline which reads data from an on-prem source and copies it to a dataset in azure.
I want to perform some datachecks:
If the data contains the features I need
If there is null in some features
If the feature is all nulls
It should fail if the conditions above dnt meet
Is there a way to do this in data factory without using a batch service and just activities in data factory or maybe a dataflow.
Many approaches to this you could do a traditional batch process running function/code in a process. You could weave together ADF activities into multiple steps combination of 'Lookup Activity' possibly followed by a 'Validation Activity' and 'Delete Activity' with your criteria and rules defined.
Azure Data Factory 'Data Flows' - https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-overview - Allows you map out data transformation as data moves through the pipeline in a codeless fashion.
A pattern with ADF Data Flows is 'Wrangling Data Flows' to work with data and prepare it for consumption. Ref Article - https://learn.microsoft.com/en-us/azure/data-factory/wrangling-overview
The Copy activity in Azure Data Factory (ADF) or Synapse Pipelines provides some basic validation checks called 'data consistency'. This can do things like: fail the activity if the number of rows read from the source is different from the number of rows in the sink, or identify the number of incompatible rows which were not copied depending on the type of copy you are doing.
This is probably not quite at the level you want so you could look at writing something custom, eg using the Stored Proc activity, or looking at Mapping Data Flows and its Assert task which could do something like this. There's a useful video in the link which shows the feature.
I tried using Assert activities but for the scope of my work this wasn't enough!
Therefore, I ended up using python code for data checks.
However, assert activity servers better if your datacheck criteria is not hard as mine.
You can try to create data flows and apply conditional split activity. This will help you to achieve your scenario.
There is no such coding for this. This is diagrammatically you can do this in ADF or Azure Synapse Data Flow.
Find my attached data flow diagram that checks a few conditions like when the year is less than the specified year or if data in a column is null, date format, etc.

Best practices for parameterizing load of multiple CSV files in Data Factory

I am experimenting with Azure Data Factory to replace some other data-load solutions we currently have, and I'm struggling with finding the best way to organize and parameterize the pipelines to provide the scalability we need.
Our typical pattern is that we build an integration for a particular Platform. This "integration" is essentially the mapping and transform of fields from their data files (CSVs) into our Stage1 SQL database, and by the time the data lands in there, the data types should be set properly and the indexes set.
Within each Platform, we have Customers. Each Customer has their own set of data files that get processed in that Customer context -- within the scope of a Platform, all Customer files follow the same schema (or close to it), but they all get sent to us separately. If you looked at our incoming file store, it might look like (simplified, there are 20-30 source datasets per customer depending on platform):
Platform
Customer A
Employees.csv
PayPeriods.csv
etc
Customer B
Employees.csv
PayPeriods.csv
etc
Each customer lands in their own SQL schema. So after processing the above, I should have CustomerA.Employees and CustomerB.Employees tables. (This allows a little bit of schema drift between customers, which does happen on some platforms. We handle it later in our stage 2 ETL process.)
What I'm trying to figure out is:
What is the best way to setup ADF so I can effectively manage one set of mappings per platform, and automatically accommodate any new customers we add to that platform without having to change the pipeline/flow?
My current thinking is to have one pipeline per platform, and one dataflow per file per platform. The pipeline has a variable, "schemaname", which is set using the path of the file that triggered it (e.g. "CustomerA"). Then, depending on file name, there is a branching conditional that will fire the right dataflow. E.g. if it's "employees.csv" it runs one dataflow, if it's "payperiods.csv" it loads a different dataflow. Also, they'd all be using the same generic target sink datasource, the table name being parameterized and those parameters being set in the pipeline using the schema variable and the filename from the conditional branch.
Are there any pitfalls to setting it up this way? Am I thinking about this correctly?
This sounds solid. Just be aware that you if you define column-specific mappings with expressions that expect those columns to be present, you may have data flow execution failures if those columns are not present in your customer source files.
The ways to protect against that in ADF Data Flow is to use column patterns. This will allow you to define mappings that are generic and more flexible.

How to implement conditional branches in Azure Data Factory pipelines

I am implementing a pipeline to insert data updates from csv files to SQL DB. Plan is to first insert the data to temporary SQL table for validation and transformation, and then move processed data to actual SQL table. I would like to branch the pipeline execution depending on the validation result. If data is OK, it will be inserted to target SQL table. If there are fatal fails, insertion activity should be skipped.
Tried to find instructions / guidance but no success so far. Any ideas if pipeline activity supports conditional execution, e.g. based on some properties in input dataset?
It is possible now with Azure Data Factory ver 2.
Post execution our downstream activities can now be dependent on four possible outcomes as standard.
- On success
- On failure
- On completion
- On skip
Also, custom ‘if’ conditions will be available for branching based expressions.
Refer below links for more detail:-
https://www.purplefrogsystems.com/paul/2017/09/whats-new-in-azure-data-factory-version-2-adfv2/
https://learn.microsoft.com/en-us/azure/data-factory/tutorial-control-flow
The short answer is no.
I think its worth pointing out that ADF is just an orchestration tool to invoke other services. The current version can't do what you want because it does not have any of its own compute. Its not an SSIS data flow engine.
If you want this behaviour you'll need to code it into the SQL DB stored procedures with flags etc on the processed datasets.
Then maybe have some boiler plate code with a parameters that are passed from ADF to perform either the insert or update or divert operation.
Handy link for called stored procedure with params from ADF: https://learn.microsoft.com/en-us/azure/data-factory/data-factory-stored-proc-activity
Hope this helps.