How to update manually a pipeline in Azure Data Factory - azure-data-factory

Right now we are doing some tasks manually while the entire solution is prepared. One of this tasks is the updating process in some Resouce Groups. Normally, what we do is to export a template from our development environtment and, then we import that .zip file template. However, if the pipeline or other object are alredy present in the target ADF these are going to be created with the prefix 1.
Then we need to rename the objects (pipelines, data flows, data sets) to implement the new solution. We were wondering if there is a way to avoid this, or another way to do it.

If both your Test_Pipeline or Test_Pipeline1 are going to perform similar type of actions then you can create a Dynamic pipeline using parameters in ADF and create a metadata / config for the operation that you need to perform and add a Lookup activity, that would avoid creating multiple pipelines
https://learn.microsoft.com/en-us/azure/data-factory/control-flow-lookup-activity

Related

Azure Data Flow generic curation framework

I wanted to create a data curation framework using Data Flow that uses generic data flow pipelines.
I have multiple data feeds (raw tables) to validate (between 10-100) and write to sink as curated tables:
For each raw data feed, need to validate the expected schema (based on a parameterized file name)
For each raw data feed, need to provide the Data Flow Script with validation logic (some columns should not be null, some columns should have specifici data types and value ranges, etc.)
Using Python SDK, create Data Factory and mapping data flows pipelines using the Data Flow Script prepared with the parameters provided (for schema validation)
Trigger the python code that creates the pipelines for each feed, does validation, write the issues into Log Analytics workspace and tear off the resources at specific schedules.
Has anyone done something like this? What is the best approach for the above please?
My overall goal is to reduce the time to validate/curate the data feeds, thus I wanted to prepare the validation logic quickly for each feed and create python classes or Powershell scripts scheduled to run them on generic data pipelines at specific times of the day.
many thanks
CK
To validate the schema, you can have a reference dataset which will be having the same schema (first row) as of your main dataset. Then you need to use “Get Metadata” activity for each dataset and get the structure of each dataset. Your Get Metadata activity will look like this:
You can then use “If Condition” activity to matches the structure of both datasets using equal Logical Function. Your equal expression will look something like this:
If both datasets’ structure matches, your next required activity(like copy the dataset to another container) will be performed.
Your complete pipeline will look like this:
The script which you want to run on your inserted dataset could be performed using “Custom” activity. You again need to create the linked service and it’s corresponding dataset for your script which you will run to validate the raw data. Please refer: https://learn.microsoft.com/en-us/azure/batch/tutorial-run-python-batch-azure-data-factory
To schedule the pipeline as per your specific pipeline will be take care by Triggers in Azure Data Factory. A schedule trigger will take care of your requirement of auto trigger your pipeline at any specific time.

Event based trigger for a sequential run of the same data factory pipeline

I would like to use an event based trigger to run a data factory pipeline.
The trigger will check a folder in a data lake for any new file and start a pipeline once a new CSV file is copied.
The pipeline will then copy the data to an intermediate table to check its consistency (multiple checks using different data flow activities) and if everything's correct, copies it into a stage table.
It is thus very important that the intermediate table will contain the data from only one single CSV file before it is checked.
I have read though that the event based trigger will start in parallel as many pipelines as the (simultaneously) downloaded CSV files.
Is this right? in this case how can I force each Pipeline to wait until the previous one is done?
Thank you for your help.
There is a flag on the pipeline properties (accssible in the top-right of the editor pane) called concurrency. Set this to 1 and only one copy will run and any other invocations will be queued until that one finishes.

Execute a pipeline after a completion pipeline

I have a first pipeline that ingest data for multiple country from BigQuery to Azure, it's an operation that copy bigquery transformed data into azure.
On Data Factory, i create multiple folders for each country that will have multiple pipeline, for example, a specific machine learning model only for 1 or 2 countries, a data prepration pipeline for an application for only 5 countries etc.
I think i need this folder construction for each market to keep it clear for anybody that needs to implement a pipeline and avoid errors.
My main problem by doing that is how i can call, for example, a machine learning pipeline in my folder UK that can only start after the first pipeline, the bigquery copy data to azure, completed ?
I can't call the Execution Pipeline activity because my first pipeline bigquerytoazure is executed by himself, it's the very important step that needs to be executed before any other pipeline can be executed.
Is there any way to call completed pipeline without the Execution Pipeline activated ?
I thought about creating a dummy blob storage in the first pipeline that can work as a trigger for all pipeline after this first one ?
Thanks by advance, hope i was clear.
Data Factory event trigger based on the blob storage. I think that's the best way.
Another way you can think about using Logic App, add a trigger to listen the BigQuery table in SQL database, if the BigQuery table modified, then execute a data factory pipeline. Create a work flow for the pipelines run.
Work flow:
SQL Server Trigger: when an item is modified.
Add a parallel branch
Data Factory Action: Get a pipeline run
Reference: Automate workflows for SQL Server or Azure SQL Database by using
Azure Logic Apps
Hope this helps.

Best practices for parameterizing load of multiple CSV files in Data Factory

I am experimenting with Azure Data Factory to replace some other data-load solutions we currently have, and I'm struggling with finding the best way to organize and parameterize the pipelines to provide the scalability we need.
Our typical pattern is that we build an integration for a particular Platform. This "integration" is essentially the mapping and transform of fields from their data files (CSVs) into our Stage1 SQL database, and by the time the data lands in there, the data types should be set properly and the indexes set.
Within each Platform, we have Customers. Each Customer has their own set of data files that get processed in that Customer context -- within the scope of a Platform, all Customer files follow the same schema (or close to it), but they all get sent to us separately. If you looked at our incoming file store, it might look like (simplified, there are 20-30 source datasets per customer depending on platform):
Platform
Customer A
Employees.csv
PayPeriods.csv
etc
Customer B
Employees.csv
PayPeriods.csv
etc
Each customer lands in their own SQL schema. So after processing the above, I should have CustomerA.Employees and CustomerB.Employees tables. (This allows a little bit of schema drift between customers, which does happen on some platforms. We handle it later in our stage 2 ETL process.)
What I'm trying to figure out is:
What is the best way to setup ADF so I can effectively manage one set of mappings per platform, and automatically accommodate any new customers we add to that platform without having to change the pipeline/flow?
My current thinking is to have one pipeline per platform, and one dataflow per file per platform. The pipeline has a variable, "schemaname", which is set using the path of the file that triggered it (e.g. "CustomerA"). Then, depending on file name, there is a branching conditional that will fire the right dataflow. E.g. if it's "employees.csv" it runs one dataflow, if it's "payperiods.csv" it loads a different dataflow. Also, they'd all be using the same generic target sink datasource, the table name being parameterized and those parameters being set in the pipeline using the schema variable and the filename from the conditional branch.
Are there any pitfalls to setting it up this way? Am I thinking about this correctly?
This sounds solid. Just be aware that you if you define column-specific mappings with expressions that expect those columns to be present, you may have data flow execution failures if those columns are not present in your customer source files.
The ways to protect against that in ADF Data Flow is to use column patterns. This will allow you to define mappings that are generic and more flexible.

Is it possible to 'update' ADF pipeline property through powershell

I would like to update some of the parameters of an ADF pipeline (e.g. concurrency level) of lots of mappings. I am not able to find out any cmdlet to be able to do this through powershell. I know I can drop existing pipeline and create new one, but that will start reprocessing all the Ready slices for that pipelines active period, which I don't want. Because in that case it will involve calculating up to what point existing pipeline has processed slices. And then this is only temporary, at some stage again I am going to revert back settings. I just want pipelines to change one of its properties. Doing this manually through the UI is slow and tedious. I am guessing there is no way around this, but let me know if you know of.
You can still use "New-AzureRmDataFactoryPipeline" for this Update scenario:
https://msdn.microsoft.com/en-us/library/mt619358.aspx
Use with the -Force parameter to force it to proceed even if the message reads "... may overwrite the existing resource".
Under the hood, it's the same HTTP PUT api call used by Azure UX Portal. You can verify that with Fiddler.
The already executed slices won't be re-run unless you set their status back to PendingExecution.
This rule applies to LinkedService and Dataset as well but NOT the top level DataFactory resource. A New-AzureRmDataFactory will cause the service to delete the existing DF along with all its sub-resources and create a brand new one. So be careful from there.