Event based trigger for a sequential run of the same data factory pipeline - azure-data-factory

I would like to use an event based trigger to run a data factory pipeline.
The trigger will check a folder in a data lake for any new file and start a pipeline once a new CSV file is copied.
The pipeline will then copy the data to an intermediate table to check its consistency (multiple checks using different data flow activities) and if everything's correct, copies it into a stage table.
It is thus very important that the intermediate table will contain the data from only one single CSV file before it is checked.
I have read though that the event based trigger will start in parallel as many pipelines as the (simultaneously) downloaded CSV files.
Is this right? in this case how can I force each Pipeline to wait until the previous one is done?
Thank you for your help.

There is a flag on the pipeline properties (accssible in the top-right of the editor pane) called concurrency. Set this to 1 and only one copy will run and any other invocations will be queued until that one finishes.

Related

Issue with Copy Activity Metadata

The Copy Data Activity Which used to show the number of rows written isnt showing up any more.
Is there any option in the copy activity to make sure it reflects the number of rows written.
Be it debug of the pipeline, or a triggered pipeline run, you can check the output of the copy data activity to conclude whether the data read is equal to data written or not.
Let's say it is a pipeline run. Navigate to monitor section and click on the pipeline.
Now, the activity run dialog opens up. There you can monitor from the activity debug output whether the data read is equal to data written or not:
NOTE: The above is for blob to blob copy. For your source and sink, there will be similar activity output data that might contain the required information (like rows read and rows written). The following is an example for Azure SQL database to blob:

Azure Data Factory - run script on parquet files and output as parquet files

In Azure Data Factory I have a pipeline, created from the built-in copy data task, that copies data from 12 entities (campaign, lead, contact etc.) from Dynamics CRM (using a linked service) and outputs the contents as parquet files in account storage. This is run every day, into a folder structure based on the date. The output structure in the container looks something like this:
Raw/CRM/2022/05/28/campaign.parquet
Raw/CRM/2022/05/28/lead.parquet
Raw/CRM/2022/05/29/campaign.parquet
Raw/CRM/2022/05/29/lead.parquet
That's just an example, but there is a folder structure for every year/month/day that the pipeline runs, and a parquet file for each of the 12 entities I'm retrieving.
This involved creating a pipeline, dataset for the source and dataset for the target. I modified the pipeline to add the pipeline's run date/time as a column in the parquet files, called RowStartDate (which I'll need in the next stage of processing)
My next step is to process the data into a staging area, which I'd like to output to a different folder in my container. My plan was to create 12 scripts (one for campaigns, one for leads, one for contact etc.) that essentially does the following:
accesses all of the correct files, using a wildcard path along the lines of: Raw/CRM/ * / * / * /campaign.parquet
selects the columns that I need
Rename column headings
in some cases, just take the most recent data (using the RowStartDate)
in some cases, create a slowly changing dimension, ensuring every row has a RowEndDate
I made some progress figuring out how to do this in SQL, by running a query using OPENROWSET with wildcards in the path as per above - but I don't think I can use my SQL script in ADF to move/process the data into a separate folder in my container.
My question is, how can I do this (preferably in ADF pipelines):
for each of my 12 entities, access each occurrence in the container with some sort of Raw/CRM///*/campaign.parquet statement
Process it as per the logic I've described above - a script of some sort
Output the contents back to a different folder in my container (each script would produce 1 output)
I've tried:
Using Azure Data Factory, but when I tell it which dataset to use, I point it to the dataset I created in my original pipeline - but this dataset has all 12 entities in the dataset and the data flow activity produces the error: "No value provided for Parameter 'cw_fileName" - but I don't see any place when configuring the data flow to specify a parameter (its not under source settings, source options, projection, optimize or inspect)
using Azure Data Factory, tried to add a script - but in trying to connect to my SQL script in Synapse - I don't know my Service Principal Key for the synapse workspace
using a notebook Databricks, I tried to mount my container but got an error along the lines that "adding secret to Databricks scope doesn't work in Standard Tier" so couldn't proceed
using Synapse, but as expected, it wants things in SQL whereas I'm trying to keep things in a container for now.
Could anybody point me in the right direction. What's the best approach that I should take? And if its one that I've described above, how do I go about getting past the issue I've described?
Pass the data flow dataset parameter values from the pipeline data flow activity settings.

need help in ADF trigger with Blob

which needs to be triggered when a file received in a Blob.
But the complex part is that there are 2 files, A.JSON and B.JSON which will be generated in 2 different locations.
So When A.JSON generated in location 1, the Pipeline A should trigger and also when B.JSON generated in Loation 2, the Pipeline A should trigger. I have done the blob trigger using 1 file on 1 location but not sure how to do when 2 different files come in 2 different locations .
There are three ways you could do this.
Using ADF directly with conditions to evaluate if the file triggered is from a specific path as per your need.
Setup Logic Apps for each different paths you would want to monitor for blobs created.
Add two different triggers configured for different paths (best option)
First method: (This has an overhead of running every time a file is triggered in container.)
Edit the trigger to look through whole storage or all containers. Select the file type: JSON in your case.
Parameterize source dataset for dynamic container and file name
Create parameters in pipeline, one each for refering the folder path you want to monitor and one for holding the triggered filename.
where receive_trigger_files will be assigned the triggered file name dynamically.
I am showing an example here where a lookup activity would evaluate the path and execute the respective activities forward if triggered file path and our monitoring paths match.
another for the path2
For example a Get MetaData activity or any in your scenario
Lets manually debug and check for a file exercise01.json that is sored in path2
You can also use IF condition activity similarly, but would require multiple steps or monitoring using activity statuses won't be clear.
Second method: Setup a blob triggered logic app
Run ADF pipeline using Create a pipeline run action, and set or pass appropriate parameters as explained previously.
Third method: Add 2 triggers each for a path you wish to monitor blob creation.

Validation checks on dataset ADF vs databricks

I want to perform some file level, field level validation checks on the dataset I receive.
Given below some checks which I want to perform and capture any issues into audit tables.
File Level Checks: File present, size of the file, Count of records matches to count present in control file
Field Level checks: Content in right format, Duplicate key checks, range in important fields.
I want to make this as a template so that all the project can adopt it, Is it good to perform these checks in ADF or in Databricks. If it is ADF any reference to example dataflow/pipeline would be very helpful.
Thanks,
Kumar
You can accomplish these tasks by using various Activities in Azure data factory pipeline.
To check the file existence, you can use Validation Activity.
In the validation activity, you specify several things. The dataset you want to validate the existence of, sleep how long you want to wait between retries, and timeout how long it should try before giving up and timing out. The minimum size is optional.
Be sure to set the timeout value properly. The default is 7 days, much too long for most jobs.
If the file is found, the activity reports success.
If the file is not found, or is smaller than minimum size, then it can timeout, which is treated like a failure by dependencies.
To count of matching records and assuming that you are using CSV, you could create a generic dataset (one column) and run a copy activity over whatever folders you want to count to a temp folder. Get the rowcount of the copy activity and save it.
At the end, delete everything in your temp folder.
Something like this:
Lookup Activity (Get's your list of base folders - Just for easy rerunning)
For Each (Base Folder)
Copy Recursively to temp folder
Stored procedure activity which stores the Copy Activity.output.rowsCopied
Delete temp files recursively.
To use the same pipeline repeatedly for multiple datasets, you can make your pipeline dynamic. Refer: https://sqlitybi.com/how-to-build-dynamic-azure-data-factory-pipelines/

Using Azure Data Factory output in Logic App

I have a logic app that runs on occurrence initially that runs an ADF
pipeline which outputs a folder of files.
Then, I use a List Blobs action to pull one specific file
from the newly made folder and place its path on a queue.
And once a message is placed on that queue, it triggers the run of
another ADF pipeline.
The issue is I have not seen a way to get the output of the first ADF pipeline to put on the queue. I have tried to cheat within the List Blobs action that is sequential to the 1st ADF pipeline by explicitly searching the name of the output folder because it will be the same every time.
However, even after the 1st ADF is ran and produces the folder, within the first instance of this Logic App being ran the List Blobs can't find the folder and says the file path is not found.
Only after I run the Logic App a second time the folder is finally found which is not at all optimal. How can I fix this ? I prefer to keep everything in one logic app. Are there other Azure tools that can help in addition?
I am not having the details of the implementation but i am wondering if the message is written by the first pipeline is only used as a signal the second pipeline ? if thats the case why you cannot you call the second pipeline on completion of the first one ? may be these pipelines are on different ADF's ?
I suggest you to read and see if you can use the Event triggers