Validation checks on dataset ADF vs databricks - azure-data-factory

I want to perform some file level, field level validation checks on the dataset I receive.
Given below some checks which I want to perform and capture any issues into audit tables.
File Level Checks: File present, size of the file, Count of records matches to count present in control file
Field Level checks: Content in right format, Duplicate key checks, range in important fields.
I want to make this as a template so that all the project can adopt it, Is it good to perform these checks in ADF or in Databricks. If it is ADF any reference to example dataflow/pipeline would be very helpful.
Thanks,
Kumar

You can accomplish these tasks by using various Activities in Azure data factory pipeline.
To check the file existence, you can use Validation Activity.
In the validation activity, you specify several things. The dataset you want to validate the existence of, sleep how long you want to wait between retries, and timeout how long it should try before giving up and timing out. The minimum size is optional.
Be sure to set the timeout value properly. The default is 7 days, much too long for most jobs.
If the file is found, the activity reports success.
If the file is not found, or is smaller than minimum size, then it can timeout, which is treated like a failure by dependencies.
To count of matching records and assuming that you are using CSV, you could create a generic dataset (one column) and run a copy activity over whatever folders you want to count to a temp folder. Get the rowcount of the copy activity and save it.
At the end, delete everything in your temp folder.
Something like this:
Lookup Activity (Get's your list of base folders - Just for easy rerunning)
For Each (Base Folder)
Copy Recursively to temp folder
Stored procedure activity which stores the Copy Activity.output.rowsCopied
Delete temp files recursively.
To use the same pipeline repeatedly for multiple datasets, you can make your pipeline dynamic. Refer: https://sqlitybi.com/how-to-build-dynamic-azure-data-factory-pipelines/

Related

How to check IF several files exist in a folder with foreach activity?

UPDATE
I am uploading 2 files (and each will have same filename every week) in an Azure Blob storage container.
I would like to execute another pipeline only if these two files exist.
So what I thought to do is :
Create an empty folder in an Azure blob storage container
Upload these two files in this folder
Check in this folder if they exist to execute a main pipeline. Two triggers for each file, and I guess with the second trigger I will find both files.
a) Get metadata activity
b) Foreach activity
c) If condition : to check if the two specific files exist
If they exist I move these two files to another folder and execute the other pipeline. This way I keep the folder empty for next upload.
These two files will always have the same name. Example: file_1.csv and file_2.csv
But I don't know if it is technically possible and what to do inside of each these steps, what can I do ?
If you just want to check if file exists and the name name of the files, The Get Metadata activity will return the values for the Arguments you have added in Field list.
For example, it has return below output for Exists and Child items Arguments.
Now, in the Foreach activity only, you can capture the Child items values by using dynamic expression in Foreach activity -> Settings tab Item field.
Use below expression:
#activity('Get Metadata1').output.childItems
Use the IF Condition activity in ForEach activity rather than separately. And based on the child field names you can further assign the task in True and False conditions inside IF Condition activity.
So this is how I would roll it. The architecture is adaptation of binary AND, i.e. when A AND B are both true then do something. Now since each A and B are independent events and one can happen before other we need to wait for the other event before executing the process.
So here are two approaches:
#1
you only have trigger on file_2.csv which triggers the ADF pipeline.
in ADF start with UNTIL activity ; inside this run a wait activity for say a minute and then run GMD to grab child items and then run filter activity to check if the file_1.csv is present in the child items array. if yes then set a variable which breaks until condition
write your processing logic as now both files are available
This approach is good if we know both files will arrive within some duration of one another. However I would bake in a logic to break out and end pipeline in error when say 20 min have passed without second file arriving.
#2
you can pass with blob event trigger the name of file. See link on how it is done with #triggerBody().fileName. So have a trigger on both files.
next thing I will move this file to a new folder. But before I do, I will check count of files which already exist in the destination folder. If it's 0 then my new file will have have prefix of file1- added to it. file1-originalfilename. If the destination already has 1 file, then prefix is file2-.
have a trigger on destination folder which trigger if a file with a prefix of file2-. in this new triggered pipeline write your processing code.

need help in ADF trigger with Blob

which needs to be triggered when a file received in a Blob.
But the complex part is that there are 2 files, A.JSON and B.JSON which will be generated in 2 different locations.
So When A.JSON generated in location 1, the Pipeline A should trigger and also when B.JSON generated in Loation 2, the Pipeline A should trigger. I have done the blob trigger using 1 file on 1 location but not sure how to do when 2 different files come in 2 different locations .
There are three ways you could do this.
Using ADF directly with conditions to evaluate if the file triggered is from a specific path as per your need.
Setup Logic Apps for each different paths you would want to monitor for blobs created.
Add two different triggers configured for different paths (best option)
First method: (This has an overhead of running every time a file is triggered in container.)
Edit the trigger to look through whole storage or all containers. Select the file type: JSON in your case.
Parameterize source dataset for dynamic container and file name
Create parameters in pipeline, one each for refering the folder path you want to monitor and one for holding the triggered filename.
where receive_trigger_files will be assigned the triggered file name dynamically.
I am showing an example here where a lookup activity would evaluate the path and execute the respective activities forward if triggered file path and our monitoring paths match.
another for the path2
For example a Get MetaData activity or any in your scenario
Lets manually debug and check for a file exercise01.json that is sored in path2
You can also use IF condition activity similarly, but would require multiple steps or monitoring using activity statuses won't be clear.
Second method: Setup a blob triggered logic app
Run ADF pipeline using Create a pipeline run action, and set or pass appropriate parameters as explained previously.
Third method: Add 2 triggers each for a path you wish to monitor blob creation.

How can I pass output from a filter activity directly to a copy activity in ADF?

I have 4000 files each averaging 30Kb in size landing in a folder on our on premise file system each day. I want to apply conditional logic (several and/or conditions) against details in their file names to only move files matching the conditions into another folder. I have tried linking a meta data activity which gets all files in the source folder with a filter activity which applies the conditional logic with a for each activity with an embedded copy activity. This works but it is taking hours to process the files. When running the pipeline in debug the output window appears to list each file copied as a line item. I’ve increased the batch count setting in the for each to 50 but it hasn’t improved things. Is there a way to link the filter activity directly to the copy activity without using for each activity? Ie pass the collection from the filter straight into copy’s source. Alternatively, some of our other pipelines just use the copy activity pointing to a source folder and we configure its filefilter setting with a simple regex using a combination of * and ?, which is extremely fast. However, in this particular scenario, my conditional logic is more complex and I need to compare attributes in each file’s name with values to decide if the file should be moved. The filefilter setting allows dynamic content so I could remove the filter activity completely, point the copy to the source folder and put the conditional logic in the filefilter’s dynamic content area but how would I get a reference to the file name to do the conditional checks?
Here is one solution:
Write array output as text to a .json in Blob Storage (or wherever). Here are the steps to make that work:
Copy Data Source:
Copy Data Sink:
Write the json (array output) to a text file that has the name of the files you want to copy.
Copy Activity Source (to get it from JSON to .txt):
Sink will be .txt file in your Blob.
Use that text file in your main copy activity and use the following setting:
This should copy over all the files that you identified in your Filter Activity.
I realize this is a work around, but really is the only solution for what you are asking. Otherwise there is no way to link a filter activity straight to a copy activity.

Event based trigger for a sequential run of the same data factory pipeline

I would like to use an event based trigger to run a data factory pipeline.
The trigger will check a folder in a data lake for any new file and start a pipeline once a new CSV file is copied.
The pipeline will then copy the data to an intermediate table to check its consistency (multiple checks using different data flow activities) and if everything's correct, copies it into a stage table.
It is thus very important that the intermediate table will contain the data from only one single CSV file before it is checked.
I have read though that the event based trigger will start in parallel as many pipelines as the (simultaneously) downloaded CSV files.
Is this right? in this case how can I force each Pipeline to wait until the previous one is done?
Thank you for your help.
There is a flag on the pipeline properties (accssible in the top-right of the editor pane) called concurrency. Set this to 1 and only one copy will run and any other invocations will be queued until that one finishes.

Azure Data Factory : Set a limit to copy number of files using Copy activity

I have a copy activity used in my pipeline to copy files from Azure data Lake gen 2. The source location may have 1000's of files and the files are required to be copied but we need to set a limit for the number files required to be copied. Is there any option available in ADF to achieve the same barring a custom activity?
Eg: I have 2000 files available in Data lake, but while running the pipeline i should able to pass a parameter to copy only 500 files.
Regards,
Sandeep
I think you can use a lookup activity with the for each loop and a copy activity to achieve this . You will have to use a counter variable also ( this is make the process slow , as you will have to copy i file at a time ) . The loopkup acitivty has a limit of 5000 at this time , so you will have to keep that in mind .
I would use metadata activity to get a list of all items in your data lake: https://learn.microsoft.com/en-us/azure/data-factory/control-flow-get-metadata-activity
After that, you can use a "ForEach" step to loop through the list of files and copy them. In order to set a limit, you can use create two variables/parameters: limit and files_copied. In the beginning of each step, check if your files_copied is less than limit, perform the copy operation and add 1 to files_copied.
Alternatively, you can create a database with the names of all the files after the first step and then use lookup and for each steps, just like #HimanshuSinha-msft mentioned. In the Lookup step you can use SQL OFFSET+FETCH query in combination with your limit parameter to process only certain number of files. That can also solve the 5k limit of the lookup activity.