How can I pass output from a filter activity directly to a copy activity in ADF? - copy

I have 4000 files each averaging 30Kb in size landing in a folder on our on premise file system each day. I want to apply conditional logic (several and/or conditions) against details in their file names to only move files matching the conditions into another folder. I have tried linking a meta data activity which gets all files in the source folder with a filter activity which applies the conditional logic with a for each activity with an embedded copy activity. This works but it is taking hours to process the files. When running the pipeline in debug the output window appears to list each file copied as a line item. I’ve increased the batch count setting in the for each to 50 but it hasn’t improved things. Is there a way to link the filter activity directly to the copy activity without using for each activity? Ie pass the collection from the filter straight into copy’s source. Alternatively, some of our other pipelines just use the copy activity pointing to a source folder and we configure its filefilter setting with a simple regex using a combination of * and ?, which is extremely fast. However, in this particular scenario, my conditional logic is more complex and I need to compare attributes in each file’s name with values to decide if the file should be moved. The filefilter setting allows dynamic content so I could remove the filter activity completely, point the copy to the source folder and put the conditional logic in the filefilter’s dynamic content area but how would I get a reference to the file name to do the conditional checks?

Here is one solution:
Write array output as text to a .json in Blob Storage (or wherever). Here are the steps to make that work:
Copy Data Source:
Copy Data Sink:
Write the json (array output) to a text file that has the name of the files you want to copy.
Copy Activity Source (to get it from JSON to .txt):
Sink will be .txt file in your Blob.
Use that text file in your main copy activity and use the following setting:
This should copy over all the files that you identified in your Filter Activity.
I realize this is a work around, but really is the only solution for what you are asking. Otherwise there is no way to link a filter activity straight to a copy activity.

Related

How to check IF several files exist in a folder with foreach activity?

UPDATE
I am uploading 2 files (and each will have same filename every week) in an Azure Blob storage container.
I would like to execute another pipeline only if these two files exist.
So what I thought to do is :
Create an empty folder in an Azure blob storage container
Upload these two files in this folder
Check in this folder if they exist to execute a main pipeline. Two triggers for each file, and I guess with the second trigger I will find both files.
a) Get metadata activity
b) Foreach activity
c) If condition : to check if the two specific files exist
If they exist I move these two files to another folder and execute the other pipeline. This way I keep the folder empty for next upload.
These two files will always have the same name. Example: file_1.csv and file_2.csv
But I don't know if it is technically possible and what to do inside of each these steps, what can I do ?
If you just want to check if file exists and the name name of the files, The Get Metadata activity will return the values for the Arguments you have added in Field list.
For example, it has return below output for Exists and Child items Arguments.
Now, in the Foreach activity only, you can capture the Child items values by using dynamic expression in Foreach activity -> Settings tab Item field.
Use below expression:
#activity('Get Metadata1').output.childItems
Use the IF Condition activity in ForEach activity rather than separately. And based on the child field names you can further assign the task in True and False conditions inside IF Condition activity.
So this is how I would roll it. The architecture is adaptation of binary AND, i.e. when A AND B are both true then do something. Now since each A and B are independent events and one can happen before other we need to wait for the other event before executing the process.
So here are two approaches:
#1
you only have trigger on file_2.csv which triggers the ADF pipeline.
in ADF start with UNTIL activity ; inside this run a wait activity for say a minute and then run GMD to grab child items and then run filter activity to check if the file_1.csv is present in the child items array. if yes then set a variable which breaks until condition
write your processing logic as now both files are available
This approach is good if we know both files will arrive within some duration of one another. However I would bake in a logic to break out and end pipeline in error when say 20 min have passed without second file arriving.
#2
you can pass with blob event trigger the name of file. See link on how it is done with #triggerBody().fileName. So have a trigger on both files.
next thing I will move this file to a new folder. But before I do, I will check count of files which already exist in the destination folder. If it's 0 then my new file will have have prefix of file1- added to it. file1-originalfilename. If the destination already has 1 file, then prefix is file2-.
have a trigger on destination folder which trigger if a file with a prefix of file2-. in this new triggered pipeline write your processing code.

need help in ADF trigger with Blob

which needs to be triggered when a file received in a Blob.
But the complex part is that there are 2 files, A.JSON and B.JSON which will be generated in 2 different locations.
So When A.JSON generated in location 1, the Pipeline A should trigger and also when B.JSON generated in Loation 2, the Pipeline A should trigger. I have done the blob trigger using 1 file on 1 location but not sure how to do when 2 different files come in 2 different locations .
There are three ways you could do this.
Using ADF directly with conditions to evaluate if the file triggered is from a specific path as per your need.
Setup Logic Apps for each different paths you would want to monitor for blobs created.
Add two different triggers configured for different paths (best option)
First method: (This has an overhead of running every time a file is triggered in container.)
Edit the trigger to look through whole storage or all containers. Select the file type: JSON in your case.
Parameterize source dataset for dynamic container and file name
Create parameters in pipeline, one each for refering the folder path you want to monitor and one for holding the triggered filename.
where receive_trigger_files will be assigned the triggered file name dynamically.
I am showing an example here where a lookup activity would evaluate the path and execute the respective activities forward if triggered file path and our monitoring paths match.
another for the path2
For example a Get MetaData activity or any in your scenario
Lets manually debug and check for a file exercise01.json that is sored in path2
You can also use IF condition activity similarly, but would require multiple steps or monitoring using activity statuses won't be clear.
Second method: Setup a blob triggered logic app
Run ADF pipeline using Create a pipeline run action, and set or pass appropriate parameters as explained previously.
Third method: Add 2 triggers each for a path you wish to monitor blob creation.

Validation checks on dataset ADF vs databricks

I want to perform some file level, field level validation checks on the dataset I receive.
Given below some checks which I want to perform and capture any issues into audit tables.
File Level Checks: File present, size of the file, Count of records matches to count present in control file
Field Level checks: Content in right format, Duplicate key checks, range in important fields.
I want to make this as a template so that all the project can adopt it, Is it good to perform these checks in ADF or in Databricks. If it is ADF any reference to example dataflow/pipeline would be very helpful.
Thanks,
Kumar
You can accomplish these tasks by using various Activities in Azure data factory pipeline.
To check the file existence, you can use Validation Activity.
In the validation activity, you specify several things. The dataset you want to validate the existence of, sleep how long you want to wait between retries, and timeout how long it should try before giving up and timing out. The minimum size is optional.
Be sure to set the timeout value properly. The default is 7 days, much too long for most jobs.
If the file is found, the activity reports success.
If the file is not found, or is smaller than minimum size, then it can timeout, which is treated like a failure by dependencies.
To count of matching records and assuming that you are using CSV, you could create a generic dataset (one column) and run a copy activity over whatever folders you want to count to a temp folder. Get the rowcount of the copy activity and save it.
At the end, delete everything in your temp folder.
Something like this:
Lookup Activity (Get's your list of base folders - Just for easy rerunning)
For Each (Base Folder)
Copy Recursively to temp folder
Stored procedure activity which stores the Copy Activity.output.rowsCopied
Delete temp files recursively.
To use the same pipeline repeatedly for multiple datasets, you can make your pipeline dynamic. Refer: https://sqlitybi.com/how-to-build-dynamic-azure-data-factory-pipelines/

Load multiple multischema delimited file from same directories

Could I know does it have any method to load multiple files that are multi schema delimited files which store in same directories in Talend?
I have tried use the tFileInputMSDelimited component before, but unable to link with tFilelist component to loop through the files inside the directory.
Does anyone have idea how to solve this problem?
To make clearer, each file only contain one batch line but contain multiple header line and it comes with a bunch of transaction line. As showing at the sample data below.
The component tFileOutputMSDelimited should suit your needs.
You will need multiple flows going into it.
You can either keep the files and read them or use tHashInput/tHashOutput to get the data directly.
Then you direct all the flows to the tFileOutputMSDelimited (example with tFixedFlowInput, adapt with your flows) :
In it, you can configure which flow is the parent flow containing your ID.
Then you can add the children flows and define the parent and the ID to recognize the rows in the parent flow :

Azure Data factory, How to incrementally copy blob data to sql

I have a azure blob container where some json files with data gets put every 6 hours and I want to use Azure Data Factory to copy it to an Azure SQL DB. The file pattern for the files are like this: "customer_year_month_day_hour_min_sec.json.data.json"
The blob container also has other json data files as well so I have filter for the files in the dataset.
First question is how can I set the file path on the blob dataset to only look for the json files that I want? I tried with the wildcard *.data.json but that doesn't work. The only filename wildcard I have gotten to work is *.json
Second question is how can I copy data only from the new files (with the specific file pattern) that lands in the blob storage to Azure SQL? I have no control of the process that puts the data in the blob container and cannot move the files to another location which makes it harder.
Please help.
You could use ADF event trigger to achieve this.
Define your event trigger as 'blob created' and specify the blobPathBeginsWith and blobPathEndsWith property based on your filename pattern.
For the first question, when an event trigger fires for a specific blob, the event captures the folder path and file name of the blob into the properties #triggerBody().folderPath and #triggerBody().fileName. You need to map the properties to pipeline parameters and pass #pipeline.parameters.parameterName expression to your fileName in copy activity.
This also answers the second question, each time the trigger is fired, you'll get the fileName of the newest created files in #triggerBody().folderPath and #triggerBody().fileName.
Thanks.
I understand your situation. Seems they've used a new platform to recreate a decades old problem. :)
The patter I would setup first looks something like:
Create a Storage Account Trigger that will fire on every new file in the source container.
In the triggered Pipeline, examine the blog name to see if it fits your parameters. If no, just end, taking no action. If so, binary copy the blob to a account/container your app owns, leaving the original in place.
Create another Trigger on your container that runs the import Pipeline.
Run your import process.
Couple caveats your management has to understand. You can be very, very reliable, but cannot guarantee compliance because there is no transaction/contract between you and the source container. Also, there may be a sequence gap since a small file can usually process while a larger file is processing.
If for any reason you do miss a file, all you need to do is copy it to your container where your process will pick it up. You can load all previous blobs in the same way.