Using ADFv2 Validation activity to check minimum size of Virtual Directory of Azure BLOB dataset

Using ADFv2 Validation activity to check minimum size of Virtual Directory of Azure BLOB dataset - azure-data-factory

ADFv2 Validation activity using Azure BLOB dataset has a property called Minimum size. I would like to validate that a certain virtual directory for a given Azure blob storage has total file size specified in the Minimum size field. For that I have tried leaving the 'File' field of the connected dataset as blank but it didn't work. The activity succeeded even though there was an empty file in the virtual directory. Then again made the 'File' field as * and then the validation activity just kept running, never succeeded. How do I achieve this?

Actually, I tested and found that: Minimum size only works for specified file, like bellow:
Parameter with dynamic content also doesn't support:
If we set the Minimum size value is bigger than the file size, the validation activity will always being in progress.
I think there are the limits about the validation activity. So we can't achieve that. We could call the Azure support to get more helps.
Hope this helps.

Related

batch size from data base reference

i tried to get batch size values from database table in sequence, but getting errors in process.
i tried to set a condition to choose the size according to id parameter in my agent.
enter image description here
enter image description here
enter image description here

This bit does not allow access to agents flowing through.
As suggested previously, it is easier to store the batch size within the agent as a parameter p_BatchSizeToUse (define it in the Source when the agent is created).
Then, simply set the batch size upfront in a block upstream of the Batch block using myBatchBlock.set_batchSize(agent.p_BatchSizeToUse)
HOWEVER: It is not logical to vary the batch size agent by agent. If the first agent has a batch size of 5 and the second has a batch size of 10, the first agent's batch size would not be considered as the 2nd takes over. You will not get the desired result with your setup, tbh.

How can I pass output from a filter activity directly to a copy activity in ADF?

I have 4000 files each averaging 30Kb in size landing in a folder on our on premise file system each day. I want to apply conditional logic (several and/or conditions) against details in their file names to only move files matching the conditions into another folder. I have tried linking a meta data activity which gets all files in the source folder with a filter activity which applies the conditional logic with a for each activity with an embedded copy activity. This works but it is taking hours to process the files. When running the pipeline in debug the output window appears to list each file copied as a line item. I’ve increased the batch count setting in the for each to 50 but it hasn’t improved things. Is there a way to link the filter activity directly to the copy activity without using for each activity? Ie pass the collection from the filter straight into copy’s source. Alternatively, some of our other pipelines just use the copy activity pointing to a source folder and we configure its filefilter setting with a simple regex using a combination of * and ?, which is extremely fast. However, in this particular scenario, my conditional logic is more complex and I need to compare attributes in each file’s name with values to decide if the file should be moved. The filefilter setting allows dynamic content so I could remove the filter activity completely, point the copy to the source folder and put the conditional logic in the filefilter’s dynamic content area but how would I get a reference to the file name to do the conditional checks?

Here is one solution:
Write array output as text to a .json in Blob Storage (or wherever). Here are the steps to make that work:
Copy Data Source:
Copy Data Sink:
Write the json (array output) to a text file that has the name of the files you want to copy.
Copy Activity Source (to get it from JSON to .txt):
Sink will be .txt file in your Blob.
Use that text file in your main copy activity and use the following setting:
This should copy over all the files that you identified in your Filter Activity.
I realize this is a work around, but really is the only solution for what you are asking. Otherwise there is no way to link a filter activity straight to a copy activity.

Validation checks on dataset ADF vs databricks

I want to perform some file level, field level validation checks on the dataset I receive.
Given below some checks which I want to perform and capture any issues into audit tables.
File Level Checks: File present, size of the file, Count of records matches to count present in control file
Field Level checks: Content in right format, Duplicate key checks, range in important fields.
I want to make this as a template so that all the project can adopt it, Is it good to perform these checks in ADF or in Databricks. If it is ADF any reference to example dataflow/pipeline would be very helpful.
Thanks,
Kumar

You can accomplish these tasks by using various Activities in Azure data factory pipeline.
To check the file existence, you can use Validation Activity.
In the validation activity, you specify several things. The dataset you want to validate the existence of, sleep how long you want to wait between retries, and timeout how long it should try before giving up and timing out. The minimum size is optional.
Be sure to set the timeout value properly. The default is 7 days, much too long for most jobs.
If the file is found, the activity reports success.
If the file is not found, or is smaller than minimum size, then it can timeout, which is treated like a failure by dependencies.
To count of matching records and assuming that you are using CSV, you could create a generic dataset (one column) and run a copy activity over whatever folders you want to count to a temp folder. Get the rowcount of the copy activity and save it.
At the end, delete everything in your temp folder.
Something like this:
Lookup Activity (Get's your list of base folders - Just for easy rerunning)
For Each (Base Folder)
Copy Recursively to temp folder
Stored procedure activity which stores the Copy Activity.output.rowsCopied
Delete temp files recursively.
To use the same pipeline repeatedly for multiple datasets, you can make your pipeline dynamic. Refer: https://sqlitybi.com/how-to-build-dynamic-azure-data-factory-pipelines/

Logic App Blob Trigger for a group of blobs

I'm creating a Logic App that has to process all blobs that in a certain container. I would like to periodically check whether there are any new blobs and, if yes, start a run. I tried using the "When a blob is added or modified". However, if at the time of checking there are several new blobs, several new runs are initiated. Is there a way to only initiate one run if one or more blobs are added/modified?
I experimented with the "Number of blobs to return from the trigger" and also with the split-on setting, but I haven't found a way yet.

If you want to trigger with multiple blob files, yes you have to use When a blob is added or modified. From the connector description you could know
This operation triggers a flow when one or more blobs are added or modified in a container.
And you must set the maxFileCount also you already find the result is split into separate parts. This is because the default setting the splitOn is on, if you want the result be a whole you need to set it OFF.
The the result should be what you want.

Azure Data factory, How to incrementally copy blob data to sql

I have a azure blob container where some json files with data gets put every 6 hours and I want to use Azure Data Factory to copy it to an Azure SQL DB. The file pattern for the files are like this: "customer_year_month_day_hour_min_sec.json.data.json"
The blob container also has other json data files as well so I have filter for the files in the dataset.
First question is how can I set the file path on the blob dataset to only look for the json files that I want? I tried with the wildcard *.data.json but that doesn't work. The only filename wildcard I have gotten to work is *.json
Second question is how can I copy data only from the new files (with the specific file pattern) that lands in the blob storage to Azure SQL? I have no control of the process that puts the data in the blob container and cannot move the files to another location which makes it harder.
Please help.

You could use ADF event trigger to achieve this.
Define your event trigger as 'blob created' and specify the blobPathBeginsWith and blobPathEndsWith property based on your filename pattern.
For the first question, when an event trigger fires for a specific blob, the event captures the folder path and file name of the blob into the properties #triggerBody().folderPath and #triggerBody().fileName. You need to map the properties to pipeline parameters and pass #pipeline.parameters.parameterName expression to your fileName in copy activity.
This also answers the second question, each time the trigger is fired, you'll get the fileName of the newest created files in #triggerBody().folderPath and #triggerBody().fileName.
Thanks.

I understand your situation. Seems they've used a new platform to recreate a decades old problem. :)
The patter I would setup first looks something like:
Create a Storage Account Trigger that will fire on every new file in the source container.
In the triggered Pipeline, examine the blog name to see if it fits your parameters. If no, just end, taking no action. If so, binary copy the blob to a account/container your app owns, leaving the original in place.
Create another Trigger on your container that runs the import Pipeline.
Run your import process.
Couple caveats your management has to understand. You can be very, very reliable, but cannot guarantee compliance because there is no transaction/contract between you and the source container. Also, there may be a sequence gap since a small file can usually process while a larger file is processing.
If for any reason you do miss a file, all you need to do is copy it to your container where your process will pick it up. You can load all previous blobs in the same way.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse