I'm creating a Logic App that has to process all blobs that in a certain container. I would like to periodically check whether there are any new blobs and, if yes, start a run. I tried using the "When a blob is added or modified". However, if at the time of checking there are several new blobs, several new runs are initiated. Is there a way to only initiate one run if one or more blobs are added/modified?
I experimented with the "Number of blobs to return from the trigger" and also with the split-on setting, but I haven't found a way yet.
If you want to trigger with multiple blob files, yes you have to use When a blob is added or modified. From the connector description you could know
This operation triggers a flow when one or more blobs are added or modified in a container.
And you must set the maxFileCount also you already find the result is split into separate parts. This is because the default setting the splitOn is on, if you want the result be a whole you need to set it OFF.
The the result should be what you want.
Related
which needs to be triggered when a file received in a Blob.
But the complex part is that there are 2 files, A.JSON and B.JSON which will be generated in 2 different locations.
So When A.JSON generated in location 1, the Pipeline A should trigger and also when B.JSON generated in Loation 2, the Pipeline A should trigger. I have done the blob trigger using 1 file on 1 location but not sure how to do when 2 different files come in 2 different locations .
There are three ways you could do this.
Using ADF directly with conditions to evaluate if the file triggered is from a specific path as per your need.
Setup Logic Apps for each different paths you would want to monitor for blobs created.
Add two different triggers configured for different paths (best option)
First method: (This has an overhead of running every time a file is triggered in container.)
Edit the trigger to look through whole storage or all containers. Select the file type: JSON in your case.
Parameterize source dataset for dynamic container and file name
Create parameters in pipeline, one each for refering the folder path you want to monitor and one for holding the triggered filename.
where receive_trigger_files will be assigned the triggered file name dynamically.
I am showing an example here where a lookup activity would evaluate the path and execute the respective activities forward if triggered file path and our monitoring paths match.
another for the path2
For example a Get MetaData activity or any in your scenario
Lets manually debug and check for a file exercise01.json that is sored in path2
You can also use IF condition activity similarly, but would require multiple steps or monitoring using activity statuses won't be clear.
Second method: Setup a blob triggered logic app
Run ADF pipeline using Create a pipeline run action, and set or pass appropriate parameters as explained previously.
Third method: Add 2 triggers each for a path you wish to monitor blob creation.
I want to perform some file level, field level validation checks on the dataset I receive.
Given below some checks which I want to perform and capture any issues into audit tables.
File Level Checks: File present, size of the file, Count of records matches to count present in control file
Field Level checks: Content in right format, Duplicate key checks, range in important fields.
I want to make this as a template so that all the project can adopt it, Is it good to perform these checks in ADF or in Databricks. If it is ADF any reference to example dataflow/pipeline would be very helpful.
Thanks,
Kumar
You can accomplish these tasks by using various Activities in Azure data factory pipeline.
To check the file existence, you can use Validation Activity.
In the validation activity, you specify several things. The dataset you want to validate the existence of, sleep how long you want to wait between retries, and timeout how long it should try before giving up and timing out. The minimum size is optional.
Be sure to set the timeout value properly. The default is 7 days, much too long for most jobs.
If the file is found, the activity reports success.
If the file is not found, or is smaller than minimum size, then it can timeout, which is treated like a failure by dependencies.
To count of matching records and assuming that you are using CSV, you could create a generic dataset (one column) and run a copy activity over whatever folders you want to count to a temp folder. Get the rowcount of the copy activity and save it.
At the end, delete everything in your temp folder.
Something like this:
Lookup Activity (Get's your list of base folders - Just for easy rerunning)
For Each (Base Folder)
Copy Recursively to temp folder
Stored procedure activity which stores the Copy Activity.output.rowsCopied
Delete temp files recursively.
To use the same pipeline repeatedly for multiple datasets, you can make your pipeline dynamic. Refer: https://sqlitybi.com/how-to-build-dynamic-azure-data-factory-pipelines/
I have two separate Data flows in Azure Data Factory, and I want to combine them into a single Data flow.
There is a technique for copying elements from one Data flow to another, as described in this video: https://www.youtube.com/watch?v=3_1I4XdoBKQ
This does not work for Source or Sink stages, though. The Script elements do not contain the Dataset that the Source or Sink is connected to, and if you try to copy them, the designer window closes and the Data flow is corrupted. The details are in the JSON, but I have tried copying and pasting into the JSON and that doesn't work either - the source appears on the canvas, but is not usable.
Does anyone know if there is a technique for doing this, other than just manually recreating the objects on the canvas?
Thanks Leon for confirming that this isn't supported, here is my workaround process.
Open the Data Flow that will receive the merged code.
Open the Data Flow that contains the code to merge in.
Go through the to-be-merged flow and change the names of any transformations that clash with the names of transformations in the target flow.
Manually create, in the target flow, any Sources that did not already exist.
Copy the entire script out of the to-be-merged flow into a text editor.
Remove the Sources and Sinks.
Copy the remaining transformations into the clipboard, and paste them in to the target flow's script editor.
Manually create the Sinks, remembering to set all properties such as "Allow Update".
Be prepared that, if you make a mistake and paste in something that is not correct, then the flow editor window will close and the flow will be unusable. The only way to recover it is to refresh and discard all changes since you last published, so don't do this if you have other unpublished changes that you don't want to lose!
I have already established a practice in our team that no mappings are done in Sinks. All mappings are done in Derived Column transformations, and any column name ambiguity is resolved in a Select transformations, so the Sink is always just auto-map. That makes operations like this simpler.
It should be possible to keep the Source definitions in Step 6, remove the Source elements from the target script, and paste the new Sources in to replace them, but that's a little more complex and error-prone.
I have a azure blob container where some json files with data gets put every 6 hours and I want to use Azure Data Factory to copy it to an Azure SQL DB. The file pattern for the files are like this: "customer_year_month_day_hour_min_sec.json.data.json"
The blob container also has other json data files as well so I have filter for the files in the dataset.
First question is how can I set the file path on the blob dataset to only look for the json files that I want? I tried with the wildcard *.data.json but that doesn't work. The only filename wildcard I have gotten to work is *.json
Second question is how can I copy data only from the new files (with the specific file pattern) that lands in the blob storage to Azure SQL? I have no control of the process that puts the data in the blob container and cannot move the files to another location which makes it harder.
Please help.
You could use ADF event trigger to achieve this.
Define your event trigger as 'blob created' and specify the blobPathBeginsWith and blobPathEndsWith property based on your filename pattern.
For the first question, when an event trigger fires for a specific blob, the event captures the folder path and file name of the blob into the properties #triggerBody().folderPath and #triggerBody().fileName. You need to map the properties to pipeline parameters and pass #pipeline.parameters.parameterName expression to your fileName in copy activity.
This also answers the second question, each time the trigger is fired, you'll get the fileName of the newest created files in #triggerBody().folderPath and #triggerBody().fileName.
Thanks.
I understand your situation. Seems they've used a new platform to recreate a decades old problem. :)
The patter I would setup first looks something like:
Create a Storage Account Trigger that will fire on every new file in the source container.
In the triggered Pipeline, examine the blog name to see if it fits your parameters. If no, just end, taking no action. If so, binary copy the blob to a account/container your app owns, leaving the original in place.
Create another Trigger on your container that runs the import Pipeline.
Run your import process.
Couple caveats your management has to understand. You can be very, very reliable, but cannot guarantee compliance because there is no transaction/contract between you and the source container. Also, there may be a sequence gap since a small file can usually process while a larger file is processing.
If for any reason you do miss a file, all you need to do is copy it to your container where your process will pick it up. You can load all previous blobs in the same way.
I would like to update some of the parameters of an ADF pipeline (e.g. concurrency level) of lots of mappings. I am not able to find out any cmdlet to be able to do this through powershell. I know I can drop existing pipeline and create new one, but that will start reprocessing all the Ready slices for that pipelines active period, which I don't want. Because in that case it will involve calculating up to what point existing pipeline has processed slices. And then this is only temporary, at some stage again I am going to revert back settings. I just want pipelines to change one of its properties. Doing this manually through the UI is slow and tedious. I am guessing there is no way around this, but let me know if you know of.
You can still use "New-AzureRmDataFactoryPipeline" for this Update scenario:
https://msdn.microsoft.com/en-us/library/mt619358.aspx
Use with the -Force parameter to force it to proceed even if the message reads "... may overwrite the existing resource".
Under the hood, it's the same HTTP PUT api call used by Azure UX Portal. You can verify that with Fiddler.
The already executed slices won't be re-run unless you set their status back to PendingExecution.
This rule applies to LinkedService and Dataset as well but NOT the top level DataFactory resource. A New-AzureRmDataFactory will cause the service to delete the existing DF along with all its sub-resources and create a brand new one. So be careful from there.