Wildcard in blob ends with path - ADF trigger blob storage event - azure-data-factory

I have a blob structure like this:
> data/
> folder1/
> dirA/
> dirB/
> dirC/
> file1.csv
> file2.csv
> file3.csv
> dir2/
> dirA/
> dirB/
> dirC/
> file1.csv
> file2.csv
> file3.csv
> source3/
> dirA/
> dirB/
> dirC/
> file1.csv
> file2.csv
> file3.csv
I want to trigger the blob storage event when any csv file is uploaded to source3/dirC only.
The problem is adf doesnt support wildcard path here. I want something like this:
Blob_path_ends_with: any_dir(exclude folder1 include dir2,source3)/dirC/*.csv (any csv file in dirC in any main directory)
So I want to ignore any csv uploads in the folder1 but trigger event on upload of files in dir2 and source3.

As mentioned by Rakesh Govindula, path begins with and ends with are the only pattern matching allowed in Storage Event Trigger. Other types of wildcard matching aren't supported for the trigger type.
However you can workaround this with a logic app as follows:
Steps to reproduce:
Create an Event Grid System Topic (Documentaion)
Select Storage Accounts (Blob & GPv2) for topic type. You can choose any name for the topic.
Create a Logic app with http trigger type
Paste the following Event Grid Event Schema to the Logic App's http trigger's Request Body JSON Schema field:
[{
"topic": "/subscriptions/{subscription-id}/resourceGroups/Storage/providers/Microsoft.Storage/storageAccounts/my-storage-account",
"subject": "/blobServices/default/containers/test-container/blobs/new-file.txt",
"eventType": "Microsoft.Storage.BlobCreated",
"eventTime": "2017-06-26T18:41:00.9584103Z",
"id": "831e1650-001e-001b-66ab-eeb76e069631",
"data": {
"api": "PutBlockList",
"clientRequestId": "6d79dbfb-0e37-4fc4-981f-442c9ca65760",
"requestId": "831e1650-001e-001b-66ab-eeb76e000000",
"eTag": "\"0x8D4BCC2E4835CD0\"",
"contentType": "text/plain",
"contentLength": 524288,
"blobType": "BlockBlob",
"url": "https://my-storage-account.blob.core.windows.net/testcontainer/new-file.txt",
"sequencer": "00000000000004420000000000028963",
"storageDiagnostics": {
"batchId": "b68529f3-68cd-4744-baa4-3c0498ec19f0"
}
},
"dataVersion": "",
"metadataVersion": "1"
}]
Save the logic app in designer and copy the content of the HTTP POST URL field. You will use it in the next.
Create an Event Subscription for the storage account (Documentation)
Type in the System Topic Name field the event grid system topic you created in step 1. Select Webhook endpoint type and paste in the HTTP POST URL you coped in step 2.
Once you are ready, you can test if the logic app is triggered if you upload a file to the storage. Go to the logic app and check the logs (raw outputs of the http trigger)
Add a condition step to the Logic app to stop the workflow if the http request's body().data.url parameter doesn't contains the path you need.
Add a Data Factory pipeline run step to the Logic App. (Useful blogpost)
You can pass the path string as pipeline parameter from the http body: body().data.url
I hope you can follow the steps I described without screenshots.

You can use the below workaround using ADF.
This my folder structure.
input
source1
dirC
source2
dirC
source3
dirC
##file uploading
Create two parameters for uploading file name and folder path.
and another parameter of array type and give your folder names that you want the file to be uploaded.
I have created the trigger like below for the container input
give the #triggerBody().fileName and #triggerBody().folderPath while creating trigger.
We will get the trigger file folder path value like input/source3/dirC. So, take this and check whether our folder names array values are exists or not in this string in if activity.
#contains(pipeline().parameters.folder_list, split(pipeline().parameters.folderpath, '/')[1])
Then inside True of if you can give your activities or if there are more activities you can use Execute pipeline activity.
Sample copy activity:

I ended up changing the folder structure to
data
source1
dirA
dirB
source2
dirA
dirB
source3
dirA
dirB
So all the files that I want to trigger the pipeline are always in source3/dirB
and using BlobPathStartsWith as data/source3/dirB and BlobPathEndsWith .csv

Related

Azure Data Factory - Check If Any Zip File Exists

I am trying to check if any zip file exists in my SFTP folder. GetMetadata activity works fine if I explicitly provide the filename but I can't know the file name here as the file name is embeded with timestamp and sequence number which are dynamic.
I tried specifying *.zip but that never works and GetMetadata activity always returns false even though the zip file actually exists. is there any way to get this worked? Suggestion please.
Sample file name as below, in this the last part 0000000004_20210907080426 is dynamic and will change every time:
TEST_TEST_9999_OK_TT_ENTITY_0000000004_20210907080426
You could possibly do a Get Metadata on the folder and include the Child items under the Field List.
You'll have to iterate with a ForEach using the expression
#activity('Get Folder Files').output.childItems
and then check if item().name (within the ForEach) ends with '.zip'.
I know it's a pain when the wildcard stuff doesn't work for a given dataset, but this alternative ought to work for you.
If you are using exists in the Get Metadata activity, you need to provide the file name in it.
As a workaround, you can get the child items (with filename *.zip) using the Get Metadata activity.
Output:
Pass the output to If Condition activity, to check if the required file exists.
#contains(string(json(string(activity('Get Metadata1').output.childItems))),'.zip')
You can use other activities inside True and False activities based on If Condition.
If there is no file exists or no child items found in the Get Metadata activity.
If condition output:
For SFTP dataset, if you want to use a wildcard to filter files under the field specified folderPath, you would have to skip this setting and specify the file name in activity source settings (Get Metadata activity).
But Wildcard filter on folders/files is not supported for Get Metadata activity.

Azure Data Factory Cannot Read Metadata Folder

I hope you guys keep health and keep strong in Pandemic covid-19.
I have some question on Azure Data Factory. btw I have create some pipeline with Metadata activity with detail below:
I have file in Folder and Subfolder like this:
I have metadata activity with for each with first get metadata child item (in folder) like this:
metadata with last modified like this (if you setting like this, metadata only read last modified subfolder
after that add variable I use #item().Name to read file in that folder like this:
after running metadata which have subfolder, I've get error like this:
the error give info that with #item().Name cannot read subfolder on that folder. the metadata for each file is success, but error like this which on my activity cannot read metadata subfolder .
many big thanks to have answer, Thank You
If you need to access the folder
Create a clone of same dataset and setup parameter as below, leave the file field empty.
If you need to access the file inside directory, use condition #equals(item().type,'Folder') to identity directory then inside that use dataset with parameters for directory and file.

ADF Copy only when a new CSV file is placed in the source and copy to the Container

I want to copy the file from Source to target container but only when the Source file is new
(latest file is placed in source). I am not sure how to proceed this and not sure about the syntax to check the source file greater than target. Should i have to use two get metadata activity to check source and target last modified date and use if condition. i tried few ways but it didn't work.
Any help will be handy
syntax i used for the condition is giving me the error
#if(greaterOrEquals(ticks(activity('Get Metadata_File').output.lastModified),activity('Get Metadata_File2')),True,False)
error message
The function 'greaterOrEquals' expects all of its parameters to be either integer or decimal numbers. Found invalid parameter types: 'Object'
You can try one of the Pipeline Templates that ADF offers.
Use this template to copy new and changed files only by using
LastModifiedDate. This template first selects the new and changed
files only by their attributes "LastModifiedDate", and then copies
them from the data source store to the data destination store. You can
also go to "Copy Data Tool" to get the pipeline for the same scenario
with more connectors.
View
documentation
OR...
You can use Storage Event Triggers to trigger the pipeline with copy activity to copy when each new file is written to storage.
Follow detailed example here: Create a trigger that runs a pipeline in response to a storage event

creating a metadata driven pipeline - parameterizing a source file

I have CSV files that are placed in various folders on a blob storage container.
These files will map to a table in a database, and we will use ADF to copy the data to the database.
The aim is to have the pipeline metadata-driven. We have a file that contains JSON with details of each source file and sink table.
[
{
"sourceContainer":"container1",
"sourceFolder":"folder1",
"sourceFile":"datafile.csv",
"sinkTable":"staging1"
},
{
"sourceContainer":"container1",
"sourceFolder":"folder2",
"sourceFile":"datafile2.csv",
"sinkTable":"staging2"
}
]
A for each will look through these values, place them in variables and use them to load the appropriate table from the appropriate CSV.
The issue is, for a CSV source dataset, I cannot parameterize the source dataset with user variables (fields marked with a red x in the below screenshot).
Would appreciate advice on how to tackle this.
The feature is definitely supported, so I'm not sure what you mean by "cannot parameterize". Here is an example of defining the parameters:
And here is an example of referencing them:
I recommend you use the "Add dynamic content" link and the expression builder to get the correct reference.
If you are having some other issue, please describe it in more detail.

Azure Data Factory Passing a pipeline-processed file as a mail attachment with logic app

I have an ADF pipeline moving a file to a blob storage. I am trying to pass the processed file as a parameter of my web activity so that I can use it as an email attachment. I am successfully passing the following parameters:
{
"Title":"Error File Received From MOE",
"Message": "This is a test message.",
"DataFactoryName":"#{pipeline().DataFactory}",
"PipelineName":"#{pipeline().Pipeline}",
"PipelineRunId":"#{pipeline().RunId}",
"Time":"#{utcnow()}",
"File":????????????????????????????
}
But, how should I specify the path to the file I just processed within the same pipeline?
Any help would be greatly appreciated,
Thanks
Eric
I'm copying data to output container. My current assumption is to upload a file a day, and then use two GetMetadata activities to get the lastModified attribute of the file to filter out the name of the most recently uploaded file.
Get Child items in Get Metadata1 activity.
Then in Foreach activity, get child items via dynamic content #activity('Get Metadata1').output.childItems
Inside Foreach activity, in Get Metadata2 activity, specify the json4 dataset to the output container.
Enter dynamic content #item().name to foreach the filename list.
In If condition activity, using #equals(dayOfMonth(activity('Get Metadata2').output.lastModified),dayOfMonth(utcnow())) to determine whether the date the file was uploaded is today.
In true activity, add dynamic content #concat('https://{account}.blob.core.windows.net/{Path}/',item().name) to assign the value to the variable.
The output is as follows: