ADF Copy only when a new CSV file is placed in the source and copy to the Container - azure-data-factory

I want to copy the file from Source to target container but only when the Source file is new
(latest file is placed in source). I am not sure how to proceed this and not sure about the syntax to check the source file greater than target. Should i have to use two get metadata activity to check source and target last modified date and use if condition. i tried few ways but it didn't work.
Any help will be handy
syntax i used for the condition is giving me the error
#if(greaterOrEquals(ticks(activity('Get Metadata_File').output.lastModified),activity('Get Metadata_File2')),True,False)
error message
The function 'greaterOrEquals' expects all of its parameters to be either integer or decimal numbers. Found invalid parameter types: 'Object'

You can try one of the Pipeline Templates that ADF offers.
Use this template to copy new and changed files only by using
LastModifiedDate. This template first selects the new and changed
files only by their attributes "LastModifiedDate", and then copies
them from the data source store to the data destination store. You can
also go to "Copy Data Tool" to get the pipeline for the same scenario
with more connectors.
View
documentation
OR...
You can use Storage Event Triggers to trigger the pipeline with copy activity to copy when each new file is written to storage.
Follow detailed example here: Create a trigger that runs a pipeline in response to a storage event

Related

Azure Data Factory - Check If Any Zip File Exists

I am trying to check if any zip file exists in my SFTP folder. GetMetadata activity works fine if I explicitly provide the filename but I can't know the file name here as the file name is embeded with timestamp and sequence number which are dynamic.
I tried specifying *.zip but that never works and GetMetadata activity always returns false even though the zip file actually exists. is there any way to get this worked? Suggestion please.
Sample file name as below, in this the last part 0000000004_20210907080426 is dynamic and will change every time:
TEST_TEST_9999_OK_TT_ENTITY_0000000004_20210907080426
You could possibly do a Get Metadata on the folder and include the Child items under the Field List.
You'll have to iterate with a ForEach using the expression
#activity('Get Folder Files').output.childItems
and then check if item().name (within the ForEach) ends with '.zip'.
I know it's a pain when the wildcard stuff doesn't work for a given dataset, but this alternative ought to work for you.
If you are using exists in the Get Metadata activity, you need to provide the file name in it.
As a workaround, you can get the child items (with filename *.zip) using the Get Metadata activity.
Output:
Pass the output to If Condition activity, to check if the required file exists.
#contains(string(json(string(activity('Get Metadata1').output.childItems))),'.zip')
You can use other activities inside True and False activities based on If Condition.
If there is no file exists or no child items found in the Get Metadata activity.
If condition output:
For SFTP dataset, if you want to use a wildcard to filter files under the field specified folderPath, you would have to skip this setting and specify the file name in activity source settings (Get Metadata activity).
But Wildcard filter on folders/files is not supported for Get Metadata activity.

Azure Data Factory Cannot Read Metadata Folder

I hope you guys keep health and keep strong in Pandemic covid-19.
I have some question on Azure Data Factory. btw I have create some pipeline with Metadata activity with detail below:
I have file in Folder and Subfolder like this:
I have metadata activity with for each with first get metadata child item (in folder) like this:
metadata with last modified like this (if you setting like this, metadata only read last modified subfolder
after that add variable I use #item().Name to read file in that folder like this:
after running metadata which have subfolder, I've get error like this:
the error give info that with #item().Name cannot read subfolder on that folder. the metadata for each file is success, but error like this which on my activity cannot read metadata subfolder .
many big thanks to have answer, Thank You
If you need to access the folder
Create a clone of same dataset and setup parameter as below, leave the file field empty.
If you need to access the file inside directory, use condition #equals(item().type,'Folder') to identity directory then inside that use dataset with parameters for directory and file.

creating a metadata driven pipeline - parameterizing a source file

I have CSV files that are placed in various folders on a blob storage container.
These files will map to a table in a database, and we will use ADF to copy the data to the database.
The aim is to have the pipeline metadata-driven. We have a file that contains JSON with details of each source file and sink table.
[
{
"sourceContainer":"container1",
"sourceFolder":"folder1",
"sourceFile":"datafile.csv",
"sinkTable":"staging1"
},
{
"sourceContainer":"container1",
"sourceFolder":"folder2",
"sourceFile":"datafile2.csv",
"sinkTable":"staging2"
}
]
A for each will look through these values, place them in variables and use them to load the appropriate table from the appropriate CSV.
The issue is, for a CSV source dataset, I cannot parameterize the source dataset with user variables (fields marked with a red x in the below screenshot).
Would appreciate advice on how to tackle this.
The feature is definitely supported, so I'm not sure what you mean by "cannot parameterize". Here is an example of defining the parameters:
And here is an example of referencing them:
I recommend you use the "Add dynamic content" link and the expression builder to get the correct reference.
If you are having some other issue, please describe it in more detail.

how to Load data from last modified files within one day from subfolders Azure Data Flow

I have the following directory structure on an Azure container:
-dwh-prod
-Main_Folder
-2021-01
-file1.parquet
-2021-02
-file2.parquet
-file3.parquet
where the Data is partitioned by year and month to create subfolders. Within these sub-folders, I have my data files. I want to load into my data flow only the latest files that were added within one day from running my data flow pipeline.
I tried using currentUTC() in End Time and subtracting one day -> AddDays(currentUTC(), -1) in Start Time in the 'Filter by last modified' option provided in source options but it didn't work.
I also tried using currentTimestamp() instead but to no avail.
How do I go about solving this?
Your expression is correct. Please change the folder path from MainFolder to Main_folder in your dataset and set Main_Folder/*/*.parquet as your Wildcard paths in your Source option. Then it will work.
I think your solution is close, but I'm not sure the folder name is sufficient. I'm also not familiar with "currentUTC". The correct function should be utcNow.
Below is an outline of how I would approach this problem.
Source Dataset
Add a Parameter for the subfolder (year-month):
and then set the Folder path to an expression like:
Pipeline
You could either pass in the subfolder or calculate it at runtime. My preference would be to pass it in as a parameter:
I would then add variables to calculate the start and end times. Since you are running this daily, I would be sure to force the time to the START of the day(s). This should handle any vagaries based on run time. Also, I would use the built in getPastTime function:
Now use these objects in your Source configuration:

Unable to run experiment on Azure ML Studio after copying from different workspace

My simple experiment reads from an Azure Storage Table, Selects a few columns and writes to another Azure Storage Table. This experiment runs fine on the Workspace (Let's call it workspace1).
Now I need to move this experiment as is to another workspace(Call it WorkSpace2) using Powershell and need to be able to run the experiment.
I am currently using this Library - https://github.com/hning86/azuremlps
Problem :
When I Copy the experiment using 'Copy-AmlExperiment' from WorkSpace 1 to WorkSpace 2, the experiment and all it's properties get copied except the Azure Table Account Key.
Now, this experiment runs fine if I manually enter the account Key for the Import/Export Modules on studio.azureml.net
But I am unable to perform this via powershell. If I Export(Export-AmlExperimentGraph) the copied experiment from WorkSpace2 as a JSON and insert the AccountKey into the JSON file and Import(Import-AmlExperiment) it into WorkSpace 2. The experiment fails to run.
On PowerShell I get an "Internal Server Error : 500".
While running on studio.azureml.net, I get the notification as "Your experiment cannot be run because it has been updated in another session. Please re-open this experiment to see the latest version."
Is there anyway to move an experiment with external dependencies to another workspace and run it?
Edit : I think the problem is something to do with how the experiment handles the AccountKey. When I enter it manually, it's converted into a JSON array comprising of RecordKey and IndexInRecord. But when I upload the JSON experiment with the accountKey, it continues to remain the same and does not get resolved into RecordKey and IndexInRecord.
For me publishing the experiment as a private experiment for the cortana gallery is one of the most useful options. Only the people with the link can see and add the experiment for the gallery. On the below link I've explained the steps I followed.
https://naadispeaks.wordpress.com/2017/08/14/copying-migrating-azureml-experiments/
When the experiment is copied, the pwd is wiped for security reasons. If you want to programmatically inject it back, you have to set another metadata field to signal that this is a plain-text password, not an encrypted password that you are setting. If you export the experiment in JSON format, you can easily figure this out.
I think I found the issue why you are unable to export the credentials back.
Export the JSON graph into your local disk, then update whatever parameter has to be updated.
Also, you will notice that the credentials are stored as 'Placeholders' instead of 'Literals'. Hence it makes sense to change them to Literals instead of placeholders.
This you can do by traversing through the JSON to find the relevant parameters you need to update.
Here is a brief illustration.
Changing the Placeholder to a Literal: