GetMetadata to get the full file directory in Azure Data Factory - azure-data-factory

I am working through a use case where I want to load all the folder names that were loaded into an Azure Database into a different "control" table, but am having problems with using GetMetadata activity properly.
The purpose of this use case would be to skip all of the older folders (that were already loaded) and only focus on the new folder and get the ".gz" file and load it into an Azure Database. Oh a high level I thought I would use GetMetadata activity to send all of the folder names to a stored procedure. That stored procedure would then load those folder names with a status of '1' (meaning successful).
That table would then be used in a separate pipeline that is used to load files into a database. I would use a Lookup activity to compare against already loaded folders and if one of them don't match then that would be the folder to get the file from (the source is an S3 bucket).
The folder structure is nested in the YYYY/MM/DD format (ex: 2019/12/27 where each day a new folder is created and a "gz" file is placed there).
I created an ADF pipeline using the "GetMetadata" activity pointing to the blob storage that has already had the folders loaded into it.
However, when I run this pipeline I only get the top three folder names: 2019, 2018, 2017.
Is it possible to to not only get the top level folder name, but go down all the way down to day level? so instead of the output being "2019" it would be "2019/12/26" and then next one would be "2019/12/27" plus all of the months and days from 2017 and 2018.
If anyone faced this issue any insight would be greatly appreciated.
Thank you

you can also use a wildcard placeholder in this case, if you have a defined and nonchanging folder structure.
Use as directory: storageroot / * / * / * / filename
For example I used csvFiles / * / * / * / * / * / * / *.csv
to get all files that have this structure:
csvFiles / topic / subtopic / country / year / month / day
Then you get all files in this folder structure.

Based on the statements in the Get-Metadata Activity doc,childItems only returns elements from the specific path,won’t include items in subfolders.
I supposed that you have to use ForEach Activity to loop the childItems array layer by layer to flatten all structure. At the same time,use Set Variable Activity to concat the complete folder path. Then use IfCondition Activity,when you detect the element type is file,not folder,you could call the SP you mentioned in your question.

Related

Pipeline filter modifies date every time I run the pipeline which prevent me from pulling only the last modified date to SQL

Once a week, a file is generated to an onprem folder. My pipeline pulls from that onprem file to blob storage, then from blob to blob, during this part the pipeline filters my data goes to sql. Problem is, when it gets filtered the modified date changes and all the files in the blob storage are pulled rather than the one that got originally pulled for that week. I have attached images of my pipeline and the onprem files and what I filter for.
Instead of trying to proceed with last modified date of the file, you can proceed using file name instead.
Since you have date (yyyyddMM format) in the filename itself, you can dynamically create the filename and check if this file is present in the filtered files list or not.
Look at the following demonstration. Let's say I have the following 2 files as my filtered files. I used Get metadata activity (child items) on the blob storage.
Since we know the format of how the filename would be (SalesWeekly_yyyyddMM.csv), create the present filename value dynamically using the following dynamic content in set variable activity (variable name is file_name_required).
#concat('SalesWeekly_',formatDateTime(utcnow(),'yyyyddMM'),'.csv')
Now, create an array containing all the filenames returned by our get metadata activity. The for each activity items value is given as #activity('Get Metadata1').output.childItems.
Inside this, use an append variable activity with value as #item().name.
Now, you have file name you actually need (dynamically build) and the filtered file names array. You can check if the filename is present in the array of filtered file names or not and take necessary actions. I used if condition activtiy with the following dynamic content.
#contains(variables('files_names'),variables('file_name_required'))
The following are reference images of the output flow.
When current week file is not present in the filtered files.
When current week file is present in the filtered files.
I have used wait for demo here. You can replace it with copy activity (from blob to SQL) in True case. If you don't want to insert when current week file is missing, then leave the false case empty.

how to merge all files using copy activity or other tools in ADF

I have json files sitting under differnt subfolders.
folder structures are like this below
/UserData/data/json/A/2021/01/28/03/
A_2021_01_28_03_file1.json
A_2021_01_28_03_file2.json
A_2021_01_28_03_file3.json
/UserData/data/json/A/2021/01/28/02/
A_2021_01_28_02_file1.json
A_2021_01_28_02_file2.json
/UserData/data/json/B/2021/03/27/02/
A_2021_03_27_02_file1.json
A_2021_03_27_02_file2.json
/UserData/data/json/C/2021/04/21/01/
A_2021_04_21_01_file1.json
A_2021_04_21_01_file2.json
I want to merge all the files available under A folder/B folder/C folder
and ingest them as A table, B table and C table in Azure Data Explorer.
the schema is
name string
timestamp date
value string
I don't see merge feature in copy activity, how could I achieve this?
appreciate your help
You need 3 copy activities.
In each copy activity, in source -> select "Wildcard file path"
and choose * to select all files.(see attached picture).
it will copy all the files under the specific folder
please read more: https://learn.microsoft.com/en-us/azure/data-factory/connector-azure-data-lake-storage?tabs=data-factory

How do i fetch the list of date folders in ADLS and pass them to delete activity?

I have created a ADF pipeline with only one activity and that is delete activity
Below is my ADLS folders , here raw is container name
/raw/2022-05-28/omega/omega3.txt
/raw/2022-05-28/demo/demo3.txt
/raw/2022-05-29/omega/omega2.txt
/raw/2022-05-29/demo/demo2.txt
/raw/2022-05-30/omega/omega1.txt
/raw/2022-05-30/demo/demo1.txt
My intention is to delete all the folders inside the raw container expect the current date folder
The folders to be deleted are below .
2022-05-28
2022-05-29
So basically once the pipeline get completed only the below folders and files needs to available because they belong to current date
/raw/2022-05-30/omega/omega1.txt
/raw/2022-05-30/demo/demo1.txt
Right now This is what doing
Created a dataset for ADLS and gave the container name and 2022-05-28 in folder
Created a pipeline with delete activity and with #1 dataset
Running the pipeline two times manually by altering the dataset folder for 2022-05-28 and 2022-05-29
I dont want to have manual intervention like this , I want to pass array of folder dates to passed automatically based on number of old folders in ADLS , so How do i fetch the list of folders in ADLS and how to extract the date folder of that list and pass that list of folder dates as array to my delete pipeline
Can you please help
Since it is not ideal to change the name to delete each folder manually, you can use Dynamic parameters. We can use Get Metadata activity to get the folder names, For Each activity to loop through each folder name, If conditional activity to compare the folder name to current date folder, and finally Delete activity to delete the folders.
Create a dataset pointing to your container that holds all these folders (raw). Create a parameter folder_name for this dataset and give its value as #dataset().folder_name
Use Get Metadata activity referring to the dataset just created with field list as child items. Give '/' as the value for parameter folder_name (We do not need dynamic parameter value in this activity).
Create a for each activity. The output of get metadata activity is passed to this for each activity. In For Each -> Settings, give the items field value as #activity(‘get_foldername’).output.childItems where get_foldername is the name of the get metadata activity.
Under For Each -> Activities, create an activity for this activity. Using If conditional activity, under activities tab, build an expression,
#not(equals(utcNow('yyyy-MM-dd'), item().Name)) (if current date folder name is not equal to the for each activity folder_name). When this condition is true, we need to perform delete activity (create delete activity for true case).
In Delete activity, use the dataset created initially, and give the value for folder_name as #item().Name (dynamic parameter).
Publish and run the pipeline. It will run successfully and delete all the other folders except the folder with current date. This way you can delete the folders from your container which do not belong to the current date.

Azure Data Factory V2 Copy Activity - Save List of All Copied Files

I have pipelines that copy files from on-premises to different sinks, such as on-premises and SFTP.
I would like to save a list of all files that were copied in each run for reporting.
I tried using Get Metadata and For Each, but not sure how to save the output to a flat file or even a database table.
Alternatively, is it possible to fine the list of object that are copied somewhere in the Data Factory logs?
Thank you
Update:
Items:#activity('Get Metadata1').output.childItems
If you want record the source file names, yes we can. As you said we need to use Get Metadata and For Each activity.
I've created a test to save the source file names of the Copy activity into a SQL table.
As we all know, we can get the file list via Child items in Get metadata activity.
The dataset of Get Metadata1 activity specify the container which contains several files.
The list of file in test container is as follows:
At inside of the ForEach activity, we can traverse this array. I set a Copy activity named Copy-Files to copy files from source to destnation.
#item().name represents every file in the test container. I key in the dynamic content #item().name to specify the file name. Then it will sequentially pass the file names in the test container. This is to execute the copy task in batches, each batch will pass in a file name to be copied. So that we can record each file name into the database table later.
Then I set another Copy activity to save the file names into a SQL table. Here I'm using Azure SQL and I've created a simple table.
create table dbo.File_Names(
Copy_File_Name varchar(max)
);
As this post also said, we can use similar syntax select '#{item().name}' as Copy_File_Name to access some activity datas in ADF. Note: the alias name should be the same as the column name in SQL table.
Then we can sink the file names into the SQL table.
Select the table which created previously.
After I run debug, I can see all the file names are saved into the table.
If you want add more infomation, you can reference the post I maintioned previously.

GCP Cloud Storage - Wildcard Prefix List

This is a question of how to acomplish a certain task with the GCP Cloud Storage API.
I have a bucket with a "folder" structure as follows:
ID / Year / Month / Day / FILES
I need to search for all files with the following format: ID/2016/04/03/. I had hoped I could use a * in the prefix (*/2016/04/03/), but this does not work.
Anyone know a way to make this happen without iterating every top level folder myself?
There is no API support for wildcard expressions - just for prefix queries.
When you say "iterating every top level folder myself" it sounds like you mean manually listing them in your client code? You can avoid doing that by doing a query that specifies delimiter="/" and prefix="" to find the top-level "folders". You would then iterate over that list and construct prefix queries to list the individual objects within the given date-named folder.
If it's possible for you to restructure your names, you could avoid having to do the extra prefix+delimiter query and iteration , so the top level is the date, e.g.,
Year / Month / Day / ID / FILES