how to merge all files using copy activity or other tools in ADF - azure-data-factory

I have json files sitting under differnt subfolders.
folder structures are like this below
/UserData/data/json/A/2021/01/28/03/
A_2021_01_28_03_file1.json
A_2021_01_28_03_file2.json
A_2021_01_28_03_file3.json
/UserData/data/json/A/2021/01/28/02/
A_2021_01_28_02_file1.json
A_2021_01_28_02_file2.json
/UserData/data/json/B/2021/03/27/02/
A_2021_03_27_02_file1.json
A_2021_03_27_02_file2.json
/UserData/data/json/C/2021/04/21/01/
A_2021_04_21_01_file1.json
A_2021_04_21_01_file2.json
I want to merge all the files available under A folder/B folder/C folder
and ingest them as A table, B table and C table in Azure Data Explorer.
the schema is
name string
timestamp date
value string
I don't see merge feature in copy activity, how could I achieve this?
appreciate your help

You need 3 copy activities.
In each copy activity, in source -> select "Wildcard file path"
and choose * to select all files.(see attached picture).
it will copy all the files under the specific folder
please read more: https://learn.microsoft.com/en-us/azure/data-factory/connector-azure-data-lake-storage?tabs=data-factory

Related

Copy all the files of a particular date using ADF

I am having multiple file dumps in a single day
For ex
pqr_20220627_1.csv
pqr_20220627_2.csv
pqr_20220627_3.csv
abc_20220628_1.csv
abc_20220628_2.csv
abc_20220628_3.csv
xyz_20220629_1.csv
xyz_20220629_2.csv
xyz_20220629_3.csv
I have to fetch the files for the given date using the ADF in a different blob storage.
For example if for 2022/06/29 i only want
xyz_20220629_1.csv
xyz_20220629_2.csv
xyz_20220629_3.csv
these three files in my target blob.
To copy files that match a particular date to a target location, you can use WildCard file path with dynamic content. Consider the following demonstration.
The following is the list of files in my source blob.
Create a copy data activity in the pipeline of azure data factory. Create a parameter called ip_date where you can give the date using which you want to copy files. You can enter this value before triggering the pipeline.
Now create the dataset using Linked Service to your source blob storage. Here, choose Wildcard File Path as the file path type. Enter the value for file name as:
#{concat('*',formatDateTime(pipeline().parameters.ip_date,'yyyyMMdd'),'*.txt')}
You can use *.csv instead of *.txt. Using this dynamic content you will get the wildcard file path as *20220629*.txt when ip_date has the value 2022/06/29 (* indicates 0 or more characters).
You can now create the linked service for your sink if required. Publish and select Add Trigger -> Trigger Now. Before triggering the pipeline, it will ask you to enter the value for the parameter ip_date. Here you can enter date as 2022/06/29.
NOTE: The date value you give to ip_date should be of the format yyyy-MM-dd or yyyy/MM/dd or it will give you an error when you run the pipeline.
The pipeline will successfully copy only the required files for the date 2022/06/29.

How do i fetch the list of date folders in ADLS and pass them to delete activity?

I have created a ADF pipeline with only one activity and that is delete activity
Below is my ADLS folders , here raw is container name
/raw/2022-05-28/omega/omega3.txt
/raw/2022-05-28/demo/demo3.txt
/raw/2022-05-29/omega/omega2.txt
/raw/2022-05-29/demo/demo2.txt
/raw/2022-05-30/omega/omega1.txt
/raw/2022-05-30/demo/demo1.txt
My intention is to delete all the folders inside the raw container expect the current date folder
The folders to be deleted are below .
2022-05-28
2022-05-29
So basically once the pipeline get completed only the below folders and files needs to available because they belong to current date
/raw/2022-05-30/omega/omega1.txt
/raw/2022-05-30/demo/demo1.txt
Right now This is what doing
Created a dataset for ADLS and gave the container name and 2022-05-28 in folder
Created a pipeline with delete activity and with #1 dataset
Running the pipeline two times manually by altering the dataset folder for 2022-05-28 and 2022-05-29
I dont want to have manual intervention like this , I want to pass array of folder dates to passed automatically based on number of old folders in ADLS , so How do i fetch the list of folders in ADLS and how to extract the date folder of that list and pass that list of folder dates as array to my delete pipeline
Can you please help
Since it is not ideal to change the name to delete each folder manually, you can use Dynamic parameters. We can use Get Metadata activity to get the folder names, For Each activity to loop through each folder name, If conditional activity to compare the folder name to current date folder, and finally Delete activity to delete the folders.
Create a dataset pointing to your container that holds all these folders (raw). Create a parameter folder_name for this dataset and give its value as #dataset().folder_name
Use Get Metadata activity referring to the dataset just created with field list as child items. Give '/' as the value for parameter folder_name (We do not need dynamic parameter value in this activity).
Create a for each activity. The output of get metadata activity is passed to this for each activity. In For Each -> Settings, give the items field value as #activity(‘get_foldername’).output.childItems where get_foldername is the name of the get metadata activity.
Under For Each -> Activities, create an activity for this activity. Using If conditional activity, under activities tab, build an expression,
#not(equals(utcNow('yyyy-MM-dd'), item().Name)) (if current date folder name is not equal to the for each activity folder_name). When this condition is true, we need to perform delete activity (create delete activity for true case).
In Delete activity, use the dataset created initially, and give the value for folder_name as #item().Name (dynamic parameter).
Publish and run the pipeline. It will run successfully and delete all the other folders except the folder with current date. This way you can delete the folders from your container which do not belong to the current date.

Required help in removing the column from text file using ADF

I have a sample file like this . Using data factory Where I need to create another text file with output where I can remove the 1st two columns. Is there any query where I can generate a file like as below.
Source file:
Output file :
Core Data Factory (ie not including Mapping Data Flows) is not gifted with many abilities to do data transformation (which this is) however it can do some things. It can change formats (eg .csv to JSON), it can add some metadata columns (like $$FILENAME) and it can remove columns, simply by using the mapping in the Copy activity.
Add a Copy activity to your pipeline and set the source to your main file
Set the Sink to your target file name. It can be the same name as your original file but I would make it different for audit trail purposes.
Import the schema of your file, make sure the separator in the dataset is set to semi-colon ';'
4. Now press the Trash can button to delete the mappings for columns 1 and 2.
5. Run your pipeline. The output file should not have the two columns.
My results:
You can accomplish this task by using Select transformation in mapping data flow in Azure Data Factory (ADF). You can delete any unwanted columns from your delimited text file in data flow transformation.
I tested the same in my environment and it is working fine.
Please follow the below steps:
Create the Azure Data factory using Azure Portal
Upload the data at the source (eg: Blob Container)
Create a linked service to connect the blob storage with ADF as shown below
Then, create DelimitedText datasets using the above linked service for source and sink files. In the source dataset, mark Column delimiter as Semicolon(;). Also, in the Schema tab, select Import Schema From connection/store.
Create a data flow. Select the Source dataset from your datasets list. Click on + symbol to add Select from options as shown below.
**In the settings, select the columns you want to delete and then click on delete option.
Add the sink at the end. In the Sink tab use the sink dataset you created earlier in step 4. In the Settings tab, for File name option select Output to single file and give the filename in below option.
Now create a pipeline and use the Data flow activity. Select the data flow you created. Click on Trigger Now option to run the pipeline.
Check the output file at the sink location. You can see my input and output files below.

Azure Data Factory V2 Copy Activity - Save List of All Copied Files

I have pipelines that copy files from on-premises to different sinks, such as on-premises and SFTP.
I would like to save a list of all files that were copied in each run for reporting.
I tried using Get Metadata and For Each, but not sure how to save the output to a flat file or even a database table.
Alternatively, is it possible to fine the list of object that are copied somewhere in the Data Factory logs?
Thank you
Update:
Items:#activity('Get Metadata1').output.childItems
If you want record the source file names, yes we can. As you said we need to use Get Metadata and For Each activity.
I've created a test to save the source file names of the Copy activity into a SQL table.
As we all know, we can get the file list via Child items in Get metadata activity.
The dataset of Get Metadata1 activity specify the container which contains several files.
The list of file in test container is as follows:
At inside of the ForEach activity, we can traverse this array. I set a Copy activity named Copy-Files to copy files from source to destnation.
#item().name represents every file in the test container. I key in the dynamic content #item().name to specify the file name. Then it will sequentially pass the file names in the test container. This is to execute the copy task in batches, each batch will pass in a file name to be copied. So that we can record each file name into the database table later.
Then I set another Copy activity to save the file names into a SQL table. Here I'm using Azure SQL and I've created a simple table.
create table dbo.File_Names(
Copy_File_Name varchar(max)
);
As this post also said, we can use similar syntax select '#{item().name}' as Copy_File_Name to access some activity datas in ADF. Note: the alias name should be the same as the column name in SQL table.
Then we can sink the file names into the SQL table.
Select the table which created previously.
After I run debug, I can see all the file names are saved into the table.
If you want add more infomation, you can reference the post I maintioned previously.

GetMetadata to get the full file directory in Azure Data Factory

I am working through a use case where I want to load all the folder names that were loaded into an Azure Database into a different "control" table, but am having problems with using GetMetadata activity properly.
The purpose of this use case would be to skip all of the older folders (that were already loaded) and only focus on the new folder and get the ".gz" file and load it into an Azure Database. Oh a high level I thought I would use GetMetadata activity to send all of the folder names to a stored procedure. That stored procedure would then load those folder names with a status of '1' (meaning successful).
That table would then be used in a separate pipeline that is used to load files into a database. I would use a Lookup activity to compare against already loaded folders and if one of them don't match then that would be the folder to get the file from (the source is an S3 bucket).
The folder structure is nested in the YYYY/MM/DD format (ex: 2019/12/27 where each day a new folder is created and a "gz" file is placed there).
I created an ADF pipeline using the "GetMetadata" activity pointing to the blob storage that has already had the folders loaded into it.
However, when I run this pipeline I only get the top three folder names: 2019, 2018, 2017.
Is it possible to to not only get the top level folder name, but go down all the way down to day level? so instead of the output being "2019" it would be "2019/12/26" and then next one would be "2019/12/27" plus all of the months and days from 2017 and 2018.
If anyone faced this issue any insight would be greatly appreciated.
Thank you
you can also use a wildcard placeholder in this case, if you have a defined and nonchanging folder structure.
Use as directory: storageroot / * / * / * / filename
For example I used csvFiles / * / * / * / * / * / * / *.csv
to get all files that have this structure:
csvFiles / topic / subtopic / country / year / month / day
Then you get all files in this folder structure.
Based on the statements in the Get-Metadata Activity doc,childItems only returns elements from the specific path,won’t include items in subfolders.
I supposed that you have to use ForEach Activity to loop the childItems array layer by layer to flatten all structure. At the same time,use Set Variable Activity to concat the complete folder path. Then use IfCondition Activity,when you detect the element type is file,not folder,you could call the SP you mentioned in your question.