Iterate through folders in azure data factory - azure-data-factory

I've a requirement like: I've three folders in azure blob container, and inside those three folders have three zip files and the zip files contains respective source files (*.csv) with same structure. I want to loop through the each folders and extract each of the zip files into an output folder then I want to load all the three csv files into target sql table. How can I achieve this by using azure data factory?
Azure storage account
productblob (blob container)
Folder1 >> product1.zip >> product1.csv
Folder2 >> product2.zip >> product2.csv
Folder3 >> product3.zip >> product3.csv
I've already tried to loop through the folders and got the output in Foreach iterator activity but unable to extract the zip files.

After looping to ForEcah activity, you could follow the following steps:
Select a binary dataset and give file path as Foreach output(by creating a parameter in Dataset and in Source defining the value to this parameter). Select compression type as ZipDeflate.
In the sink, select the path where you want to save the unzipped files. (Select Flatten hierarchy in Sink if you want only the files.)

Related

Azure Data Factory - What is the fastest way to copy a lot of files from OnPrem to blob storage when they are deeply nested

I must get to two different excel files that are nested within 360 parent directories XXXX with a \ME (Month End directory) then a year directory, and finally a yyyymm directory.
Example: Z500\ME\2022\202205\Z500_contributions_202205.xls.
I tried with the copy data activity and killed it after it was still spinning on the listing source step. I thought about the lookup and metadata activities and those have limits of 5000 rows. Any thoughts on what would be the fastest way to do this?
Code for creating the filelist. I'll clean the results up in Excel
dir L:*.xls /s /b > "C:\Foo.txt"
Right now I am creating a list of files with the DOS dir command and hoping that if I have a filelist with the copy activity that it will runs faster if it doesn't have to go through the "list sources" step and interrogate the filesystem.
Thoughts on a ADF option?
If you are facing issues with copy activity, instead you can try azcopy which also can be used for copying from OnPrem to Blob storage.
You can try the below code:
azcopy copy "local path/*" "https://<storage account
name>.blob.core.windows.net/<container name><path to blob" --recursive=true --include-pattern "*.xlsx"
Please go through this Microsoft documentation to know how to use azcopy.
The above script copies all the excel files from nested folders recursively. But it also copies the folders to the blob storage as well.
After copying to blob storage, you can use the Start-AzureStorageBlobCopy in the Powershell to list out all the excel files in the nested folders to a single folder.
Start-AzureStorageBlobCopy -SrcFile $sourcefile -DestCloudBlob “Destination path”
Please refer this SO thread to list out the files in the blob recursively.
If you are creating list of files in OnPrem then you can use either azcopy or copy activity as your wish.
Please check these screenshots of azcopy for your reference:
Here I am using azcopy with SAS. You can use it in both ways with SAS and with Active Directory as mentioned in the documentation above.
Excel file in Blob storage:

Azure data factory File Content Replace in the Azure blob

Good morning,
We have Azure Data Factory (ADF). We have 2 files that we want to merge into one another. The files are currently in the Azure Blob storage. Below are the contents of the files. We are trying to take the contents of File2.txt and replace the '***' in File1.txt. When finished, it should look like File3.txt.
File1.txt
OP01PAMTXXXX01997
***
CL9900161313
File2.txt
ZCBP04178 2017052520220525
NENTA2340 2015033020220330
NFF232174 2015052720220527
File3.txt
OP01PAMTXXXX01997
ZCBP04178 2017052520220525
NENTA2340 2015033020220330
NFF232174 2015052720220527
CL9900161313
Does anyone know how we can do this? I have been working with this for 2 days and it would seem that this should not be a difficult thing to do.
All the best,
George
You can merge 2 files or more using ADF but i can't see a way where we can merge with a condition / control the way we merge files, so what i can recommend is to use Azure Function and do the merge programmatically.
if you want to know how to merge files without preserving line priorities use my approach:
create a pipeline
add a "Copy activity"
in copy activity use this basic settings:
in source -> chose WildCard file path (select the folder that files are located at), make sure in wildcard path to write "*" in filename this will guarantee to chose all files under the same folder.
this will merge all the files under the same folder.
in Sink -> make sure to select in Copy behavior Merge Files mode.
Output :

How to copy CSV file from blob container to another blob container with Azure Data Factory?

I would like to copy any file in Blob container to another Blob container. No transformation is needed. How to do it?
However I get validate error:
Copy data1:
Dataset yellow_tripdata_2020_1 location is a folder, the wildcard file name is required for
Copy data1
As the error states: the wildcard file name is required for Copy data1.
On your data source, in the file field, you should enter a pattern that matches the files you want to copy. So *.* if you want to copy all the files, and something like *.csv if you only want to copy over CSV files.

Prevent glue to read all files from S3 folder

We have one S3 folder that is being used for storing different files for ETL processing separately. The ETL processing of one files is reading all other files placed in the same S3 folder. I don't see an option to read only file from the folder. The Location property in the table is set to the folder level.
Code:
gluedb = "srcgluedb"
gluetbl = "gluesrctable"
dfRead=glue_context.create_dynamic_frame.from_catalog(database=gluedb, table_name=gluetbl)
df = dfRead.toDF()

Copy activity with simultaneous renaming of a file. From blob to blob

I have a "copy data" activity in Azure Data Factory. I want to copy .csv files from blob container X to Blob container Y. I don't need to change the content of the files in any way, but I want to add a timestamp to the name, e.g. rename it. However, I get the following error "Binary copy does not support copying from folder to file". Both the source and the sink are set up as binary.
If you want to copy the files and rename them, you pipeline should like this:
Create a Get Metadata active to get the file list(dataset Binary1):
Create For Each active to copy the each file:#activity('Get Metadata1').output.childItems:
Foreach inner active, create a copy active with source dataset
Binary2(same with Binary2) with dataset parameter to specify the source file:
Copy active sink setting, create the sink Binary3 also with
parameter to rename the files:
#concat(split(item().name,'.')[0],utcnow(),'.',split(item().name,'.')[1]):
Run the pipeline and check the output:
Note: The example I made just copy the files to the same container but with new name.