How to copy the files from SFTP to target folder created dynamically based on source filename (Blob storage)? - copy

I am new to ADF, need help for 2 scenarios
1.I have to copy files from SFTP to blob storage(Azure Gnen2) using ADF. In the source SFTP folder, there are 3- 5 different the files. For example
S09353.DB2K.AFC00R46.F201130.txt
S09353.DB2K.XYZ00R46.F201130.txt
S09353.DB2K.GLY00R46.F201130.txt
On copying, this files are copied and placed under corresponding folders which are created dynamically based on file types.
For example: S09353.DB2K.AFC00R46.F201130.txt copy to AFC00R46 folder
S09353.DB2K.XYZ00R46.F201130.txt copy to XYZ00R46 folder.
2.Another requirement is need to copy csv files from blob storage to SFTP. On coping, the files need to copy to target folder created dynamically based on file name:
for example: cust-fin.csv----->copy to--------->Finance folder
please help me on this

The basic solution to your problem is to use Parameters in DataSets. This example is for a Blob Storage connection, but the approach is the same for SFTP as well. Another hint: if you are just moving files, use Binary DataSets.
Create Parameter(s) in DataSet
Reference Parameter(s) in DataSet
Supply Parameter(s) in the Pipeline
In this example, I am passing Pipeline parameters to a GetMetadata Activity, but the principles are the same for all DataSet types. The values could also be hard coded, expressions, or variables.
Your Process
If you need this to be dynamic for each file name, then you'll probably want to break this into parts:
Use a GetMetadata Activity to list the files from SFTP.
Foreach over the return list and process each file individually.
Inside the Foreach -> Parse each file name individually to extract the Folder name to a variable.
Inside the Foreach -> Use the Variable in a Copy Activity to populate the Folder name in the DataSet.

Related

How to copy file based on date in Azure Data Factory

I have a list of files in a adls container which contain date in the name as given below:
TestFile-Name-20221120. csv
TestFile-Name-20221119. csv
TestFile-Name-20221118. csv
and i want to copy files which contain today date only like TestFile-Name-20221120. csv on today and so on.
I've used get metedata activity to get list of files and then for each to iterate over each file and then used set variable to extract name from the file like 20221120 but not sure how to proceed further.
We have something similar running. We check an SFTP folder for the existanc e of files, using the Get Metadata activity. In our case, there can be folders or files. We only want to process files, and very specific ones for that matter (I.e. we have 1 pipeline per filename we can process, as the different filenames would contain different columns/datatypes etc).
Our pipeline looks like this:
Within our Get Metadata component, we basically just filter for the name of the object we want, and we only want files ending in .zip, meaning we added a Filename filter:
:
In your case, the first part would be 'TestFile-Name-', and the second part would be *.csv'.
We then have a For Each loop set up, to process anything (the child items) we retrieved in the Get Metadata step. Within the For Each we defined an If Condition to only process files, and not folders.
In our cases, we use the following expression:
#equals(item().type, 'File')
In your case, you could use something like:
#endsWith(item().name, concat(<variable containing your date>, '.csv'))
Assuming all the file names start with TestFile-Name-,
and you want to copy the data of file with todays date,
use get metadata activity to check if the file exists and the file name can be dynamic like
#concat('TestFile-Name-',utcnow(),'.csv')
Note: you need to fromat utcnow as per the needed format
and if file exists, then proceed for copy else ignore

How to rename an XML file using the pattern in a different .txt file?

I have a folder say, source folder, containing 1000+ xml files with some ambiguous names, like:
_MIM_15646432635_6664684
_MIM_54154548557_6568436 etc.
Out of these thousands of XML files I’ve to select some 10-12 xml files with a particular node in them and move them to another folder (destination folder) and rename the files respectively with some meaningful names.
For example:
There is an xml file with name _MIM_15646432635_6664684 and it contains a node pattern like: “bab6e7h835468eg” and I’ve to rename it to name like: {1FE9909E-4450-B98665362022}
So for this I’ve to write a script which will search the file in source folder and if it finds my desired node pattern then move this file to destination folder post renaming it to some meaningful name.
Provided I’ve an excel sheet where I do have a list of two columns, one containing the specific node pattern and column two has a respective new name list.
Currenty, I’ve a script which can search a file and move it to another folder provided I’m giving input to the script with the node pattern and file name from that excel sheet:
Select-String – Path “\Dubwta01\AIR\Invalid*.xml” – Pattern ‘bab6e7h835468eg’ | %{Copy-Item – Path $_.Path – Destination ‘\Dubwta01\AIR\Invalid{1FE9909E-4450-B98665362022}.xml’
What now I need is that a new script which will pick all 10 files from source folder and move it to destination folder by renaming it and I don’t have to hard code the pattern and new name in script rather it shall fetch the details from that excel or I can save the details in text files whatever is suitable for script to pick the name and pattern from.

Configuring sink data set in azure data factory

I am trying to copy multiple folders with their files (.dat and .csv ) from ftp to Azure storage account , so I am using a get metadata for each and copy activity. My problem is that when setting the file path in the output data set I am not sure how to set the filename so it picks up all files in my folder.
I added a filename parameter in the data set and in the copydata sink I set it as #item().name but it's not working instead of copying the files, it copies the folder. the second try is that I dont set the filename in the directory, and it does copy the files but it adds the extension.txt to the files instead of keeping their original format.
Thank you for your help,
enter image description here
enter image description here
You need to add a third parameter for the sink dataset for filename.
Here you can pass parameter as you have for container and folder.
Filename can be set from the Get Metadata activity output.

How to parameterise Dataset definition filename in Azure Data factory

I have a blob storage container folder (source) that gets several csv files. My task is to pick the csv files starting with "file". See example filename below::
file12345.csv
The numeric part varies every time.
I have set the "fixed" Container and Directory names in the image below but it seems the File parameter does not accept wildcard "File*.csv".
How can I pass a wildcard to the Dataset definition?
Thanks
You can't do that operation in Soure dataset.
Just choose the container or folder in the dataset like bellow:
Choose the Wildcard file path in Source settings:
The will help you filter the filename wildcard "File*.csv".
Ref: Copy activity properties:
Hope this helps.

Copy multiple .gz files from one GCS bucket to another bucket in Java

I want to copy multiple .gz files from one gcs bucket to another. File name pattern has prefix as 'Logs_' and suffix as date like '20160909',so full file name will be Logs_2016090.gz, Logs_20160908.gz etc. I want to copy all files starting with Logs_ from one gcs bucket to another gcs bucket. For this I am using wildcard character * at the end like Logs_*.gz for copy operation as below:
Storage.Objects.Copy request =
storageService
.objects()
.copy("source_bucket", "Logs_*.gz", "destination_bucket", ".", content);
Above I am using "." because all files has to be copied to destination_bucket, so I can't specify single file name there. Unfortunately, this code doesn't work and error that file doesn't exist. I am not sure what change is required here. Any java link or any piece of code will be helpful. Thanks !!
While the gsutil command-line utility happily supports wildcards, the GCS APIs themselves are lower level commands and do not. The storage.objects.copy method must have one precise source and one precise destination.
I recommend one of the following:
Use a small script invoking gsutil, or
Make a storage.objects.list call to get the names of all matching source objects, then iterate over them, calling copy for each, or
If you're dealing with more than, say, 10 TB or so of gzip files, consider using Google's Cloud Storage Transfer Service to copy the files.