Copy Each '.txt' File into respective date folder Based on Date in Filename using data factory - azure-data-factory

``
I have to copy files from source folder to target folder both are in the same storage account(ADL). The files in the source folder are of in .txt format and have date appended in the file name,
eg: RAINBOW.IND.EXPORT.20221201.WIFI.NETWORK.SCHOOL.txt
and
RAINBOW.IND.EXPORT.20221202.WIFI.NETWORK.SCHOOL.txt
(20221201 and 20221202 is date in file name , date format: yyyymmdd)
I have to create a pipeline that will sort and store files in the folders in ADL's in this hierarchy
ex: adl/2022/12/01/RAINBOW.IND.EXPORT.20221201.WIFI.NETWORK.SCHOOL.txt
adl/2022/12/02/RAINBOW.IND.EXPORT.20221202.WIFI.NETWORK.SCHOOL.txt
even if we have multiple files on same date in file name based on that date in file name it has to create year(YYYY) folder and in year(YYYY) folder it should create month(MM) folder and in month(MM) folder it should create date(DD) folder like above example. Each File should copy into respective yyyy and respective mm and respective date folder.
What I have done:
In Get Metadata - Given argument to extract **childitems**
For each activity that contains a Copy activity.
In Copy activity source wildcard path is given as *.txt
for sink took concat expression using split and substring functions
Please check the screenshots of all activities and expressions
but this pipeline is creating the folders based on date in file name (like adl/2022/12/01)
but problem is it was copying all files into all date(DD) folders
(like adl/2022/12/01/RAINBOW.IND.EXPORT.20221201.WIFI.NETWORK.SCHOOL.txt
RAINBOW.IND.EXPORT.20221202.WIFI.NETWORK.SCHOOL.txt
adl/2022/12/02/RAINBOW.IND.EXPORT.20221201.WIFI.NETWORK.SCHOOL.txt
RAINBOW.IND.EXPORT.20221202.WIFI.NETWORK.SCHOOL.txt)
1.[GET META to extract child items](https://i.stack.imgur.com/GVYgZ.png)
2.[Giving GET META output to FOREACH](https://i.stack.imgur.com/cbo30.png)
3.[Inside FOREACH using COPY ](https://i.stack.imgur.com/U5LK5.png)
4.[Source Data Set](https://i.stack.imgur.com/hyzuC.png)
5.[Sink Data Set](https://i.stack.imgur.com/aiYYm.png) Expression used in Data Set in Folder Path '#concat('adl','/'dataset().FolderName)
6.[Took parameter for Sink](https://i.stack.imgur.com/QihZR.png)
7.[Sink in copy activity ](https://i.stack.imgur.com/4OzT5.png)
Expression used in sink for dynamic folders using split and substring function
#concat(substring(split(item().name,'.')[3],0,4),'/',
substring(split(item().name,'.')[3],4,2),'/',
substring(split(item().name,'.')[3],6,2)
)
**OUTPUT for this pipeline**
adl/2022/12/01/RAINBOW.IND.EXPORT.20221201.WIFI.NETWORK.SCHOOL.txt
RAINBOW.IND.EXPORT.20221202.WIFI.NETWORK.SCHOOL.txt
adl/2022/12/02/RAINBOW.IND.EXPORT.20221201.WIFI.NETWORK.SCHOOL.txt
RAINBOW.IND.EXPORT.20221202.WIFI.NETWORK.SCHOOL.txt
**Required Output is**
adl/2022/12/01/RAINBOW.IND.EXPORT.20221201.WIFI.NETWORK.SCHOOL.txt
adl/2022/12/02/RAINBOW.IND.EXPORT.20221202.WIFI.NETWORK.SCHOOL.txt
(i.e each file should copy to respective date folders only even if we have multiple files in same date, they should copy to date folders based on date in file name)

I have reproduced the above and got same result when I followed the steps that you have given.
Copy activity did like this because, in source or sink you did not gave #item().name(file name for that particular iteration) and you have given *.txt in the wildcard path of source in copy activity.
It means for every iteration(for every file name) it copies all .txt files from source into that particular target folder(same happened for you).
To avoid this,
Give #item().name in source wild card file name
It means we are giving only one that iteration file name in the source for the copy.
(OR)
Keep the wildcard file name in source as it is(*.txt) and create a sink dataset parameter for file name.
and give #item().name to it in copy activity sink.
You can do any of the above and if you want you can do both at a time. I have checked all the 3 scenarios like
1.#item().name in wild card sink file name.
2. #item().name in dataset file name by keeping wildcard path same.
3. combining both 1 and 2(#item().name in wild card file name and in sink dataset parameter).
All are working fine and giving desired result.

Related

Requiremet of Error logging mail(mail should contain all the missing file details). in the ADF flow

the reqirement is simple , i have a folder having 4 txt files(1.txt,2.txt,3.txt,4.txt) . the Flow is controlled by a parameter called all or some which is of string type.
If i select all in the parameter, all 4 file should be processed. the requirement starts here >>
IF any file is missing from the folder(for ex 2.txt and 3.txt is not present and i selected ALL in the parameter) , i need a mail saying file is 2.txt and 3.txt is missing.
If i select some in the parameter, for ex 1.txt and 4.txt and if any of the file is missing 1.txt and 4.txt is missing(for example 1.txt is missing) , i need a mail with the missing file name(i.e 1.txt in our case).
capture missing file details in one variable
I tried to repro this capturing missing files using azure data factory. Below is the approach.
Take a parameter of array type in the pipeline. During runtime, you can give the list of file names in a folder to be processed in this array parameter.
Take a get metadata activity and add a dataset in the activity. Click +New in the field list and select child items as an argument.
Take a filter activity and give the array parameter value in items and write the condition to filter the missing files in condition box
item:
#pipeline().parameters.AllorSome
condition:
#not(contains(string(activity('Get Metadata1').output.childItems),item()))
I tried to run this pipeline. During run time, four file names are given to the array parameter.
Get Metadata activity output has three file names.
Parameter has 4 filenames and Get meta data activity has 3 filenames. Missing file names are to be filtered out.
Output of filter activity:
Use this output and send it in email.
Refer the MS document on How to send email - Azure Data Factory & Azure Synapse | Microsoft Learn for sending email.

How to fetch file path dynamically using pyspark

I have multiple files in my folder , i want to pattern match if any file is present , if that file is present then store the variable with whole file path.
how to achieve this in pyspark
Since you want to store the whole path in a variable, you can achieve this with a combination of dbutils and Regular expression pattern matching.
We can use dbutils.fs.ls(path) to return the list of files present in a folder (storage account or DBFS). Assign its return value to a variable called files.
#my sample path- mounted storage account folder.
files = dbutils.fs.ls("/mnt/repro")
Loop through this list. Now using Python's re.match() you can check if the current item's file name matches your pattern. If it matches, append its path to your result variable (list).
from re import match
matched_files=[]
for file in files:
#print(file)
if(match("sample.*csv", file.name)): #"sample.*csv" is pattern to be matched
matched_files.append(file.path)
#print("Matched files: ",matched_files)
Sample output:

Copy Dynamically generated filename and paste it in an other Folder

I am held up with one of the task I need to perform to create CSV file with its name generated in run time and then Copy that same file and paste it to a different folder. I'm able to create the required file.
Here is what I've done till now:
In SSIS I'm taking a DFT in control flow and taking a view as my
OLEDB source, then pointing it to a Flat File destination and
creating a file in my desired location say folder x in a variable
i.e My_dest_folder for the variable I've created. Here are the steps I've followed.
My_dest_folder of type string and have given my folders path as the value.
Filename of type sting and gave a name say cv99351_ as the value.
Timestamp of type string and give the expression which generates a timestamp YYYYMMDDHHMISS format.
Archivefolder of type sting and gave another path where the generated file is supposed to be copied from My_dest_folder &
pasted into
archive folder.
In the connection string of my flat file connection manager, I have given the variables with
#My_dest_folder+#Filename+#Timestamp+".csv". which creates a file
with name cs99351_.csv in the folder x.
After the file is created I am trying to capture the filename from the My_dest_folder but since the timestamp also contains seconds I am not able to capture it everytime.
Can someone please help me out here? I would really appreciate it.
If someone want to save his files with SSIS, your description is already nice and could be use as a tutorial :)
But if I understand well you have a problem at the end of your process, when you try to get the filename generated.
To read it you use the same variable concatenation but sometimes your Timestamp can change and then you get an error (Your file doesn't exist)
If yes, I guess that you use a kind of GETDATE() function in the expression of your variable. It appears that SSIS will evalute the value of your variable each time you will request it.
I tested it:
I Ran 3 Insert statement and wait with the debugger between each.
It gave me 3 differents values:
I recommend you to not use your getdate() function in the variable expression.
You can retrieve it with a unique SQL Task (with a SELECT GETDATE() SQL Query) or with C# / VB method.
Does it solve your problem?
Regards,
Arnaud
I had a similar issue.
There was a specific named file in the source folder.
#[User::v_Orig_FileName] : #[$Project::v_FilePath]+ #[$Project::v_FileName]
It was renamed with the timestamp using GETDATE()
#[User::v_Archive_FileName] : #[$Project::v_FilePath] +"\\"+REPLACE( REPLACE(REPLACE( #[$Project::v_FileName] , ".csv", Substring((DT_WSTR,50) GETDATE(),1,19)+".csv" ),":","")," ","_")
Source Variable
#[User::v_Orig_FileName]
Destination Variable
#[User::v_Archive_FileName]
The file was moved into the archive folder. To get the source file name , I was using the exactly same variable name as the destination variable name in step 2.
#[User::v_Archive_Folder]: #[$Project::v_FilePath]+ "\\Archive"
#[User::v_Archive_ArchivedFileName] : #[User::v_Archive_Folder] +"\\"+REPLACE( REPLACE(REPLACE( #[$Project::v_FileName] , ".csv", Substring((DT_WSTR,50) GETDATE(),1,19)+".csv" ),":","")," ","_")
Source Variable
#[User::v_Archive_FileName]
Destination Variable
#[User::v_Archive_ArchivedFileName]
If the timestamp for step 2 and step 3 has even a second difference, there is an error as, pointed out above, that GETDATE() will evaluate each time when it is requested.
So the solution I came up with was swapping the Step 2 and Step 3.
There was a specific named file in the source folder.
#[User::v_Orig_FileName] : #[$Project::v_FilePath]+ #[$Project::v_FileName]
The file was moved into the archive folder.
#[User::v_Archive_Folder]: #[$Project::v_FilePath]+ "\\Archive"
#[User::v_Archive_OrigFileName] : #[User::v_Archive_Folder] +"\\"+ #[$Project::v_FileName]
Source Variable
#[User::v_Orig_FileName]
Destination Variable
#[User::v_Archive_OrigFileName]
It was renamed with the timestamp using GETDATE()
#[User::v_Archive_FileName] : #[User::v_Archive_Folder] +"\\"+REPLACE( REPLACE(REPLACE( #[$Project::v_FileName] , ".csv", Substring((DT_WSTR,50) GETDATE(),1,19)+".csv" ),":","")," ","_")
Source Variable
#[User::v_Archive_OrigFileName]
Destination Variable
#[User::v_Archive_FileName]
Hope this gives an idea to have a different spin on this issue.

Using context variable with fix values

In my talend job I have a context variable named context.TempFolder.
Now while copying data from sql table to excel file I need to create an Excel file named export.excel (fixed name) in to the folder specified by the variable context.TempFolder.
How do I specify the 'File Name' of my tFileOutputExcel component?
Here value of a context variable TempFolder might change but I will always be creating Excel file by same name export.excel
You just need to concatenate the context.TempFolder with your output file name.
So your file path for your tFileOutputExcel should look something like:
context.TempFolder + "export.excel.xls"
You can use vraiables and strings like this in a lot of places in Talend. To do something slightly more complicated, you might define the output file name in your job (so calculate it at run time) and then put that file name in the globalMap and then retrieve it when you output your file so you might end up with something like:
context.OutputFolder + (String)globalMap.get("FileName") + ".xls"
This is useful for date-time stamping files for example. Or maybe defining the file name by some sort of data in your input.

matlab multiple folders

I have one dir with 50 folders, and each folder has 50 files. I have a script to read all files in each folder and save the results, but I need to type the folder name every time. Is there any loop or batch tools I can use? Any suggestions or code greatly appreciated.
There may be a cleaner way to do it, but the output of the dir command can be assigned to a variable. This gives you a struct, with the pertinent fields being name and isdir. For instance, assuming that the top-level directory (the one with 50 files) only has folders in it, the following will give you the first folder's name:
folderList = dir();
folderList(3).name
(Note that the first two entries in the folderList struct will be for "." (the current directory) and ".." (the parent directory), so if you want the first directory with files in it you have to go to the third entry). If you wish to go through the folders one by one, you can do something like the following:
folderList = dir();
for i = 3:length(folderList)
curr_directory = pwd;
cd(folderList(i).name); % changes directory to the next working directory
% operate with files as if you were in that directory
cd(curr_directory); % return to the top-level directory
end
If the top-level directory contains files as well as folders, then you need to check the isdir of each entry in the folderList struct--if it is "1", it's a directory, if it is "0", it's a file.