Azure Factory v2 Wildcard - azure-data-factory

I am trying to create a new dataset in ADF that looks for csv files that meet a certain naming convention. These files are located within a series of different folders in my Azure Blob Storage.
For instance, in the sample directory below, I am trying to pull out csv files that contain the word "cars".
Folder A
fastcars.csv
fasttrucks.csv
Folder B
slowcars.csv
slowtrucks.csv
Ideally , I would end up with the files "slowcars.csv" and "fastcars.csv". I've seen examples out there were people were able to wildcard the file name. I have been playing around with that, but have had no luck. (See image below for one example of what I have been doing).
Is what I am trying to do even possible? Would appreciate any advice you guys may have. Please let me know if I can provide further clarification.

According to the description of filename in this documentation,
The file name under the given fileSystem + folderPath. If you want to
use a wildcard to filter files, skip this setting and specify it in
activity source settings.
so you need to specify it in activity not in file path.
A easy sample in copy activity:
Hope this can help you.

Related

Azure Data Factory Cannot Read Metadata Folder

I hope you guys keep health and keep strong in Pandemic covid-19.
I have some question on Azure Data Factory. btw I have create some pipeline with Metadata activity with detail below:
I have file in Folder and Subfolder like this:
I have metadata activity with for each with first get metadata child item (in folder) like this:
metadata with last modified like this (if you setting like this, metadata only read last modified subfolder
after that add variable I use #item().Name to read file in that folder like this:
after running metadata which have subfolder, I've get error like this:
the error give info that with #item().Name cannot read subfolder on that folder. the metadata for each file is success, but error like this which on my activity cannot read metadata subfolder .
many big thanks to have answer, Thank You
If you need to access the folder
Create a clone of same dataset and setup parameter as below, leave the file field empty.
If you need to access the file inside directory, use condition #equals(item().type,'Folder') to identity directory then inside that use dataset with parameters for directory and file.

how to Load data from last modified files within one day from subfolders Azure Data Flow

I have the following directory structure on an Azure container:
-dwh-prod
-Main_Folder
-2021-01
-file1.parquet
-2021-02
-file2.parquet
-file3.parquet
where the Data is partitioned by year and month to create subfolders. Within these sub-folders, I have my data files. I want to load into my data flow only the latest files that were added within one day from running my data flow pipeline.
I tried using currentUTC() in End Time and subtracting one day -> AddDays(currentUTC(), -1) in Start Time in the 'Filter by last modified' option provided in source options but it didn't work.
I also tried using currentTimestamp() instead but to no avail.
How do I go about solving this?
Your expression is correct. Please change the folder path from MainFolder to Main_folder in your dataset and set Main_Folder/*/*.parquet as your Wildcard paths in your Source option. Then it will work.
I think your solution is close, but I'm not sure the folder name is sufficient. I'm also not familiar with "currentUTC". The correct function should be utcNow.
Below is an outline of how I would approach this problem.
Source Dataset
Add a Parameter for the subfolder (year-month):
and then set the Folder path to an expression like:
Pipeline
You could either pass in the subfolder or calculate it at runtime. My preference would be to pass it in as a parameter:
I would then add variables to calculate the start and end times. Since you are running this daily, I would be sure to force the time to the START of the day(s). This should handle any vagaries based on run time. Also, I would use the built in getPastTime function:
Now use these objects in your Source configuration:

Is there a way to know filenames generated with MultiResourceItemWriter?

I'm writing a spring-batch application with spring-boot support and I'm looking for a way to know which files were generated by MultiResourceItemWriter. The first solution I have in mind is to have a folder for only the files generated and check the content, but if there is something already implemented on spring-batch would be great!
The intention is to encrypt and then upload each file to an sftp server.
The file names generated by the MultiResourceItemWriter are the combination of the resource name + the suffix created by the ResourceSuffixCreator. For example, if you create the writer like the following:
MultiResourceItemWriter<String> writer = new MultiResourceItemWriter<>();
writer.setResource(new FileSystemResource(new File("data.txt")));
writer.setResourceSuffixCreator(index -> "part" + index);
Then the generated files will be data.txt.part1, data.txt.part2, etc.
MultiResourceItemWriter doesn't perform write directly but delegate this job to other components.
All those components are ResourceAwareItemWriterItemStream implementors so you may write a ResourceAwareItemWriterItemStreamDelegate, intercept setResource() method and store resource into current step execution-context as a collection.
If you want to pass this list of resources to next steps you may use an ExecutionContextPromotionListener.

How can I rename file uploaded to s3 using javascript api?

'pickAndStore' method allows me to specify full path to the file, but I don't know it's extension at this point (file path has to be defined before file is uploaded, so it's not possible to provide a path with correct extension).
if I use 'pick' and then 'store' I have 2 files (because both methods uploads file to the s3). I can delete 'old' file, but it's not optimal and can be pain (take ages) with really big files.
Is there any better solution? Ideally to rename existing file.
Currently, there is no workaround for renaming file.
However, in our Javascript API v2 we are planing to add new callback function. onStart callback will be fired after user pick file but before file uploading. There could be option like renaming file based on original filename.
We will keep you updated.

Connecting cleansing components to tFileList - Talend

What is the best way to apply logic to objects during an iteration of tFileList.
The issue is that if I use a tFileList to get a list of files, i am not able to use tJavaRow or jMap to create the filename that i want the file to be renamed. Basically, if i have zip files with years(2010,2011,2012 etc) and each zip file contains files with the same name (f1.csv, f2.csv, f3.csv), i want to iterate through the compressed files, uncompress them and rename the files with
f1_2010.csv, f2_2010.csv, f3_2010.csv..f1_2012.csv etc.
Thanks!
Iterate links are providing a way to execute components based on events or facts while main links are transfering data between components.
With something looking like that you should be able to resolve your problem :
tFileList_1 --iterate--> tFileUnarchive_1
|
onComponentOK
|
tFileList_2 -- iterate --> tFileCopy_1
|
onComponentOK
|
tFileArchive_1
Use ((String)globalMap.get("tFileList_1_CURRENT_FILEPATH")) in your tFileUnarchive to get the ZIP path.
In tFileCopy use ((String)globalMap.get("tFileList_2_CURRENT_FILEPATH")) to get the path of file and config it to be a rename.
For your name modification you can add tJava on "onComponentOK" links. By using globalMap.put("year",((String)globalMap.get("tFileList_1_CURRENT_FILEPATH")).substring(x,x)) or more complicated code. And use these variables in your others components parameters.