Connecting cleansing components to tFileList - Talend - talend

What is the best way to apply logic to objects during an iteration of tFileList.
The issue is that if I use a tFileList to get a list of files, i am not able to use tJavaRow or jMap to create the filename that i want the file to be renamed. Basically, if i have zip files with years(2010,2011,2012 etc) and each zip file contains files with the same name (f1.csv, f2.csv, f3.csv), i want to iterate through the compressed files, uncompress them and rename the files with
f1_2010.csv, f2_2010.csv, f3_2010.csv..f1_2012.csv etc.
Thanks!

Iterate links are providing a way to execute components based on events or facts while main links are transfering data between components.
With something looking like that you should be able to resolve your problem :
tFileList_1 --iterate--> tFileUnarchive_1
|
onComponentOK
|
tFileList_2 -- iterate --> tFileCopy_1
|
onComponentOK
|
tFileArchive_1
Use ((String)globalMap.get("tFileList_1_CURRENT_FILEPATH")) in your tFileUnarchive to get the ZIP path.
In tFileCopy use ((String)globalMap.get("tFileList_2_CURRENT_FILEPATH")) to get the path of file and config it to be a rename.
For your name modification you can add tJava on "onComponentOK" links. By using globalMap.put("year",((String)globalMap.get("tFileList_1_CURRENT_FILEPATH")).substring(x,x)) or more complicated code. And use these variables in your others components parameters.

Related

Azure Data Factory - Check If Any Zip File Exists

I am trying to check if any zip file exists in my SFTP folder. GetMetadata activity works fine if I explicitly provide the filename but I can't know the file name here as the file name is embeded with timestamp and sequence number which are dynamic.
I tried specifying *.zip but that never works and GetMetadata activity always returns false even though the zip file actually exists. is there any way to get this worked? Suggestion please.
Sample file name as below, in this the last part 0000000004_20210907080426 is dynamic and will change every time:
TEST_TEST_9999_OK_TT_ENTITY_0000000004_20210907080426
You could possibly do a Get Metadata on the folder and include the Child items under the Field List.
You'll have to iterate with a ForEach using the expression
#activity('Get Folder Files').output.childItems
and then check if item().name (within the ForEach) ends with '.zip'.
I know it's a pain when the wildcard stuff doesn't work for a given dataset, but this alternative ought to work for you.
If you are using exists in the Get Metadata activity, you need to provide the file name in it.
As a workaround, you can get the child items (with filename *.zip) using the Get Metadata activity.
Output:
Pass the output to If Condition activity, to check if the required file exists.
#contains(string(json(string(activity('Get Metadata1').output.childItems))),'.zip')
You can use other activities inside True and False activities based on If Condition.
If there is no file exists or no child items found in the Get Metadata activity.
If condition output:
For SFTP dataset, if you want to use a wildcard to filter files under the field specified folderPath, you would have to skip this setting and specify the file name in activity source settings (Get Metadata activity).
But Wildcard filter on folders/files is not supported for Get Metadata activity.

how to Load data from last modified files within one day from subfolders Azure Data Flow

I have the following directory structure on an Azure container:
-dwh-prod
-Main_Folder
-2021-01
-file1.parquet
-2021-02
-file2.parquet
-file3.parquet
where the Data is partitioned by year and month to create subfolders. Within these sub-folders, I have my data files. I want to load into my data flow only the latest files that were added within one day from running my data flow pipeline.
I tried using currentUTC() in End Time and subtracting one day -> AddDays(currentUTC(), -1) in Start Time in the 'Filter by last modified' option provided in source options but it didn't work.
I also tried using currentTimestamp() instead but to no avail.
How do I go about solving this?
Your expression is correct. Please change the folder path from MainFolder to Main_folder in your dataset and set Main_Folder/*/*.parquet as your Wildcard paths in your Source option. Then it will work.
I think your solution is close, but I'm not sure the folder name is sufficient. I'm also not familiar with "currentUTC". The correct function should be utcNow.
Below is an outline of how I would approach this problem.
Source Dataset
Add a Parameter for the subfolder (year-month):
and then set the Folder path to an expression like:
Pipeline
You could either pass in the subfolder or calculate it at runtime. My preference would be to pass it in as a parameter:
I would then add variables to calculate the start and end times. Since you are running this daily, I would be sure to force the time to the START of the day(s). This should handle any vagaries based on run time. Also, I would use the built in getPastTime function:
Now use these objects in your Source configuration:

Azure Factory v2 Wildcard

I am trying to create a new dataset in ADF that looks for csv files that meet a certain naming convention. These files are located within a series of different folders in my Azure Blob Storage.
For instance, in the sample directory below, I am trying to pull out csv files that contain the word "cars".
Folder A
fastcars.csv
fasttrucks.csv
Folder B
slowcars.csv
slowtrucks.csv
Ideally , I would end up with the files "slowcars.csv" and "fastcars.csv". I've seen examples out there were people were able to wildcard the file name. I have been playing around with that, but have had no luck. (See image below for one example of what I have been doing).
Is what I am trying to do even possible? Would appreciate any advice you guys may have. Please let me know if I can provide further clarification.
According to the description of filename in this documentation,
The file name under the given fileSystem + folderPath. If you want to
use a wildcard to filter files, skip this setting and specify it in
activity source settings.
so you need to specify it in activity not in file path.
A easy sample in copy activity:
Hope this can help you.

How to iterate over files for tMongoDBBulkLoad

What is the best way to iterate over files and feed them into tMongoDBBulkLoad? It sees that you cannot feed into this component from a tFileList componet (Iterate) - which would make the most sense.
I want to import 80 files, rather than create one massive file which is too large to open in notepad if I have issues during the import.
Thanks
---Update----
I know how to do this with other components, my issue is I cannot feed an Iterate component into the tMongoBulkLoad
The simplified job will be like this :
tFileList ---------iterate--------tMongoDBBulkLoad
and in the tMongoDBBulkLoad settings you set the Data file to :
((String)globalMap.get("tFileList_1_CURRENT_FILEPATH"))
Here, the tFileList will iterate over files, in each iteration, the tMongoDBBulkLoad will be triggered to load the current file, which is indicated by the global variable.
--- Reply for the Update ---
To connect an iterate trigger to the component, you can add a dummy tjava with no code, it will be like this :
tFileList -----(iterate)-----tJava-------(onComponentOk)-------tMongoDBBulkLoad

Jenkins How can i upload a text file and use it as a parameter

I have a txt file that is holding a string inside, I want to be able to use this string in one of my scripts, so I'm wondering if there is a way to set the content of the file as one of the build properties or parameters which I'll be able to use in my scripts it should be the same as using one of the build environment properties.
For example : ${JOB_NAME} which is holding the the job name, so in the same way I want to access the content of the file which is holding some value inside.
Is it possible?
You can upload a file from your computer to the workspace through the File parameter of the job.
You can use Extended Choice plugin parameter, to read value(s) from a file and display them in a dropdown/radio-button/checkbox for the user to select, dynamically, every time the build is triggered.
You can use EnvInject plugin to read value(s) from a file and inject them into the build as environment variables, so that they can be used by the rest of the build steps/scripts.
Your question is very unclear on what your are trying to do. Pick one of the 3 methods above based on what you need, or clarify your question.