TfileList catches one of the 6 files only - talend

I tried to display some results from several files in a directory. I use TFileList, and 2 tFileInputDelimited which are both linked to TFileList. I don't know why but at the end of the processing my results are lugged from just one of the 6 files I want. It appears that there are results from the list file of the directory.
Each tFileInputDelimited has ((String)globalMap.get("tFileList_1_CURRENT_FILEPATH")) as name of the flow.
Here is my TMap:

Your job is set up so your lookup is iterative which causes some issues as Talend only seems to use the last iteration rather than doing what you might expect and iterating through every step for everything it needs (although this might be more complicated than you first think).
One option is to rework the job so you use your iterate part of the job as the main input to the tMap rather than the lookup.
Alternatively, you could iterate the data into a tBufferOutput component and then OnSubjobOk you could link the job as before but replace the iterative part with a tBufferInput component as it will store all of the data from all of the files iterated through.

Related

Azure Data Factory For Each Loop is importing all my CSV files per iteration instead of just the file name I *think* I've told it to

I could really do with some help with ADF; I've recently started trying to use it thinking it would be similar to SSIS but wow am I having a hard time! I've built up this kinda complicated pipeline over the last few weeks which basically reads a list of files from a folder and from within a For Each loop it's supposed to check where the data starts per file and import it into a SQL table. I'll not bore you with all the issues I've had so far but atm it seems to be working aside from the For Each part of it, it's importing all the files in the folder per iteration and it seems to be the data set configuration which is not recognising the filename per iteration because if I look through the debugging I can see it pick up the list of files, set the DSFileName variable to the first of them, but the output of the data flow task is both files. So it seems like I've missed a step somewhere and I've just spent the last 5 hours looking and could really do with some help :(
I reckon to have followed the instructions here: https://www.sqlshack.com/how-to-use-iterations-and-conditions-activities-in-azure-data-factory/
Some pictures to show the debugging I've done:
Here it shows it's picking up 2 files (after I filtered out folders and stuff)
Here shows the first file name only being passed into the first data flow
Here shows the output from it, where it has picked up both files somehow and displays the count of 2 files
Here shows the Data Set set up where I believe to have correctly set the variable as the file name to be used
I just don't even know where to start now tbh, I reckon to have checked everything I can see and I'm not using any wild cards or anything. I can see it passing the 1 file name per iteration into that variable but each iteration I can see 2x counts of the file going into the table and the output of each data flow task showing both file counts.
Does anybody have any ideas or know what I've missed?
EDIT 23/07/22: Pics of the source as requested:
Data Source Settings
Data Source Options
So it turns out that adding .name to item() in the dataset parameter means it uses just the current one instead of them all.... I'm confused by this as all the documentation I've read states that item() references the CURRENT item within the For Each, did I misunderstand?
Adding .name to the dataset here is now importing just the current file per loop iteration

Append datasets from xlsx and databse - Talend

I have three Excel files and one database connection which I need to append as a part of my flow. All four datasets in the pre-append stage have just one column.
When I try to use tUnite, I get the error for tFileInputExcel - see the screenshot. Moreover, I cannot join the database connection with tUnite.
What am I doing wrong?
I think the problem is with the tFileExist components (I think that's what these are on the left with the "if" links coming out) because each of them is trying to start a new flow. Once you're joining them with the unite, there can be only one start to the flow - and this goes to the start of the first branch of the merge order.
You can move the if logic elsewhere. Another idea is to put the output from each of the Excel into a tHashOutput (linked together), then use a tHashInput to write to your DB.

Azure Data Factory ForEach is seemingly not running data flow in parallel

In Azure Data Factory I am using a Lookup activity to get a list of files to download, then pass it to a ForEach where a dataflow is processing each file
I do not have 'Sequential' mode turned on, I would assume that the data flows should be running in parallel. However, their runtimes are not the same but actually have almost constant time between them (like, first data flow ran 4 mins, second 6, third 8 and so on). It seems as if the second data flow is waiting for the first one to finish and then uses its cluster to process the file.
Is that intended behavior? I have TTL on the cluster set but that did not help too much. If it is, then what is a workaround? I am currently working on creating a list of files first and using that instead of a ForEach but I am not sure if I am going to see an increase in efficiency
I have not been able to solve the issue with the Parallel data flows not executing in parallel, however, I have managed to change the solution that would increase performance.
What was before: A lookup activity that would get a list of files to process, passed on to a ForEach loop with a data flow activity.
What I am testing now: A Data flow activity that would get a list of files, and save them in a text file in ADLS, Then another data flow activity that was previously in a ForEach loop, but changed its source to use "List of Files" and point to that list
The result was an increase in efficiency (Using the same cluster, 40 files would take around 80 mins using ForEach and only 2-3 mins using List of Files), however, debugging is not easy now that everything is in 1 data flow
You can overwrite a list of files file, or use dynamic expressions and name the file as the pipelineId or something else

Using Talend Open Studio DI to extract extract value from unique 1st row before continuing to process columns

I have a number of excel files where there is a line of text (and blank row) above the header row for the table.
What would be the best way to process the file so I can extract the text from that row AND include it as a column when appending multiple files? Is it possible without having to process each file twice?
Example
This file was created on machine A on 01/02/2013
Task|Quantity|ErrorRate
0102|4550|6 per minute
0103|4004|5 per minute
And end up with the data from multiple similar files
Task|Quantity|ErrorRate|Machine|Date
0102|4550|6 per minute|machine A|01/02/2013
0103|4004|5 per minute|machine A|01/02/2013
0467|1264|2 per minute|machine D|02/02/2013
I put together a small, crude sample of how it can be done. I call it crude because a. it is not dynamic, you can add more files to process but you need to know how many files in advance of building your job, and b. it shows the basic concept, but would require more work to suite your needs. For example, in my test files I simply have "MachineA" or "MachineB" in the first line. You will need to parse that data out to obtain the machine name and the date.
But here is how may sample works. Each Excel is setup as two inputs. For the header the tFileInput_Excel is configured to read only the first line while the body tFileInput_Excel is configured to start reading at line 4.
In the tMap they are combined (not joined) into the output schema. This is done for the Machine A Excel and Machine B excels, then those tMaps are combined with a tUnite for the final output.
As you can see in the log row the data is combined and includes the header info.

How to increment a number from a csv and write over it

I'm wondering how to increment a number "extracted" from a field in a csv, and then rewrite the file with the number incremented.
I need this counter in a tMap.
Is the design below a good way to do it ?
EDIT: im trying a new method. see the design of my subjob below, but i have an error when i link the tjavarow to my main tmap in the main job
Exception in component tMap_1
java.lang.NullPointerException
at mod_file_02.file_02_0_1.FILE_02.tFileList_1Process(FILE_02.java:9157)
at mod_file_02.file_02_0_1.FILE_02.tRowGenerator_5Process(FILE_02.java:8226)
at mod_file_02.file_02_0_1.FILE_02.tFileInputDelimited_2Process(FILE_02.java:7340)
at mod_file_02.file_02_0_1.FILE_02.runJobInTOS(FILE_02.java:12170)
at mod_file_02.file_02_0_1.FILE_02.main(FILE_02.java:11954)
2014-08-07 12:43:35|bm9aSI|bm9aSI|bm9aSI|MOD_FILE_02|FILE_02|Default|6|Java
Exception|tMap_1|java.lang.NullPointerException:null|1
[statistics] disconnected
enter image description here
You should be able to do this mid flow in a tMap or a tJavaRow.
Simply read the number in as an integer (or other numeric data type) and then add your increment to it.
A really simple example might look like this:
Here we have a tFixedFlowInput that has some hard coded values for the job:
And we run it through a tMap where we add 1 to the age column:
And finally, we output it to the console in a table:
EDIT:
As Gabriele B has pointed out, this doesn't exactly work when reading and writing to the same flat file as Talend claims an exclusive read-write lock on the file when reading and keeps it open throughout the job.
Instead you would have to write the incremented data to some other place such as a temporary file, a database or even just to the buffer and then read that data in to a separate job which would then output the file you want and clean up anything temporary.
The problem with that is you can't do the output in the same process. I've just tried testing reading in the file in one child job, passing the data back to a parent job using a tBufferOutput and then passing that data to another child job as a context variable and then trying to output to the file. Unfortunately the file lock remains on it so you can't do this all in one self contain job (even using a parent job and several child jobs).
If this sounds horrible to you (it is) and you absolutely need this to happen (I'd suggest a database table sounds like a better match for this functionality than a flat file) then you could raise a feature request on the Talend Jira for the tFileInputDelimited to not hold the file open or to not insist on an exclusive read-write lock on the file.
Once again, I strongly recommend that you move to using a database table for this because even without the file lock issue, this is definitely not the right use of a flat file and this use case perfectly fits a database, even something as lightweight as an embedded H2 database.