How to log progress of tasks in Talend Open Studio? - talend

I have some sample jobs that migrate data from one database to another and I would like to have some information about the current progress, like the one you have when the job is run interactively from the application itself (I export and run it from command line).
I use flowMeter and statsCatcher but everything i got is the overall time and the overall number of records passed (e.g. 4657 sec, 50.000.000 rows).
Is there any solution to get a decent log ?

Your solution is about adding a conditional clause to logging. Something true one row every, let's say, 50000. This condition using a sequence should work:
Numeric.sequence("log_seq",1,1) % 50000 == 0
You can use the custom component bcLogBack to basically output your log using an sl4j facade stack. The component has an option called "Conditional logging" to send the message only when the condition evaluate to true.
Alternatively, if you don't like the idea of install a custom component, you can end your subjob using the standard tLogRow (or tWarn, tDie or whatever) prefixed by a tFilter with the same expression as advanced condition. This way you'll let the stream pass (and the log message to be triggered) just one time every 50000. Here's a very basic job diagram
//---->tMySqlOutput--->tFilter-----//filter--->tWarn (or tLogRow)

As far as I know, tLogRow outputs to the console. So you can easily plug an output into it.
If tLogRow isn't enough, you can plug your output into a TJavaFlex component. There you could use something like log4j or any custom output.
You can also use tFileDelimitedOutput as a log file. This component have a nice "append" option that works like a charm for this use case.
For your question above : How to obtain the log information
By experience, I can tell that some components outputs the flow. For example, the tMysqlInput outputs the successfully inserted rows.
Generally, to log the information I use the component tReplicate which allows me to output a copy of the flow to a log file.
tMySqlOutput ---- tReplicate ----- tMap -------- tMySqlInput (insert in DB)
+---- tMap -------- tDelimitedFile (log info)

You can also use tWarn in combination with tLogCatcher:
tMySqlOutput ---- tFilter ---- tWarn
tLogCatcher ---- tMap ---- tLogRow
tFilter would prevent you from logging a progress on every row completion (see Gabriele B's answer). tWarn would have the actual message you want to log out.
tLogCatcher should get inputs from all of the tWarns, tMapper transforms each row from the logCatcher into an output row, and tLogRow logs it.
That answer is described in more detail (with pictures): http://blog.wdcigroup.net/2012/05/error-handling-in-talend-using-tlogcatcher/

Related

How to pass output from a Datastage Parallel job to input as another job?

My requirement is
Parallel Job1 --I extract data from a table, when row count is more than 0
Parallel job 2 should be triggered in the sequencer only when the row count from source query in Job1 is greater than 0
I want to achieve this without creating any intermediate file in job1.
So basically what you want to do is using information from a data stream (of your Job1) and use it in the "above" sequence as a parameter.
In your case you want to decide on sequence level to run subsequent jobs (if more than 0 rows get returned) or not.
Two options for that:
Job1 writes information to a file which is a value file of a parameterset. These files are stored in a fixed directory. The parameter of the value file could then be used in your sequence to decide your further processing. Details for parameter sets can be found here.
You could use a server job for Job1 and set a user status (basic function DSSetUserStatus) in a transfomer. This is also passed back to the sequence and could be referenced in subsequent stages of the sequence. See the documentation but you will find many other information on the internet as well regarding this topic.
There are more solution to this problem - or let us call it challenge. Other ways may be a script called at sequence level which queries the database and will avoid Job1...

How to increment a number from a csv and write over it

I'm wondering how to increment a number "extracted" from a field in a csv, and then rewrite the file with the number incremented.
I need this counter in a tMap.
Is the design below a good way to do it ?
EDIT: im trying a new method. see the design of my subjob below, but i have an error when i link the tjavarow to my main tmap in the main job
Exception in component tMap_1
java.lang.NullPointerException
at mod_file_02.file_02_0_1.FILE_02.tFileList_1Process(FILE_02.java:9157)
at mod_file_02.file_02_0_1.FILE_02.tRowGenerator_5Process(FILE_02.java:8226)
at mod_file_02.file_02_0_1.FILE_02.tFileInputDelimited_2Process(FILE_02.java:7340)
at mod_file_02.file_02_0_1.FILE_02.runJobInTOS(FILE_02.java:12170)
at mod_file_02.file_02_0_1.FILE_02.main(FILE_02.java:11954)
2014-08-07 12:43:35|bm9aSI|bm9aSI|bm9aSI|MOD_FILE_02|FILE_02|Default|6|Java
Exception|tMap_1|java.lang.NullPointerException:null|1
[statistics] disconnected
enter image description here
You should be able to do this mid flow in a tMap or a tJavaRow.
Simply read the number in as an integer (or other numeric data type) and then add your increment to it.
A really simple example might look like this:
Here we have a tFixedFlowInput that has some hard coded values for the job:
And we run it through a tMap where we add 1 to the age column:
And finally, we output it to the console in a table:
EDIT:
As Gabriele B has pointed out, this doesn't exactly work when reading and writing to the same flat file as Talend claims an exclusive read-write lock on the file when reading and keeps it open throughout the job.
Instead you would have to write the incremented data to some other place such as a temporary file, a database or even just to the buffer and then read that data in to a separate job which would then output the file you want and clean up anything temporary.
The problem with that is you can't do the output in the same process. I've just tried testing reading in the file in one child job, passing the data back to a parent job using a tBufferOutput and then passing that data to another child job as a context variable and then trying to output to the file. Unfortunately the file lock remains on it so you can't do this all in one self contain job (even using a parent job and several child jobs).
If this sounds horrible to you (it is) and you absolutely need this to happen (I'd suggest a database table sounds like a better match for this functionality than a flat file) then you could raise a feature request on the Talend Jira for the tFileInputDelimited to not hold the file open or to not insist on an exclusive read-write lock on the file.
Once again, I strongly recommend that you move to using a database table for this because even without the file lock issue, this is definitely not the right use of a flat file and this use case perfectly fits a database, even something as lightweight as an embedded H2 database.

Tracking the job progress in Talend

I have to copy data from excel sheets to the sql server tables.
I want to track my job progress as in I would like to have output message saying 'data is been loaded in tableX' after each table's completion.
I tried to use tLogRow but it outputs each row being copied.
Which component should I use and how do I do it?
I want my messages to be printed while running from command line as well.
You can do this by logging to the console in a tJava component for each of your tMSSqlOutput components and link them with an onComponentOk link.
To print to the console you can use System.out.println("data is been loaded in tableX");.
You'll then see the output of this in your run tab and also in any logs produced when the job is ran just as you would with a tLogRow component.
A slightly more lengthy approach but without writing this small snippet of Java code would be to link a tFixedFlowInput with an onComponentOk to your database output component. In this you could specify a single row of data with a single column "message" (or whatever you want to call it) and then put your message in the tFixedFlowInput component. From here just link it to a tLogRow as normal.

TfileList catches one of the 6 files only

I tried to display some results from several files in a directory. I use TFileList, and 2 tFileInputDelimited which are both linked to TFileList. I don't know why but at the end of the processing my results are lugged from just one of the 6 files I want. It appears that there are results from the list file of the directory.
Each tFileInputDelimited has ((String)globalMap.get("tFileList_1_CURRENT_FILEPATH")) as name of the flow.
Here is my TMap:
Your job is set up so your lookup is iterative which causes some issues as Talend only seems to use the last iteration rather than doing what you might expect and iterating through every step for everything it needs (although this might be more complicated than you first think).
One option is to rework the job so you use your iterate part of the job as the main input to the tMap rather than the lookup.
Alternatively, you could iterate the data into a tBufferOutput component and then OnSubjobOk you could link the job as before but replace the iterative part with a tBufferInput component as it will store all of the data from all of the files iterated through.

Split file to more files in talend

I'm looking for a way how to split job execution in talend studio according to actual file row - I'd like to process file rows starting with "DEBUG" in one job branch and another rows in another job branch. It that possible?
To do this, use a tMap component. Your job will look like this
t*Input--row-->tMap--out1--->tFileOutput*
--out2--->tFileOutput*
In the tMap component, you have input on the left and output on the right. In your output table, select "Activate expression filter" and use the text box to define your filter-- only rows that match that filter will be ouput from that connection. You can have as many output tables and filters as you need.
Using tMap is cool, but if number of output stream is not defined and fixed, tMap is not a good choice.
In this case using iterate link or tjavaflex can help you:
Have a look at this tutorial on "how to split a file into many files regarding a key on each record" which explains how to solve this kind of task. It is actually only available in french. The tutorial shows 3 different technics to achieve this task.
Finally I used tExctractRegeFields component - simply defined regex for matching lines. The most important (and I didn't know before) is that you can connect components with different types of connections. I did right click on used component a chose Row > Reject for new branch in job as described in question.
We can do it by using tfileoutputdelimited and tfileinputdelimited.
We have one option in tfileoutputdelimited in advanced settings and check option split out files in several files.