Execute pentaho job completely for first row before starting job for second row - pentaho-spoon

My root job which has two steps,
Transformation Executor(to copy rows to results) & a Job Executor(Executing for each input row)
what I want is, that my sub-job should execute completely for first incoming row before it start execution for second row.

Click on the Job executor step and check the box Execute for every input row.
Tell me if it is not what you need.

Unless you specify a different value than 1 on Change Number Of Copies To Start (Right click on any Transformation Entry to see that option), that will always be the expected behavior.
If the number is greater than 1 then the Job Executor will have that number of copies running in parallel, distributing the input rows (for example, 100 input rows, with 10 copies, each copy will execute 10 rows no matter what).

Related

How to check if the stream of rows has ended

Is there a way for me to know if the stream of rows has ended? That is, if the job is on the last row?
What im trying to do is for every 10 rows do something, my problem are the last rows, for example in 115 rows, the last 5 wont happen but i need them to.
There is no built-in functionality in Talend which tells you if you're on the last row. You can work around this using one of the following:
Get the row count beforehand. For instance, if you have a file, you
can use tFileRowCount to count the number of rows, then when you
process your file, you use a variable for your current row
number, and so you can tell if you've reached the last row. If your
data come from a database, you could either issue a query that
returns the total number of rows beforehand, or modify your main
query to return the total number of rows in an additional column and
use that (using ranking functions).
Do some processing after the subjob has ended: There may be situations
where you need a special processing for the last row, you can achieve
this by getting the last row processed by the previous subjob (which
you have already saved, for instance, by putting a tSetGlobalVar
after your target, when your subjob is done, your variable contains the last written value).
Edit
For your use case, what you could do is first store the result of the API call in memory using tHashOutput, then read it with a tHashInput in order to process it, and you'll know then how many rows you have retrieved using tHashOutput's global variable tHashOuput_X_NB_LINE.

Closed loop does not work in a Talend Job

I have a Talend Job, where somehow a closed loop is formed by the components. Image is as follows:
The schemas of both the tMap outputs is same. Now after connecting any tMap to tUnite, when I try to connect the second tMap, it does not connect to it.
I heard that Talend does not allow, a closed loop in a Job. Is that true? If yes, the Why?
Someone had a similar question here, but found no answers.
Talend actually creates a Java program; essentially that is the reason for the limitation you've encountered.
tUnite take all the data provided by each of the inputs in turn i.e. all of A then all of B then all of C.
It cannot take row 1 from A then row 1 from B then row 1 from C then row 2 from A then row 2 from B etc. because of the nature of programming loops used for each flow.
However, tMap multiple outputs or tReplicate do create row 1 to A then row 1 to B then row 1 to C then row 2 to A then row 2 to B etc..
This is why you cannot split and then rejoin flows.
PreetyK has explained the why. I'll explain how to work around this limitation.
You can store the output from tMap_10 and tMap_11 in a tHashOutput each. On the 2nd tHashOutput you must check "Link with a tHashOutput" checkbox and then select the other tHashOutput from the droplist. This tells it to write to the same buffer as the 1st tHashOutput effectively making "union" of your tMap_10 and tMap_11 outputs.
On the next subjob, you use a tHashInput to read from your tHashOuput (you must use a single tHashInput as the 2 outputs share the same data).
Here are some screenshots :
Then the tHashInput:
Note that by default these components are hidden. You have to go to File > Project Settings > Designer > Palette settings, and then move them from left to right pane as bellow. You will then find them in your palette.

tJavaFlex behaviour when changing loop position

Having some problems in a job, and I suspect it is due to a lack of understanding of tJavaFlex. I am generating 10 rows in this test job, and am generating loop inside a tJavaFlex:
So there are 10 rows coming in, and a loop in the Start and End section. I was expecting that for each row coming in, it would generate 10 identical rows coming out. And that I would see iterations 0,1,2,3....9 for each row.
What I got was this. This looks to me like the entire job is running 10 times, and so I have 100 random values coming through the flow from the tRowGenerator.
If I move the for loop into the Main Code section, I get close to the behaviour I was expecting. I am expecting each row when it comes in to be repeated 10 times, and for 1 row coming in to produce 10 output rows. What I get is this.
But even then my tLogRow is only generating one row for each 10 iterations it seems (look at the tLogRow output after iteration 9 above why not 10 items?). I had thought I would be getting 10 rows for each single row coming in and I would see this in the tLogRow.
What I need to do is take a value from a field coming in, do some reg exp parsing and split into an array, and then for each item in the array create lines in the output flow. i.e. 1 row coming in can be turned into x number of rows coming out using a string.split() method.
Can someone explain the behaviour above, and also advise on the best approach to get one value coming in, do some java manipulation and then generate multiple rows coming out?
Any advice appreciated.
Yes you don't use it correctly.
The initial part is for initiate variable. (executed one time before the first tow)
In the principal you put your loop (executed one time at each row)
In the final you store in global variable for example.(executed one time after the last row)
The principal code will be executed at each row in a tjavaflex. So don't put a for loop inside you can do like the example in the screen.
You tjavaflex comportement is normal. you have ten row so each row the for loop wil be executed 10 time (i<10)
You can use it like :
You dont need to create your own loop.
By putting the for loop in the Start code, your main code will be triggered by the loop and by incoming rows, and it will be executed n*r times.
The behaviour of subjob that contains a tJavaFlex, reveils that component before tJavaFlex is included into its starting code, and the after component is included in the ending code, but that may depend to many conditions like data propagation and trigger type.
start code :
System.out.print("tJavaFlex is starting...");
int i = 0;
Main code :
i++;
System.out.print("tJavaFlex inside Main Code...iteration:"+i);
row8.ITEM_NAME = row7.ITEM_NAME;
row8.ITEM_COUNT = row7.ITEM_COUNT;
End code :
System.out.print("tJavaFlex is ending...");
System.out.print(row7.ITEM_NAME);
Instead of main flow in row5, try using iterate flow to connect tJavaFlex

compare current value with previous value in datastage

i have input like below
empid salary
10 1000
20 2000
30 3000
40 4000
the output i require in a sequential fie is like below. that is prevsal should have the salary of the previous row
empid salary prevsal
10 1000 null
20 2000 1000
30 3000 2000
40 4000 3000
i tried using a transformer by giving stage variable as prevsal=inputlink.salary and then defining a output column inputlink.salary=prevsal. i know that doesnt work logically and yes it didnt work. can anyone find me a solution for this.
You are on the right way - transformer and stage variables is the way to go.
Remember that within the transformer the data is processed top down. This means the first (top most) stage variable is processed first, then the second and so on and finally the data is put on the output links.
Having you input column: inputlink.salary
Assuming two stagevariables: svPrevSalary (top most)
and a second one svCurrentSalary
Try following assingments in the stage variable section:
1. svCurrentSalary (=) svPrevSalary
2. inputlink.salary (=) svCurrentSalary
Use
svPrevSalary
as derivation of the output link / field.
Please note that the (=) are just the idea you have to specify only svCurrentSalary for the first stage variable.
I was facing the same problem when I started to do.i was not getting the expected result.
For this question, we have to note down 2 things.
1. for which jobs I m doing.like for the server, sequential or parallel.I am working in the parallel environment.
2. pls remember the order of execution of link i.e i/o order.
code- curr_salary -> Prev_Sal,
link.salary -> curr_Salary
link to prev_salary output link to prev_salary
Note
If you working in the Parallel environment then you have to make a sequential mode in execution section in every stage.
Go to Stage-> advanced -> Excution mode-> sequential.
I think it should work. I did this practically.Transformer_Image
thanks

Get un retrieved rows only in DB2 select

I have an BPM application where I am polling some rows from DB2 database at every 5 mins with a scheduler R1 with below query -
- select * from Table where STATUS = 'New'
based on rows returned I do some processing and then change the status of these rows to 'Read'.
But while this processing is being completed, its takes more than 5 mins and in meanwhile scheduler R1 runs and picks up some of the cases already picked up in last run.
How can I ensure that every scheduler picks up the rows which were not selected in last run. What changes do i need to do in my select statement? Please hep.
How can I ensure that every scheduler picks up the rows which were not selected in last run
You will need to make every scheduler aware of what was selected by other schedulers. You can do this, for example, by locking the selected rows (SELECT ... FOR UPDATE). Of course, you will then need to handle lock timeouts.
Another option, allowing for better concurrency, would be to update the record status before processing the records. You might introduce an intermediary status, something like 'In progress', and include the status in the query condition.