Closed loop does not work in a Talend Job - talend

I have a Talend Job, where somehow a closed loop is formed by the components. Image is as follows:
The schemas of both the tMap outputs is same. Now after connecting any tMap to tUnite, when I try to connect the second tMap, it does not connect to it.
I heard that Talend does not allow, a closed loop in a Job. Is that true? If yes, the Why?
Someone had a similar question here, but found no answers.

Talend actually creates a Java program; essentially that is the reason for the limitation you've encountered.
tUnite take all the data provided by each of the inputs in turn i.e. all of A then all of B then all of C.
It cannot take row 1 from A then row 1 from B then row 1 from C then row 2 from A then row 2 from B etc. because of the nature of programming loops used for each flow.
However, tMap multiple outputs or tReplicate do create row 1 to A then row 1 to B then row 1 to C then row 2 to A then row 2 to B etc..
This is why you cannot split and then rejoin flows.

PreetyK has explained the why. I'll explain how to work around this limitation.
You can store the output from tMap_10 and tMap_11 in a tHashOutput each. On the 2nd tHashOutput you must check "Link with a tHashOutput" checkbox and then select the other tHashOutput from the droplist. This tells it to write to the same buffer as the 1st tHashOutput effectively making "union" of your tMap_10 and tMap_11 outputs.
On the next subjob, you use a tHashInput to read from your tHashOuput (you must use a single tHashInput as the 2 outputs share the same data).
Here are some screenshots :
Then the tHashInput:
Note that by default these components are hidden. You have to go to File > Project Settings > Designer > Palette settings, and then move them from left to right pane as bellow. You will then find them in your palette.

Related

Parallel job is adding extra columns when outputting to a dataset

The last job before my dataset is written is a transformation. It's a lot more complex than this, but the basics are:
input = A Integer, B Integer and C Integer
output = A Integer, if B > 10 then C else 0 -> C Integer
So, to clarify, column A is just passed through and columns B and C are used to perform a transformation that is called "C" in the final output link.
When I examine the columns being written to the dataset I see A and C. I can save the table definition and this is also just columns A and C. However, when I actually run the job, column B also ends up in the dataset, so I end up with (in any order) columns A, B and C.
I've tried deleting my output dataset, then recreating it, giving it a new name, but it always ends up with that "working column" B in it for some reason I don't fully understand. I don't see how it's picking up a column that isn't in the final output link and choosing to add it against my wishes.
I don't want column B in my dataset, it's wasteful to store it and it's confusing for developers as it shouldn't be there in the first place. How do I stop DataStage from writing it?
Seems you have RCP Runtime Column Propagation activated - that will transfor all columns available independend of the specified ones.
Go to the stage (Transformer) - Properties - Output tab and there is a checkbox Runtime Column Propagation - remove the check mark.
In other stages it could be located on the columns tab as well.
In the job properties of your job there is also a setting which will enable RCP for new links - remove this mark as well to avaoid this problems for future job extensions.
For more details on RCP check out this.

Is is possible limit the number of rows in the output of a Dataprep flow?

I'm using Dataprep on GCP to wrangle a large file with a billion rows. I would like to limit the number of rows in the output of the flow, as I am prototyping a Machine Learning model.
Let's say I would like to keep one million rows out of the original billion. Is this possible to do this with Dataprep? I have reviewed the documentation of sampling, but that only applies to the input of the Transformer tool and not the outcome of the process.
You can do this, but it does take a bit of extra work in your Recipe--set up a formula in a new column using something like RANDBETWEEN to give you a random integer output between 1 and 1,000 (in this million-to-billion case). From there, you can filter rows based on whatever random integer between 1 and 1,000 as what you'll keep, and then your output will only have your randomized subset. Just have your last part of the recipe remove this temporary column.
So indeed there are 2 approaches to this.
As Courtney Grimes said, you can use one of the 2 functions that create random-number out of a range.
randbetween :
rand :
These methods can be used to slice an "even" portion of your data. As suggested, a randbetween(1,1000) , then pick 1<x<1000 to filter, because it's 1\1000 of data (million out of a billion).
Alternatively, if you just want to have million records in your output, but either
Don't want to rely on the knowledge of the size of the entire table
just want the first million rows, agnostic to how many rows there are -
You can just use 2 of these 3 row filtering methods: (top rows\ range)
P.S
By understanding the $sourcerownumber metadata parameter (can read in-product documentation), you can filter\keep a portion of the data (as per the first scenario) in 1 step (AKA without creating an additional column.
BTW, an easy way of "discovery" of how-to's in Trifacta would be to just type what you're looking for in the "search-transtormation" pane (accessed via ctrl-k). By searching "filter", you'll get most of the relevant options for your problem.
Cheers!

Execute pentaho job completely for first row before starting job for second row

My root job which has two steps,
Transformation Executor(to copy rows to results) & a Job Executor(Executing for each input row)
what I want is, that my sub-job should execute completely for first incoming row before it start execution for second row.
Click on the Job executor step and check the box Execute for every input row.
Tell me if it is not what you need.
Unless you specify a different value than 1 on Change Number Of Copies To Start (Right click on any Transformation Entry to see that option), that will always be the expected behavior.
If the number is greater than 1 then the Job Executor will have that number of copies running in parallel, distributing the input rows (for example, 100 input rows, with 10 copies, each copy will execute 10 rows no matter what).

tJavaFlex behaviour when changing loop position

Having some problems in a job, and I suspect it is due to a lack of understanding of tJavaFlex. I am generating 10 rows in this test job, and am generating loop inside a tJavaFlex:
So there are 10 rows coming in, and a loop in the Start and End section. I was expecting that for each row coming in, it would generate 10 identical rows coming out. And that I would see iterations 0,1,2,3....9 for each row.
What I got was this. This looks to me like the entire job is running 10 times, and so I have 100 random values coming through the flow from the tRowGenerator.
If I move the for loop into the Main Code section, I get close to the behaviour I was expecting. I am expecting each row when it comes in to be repeated 10 times, and for 1 row coming in to produce 10 output rows. What I get is this.
But even then my tLogRow is only generating one row for each 10 iterations it seems (look at the tLogRow output after iteration 9 above why not 10 items?). I had thought I would be getting 10 rows for each single row coming in and I would see this in the tLogRow.
What I need to do is take a value from a field coming in, do some reg exp parsing and split into an array, and then for each item in the array create lines in the output flow. i.e. 1 row coming in can be turned into x number of rows coming out using a string.split() method.
Can someone explain the behaviour above, and also advise on the best approach to get one value coming in, do some java manipulation and then generate multiple rows coming out?
Any advice appreciated.
Yes you don't use it correctly.
The initial part is for initiate variable. (executed one time before the first tow)
In the principal you put your loop (executed one time at each row)
In the final you store in global variable for example.(executed one time after the last row)
The principal code will be executed at each row in a tjavaflex. So don't put a for loop inside you can do like the example in the screen.
You tjavaflex comportement is normal. you have ten row so each row the for loop wil be executed 10 time (i<10)
You can use it like :
You dont need to create your own loop.
By putting the for loop in the Start code, your main code will be triggered by the loop and by incoming rows, and it will be executed n*r times.
The behaviour of subjob that contains a tJavaFlex, reveils that component before tJavaFlex is included into its starting code, and the after component is included in the ending code, but that may depend to many conditions like data propagation and trigger type.
start code :
System.out.print("tJavaFlex is starting...");
int i = 0;
Main code :
i++;
System.out.print("tJavaFlex inside Main Code...iteration:"+i);
row8.ITEM_NAME = row7.ITEM_NAME;
row8.ITEM_COUNT = row7.ITEM_COUNT;
End code :
System.out.print("tJavaFlex is ending...");
System.out.print(row7.ITEM_NAME);
Instead of main flow in row5, try using iterate flow to connect tJavaFlex

Macro that will copy only cells with data and paste in a different worksheet

I need create a macro that will copy cells that go from A2 to O2 in the worksheet DL and continue for a varrying amount of rows. (Depends on the month). I need this pasted in the worksheet Efficiency in rows A2 to O2.
Because every time I create the report the number of rows of data changes I'm running into issues with creating an effective macro.
Also, some of the rows don't have information in every column, but I still want the blank cells to be coppied in this case. Basically if there is data in column A I want the rest of the row to column O to be copied as well.
Do any of those suggestions help you out?
Also, I'm a bit confused of what are you referring to when you write "rows". A row has a number (1 to 1048576 in Office 2010 and up) and a column has a letter (A to XFD in Office 2010 and up). So what do you mean by "row A2 to O2"?
EDIT: Also, to copy from A to O every time, you simply set the range from A to O. For example Range("A2:O10") would include every cell from A2 to O10. Since you set the range with simple strings you can select the row numbers with the help of one of the suggestions in the link I gave you.