To achieve an output from input in datastage tool - datastage

I have an input file with data
GGN,IBM
BNGLR, IBM
GGN,HCL
NOIDA,HCL
BNGLR,HCL
I want output like
IBM,GGN,BNGLR
HCL,GGN,NOIDA,BNGLR
using datastage tool.
Thanks in advance

You've not given us much details to work with, so I'm making a few assumptions here on the job you're using (server/parallel) and your DataStage version. In the job design I've considered the name of the first of your columns to be "Value" and the second to be "Key".
Here is a basic job design, notice the partitioning: Job design image
Here is the first transformer setup. I know it's inneficient to add a second transformer just for a trim, but a limitation of the LastRowInGroup() function is that it can only accept columns as params. So transforms to the column it uses must be done before it's passed in the function: first transformer image
Here is the second transformer setup. The stage variable order matters, don't forget the constraint: Second transformer image
In the second transformer, be sure to set the partitioning and constraint as detailed in the picture: second transformer properties image
Your output data will look like this: output stage data image
Hope that helps and is clear, look through the images closely. I'm using images as they speak more than words.
Regards,
Sam Gribble
#InforgeAcademy

Related

Best Practice to Store Simulation Results

Dear Anylogic Community,
I am struggling with finding the right approach for storing my simulation results. I have datasets created that keep track of every value I am interested in. They live in Main (see below)
My aim is to do a parameter variation experiment. In every run, I change the value for p_nDrones (see below)
After the experiment, I would like to store all the datasets in one excel sheet.
However, when I do the parameter variation experiment and afterwards check the log of the dataset (datasets_log), the changed values do not even show up (2 is the value I did set up in the normal simulation).
Now my question. Do I need to create another type of dataset if I want to track the values that are produced in the experiments? Why are they not stored after executing the experiment?
I really would appreciate if someone could share the best way to set up this export of experiment results. I would like to store the whole time series for every dataset.
Thank you!
Best option would be to write the outputs to some external file at the end of each model run.
If you want to use Excel, which I personally would not advise, even though it has a nice excelFile.writeDataSet() function, you can.
I would rather write the data to a text file as you will have much for control over the writing, the file itself, it is thread-safe, and useable in many many more platforms than Microsoft Excel.
See my example below:
Setup parameters in your model that you will write the data to at the end of the model of type TextFile. Here I used the model on destroy code to write out the data from the data sets.
Here you can immediately see the benefit of using the text file! You can add the number of drones we are simulating (or scenario name or any other parameter) in a column, whereas with Excel this would be a pain...
Now you can pass your specific text file to the model to use by adding it to the parameter variation page, providing it to the model through the parameters.
You will see that I also set up some headers for the text file in the Initial Experiment setup part, and then at the very end of the experiment, I close the text files in the After experiment section so that the text files can be used.
Here is the result if you simply right-click on the text files and open them in Excel. (Excel will always have a purpose, even if it is just to open text files ;-) )

Talend Component with multiple inputs and unrelated outputs

using Talend Open Studio, I have a data-processing component, for which I'd appreciate your advice on how to make this possible (a) in a single component and (b) without a dirty workaround - thanks.
Relating part (a):
I have two different inputs:
One Input (with exactly one row) defines some kind of metadata for my processing.
One Input (with 1...n rows) defines the core data to process.
Currently, I solved this first requirement using two components and passing my metadata to the second component using the globalMap. But it would be nice, if I could integrate both connections into one component.
Relating part (b): After I have read all my input rows, I need to process them all at once. So far, so easy, I could use the end-section - my problem comes here: After that processing, I need to create a number of output-rows for a single output connection. Problem is, that Output-rows can only be created in the main-part and there I don't know when the last row was read...
Currently, I solved this counting the input-rows in advance and then, after that number is reached, I create that output. But this seems a really dirty workaround to me, so maybe someone has a solution for that, too?
Thank you for any useful tips!

talend how can we estimate taggregateSortedRow recordcount parameter value

We are trying out talend and we wanted to aggregate some sorted data on few keys .
Simple enough but when we try to use taggregatesortedrow its asking for Exact number of rows to be specified.
I am not sure how any one can input this on the fly. Dosen't this value change for every run ? am i missing something. surely they cant expect us to know total recs before we run the job.
This has to do with the way in how the Talend component tAggregateSortedRow is programmed. To avoid it omitting data you need to provide the record count. There are a few users with the same question like you asked:
https://www.talendforge.org/forum/viewtopic.php?id=50094
https://www.talendforge.org/forum/viewtopic.php?id=54231
https://www.talendforge.org/forum/viewtopic.php?id=7641
which I found simply by using Google.
Anyway, if you need to do sorting and aggregating, consider using the components tAggregateRow and tSortedRow separately. It should work fine.

BIRT: Using information from one Dataset as parameter of an other

i'm creating some BIRT-Reports with Eclipse. Now i got the following problem.
I've got two datasets (Set one named diag, set two named risk). In my report i produce fpr every data in diag a region with an diag_id. Now i tried to use this diag_id as input parameter for the second dataset (risk). Is this possible, and how is this possible?
To link one dataset to another in BIRT, you can either:
Create a subreport within your report that links one dataset to another via an input parameter - see this Eclipse tutorial.
or:
Create a joint dataset that explicitly links the two datasets together - see the answer to this StackOverflow question.
Alternatively, if both datasets come from the same relational database, you could simply combine the two queries into a single query.
If you are using scripted data sources, you could use variables.
Add a variable through the Eclipse UI called "diag_id".
In the fetch script of diag, set diag_id:
vars["diag_id"] = ...; // store value in Variable.
Then, in the open script of risk, use the diag_id however you need to.
diag_id = vars["diag_id"];
This implies that placement of risk report elements are nested inside the diag repeating element so that diag.fetch will happen before each risk.open.

Split file to more files in talend

I'm looking for a way how to split job execution in talend studio according to actual file row - I'd like to process file rows starting with "DEBUG" in one job branch and another rows in another job branch. It that possible?
To do this, use a tMap component. Your job will look like this
t*Input--row-->tMap--out1--->tFileOutput*
--out2--->tFileOutput*
In the tMap component, you have input on the left and output on the right. In your output table, select "Activate expression filter" and use the text box to define your filter-- only rows that match that filter will be ouput from that connection. You can have as many output tables and filters as you need.
Using tMap is cool, but if number of output stream is not defined and fixed, tMap is not a good choice.
In this case using iterate link or tjavaflex can help you:
Have a look at this tutorial on "how to split a file into many files regarding a key on each record" which explains how to solve this kind of task. It is actually only available in french. The tutorial shows 3 different technics to achieve this task.
Finally I used tExctractRegeFields component - simply defined regex for matching lines. The most important (and I didn't know before) is that you can connect components with different types of connections. I did right click on used component a chose Row > Reject for new branch in job as described in question.
We can do it by using tfileoutputdelimited and tfileinputdelimited.
We have one option in tfileoutputdelimited in advanced settings and check option split out files in several files.