How to split data into multiple outputs files based on value of a given column - talend

Using Talend Open Studio for Data integration
How can I split one Excel file into multiple outputs based on values of given column ?
Example
Example of data in input.xlsx :
ID; Category
1; AAA
2; AAA
3; BBB
4; CCC
Example of output files :
AAA.xlsx contains ID 1 and 2
BBB.xslx contains ID 3
CCC.xslx contains ID 4
What I tried ?
tfilelist-->tinputexcel-->tuniqrows-->tflowtoiterate-->tfileinputexcel-->tfilterow-->tlogrow
In order to perform these actions :
Browse a folder of Excel files
Iterate to Open Excel file
Get uniques values in Excel files (on column used for the split)
Iterate to generate splitted files with the unique values and tfilterow to filter Excel file and that's where I get an error about Garbage Collector
Exception in component tFileInputExcel_4 (automatisation_premed)
java.io.IOException: GC overhead limit exceeded
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
Talend's job diagram
Do someone have an idea to optimize this talend workflow and solve GC error ? Thanks for the support

Finally, I think we must not iterate on an Excel Input as openning twice the same file is a problem both on Windows and in designed job so a workaround should be :
Talend diagram for the job

There are multiple ways to tackle this type of thing in Talend. One approach is to store the Excel file somewhere after loading (Database, CSV, Hash, etc).
An alternative approach is to aggregate -> iterate -> normalize the data like so:
In tAggregateRow you want to group by the field containing the 'base' of your file name (Category in this case):
The aggregate function should be 'list' (with an appropriate delimiter not already contained in your Id column:
Feed the aggregated output into a tFlowToIterate to loop over each Category:
tFixedFlow can be used to output each of the aggregates to an independent flow:
Use tNormalize to dump the single Category row into one row per Id by normalizing the 'list' column:
Set the tFileOutputExcel file name to be the current iterations Category as defined in tFlowToIterate:
Final result is one file per Category with one row per Id:

Related

Throw error on invalid lookup in Talend job that populates an output table

I have a tMap component in a Talend job. The objective is to get a row from an input table, perform a column lookup in another input table, and write an output table populating one of the columns with the retrieved value (see screenshot below).
If the lookup is unsuccessful, I generate a row in an "invalid rows" table. This works fine however is not the solution I'm looking for.
Instead, I want to stop the entire process and throw an error on the first unsuccessful lookup. Is this possible in Talend? The error that is thrown should contain the value that failed the lookup.
UPDATE
A tfileoutputdelimited componenent would do the staff .
So ,the flow would be as such tMap ->invalid_row->tfileoutputdelimited -> tdie
Note : that you have to go to advanced settings in the tfileoutputdelimited component aand tick split output into multiple files option and put 1 rather then 1000
For more flexibility , simply do two tmap order than one tMap

How to export from sql server table to multiple csv files in Azure Data Factory

I have a simple clients table in sql server that contains 2 columns - client name and city. In Azure Data Factory, how can I export this table to multiple csv files that each file will contain only a list of clients from the same city, which will be the name of the file
I already tried, and succeeded, to split it to different files using lookup and foreach, but the data remains unfiltered by the city
any ideas anyone?
You can use Data Flow to achieve that easily.
I made an example for you. I create a table as source, export this table to multiple csv files that each file will contain only a list of clients from the same city, which will be the name of the file.
Data Flow Source:
Data Flow Sink settings: File name options: as data in column and use auto mapping.
Check the output files and data in it:
HTH.
You would need to follow the below flow chart:
LookUp Activity : Query : Select distinct city from table
For each activity
Input : #activity('LookUp').output.value
a) Copy activity
i) Source : Dynamic Query Select * from t1 where city=#item().City
This should generate separate files for each country as needed
Steps:
1)
The batch job can be any nbr of parallel executions
Create a parameterised dataset:
5)
Result: I have 2 different Entities, so 2 files are generated.
Input :
Output:

Pivot data in Talend

I have some data which I need to pivot in Talend. This is a sample:
brandname,metric,value
A,xyz,2
B,xyz,2
A,abc,3
C,def,1
C,ghi,6
A,ghi,1
Now I need this data to be pivoted on the metric column like this:
brandname,abc,def,ghi,xyz
A,3,null,1,2
B,null,null,null,2
C,null,1,6,null
Currently I am using tPivotToColumnsDelimited to pivot the data to a file and reading back from that file. However having to store data on an external file and reading back is messy and unnecessary overhead.
Is there a way to do this with Talend without writing to an external file? I tried to use tDenormalize but as far as I understand, it will return the rows as 1 column which is not what I need. I also looked for some 3rd party component in TalendExchange but couldn't find anything useful.
Thank you for your help.
Assuming that your metrics are fixed, you can use their names as columns of the output. The solution to do the pivot has two parts: first, a tMap that transposes the value of each input-row in into the corresponding column in the output-row out and second, a tAggregate that groups the map's output-rows according to the brandname.
For the tMap you'd have to fill the columns conditionally like this, example for output colum named "abc":
out.abc = "abc".equals(in.metric)?in.value:null
In the tAggregate you'd have to group by out.brandname and aggregate each column as sum ignoring nulls.

How to merge multiple output from single table using talend?

I have Employee table in database which is having gender column. so I want to Filter Employee data based on number of gender with three column to excel like:
I'm getting this output using below talend schemaStructure 1:
So I want to optimized above structure and trying in this way but I have been stuck with other scenario. Here I'm getting Employee data with gender wise but in three different file so Is there any way so that I can achieve same excel result from one SQL input file and after mapping can be get in a single output excel file?
Structure 2 :
NOTE: I don't want to use same input table many time. I want to get same output using single table and single output excel file. so please suggest me any component which one is useful for me.
Thanks in advance!!!

Count the number of rows for each file along with the file name in Talend

I have built a job that reads the data from a file, and based on the unique data of a particular columns, splits the data set into many files.
I am able to acheive the requirement by the below job :
Now from this job which is splitting the output into multiple files, what I want is to add a sub job which would give me two columns.
In the first column I want the name of the files that I created in my main job and in the second column, I want the count of number of rows each created output file has.
To achive this I used tflowmeter and to catch the result of count i used the tFlowmeterCatcher, which is giving me correct result for the count of each rows for the correspoding output files, but is giving the last file name in all the files that i have generated for the counts.
How can I get the correct file names and the corresponding row count.
If you use the following directions, your job will in the end have additional components like so:
Use a tJavaFlex directly after the tFileOutputDelimited on main. It should look like this:
Start Code: int countRows = 0;
Main Code: countRows = countRows + 1;
End Code: globalMap.put("rowCount", countRows);
Connect this component OnComponentOk with the first component of a new subjob. This subjob holds a tFixedFlowInput, a tJavaRow and a tBufferOutput.
The tFixedFlowInput is just here so that the OnComponentOk can be connected, nothing has to be altered. In tJavaRow you put the following:
output_row.filename = (String)globalMap.get("row7.newColumn");
//or whatever is your row variable where the filename is located
output_row.rowCount = (Integer)globalMap.get("rowCount");
In the schema, add the following elements:
Simply add a tBufferOutput now at the end of the first subjob.
Now, create another new subjob with the components tBufferInput and whatever components you may need to process and store the data. Connect the very first component of your job with a OnSubjobOk with the tBufferInput component. I used a tLogRow to show the result (with my randomly created fake data):
.---------------+--------.
| LogFileData |
|=--------------+-------=|
|filename |rowCount|
|=--------------+-------=|
|fileblerb1.txt |27 |
|fileblerb29.txt|14 |
|fileblerb44.txt|20 |
'---------------+--------'
NOTE: Keep in mind that if you add a header to the file (Include Header checked in tFileOutputDelimited), the job might need to be changed (simply set int countRows = 1; or whatever you would need). I did not test this case.
You can use tFileproperties component to store file-name generated in a intermediate excel after first sub-job and use this excel in your second sub-job.
Thanks!