Count the number of rows for each file along with the file name in Talend

Count the number of rows for each file along with the file name in Talend - talend

I have built a job that reads the data from a file, and based on the unique data of a particular columns, splits the data set into many files.
I am able to acheive the requirement by the below job :
Now from this job which is splitting the output into multiple files, what I want is to add a sub job which would give me two columns.
In the first column I want the name of the files that I created in my main job and in the second column, I want the count of number of rows each created output file has.
To achive this I used tflowmeter and to catch the result of count i used the tFlowmeterCatcher, which is giving me correct result for the count of each rows for the correspoding output files, but is giving the last file name in all the files that i have generated for the counts.
How can I get the correct file names and the corresponding row count.

If you use the following directions, your job will in the end have additional components like so:
Use a tJavaFlex directly after the tFileOutputDelimited on main. It should look like this:
Start Code: int countRows = 0;
Main Code: countRows = countRows + 1;
End Code: globalMap.put("rowCount", countRows);
Connect this component OnComponentOk with the first component of a new subjob. This subjob holds a tFixedFlowInput, a tJavaRow and a tBufferOutput.
The tFixedFlowInput is just here so that the OnComponentOk can be connected, nothing has to be altered. In tJavaRow you put the following:
output_row.filename = (String)globalMap.get("row7.newColumn");
//or whatever is your row variable where the filename is located
output_row.rowCount = (Integer)globalMap.get("rowCount");
In the schema, add the following elements:
Simply add a tBufferOutput now at the end of the first subjob.
Now, create another new subjob with the components tBufferInput and whatever components you may need to process and store the data. Connect the very first component of your job with a OnSubjobOk with the tBufferInput component. I used a tLogRow to show the result (with my randomly created fake data):
.---------------+--------.
| LogFileData |
|=--------------+-------=|
|filename |rowCount|
|=--------------+-------=|
|fileblerb1.txt |27 |
|fileblerb29.txt|14 |
|fileblerb44.txt|20 |
'---------------+--------'
NOTE: Keep in mind that if you add a header to the file (Include Header checked in tFileOutputDelimited), the job might need to be changed (simply set int countRows = 1; or whatever you would need). I did not test this case.

You can use tFileproperties component to store file-name generated in a intermediate excel after first sub-job and use this excel in your second sub-job.
Thanks!

Related

Guid additional column is always the same

I have a Copy-Data task where I am adding an additional column called "Id" and its value is #guid(). Problem is that for every row it is importing, the Guid value is always the same and the destination/sink throws a primary key violation.
Additional column definition

The copy activity will copy the same guid() for all rows if you use additional column.
To get Unique guid() for Each row, you can follow the demonstration below.
First Give your source data to lookup activity and give its output to a ForEach activity.
This is my source data in csv format for sample, give this to lookup.
source.csv:
name
"Rakesh"
"Laddu"
"Virat"
"John"
Use another dummy dataset and give any one value to it. Use this in copy activity.
Dummy.csv:
name
"Rakesh"
ForEach activity:
Inside ForEach use Copy activity and give the dummy dataset. Create additional columns and give our source data(#item().name) and #guid().
Copy activity:
Now in sink give your database dataset. Here for sample, I have used Azure SQL database table.
Go to mapping of copy activity and click on import Schemas.Give any string value for it to import the schemas of source (Here dummy schema) and sink.
After the above, you will get like this, in this give the additional columns we created to the database columns.
Pipeline Execution:
After Executing the pipeline, you can get the desired output with Unique rows.
Output:

How to split data into multiple outputs files based on value of a given column

Using Talend Open Studio for Data integration
How can I split one Excel file into multiple outputs based on values of given column ?
Example
Example of data in input.xlsx :
ID; Category
1; AAA
2; AAA
3; BBB
4; CCC
Example of output files :
AAA.xlsx contains ID 1 and 2
BBB.xslx contains ID 3
CCC.xslx contains ID 4
What I tried ?
tfilelist-->tinputexcel-->tuniqrows-->tflowtoiterate-->tfileinputexcel-->tfilterow-->tlogrow
In order to perform these actions :
Browse a folder of Excel files
Iterate to Open Excel file
Get uniques values in Excel files (on column used for the split)
Iterate to generate splitted files with the unique values and tfilterow to filter Excel file and that's where I get an error about Garbage Collector
Exception in component tFileInputExcel_4 (automatisation_premed)
java.io.IOException: GC overhead limit exceeded
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
Talend's job diagram
Do someone have an idea to optimize this talend workflow and solve GC error ? Thanks for the support

Finally, I think we must not iterate on an Excel Input as openning twice the same file is a problem both on Windows and in designed job so a workaround should be :
Talend diagram for the job

There are multiple ways to tackle this type of thing in Talend. One approach is to store the Excel file somewhere after loading (Database, CSV, Hash, etc).
An alternative approach is to aggregate -> iterate -> normalize the data like so:
In tAggregateRow you want to group by the field containing the 'base' of your file name (Category in this case):
The aggregate function should be 'list' (with an appropriate delimiter not already contained in your Id column:
Feed the aggregated output into a tFlowToIterate to loop over each Category:
tFixedFlow can be used to output each of the aggregates to an independent flow:
Use tNormalize to dump the single Category row into one row per Id by normalizing the 'list' column:
Set the tFileOutputExcel file name to be the current iterations Category as defined in tFlowToIterate:
Final result is one file per Category with one row per Id:

How to export statistics_log data cumulatively to the desired Excel sheet in AnyLogic?

I have selected to export tables at the end of model execution to an Excel file, and I would like that data to accumulate on the same Excel sheet after every stop and start of the model. As of now, every stop and start just exports that 1 run's data and overwrites what was there previously. I may be approaching the method of exporting multiple runs wrong/inefficiently but I'm not sure.

Best method is to export the raw data, as you do (if it is not too large).
However, 2 improvements:
manage your output data yourself, i.e. do not rely on the standard export tables but only write data that you really need. Check this help article to learn how to write your own data
in your custom output data tables, add additional identification columns such as date_of_run. I often use iteration and replication columns to also identify from which of those the data stems.
custom csv approach
Alternative approach is to write create your own csv file programmatically, this is possible with Java code. Then, you can create a new one (with a custom filename) after any run:
First, define a “Text file” element as below:
Then, use this code below to create your own csv with a custom name and write to it:
File outputDirectory = new File("outputs");
outputDirectory.mkdir();
String outputFileNameWithExtension = outputDirectory.getPath()+File.separator+"output_file.csv";
file.setFile(outputFileNameWithExtension, Mode.WRITE_APPEND);
// create header
file.println( "col_1"+","+"col_2");
// Write data from dbase table
List<Tuple> rows = selectFrom(my_dbase_table).list();
for (Tuple row : rows) {
file.println( row.get( my_dbase_table.col_1) + "," +
row.get( my_dbase_table.col_2));
}
file.close();

ssis how to add all columns from another source without join criteria?

I want to read properties from my source file, and the add the properties to all records in the file itself
So I'd like to join the Element reading the FileProperties to all rows from the data...
Data1 FileProperty1 Fileproperty2
Data2 FileProperty1 FileProperty2
In fact, I just want to add the columns from the property dataset, to each row from the data... how can I do that ? I try merge and lookup, but I don't have any Id to match witch, just need to append...

Create your file property variables
Add a script task to control flow:
Add variables as read/write
Add the following script (I gave you 1 property as an example)
System.IO.FileInfo fi = new System.IO.FileInfo("[Your File Path]");
Dts.Variables["FileSize"].Value = fi.Length;
Add your dataflow and connect to script task (script first)
Add a derived column to DF and add additional columns for your variables.

Use your property file as a source to a pivot transform to transform the rows into a list of columns. Now that you have all the properties in a single row, simply use the pivot output as a source, and use merge transform to merge these new columns with the other file.
More info on how to use Pivot transform

How to conditionally execute something based on previous processed number of rows?

I want to execute some subjob if the previously processed number of rows are greater than N. To do this, i'm using the following configuration:
tFixedFlowInput have some rows.
tAggregateRow uses the count function and outputs one row with the number.
tSetGlobalVar then stores this value into a global var that I can check in the Run If connector (In this case, (Integer)globalMap.get("tSetGlobalVar_1") > 3 ).
tMsgBox then shows if the condition is true.
What I would like is to do the same, but in a more elegant way, using the minimum components required. I would like to connect tAggregateRow with the Run If connector directly (or even tFixedFlow) with tMsgBox, but I haven't found a way to refer to the number of rows previously processed without using the output row2.count variable.
How could I do something like this?
What should I put in the If condition to refer to the tAggregateRow operation result without connecting it to another meaningless component like exposed at the beginning?

for any talend component look under outline tab under the left side workspace pane at the bottom. this lists down the properties available via global variables for that component. Some properties like count of records inserted by output components are only available once the component is executed completely (After).
for your case you can try directly using ((Integer)globalMap.get("tFixedFlowInput_1_NB_LINE")) which gives number of lines (after) given by tFixedFlowInput.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Count the number of rows for each file along with the file name in Talend - talend

You can use tFileproperties component to store file-name generated in a intermediate excel after first sub-job and use this excel in your second sub-job. Thanks!

Related

Guid additional column is always the same

How to split data into multiple outputs files based on value of a given column

How to export statistics_log data cumulatively to the desired Excel sheet in AnyLogic?

ssis how to add all columns from another source without join criteria?

How to conditionally execute something based on previous processed number of rows?

Categories

Resources