How to get row count in file using Azure Lookup Activity - azure-data-factory

I am reading the data file and RecordCount file having counts of record in data file. I am using lookup Activity to get the counts in data file and comparing it with the count of RecordCount file. This approach is working well and I can compare the records when we have count less than 5000. When data file has Count is more than 5000, it's considering only 5000 records and my pipeline is aborting because of count mismatch.
eg:
Datafile count: 7500
RecordCount file: 7500
Though counts are equal but Lookup will consider only 5000 records and will give a mismatch.
How can I achieve this.

Add a Data Flow to your pipeline before the Lookup. Source = ADLS Gen2, Sink = ADLS Gen2. Add a Surrogate Key transformation, call the new column as "mycounter". Add an Aggregate transformation and call the new column as "rowcount" with a formula of max(mycounter). In the Sink, output just the "rowcount" column. You'll now have a new dataset that is just the row count from any file. You can consume that row count as a single-line Lookup activity in the pipeline directly after your data flow.

Related

How to split data into multiple outputs files based on value of a given column

Using Talend Open Studio for Data integration
How can I split one Excel file into multiple outputs based on values of given column ?
Example
Example of data in input.xlsx :
ID; Category
1; AAA
2; AAA
3; BBB
4; CCC
Example of output files :
AAA.xlsx contains ID 1 and 2
BBB.xslx contains ID 3
CCC.xslx contains ID 4
What I tried ?
tfilelist-->tinputexcel-->tuniqrows-->tflowtoiterate-->tfileinputexcel-->tfilterow-->tlogrow
In order to perform these actions :
Browse a folder of Excel files
Iterate to Open Excel file
Get uniques values in Excel files (on column used for the split)
Iterate to generate splitted files with the unique values and tfilterow to filter Excel file and that's where I get an error about Garbage Collector
Exception in component tFileInputExcel_4 (automatisation_premed)
java.io.IOException: GC overhead limit exceeded
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
Talend's job diagram
Do someone have an idea to optimize this talend workflow and solve GC error ? Thanks for the support
Finally, I think we must not iterate on an Excel Input as openning twice the same file is a problem both on Windows and in designed job so a workaround should be :
Talend diagram for the job
There are multiple ways to tackle this type of thing in Talend. One approach is to store the Excel file somewhere after loading (Database, CSV, Hash, etc).
An alternative approach is to aggregate -> iterate -> normalize the data like so:
In tAggregateRow you want to group by the field containing the 'base' of your file name (Category in this case):
The aggregate function should be 'list' (with an appropriate delimiter not already contained in your Id column:
Feed the aggregated output into a tFlowToIterate to loop over each Category:
tFixedFlow can be used to output each of the aggregates to an independent flow:
Use tNormalize to dump the single Category row into one row per Id by normalizing the 'list' column:
Set the tFileOutputExcel file name to be the current iterations Category as defined in tFlowToIterate:
Final result is one file per Category with one row per Id:

Google Cloud Data Fusion is appending a column to original data

When I am loading data encrypted data from GCS source to GCS sink there one additional column getting added.
Original data
Employee ID,Employee First Name,Employee Last Name,Employee Joining Date,Employee location
1,Vinay,Argekar,01/01/2017,India
2,Thirukkumaran,Haridass,02/02/2017,USA
3,David,Wu,03/04/2000,Canada
4,Vinod,Kumar,04/02/2002,India
5,Joshua,Abraham,04/15/2010,France
6,Allaudin,Dastigar,09/24/2012,UK
7,Senthil,Kumar,08/15/2009,Germany
8,Sudha,Narayanan,12/14/2016,India
9,Ravi,Prasad,11/11/2011,Costa Rica
Data came to file after running pipeline
0,Employee ID,Employee First Name,Employee Last Name,Employee Joining Date,Employee location
91,1,Vinay,Argekar,01/01/2017,India
124,2,Thirukkumaran,Haridass,02/02/2017,US
164,3,David,Wu,03/04/2000,Canada
193,4,Vinod,Kumar,04/02/2002,India
224,5,Joshua,Abraham,04/15/2010,France
259,6,Allaudin,Dastigar,09/24/2012,UK
293,7,Senthil,Kumar,08/15/2009,Germany
328,8,Sudha,Narayanan,12/14/2016,India
363,9,Ravi,Prasad,11/11/2011,Costa Rica
First column 0 was not present in original file
When you are configuring the GCS source, did you specify the Format to be CSV or was it left as Text? When the Format is Text, the output schema actually contains an offset, which is the first column that first column that you see in the output data. When you specify the format to be CSV, you have to specify the output schema of the file.

Set row as a header Azure Data Factory [mapping data flow]

Currently, I have an Excel file that I'm processing using a mapping data flow to remove some null values.
This is my input file:
and after remove the null values I have:
I'm sinking my data into a Cosmos DB but I need to change the names of the columns and set my first row as headers...
I need to do this (First row as a header) in the previous step before Sink and y Can't use the mapping option a set manually the names of the columns because maybe some of these position of columns can change
Any idea to do this?
Thanks
First row as a header can only check in dataset connection.
As a work around, you can save your excel to blob(csv format) after removing null value.
Then create a copy data activity or data flow, use this csv file as source(check first row as header), Cosmos DB as sink.
Update
Setting of sink in data flow:
Data preview of sink:
Result:

Pipeline not picking up all rows/records from source

Stream sets pipeline is not picking up all the records/rows from the source. Am I missing the obvious?
In example,
Source Informix
39,136 rows
StreamSets
Input = 38,926 rows
Output = 38,928 rows

How can I filter my source dataset to copy only specific vaules to my sink?

I have a csv file with 2 columns (id and name). The csv file has over 1 million names. I'm struggling to workout how I can filter my results to only copy data where column 2 has the name 'mary' in it.
Can anyone advise?
Add a Data Flow activity to your ADF pipeline. In that pipeline, point the Source to your CSV dataset. Next, add a Filter transformation and write an expression such as name == 'mary'. Next, add a Sink. This will copy only rows that have 'Mary' for the value in the name column.