TALEND : while moving employee data from source to target I want to remove duplicates and move them to separate file - talend

TALEND : while moving employee data from source to target I want to remove duplicates and move them to separate file

Exactly if the full record is same its a duplicate
like there could be 2 scenarios
duplicate records in the source file
Duplicate records in the source file when compared with existing table.
duplicate in source file :
if there are 3 duplicate records
catch the first record and move it to target and other 2 move them to another file (may be u shd add a surrogate key to identify them uniquely)
duplicate rec compared with table data :
compare each rec with table data and if its duplicate move it to another file

Related

Update column in a dataset only if matching record exists in another dataset in Tableau Prep Builder

Any way to do this? Basically trying to do a SQL UPDATE SET function if matching record for one or more key fields exists in another dataset.
Tried using Joins and Merge. Joins seems like more steps and the Merge appends records instead of updating the correlating rows.

How to split data into multiple outputs files based on value of a given column

Using Talend Open Studio for Data integration
How can I split one Excel file into multiple outputs based on values of given column ?
Example
Example of data in input.xlsx :
ID; Category
1; AAA
2; AAA
3; BBB
4; CCC
Example of output files :
AAA.xlsx contains ID 1 and 2
BBB.xslx contains ID 3
CCC.xslx contains ID 4
What I tried ?
tfilelist-->tinputexcel-->tuniqrows-->tflowtoiterate-->tfileinputexcel-->tfilterow-->tlogrow
In order to perform these actions :
Browse a folder of Excel files
Iterate to Open Excel file
Get uniques values in Excel files (on column used for the split)
Iterate to generate splitted files with the unique values and tfilterow to filter Excel file and that's where I get an error about Garbage Collector
Exception in component tFileInputExcel_4 (automatisation_premed)
java.io.IOException: GC overhead limit exceeded
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
Talend's job diagram
Do someone have an idea to optimize this talend workflow and solve GC error ? Thanks for the support
Finally, I think we must not iterate on an Excel Input as openning twice the same file is a problem both on Windows and in designed job so a workaround should be :
Talend diagram for the job
There are multiple ways to tackle this type of thing in Talend. One approach is to store the Excel file somewhere after loading (Database, CSV, Hash, etc).
An alternative approach is to aggregate -> iterate -> normalize the data like so:
In tAggregateRow you want to group by the field containing the 'base' of your file name (Category in this case):
The aggregate function should be 'list' (with an appropriate delimiter not already contained in your Id column:
Feed the aggregated output into a tFlowToIterate to loop over each Category:
tFixedFlow can be used to output each of the aggregates to an independent flow:
Use tNormalize to dump the single Category row into one row per Id by normalizing the 'list' column:
Set the tFileOutputExcel file name to be the current iterations Category as defined in tFlowToIterate:
Final result is one file per Category with one row per Id:

matching the columns in a source file with sink table columns to make sure they match using Azure Data Factory

I have an Azure Data factory trigger that is fired off when a file is placed in blob storage, this trigger will start pipeline execution and pass the file name to the data flow activity. I would like to make sure that all the column names from the header row in the file are in the sink table. There is an identity column in the sink table that should not be in the comparison. Not sure how to tackle this task, I've read about the 'derived column' activity, is that the route I should take?
You can select or filter which columns reside in sink dataset or table by using "Field mapping". You can optionally use "derived columns" transformation, however in the "sink transformation" you will have this by default and is set to "Auto mapping". Here you can add or remove which columns are written to sink.
In the below example the column "id" can be assumed as similar to "Identity" column in your table. Assuming all the files have same columns:
Once you have modified as per your need, you can confirm the same from the "inspect" tab before run.
Strategy:
Use two ADF pipelines, one to get a list of all files and another one to process each file copying its content to a specific SQL table.
Setup:
I’ve created 4 CSV files, following the pattern you need: “[CustomerID][TableName][FileID].csv” and 4 SQL tables, one for each type of file.
A_inventory_0001.csv: inventory records for customer A, to be
inserted into the SQL table “A_Inventory”.
A_sales_0003.csv: sales
records for customer A, to be inserted into the SQL table “A_Sales”.
B_inventory_0002.csv: inventory records for customer B, to be
inserted into the SQL table “B_Inventory”.
B_sales_0004.csv: sales
records for customer B, to be inserted into the SQL table “B_Sales”
Linked Services
In Azure Data Factory, the following linked services were create using Key Vault (Key Vault is optional).
Datasets
The following datasets were created. Note we have created some parameters to allow the pipeline to specify the source file and the destination SQL table.
The dataset “AzureSQLTable” has a parameter to specify the name of the destination SQL table.
The dataset “DelimitedTextFile” has a parameter to specify the name of the source CSV file.
The dataset “DelimitedTextFiles” has no parameter because it will be used to list all files from source folder.
Pipelines
The first pipeline “Get Files” will get the list of CSV files from source folder (Get Metadata activity), and then, for each file, call the second pipeline passing the CSV file name as a parameter.
Inside the foreach loop, there is a call to the second pipeline “Process File” passing the file name as a parameter.
The second pipeline has a parameter “pFileName” to receive the name of the file to be processed and a variable to calculate the name of the destination table based on the file name.
The first activity is to use a split in the file name to extract the parts we need to compose the destination table name.
In the expression bellow we are splitting the file name using the “__” separator and then using the first and second parts to compose the destination table name.
#concat(string(split(pipeline().parameters.pFileName, '_')[0]),'_',string(split(pipeline().parameters.pFileName, '_')[10]))
The second activity will then copy the file from the source “pFileName” to the desnation table “vTableName” using dynamic mapping, ie not adding specific column names as this will be dynamic.
The files I used in this example and the ADF code are available here:
https://github.com/diegoeick/stack-overflow/tree/main/69340699
I hope this will resolve your issue.
In case you still need to save the CustomerID and FileID in the database tables, you can use the dynamic mapping and use the available parameters (filename) and create a json with the dynamic mapping in the mapping tab of your copy activity. You can find more details here: https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-schema-and-type-mapping#parameterize-mapping

Capturing Each Skipped Record - Copy Data Activity

I'm doing a large-scale project with multiple pipelines, millions of records per pipeline. I'm trying to develop a generic skipped row capture process.
What I need to do is: for every source row skipped due to any error encountered on the attempted load, I want to capture a key column value from the row and write it to a distinct log file (or separate DB table row). This can't be summary data: for each individual row that fails, I need to capture the row key from that row so we can review/re-load later (I will add in system variable values to identify pipeline, component, time stamp, etc). Pipeline must complete with all successful rows loaded, all unsuccessful rows logged.
This is no-brainer functionality in most ETL tools; I have to be overlooking something in ADF, because I can't find a way to do this. Appreciate any/all suggestions.
You can enable Fault tolerance and choose Skip incompatible rows option. It will skip the incompatible rows between source and target store during copy data. e.g. type and field mismatch or PK violation.
Then you can enable session log and choose Warning log level in copy activity to log skipped rows. Finally, you can save your log file in Azure Storage or Azure Data Lake Storage Gen2.
Reference:
https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-fault-tolerance
https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-log
With your first copy activity, check the fault tolerance option in 'settings' to log skipped fault rows.
Make sure to place your rows key column, as the first in the mapping definition.
Get the copy activity logFilePath from the activity output into a variable
Add another copy activity to load skipped rows into relational table
it source path will be the variable holds logFilePath
Set the file path type to: 'Wildcard file path'
Keep the 'Wildcard file path' empty
Will be the value in Wildcard file name
Make sure that the delimited file dataset escape character is set to quotations.
The OperationItem field of the lg file holds your record fields seperated by ,; because we placed the rowID first on mapping, it will appear first in OperationalItem as well.
Goodluck

Greenplum COPY not filtering duplicate entries

I have a problem loading contents to the green plum table using COPY command. What i have is three column table lets say A , B , C and the table should not entertain duplicate elements. So i have made a composite key clubbing the above three
PRIMARY KEY ( A , B , C )
But the input file which I am using to load the table, has duplicate entries. All I want is, the COPY command to filter off the duplicate elements and continue loading the data. But in my case whenever the COPY encounters a duplicate entry, it aborts the loading. Any leads on how to proceed??
Thanks
Ganesh.R
COPY doesn't work like that.
The first thing I'd try is the system sort.
sort -u old_filename > new_filename
The '-u' argument tells sort to output only unique lines.