Is there any way to compare two files and generate output with only differences in field level using datastage? - datastage

I have two files,
File A
id,name,address,phone
01,abc,cde,345
02,efg,ghi,654
File B
id,name,address,phone
01,abc,edc,231
02,abc,ghi,789
Output file will have the data in below format if there is difference in records for each field.
Output File
id,fields,value from File A,value from File B
01,address,cde,edc
01,phone,345,231
02,name,efg,abc
02,phone,654,789
output shouldn't have 01,name,abc,abc because it is matching from both file.key column will be id.
Any design based on dastage will be very helpful. Thanks in advance.

Yes it is - I suggest to check out the Change_Capture stage.
Alternatively the Difference stage might also be an option
Both stages will return information if a row from the one file compared to the other is a Copy, Delete, Insert or Edit. You can filter that to just return what you are interested in.
The decision on which stage is suited best depends on which information to return the "before" or "after" values (which again depends how you define the file A and file B)
the documentation will show some examples.

Related

Dynamic outputfilename in Data Flows in Azure Data Factory results in folders instead of files

I am setting up a Data Flow in ADF that takes an Azure Table Dataset as Source, adds a Derived Column that adds a column with the name "filename" and a dynamic value, based on a data field from the source schema.
Then the output is sent to a sink that is linked to a DataSet that is attached to Blob Storage (tried ADLS Gen2 and standard Blob storage).
However, after executing the pipeline, instead of finding multiple files in my container, I see there are folders created with the name filename=ABC123.csv that on its own contains other files (it makes me think of parquet files):
- filename=ABC123.csv
+ _started_UNIQUEID
+ part-00000-tid-UNIQUEID-guids.c000.csv
So, I'm clearly missing something, as I would need to have single files listed in the dataset container with the name I have specified in the pipeline.
This is how the pipeline looks like:
The Optimize tab of the Sink shape looks like this:
Here you can see the settings of the Sink shape:
And this is the code of the pipeline (however some parts are edited out):
source(output(
PartitionKey as string,
RowKey as string,
Timestamp as string,
DeviceId as string,
SensorValue as double
),
allowSchemaDrift: true,
validateSchema: false,
inferDriftedColumnTypes: true) ~> devicetable
devicetable derive(filename = Isin + '.csv') ~> setoutputfilename
setoutputfilename sink(allowSchemaDrift: true,
validateSchema: false,
rowUrlColumn:'filename',
mapColumn(
RowKey,
Timestamp,
DeviceId,
SensorValue
),
skipDuplicateMapInputs: true,
skipDuplicateMapOutputs: true) ~> distributetofiles
Any suggestions or tips? (I'm rather new to ADF, so bear with me)
I recently struggled through something similar to your scenario (but not exactly the same). There are a lot of options and moving parts here, so this post is not meant to be exhaustive. Hopefully something in it will steer you towards the solution you are after.
Step 1: Source Partitioning
In Data Flow, you can group like rows together via Set Partitioning. One of the many options is by Key (a column in the source):
In this example, we have 51 US States (50 states + DC), and so will end up with 51 partitions.
Step 2: Sink Settings
As you found out, the "As data in column" option results in a structured folder name like {columnName}={columnValue}. I've been told this is because it is a standard in Hadoop/Spark type environments. Inside that folder will be a set of files, typically with non-human-friendly GUID based names.
"Default" will give much the same result you currently have, without the column based folder name. Output to Single File" is pretty self-explanatory, and the farthest thing from the solution you are after. If you want control over the final file names, the best option I have found is the "Pattern" option. This will generate file(s) with the specified name and a variable number [n]. I honestly don't know what Per partition would generate, but it may get close to you the results you are after, 1 file per column value.
Some caveats:
The folder name is defined in the Sink Dataset, NOT in the Data Flow. Dataset parameters is really probably "Step 0". For Blob type output, you could probably hard code the folder name like "myfolder/fileName-[n]". YMMV.
Unfortunately, none of these options will permit you to use a derived column to generate the file name. [If you open the expression
editor, you'll find that "Incoming schema" is not populated.]
Step 3: Sink Optimize
The last piece you may experiment with is Sink Partitioning under the Optimize tab:
"Use current partitioning" will group the results based on the partition set in the Source configuration. "Single partition" will group all the results into a single output group (almost certainly NOT what you want). "Set partitioning" will allow you to re-group the Sink data based on a Key column. Unlike the Sink settings, this WILL permit you to access the derived column name, but my guess is that you will end up with the same folder naming problem you have now.
At the moment, this is all I know. I believe that there is a combination of these options that will produce what you want, or something close to it. You may need to approach this in multiple steps, such as have this flow output to incorrectly named folders to a staging location, then have another pipeline/flow that processes each folder and collapses the results the desired name.
You're seeing the ghost files left behind by the Spark process in your dataset folder path. When you use 'As data in column', ADF will write the file using your field value starting at the container root.
You'll see this noted on the 'Column with file name' property:
So, if you navigate to your storage container root, you should see the ABC123.csv file.
Now, if you want to put that file in a folder, just prepend that folder name in your Derived Column transformation formula something like this:
"output/folder1/{Isin}.csv"
The double-quotes activate ADF's string interpolation. You can combine literal text with formulas that way.

Spark - Update target data if primary keys match?

Is it possible to overwrite a record in the target if specific conditions are met using spark without reading the target into a dataframe? For example, I know we can do this if both sets of data are loaded into dataframes, but I would like to know if there is a way to perform this action without loading the target into a dataframe. Basically, a way to specify overwrite/update conditions.
I am guessing no, but I figured I would ask before I dive into this project. I know we have the write options of append and overwrite. What I really want is, if a few specific columns already exist in the data target, then overwrite it and fill in the other columns with the new data. For example:
File1:
id,name,date,score
1,John,"1-10-17",35
2,James,"1-11-17",43
File2:
id,name,date,score
3,Michael,"1-10-17",23
4,James,"1-11-17",56
5,James,"1-12-17",58
I would like the result to look like this:
id,name,date,score
1,John,"1-10-17",35
3,Michael,"1-10-17",23
4,James,"1-11-17",56
5,James,"1-12-17",58
Basically, Name and Date columns act like primary keys in this scenario. I want updates to occur based on those two columns matching, otherwise make a new record. As you can see id 4 overwrites id 2, but id 5 appends because the date column did not match. Thanks ahead guys!

Transpose data using Talend

I have this kind of data:
I need to transpose this data into something like this using Talend:
Help would be much appreciated.
dbh's suggestion should work indeed, but I did not try it.
However, I have another solution which doesn't require to change input format and is not too complicated to implement. Indeed the job has only 2 transformation components (tDenormalize and tMap).
The job looks like the following:
Explanation :
Your input is read from a CSV file (could be a database or any other kind of input)
tDenormalize component will Denormalize your column value (column 2), based on value on id column (column 1), separating fields with a specific delimiter (";" in my case), resulting as shown in 2 rows.
tMap : split the aggregated column into multiple columns, by using java's String.split() method and spreading the resulting array into multiple columns. The tMap should like like this:
Since Talend doesn't accept to store Array objects, make sure to store the splitted String in Object format. Then, cast that object into Array on the right side of the Map.
That approach should give you the expected result.
IMPORTANT:
tNormalize might shuffle the rows, meaning for bigger input, you might encounter unsorted output. Make sure to sort it if needed or use tDenormalizeSortedRow instead.
tNormalize is similar to an aggregation component meaning it scans the whole input before processing, which results into possible performance issues with particularly big inputs (tens of millions of records).
Your input is probably wrong (you have 5 entries with 1 as id, and 6 entries with 2 as id). 6 columns are expected meaning you should always have 6 lines per id. If not, then you should implement dbh's solution, and you probably HAVE TO add a column with a key.
You can use Talend's tPivotToColumnsDelimited component to achieve this. You will most likely need an additional column in your data to represent the field name.
Like "Identifier, field name, value "
Then you can use this component to pivot the data and write a file as output. If you need to process the data further, read the resulting file with tFileInoutDelimited .
See docs and an example at
https://help.talend.com/display/TalendOpenStudioComponentsReferenceGuide521EN/13.43+tPivotToColumnsDelimited

can a modify stage bulk rename columns based on wildcards?

I need to modify columns based on business rules using RCP. For example, all source columns that end with '_ID' must be changed to '_KEY' to meet the target.
An example: Test_ID in source becomes Test_KEY in target
I have multiple tables, some with 2 "ID" columns, and some with 20. Is there a way to configure the modify stage to bulk rename columns based on wildcard?
If not, is there another way?
Thanks.
I doubt that there is an option using modify stage with wildcards for this.
One alternative could be a schema file which can be used with any of following stages:
Sequential File, File Set, External Source, External Target, Column Import, Column Export
This schema file could also be generated or modified to adjust the column names as needed.
Another way could be to generate the appropriate SQL statement if the data resides in a database or is written to one.

get duplicate record in large file using MapReduce

I have a large file contain > 10 million line. I want to get dupplicate line using MapReduce.
How can I solve this problem?
Thanks for help
You need to make use of the fact that the default behaviour of MapReduce is to group values based on a common key.
So the basic steps required are:
Read in each line of you file to you mapper, probably using something like the TextInputFormat.
Set the output Key (Text object) to the value of each line. The contents of the value doesn't really matter. You can just set it to a NullWritable if you want.
In the reduce check the number of values grouped for each key. If you have more than one value you know you have a duplicate.
If you just want the duplicate values, write out the keys that have multiple values.