Sort unmatched records using joinkeys - jcl

I have two GDG files (-1 & 0 version). Using these two files a flat file needs to be generated which will have Insert records(records which are not in -1 version but are in +0 version), Delete records(records which are in -1 version but are not in +0 version) and Update records(records which are in both the versions but the +0 version might have changes in some of the fields). How can i get those update records? Can i do it using Joinkeys, if yes, How?
Note: The update can be anywhere from column 1 to the last column of the file(+0 version of the GDG)

It is a simple JOINKEYS:
OPTION COPY
JOINKEYS F1=INA,FIELDS=(4,80),SORTED,NOSEQCK
JOINKEYS F2=INB,FIELDS=(4,80),SORTED,NOSEQCK
JOIN UNPAIRED
REFORMAT FIELDS=(F1:1,227,F2:1,227,?)
The OPTION COPY is for the Main Task, the bit which runs after the joined file is produced. SORT FIELDS=COPY is equivalent to OPTION COPY.
The assumption is that your data is in key order already. If not, remove the SORTED,NOSEQCKs but bear in mind that you may get "spurious" matches, by equal keys not in the same position on the file relative to inserts and deletes.
JOIN UPAIRED gives you matches and both types of mismatch. JOIN UNPAIRED,F1,F2 is equivalent.
The REFORMAT statement defines the records on the joined file. You want all the data from both/either record, and you want to know whether there was a match, and if no match, which input file had the record. That is what the question-mark (?) is. It will contain 'B' (on both files), '1' (on F1, or the first physically present JOINKEYS, only) or '2' (on the other JOINKEYS file only).
Then you need to output the data. I'll assume you want the data in separate places:
OUTFIL FNAMES=INSERT,
INCLUDE=(455,1,CH,EQ,C'1'),
BUILD=(1,227)
OUTFIL FNAMES=DELETE,
INCLUDE=(455,1,CH,EQ,C'2'),
BUILD=(228,227)
OUTFIL FNAMES=CHANGE,
INCLUDE=(455,1,CH,EQ,C'B',
AND,
1,227,CH,NE,228,227,CH),
BUILD=(1,454)
OUTFIL FNAMES=UNCHNGE,
SAVE,
BUILD=(1,227)
INCLUDE= (or OMIT=) includes or omits the data from the "OUTFIL Group". OUTFILs "run" concurrently (as in the same record is presented to each in turn, then the next record, etc).
FNAMES gives you the DDname to put in the JCL.
For CHANGE, the INCLUDE is for the first record (known to match due to the test for 'B') not being equal to the second. It is not exactly clear what output you want here. Currently those are output as F2 appended to F1, and entire (twice the size) record written. You could also write the records in "pairs" (BUILD=(1,227,/,228,227)) or just one or the other of the records.
SAVE is a thing which says "if this record hasn't appeared on any OUTFIL, output it here. It is certainly useful for testing, even if you don't want it in the final code.

Related

Capturing Each Skipped Record - Copy Data Activity

I'm doing a large-scale project with multiple pipelines, millions of records per pipeline. I'm trying to develop a generic skipped row capture process.
What I need to do is: for every source row skipped due to any error encountered on the attempted load, I want to capture a key column value from the row and write it to a distinct log file (or separate DB table row). This can't be summary data: for each individual row that fails, I need to capture the row key from that row so we can review/re-load later (I will add in system variable values to identify pipeline, component, time stamp, etc). Pipeline must complete with all successful rows loaded, all unsuccessful rows logged.
This is no-brainer functionality in most ETL tools; I have to be overlooking something in ADF, because I can't find a way to do this. Appreciate any/all suggestions.
You can enable Fault tolerance and choose Skip incompatible rows option. It will skip the incompatible rows between source and target store during copy data. e.g. type and field mismatch or PK violation.
Then you can enable session log and choose Warning log level in copy activity to log skipped rows. Finally, you can save your log file in Azure Storage or Azure Data Lake Storage Gen2.
Reference:
https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-fault-tolerance
https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-log
With your first copy activity, check the fault tolerance option in 'settings' to log skipped fault rows.
Make sure to place your rows key column, as the first in the mapping definition.
Get the copy activity logFilePath from the activity output into a variable
Add another copy activity to load skipped rows into relational table
it source path will be the variable holds logFilePath
Set the file path type to: 'Wildcard file path'
Keep the 'Wildcard file path' empty
Will be the value in Wildcard file name
Make sure that the delimited file dataset escape character is set to quotations.
The OperationItem field of the lg file holds your record fields seperated by ,; because we placed the rowID first on mapping, it will appear first in OperationalItem as well.
Goodluck

Merge Rows diff step flags "changed" when actually they should be identical

I'm using Merge Rows diff() step to compare the two data sets. Set A are the records from source(compare rows) and set B are the records from target (reference rows).
Merge rows diff raises a flag "changed" when the records are actually "identical".
I'm inserting or updating a record in my target table using "Synchronize after merge" step which inserts when records are "new" and updates the record when the records are "changed".
So, every time when I execute my transformation it always shows the flag as changed which could not happen.
My two data sets are from postgres database.
Used "Sort rows" step sorted the two data sets on a primary key field in the transformation.
In Merge rows step, I used "key field" to match the records in both streams. and compared the value fields
I want expect the exact behavior with merge rows diff flag
If the value is changed - I want to see the flag as "changed"
if the record value is identical - I want to see the flag as "identical"
When there is new coming from the source - i want to see the flag as "new"
sorted source
sorted target
Double check; triple check. There is some difference, somewhere. And don't forget about leading or trailing whitespace.
Are you absolutely sure that systemmmodstamp has the same value in both cases?

Using NEsper to read LogFiles for reporting purposes

We are evaluating NEsper. Our focus is to monitor data quality in an enterprise context. In an application we are going to log every change on a lot of fields - for example in an "order". So we have fields like
Consignee name
Consignee street
Orderdate
....and a lot of more fields. As you can imagine the log files are going to grow big.
Because the data is sent by different customers and is imported in the application, we want to analyze how many (and which) fields are updated from "no value" to "a value" (just as an example).
I tried to build a test case with just with the fields
order reference
fieldname
fieldvalue
For my test cases I added two statements with context-information. The first one should just count the changes in general per order:
epService.EPAdministrator.CreateEPL("create context RefContext partition by Ref from LogEvent");
var userChanges = epService.EPAdministrator.CreateEPL("context RefContext select count(*) as x, context.key1 as Ref from LogEvent");
The second statement should count updates from "no value" to "a value":
epService.EPAdministrator.CreateEPL("create context FieldAndRefContext partition by Ref,Fieldname from LogEvent");
var countOfDataInput = epService.EPAdministrator.CreateEPL("context FieldAndRefContext SELECT context.key1 as Ref, context.key2 as Fieldname,count(*) as x from pattern[every (a=LogEvent(Value = '') -> b=LogEvent(Value != ''))]");
To read the test-logfile I use the csvInputAdapter:
CSVInputAdapterSpec csvSpec = new CSVInputAdapterSpec(ais, "LogEvent");
csvInputAdapter = new CSVInputAdapter(epService.Container, epService, csvSpec);
csvInputAdapter.Start();
I do not want to use the update listener, because I am interested only in the result of all events (probably this is not possible and this is my failure).
So after reading the csv (csvInputAdapter.Start() returns) I read all events, which are stored in the statements NewEvents-Stream.
Using 10 Entries in the CSV-File everything works fine. Using 1 Million lines it takes way to long. I tried without EPL-Statement (so just the CSV import) - it took about 5sec. With the first statement (not the complex pattern statement) I always stop after 20 minutes - so I am not sure how long it would take.
Then I changed my EPL of the first statement: I introduce a group by instead of the context.
select Ref,count(*) as x from LogEvent group by Ref
Now it is really fast - but I do not have any results in my NewEvents Stream after the CSVInputAdapter comes back...
My questions:
Is the way I want to use NEsper a supported use case or is this the root cause of my failure?
If this is a valid use case: Where is my mistake? How can I get the results I want in a performant way?
Why are there no NewEvents in my EPL-statement when using "group by" instead of "context"?
To 1), yes
To 2) this is valid, your EPL design is probably a little inefficient. You would want to understand how patterns work, by using filter indexes and index entries, which are more expensive to create but are extremely fast at discarding unneeded events.
Read:
http://esper.espertech.com/release-7.1.0/esper-reference/html_single/index.html#processingmodel_indexes_filterindexes and also
http://esper.espertech.com/release-7.1.0/esper-reference/html_single/index.html#pattern-walkthrough
Try the "previous" perhaps. Measure performance for each statement separately.
Also I don't think the CSV adapter is optimized for processing a large file. I think CSV may not stream.
To 3) check your code? Don't use CSV file for large stuff. Make sure a listener is attached.

GROUP BY CLAUSE using SYNCSORT

I have some content in a file on which I must generate statistics such as how many of records are of type - 1, type - 2 etc. Number of types can change and is unknown to the code until file arrives. In a SQL system, I can do this using COUNT and GROUP BY clause. But I am not sure if I can do this using SYNCSORT or COBOL program. Would anyone here have an idea on how I can implement 'GROUP BY' type query on a file using SYNCSORT.
Sample Data:
TYPE001 SUBTYPE001 TYPE01-DESC
TYPE001 SUBTYPE002 TYPE01-DESC
TYPE001 SUBTYPE003 TYPE01-DESC
TYPE002 SUBTYPE001 TYPE02-DESC
TYPE002 SUBTYPE004 TYPE02-DESC
TYPE002 SUBTYPE008 TYPE02-DESC
I want to get the information such as TYPE001 ==> 3 Records, TYPE002 ==> 3 Records. What the code doesn't know until runtime is the TYPENNN value
You show data already in sequence, so there is no need to sort the data itself, which makes SUM FIELDS= with SORT a poor solution if anyone suggests it (plus code for the formatting).
MERGE with a single input file and SUM FIELDS= would be better, but still require the code for formatting.
The simplest way to produce output which may suit you is to use OUTFIL reporting functions:
OPTION COPY
OUTFIL NODETAIL,
REMOVECC,
SECTIONS=(1,7,
TRAILER3=(1,7,
' ==> ',
COUNT=(M10,LENGTH=3),
' Records'))
The NODETAIL says "remove all the data lines". The REMOVECC says "although it is a report, don't use printer-control characters on position one of the output records". The SECTIONS says "we're going to use control-breaks, and here they (it in this case) are". In this case, your control-field is 1,7. The TRAILER3 defines the output which will be produced at each control-break: COUNT here is the number of records in that particular break. M10 is an editing mask which will change leading zeros to blanks. The LENGTH gives a length to the output of COUNT, three is chosen from your sample data with sub-types being unique and having three digits as the unique part of the data. Change to whatever suits your actual data.
You've not been clear, and perhaps you want the output "floating" (3bb instead of bb3, where b represents a blank)? That would require more code...

Migrating Low-Values in flat file to RDB

I have an indexed file where a particular field now holds alphanumeric values and this field is a part of the Key, that particular column has LOW-VALUES in a rows and SPACES in another row, these two rows are identified as unique fields in indexed file, but when I try to migrate this to a RDB I get unique key violation since LOW-VALUES in RDB is treated as spaces. has anyone faced a similar instance and how did you handle it?
Note: Right now, I'm just planning to replace LOW_VALUE with "RANDOM" text. I just want to know is there any other possibility to handle LOW-VALUE in RDB.
It is a bit odd that a record key would contain either spaces or low-values. Strikes me that you may be migrating some "bad data".
However, if these are valid values, then you need to replace one of them: Low-values (probably binary zeros) or spaces with something else that will not conflict with any currently existing or likely to exist value for that key.
Keys on one file are often held as references in other files - you will need to track down and convert all of these as well. Failing to do so will lead to a corrupted database (broken RI constraints etc).
This does not look like a "pretty" situation.