How to capture the last record in a file - datastage

I have a requirement to split a sequential file into 3 parts, Header, Data, Trailer. I have the header and Data worked out.
Is there a way, in a Transformer, to determine if you have the last record in a sequential file? I tried using LastRow() but that gives me the last row for each node. I need to leave parallelize on.
Thanks in advance for any help.

You have no a priori knowledge about which node the trailer row will come through on. There is therefore no solution in a Transformer stage if you want to retain parallel execution.
One way to do it is to have a reject link on the Sequential File stage. This will capture any row that does not match the defined metadata. Set up the stage with the metadata for your Data rows, then the Header and Trailer will be captured onto the reject link. It should be pretty obvious from their data which is which, and you can process them further and perhaps even rejoin them to your Data rows.
You could also capture the last row separately (e.g. via head -1 filename) and compare that against every row processed to determine if it's the last. Computationally heavy for very little gain.

Related

Detect and remove duplicate HL7 messages in a log

I'm trying to populate a new EMR with data from an existing environment. I am pulling a log of all activity for a given interface and feeding it in to the inbound channel in the new environment. The problem is our existing channel has duplicates of the messages which will create duplicate reports in the patient records.
Beyond looking through what feels like the entire internet I've tried pushing text around in Iguana, PowerShell and Excel and I'm not familiar enough with MirthConnect to make use of it. I'm not married to any one solution, I just need a solution and PDQ.
I found a fairly good starting point at https://www.secretgeek.net/ps_duplicates and I've been massaging it but still no complete solution. At this point I've basically reset it to zero because nothing I've done has improved it (mostly I broke it repeatedly).
$hash = #{} #Define an empty hashtable
gc "c:\Samples\Q12019.txt" | #Send the content of the file into the pipeline...
% {
if ($hash.$_ -eq $null) { #if that line isn't a key in the hash table
# $_ is data from the pipe
$_ #send the data down the pipe
};
$hash.$_ = 1 #add that line to the hash so it doesn't resend
} > "c:\Samples\RadHx Test Q12019.txt"
This does some trippy stuff I don't understand. It ingests the file and the output has a new space B E T W E E N every single character in the file. I can't even tell if it's removing duplicates and I haven't been able to get it to stop doing this. I'm also not sure it's reading an entire message including all of it's segments. Example 2 at
https://healthstandards.com/blog/2007/09/10/variations-of-the-hl7-orur01-message-format/
looks close enough to what I'm dealing with as an example of ingest, just add 2000 more in a text file.
Simplified explanation:
I have a text file with several blocks of related text. Each block has the same starting sequence of characters, say 'ABC'. The blocks have an arbitrary length and don't necessarily end with the same string but all blocks end with CRLF. Problem:
Each block may not be unique but I need to eliminate repeating blocks of text so the file only contains one instance of each block of text.
Mirth should be able to easily debatch the file for you. If the messages are exact duplicates, you can probably just keep track as you go of a few of the MSH fields that should guarantee uniqueness.
If they were resends of the same data, where it is mostly the same, but some fields (especially in the MSH segment) may be updated, you'll probably want to exclude some of the segments, then hash the message, and track that instead (maybe with a patient id or something, in the rare case of a hash collision.)
You can store information in the globalChannelMap to compare values across messages. The map exists in memory only and won't survive a mirth restart, but that shouldn't be a problem for your one time conversion. If you need something more persistent, store the values in a database.

Using Talend Open Studio DI to extract extract value from unique 1st row before continuing to process columns

I have a number of excel files where there is a line of text (and blank row) above the header row for the table.
What would be the best way to process the file so I can extract the text from that row AND include it as a column when appending multiple files? Is it possible without having to process each file twice?
Example
This file was created on machine A on 01/02/2013
Task|Quantity|ErrorRate
0102|4550|6 per minute
0103|4004|5 per minute
And end up with the data from multiple similar files
Task|Quantity|ErrorRate|Machine|Date
0102|4550|6 per minute|machine A|01/02/2013
0103|4004|5 per minute|machine A|01/02/2013
0467|1264|2 per minute|machine D|02/02/2013
I put together a small, crude sample of how it can be done. I call it crude because a. it is not dynamic, you can add more files to process but you need to know how many files in advance of building your job, and b. it shows the basic concept, but would require more work to suite your needs. For example, in my test files I simply have "MachineA" or "MachineB" in the first line. You will need to parse that data out to obtain the machine name and the date.
But here is how may sample works. Each Excel is setup as two inputs. For the header the tFileInput_Excel is configured to read only the first line while the body tFileInput_Excel is configured to start reading at line 4.
In the tMap they are combined (not joined) into the output schema. This is done for the Machine A Excel and Machine B excels, then those tMaps are combined with a tUnite for the final output.
As you can see in the log row the data is combined and includes the header info.

Itemwriter output using same order that itemreader used to read file

We have a springbatch job that reads a file (flatfileitemreader), process it and writes data to a queue (jmsitemwriter).
We have another job that reads the queue (jmsitemreader) and writes a file (flatfileitemwriter). It's asynchronous process (in between the execution of the two jobs, we have some manual process that must be performed).
The flat file content doesn't have a line identifier. And we use a multi-threaded approach when reading the file ("throttle-limit"). So, the messages queued do not maintain the same order that they used to have into the flat file.
The problem is that we should generate an output file respecting the original order. So the line 33 inside the incoming file, should be line 33 into the outgoing file (it will have the contents of the original line, plus some data).
Does springbatch provide "native" a way to order the output, respecting the original read order? I used "native" because one solution that we thought is to create an additional step just to add a line number to the file and use it at the end, but I was wondering if this "reinvent the wheel"...
We are using SB 3.0.3
TIA,
Bob
The use case you are describing asks that you maintain order across multiple jobs which is not supported. In theory (while not guaranteed) a single, single threaded step would retain the order of the input file.
Since you are reading in a multithreaded manor, there really isn't a good way to guarantee the order of the items as they are being read. The best you could do is synchronize the read method and add an id as the items are being read. If the bottleneck you're attempting to address with multithreading is in the processor or writer, this may not be a bad option.

Powershell - Copying CSV, Modifying Headers, and Continuously Updating New CSV

We have a log that tracks faxes sent through our fax server. It is a .csv that contains Date_Time, Duration, CallerID, Direction (i.e. inbound/outbound), Dialed#, and Answered#. This file is overwritten every 10 minutes with any new info that was tracked on the fax server. This cannot be changed to be appended.
Sometimes our faxes fail, and the duration on those will be equal to 00:00:00. We really don't know if they are failing until users let us know that they are getting complaints about missing faxes. I am trying to create a Powershell script that can read the file and notify us via email if there are n amount of failures.
I started working on it, but it quickly became a big mess as I ran into more problems. One issue I was trying to overcome was having it email us over and over if there are certain failures. Since I can't save anything on the original .csv's, I was trying to preform these ideas in the script.
Copy .csv with a new header titled "LoggedFailure". Create file if it doesn't exist.
Compare the two files, and add different data (i.e. updates on the original) to the copy.
Check copied .csv for Durations equal to 00:00:00. If it is, mark the LoggedFailure header as "Yes" or some value.
If there are n amount of failures, email us.
Have this script run as a scheduled task (every hour or so).
I'm having difficulty with maintaining the data. I haven't done a lot of work with scripting or programming, so I'm having trouble with making the correct logic. I can look up cmdlets and understand them, but my main issue is logic. Does anyone have any tips or could provide some ideas on how to best update the data, track failures as to not send duplicate information, and have it run?
I'd use a hash table with the Dialed# as the key. Create PSCustomObjects that have LastFail date and FailCount properties as the values. Read through the log in chronological order, and add/increment a new entry in the hash table every time it finds an entry with Duration of 00:00:00 that's newer than what's already in the hash table. If it finds a successful delivery event, delete the entry with that Dialed# key from the hash table if it exists.
When it's done, the hash table keys will be a collection of the Dialed numbers that are failing, and the objects in the values will tell you how many failures there have been, and when the last one was. Use that to determine determine if an alert needs to be sent, and what numbers to report.
When a problem with a given fax number is resolved, a successful fax to that number will clear the entry from the hash table, and stop the alerts.
Save the hash table between runs by exporting it as CLIXML, and re-import it at the beginning of each run.

TfileList catches one of the 6 files only

I tried to display some results from several files in a directory. I use TFileList, and 2 tFileInputDelimited which are both linked to TFileList. I don't know why but at the end of the processing my results are lugged from just one of the 6 files I want. It appears that there are results from the list file of the directory.
Each tFileInputDelimited has ((String)globalMap.get("tFileList_1_CURRENT_FILEPATH")) as name of the flow.
Here is my TMap:
Your job is set up so your lookup is iterative which causes some issues as Talend only seems to use the last iteration rather than doing what you might expect and iterating through every step for everything it needs (although this might be more complicated than you first think).
One option is to rework the job so you use your iterate part of the job as the main input to the tMap rather than the lookup.
Alternatively, you could iterate the data into a tBufferOutput component and then OnSubjobOk you could link the job as before but replace the iterative part with a tBufferInput component as it will store all of the data from all of the files iterated through.