Detect and remove duplicate HL7 messages in a log - powershell

I'm trying to populate a new EMR with data from an existing environment. I am pulling a log of all activity for a given interface and feeding it in to the inbound channel in the new environment. The problem is our existing channel has duplicates of the messages which will create duplicate reports in the patient records.
Beyond looking through what feels like the entire internet I've tried pushing text around in Iguana, PowerShell and Excel and I'm not familiar enough with MirthConnect to make use of it. I'm not married to any one solution, I just need a solution and PDQ.
I found a fairly good starting point at https://www.secretgeek.net/ps_duplicates and I've been massaging it but still no complete solution. At this point I've basically reset it to zero because nothing I've done has improved it (mostly I broke it repeatedly).
$hash = #{} #Define an empty hashtable
gc "c:\Samples\Q12019.txt" | #Send the content of the file into the pipeline...
% {
if ($hash.$_ -eq $null) { #if that line isn't a key in the hash table
# $_ is data from the pipe
$_ #send the data down the pipe
};
$hash.$_ = 1 #add that line to the hash so it doesn't resend
} > "c:\Samples\RadHx Test Q12019.txt"
This does some trippy stuff I don't understand. It ingests the file and the output has a new space B E T W E E N every single character in the file. I can't even tell if it's removing duplicates and I haven't been able to get it to stop doing this. I'm also not sure it's reading an entire message including all of it's segments. Example 2 at
https://healthstandards.com/blog/2007/09/10/variations-of-the-hl7-orur01-message-format/
looks close enough to what I'm dealing with as an example of ingest, just add 2000 more in a text file.
Simplified explanation:
I have a text file with several blocks of related text. Each block has the same starting sequence of characters, say 'ABC'. The blocks have an arbitrary length and don't necessarily end with the same string but all blocks end with CRLF. Problem:
Each block may not be unique but I need to eliminate repeating blocks of text so the file only contains one instance of each block of text.

Mirth should be able to easily debatch the file for you. If the messages are exact duplicates, you can probably just keep track as you go of a few of the MSH fields that should guarantee uniqueness.
If they were resends of the same data, where it is mostly the same, but some fields (especially in the MSH segment) may be updated, you'll probably want to exclude some of the segments, then hash the message, and track that instead (maybe with a patient id or something, in the rare case of a hash collision.)
You can store information in the globalChannelMap to compare values across messages. The map exists in memory only and won't survive a mirth restart, but that shouldn't be a problem for your one time conversion. If you need something more persistent, store the values in a database.

Related

How to do duplicate file check in DataStage?

For instance
File A Loaded then next day
File B Loaded then next day
This time Again, File A received this time sequence should be abort
Can anyone help me out with this
Thanks
There are multiple ways to solve this, but please don't do intentionally aborts as they're most likely boomerangs.
Keep track of filenames and file hashes (like MD5sum) in a table and compare the list before loading. If the file is known, handle/ignore it.
Just read the file again as if it was new or updated. Compare old data with new data using the Change Capture stage, handle data as needed, e.g. write changed and new data to target. (recommended)
I would not recommend writing a sequence that "should abort" as this is not the goal of an ETL process. If the file contains the very same content that is already known, just ignore it. If it has updated data, handle it as needed. Only abort, if there is a technical issue, e.g. the file given is wrong formatted. An abort of a job should indicate that something is wrong with the job. When you get a file twice, then it's not the job that failed.
If an error was found in the data that needs to be fixed by others, write the information about it to a table. Have a another independend process monitoring that table to tell the data producer about it (via dashboard, email,...).

How to capture the last record in a file

I have a requirement to split a sequential file into 3 parts, Header, Data, Trailer. I have the header and Data worked out.
Is there a way, in a Transformer, to determine if you have the last record in a sequential file? I tried using LastRow() but that gives me the last row for each node. I need to leave parallelize on.
Thanks in advance for any help.
You have no a priori knowledge about which node the trailer row will come through on. There is therefore no solution in a Transformer stage if you want to retain parallel execution.
One way to do it is to have a reject link on the Sequential File stage. This will capture any row that does not match the defined metadata. Set up the stage with the metadata for your Data rows, then the Header and Trailer will be captured onto the reject link. It should be pretty obvious from their data which is which, and you can process them further and perhaps even rejoin them to your Data rows.
You could also capture the last row separately (e.g. via head -1 filename) and compare that against every row processed to determine if it's the last. Computationally heavy for very little gain.

Powershell - Copying CSV, Modifying Headers, and Continuously Updating New CSV

We have a log that tracks faxes sent through our fax server. It is a .csv that contains Date_Time, Duration, CallerID, Direction (i.e. inbound/outbound), Dialed#, and Answered#. This file is overwritten every 10 minutes with any new info that was tracked on the fax server. This cannot be changed to be appended.
Sometimes our faxes fail, and the duration on those will be equal to 00:00:00. We really don't know if they are failing until users let us know that they are getting complaints about missing faxes. I am trying to create a Powershell script that can read the file and notify us via email if there are n amount of failures.
I started working on it, but it quickly became a big mess as I ran into more problems. One issue I was trying to overcome was having it email us over and over if there are certain failures. Since I can't save anything on the original .csv's, I was trying to preform these ideas in the script.
Copy .csv with a new header titled "LoggedFailure". Create file if it doesn't exist.
Compare the two files, and add different data (i.e. updates on the original) to the copy.
Check copied .csv for Durations equal to 00:00:00. If it is, mark the LoggedFailure header as "Yes" or some value.
If there are n amount of failures, email us.
Have this script run as a scheduled task (every hour or so).
I'm having difficulty with maintaining the data. I haven't done a lot of work with scripting or programming, so I'm having trouble with making the correct logic. I can look up cmdlets and understand them, but my main issue is logic. Does anyone have any tips or could provide some ideas on how to best update the data, track failures as to not send duplicate information, and have it run?
I'd use a hash table with the Dialed# as the key. Create PSCustomObjects that have LastFail date and FailCount properties as the values. Read through the log in chronological order, and add/increment a new entry in the hash table every time it finds an entry with Duration of 00:00:00 that's newer than what's already in the hash table. If it finds a successful delivery event, delete the entry with that Dialed# key from the hash table if it exists.
When it's done, the hash table keys will be a collection of the Dialed numbers that are failing, and the objects in the values will tell you how many failures there have been, and when the last one was. Use that to determine determine if an alert needs to be sent, and what numbers to report.
When a problem with a given fax number is resolved, a successful fax to that number will clear the entry from the hash table, and stop the alerts.
Save the hash table between runs by exporting it as CLIXML, and re-import it at the beginning of each run.

How to increment a number from a csv and write over it

I'm wondering how to increment a number "extracted" from a field in a csv, and then rewrite the file with the number incremented.
I need this counter in a tMap.
Is the design below a good way to do it ?
EDIT: im trying a new method. see the design of my subjob below, but i have an error when i link the tjavarow to my main tmap in the main job
Exception in component tMap_1
java.lang.NullPointerException
at mod_file_02.file_02_0_1.FILE_02.tFileList_1Process(FILE_02.java:9157)
at mod_file_02.file_02_0_1.FILE_02.tRowGenerator_5Process(FILE_02.java:8226)
at mod_file_02.file_02_0_1.FILE_02.tFileInputDelimited_2Process(FILE_02.java:7340)
at mod_file_02.file_02_0_1.FILE_02.runJobInTOS(FILE_02.java:12170)
at mod_file_02.file_02_0_1.FILE_02.main(FILE_02.java:11954)
2014-08-07 12:43:35|bm9aSI|bm9aSI|bm9aSI|MOD_FILE_02|FILE_02|Default|6|Java
Exception|tMap_1|java.lang.NullPointerException:null|1
[statistics] disconnected
enter image description here
You should be able to do this mid flow in a tMap or a tJavaRow.
Simply read the number in as an integer (or other numeric data type) and then add your increment to it.
A really simple example might look like this:
Here we have a tFixedFlowInput that has some hard coded values for the job:
And we run it through a tMap where we add 1 to the age column:
And finally, we output it to the console in a table:
EDIT:
As Gabriele B has pointed out, this doesn't exactly work when reading and writing to the same flat file as Talend claims an exclusive read-write lock on the file when reading and keeps it open throughout the job.
Instead you would have to write the incremented data to some other place such as a temporary file, a database or even just to the buffer and then read that data in to a separate job which would then output the file you want and clean up anything temporary.
The problem with that is you can't do the output in the same process. I've just tried testing reading in the file in one child job, passing the data back to a parent job using a tBufferOutput and then passing that data to another child job as a context variable and then trying to output to the file. Unfortunately the file lock remains on it so you can't do this all in one self contain job (even using a parent job and several child jobs).
If this sounds horrible to you (it is) and you absolutely need this to happen (I'd suggest a database table sounds like a better match for this functionality than a flat file) then you could raise a feature request on the Talend Jira for the tFileInputDelimited to not hold the file open or to not insist on an exclusive read-write lock on the file.
Once again, I strongly recommend that you move to using a database table for this because even without the file lock issue, this is definitely not the right use of a flat file and this use case perfectly fits a database, even something as lightweight as an embedded H2 database.

What are some options for keeping track of temporary results and re-use them after restart, in case the program dies while running?

(Suggestions for improving the title of this question are welcomed.)
I have a perl script that uses web APIs to fetch a user's "liked" posts on various sites (tumblr, reddit, etc.), then download some portion of each post (for example, an image that's linked from the post).
Right now, I have a JSON-encoded file that keeps track of the posts that have already been fetched (for tumblr, it just records the total number of likes, for reddit, it records, the "id" of the last post fetched) so that the script can just pick up with the newly "liked" items the next time it runs. This means that after the program is finished archiving a new batch of links, the new "stopping point" is recorded in the JSON file.
However, if the program croaks for some reason (or is killed with ctrl+c, say), the progress is not recorded (since the progress is only recorded at the end of the "fetching"). So the next time the program runs, it looks in the tracking file and gets the last recorded stopping point (the last time it successfully completed fetching and recorded the progress), and picks up there again, downloading duplicates up to the point where it croaked the last time.
My question is, what's the best (i.e. simplest, most efficient, take your pick--I'm open to options here) way to record progress with each incremental archived item, so that if the program dies for some reason, it always knows exactly where to pick up where it left off? Adapting the current method (literally print-ing to the tracking file at the end of each fetch) to do the same thing after each individual item is definitely not the best solution because it's got to be pretty inefficient.
Edited for clarity
Let me make clearer that the file used to track the downloaded posts is not large, and does not grow appreciably with each "fetch" operation. There is only one element for each api (tumblr, etc.) that contains either the total number of likes for the account (in other words, the number that we have already downloaded, so we query the api for the current total, subtract the number in the file, and we know how many new items to fetch), or the ID of the last item fetched (reddit uses this, so we can ask the api for all items "after" the one in the file and only get the new stuff).
My problem is not an ever growing list of fetched posts, rather it is writing to the tracking file every time one single post is downloaded (and there could be thousands of posts downloaded in a single run).
Some ideas I would consider:
Write to the file more often or use an interrupt handler to 'safely' handle the interrupt signal. When it's called, allow the script to write to your file so it's as current as possible and elegantly quit.
Use a better storage mechanic than writing to a flat file. I would consider, depending on the need, using a database to store the ids. I groan when database starts getting in play due to the complexities it adds, however it doesn't have to be. I've used SQLite for queuing but also consider DBD::CSV which just writes to a CSV while allowing SQL syntax (haven't used it myself). In your code you could then check if the id is already in the database and know to skip it. I would imagine that SQLite is also more 'efficient' than reading/writing a flat file and, imo, would be easier to code than having to write code to read a file yourself.
I'd just use a hash, tied to an NDBM file, to keep track of what is loaded and what isn't.
When you start a new batch of URLs, you delete the NDBM file.
Then, in your code, at the start of the program, you do
tie(%visited, 'NDBM_File', 'visitedurls', O_RDWR|O_CREAT, 0666)
(don't worry about the O_CREAT, the file will remain intact if it exists unless you pass O_TRUNC as well)
Assuming your main loop looks like this:
while ($id=<INFILE>) {
my $url=id_to_url($id);
my $results=fetch($url);
save_results($url, $results);
}
you change that to
while ($id=<INFILE>) {
my $url=id_to_url($id);
my $results;
if ($visited{$url}) {
$results=$visited{$url};
} else {
$results=fetch($url);
$visited{$url}=$results;
}
save_results($url, $results);
}
So whenever you fetch a new URL, you write the results to the NDBM file, and whenever you restart your program, the results that have already been fetched will be in the NDBM file and fetched from there instead of reading the URL.
This assumes $results is a scalar, else you won't be able to store/retrieve it in this way. But as you're producing JSON anyway, the "partial json" for each URL will probably be what you want to store.