Add a label indicating duplicate names [duplicate] - sed

This question already has answers here:
Add double quotation on duplicated name
(4 answers)
Closed 5 years ago.
I tried to use
sed 's/ */:/' file | awk -F: '{ if (arr[$1":"$2]) print "\""$1"\":"$2; else { arr[$1":"$2]++; print $0 }}'
but cannot get ideal output. Thanks.
The following is the file information and the desired output that I want.
Text File:
Jon DeLoach:408-253-3122:123 Park St., San Jose, CA 04086:7/25/53:85100
Karen Evich:284-758-2857:23 Edgecliff Place, Lincoln, NB 92086:7/25/53:85100
Karen Evich:284-758-2867:23 Edgecliff Place, Lincoln, NB 92743:11/3/35:58200
Karen Evich:284-758-2867:23 Edgecliff Place, Lincoln, NB 92743:11/3/35:58200
Fred Fardbarkle:674-843-1385:20 Parak Lane, DeLuth, MN 23850:4/12/23:780900
Fred Fardbarkle:674-843-1385:20 Parak Lane, DeLuth, MN 23850:4/12/23:780900
Lori Gortz:327-832-5728:3465 Mirlo Street, Peabody, MA 34756:10/2/65:35200
Paco Gutierrez:835-365-1284:454 Easy Street, Decatur, IL 75732:2/28/53:123500
Paco Gutierrez:835-365-1284:454 Easy Street, Decatur, IL 75732:2/28/53:123500
Jesse Neal:408-233-8971:45 Rose Terrace, San Francisco, CA 92303:2/3/36:25000
Jesse Neal:408-233-8971:45 Rose Terrace, San Francisco, CA 92303:2/3/36:25000
Zippy Pinhead:834-823-8319:2356 Bizarro Ave., Farmount, IL 84357:1/1/67:89500
Required output: Add stars indicating duplicated names
Jon DeLoach:408-253-3122:123 Park St., San Jose, CA 04086:7/25/53:85100
*Karen Evich*:284-758-2857:23 Edgecliff Place, Lincoln, NB 92086:7/25/53:85100
*Karen Evich*:284-758-2867:23 Edgecliff Place, Lincoln, NB 92743:11/3/35:58200
*Karen Evich*:284-758-2867:23 Edgecliff Place, Lincoln, NB 92743:11/3/35:58200
*Fred Fardbarkle*:674-843-1385:20 Parak Lane, DeLuth, MN 23850:4/12/23:780900
*Fred Fardbarkle*:674-843-1385:20 Parak Lane, DeLuth, MN 23850:4/12/23:780900
Lori Gortz:327-832-5728:3465 Mirlo Street, Peabody, MA 34756:10/2/65:35200
*Paco Gutierrez*:835-365-1284:454 Easy Street, Decatur, IL 75732:2/28/53:123500
*Paco Gutierrez*:835-365-1284:454 Easy Street, Decatur, IL 75732:2/28/53:123500
*Jesse Neal*:408-233-8971:45 Rose Terrace, San Francisco, CA 92303:2/3/36:25000
*Jesse Neal*:408-233-8971:45 Rose Terrace, San Francisco, CA 92303:2/3/36:25000
Zippy Pinhead:834-823-8319:2356 Bizarro Ave., Farmount, IL 84357:1/1/67:89500

Give a test to this. Seems to work ok.
$ awk -F":" 'NR==FNR{a[$1]++;next}(a[$1]>1){sub($1,"*" $1 "*")}1' file1 file1
Explanation:
This code reads the same file twice. This maybe has a performance penalty depending on the filesize.
-F":" : Global Input Fields Delimiter is defined as :
NR==FNR{a[$1]++;next} : The code in { } is executed when NR==FNR = the first file is read by awk
a[$1]++ : Creates an array a with index $1 and value ++ => +1 for each $1 found. So for record 1 we have a[Jon DeLoach]=1. For Record2 a[Karen Evich]=1, for record 3 a[Karen Evich]++ => 2,etc
next : instructs awk to go to the next record and skip the rest script.
(a[$1]>1){sub($1,"*" $1 "*")}1 : This condition & action is performed on the second file. For each a[$1] found in second file with a value >1 (as has been finalized when the first file finished), we insert * around $1 using awk sub function. sub function applies substitution directly to $0 = Whole record.
1 : prints the whole record of the second file.

Related

SED Multiple search items

SED question
I need to print any lines that have contain 11 for November or 12 for December.
My two questions are:
How do I search for more than one item I.E. print lines with the value 11 and 12?
How do I tell the search to look in column 4 which has the dates?
What I have so far:
sed -n -e '/11/,/12/p' datebook
File datebook:
Steve Blenheim:238-923-7366:95 Latham Lane, Easton, PA 83755:11/12/56:20300
Betty Boop:245-836-8357:635 Cutesy Lane, Hollywood, CA 91464:6/23/23:14500
Igor Chevsky:385-375-8395:3567 Populus Place, Caldwell, NJ 23875:6/18/68:23400
Norma Corder:397-857-2735:74 Pine Street, Dearborn, MI 23874:3/28/45:245700
Jennifer Cowan:548-834-2348:583 Laurel Ave., Kingsville, TX 83745:10/1/35:58900
Jon DeLoach:408-253-3122:123 Park St., San Jose, CA 04086:7/25/53:85100
Karen Evich:284-758-2857:23 Edgecliff Place, Lincoln, NB 92086:7/25/53:85100
Karen Evich:284-758-2867:23 Edgecliff Place, Lincoln, NB 92743:11/3/35:58200
Karen Evich:284-758-2867:23 Edgecliff Place, Lincoln, NB 92743:11/3/35:58200
Fred Fardbarkle:674-843-1385:20 Parak Lane, DeLuth, MN 23850:4/12/23:780900
Fred Fardbarkle:674-843-1385:20 Parak Lane, DeLuth, MN 23850:4/12/23:780900
Lori Gortz:327-832-5728:3465 Mirlo Street, Peabody, MA 34756:10/2/65:35200
Paco Gutierrez:835-365-1284:454 Easy Street, Decatur, IL 75732:2/28/53:123500
Ephram Hardy:293-259-5395:235 CarltonLane, Joliet, IL 73858:8/12/20:56700
James Ikeda:834-938-8376:23445 Aster Ave., Allentown, NJ 83745:12/1/38:45000
Barbara Kertz:385-573-8326:832 Ponce Drive, Gary, IN 83756:12/1/46:268500
Lesley Kirstin:408-456-1234:4 Harvard Square, Boston, MA 02133:4/22/62:52600
William Kopf:846-836-2837:6937 Ware Road, Milton, PA 93756:9/21/46:43500
Sir Lancelot:837-835-8257:474 Camelot Boulevard, Bath, WY 28356:5/13/69:24500
Jesse Neal:408-233-8971:45 Rose Terrace, San Francisco, CA 92303:2/3/36:25000
Zippy Pinhead:834-823-8319:2356 Bizarro Ave., Farmount, IL 84357:1/1/67:89500
Arthur Putie:923-835-8745:23 Wimp Lane, Kensington, DL 38758:8/31/69:126000
Popeye Sailor:156-454-3322:945 Bluto Street, Anywhere, USA 29358:3/19/35:22350
Jose Santiago:385-898-8357:38 Fife Way, Abilene, TX 39673:1/5/58:95600
Tommy Savage:408-724-0140:1222 Oxbow Court, Sunnyvale, CA 94087:5/19/66:34200
Yukio Takeshida:387-827-1095:13 Uno Lane, Ashville, NC 23556:7/1/29:57000
Vinh Tranh:438-910-7449:8235 Maple Street, Wilmington, VM 29085:9/23/63:68900
How do I tell the search to look in column 4 which has the dates?
This is an indication that you should use awk because sed doesn't have the concept of fields. An awk solution would be
awk -v FS=":" '$4 ~ /^1[12]\/.*/{print}' datebook
Output
Steve Blenheim:238-923-7366:95 Latham Lane, Easton, PA 83755:11/12/56:20300
Karen Evich:284-758-2867:23 Edgecliff Place, Lincoln, NB 92743:11/3/35:58200
Karen Evich:284-758-2867:23 Edgecliff Place, Lincoln, NB 92743:11/3/35:58200
James Ikeda:834-938-8376:23445 Aster Ave., Allentown, NJ 83745:12/1/38:45000
Barbara Kertz:385-573-8326:832 Ponce Drive, Gary, IN 83756:12/1/46:268500
Deciphering the solution
FS=":" sets the the field/column delimiter to colon.
$4 represents the column four in your input file which is the date in the format mm/dd/yy
The ~ in $4 ~ /^1[12]\/.*/ means we do a regex match in which
^ represents the beginning of the string
[12] can match either one or two.
Since the regex part itself is delimited by / you need to escape any literal / as in \/
It appears that you want to select lines where the first characters after the third colon on the line are 11/ or 12/ (since the data formats appear to be pre-Y2K-style US-format dates with mm/dd/yy notation). So you write:
$ sed -n '/^\([^:]*:\)\{3\}1[12]\//p' datebook
Steve Blenheim:238-923-7366:95 Latham Lane, Easton, PA 83755:11/12/56:20300
Karen Evich:284-758-2867:23 Edgecliff Place, Lincoln, NB 92743:11/3/35:58200
Karen Evich:284-758-2867:23 Edgecliff Place, Lincoln, NB 92743:11/3/35:58200
James Ikeda:834-938-8376:23445 Aster Ave., Allentown, NJ 83745:12/1/38:45000
Barbara Kertz:385-573-8326:832 Ponce Drive, Gary, IN 83756:12/1/46:268500
$
The ^ matches at the start of a line; the \([^:]*]:\) part looks for a series of zero or more non-colons followed by a colon; the \{3\} requires 3 of them; the 1[12]\/ demands 11/ or 12/ after that; the p prints.
I observe that the initial statement says 'contain 11 for November or 12 for December', but your first numbered question says 'value 11 and 12'. These are contradictory; a given date field can only start with one or the other, not both. I've assumed that 'or' is what you intended.

Mongodb find documents

I have a MongoDB instance which contains a translation of texts:
{
"_id" : ObjectId("57c68ba415f4d42b6ecd9ee7"),
"en" : "Adana (pronounced [aˈda.na]) is a major city in southern Turkey. The city is situated on the Seyhan river, 35 km (22 mi) inland from the Mediterranean Sea, in south-central Anatolia. It is the administrative seat of the Adana Province and has a population of 1.7 million,[1] making it the fifth most populous city in Turkey. Adana-Mersin polycentric metropolitan area, with a population of 3 million, stretches over 70 km (43 mi) east-west and 25 km (16 mi) north-south; encompassing the cities of Mersin, Tarsus and Adana.",
"sw" : "Adana (Kigiriki Άδανα) ni mji mkubwa katika nchi ya Uturuki. Kwa mujibu wa sensa iliyofanyika mwaka wa 2000, mji una wakazi wapatao 1,130,710 waishio huko,[2] na kuufanya kuwa mmoja kati ya miji mitano mikubwa ya Uturuku (baada ya Istanbul, Ankara, İzmir na Bursa). Mwaka wa 2006 mji wa Adana umekadiriwa kufikia iadadi ya wakazi wapatao 1,271,894. Huu ndiyo mji mkuu wa Mkoa wa Adana."
}
{
"_id" : ObjectId("57c68ba915f4d42b6ecd9eea"),
"en" : "Addis Ababa or Addis Abeba (the spelling used by the official Ethiopian Mapping Authority),(Amharic: አዲስ አበባ? Addis Abäba IPA: [adˈdis ˈabəba] ( listen), \"new flower\"; Oromo: Finfinne,[3][4] [fɪnˈfɪ́n.nɛ́] \"Natural Spring(s)\"), is the capital and largest city of Ethiopia. Finfinne is its Oromo name. It has a population of 3,384,569 according to the 2007 population census, with annual growth rate of 3.8%. This number has been increased from the originally published 2,738,248 figure and appears to be still largely underestimated.[2][5]",
"sw" : "Addis Ababa (pia Addis Abeba; kwa Kiamhara አዲስ አበባ, \"Ua Jipya\"; kwa Kioromo Finfinne) ni mji mkuu wa Ethiopia na wa Umoja wa Afrika."
}
{
"_id" : ObjectId("57c68bab15f4d42b6ecd9eec"),
"en" : "Adelaide of Italy (931 – 16 December 999), also called Adelaide of Burgundy, was the second wife of Holy Roman Emperor Otto the Great[2] and was crowned as the Holy Roman Empress with him by Pope John XII in Rome on February 2, 962. Empress Adelaide was perhaps the most prominent European woman of the 10th century; she was regent of the Holy Roman Empire as the guardian of her grandson in 991-995.[2]",
"sw" : "Adelaide wa Italia (takriban 931 – 16 Desemba, 999) alikuwa binti wa Rudolf II, mfalme wa Burgundia. Kwanza aliolewa na Lothar, mfalme wa Italia. Alipofariki Lothar, Adelaide aliolewa na Otto I, mfalme wa Ujerumani. Aliishi maisha matakatifu. Sikukuu yake ni 16 Desemba."
}
What I would like to do is to select one specific record. For example I expect to select the last record by doing this:
db.wiki.find({"sw": "Adelaide wa Italia"}).pretty();
But the mongo shell returns nothing.
Indeed, I know that I can create an index and do something like:
db.wiki.find({$text: {$search: "\"Adelaide wa Italia\""}}).pretty();
which indeed returns the record as expected.
What am I doing wrong in the non-index searching please?
In this case you should use search with regex:
db.wiki.find({"sw": /Adelaide wa Italia/}).pretty();
The way you are doing it by:
db.wiki.find({"sw": "Adelaide wa Italia"}).pretty();
you simply tell Mongo to return you all documents where sw is equal to Adelaide wa Italia but you want to get all documents which contains this phrase in sw field instead.

SAS: Unable to read dates in the program. How to fix it?

Here's my code. I am unable to read the dates from the input, it keeps giving me incorrect format, I tried changing a few times to mmddyy10. mmddyy8. and others but it still does not read them in correctly.
data master_patients;
infile datalines;
input account_number name $8-16 address $17-34 date MMDDYYYY10. gender $1.
insurance_code $49-51 updated_date mmddyyyy10.;
datalines;
620135 Smith 234 Aspen St. 12-21-1975 m CBC 02-16-1998
645722 Miyamoto 65 3rd Ave. 04-03-1936 f MCR 05-30-1999
645739 Jensvold 505 Glendale Ave. 06-15-1960 f HLT 09-23-1993
874329 Kazoyan 76-C La Vista . . MCD 01-15-2003
;
proc print data=master_patients;
run;
Could you please point out where I am going wrong? Thanks for any help.
I recommend a specific informat, rather than anydtdte though it helps you get started. It will ensure that your data is correct.
data master_patients;
infile datalines;
informat date updated_date mmddyy10.;
format date updated_date date9.;
input account_number name $ 8-16 address $ 17-34 date gender $1.
insurance_code $ 49-51 updated_date;
datalines;
620135 Smith 234 Aspen St. 12-21-1975 m CBC 02-16-1998
645722 Miyamoto 65 3rd Ave. 04-03-1936 f MCR 05-30-1999
645739 Jensvold 505 Glendale Ave. 06-15-1960 f HLT 09-23-1993
874329 Kazoyan 76-C La Vista . . MCD 01-15-2003
;
run;
There are two main problems. First the informat name does not have 4 Y's in it. Just 2. Second you don't have the column pointer in the right place when you are trying to read 10 characters as a date so that you are getting a blank and then the first 9 characters of the date. SAS cannot represents dates in the second or third century AD. Try MDY(12,21,197) and see what happens.
data master_patients;
infile datalines firstobs=2;
input account_number name $8-16 address $17-34 #36 date MMDDYY10.
gender $1. insurance_code $49-51 #53 updated_date mmddyy10.
;
datalines;
----+----1----+----2----+----3----+----4----+----5----+----6----+
620135 Smith 234 Aspen St. 12-21-1975 m CBC 02-16-1998
645722 Miyamoto 65 3rd Ave. 04-03-1936 f MCR 05-30-1999
645739 Jensvold 505 Glendale Ave. 06-15-1960 f HLT 09-23-1993
874329 Kazoyan 76-C La Vista . . MCD 01-15-2003
;
proc print data=master_patients;
run;
For modified list input for this problem.Just add ":" between variable name and informat.
data master_patients;
infile datalines;
input account_number name $8-16 address $17-34 date : mmddyy10. gender $1.
insurance_code $49-51 updated_date : mmddyy10.;
datalines;
620135 Smith 234 Aspen St. 12-21-1975 m CBC 02-16-1998
645722 Miyamoto 65 3rd Ave. 04-03-1936 f MCR 05-30-1999
645739 Jensvold 505 Glendale Ave. 06-15-1960 f HLT 09-23-1993
874329 Kazoyan 76-C La Vista . . MCD 01-15-2003
;
proc print data=master_patients;
run;
Please note if you don't add ":" , just change mmddyy10. to anydtdte. , the data read into dataset may Not correct.

subtraction between two columns with awk or sed

I have some text files. I need to do the subtraction between second and fourth columns in each file. The subtracted values should print to the original files as fifth column. How can I do this with awk or sed?
HII 62.0 HII 35.1
MEE 21.3 MEE 21.3
GLL 42.3 GLL 18.5
ASS 105.9 ASS 105.9
RRG 65.6
GLL 48.3
SES 83.5
Desired output
HII 62.0 HII 35.1 26.9
MEE 21.3 MEE 21.3 0
GLL 42.3 GLL 18.5 23.8
ASS 105.9 ASS 105.9 0
RRG 65.6
GLL 48.3
SES 83.5
If the third and fourth columns are blank, no need to subtract.
awk 'NF == 2 { print }
NF == 4 { print $0, $2 - $4 }'
That could all be fitted onto one line, but it clearer what it is doing when it is spread over two lines.
If you want more control over the format, you can use printf() instead of just print.
After sanitizing trailing spaces in the data, it produces:
HII 62.0 HII 35.1 26.9
MEE 21.3 MEE 21.3 0
GLL 42.3 GLL 18.5 23.8
ASS 105.9 ASS 105.9 0
RRG 65.6
GLL 48.3
SES 83.5
This might work for you (GNU sed & Bash):
sed -ri '/^\S+\s+(\S+)\s+\S+\s+(\S+)/s//echo "&\t$(echo \1-\2|bc)"/e' file

Perl: Search & Replace within a foreach loop

perhaps someone can help me out. I need to do a search and replace on a given string, finding any occurance of one of a list of things, and inserting a carriage return before it.
I'm providing a sample string, and my attempt at solving the problem.
Sample Input:
MSH|^~\&|PCM|A|NSG|A|20120613081122|DoNotBundle|ORM^O01|1133316|P|2.2|||AL|NEPID|1|1234567^PI^PE|345235^ST02A^MR^A~02340395^ST02^PI||HSM^AERHART||19510418000000|F||||||||||1215200001^A|111-22-3333
PV1|1|I|CCU^W207^A^A||||12342^ALI^ROGERS^M^MD^MD|||SUR|||||||16532^ALI^ROGERS^M^MD^MD|INP||B|||||||||||||||||||A|||||20120531145230ORC|PA|11109489^PCM|11109489^PCM|94986|SC||1^Continuous^INDEF^20120613081900^1||20120613081958|RGYIDDER^YIDDER^ROBERT^GSYSTEM ADM^SA||16532^ALI^ROGERS^MMD^MD|CCU||20120613081958|||CCU|RGYIDDER^YIDDER^ROBERT^
G^SYSTEM ADM^SA
OBR|1|11109489^PCM|11109489^PCM|DNR ON^Hard of Hearing^NSG||20120613081122||||||||||16532^ALI^ROGERS^M^MD^MD|||||||||||1^Continuous^INDEF^20120613081900^1
And my attempt:
$/ = undef; #tells perl to ignore newlines when reading input
$input = <STDIN>; #read entire input into $input
$input =~ s/\R/ /g; #remove all newlines from input. \R matches \r, \n, \r\n
#validSegHdrs = ( "ABS", "ACC", "ADD", "ADJ", "AFF", "AIG", "AIL", "AIP", "AIS", "AL1",
"APR", "ARQ", "ACC", "ADD", "ADJ", "AFF", "AIG", "AIL", "AIP", "AIS",
"AL1", "APR", "ARQ", "ARV", "AUT", "BHS", "BLC", "BLG", "BPO", "BPX",
"BTS", "BTX", "CDM", "CER", "CM0", "CM1", "CM2", "CNS", "CON", "CSP",
"CSR", "CSS", "CTD", "CTI", "DB1", "DG1", "DMI", "DRG", "DSC", "DSP",
"ECD", "ECR", "EDU", "EQP", "EQU", "ERR", "EVN", "FAC", "FHS", "FT1",
"FTS", "GOL", "GP1", "GP2", "GT1", "IAM", "IIM", "ILT", "IN1", "IN2",
"IN3", "INV", "IPC", "IPR", "ISD", "ITM", "IVC", "IVT", "LAN", "LCC",
"LCH", "LDP", "LOC", "LRL", "MFA", "MFE", "MFI", "MRG", "MSA", "MSH",
"NCK", "NDS", "NK1", "NPU", "NSC", "NST", "NTE", "OBR", "OBX", "ODS",
"ODT", "OM1", "OM2", "OM3", "OM4", "OM5", "OM6", "OM7", "ORC", "ORG",
"OVR", "PCE", "PCR", "PD1", "PDA", "PDC", "PEO", "PES", "PID", "PKG",
"PMT", "PR1", "PRA", "PRB", "PRC", "PRD", "PSG", "PSH", "PSL", "PSS",
"PTH", "PV1", "PV2", "PYE", "QAK", "QID", "QPD", "QRD", "QRF", "QRI",
"RCP", "RDF", "RDT", "REL", "RF1", "RFI", "RGS", "RMI", "ROL", "RQ1",
"RQD", "RXA", "RXC", "RXD", "RXE", "RXG", "RXO", "RXR", "SAC", "SCD",
"SCH", "SCP", "SDD", "SFT", "SID", "SLT", "SPM", "STF", "STZ", "TCC",
"TCD", "TQ1", "TQ2", "TXA", "UAC", "UB1", "UB2", "URD", "URS", "VAR",
"VND"
);
foreach (#validSegHdrs) {
$input =~ s/$_/\r$_/g;
}
print $input;
-
For what it's worth, I'm working with HL7. HL7 consists of "segments" each on its own line. The segment beginning with "MSH" is always first, and there must be a carriage return preceding each additional segment.
My input may have line breaks (or carriage returns) in the middle of a segment, which is not allowed. I also may encounter a new segment beginning on the same line as another one, which is also not allowed.
I intend to parse the input, first strip all line breaks, and find any matches of valid segment headers, and insert a carriage return before them. I have defined an array with all valid segment headers, and am attempting to use a foreach loop to do a simple search and replace to insert the \r before each match. I think it may be a good idea to match for each string plus '|', eg match on 'PV1|' to be more precise.
I'm not getting the expected output, so I humbly ask for some expertise. Thanks much!
#validSegHdrs = ( "ABS", # .....
);
my $regex = join ("|", #validSegHdrs);
while (<>) {
s/\R/ /g;
s/($regex)/\r$1/g;
print;
}
I used this script from the command line:
perl -e 'print "\n"; local $/; $in=<>; $in=~s/\R//g; my #blk = qw(ABS ACC ADD ADJ AFF AIG AIL AIP AIS AL1 APR ARQ ACC ADD ADJ AFF AIG AIL AIP AIS AL1 APR ARQ ARV AUT BHS BLC BLG BPO BPX BTS BTX CDM CER CM0 CM1 CM2 CNS CON CSP CSR CSS CTD CTI DB1 DG1 DMI DRG DSC DSP ECD ECR EDU EQP EQU ERR EVN FAC FHS FT1 FTS GOL GP1 GP2 GT1 IAM IIM ILT IN1 IN2 IN3 INV IPC IPR ISD ITM IVC IVT LAN LCC LCH LDP LOC LRL MFA MFE MFI MRG MSA MSH NCK NDS NK1 NPU NSC NST NTE OBR OBX ODS ODT OM1 OM2 OM3 OM4 OM5 OM6 OM7 ORC ORG OVR PCE PCR PD1 PDA PDC PEO PES PID PKG PMT PR1 PRA PRB PRC PRD PSG PSH PSL PSS PTH PV1 PV2 PYE QAK QID QPD QRD QRF QRI RCP RDF RDT REL RF1 RFI RGS RMI ROL RQ1 RQD RXA RXC RXD RXE RXG RXO RXR SAC SCD SCH SCP SDD SFT SID SLT SPM STF STZ TCC TCD TQ1 TQ2 TXA UAC UB1 UB2 URD URS VAR VND); $in=~s/$_/\n$_/ for #blk; print $in, "\n";'
And got this output:
MSH|^~\&|PCM|A|NSG|A|20120613081122|DoNotBundle|ORM^O01|1133316|P|2.2|||AL|NE
PID|1|1234567^PI^PE|345235^ST02A^MR^A~02340395^ST02^PI||HSM^AERHART||19510418000000|F||||||||||1215200001^A|111-22-3333
PV1|1|I|CCU^W207^A^A||||12342^ALI^ROGERS^M^MD^MD|||SUR|||||||16532^ALI^ROGERS^M^MD^MD|INP||B|||||||||||||||||||A|||||20120531145230
ORC|PA|11109489^PCM|11109489^PCM|94986|SC||1^Continuous^INDEF^20120613081900^1||20120613081958|RGYIDDER^YIDDER^ROBERT^GSYSTEM ADM^SA||16532^ALI^ROGERS^MMD^MD|CCU||20120613081958|||CCU|RGYIDDER^YIDDER^ROBERT^G^SYSTEM ADM^SA
OBR|1|11109489^PCM|11109489^PCM|DNR ON^Hard of Hearing^NSG||20120613081122||||||||||16532^ALI^ROGERS^M^MD^MD|||||||||||1^Continuous^INDEF^20120613081900^1
If the script were written indented, it would look like this:
local $/;
$in=<>;
$in=~s/\R//g;
my #blk = qw(
ABS ACC ADD ADJ AFF AIG AIL AIP AIS AL1 APR ARQ ACC ADD ADJ AFF AIG AIL AIP
AIS AL1 APR ARQ ARV AUT BHS BLC BLG BPO BPX BTS BTX CDM CER CM0 CM1 CM2 CNS
CON CSP CSR CSS CTD CTI DB1 DG1 DMI DRG DSC DSP ECD ECR EDU EQP EQU ERR EVN
FAC FHS FT1 FTS GOL GP1 GP2 GT1 IAM IIM ILT IN1 IN2 IN3 INV IPC IPR ISD ITM
IVC IVT LAN LCC LCH LDP LOC LRL MFA MFE MFI MRG MSA MSH NCK NDS NK1 NPU NSC
NST NTE OBR OBX ODS ODT OM1 OM2 OM3 OM4 OM5 OM6 OM7 ORC ORG OVR PCE PCR PD1
PDA PDC PEO PES PID PKG PMT PR1 PRA PRB PRC PRD PSG PSH PSL PSS PTH PV1 PV2
PYE QAK QID QPD QRD QRF QRI RCP RDF RDT REL RF1 RFI RGS RMI ROL RQ1 RQD RXA
RXC RXD RXE RXG RXO RXR SAC SCD SCH SCP SDD SFT SID SLT SPM STF STZ TCC TCD
TQ1 TQ2 TXA UAC UB1 UB2 URD URS VAR VND);
$in=~s/$_/\n$_/ for #blk;
print $in, "\n";
You would replace the \n with a \r I guess.
I don't know what the real difference between our scripts is, but it works for me??
Do note that using a hash could be more efficient (O(n) → O(1) where n is the number of header sequences):
my %hash = map {$_ => 1} #blk;
# Test if $1 is a header sequence, if so, print newline
$in =~ s/( [A-Z0-9]{3} )/ $hash{$1} ? "\n$1" : $1 /xeg;