Perl: Search & Replace within a foreach loop

Perl: Search & Replace within a foreach loop - perl

perhaps someone can help me out. I need to do a search and replace on a given string, finding any occurance of one of a list of things, and inserting a carriage return before it.
I'm providing a sample string, and my attempt at solving the problem.
Sample Input:
MSH|^~\&|PCM|A|NSG|A|20120613081122|DoNotBundle|ORM^O01|1133316|P|2.2|||AL|NEPID|1|1234567^PI^PE|345235^ST02A^MR^A~02340395^ST02^PI||HSM^AERHART||19510418000000|F||||||||||1215200001^A|111-22-3333
PV1|1|I|CCU^W207^A^A||||12342^ALI^ROGERS^M^MD^MD|||SUR|||||||16532^ALI^ROGERS^M^MD^MD|INP||B|||||||||||||||||||A|||||20120531145230ORC|PA|11109489^PCM|11109489^PCM|94986|SC||1^Continuous^INDEF^20120613081900^1||20120613081958|RGYIDDER^YIDDER^ROBERT^GSYSTEM ADM^SA||16532^ALI^ROGERS^MMD^MD|CCU||20120613081958|||CCU|RGYIDDER^YIDDER^ROBERT^
G^SYSTEM ADM^SA
OBR|1|11109489^PCM|11109489^PCM|DNR ON^Hard of Hearing^NSG||20120613081122||||||||||16532^ALI^ROGERS^M^MD^MD|||||||||||1^Continuous^INDEF^20120613081900^1
And my attempt:
$/ = undef; #tells perl to ignore newlines when reading input
$input = <STDIN>; #read entire input into $input
$input =~ s/\R/ /g; #remove all newlines from input. \R matches \r, \n, \r\n
#validSegHdrs = ( "ABS", "ACC", "ADD", "ADJ", "AFF", "AIG", "AIL", "AIP", "AIS", "AL1",
"APR", "ARQ", "ACC", "ADD", "ADJ", "AFF", "AIG", "AIL", "AIP", "AIS",
"AL1", "APR", "ARQ", "ARV", "AUT", "BHS", "BLC", "BLG", "BPO", "BPX",
"BTS", "BTX", "CDM", "CER", "CM0", "CM1", "CM2", "CNS", "CON", "CSP",
"CSR", "CSS", "CTD", "CTI", "DB1", "DG1", "DMI", "DRG", "DSC", "DSP",
"ECD", "ECR", "EDU", "EQP", "EQU", "ERR", "EVN", "FAC", "FHS", "FT1",
"FTS", "GOL", "GP1", "GP2", "GT1", "IAM", "IIM", "ILT", "IN1", "IN2",
"IN3", "INV", "IPC", "IPR", "ISD", "ITM", "IVC", "IVT", "LAN", "LCC",
"LCH", "LDP", "LOC", "LRL", "MFA", "MFE", "MFI", "MRG", "MSA", "MSH",
"NCK", "NDS", "NK1", "NPU", "NSC", "NST", "NTE", "OBR", "OBX", "ODS",
"ODT", "OM1", "OM2", "OM3", "OM4", "OM5", "OM6", "OM7", "ORC", "ORG",
"OVR", "PCE", "PCR", "PD1", "PDA", "PDC", "PEO", "PES", "PID", "PKG",
"PMT", "PR1", "PRA", "PRB", "PRC", "PRD", "PSG", "PSH", "PSL", "PSS",
"PTH", "PV1", "PV2", "PYE", "QAK", "QID", "QPD", "QRD", "QRF", "QRI",
"RCP", "RDF", "RDT", "REL", "RF1", "RFI", "RGS", "RMI", "ROL", "RQ1",
"RQD", "RXA", "RXC", "RXD", "RXE", "RXG", "RXO", "RXR", "SAC", "SCD",
"SCH", "SCP", "SDD", "SFT", "SID", "SLT", "SPM", "STF", "STZ", "TCC",
"TCD", "TQ1", "TQ2", "TXA", "UAC", "UB1", "UB2", "URD", "URS", "VAR",
"VND"
);
foreach (#validSegHdrs) {
$input =~ s/$_/\r$_/g;
}
print $input;
-
For what it's worth, I'm working with HL7. HL7 consists of "segments" each on its own line. The segment beginning with "MSH" is always first, and there must be a carriage return preceding each additional segment.
My input may have line breaks (or carriage returns) in the middle of a segment, which is not allowed. I also may encounter a new segment beginning on the same line as another one, which is also not allowed.
I intend to parse the input, first strip all line breaks, and find any matches of valid segment headers, and insert a carriage return before them. I have defined an array with all valid segment headers, and am attempting to use a foreach loop to do a simple search and replace to insert the \r before each match. I think it may be a good idea to match for each string plus '|', eg match on 'PV1|' to be more precise.
I'm not getting the expected output, so I humbly ask for some expertise. Thanks much!

#validSegHdrs = ( "ABS", # .....
);
my $regex = join ("|", #validSegHdrs);
while (<>) {
s/\R/ /g;
s/($regex)/\r$1/g;
print;
}

I used this script from the command line:
perl -e 'print "\n"; local $/; $in=<>; $in=~s/\R//g; my #blk = qw(ABS ACC ADD ADJ AFF AIG AIL AIP AIS AL1 APR ARQ ACC ADD ADJ AFF AIG AIL AIP AIS AL1 APR ARQ ARV AUT BHS BLC BLG BPO BPX BTS BTX CDM CER CM0 CM1 CM2 CNS CON CSP CSR CSS CTD CTI DB1 DG1 DMI DRG DSC DSP ECD ECR EDU EQP EQU ERR EVN FAC FHS FT1 FTS GOL GP1 GP2 GT1 IAM IIM ILT IN1 IN2 IN3 INV IPC IPR ISD ITM IVC IVT LAN LCC LCH LDP LOC LRL MFA MFE MFI MRG MSA MSH NCK NDS NK1 NPU NSC NST NTE OBR OBX ODS ODT OM1 OM2 OM3 OM4 OM5 OM6 OM7 ORC ORG OVR PCE PCR PD1 PDA PDC PEO PES PID PKG PMT PR1 PRA PRB PRC PRD PSG PSH PSL PSS PTH PV1 PV2 PYE QAK QID QPD QRD QRF QRI RCP RDF RDT REL RF1 RFI RGS RMI ROL RQ1 RQD RXA RXC RXD RXE RXG RXO RXR SAC SCD SCH SCP SDD SFT SID SLT SPM STF STZ TCC TCD TQ1 TQ2 TXA UAC UB1 UB2 URD URS VAR VND); $in=~s/$_/\n$_/ for #blk; print $in, "\n";'
And got this output:
MSH|^~\&|PCM|A|NSG|A|20120613081122|DoNotBundle|ORM^O01|1133316|P|2.2|||AL|NE
PID|1|1234567^PI^PE|345235^ST02A^MR^A~02340395^ST02^PI||HSM^AERHART||19510418000000|F||||||||||1215200001^A|111-22-3333
PV1|1|I|CCU^W207^A^A||||12342^ALI^ROGERS^M^MD^MD|||SUR|||||||16532^ALI^ROGERS^M^MD^MD|INP||B|||||||||||||||||||A|||||20120531145230
ORC|PA|11109489^PCM|11109489^PCM|94986|SC||1^Continuous^INDEF^20120613081900^1||20120613081958|RGYIDDER^YIDDER^ROBERT^GSYSTEM ADM^SA||16532^ALI^ROGERS^MMD^MD|CCU||20120613081958|||CCU|RGYIDDER^YIDDER^ROBERT^G^SYSTEM ADM^SA
OBR|1|11109489^PCM|11109489^PCM|DNR ON^Hard of Hearing^NSG||20120613081122||||||||||16532^ALI^ROGERS^M^MD^MD|||||||||||1^Continuous^INDEF^20120613081900^1
If the script were written indented, it would look like this:
local $/;
$in=<>;
$in=~s/\R//g;
my #blk = qw(
ABS ACC ADD ADJ AFF AIG AIL AIP AIS AL1 APR ARQ ACC ADD ADJ AFF AIG AIL AIP
AIS AL1 APR ARQ ARV AUT BHS BLC BLG BPO BPX BTS BTX CDM CER CM0 CM1 CM2 CNS
CON CSP CSR CSS CTD CTI DB1 DG1 DMI DRG DSC DSP ECD ECR EDU EQP EQU ERR EVN
FAC FHS FT1 FTS GOL GP1 GP2 GT1 IAM IIM ILT IN1 IN2 IN3 INV IPC IPR ISD ITM
IVC IVT LAN LCC LCH LDP LOC LRL MFA MFE MFI MRG MSA MSH NCK NDS NK1 NPU NSC
NST NTE OBR OBX ODS ODT OM1 OM2 OM3 OM4 OM5 OM6 OM7 ORC ORG OVR PCE PCR PD1
PDA PDC PEO PES PID PKG PMT PR1 PRA PRB PRC PRD PSG PSH PSL PSS PTH PV1 PV2
PYE QAK QID QPD QRD QRF QRI RCP RDF RDT REL RF1 RFI RGS RMI ROL RQ1 RQD RXA
RXC RXD RXE RXG RXO RXR SAC SCD SCH SCP SDD SFT SID SLT SPM STF STZ TCC TCD
TQ1 TQ2 TXA UAC UB1 UB2 URD URS VAR VND);
$in=~s/$_/\n$_/ for #blk;
print $in, "\n";
You would replace the \n with a \r I guess.
I don't know what the real difference between our scripts is, but it works for me??
Do note that using a hash could be more efficient (O(n) → O(1) where n is the number of header sequences):
my %hash = map {$_ => 1} #blk;
# Test if $1 is a header sequence, if so, print newline
$in =~ s/( [A-Z0-9]{3} )/ $hash{$1} ? "\n$1" : $1 /xeg;

Related

Unicode letters with more than 1 alphabetic latin character?

I'm not really sure how to express it but I'm searching for unicode letters which are more than one visual latin letter.
I found this in Word so far:
Ǳ
ǲ
ǳ
Ǌ
ǈ
Ǉ
ǋ
ǌ
Any others?

Here are some of the characters I've found. I'd first done this manually by looking at some probable blocks. However I've later written a Python script to do this automatically that you can find at the end of this answer
Digraphs
Two Glyphs
Digraph
Unicode Code Point
HTML
DZ, Dz, dz
Ǳ, ǲ, ǳ
U+01F1 U+01F2 U+01F3
Ǳ ǲ ǳ
DŽ, Dž, dž
Ǆ, ǅ, ǆ
U+01C4 U+01C5 U+01C6
Ǆ ǅ ǆ
IJ, ij
Ĳ, ĳ
U+0132 U+0133
Ĳ ĳ
LJ, Lj, lj
Ǉ, ǈ, ǉ
U+01C7 U+01C8 U+01C9
Ǉ ǈ ǉ
NJ, Nj, nj
Ǌ, ǋ, ǌ
U+01CA U+01CB U+01CC
Ǌ ǋ ǌ
Ligatures
Non-ligature
Ligature
Unicode
HTML
AA, aa
Ꜳ, ꜳ
U+A732, U+A733
Ꜳ ꜳ
AE, ae
Æ, æ
U+00C6, U+00E6
Æ æ
AO, ao
Ꜵ, ꜵ
U+A734, U+A735
Ꜵ ꜵ
AU, au
Ꜷ, ꜷ
U+A736, U+A737
Ꜷ ꜷ
AV, av
Ꜹ, ꜹ
U+A738, U+A739
Ꜹ ꜹ
AV, av (with bar)
Ꜻ, ꜻ
U+A73A, U+A73B
Ꜻ ꜻ
AY, ay
Ꜽ, ꜽ
U+A73C, U+A73D
Ꜽ ꜽ
et
🙰
U+1F670
🙰
f‌f
ﬀ
U+FB00
ﬀ
f‌f‌i
ﬃ
U+FB03
ﬃ
f‌f‌l
ﬄ
U+FB04
ﬄ
f‌i
ﬁ
U+FB01
ﬁ
f‌l
ﬂ
U+FB02
ﬂ
OE, oe
Œ, œ
U+0152, U+0153
Œ œ
OO, oo
Ꝏ, ꝏ
U+A74E, U+A74F
Ꝏ ꝏ
ſs, ſz
ẞ, ß
U+1E9E, U+00DF
ß
st
ﬆ
U+FB06
ﬆ
ſt
ﬅ
U+FB05
ﬅ
TZ, tz
Ꜩ, ꜩ
U+A728, U+A729
Ꜩ ꜩ
ue
ᵫ
U+1D6B
ᵫ
VY, vy
Ꝡ, ꝡ
U+A760, U+A761
Ꝡ ꝡ
There are a few other ligatures that are used for phonetic transcription but looks like Latin characters
Non-ligature
Ligature
Unicode
HTML
db
ȸ
U+0238
ȸ
dz
ʣ
U+02A3
ʣ
IJ, ij
Ĳ, ĳ
U+0132, U+0133
Ĳ ĳ
ls
ʪ
U+02AA
ʪ
lz
ʫ
U+02AB
ʫ
qp
ȹ
U+0239
ȹ
ts
ʦ
U+02A6
ʦ
ui
ꭐ
U+AB50
ꭐ
turned ui
ꭑ
U+AB51
ꭑ
https://en.wikipedia.org/wiki/List_of_precomposed_Latin_characters_in_Unicode#Digraphs_and_ligatures
Edit:
There are more letterlike symbols beside ℻ and ℡ like what the OP found in the comment:
℀ ℁ ⅍ ℅ ℆ ℔ ℠ ™
Longer letters are mainly from the CJK Compatibility block
U+XXXX
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
U+338x
㎀
㎁
㎂
㎃
㎄
㎅
㎆
㎇
㎈
㎉
㎊
㎋
㎌
㎍
㎎
㎏
U+339x
㎐
㎑
㎒
㎓
㎔
㎕
㎖
㎗
㎘
㎙
㎚
㎛
㎜
㎝
㎞
㎟
U+33Ax
㎠
㎡
㎢
㎣
㎤
㎥
㎦
㎧
㎨
㎩
㎪
㎫
㎬
㎭
㎮
㎯
U+33Bx
㎰
㎱
㎲
㎳
㎴
㎵
㎶
㎷
㎸
㎹
㎺
㎻
㎼
㎽
㎾
㎿
U+33Cx
㏀
㏁
㏂
㏃
㏄
㏅
㏆
㏇
㏈
㏉
㏊
㏋
㏌
㏍
㏎
㏏
U+33Dx
㏐
㏑
㏒
㏓
㏔
㏕
㏖
㏗
㏘
㏙
㏚
㏛
㏜
㏝
㏞
㏟
Among the 3-letter-like symbols are ㎈ ㎑ ㎒ ㎓ ㎔㏒ ㏕ ㏖ ㏙ ㎪ ㎫ ㎬ ㎭ ㏆ ㏿ ㍱... Probably the ones with most characters are ㎉ and ㎯
Unicode even have codepoints for Roman numerals. Here another 4-letter-like character can be found: Ⅷ
U+XXXX
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
U+215x
⅐
⅑
⅒
⅓
⅔
⅕
⅖
⅗
⅘
⅙
⅚
⅛
⅜
⅝
⅞
⅟
U+216x
Ⅰ
Ⅱ
Ⅲ
Ⅳ
Ⅴ
Ⅵ
Ⅶ
Ⅷ
Ⅸ
Ⅹ
Ⅺ
Ⅻ
Ⅼ
Ⅽ
Ⅾ
Ⅿ
U+217x
ⅰ
ⅱ
ⅲ
ⅳ
ⅴ
ⅵ
ⅶ
ⅷ
ⅸ
ⅹ
ⅺ
ⅻ
ⅼ
ⅽ
ⅾ
ⅿ
U+218x
ↀ
ↁ
ↂ
Ↄ
ↄ
ↅ
ↆ
ↇ
ↈ
↉
↊
↋
If normal numbers can be considered then there are some other code points for multiple digits like ⒆ ⒇ ⓳ ⓴ in enclosed alphanumerics
U+XXXX
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
U+246x
①
②
③
④
⑤
⑥
⑦
⑧
⑨
⑩
⑪
⑫
⑬
⑭
⑮
⑯
U+247x
⑰
⑱
⑲
⑳
⑴
⑵
⑶
⑷
⑸
⑹
⑺
⑻
⑼
⑽
⑾
⑿
U+248x
⒀
⒁
⒂
⒃
⒄
⒅
⒆
⒇
⒈
⒉
⒊
⒋
⒌
⒍
⒎
⒏
U+249x
⒐
⒑
⒒
⒓
⒔
⒕
⒖
⒗
⒘
⒙
⒚
⒛
⒜
⒝
⒞
⒟
U+24Ax
⒠
⒡
⒢
⒣
⒤
⒥
⒦
⒧
⒨
⒩
⒪
⒫
⒬
⒭
⒮
⒯
U+24Bx
⒰
⒱
⒲
⒳
⒴
⒵
Ⓐ
Ⓑ
Ⓒ
Ⓓ
Ⓔ
Ⓕ
Ⓖ
Ⓗ
Ⓘ
Ⓙ
U+24Cx
Ⓚ
Ⓛ
Ⓜ
Ⓝ
Ⓞ
Ⓟ
Ⓠ
Ⓡ
Ⓢ
Ⓣ
Ⓤ
Ⓥ
Ⓦ
Ⓧ
Ⓨ
Ⓩ
U+24Dx
ⓐ
ⓑ
ⓒ
ⓓ
ⓔ
ⓕ
ⓖ
ⓗ
ⓘ
ⓙ
ⓚ
ⓛ
ⓜ
ⓝ
ⓞ
ⓟ
U+24Ex
ⓠ
ⓡ
ⓢ
ⓣ
ⓤ
ⓥ
ⓦ
ⓧ
ⓨ
ⓩ
⓪
⓫
⓬
⓭
⓮
⓯
U+24Fx
⓰
⓱
⓲
⓳
⓴
⓵
⓶
⓷
⓸
⓹
⓺
⓻
⓼
⓽
⓾
⓿
and in Enclosed Alphanumeric Supplement
🅫, 🅪, 🆋, 🆌, 🆍, 🄭, 🄮, 🅊, 🅋, 🅌, 🅍, 🅎, 🅏
A few more:
Currency symbol group
₧ ₨ ₶ ₯ ₠ ₢ ₷
Miscellaneous technical group
⎂ ⏨
Control pictures (probably you'll need to zoom out to see)
U+XXXX
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
U+240x
␀
␁
␂
␃
␄
␅
␆
␇
␈
␉
␊
␋
␌
␍
␎
␏
U+241x
␐
␑
␒
␓
␔
␕
␖
␗
␘
␙
␚
␛
␜
␝
␞
␟
U+242x
␠
␡
␢
␣
␤
␥
␦
Alchemical Symbols
🜀 🜅 🜆 🜇 🜈 🝪 🝫 🝬 🝛 🝜 🝝
Musical Symbols
𝄶 𝄷 𝄸 𝄹 𝄉 𝄊 𝄫
And there are the emojis 🔟 💤🆔🚾🆖🆗🔢🔡🔠 💯🆘🆎🆑™🔙🔚🔜🔝🔛📆🗓🔞
Vertical bars may be considered uppercase i or lowercase L (like your 〷 example which is actually the TELEGRAPH LINE FEED SEPARATOR SYMBOL) and we have
Vai syllable see ꔖ 0xa516
Large triple vertical bar operator ⫼ 0x2afc
Counting rod tens digit three: 𝍫 0x1d36b
Suzhou numerals 〢 〣
Chinese river 川
║ BOX DRAWINGS DOUBLE VERTICAL...
Here's the automatic script to find the multi-character letters
import unicodedata
for c in range(0, 0x10FFFF + 1):
d = unicodedata.normalize('NFKD', chr(c))
if len(d) > 1 and d.isascii() and d.isalpha():
print("U+%04X (%s): %s\n" % (c, chr(c), d))
It won't be able to find many ligatures like æ or œ because they're not considered orthographic ligatures and aren't decomposable in Unicode. Here's the result in Unicode 11.0.0 (checked with unicodedata.unidata_version)
U+0132 (Ĳ): IJ
U+0133 (ĳ): ij
U+01C7 (Ǉ): LJ
U+01C8 (ǈ): Lj
U+01C9 (ǉ): lj
U+01CA (Ǌ): NJ
U+01CB (ǋ): Nj
U+01CC (ǌ): nj
U+01F1 (Ǳ): DZ
U+01F2 (ǲ): Dz
U+01F3 (ǳ): dz
U+20A8 (₨): Rs
U+2116 (№): No
U+2120 (℠): SM
U+2121 (℡): TEL
U+2122 (™): TM
U+213B (℻): FAX
U+2161 (Ⅱ): II
U+2162 (Ⅲ): III
U+2163 (Ⅳ): IV
U+2165 (Ⅵ): VI
U+2166 (Ⅶ): VII
U+2167 (Ⅷ): VIII
U+2168 (Ⅸ): IX
U+216A (Ⅺ): XI
U+216B (Ⅻ): XII
U+2171 (ⅱ): ii
U+2172 (ⅲ): iii
U+2173 (ⅳ): iv
U+2175 (ⅵ): vi
U+2176 (ⅶ): vii
U+2177 (ⅷ): viii
U+2178 (ⅸ): ix
U+217A (ⅺ): xi
U+217B (ⅻ): xii
U+3250 (㉐): PTE
U+32CC (㋌): Hg
U+32CD (㋍): erg
U+32CE (㋎): eV
U+32CF (㋏): LTD
U+3371 (㍱): hPa
U+3372 (㍲): da
U+3373 (㍳): AU
U+3374 (㍴): bar
U+3375 (㍵): oV
U+3376 (㍶): pc
U+3377 (㍷): dm
U+337A (㍺): IU
U+3380 (㎀): pA
U+3381 (㎁): nA
U+3383 (㎃): mA
U+3384 (㎄): kA
U+3385 (㎅): KB
U+3386 (㎆): MB
U+3387 (㎇): GB
U+3388 (㎈): cal
U+3389 (㎉): kcal
U+338A (㎊): pF
U+338B (㎋): nF
U+338E (㎎): mg
U+338F (㎏): kg
U+3390 (㎐): Hz
U+3391 (㎑): kHz
U+3392 (㎒): MHz
U+3393 (㎓): GHz
U+3394 (㎔): THz
U+3396 (㎖): ml
U+3397 (㎗): dl
U+3398 (㎘): kl
U+3399 (㎙): fm
U+339A (㎚): nm
U+339C (㎜): mm
U+339D (㎝): cm
U+339E (㎞): km
U+33A9 (㎩): Pa
U+33AA (㎪): kPa
U+33AB (㎫): MPa
U+33AC (㎬): GPa
U+33AD (㎭): rad
U+33B0 (㎰): ps
U+33B1 (㎱): ns
U+33B3 (㎳): ms
U+33B4 (㎴): pV
U+33B5 (㎵): nV
U+33B7 (㎷): mV
U+33B8 (㎸): kV
U+33B9 (㎹): MV
U+33BA (㎺): pW
U+33BB (㎻): nW
U+33BD (㎽): mW
U+33BE (㎾): kW
U+33BF (㎿): MW
U+33C3 (㏃): Bq
U+33C4 (㏄): cc
U+33C5 (㏅): cd
U+33C8 (㏈): dB
U+33C9 (㏉): Gy
U+33CA (㏊): ha
U+33CB (㏋): HP
U+33CC (㏌): in
U+33CD (㏍): KK
U+33CE (㏎): KM
U+33CF (㏏): kt
U+33D0 (㏐): lm
U+33D1 (㏑): ln
U+33D2 (㏒): log
U+33D3 (㏓): lx
U+33D4 (㏔): mb
U+33D5 (㏕): mil
U+33D6 (㏖): mol
U+33D7 (㏗): PH
U+33D9 (㏙): PPM
U+33DA (㏚): PR
U+33DB (㏛): sr
U+33DC (㏜): Sv
U+33DD (㏝): Wb
U+33FF (㏿): gal
U+FB00 (ﬀ): ff
U+FB01 (ﬁ): fi
U+FB02 (ﬂ): fl
U+FB03 (ﬃ): ffi
U+FB04 (ﬄ): ffl
U+FB05 (ﬅ): st
U+FB06 (ﬆ): st
U+1F12D (🄭): CD
U+1F12E (🄮): WZ
U+1F14A (🅊): HV
U+1F14B (🅋): MV
U+1F14C (🅌): SD
U+1F14D (🅍): SS
U+1F14E (🅎): PPV
U+1F14F (🅏): WC
U+1F16A (🅪): MC
U+1F16B (🅫): MD
U+1F190 (🆐): DJ

Add a label indicating duplicate names [duplicate]

This question already has answers here:
Add double quotation on duplicated name
(4 answers)
Closed 5 years ago.
I tried to use
sed 's/ */:/' file | awk -F: '{ if (arr[$1":"$2]) print "\""$1"\":"$2; else { arr[$1":"$2]++; print $0 }}'
but cannot get ideal output. Thanks.
The following is the file information and the desired output that I want.
Text File:
Jon DeLoach:408-253-3122:123 Park St., San Jose, CA 04086:7/25/53:85100
Karen Evich:284-758-2857:23 Edgecliff Place, Lincoln, NB 92086:7/25/53:85100
Karen Evich:284-758-2867:23 Edgecliff Place, Lincoln, NB 92743:11/3/35:58200
Karen Evich:284-758-2867:23 Edgecliff Place, Lincoln, NB 92743:11/3/35:58200
Fred Fardbarkle:674-843-1385:20 Parak Lane, DeLuth, MN 23850:4/12/23:780900
Fred Fardbarkle:674-843-1385:20 Parak Lane, DeLuth, MN 23850:4/12/23:780900
Lori Gortz:327-832-5728:3465 Mirlo Street, Peabody, MA 34756:10/2/65:35200
Paco Gutierrez:835-365-1284:454 Easy Street, Decatur, IL 75732:2/28/53:123500
Paco Gutierrez:835-365-1284:454 Easy Street, Decatur, IL 75732:2/28/53:123500
Jesse Neal:408-233-8971:45 Rose Terrace, San Francisco, CA 92303:2/3/36:25000
Jesse Neal:408-233-8971:45 Rose Terrace, San Francisco, CA 92303:2/3/36:25000
Zippy Pinhead:834-823-8319:2356 Bizarro Ave., Farmount, IL 84357:1/1/67:89500
Required output: Add stars indicating duplicated names
Jon DeLoach:408-253-3122:123 Park St., San Jose, CA 04086:7/25/53:85100
*Karen Evich*:284-758-2857:23 Edgecliff Place, Lincoln, NB 92086:7/25/53:85100
*Karen Evich*:284-758-2867:23 Edgecliff Place, Lincoln, NB 92743:11/3/35:58200
*Karen Evich*:284-758-2867:23 Edgecliff Place, Lincoln, NB 92743:11/3/35:58200
*Fred Fardbarkle*:674-843-1385:20 Parak Lane, DeLuth, MN 23850:4/12/23:780900
*Fred Fardbarkle*:674-843-1385:20 Parak Lane, DeLuth, MN 23850:4/12/23:780900
Lori Gortz:327-832-5728:3465 Mirlo Street, Peabody, MA 34756:10/2/65:35200
*Paco Gutierrez*:835-365-1284:454 Easy Street, Decatur, IL 75732:2/28/53:123500
*Paco Gutierrez*:835-365-1284:454 Easy Street, Decatur, IL 75732:2/28/53:123500
*Jesse Neal*:408-233-8971:45 Rose Terrace, San Francisco, CA 92303:2/3/36:25000
*Jesse Neal*:408-233-8971:45 Rose Terrace, San Francisco, CA 92303:2/3/36:25000
Zippy Pinhead:834-823-8319:2356 Bizarro Ave., Farmount, IL 84357:1/1/67:89500

Give a test to this. Seems to work ok.
$ awk -F":" 'NR==FNR{a[$1]++;next}(a[$1]>1){sub($1,"*" $1 "*")}1' file1 file1
Explanation:
This code reads the same file twice. This maybe has a performance penalty depending on the filesize.
-F":" : Global Input Fields Delimiter is defined as :
NR==FNR{a[$1]++;next} : The code in { } is executed when NR==FNR = the first file is read by awk
a[$1]++ : Creates an array a with index $1 and value ++ => +1 for each $1 found. So for record 1 we have a[Jon DeLoach]=1. For Record2 a[Karen Evich]=1, for record 3 a[Karen Evich]++ => 2,etc
next : instructs awk to go to the next record and skip the rest script.
(a[$1]>1){sub($1,"*" $1 "*")}1 : This condition & action is performed on the second file. For each a[$1] found in second file with a value >1 (as has been finalized when the first file finished), we insert * around $1 using awk sub function. sub function applies substitution directly to $0 = Whole record.
1 : prints the whole record of the second file.

matlab import binary data [duplicate]

This question already has an answer here:
Reading multiple precision binary files through fread in Matlab
(1 answer)
Closed 8 years ago.
I'd like to import binary data to matlab. I the specifications of the binary data:
First Byte: Start of package
Second Byte: Command Value
Command Data: (consisting of:)
Data format:
"%1B(Hours)%1B(Minutes)%4F(Seconds)%4F(NormAccelX)%4F(NormAccelY)%4F(NormAccelZ)%4F(OrientPitch)%4F(OrientYaw)%4F(OrientRoll)%4F(UOrientPitch)%4F(UOrientYaw)%4F(UOrientRoll)%4F(GyroX)%4F(GyroY)%4F(GyroZ)%4U(ChipTimeMS)%4U(ChipTimeMS)%4F(RawGyroX)%4F(RawGyroY)%4F(RawGyroZ)%4F(RawAccelX)%4F(RawAccelY)%4F(RawAccelZ)"
Last Byte: Checksum (sum of all other bytes except first)
Data is stored in big-endian format!
Now, I would like to read the data into an array in Matalab.
Below, I have given the first few line of the file. The data is the output of an IMU.
I would really appreciate some help! Thanks in advance!!!
B[Ý=:„œ>YÃd¿yßa¿æc#
Wu¼ï1š¿æƒh¾ÿœr½O„e½ <[¼"!O¹›O¹¢À  ¿€ ¿€ D† Em Æt B[êQ=+^P>[,¿yÖ†¿æÊ³#
;>¼÷¿æì4¿ ??½St'½6¥ñ<»¼r2OOêIOêPÁP #À À  D| En Æs€ B[÷= T>T™¿z8Í¿æÄw#
6â¼õk‹¿ææ'¿ P±½RŸV½T¬-;Ô,¼µ% PéPðÁ˜ À# Á0 Dp Eg ÆsÀ B\’=DÑ>X¦ø¿yçŽ¿æÃÖ#
4¼õŽ°¿æå¦¿ \ ½R°˜½1¤æ<·&¼ƒ<PF‘PF˜Á# A ÀÀ DŠ Ej Ær B\L=>®m>Vi¿z¿æ¼|#
0È¼õ¢’¿æÞp¿ i½Rºc½O«#<Á(¼r2OPwPw"Á A0 À  Dˆ Ei Æt B\=B>Z%e¿yÓÜ¿æ´%#
.¼õ¾¿æÖ8¿ t½Rk9½r²k<·&¼r2OP§¤P§«ÁÈ A À  DŠ Em Æs€ B\(Æ=0[>[¿yÔ†¿æ®ù#
*¼ôºº¿æÑ1¿ ‚>½RF½Jª<»¼"!PØMPØTÁˆ #À ¿€ D€ Em Ær€ B\5=E¾
>W[Æ¿yø¿æªÚ#
'9¼ôt0¿æÍ6¿ ]½R"Œ½;¦û<b.ü¼«#ŒQÜQãÁ` ## Á DŒ Ek Æt  B\B9=E¸å>[R¨¿yÀî¿æ¬##
"V¼õc¿æÎµ¿ ¢ò½Rtì½E©<N*Ô¼É)ÉQ9cQ9iÁ€ # ÁP DŒ Eo Æt# B\Oi=Z;Ö>W|à¿yåA¿æ¬Â#
“¼öš¿æÏŠ¿ ¶½S4Ù½Y7<#÷¼¡!xQk±Qk·Á  A Á D– Ej Æs# B\\ =BW5>ZÙ¶¿yÊ8¿æ±Å#
ó¼÷ ¿æÔµ¿ Ä‹½Sw ½^®B<é0p¼«#ŒQœ0Qœ6Á¨ Ap Á DŠ En ÆsÀ B\hØ=Eæ(>U‡ï¿z
¿æ¬#
Þ¼÷”û¿æÏ&¿ Ðâ½S±Õ½,£Ü;Ô,»ô2ºQÌ´QÌ»Á0 À# ?€ DŒ Ei Æt  B\u•=MÅÝ>T®G¿zH¿æ¥Ÿ#
¼øbó¿æÈÏ¿ Ü%½Tœ½#¨;8&B¼J)þQýMQýTÁp ÀÀ À# D Eh Æt€ B\‚M=:(£>Ocù¿zl°¿æ \#
¼ù9¿æÃ¨¿ æ+½Tn½Jª<#÷¼¡!xR-ÒR-ØÁˆ A Á D† Ec Æu€ B\ˆ=M™à>Q‡_¿z#û¿æœ#
¨¼ùec¿æÀ¿ ùÆ½T™`½O«#<»¼Ý-ñR`IR`PÁ #À Áp D Ee Æu B\œ?=L:G>Z’¿yÍS¿æ™¨#
5¼ú®,¿æ½Z¿™½U=u½Jª<·&¼µ% RÌRÒÁˆ A Á0 DŽ Ek Æq` B\¨ø=3Q¼>Xc<¿y÷Å¿æ•é#
;¼úå¿æ¹¼¿ƒ½Tè½O«#<N*Ô¼^.&RÁSRÁZÁ # À€ D‚ Ek Æs€ B\µ²=8
¢>UQ¿zŽ¿æŽ#
Ö¼øÌÁ¿æ²¿)½TL5½O«#:Ð+#¼Ý-ñRñÞRñåÁ Àà Áp D„ Eg ÆrÀ B\Âm=.í^>WÇº¿zH¿æŒ6# ü”¼÷‡Ë¿æ°i¿:!½S©j½w³u=œ`¼ƒ<S"nS"tÁÐ A˜ ÀÀ D€ Ek Æt# B\Ï(=>[Î>S½¿z/G¿æ‹J# ÷Ç¼÷"_¿æ¯°¿MY½Svo½^®B=Ÿ¼¡!xSSSSÁ¨ A° Á Dˆ Eg Æt  B\Üc=0oD>S‚F¿z<™¿æ‰E# ô]¼÷(À¿æÒ¿[½SyD½Jª<Õ,H¼^.&S…zS…€Áˆ AP À€ D€ Ee ÆrÀ B\é,=4T„>Z’¿yØ¤¿æ‰µ# ïì¼õxý¿æ®r¿lÇ½R¡½|´<…¦¼PS¶=S¶CÁØ #  Àà D‚ El Ær B\õë=>iã>VÊâ¿z“¿æˆú# ìŒ¼óÖn¿æÜ¿zD½QÏ–½w³u<b.ü¼^.&SæÛSæâÁÐ ## À€ Dˆ Ej Æt# B]¢='0¸>\Îs¿yÂT¿æ†t# éT¼ó•…¿æ«y¿‡*½Q®å½^®B<£!ã¼…T_TfÁ¨ A Dx Ep ÆsÀ B]^=3ÔÞ>V÷ê¿z
þ¿æ†¼# å×¼óN¿æ«è¿• ½QŠñ½Jª<v3%¼—eTGñTG÷Áˆ #€ Á D‚ Ei Æs B]='m>Tê¿z;r¿æú# áÍ¼ñØU¿æ§R¿¥G½PÏÃ½T¬-<[¼PTxnTxuÁ˜ ¿€ Àà Dx Eg Æt` B])P==yá>[¾Ÿ¿yÁe¿æ‚þ# Ý‹¼òÙA¿æ¨†¿¶T½QOá½T¬-<£!ã¼¡!xTªéTªïÁ˜ A Á Dˆ Ep Ætà B]6=B5>S¸]¿z,“¿æW# Ú¤¼òßp¿æ¦þ¿Á÷½QRÒ½O«#<:&¬¼r2OTÛ…TÛŒÁ ?€ À  DŠ Eg Æt  B]BÍ=E|>T¿z$Ç¿æ±# Õ"¼òèô¿æ¥•¿Ø½QW,½#¨<»¼PU%U,Áp #À Àà DŒ Eh Æu# B]O‰=>ú>U\ë¿zt¿æ}# ÐÞ¼ò»î¿æ£°¿é½Q#T½h°V<N*Ô¼É)ÉU<¸U<¿Á¸ # ÁP Dˆ Ei Ætà B]\I=B*Ö>Q¯¨¿zH ¿æ}–# Ëê¼òîU¿æ£Þ¿üë½QY6½m±a<Õ,H¼Ó+ÝUm[UmbÁÀ AP Á` DŠ Ee Ætà B]i==_=>U¨í¿zâ¿æz¨# Èð¼ñ°r¿æ¡¿Ï½Pº
½#¨<Ë*4¼«#ŒUüUžÁp A# Á Dˆ Ej Æu  B]v?=B§2>R1?¿z#Õ¿æw–# Ã¥¼ò_ä¿æž:¿½QU½T¬-;¬#Ã¼r2OUÐbUÐiÁ˜ À€ À  DŠ Ee Æt# B]‚ù=>
2>We`¿yýŠ¿æt# À¨¼ò8^¿æš×¿)ü½PýR½;¦û<&"ƒ¼—eV ñV øÁ` Á Dˆ Ek Æt  B]²=3]T>Um¿z n¿æwú# »×¼òm¥¿æžó¿=#½Q—½…ÛÏ=œ`¼É)ÉV1zV1€Áð A˜ ÁP D‚ Eh ÆsÀ B]œl=TVî>Sn‹¿z!½¿æx‹# ¶š¼óá¿æŸÀ¿R?½QÐõ½ƒ[I<ó2„¼É)ÉVbVb
Áè A€ ÁP D“ Ef ÆsÀ B]©'=/L>S3c¿zA¿æuz# °ë¼òÈÏ¿æœì¿hû½QDJ½O«#<:&¬¼Ý-ñV’”V’›Á ?€ Áp D€ Ef Æt# B]µá=>Q‰>V°f¿z¿æt¸# M¼òÎ¿æœS¿wx½Pë½6¥ñ;Ô,¼¿'µVÃVÃ&ÁP À# Á# Dˆ Ej Æt` B]Ã=H8ª>Oçþ¿zZþ¿æsõ# §õ¼ñæ†¿æ›Ê¿ŒÝ½PÒ†½ƒ[I<…¦¼¿'µVõ”Võ›Áè #  Á# DŒ Ea Ær€ B]ÏÑ=>y>OßN¿zc¿æké# £ø¼ò<¿æ“é¿œÕ½Pî‘½Y7=!¡”¼PW&W&Á  AÀ Àà Dˆ Ec Ætà B]ÜŒ=/üs>Yâ¿yñ'¿æo;# žV¼òJ¿æ—z¿³`½Pá½w³u=›V¼¡!xWV¤WVªÁÐ A Á D€ Ek ÆrÀ B]éE=I—S>i&y¿xõ¿æwË# ˜h¼ñÅ„¿æ I¿Ë½PÀÒ½]^:Ð+#¼«#ŒW‡,W‡2Â Àà Á D E Æu B]õú=LÑA>Q,!¿zFf¿æ‚N# ’¼ò!]¿æ«¿äK½Pî1½’^h<N*Ô½œ+W·£W·©Â # Á  D Ec€ ÆsÀ B^²=:´l>Yù&¿yÜP¿æ<# ‹|¼ò¿æ¹J¿þÚ½Pàø½]^<Ë*4¼Ý-ñWè*Wè1Â A# Áp D† Em ÆsÀ B^î=8Ì>cBs¿yY8¿æ©V# ‰¼ó<ž¿æÒ}¿T½Q{½;¦û<»¼ñ2X§X®Á` #À Áˆ D† Ex€ Æu B^£=^Ý`>Jþ#¿zˆZ¿æüÎ# †Ì¼ô¦9¿ç&¿¤½R/³¼×,³<·&¼PXKXK%# A Àà D™ E^€ Æu  B^)`=>
2>We`¿yýŠ¿ç^# ¼òŠú¿ç,Ù¿&Í½Q!·½”Þì<…¦½›!X{´X{»Â #  Á˜ Dˆ Ek Æt  B^6=Ifž>S,ƒ¿z.C¿ç"# }]¼òx¶¿ç+Ì¿7b½Q/½’^h<Õ,H¼^.&

The fread function should do the trick:
http://www.mathworks.com/help/matlab/ref/fread.html
You could use the following:
binData = fread(fileID, sizeA, '*bit8', 0, 'b');
To get the sizeA of the file, try the answer to this question:
How do you get the size of a file in MATLAB?
Use the s output from dir:
s = dir(filename);

How to read reference line (start with RN,RT,RA,RC,RX,RP,RL) and print all

Hello Everyone,
I had a problem regarding a Perl Module as I am using this module to retrieve some specific lines form a flat file that contains multiple sets of information as I had mentioned in code.(This is an example code of Bio::Parse::SwissProt.pm). But the problem is that whenever we are working with this code, it has a problem in Refs statement. It is giving an error as modification of read-only value attempted atc:/wamp/bin/perl/site/lib/bio/parse/swissprot.pm line 345. Input file looks like this
Input File(Flate file)
ID P72354_STAAU Unreviewed; 575 AA.
AC P72354;
DT 01-FEB-1997, integrated into UniProtKB/TrEMBL.
DT 01-FEB-1997, sequence version 1.
DT 29-MAY-2013, entry version 79.
DE SubName: Full=ATP-binding cassette transporter A;
GN Name=abcA;
OS Staphylococcus aureus.
OC Bacteria; Firmicutes; Bacilli; Bacillales; Staphylococcus.
OX NCBI_TaxID=1280;
RN [1]
RP NUCLEOTIDE SEQUENCE.
RC STRAIN=NCTC 8325;
RX PubMed=8878592;
RA Henze U.U., Berger-Bachi B.;
RT "Penicillin-binding protein 4 overproduction increases beta-lactam
RT resistance in Staphylococcus aureus.";
RL Antimicrob. Agents Chemother. 40:2121-2125(1996).
RN [2]
RP NUCLEOTIDE SEQUENCE.
RC STRAIN=NCTC 8325;
RX PubMed=9158759;
RA Henze U.U., Roos M., Berger-Bachi B.;
RT "Effects of penicillin-binding protein 4 overproduction in
RT Staphylococcus aureus.";
RL Microb. Drug Resist. 2:193-199(1996).
CC -!- SIMILARITY: Belongs to the ABC transporter superfamily.
CC -----------------------------------------------------------------------
CC Copyrighted by the UniProt Consortium, see http://www.uniprot.org/terms
CC Distributed under the Creative Commons Attribution-NoDerivs License
CC -----------------------------------------------------------------------
DR EMBL; X91786; CAA62898.1; -; Genomic_DNA.
DR ProteinModelPortal; P72354; -.
DR SMR; P72354; 335-571.
DR GO; GO:0016021; C:integral to membrane; IEA:InterPro.
DR GO; GO:0005524; F:ATP binding; IEA:UniProtKB-KW.
DR GO; GO:0042626; F:ATPase activity
DR GO; GO:0006200; P:ATP catabolic process; IEA:GOC.
DR InterPro; IPR003593; AAA+_ATPase.
DR InterPro; IPR003439; ABC_transporter-like.
DR InterPro; IPR017871; ABC_transporter_CS.
DR InterPro; IPR017940; ABC_transporter_type1.
DR InterPro; IPR001140; ABC_transptr_TM_dom.
DR InterPro; IPR011527; ABC_transptrTM_dom_typ1.
DR InterPro; IPR027417; P-loop_NTPase.
DR Pfam; PF00664; ABC_membrane; 1.
DR Pfam; PF00005; ABC_tran; 1.
DR SMART; SM00382; AAA; 1.
DR SUPFAM; SSF90123; ABC_TM_1; 1.
DR SUPFAM; SSF52540; SSF52540; 1.
DR PROSITE; PS50929; ABC_TM1F; 1.
DR PROSITE; PS00211; ABC_TRANSPORTER_1; 1.
DR PROSITE; PS50893; ABC_TRANSPORTER_2; 1.
PE 3: Inferred from homology;
KW ATP-binding; Nucleotide-binding.
SQ SEQUENCE 575 AA; 64028 MW; F7E30A85971719B9 CRC64;
MKRENPLFFL FKKLSWPVGL IVAAITISSL GSLSGLLVPL FTGRIVDKFS VSHINWNLIA
LFGGIFVINA LLSGLGLYLL SKIGEKIIYA IRSVLWEHII QLKMPFFDKN ESGQLMSRLT
DDTKVINEFI SQKLPNLLPS IVTLVGSLIM LFILDWKMTL LTFITIPIFV LIMIPLGRIM
QKISTSTQSE IANFSGLLGR VLTEMRLVKI SNTERLELDN AHKNLNEIYK LGLKQAKIAA
VVQPISGIVM LLTIAIILGF GALEIATGAI TAGTLIAMIF YVIQLSMPLI NLSTLVTDYK
KAVGASSRIY EIMQEPIEPT EALEDSENVL IDDGVLSFEH VDFKYDVKKI LDDVSFQIPQ
GQVSAFVGPS GSGKSTIFNL IERMYEIESG DIKYGLESVY DIPLSKWRRK IGYVMQSNSM
MSGTIRDNIL YGINRHVSDE ELINYAKLAN CHDFIMQFDE GYDTLVGERG LKLSGGQRQR
IDIARSFVKN PDILLLDEAT ANLDSESELK IQEALETLME GRTTIVIANR LSTIKKAGQI
IFLDKGQVTG KGTHSELMAS HAKYKNFVVS QKLTD
//
Script part C:/wamp/bin/perl/bin/perl.exe
use strict;
use warnings;
use Data::Dumper;
use SWISS::Entry;
use Bio::Parse::SwissProt;
my $sp = Bio::Parse::SwissProt->new(FILE =>"me.txt")or die $!;
# Read in all the entries and fill %entries
my $entry_name = $sp->entry_name( );
print "$entry_name\n";
my $seq_len = $sp->seq_len( );
print "$seq_len\n";
$refs = $sw->refs();
$refs = $sw->refs(TITLE => 1, AUTH => 1);
for my $i (0..$#{$refs}) {
print "#{$refs->[$i]}\n";
OUTPUT should be like
[1]
NUCLEOTIDE SEQUENCE.
STRAIN=NCTC 8325;
PubMed=8878592;
Henze U.U., Berger-Bachi B.;
"Penicillin-binding protein 4 overproduction increases beta-lactam
resistance in Staphylococcus aureus.";
Antimicrob. Agents Chemother. 40:2121-2125(1996).
[2]
NUCLEOTIDE SEQUENCE.
STRAIN=NCTC 8325;
PubMed=9158759;
Henze U.U., Roos M., Berger-Bachi B.;
"Effects of penicillin-binding protein 4 overproduction in
Staphylococcus aureus.";
Microb. Drug Resist. 2:193-199(1996).
</code></pre>

After some searching on the internet, it appears that you are using SWISS::Entry from the Swissknife package, and it appears you (or someone) downloaded Bio::Parse::SwissProt as an independent project (not part of BioPerl) from sourceforge. I am not familiar with either of these projects, but you can get the information you want by simply using Bio::SeqIO from BioPerl. Here is an example to get the refs:
#!usr/bin/env perl
use strict;
use warnings;
use Bio::SeqIO;
my $usage = "perl $0 swiss-file\n";
my $infile = shift or die $usage;
my $io = Bio::SeqIO->new(-file => $infile, -format => 'swiss');
my $seqio = $io->next_seq;
my $anno_collection = $seqio->annotation;
for my $key ( $anno_collection->get_all_annotation_keys ) {
my #annotations = $anno_collection->get_Annotations($key);
for my $value ( #annotations ) {
if ($value->tagname eq "reference") {
my $hash_ref = $value->hash_tree;
for my $key (keys %{$hash_ref}) {
print $key,": ",$hash_ref->{$key},"\n" if defined $hash_ref->{$key};
}
}
}
}
Running this gives the information you wanted:
authors: Henze U.U., Berger-Bachi B.
location: Antimicrob. Agents Chemother. 40:2121-2125(1996).
title: "Penicillin-binding protein 4 overproduction increases beta-lactam resistance in Staphylococcus aureus."
pubmed: 8878592
authors: Henze U.U., Roos M., Berger-Bachi B.
location: Microb. Drug Resist. 2:193-199(1996).
title: "Effects of penicillin-binding protein 4 overproduction in Staphylococcus aureus."
pubmed: 9158759
The BioPerl Feature Annotation HOWTO is a helpful page for parsing these types of files. If you want to fetch the entries and then parse them, you can use Bio::DB::Swissprot and add just a couple of lines of code to the above example. I know that is not an answer to your specific problem but it is a solution and you'll find that many people can help you with BioPerl.

DNA to RNA and Getting Proteins with Perl

I am working on a project(I have to implement it in Perl but I am not good at it) that reads DNA and finds its RNA. Divide that RNA's into triplets to get the equivalent protein name of it. I will explain the steps:
1) Transcribe the following DNA to RNA, then use the genetic code to translate it to a sequence of amino acids
Example:
TCATAATACGTTTTGTATTCGCCAGCGCTTCGGTGT
2) To transcribe the DNA, first substitute each DNA for it’s counterpart (i.e., G for C, C for G, T for A and A for T):
TCATAATACGTTTTGTATTCGCCAGCGCTTCGGTGT
AGTATTATGCAAAACATAAGCGGTCGCGAAGCCACA
Next, remember that the Thymine (T) bases become a Uracil (U). Hence our sequence becomes:
AGUAUUAUGCAAAACAUAAGCGGUCGCGAAGCCACA
Using the genetic code is like that
AGU AUU AUG CAA AAC AUA AGC GGU CGC GAA GCC ACA
then look each triplet (codon) up in the genetic code table. So AGU becomes Serine, which we can write as Ser, or
just S. AUU becomes Isoleucine (Ile), which we write as I. Carrying on in this way, we get:
SIMQNISGREAT
I will give the protein table:
So how can I write that code in Perl? I will edit my question and write the code that what I did.

Try the script below, it accepts input on STDIN (or in file given as parameter) and read it by line. I also presume, that "STOP" in the image attached is some stop state. Hope I read it all well from that picture.
#!/usr/bin/perl
use strict;
use warnings;
my %proteins = qw/
UUU F UUC F UUA L UUG L UCU S UCC S UCA S UCG S UAU Y UAC Y UGU C UGC C UGG W
CUU L CUC L CUA L CUG L CCU P CCC P CCA P CCG P CAU H CAC H CAA Q CAG Q CGU R CGC R CGA R CGG R
AUU I AUC I AUA I AUG M ACU T ACC T ACA T ACG T AAU N AAC N AAA K AAG K AGU S AGC S AGA R AGG R
GUU V GUC V GUA V GUG V GCU A GCC A GCA A GCG A GAU D GAC D GAA E GAG E GGU G GGC G GGA G GGG G
/;
LINE: while (<>) {
chomp;
y/GCTA/CGAU/; # translate (point 1&2 mixed)
foreach my $protein (/(...)/g) {
if (defined $proteins{$protein}) {
print $proteins{$protein};
}
else {
print "Whoops, stop state?\n";
next LINE;
}
}
print "\n"
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Perl: Search & Replace within a foreach loop - perl

#validSegHdrs = ( "ABS", # ..... ); my $regex = join ("|", #validSegHdrs); while (<>) { s/\R/ /g; s/($regex)/\r$1/g; print; }

Related

Unicode letters with more than 1 alphabetic latin character?

Add a label indicating duplicate names [duplicate]

matlab import binary data [duplicate]

How to read reference line (start with RN,RT,RA,RC,RX,RP,RL) and print all

DNA to RNA and Getting Proteins with Perl

Categories

Resources