substituting chemical atomic numbers using sed - sed

I am trying to substitute some patterns of atomic numbers in a single file. That file contain a series of atomic numbers in a column as shown in the first column. Now I want to substitute the first column of numbers with the series of numbers as in the second column line after line.
C1 C21
C2 C22
C4 C23
C5 C24
C6 C25
C7 C26
C8 C27
C9 C28
C10 C29
C11 C30
C12 C31
C13 C32
C14 C33
O1 O11
O2 O12
O3 O13
O4 O14
O5 O15
O6 O16
H1 H31
H2 H32
H3 H33
H4 H34
H5 H35
H6 H36
H7 H37
H8 H38
H9 H39
H10 H40
H11 H41
H12 H42
H13 H43
H14 H44
H15 H45
H16 H46
H17 H47
H18 H48
H19 H49
H20 H50
H21 H51
H22 H52
H23 H53
H24 H54
H25 H55
H26 H56
H27 H57
H28 H58
To achieve this I tried the sed command as below
sed -i -e 's/C1/C21/;s/C2/C22/;s/C3/C23/;s/C4/C24/;s/C5/C25/;s/C6/C26/;s/C7/C27/;s/C8/C28/;s/C9/C29/;s/C10/C30/;s/C11/C31/;s/C12/C32/;s/C13/C33/;s/C14/C34/;s/O1/O11/;s/O2/O12/;s/O3/O13/;s/O4/O14/;s/O5/O15/;s/O6/O16/;s/H1/H31/;s/H2/H32/;s/H3/H33/;s/H4/H34/;s/H5/H35/;s/H6/H36/;s/H7/H37/;s/H8/H38/;s/H9/H39/;s/H10/H40/;s/H11/H41/;s/H12/H42/;s/H13/H43/;s/H14/H44/;s/H15/H45/;s/H16/H46/;s/H17/H47/;s/H18/H48/;s/H19/H49/;s/H20/H50/;s/H21/H51/;s/H22/H52/;s/H23/H53/;s/H24/H54/;s/H25/H55/;s/H26/H56/;s/H27/H57/;s/H28/H58/' FILE_NAME
Unfortunately, what I get is multiple substitutions like C3328 and so on.
Can anyone help me to address the correct way of doing this? Appreciate in advance.

It's still not clear but I THINK this is what you want:
$ cat tst.awk
BEGIN { cnt["C"]=21; cnt["O"]=11; cnt["H"]=31 }
NF { c=substr($0,1,1); $0=c cnt[c]++ }
{ print }
.
$ awk -f tst.awk file
C21
C22
C23
C24
C25
C26
C27
C28
C29
C30
C31
C32
C33
O11
O12
O13
O14
O15
O16
H31
H32
H33
H34
H35
H36
H37
H38
H39
H40
H41
H42
H43
H44
H45
H46
H47
H48
H49
H50
H51
H52
H53
H54
H55
H56
H57
H58

The problem is that sed will attempt to carry out all substitutions in order, which results in multiple substitutions. So you need to rearrange your substitutions from most specific to least specific. For example:
echo "C1" | sed -n 's/C1/C21/p; s/C2/C22/p; s/C3/C23/p'
C21
C221
echo "C1" | sed -n 's/C3/C23/p; s/C2/C22/p; s/C1/C21/p'
C21

put [^0-9] after each pattern should work fine, to automate this process:
awk '$0{printf("s/%s\\([^0-9]\\)/%s\\1/g\n", $1, $2)}' <pattern-file >sedscr
run this one-liner for the pattern file, cat sedscr, then you would get:
s/C1\([^0-9]\)/C21\1/g
s/C2\([^0-9]\)/C22\1/g
s/C4\([^0-9]\)/C23\1/g
...
after that you run sed with the generated script for your sample files.
sed -f sedscr sample-files...

Related

Unicode letters with more than 1 alphabetic latin character?

I'm not really sure how to express it but I'm searching for unicode letters which are more than one visual latin letter.
I found this in Word so far:
DZ
Dz
dz
NJ
Lj
LJ
Nj
nj
Any others?
Here are some of the characters I've found. I'd first done this manually by looking at some probable blocks. However I've later written a Python script to do this automatically that you can find at the end of this answer
Digraphs
Two Glyphs
Digraph
Unicode Code Point
HTML
DZ, Dz, dz
DZ, Dz, dz
U+01F1 U+01F2 U+01F3
DZ Dz dz
DŽ, Dž, dž
DŽ, Dž, dž
U+01C4 U+01C5 U+01C6
DŽ Dž dž
IJ, ij
IJ, ij
U+0132 U+0133
IJ ij
LJ, Lj, lj
LJ, Lj, lj
U+01C7 U+01C8 U+01C9
LJ Lj lj
NJ, Nj, nj
NJ, Nj, nj
U+01CA U+01CB U+01CC
NJ Nj nj
Ligatures
Non-ligature
Ligature
Unicode
HTML
AA, aa
Ꜳ, ꜳ
U+A732, U+A733
Ꜳ ꜳ
AE, ae
Æ, æ
U+00C6, U+00E6
Æ æ
AO, ao
Ꜵ, ꜵ
U+A734, U+A735
Ꜵ ꜵ
AU, au
Ꜷ, ꜷ
U+A736, U+A737
Ꜷ ꜷ
AV, av
Ꜹ, ꜹ
U+A738, U+A739
Ꜹ ꜹ
AV, av (with bar)
Ꜻ, ꜻ
U+A73A, U+A73B
Ꜻ ꜻ
AY, ay
Ꜽ, ꜽ
U+A73C, U+A73D
Ꜽ ꜽ
et
🙰
U+1F670
🙰
f‌f
ff
U+FB00
ff
f‌f‌i
ffi
U+FB03
ffi
f‌f‌l
ffl
U+FB04
ffl
f‌i
fi
U+FB01
fi
f‌l
fl
U+FB02
fl
OE, oe
Œ, œ
U+0152, U+0153
Œ œ
OO, oo
Ꝏ, ꝏ
U+A74E, U+A74F
Ꝏ ꝏ
ſs, ſz
ẞ, ß
U+1E9E, U+00DF
ß
st
st
U+FB06
st
ſt
ſt
U+FB05
ſt
TZ, tz
Ꜩ, ꜩ
U+A728, U+A729
Ꜩ ꜩ
ue
ᵫ
U+1D6B
ᵫ
VY, vy
Ꝡ, ꝡ
U+A760, U+A761
Ꝡ ꝡ
There are a few other ligatures that are used for phonetic transcription but looks like Latin characters
Non-ligature
Ligature
Unicode
HTML
db
ȸ
U+0238
ȸ
dz
ʣ
U+02A3
ʣ
IJ, ij
IJ, ij
U+0132, U+0133
IJ ij
ls
ʪ
U+02AA
ʪ
lz
ʫ
U+02AB
ʫ
qp
ȹ
U+0239
ȹ
ts
ʦ
U+02A6
ʦ
ui
ꭐ
U+AB50
ꭐ
turned ui
ꭑ
U+AB51
ꭑ
https://en.wikipedia.org/wiki/List_of_precomposed_Latin_characters_in_Unicode#Digraphs_and_ligatures
Edit:
There are more letterlike symbols beside ℻ and ℡ like what the OP found in the comment:
℀ ℁ ⅍ ℅ ℆ ℔ ℠ ™
Longer letters are mainly from the CJK Compatibility block
U+XXXX
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
U+338x
㎀
㎁
㎂
㎃
㎄
㎅
㎆
㎇
㎈
㎉
㎊
㎋
㎌
㎍
㎎
㎏
U+339x
㎐
㎑
㎒
㎓
㎔
㎕
㎖
㎗
㎘
㎙
㎚
㎛
㎜
㎝
㎞
㎟
U+33Ax
㎠
㎡
㎢
㎣
㎤
㎥
㎦
㎧
㎨
㎩
㎪
㎫
㎬
㎭
㎮
㎯
U+33Bx
㎰
㎱
㎲
㎳
㎴
㎵
㎶
㎷
㎸
㎹
㎺
㎻
㎼
㎽
㎾
㎿
U+33Cx
㏀
㏁
㏂
㏃
㏄
㏅
㏆
㏇
㏈
㏉
㏊
㏋
㏌
㏍
㏎
㏏
U+33Dx
㏐
㏑
㏒
㏓
㏔
㏕
㏖
㏗
㏘
㏙
㏚
㏛
㏜
㏝
㏞
㏟
Among the 3-letter-like symbols are ㎈ ㎑ ㎒ ㎓ ㎔㏒ ㏕ ㏖ ㏙ ㎪ ㎫ ㎬ ㎭ ㏆ ㏿ ㍱... Probably the ones with most characters are ㎉ and ㎯
Unicode even have codepoints for Roman numerals. Here another 4-letter-like character can be found: Ⅷ
U+XXXX
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
U+215x
⅐
⅑
⅒
⅓
⅔
⅕
⅖
⅗
⅘
⅙
⅚
⅛
⅜
⅝
⅞
⅟
U+216x
Ⅰ
Ⅱ
Ⅲ
Ⅳ
Ⅴ
Ⅵ
Ⅶ
Ⅷ
Ⅸ
Ⅹ
Ⅺ
Ⅻ
Ⅼ
Ⅽ
Ⅾ
Ⅿ
U+217x
ⅰ
ⅱ
ⅲ
ⅳ
ⅴ
ⅵ
ⅶ
ⅷ
ⅸ
ⅹ
ⅺ
ⅻ
ⅼ
ⅽ
ⅾ
ⅿ
U+218x
ↀ
ↁ
ↂ
Ↄ
ↄ
ↅ
ↆ
ↇ
ↈ
↉
↊
↋
If normal numbers can be considered then there are some other code points for multiple digits like ⒆ ⒇ ⓳ ⓴ in enclosed alphanumerics
U+XXXX
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
U+246x
①
②
③
④
⑤
⑥
⑦
⑧
⑨
⑩
⑪
⑫
⑬
⑭
⑮
⑯
U+247x
⑰
⑱
⑲
⑳
⑴
⑵
⑶
⑷
⑸
⑹
⑺
⑻
⑼
⑽
⑾
⑿
U+248x
⒀
⒁
⒂
⒃
⒄
⒅
⒆
⒇
⒈
⒉
⒊
⒋
⒌
⒍
⒎
⒏
U+249x
⒐
⒑
⒒
⒓
⒔
⒕
⒖
⒗
⒘
⒙
⒚
⒛
⒜
⒝
⒞
⒟
U+24Ax
⒠
⒡
⒢
⒣
⒤
⒥
⒦
⒧
⒨
⒩
⒪
⒫
⒬
⒭
⒮
⒯
U+24Bx
⒰
⒱
⒲
⒳
⒴
⒵
Ⓐ
Ⓑ
Ⓒ
Ⓓ
Ⓔ
Ⓕ
Ⓖ
Ⓗ
Ⓘ
Ⓙ
U+24Cx
Ⓚ
Ⓛ
Ⓜ
Ⓝ
Ⓞ
Ⓟ
Ⓠ
Ⓡ
Ⓢ
Ⓣ
Ⓤ
Ⓥ
Ⓦ
Ⓧ
Ⓨ
Ⓩ
U+24Dx
ⓐ
ⓑ
ⓒ
ⓓ
ⓔ
ⓕ
ⓖ
ⓗ
ⓘ
ⓙ
ⓚ
ⓛ
ⓜ
ⓝ
ⓞ
ⓟ
U+24Ex
ⓠ
ⓡ
ⓢ
ⓣ
ⓤ
ⓥ
ⓦ
ⓧ
ⓨ
ⓩ
⓪
⓫
⓬
⓭
⓮
⓯
U+24Fx
⓰
⓱
⓲
⓳
⓴
⓵
⓶
⓷
⓸
⓹
⓺
⓻
⓼
⓽
⓾
⓿
and in Enclosed Alphanumeric Supplement
🅫, 🅪, 🆋, 🆌, 🆍, 🄭, 🄮, 🅊, 🅋, 🅌, 🅍, 🅎, 🅏
A few more:
Currency symbol group
₧ ₨ ₶ ₯ ₠ ₢ ₷
Miscellaneous technical group
⎂ ⏨
Control pictures (probably you'll need to zoom out to see)
U+XXXX
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
U+240x
␀
␁
␂
␃
␄
␅
␆
␇
␈
␉
␊
␋
␌
␍
␎
␏
U+241x
␐
␑
␒
␓
␔
␕
␖
␗
␘
␙
␚
␛
␜
␝
␞
␟
U+242x
␠
␡
␢
␣
␤
␥
␦
Alchemical Symbols
🜀 🜅 🜆 🜇 🜈 🝪 🝫 🝬 🝛 🝜 🝝
Musical Symbols
𝄶 𝄷 𝄸 𝄹 𝄉 𝄊 𝄫
And there are the emojis 🔟 💤🆔🚾🆖🆗🔢🔡🔠 💯🆘🆎🆑™🔙🔚🔜🔝🔛📆🗓🔞
Vertical bars may be considered uppercase i or lowercase L (like your 〷 example which is actually the TELEGRAPH LINE FEED SEPARATOR SYMBOL) and we have
Vai syllable see ꔖ 0xa516
Large triple vertical bar operator ⫼ 0x2afc
Counting rod tens digit three: 𝍫 0x1d36b
Suzhou numerals 〢 〣
Chinese river 川
║ BOX DRAWINGS DOUBLE VERTICAL...
Here's the automatic script to find the multi-character letters
import unicodedata
for c in range(0, 0x10FFFF + 1):
d = unicodedata.normalize('NFKD', chr(c))
if len(d) > 1 and d.isascii() and d.isalpha():
print("U+%04X (%s): %s\n" % (c, chr(c), d))
It won't be able to find many ligatures like æ or œ because they're not considered orthographic ligatures and aren't decomposable in Unicode. Here's the result in Unicode 11.0.0 (checked with unicodedata.unidata_version)
U+0132 (IJ): IJ
U+0133 (ij): ij
U+01C7 (LJ): LJ
U+01C8 (Lj): Lj
U+01C9 (lj): lj
U+01CA (NJ): NJ
U+01CB (Nj): Nj
U+01CC (nj): nj
U+01F1 (DZ): DZ
U+01F2 (Dz): Dz
U+01F3 (dz): dz
U+20A8 (₨): Rs
U+2116 (№): No
U+2120 (℠): SM
U+2121 (℡): TEL
U+2122 (™): TM
U+213B (℻): FAX
U+2161 (Ⅱ): II
U+2162 (Ⅲ): III
U+2163 (Ⅳ): IV
U+2165 (Ⅵ): VI
U+2166 (Ⅶ): VII
U+2167 (Ⅷ): VIII
U+2168 (Ⅸ): IX
U+216A (Ⅺ): XI
U+216B (Ⅻ): XII
U+2171 (ⅱ): ii
U+2172 (ⅲ): iii
U+2173 (ⅳ): iv
U+2175 (ⅵ): vi
U+2176 (ⅶ): vii
U+2177 (ⅷ): viii
U+2178 (ⅸ): ix
U+217A (ⅺ): xi
U+217B (ⅻ): xii
U+3250 (㉐): PTE
U+32CC (㋌): Hg
U+32CD (㋍): erg
U+32CE (㋎): eV
U+32CF (㋏): LTD
U+3371 (㍱): hPa
U+3372 (㍲): da
U+3373 (㍳): AU
U+3374 (㍴): bar
U+3375 (㍵): oV
U+3376 (㍶): pc
U+3377 (㍷): dm
U+337A (㍺): IU
U+3380 (㎀): pA
U+3381 (㎁): nA
U+3383 (㎃): mA
U+3384 (㎄): kA
U+3385 (㎅): KB
U+3386 (㎆): MB
U+3387 (㎇): GB
U+3388 (㎈): cal
U+3389 (㎉): kcal
U+338A (㎊): pF
U+338B (㎋): nF
U+338E (㎎): mg
U+338F (㎏): kg
U+3390 (㎐): Hz
U+3391 (㎑): kHz
U+3392 (㎒): MHz
U+3393 (㎓): GHz
U+3394 (㎔): THz
U+3396 (㎖): ml
U+3397 (㎗): dl
U+3398 (㎘): kl
U+3399 (㎙): fm
U+339A (㎚): nm
U+339C (㎜): mm
U+339D (㎝): cm
U+339E (㎞): km
U+33A9 (㎩): Pa
U+33AA (㎪): kPa
U+33AB (㎫): MPa
U+33AC (㎬): GPa
U+33AD (㎭): rad
U+33B0 (㎰): ps
U+33B1 (㎱): ns
U+33B3 (㎳): ms
U+33B4 (㎴): pV
U+33B5 (㎵): nV
U+33B7 (㎷): mV
U+33B8 (㎸): kV
U+33B9 (㎹): MV
U+33BA (㎺): pW
U+33BB (㎻): nW
U+33BD (㎽): mW
U+33BE (㎾): kW
U+33BF (㎿): MW
U+33C3 (㏃): Bq
U+33C4 (㏄): cc
U+33C5 (㏅): cd
U+33C8 (㏈): dB
U+33C9 (㏉): Gy
U+33CA (㏊): ha
U+33CB (㏋): HP
U+33CC (㏌): in
U+33CD (㏍): KK
U+33CE (㏎): KM
U+33CF (㏏): kt
U+33D0 (㏐): lm
U+33D1 (㏑): ln
U+33D2 (㏒): log
U+33D3 (㏓): lx
U+33D4 (㏔): mb
U+33D5 (㏕): mil
U+33D6 (㏖): mol
U+33D7 (㏗): PH
U+33D9 (㏙): PPM
U+33DA (㏚): PR
U+33DB (㏛): sr
U+33DC (㏜): Sv
U+33DD (㏝): Wb
U+33FF (㏿): gal
U+FB00 (ff): ff
U+FB01 (fi): fi
U+FB02 (fl): fl
U+FB03 (ffi): ffi
U+FB04 (ffl): ffl
U+FB05 (ſt): st
U+FB06 (st): st
U+1F12D (🄭): CD
U+1F12E (🄮): WZ
U+1F14A (🅊): HV
U+1F14B (🅋): MV
U+1F14C (🅌): SD
U+1F14D (🅍): SS
U+1F14E (🅎): PPV
U+1F14F (🅏): WC
U+1F16A (🅪): MC
U+1F16B (🅫): MD
U+1F190 (🆐): DJ

How to replace a value with "." in sed

I want to replace all instances of "a number followed by any number of spaces followed by a period and possibly more spaces" with the number and period only.
For example, '14 . x' will become '14.x'.
My test data is:
1. c4 e5 2. g3 c6 { good move. } 3. Bg2 Nf6 4. Nc3 $6 d5 5. cxd5 cxd5 6. Qb3 Nc6 $1.. Nxd5 Nd4
8. Nxf6+ Qxf6 9. Qd1.f5 10. d3 Rc8 (10... Bb4+ $5 11. Bd2 Bxd2+ 12. Qxd2 Qa6 $1.3. Rc1.xa2
14. Bxb7 $2 Rb8 15. Qb4 Bd7) 11. Kf1.c5 12. Nf3 O-O
How can I do that?
If you want any number of spaces removed from either side of the period, you should try s/\([0-9]\) *\. */\1./g:
$ echo '11. A 12 .B 13 . C 14.D 15 . E' | sed 's/\([0-9]\) *\. */\1./g'
11.A 12.B 13.C 14.D 15.E
For your test data, the results are:
1.c4 e5 2.g3 c6 { good move. } 3.Bg2 Nf6 4.Nc3 $6 d5 5.cxd5 cxd5 6.Qb3 Nc6 $1.. Nxd5 Nd4
8.Nxf6+ Qxf6 9.Qd1.f5 10.d3 Rc8 (10... Bb4+ $5 11.Bd2 Bxd2+ 12.Qxd2 Qa6 $1.3.Rc1.xa2
14.Bxb7 $2 Rb8 15.Qb4 Bd7) 11.Kf1.c5 12.Nf3 O-O

diff ignore white spaces or the same string on a different line

I need to make diff between two files but If I have the same lines in the files on a different line, I don't want to display any output.
Example:
File1:
cc aaaw
bb bbbw
aa cccw
File2:
cc aaaw
bb bbbw
aa cccw
diff file1 file2:
2d1
< bb bbbw
3a3
> bb bbbw
-> I don't want any output
but If I have file1 as the one above and file2:
cc aaaw
bb bbbw
aa cccw
ddddddd
I want this output:
4a5
> ddddddd
Thanks.
You might use diff -B to ignore empty/blank lines.

Insert space between pairs of characters - sed

Another sed question! I have nucleotide data in pairs
1 Affx-14150122 0 75891 00 CT TT CT TT CT
split by spaces and I need to put a space into every pair, eg
1 Affx-14150122 0 75891 0 0 C T T T C T T T C T
I've tried sed 's/[A-Z][A-Z]/ &/g' and sed 's/[A-Z][A-Z]/& /g'
And both A-Z replaced with .. and it never splits the pair as I'd like it to (it puts spaces before or after or splits every other pair or similar!).
I assume that this will work for you, however it's not perfect!
echo "1 Affx-14150122 0 75891 00 CT TT CT TT CT" | \
sed 's/\(\s[A-Z]\)\([A-Z]\)/\1 \2/g'
gives
1 Affx-14150122 0 75891 00 C T T T C T T T C T
sed 's/\(\s[A-Z]\)\([A-Z]\)/\1 \2/g' matches whitespace (\s) upper case character ([A-Z]), puts that in a group (\(...\)), and then matches upper case character and stores that in second group. Then this match is substituted by first group (\1) space second group (\2).
NOTE:
This fails when you have sequences that are longer than 2 characters.
An solution using awk which modifies only pairs of characters and might be more robust depending on your input data:
echo "1 Affx-14150122 0 75891 00 CT TT CT TT CT" | \
awk '
{
for(i=1;i<=NF;i++) {
if($i ~ /^[A-Z][A-Z]$/){
$i=substr($i,1,1)" "substr($i,2,1)
}
}
}
1'
gives
1 Affx-14150122 0 75891 00 C T T T C T T T C T1
This might work for you (GNU sed):
echo '1 Affx-14150122 0 75891 00 CT TT CT TT CT' |
sed ':a;s/\(\s\S\)\(\S\(\s\|$\)\)/\1 \2/g;ta'
1 Affx-14150122 0 75891 0 0 C T T T C T T T C T
This second method works but might provide false positives:
echo '1 Affx-14150122 0 75891 00 CT TT CT TT CT' | sed 's/\<\(.\)\(.\)\>/\1 \2/g'
1 Affx-14150122 0 75891 0 0 C T T T C T T T C T
This is actually easier in python than in awk:
echo caca | python -c 'import sys;\
for line in sys.stdin: print (" ".join(line))'
c a c a

DNA to RNA and Getting Proteins with Perl

I am working on a project(I have to implement it in Perl but I am not good at it) that reads DNA and finds its RNA. Divide that RNA's into triplets to get the equivalent protein name of it. I will explain the steps:
1) Transcribe the following DNA to RNA, then use the genetic code to translate it to a sequence of amino acids
Example:
TCATAATACGTTTTGTATTCGCCAGCGCTTCGGTGT
2) To transcribe the DNA, first substitute each DNA for it’s counterpart (i.e., G for C, C for G, T for A and A for T):
TCATAATACGTTTTGTATTCGCCAGCGCTTCGGTGT
AGTATTATGCAAAACATAAGCGGTCGCGAAGCCACA
Next, remember that the Thymine (T) bases become a Uracil (U). Hence our sequence becomes:
AGUAUUAUGCAAAACAUAAGCGGUCGCGAAGCCACA
Using the genetic code is like that
AGU AUU AUG CAA AAC AUA AGC GGU CGC GAA GCC ACA
then look each triplet (codon) up in the genetic code table. So AGU becomes Serine, which we can write as Ser, or
just S. AUU becomes Isoleucine (Ile), which we write as I. Carrying on in this way, we get:
SIMQNISGREAT
I will give the protein table:
So how can I write that code in Perl? I will edit my question and write the code that what I did.
Try the script below, it accepts input on STDIN (or in file given as parameter) and read it by line. I also presume, that "STOP" in the image attached is some stop state. Hope I read it all well from that picture.
#!/usr/bin/perl
use strict;
use warnings;
my %proteins = qw/
UUU F UUC F UUA L UUG L UCU S UCC S UCA S UCG S UAU Y UAC Y UGU C UGC C UGG W
CUU L CUC L CUA L CUG L CCU P CCC P CCA P CCG P CAU H CAC H CAA Q CAG Q CGU R CGC R CGA R CGG R
AUU I AUC I AUA I AUG M ACU T ACC T ACA T ACG T AAU N AAC N AAA K AAG K AGU S AGC S AGA R AGG R
GUU V GUC V GUA V GUG V GCU A GCC A GCA A GCG A GAU D GAC D GAA E GAG E GGU G GGC G GGA G GGG G
/;
LINE: while (<>) {
chomp;
y/GCTA/CGAU/; # translate (point 1&2 mixed)
foreach my $protein (/(...)/g) {
if (defined $proteins{$protein}) {
print $proteins{$protein};
}
else {
print "Whoops, stop state?\n";
next LINE;
}
}
print "\n"
}