I'm not really sure how to express it but I'm searching for unicode letters which are more than one visual latin letter.
I found this in Word so far:
DZ
Dz
dz
NJ
Lj
LJ
Nj
nj
Any others?
Here are some of the characters I've found. I'd first done this manually by looking at some probable blocks. However I've later written a Python script to do this automatically that you can find at the end of this answer
Digraphs
Two Glyphs
Digraph
Unicode Code Point
HTML
DZ, Dz, dz
DZ, Dz, dz
U+01F1 U+01F2 U+01F3
DZ Dz dz
DŽ, Dž, dž
DŽ, Dž, dž
U+01C4 U+01C5 U+01C6
DŽ Dž dž
IJ, ij
IJ, ij
U+0132 U+0133
IJ ij
LJ, Lj, lj
LJ, Lj, lj
U+01C7 U+01C8 U+01C9
LJ Lj lj
NJ, Nj, nj
NJ, Nj, nj
U+01CA U+01CB U+01CC
NJ Nj nj
Ligatures
Non-ligature
Ligature
Unicode
HTML
AA, aa
Ꜳ, ꜳ
U+A732, U+A733
Ꜳ ꜳ
AE, ae
Æ, æ
U+00C6, U+00E6
Æ æ
AO, ao
Ꜵ, ꜵ
U+A734, U+A735
Ꜵ ꜵ
AU, au
Ꜷ, ꜷ
U+A736, U+A737
Ꜷ ꜷ
AV, av
Ꜹ, ꜹ
U+A738, U+A739
Ꜹ ꜹ
AV, av (with bar)
Ꜻ, ꜻ
U+A73A, U+A73B
Ꜻ ꜻ
AY, ay
Ꜽ, ꜽ
U+A73C, U+A73D
Ꜽ ꜽ
et
🙰
U+1F670
🙰
ff
ff
U+FB00
ff
ffi
ffi
U+FB03
ffi
ffl
ffl
U+FB04
ffl
fi
fi
U+FB01
fi
fl
fl
U+FB02
fl
OE, oe
Œ, œ
U+0152, U+0153
Œ œ
OO, oo
Ꝏ, ꝏ
U+A74E, U+A74F
Ꝏ ꝏ
ſs, ſz
ẞ, ß
U+1E9E, U+00DF
ß
st
st
U+FB06
st
ſt
ſt
U+FB05
ſt
TZ, tz
Ꜩ, ꜩ
U+A728, U+A729
Ꜩ ꜩ
ue
ᵫ
U+1D6B
ᵫ
VY, vy
Ꝡ, ꝡ
U+A760, U+A761
Ꝡ ꝡ
There are a few other ligatures that are used for phonetic transcription but looks like Latin characters
Non-ligature
Ligature
Unicode
HTML
db
ȸ
U+0238
ȸ
dz
ʣ
U+02A3
ʣ
IJ, ij
IJ, ij
U+0132, U+0133
IJ ij
ls
ʪ
U+02AA
ʪ
lz
ʫ
U+02AB
ʫ
qp
ȹ
U+0239
ȹ
ts
ʦ
U+02A6
ʦ
ui
ꭐ
U+AB50
ꭐ
turned ui
ꭑ
U+AB51
ꭑ
https://en.wikipedia.org/wiki/List_of_precomposed_Latin_characters_in_Unicode#Digraphs_and_ligatures
Edit:
There are more letterlike symbols beside ℻ and ℡ like what the OP found in the comment:
℀ ℁ ⅍ ℅ ℆ ℔ ℠ ™
Longer letters are mainly from the CJK Compatibility block
U+XXXX
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
U+338x
㎀
㎁
㎂
㎃
㎄
㎅
㎆
㎇
㎈
㎉
㎊
㎋
㎌
㎍
㎎
㎏
U+339x
㎐
㎑
㎒
㎓
㎔
㎕
㎖
㎗
㎘
㎙
㎚
㎛
㎜
㎝
㎞
㎟
U+33Ax
㎠
㎡
㎢
㎣
㎤
㎥
㎦
㎧
㎨
㎩
㎪
㎫
㎬
㎭
㎮
㎯
U+33Bx
㎰
㎱
㎲
㎳
㎴
㎵
㎶
㎷
㎸
㎹
㎺
㎻
㎼
㎽
㎾
㎿
U+33Cx
㏀
㏁
㏂
㏃
㏄
㏅
㏆
㏇
㏈
㏉
㏊
㏋
㏌
㏍
㏎
㏏
U+33Dx
㏐
㏑
㏒
㏓
㏔
㏕
㏖
㏗
㏘
㏙
㏚
㏛
㏜
㏝
㏞
㏟
Among the 3-letter-like symbols are ㎈ ㎑ ㎒ ㎓ ㎔㏒ ㏕ ㏖ ㏙ ㎪ ㎫ ㎬ ㎭ ㏆ ㏿ ㍱... Probably the ones with most characters are ㎉ and ㎯
Unicode even have codepoints for Roman numerals. Here another 4-letter-like character can be found: Ⅷ
U+XXXX
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
U+215x
⅐
⅑
⅒
⅓
⅔
⅕
⅖
⅗
⅘
⅙
⅚
⅛
⅜
⅝
⅞
⅟
U+216x
Ⅰ
Ⅱ
Ⅲ
Ⅳ
Ⅴ
Ⅵ
Ⅶ
Ⅷ
Ⅸ
Ⅹ
Ⅺ
Ⅻ
Ⅼ
Ⅽ
Ⅾ
Ⅿ
U+217x
ⅰ
ⅱ
ⅲ
ⅳ
ⅴ
ⅵ
ⅶ
ⅷ
ⅸ
ⅹ
ⅺ
ⅻ
ⅼ
ⅽ
ⅾ
ⅿ
U+218x
ↀ
ↁ
ↂ
Ↄ
ↄ
ↅ
ↆ
ↇ
ↈ
↉
↊
↋
If normal numbers can be considered then there are some other code points for multiple digits like ⒆ ⒇ ⓳ ⓴ in enclosed alphanumerics
U+XXXX
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
U+246x
①
②
③
④
⑤
⑥
⑦
⑧
⑨
⑩
⑪
⑫
⑬
⑭
⑮
⑯
U+247x
⑰
⑱
⑲
⑳
⑴
⑵
⑶
⑷
⑸
⑹
⑺
⑻
⑼
⑽
⑾
⑿
U+248x
⒀
⒁
⒂
⒃
⒄
⒅
⒆
⒇
⒈
⒉
⒊
⒋
⒌
⒍
⒎
⒏
U+249x
⒐
⒑
⒒
⒓
⒔
⒕
⒖
⒗
⒘
⒙
⒚
⒛
⒜
⒝
⒞
⒟
U+24Ax
⒠
⒡
⒢
⒣
⒤
⒥
⒦
⒧
⒨
⒩
⒪
⒫
⒬
⒭
⒮
⒯
U+24Bx
⒰
⒱
⒲
⒳
⒴
⒵
Ⓐ
Ⓑ
Ⓒ
Ⓓ
Ⓔ
Ⓕ
Ⓖ
Ⓗ
Ⓘ
Ⓙ
U+24Cx
Ⓚ
Ⓛ
Ⓜ
Ⓝ
Ⓞ
Ⓟ
Ⓠ
Ⓡ
Ⓢ
Ⓣ
Ⓤ
Ⓥ
Ⓦ
Ⓧ
Ⓨ
Ⓩ
U+24Dx
ⓐ
ⓑ
ⓒ
ⓓ
ⓔ
ⓕ
ⓖ
ⓗ
ⓘ
ⓙ
ⓚ
ⓛ
ⓜ
ⓝ
ⓞ
ⓟ
U+24Ex
ⓠ
ⓡ
ⓢ
ⓣ
ⓤ
ⓥ
ⓦ
ⓧ
ⓨ
ⓩ
⓪
⓫
⓬
⓭
⓮
⓯
U+24Fx
⓰
⓱
⓲
⓳
⓴
⓵
⓶
⓷
⓸
⓹
⓺
⓻
⓼
⓽
⓾
⓿
and in Enclosed Alphanumeric Supplement
🅫, 🅪, 🆋, 🆌, 🆍, 🄭, 🄮, 🅊, 🅋, 🅌, 🅍, 🅎, 🅏
A few more:
Currency symbol group
₧ ₨ ₶ ₯ ₠ ₢ ₷
Miscellaneous technical group
⎂ ⏨
Control pictures (probably you'll need to zoom out to see)
U+XXXX
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
U+240x
␀
␁
␂
␃
␄
␅
␆
␇
␈
␉
␊
␋
␌
␍
␎
␏
U+241x
␐
␑
␒
␓
␔
␕
␖
␗
␘
␙
␚
␛
␜
␝
␞
␟
U+242x
␠
␡
␢
␣

␥
␦
Alchemical Symbols
🜀 🜅 🜆 🜇 🜈 🝪 🝫 🝬 🝛 🝜 🝝
Musical Symbols
𝄶 𝄷 𝄸 𝄹 𝄉 𝄊 𝄫
And there are the emojis 🔟 💤🆔🚾🆖🆗🔢🔡🔠 💯🆘🆎🆑™🔙🔚🔜🔝🔛📆🗓🔞
Vertical bars may be considered uppercase i or lowercase L (like your 〷 example which is actually the TELEGRAPH LINE FEED SEPARATOR SYMBOL) and we have
Vai syllable see ꔖ 0xa516
Large triple vertical bar operator ⫼ 0x2afc
Counting rod tens digit three: 𝍫 0x1d36b
Suzhou numerals 〢 〣
Chinese river 川
║ BOX DRAWINGS DOUBLE VERTICAL...
Here's the automatic script to find the multi-character letters
import unicodedata
for c in range(0, 0x10FFFF + 1):
d = unicodedata.normalize('NFKD', chr(c))
if len(d) > 1 and d.isascii() and d.isalpha():
print("U+%04X (%s): %s\n" % (c, chr(c), d))
It won't be able to find many ligatures like æ or œ because they're not considered orthographic ligatures and aren't decomposable in Unicode. Here's the result in Unicode 11.0.0 (checked with unicodedata.unidata_version)
U+0132 (IJ): IJ
U+0133 (ij): ij
U+01C7 (LJ): LJ
U+01C8 (Lj): Lj
U+01C9 (lj): lj
U+01CA (NJ): NJ
U+01CB (Nj): Nj
U+01CC (nj): nj
U+01F1 (DZ): DZ
U+01F2 (Dz): Dz
U+01F3 (dz): dz
U+20A8 (₨): Rs
U+2116 (№): No
U+2120 (℠): SM
U+2121 (℡): TEL
U+2122 (™): TM
U+213B (℻): FAX
U+2161 (Ⅱ): II
U+2162 (Ⅲ): III
U+2163 (Ⅳ): IV
U+2165 (Ⅵ): VI
U+2166 (Ⅶ): VII
U+2167 (Ⅷ): VIII
U+2168 (Ⅸ): IX
U+216A (Ⅺ): XI
U+216B (Ⅻ): XII
U+2171 (ⅱ): ii
U+2172 (ⅲ): iii
U+2173 (ⅳ): iv
U+2175 (ⅵ): vi
U+2176 (ⅶ): vii
U+2177 (ⅷ): viii
U+2178 (ⅸ): ix
U+217A (ⅺ): xi
U+217B (ⅻ): xii
U+3250 (㉐): PTE
U+32CC (㋌): Hg
U+32CD (㋍): erg
U+32CE (㋎): eV
U+32CF (㋏): LTD
U+3371 (㍱): hPa
U+3372 (㍲): da
U+3373 (㍳): AU
U+3374 (㍴): bar
U+3375 (㍵): oV
U+3376 (㍶): pc
U+3377 (㍷): dm
U+337A (㍺): IU
U+3380 (㎀): pA
U+3381 (㎁): nA
U+3383 (㎃): mA
U+3384 (㎄): kA
U+3385 (㎅): KB
U+3386 (㎆): MB
U+3387 (㎇): GB
U+3388 (㎈): cal
U+3389 (㎉): kcal
U+338A (㎊): pF
U+338B (㎋): nF
U+338E (㎎): mg
U+338F (㎏): kg
U+3390 (㎐): Hz
U+3391 (㎑): kHz
U+3392 (㎒): MHz
U+3393 (㎓): GHz
U+3394 (㎔): THz
U+3396 (㎖): ml
U+3397 (㎗): dl
U+3398 (㎘): kl
U+3399 (㎙): fm
U+339A (㎚): nm
U+339C (㎜): mm
U+339D (㎝): cm
U+339E (㎞): km
U+33A9 (㎩): Pa
U+33AA (㎪): kPa
U+33AB (㎫): MPa
U+33AC (㎬): GPa
U+33AD (㎭): rad
U+33B0 (㎰): ps
U+33B1 (㎱): ns
U+33B3 (㎳): ms
U+33B4 (㎴): pV
U+33B5 (㎵): nV
U+33B7 (㎷): mV
U+33B8 (㎸): kV
U+33B9 (㎹): MV
U+33BA (㎺): pW
U+33BB (㎻): nW
U+33BD (㎽): mW
U+33BE (㎾): kW
U+33BF (㎿): MW
U+33C3 (㏃): Bq
U+33C4 (㏄): cc
U+33C5 (㏅): cd
U+33C8 (㏈): dB
U+33C9 (㏉): Gy
U+33CA (㏊): ha
U+33CB (㏋): HP
U+33CC (㏌): in
U+33CD (㏍): KK
U+33CE (㏎): KM
U+33CF (㏏): kt
U+33D0 (㏐): lm
U+33D1 (㏑): ln
U+33D2 (㏒): log
U+33D3 (㏓): lx
U+33D4 (㏔): mb
U+33D5 (㏕): mil
U+33D6 (㏖): mol
U+33D7 (㏗): PH
U+33D9 (㏙): PPM
U+33DA (㏚): PR
U+33DB (㏛): sr
U+33DC (㏜): Sv
U+33DD (㏝): Wb
U+33FF (㏿): gal
U+FB00 (ff): ff
U+FB01 (fi): fi
U+FB02 (fl): fl
U+FB03 (ffi): ffi
U+FB04 (ffl): ffl
U+FB05 (ſt): st
U+FB06 (st): st
U+1F12D (🄭): CD
U+1F12E (🄮): WZ
U+1F14A (🅊): HV
U+1F14B (🅋): MV
U+1F14C (🅌): SD
U+1F14D (🅍): SS
U+1F14E (🅎): PPV
U+1F14F (🅏): WC
U+1F16A (🅪): MC
U+1F16B (🅫): MD
U+1F190 (🆐): DJ
I am trying to substitute some patterns of atomic numbers in a single file. That file contain a series of atomic numbers in a column as shown in the first column. Now I want to substitute the first column of numbers with the series of numbers as in the second column line after line.
C1 C21
C2 C22
C4 C23
C5 C24
C6 C25
C7 C26
C8 C27
C9 C28
C10 C29
C11 C30
C12 C31
C13 C32
C14 C33
O1 O11
O2 O12
O3 O13
O4 O14
O5 O15
O6 O16
H1 H31
H2 H32
H3 H33
H4 H34
H5 H35
H6 H36
H7 H37
H8 H38
H9 H39
H10 H40
H11 H41
H12 H42
H13 H43
H14 H44
H15 H45
H16 H46
H17 H47
H18 H48
H19 H49
H20 H50
H21 H51
H22 H52
H23 H53
H24 H54
H25 H55
H26 H56
H27 H57
H28 H58
To achieve this I tried the sed command as below
sed -i -e 's/C1/C21/;s/C2/C22/;s/C3/C23/;s/C4/C24/;s/C5/C25/;s/C6/C26/;s/C7/C27/;s/C8/C28/;s/C9/C29/;s/C10/C30/;s/C11/C31/;s/C12/C32/;s/C13/C33/;s/C14/C34/;s/O1/O11/;s/O2/O12/;s/O3/O13/;s/O4/O14/;s/O5/O15/;s/O6/O16/;s/H1/H31/;s/H2/H32/;s/H3/H33/;s/H4/H34/;s/H5/H35/;s/H6/H36/;s/H7/H37/;s/H8/H38/;s/H9/H39/;s/H10/H40/;s/H11/H41/;s/H12/H42/;s/H13/H43/;s/H14/H44/;s/H15/H45/;s/H16/H46/;s/H17/H47/;s/H18/H48/;s/H19/H49/;s/H20/H50/;s/H21/H51/;s/H22/H52/;s/H23/H53/;s/H24/H54/;s/H25/H55/;s/H26/H56/;s/H27/H57/;s/H28/H58/' FILE_NAME
Unfortunately, what I get is multiple substitutions like C3328 and so on.
Can anyone help me to address the correct way of doing this? Appreciate in advance.
It's still not clear but I THINK this is what you want:
$ cat tst.awk
BEGIN { cnt["C"]=21; cnt["O"]=11; cnt["H"]=31 }
NF { c=substr($0,1,1); $0=c cnt[c]++ }
{ print }
.
$ awk -f tst.awk file
C21
C22
C23
C24
C25
C26
C27
C28
C29
C30
C31
C32
C33
O11
O12
O13
O14
O15
O16
H31
H32
H33
H34
H35
H36
H37
H38
H39
H40
H41
H42
H43
H44
H45
H46
H47
H48
H49
H50
H51
H52
H53
H54
H55
H56
H57
H58
The problem is that sed will attempt to carry out all substitutions in order, which results in multiple substitutions. So you need to rearrange your substitutions from most specific to least specific. For example:
echo "C1" | sed -n 's/C1/C21/p; s/C2/C22/p; s/C3/C23/p'
C21
C221
echo "C1" | sed -n 's/C3/C23/p; s/C2/C22/p; s/C1/C21/p'
C21
put [^0-9] after each pattern should work fine, to automate this process:
awk '$0{printf("s/%s\\([^0-9]\\)/%s\\1/g\n", $1, $2)}' <pattern-file >sedscr
run this one-liner for the pattern file, cat sedscr, then you would get:
s/C1\([^0-9]\)/C21\1/g
s/C2\([^0-9]\)/C22\1/g
s/C4\([^0-9]\)/C23\1/g
...
after that you run sed with the generated script for your sample files.
sed -f sedscr sample-files...
perhaps someone can help me out. I need to do a search and replace on a given string, finding any occurance of one of a list of things, and inserting a carriage return before it.
I'm providing a sample string, and my attempt at solving the problem.
Sample Input:
MSH|^~\&|PCM|A|NSG|A|20120613081122|DoNotBundle|ORM^O01|1133316|P|2.2|||AL|NEPID|1|1234567^PI^PE|345235^ST02A^MR^A~02340395^ST02^PI||HSM^AERHART||19510418000000|F||||||||||1215200001^A|111-22-3333
PV1|1|I|CCU^W207^A^A||||12342^ALI^ROGERS^M^MD^MD|||SUR|||||||16532^ALI^ROGERS^M^MD^MD|INP||B|||||||||||||||||||A|||||20120531145230ORC|PA|11109489^PCM|11109489^PCM|94986|SC||1^Continuous^INDEF^20120613081900^1||20120613081958|RGYIDDER^YIDDER^ROBERT^GSYSTEM ADM^SA||16532^ALI^ROGERS^MMD^MD|CCU||20120613081958|||CCU|RGYIDDER^YIDDER^ROBERT^
G^SYSTEM ADM^SA
OBR|1|11109489^PCM|11109489^PCM|DNR ON^Hard of Hearing^NSG||20120613081122||||||||||16532^ALI^ROGERS^M^MD^MD|||||||||||1^Continuous^INDEF^20120613081900^1
And my attempt:
$/ = undef; #tells perl to ignore newlines when reading input
$input = <STDIN>; #read entire input into $input
$input =~ s/\R/ /g; #remove all newlines from input. \R matches \r, \n, \r\n
#validSegHdrs = ( "ABS", "ACC", "ADD", "ADJ", "AFF", "AIG", "AIL", "AIP", "AIS", "AL1",
"APR", "ARQ", "ACC", "ADD", "ADJ", "AFF", "AIG", "AIL", "AIP", "AIS",
"AL1", "APR", "ARQ", "ARV", "AUT", "BHS", "BLC", "BLG", "BPO", "BPX",
"BTS", "BTX", "CDM", "CER", "CM0", "CM1", "CM2", "CNS", "CON", "CSP",
"CSR", "CSS", "CTD", "CTI", "DB1", "DG1", "DMI", "DRG", "DSC", "DSP",
"ECD", "ECR", "EDU", "EQP", "EQU", "ERR", "EVN", "FAC", "FHS", "FT1",
"FTS", "GOL", "GP1", "GP2", "GT1", "IAM", "IIM", "ILT", "IN1", "IN2",
"IN3", "INV", "IPC", "IPR", "ISD", "ITM", "IVC", "IVT", "LAN", "LCC",
"LCH", "LDP", "LOC", "LRL", "MFA", "MFE", "MFI", "MRG", "MSA", "MSH",
"NCK", "NDS", "NK1", "NPU", "NSC", "NST", "NTE", "OBR", "OBX", "ODS",
"ODT", "OM1", "OM2", "OM3", "OM4", "OM5", "OM6", "OM7", "ORC", "ORG",
"OVR", "PCE", "PCR", "PD1", "PDA", "PDC", "PEO", "PES", "PID", "PKG",
"PMT", "PR1", "PRA", "PRB", "PRC", "PRD", "PSG", "PSH", "PSL", "PSS",
"PTH", "PV1", "PV2", "PYE", "QAK", "QID", "QPD", "QRD", "QRF", "QRI",
"RCP", "RDF", "RDT", "REL", "RF1", "RFI", "RGS", "RMI", "ROL", "RQ1",
"RQD", "RXA", "RXC", "RXD", "RXE", "RXG", "RXO", "RXR", "SAC", "SCD",
"SCH", "SCP", "SDD", "SFT", "SID", "SLT", "SPM", "STF", "STZ", "TCC",
"TCD", "TQ1", "TQ2", "TXA", "UAC", "UB1", "UB2", "URD", "URS", "VAR",
"VND"
);
foreach (#validSegHdrs) {
$input =~ s/$_/\r$_/g;
}
print $input;
-
For what it's worth, I'm working with HL7. HL7 consists of "segments" each on its own line. The segment beginning with "MSH" is always first, and there must be a carriage return preceding each additional segment.
My input may have line breaks (or carriage returns) in the middle of a segment, which is not allowed. I also may encounter a new segment beginning on the same line as another one, which is also not allowed.
I intend to parse the input, first strip all line breaks, and find any matches of valid segment headers, and insert a carriage return before them. I have defined an array with all valid segment headers, and am attempting to use a foreach loop to do a simple search and replace to insert the \r before each match. I think it may be a good idea to match for each string plus '|', eg match on 'PV1|' to be more precise.
I'm not getting the expected output, so I humbly ask for some expertise. Thanks much!
#validSegHdrs = ( "ABS", # .....
);
my $regex = join ("|", #validSegHdrs);
while (<>) {
s/\R/ /g;
s/($regex)/\r$1/g;
print;
}
I used this script from the command line:
perl -e 'print "\n"; local $/; $in=<>; $in=~s/\R//g; my #blk = qw(ABS ACC ADD ADJ AFF AIG AIL AIP AIS AL1 APR ARQ ACC ADD ADJ AFF AIG AIL AIP AIS AL1 APR ARQ ARV AUT BHS BLC BLG BPO BPX BTS BTX CDM CER CM0 CM1 CM2 CNS CON CSP CSR CSS CTD CTI DB1 DG1 DMI DRG DSC DSP ECD ECR EDU EQP EQU ERR EVN FAC FHS FT1 FTS GOL GP1 GP2 GT1 IAM IIM ILT IN1 IN2 IN3 INV IPC IPR ISD ITM IVC IVT LAN LCC LCH LDP LOC LRL MFA MFE MFI MRG MSA MSH NCK NDS NK1 NPU NSC NST NTE OBR OBX ODS ODT OM1 OM2 OM3 OM4 OM5 OM6 OM7 ORC ORG OVR PCE PCR PD1 PDA PDC PEO PES PID PKG PMT PR1 PRA PRB PRC PRD PSG PSH PSL PSS PTH PV1 PV2 PYE QAK QID QPD QRD QRF QRI RCP RDF RDT REL RF1 RFI RGS RMI ROL RQ1 RQD RXA RXC RXD RXE RXG RXO RXR SAC SCD SCH SCP SDD SFT SID SLT SPM STF STZ TCC TCD TQ1 TQ2 TXA UAC UB1 UB2 URD URS VAR VND); $in=~s/$_/\n$_/ for #blk; print $in, "\n";'
And got this output:
MSH|^~\&|PCM|A|NSG|A|20120613081122|DoNotBundle|ORM^O01|1133316|P|2.2|||AL|NE
PID|1|1234567^PI^PE|345235^ST02A^MR^A~02340395^ST02^PI||HSM^AERHART||19510418000000|F||||||||||1215200001^A|111-22-3333
PV1|1|I|CCU^W207^A^A||||12342^ALI^ROGERS^M^MD^MD|||SUR|||||||16532^ALI^ROGERS^M^MD^MD|INP||B|||||||||||||||||||A|||||20120531145230
ORC|PA|11109489^PCM|11109489^PCM|94986|SC||1^Continuous^INDEF^20120613081900^1||20120613081958|RGYIDDER^YIDDER^ROBERT^GSYSTEM ADM^SA||16532^ALI^ROGERS^MMD^MD|CCU||20120613081958|||CCU|RGYIDDER^YIDDER^ROBERT^G^SYSTEM ADM^SA
OBR|1|11109489^PCM|11109489^PCM|DNR ON^Hard of Hearing^NSG||20120613081122||||||||||16532^ALI^ROGERS^M^MD^MD|||||||||||1^Continuous^INDEF^20120613081900^1
If the script were written indented, it would look like this:
local $/;
$in=<>;
$in=~s/\R//g;
my #blk = qw(
ABS ACC ADD ADJ AFF AIG AIL AIP AIS AL1 APR ARQ ACC ADD ADJ AFF AIG AIL AIP
AIS AL1 APR ARQ ARV AUT BHS BLC BLG BPO BPX BTS BTX CDM CER CM0 CM1 CM2 CNS
CON CSP CSR CSS CTD CTI DB1 DG1 DMI DRG DSC DSP ECD ECR EDU EQP EQU ERR EVN
FAC FHS FT1 FTS GOL GP1 GP2 GT1 IAM IIM ILT IN1 IN2 IN3 INV IPC IPR ISD ITM
IVC IVT LAN LCC LCH LDP LOC LRL MFA MFE MFI MRG MSA MSH NCK NDS NK1 NPU NSC
NST NTE OBR OBX ODS ODT OM1 OM2 OM3 OM4 OM5 OM6 OM7 ORC ORG OVR PCE PCR PD1
PDA PDC PEO PES PID PKG PMT PR1 PRA PRB PRC PRD PSG PSH PSL PSS PTH PV1 PV2
PYE QAK QID QPD QRD QRF QRI RCP RDF RDT REL RF1 RFI RGS RMI ROL RQ1 RQD RXA
RXC RXD RXE RXG RXO RXR SAC SCD SCH SCP SDD SFT SID SLT SPM STF STZ TCC TCD
TQ1 TQ2 TXA UAC UB1 UB2 URD URS VAR VND);
$in=~s/$_/\n$_/ for #blk;
print $in, "\n";
You would replace the \n with a \r I guess.
I don't know what the real difference between our scripts is, but it works for me??
Do note that using a hash could be more efficient (O(n) → O(1) where n is the number of header sequences):
my %hash = map {$_ => 1} #blk;
# Test if $1 is a header sequence, if so, print newline
$in =~ s/( [A-Z0-9]{3} )/ $hash{$1} ? "\n$1" : $1 /xeg;
I am working on a project(I have to implement it in Perl but I am not good at it) that reads DNA and finds its RNA. Divide that RNA's into triplets to get the equivalent protein name of it. I will explain the steps:
1) Transcribe the following DNA to RNA, then use the genetic code to translate it to a sequence of amino acids
Example:
TCATAATACGTTTTGTATTCGCCAGCGCTTCGGTGT
2) To transcribe the DNA, first substitute each DNA for it’s counterpart (i.e., G for C, C for G, T for A and A for T):
TCATAATACGTTTTGTATTCGCCAGCGCTTCGGTGT
AGTATTATGCAAAACATAAGCGGTCGCGAAGCCACA
Next, remember that the Thymine (T) bases become a Uracil (U). Hence our sequence becomes:
AGUAUUAUGCAAAACAUAAGCGGUCGCGAAGCCACA
Using the genetic code is like that
AGU AUU AUG CAA AAC AUA AGC GGU CGC GAA GCC ACA
then look each triplet (codon) up in the genetic code table. So AGU becomes Serine, which we can write as Ser, or
just S. AUU becomes Isoleucine (Ile), which we write as I. Carrying on in this way, we get:
SIMQNISGREAT
I will give the protein table:
So how can I write that code in Perl? I will edit my question and write the code that what I did.
Try the script below, it accepts input on STDIN (or in file given as parameter) and read it by line. I also presume, that "STOP" in the image attached is some stop state. Hope I read it all well from that picture.
#!/usr/bin/perl
use strict;
use warnings;
my %proteins = qw/
UUU F UUC F UUA L UUG L UCU S UCC S UCA S UCG S UAU Y UAC Y UGU C UGC C UGG W
CUU L CUC L CUA L CUG L CCU P CCC P CCA P CCG P CAU H CAC H CAA Q CAG Q CGU R CGC R CGA R CGG R
AUU I AUC I AUA I AUG M ACU T ACC T ACA T ACG T AAU N AAC N AAA K AAG K AGU S AGC S AGA R AGG R
GUU V GUC V GUA V GUG V GCU A GCC A GCA A GCG A GAU D GAC D GAA E GAG E GGU G GGC G GGA G GGG G
/;
LINE: while (<>) {
chomp;
y/GCTA/CGAU/; # translate (point 1&2 mixed)
foreach my $protein (/(...)/g) {
if (defined $proteins{$protein}) {
print $proteins{$protein};
}
else {
print "Whoops, stop state?\n";
next LINE;
}
}
print "\n"
}
I am trying to use unicode variable names in g++.
It does not appear to work.
Does g++ not support unicode variable names, ... or is there some subset of unicode (from which I'm not testing in).
Thanks!
You have to specify the -fextended-identifiers flag when compiling, you also have to use \uXXXX or \uXXXXXXXX for unicode(atleast in gcc it's unicode)
Identifiers (variable/class names etc) in g++ can't be of utf-8/utf-16 or whatever encoding,
they have to be:
identifier:
nondigit
identifier nondigit
identifier digit
a nondigit is
nondigit: one of
universalcharactername
_ a b c d e f g h i j k l m n o p q r s t u v w x y z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
and a universalcharactername is
universalcharactername:
\UXXXXXXXX
\uXXXX
Thus, if you save your source file as UTF-8, you cannot have a variable like e.g.:
int høyde = 10;
it had to be written like:
int h\u00F8yde = 10;
(which imo would beat the whole purpose - so just stick with a-z)
A one-line patch to the cpp preprocessor allows UTF-8 input. Details for gcc are given at
https://www.raspberrypi.org/forums/viewtopic.php?p=802657
however, since the preprocessor is shared, the same patch should work for g++ as well. In particular, the patch needed, as of gcc-5.2 is
diff -cNr gcc-5.2.0/libcpp/charset.c gcc-5.2.0-ejo/libcpp/charset.c
*** gcc-5.2.0/libcpp/charset.c Mon Jan 5 04:33:28 2015
--- gcc-5.2.0-ejo/libcpp/charset.c Wed Aug 12 14:34:23 2015
***************
*** 1711,1717 ****
struct _cpp_strbuf to;
unsigned char *buffer;
! input_cset = init_iconv_desc (pfile, SOURCE_CHARSET, input_charset);
if (input_cset.func == convert_no_conversion)
{
to.text = input;
--- 1711,1717 ----
struct _cpp_strbuf to;
unsigned char *buffer;
! input_cset = init_iconv_desc (pfile, "C99", input_charset);
if (input_cset.func == convert_no_conversion)
{
to.text = input;
Note that for the above patch to work, a recent version of iconv needs to be installed that supports C99 conversions. Type iconv --list to verify this, otherwise, you can install a new version of iconv along with gcc as described in the link above. Change the configure command to
$ ../gcc-5.2.0/configure -v --disable-multilib \
--with-libiconv-prefix=/usr/local/gcc-5.2 \
--prefix=/usr/local/gcc-5.2 \
--enable-languages="c,c++"
if you are building for x86 and want to include the c++ compiler as well.