I am working with post-processing of the log file arranged in the following format:
Finding intramodel H-bonds
Constraints relaxed by 0.55 angstroms and 20 degrees
Models used:
1.1 SarsCov2_structure31R_nsp5holo_rep1.pdb
1.6 SarsCov2_structure31R_nsp5holo_rep1.pdb
1.10 SarsCov2_structure31R_nsp5holo_rep1.pdb
1.8 SarsCov2_structure31R_nsp5holo_rep1.pdb
1.2 SarsCov2_structure31R_nsp5holo_rep1.pdb
1.3 SarsCov2_structure31R_nsp5holo_rep1.pdb
1.4 SarsCov2_structure31R_nsp5holo_rep1.pdb
1.7 SarsCov2_structure31R_nsp5holo_rep1.pdb
1.5 SarsCov2_structure31R_nsp5holo_rep1.pdb
1.9 SarsCov2_structure31R_nsp5holo_rep1.pdb
6 H-bonds
H-bonds (donor, acceptor, hydrogen, D..A dist, D-H..A dist):
SarsCov2_structure31R_nsp5holo_rep1.pdb #1.3/? ASN 142 ND2 SarsCov2_structure31R_nsp5holo_rep1.pdb #1.3/A UNL 1 N SarsCov2_structure31R_nsp5holo_rep1.pdb #1.3/? ASN 142 2HD2 3.419 2.541
SarsCov2_structure31R_nsp5holo_rep1.pdb #1.5/? GLN 189 NE2 SarsCov2_structure31R_nsp5holo_rep1.pdb #1.5/A UNL 1 O SarsCov2_structure31R_nsp5holo_rep1.pdb #1.5/? GLN 189 1HE2 2.883 2.159
SarsCov2_structure31R_nsp5holo_rep1.pdb #1.6/? HIS 163 NE2 SarsCov2_structure31R_nsp5holo_rep1.pdb #1.6/A UNL 1 O no hydrogen
From this log I need to take all the lines after the 3rd line, and then delete all dublicated patterns "SarsCov2_structure31R_nsp5holo_rep1.pdb". May I use some regex with sed to detect any phrase matching such patter in the log ( which ends with *.pdb) that should be removed automatically for each processed log?
So the expected output should be:
Models used:
1.1
1.6
1.10
1.8
1.2
1.3
1.4
1.7
1.5
1.9
6 H-bonds
H-bonds (donor, acceptor, hydrogen, D..A dist, D-H..A dist):
#1.3/? ASN 142 ND2 #1.3/A UNL 1 N #1.3/? ASN 142 2HD2 3.419 2.541
#1.5/? GLN 189 NE2 #1.5/A UNL 1 O #1.5/? GLN 189 1HE2 2.883 2.159
#1.6/? HIS 163 NE2 #1.6/A UNL 1 O no hydrogen 3.299 N/A
#1.7/? GLN 189 NE2 #1.7/A UNL 1 O #1.7/? GLN 189 1HE2 3.109 2.147
#1.9/? ASN 142 ND2 #1.9/A UNL 1 O #1.9/? ASN 142 1HD2 3.032 2.319
#1.10/? GLN 189 NE2 #1.10/A UNL 1 O #1.10/? GLN 189 1HE2 3.054 2.125
Here is some example without regex, which does not work yet :-)
cat test.log | tail -n +2 | sed -e "/SarsCov2_structure31R_nsp5holo_rep1.pdb/d" >> ./test2.log
You may use this sed:
sed -E '1,2d; s/[[:blank:]]*[^[:blank:]]+\.pdb//g' file
Models used:
1.1
1.6
1.10
1.8
1.2
1.3
1.4
1.7
1.5
1.9
6 H-bonds
H-bonds (donor, acceptor, hydrogen, D..A dist, D-H..A dist):
#1.3/? ASN 142 ND2 #1.3/A UNL 1 N #1.3/? ASN 142 2HD2 3.419 2.541
#1.5/? GLN 189 NE2 #1.5/A UNL 1 O #1.5/? GLN 189 1HE2 2.883 2.159
#1.6/? HIS 163 NE2 #1.6/A UNL 1 O no hydrogen
Details:
1,2d: Removes first 2 lines
s/[[:blank:]]*[^[:blank:]]+\.pdb//g: Removes 0 or more spaces filled by 1+ of non space characters followed by .adb from each line globally
Using sed
$ sed 's/[[:alnum:]_]*\.pdb//g;1,2d' input_file
Models used:
1.1
1.6
1.10
1.8
1.2
1.3
1.4
1.7
1.5
1.9
6 H-bonds
H-bonds (donor, acceptor, hydrogen, D..A dist, D-H..A dist):
#1.3/? ASN 142 ND2 #1.3/A UNL 1 N #1.3/? ASN 142 2HD2 3.419 2.541
#1.5/? GLN 189 NE2 #1.5/A UNL 1 O #1.5/? GLN 189 1HE2 2.883 2.159
#1.6/? HIS 163 NE2 #1.6/A UNL 1 O no hydrogen
With your shown samples please try following awk code. Simple explanation would be, firstly checking condition if FNR>2 then only run all other commands(inside condition block). Inside condition using gsub to Globally substitute [[:space:]]*SarsCov2_structure31R_nsp5holo_rep1\.pdb with NULL as per shown samples and printing the current line.
awk '
FNR>2{
gsub(/[[:space:]]*SarsCov2_structure31R_nsp5holo_rep1\.pdb/,"")
print
}
' Input_file
Related
i am dealing with the log consisted of many lines in the following format:
06I: 31 H-bonds
H-bonds (donor, acceptor, hydrogen, D..A dist, D-H..A dist):
#1.1/? THR 26 N #1.1/A UNL 1 O #1.1/? THR 26 H 3.515 2.716
#1.1/? ASN 142 ND2 #1.1/A UNL 1 O #1.1/? ASN 142 2HD2 3.227 2.305
#1.1/A UNL 1 N #1.1/? THR 26 O #1.1/A UNL 1 H 3.463 2.652
#1.2/A UNL 1 N #1.2/? PHE 140 O #1.2/A UNL 1 H 2.987 2.200
#1.4/? THR 26 N #1.4/A UNL 1 S #1.4/? THR 26 H 4.354 3.371
#1.4/? HIS 163 NE2 #1.4/A UNL 1 N no hydrogen 3.137 N/A
#1.4/A UNL 1 N #1.4/? ARG 188 O #1.4/A UNL 1 H 3.000 2.081
#1.5/? HIS 163 NE2 #1.5/A UNL 1 N no hydrogen 3.330 N/A
#1.5/? GLN 189 NE2 #1.5/A UNL 1 O #1.5/? GLN 189 2HE2 3.029 2.132
#1.6/A UNL 1 N #1.6/? ARG 188 O #1.6/A UNL 1 H 2.984 2.064
#1.8/? ASN 142 ND2 #1.8/A UNL 1 N #1.8/? ASN 142 2HD2 3.164 2.395
#1.8/? ASN 142 ND2 #1.8/A UNL 1 O #1.8/? ASN 142 2HD2 3.031 2.180
#1.8/? GLN 189 NE2 #1.8/A UNL 1 O #1.8/? GLN 189 1HE2 3.276 2.553
#1.8/A UNL 1 N #1.8/? THR 190 O #1.8/A UNL 1 H 3.257 2.407
#1.9/A UNL 1 N #1.9/? THR 190 O #1.9/A UNL 1 H 2.913 2.037
#1.10/? SER 144 OG #1.10/A UNL 1 S #1.10/? SER 144 HG 4.246 3.845
#1.10/? HIS 163 NE2 #1.10/A UNL 1 S no hydrogen 3.700 N/A
#1.10/A UNL 1 N #1.10/? THR 190 O #1.10/A UNL 1 H 3.008 2.091
#1.12/? GLN 189 NE2 #1.12/A UNL 1 O #1.12/? GLN 189 1HE2 2.929 2.152
#1.12/A UNL 1 N #1.12/? PHE 140 O #1.12/A UNL 1 H 2.912 2.012
#1.13/? ASN 142 ND2 #1.13/A UNL 1 O #1.13/? ASN 142 2HD2 3.063 2.291
#1.14/? HIS 41 NE2 #1.14/A UNL 1 S no hydrogen 3.919 N/A
#1.14/? ASN 142 ND2 #1.14/A UNL 1 O #1.14/? ASN 142 2HD2 2.802 1.872
#1.14/A UNL 1 N #1.14/? THR 190 O #1.14/A UNL 1 H 2.927 1.987
#1.16/? GLN 189 NE2 #1.16/A UNL 1 N #1.16/? GLN 189 1HE2 3.456 2.669
#1.16/? GLN 189 NE2 #1.16/A UNL 1 O #1.16/? GLN 189 1HE2 3.079 2.177
#1.16/A UNL 1 N #1.16/? THR 190 O #1.16/A UNL 1 H 2.967 1.987
#1.17/? ASN 142 ND2 #1.17/A UNL 1 N #1.17/? ASN 142 2HD2 3.218 2.294
#1.17/A UNL 1 N #1.17/? THR 190 O #1.17/A UNL 1 H 3.364 2.469
#1.18/? ASN 142 ND2 #1.18/A UNL 1 O #1.18/? ASN 142 2HD2 3.117 2.142
#1.20/? ASN 142 ND2 #1.20/A UNL 1 N #1.20/? ASN 142 2HD2 3.245 2.560
-----------------------------------------------------------------------------
structure30R: 21 H-bonds
H-bonds (donor, acceptor, hydrogen, D..A dist, D-H..A dist):
#1.4/? GLN 189 NE2 #1.4/A UNL 1 O #1.4/? GLN 189 1HE2 3.139 2.374
#1.5/? GLN 189 NE2 #1.5/A UNL 1 N #1.5/? GLN 189 2HE2 3.296 2.365
#1.7/? CYS 145 SG #1.7/A UNL 1 O #1.7/? CYS 145 HG 3.466 2.762
#1.7/A UNL 1 O #1.7/? LEU 141 O #1.7/A UNL 1 H 2.951 2.048
#1.8/? ASN 142 ND2 #1.8/A UNL 1 O #1.8/? ASN 142 2HD2 3.660 3.073
#1.8/? ASN 142 ND2 #1.8/A UNL 1 O #1.8/? ASN 142 1HD2 2.965 2.162
#1.8/? CYS 145 SG #1.8/A UNL 1 O #1.8/? CYS 145 HG 3.480 2.556
#1.9/? HIS 163 NE2 #1.9/A UNL 1 O no hydrogen 3.272 N/A
#1.9/A UNL 1 O #1.9/? GLN 189 OE1 #1.9/A UNL 1 H 2.915 2.341
#1.10/? ASN 142 ND2 #1.10/A UNL 1 O #1.10/? ASN 142 2HD2 3.100 2.185
#1.10/? GLN 189 NE2 #1.10/A UNL 1 O #1.10/? GLN 189 1HE2 3.180 2.408
#1.10/A UNL 1 O #1.10/? GLU 166 O #1.10/A UNL 1 H 3.246 2.639
#1.11/? ASN 142 ND2 #1.11/A UNL 1 O #1.11/? ASN 142 2HD2 3.122 2.204
#1.11/? HIS 163 NE2 #1.11/A UNL 1 O no hydrogen 3.313 N/A
as you may see some lines (which consist of the pattern "no hydrogen" + some number os spaces) are out of the format where the last two numbers are significantly shifted e.g. no hydrogen 3.137 N/A
Since the number of the spaces between these elements may be different I could not find a simple expression using sed to remove all of those useless spaces e.g.
sed -e "s/no hydrogen //g"
will match only for a partcilar line.
may you suggest me some regular expressiion which can be used with sed to match all the lines consisted of "no hydrogen" and remove the unused spaces?
Here is the expected output:
06I: 31 H-bonds
H-bonds (donor, acceptor, hydrogen, D..A dist, D-H..A dist):
#1.1/? THR 26 N #1.1/A UNL 1 O #1.1/? THR 26 H 3.515 2.716
#1.1/? ASN 142 ND2 #1.1/A UNL 1 O #1.1/? ASN 142 2HD2 3.227 2.305
#1.1/A UNL 1 N #1.1/? THR 26 O #1.1/A UNL 1 H 3.463 2.652
#1.2/A UNL 1 N #1.2/? PHE 140 O #1.2/A UNL 1 H 2.987 2.200
#1.4/? THR 26 N #1.4/A UNL 1 S #1.4/? THR 26 H 4.354 3.371
#1.4/? HIS 163 NE2 #1.4/A UNL 1 N no hydrogen 3.137 N/A
#1.4/A UNL 1 N #1.4/? ARG 188 O #1.4/A UNL 1 H 3.000 2.081
#1.5/? HIS 163 NE2 #1.5/A UNL 1 N no hydrogen 3.330 N/A
#1.5/? GLN 189 NE2 #1.5/A UNL 1 O #1.5/? GLN 189 2HE2 3.029 2.132
#1.6/A UNL 1 N #1.6/? ARG 188 O #1.6/A UNL 1 H 2.984 2.064
#1.8/? ASN 142 ND2 #1.8/A UNL 1 N #1.8/? ASN 142 2HD2 3.164 2.395
#1.8/? ASN 142 ND2 #1.8/A UNL 1 O #1.8/? ASN 142 2HD2 3.031 2.180
#1.8/? GLN 189 NE2 #1.8/A UNL 1 O #1.8/? GLN 189 1HE2 3.276 2.553
#1.8/A UNL 1 N #1.8/? THR 190 O #1.8/A UNL 1 H 3.257 2.407
#1.9/A UNL 1 N #1.9/? THR 190 O #1.9/A UNL 1 H 2.913 2.037
#1.10/? SER 144 OG #1.10/A UNL 1 S #1.10/? SER 144 HG 4.246 3.845
#1.10/? HIS 163 NE2 #1.10/A UNL 1 S no hydrogen 3.700 N/A
#1.10/A UNL 1 N #1.10/? THR 190 O #1.10/A UNL 1 H 3.008 2.091
#1.12/? GLN 189 NE2 #1.12/A UNL 1 O #1.12/? GLN 189 1HE2 2.929 2.152
#1.12/A UNL 1 N #1.12/? PHE 140 O #1.12/A UNL 1 H 2.912 2.012
#1.13/? ASN 142 ND2 #1.13/A UNL 1 O #1.13/? ASN 142 2HD2 3.063 2.291
#1.14/? HIS 41 NE2 #1.14/A UNL 1 S no hydrogen 3.919 N/A
#1.14/? ASN 142 ND2 #1.14/A UNL 1 O #1.14/? ASN 142 2HD2 2.802 1.872
#1.14/A UNL 1 N #1.14/? THR 190 O #1.14/A UNL 1 H 2.927 1.987
#1.16/? GLN 189 NE2 #1.16/A UNL 1 N #1.16/? GLN 189 1HE2 3.456 2.669
#1.16/? GLN 189 NE2 #1.16/A UNL 1 O #1.16/? GLN 189 1HE2 3.079 2.177
#1.16/A UNL 1 N #1.16/? THR 190 O #1.16/A UNL 1 H 2.967 1.987
#1.17/? ASN 142 ND2 #1.17/A UNL 1 N #1.17/? ASN 142 2HD2 3.218 2.294
#1.17/A UNL 1 N #1.17/? THR 190 O #1.17/A UNL 1 H 3.364 2.469
#1.18/? ASN 142 ND2 #1.18/A UNL 1 O #1.18/? ASN 142 2HD2 3.117 2.142
#1.20/? ASN 142 ND2 #1.20/A UNL 1 N #1.20/? ASN 142 2HD2 3.245 2.560
-----------------------------------------------------------------------------
structure30R: 21 H-bonds
H-bonds (donor, acceptor, hydrogen, D..A dist, D-H..A dist):
#1.4/? GLN 189 NE2 #1.4/A UNL 1 O #1.4/? GLN 189 1HE2 3.139 2.374
#1.5/? GLN 189 NE2 #1.5/A UNL 1 N #1.5/? GLN 189 2HE2 3.296 2.365
#1.7/? CYS 145 SG #1.7/A UNL 1 O #1.7/? CYS 145 HG 3.466 2.762
#1.7/A UNL 1 O #1.7/? LEU 141 O #1.7/A UNL 1 H 2.951 2.048
#1.8/? ASN 142 ND2 #1.8/A UNL 1 O #1.8/? ASN 142 2HD2 3.660 3.073
#1.8/? ASN 142 ND2 #1.8/A UNL 1 O #1.8/? ASN 142 1HD2 2.965 2.162
#1.8/? CYS 145 SG #1.8/A UNL 1 O #1.8/? CYS 145 HG 3.480 2.556
#1.9/? HIS 163 NE2 #1.9/A UNL 1 O no hydrogen 3.272 N/A
#1.9/A UNL 1 O #1.9/? GLN 189 OE1 #1.9/A UNL 1 H 2.915 2.341
#1.10/? ASN 142 ND2 #1.10/A UNL 1 O #1.10/? ASN 142 2HD2 3.100 2.185
#1.10/? GLN 189 NE2 #1.10/A UNL 1 O #1.10/? GLN 189 1HE2 3.180 2.408
#1.10/A UNL 1 O #1.10/? GLU 166 O #1.10/A UNL 1 H 3.246 2.639
#1.11/? ASN 142 ND2 #1.11/A UNL 1 O #1.11/? ASN 142 2HD2 3.122 2.204
#1.11/? HIS 163 NE2 #1.11/A UNL 1 O no hydrogen
Using sed
$ sed 's/\(no hydrogen \{12\}\)[[:space:]]\+/\1/' input_fie
06I: 31 H-bonds
H-bonds (donor, acceptor, hydrogen, D..A dist, D-H..A dist):
#1.1/? THR 26 N #1.1/A UNL 1 O #1.1/? THR 26 H 3.515 2.716
#1.1/? ASN 142 ND2 #1.1/A UNL 1 O #1.1/? ASN 142 2HD2 3.227 2.305
#1.1/A UNL 1 N #1.1/? THR 26 O #1.1/A UNL 1 H 3.463 2.652
#1.2/A UNL 1 N #1.2/? PHE 140 O #1.2/A UNL 1 H 2.987 2.200
#1.4/? THR 26 N #1.4/A UNL 1 S #1.4/? THR 26 H 4.354 3.371
#1.4/? HIS 163 NE2 #1.4/A UNL 1 N no hydrogen 3.137 N/A
#1.4/A UNL 1 N #1.4/? ARG 188 O #1.4/A UNL 1 H 3.000 2.081
#1.5/? HIS 163 NE2 #1.5/A UNL 1 N no hydrogen 3.330 N/A
#1.5/? GLN 189 NE2 #1.5/A UNL 1 O #1.5/? GLN 189 2HE2 3.029 2.132
#1.6/A UNL 1 N #1.6/? ARG 188 O #1.6/A UNL 1 H 2.984 2.064
#1.8/? ASN 142 ND2 #1.8/A UNL 1 N #1.8/? ASN 142 2HD2 3.164 2.395
#1.8/? ASN 142 ND2 #1.8/A UNL 1 O #1.8/? ASN 142 2HD2 3.031 2.180
#1.8/? GLN 189 NE2 #1.8/A UNL 1 O #1.8/? GLN 189 1HE2 3.276 2.553
#1.8/A UNL 1 N #1.8/? THR 190 O #1.8/A UNL 1 H 3.257 2.407
#1.9/A UNL 1 N #1.9/? THR 190 O #1.9/A UNL 1 H 2.913 2.037
#1.10/? SER 144 OG #1.10/A UNL 1 S #1.10/? SER 144 HG 4.246 3.845
#1.10/? HIS 163 NE2 #1.10/A UNL 1 S no hydrogen 3.700 N/A
#1.10/A UNL 1 N #1.10/? THR 190 O #1.10/A UNL 1 H 3.008 2.091
#1.12/? GLN 189 NE2 #1.12/A UNL 1 O #1.12/? GLN 189 1HE2 2.929 2.152
#1.12/A UNL 1 N #1.12/? PHE 140 O #1.12/A UNL 1 H 2.912 2.012
#1.13/? ASN 142 ND2 #1.13/A UNL 1 O #1.13/? ASN 142 2HD2 3.063 2.291
#1.14/? HIS 41 NE2 #1.14/A UNL 1 S no hydrogen 3.919 N/A
#1.14/? ASN 142 ND2 #1.14/A UNL 1 O #1.14/? ASN 142 2HD2 2.802 1.872
#1.14/A UNL 1 N #1.14/? THR 190 O #1.14/A UNL 1 H 2.927 1.987
#1.16/? GLN 189 NE2 #1.16/A UNL 1 N #1.16/? GLN 189 1HE2 3.456 2.669
#1.16/? GLN 189 NE2 #1.16/A UNL 1 O #1.16/? GLN 189 1HE2 3.079 2.177
#1.16/A UNL 1 N #1.16/? THR 190 O #1.16/A UNL 1 H 2.967 1.987
#1.17/? ASN 142 ND2 #1.17/A UNL 1 N #1.17/? ASN 142 2HD2 3.218 2.294
#1.17/A UNL 1 N #1.17/? THR 190 O #1.17/A UNL 1 H 3.364 2.469
#1.18/? ASN 142 ND2 #1.18/A UNL 1 O #1.18/? ASN 142 2HD2 3.117 2.142
#1.20/? ASN 142 ND2 #1.20/A UNL 1 N #1.20/? ASN 142 2HD2 3.245 2.560
-----------------------------------------------------------------------------
structure30R: 21 H-bonds
H-bonds (donor, acceptor, hydrogen, D..A dist, D-H..A dist):
#1.4/? GLN 189 NE2 #1.4/A UNL 1 O #1.4/? GLN 189 1HE2 3.139 2.374
#1.5/? GLN 189 NE2 #1.5/A UNL 1 N #1.5/? GLN 189 2HE2 3.296 2.365
#1.7/? CYS 145 SG #1.7/A UNL 1 O #1.7/? CYS 145 HG 3.466 2.762
#1.7/A UNL 1 O #1.7/? LEU 141 O #1.7/A UNL 1 H 2.951 2.048
#1.8/? ASN 142 ND2 #1.8/A UNL 1 O #1.8/? ASN 142 2HD2 3.660 3.073
#1.8/? ASN 142 ND2 #1.8/A UNL 1 O #1.8/? ASN 142 1HD2 2.965 2.162
#1.8/? CYS 145 SG #1.8/A UNL 1 O #1.8/? CYS 145 HG 3.480 2.556
#1.9/? HIS 163 NE2 #1.9/A UNL 1 O no hydrogen 3.272 N/A
#1.9/A UNL 1 O #1.9/? GLN 189 OE1 #1.9/A UNL 1 H 2.915 2.341
#1.10/? ASN 142 ND2 #1.10/A UNL 1 O #1.10/? ASN 142 2HD2 3.100 2.185
#1.10/? GLN 189 NE2 #1.10/A UNL 1 O #1.10/? GLN 189 1HE2 3.180 2.408
#1.10/A UNL 1 O #1.10/? GLU 166 O #1.10/A UNL 1 H 3.246 2.639
#1.11/? ASN 142 ND2 #1.11/A UNL 1 O #1.11/? ASN 142 2HD2 3.122 2.204
#1.11/? HIS 163 NE2 #1.11/A UNL 1 O no hydrogen
\(no hydrogen \{12\}\) - Create a group match within parenthesis (..) with seds back referencing functionality which can later be returned with \1. The command could also have been written \(no hydrogen[[:space:]]\{12\}\) to emphasize the presence of a space. This will include 12 spaces after the word no hydrogen to be returned as a back reference.
[[:space:]]\+ - As this is not part of the group match, it will be excluded. This will match all the remaining spaces after the matched word and 12 spaces we want retained within the group match.
File1 is an hard formatted pdb file containing protein coordinates:
ATOM 1 N MET A 1 -37.809 27.446 34.618 1.00 43.34 N
ATOM 2 CA MET A 1 -37.480 26.307 33.746 1.00 43.34 C
ATOM 3 C MET A 1 -36.495 25.493 34.556 1.00 43.34 C
ATOM 4 CB MET A 1 -36.919 26.801 32.394 1.00 43.34 C
ATOM 5 O MET A 1 -35.346 25.898 34.661 1.00 43.34 O
ATOM 6 CG MET A 1 -36.980 25.729 31.301 1.00 43.34 C
ATOM 7 SD MET A 1 -35.977 26.080 29.826 1.00 43.34 S
ATOM 8 CE MET A 1 -36.833 27.479 29.055 1.00 43.34 C
ATOM 9 N GLU A 2 -36.991 24.516 35.314 1.00 37.48 N
ATOM 10 CA GLU A 2 -36.090 23.617 36.039 1.00 37.48 C
ATOM 11 C GLU A 2 -35.250 22.852 35.010 1.00 37.48 C
ATOM 12 CB GLU A 2 -36.860 22.659 36.957 1.00 37.48 C
ATOM 13 O GLU A 2 -35.776 22.534 33.938 1.00 37.48 O
ATOM 14 CG GLU A 2 -37.467 23.407 38.153 1.00 37.48 C
..............................................................................
..............................................................................
..............................................................................
ATOM 981 N CYS A 123 -15.659 -7.164 13.998 1.00 90.53 N
ATOM 982 CA CYS A 123 -16.801 -7.332 13.106 1.00 90.53 C
ATOM 983 C CYS A 123 -17.894 -8.234 13.699 1.00 90.53 C
ATOM 984 CB CYS A 123 -16.321 -7.886 11.757 1.00 90.53 C
ATOM 985 O CYS A 123 -18.918 -8.425 13.046 1.00 90.53 O
ATOM 986 SG CYS A 123 -15.266 -6.683 10.904 1.00 90.53 S
ATOM 987 N GLY A 124 -17.679 -8.840 14.874 1.00 90.37 N
ATOM 988 CA GLY A 124 -18.641 -9.764 15.474 1.00 90.37 C
ATOM 989 C GLY A 124 -18.851 -11.029 14.637 1.00 90.37 C
ATOM 990 O GLY A 124 -19.970 -11.514 14.513 1.00 90.37 O
ATOM 991 N SER A 125 -17.793 -11.536 13.996 1.00 92.09 N
ATOM 992 CA SER A 125 -17.837 -12.749 13.159 1.00 92.09 C
ATOM 993 C SER A 125 -17.220 -13.976 13.833 1.00 92.09 C
ATOM 994 CB SER A 125 -17.117 -12.481 11.840 1.00 92.09 C
ATOM 995 O SER A 125 -17.538 -15.108 13.459 1.00 92.09 O
ATOM 996 OG SER A 125 -17.831 -11.523 11.084 1.00 92.09 O
....................... plus many more lines .................................
File2 is a list of representative lines obtained from fields 3,4, and 5 of the above
pdb file. To keep all simple, let's consider just to lines:
GLU A 2
GLY A 124
The desired output is:
ATOM 1 N MET A 1 -37.809 27.446 34.618 1.00 43.34 N
ATOM 2 CA MET A 1 -37.480 26.307 33.746 1.00 43.34 C
ATOM 3 C MET A 1 -36.495 25.493 34.556 1.00 43.34 C
ATOM 4 CB MET A 1 -36.919 26.801 32.394 1.00 43.34 C
ATOM 5 O MET A 1 -35.346 25.898 34.661 1.00 43.34 O
ATOM 6 CG MET A 1 -36.980 25.729 31.301 1.00 43.34 C
ATOM 7 SD MET A 1 -35.977 26.080 29.826 1.00 43.34 S
ATOM 8 CE MET A 1 -36.833 27.479 29.055 1.00 43.34 C
ATOM 9 N GLU A 2 -36.991 24.516 35.314 1.00 00.00 N
ATOM 10 CA GLU A 2 -36.090 23.617 36.039 1.00 00.00 C
ATOM 11 C GLU A 2 -35.250 22.852 35.010 1.00 00.00 C
ATOM 12 CB GLU A 2 -36.860 22.659 36.957 1.00 00.00 C
ATOM 13 O GLU A 2 -35.776 22.534 33.938 1.00 00.00 O
ATOM 14 CG GLU A 2 -37.467 23.407 38.153 1.00 00.00 C
..............................................................................
..............................................................................
..............................................................................
ATOM 981 N CYS A 123 -15.659 -7.164 13.998 1.00 90.53 N
ATOM 982 CA CYS A 123 -16.801 -7.332 13.106 1.00 90.53 C
ATOM 983 C CYS A 123 -17.894 -8.234 13.699 1.00 90.53 C
ATOM 984 CB CYS A 123 -16.321 -7.886 11.757 1.00 90.53 C
ATOM 985 O CYS A 123 -18.918 -8.425 13.046 1.00 90.53 O
ATOM 986 SG CYS A 123 -15.266 -6.683 10.904 1.00 90.53 S
ATOM 987 N GLY A 124 -17.679 -8.840 14.874 1.00 00.00 N
ATOM 988 CA GLY A 124 -18.641 -9.764 15.474 1.00 00.00 C
ATOM 989 C GLY A 124 -18.851 -11.029 14.637 1.00 00.00 C
ATOM 990 O GLY A 124 -19.970 -11.514 14.513 1.00 00.00 O
ATOM 991 N SER A 125 -17.793 -11.536 13.996 1.00 92.09 N
ATOM 992 CA SER A 125 -17.837 -12.749 13.159 1.00 92.09 C
ATOM 993 C SER A 125 -17.220 -13.976 13.833 1.00 92.09 C
ATOM 994 CB SER A 125 -17.117 -12.481 11.840 1.00 92.09 C
ATOM 995 O SER A 125 -17.538 -15.108 13.459 1.00 92.09 O
ATOM 996 OG SER A 125 -17.831 -11.523 11.084 1.00 92.09 O
i.e. a modified pdb with 00.00 in the 11th field if a File1's line contain a
File2 occurrence.
I already know how to do that with Bash while-read and awk but because these tools
change the format and require reformatting and/or specify the output format, in this
particular case dealing with hundreds of files they are not practical.
In order to avoid these problems I decided to look for a solution based on sed.
I got a working solution if I explicitly give a single search pattern. i.e. the
following code works:
digits=00.00
sed "/GLU A 2/s/\(.\{61\}\)\(.\{5\}\)/\1$digits/" File1.pdb > out.pdb
but the following does not (the File1 lines are unchanged) and I did not manage
to figure out why:
digits=00.00
while read pattern; do
sed "/$pattern/s/\(.\{61\}\)\(.\{5\}\)/\1$digits/" File1.pdb > out.pdb ;
done < File2.txt
Sorry for the lengthy message. Thanks in advance for any help.
#anubhava:
using my real data this is what happen at the first substitution site:
ATOM 293 CE1 HIS A 38 -18.278 19.735 13.486 1.00 67.94 C
ATOM 294 NE2 HIS A 38 -18.518 18.594 14.144 1.00 67.94 N
ATOM 295 N GLY A 39 -13.836 00.00 9.206 1.00 71.50 N
ATOM 296 CA GLY A 39 -12.628 00.00 8.447 1.00 71.50 C
ATOM 297 C GLY A 39 -11.358 00.00 9.286 1.00 71.50 C
ATOM 298 O GLY A 39 -11.411 18.636 10.344 1.00 00.00 O
ATOM 299 N PRO A 40 -10.180 17.577 8.797 1.00 71.93 N
ATOM 300 CA PRO A 40 -8.908 17.719 9.520 1.00 71.93 C
ATOM 301 C PRO A 40 -8.580 19.169 9.912 1.00 71.93 C
In this case the site is /GLY A 39/. As you can see there is a shift in some lines and unwanted substitutions in the 8th field.
Strange enough such problems occur only for the first replacement i.e. the remaning output is just perfect. Thanks.
Using sed in a while loop which reads file 2 line by line, you can target only lines matches those found in file2 and carry out the sub on those lines where;
s/\(.*\)[0-9]\{2\}\.[0-9]\{2\}\([[:space:]]\+.*$\)/ - Group everything up to the last digits that matches the pattern and retain to be returned with back reference \1. Exclude the number matched in the pattern and once again group everything else after from the space to the end of the line and return with back-reference \2
$ cat file1
ATOM 1 N MET A 1 -37.809 27.446 34.618 1.00 43.34 N
ATOM 2 CA MET A 1 -37.480 26.307 33.746 1.00 43.34 C
ATOM 3 C MET A 1 -36.495 25.493 34.556 1.00 43.34 C
ATOM 4 CB MET A 1 -36.919 26.801 32.394 1.00 43.34 C
ATOM 5 O MET A 1 -35.346 25.898 34.661 1.00 43.34 O
ATOM 6 CG MET A 1 -36.980 25.729 31.301 1.00 43.34 C
ATOM 7 SD MET A 1 -35.977 26.080 29.826 1.00 43.34 S
ATOM 8 CE MET A 1 -36.833 27.479 29.055 1.00 43.34 C
ATOM 9 N GLU A 2 -36.991 24.516 35.314 1.00 37.48 N
ATOM 10 CA GLU A 2 -36.090 23.617 36.039 1.00 37.48 C
ATOM 11 C GLU A 2 -35.250 22.852 35.010 1.00 37.48 C
ATOM 12 CB GLU A 2 -36.860 22.659 36.957 1.00 37.48 C
ATOM 13 O GLU A 2 -35.776 22.534 33.938 1.00 37.48 O
ATOM 14 CG GLU A 2 -37.467 23.407 38.153 1.00 37.48 C
ATOM 981 N CYS A 123 -15.659 -7.164 13.998 1.00 90.53 N
ATOM 982 CA CYS A 123 -16.801 -7.332 13.106 1.00 90.53 C
ATOM 983 C CYS A 123 -17.894 -8.234 13.699 1.00 90.53 C
ATOM 984 CB CYS A 123 -16.321 -7.886 11.757 1.00 90.53 C
ATOM 985 O CYS A 123 -18.918 -8.425 13.046 1.00 90.53 O
ATOM 986 SG CYS A 123 -15.266 -6.683 10.904 1.00 90.53 S
ATOM 987 N GLY A 124 -17.679 -8.840 14.874 1.00 90.37 N
ATOM 988 CA GLY A 124 -18.641 -9.764 15.474 1.00 90.37 C
ATOM 989 C GLY A 124 -18.851 -11.029 14.637 1.00 90.37 C
ATOM 990 O GLY A 124 -19.970 -11.514 14.513 1.00 90.37 O
ATOM 991 N SER A 125 -17.793 -11.536 13.996 1.00 92.09 N
ATOM 992 CA SER A 125 -17.837 -12.749 13.159 1.00 92.09 C
ATOM 993 C SER A 125 -17.220 -13.976 13.833 1.00 92.09 C
ATOM 994 CB SER A 125 -17.117 -12.481 11.840 1.00 92.09 C
ATOM 995 O SER A 125 -17.538 -15.108 13.459 1.00 92.09 O
ATOM 996 OG SER A 125 -17.831 -11.523 11.084 1.00 92.09 O
$ while read -r line; do sed -i.bak "/$line/s/\(.*\)[0-9]\{2\}\.[0-9]\{2\}\([[:space:]]\+.*$\)/\100.00\2/" file1; done < file2
$ cat file1
ATOM 1 N MET A 1 -37.809 27.446 34.618 1.00 43.34 N
ATOM 2 CA MET A 1 -37.480 26.307 33.746 1.00 43.34 C
ATOM 3 C MET A 1 -36.495 25.493 34.556 1.00 43.34 C
ATOM 4 CB MET A 1 -36.919 26.801 32.394 1.00 43.34 C
ATOM 5 O MET A 1 -35.346 25.898 34.661 1.00 43.34 O
ATOM 6 CG MET A 1 -36.980 25.729 31.301 1.00 43.34 C
ATOM 7 SD MET A 1 -35.977 26.080 29.826 1.00 43.34 S
ATOM 8 CE MET A 1 -36.833 27.479 29.055 1.00 43.34 C
ATOM 9 N GLU A 2 -36.991 24.516 35.314 1.00 00.00 N
ATOM 10 CA GLU A 2 -36.090 23.617 36.039 1.00 00.00 C
ATOM 11 C GLU A 2 -35.250 22.852 35.010 1.00 00.00 C
ATOM 12 CB GLU A 2 -36.860 22.659 36.957 1.00 00.00 C
ATOM 13 O GLU A 2 -35.776 22.534 33.938 1.00 00.00 O
ATOM 14 CG GLU A 2 -37.467 23.407 38.153 1.00 00.00 C
ATOM 981 N CYS A 123 -15.659 -7.164 13.998 1.00 90.53 N
ATOM 982 CA CYS A 123 -16.801 -7.332 13.106 1.00 90.53 C
ATOM 983 C CYS A 123 -17.894 -8.234 13.699 1.00 90.53 C
ATOM 984 CB CYS A 123 -16.321 -7.886 11.757 1.00 90.53 C
ATOM 985 O CYS A 123 -18.918 -8.425 13.046 1.00 90.53 O
ATOM 986 SG CYS A 123 -15.266 -6.683 10.904 1.00 90.53 S
ATOM 987 N GLY A 124 -17.679 -8.840 14.874 1.00 00.00 N
ATOM 988 CA GLY A 124 -18.641 -9.764 15.474 1.00 00.00 C
ATOM 989 C GLY A 124 -18.851 -11.029 14.637 1.00 00.00 C
ATOM 990 O GLY A 124 -19.970 -11.514 14.513 1.00 00.00 O
ATOM 991 N SER A 125 -17.793 -11.536 13.996 1.00 92.09 N
ATOM 992 CA SER A 125 -17.837 -12.749 13.159 1.00 92.09 C
ATOM 993 C SER A 125 -17.220 -13.976 13.833 1.00 92.09 C
ATOM 994 CB SER A 125 -17.117 -12.481 11.840 1.00 92.09 C
ATOM 995 O SER A 125 -17.538 -15.108 13.459 1.00 92.09 O
ATOM 996 OG SER A 125 -17.831 -11.523 11.084 1.00 92.09 O
awk suites this role better:
awk 'FNR==NR {a[$1,$2,$3]; next} ($4,$5,$6) in a {$11="00.00"} 1' file2 file1 | column -t
ATOM 1 N MET A 1 -37.809 27.446 34.618 1.00 43.34 N
ATOM 2 CA MET A 1 -37.480 26.307 33.746 1.00 43.34 C
ATOM 3 C MET A 1 -36.495 25.493 34.556 1.00 43.34 C
ATOM 4 CB MET A 1 -36.919 26.801 32.394 1.00 43.34 C
ATOM 5 O MET A 1 -35.346 25.898 34.661 1.00 43.34 O
ATOM 6 CG MET A 1 -36.980 25.729 31.301 1.00 43.34 C
ATOM 7 SD MET A 1 -35.977 26.080 29.826 1.00 43.34 S
ATOM 8 CE MET A 1 -36.833 27.479 29.055 1.00 43.34 C
ATOM 9 N GLU A 2 -36.991 24.516 35.314 1.00 00.00 N
ATOM 10 CA GLU A 2 -36.090 23.617 36.039 1.00 00.00 C
ATOM 11 C GLU A 2 -35.250 22.852 35.010 1.00 00.00 C
ATOM 12 CB GLU A 2 -36.860 22.659 36.957 1.00 00.00 C
ATOM 13 O GLU A 2 -35.776 22.534 33.938 1.00 00.00 O
ATOM 14 CG GLU A 2 -37.467 23.407 38.153 1.00 00.00 C
ATOM 981 N CYS A 123 -15.659 -7.164 13.998 1.00 90.53 N
ATOM 982 CA CYS A 123 -16.801 -7.332 13.106 1.00 90.53 C
ATOM 983 C CYS A 123 -17.894 -8.234 13.699 1.00 90.53 C
ATOM 984 CB CYS A 123 -16.321 -7.886 11.757 1.00 90.53 C
ATOM 985 O CYS A 123 -18.918 -8.425 13.046 1.00 90.53 O
ATOM 986 SG CYS A 123 -15.266 -6.683 10.904 1.00 90.53 S
ATOM 987 N GLY A 124 -17.679 -8.840 14.874 1.00 00.00 N
ATOM 988 CA GLY A 124 -18.641 -9.764 15.474 1.00 00.00 C
ATOM 989 C GLY A 124 -18.851 -11.029 14.637 1.00 00.00 C
ATOM 990 O GLY A 124 -19.970 -11.514 14.513 1.00 00.00 O
ATOM 991 N SER A 125 -17.793 -11.536 13.996 1.00 92.09 N
ATOM 992 CA SER A 125 -17.837 -12.749 13.159 1.00 92.09 C
ATOM 993 C SER A 125 -17.220 -13.976 13.833 1.00 92.09 C
ATOM 994 CB SER A 125 -17.117 -12.481 11.840 1.00 92.09 C
ATOM 995 O SER A 125 -17.538 -15.108 13.459 1.00 92.09 O
ATOM 996 OG SER A 125 -17.831 -11.523 11.084 1.00 92.09 O
Used column -t for tabular output display only.
I am using clojure in Emacs with cider and the cider repl (0.7.0). This is pretty fine, but whenever I run cider-referesh (or hit C-c C-x), I get an exception:
ClassNotFoundException clojure.tools.namespace.repl java.net.URLClassLoader$1.run (URLClassLoader.java:372)
1. Unhandled java.lang.ClassNotFoundException
clojure.tools.namespace.repl
URLClassLoader.java: 372 java.net.URLClassLoader$1/run
URLClassLoader.java: 361 java.net.URLClassLoader$1/run
AccessController.java: -2 java.security.AccessController/doPrivileged
URLClassLoader.java: 360 java.net.URLClassLoader/findClass
DynamicClassLoader.java: 61 clojure.lang.DynamicClassLoader/findClass
ClassLoader.java: 424 java.lang.ClassLoader/loadClass
ClassLoader.java: 357 java.lang.ClassLoader/loadClass
Class.java: -2 java.lang.Class/forName0
Class.java: 340 java.lang.Class/forName
RT.java: 2065 clojure.lang.RT/classForName
Compiler.java: 978 clojure.lang.Compiler$HostExpr/maybeClass
Compiler.java: 756 clojure.lang.Compiler$HostExpr/access$400
Compiler.java: 6583 clojure.lang.Compiler/macroexpand1
Compiler.java: 6613 clojure.lang.Compiler/macroexpand
Compiler.java: 6687 clojure.lang.Compiler/eval
Compiler.java: 6666 clojure.lang.Compiler/eval
core.clj: 2927 clojure.core/eval
main.clj: 239 clojure.main/repl/read-eval-print/fn
main.clj: 239 clojure.main/repl/read-eval-print
main.clj: 257 clojure.main/repl/fn
main.clj: 257 clojure.main/repl
RestFn.java: 1096 clojure.lang.RestFn/invoke
interruptible_eval.clj: 56 clojure.tools.nrepl.middleware.interruptible-eval/evaluate/fn
AFn.java: 152 clojure.lang.AFn/applyToHelper
AFn.java: 144 clojure.lang.AFn/applyTo
core.clj: 624 clojure.core/apply
core.clj: 1862 clojure.core/with-bindings*
RestFn.java: 425 clojure.lang.RestFn/invoke
interruptible_eval.clj: 41 clojure.tools.nrepl.middleware.interruptible-eval/evaluate
interruptible_eval.clj: 171 clojure.tools.nrepl.middleware.interruptible-eval/interruptible-eval/fn/fn
core.clj: 2402 clojure.core/comp/fn
interruptible_eval.clj: 138 clojure.tools.nrepl.middleware.interruptible-eval/run-next/fn
AFn.java: 22 clojure.lang.AFn/run
ThreadPoolExecutor.java: 1142 java.util.concurrent.ThreadPoolExecutor/runWorker
ThreadPoolExecutor.java: 617 java.util.concurrent.ThreadPoolExecutor$Worker/run
Thread.java: 745 java.lang.Thread/run
What is the reason for this, and how can I fix it?
It seems that this exception was a bug, that has now been fixed in cider.
Try adding [org.clojure/tools.namespace "0.2.5"] to your project.clj
I have a folder that contains 200 pdb files.I would like to arrange the atom lines of PDB files in ascending order based on the 6th column. I would like to get in-place editing for each pdb file in the folder. your help would be appreciated.
ATOM 81 N ASN A 248 38.791 -16.708 12.507 1.00 52.04 N
ATOM 82 CA ASN A 248 39.443 -17.018 11.206 1.00 54.49 C
ATOM 422 C SER A 205 70.124 -29.955 8.226 1.00 55.81 C
ATOM 423 O SER A 205 70.901 -29.008 8.438 1.00 46.60 O
ATOM 303 N MET A 231 61.031 -38.086 -3.054 1.00 52.32 N
ATOM 304 CA MET A 231 60.580 -39.074 -4.047 1.00 64.11 C
ATOM 392 C GLU B 65 23.248 10.071 -7.321 1.00 48.26 C
ATOM 393 O GLU B 65 24.465 10.200 -7.158 1.00 46.53 O
ATOM 394 O GLU B 65 24.465 10.200 -7.158 1.00 46.53 O
Desired output
ATOM 392 C GLU B 65 23.248 10.071 -7.321 1.00 48.26 C
ATOM 393 O GLU B 65 24.465 10.200 -7.158 1.00 46.53 O
ATOM 394 O GLU B 65 24.465 10.200 -7.158 1.00 46.53 O
ATOM 422 C SER A 205 70.124 -29.955 8.226 1.00 55.81 C
ATOM 423 O SER A 205 70.901 -29.008 8.438 1.00 46.60 O
ATOM 303 N MET A 231 61.031 -38.086 -3.054 1.00 52.32 N
ATOM 304 CA MET A 231 60.580 -39.074 -4.047 1.00 64.11 C
ATOM 81 N ASN A 248 38.791 -16.708 12.507 1.00 52.04 N
ATOM 82 CA ASN A 248 39.443 -17.018 11.206 1.00 54.49 C
Use sort.
sort -n -k 6 inputfile
-n performs numeric sort, and -k tells to sort via a key.
EDIT: For in-place sorting, use the -o option:
sort -n -k 6 inputfile -o inputfile
I use a hash where its key will be the 6th field plus a counter that increments each line appended at the end. This avoids overwrite duplicated entries and keep stable order. Then use asorti() function to sort by that 6th field and print each line of the original array.
Content of script.awk:
{
++n
data[ $6 _ n ] = $0;
}
END {
asorti( data, mod_data, "#ind_num_asc" )
l = length( data )
for ( i = 1; i <= l; i++ ) {
print data[ mod_data[i] ]
}
}
Run it like:
awk -f script.awk infile
That yields:
ATOM 392 C GLU B 65 23.248 10.071 -7.321 1.00 48.26 C
ATOM 393 O GLU B 65 24.465 10.200 -7.158 1.00 46.53 O
ATOM 394 O GLU B 65 24.465 10.200 -7.158 1.00 46.53 O
ATOM 422 C SER A 205 70.124 -29.955 8.226 1.00 55.81 C
ATOM 423 O SER A 205 70.901 -29.008 8.438 1.00 46.60 O
ATOM 303 N MET A 231 61.031 -38.086 -3.054 1.00 52.32 N
ATOM 304 CA MET A 231 60.580 -39.074 -4.047 1.00 64.11 C
ATOM 81 N ASN A 248 38.791 -16.708 12.507 1.00 52.04 N
ATOM 82 CA ASN A 248 39.443 -17.018 11.206 1.00 54.49 C
I am stuck at 1 point in my project. I am a biomedical science. So, I don't know perl programming much.
I have a file that explains proteins interactions with ligands. The file looks as shown below:
H P L A 82 SER 1290 N --> O12 1668 GSH 106 A 2.90
H P L A 83 SER 1301 N --> O12 1668 GSH 106 A 2.93
N P L A 19 LYS 302 NZ --- O31 1682 GSH 106 A 3.86
N P L A 22 CYS 348 CB --- CB2 1677 GSH 106 A 3.75
N P L A 22 CYS 348 CB --- SG2 1678 GSH 106 A 3.02
N P L A 22 CYS 349 SG --- CB2 1677 GSH 106 A 3.03
N P L A 22 CYS 349 SG --- SG2 1678 GSH 106 A 2.02
N P L A 24 TYR 372 CB --- CG1 1670 GSH 106 A 3.68
Now you can see the are O12 in two rows. Similarly you can see that there are two CB2 as well. These O12 and CB2 are atom symbols. O12 means oxygen 12 in an atom. Now I need to calculate how many different atom symbols are there in file. I have to use perl script to do that. I am reading this file line by line using perl. while (my $line = <MYFILE>) { }; Now, I need to calculate how many different atom symbols are there while reading the file line by line. I hope I am clear enough to explain my problem. Waiting for a kind reply...
How the problem is best solved depends on how your data is delimited. As it looks like fixed width, I'll present that solution first:
use strict;
use warnings;
my %atom;
while (<DATA>) {
my (undef,$atom) = unpack "A34A4 ", $_;
$atom{$atom}++;
}
print scalar keys %atom;
__DATA__
H P L A 82 SER 1290 N --> O12 1668 GSH 106 A 2.90
H P L A 83 SER 1301 N --> O12 1668 GSH 106 A 2.93
N P L A 19 LYS 302 NZ --- O31 1682 GSH 106 A 3.86
N P L A 22 CYS 348 CB --- CB2 1677 GSH 106 A 3.75
N P L A 22 CYS 348 CB --- SG2 1678 GSH 106 A 3.02
N P L A 22 CYS 349 SG --- CB2 1677 GSH 106 A 3.03
N P L A 22 CYS 349 SG --- SG2 1678 GSH 106 A 2.02
N P L A 24 TYR 372 CB --- CG1 1670 GSH 106 A 3.68
Note here that I estimated the offset used by unpack, so you may need to tweak that to fit your data.
If your data is tab-delimited, you'll need to split on tab, or better yet use Text::CSV to parse your data. Basic script is the same:
use Text::CSV;
my $csv = Text::CSV->new({
binary => 1,
sep_char => "\t",
});
my %atom;
while (<DATA>) {
$csv->parse($_);
my $atom = ($csv->fields())[9];
next unless defined $atom;
$atom{$atom}++;
}
You can also use the loop condition while (my $aref = $csv->getline(*DATA)), which is more efficient, but also breaks if your csv data is not consistent.
A simpler and possibly as valid (depending on how complex your data can be) solution is using split:
while (<DATA>) {
my $atom = (split /\t/)[9]; # implicitly splits $_
$atom{$atom}++;
}
If your data is space delimited, simply remove /\t/ from the above.
Note that I assumed all spaces were tabs in your input, so if they are not, my count may need to be tweaked.
In command line (no perl):
cat yourfile | awk '{print $10}' | sort | uniq | wc -l
Works on your input.
Have a look at this Perl Cookbook recipe.
While you're reading the file line by line you want to split/extract the atom symbols and count them in a hash.
use strict;
use warnings;
# open FILE goes here...
my %seen; # we use this to count
while (<FILE>) {
m/--[>-]\s+(\w+)\s/; # fetch the atom symbol after arrow-thing
$seen{$1}++;
}
close FILE;
print scalar keys %seen; # number of unique atom symbols
print join ', ', keys %seen; # List as string
Or in perl:
#!/usr/bin/perl
while(my $line = <DATA>){
my $atom = (split / +/, $line)[9];
$atoms{$atom}++;
}
print "$_: $atoms{$_}\n" for keys %atoms;
__DATA__
H P L A 82 SER 1290 N --> O12 1668 GSH 106 A 2.90
H P L A 83 SER 1301 N --> O12 1668 GSH 106 A 2.93
N P L A 19 LYS 302 NZ --- O31 1682 GSH 106 A 3.86
N P L A 22 CYS 348 CB --- CB2 1677 GSH 106 A 3.75
N P L A 22 CYS 348 CB --- SG2 1678 GSH 106 A 3.02
N P L A 22 CYS 349 SG --- CB2 1677 GSH 106 A 3.03
N P L A 22 CYS 349 SG --- SG2 1678 GSH 106 A 2.02
N P L A 24 TYR 372 CB --- CG1 1670 GSH 106 A 3.68