I have a folder that contains 200 pdb files.I would like to arrange the atom lines of PDB files in ascending order based on the 6th column. I would like to get in-place editing for each pdb file in the folder. your help would be appreciated.
ATOM 81 N ASN A 248 38.791 -16.708 12.507 1.00 52.04 N
ATOM 82 CA ASN A 248 39.443 -17.018 11.206 1.00 54.49 C
ATOM 422 C SER A 205 70.124 -29.955 8.226 1.00 55.81 C
ATOM 423 O SER A 205 70.901 -29.008 8.438 1.00 46.60 O
ATOM 303 N MET A 231 61.031 -38.086 -3.054 1.00 52.32 N
ATOM 304 CA MET A 231 60.580 -39.074 -4.047 1.00 64.11 C
ATOM 392 C GLU B 65 23.248 10.071 -7.321 1.00 48.26 C
ATOM 393 O GLU B 65 24.465 10.200 -7.158 1.00 46.53 O
ATOM 394 O GLU B 65 24.465 10.200 -7.158 1.00 46.53 O
Desired output
ATOM 392 C GLU B 65 23.248 10.071 -7.321 1.00 48.26 C
ATOM 393 O GLU B 65 24.465 10.200 -7.158 1.00 46.53 O
ATOM 394 O GLU B 65 24.465 10.200 -7.158 1.00 46.53 O
ATOM 422 C SER A 205 70.124 -29.955 8.226 1.00 55.81 C
ATOM 423 O SER A 205 70.901 -29.008 8.438 1.00 46.60 O
ATOM 303 N MET A 231 61.031 -38.086 -3.054 1.00 52.32 N
ATOM 304 CA MET A 231 60.580 -39.074 -4.047 1.00 64.11 C
ATOM 81 N ASN A 248 38.791 -16.708 12.507 1.00 52.04 N
ATOM 82 CA ASN A 248 39.443 -17.018 11.206 1.00 54.49 C
Use sort.
sort -n -k 6 inputfile
-n performs numeric sort, and -k tells to sort via a key.
EDIT: For in-place sorting, use the -o option:
sort -n -k 6 inputfile -o inputfile
I use a hash where its key will be the 6th field plus a counter that increments each line appended at the end. This avoids overwrite duplicated entries and keep stable order. Then use asorti() function to sort by that 6th field and print each line of the original array.
Content of script.awk:
{
++n
data[ $6 _ n ] = $0;
}
END {
asorti( data, mod_data, "#ind_num_asc" )
l = length( data )
for ( i = 1; i <= l; i++ ) {
print data[ mod_data[i] ]
}
}
Run it like:
awk -f script.awk infile
That yields:
ATOM 392 C GLU B 65 23.248 10.071 -7.321 1.00 48.26 C
ATOM 393 O GLU B 65 24.465 10.200 -7.158 1.00 46.53 O
ATOM 394 O GLU B 65 24.465 10.200 -7.158 1.00 46.53 O
ATOM 422 C SER A 205 70.124 -29.955 8.226 1.00 55.81 C
ATOM 423 O SER A 205 70.901 -29.008 8.438 1.00 46.60 O
ATOM 303 N MET A 231 61.031 -38.086 -3.054 1.00 52.32 N
ATOM 304 CA MET A 231 60.580 -39.074 -4.047 1.00 64.11 C
ATOM 81 N ASN A 248 38.791 -16.708 12.507 1.00 52.04 N
ATOM 82 CA ASN A 248 39.443 -17.018 11.206 1.00 54.49 C
Related
File1 is an hard formatted pdb file containing protein coordinates:
ATOM 1 N MET A 1 -37.809 27.446 34.618 1.00 43.34 N
ATOM 2 CA MET A 1 -37.480 26.307 33.746 1.00 43.34 C
ATOM 3 C MET A 1 -36.495 25.493 34.556 1.00 43.34 C
ATOM 4 CB MET A 1 -36.919 26.801 32.394 1.00 43.34 C
ATOM 5 O MET A 1 -35.346 25.898 34.661 1.00 43.34 O
ATOM 6 CG MET A 1 -36.980 25.729 31.301 1.00 43.34 C
ATOM 7 SD MET A 1 -35.977 26.080 29.826 1.00 43.34 S
ATOM 8 CE MET A 1 -36.833 27.479 29.055 1.00 43.34 C
ATOM 9 N GLU A 2 -36.991 24.516 35.314 1.00 37.48 N
ATOM 10 CA GLU A 2 -36.090 23.617 36.039 1.00 37.48 C
ATOM 11 C GLU A 2 -35.250 22.852 35.010 1.00 37.48 C
ATOM 12 CB GLU A 2 -36.860 22.659 36.957 1.00 37.48 C
ATOM 13 O GLU A 2 -35.776 22.534 33.938 1.00 37.48 O
ATOM 14 CG GLU A 2 -37.467 23.407 38.153 1.00 37.48 C
..............................................................................
..............................................................................
..............................................................................
ATOM 981 N CYS A 123 -15.659 -7.164 13.998 1.00 90.53 N
ATOM 982 CA CYS A 123 -16.801 -7.332 13.106 1.00 90.53 C
ATOM 983 C CYS A 123 -17.894 -8.234 13.699 1.00 90.53 C
ATOM 984 CB CYS A 123 -16.321 -7.886 11.757 1.00 90.53 C
ATOM 985 O CYS A 123 -18.918 -8.425 13.046 1.00 90.53 O
ATOM 986 SG CYS A 123 -15.266 -6.683 10.904 1.00 90.53 S
ATOM 987 N GLY A 124 -17.679 -8.840 14.874 1.00 90.37 N
ATOM 988 CA GLY A 124 -18.641 -9.764 15.474 1.00 90.37 C
ATOM 989 C GLY A 124 -18.851 -11.029 14.637 1.00 90.37 C
ATOM 990 O GLY A 124 -19.970 -11.514 14.513 1.00 90.37 O
ATOM 991 N SER A 125 -17.793 -11.536 13.996 1.00 92.09 N
ATOM 992 CA SER A 125 -17.837 -12.749 13.159 1.00 92.09 C
ATOM 993 C SER A 125 -17.220 -13.976 13.833 1.00 92.09 C
ATOM 994 CB SER A 125 -17.117 -12.481 11.840 1.00 92.09 C
ATOM 995 O SER A 125 -17.538 -15.108 13.459 1.00 92.09 O
ATOM 996 OG SER A 125 -17.831 -11.523 11.084 1.00 92.09 O
....................... plus many more lines .................................
File2 is a list of representative lines obtained from fields 3,4, and 5 of the above
pdb file. To keep all simple, let's consider just to lines:
GLU A 2
GLY A 124
The desired output is:
ATOM 1 N MET A 1 -37.809 27.446 34.618 1.00 43.34 N
ATOM 2 CA MET A 1 -37.480 26.307 33.746 1.00 43.34 C
ATOM 3 C MET A 1 -36.495 25.493 34.556 1.00 43.34 C
ATOM 4 CB MET A 1 -36.919 26.801 32.394 1.00 43.34 C
ATOM 5 O MET A 1 -35.346 25.898 34.661 1.00 43.34 O
ATOM 6 CG MET A 1 -36.980 25.729 31.301 1.00 43.34 C
ATOM 7 SD MET A 1 -35.977 26.080 29.826 1.00 43.34 S
ATOM 8 CE MET A 1 -36.833 27.479 29.055 1.00 43.34 C
ATOM 9 N GLU A 2 -36.991 24.516 35.314 1.00 00.00 N
ATOM 10 CA GLU A 2 -36.090 23.617 36.039 1.00 00.00 C
ATOM 11 C GLU A 2 -35.250 22.852 35.010 1.00 00.00 C
ATOM 12 CB GLU A 2 -36.860 22.659 36.957 1.00 00.00 C
ATOM 13 O GLU A 2 -35.776 22.534 33.938 1.00 00.00 O
ATOM 14 CG GLU A 2 -37.467 23.407 38.153 1.00 00.00 C
..............................................................................
..............................................................................
..............................................................................
ATOM 981 N CYS A 123 -15.659 -7.164 13.998 1.00 90.53 N
ATOM 982 CA CYS A 123 -16.801 -7.332 13.106 1.00 90.53 C
ATOM 983 C CYS A 123 -17.894 -8.234 13.699 1.00 90.53 C
ATOM 984 CB CYS A 123 -16.321 -7.886 11.757 1.00 90.53 C
ATOM 985 O CYS A 123 -18.918 -8.425 13.046 1.00 90.53 O
ATOM 986 SG CYS A 123 -15.266 -6.683 10.904 1.00 90.53 S
ATOM 987 N GLY A 124 -17.679 -8.840 14.874 1.00 00.00 N
ATOM 988 CA GLY A 124 -18.641 -9.764 15.474 1.00 00.00 C
ATOM 989 C GLY A 124 -18.851 -11.029 14.637 1.00 00.00 C
ATOM 990 O GLY A 124 -19.970 -11.514 14.513 1.00 00.00 O
ATOM 991 N SER A 125 -17.793 -11.536 13.996 1.00 92.09 N
ATOM 992 CA SER A 125 -17.837 -12.749 13.159 1.00 92.09 C
ATOM 993 C SER A 125 -17.220 -13.976 13.833 1.00 92.09 C
ATOM 994 CB SER A 125 -17.117 -12.481 11.840 1.00 92.09 C
ATOM 995 O SER A 125 -17.538 -15.108 13.459 1.00 92.09 O
ATOM 996 OG SER A 125 -17.831 -11.523 11.084 1.00 92.09 O
i.e. a modified pdb with 00.00 in the 11th field if a File1's line contain a
File2 occurrence.
I already know how to do that with Bash while-read and awk but because these tools
change the format and require reformatting and/or specify the output format, in this
particular case dealing with hundreds of files they are not practical.
In order to avoid these problems I decided to look for a solution based on sed.
I got a working solution if I explicitly give a single search pattern. i.e. the
following code works:
digits=00.00
sed "/GLU A 2/s/\(.\{61\}\)\(.\{5\}\)/\1$digits/" File1.pdb > out.pdb
but the following does not (the File1 lines are unchanged) and I did not manage
to figure out why:
digits=00.00
while read pattern; do
sed "/$pattern/s/\(.\{61\}\)\(.\{5\}\)/\1$digits/" File1.pdb > out.pdb ;
done < File2.txt
Sorry for the lengthy message. Thanks in advance for any help.
#anubhava:
using my real data this is what happen at the first substitution site:
ATOM 293 CE1 HIS A 38 -18.278 19.735 13.486 1.00 67.94 C
ATOM 294 NE2 HIS A 38 -18.518 18.594 14.144 1.00 67.94 N
ATOM 295 N GLY A 39 -13.836 00.00 9.206 1.00 71.50 N
ATOM 296 CA GLY A 39 -12.628 00.00 8.447 1.00 71.50 C
ATOM 297 C GLY A 39 -11.358 00.00 9.286 1.00 71.50 C
ATOM 298 O GLY A 39 -11.411 18.636 10.344 1.00 00.00 O
ATOM 299 N PRO A 40 -10.180 17.577 8.797 1.00 71.93 N
ATOM 300 CA PRO A 40 -8.908 17.719 9.520 1.00 71.93 C
ATOM 301 C PRO A 40 -8.580 19.169 9.912 1.00 71.93 C
In this case the site is /GLY A 39/. As you can see there is a shift in some lines and unwanted substitutions in the 8th field.
Strange enough such problems occur only for the first replacement i.e. the remaning output is just perfect. Thanks.
Using sed in a while loop which reads file 2 line by line, you can target only lines matches those found in file2 and carry out the sub on those lines where;
s/\(.*\)[0-9]\{2\}\.[0-9]\{2\}\([[:space:]]\+.*$\)/ - Group everything up to the last digits that matches the pattern and retain to be returned with back reference \1. Exclude the number matched in the pattern and once again group everything else after from the space to the end of the line and return with back-reference \2
$ cat file1
ATOM 1 N MET A 1 -37.809 27.446 34.618 1.00 43.34 N
ATOM 2 CA MET A 1 -37.480 26.307 33.746 1.00 43.34 C
ATOM 3 C MET A 1 -36.495 25.493 34.556 1.00 43.34 C
ATOM 4 CB MET A 1 -36.919 26.801 32.394 1.00 43.34 C
ATOM 5 O MET A 1 -35.346 25.898 34.661 1.00 43.34 O
ATOM 6 CG MET A 1 -36.980 25.729 31.301 1.00 43.34 C
ATOM 7 SD MET A 1 -35.977 26.080 29.826 1.00 43.34 S
ATOM 8 CE MET A 1 -36.833 27.479 29.055 1.00 43.34 C
ATOM 9 N GLU A 2 -36.991 24.516 35.314 1.00 37.48 N
ATOM 10 CA GLU A 2 -36.090 23.617 36.039 1.00 37.48 C
ATOM 11 C GLU A 2 -35.250 22.852 35.010 1.00 37.48 C
ATOM 12 CB GLU A 2 -36.860 22.659 36.957 1.00 37.48 C
ATOM 13 O GLU A 2 -35.776 22.534 33.938 1.00 37.48 O
ATOM 14 CG GLU A 2 -37.467 23.407 38.153 1.00 37.48 C
ATOM 981 N CYS A 123 -15.659 -7.164 13.998 1.00 90.53 N
ATOM 982 CA CYS A 123 -16.801 -7.332 13.106 1.00 90.53 C
ATOM 983 C CYS A 123 -17.894 -8.234 13.699 1.00 90.53 C
ATOM 984 CB CYS A 123 -16.321 -7.886 11.757 1.00 90.53 C
ATOM 985 O CYS A 123 -18.918 -8.425 13.046 1.00 90.53 O
ATOM 986 SG CYS A 123 -15.266 -6.683 10.904 1.00 90.53 S
ATOM 987 N GLY A 124 -17.679 -8.840 14.874 1.00 90.37 N
ATOM 988 CA GLY A 124 -18.641 -9.764 15.474 1.00 90.37 C
ATOM 989 C GLY A 124 -18.851 -11.029 14.637 1.00 90.37 C
ATOM 990 O GLY A 124 -19.970 -11.514 14.513 1.00 90.37 O
ATOM 991 N SER A 125 -17.793 -11.536 13.996 1.00 92.09 N
ATOM 992 CA SER A 125 -17.837 -12.749 13.159 1.00 92.09 C
ATOM 993 C SER A 125 -17.220 -13.976 13.833 1.00 92.09 C
ATOM 994 CB SER A 125 -17.117 -12.481 11.840 1.00 92.09 C
ATOM 995 O SER A 125 -17.538 -15.108 13.459 1.00 92.09 O
ATOM 996 OG SER A 125 -17.831 -11.523 11.084 1.00 92.09 O
$ while read -r line; do sed -i.bak "/$line/s/\(.*\)[0-9]\{2\}\.[0-9]\{2\}\([[:space:]]\+.*$\)/\100.00\2/" file1; done < file2
$ cat file1
ATOM 1 N MET A 1 -37.809 27.446 34.618 1.00 43.34 N
ATOM 2 CA MET A 1 -37.480 26.307 33.746 1.00 43.34 C
ATOM 3 C MET A 1 -36.495 25.493 34.556 1.00 43.34 C
ATOM 4 CB MET A 1 -36.919 26.801 32.394 1.00 43.34 C
ATOM 5 O MET A 1 -35.346 25.898 34.661 1.00 43.34 O
ATOM 6 CG MET A 1 -36.980 25.729 31.301 1.00 43.34 C
ATOM 7 SD MET A 1 -35.977 26.080 29.826 1.00 43.34 S
ATOM 8 CE MET A 1 -36.833 27.479 29.055 1.00 43.34 C
ATOM 9 N GLU A 2 -36.991 24.516 35.314 1.00 00.00 N
ATOM 10 CA GLU A 2 -36.090 23.617 36.039 1.00 00.00 C
ATOM 11 C GLU A 2 -35.250 22.852 35.010 1.00 00.00 C
ATOM 12 CB GLU A 2 -36.860 22.659 36.957 1.00 00.00 C
ATOM 13 O GLU A 2 -35.776 22.534 33.938 1.00 00.00 O
ATOM 14 CG GLU A 2 -37.467 23.407 38.153 1.00 00.00 C
ATOM 981 N CYS A 123 -15.659 -7.164 13.998 1.00 90.53 N
ATOM 982 CA CYS A 123 -16.801 -7.332 13.106 1.00 90.53 C
ATOM 983 C CYS A 123 -17.894 -8.234 13.699 1.00 90.53 C
ATOM 984 CB CYS A 123 -16.321 -7.886 11.757 1.00 90.53 C
ATOM 985 O CYS A 123 -18.918 -8.425 13.046 1.00 90.53 O
ATOM 986 SG CYS A 123 -15.266 -6.683 10.904 1.00 90.53 S
ATOM 987 N GLY A 124 -17.679 -8.840 14.874 1.00 00.00 N
ATOM 988 CA GLY A 124 -18.641 -9.764 15.474 1.00 00.00 C
ATOM 989 C GLY A 124 -18.851 -11.029 14.637 1.00 00.00 C
ATOM 990 O GLY A 124 -19.970 -11.514 14.513 1.00 00.00 O
ATOM 991 N SER A 125 -17.793 -11.536 13.996 1.00 92.09 N
ATOM 992 CA SER A 125 -17.837 -12.749 13.159 1.00 92.09 C
ATOM 993 C SER A 125 -17.220 -13.976 13.833 1.00 92.09 C
ATOM 994 CB SER A 125 -17.117 -12.481 11.840 1.00 92.09 C
ATOM 995 O SER A 125 -17.538 -15.108 13.459 1.00 92.09 O
ATOM 996 OG SER A 125 -17.831 -11.523 11.084 1.00 92.09 O
awk suites this role better:
awk 'FNR==NR {a[$1,$2,$3]; next} ($4,$5,$6) in a {$11="00.00"} 1' file2 file1 | column -t
ATOM 1 N MET A 1 -37.809 27.446 34.618 1.00 43.34 N
ATOM 2 CA MET A 1 -37.480 26.307 33.746 1.00 43.34 C
ATOM 3 C MET A 1 -36.495 25.493 34.556 1.00 43.34 C
ATOM 4 CB MET A 1 -36.919 26.801 32.394 1.00 43.34 C
ATOM 5 O MET A 1 -35.346 25.898 34.661 1.00 43.34 O
ATOM 6 CG MET A 1 -36.980 25.729 31.301 1.00 43.34 C
ATOM 7 SD MET A 1 -35.977 26.080 29.826 1.00 43.34 S
ATOM 8 CE MET A 1 -36.833 27.479 29.055 1.00 43.34 C
ATOM 9 N GLU A 2 -36.991 24.516 35.314 1.00 00.00 N
ATOM 10 CA GLU A 2 -36.090 23.617 36.039 1.00 00.00 C
ATOM 11 C GLU A 2 -35.250 22.852 35.010 1.00 00.00 C
ATOM 12 CB GLU A 2 -36.860 22.659 36.957 1.00 00.00 C
ATOM 13 O GLU A 2 -35.776 22.534 33.938 1.00 00.00 O
ATOM 14 CG GLU A 2 -37.467 23.407 38.153 1.00 00.00 C
ATOM 981 N CYS A 123 -15.659 -7.164 13.998 1.00 90.53 N
ATOM 982 CA CYS A 123 -16.801 -7.332 13.106 1.00 90.53 C
ATOM 983 C CYS A 123 -17.894 -8.234 13.699 1.00 90.53 C
ATOM 984 CB CYS A 123 -16.321 -7.886 11.757 1.00 90.53 C
ATOM 985 O CYS A 123 -18.918 -8.425 13.046 1.00 90.53 O
ATOM 986 SG CYS A 123 -15.266 -6.683 10.904 1.00 90.53 S
ATOM 987 N GLY A 124 -17.679 -8.840 14.874 1.00 00.00 N
ATOM 988 CA GLY A 124 -18.641 -9.764 15.474 1.00 00.00 C
ATOM 989 C GLY A 124 -18.851 -11.029 14.637 1.00 00.00 C
ATOM 990 O GLY A 124 -19.970 -11.514 14.513 1.00 00.00 O
ATOM 991 N SER A 125 -17.793 -11.536 13.996 1.00 92.09 N
ATOM 992 CA SER A 125 -17.837 -12.749 13.159 1.00 92.09 C
ATOM 993 C SER A 125 -17.220 -13.976 13.833 1.00 92.09 C
ATOM 994 CB SER A 125 -17.117 -12.481 11.840 1.00 92.09 C
ATOM 995 O SER A 125 -17.538 -15.108 13.459 1.00 92.09 O
ATOM 996 OG SER A 125 -17.831 -11.523 11.084 1.00 92.09 O
Used column -t for tabular output display only.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I currently have a text file that starts like this,
ATOM 277 N DOPC 3 2.637 5.546 17.667 1.00 0.00 MEMB
ATOM 278 C12 DOPC 3 2.869 5.398 19.176 1.00 0.00 MEMB
ATOM 279 H12A DOPC 3 3.729 6.005 19.418 1.00 0.00 MEMB
ATOM 280 H12B DOPC 3 3.176 4.394 19.427 1.00 0.00 MEMB
ATOM 281 C13 DOPC 3 1.352 4.873 17.275 1.00 0.00 MEMB
ATOM 282 H13A DOPC 3 1.380 5.091 16.217 1.00 0.00 MEMB
ATOM 283 H13B DOPC 3 1.415 3.810 17.452 1.00 0.00 MEMB
ATOM 284 H13C DOPC 3 0.491 5.261 17.799 1.00 0.00 MEMB
ATOM 285 C14 DOPC 3 3.791 4.845 16.976 1.00 0.00 MEMB
ATOM 286 H14A DOPC 3 4.692 4.989 17.554 1.00 0.00 MEMB
ATOM 287 H14B DOPC 3 3.563 3.790 17.025 1.00 0.00 MEMB
ATOM 288 H14C DOPC 3 3.875 5.097 15.930 1.00 0.00 MEMB
ATOM 289 C15 DOPC 3 2.627 6.991 17.324 1.00 0.00 MEMB
ATOM 290 H15A DOPC 3 1.812 7.530 17.785 1.00 0.00 MEMB
.
.
I'm wondering if there is any way using sed or awk to reorder the lines so that the ordering goes from [1,2,3...14...] to [1,2,5,9,13,3,4,6,7,8,10,11,12,14...] just by simply using their unique line number?
Here is the desired output,
ATOM 277 N DOPC 3 2.637 5.546 17.667 1.00 0.00 MEMB
ATOM 278 C12 DOPC 3 2.869 5.398 19.176 1.00 0.00 MEMB
ATOM 281 C13 DOPC 3 1.352 4.873 17.275 1.00 0.00 MEMB
ATOM 285 C14 DOPC 3 3.791 4.845 16.976 1.00 0.00 MEMB
ATOM 289 C15 DOPC 3 2.627 6.991 17.324 1.00 0.00 MEMB
ATOM 279 H12A DOPC 3 3.729 6.005 19.418 1.00 0.00 MEMB
ATOM 280 H12B DOPC 3 3.176 4.394 19.427 1.00 0.00 MEMB
ATOM 284 H13C DOPC 3 0.491 5.261 17.799 1.00 0.00 MEMB
ATOM 286 H14A DOPC 3 4.692 4.989 17.554 1.00 0.00 MEMB
ATOM 287 H14B DOPC 3 3.563 3.790 17.025 1.00 0.00 MEMB
ATOM 288 H14C DOPC 3 3.875 5.097 15.930 1.00 0.00 MEMB
ATOM 290 H15A DOPC 3 1.812 7.530 17.785 1.00 0.00 MEMB
.
.
Thanks!
In awk:
$ awk -v order="1,2,5,9,13,3,4,6,7,8,10,11,12,13,14" '{
a[NR]=$0 # hash records to a with NR as index
}
END {
n=split(order,o,/,/) # split the given order to a mapping
for(i=1;i<=n;i++) { # iterate the map indexes
print a[o[i]] # output
# delete a[o[i]] # uncomment these
}
# for(i=1;i<=NR;i++) # to print any leftovers
# if(i in a) # that were not in the order list
# print a[i]
}' file
Use this Perl one-liner:
perl -ne 'push #a, $_; END { print $a[$_-1] for ( 1,2,5,9,13,3,4,6,7,8,10,11,12,13,14, 15..($#a+1) ); }' in_file > out_file
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
push #a, $_; : Add to the array #a (which is empty initially) the current line as the next element.
$#a : The index of the last element of the array #a.
END { ... } : After all input lines have been read, execute the code inside the block. Here, print the lines in the specified order.
I have two files with thousands of proteins: (1) file: protein ID + sequence of amino acids; (2) file: protein ID + sequence of nucleotides. And my third file is a file with position of domains in (these proteins) which are related with my amino acid's sequence and nucleotide's sequence files. I related these three files with this code:
acids.txt file contains:
ENST00000274849|Q9ULW3
MEAEESEKAATEQEPLEGTEQTLDAEEEQEESEEAACGSKKRVVPGIVYLGHIPPRFRPL HVRNLLSAYGEVGRVFFQAEDRFVRRKKKAAAAAGGKKRSYTKDYTEGWVEFRDKRIAKR VAASLHNTPMGARRRSPFRYDLWNLKYLHRFTWSHLSEHLAFERQVRRQRLRAEVAQAKR ETDFYLQSVERGQRFLAADGDPARPDGSWTFAQRPTEQELRARKAARPGGRERARLATAQ DKARSNKGLLARIFGAPPPSESMEGPSLVRDS*
nucleotides.txt file contains:
ENST00000274849|Q9ULW3
ATGGAGGCAGAGGAATCGGAGAAGGCCGCAACGGAGCAAGAGCCGCTGGAAGGGACAGAA CAGACACTAGATGCGGAGGAGGAGCAGGAGGAATCCGAAGAAGCGGCCTGTGGCAGCAAG AAACGGGTAGTGCCAGGTATTGTGTACCTGGGCCATATCCCGCCGCGCTTCCGGCCCCTG CACGTCCGCAACCTTCTCAGCGCCTATGGCGAGGTCGGACGCGTCTTCTTTCAGGCTGAG GACCGGTTCGTGAGACGCAAGAAGAAGGCAGCAGCAGCTGCCGGAGGAAAAAAGCGGTCC TACACCAAGGACTACACCGAGGGATGGGTGGAGTTCCGTGACAAGCGCATAGCCAAGCGC GTGGCGGCCAGTCTACACAACACGCCTATGGGTGCCCGCAGGCGCAGCCCCTTCCGTTAT GATCTTTGGAACCTCAAGTACTTGCACCGTTTCACCTGGTCCCACCTCAGCGAGCACCTC GCCTTTGAGCGCCAGGTGCGCAGGCAGCGCTTGAGAGCGGAGGTTGCTCAAGCCAAGCGT GAGACCGACTTCTATCTTCAAAGTGTGGAACGGGGACAACGCTTTCTTGCGGCCGATGGG GACCCTGCTCGCCCAGATGGCTCCTGGACATTTGCCCAGCGTCCTACTGAGCAGGAACTG AGGGCCCGTAAAGCAGCACGGCCAGGGGGACGTGAACGGGCTCGCCTGGCAACTGCCCAG GACAAGGCCCGCTCCAACAAAGGGCTCCTGGCCAGGATCTTTGGAGCCCCGCCACCCTCA GAGAGCATGGAGGGACCTTCCCTTGTCAGGGACTCCTGA
domain.txt file contains:
Q9ULW3; 46 142
NOTE: this numbers mean the position domain in my sequence
script:
use strict;
use Bio::SeqIO;
####################################################
#MODULE 1: read protein file, and save it in a hash#
####################################################
my %hash1;
my $sequence = "acid.txt";
my $multifasta = Bio::SeqIO ->new (-file => "<$sequence",-format=> "fasta");
while (my $seq= $multifasta->next_seq()) {
my $na= $seq->display_id; #Saves the ID in $na
my $ss = $seq->seq;
$hash1{$na} = $ss;
}
#############################################################
#MODULE 2: read nucleotide file, and save it in another hash#
#############################################################
my %hash2;
my $genes = "nucleotides.txt";
my $multifasta = Bio::SeqIO ->new (-file => "<$genes",-format=> "fasta");
while (my $seq= $multifasta->next_seq()) {
my $na= $seq->display_id; #Saves the ID in $na
my $des=$seq->description;
my $ss = $seq->seq;
$hash2{$na} = $ss;
}
#####################
#MODULE 3: my $name;#
#####################
my $name; # Read from standard input
chomp $name;
##############################################################################
#MODULE 4: DOMAIN ANNOTATION + RELATED AMINO ACIDS AND NUCLEOTIDES IN COLUMNS#
##############################################################################
foreach my $name (keys %hash1) {
my $ac = (split(/\s*\|/, $name))[1];
#print "$ac\n" ;
####################################################
#MODULE 4.1: DOMAIN ANNOTATION: POSITION OF DOMAINS#
####################################################
open(FILE, "<" ,"domain.txt");
my #array = (<FILE>);
my #lines = grep (/$ac/, #array);
print for #lines;
close (FILE);
############################################################
#MODULE 4.2: RELATED AMINO ACIDS AND NUCLEOTIDES IN COLUMNS#
############################################################
my #array1 = split(//, $hash1{$name}, $hash2{$name}); #CUT SEQUENCE
my #array2 = unpack("a3" x (length($hash1{$name})),$hash2{$name}); #CUT
NUCLEOTIDE SEQUENCE
my $number = "$#array1+1";
foreach (my $count = 0; $count <= $number; $count++) {
print "$count\t#array1[$count]\t#array2[$count]\n";
}
}
And here is my FILE which I got after running the script:
Q9ULW3; 46 142
0 M ATG
1 E GAG
2 A GCA
3 E GAG
4 E GAA
5 S TCG
6 E GAG
7 K AAG
8 A GCC
9 A GCA
10 T ACG
11 E GAG
12 Q CAA
13 E GAG
14 P CCG
15 L CTG
16 E GAA
17 G GGG
18 T ACA
19 E GAA
20 Q CAG
21 T ACA
22 L CTA
23 D GAT
24 A GCG
25 E GAG
26 E GAG
27 E GAG
28 Q CAG
29 E GAG
30 E GAA
31 S TCC
32 E GAA
33 E GAA
34 A GCG
35 A GCC
36 C TGT
37 G GGC
38 S AGC
39 K AAG
40 K AAA
41 R CGG
42 V GTA
43 V GTG
44 P CCA
45 G GGT
46 I ATT
47 V GTG
48 Y TAC
49 L CTG
50 G GGC
51 H CAT
52 I ATC
53 P CCG
54 P CCG
55 R CGC
56 F TTC
57 R CGG
58 P CCC
59 L CTG
60 H CAC
61 V GTC
62 R CGC
63 N AAC
64 L CTT
65 L CTC
66 S AGC
67 A GCC
68 Y TAT
69 G GGC
70 E GAG
71 V GTC
72 G GGA
73 R CGC
74 V GTC
75 F TTC
76 F TTT
77 Q CAG
78 A GCT
79 E GAG
80 D GAC
81 R CGG
82 F TTC
83 V GTG
84 R AGA
85 R CGC
86 K AAG
87 K AAG
88 K AAG
89 A GCA
90 A GCA
91 A GCA
92 A GCT
93 A GCC
94 G GGA
95 G GGA
96 K AAA
97 K AAG
98 R CGG
99 S TCC
100 Y TAC
101 T ACC
102 K AAG
103 D GAC
104 Y TAC
105 T ACC
106 E GAG
107 G GGA
108 W TGG
109 V GTG
110 E GAG
111 F TTC
112 R CGT
113 D GAC
114 K AAG
115 R CGC
116 I ATA
117 A GCC
118 K AAG
119 R CGC
120 V GTG
121 A GCG
122 A GCC
123 S AGT
124 L CTA
125 H CAC
126 N AAC
127 T ACG
128 P CCT
129 M ATG
130 G GGT
131 A GCC
132 R CGC
133 R AGG
134 R CGC
135 S AGC
136 P CCC
137 F TTC
138 R CGT
139 Y TAT
140 D GAT
141 L CTT
142 W TGG
143 N AAC
144 L CTC
145 K AAG
146 Y TAC
147 L TTG
148 H CAC
149 R CGT
150 F TTC
151 T ACC
152 W TGG
153 S TCC
154 H CAC
155 * TGA
Now I should add a new fourth column which will contain 'YES' or 'NOT' - it depends on which codons are in domain - YES, which are not in domain - NOT. So, here is domains in the positions from 46 till 142. I would like to get this OUTPUT FILE:
Q9ULW3; 46 142
0 M ATG NOT
1 E GAG NOT
2 A GCA NOT
3 E GAG NOT
4 E GAA NOT
5 S TCG NOT
6 E GAG NOT
7 K AAG NOT
8 A GCC NOT
9 A GCA NOT
10 T ACG NOT
11 E GAG NOT
12 Q CAA NOT
13 E GAG NOT
14 P CCG NOT
15 L CTG NOT
16 E GAA NOT
17 G GGG NOT
18 T ACA NOT
19 E GAA NOT
20 Q CAG NOT
21 T ACA NOT
22 L CTA NOT
23 D GAT NOT
24 A GCG NOT
25 E GAG NOT
26 E GAG NOT
27 E GAG NOT
28 Q CAG NOT
29 E GAG NOT
30 E GAA NOT
31 S TCC NOT
32 E GAA NOT
33 E GAA NOT
34 A GCG NOT
35 A GCC NOT
36 C TGT NOT
37 G GGC NOT
38 S AGC NOT
39 K AAG NOT
40 K AAA NOT
41 R CGG NOT
42 V GTA NOT
43 V GTG NOT
44 P CCA NOT
45 G GGT NOT
46 I ATT YES
47 V GTG YES
48 Y TAC YES
49 L CTG YES
50 G GGC YES
51 H CAT YES
52 I ATC YES
53 P CCG YES
54 P CCG YES
55 R CGC YES
56 F TTC YES
57 R CGG YES
58 P CCC YES
59 L CTG YES
60 H CAC YES
61 V GTC YES
62 R CGC YES
63 N AAC YES
64 L CTT YES
65 L CTC YES
66 S AGC YES
67 A GCC YES
68 Y TAT YES
69 G GGC YES
70 E GAG YES
71 V GTC YES
72 G GGA YES
73 R CGC YES
74 V GTC YES
75 F TTC YES
76 F TTT YES
77 Q CAG YES
78 A GCT YES
79 E GAG YES
80 D GAC YES
81 R CGG YES
82 F TTC YES
83 V GTG YES
84 R AGA YES
85 R CGC YES
86 K AAG YES
87 K AAG YES
88 K AAG YES
89 A GCA YES
90 A GCA YES
91 A GCA YES
92 A GCT YES
93 A GCC YES
94 G GGA YES
95 G GGA YES
96 K AAA YES
97 K AAG YES
98 R CGG YES
99 S TCC YES
100 Y TAC YES
101 T ACC YES
102 K AAG YES
103 D GAC YES
104 Y TAC YES
105 T ACC YES
106 E GAG YES
107 G GGA YES
108 W TGG YES
109 V GTG YES
110 E GAG YES
111 F TTC YES
112 R CGT YES
113 D GAC YES
114 K AAG YES
115 R CGC YES
116 I ATA YES
117 A GCC YES
118 K AAG YES
119 R CGC YES
120 V GTG YES
121 A GCG YES
122 A GCC YES
123 S AGT YES
124 L CTA YES
125 H CAC YES
126 N AAC YES
127 T ACG YES
128 P CCT YES
129 M ATG YES
130 G GGT YES
131 A GCC YES
132 R CGC YES
133 R AGG YES
134 R CGC YES
135 S AGC YES
136 P CCC YES
137 F TTC YES
138 R CGT YES
139 Y TAT YES
140 D GAT YES
141 L CTT YES
142 W TGG YES
143 N AAC NOT
144 L CTC NOT
145 K AAG NOT
146 Y TAC NOT
147 L TTG NOT
148 H CAC NOT
149 R CGT NOT
150 F TTC NOT
151 T ACC NOT
152 W TGG NOT
153 S TCC NOT
154 H CAC NOT
155 * TGA NOT
This is example for one protein, I have to do it for thousands proteins. Please, do you have any suggestions?
Thank you!
I am stuck at 1 point in my project. I am a biomedical science. So, I don't know perl programming much.
I have a file that explains proteins interactions with ligands. The file looks as shown below:
H P L A 82 SER 1290 N --> O12 1668 GSH 106 A 2.90
H P L A 83 SER 1301 N --> O12 1668 GSH 106 A 2.93
N P L A 19 LYS 302 NZ --- O31 1682 GSH 106 A 3.86
N P L A 22 CYS 348 CB --- CB2 1677 GSH 106 A 3.75
N P L A 22 CYS 348 CB --- SG2 1678 GSH 106 A 3.02
N P L A 22 CYS 349 SG --- CB2 1677 GSH 106 A 3.03
N P L A 22 CYS 349 SG --- SG2 1678 GSH 106 A 2.02
N P L A 24 TYR 372 CB --- CG1 1670 GSH 106 A 3.68
Now you can see the are O12 in two rows. Similarly you can see that there are two CB2 as well. These O12 and CB2 are atom symbols. O12 means oxygen 12 in an atom. Now I need to calculate how many different atom symbols are there in file. I have to use perl script to do that. I am reading this file line by line using perl. while (my $line = <MYFILE>) { }; Now, I need to calculate how many different atom symbols are there while reading the file line by line. I hope I am clear enough to explain my problem. Waiting for a kind reply...
How the problem is best solved depends on how your data is delimited. As it looks like fixed width, I'll present that solution first:
use strict;
use warnings;
my %atom;
while (<DATA>) {
my (undef,$atom) = unpack "A34A4 ", $_;
$atom{$atom}++;
}
print scalar keys %atom;
__DATA__
H P L A 82 SER 1290 N --> O12 1668 GSH 106 A 2.90
H P L A 83 SER 1301 N --> O12 1668 GSH 106 A 2.93
N P L A 19 LYS 302 NZ --- O31 1682 GSH 106 A 3.86
N P L A 22 CYS 348 CB --- CB2 1677 GSH 106 A 3.75
N P L A 22 CYS 348 CB --- SG2 1678 GSH 106 A 3.02
N P L A 22 CYS 349 SG --- CB2 1677 GSH 106 A 3.03
N P L A 22 CYS 349 SG --- SG2 1678 GSH 106 A 2.02
N P L A 24 TYR 372 CB --- CG1 1670 GSH 106 A 3.68
Note here that I estimated the offset used by unpack, so you may need to tweak that to fit your data.
If your data is tab-delimited, you'll need to split on tab, or better yet use Text::CSV to parse your data. Basic script is the same:
use Text::CSV;
my $csv = Text::CSV->new({
binary => 1,
sep_char => "\t",
});
my %atom;
while (<DATA>) {
$csv->parse($_);
my $atom = ($csv->fields())[9];
next unless defined $atom;
$atom{$atom}++;
}
You can also use the loop condition while (my $aref = $csv->getline(*DATA)), which is more efficient, but also breaks if your csv data is not consistent.
A simpler and possibly as valid (depending on how complex your data can be) solution is using split:
while (<DATA>) {
my $atom = (split /\t/)[9]; # implicitly splits $_
$atom{$atom}++;
}
If your data is space delimited, simply remove /\t/ from the above.
Note that I assumed all spaces were tabs in your input, so if they are not, my count may need to be tweaked.
In command line (no perl):
cat yourfile | awk '{print $10}' | sort | uniq | wc -l
Works on your input.
Have a look at this Perl Cookbook recipe.
While you're reading the file line by line you want to split/extract the atom symbols and count them in a hash.
use strict;
use warnings;
# open FILE goes here...
my %seen; # we use this to count
while (<FILE>) {
m/--[>-]\s+(\w+)\s/; # fetch the atom symbol after arrow-thing
$seen{$1}++;
}
close FILE;
print scalar keys %seen; # number of unique atom symbols
print join ', ', keys %seen; # List as string
Or in perl:
#!/usr/bin/perl
while(my $line = <DATA>){
my $atom = (split / +/, $line)[9];
$atoms{$atom}++;
}
print "$_: $atoms{$_}\n" for keys %atoms;
__DATA__
H P L A 82 SER 1290 N --> O12 1668 GSH 106 A 2.90
H P L A 83 SER 1301 N --> O12 1668 GSH 106 A 2.93
N P L A 19 LYS 302 NZ --- O31 1682 GSH 106 A 3.86
N P L A 22 CYS 348 CB --- CB2 1677 GSH 106 A 3.75
N P L A 22 CYS 348 CB --- SG2 1678 GSH 106 A 3.02
N P L A 22 CYS 349 SG --- CB2 1677 GSH 106 A 3.03
N P L A 22 CYS 349 SG --- SG2 1678 GSH 106 A 2.02
N P L A 24 TYR 372 CB --- CG1 1670 GSH 106 A 3.68
I have a file which looks like this.
In the Perl code, i am using an array #query = ('A+80', 'A+40', 'A+202', 'B+130', 'B+268', 'B+211', 'A+35');
What I want to do is: for each element of the array, scan the lines in the file shown below and print out something like this:
A+80 - HELIX
A+40 - SHEET
A+202 - HELIX
B+130 - HELIX
B+268 - SHEET
B+211 - SHEET
A+35 - LOOP
The logic behind this output is to extract for each entry in array, the first part i.e. A or B, and the second part, i.e. the number associated with the 1st part. Consider the first entry in the array: A+80. On the third line of the file the number 80 is lying between 78 (the 6th column) and 90 (the 9th column) and also first alphabet A is also matching in both cases. Hence the program prints HELIX for this query.
Consider 2nd element: A+40. The 2nd part i.e the number lies in the range as on this line
SHEET 2 B 3 ARG A 37 VAL A 43 1
i.e. between the numbers listed in columns 7 and 10, and the alphabet matches too. Hence for this entry the output is: SHEET
For other cases, like B+211. THe line given below matches the number and the alphabet associated with it.
SHEET 2 B 3 ARG A 37 VAL A 43 1
Hence the output for this entry is: SHEET
Also, for entries whose alphabet and number associated with it, do not match any of the lines in the file. the code outputs: A+35 - LOOP
What is an efficient way to do this in Perl?
Since I am a beginner with Perl, I am as of now splitting each entry in successive array elements, i.e. for #query, and matching/comparing the alphabet and number to each of the relevant columns in the lines. But somehow am unable to get the output desired.
Please help...
HELIX 1 1 GLY A 9 GLN A 30 1
HELIX 2 2 ASP A 47 ILE A 63 1
HELIX 3 3 GLU A 78 GLU A 90 1
HELIX 4 4 THR A 111 ALA A 117 1
HELIX 5 5 PRO A 120 LYS A 122 5
HELIX 6 6 SER A 129 ARG A 137 1
HELIX 7 7 CYS A 147 THR A 159 1
HELIX 8 8 GLY A 178 ASN A 188 1
HELIX 9 9 LEU A 202 LYS A 208 1
HELIX 10 10 GLY A 224 TRP A 226 5
HELIX 11 11 TYR A 258 GLU A 263 1
HELIX 12 12 VAL A 275 PHE A 294 1
HELIX 13 13 GLY B 9 GLN B 30 1
HELIX 14 14 ASP B 47 ILE B 63 1
HELIX 15 15 GLU B 78 GLU B 90 1
HELIX 16 16 THR B 111 ALA B 117 1
HELIX 17 17 PRO B 120 LYS B 122 5
HELIX 18 18 SER B 129 ARG B 137 1
HELIX 19 19 CYS B 147 THR B 159 1
HELIX 20 20 GLY B 178 TRP B 187 1
HELIX 21 21 LEU B 202 LYS B 208 1
HELIX 22 22 GLY B 224 TRP B 226 5
HELIX 23 23 TYR B 258 GLU B 263 1
HELIX 24 24 GLY B 276 PHE B 294 5
SHEET 1 A 2 GLU A 5 LEU A 7 0
SHEET 2 A 2 PHE A 267 THR A 269 1
SHEET 1 B 3 LYS A 66 LEU A 72 0
SHEET 2 B 3 ARG A 37 VAL A 43 1
SHEET 3 B 3 GLY A 96 VAL A 99 1
SHEET 1 C 4 THR A 191 CYS A 195 0
SHEET 2 C 4 HIS A 167 VAL A 171 1
SHEET 3 C 4 ILE A 211 VAL A 214 1
SHEET 4 C 4 ILE A 232 ASP A 235 1
SHEET 1 D 2 GLU B 5 LEU B 7 0
SHEET 2 D 2 PHE B 267 THR B 269 1
SHEET 1 E 3 LYS B 66 LEU B 72 0
SHEET 2 E 3 ARG B 37 VAL B 43 1
SHEET 3 E 3 GLY B 96 VAL B 99 1
SHEET 1 F 4 THR B 191 CYS B 195 0
SHEET 2 F 4 HIS B 167 VAL B 171 1
SHEET 3 F 4 ILE B 211 VAL B 214 1
SHEET 4 F 4 ILE B 232 ASP B 235 1
SHEET 1 G 2 ASN B 239 PRO B 242 0
SHEET 2 G 2 ARG B 250 VAL B 253 -1
The program below seems to do what you need. It reads data from within the source file using the DATA file handle for convenience: you must arrange to open and read the appropriate data file.
The entirety of the file is read into memory for straightforward access. If the file is enormous (say, hundreds of megabytes) then this approach may be inappropriate and you will have to come back for more help.
The records vary in length, so the algorithm locates the relevant fields relative to the first three-letter field found.
The hash %categories contains all of the required file data. It is indexed by the key letter - A or B here - and the value of each element is an array of anonymous hashes containing the label (column 1), the letter, and the start and end of the range covered by each record.
Building the output is straightforward, and uses map and grep to find the 'label' of all the relevant entries in the hash. If none are found the text "LOOP" is added as a default.
use strict;
use warnings;
my #query = qw/ A+80 A+40 A+202 B+130 B+268 B+211 A+35 /;
my %categories;
while (<DATA>) {
next unless /\S/;
my #data = split;
my #indices = grep $data[$_] =~ /^[A-Z]{3}$/, 0 .. $#data;
my %info;
#info{qw/ label letter start end /} = #data[ 0, $indices[0]+1, $indices[0]+2, $indices[1]+2 ];
push #{ $categories{$info{letter}} }, \%info;
}
for my $item (#query) {
my ($letter, $value) = split /\+/, $item;
my #matches = map $_->{label},
grep { $value >= $_->{start} and $value <= $_->{end} }
#{ $categories{$letter} };
#matches = ('LOOP') unless #matches;
warn qq(Multiple categories for query "$item") unless #matches == 1;
printf "%s - %s\n", $item, $_ for #matches
}
__DATA__
HELIX 1 1 GLY A 9 GLN A 30 1
HELIX 2 2 ASP A 47 ILE A 63 1
HELIX 3 3 GLU A 78 GLU A 90 1
HELIX 4 4 THR A 111 ALA A 117 1
HELIX 5 5 PRO A 120 LYS A 122 5
HELIX 6 6 SER A 129 ARG A 137 1
HELIX 7 7 CYS A 147 THR A 159 1
HELIX 8 8 GLY A 178 ASN A 188 1
HELIX 9 9 LEU A 202 LYS A 208 1
HELIX 10 10 GLY A 224 TRP A 226 5
HELIX 11 11 TYR A 258 GLU A 263 1
HELIX 12 12 VAL A 275 PHE A 294 1
HELIX 13 13 GLY B 9 GLN B 30 1
HELIX 14 14 ASP B 47 ILE B 63 1
HELIX 15 15 GLU B 78 GLU B 90 1
HELIX 16 16 THR B 111 ALA B 117 1
HELIX 17 17 PRO B 120 LYS B 122 5
HELIX 18 18 SER B 129 ARG B 137 1
HELIX 19 19 CYS B 147 THR B 159 1
HELIX 20 20 GLY B 178 TRP B 187 1
HELIX 21 21 LEU B 202 LYS B 208 1
HELIX 22 22 GLY B 224 TRP B 226 5
HELIX 23 23 TYR B 258 GLU B 263 1
HELIX 24 24 GLY B 276 PHE B 294 5
SHEET 1 A 2 GLU A 5 LEU A 7 0
SHEET 2 A 2 PHE A 267 THR A 269 1
SHEET 1 B 3 LYS A 66 LEU A 72 0
SHEET 2 B 3 ARG A 37 VAL A 43 1
SHEET 3 B 3 GLY A 96 VAL A 99 1
SHEET 1 C 4 THR A 191 CYS A 195 0
SHEET 2 C 4 HIS A 167 VAL A 171 1
SHEET 3 C 4 ILE A 211 VAL A 214 1
SHEET 4 C 4 ILE A 232 ASP A 235 1
SHEET 1 D 2 GLU B 5 LEU B 7 0
SHEET 2 D 2 PHE B 267 THR B 269 1
SHEET 1 E 3 LYS B 66 LEU B 72 0
SHEET 2 E 3 ARG B 37 VAL B 43 1
SHEET 3 E 3 GLY B 96 VAL B 99 1
SHEET 1 F 4 THR B 191 CYS B 195 0
SHEET 2 F 4 HIS B 167 VAL B 171 1
SHEET 3 F 4 ILE B 211 VAL B 214 1
SHEET 4 F 4 ILE B 232 ASP B 235 1
SHEET 1 G 2 ASN B 239 PRO B 242 0
SHEET 2 G 2 ARG B 250 VAL B 253 -1
output
A+80 - HELIX
A+40 - SHEET
A+202 - HELIX
B+130 - HELIX
B+268 - SHEET
B+211 - SHEET
A+35 - LOOP