Distance between one point to all other in a PDB file - perl

I have a PDB file. Now it has two parts separated by TER. Before TER I call it part 1. I want to take x,y,z of ATOM 1 of first part i.e before TER and find distance to all x,y,z co ordinates after TER and then second ATOM of part one to all ATOMS of part second. This has to be repeated for all ATOMS of first part= to all ATOMS of second part. I have to automate it for 20 files. names of my files begin like 1_0.pdb,2_0.pdb....20_0.pdb.
This is a distance calculation. I have tried something in PERL but its very rough. Can someone help a bit.
The File looks like:
----long file (I truncated it)----
ATOM 1279 C ALA 81 -1.925 -11.270 1.404
ATOM 1280 O ALA 81 -0.279 9.355 15.557
ATOM 1281 OXT ALA 81 -2.188 10.341 15.346
TER
ATOM 1282 N THR 82 29.632 5.205 5.525
ATOM 1283 H1 THR 82 30.175 4.389 5.768
ATOM 1284 H2 THR 82 28.816 4.910 5.008
The code is: In the end it finds the maximum distance and its co ordinates
my #points = ();
open(IN, #ARGV[0]) or die "$!";
while (my $line = <IN>) {
chomp($line);
my #array = (split (/\s+/, $line))[5, 6, 7];
print "#array\n";
push #points, [ #array ];
}
close(IN);
$max=0;
for my $i1 ( 0 .. $#points )
{
my ( $x1, $y1, $z1 ) = #{ $points[$i1] };
my $dist = sqrt( ($x1+1.925)**2 + ($y1+11.270)**2 + ($z1-1.404)**2 );
print "distance from (-1.925 -11.270 1.404) to ( $x1, $y1, $z1 ) is $dist\n";
if ( $dist > $max )
{ $max = $dist;
$x=$x1;
$y=$y1;
$z=$z1;
}}
print "maximum value is : $max\n";
print "co ordinates are : $x $y $z\n";

Not sure I clearly understand what you want, but how about:
#!/usr/local/bin/perl
use strict;
use warnings;
my (#refer, #points);
my $part = 0;
while (my $line = <DATA>) {
chomp($line);
if ($line =~ /^TER/) {
$part++;
next;
}
my #array = (split (/\s+/, $line))[5, 6, 7];
if ($part == 0) {
push #refer, [ #array ];
} else {
push #points, [ #array ];
}
}
my %max = (val=>0, x=>0, y=>0, z=>0);
foreach my $ref(#refer) {
my ($x1, $y1, $z1) = #{$ref};
foreach my $atom(#points) {
my ($x, $y, $z) = #{$atom};
my $dist = sqrt( ($x-$x1)**2 + ($y-$y1)**2 + ($z-$z1)**2 );
if ($dist > $max{val}) {
$max{val} = $dist;
$max{x} = $x;
$max{y} = $y;
$max{z} = $z;
}
}
}
print "max is $max{val}; coord: x=$max{x}, y=$max{y}, z=$max{z}\n";
__DATA__
ATOM 1279 C ALA 81 -1.925 -11.270 1.404
ATOM 1280 O ALA 81 -0.279 9.355 15.557
ATOM 1281 OXT ALA 81 -2.188 10.341 15.346
TER
ATOM 1282 N THR 82 29.632 5.205 5.525
ATOM 1283 H1 THR 82 30.175 4.389 5.768
ATOM 1284 H2 THR 82 28.816 4.910 5.008
output:
max is 35.9813670807545; coord: x=30.175, y=4.389, z=5.768

The main issue here is reading the data. First, note that one cannot use split with PDB text files since the fields are defined by position and not by separators. See Coordinate File Description (PDB Format).
To separate the ATOM record of different polymer chains you can start with a simplified version like
my $iblock = 0;
my #atoms = ();
while (my $line = <IN>) {
chomp($line);
# Switch blocks at TER lines
if ($line =~ /^TER/) {
$iblock++;
# Read ATOM lines
} elsif ($line =~ m/^ATOM/) {
my #xyz = (substr($line,7-1,9),substr($line,16-1,9),substr($line,25-1,9));
printf "Block %d: atom at (%s)\n",$iblock,join (",",#xyz);
push #{$atoms[$iblock]},\#xyz;
# Parse additional line types (if needed)
} else {
...
}
}
Followed by a loop over all pairs of coordinates from different blocks, structured as follows:
# 1st block
for my $iblock1 (0..$#atoms) {
# 2nd block
for my $iblock2 ($iblock1+1..$#atoms) {
# Compare all pairs of atoms
...
my $xyz1 (#{$atoms[$iblock1]}) {
for my $xyz2 (#{$atoms[$iblock2]}) {
# Calculate distance and compare with $max_dist
...
}
}
# Print the maximal distance between these two blocks
...
}
}
Of course, the code could be more general if a more elaborate data structure is used or by applying one of the available PDB parsers, such as Bioperl's.

With proper encapsulation, this is pretty simple, and requires minor modifications of your code.
ETA: Added fixed width solution I had on hand. It would probably be best to read all the fields instead of discarding the first 31 chars, and then return them all in a hash reference. That way, you could process all the lines with the same subroutine, and simply switch between parts when the first field turns out to be TER. It should be easy for you to extrapolate this from the given code.
You'll note that the reference values are read in with a loop, because we need to break the loop at the break point. The rest of the values are slurped up with a map statement. Then we simply feed the data to the subroutine we made from your initial code (with some improvements). I used the same names for the lexical variables to make it easier to read the code.
use strict;
use warnings;
my #points;
while (<DATA>) {
last if /^TER$/;
push #points, getpoints($_);
}
my #ref = map getpoints($_), <DATA>;
for my $p (#points) {
getcoords($p, \#ref);
}
sub getpoints {
my $line = shift;
my #data = unpack "A31 A8 A8 A8", $line;
shift #data;
return \#data;
}
sub getcoords {
my ($p, $ref) = #_;
my ($p1,$p2,$p3) = #$p;
my $max=0;
my ($x,$y,$z);
for my $aref ( #$ref ) {
my ( $x1, $y1, $z1 ) = #$aref;
my $dist = sqrt(
($x1-$p1)**2 +
($y1-$p2)**2 +
($z1-$p3)**2
);
print "distance from ($p1 $p2 $p3) to ( $x1, $y1, $z1 ) is $dist\n";
if ( $dist > $max ) {
$max = $dist;
$x=$x1;
$y=$y1;
$z=$z1;
}
}
print "maximum value is : $max\n";
print "co ordinates are : $x $y $z\n";
}
__DATA__
ATOM 1279 C ALA 81 -1.925 -11.270 1.404
ATOM 1280 O ALA 81 -0.279 9.355 15.557
ATOM 1281 OXT ALA 81 -2.188 10.341 15.346
TER
ATOM 1282 N THR 82 29.632 5.205 5.525
ATOM 1283 H1 THR 82 30.175 4.389 5.768
ATOM 1284 H2 THR 82 28.816 4.910 5.008

Related

calculating size of the result and moving files into another folder if it fulfills conditions [duplicate]

I am trying to calculate distance between each coordinate of protein atom (ATOM) and ligand atom (HETATM). I have number of files that look like this:
ATOM 1592 HD13 LEU D 46 11.698 -10.914 2.183 1.00 0.00 H
ATOM 1593 HD21 LEU D 46 11.528 -8.800 5.301 1.00 0.00 H
ATOM 1594 HD22 LEU D 46 12.997 -9.452 4.535 1.00 0.00 H
ATOM 1595 HD23 LEU D 46 11.722 -8.718 3.534 1.00 0.00 H
HETATM 1597 N1 308 A 1 0.339 6.314 -9.091 1.00 0.00 N
HETATM 1598 C10 308 A 1 -0.195 5.226 -8.241 1.00 0.00 C
HETATM 1599 C7 308 A 1 -0.991 4.254 -9.133 1.00 0.00 C
HETATM 1600 C1 308 A 1 -1.468 3.053 -8.292 1.00 0.00 C
So I am trying to calculate distances between ATOM1 and all other HETATM1, between ATOM1 and all other 'HETATM2' and so on. I have written a script in perl, but I cannot figure it out what is wrong with the script, it doesnt give me any error it just does not print anything.
I am also not sure how to add it in the script and if it is possible, that if the result of each calculation is more then 5 then delete this both lines that were included into calculation. If it is <= then 5 then keep it.
#!/usr/local/bin/perl
use strict;
use warnings;
open(IN, $ARGV[0]) or die "$!";
my (#refer, #points);
my $part = 0;
my $dist;
while (my $line = <IN>) {
chomp($line);
if ($line =~ /^HETATM/) {
$part++;
next;
}
my #array = (substr($line, 30, 8),substr($line,38,8),substr($line,46,8));
# print "#array\n";
if ($part == 0) {
push #refer, [ #array ];
} elsif ($part ==1){
push #points, [ #array ];
}
}
foreach my $ref(#refer) {
my ($x1, $y1, $z1) = #{$ref};
foreach my $atom(#points) {
my ($x, $y, $z) = #{$atom};
my $dist = sqrt( ($x-$x1)**2 + ($y-$y1)**2 + ($z-$z1)**2 );
print $dist;
}
}
When seeing a line with HETATM you increment $part and skip to the next input line. Your array #refer will therefor be empty.
Remove the line with next; after incrementing $part.
And your test should be } elsif( $part ) { ... } since you increment $part for each line of HETATM.
OK, I have to say - I'd rewrite your code, to work a bit differently.
Something like this:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my %coordinates;
#use types to track different types. Unclear if you need to handle anything other than 'not ATOM' but this is in case you do.
my %types;
#read STDIN or files specified on command line - like how grep/sed do it.
while ( <> ) {
my ( $type, $id, undef, undef, undef, undef, $x, $y, $z ) = split; # splits on white space.
$coordinates{$type}{$id} = [$x, $y, $z];
$types{$type}++ if $type ne 'ATOM';
}
#print for debugging:
print Dumper \%coordinates;
print Dumper \%types;
#iterate each element of "ATOM"
foreach my $atom_id ( keys %{$coordinates{'ATOM'}} ) {
#iterate all the types (HETATM)
foreach my $type ( sort keys %types ) {
#iterate each id within the data structure.
foreach my $id ( sort keys %{$coordinates{$type}} ) {
my $dist = 0;
#take square of x - x1, y - y1, z - z1
#do it iteratively, using 'for' loop.
$dist += (($coordinates{$type}{$id}[$_] - $coordinates{'ATOM'}{$atom_id}[$_]) ** 2) for 0..2;
$dist = sqrt $dist;
print "$atom_id -> $type $id $dist\n";
}
This is:
Using <> to read STDIN or named files on command line instead of manually opening ARGV[0] which accomplishes a similar result. (but means you can pipe stuff through it too).
Reads your data into a hash first.
Then iterates all the possible pairings, calculating your distance.
Conditionally prints if they match the criteria (all your results seem to?)
This gives as results:
1592 -> HETATM 1597 23.5145474334506
1592 -> HETATM 1598 22.5965224094328
1592 -> HETATM 1599 22.7844420822631
1592 -> HETATM 1600 21.8665559702483
1595 -> HETATM 1597 22.6919443415499
1595 -> HETATM 1598 21.7968036647578
1595 -> HETATM 1599 22.1437585337268
1595 -> HETATM 1600 21.2693868505888
1594 -> HETATM 1597 24.3815421169376
1594 -> HETATM 1598 23.509545380547
1594 -> HETATM 1599 23.8816415683679
1594 -> HETATM 1600 23.0248383056212
1593 -> HETATM 1597 23.6802952050856
1593 -> HETATM 1598 22.74957513889
1593 -> HETATM 1599 23.1402816102138
1593 -> HETATM 1600 22.2296935201545
Now you mention wanting to delete lines that are 'too far' - that's a bit complicated, because you've a compound criteria (and you'll delete all your lines).
The problem is - you don't know if your ATOM lines have too much "distance" until you've tested every single pairing in the file.
You could perhaps do this by:
#iterate each element of "ATOM"
foreach my $atom_id ( keys %{$coordinates{'ATOM'}} ) {
#iterate all the types (HETATM)
foreach my $type ( sort keys %types ) {
#iterate each id within the data structure.
foreach my $id ( sort keys %{$coordinates{$type}} ) {
my $dist = 0;
#take square of x - x1, y - y1, z - z1
#do it iteratively, using 'for' loop.
$dist += (($coordinates{$type}{$id}[$_] - $coordinates{'ATOM'}{$atom_id}[$_]) ** 2) for 0..2;
$dist = sqrt $dist;
print "### $atom_id -> $type $id $dist\n";
##note - this will print out multiple times if there's multiple pairings.
if ( $dist <= 5 ) {
print $lines{'ATOM'}{$atom_id};
print $lines{$type}{$id};
}
}
}
}
Which will - for each pairing-comparison print both the ATOM and HETATM lines that had a distance of <= 5. But it will do so multiple times if multiple pairings exist.
But I think your core error is in mishandling the $part and next clauses.
You only ever increment $part and whilst you initialise it at 0, you never reset it to zero. So it'll be 1,2,3,4 for each successive HETATM.
You use next after incrementing part which means you skip the if ( $part == 1 clause entirely.
I would use this approach:
#!/usr/bin/env perl
use strict;
use warnings;
use constant POZITION => ( 6, 7, 8 ); # X, Y, Z
sub dist {
my ( $a, $b ) = #_;
my $s = 0;
for my $i ( 0 .. $#$a ) {
$s += ( $a->[$i] - $b->[$i] )**2;
}
return sqrt($s);
}
# Record format
use constant {
LINE => 0,
POZ => 1,
KEEP => 2,
};
my ( #refer, #points );
while ( my $line = <> ) {
my ( $type, #poz ) = ( split ' ', $line )[ 0, POZITION ];
print STDERR join( ',', $type, #poz ), "\n";
if ( $type eq 'ATOM' ) {
push #refer, [ $line, \#poz ];
}
elsif ( $type eq 'HETATM' ) {
push #points, [ $line, \#poz ];
}
}
for my $ref (#refer) {
for my $atom (#points) {
my $dist = dist( $ref->[POZ], $atom->[POZ] );
print STDERR "$ref->[LINE]$atom->[LINE]dist: $dist\n";
next if $dist > 5;
$ref->[KEEP] ||= 1;
$atom->[KEEP] ||= 1;
}
}
print $_->[LINE] for grep $_->[KEEP], #refer, #points;
Unfortunately, your data doesn't contain any ATOM and HETATM pair with distance <= 5. (Note that split ' ' is word split. It means split /\s+/ with omitting any leading and trailing whitespaces.)
It works as a filter with debugging output to the STDERR.

Moving an input file into another folder whenever it encounters specific result

I have written a script in perl that reads each file in current directory and calculates distances between protein and ligand atom. Whenever it encounters condition that the result is <=5 it should break the loop (last) and move this file into another directory (mp) that was previously made. For some reason I have a problem with this script; it calcluates the distances correctly but does not move these files into another folder. It does not give me any error tough. I am trying to figure out what seems to be a problem.
This is the script that I have been using:
#!/usr/local/bin/perl
use strict;
use warnings;
use File::Glob;
mkdir "mp";
my $txt;
for my $txt ( glob '*.txt' )
{
open my $fh, '<', $txt;
my $out_fh;
my (#refer, #points);
my $part = 0;
my $dist;
while (my $line = <$fh>)
{
chomp($line);
$part++ if ($line =~ /^HETATM/);
my #array = (substr($line, 30, 8),substr($line,38,8),substr($line,46,8));
#print "#array\n";
if ($part == 0)
{
push #refer, [ #array ];
}
elsif ($part)
{
push #points, [ #array ];
}
my $atom;
foreach my $ref(#refer)
{
my ($x1, $y1, $z1) = #{$ref};
foreach my $atom(#points)
{
my ($x, $y, $z) = #{$atom};
my $dist = sqrt( ($x-$x1)**2 + ($y-$y1)**2 + ($z-$z1)**2 );
if ($dist <= 5)
{
print "Distance for calculation between $ref and $atom is $dist\n";
last;
system ("mv $fh mp");
}
}
}
}
}
The content of my input file looks like this:
ATOM 1593 HD21 LEU D 46 11.528 -8.800 5.301 1.00 0.00 H
ATOM 1594 HD22 LEU D 46 12.997 -9.452 4.535 1.00 0.00 H
ATOM 1595 HD23 LEU D 46 11.722 -8.718 3.534 1.00 0.00 H
HETATM 1597 N1 308 A 1 0.339 6.314 -9.091 1.00 0.00 N
HETATM 1598 C10 308 A 1 -0.195 5.226 -8.241 1.00 0.00 C
The result that script gives me in terminal looks like this:
Distance for calculation between ARRAY(0x1c61fa8) and ARRAY(0x1c6f950) is 4.98553437456809
Distance for calculation between ARRAY(0x1c62098) and ARRAY(0x1c6ffe0) is 4.98962253081333
But it does not move the files.

How to extract line after line with matching pattern in perl

I am new in perl. I am trying this code below to extract line after the line that has string "length' in my text file merge.txt. I want to get the result as shown below. can someone please help me with what I need to add to my code. Thanks
my #array;
While (#array = ‘/Users/Desktop/merged.txt’) {
foreach my $line (#array) {
if ($line =~ m/length/) {
my $a = $array[$i+1];
push (#array, $a);
}
my $wanted_type = "$1";
print "$wanted_type\n";
}
}
}
merge.txt file has
owner#owner-HP-Z840-Workstation ~/SLM/Desktop/Documents
% ./count-smRNAs.pl /media/owner/c92ed9e9-3d94-497c-bb2e-514a4806bbcd/merged.fastq 32 32
length number A C G T
32 14824945 1992856 1576607 2413263 8756583
owner#owner-HP-Z840-Workstation ~/SLM/Desktop/Documents
% ./count-smRNAs.pl /media/owner/c92ed9e9-3d94-497c-bb2e-514a4806bbcd/merged.fastq 33 33
length number A C G T
33 58619575 1415093 3274505 5499169 48070172
owner#owner-HP-Z840-Workstation ~/SLM/Desktop/Documents
% ./count-smRNAs.pl /media/owner/c92ed9e9-3d94-497c-bb2e-514a4806bbcd/test.fastq 34 34
length number A C G T
34 13018196 1047476 903554 1695778 9296236
result I want is shown below. I also want to grab the filename from line
that has /c92ed9e9-3d94-497c-bb2e-514a4806bbcd/ as shown below.
merged.fastq
32 14824945 1992856 1576607 2413263 8756583
33 58619575 1415093 3274505 5499169 48070172
test.fastq
34 13018196 1047476 903554 1695778 9296236
my $print;
while (<>) {
if ( my $qfn = /^\s*%\s*\S+\s+(\S+)/ ) {
( my $fn = $qfn ) =~ s{^.*/}{}s;
print("$fn\n");
}
print if $print;
$print = /length/;
}

Calculating distances in PDB file

With reference to the question Calculating the distance between atomic coordinates, where the input is
ATOM 920 CA GLN A 203 39.292 -13.354 17.416 1.00 55.76 C
ATOM 929 CA HIS A 204 38.546 -15.963 14.792 1.00 29.53 C
ATOM 939 CA ASN A 205 39.443 -17.018 11.206 1.00 54.49 C
ATOM 947 CA GLU A 206 41.454 -13.901 10.155 1.00 26.32 C
ATOM 956 CA VAL A 207 43.664 -14.041 13.279 1.00 40.65 C
.
.
.
ATOM 963 CA GLU A 208 45.403 -17.443 13.188 1.00 40.25 C
there is an answer reported as
use strict;
use warnings;
my #line;
while (<>) {
push #line, $_; # add line to buffer
next if #line < 2; # skip unless buffer is full
print proc(#line), "\n"; # process and print
shift #line; # remove used line
}
sub proc {
my #a = split ' ', shift; # line 1
my #b = split ' ', shift; # line 2
my $x = ($a[6]-$b[6]); # calculate the diffs
my $y = ($a[7]-$b[7]);
my $z = ($a[8]-$b[8]);
my $dist = sprintf "%.1f", # format the number
sqrt($x**2+$y**2+$z**2); # do the calculation
return "$a[3]-$b[3]\t$dist"; # return the string for printing
}
The output of above code is the distance between the first CA to the second one and second to third and so on...
How to modify this code to find the distance between first CA to rest of the CAs (2, 3, ..) and from second CA to rest of the CAs (3, 4, ..) and so on and printing only those which is less then 5 Angstrom?
I found that push #line, $_; statement should be altered to increase the array size but not clear how to do that.
To get the pairs, read the file into an array, #data_array. Then loop over the entries.
Update: Added file opening and load #data_array.
open my $fh, '<', 'atom_file.pdb' or die $!;
my #data_array = <$fh>;
close $fh or die $!;
for my $i (0 .. $#data_array) {
for my $j ($i+1 .. $#data_array) {
process(#data_array[$i,$j]);
}
}
May be try this:
use strict;
use warnings;
my #alllines = ();
while(<DATA>) { push(#alllines, $_); }
#Each Current line
for(my $i=0; $i<=$#alllines+1; $i++)
{
#Each Next line
for(my $j=$i+1; $j<=$#alllines; $j++)
{
if($alllines[$i])
{
#Split the line into tab delimits
my ($line1_tb_1,$line1_tb_2,$line1_tb_3) = split /\t/, $alllines[$i];
print "Main_Line: $line1_tb_1\t$line1_tb_2\t$line1_tb_3";
if($alllines[$j])
{
#Split the line into tab delimits
my ($line_nxt_tb1,$line_nxt_tb2,$line_nxt_tb3) = split /\t/, $alllines[$j];
print "Next_Line: $line_nxt_tb1\t$line_nxt_tb2\t$line_nxt_tb3";
#Do it your coding/regex here
}
}
#system 'pause'; Testing Purpose!!!
}
}
__DATA__
tab1 123 456
tab2 789 012
tab3 345 678
tab4 901 234
tab5 567 890
I hope this will help you.

Print header row and each row in file the file, side by side in columns

Hi I have a tsv file that i am trying to print the header row and each line of the file side by side ie in columns.
Unfortunatley i am bit confused on how to join the lines in a print statement.
#!/usr/bin/perl
use strict;
use warnings;
local $" = "'\n'";
my #temp;
while (<DATA>) {
chomp;
my #columns = join "\t", $_;
push #temp, #columns;
}
my $Header_row = shift (#temp);
my #head;
my $abc;
my #abc = split(/\t/,$Header_row);
for my $abc(#abc) {
push #head, $abc ."\n";
}
my #roows;
my $elements;
foreach (#temp){
chomp;
my $line = $_;
my #elements = split ("\t", $line);
for $elements(#elements){
push #roows, $elements ."\n";
}
}
#print #head, $abc ."\n";
#print #roows, $elements ."\n";
__DATA__
Year Tonn Class Cargo Type
88 61 T Rice Truck
89 55 G Corn Train
92 93 S Peas Ship
required Output
OUTPUT
Year 88
Tonn 61
Class T
Cargo Rice
Type Truck
Year 89
Tonn 55
Class G
Cargo Corn
Type Train
Year 92
Tonn 93
Class S
Cargo Peas
Type Ship
Based on your source, this should do the trick:
#!/usr/bin/env perl
use strict;
use warnings;
#read the header line into #header;
my $header_line = <DATA>;
chomp $header_line;
chomp ( my #header = split ( ' ', $header_line );
#iteraate data fh
while ( <DATA> ) {
#strip linefeed
chomp;
#read this row into a hash
my %row; #row{#header} = split;
#print this hash in the same order as the header.
#note - $_ is set to each element of header in turn when doing this.
print "$_\t$row{$_}\n" for #header;
#insert extra linefeed
print "\n";
}
__DATA__
Year Tonn Class Cargo Type
88 61 T Rice Truck
89 55 G Corn Train
92 93 S Peas Ship
Note - you can condense further that 'read headers' to:
chomp ( my #header = split ( ' ', <DATA> ) );
Which does the same thing, but might be a bit harder to follow.
There's really no need to read all the lines into #temp before looping through to print them out. It would be more efficient to read just the first line to get the headings and then loop through the remaining lines printing them immediately:
#!/usr/bin/perl
use strict;
use warnings;
my #temp;
my $line = <DATA>;
chomp($line);
my #head = split "\t", $line;
foreach $line (<DATA>) {
chomp($line);
my #elements = split ("\t", $line);
foreach my $i (0..$#head) {
print $head[$i], "\t", $elements[$i], "\n";
}
print "\n";
}
__DATA__
Year Tonn Class Cargo Type
88 61 T Rice Truck
89 55 G Corn Train
92 93 S Peas Ship
The print line could also be written as:
print "$head[$i]\t$elements[$i]\n";
I just thought it was a little clearer to separate out all the parts.