Moving an input file into another folder whenever it encounters specific result - perl

I have written a script in perl that reads each file in current directory and calculates distances between protein and ligand atom. Whenever it encounters condition that the result is <=5 it should break the loop (last) and move this file into another directory (mp) that was previously made. For some reason I have a problem with this script; it calcluates the distances correctly but does not move these files into another folder. It does not give me any error tough. I am trying to figure out what seems to be a problem.
This is the script that I have been using:
#!/usr/local/bin/perl
use strict;
use warnings;
use File::Glob;
mkdir "mp";
my $txt;
for my $txt ( glob '*.txt' )
{
open my $fh, '<', $txt;
my $out_fh;
my (#refer, #points);
my $part = 0;
my $dist;
while (my $line = <$fh>)
{
chomp($line);
$part++ if ($line =~ /^HETATM/);
my #array = (substr($line, 30, 8),substr($line,38,8),substr($line,46,8));
#print "#array\n";
if ($part == 0)
{
push #refer, [ #array ];
}
elsif ($part)
{
push #points, [ #array ];
}
my $atom;
foreach my $ref(#refer)
{
my ($x1, $y1, $z1) = #{$ref};
foreach my $atom(#points)
{
my ($x, $y, $z) = #{$atom};
my $dist = sqrt( ($x-$x1)**2 + ($y-$y1)**2 + ($z-$z1)**2 );
if ($dist <= 5)
{
print "Distance for calculation between $ref and $atom is $dist\n";
last;
system ("mv $fh mp");
}
}
}
}
}
The content of my input file looks like this:
ATOM 1593 HD21 LEU D 46 11.528 -8.800 5.301 1.00 0.00 H
ATOM 1594 HD22 LEU D 46 12.997 -9.452 4.535 1.00 0.00 H
ATOM 1595 HD23 LEU D 46 11.722 -8.718 3.534 1.00 0.00 H
HETATM 1597 N1 308 A 1 0.339 6.314 -9.091 1.00 0.00 N
HETATM 1598 C10 308 A 1 -0.195 5.226 -8.241 1.00 0.00 C
The result that script gives me in terminal looks like this:
Distance for calculation between ARRAY(0x1c61fa8) and ARRAY(0x1c6f950) is 4.98553437456809
Distance for calculation between ARRAY(0x1c62098) and ARRAY(0x1c6ffe0) is 4.98962253081333
But it does not move the files.

Related

calculating size of the result and moving files into another folder if it fulfills conditions [duplicate]

I am trying to calculate distance between each coordinate of protein atom (ATOM) and ligand atom (HETATM). I have number of files that look like this:
ATOM 1592 HD13 LEU D 46 11.698 -10.914 2.183 1.00 0.00 H
ATOM 1593 HD21 LEU D 46 11.528 -8.800 5.301 1.00 0.00 H
ATOM 1594 HD22 LEU D 46 12.997 -9.452 4.535 1.00 0.00 H
ATOM 1595 HD23 LEU D 46 11.722 -8.718 3.534 1.00 0.00 H
HETATM 1597 N1 308 A 1 0.339 6.314 -9.091 1.00 0.00 N
HETATM 1598 C10 308 A 1 -0.195 5.226 -8.241 1.00 0.00 C
HETATM 1599 C7 308 A 1 -0.991 4.254 -9.133 1.00 0.00 C
HETATM 1600 C1 308 A 1 -1.468 3.053 -8.292 1.00 0.00 C
So I am trying to calculate distances between ATOM1 and all other HETATM1, between ATOM1 and all other 'HETATM2' and so on. I have written a script in perl, but I cannot figure it out what is wrong with the script, it doesnt give me any error it just does not print anything.
I am also not sure how to add it in the script and if it is possible, that if the result of each calculation is more then 5 then delete this both lines that were included into calculation. If it is <= then 5 then keep it.
#!/usr/local/bin/perl
use strict;
use warnings;
open(IN, $ARGV[0]) or die "$!";
my (#refer, #points);
my $part = 0;
my $dist;
while (my $line = <IN>) {
chomp($line);
if ($line =~ /^HETATM/) {
$part++;
next;
}
my #array = (substr($line, 30, 8),substr($line,38,8),substr($line,46,8));
# print "#array\n";
if ($part == 0) {
push #refer, [ #array ];
} elsif ($part ==1){
push #points, [ #array ];
}
}
foreach my $ref(#refer) {
my ($x1, $y1, $z1) = #{$ref};
foreach my $atom(#points) {
my ($x, $y, $z) = #{$atom};
my $dist = sqrt( ($x-$x1)**2 + ($y-$y1)**2 + ($z-$z1)**2 );
print $dist;
}
}
When seeing a line with HETATM you increment $part and skip to the next input line. Your array #refer will therefor be empty.
Remove the line with next; after incrementing $part.
And your test should be } elsif( $part ) { ... } since you increment $part for each line of HETATM.
OK, I have to say - I'd rewrite your code, to work a bit differently.
Something like this:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my %coordinates;
#use types to track different types. Unclear if you need to handle anything other than 'not ATOM' but this is in case you do.
my %types;
#read STDIN or files specified on command line - like how grep/sed do it.
while ( <> ) {
my ( $type, $id, undef, undef, undef, undef, $x, $y, $z ) = split; # splits on white space.
$coordinates{$type}{$id} = [$x, $y, $z];
$types{$type}++ if $type ne 'ATOM';
}
#print for debugging:
print Dumper \%coordinates;
print Dumper \%types;
#iterate each element of "ATOM"
foreach my $atom_id ( keys %{$coordinates{'ATOM'}} ) {
#iterate all the types (HETATM)
foreach my $type ( sort keys %types ) {
#iterate each id within the data structure.
foreach my $id ( sort keys %{$coordinates{$type}} ) {
my $dist = 0;
#take square of x - x1, y - y1, z - z1
#do it iteratively, using 'for' loop.
$dist += (($coordinates{$type}{$id}[$_] - $coordinates{'ATOM'}{$atom_id}[$_]) ** 2) for 0..2;
$dist = sqrt $dist;
print "$atom_id -> $type $id $dist\n";
}
This is:
Using <> to read STDIN or named files on command line instead of manually opening ARGV[0] which accomplishes a similar result. (but means you can pipe stuff through it too).
Reads your data into a hash first.
Then iterates all the possible pairings, calculating your distance.
Conditionally prints if they match the criteria (all your results seem to?)
This gives as results:
1592 -> HETATM 1597 23.5145474334506
1592 -> HETATM 1598 22.5965224094328
1592 -> HETATM 1599 22.7844420822631
1592 -> HETATM 1600 21.8665559702483
1595 -> HETATM 1597 22.6919443415499
1595 -> HETATM 1598 21.7968036647578
1595 -> HETATM 1599 22.1437585337268
1595 -> HETATM 1600 21.2693868505888
1594 -> HETATM 1597 24.3815421169376
1594 -> HETATM 1598 23.509545380547
1594 -> HETATM 1599 23.8816415683679
1594 -> HETATM 1600 23.0248383056212
1593 -> HETATM 1597 23.6802952050856
1593 -> HETATM 1598 22.74957513889
1593 -> HETATM 1599 23.1402816102138
1593 -> HETATM 1600 22.2296935201545
Now you mention wanting to delete lines that are 'too far' - that's a bit complicated, because you've a compound criteria (and you'll delete all your lines).
The problem is - you don't know if your ATOM lines have too much "distance" until you've tested every single pairing in the file.
You could perhaps do this by:
#iterate each element of "ATOM"
foreach my $atom_id ( keys %{$coordinates{'ATOM'}} ) {
#iterate all the types (HETATM)
foreach my $type ( sort keys %types ) {
#iterate each id within the data structure.
foreach my $id ( sort keys %{$coordinates{$type}} ) {
my $dist = 0;
#take square of x - x1, y - y1, z - z1
#do it iteratively, using 'for' loop.
$dist += (($coordinates{$type}{$id}[$_] - $coordinates{'ATOM'}{$atom_id}[$_]) ** 2) for 0..2;
$dist = sqrt $dist;
print "### $atom_id -> $type $id $dist\n";
##note - this will print out multiple times if there's multiple pairings.
if ( $dist <= 5 ) {
print $lines{'ATOM'}{$atom_id};
print $lines{$type}{$id};
}
}
}
}
Which will - for each pairing-comparison print both the ATOM and HETATM lines that had a distance of <= 5. But it will do so multiple times if multiple pairings exist.
But I think your core error is in mishandling the $part and next clauses.
You only ever increment $part and whilst you initialise it at 0, you never reset it to zero. So it'll be 1,2,3,4 for each successive HETATM.
You use next after incrementing part which means you skip the if ( $part == 1 clause entirely.
I would use this approach:
#!/usr/bin/env perl
use strict;
use warnings;
use constant POZITION => ( 6, 7, 8 ); # X, Y, Z
sub dist {
my ( $a, $b ) = #_;
my $s = 0;
for my $i ( 0 .. $#$a ) {
$s += ( $a->[$i] - $b->[$i] )**2;
}
return sqrt($s);
}
# Record format
use constant {
LINE => 0,
POZ => 1,
KEEP => 2,
};
my ( #refer, #points );
while ( my $line = <> ) {
my ( $type, #poz ) = ( split ' ', $line )[ 0, POZITION ];
print STDERR join( ',', $type, #poz ), "\n";
if ( $type eq 'ATOM' ) {
push #refer, [ $line, \#poz ];
}
elsif ( $type eq 'HETATM' ) {
push #points, [ $line, \#poz ];
}
}
for my $ref (#refer) {
for my $atom (#points) {
my $dist = dist( $ref->[POZ], $atom->[POZ] );
print STDERR "$ref->[LINE]$atom->[LINE]dist: $dist\n";
next if $dist > 5;
$ref->[KEEP] ||= 1;
$atom->[KEEP] ||= 1;
}
}
print $_->[LINE] for grep $_->[KEEP], #refer, #points;
Unfortunately, your data doesn't contain any ATOM and HETATM pair with distance <= 5. (Note that split ' ' is word split. It means split /\s+/ with omitting any leading and trailing whitespaces.)
It works as a filter with debugging output to the STDERR.

How to delete lines that in specific column match elements from another file

I have a folder with many text files that look like this:
ATOM 5132 HG22 ILE B 162 -10.906 60.208 9.028 1.00 0.00 H
ATOM 5133 HG23 ILE B 162 -11.193 58.585 9.650 1.00 0.00 H
ATOM 5134 HD11 ILE B 162 -9.888 57.413 9.161 1.00 0.00 H
ATOM 5135 HD12 ILE B 162 -8.448 57.195 8.181 1.00 0.00 H
ATOM 5136 HD13 ILE B 162 -9.913 56.300 7.799 1.00 0.00 H
HETATM 5138 ZN ZN A 190 30.757 32.494 -1.721 1.00 0.00 ZN
HETATM 5139 C1 UQ1 B 501 2.889 33.364 18.810 1.00 0.00 C
HETATM 5140 O1 UQ1 B 501 2.849 32.140 19.037 1.00 0.00 O
HETATM 5141 C2 UQ1 B 501 4.162 33.930 18.303 1.00 0.00 C
HETATM 5142 O2 UQ1 B 501 5.209 33.069 18.099 1.00 0.00 O
HETATM 5143 CM2 UQ1 B 501 5.802 32.349 19.180 1.00 0.00 C
HETATM 5144 C3 UQ1 B 501 4.270 35.396 18.017 1.00 0.00 C
I have a file ions_solvents_cofactors that contains different number of symbols, and look like this:
ZN
008
03S
06C
0KA
0NG
0NM
0QE
144
1CL
1SA
1TP
202
21H
2A6
2BM
2F2
2HE
2HP
2MO
2NO
2PA
2PN
2PO
2T8
I wrote a program that
should open and read each .txt file in the current folder and delete those lines where column 4 matches with any value from the file ions_solevnts_cofactors when column 1 is HETATM.
It gives me this error
rm: cannot remove `ATOM': No such file or directory
rm: cannot remove `1459': No such file or directory
rm: cannot remove `HB': No such file or directory
rm: cannot remove `ILE': No such file or directory
This is the script
#!/usr/local/bin/perl
use strict;
use warnings;
$dirname = '.';
opendir( DIR, $dirname ) or die "cannot open directory";
#files = grep( /\.txt$/, readdir( DIR ) );
foreach $files ( #files ) {
open( FH, $files ) or die "could not open $files\n";
#file_each = <FH>;
close FH;
close DIR;
my #ion = ();
my $ionfile = 'ions_solvents_cofactors';
open( ION, $ionfile ) or die "Could not open $ionfile, $!";
my #ion = <ION>;
close ION;
for ( my $line = 0; $line <= $#file_each; $line++ ) {
chomp( $file_each[$line] );
if ( $file_each[$line] =~ /^HETATM/ ) {
#is = split '\s+', $file_each[$line];
chomp $is[3];
}
foreach ( $file_each[$line] ) { # line 39
if ( "#ion" =~ $is[3] ) {
system( "rm $file_each[$line]" );
}
}
}
}
I want the script to overwrite each text file and reads only the fourth column of the lines that starts with HETATM. If it matches any of the elements from file ions_solvents_cofactors then this line should be deleted.
So, for example
HETATM 5138 ZN ZN A 190 30.757 32.494 -1.721 1.00 0.00 ZN
this line should be deleted from the file because ZN matches.
There is a number of needed improvements, and a few direct errors.
First a simple working code, with some assumptions taken from the question
use warnings;
use strict;
use feature 'say';
#use File::Glob ':bsd_glob'; # using \Q..\E in glob, no need for this
use File::Copy qw(move);
use List::MoreUtils qw(any);
my $dirname = shift #ARGV || '.';
my $ionfile = 'ions_solvents_cofactors';
open my $fh, '<', $ionfile or die "Can't open $ionfile: $!";
my #ion_terms = <$fh>;
chomp #ion_terms;
my #files = glob "\Q$dirname\E/*.txt";
foreach my $file (#files) {
open my $fh, '<', $file or do {
warn "Can't open $file: $!";
next;
};
my $outfile = $file . '_new';
open my $fh_out, '>', $outfile or die "Can't open $outfile: $!";
while (<$fh>) {
next if not /^HETATM/;
my #fields = split;
next if any { $fields[3] =~ /$_/ } #ion_terms;
print $fh_out $_;
}
# Uncomment to overwrite, when thoroughly tested
#move $outfile, $file or warn "Can't move $outfile to $file: $!"
}
Comments
The reference file need be opened only once; get it out of the loop
No purpose in "initializing" an array to emtpy, like my #ion = (). You get that when you declare it with my #ion. (If you need to clear an array, then #ary = (); makes sense)
Use lexical filehandles, open my $fh, ..., instead of typeglobs FH. Use lexical filehandles. Use lexical filehandles. See end of Typeglobs and Filehandles and read open
There is practically never a need for a C-style foreach loop. If you need to iterate over index, for my $i (0..$#ary) is great. But most of the time you need elements, like here
Instead of the \s+ pattern used in split you should use ' ', which is also split's default. This is why the code above doesn't need it, as split; is the same as split ' ', $_;
The #file_each is not a good name for lines in the file
Direct error: you are attempting to rm what is a line in a file! Better naming would've helped
Your use of opendir and readdir is fine (except for DIR instead of a lexical filehandle!!), but glob is better here. Edit: I use \Q..\E in glob, to prevent possible injection bug whereby an unusual directory name triggers unintended processing. Since those also escape spaces the File::Glob with its bsd_glob() isn't needed any more
I use List::MoreUtils::any to find whether any element from #ion_terms satisfies the condition in the block, to match $fields[3]. This can be done with grep as well. Also, if your list of terms is as short as shown, you can assemble a regex pattern with it
my $re = join '|', { quotemeta } #ion_terms; # before the loop
next if $fields[3] =~ /$re/;
Some of the code above can be written more concisely and simply
I'm sorry if I was unclear with my advice regarding to your
previous question
How to delete lines that match elements from another file. I suggested that you should post another question because you brought up new issues, but I intended that you should work from where we had got to, whereas you seem to have discarded all of that and started again with your original code, including shelling out to rm in the mistaken belief that it would delete a line from a file
Now that you have shown a full version of ions_solvents_cofactors I can see that my assumptions were correct, and the only other problem that you raised was that only lines starting HETATM should be removed from the PDB file, which you didn't say in your question
This is very similar to my previous solution, but I have added a check for HETATM data. I have also improved the log output so that it says which value from ions_solvents_cofactors was matched to cause the deletion
Please try this new code, and report if you find any problems
use strict;
use warnings 'all';
use File::Glob ':bsd_glob';
use Tie::File;
my %matches = do {
open my $fh, '<', 'ions_solvents_cofactors';
local $/;
map { $_ => 1 } split ' ', <$fh>;
};
for my $pdb ( glob '*.txt' ) {
tie my #file, 'Tie::File', $pdb or die $!;
for ( my $i = 0; $i < #file; ) {
my ($id, undef, undef, $col4) = split ' ', $file[$i];
if ( $id eq 'HETATM' and $col4 and $matches{$col4} ) {
printf qq{Removing line %d from "%s" (matches %s)\n},
$i+1, $pdb, $col4;
splice #file, $i, 1;
}
else {
++$i;
}
}
}
output
Removing line 6 from "test.txt" (matches ZN)

Calculating distances in PDB file

With reference to the question Calculating the distance between atomic coordinates, where the input is
ATOM 920 CA GLN A 203 39.292 -13.354 17.416 1.00 55.76 C
ATOM 929 CA HIS A 204 38.546 -15.963 14.792 1.00 29.53 C
ATOM 939 CA ASN A 205 39.443 -17.018 11.206 1.00 54.49 C
ATOM 947 CA GLU A 206 41.454 -13.901 10.155 1.00 26.32 C
ATOM 956 CA VAL A 207 43.664 -14.041 13.279 1.00 40.65 C
.
.
.
ATOM 963 CA GLU A 208 45.403 -17.443 13.188 1.00 40.25 C
there is an answer reported as
use strict;
use warnings;
my #line;
while (<>) {
push #line, $_; # add line to buffer
next if #line < 2; # skip unless buffer is full
print proc(#line), "\n"; # process and print
shift #line; # remove used line
}
sub proc {
my #a = split ' ', shift; # line 1
my #b = split ' ', shift; # line 2
my $x = ($a[6]-$b[6]); # calculate the diffs
my $y = ($a[7]-$b[7]);
my $z = ($a[8]-$b[8]);
my $dist = sprintf "%.1f", # format the number
sqrt($x**2+$y**2+$z**2); # do the calculation
return "$a[3]-$b[3]\t$dist"; # return the string for printing
}
The output of above code is the distance between the first CA to the second one and second to third and so on...
How to modify this code to find the distance between first CA to rest of the CAs (2, 3, ..) and from second CA to rest of the CAs (3, 4, ..) and so on and printing only those which is less then 5 Angstrom?
I found that push #line, $_; statement should be altered to increase the array size but not clear how to do that.
To get the pairs, read the file into an array, #data_array. Then loop over the entries.
Update: Added file opening and load #data_array.
open my $fh, '<', 'atom_file.pdb' or die $!;
my #data_array = <$fh>;
close $fh or die $!;
for my $i (0 .. $#data_array) {
for my $j ($i+1 .. $#data_array) {
process(#data_array[$i,$j]);
}
}
May be try this:
use strict;
use warnings;
my #alllines = ();
while(<DATA>) { push(#alllines, $_); }
#Each Current line
for(my $i=0; $i<=$#alllines+1; $i++)
{
#Each Next line
for(my $j=$i+1; $j<=$#alllines; $j++)
{
if($alllines[$i])
{
#Split the line into tab delimits
my ($line1_tb_1,$line1_tb_2,$line1_tb_3) = split /\t/, $alllines[$i];
print "Main_Line: $line1_tb_1\t$line1_tb_2\t$line1_tb_3";
if($alllines[$j])
{
#Split the line into tab delimits
my ($line_nxt_tb1,$line_nxt_tb2,$line_nxt_tb3) = split /\t/, $alllines[$j];
print "Next_Line: $line_nxt_tb1\t$line_nxt_tb2\t$line_nxt_tb3";
#Do it your coding/regex here
}
}
#system 'pause'; Testing Purpose!!!
}
}
__DATA__
tab1 123 456
tab2 789 012
tab3 345 678
tab4 901 234
tab5 567 890
I hope this will help you.

How can I get a different naming text file in each different subfolder using perl?

I have nine subfolders in a main folder. Each subfolder includes a .pdb file, which has 10 columns. I want to get a new .log file for each subfolder. Each new .log file must be in its own folder. I am trying to create a different naming .log file in each subfolder (like 1.log,2.log,3.log...). But I get two .log file in each folder. How can I get a different naming .log file in each subfolder?
.pdb file:
ATOM 1 O LIG A 1 -4.657 -0.947 -1.014 1.00 0.00 O
ATOM 2 N LIG A 1 -0.173 0.679 0.052 1.00 0.00 N1+
ATOM 3 N LIG A 1 3.135 -0.678 -0.977 1.00 0.00 N
ATOM 4 N LIG A 1 3.331 0.341 1.198 1.00 0.00 N
ATOM 5 N LIG A 1 1.046 -0.695 -2.103 1.00 0.00 N
ATOM 6 C LIG A 1 -1.086 -0.167 0.546 1.00 0.00 C
ATOM 7 C LIG A 1 -2.430 0.177 0.537 1.00 0.00 C
ATOM 8 C LIG A 1 -3.476 -0.737 1.080 1.00 0.00 C
ATOM 9 C LIG A 1 1.209 0.327 0.061 1.00 0.00 C
ATOM 10 C LIG A 1 -2.803 1.409 0.008 1.00 0.00 C
.log file:
O -4.657 -0.947 -1.014
N -0.173 0.679 0.052
N 3.135 -0.678 -0.977
N 3.331 0.341 1.198
N 1.046 -0.695 -2.103
C -1.086 -0.167 0.546
C -2.430 0.177 0.537
C -3.476 -0.737 1.080
C 1.209 0.327 0.061
C -2.803 1.409 0.008
my code:
#!/usr/bin/perl
use strict;
use warnings;
use File::Basename;
my $search_text = qr/ATOM/;
my #fullpath= <*/*.pdb>;
foreach my $file (#fullpath) {
print $file . "\n";
my $dir = dirname($file);
for (my $i=1; $i < 10; $i++) {
open(DATA, $file);
open(out_file, ">", "$dir/$i.log") or die "Failed to open $dir/$i.log: $!";
}
while (my $line = <DATA>) {
my #fields = split /\s+/, $line;
if ($line =~ m/$search_text/) {
print out_file join("\t", #fields[2,6,7,8]), "\n";
}
}
}
close(out_file);
The main problem is that you have an inner foreach loop that opens the input file nine times ($i is 1 .. 9 because of the < 10 criterion) and nine different output files, and this is happening for every .pdb file found.
You need a file-scoped counter to ensure that you name each log file differently.
You may also find it useful to consider these points
It's best to restrict the import list of external modules as far as possible. By default, File::Basename exports fileparse, fileparse_set_fstype, basename, and dirname. You want only one of these, so you should write
use File::Basename qw/ dirname /
I have mentioned this loop already
for (my $i=1; $i < 10; $i++) { ... }
but in addition you should remember that most accomplished Perl programmers use for instead of foreach to reduce noise (they are identical in every way except their spelling). And if all you want is a range of numbers then you should use the range operator instead, like this
for my $i (1 .. 9) { ... }
The cases where a C-style for loop are very rare.
You shouldn't use the DATA file handle as it has a built-in purpose. In fact you should use lexical file handles everywhere, like this
open my $out_file, '>', "$dir/$i.log"
But well done for choosing the three-parameter form of open
If your processing is brief and straightforward then it is best to leave the <> operator to perform its default behaviour of putting each line into $_. That makes chomp, split, print, regex matches, and others much more concise and therefor more readable.
I believe this program does what you need.
#!/usr/bin/perl
use strict;
use warnings;
use File::Basename qw/ dirname /;
my $search_text = qr/ATOM/;
my $infile_number;
while (my $infile = glob '*/*.pdb') {
++$infile_number;
my $dir = dirname($infile);
my $outfile = "$dir/$infile_number.log";
print "$infile => $outfile\n";
open my $in_fh, '<', $infile or die qq{Failed to open "$infile" for writing: $!};
open my $out_fh, '>', $outfile or die qq{Failed to open "$outfile" for writing: $!};
while (<$in_fh>) {
next unless /$search_text/;
my #fields = split;
print $out_fh join("\t", #fields[2,6,7,8]), "\n";
}
}
Update
"The columns in log file aren't ordered like columns in .pdb files. The point/dot of decimal numbers in .log files isn't top and bottom like in .pdb files. There is this problem when minus and plus decimal numbers are top and bottom."
I think you mean that the decimal points aren't aligned in the same column. This variant of my original program will fix that. But are you sure you want to use tab characters between the columns? Tabs are usually used so that computers can separate the columns correctly, and aren't much use in documents that are meant for people to read.
use strict;
use warnings;
use File::Basename qw/ dirname /;
use Scalar::Util qw/ looks_like_number /;
my $search_text = qr/ATOM/;
my $infile_number;
while (my $infile = glob '*/*.pdb') {
++$infile_number;
my $dir = dirname($infile);
my $outfile = "$dir/$infile_number.log";
print "$infile => $outfile\n";
open my $in_fh, '<', $infile or die qq{Failed to open "$infile" for writing: $!};
open my $out_fh, '>', $outfile or die qq{Failed to open "$outfile" for writing: $!};
while (<$in_fh>) {
next unless /$search_text/;
my #fields = split;
for (#fields) {
$_ = sprintf '%6.3f', $_ if looks_like_number($_);
}
print $out_fh join("\t", #fields[2,6,7,8]), "\n";
}
}
output
O -4.657 -0.947 -1.014
N -0.173 0.679 0.052
N 3.135 -0.678 -0.977
N 3.331 0.341 1.198
N 1.046 -0.695 -2.103
C -1.086 -0.167 0.546
C -2.430 0.177 0.537
C -3.476 -0.737 1.080
C 1.209 0.327 0.061
C -2.803 1.409 0.008

Distance between one point to all other in a PDB file

I have a PDB file. Now it has two parts separated by TER. Before TER I call it part 1. I want to take x,y,z of ATOM 1 of first part i.e before TER and find distance to all x,y,z co ordinates after TER and then second ATOM of part one to all ATOMS of part second. This has to be repeated for all ATOMS of first part= to all ATOMS of second part. I have to automate it for 20 files. names of my files begin like 1_0.pdb,2_0.pdb....20_0.pdb.
This is a distance calculation. I have tried something in PERL but its very rough. Can someone help a bit.
The File looks like:
----long file (I truncated it)----
ATOM 1279 C ALA 81 -1.925 -11.270 1.404
ATOM 1280 O ALA 81 -0.279 9.355 15.557
ATOM 1281 OXT ALA 81 -2.188 10.341 15.346
TER
ATOM 1282 N THR 82 29.632 5.205 5.525
ATOM 1283 H1 THR 82 30.175 4.389 5.768
ATOM 1284 H2 THR 82 28.816 4.910 5.008
The code is: In the end it finds the maximum distance and its co ordinates
my #points = ();
open(IN, #ARGV[0]) or die "$!";
while (my $line = <IN>) {
chomp($line);
my #array = (split (/\s+/, $line))[5, 6, 7];
print "#array\n";
push #points, [ #array ];
}
close(IN);
$max=0;
for my $i1 ( 0 .. $#points )
{
my ( $x1, $y1, $z1 ) = #{ $points[$i1] };
my $dist = sqrt( ($x1+1.925)**2 + ($y1+11.270)**2 + ($z1-1.404)**2 );
print "distance from (-1.925 -11.270 1.404) to ( $x1, $y1, $z1 ) is $dist\n";
if ( $dist > $max )
{ $max = $dist;
$x=$x1;
$y=$y1;
$z=$z1;
}}
print "maximum value is : $max\n";
print "co ordinates are : $x $y $z\n";
Not sure I clearly understand what you want, but how about:
#!/usr/local/bin/perl
use strict;
use warnings;
my (#refer, #points);
my $part = 0;
while (my $line = <DATA>) {
chomp($line);
if ($line =~ /^TER/) {
$part++;
next;
}
my #array = (split (/\s+/, $line))[5, 6, 7];
if ($part == 0) {
push #refer, [ #array ];
} else {
push #points, [ #array ];
}
}
my %max = (val=>0, x=>0, y=>0, z=>0);
foreach my $ref(#refer) {
my ($x1, $y1, $z1) = #{$ref};
foreach my $atom(#points) {
my ($x, $y, $z) = #{$atom};
my $dist = sqrt( ($x-$x1)**2 + ($y-$y1)**2 + ($z-$z1)**2 );
if ($dist > $max{val}) {
$max{val} = $dist;
$max{x} = $x;
$max{y} = $y;
$max{z} = $z;
}
}
}
print "max is $max{val}; coord: x=$max{x}, y=$max{y}, z=$max{z}\n";
__DATA__
ATOM 1279 C ALA 81 -1.925 -11.270 1.404
ATOM 1280 O ALA 81 -0.279 9.355 15.557
ATOM 1281 OXT ALA 81 -2.188 10.341 15.346
TER
ATOM 1282 N THR 82 29.632 5.205 5.525
ATOM 1283 H1 THR 82 30.175 4.389 5.768
ATOM 1284 H2 THR 82 28.816 4.910 5.008
output:
max is 35.9813670807545; coord: x=30.175, y=4.389, z=5.768
The main issue here is reading the data. First, note that one cannot use split with PDB text files since the fields are defined by position and not by separators. See Coordinate File Description (PDB Format).
To separate the ATOM record of different polymer chains you can start with a simplified version like
my $iblock = 0;
my #atoms = ();
while (my $line = <IN>) {
chomp($line);
# Switch blocks at TER lines
if ($line =~ /^TER/) {
$iblock++;
# Read ATOM lines
} elsif ($line =~ m/^ATOM/) {
my #xyz = (substr($line,7-1,9),substr($line,16-1,9),substr($line,25-1,9));
printf "Block %d: atom at (%s)\n",$iblock,join (",",#xyz);
push #{$atoms[$iblock]},\#xyz;
# Parse additional line types (if needed)
} else {
...
}
}
Followed by a loop over all pairs of coordinates from different blocks, structured as follows:
# 1st block
for my $iblock1 (0..$#atoms) {
# 2nd block
for my $iblock2 ($iblock1+1..$#atoms) {
# Compare all pairs of atoms
...
my $xyz1 (#{$atoms[$iblock1]}) {
for my $xyz2 (#{$atoms[$iblock2]}) {
# Calculate distance and compare with $max_dist
...
}
}
# Print the maximal distance between these two blocks
...
}
}
Of course, the code could be more general if a more elaborate data structure is used or by applying one of the available PDB parsers, such as Bioperl's.
With proper encapsulation, this is pretty simple, and requires minor modifications of your code.
ETA: Added fixed width solution I had on hand. It would probably be best to read all the fields instead of discarding the first 31 chars, and then return them all in a hash reference. That way, you could process all the lines with the same subroutine, and simply switch between parts when the first field turns out to be TER. It should be easy for you to extrapolate this from the given code.
You'll note that the reference values are read in with a loop, because we need to break the loop at the break point. The rest of the values are slurped up with a map statement. Then we simply feed the data to the subroutine we made from your initial code (with some improvements). I used the same names for the lexical variables to make it easier to read the code.
use strict;
use warnings;
my #points;
while (<DATA>) {
last if /^TER$/;
push #points, getpoints($_);
}
my #ref = map getpoints($_), <DATA>;
for my $p (#points) {
getcoords($p, \#ref);
}
sub getpoints {
my $line = shift;
my #data = unpack "A31 A8 A8 A8", $line;
shift #data;
return \#data;
}
sub getcoords {
my ($p, $ref) = #_;
my ($p1,$p2,$p3) = #$p;
my $max=0;
my ($x,$y,$z);
for my $aref ( #$ref ) {
my ( $x1, $y1, $z1 ) = #$aref;
my $dist = sqrt(
($x1-$p1)**2 +
($y1-$p2)**2 +
($z1-$p3)**2
);
print "distance from ($p1 $p2 $p3) to ( $x1, $y1, $z1 ) is $dist\n";
if ( $dist > $max ) {
$max = $dist;
$x=$x1;
$y=$y1;
$z=$z1;
}
}
print "maximum value is : $max\n";
print "co ordinates are : $x $y $z\n";
}
__DATA__
ATOM 1279 C ALA 81 -1.925 -11.270 1.404
ATOM 1280 O ALA 81 -0.279 9.355 15.557
ATOM 1281 OXT ALA 81 -2.188 10.341 15.346
TER
ATOM 1282 N THR 82 29.632 5.205 5.525
ATOM 1283 H1 THR 82 30.175 4.389 5.768
ATOM 1284 H2 THR 82 28.816 4.910 5.008