How to print same row data in multiple time from pdb file in perl

How to print same row data in multiple time from pdb file in perl - perl

I am new in perl, I ma trying to write a program to input pdb file (from Directory, I have 3000 files) and output will save another directory (Another folder).
Code:
open( filehandler, "Document1.txt" ) or die $!; #Input file
my #file1 = <filehandler>;
my $OutputDir = 'C:\test_result_file';
foreach my $line (#file1) {
chomp $line;
open( fh, "$line" ) or die $!;
open( out, ">$OutputDir/$line.pdb" ) or die $!;
while ( $file = <fh> ) {
if ( $file =~ /^ATOM.{9}(?:CG|CD1|CD2B|CE1|CE2|CZ|C|O|CB|CG|CD)/ ) {
$hash{$1}{$2}++;
}
foreach $key ( sort { $hash{$1} <=> $hash{$2} or $1 cmp $2 } keys %hash ) {
print out $key;
}
}
print "Completed", "\n";
}
for example input file:
ATOM 1752 CG TYR A 248 89.088 39.843 51.944 1.00 32.03 C
ATOM 1753 CD1 TYR A 248 89.759 39.356 50.810 1.00 37.15 C
ATOM 1754 CD2 TYR A 248 87.727 40.049 51.864 1.00 32.81 C
ATOM 1755 CE1 TYR A 248 89.078 39.081 49.646 1.00 36.00 C
ATOM 1756 CE2 TYR A 248 87.035 39.774 50.706 1.00 35.66 C
ATOM 1757 CZ TYR A 248 87.708 39.285 49.599 1.00 35.16 C
ATOM 7394 C GLN B 331 37.664 74.934 36.854 1.00 22.75 C
ATOM 7395 O GLN B 331 37.728 73.730 36.607 1.00 31.73 O
ATOM 7396 CB GLN B 331 37.467 76.222 34.712 1.00 27.88 C
ATOM 7397 CG GLN B 331 36.515 76.825 33.693 1.00 32.42 C
ATOM 7398 CD GLN B 331 35.390 75.877 33.328 1.00 35.70 C
Expected output:
A chain:
ATOM 1753 CD1 TYR A 248 89.759 39.356 50.810 1.00 37.15 C
ATOM 1752 CG TYR A 248 89.088 39.843 51.944 1.00 32.03 C
ATOM 1754 CD2 TYR A 248 87.727 40.049 51.864 1.00 32.81 C
ATOM 1755 CE1 TYR A 248 89.078 39.081 49.646 1.00 36.00 C
ATOM 1753 CD1 TYR A 248 89.759 39.356 50.810 1.00 37.15 C
ATOM 1754 CD2 TYR A 248 87.727 40.049 51.864 1.00 32.81 C
ATOM 1755 CE1 TYR A 248 89.078 39.081 49.646 1.00 36.00 C
ATOM 1756 CE2 TYR A 248 87.035 39.774 50.706 1.00 35.66 C
ATOM 1754 CD2 TYR A 248 87.727 40.049 51.864 1.00 32.81 C
ATOM 1755 CE1 TYR A 248 89.078 39.081 49.646 1.00 36.00 C
ATOM 1756 CE2 TYR A 248 87.035 39.774 50.706 1.00 35.66 C
ATOM 1757 CZ TYR A 248 87.708 39.285 49.599 1.00 35.16 C
B chain:
ATOM 7394 C GLN B 331 37.664 74.934 36.854 1.00 22.75 C
ATOM 7395 O GLN B 331 37.728 73.730 36.607 1.00 31.73 O
ATOM 7396 CB GLN B 331 37.467 76.222 34.712 1.00 27.88 C
ATOM 7397 CG GLN B 331 36.515 76.825 33.693 1.00 32.42 C
ATOM 7395 O GLN B 331 37.728 73.730 36.607 1.00 31.73 O
ATOM 7396 CB GLN B 331 37.467 76.222 34.712 1.00 27.88 C
ATOM 7397 CG GLN B 331 36.515 76.825 33.693 1.00 32.42 C
ATOM 7398 CD GLN B 331 35.390 75.877 33.328 1.00 35.70 C
ATOM 7396 CB GLN B 331 37.467 76.222 34.712 1.00 27.88 C
ATOM 7397 CG GLN B 331 36.515 76.825 33.693 1.00 32.42 C
ATOM 7398 CD GLN B 331 35.390 75.877 33.328 1.00 35.70 C
ATOM 7394 C GLN B 331 37.664 74.934 36.854 1.00 22.75 C
Chain ID may be a to h. so, rule is see above expected output: First four row will unique and then line five will be same row of second row and will add new row as eight line row.
I am unable to write a code to solve this problem, any one pl help

I'm afraid I have to say your code is rather confusing but what I get from the data samples and further explanation is:
There should be a window of four lines moving through the input lines.
The window contents (all four lines) should be printed as you advance to each new line.
The window should be reset each time a new sequence is encountered.
Each line is a series of space delimited fields and the fifth field identifies the sequence.
The window contents can be stored in a simple Perl array (see #window in the snippet below). You simply append data to it with push and remove the first line with shift as you move to a next line. When the sequence changes, print the current window and reset it. In the sample code below I made an assumption that sequences do not intermix. If this is note the case, you need to read all the input beforehand and sort it as necessary.
use strict;
use warnings;
my $win_len = 4;
my #window = ();
my $prev_chain = "";
while (<>) {
my ($atom_name, $chain) = (split)[2, 4];
next unless $atom_name =~ /\b(?:CG|CD1|CD2B|CE1|CE2|CZ|C|O|CB|CG|CD)\b/;
if ($chain eq $prev_chain) {
if (#window == $win_len) {
print_window();
shift #window;
}
push #window, $_;
} else {
print_window() if #window;
#window = ($_);
$prev_chain = $chain;
}
}
print_window() if #window;
sub print_window {
print foreach #window;
print "\n";
}
Demo: https://ideone.com/oOB8bF
The script reads data from STDIN and prints result to STDOUT for the sake of simplicity. Your code sample suggests you store a list of files to process in the Document1.txt and the actual input is read from these files. In this case you need an extra loop:
use strict;
use warnings;
my $OutputDir = 'C:/test_result_file';
open my $dir, "Document1.txt" or die "Failed to open Document1.txt:$!";
chomp(my #files = <$dir>);
foreach my $file (#files) {
my $win_len = 4;
my #window = ();
my $prev_chain = "";
open my $input, $file or die "failed to open $file: $!\n";
open my $output, '>', "$OutputDir/$file.pdb" or die "failed to open $OutputDir/$file.pdb: $!\n";
while (<$input>) {
my ($atom_name, $chain) = (split)[2, 4];
next unless $atom_name =~ /\b(?:CG|CD1|CD2B|CE1|CE2|CZ|C|O|CB|CG|CD)\b/;
if ($chain eq $prev_chain) {
if (#window == $win_len) {
print_window($output, #window);
shift #window;
}
push #window, $_;
} else {
print_window($output, #window) if #window;
#window = ($_);
$prev_chain = $chain;
}
}
print_window($output, #window) if #window;
}
sub print_window {
my $fh = shift;
print $fh $_ foreach #_;
print $fh "\n";
}

Related

Why is the output the way it is? -Splitting and chop

I'm trouble understanding the output of the below code.
1. Why is the output Jo Al Ch and Sa? Doesn't chop remove the last character of string and return that character, so shouldn't the output be n i n and y? 2. What is the purpose of the $firstline=0; line in the code?
3. What exactly is happening at the lines
foreach(#data)
{$name,$age)=split(//,$_);
print "$name $age \n";
The output of the following code is
Data in file is:
J o
A l
C h
S a
The file contents are:
NAME AGE
John 26
Ali 21
Chen 22
Sally 25
The code:
#!/usr/bin/perl
my ($firstline,
#data,
$data);
open (INFILE,"heading.txt") or die $.;
while (<INFILE>)
{
if ($firstline)
{
$firstline=0;
}
else
{
chop(#data=<INFILE>);
}
print "Data in file is: \n";
foreach (#data)
{
($name,$age)=split(//,$_);
print "$name $age\n";
}
}

There are few issues with this script but first I will answer your points
chop will remove the last character of a string and returns the character chopped. In your data file "heading.txt" every line might be ending with \n and hence chop will be removing \n. It is always recommended to use chomp instead.
You can verify what is the last character of the line by running the command below:
od -bc heading.txt
0000000 116 101 115 105 040 101 107 105 012 112 157 150 156 040 062 066
N A M E A G E \n J o h n 2 6
0000020 012 101 154 151 040 062 061 012 103 150 145 156 040 062 062 012
\n A l i 2 1 \n C h e n 2 2 \n
0000040 123 141 154 154 171 040 062 065 012
S a l l y 2 5 \n
0000051
You can see \n
There is no use of $firstline because it is never been set to 1. So you can remove the if/else block.
In the first line it is reading all the elements of array #data one by one. In 2nd line it is splitting the contents of the element in characters and capturing first 2 characters and assigning them to $name and $age variables and discarding the rest. In the last line we are printing those captured characters.
IMO, in line 2 we should do split based on space to actual capture the name and age.
So the final script should looks like:
#!/usr/bin/perl
use strict;
use warnings;
my #data;
open (INFILE,"heading.txt") or die "Can't open heading.txt: $!";
while (<INFILE>) {
chomp(#data= <INFILE>);
}
close(INFILE);
print "Data in file is: \n";
foreach (#data) {
my ($name,$age)=split(/ /,$_);
print "$name $age\n";
}
Output:
Data in file is:
John 26
Ali 21
Chen 22
Sally 25

Print and use the data from two files simultaenously

pdb1.pdb
ATOM 709 CA THR 25 -29.789 33.001 72.164 1.00 0.00
ATOM 711 CB THR 25 -29.013 31.703 72.370 1.00 0.00
ATOM 734 CG THR 25 -29.838 30.458 72.573 1.00 0.00
ATOM 768 CE THR 25 -28.541 28.330 71.361 1.00 0.00
pdb2.pdb
ATOM 765 N ALA 25 -30.838 33.150 73.195 1.00 0.00
ATOM 764 N LEU 26 -29.457 33.193 69.767 1.00 0.00
ATOM 783 N VAL 27 -30.286 31.938 66.438 1.00 0.00
ATOM 798 N GLY 28 -28.076 30.044 64.519 1.00 0.00
output desired
709 CA 765 N 1.477 -29.789 33.001 72.164 -30.838 33.150 73.195
709 CA 764 N 2.427 -29.789 33.001 72.164 -29.457 33.193 69.767
709 CA 783 N 5.844 -29.789 33.001 72.164 -30.286 31.938 66.438
and so on.
The content from pdb1.pdb and pdb2.pdb is to read values in column 2,3,6,7 and 8 and then using column 6,7,8 do distance calculations.
I tried with this but the output is not getting printed.
Perl
open( f1, "pdb1.pdb" or die $! );
open( f2, "pdb2.pdb" or die $! );
while ( ( $line1 = <$f1> ) and ( $line2 = <$f2> ) ) {
#splitted = split( ' ', $line1 );
my #fields = split / /, $line1;
print $fields[1], "\n";
my $atom1 = #{ [ $line1 =~ m/\S+/g ] }[2];
my $no1 = #{ [ $line1 =~ m/\w+/g ] }[3];
my $x1 = #{ [ $line1 =~ m/\w+/g ] }[6];
my $y1 = #{ [ $line1 =~ m/\w+/g ] }[7];
my $z1 = #{ [ $line1 =~ m/\w+/g ] }[8];
my $atom2 = #{ [ $line2 =~ m/\w+/g ] }[2];
my $no2 = #{ [ $line2 =~ m/\w+/g ] }[3];
my $x2 = #{ [ $line2 =~ m/\w+/g ] }[6];
my $y2 = #{ [ $line2 =~ m/\w+/g ] }[7];
my $z2 = #{ [ $line2 =~ m/\w+/g ] }[8];
print $atom1;
for ( $f1, $f2 ) {
print $atom1 $no1 $x1 $y1 $z1 $atom2 $no2 $x2 $y2 $z2 "\n";
}
}
close( $f1 );
close( $f2 );

It's probably simplest to read both files into memory unless they're enormous
This solution calls subroutine read_file to build an array of hashes of all five fields of interest from each file. It then calculates the delta and reformats the data for output
use strict;
use warnings 'all';
my $f1 = read_file('file1.txt');
my $f2 = read_file('file2.txt');
for my $r1 ( #$f1 ) {
for my $r2 ( #$f2 ) {
my ($dx, $dy, $dz) = map { $r1->{$_} - $r2->{$_} } qw/ x y z /;
my $delta = sqrt( $dx * $dx + $dy * $dy + $dz * $dz );
my #rec = (
#{$r1}{qw/ id name /},
#{$r2}{qw/ id name /},
sprintf('%5.3f', $delta),
#{$r1}{qw/ x y z /},
#{$r2}{qw/ x y z /},
);
print "#rec\n";
}
}
sub read_file {
my ($file_name) = #_;
open my $fh, '<', $file_name or die qq{Unable to open "$file_name" for input: $!};
my #records;
while ( <$fh> ) {
next unless /\S/;
my %record;
#record{qw/ id name x y z /} = (split)[1,2,5,6,7];
push #records, \%record;
}
\#records;
}
output
709 CA 765 N 1.478 -29.789 33.001 72.164 -30.838 33.150 73.195
709 CA 764 N 2.427 -29.789 33.001 72.164 -29.457 33.193 69.767
709 CA 783 N 5.845 -29.789 33.001 72.164 -30.286 31.938 66.438
709 CA 798 N 8.374 -29.789 33.001 72.164 -28.076 30.044 64.519
711 CB 765 N 2.471 -29.013 31.703 72.370 -30.838 33.150 73.195
711 CB 764 N 3.032 -29.013 31.703 72.370 -29.457 33.193 69.767
711 CB 783 N 6.072 -29.013 31.703 72.370 -30.286 31.938 66.438
711 CB 798 N 8.079 -29.013 31.703 72.370 -28.076 30.044 64.519
734 CG 765 N 2.938 -29.838 30.458 72.573 -30.838 33.150 73.195
734 CG 764 N 3.937 -29.838 30.458 72.573 -29.457 33.193 69.767
734 CG 783 N 6.327 -29.838 30.458 72.573 -30.286 31.938 66.438
734 CG 798 N 8.255 -29.838 30.458 72.573 -28.076 30.044 64.519
768 CE 765 N 5.646 -28.541 28.330 71.361 -30.838 33.150 73.195
768 CE 764 N 5.199 -28.541 28.330 71.361 -29.457 33.193 69.767
768 CE 783 N 6.348 -28.541 28.330 71.361 -30.286 31.938 66.438
768 CE 798 N 7.069 -28.541 28.330 71.361 -28.076 30.044 64.519

Your code has lot of syntactical errors. I had made some changes to your code and this will get you started to what you want.
First of all use strict and use warnings by this way you would have already got a lot of noise removed.
use strict;
use warnings;
open(my $f1, "pdb1.pdb") or die $!;
open(my $f2, "pdb2.pdb") or die $!;
while(defined(my $line1 = <$f1>) and defined(my $line2 = <$f2>))
{
# print "Iam here";
my #splitted = split(' ',$line1);
my #fields = split / /, $line1;
#print $fields[1], "\n";
my $atom1 = #{[$line1 =~ m/\S+/g]}[2];
my $no1 = #{[$line1 =~ m/\w+/g]}[3];
my $x1 = #{[$line1 =~ m/\w+/g]}[6];
my $y1 = #{[$line1 =~ m/\w+/g]}[7];
my $z1 = #{[$line1 =~ m/\w+/g]}[8];
my $atom2 = #{[$line2 =~ m/\w+/g]}[2];
my $no2 = #{[$line2 =~ m/\w+/g]}[3];
my $x2 = #{[$line2 =~ m/\w+/g]}[6];
my $y2 = #{[$line2 =~ m/\w+/g]}[7];
my $z2 = #{[$line2 =~ m/\w+/g]}[8];
#print $atom1;
for ($f1, $f2) {
print "$atom1 $no1 $x1 $y1 $z1 $atom2 $no2 $x2 $y2 $z2 \n";
}
}
close ($f1);
close ($f2);
Now coming to your question, your expected output seems to be different from what you are doing in your logic. You are looping two files simultaneously which will do a one one one iteration rather than each line from file1 with all lines in file2. So I think you might need to look at looping part.
And the next thing you need to know is about column splitting.
#splitted = split(' ',$line1);
if you split a line in the above mentioned way you get all columns in the array. SO now your column1 is in zeroth index, column2 in first index and so on.
SO to get first column you should do
my $col1 = #splitted[0];
If you are using those regexs just for getting columns then its not needed as you are splitting those already and you have each column indpendently in the array.
Update:
The problem that you were getting was that you were using filehandles to iterate which is causing the issue.
use strict;
use warnings;
open(my $f1, "<pdb1.pdb") or die "$!" ;
open(my $f2, "<pdb2.pdb") or die "$!" ;
my #in1 = <$f1>;
my #in2 = <$f2>;
foreach my $file1 (#in1) { #use array to iterate
chomp($file1);
#print "File1 $file1\n";
my $atomno1=(split " ", $file1)[1];
my $atomname1=(split " ", $file1)[2];
my $xx=(split " ", $file1)[5];
my $yy=(split " ", $file1)[6];
foreach my $file2(#in2) {
chomp($file2);
#print "File2 $file2\n";
my $atomno2=(split " ", $file2)[1];
my $atomname2=(split " ", $file2)[2];
my $x=(split " ", $file2)[5];
my $y=(split " ", $file2)[6];
my $dis=sqrt((($x-$xx)*($x-$xx))+ (($y-$yy)*($y-$yy)));
print "$atomno1 $atomname1 $atomno2 $atomname2 $dis $xx $yy $x $y\n" ;
}
#$file1++;
}
close ($f1);

find text sequences and create new files with replacement text [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I'm trying to find a way to write a script that does the following:
Open and detect the first use of a three-letter sequence that is repeated in the input file
Edit and permute this three letter sequence 19 times, giving 19 outputs each with a different three letter code that corresponds to a list of 19 possible three letter codes
Essentially, this is a fairly straightforward find and replace problem that I know how to do. The problem is that I then need to loop this so that, after creating the 19 files from the previous line, the next line with a different three letter code has the same replacement done to it.
I'm struggling to find a way to have the script recognize sequences of text when it can be one of twenty different things.
Let me know if anyone has any ideas on how I could go about doing this, I'll provide any clarification if necessary too!
Here is an example of an input file:
ATOM 1 N SER A 2 37.396 -5.247 -4.830 1.00 65.06 N
ATOM 2 CA SER A 2 37.881 -6.354 -3.929 1.00 64.88 C
ATOM 3 C SER A 2 36.918 -7.555 -3.786 1.00 64.14 C
ATOM 4 O SER A 2 37.287 -8.576 -3.177 1.00 64.31 O
ATOM 5 CB SER A 2 38.251 -5.804 -2.552 1.00 65.31 C
ATOM 6 OG SER A 2 37.122 -5.210 -1.918 1.00 66.94 O
ATOM 7 N GLU A 3 35.705 -7.438 -4.342 1.00 62.82 N
ATOM 8 CA GLU A 3 34.716 -8.539 -4.306 1.00 61.94 C
ATOM 9 C GLU A 3 35.126 -9.833 -5.033 1.00 59.71 C
ATOM 10 O GLU A 3 34.927 -10.911 -4.473 1.00 59.23 O
ATOM 11 CB GLU A 3 33.328 -8.094 -4.789 1.00 62.49 C
ATOM 12 CG GLU A 3 32.291 -7.994 -3.693 1.00 66.67 C
ATOM 13 CD GLU A 3 31.552 -9.302 -3.426 1.00 71.93 C
ATOM 14 OE1 GLU A 3 32.177 -10.254 -2.892 1.00 73.96 O
ATOM 15 OE2 GLU A 3 30.329 -9.364 -3.723 1.00 74.25 O
ATOM 16 N PRO A 4 35.663 -9.732 -6.280 1.00 57.83 N
ATOM 17 CA PRO A 4 36.131 -10.951 -6.967 1.00 56.64 C
Where an output would look like this:
ATOM 1 N ALA A 2 37.396 -5.247 -4.830 1.00 65.06 N
ATOM 2 CA SER A 2 37.881 -6.354 -3.929 1.00 64.88 C
ATOM 3 C SER A 2 36.918 -7.555 -3.786 1.00 64.14 C
ATOM 4 O SER A 2 37.287 -8.576 -3.177 1.00 64.31 O
ATOM 5 CB SER A 2 38.251 -5.804 -2.552 1.00 65.31 C
ATOM 6 OG SER A 2 37.122 -5.210 -1.918 1.00 66.94 O
ATOM 7 N GLU A 3 35.705 -7.438 -4.342 1.00 62.82 N
ATOM 8 CA GLU A 3 34.716 -8.539 -4.306 1.00 61.94 C
ATOM 9 C GLU A 3 35.126 -9.833 -5.033 1.00 59.71 C
ATOM 10 O GLU A 3 34.927 -10.911 -4.473 1.00 59.23 O
ATOM 11 CB GLU A 3 33.328 -8.094 -4.789 1.00 62.49 C
ATOM 12 CG GLU A 3 32.291 -7.994 -3.693 1.00 66.67 C
ATOM 13 CD GLU A 3 31.552 -9.302 -3.426 1.00 71.93 C
ATOM 14 OE1 GLU A 3 32.177 -10.254 -2.892 1.00 73.96 O
ATOM 15 OE2 GLU A 3 30.329 -9.364 -3.723 1.00 74.25 O
ATOM 16 N PRO A 4 35.663 -9.732 -6.280 1.00 57.83 N
ATOM 17 CA PRO A 4 36.131 -10.951 -6.967 1.00 56.64 C
On the first pass, the SER should be changed to a series of twenty different text sequences, the first being ALA. The issue I'm having is that I'm not sure how to write a script that will change more than one line of text.
My current script can form the 19 mutations of the first SER, but that's where it will stop. It won't mutate the next one, and it won't mutate a different three letter code, for example it wouldn't change the GLU. Is there any easy way to integrate this functionality?
Currently, the way I've approached this is to do a simple text transformation using sed, but as this seems more complicated than what sed can bring to the table, I think perl is likely the way to go. I can add the sed code, but I didn't think it would be of much help.

Your question and comments aren't entirely clear, but I believe this script will do what you want. It parses a PDB file until it reaches the amino acid of interest. A set of 19 files are produced where this AA is substituted by the other 19 AAs. From there onwards, every time an AA differs from the AA in the previous line, another set of 19 files will be generated.
#!/usr/bin/perl
use warnings;
use strict;
# we're going to start mutating when we find this residue.
my $target = 'GLU';
my #aas = ( 'ALA', 'ARG', 'ASN', 'ASP', 'CYS', 'GLU', 'GLN', 'GLY', 'HIS', 'ILE', 'LEU', 'LYS', 'MET', 'PHE', 'PRO', 'SER', 'THR', 'TRP', 'TYR', 'VAL' );
my $prev = '';
my $line_no = 0;
my #lines;
my %changes;
# uncomment the following lines and comment out "while (<DATA>) {"
# to read the input from a file
# my $input = 'path/to/pdb_file';
# open( my $fh, "<", $input ) or die "Could not open $input: $!";
# while (<$fh>) {
while (<DATA>) {
# split the line into columns (assuming it is tab-delimited;
# switch this for "\s+" if it is separated with whitespace.
my #cols = split "\t";
if ($target && $cols[3] eq $target) {
# Found our target residue! unset $target so that the following
# set of tests are performed
undef $target;
}
# see if this AA is the same as the AA in the previous line
if (! $target && $prev ne $cols[3]) {
# if it isn't, store the line number and the amino acid
$changes{ $line_no } = $cols[3];
# update $prev to reflect the new AA
$prev = $cols[3];
}
# store all the lines
push #lines, $_;
# increment the line number
$line_no++;
}
# now, for each of the changes, create substitute files
for (keys %changes) {
create_substitutes($_, $changes{$_}, [#aas], [#lines]);
}
sub create_substitutes {
# arguments: line no, $res: residue, $aas: array of amino acids,
# $all_lines: all lines in the file
my ($line_no, $res, $aas, $all_lines) = #_;
# this is the target line that we want to substitute
my #target = split "\t", $all_lines->[$line_no];
# for each AA in the list of AAs, create a new file called 'XXX-##.txt',
# where XXX is the amino acid and ## is the line number where the
# substituted residue is.
for (#$aas) {
next if $_ eq $res;
open( my $fh, ">", $_."-$line_no.txt") or die "Could not create output file for $_: $!";
# print out all lines up to the changed line
print { $fh } #$all_lines[0..$line_no-1];
# print out the changed line, substituting in the AA
print { $fh } join "\t", #target[0..2], $_, #target[4..$#target];
# print out the rest of the lines.
print { $fh } #$all_lines[$line_no+1 .. $#{$all_lines}];
}
}
__DATA__
ATOM 1 N SER A 2 37.396 -5.247 -4.830 1.00 65.06 N
ATOM 2 CA SER A 2 37.881 -6.354 -3.929 1.00 64.88 C
ATOM 3 C SER A 2 36.918 -7.555 -3.786 1.00 64.14 C
ATOM 4 O SER A 2 37.287 -8.576 -3.177 1.00 64.31 O
ATOM 5 CB SER A 2 38.251 -5.804 -2.552 1.00 65.31 C
ATOM 6 OG SER A 2 37.122 -5.210 -1.918 1.00 66.94 O
ATOM 7 N GLU A 3 35.705 -7.438 -4.342 1.00 62.82 N
ATOM 8 CA GLU A 3 34.716 -8.539 -4.306 1.00 61.94 C
ATOM 9 C GLU A 3 35.126 -9.833 -5.033 1.00 59.71 C
ATOM 10 O GLU A 3 34.927 -10.911 -4.473 1.00 59.23 O
ATOM 11 CB GLU A 3 33.328 -8.094 -4.789 1.00 62.49 C
ATOM 12 CG GLU A 3 32.291 -7.994 -3.693 1.00 66.67 C
ATOM 13 CD GLU A 3 31.552 -9.302 -3.426 1.00 71.93 C
ATOM 14 OE1 GLU A 3 32.177 -10.254 -2.892 1.00 73.96 O
ATOM 15 OE2 GLU A 3 30.329 -9.364 -3.723 1.00 74.25 O
ATOM 16 N PRO A 4 35.663 -9.732 -6.280 1.00 57.83 N
ATOM 17 CA PRO A 4 36.131 -10.951 -6.967 1.00 56.64 C
ATOM 18 CA ARG A 4 36.131 -10.951 -6.967 1.00 56.64 C
This example data will produce a set of files for the first GLU found (line 6), then another set for line 15 (PRO residue), and another set for line 17 (ARG residue).
Example of ALA-6.txt file:
ATOM 1 N SER A 2 37.396 -5.247 -4.830 1.00 65.06 N
ATOM 2 CA SER A 2 37.881 -6.354 -3.929 1.00 64.88 C
ATOM 3 C SER A 2 36.918 -7.555 -3.786 1.00 64.14 C
ATOM 4 O SER A 2 37.287 -8.576 -3.177 1.00 64.31 O
ATOM 5 CB SER A 2 38.251 -5.804 -2.552 1.00 65.31 C
ATOM 6 OG SER A 2 37.122 -5.210 -1.918 1.00 66.94 O
ATOM 7 N ALA A 3 35.705 -7.438 -4.342 1.00 62.82 N
ATOM 8 CA GLU A 3 34.716 -8.539 -4.306 1.00 61.94 C
ATOM 9 C GLU A 3 35.126 -9.833 -5.033 1.00 59.71 C
(etc.)
If this isn't the correct behaviour, you'll have to edit your question as it isn't very clear!

Because your question isn't very clear (more precisely, it is totally unclear), i created the following:
#!/usr/bin/env perl
use 5.014;
use strict;
use warnings;
use Path::Tiny;
use Bio::PDB::Structure;
use Data::Dumper;
my $residues_file = "input2.txt"; #residue names, one per line
my $molfile = "m1.pdb"; #molecule file
#read the residues
my(#residues) = path($residues_file)->lines({chomp => 1});
my $m= Bio::PDB::Structure::Molecule->new;
for my $res (#residues) { #for each residue name from a file "input2.txt"
$m->read("m1.pdb"); #read the molecule
my $atom = $m->atom(0); #get the 1st atom
$atom->residue_name($res); #change the residue to the from file
#create output filename
my $outfile = path($molfile)->basename('.pdb') . '_' . lc($res) . '.pdb';
#write the result
$m->print($outfile);
}
for example, if the input2.txt contains
ALA
ARG
ASN
ASP
CYS
GLN
GLU
GLY
HIS
ILE
LEU
LYS
MET
PHE
PRO
SER
THR
TRP
TYR
VAL
the from your input, generates 20 files where the residue in the 1st atom is changed (according to your output example) to like:
==> m1_ala.pdb <==
ATOM 1 N ALA A 2 37.396 -5.247 -4.830 1.00 65.06
==> m1_arg.pdb <==
ATOM 1 N ARG A 2 37.396 -5.247 -4.830 1.00 65.06
==> m1_asn.pdb <==
ATOM 1 N ASN A 2 37.396 -5.247 -4.830 1.00 65.06
==> m1_asp.pdb <==
ATOM 1 N ASP A 2 37.396 -5.247 -4.830 1.00 65.06
==> m1_cys.pdb <==
ATOM 1 N CYS A 2 37.396 -5.247 -4.830 1.00 65.06
... etc, 20 times...

How to print the values and value's original line in perl?

My file like this
(*CP*TP*TP*TP*TP*CP*TP*TP*TP*TP*AP*AP*AP*AP*AP*GP*TP*GP*GP
(*CP*TP*TP*TP*TP*CP*TP*TP*TP*TP*AP*AP*AP*AP*AP*GP*TP*GP*GP
(*CP*TP*TP*TP*TP*CP*TP*TP*TP*TP*AP*AP*AP*AP*AP*GP*TP*GP*GP
(*UP*CP*AP*GP*CP*CP*AP*CP*UP*UP*UP*UP*UP*AP*AP*AP*AP*GP*AP
(*UP*CP*AP*GP*CP*CP*AP*CP*UP*UP*UP*UP*UP*AP*AP*AP*AP*GP*AP
(*UP*CP*AP*GP*CP*CP*AP*CP*UP*UP*UP*UP*UP*AP*AP*AP*AP*GP*AP
values 290 MR1 1 1.000000 0.000000
values 290 MR2 1 0.000000 1.000000
values 290 MR3 1 0.000000 0.000000
values 290 MR1 2 -1.000000 0.000000
values 290 MR2 2 0.000000 -1.000000
values 290 MR3 2 0.000000 0.000000
values 290 MR1 3 -1.000000 0.000000
SEE FOR THE AUTHOR PROVIDED AND/OR PROGRAM GENERATED ASSEMBLY INFORMATION.
THIS ENTRY. THE REMARK MAY ALSO PROVIDE INFORMATION ON
BURIED SURFACE AREA.
350 COMPLETE MULTIMER REPRESENTING THE KNOWN
350 BIOLOGICALLY SIGNIFICANT OLIGOMERIZATION STATE OF THE
350 GENERATED BY APPLYING BIOMT TRANSFORMATIONS
350 GIVEN BELOW. BOTH NON-CRYSTALLOGRAPHIC AND
350 OPERATIONS ARE GIVEN.
350
350 BIOMOLECULE: 1
350 AUTHOR DETERMINED BIOLOGICAL UNIT
VALUES 944 CA SER A 124 19.929 15.508 41.001 1.00 27.16 C
VALUES 945 C SER A 124 18.528 15.865 41.525 1.00 27.86 C
VALUES 946 O SER A 124 18.305 16.949 42.074 1.00 29.52 O
VALUES 947 CB SER A 124 20.209 16.197 39.656 1.00 27.72 C
VALUES 948 OG SER A 124 19.168 16.143 38.688 1.00 29.83 O
VALUES 949 N LYS A 125 17.556 14.956 41.380 1.00 26.42 N
VALUES 950 CA LYS A 125 16.202 15.172 41.869 1.00 26.36 C
my script below
use warnings;
use strict;
print "Enter the filename >> ";
chomp(my $s = <>);
die "error openng file" unless (open('i',"$s"));
my #a=<i>;
my #grep = grep{s/^VALUES.*\w{3}\s\w//g} #a;
my #grep2 = grep{s/^values.*MR\d\s//g} #a;
my #x1;
my #y1;
my $y;
my $x;
foreach (#grep)
{
$x = (split)[1],$_;
$y = (split)[2],$_;
push (#x1,$x);
push (#y1,$y);
}
my #x2;
my #y2;
foreach (#grep2)
{
$x = (split)[1],$_;
$y = (split)[2],$_;
push (#x2,$x);
push (#y2,$y);
}
my #x;
my #y;
my #tot;
my $i; my $j;
for ($i=0 ; $i<#x1 ; $i++)
{
for ($j=0 ; $j<#x2 ; $j++)
{
my $m = $x1[$i] - $x2[$j];
my $v = $m/2;
push (#x , $v);
}
}
for ($i=0 ; $i<#y1 ; $i++)
{
for ($j=0 ; $j<#y2 ; $j++)
{
my $m = $y1[$i] - $y2[$j];
my $v = $m/2;
push (#y,$v);
}
}
for ($i=0 ; $i< scalar #x ; $i++)
{
my $total = $x[$i] + $y[$i];
print "$total\n";
push (#tot,$total);
}
#Below script i get confused
for(#grep)
{
my #mk = #tot <='17';
print "$_ \tWHICH ANSWER IS >> #mk\n";
}
Mathematical function used to 'values' and 'VALUES'. I get confused at how to print the values lessthan '17' which lines are print from the 'VALUES'. How i do it?
#I expect output is
VALUES 945 C SER A 124 18.528 15.865 41.525 1.00 27.86 C WHICH ANSWER IS >> 16.6965
VALUES 945 C SER A 124 18.528 15.865 41.525 1.00 27.86 C WHICH ANSWER IS >> 16.6965
VALUES 949 N LYS A 125 17.556 14.956 41.380 1.00 26.42 N WHICH ANSWER IS >> 15.756
VALUES 949 N LYS A 125 17.556 14.956 41.380 1.00 26.42 N WHICH ANSWER IS >> 15.756
VALUES 949 N LYS A 125 17.556 14.956 41.380 1.00 26.42 N WHICH ANSWER IS >> 16.256
VALUES 949 N LYS A 125 17.556 14.956 41.380 1.00 26.42 N WHICH ANSWER IS >> 16.256
VALUES 949 N LYS A 125 17.556 14.956 41.380 1.00 26.42 N WHICH ANSWER IS >> 16.756
VALUES 949 N LYS A 125 17.556 14.956 41.380 1.00 26.42 N WHICH ANSWER IS >> 16.256
VALUES 949 N LYS A 125 17.556 14.956 41.380 1.00 26.42 N WHICH ANSWER IS >> 16.756
VALUES 950 CA LYS A 125 16.202 15.172 41.869 1.00 26.36 C WHICH ANSWER IS >> 15.187
VALUES 950 CA LYS A 125 16.202 15.172 41.869 1.00 26.36 C WHICH ANSWER IS >> 15.187
VALUES 950 CA LYS A 125 16.202 15.172 41.869 1.00 26.36 C WHICH ANSWER IS >> 15.687
VALUES 950 CA LYS A 125 16.202 15.172 41.869 1.00 26.36 C WHICH ANSWER IS >> 16.187
VALUES 950 CA LYS A 125 16.202 15.172 41.869 1.00 26.36 C WHICH ANSWER IS >> 16.187
VALUES 950 CA LYS A 125 16.202 15.172 41.869 1.00 26.36 C WHICH ANSWER IS >> 15.687
VALUES 950 CA LYS A 125 16.202 15.172 41.869 1.00 26.36 C WHICH ANSWER IS >> 16.187
And how avoid the 'Useless use of a variable in void context' error. in some lines

The following line and others like it are giving you the Useless use of a variable in void context message:
$x = (split)[1],$_;
Your trailing ,$_ is meaningless. You want:
$x = (split)[1];
And if you want to be clearer still about your intent, I'd combine the two lines assigning $x and $y:
(undef, $x, $y) = split;

Your have yourself a little tied up here. Your main problem (and what took me so long to work out what it is you were aiming for) is that you create elements for #x and #y for every combination of #grep and #grep2 instead of just pairing them up one for one
I take that back. On reflection, the biggest problem to understanding and fixing your code is your dreadful choice of variable names! I don't know what to call the data labelled VALUES and values so I have just used arrays #VALUES and #values, but you should rename them appropriately
I have come up with this program which does what I think you want. It produces only three output records which is far smaller than your example required output, but I think that output corresponds to a bigger input file? You should show the output you expected for the example input, otherwise we have no way of testing our solutions
I hope this helps
use strict;
use warnings;
use autodie;
print "Enter the filename: ";
chomp(my $filename = <>);
open my $in_fh, '<', $filename;
my (#VALUES, #values);
while (<$in_fh>) {
chomp;
if ( /^values/ ) {
push #values, [ $_, (split)[4,5] ];
}
elsif ( /^VALUES/ ) {
push #VALUES, [ $_, (split)[6,7] ];
}
}
for my $i (0 .. $#VALUES) {
my $total;
for my $j (1, 2) {
$total += ( $VALUES[$i][$j] - $values[$i][$j] ) / 2;
}
if ($total <= 17.0) {
printf "%s WHICH ANSWER IS >> %s\n", $VALUES[$i][0], $total;
}
}
output
VALUES 945 C SER A 124 18.528 15.865 41.525 1.00 27.86 C WHICH ANSWER IS >> 16.6965
VALUES 949 N LYS A 125 17.556 14.956 41.380 1.00 26.42 N WHICH ANSWER IS >> 16.256
VALUES 950 CA LYS A 125 16.202 15.172 41.869 1.00 26.36 C WHICH ANSWER IS >> 16.187

Deleting lines with sed or awk

I have a file data.txt like this.
>1BN5.txt
207
208
211
>1B24.txt
88
92
I have a folder F1 that contains text files.
1BN5.txt file in F1 folder is shown below.
ATOM 421 CA SER A 207 68.627 -29.819 8.533 1.00 50.79 C
ATOM 421 CA SER A 207 68.627 -29.819 8.533 1.00 50.79 C
ATOM 422 C SER A 248 70.124 -29.955 8.226 1.00 55.81 C
ATOM 615 H LEU B 208 3.361 -5.394 -6.021 1.00 10.00 H
ATOM 616 HA LEU B 211 2.930 -4.494 -3.302 1.00 10.00 H
ATOM 626 N MET B 87 1.054 -3.071 -5.633 1.00 10.00 N
ATOM 627 CA MET B 87 -0.213 -2.354 -5.826 1.00 10.00 C
1B24.txt file in F1 folder is shown below.
ATOM 630 CB MET B 87 -0.476 -2.140 -7.318 1.00 10.00 C
ATOM 631 CG MET B 88 -0.828 -0.688 -7.575 1.00 10.00 C
ATOM 632 SD MET B 88 -2.380 -0.156 -6.830 1.00 10.00 S
ATOM 643 N ALA B 92 -1.541 -4.371 -5.366 1.00 10.00 N
ATOM 644 CA ALA B 94 -2.560 -5.149 -4.675 1.00 10.00 C
I need only the lines containing 207,208,211(6th column)in 1BN5.txt file. I want to delete other lines in 1BN5.txt file. Like this, I need only the lines containing 88,92 in 1B24.txt file.
Desired output
1BN5.txt file
ATOM 421 CA SER A 207 68.627 -29.819 8.533 1.00 50.79 C
ATOM 421 CA SER A 207 68.627 -29.819 8.533 1.00 50.79 C
ATOM 615 H LEU B 208 3.361 -5.394 -6.021 1.00 10.00 H
ATOM 616 HA LEU B 211 2.930 -4.494 -3.302 1.00 10.00 H
1B24.txt file
ATOM 631 CG MET B 88 -0.828 -0.688 -7.575 1.00 10.00 C
ATOM 632 SD MET B 88 -2.380 -0.156 -6.830 1.00 10.00 S
ATOM 643 N ALA B 92 -1.541 -4.371 -5.366 1.00 10.00 N

Here's one way using GNU awk. Run like:
awk -f script.awk data.txt
Contents of script.awk:
/^>/ {
file = substr($1,2)
next
}
{
a[file][$1]
}
END {
for (i in a) {
while ( ( getline line < ("./F1/" i) ) > 0 ) {
split(line,b)
for (j in a[i]) {
if (b[6]==j) {
print line > "./F1/" i ".new"
}
}
}
system(sprintf("mv ./F1/%s.new ./F1/%s", i, i))
}
}
Alternatively, here's the one-liner:
awk '/^>/ { file = substr($1,2); next } { a[file][$1] } END { for (i in a) { while ( ( getline line < ("./F1/" i) ) > 0 ) { split(line,b); for (j in a[i]) if (b[6]==j) print line > "./F1/" i ".new" } system(sprintf("mv ./F1/%s.new ./F1/%s", i, i)) } }' data.txt
If you have an older version of awk, older than GNU Awk 4.0.0, you could try the following. Run like:
awk -f script.awk data.txt
Contents of script.awk:
/^>/ {
file = substr($1,2)
next
}
{
a[file]=( a[file] ? a[file] SUBSEP : "") $1
}
END {
for (i in a) {
split(a[i],b,SUBSEP)
while ( ( getline line < ("./F1/" i) ) > 0 ) {
split(line,c)
for (j in b) {
if (c[6]==b[j]) {
print line > "./F1/" i ".new"
}
}
}
system(sprintf("mv ./F1/%s.new ./F1/%s", i, i))
}
}
Alternatively, here's the one-liner:
awk '/^>/ { file = substr($1,2); next } { a[file]=( a[file] ? a[file] SUBSEP : "") $1 } END { for (i in a) { split(a[i],b,SUBSEP); while ( ( getline line < ("./F1/" i) ) > 0 ) { split(line,c); for (j in b) if (c[6]==b[j]) print line > "./F1/" i ".new" } system(sprintf("mv ./F1/%s.new ./F1/%s", i, i)) } }' data.txt
Please note that this script does exactly as you describe. It expects files like 1BN5.txt and 1B24.txt to reside in the folder F1 in the present working directory. It will also overwrite your original files. If this is not the desired behavior, drop the system() call. HTH.
Results:
Contents of F1/1BN5.txt:
ATOM 421 CA SER A 207 68.627 -29.819 8.533 1.00 50.79 C
ATOM 421 CA SER A 207 68.627 -29.819 8.533 1.00 50.79 C
ATOM 615 H LEU B 208 3.361 -5.394 -6.021 1.00 10.00 H
ATOM 616 HA LEU B 211 2.930 -4.494 -3.302 1.00 10.00 H
Contents of F1/1B24.txt:
ATOM 631 CG MET B 88 -0.828 -0.688 -7.575 1.00 10.00 C
ATOM 632 SD MET B 88 -2.380 -0.156 -6.830 1.00 10.00 S
ATOM 643 N ALA B 92 -1.541 -4.371 -5.366 1.00 10.00 N

Don't try to delete lines from the existing file, try to create a new file with only the lines you want to have:
cat 1bn5.txt | awk '$6 == 207 || $6 == 208 || $6 == 211 { print }' > output.txt

assuming gnu awk, run this command from the directory containing data.txt:
awk -F">" '{if($2 != ""){fname=$2}if($2 == ""){term=$1;system("grep "term" F1/"fname" >>F1/"fname"_results");}}' data.txt
this parses data.txt for filenames and search terms, then calls grep from inside awk to append the matches from each file and term listed in data.txt to a new file in F1 called originalfilename.txt_results.
if you want to replace the original files completely, you could then run this command:
grep "^>.*$" data.txt | sed 's/>//' | xargs -I{} find F1 -name {}_results -exec mv F1/{}_results F1/{} \;

This will move all of the files in F1 to a tmp dir named "backup" and then re-create just the resultant non-empty files under F1
mv F1 backup &&
mkdir F1 &&
awk '
NF==FNR {
if (sub(/>/,"")) {
file=$0
ARGV[ARGC++] = "backup/" file
}
else {
tgt[file,$0] = "F1/" file
}
next
}
(FILENAME,$6) in tgt {
print > tgt[FILENAME,$6]
}
' data.txt &&
rm -rf backup
If you want the empty files too it's a trivial tweak and if you want to keep the backup dir just get rid of the "&& rm.." at the end (do that during testing anyway).
EDIT: FYI this is one case where you could argue the case for getline not being completely incorrect since it's parsing a first file that's totally unlike the rest of the files in structure and intent so parsing that one file differently from the rest isn't going to cause any maintenance headaches later:
mv F1 backup &&
mkdir F1 &&
awk -v data="data.txt" '
BEGIN {
while ( (getline line < data) > 0 ) {
if (sub(/>/,"",line)) {
file=line
ARGV[ARGC++] = "backup/" file
}
else {
tgt[file,line] = "F1/" file
}
}
}
(FILENAME,$6) in tgt {
print > tgt[FILENAME,$6]
}
' &&
rm -rf backup
but as you can see it makes the script a bit more complicated (though slightly more efficient as there's now no test for FNR==NR in the main body).

This solution plays some tricks with the record separator: "data.txt" uses > as the record separator, while the other files use newline.
awk '
BEGIN {RS=">"}
FNR == 1 {
# since the first char in data.txt is the record separator,
# there is an empty record before the real data starts
next
}
{
n = split($0, a, "\n")
file = "F1/" a[1]
newfile = file ".new"
RS="\n"
while (getline < file) {
for (i=2; i<n; i++) {
if ($6 == a[i]) {
print > newfile
break
}
}
}
RS=">"
system(sprintf("mv \"%s\" \"%s.bak\" && mv \"%s\" \"%s\"", file, file, newfile, file))
}
' data.txt

Definitely a job for awk:
$ awk '$6==207||$6==208||$6==211 { print }' 1bn5.txt
ATOM 421 CA SER A 207 68.627 -29.819 8.533 1.00 50.79 C
ATOM 421 CA SER A 207 68.627 -29.819 8.533 1.00 50.79 C
ATOM 615 H LEU B 208 3.361 -5.394 -6.021 1.00 10.00 H
ATOM 616 HA LEU B 211 2.930 -4.494 -3.302 1.00 10.00 H
$ awk '$6==92||$6==88 { print }' 1B24.txt
ATOM 631 CG MET B 88 -0.828 -0.688 -7.575 1.00 10.00 C
ATOM 632 SD MET B 88 -2.380 -0.156 -6.830 1.00 10.00 S
ATOM 643 N ALA B 92 -1.541 -4.371 -5.366 1.00 10.00 N
Redirect to save the output:
$ awk '$6==207||$6==208||$6==211 { print }' 1bn5.txt > output.txt

I don't think you can do this with just sed alone. You need a loop to read your file data.txt. For example, using a bash script:
#!/bin/bash
# First remove all possible "problematic" characters from data.txt, storing result
# in data.clean.txt. This removes everything except A-Z, a-z, 0-9, leading >, and ..
sed 's/[^A-Za-z0-9>\.]//g;s/\(.\)>/\1/g;/^$/d' data.txt >| data.clean.txt
# Next determine which lines to keep:
cat data.clean.txt | while read line; do
if [[ "${line:0:1}" == ">" ]]; then
# If input starts with ">", set remainder to be the current file
file="${line:1}"
else
# If value is in sixth column, add "keep" to end of line
# Columns assumed separated by one or more spaces
# "+" is a GNU extension, so we need the -r switch
sed -i -r "/^[^ ]+ +[^ ]+ +[^ ]+ +[^ ]+ +$line +/s/$/keep/" $file
fi
done
# Finally delete the unwanted lines, i.e. those without "keep":
# (assumes each file appears only once in data.txt)
cat data.clean.txt | while read line; do
if [[ "${line:0:1}" == ">" ]]; then
sed -i -n "/keep/{s/keep//g;p;}" ${line:1}
fi
done

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to print same row data in multiple time from pdb file in perl - perl

Related

Why is the output the way it is? -Splitting and chop

Print and use the data from two files simultaenously

find text sequences and create new files with replacement text [closed]

How to print the values and value's original line in perl?

Deleting lines with sed or awk

Categories

Resources