find text sequences and create new files with replacement text [closed] - perl

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I'm trying to find a way to write a script that does the following:
Open and detect the first use of a three-letter sequence that is repeated in the input file
Edit and permute this three letter sequence 19 times, giving 19 outputs each with a different three letter code that corresponds to a list of 19 possible three letter codes
Essentially, this is a fairly straightforward find and replace problem that I know how to do. The problem is that I then need to loop this so that, after creating the 19 files from the previous line, the next line with a different three letter code has the same replacement done to it.
I'm struggling to find a way to have the script recognize sequences of text when it can be one of twenty different things.
Let me know if anyone has any ideas on how I could go about doing this, I'll provide any clarification if necessary too!
Here is an example of an input file:
ATOM 1 N SER A 2 37.396 -5.247 -4.830 1.00 65.06 N
ATOM 2 CA SER A 2 37.881 -6.354 -3.929 1.00 64.88 C
ATOM 3 C SER A 2 36.918 -7.555 -3.786 1.00 64.14 C
ATOM 4 O SER A 2 37.287 -8.576 -3.177 1.00 64.31 O
ATOM 5 CB SER A 2 38.251 -5.804 -2.552 1.00 65.31 C
ATOM 6 OG SER A 2 37.122 -5.210 -1.918 1.00 66.94 O
ATOM 7 N GLU A 3 35.705 -7.438 -4.342 1.00 62.82 N
ATOM 8 CA GLU A 3 34.716 -8.539 -4.306 1.00 61.94 C
ATOM 9 C GLU A 3 35.126 -9.833 -5.033 1.00 59.71 C
ATOM 10 O GLU A 3 34.927 -10.911 -4.473 1.00 59.23 O
ATOM 11 CB GLU A 3 33.328 -8.094 -4.789 1.00 62.49 C
ATOM 12 CG GLU A 3 32.291 -7.994 -3.693 1.00 66.67 C
ATOM 13 CD GLU A 3 31.552 -9.302 -3.426 1.00 71.93 C
ATOM 14 OE1 GLU A 3 32.177 -10.254 -2.892 1.00 73.96 O
ATOM 15 OE2 GLU A 3 30.329 -9.364 -3.723 1.00 74.25 O
ATOM 16 N PRO A 4 35.663 -9.732 -6.280 1.00 57.83 N
ATOM 17 CA PRO A 4 36.131 -10.951 -6.967 1.00 56.64 C
Where an output would look like this:
ATOM 1 N ALA A 2 37.396 -5.247 -4.830 1.00 65.06 N
ATOM 2 CA SER A 2 37.881 -6.354 -3.929 1.00 64.88 C
ATOM 3 C SER A 2 36.918 -7.555 -3.786 1.00 64.14 C
ATOM 4 O SER A 2 37.287 -8.576 -3.177 1.00 64.31 O
ATOM 5 CB SER A 2 38.251 -5.804 -2.552 1.00 65.31 C
ATOM 6 OG SER A 2 37.122 -5.210 -1.918 1.00 66.94 O
ATOM 7 N GLU A 3 35.705 -7.438 -4.342 1.00 62.82 N
ATOM 8 CA GLU A 3 34.716 -8.539 -4.306 1.00 61.94 C
ATOM 9 C GLU A 3 35.126 -9.833 -5.033 1.00 59.71 C
ATOM 10 O GLU A 3 34.927 -10.911 -4.473 1.00 59.23 O
ATOM 11 CB GLU A 3 33.328 -8.094 -4.789 1.00 62.49 C
ATOM 12 CG GLU A 3 32.291 -7.994 -3.693 1.00 66.67 C
ATOM 13 CD GLU A 3 31.552 -9.302 -3.426 1.00 71.93 C
ATOM 14 OE1 GLU A 3 32.177 -10.254 -2.892 1.00 73.96 O
ATOM 15 OE2 GLU A 3 30.329 -9.364 -3.723 1.00 74.25 O
ATOM 16 N PRO A 4 35.663 -9.732 -6.280 1.00 57.83 N
ATOM 17 CA PRO A 4 36.131 -10.951 -6.967 1.00 56.64 C
On the first pass, the SER should be changed to a series of twenty different text sequences, the first being ALA. The issue I'm having is that I'm not sure how to write a script that will change more than one line of text.
My current script can form the 19 mutations of the first SER, but that's where it will stop. It won't mutate the next one, and it won't mutate a different three letter code, for example it wouldn't change the GLU. Is there any easy way to integrate this functionality?
Currently, the way I've approached this is to do a simple text transformation using sed, but as this seems more complicated than what sed can bring to the table, I think perl is likely the way to go. I can add the sed code, but I didn't think it would be of much help.

Your question and comments aren't entirely clear, but I believe this script will do what you want. It parses a PDB file until it reaches the amino acid of interest. A set of 19 files are produced where this AA is substituted by the other 19 AAs. From there onwards, every time an AA differs from the AA in the previous line, another set of 19 files will be generated.
#!/usr/bin/perl
use warnings;
use strict;
# we're going to start mutating when we find this residue.
my $target = 'GLU';
my #aas = ( 'ALA', 'ARG', 'ASN', 'ASP', 'CYS', 'GLU', 'GLN', 'GLY', 'HIS', 'ILE', 'LEU', 'LYS', 'MET', 'PHE', 'PRO', 'SER', 'THR', 'TRP', 'TYR', 'VAL' );
my $prev = '';
my $line_no = 0;
my #lines;
my %changes;
# uncomment the following lines and comment out "while (<DATA>) {"
# to read the input from a file
# my $input = 'path/to/pdb_file';
# open( my $fh, "<", $input ) or die "Could not open $input: $!";
# while (<$fh>) {
while (<DATA>) {
# split the line into columns (assuming it is tab-delimited;
# switch this for "\s+" if it is separated with whitespace.
my #cols = split "\t";
if ($target && $cols[3] eq $target) {
# Found our target residue! unset $target so that the following
# set of tests are performed
undef $target;
}
# see if this AA is the same as the AA in the previous line
if (! $target && $prev ne $cols[3]) {
# if it isn't, store the line number and the amino acid
$changes{ $line_no } = $cols[3];
# update $prev to reflect the new AA
$prev = $cols[3];
}
# store all the lines
push #lines, $_;
# increment the line number
$line_no++;
}
# now, for each of the changes, create substitute files
for (keys %changes) {
create_substitutes($_, $changes{$_}, [#aas], [#lines]);
}
sub create_substitutes {
# arguments: line no, $res: residue, $aas: array of amino acids,
# $all_lines: all lines in the file
my ($line_no, $res, $aas, $all_lines) = #_;
# this is the target line that we want to substitute
my #target = split "\t", $all_lines->[$line_no];
# for each AA in the list of AAs, create a new file called 'XXX-##.txt',
# where XXX is the amino acid and ## is the line number where the
# substituted residue is.
for (#$aas) {
next if $_ eq $res;
open( my $fh, ">", $_."-$line_no.txt") or die "Could not create output file for $_: $!";
# print out all lines up to the changed line
print { $fh } #$all_lines[0..$line_no-1];
# print out the changed line, substituting in the AA
print { $fh } join "\t", #target[0..2], $_, #target[4..$#target];
# print out the rest of the lines.
print { $fh } #$all_lines[$line_no+1 .. $#{$all_lines}];
}
}
__DATA__
ATOM 1 N SER A 2 37.396 -5.247 -4.830 1.00 65.06 N
ATOM 2 CA SER A 2 37.881 -6.354 -3.929 1.00 64.88 C
ATOM 3 C SER A 2 36.918 -7.555 -3.786 1.00 64.14 C
ATOM 4 O SER A 2 37.287 -8.576 -3.177 1.00 64.31 O
ATOM 5 CB SER A 2 38.251 -5.804 -2.552 1.00 65.31 C
ATOM 6 OG SER A 2 37.122 -5.210 -1.918 1.00 66.94 O
ATOM 7 N GLU A 3 35.705 -7.438 -4.342 1.00 62.82 N
ATOM 8 CA GLU A 3 34.716 -8.539 -4.306 1.00 61.94 C
ATOM 9 C GLU A 3 35.126 -9.833 -5.033 1.00 59.71 C
ATOM 10 O GLU A 3 34.927 -10.911 -4.473 1.00 59.23 O
ATOM 11 CB GLU A 3 33.328 -8.094 -4.789 1.00 62.49 C
ATOM 12 CG GLU A 3 32.291 -7.994 -3.693 1.00 66.67 C
ATOM 13 CD GLU A 3 31.552 -9.302 -3.426 1.00 71.93 C
ATOM 14 OE1 GLU A 3 32.177 -10.254 -2.892 1.00 73.96 O
ATOM 15 OE2 GLU A 3 30.329 -9.364 -3.723 1.00 74.25 O
ATOM 16 N PRO A 4 35.663 -9.732 -6.280 1.00 57.83 N
ATOM 17 CA PRO A 4 36.131 -10.951 -6.967 1.00 56.64 C
ATOM 18 CA ARG A 4 36.131 -10.951 -6.967 1.00 56.64 C
This example data will produce a set of files for the first GLU found (line 6), then another set for line 15 (PRO residue), and another set for line 17 (ARG residue).
Example of ALA-6.txt file:
ATOM 1 N SER A 2 37.396 -5.247 -4.830 1.00 65.06 N
ATOM 2 CA SER A 2 37.881 -6.354 -3.929 1.00 64.88 C
ATOM 3 C SER A 2 36.918 -7.555 -3.786 1.00 64.14 C
ATOM 4 O SER A 2 37.287 -8.576 -3.177 1.00 64.31 O
ATOM 5 CB SER A 2 38.251 -5.804 -2.552 1.00 65.31 C
ATOM 6 OG SER A 2 37.122 -5.210 -1.918 1.00 66.94 O
ATOM 7 N ALA A 3 35.705 -7.438 -4.342 1.00 62.82 N
ATOM 8 CA GLU A 3 34.716 -8.539 -4.306 1.00 61.94 C
ATOM 9 C GLU A 3 35.126 -9.833 -5.033 1.00 59.71 C
(etc.)
If this isn't the correct behaviour, you'll have to edit your question as it isn't very clear!

Because your question isn't very clear (more precisely, it is totally unclear), i created the following:
#!/usr/bin/env perl
use 5.014;
use strict;
use warnings;
use Path::Tiny;
use Bio::PDB::Structure;
use Data::Dumper;
my $residues_file = "input2.txt"; #residue names, one per line
my $molfile = "m1.pdb"; #molecule file
#read the residues
my(#residues) = path($residues_file)->lines({chomp => 1});
my $m= Bio::PDB::Structure::Molecule->new;
for my $res (#residues) { #for each residue name from a file "input2.txt"
$m->read("m1.pdb"); #read the molecule
my $atom = $m->atom(0); #get the 1st atom
$atom->residue_name($res); #change the residue to the from file
#create output filename
my $outfile = path($molfile)->basename('.pdb') . '_' . lc($res) . '.pdb';
#write the result
$m->print($outfile);
}
for example, if the input2.txt contains
ALA
ARG
ASN
ASP
CYS
GLN
GLU
GLY
HIS
ILE
LEU
LYS
MET
PHE
PRO
SER
THR
TRP
TYR
VAL
the from your input, generates 20 files where the residue in the 1st atom is changed (according to your output example) to like:
==> m1_ala.pdb <==
ATOM 1 N ALA A 2 37.396 -5.247 -4.830 1.00 65.06
==> m1_arg.pdb <==
ATOM 1 N ARG A 2 37.396 -5.247 -4.830 1.00 65.06
==> m1_asn.pdb <==
ATOM 1 N ASN A 2 37.396 -5.247 -4.830 1.00 65.06
==> m1_asp.pdb <==
ATOM 1 N ASP A 2 37.396 -5.247 -4.830 1.00 65.06
==> m1_cys.pdb <==
ATOM 1 N CYS A 2 37.396 -5.247 -4.830 1.00 65.06
... etc, 20 times...

Related

How to print same row data in multiple time from pdb file in perl

I am new in perl, I ma trying to write a program to input pdb file (from Directory, I have 3000 files) and output will save another directory (Another folder).
Code:
open( filehandler, "Document1.txt" ) or die $!; #Input file
my #file1 = <filehandler>;
my $OutputDir = 'C:\test_result_file';
foreach my $line (#file1) {
chomp $line;
open( fh, "$line" ) or die $!;
open( out, ">$OutputDir/$line.pdb" ) or die $!;
while ( $file = <fh> ) {
if ( $file =~ /^ATOM.{9}(?:CG|CD1|CD2B|CE1|CE2|CZ|C|O|CB|CG|CD)/ ) {
$hash{$1}{$2}++;
}
foreach $key ( sort { $hash{$1} <=> $hash{$2} or $1 cmp $2 } keys %hash ) {
print out $key;
}
}
print "Completed", "\n";
}
for example input file:
ATOM 1752 CG TYR A 248 89.088 39.843 51.944 1.00 32.03 C
ATOM 1753 CD1 TYR A 248 89.759 39.356 50.810 1.00 37.15 C
ATOM 1754 CD2 TYR A 248 87.727 40.049 51.864 1.00 32.81 C
ATOM 1755 CE1 TYR A 248 89.078 39.081 49.646 1.00 36.00 C
ATOM 1756 CE2 TYR A 248 87.035 39.774 50.706 1.00 35.66 C
ATOM 1757 CZ TYR A 248 87.708 39.285 49.599 1.00 35.16 C
ATOM 7394 C GLN B 331 37.664 74.934 36.854 1.00 22.75 C
ATOM 7395 O GLN B 331 37.728 73.730 36.607 1.00 31.73 O
ATOM 7396 CB GLN B 331 37.467 76.222 34.712 1.00 27.88 C
ATOM 7397 CG GLN B 331 36.515 76.825 33.693 1.00 32.42 C
ATOM 7398 CD GLN B 331 35.390 75.877 33.328 1.00 35.70 C
Expected output:
A chain:
ATOM 1753 CD1 TYR A 248 89.759 39.356 50.810 1.00 37.15 C
ATOM 1752 CG TYR A 248 89.088 39.843 51.944 1.00 32.03 C
ATOM 1754 CD2 TYR A 248 87.727 40.049 51.864 1.00 32.81 C
ATOM 1755 CE1 TYR A 248 89.078 39.081 49.646 1.00 36.00 C
ATOM 1753 CD1 TYR A 248 89.759 39.356 50.810 1.00 37.15 C
ATOM 1754 CD2 TYR A 248 87.727 40.049 51.864 1.00 32.81 C
ATOM 1755 CE1 TYR A 248 89.078 39.081 49.646 1.00 36.00 C
ATOM 1756 CE2 TYR A 248 87.035 39.774 50.706 1.00 35.66 C
ATOM 1754 CD2 TYR A 248 87.727 40.049 51.864 1.00 32.81 C
ATOM 1755 CE1 TYR A 248 89.078 39.081 49.646 1.00 36.00 C
ATOM 1756 CE2 TYR A 248 87.035 39.774 50.706 1.00 35.66 C
ATOM 1757 CZ TYR A 248 87.708 39.285 49.599 1.00 35.16 C
B chain:
ATOM 7394 C GLN B 331 37.664 74.934 36.854 1.00 22.75 C
ATOM 7395 O GLN B 331 37.728 73.730 36.607 1.00 31.73 O
ATOM 7396 CB GLN B 331 37.467 76.222 34.712 1.00 27.88 C
ATOM 7397 CG GLN B 331 36.515 76.825 33.693 1.00 32.42 C
ATOM 7395 O GLN B 331 37.728 73.730 36.607 1.00 31.73 O
ATOM 7396 CB GLN B 331 37.467 76.222 34.712 1.00 27.88 C
ATOM 7397 CG GLN B 331 36.515 76.825 33.693 1.00 32.42 C
ATOM 7398 CD GLN B 331 35.390 75.877 33.328 1.00 35.70 C
ATOM 7396 CB GLN B 331 37.467 76.222 34.712 1.00 27.88 C
ATOM 7397 CG GLN B 331 36.515 76.825 33.693 1.00 32.42 C
ATOM 7398 CD GLN B 331 35.390 75.877 33.328 1.00 35.70 C
ATOM 7394 C GLN B 331 37.664 74.934 36.854 1.00 22.75 C
Chain ID may be a to h. so, rule is see above expected output: First four row will unique and then line five will be same row of second row and will add new row as eight line row.
I am unable to write a code to solve this problem, any one pl help
I'm afraid I have to say your code is rather confusing but what I get from the data samples and further explanation is:
There should be a window of four lines moving through the input lines.
The window contents (all four lines) should be printed as you advance to each new line.
The window should be reset each time a new sequence is encountered.
Each line is a series of space delimited fields and the fifth field identifies the sequence.
The window contents can be stored in a simple Perl array (see #window in the snippet below). You simply append data to it with push and remove the first line with shift as you move to a next line. When the sequence changes, print the current window and reset it. In the sample code below I made an assumption that sequences do not intermix. If this is note the case, you need to read all the input beforehand and sort it as necessary.
use strict;
use warnings;
my $win_len = 4;
my #window = ();
my $prev_chain = "";
while (<>) {
my ($atom_name, $chain) = (split)[2, 4];
next unless $atom_name =~ /\b(?:CG|CD1|CD2B|CE1|CE2|CZ|C|O|CB|CG|CD)\b/;
if ($chain eq $prev_chain) {
if (#window == $win_len) {
print_window();
shift #window;
}
push #window, $_;
} else {
print_window() if #window;
#window = ($_);
$prev_chain = $chain;
}
}
print_window() if #window;
sub print_window {
print foreach #window;
print "\n";
}
Demo: https://ideone.com/oOB8bF
The script reads data from STDIN and prints result to STDOUT for the sake of simplicity. Your code sample suggests you store a list of files to process in the Document1.txt and the actual input is read from these files. In this case you need an extra loop:
use strict;
use warnings;
my $OutputDir = 'C:/test_result_file';
open my $dir, "Document1.txt" or die "Failed to open Document1.txt:$!";
chomp(my #files = <$dir>);
foreach my $file (#files) {
my $win_len = 4;
my #window = ();
my $prev_chain = "";
open my $input, $file or die "failed to open $file: $!\n";
open my $output, '>', "$OutputDir/$file.pdb" or die "failed to open $OutputDir/$file.pdb: $!\n";
while (<$input>) {
my ($atom_name, $chain) = (split)[2, 4];
next unless $atom_name =~ /\b(?:CG|CD1|CD2B|CE1|CE2|CZ|C|O|CB|CG|CD)\b/;
if ($chain eq $prev_chain) {
if (#window == $win_len) {
print_window($output, #window);
shift #window;
}
push #window, $_;
} else {
print_window($output, #window) if #window;
#window = ($_);
$prev_chain = $chain;
}
}
print_window($output, #window) if #window;
}
sub print_window {
my $fh = shift;
print $fh $_ foreach #_;
print $fh "\n";
}

How to print the values and value's original line in perl?

My file like this
(*CP*TP*TP*TP*TP*CP*TP*TP*TP*TP*AP*AP*AP*AP*AP*GP*TP*GP*GP
(*CP*TP*TP*TP*TP*CP*TP*TP*TP*TP*AP*AP*AP*AP*AP*GP*TP*GP*GP
(*CP*TP*TP*TP*TP*CP*TP*TP*TP*TP*AP*AP*AP*AP*AP*GP*TP*GP*GP
(*UP*CP*AP*GP*CP*CP*AP*CP*UP*UP*UP*UP*UP*AP*AP*AP*AP*GP*AP
(*UP*CP*AP*GP*CP*CP*AP*CP*UP*UP*UP*UP*UP*AP*AP*AP*AP*GP*AP
(*UP*CP*AP*GP*CP*CP*AP*CP*UP*UP*UP*UP*UP*AP*AP*AP*AP*GP*AP
values 290 MR1 1 1.000000 0.000000
values 290 MR2 1 0.000000 1.000000
values 290 MR3 1 0.000000 0.000000
values 290 MR1 2 -1.000000 0.000000
values 290 MR2 2 0.000000 -1.000000
values 290 MR3 2 0.000000 0.000000
values 290 MR1 3 -1.000000 0.000000
SEE FOR THE AUTHOR PROVIDED AND/OR PROGRAM GENERATED ASSEMBLY INFORMATION.
THIS ENTRY. THE REMARK MAY ALSO PROVIDE INFORMATION ON
BURIED SURFACE AREA.
350 COMPLETE MULTIMER REPRESENTING THE KNOWN
350 BIOLOGICALLY SIGNIFICANT OLIGOMERIZATION STATE OF THE
350 GENERATED BY APPLYING BIOMT TRANSFORMATIONS
350 GIVEN BELOW. BOTH NON-CRYSTALLOGRAPHIC AND
350 OPERATIONS ARE GIVEN.
350
350 BIOMOLECULE: 1
350 AUTHOR DETERMINED BIOLOGICAL UNIT
VALUES 944 CA SER A 124 19.929 15.508 41.001 1.00 27.16 C
VALUES 945 C SER A 124 18.528 15.865 41.525 1.00 27.86 C
VALUES 946 O SER A 124 18.305 16.949 42.074 1.00 29.52 O
VALUES 947 CB SER A 124 20.209 16.197 39.656 1.00 27.72 C
VALUES 948 OG SER A 124 19.168 16.143 38.688 1.00 29.83 O
VALUES 949 N LYS A 125 17.556 14.956 41.380 1.00 26.42 N
VALUES 950 CA LYS A 125 16.202 15.172 41.869 1.00 26.36 C
my script below
use warnings;
use strict;
print "Enter the filename >> ";
chomp(my $s = <>);
die "error openng file" unless (open('i',"$s"));
my #a=<i>;
my #grep = grep{s/^VALUES.*\w{3}\s\w//g} #a;
my #grep2 = grep{s/^values.*MR\d\s//g} #a;
my #x1;
my #y1;
my $y;
my $x;
foreach (#grep)
{
$x = (split)[1],$_;
$y = (split)[2],$_;
push (#x1,$x);
push (#y1,$y);
}
my #x2;
my #y2;
foreach (#grep2)
{
$x = (split)[1],$_;
$y = (split)[2],$_;
push (#x2,$x);
push (#y2,$y);
}
my #x;
my #y;
my #tot;
my $i; my $j;
for ($i=0 ; $i<#x1 ; $i++)
{
for ($j=0 ; $j<#x2 ; $j++)
{
my $m = $x1[$i] - $x2[$j];
my $v = $m/2;
push (#x , $v);
}
}
for ($i=0 ; $i<#y1 ; $i++)
{
for ($j=0 ; $j<#y2 ; $j++)
{
my $m = $y1[$i] - $y2[$j];
my $v = $m/2;
push (#y,$v);
}
}
for ($i=0 ; $i< scalar #x ; $i++)
{
my $total = $x[$i] + $y[$i];
print "$total\n";
push (#tot,$total);
}
#Below script i get confused
for(#grep)
{
my #mk = #tot <='17';
print "$_ \tWHICH ANSWER IS >> #mk\n";
}
Mathematical function used to 'values' and 'VALUES'. I get confused at how to print the values lessthan '17' which lines are print from the 'VALUES'. How i do it?
#I expect output is
VALUES 945 C SER A 124 18.528 15.865 41.525 1.00 27.86 C WHICH ANSWER IS >> 16.6965
VALUES 945 C SER A 124 18.528 15.865 41.525 1.00 27.86 C WHICH ANSWER IS >> 16.6965
VALUES 949 N LYS A 125 17.556 14.956 41.380 1.00 26.42 N WHICH ANSWER IS >> 15.756
VALUES 949 N LYS A 125 17.556 14.956 41.380 1.00 26.42 N WHICH ANSWER IS >> 15.756
VALUES 949 N LYS A 125 17.556 14.956 41.380 1.00 26.42 N WHICH ANSWER IS >> 16.256
VALUES 949 N LYS A 125 17.556 14.956 41.380 1.00 26.42 N WHICH ANSWER IS >> 16.256
VALUES 949 N LYS A 125 17.556 14.956 41.380 1.00 26.42 N WHICH ANSWER IS >> 16.756
VALUES 949 N LYS A 125 17.556 14.956 41.380 1.00 26.42 N WHICH ANSWER IS >> 16.256
VALUES 949 N LYS A 125 17.556 14.956 41.380 1.00 26.42 N WHICH ANSWER IS >> 16.756
VALUES 950 CA LYS A 125 16.202 15.172 41.869 1.00 26.36 C WHICH ANSWER IS >> 15.187
VALUES 950 CA LYS A 125 16.202 15.172 41.869 1.00 26.36 C WHICH ANSWER IS >> 15.187
VALUES 950 CA LYS A 125 16.202 15.172 41.869 1.00 26.36 C WHICH ANSWER IS >> 15.687
VALUES 950 CA LYS A 125 16.202 15.172 41.869 1.00 26.36 C WHICH ANSWER IS >> 16.187
VALUES 950 CA LYS A 125 16.202 15.172 41.869 1.00 26.36 C WHICH ANSWER IS >> 16.187
VALUES 950 CA LYS A 125 16.202 15.172 41.869 1.00 26.36 C WHICH ANSWER IS >> 15.687
VALUES 950 CA LYS A 125 16.202 15.172 41.869 1.00 26.36 C WHICH ANSWER IS >> 16.187
And how avoid the 'Useless use of a variable in void context' error. in some lines
The following line and others like it are giving you the Useless use of a variable in void context message:
$x = (split)[1],$_;
Your trailing ,$_ is meaningless. You want:
$x = (split)[1];
And if you want to be clearer still about your intent, I'd combine the two lines assigning $x and $y:
(undef, $x, $y) = split;
Your have yourself a little tied up here. Your main problem (and what took me so long to work out what it is you were aiming for) is that you create elements for #x and #y for every combination of #grep and #grep2 instead of just pairing them up one for one
I take that back. On reflection, the biggest problem to understanding and fixing your code is your dreadful choice of variable names! I don't know what to call the data labelled VALUES and values so I have just used arrays #VALUES and #values, but you should rename them appropriately
I have come up with this program which does what I think you want. It produces only three output records which is far smaller than your example required output, but I think that output corresponds to a bigger input file? You should show the output you expected for the example input, otherwise we have no way of testing our solutions
I hope this helps
use strict;
use warnings;
use autodie;
print "Enter the filename: ";
chomp(my $filename = <>);
open my $in_fh, '<', $filename;
my (#VALUES, #values);
while (<$in_fh>) {
chomp;
if ( /^values/ ) {
push #values, [ $_, (split)[4,5] ];
}
elsif ( /^VALUES/ ) {
push #VALUES, [ $_, (split)[6,7] ];
}
}
for my $i (0 .. $#VALUES) {
my $total;
for my $j (1, 2) {
$total += ( $VALUES[$i][$j] - $values[$i][$j] ) / 2;
}
if ($total <= 17.0) {
printf "%s WHICH ANSWER IS >> %s\n", $VALUES[$i][0], $total;
}
}
output
VALUES 945 C SER A 124 18.528 15.865 41.525 1.00 27.86 C WHICH ANSWER IS >> 16.6965
VALUES 949 N LYS A 125 17.556 14.956 41.380 1.00 26.42 N WHICH ANSWER IS >> 16.256
VALUES 950 CA LYS A 125 16.202 15.172 41.869 1.00 26.36 C WHICH ANSWER IS >> 16.187

Edit text columns

I have a text file (the first two lines are character spacings):
1 2 3 4 5 6 7 8
12345678901234567890123456789012345678901234567890123456789012345678901234567890
ATOM 1 N1 SPINA 3 30.616 29.799 14.979 1.00 20.00 S N
ATOM 2 N1 SPINA 3 28.146 28.381 13.950 1.00 20.00 S N
ATOM 3 N1 SPINA 3 27.605 28.239 14.037 1.00 20.00 S N
ATOM 4 N1 SPINA 3 30.333 29.182 15.464 1.00 20.00 S N
ATOM 5 N1 SPINA 3 29.608 29.434 14.333 1.00 20.00 S N
ATOM 6 N1 SPINA 3 29.303 29.830 13.317 1.00 20.00 S N
ATOM 7 N1 SPINA 3 28.963 31.116 13.472 1.00 20.00 S N
ATOM 8 N1 SPINA 3 28.859 28.743 13.828 1.00 20.00 S N
ATOM 9 N1 SPINA 3 29.699 30.575 14.564 1.00 20.00 S N
ATOM 10 N1 SPINA 3 29.518 29.194 15.301 1.00 20.00 S N
I want to edit it and make it like:
1 2 3 4 5 6 7 8
12345678901234567890123456789012345678901234567890123456789012345678901234567890
ATOM 1 N001 SPINA 3 30.616 29.799 14.979 1.00 20.00 S N
ATOM 2 N002 SPINA 3 28.146 28.381 13.950 1.00 20.00 S N
ATOM 3 N003 SPINA 3 27.605 28.239 14.037 1.00 20.00 S N
ATOM 4 N004 SPINA 3 30.333 29.182 15.464 1.00 20.00 S N
ATOM 5 N005 SPINA 3 29.608 29.434 14.333 1.00 20.00 S N
ATOM 6 N006 SPINA 3 29.303 29.830 13.317 1.00 20.00 S N
ATOM 7 N007 SPINA 3 28.963 31.116 13.472 1.00 20.00 S N
ATOM 8 N008 SPINA 3 28.859 28.743 13.828 1.00 20.00 S N
ATOM 9 N009 SPINA 3 29.699 30.575 14.564 1.00 20.00 S N
ATOM 10 N010 SPINA 3 29.518 29.194 15.301 1.00 20.00 S N
The number of spaces between each column are important and the list of atoms needs to go up to 190 (N001-N190). Thus I would like to replace characters 13-16 (" N1 ") in file 1 with ("N001") and keep the remainder of the file in the original spacing.
You don't need 10 long lines of sample input to demonstrate the problem or the solution:
$ cat file
ATOM 1 N1 SPINA 3
ATOM 2 N1 SPINA 3
ATOM 10 N1 SPINA 3
$ awk '{print substr($0,1,12) sprintf("N%03d",$2) substr($0,17)}' file
ATOM 1 N001 SPINA 3
ATOM 2 N002 SPINA 3
ATOM 10 N010 SPINA 3
I'm assuming we could use $2 as the numeric part of the 3rd field. It seems to increment sequentially with your line numbers. Using NR might be an alternative. If neither of those is actually what you want, post some more representative sample input/output.
Also, note that any solution that involves assigning to a field (e.g. $3=...) WILL cause awk to recompile the line using the value of OFS as the field separator and so will change your spacing.
Oh, and if those 2 initial lines of character spacings are really present in your files, this is the tweak:
$ cat file
1 2
12345678901234567890123456
ATOM 1 N1 SPINA 3
ATOM 2 N1 SPINA 3
ATOM 10 N1 SPINA 3
$ awk 'NR>2{$0 = substr($0,1,12) sprintf("N%03d",$2) substr($0,17)} 1' file
1 2
12345678901234567890123456
ATOM 1 N001 SPINA 3
ATOM 2 N002 SPINA 3
ATOM 10 N010 SPINA 3
Try :
$ awk '{$3=substr($3,1,1) sprintf("%03d",$2)}1' OFS=\\t file
Note : OFS will be tab
If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk
--edit--
if you want to increment with line
$ awk '{$3=substr($3,1,1) sprintf("%03d",NR)}1' OFS=\\t file
Here is yet another way:
awk 'sub(/.$/,sprintf("%03d",NR),$3)' OFS='\t' file
Output:
$ awk 'sub(/.$/,sprintf("%03d",NR),$3)' OFS='\t' file
ATOM 1 N001 SPINA 3 30.616 29.799 14.979 1.00 20.00 S N
ATOM 2 N002 SPINA 3 28.146 28.381 13.950 1.00 20.00 S N
ATOM 3 N003 SPINA 3 27.605 28.239 14.037 1.00 20.00 S N
ATOM 4 N004 SPINA 3 30.333 29.182 15.464 1.00 20.00 S N
ATOM 5 N005 SPINA 3 29.608 29.434 14.333 1.00 20.00 S N
ATOM 6 N006 SPINA 3 29.303 29.830 13.317 1.00 20.00 S N
ATOM 7 N007 SPINA 3 28.963 31.116 13.472 1.00 20.00 S N
ATOM 8 N008 SPINA 3 28.859 28.743 13.828 1.00 20.00 S N
ATOM 9 N009 SPINA 3 29.699 30.575 14.564 1.00 20.00 S N
ATOM 10 N010 SPINA 3 29.518 29.194 15.301 1.00 20.00 S N
If you are interesting to resolve it with pure shell, here is the code:
while IFS="\n" read -r line
do
n=${line:9:3}
printf "%sN%03d%s\n" "${line:0:12}" $n "${line:16}"
done < file
awk '$3="N"sprintf("%03d",$2)' OFS='\t' infile.txt
Result
ATOM 1 N001 SPINA 3 30.616 29.799 14.979 1.00 20.00SN
ATOM 2 N002 SPINA 3 28.146 28.381 13.950 1.00 20.00SN
ATOM 3 N003 SPINA 3 27.605 28.239 14.037 1.00 20.00SN
ATOM 4 N004 SPINA 3 30.333 29.182 15.464 1.00 20.00SN
ATOM 5 N005 SPINA 3 29.608 29.434 14.333 1.00 20.00SN
ATOM 6 N006 SPINA 3 29.303 29.830 13.317 1.00 20.00SN
ATOM 7 N007 SPINA 3 28.963 31.116 13.472 1.00 20.00SN
ATOM 8 N008 SPINA 3 28.859 28.743 13.828 1.00 20.00SN
ATOM 9 N009 SPINA 3 29.699 30.575 14.564 1.00 20.00SN
ATOM 10 N010 SPINA 3 29.518 29.194 15.301 1.00 20.00SN

Field manipulation

I have some text files and I need to remove the first character from the fourth column only if the column has four characters
file1 as follows
ATOM 5181 N AMET K 406 12.440 6.552 25.691 0.50 7.37 N
ATOM 5182 CA AMET K 406 13.685 5.798 25.578 0.50 5.87 C
ATOM 5183 C AMET K 406 14.045 5.179 26.909 0.50 5.07 C
ATOM 5184 O MET K 406 14.595 4.083 27.003 0.50 7.07 O
ATOM 5185 CB MET K 406 14.812 6.674 25.044 0.50 6.80 C
ATOM 5185 CB MET K 406 14.812 6.674 25.044 0.50 6.80 C
ATOM 5202 N AARG K 408 12.186 3.982 29.147 0.50 6.55 N
file2 as follows
ATOM 41 CA ATRP A 6 -18.975 -29.894 -7.425 0.50 19.50 C
ATOM 42 CA BTRP A 6 -18.979 -29.890 -7.428 0.50 19.16 C
ATOM 43 C HIS A 6 -18.091 -29.845 -8.669 1.00 19.84 C
ATOM 44 O HIS A 6 -17.015 -30.452 -8.696 1.00 20.10 O
ATOM 45 CB ASER A 9 -18.499 -28.879 -6.370 0.50 19.73 C
ATOM 46 CB BSER A 9 -18.565 -28.837 -6.367 0.50 19.13 C
ATOM 47 CG CHIS A 12 -19.421 -27.711 -6.216 0.50 21.30 C
Desired output
file1
ATOM 5181 N MET K 406 12.440 6.552 25.691 0.50 7.37 N
ATOM 5182 CA MET K 406 13.685 5.798 25.578 0.50 5.87 C
ATOM 5183 C MET K 406 14.045 5.179 26.909 0.50 5.07 C
ATOM 5184 O MET K 406 14.595 4.083 27.003 0.50 7.07 O
ATOM 5185 CB MET K 406 14.812 6.674 25.044 0.50 6.80 C
ATOM 5185 CB MET K 406 14.812 6.674 25.044 0.50 6.80 C
ATOM 5202 N ARG K 408 12.186 3.982 29.147 0.50 6.55 N
file2
ATOM 41 CA TRP A 6 -18.975 -29.894 -7.425 0.50 19.50 C
ATOM 42 CA TRP A 6 -18.979 -29.890 -7.428 0.50 19.16 C
ATOM 43 C HIS A 6 -18.091 -29.845 -8.669 1.00 19.84 C
ATOM 44 O HIS A 6 -17.015 -30.452 -8.696 1.00 20.10 O
ATOM 45 CB SER A 9 -18.499 -28.879 -6.370 0.50 19.73 C
ATOM 46 CB SER A 9 -18.565 -28.837 -6.367 0.50 19.13 C
ATOM 47 CG HIS A 12 -19.421 -27.711 -6.216 0.50 21.30 C
This might work for you (GNU sed):
sed -r 's/^((\S+\s+){3})\S(\S{3}\s)/\1 \3/' file
This replaces the first character of the fourth column with a space if that column has four non-space characters.
Use the length() function to find the length of the column and the substr() function to print the substring you need:
$ awk 'length($4)==4{$4=substr($4,2)}1' file | column -t
ATOM 5181 N MET K 406 12.440 6.552 25.691 0.50 7.37 N
ATOM 5182 CA MET K 406 13.685 5.798 25.578 0.50 5.87 C
ATOM 5183 C MET K 406 14.045 5.179 26.909 0.50 5.07 C
ATOM 5184 O MET K 406 14.595 4.083 27.003 0.50 7.07 O
ATOM 5185 CB MET K 406 14.812 6.674 25.044 0.50 6.80 C
ATOM 5185 CB MET K 406 14.812 6.674 25.044 0.50 6.80 C
ATOM 5202 N ARG K 408 12.186 3.982 29.147 0.50 6.55 N
Piping to column -t rebuilds a nice table format. To store the changes back to a file uses the redirection operator:
$ awk 'length($4)==4{$4=substr($4,2)}1' file | column -t > new_file
With sed you could do:
$ sed -r 's/^((\S+\s+){3})\S(\S{3}\s)/\1\3/' file
ATOM 5181 N MET K 406 12.440 6.552 25.691 0.50 7.37 N
ATOM 5182 CA MET K 406 13.685 5.798 25.578 0.50 5.87 C
ATOM 5183 C MET K 406 14.045 5.179 26.909 0.50 5.07 C
ATOM 5184 O MET K 406 14.595 4.083 27.003 0.50 7.07 O
ATOM 5185 CB MET K 406 14.812 6.674 25.044 0.50 6.80 C
ATOM 5185 CB MET K 406 14.812 6.674 25.044 0.50 6.80 C
ATOM 5202 N ARG K 408 12.186 3.982 29.147 0.50 6.55 N
To store the changes back to the original file you can use the -i option:
$ sed -ri 's/^((\S+\s+){3})\S(\S{3}\s)/\1\3/' file

match a pattern and print subsequent lines

there are 200 files named File1_0.pdb,File1_60.pdb etc....it looks like:
ATOM 1 N VAL 1 8.897 -21.545 -7.276 1.00 0.00
ATOM 2 H1 VAL 1 9.692 -22.015 -6.868 1.00 0.00
ATOM 3 H2 VAL 1 9.228 -20.766 -7.827 1.00 0.00
ATOM 4 H3 VAL 1 8.289 -22.236 -7.693 1.00 0.00
TER
ATOM 5 CA VAL 1 8.124 -20.953 -6.203 1.00 0.00
ATOM 6 HA VAL 1 8.072 -19.874 -6.345 1.00 0.00
ATOM 7 CB VAL 1 6.693 -21.515 -6.176 1.00 0.00
ATOM 8 HB VAL 1 6.522 -22.024 -5.227 1.00 0.00
ATOM 9 CG1 VAL 1 5.684 -20.370 -6.330 1.00 0.00
ATOM 10 1HG1 VAL 1 5.854 -19.861 -7.279 1.00 0.00
i have to extract the part after TER and put in a different file...this has to be done on all 200 files. I did something like sed '1,/TER/d' File1_0.pdb > 1_0.pdb. But this will work for one file at a time...can there be a solution for all 200 files in one go... output file is named same only "File" is removed from the name...
for i in *.pdb; do sed '1,/TER/d' $i > ${i/File/}; done
This might work:
seq 0 200| xargs -i -n1 cp File1_{}.pdb 1_{}.pbd # backup files
sed -si '1,/TER/d' 1_{0..200}.pdb # edit files separately inline