Field manipulation - sed

I have some text files and I need to remove the first character from the fourth column only if the column has four characters
file1 as follows
ATOM 5181 N AMET K 406 12.440 6.552 25.691 0.50 7.37 N
ATOM 5182 CA AMET K 406 13.685 5.798 25.578 0.50 5.87 C
ATOM 5183 C AMET K 406 14.045 5.179 26.909 0.50 5.07 C
ATOM 5184 O MET K 406 14.595 4.083 27.003 0.50 7.07 O
ATOM 5185 CB MET K 406 14.812 6.674 25.044 0.50 6.80 C
ATOM 5185 CB MET K 406 14.812 6.674 25.044 0.50 6.80 C
ATOM 5202 N AARG K 408 12.186 3.982 29.147 0.50 6.55 N
file2 as follows
ATOM 41 CA ATRP A 6 -18.975 -29.894 -7.425 0.50 19.50 C
ATOM 42 CA BTRP A 6 -18.979 -29.890 -7.428 0.50 19.16 C
ATOM 43 C HIS A 6 -18.091 -29.845 -8.669 1.00 19.84 C
ATOM 44 O HIS A 6 -17.015 -30.452 -8.696 1.00 20.10 O
ATOM 45 CB ASER A 9 -18.499 -28.879 -6.370 0.50 19.73 C
ATOM 46 CB BSER A 9 -18.565 -28.837 -6.367 0.50 19.13 C
ATOM 47 CG CHIS A 12 -19.421 -27.711 -6.216 0.50 21.30 C
Desired output
file1
ATOM 5181 N MET K 406 12.440 6.552 25.691 0.50 7.37 N
ATOM 5182 CA MET K 406 13.685 5.798 25.578 0.50 5.87 C
ATOM 5183 C MET K 406 14.045 5.179 26.909 0.50 5.07 C
ATOM 5184 O MET K 406 14.595 4.083 27.003 0.50 7.07 O
ATOM 5185 CB MET K 406 14.812 6.674 25.044 0.50 6.80 C
ATOM 5185 CB MET K 406 14.812 6.674 25.044 0.50 6.80 C
ATOM 5202 N ARG K 408 12.186 3.982 29.147 0.50 6.55 N
file2
ATOM 41 CA TRP A 6 -18.975 -29.894 -7.425 0.50 19.50 C
ATOM 42 CA TRP A 6 -18.979 -29.890 -7.428 0.50 19.16 C
ATOM 43 C HIS A 6 -18.091 -29.845 -8.669 1.00 19.84 C
ATOM 44 O HIS A 6 -17.015 -30.452 -8.696 1.00 20.10 O
ATOM 45 CB SER A 9 -18.499 -28.879 -6.370 0.50 19.73 C
ATOM 46 CB SER A 9 -18.565 -28.837 -6.367 0.50 19.13 C
ATOM 47 CG HIS A 12 -19.421 -27.711 -6.216 0.50 21.30 C

This might work for you (GNU sed):
sed -r 's/^((\S+\s+){3})\S(\S{3}\s)/\1 \3/' file
This replaces the first character of the fourth column with a space if that column has four non-space characters.

Use the length() function to find the length of the column and the substr() function to print the substring you need:
$ awk 'length($4)==4{$4=substr($4,2)}1' file | column -t
ATOM 5181 N MET K 406 12.440 6.552 25.691 0.50 7.37 N
ATOM 5182 CA MET K 406 13.685 5.798 25.578 0.50 5.87 C
ATOM 5183 C MET K 406 14.045 5.179 26.909 0.50 5.07 C
ATOM 5184 O MET K 406 14.595 4.083 27.003 0.50 7.07 O
ATOM 5185 CB MET K 406 14.812 6.674 25.044 0.50 6.80 C
ATOM 5185 CB MET K 406 14.812 6.674 25.044 0.50 6.80 C
ATOM 5202 N ARG K 408 12.186 3.982 29.147 0.50 6.55 N
Piping to column -t rebuilds a nice table format. To store the changes back to a file uses the redirection operator:
$ awk 'length($4)==4{$4=substr($4,2)}1' file | column -t > new_file
With sed you could do:
$ sed -r 's/^((\S+\s+){3})\S(\S{3}\s)/\1\3/' file
ATOM 5181 N MET K 406 12.440 6.552 25.691 0.50 7.37 N
ATOM 5182 CA MET K 406 13.685 5.798 25.578 0.50 5.87 C
ATOM 5183 C MET K 406 14.045 5.179 26.909 0.50 5.07 C
ATOM 5184 O MET K 406 14.595 4.083 27.003 0.50 7.07 O
ATOM 5185 CB MET K 406 14.812 6.674 25.044 0.50 6.80 C
ATOM 5185 CB MET K 406 14.812 6.674 25.044 0.50 6.80 C
ATOM 5202 N ARG K 408 12.186 3.982 29.147 0.50 6.55 N
To store the changes back to the original file you can use the -i option:
$ sed -ri 's/^((\S+\s+){3})\S(\S{3}\s)/\1\3/' file

Related

find text sequences and create new files with replacement text [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I'm trying to find a way to write a script that does the following:
Open and detect the first use of a three-letter sequence that is repeated in the input file
Edit and permute this three letter sequence 19 times, giving 19 outputs each with a different three letter code that corresponds to a list of 19 possible three letter codes
Essentially, this is a fairly straightforward find and replace problem that I know how to do. The problem is that I then need to loop this so that, after creating the 19 files from the previous line, the next line with a different three letter code has the same replacement done to it.
I'm struggling to find a way to have the script recognize sequences of text when it can be one of twenty different things.
Let me know if anyone has any ideas on how I could go about doing this, I'll provide any clarification if necessary too!
Here is an example of an input file:
ATOM 1 N SER A 2 37.396 -5.247 -4.830 1.00 65.06 N
ATOM 2 CA SER A 2 37.881 -6.354 -3.929 1.00 64.88 C
ATOM 3 C SER A 2 36.918 -7.555 -3.786 1.00 64.14 C
ATOM 4 O SER A 2 37.287 -8.576 -3.177 1.00 64.31 O
ATOM 5 CB SER A 2 38.251 -5.804 -2.552 1.00 65.31 C
ATOM 6 OG SER A 2 37.122 -5.210 -1.918 1.00 66.94 O
ATOM 7 N GLU A 3 35.705 -7.438 -4.342 1.00 62.82 N
ATOM 8 CA GLU A 3 34.716 -8.539 -4.306 1.00 61.94 C
ATOM 9 C GLU A 3 35.126 -9.833 -5.033 1.00 59.71 C
ATOM 10 O GLU A 3 34.927 -10.911 -4.473 1.00 59.23 O
ATOM 11 CB GLU A 3 33.328 -8.094 -4.789 1.00 62.49 C
ATOM 12 CG GLU A 3 32.291 -7.994 -3.693 1.00 66.67 C
ATOM 13 CD GLU A 3 31.552 -9.302 -3.426 1.00 71.93 C
ATOM 14 OE1 GLU A 3 32.177 -10.254 -2.892 1.00 73.96 O
ATOM 15 OE2 GLU A 3 30.329 -9.364 -3.723 1.00 74.25 O
ATOM 16 N PRO A 4 35.663 -9.732 -6.280 1.00 57.83 N
ATOM 17 CA PRO A 4 36.131 -10.951 -6.967 1.00 56.64 C
Where an output would look like this:
ATOM 1 N ALA A 2 37.396 -5.247 -4.830 1.00 65.06 N
ATOM 2 CA SER A 2 37.881 -6.354 -3.929 1.00 64.88 C
ATOM 3 C SER A 2 36.918 -7.555 -3.786 1.00 64.14 C
ATOM 4 O SER A 2 37.287 -8.576 -3.177 1.00 64.31 O
ATOM 5 CB SER A 2 38.251 -5.804 -2.552 1.00 65.31 C
ATOM 6 OG SER A 2 37.122 -5.210 -1.918 1.00 66.94 O
ATOM 7 N GLU A 3 35.705 -7.438 -4.342 1.00 62.82 N
ATOM 8 CA GLU A 3 34.716 -8.539 -4.306 1.00 61.94 C
ATOM 9 C GLU A 3 35.126 -9.833 -5.033 1.00 59.71 C
ATOM 10 O GLU A 3 34.927 -10.911 -4.473 1.00 59.23 O
ATOM 11 CB GLU A 3 33.328 -8.094 -4.789 1.00 62.49 C
ATOM 12 CG GLU A 3 32.291 -7.994 -3.693 1.00 66.67 C
ATOM 13 CD GLU A 3 31.552 -9.302 -3.426 1.00 71.93 C
ATOM 14 OE1 GLU A 3 32.177 -10.254 -2.892 1.00 73.96 O
ATOM 15 OE2 GLU A 3 30.329 -9.364 -3.723 1.00 74.25 O
ATOM 16 N PRO A 4 35.663 -9.732 -6.280 1.00 57.83 N
ATOM 17 CA PRO A 4 36.131 -10.951 -6.967 1.00 56.64 C
On the first pass, the SER should be changed to a series of twenty different text sequences, the first being ALA. The issue I'm having is that I'm not sure how to write a script that will change more than one line of text.
My current script can form the 19 mutations of the first SER, but that's where it will stop. It won't mutate the next one, and it won't mutate a different three letter code, for example it wouldn't change the GLU. Is there any easy way to integrate this functionality?
Currently, the way I've approached this is to do a simple text transformation using sed, but as this seems more complicated than what sed can bring to the table, I think perl is likely the way to go. I can add the sed code, but I didn't think it would be of much help.
Your question and comments aren't entirely clear, but I believe this script will do what you want. It parses a PDB file until it reaches the amino acid of interest. A set of 19 files are produced where this AA is substituted by the other 19 AAs. From there onwards, every time an AA differs from the AA in the previous line, another set of 19 files will be generated.
#!/usr/bin/perl
use warnings;
use strict;
# we're going to start mutating when we find this residue.
my $target = 'GLU';
my #aas = ( 'ALA', 'ARG', 'ASN', 'ASP', 'CYS', 'GLU', 'GLN', 'GLY', 'HIS', 'ILE', 'LEU', 'LYS', 'MET', 'PHE', 'PRO', 'SER', 'THR', 'TRP', 'TYR', 'VAL' );
my $prev = '';
my $line_no = 0;
my #lines;
my %changes;
# uncomment the following lines and comment out "while (<DATA>) {"
# to read the input from a file
# my $input = 'path/to/pdb_file';
# open( my $fh, "<", $input ) or die "Could not open $input: $!";
# while (<$fh>) {
while (<DATA>) {
# split the line into columns (assuming it is tab-delimited;
# switch this for "\s+" if it is separated with whitespace.
my #cols = split "\t";
if ($target && $cols[3] eq $target) {
# Found our target residue! unset $target so that the following
# set of tests are performed
undef $target;
}
# see if this AA is the same as the AA in the previous line
if (! $target && $prev ne $cols[3]) {
# if it isn't, store the line number and the amino acid
$changes{ $line_no } = $cols[3];
# update $prev to reflect the new AA
$prev = $cols[3];
}
# store all the lines
push #lines, $_;
# increment the line number
$line_no++;
}
# now, for each of the changes, create substitute files
for (keys %changes) {
create_substitutes($_, $changes{$_}, [#aas], [#lines]);
}
sub create_substitutes {
# arguments: line no, $res: residue, $aas: array of amino acids,
# $all_lines: all lines in the file
my ($line_no, $res, $aas, $all_lines) = #_;
# this is the target line that we want to substitute
my #target = split "\t", $all_lines->[$line_no];
# for each AA in the list of AAs, create a new file called 'XXX-##.txt',
# where XXX is the amino acid and ## is the line number where the
# substituted residue is.
for (#$aas) {
next if $_ eq $res;
open( my $fh, ">", $_."-$line_no.txt") or die "Could not create output file for $_: $!";
# print out all lines up to the changed line
print { $fh } #$all_lines[0..$line_no-1];
# print out the changed line, substituting in the AA
print { $fh } join "\t", #target[0..2], $_, #target[4..$#target];
# print out the rest of the lines.
print { $fh } #$all_lines[$line_no+1 .. $#{$all_lines}];
}
}
__DATA__
ATOM 1 N SER A 2 37.396 -5.247 -4.830 1.00 65.06 N
ATOM 2 CA SER A 2 37.881 -6.354 -3.929 1.00 64.88 C
ATOM 3 C SER A 2 36.918 -7.555 -3.786 1.00 64.14 C
ATOM 4 O SER A 2 37.287 -8.576 -3.177 1.00 64.31 O
ATOM 5 CB SER A 2 38.251 -5.804 -2.552 1.00 65.31 C
ATOM 6 OG SER A 2 37.122 -5.210 -1.918 1.00 66.94 O
ATOM 7 N GLU A 3 35.705 -7.438 -4.342 1.00 62.82 N
ATOM 8 CA GLU A 3 34.716 -8.539 -4.306 1.00 61.94 C
ATOM 9 C GLU A 3 35.126 -9.833 -5.033 1.00 59.71 C
ATOM 10 O GLU A 3 34.927 -10.911 -4.473 1.00 59.23 O
ATOM 11 CB GLU A 3 33.328 -8.094 -4.789 1.00 62.49 C
ATOM 12 CG GLU A 3 32.291 -7.994 -3.693 1.00 66.67 C
ATOM 13 CD GLU A 3 31.552 -9.302 -3.426 1.00 71.93 C
ATOM 14 OE1 GLU A 3 32.177 -10.254 -2.892 1.00 73.96 O
ATOM 15 OE2 GLU A 3 30.329 -9.364 -3.723 1.00 74.25 O
ATOM 16 N PRO A 4 35.663 -9.732 -6.280 1.00 57.83 N
ATOM 17 CA PRO A 4 36.131 -10.951 -6.967 1.00 56.64 C
ATOM 18 CA ARG A 4 36.131 -10.951 -6.967 1.00 56.64 C
This example data will produce a set of files for the first GLU found (line 6), then another set for line 15 (PRO residue), and another set for line 17 (ARG residue).
Example of ALA-6.txt file:
ATOM 1 N SER A 2 37.396 -5.247 -4.830 1.00 65.06 N
ATOM 2 CA SER A 2 37.881 -6.354 -3.929 1.00 64.88 C
ATOM 3 C SER A 2 36.918 -7.555 -3.786 1.00 64.14 C
ATOM 4 O SER A 2 37.287 -8.576 -3.177 1.00 64.31 O
ATOM 5 CB SER A 2 38.251 -5.804 -2.552 1.00 65.31 C
ATOM 6 OG SER A 2 37.122 -5.210 -1.918 1.00 66.94 O
ATOM 7 N ALA A 3 35.705 -7.438 -4.342 1.00 62.82 N
ATOM 8 CA GLU A 3 34.716 -8.539 -4.306 1.00 61.94 C
ATOM 9 C GLU A 3 35.126 -9.833 -5.033 1.00 59.71 C
(etc.)
If this isn't the correct behaviour, you'll have to edit your question as it isn't very clear!
Because your question isn't very clear (more precisely, it is totally unclear), i created the following:
#!/usr/bin/env perl
use 5.014;
use strict;
use warnings;
use Path::Tiny;
use Bio::PDB::Structure;
use Data::Dumper;
my $residues_file = "input2.txt"; #residue names, one per line
my $molfile = "m1.pdb"; #molecule file
#read the residues
my(#residues) = path($residues_file)->lines({chomp => 1});
my $m= Bio::PDB::Structure::Molecule->new;
for my $res (#residues) { #for each residue name from a file "input2.txt"
$m->read("m1.pdb"); #read the molecule
my $atom = $m->atom(0); #get the 1st atom
$atom->residue_name($res); #change the residue to the from file
#create output filename
my $outfile = path($molfile)->basename('.pdb') . '_' . lc($res) . '.pdb';
#write the result
$m->print($outfile);
}
for example, if the input2.txt contains
ALA
ARG
ASN
ASP
CYS
GLN
GLU
GLY
HIS
ILE
LEU
LYS
MET
PHE
PRO
SER
THR
TRP
TYR
VAL
the from your input, generates 20 files where the residue in the 1st atom is changed (according to your output example) to like:
==> m1_ala.pdb <==
ATOM 1 N ALA A 2 37.396 -5.247 -4.830 1.00 65.06
==> m1_arg.pdb <==
ATOM 1 N ARG A 2 37.396 -5.247 -4.830 1.00 65.06
==> m1_asn.pdb <==
ATOM 1 N ASN A 2 37.396 -5.247 -4.830 1.00 65.06
==> m1_asp.pdb <==
ATOM 1 N ASP A 2 37.396 -5.247 -4.830 1.00 65.06
==> m1_cys.pdb <==
ATOM 1 N CYS A 2 37.396 -5.247 -4.830 1.00 65.06
... etc, 20 times...

How to print the values and value's original line in perl?

My file like this
(*CP*TP*TP*TP*TP*CP*TP*TP*TP*TP*AP*AP*AP*AP*AP*GP*TP*GP*GP
(*CP*TP*TP*TP*TP*CP*TP*TP*TP*TP*AP*AP*AP*AP*AP*GP*TP*GP*GP
(*CP*TP*TP*TP*TP*CP*TP*TP*TP*TP*AP*AP*AP*AP*AP*GP*TP*GP*GP
(*UP*CP*AP*GP*CP*CP*AP*CP*UP*UP*UP*UP*UP*AP*AP*AP*AP*GP*AP
(*UP*CP*AP*GP*CP*CP*AP*CP*UP*UP*UP*UP*UP*AP*AP*AP*AP*GP*AP
(*UP*CP*AP*GP*CP*CP*AP*CP*UP*UP*UP*UP*UP*AP*AP*AP*AP*GP*AP
values 290 MR1 1 1.000000 0.000000
values 290 MR2 1 0.000000 1.000000
values 290 MR3 1 0.000000 0.000000
values 290 MR1 2 -1.000000 0.000000
values 290 MR2 2 0.000000 -1.000000
values 290 MR3 2 0.000000 0.000000
values 290 MR1 3 -1.000000 0.000000
SEE FOR THE AUTHOR PROVIDED AND/OR PROGRAM GENERATED ASSEMBLY INFORMATION.
THIS ENTRY. THE REMARK MAY ALSO PROVIDE INFORMATION ON
BURIED SURFACE AREA.
350 COMPLETE MULTIMER REPRESENTING THE KNOWN
350 BIOLOGICALLY SIGNIFICANT OLIGOMERIZATION STATE OF THE
350 GENERATED BY APPLYING BIOMT TRANSFORMATIONS
350 GIVEN BELOW. BOTH NON-CRYSTALLOGRAPHIC AND
350 OPERATIONS ARE GIVEN.
350
350 BIOMOLECULE: 1
350 AUTHOR DETERMINED BIOLOGICAL UNIT
VALUES 944 CA SER A 124 19.929 15.508 41.001 1.00 27.16 C
VALUES 945 C SER A 124 18.528 15.865 41.525 1.00 27.86 C
VALUES 946 O SER A 124 18.305 16.949 42.074 1.00 29.52 O
VALUES 947 CB SER A 124 20.209 16.197 39.656 1.00 27.72 C
VALUES 948 OG SER A 124 19.168 16.143 38.688 1.00 29.83 O
VALUES 949 N LYS A 125 17.556 14.956 41.380 1.00 26.42 N
VALUES 950 CA LYS A 125 16.202 15.172 41.869 1.00 26.36 C
my script below
use warnings;
use strict;
print "Enter the filename >> ";
chomp(my $s = <>);
die "error openng file" unless (open('i',"$s"));
my #a=<i>;
my #grep = grep{s/^VALUES.*\w{3}\s\w//g} #a;
my #grep2 = grep{s/^values.*MR\d\s//g} #a;
my #x1;
my #y1;
my $y;
my $x;
foreach (#grep)
{
$x = (split)[1],$_;
$y = (split)[2],$_;
push (#x1,$x);
push (#y1,$y);
}
my #x2;
my #y2;
foreach (#grep2)
{
$x = (split)[1],$_;
$y = (split)[2],$_;
push (#x2,$x);
push (#y2,$y);
}
my #x;
my #y;
my #tot;
my $i; my $j;
for ($i=0 ; $i<#x1 ; $i++)
{
for ($j=0 ; $j<#x2 ; $j++)
{
my $m = $x1[$i] - $x2[$j];
my $v = $m/2;
push (#x , $v);
}
}
for ($i=0 ; $i<#y1 ; $i++)
{
for ($j=0 ; $j<#y2 ; $j++)
{
my $m = $y1[$i] - $y2[$j];
my $v = $m/2;
push (#y,$v);
}
}
for ($i=0 ; $i< scalar #x ; $i++)
{
my $total = $x[$i] + $y[$i];
print "$total\n";
push (#tot,$total);
}
#Below script i get confused
for(#grep)
{
my #mk = #tot <='17';
print "$_ \tWHICH ANSWER IS >> #mk\n";
}
Mathematical function used to 'values' and 'VALUES'. I get confused at how to print the values lessthan '17' which lines are print from the 'VALUES'. How i do it?
#I expect output is
VALUES 945 C SER A 124 18.528 15.865 41.525 1.00 27.86 C WHICH ANSWER IS >> 16.6965
VALUES 945 C SER A 124 18.528 15.865 41.525 1.00 27.86 C WHICH ANSWER IS >> 16.6965
VALUES 949 N LYS A 125 17.556 14.956 41.380 1.00 26.42 N WHICH ANSWER IS >> 15.756
VALUES 949 N LYS A 125 17.556 14.956 41.380 1.00 26.42 N WHICH ANSWER IS >> 15.756
VALUES 949 N LYS A 125 17.556 14.956 41.380 1.00 26.42 N WHICH ANSWER IS >> 16.256
VALUES 949 N LYS A 125 17.556 14.956 41.380 1.00 26.42 N WHICH ANSWER IS >> 16.256
VALUES 949 N LYS A 125 17.556 14.956 41.380 1.00 26.42 N WHICH ANSWER IS >> 16.756
VALUES 949 N LYS A 125 17.556 14.956 41.380 1.00 26.42 N WHICH ANSWER IS >> 16.256
VALUES 949 N LYS A 125 17.556 14.956 41.380 1.00 26.42 N WHICH ANSWER IS >> 16.756
VALUES 950 CA LYS A 125 16.202 15.172 41.869 1.00 26.36 C WHICH ANSWER IS >> 15.187
VALUES 950 CA LYS A 125 16.202 15.172 41.869 1.00 26.36 C WHICH ANSWER IS >> 15.187
VALUES 950 CA LYS A 125 16.202 15.172 41.869 1.00 26.36 C WHICH ANSWER IS >> 15.687
VALUES 950 CA LYS A 125 16.202 15.172 41.869 1.00 26.36 C WHICH ANSWER IS >> 16.187
VALUES 950 CA LYS A 125 16.202 15.172 41.869 1.00 26.36 C WHICH ANSWER IS >> 16.187
VALUES 950 CA LYS A 125 16.202 15.172 41.869 1.00 26.36 C WHICH ANSWER IS >> 15.687
VALUES 950 CA LYS A 125 16.202 15.172 41.869 1.00 26.36 C WHICH ANSWER IS >> 16.187
And how avoid the 'Useless use of a variable in void context' error. in some lines
The following line and others like it are giving you the Useless use of a variable in void context message:
$x = (split)[1],$_;
Your trailing ,$_ is meaningless. You want:
$x = (split)[1];
And if you want to be clearer still about your intent, I'd combine the two lines assigning $x and $y:
(undef, $x, $y) = split;
Your have yourself a little tied up here. Your main problem (and what took me so long to work out what it is you were aiming for) is that you create elements for #x and #y for every combination of #grep and #grep2 instead of just pairing them up one for one
I take that back. On reflection, the biggest problem to understanding and fixing your code is your dreadful choice of variable names! I don't know what to call the data labelled VALUES and values so I have just used arrays #VALUES and #values, but you should rename them appropriately
I have come up with this program which does what I think you want. It produces only three output records which is far smaller than your example required output, but I think that output corresponds to a bigger input file? You should show the output you expected for the example input, otherwise we have no way of testing our solutions
I hope this helps
use strict;
use warnings;
use autodie;
print "Enter the filename: ";
chomp(my $filename = <>);
open my $in_fh, '<', $filename;
my (#VALUES, #values);
while (<$in_fh>) {
chomp;
if ( /^values/ ) {
push #values, [ $_, (split)[4,5] ];
}
elsif ( /^VALUES/ ) {
push #VALUES, [ $_, (split)[6,7] ];
}
}
for my $i (0 .. $#VALUES) {
my $total;
for my $j (1, 2) {
$total += ( $VALUES[$i][$j] - $values[$i][$j] ) / 2;
}
if ($total <= 17.0) {
printf "%s WHICH ANSWER IS >> %s\n", $VALUES[$i][0], $total;
}
}
output
VALUES 945 C SER A 124 18.528 15.865 41.525 1.00 27.86 C WHICH ANSWER IS >> 16.6965
VALUES 949 N LYS A 125 17.556 14.956 41.380 1.00 26.42 N WHICH ANSWER IS >> 16.256
VALUES 950 CA LYS A 125 16.202 15.172 41.869 1.00 26.36 C WHICH ANSWER IS >> 16.187

Edit text columns

I have a text file (the first two lines are character spacings):
1 2 3 4 5 6 7 8
12345678901234567890123456789012345678901234567890123456789012345678901234567890
ATOM 1 N1 SPINA 3 30.616 29.799 14.979 1.00 20.00 S N
ATOM 2 N1 SPINA 3 28.146 28.381 13.950 1.00 20.00 S N
ATOM 3 N1 SPINA 3 27.605 28.239 14.037 1.00 20.00 S N
ATOM 4 N1 SPINA 3 30.333 29.182 15.464 1.00 20.00 S N
ATOM 5 N1 SPINA 3 29.608 29.434 14.333 1.00 20.00 S N
ATOM 6 N1 SPINA 3 29.303 29.830 13.317 1.00 20.00 S N
ATOM 7 N1 SPINA 3 28.963 31.116 13.472 1.00 20.00 S N
ATOM 8 N1 SPINA 3 28.859 28.743 13.828 1.00 20.00 S N
ATOM 9 N1 SPINA 3 29.699 30.575 14.564 1.00 20.00 S N
ATOM 10 N1 SPINA 3 29.518 29.194 15.301 1.00 20.00 S N
I want to edit it and make it like:
1 2 3 4 5 6 7 8
12345678901234567890123456789012345678901234567890123456789012345678901234567890
ATOM 1 N001 SPINA 3 30.616 29.799 14.979 1.00 20.00 S N
ATOM 2 N002 SPINA 3 28.146 28.381 13.950 1.00 20.00 S N
ATOM 3 N003 SPINA 3 27.605 28.239 14.037 1.00 20.00 S N
ATOM 4 N004 SPINA 3 30.333 29.182 15.464 1.00 20.00 S N
ATOM 5 N005 SPINA 3 29.608 29.434 14.333 1.00 20.00 S N
ATOM 6 N006 SPINA 3 29.303 29.830 13.317 1.00 20.00 S N
ATOM 7 N007 SPINA 3 28.963 31.116 13.472 1.00 20.00 S N
ATOM 8 N008 SPINA 3 28.859 28.743 13.828 1.00 20.00 S N
ATOM 9 N009 SPINA 3 29.699 30.575 14.564 1.00 20.00 S N
ATOM 10 N010 SPINA 3 29.518 29.194 15.301 1.00 20.00 S N
The number of spaces between each column are important and the list of atoms needs to go up to 190 (N001-N190). Thus I would like to replace characters 13-16 (" N1 ") in file 1 with ("N001") and keep the remainder of the file in the original spacing.
You don't need 10 long lines of sample input to demonstrate the problem or the solution:
$ cat file
ATOM 1 N1 SPINA 3
ATOM 2 N1 SPINA 3
ATOM 10 N1 SPINA 3
$ awk '{print substr($0,1,12) sprintf("N%03d",$2) substr($0,17)}' file
ATOM 1 N001 SPINA 3
ATOM 2 N002 SPINA 3
ATOM 10 N010 SPINA 3
I'm assuming we could use $2 as the numeric part of the 3rd field. It seems to increment sequentially with your line numbers. Using NR might be an alternative. If neither of those is actually what you want, post some more representative sample input/output.
Also, note that any solution that involves assigning to a field (e.g. $3=...) WILL cause awk to recompile the line using the value of OFS as the field separator and so will change your spacing.
Oh, and if those 2 initial lines of character spacings are really present in your files, this is the tweak:
$ cat file
1 2
12345678901234567890123456
ATOM 1 N1 SPINA 3
ATOM 2 N1 SPINA 3
ATOM 10 N1 SPINA 3
$ awk 'NR>2{$0 = substr($0,1,12) sprintf("N%03d",$2) substr($0,17)} 1' file
1 2
12345678901234567890123456
ATOM 1 N001 SPINA 3
ATOM 2 N002 SPINA 3
ATOM 10 N010 SPINA 3
Try :
$ awk '{$3=substr($3,1,1) sprintf("%03d",$2)}1' OFS=\\t file
Note : OFS will be tab
If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk
--edit--
if you want to increment with line
$ awk '{$3=substr($3,1,1) sprintf("%03d",NR)}1' OFS=\\t file
Here is yet another way:
awk 'sub(/.$/,sprintf("%03d",NR),$3)' OFS='\t' file
Output:
$ awk 'sub(/.$/,sprintf("%03d",NR),$3)' OFS='\t' file
ATOM 1 N001 SPINA 3 30.616 29.799 14.979 1.00 20.00 S N
ATOM 2 N002 SPINA 3 28.146 28.381 13.950 1.00 20.00 S N
ATOM 3 N003 SPINA 3 27.605 28.239 14.037 1.00 20.00 S N
ATOM 4 N004 SPINA 3 30.333 29.182 15.464 1.00 20.00 S N
ATOM 5 N005 SPINA 3 29.608 29.434 14.333 1.00 20.00 S N
ATOM 6 N006 SPINA 3 29.303 29.830 13.317 1.00 20.00 S N
ATOM 7 N007 SPINA 3 28.963 31.116 13.472 1.00 20.00 S N
ATOM 8 N008 SPINA 3 28.859 28.743 13.828 1.00 20.00 S N
ATOM 9 N009 SPINA 3 29.699 30.575 14.564 1.00 20.00 S N
ATOM 10 N010 SPINA 3 29.518 29.194 15.301 1.00 20.00 S N
If you are interesting to resolve it with pure shell, here is the code:
while IFS="\n" read -r line
do
n=${line:9:3}
printf "%sN%03d%s\n" "${line:0:12}" $n "${line:16}"
done < file
awk '$3="N"sprintf("%03d",$2)' OFS='\t' infile.txt
Result
ATOM 1 N001 SPINA 3 30.616 29.799 14.979 1.00 20.00SN
ATOM 2 N002 SPINA 3 28.146 28.381 13.950 1.00 20.00SN
ATOM 3 N003 SPINA 3 27.605 28.239 14.037 1.00 20.00SN
ATOM 4 N004 SPINA 3 30.333 29.182 15.464 1.00 20.00SN
ATOM 5 N005 SPINA 3 29.608 29.434 14.333 1.00 20.00SN
ATOM 6 N006 SPINA 3 29.303 29.830 13.317 1.00 20.00SN
ATOM 7 N007 SPINA 3 28.963 31.116 13.472 1.00 20.00SN
ATOM 8 N008 SPINA 3 28.859 28.743 13.828 1.00 20.00SN
ATOM 9 N009 SPINA 3 29.699 30.575 14.564 1.00 20.00SN
ATOM 10 N010 SPINA 3 29.518 29.194 15.301 1.00 20.00SN

Deleting lines with sed or awk

I have a file data.txt like this.
>1BN5.txt
207
208
211
>1B24.txt
88
92
I have a folder F1 that contains text files.
1BN5.txt file in F1 folder is shown below.
ATOM 421 CA SER A 207 68.627 -29.819 8.533 1.00 50.79 C
ATOM 421 CA SER A 207 68.627 -29.819 8.533 1.00 50.79 C
ATOM 422 C SER A 248 70.124 -29.955 8.226 1.00 55.81 C
ATOM 615 H LEU B 208 3.361 -5.394 -6.021 1.00 10.00 H
ATOM 616 HA LEU B 211 2.930 -4.494 -3.302 1.00 10.00 H
ATOM 626 N MET B 87 1.054 -3.071 -5.633 1.00 10.00 N
ATOM 627 CA MET B 87 -0.213 -2.354 -5.826 1.00 10.00 C
1B24.txt file in F1 folder is shown below.
ATOM 630 CB MET B 87 -0.476 -2.140 -7.318 1.00 10.00 C
ATOM 631 CG MET B 88 -0.828 -0.688 -7.575 1.00 10.00 C
ATOM 632 SD MET B 88 -2.380 -0.156 -6.830 1.00 10.00 S
ATOM 643 N ALA B 92 -1.541 -4.371 -5.366 1.00 10.00 N
ATOM 644 CA ALA B 94 -2.560 -5.149 -4.675 1.00 10.00 C
I need only the lines containing 207,208,211(6th column)in 1BN5.txt file. I want to delete other lines in 1BN5.txt file. Like this, I need only the lines containing 88,92 in 1B24.txt file.
Desired output
1BN5.txt file
ATOM 421 CA SER A 207 68.627 -29.819 8.533 1.00 50.79 C
ATOM 421 CA SER A 207 68.627 -29.819 8.533 1.00 50.79 C
ATOM 615 H LEU B 208 3.361 -5.394 -6.021 1.00 10.00 H
ATOM 616 HA LEU B 211 2.930 -4.494 -3.302 1.00 10.00 H
1B24.txt file
ATOM 631 CG MET B 88 -0.828 -0.688 -7.575 1.00 10.00 C
ATOM 632 SD MET B 88 -2.380 -0.156 -6.830 1.00 10.00 S
ATOM 643 N ALA B 92 -1.541 -4.371 -5.366 1.00 10.00 N
Here's one way using GNU awk. Run like:
awk -f script.awk data.txt
Contents of script.awk:
/^>/ {
file = substr($1,2)
next
}
{
a[file][$1]
}
END {
for (i in a) {
while ( ( getline line < ("./F1/" i) ) > 0 ) {
split(line,b)
for (j in a[i]) {
if (b[6]==j) {
print line > "./F1/" i ".new"
}
}
}
system(sprintf("mv ./F1/%s.new ./F1/%s", i, i))
}
}
Alternatively, here's the one-liner:
awk '/^>/ { file = substr($1,2); next } { a[file][$1] } END { for (i in a) { while ( ( getline line < ("./F1/" i) ) > 0 ) { split(line,b); for (j in a[i]) if (b[6]==j) print line > "./F1/" i ".new" } system(sprintf("mv ./F1/%s.new ./F1/%s", i, i)) } }' data.txt
If you have an older version of awk, older than GNU Awk 4.0.0, you could try the following. Run like:
awk -f script.awk data.txt
Contents of script.awk:
/^>/ {
file = substr($1,2)
next
}
{
a[file]=( a[file] ? a[file] SUBSEP : "") $1
}
END {
for (i in a) {
split(a[i],b,SUBSEP)
while ( ( getline line < ("./F1/" i) ) > 0 ) {
split(line,c)
for (j in b) {
if (c[6]==b[j]) {
print line > "./F1/" i ".new"
}
}
}
system(sprintf("mv ./F1/%s.new ./F1/%s", i, i))
}
}
Alternatively, here's the one-liner:
awk '/^>/ { file = substr($1,2); next } { a[file]=( a[file] ? a[file] SUBSEP : "") $1 } END { for (i in a) { split(a[i],b,SUBSEP); while ( ( getline line < ("./F1/" i) ) > 0 ) { split(line,c); for (j in b) if (c[6]==b[j]) print line > "./F1/" i ".new" } system(sprintf("mv ./F1/%s.new ./F1/%s", i, i)) } }' data.txt
Please note that this script does exactly as you describe. It expects files like 1BN5.txt and 1B24.txt to reside in the folder F1 in the present working directory. It will also overwrite your original files. If this is not the desired behavior, drop the system() call. HTH.
Results:
Contents of F1/1BN5.txt:
ATOM 421 CA SER A 207 68.627 -29.819 8.533 1.00 50.79 C
ATOM 421 CA SER A 207 68.627 -29.819 8.533 1.00 50.79 C
ATOM 615 H LEU B 208 3.361 -5.394 -6.021 1.00 10.00 H
ATOM 616 HA LEU B 211 2.930 -4.494 -3.302 1.00 10.00 H
Contents of F1/1B24.txt:
ATOM 631 CG MET B 88 -0.828 -0.688 -7.575 1.00 10.00 C
ATOM 632 SD MET B 88 -2.380 -0.156 -6.830 1.00 10.00 S
ATOM 643 N ALA B 92 -1.541 -4.371 -5.366 1.00 10.00 N
Don't try to delete lines from the existing file, try to create a new file with only the lines you want to have:
cat 1bn5.txt | awk '$6 == 207 || $6 == 208 || $6 == 211 { print }' > output.txt
assuming gnu awk, run this command from the directory containing data.txt:
awk -F">" '{if($2 != ""){fname=$2}if($2 == ""){term=$1;system("grep "term" F1/"fname" >>F1/"fname"_results");}}' data.txt
this parses data.txt for filenames and search terms, then calls grep from inside awk to append the matches from each file and term listed in data.txt to a new file in F1 called originalfilename.txt_results.
if you want to replace the original files completely, you could then run this command:
grep "^>.*$" data.txt | sed 's/>//' | xargs -I{} find F1 -name {}_results -exec mv F1/{}_results F1/{} \;
This will move all of the files in F1 to a tmp dir named "backup" and then re-create just the resultant non-empty files under F1
mv F1 backup &&
mkdir F1 &&
awk '
NF==FNR {
if (sub(/>/,"")) {
file=$0
ARGV[ARGC++] = "backup/" file
}
else {
tgt[file,$0] = "F1/" file
}
next
}
(FILENAME,$6) in tgt {
print > tgt[FILENAME,$6]
}
' data.txt &&
rm -rf backup
If you want the empty files too it's a trivial tweak and if you want to keep the backup dir just get rid of the "&& rm.." at the end (do that during testing anyway).
EDIT: FYI this is one case where you could argue the case for getline not being completely incorrect since it's parsing a first file that's totally unlike the rest of the files in structure and intent so parsing that one file differently from the rest isn't going to cause any maintenance headaches later:
mv F1 backup &&
mkdir F1 &&
awk -v data="data.txt" '
BEGIN {
while ( (getline line < data) > 0 ) {
if (sub(/>/,"",line)) {
file=line
ARGV[ARGC++] = "backup/" file
}
else {
tgt[file,line] = "F1/" file
}
}
}
(FILENAME,$6) in tgt {
print > tgt[FILENAME,$6]
}
' &&
rm -rf backup
but as you can see it makes the script a bit more complicated (though slightly more efficient as there's now no test for FNR==NR in the main body).
This solution plays some tricks with the record separator: "data.txt" uses > as the record separator, while the other files use newline.
awk '
BEGIN {RS=">"}
FNR == 1 {
# since the first char in data.txt is the record separator,
# there is an empty record before the real data starts
next
}
{
n = split($0, a, "\n")
file = "F1/" a[1]
newfile = file ".new"
RS="\n"
while (getline < file) {
for (i=2; i<n; i++) {
if ($6 == a[i]) {
print > newfile
break
}
}
}
RS=">"
system(sprintf("mv \"%s\" \"%s.bak\" && mv \"%s\" \"%s\"", file, file, newfile, file))
}
' data.txt
Definitely a job for awk:
$ awk '$6==207||$6==208||$6==211 { print }' 1bn5.txt
ATOM 421 CA SER A 207 68.627 -29.819 8.533 1.00 50.79 C
ATOM 421 CA SER A 207 68.627 -29.819 8.533 1.00 50.79 C
ATOM 615 H LEU B 208 3.361 -5.394 -6.021 1.00 10.00 H
ATOM 616 HA LEU B 211 2.930 -4.494 -3.302 1.00 10.00 H
$ awk '$6==92||$6==88 { print }' 1B24.txt
ATOM 631 CG MET B 88 -0.828 -0.688 -7.575 1.00 10.00 C
ATOM 632 SD MET B 88 -2.380 -0.156 -6.830 1.00 10.00 S
ATOM 643 N ALA B 92 -1.541 -4.371 -5.366 1.00 10.00 N
Redirect to save the output:
$ awk '$6==207||$6==208||$6==211 { print }' 1bn5.txt > output.txt
I don't think you can do this with just sed alone. You need a loop to read your file data.txt. For example, using a bash script:
#!/bin/bash
# First remove all possible "problematic" characters from data.txt, storing result
# in data.clean.txt. This removes everything except A-Z, a-z, 0-9, leading >, and ..
sed 's/[^A-Za-z0-9>\.]//g;s/\(.\)>/\1/g;/^$/d' data.txt >| data.clean.txt
# Next determine which lines to keep:
cat data.clean.txt | while read line; do
if [[ "${line:0:1}" == ">" ]]; then
# If input starts with ">", set remainder to be the current file
file="${line:1}"
else
# If value is in sixth column, add "keep" to end of line
# Columns assumed separated by one or more spaces
# "+" is a GNU extension, so we need the -r switch
sed -i -r "/^[^ ]+ +[^ ]+ +[^ ]+ +[^ ]+ +$line +/s/$/keep/" $file
fi
done
# Finally delete the unwanted lines, i.e. those without "keep":
# (assumes each file appears only once in data.txt)
cat data.clean.txt | while read line; do
if [[ "${line:0:1}" == ">" ]]; then
sed -i -n "/keep/{s/keep//g;p;}" ${line:1}
fi
done

match a pattern and print subsequent lines

there are 200 files named File1_0.pdb,File1_60.pdb etc....it looks like:
ATOM 1 N VAL 1 8.897 -21.545 -7.276 1.00 0.00
ATOM 2 H1 VAL 1 9.692 -22.015 -6.868 1.00 0.00
ATOM 3 H2 VAL 1 9.228 -20.766 -7.827 1.00 0.00
ATOM 4 H3 VAL 1 8.289 -22.236 -7.693 1.00 0.00
TER
ATOM 5 CA VAL 1 8.124 -20.953 -6.203 1.00 0.00
ATOM 6 HA VAL 1 8.072 -19.874 -6.345 1.00 0.00
ATOM 7 CB VAL 1 6.693 -21.515 -6.176 1.00 0.00
ATOM 8 HB VAL 1 6.522 -22.024 -5.227 1.00 0.00
ATOM 9 CG1 VAL 1 5.684 -20.370 -6.330 1.00 0.00
ATOM 10 1HG1 VAL 1 5.854 -19.861 -7.279 1.00 0.00
i have to extract the part after TER and put in a different file...this has to be done on all 200 files. I did something like sed '1,/TER/d' File1_0.pdb > 1_0.pdb. But this will work for one file at a time...can there be a solution for all 200 files in one go... output file is named same only "File" is removed from the name...
for i in *.pdb; do sed '1,/TER/d' $i > ${i/File/}; done
This might work:
seq 0 200| xargs -i -n1 cp File1_{}.pdb 1_{}.pbd # backup files
sed -si '1,/TER/d' 1_{0..200}.pdb # edit files separately inline