Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I'm trying to find a way to write a script that does the following:
Open and detect the first use of a three-letter sequence that is repeated in the input file
Edit and permute this three letter sequence 19 times, giving 19 outputs each with a different three letter code that corresponds to a list of 19 possible three letter codes
Essentially, this is a fairly straightforward find and replace problem that I know how to do. The problem is that I then need to loop this so that, after creating the 19 files from the previous line, the next line with a different three letter code has the same replacement done to it.
I'm struggling to find a way to have the script recognize sequences of text when it can be one of twenty different things.
Let me know if anyone has any ideas on how I could go about doing this, I'll provide any clarification if necessary too!
Here is an example of an input file:
ATOM 1 N SER A 2 37.396 -5.247 -4.830 1.00 65.06 N
ATOM 2 CA SER A 2 37.881 -6.354 -3.929 1.00 64.88 C
ATOM 3 C SER A 2 36.918 -7.555 -3.786 1.00 64.14 C
ATOM 4 O SER A 2 37.287 -8.576 -3.177 1.00 64.31 O
ATOM 5 CB SER A 2 38.251 -5.804 -2.552 1.00 65.31 C
ATOM 6 OG SER A 2 37.122 -5.210 -1.918 1.00 66.94 O
ATOM 7 N GLU A 3 35.705 -7.438 -4.342 1.00 62.82 N
ATOM 8 CA GLU A 3 34.716 -8.539 -4.306 1.00 61.94 C
ATOM 9 C GLU A 3 35.126 -9.833 -5.033 1.00 59.71 C
ATOM 10 O GLU A 3 34.927 -10.911 -4.473 1.00 59.23 O
ATOM 11 CB GLU A 3 33.328 -8.094 -4.789 1.00 62.49 C
ATOM 12 CG GLU A 3 32.291 -7.994 -3.693 1.00 66.67 C
ATOM 13 CD GLU A 3 31.552 -9.302 -3.426 1.00 71.93 C
ATOM 14 OE1 GLU A 3 32.177 -10.254 -2.892 1.00 73.96 O
ATOM 15 OE2 GLU A 3 30.329 -9.364 -3.723 1.00 74.25 O
ATOM 16 N PRO A 4 35.663 -9.732 -6.280 1.00 57.83 N
ATOM 17 CA PRO A 4 36.131 -10.951 -6.967 1.00 56.64 C
Where an output would look like this:
ATOM 1 N ALA A 2 37.396 -5.247 -4.830 1.00 65.06 N
ATOM 2 CA SER A 2 37.881 -6.354 -3.929 1.00 64.88 C
ATOM 3 C SER A 2 36.918 -7.555 -3.786 1.00 64.14 C
ATOM 4 O SER A 2 37.287 -8.576 -3.177 1.00 64.31 O
ATOM 5 CB SER A 2 38.251 -5.804 -2.552 1.00 65.31 C
ATOM 6 OG SER A 2 37.122 -5.210 -1.918 1.00 66.94 O
ATOM 7 N GLU A 3 35.705 -7.438 -4.342 1.00 62.82 N
ATOM 8 CA GLU A 3 34.716 -8.539 -4.306 1.00 61.94 C
ATOM 9 C GLU A 3 35.126 -9.833 -5.033 1.00 59.71 C
ATOM 10 O GLU A 3 34.927 -10.911 -4.473 1.00 59.23 O
ATOM 11 CB GLU A 3 33.328 -8.094 -4.789 1.00 62.49 C
ATOM 12 CG GLU A 3 32.291 -7.994 -3.693 1.00 66.67 C
ATOM 13 CD GLU A 3 31.552 -9.302 -3.426 1.00 71.93 C
ATOM 14 OE1 GLU A 3 32.177 -10.254 -2.892 1.00 73.96 O
ATOM 15 OE2 GLU A 3 30.329 -9.364 -3.723 1.00 74.25 O
ATOM 16 N PRO A 4 35.663 -9.732 -6.280 1.00 57.83 N
ATOM 17 CA PRO A 4 36.131 -10.951 -6.967 1.00 56.64 C
On the first pass, the SER should be changed to a series of twenty different text sequences, the first being ALA. The issue I'm having is that I'm not sure how to write a script that will change more than one line of text.
My current script can form the 19 mutations of the first SER, but that's where it will stop. It won't mutate the next one, and it won't mutate a different three letter code, for example it wouldn't change the GLU. Is there any easy way to integrate this functionality?
Currently, the way I've approached this is to do a simple text transformation using sed, but as this seems more complicated than what sed can bring to the table, I think perl is likely the way to go. I can add the sed code, but I didn't think it would be of much help.
Your question and comments aren't entirely clear, but I believe this script will do what you want. It parses a PDB file until it reaches the amino acid of interest. A set of 19 files are produced where this AA is substituted by the other 19 AAs. From there onwards, every time an AA differs from the AA in the previous line, another set of 19 files will be generated.
#!/usr/bin/perl
use warnings;
use strict;
# we're going to start mutating when we find this residue.
my $target = 'GLU';
my #aas = ( 'ALA', 'ARG', 'ASN', 'ASP', 'CYS', 'GLU', 'GLN', 'GLY', 'HIS', 'ILE', 'LEU', 'LYS', 'MET', 'PHE', 'PRO', 'SER', 'THR', 'TRP', 'TYR', 'VAL' );
my $prev = '';
my $line_no = 0;
my #lines;
my %changes;
# uncomment the following lines and comment out "while (<DATA>) {"
# to read the input from a file
# my $input = 'path/to/pdb_file';
# open( my $fh, "<", $input ) or die "Could not open $input: $!";
# while (<$fh>) {
while (<DATA>) {
# split the line into columns (assuming it is tab-delimited;
# switch this for "\s+" if it is separated with whitespace.
my #cols = split "\t";
if ($target && $cols[3] eq $target) {
# Found our target residue! unset $target so that the following
# set of tests are performed
undef $target;
}
# see if this AA is the same as the AA in the previous line
if (! $target && $prev ne $cols[3]) {
# if it isn't, store the line number and the amino acid
$changes{ $line_no } = $cols[3];
# update $prev to reflect the new AA
$prev = $cols[3];
}
# store all the lines
push #lines, $_;
# increment the line number
$line_no++;
}
# now, for each of the changes, create substitute files
for (keys %changes) {
create_substitutes($_, $changes{$_}, [#aas], [#lines]);
}
sub create_substitutes {
# arguments: line no, $res: residue, $aas: array of amino acids,
# $all_lines: all lines in the file
my ($line_no, $res, $aas, $all_lines) = #_;
# this is the target line that we want to substitute
my #target = split "\t", $all_lines->[$line_no];
# for each AA in the list of AAs, create a new file called 'XXX-##.txt',
# where XXX is the amino acid and ## is the line number where the
# substituted residue is.
for (#$aas) {
next if $_ eq $res;
open( my $fh, ">", $_."-$line_no.txt") or die "Could not create output file for $_: $!";
# print out all lines up to the changed line
print { $fh } #$all_lines[0..$line_no-1];
# print out the changed line, substituting in the AA
print { $fh } join "\t", #target[0..2], $_, #target[4..$#target];
# print out the rest of the lines.
print { $fh } #$all_lines[$line_no+1 .. $#{$all_lines}];
}
}
__DATA__
ATOM 1 N SER A 2 37.396 -5.247 -4.830 1.00 65.06 N
ATOM 2 CA SER A 2 37.881 -6.354 -3.929 1.00 64.88 C
ATOM 3 C SER A 2 36.918 -7.555 -3.786 1.00 64.14 C
ATOM 4 O SER A 2 37.287 -8.576 -3.177 1.00 64.31 O
ATOM 5 CB SER A 2 38.251 -5.804 -2.552 1.00 65.31 C
ATOM 6 OG SER A 2 37.122 -5.210 -1.918 1.00 66.94 O
ATOM 7 N GLU A 3 35.705 -7.438 -4.342 1.00 62.82 N
ATOM 8 CA GLU A 3 34.716 -8.539 -4.306 1.00 61.94 C
ATOM 9 C GLU A 3 35.126 -9.833 -5.033 1.00 59.71 C
ATOM 10 O GLU A 3 34.927 -10.911 -4.473 1.00 59.23 O
ATOM 11 CB GLU A 3 33.328 -8.094 -4.789 1.00 62.49 C
ATOM 12 CG GLU A 3 32.291 -7.994 -3.693 1.00 66.67 C
ATOM 13 CD GLU A 3 31.552 -9.302 -3.426 1.00 71.93 C
ATOM 14 OE1 GLU A 3 32.177 -10.254 -2.892 1.00 73.96 O
ATOM 15 OE2 GLU A 3 30.329 -9.364 -3.723 1.00 74.25 O
ATOM 16 N PRO A 4 35.663 -9.732 -6.280 1.00 57.83 N
ATOM 17 CA PRO A 4 36.131 -10.951 -6.967 1.00 56.64 C
ATOM 18 CA ARG A 4 36.131 -10.951 -6.967 1.00 56.64 C
This example data will produce a set of files for the first GLU found (line 6), then another set for line 15 (PRO residue), and another set for line 17 (ARG residue).
Example of ALA-6.txt file:
ATOM 1 N SER A 2 37.396 -5.247 -4.830 1.00 65.06 N
ATOM 2 CA SER A 2 37.881 -6.354 -3.929 1.00 64.88 C
ATOM 3 C SER A 2 36.918 -7.555 -3.786 1.00 64.14 C
ATOM 4 O SER A 2 37.287 -8.576 -3.177 1.00 64.31 O
ATOM 5 CB SER A 2 38.251 -5.804 -2.552 1.00 65.31 C
ATOM 6 OG SER A 2 37.122 -5.210 -1.918 1.00 66.94 O
ATOM 7 N ALA A 3 35.705 -7.438 -4.342 1.00 62.82 N
ATOM 8 CA GLU A 3 34.716 -8.539 -4.306 1.00 61.94 C
ATOM 9 C GLU A 3 35.126 -9.833 -5.033 1.00 59.71 C
(etc.)
If this isn't the correct behaviour, you'll have to edit your question as it isn't very clear!
Because your question isn't very clear (more precisely, it is totally unclear), i created the following:
#!/usr/bin/env perl
use 5.014;
use strict;
use warnings;
use Path::Tiny;
use Bio::PDB::Structure;
use Data::Dumper;
my $residues_file = "input2.txt"; #residue names, one per line
my $molfile = "m1.pdb"; #molecule file
#read the residues
my(#residues) = path($residues_file)->lines({chomp => 1});
my $m= Bio::PDB::Structure::Molecule->new;
for my $res (#residues) { #for each residue name from a file "input2.txt"
$m->read("m1.pdb"); #read the molecule
my $atom = $m->atom(0); #get the 1st atom
$atom->residue_name($res); #change the residue to the from file
#create output filename
my $outfile = path($molfile)->basename('.pdb') . '_' . lc($res) . '.pdb';
#write the result
$m->print($outfile);
}
for example, if the input2.txt contains
ALA
ARG
ASN
ASP
CYS
GLN
GLU
GLY
HIS
ILE
LEU
LYS
MET
PHE
PRO
SER
THR
TRP
TYR
VAL
the from your input, generates 20 files where the residue in the 1st atom is changed (according to your output example) to like:
==> m1_ala.pdb <==
ATOM 1 N ALA A 2 37.396 -5.247 -4.830 1.00 65.06
==> m1_arg.pdb <==
ATOM 1 N ARG A 2 37.396 -5.247 -4.830 1.00 65.06
==> m1_asn.pdb <==
ATOM 1 N ASN A 2 37.396 -5.247 -4.830 1.00 65.06
==> m1_asp.pdb <==
ATOM 1 N ASP A 2 37.396 -5.247 -4.830 1.00 65.06
==> m1_cys.pdb <==
ATOM 1 N CYS A 2 37.396 -5.247 -4.830 1.00 65.06
... etc, 20 times...
I have some text files and I need to remove the first character from the fourth column only if the column has four characters
file1 as follows
ATOM 5181 N AMET K 406 12.440 6.552 25.691 0.50 7.37 N
ATOM 5182 CA AMET K 406 13.685 5.798 25.578 0.50 5.87 C
ATOM 5183 C AMET K 406 14.045 5.179 26.909 0.50 5.07 C
ATOM 5184 O MET K 406 14.595 4.083 27.003 0.50 7.07 O
ATOM 5185 CB MET K 406 14.812 6.674 25.044 0.50 6.80 C
ATOM 5185 CB MET K 406 14.812 6.674 25.044 0.50 6.80 C
ATOM 5202 N AARG K 408 12.186 3.982 29.147 0.50 6.55 N
file2 as follows
ATOM 41 CA ATRP A 6 -18.975 -29.894 -7.425 0.50 19.50 C
ATOM 42 CA BTRP A 6 -18.979 -29.890 -7.428 0.50 19.16 C
ATOM 43 C HIS A 6 -18.091 -29.845 -8.669 1.00 19.84 C
ATOM 44 O HIS A 6 -17.015 -30.452 -8.696 1.00 20.10 O
ATOM 45 CB ASER A 9 -18.499 -28.879 -6.370 0.50 19.73 C
ATOM 46 CB BSER A 9 -18.565 -28.837 -6.367 0.50 19.13 C
ATOM 47 CG CHIS A 12 -19.421 -27.711 -6.216 0.50 21.30 C
Desired output
file1
ATOM 5181 N MET K 406 12.440 6.552 25.691 0.50 7.37 N
ATOM 5182 CA MET K 406 13.685 5.798 25.578 0.50 5.87 C
ATOM 5183 C MET K 406 14.045 5.179 26.909 0.50 5.07 C
ATOM 5184 O MET K 406 14.595 4.083 27.003 0.50 7.07 O
ATOM 5185 CB MET K 406 14.812 6.674 25.044 0.50 6.80 C
ATOM 5185 CB MET K 406 14.812 6.674 25.044 0.50 6.80 C
ATOM 5202 N ARG K 408 12.186 3.982 29.147 0.50 6.55 N
file2
ATOM 41 CA TRP A 6 -18.975 -29.894 -7.425 0.50 19.50 C
ATOM 42 CA TRP A 6 -18.979 -29.890 -7.428 0.50 19.16 C
ATOM 43 C HIS A 6 -18.091 -29.845 -8.669 1.00 19.84 C
ATOM 44 O HIS A 6 -17.015 -30.452 -8.696 1.00 20.10 O
ATOM 45 CB SER A 9 -18.499 -28.879 -6.370 0.50 19.73 C
ATOM 46 CB SER A 9 -18.565 -28.837 -6.367 0.50 19.13 C
ATOM 47 CG HIS A 12 -19.421 -27.711 -6.216 0.50 21.30 C
This might work for you (GNU sed):
sed -r 's/^((\S+\s+){3})\S(\S{3}\s)/\1 \3/' file
This replaces the first character of the fourth column with a space if that column has four non-space characters.
Use the length() function to find the length of the column and the substr() function to print the substring you need:
$ awk 'length($4)==4{$4=substr($4,2)}1' file | column -t
ATOM 5181 N MET K 406 12.440 6.552 25.691 0.50 7.37 N
ATOM 5182 CA MET K 406 13.685 5.798 25.578 0.50 5.87 C
ATOM 5183 C MET K 406 14.045 5.179 26.909 0.50 5.07 C
ATOM 5184 O MET K 406 14.595 4.083 27.003 0.50 7.07 O
ATOM 5185 CB MET K 406 14.812 6.674 25.044 0.50 6.80 C
ATOM 5185 CB MET K 406 14.812 6.674 25.044 0.50 6.80 C
ATOM 5202 N ARG K 408 12.186 3.982 29.147 0.50 6.55 N
Piping to column -t rebuilds a nice table format. To store the changes back to a file uses the redirection operator:
$ awk 'length($4)==4{$4=substr($4,2)}1' file | column -t > new_file
With sed you could do:
$ sed -r 's/^((\S+\s+){3})\S(\S{3}\s)/\1\3/' file
ATOM 5181 N MET K 406 12.440 6.552 25.691 0.50 7.37 N
ATOM 5182 CA MET K 406 13.685 5.798 25.578 0.50 5.87 C
ATOM 5183 C MET K 406 14.045 5.179 26.909 0.50 5.07 C
ATOM 5184 O MET K 406 14.595 4.083 27.003 0.50 7.07 O
ATOM 5185 CB MET K 406 14.812 6.674 25.044 0.50 6.80 C
ATOM 5185 CB MET K 406 14.812 6.674 25.044 0.50 6.80 C
ATOM 5202 N ARG K 408 12.186 3.982 29.147 0.50 6.55 N
To store the changes back to the original file you can use the -i option:
$ sed -ri 's/^((\S+\s+){3})\S(\S{3}\s)/\1\3/' file
My data looks like this,
1 20010101 945 A 6
1 20010101 946 B 4
1 20010101 947 P 3.5
1 20010101 950 A 5
1 20010101 951 P 4
1 20010101 952 P 4
1 20010101 1010 A 4
1 20010101 1011 P 4
2 20010101 940 A 3.5
2 20010101 1015 A 3
2 20010101 1113 B 3.5
2 20010101 1114 P 3.2
2 20010101 1115 B 3.4
2 20010101 1116 P 3.1
2 20010101 1119 P 3.6
I am trying to find all the lines (with P) followed by the latest A and B values based on the matching of first two columns (e.g., 1 and 20010101).
The result is expected to be like this,
1 20010101 947 P 3.5 6 4
1 20010101 951 P 4 5 4
1 20010101 952 P 4 5 4
1 20010101 1011 P 4 4 4
2 20010101 1114 P 3.2 3 3.5
2 20010101 1116 P 3.1 3 3.4
2 20010101 1119 P 3.6 3 3.4
Does it need to sort by using hash in Perl? I am lack of ideas could anybody give any hint? I will be much appreciated!
perl -ane 'if($F[3] eq "P"){ s/$/ $la $lb/; print; }else{ ($la,$lb) = ($F[3] eq "A")?($F[4],$lb):($la,$F[4]) }' data.txt
Simplest solved with a simple if-elsif structure:
use strict;
use warnings;
my ($A, $B);
while (<DATA>) {
my #data = split;
if ($data[3] eq "A") {
$A = $data[4];
} elsif ($data[3] eq "B") {
$B = $data[4];
} elsif ($data[3] eq "P") {
print join("\t", #data, $A, $B), "\n";
}
}
__DATA__
1 20010101 945 A 6
1 20010101 946 B 4
1 20010101 947 P 3.5
1 20010101 950 A 5
1 20010101 951 P 4
1 20010101 952 P 4
1 20010101 1010 A 4
1 20010101 1011 P 4
2 20010101 940 A 3.5
2 20010101 1015 A 3
2 20010101 1113 B 3.5
2 20010101 1114 P 3.2
2 20010101 1115 B 3.4
2 20010101 1116 P 3.1
2 20010101 1119 P 3.6
Output:
1 20010101 947 P 3.5 6 4
1 20010101 951 P 4 5 4
1 20010101 952 P 4 5 4
1 20010101 1011 P 4 4 4
2 20010101 1114 P 3.2 3 3.5
2 20010101 1116 P 3.1 3 3.4
2 20010101 1119 P 3.6 3 3.4
You might want to compensate for possible empty/undefined/old values in $A and $B.
I have a code that load cell array and convert them to matrix.
now this matrix shows 4 numbers after floating point for example
0 5 15 1 51,9000 3,4000
0 5 15 1 51,9000 3,4000
0 5 15 1 51,9000 3,4000
how can I change all af the rows to just show 2 numbers after the floating point ?
please consider that I want to change the matrix not print it in command window !
If you want to see it in the command window/editor for debugging purposes, use bank format:
format bank;
Example:
A =[ 51.213123 6.132434]
format bank
disp(A);
Will result in :
A =
51.21 6.13
Also, you can use sprintf
A = [51.900 3.4000];
disp(sprintf('%2.2f ',A));
x = [0 5 15 1 51.9000 3.4000
0 5 15 1 51.9000 3.4000
0 5 15 1 51.9000 3.4000];
fprintf([repmat('%.2f ',1,size(x,2)) '\n'], x')
0.00 5.00 15.00 1.00 51.90 3.40
0.00 5.00 15.00 1.00 51.90 3.40
0.00 5.00 15.00 1.00 51.90 3.40
there are 200 files named File1_0.pdb,File1_60.pdb etc....it looks like:
ATOM 1 N VAL 1 8.897 -21.545 -7.276 1.00 0.00
ATOM 2 H1 VAL 1 9.692 -22.015 -6.868 1.00 0.00
ATOM 3 H2 VAL 1 9.228 -20.766 -7.827 1.00 0.00
ATOM 4 H3 VAL 1 8.289 -22.236 -7.693 1.00 0.00
TER
ATOM 5 CA VAL 1 8.124 -20.953 -6.203 1.00 0.00
ATOM 6 HA VAL 1 8.072 -19.874 -6.345 1.00 0.00
ATOM 7 CB VAL 1 6.693 -21.515 -6.176 1.00 0.00
ATOM 8 HB VAL 1 6.522 -22.024 -5.227 1.00 0.00
ATOM 9 CG1 VAL 1 5.684 -20.370 -6.330 1.00 0.00
ATOM 10 1HG1 VAL 1 5.854 -19.861 -7.279 1.00 0.00
i have to extract the part after TER and put in a different file...this has to be done on all 200 files. I did something like sed '1,/TER/d' File1_0.pdb > 1_0.pdb. But this will work for one file at a time...can there be a solution for all 200 files in one go... output file is named same only "File" is removed from the name...
for i in *.pdb; do sed '1,/TER/d' $i > ${i/File/}; done
This might work:
seq 0 200| xargs -i -n1 cp File1_{}.pdb 1_{}.pbd # backup files
sed -si '1,/TER/d' 1_{0..200}.pdb # edit files separately inline