I want to find out the longest possible protein sequence translated from cds in 6 forward and reverse frame.
This is the example input format:
>111
KKKKKKKMGFSOXLKPXLLLLLLLLLLLLLLLLLMJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJX
>222
WWWMPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPMPPPPPXKKKKKK
I would like to find out all the strings which start from "M" and stop at "X", count the each length of the strings and select the longest.
For example, in the case above:
the script will find,
>111 has two matches:
MGFSOX
MJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJX
>222 has one match:
MPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPMPPPPPX
Then count each match's length, and print the string and number of longest matches which is the result I want:
>111
MJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJX 32
>222
MPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPMPPPPPX 38
But it prints out no answer. Does anyone know how to fix it? Any suggestion will be helpful.
#!/usr/bin/perl -w
use strict;
use warnings;
my #pep=();
my $i=();
my #Xnum=();
my $n=();
my %hash=();
my #k=();
my $seq=();
$n=0;
open(IN, "<$ARGV[0]");
while(<IN>){
chomp;
if($_=~/^[^\>]/){
#pep=split(//, $_);
if($_ =~ /(X)/){
push(#Xnum, $1);
if($n >= 0 && $n <= $#Xnum){
if(#pep eq "M"){
for($i=1; $i<=$#pep; $i++){
$seq=join("",#pep);
$hash{$i}=$seq;
push(#k, $i);
}
}
elsif(#pep eq "X"){
$n=$n+1;
}
foreach (sort {$a cmp $b} #k){
print "$hash{$k[0]}\t$k[0]";
}
}
}
}
elsif($_=~/^\>/){
print "$_\n";
}
}
close IN;
Check out this Perl one-liner
$ cat iris.txt
>111
KKKKKKKMGFSOXLKPXLLLLLLLLLLLLLLLLLMJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJX
>222
WWWMPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPMPPPPPXKKKKKK
$ perl -ne ' if(!/^>/) { print "$p"; while(/(M[^M]+?X)/g ) { if(length($1)>length($x)) {$x=$1 } } print "$x ". length($x)."\n";$x="" } else { $p=$_ } ' iris.txt
>111
MJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJX 32
>222
MPPPPPX 7
$
There's more than one way to do it!
Try this too:
print and next if /^>/;
chomp and my #z = $_ =~ /(M[^X]*X)/g;
my $m = "";
for my $s (#z) {
$m = $s if length $s > length $m
}
say "$m\t" . length $m
Output:
>111
MJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJX 32
>222
MPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPMPPPPPX 38
uses >=5.14 and make sure to run script with perl -n
As a one-liner:
perl -E 'print and next if /^>/; chomp and my #z = $_ =~ /(M[^X]*X)/g; my $m = ""; for my $s (#z) { $m = $s if length $s > length $m } say "$m\t" . length $m' -n data.txt
Here is solution using reduce from List::Util.
Edit: mistakenly used maxstr which gave results but is not what was needed. Have reedited this post to use reduce (correctly) instead.
#!/usr/bin/perl
use strict;
use warnings;
use List::Util qw/reduce/;
open my $fh, '<', \<<EOF;
>111
KKKKKKKMGFSOXLKPXLLLLLLLLLLLLLLLLLMJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJX
>222
WWWMPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPMPPPPPXKKKKKK
EOF
my $id;
while (<$fh>) {
chomp;
if (/^>/) {
$id = $_;
}
else {
my $data = reduce {length($a) > length($b) ? $a : $b} /M[^X]*X/g;
print "$id\n$data\t" . length($data) . "\n" if $data;
}
}
Here's my take on it.
I like fasta files tucked into a hash, with the fasta name as the key. This way you can just add descriptions to it, e.g. base composition etc...
#!/usr/local/ActivePerl-5.20/bin/env perl
use strict;
use warnings;
my %prot;
open (my $fh, '<', '/Users/me/Desktop/fun_prot.fa') or die $!;
my $string = do { local $/; <$fh> };
close $fh;
chomp $string;
my #fasta = grep {/./} split (">", $string);
for my $aa (#fasta){
my ($key, $value) = split ("\n", $aa);
$value =~ s/[A-Z]*(M.*M)[A-Z]/$1/;
$prot{$key}->{'len'} = length($value);
$prot{$key}->{'prot'} = $value;
}
for my $sequence (sort { $prot{$b}->{'len'} <=> $prot{$a}->{'len'} } keys %prot){
print ">" . $sequence, "\n", $prot{$sequence}->{'prot'}, "\t", $prot{$sequence}->{'len'}, "\n";
last;
}
__DATA__
>1232
ASDFASMJJJJJMFASDFSDAFSDDFSA
>2343
AASFDFASMJJJJJJJJJJJJJJMRGQEGDAGDA
Output
>2343
MJJJJJJJJJJJJJJM 16
I have written a script which uses a subroutine to call percentage of nucleotides in a given sequence. When I run the script the output for each nucleotide percentage is always shown to be zero.
Here's my code;
#!/usr/bin/perl
use strict;
use warnings;
#### Subroutine to report percentage of each nucleotide in DNA sequence ####
my $input = $ARGV[0];
my $nt = $ARGV[1];
my $args = $#ARGV +1;
if($args != 2){
print "Error!!! Insufficient number of arguments\n";
print "Usage: $0 <input fasta file>\n";
}
my($FH, $line);
open($FH, '<', $input) || die "Could\'nt open file: $input\n";
$line = do{
local $/;
<$FH>;
};
$line =~ s/>(.*)//g;
$line =~ s/\s+//g;
my $perc = perc_nucleotide($line , $nt);
printf("The percentage of $nt nucleotide in given sequence is %.0f", $perc);
print "\n";
sub perc_nucleotide {
my($line, $nt) = #_;
print "$nt\n";
my $count = 0;
if( $nt eq "A" || $nt eq "T" || $nt eq "G" || $nt eq "C"){
$count++;
}
my $total_len = length($line);
my $perc = ($count/$total_len)*100;
}
I think that I am setting the $count variable wrong. I tried different ways but can't figure it out.
This is the input file
>XM_024894547.1 Trichoderma citrinoviride Redoxin (BBK36DRAFT_1163529), partial mRNA
ATGGCCTTCCGTCTCCCTCTGCGCCGCATTGCCCTGGCCCGCCCCGCCACCGTTGCGCGTGGCTTCCACT
CGACGCCCCGCGCCCTGGTCAAGGTCGGCGACGAGGTCCCGAGCTTGGAGCTGTTCGAGAAGTCGGCCGC
CAGCAAGATCAACCTGGCCGACGAGTTCAAGAAGGGCGACGGCTACATTGTCGGCGTCCCGGGCGCCTTC
TCCGGCACCTGCTCCGGCACCCACGTCCCGTCGTACATCAACCACCCTGACATCAAGACGGCCGGCCAGG
TCTTTGTCGTCTCCGTCAACGACCCCTTTGTCATGAAGGCTTGGGCAGACCAGCTGGATCCCGCCGGAGA
GACAGGAATCCGGTTCGTTGCCGACCCCACGGCTGAGTTCACAAAGGCTCTGGAACTGGGATTCGACGAC
GCTGCTCCTCTGTTCGGAGGCACCCGAAGCAAGCGCTATGCTCTCAAGGTTAAGGATGGCAAGGTCACTG
CCGCCTTTGTTGAGCCCGACAACACGGGCACTTCCGTGTCAATGGCCGACAAGGTCCTCAGCTAA
The problem is here:
my $perc = perc_nucleotide($line , $nt);
printf("The percentage of $nt nucleotide in given sequence is %.0f", $perc);
perc_nucleotide is returning 0.18018018018018 but the format %.0f says to print it with no decimal places. So it gets truncated to 0. You should probably use something more like %.2f.
It's also worth noting that perc_nucleotide does not have a return. It still works, but for reasons that might not be obvious.
perc_nucleotide sets my $perc = ($count/$total_len)*100; but never uses that $perc. The $perc in the main program is a different variable.
perc_nucleotide does return something, every Perl subroutine without an explicit return returns the "last evaluated expression". In this case it's my $perc = ($count/$total_len)*100; but the last evaluated expression rules can get a bit tricky.
It's easier to read and safer to have an explicit return. return ($count/$total_len)*100;
I corrected the script and it gave me right answers.
#!/usr/bin/perl
use strict;
use warnings;
##### Subroutine to calculate percentage of all nucleotides in a DNA sequence #####
my $input = $ARGV[0];
my $nt = $ARGV[1];
my $args = $#ARGV + 1;
if($args != 2){
print "Error!!! Insufficient number of arguments\n";
print "Usage: $0 <input_fasta_file> <nucleotide>\n";
}
my($FH, $line);
open($FH, '<', $input) || die "Couldn\'t open input file: $input\n";
$line = do{
local $/;
<$FH>;
};
chomp $line;
#print $line;
$line =~ s/>(.*)//g;
$line =~ s/\s+//g;
#print "$line\n";
my $total_len = length($line);
my $perc_of_nt = perc($line, $nt);
**printf("The percentage of nucleotide $nt in a given sequence is %.2f%%", $perc_of_nt);
print "\n";**
#print "$total_len\n";
sub perc{
my($line, $nt) = #_;
my $char; my $count = 0;
**foreach $char (split //, $line){
if($char eq $nt){
$count += 1;
}
}**
**return (($count/$total_len)*100)**
}
The answer for the above input file is:
Total_len = 555
The percentage of nucleotide A in a given sequence is 18.02%
The percentage of nucleotide T in a given sequence is 18.74%
The percentage of nucleotide G in a given sequence is 28.47%
The changes which I made are in bold.
Thanks for amazing insight!!!
I am using perl to extract "Yes," or "No," from a large CSV, and output to a file using this code
open my $fin, "leads.csv";
my $str;
for (<$fin>) {
if (/^\s*\d+\.\s*(\w+)/) {
$str .= $1 . ",";
}
}
open (MYFILE, '>>data.txt');
print MYFILE $str;
close (MYFILE);
This is working correctly, and outputting data like this http://pastebin.com/r7Lwwz8p, however I need to break
to a new line after the 16th element so it looks like this on output: http://pastebin.com/xC8Lyk5R
Any tips/tricks greatly appreciated!
The following splits a line by commas, and then regroups them by 16 elements:
use strict;
use warnings;
while (my $line = <DATA>) {
chomp $line;
my #fields = split ',', $line;
while (my #data = splice #fields, 0, 16) {
print join(',', #data), "\n";
}
}
__DATA__
LineA,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,LineB,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,LineC,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,LineD,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,LineE,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,LineF,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,LineG,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,LineH,2,3,4,5,6,7,8,9,10,11,12
Outputs:
LineA,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
LineB,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
LineC,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
LineD,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
LineE,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
LineF,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
LineG,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
LineH,2,3,4,5,6,7,8,9,10,11,12
Use a variable to count the number of yes/no matches that you find, and then use the mod (%) operator to insert a newline into the string.
#!/usr/bin/perl
use strict;
use warnings;
open my $fin, "leads.csv";
my $str;
my $count = 0;
for (<$fin>) {
if (/^\s*\d+\.\s*(\w+)/) {
$str .= $1 . ",";
$count++;
}
$str .= "\n" unless ($count % 16);
}
open (MYFILE, '>>data.txt');
print MYFILE $str;
close (MYFILE);
keyword harry /
sally/
tally/
want that whenever the string matches with keyword it should also look for "/" character.This signifies continuation of line
Then I want output as
keyword harry sally tally
==========================
My current code:
#!/usr/bin/perl
open (file2, "trial.txt");
$keyword_1 = keyword;
foreach $line1 (<file2>) {
s/^\s+|\s+$//g;
if ($line1 =~ $keyword_1) {
$line2 =~ (s/$\//g, $line1) ;
print " $line2 " ;
}
}
If the ===== lines in your question are supposed to be in the output, then use
#! /usr/bin/env perl
use strict;
use warnings;
*ARGV = *DATA; # for demo only; delete
sub print_line {
my($line) = #_;
$line =~ s/\n$//; # / fix Stack Overflow highlighting
print $line, "\n",
"=" x (length($line) + 1), "\n";
}
my $line = "";
while (<>) {
$line .= $line =~ /^$|[ \t]$/ ? $_ : " $_";
if ($line !~ s!/\n$!!) { # / ditto
print_line $line;
$line = "";
}
}
print_line $line if length $line;
__DATA__
keyword jim-bob
keyword harry /
sally/
tally/
Output:
keyword jim-bob
================
keyword harry sally tally
==========================
You did not specify what to do with the lines that do not contain the keyword. You might use this code as an inspiration, though:
#!/usr/bin/perl
use warnings;
use strict;
my $on_keyword_line;
while (<>) {
if (/keyword/ or $on_keyword_line) {
chomp;
if (m{/$}) {
chop;
$on_keyword_line = 1;
} else {
$on_keyword_line = 0;
}
print;
} else {
$on_keyword_line = 0;
print "\n";
}
}
A redo is useful when dealing with concatenating continuation lines.
my $line;
while ( defined( $line = <DATA> )) {
chomp $line;
if ( $line =~ s{/\s*$}{ } ) {
$line .= <DATA>;
redo unless eof(DATA);
}
$line =~ s{/}{};
print "$line\n";
}
__DATA__
keyword harry /
sally/
tally/
and
done!!!
$ ./test.pl
keyword harry sally tally and
done!!!
I think you need to simply concatenate all lines that end in a slash, regardless of the keyword.
I suggest this code.
Updated to account for the OP's comment that continuation lines are terminated by backslashes.
while (<>) {
s|\\\s*\z||;
print;
}
If I open a file with strings like "233445", how can I then split that string into digits "2 3 3 4 4 5" and add each one to each other "2 + 3 + 3 etc..." and print out the result.
My code so far looks like this:
use strict;
#open (FILE, '<', shift);
#my #strings = <FILE>;
#strings = qw(12243434, 345, 676744); ## or a contents of a file
foreach my $numbers (#strings) {
my #done = split(undef, $numbers);
print "#done\n";
}
But I don't know where to start for the actual add function.
use strict;
use warnings;
my #strings = qw( 12243434 345 676744 );
for my $string (#strings) {
my $sum;
$sum += $_ for split(//, $string);
print "$sum\n";
}
or
use strict;
use warnings;
use List::Util qw( sum );
my #strings = qw( 12243434 345 676744 );
for my $string (#strings) {
my $sum = sum split(//, $string);
print "$sum\n";
}
PS — Always use use strict; use warnings;. It would have detected your misuse of commas in qw, and it would have dected your misuse of undef for split's first argument.
use strict;
my #done;
#open (FILE, '<', shift);
#my #strings = <FILE>;
my #strings = qw(12243434, 345, 676744); ## or a contents of a file
foreach my $numbers (#strings) {
#done = split(undef, $numbers);
print "#done\n";
}
my $tot;
map { $tot += $_} #done;
print $tot, "\n";
No one suggested an eval solution?
my #strings = qw( 12243434 345 676744 );
foreach my $string (#strings) {
my $sum = eval join '+',split //, $string;
print "$sum\n";
}
If your numbers are in a file, a one-liner might be nice:
perl -lnwe 'my $sum; s/(\d)/$sum += $1/eg; print $sum' numbers.txt
Since addition only uses numbers, it is safe to ignore all other characters. So just extract them one at the time with the regex and sum them up.
TIMTOWTDI:
perl -MList::Util=sum -lnwe 'print sum(/\d/g);' numbers.txt
perl -lnwe 'my $a; $a+=$_ for /\d/g; print $a' numbers.txt
Options:
-l auto-chomp input and add newline to print
-n implicit while(<>) loop around program -- open the file name given as argument and read each line into $_.