Pick up the longest peptide using perl - perl

I want to find out the longest possible protein sequence translated from cds in 6 forward and reverse frame.
This is the example input format:
>111
KKKKKKKMGFSOXLKPXLLLLLLLLLLLLLLLLLMJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJX
>222
WWWMPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPMPPPPPXKKKKKK
I would like to find out all the strings which start from "M" and stop at "X", count the each length of the strings and select the longest.
For example, in the case above:
the script will find,
>111 has two matches:
MGFSOX
MJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJX
>222 has one match:
MPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPMPPPPPX
Then count each match's length, and print the string and number of longest matches which is the result I want:
>111
MJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJX 32
>222
MPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPMPPPPPX 38
But it prints out no answer. Does anyone know how to fix it? Any suggestion will be helpful.
#!/usr/bin/perl -w
use strict;
use warnings;
my #pep=();
my $i=();
my #Xnum=();
my $n=();
my %hash=();
my #k=();
my $seq=();
$n=0;
open(IN, "<$ARGV[0]");
while(<IN>){
chomp;
if($_=~/^[^\>]/){
#pep=split(//, $_);
if($_ =~ /(X)/){
push(#Xnum, $1);
if($n >= 0 && $n <= $#Xnum){
if(#pep eq "M"){
for($i=1; $i<=$#pep; $i++){
$seq=join("",#pep);
$hash{$i}=$seq;
push(#k, $i);
}
}
elsif(#pep eq "X"){
$n=$n+1;
}
foreach (sort {$a cmp $b} #k){
print "$hash{$k[0]}\t$k[0]";
}
}
}
}
elsif($_=~/^\>/){
print "$_\n";
}
}
close IN;

Check out this Perl one-liner
$ cat iris.txt
>111
KKKKKKKMGFSOXLKPXLLLLLLLLLLLLLLLLLMJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJX
>222
WWWMPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPMPPPPPXKKKKKK
$ perl -ne ' if(!/^>/) { print "$p"; while(/(M[^M]+?X)/g ) { if(length($1)>length($x)) {$x=$1 } } print "$x ". length($x)."\n";$x="" } else { $p=$_ } ' iris.txt
>111
MJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJX 32
>222
MPPPPPX 7
$

There's more than one way to do it!
Try this too:
print and next if /^>/;
chomp and my #z = $_ =~ /(M[^X]*X)/g;
my $m = "";
for my $s (#z) {
$m = $s if length $s > length $m
}
say "$m\t" . length $m
Output:
>111
MJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJX 32
>222
MPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPMPPPPPX 38
uses >=5.14 and make sure to run script with perl -n
As a one-liner:
perl -E 'print and next if /^>/; chomp and my #z = $_ =~ /(M[^X]*X)/g; my $m = ""; for my $s (#z) { $m = $s if length $s > length $m } say "$m\t" . length $m' -n data.txt

Here is solution using reduce from List::Util.
Edit: mistakenly used maxstr which gave results but is not what was needed. Have reedited this post to use reduce (correctly) instead.
#!/usr/bin/perl
use strict;
use warnings;
use List::Util qw/reduce/;
open my $fh, '<', \<<EOF;
>111
KKKKKKKMGFSOXLKPXLLLLLLLLLLLLLLLLLMJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJX
>222
WWWMPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPMPPPPPXKKKKKK
EOF
my $id;
while (<$fh>) {
chomp;
if (/^>/) {
$id = $_;
}
else {
my $data = reduce {length($a) > length($b) ? $a : $b} /M[^X]*X/g;
print "$id\n$data\t" . length($data) . "\n" if $data;
}
}

Here's my take on it.
I like fasta files tucked into a hash, with the fasta name as the key. This way you can just add descriptions to it, e.g. base composition etc...
#!/usr/local/ActivePerl-5.20/bin/env perl
use strict;
use warnings;
my %prot;
open (my $fh, '<', '/Users/me/Desktop/fun_prot.fa') or die $!;
my $string = do { local $/; <$fh> };
close $fh;
chomp $string;
my #fasta = grep {/./} split (">", $string);
for my $aa (#fasta){
my ($key, $value) = split ("\n", $aa);
$value =~ s/[A-Z]*(M.*M)[A-Z]/$1/;
$prot{$key}->{'len'} = length($value);
$prot{$key}->{'prot'} = $value;
}
for my $sequence (sort { $prot{$b}->{'len'} <=> $prot{$a}->{'len'} } keys %prot){
print ">" . $sequence, "\n", $prot{$sequence}->{'prot'}, "\t", $prot{$sequence}->{'len'}, "\n";
last;
}
__DATA__
>1232
ASDFASMJJJJJMFASDFSDAFSDDFSA
>2343
AASFDFASMJJJJJJJJJJJJJJMRGQEGDAGDA
Output
>2343
MJJJJJJJJJJJJJJM 16

Related

Perl script grep

The script is printing the amount of input lines, I want it to print the amount of input lines that are present in another file
#!/usr/bin/perl -w
open("file", "text.txt");
#todd = <file>;
close "file";
while(<>){
if( grep( /^$_$/, #todd)){
#if( grep #todd, /^$_$/){
print $_;
}
print "\n";
}
if for example file contains
1
3
4
5
7
and the input file that will be read from contains
1
2
3
4
5
6
7
8
9
I would want it to print 1,3,4,5 and 7
but 1-9 are being printed instead
UPDATE******
This is my code now and I am getting this error
readline() on closed filehandle todd at ./may6test.pl line 3.
#!/usr/bin/perl -w
open("todd", "<text.txt");
#files = <todd>; #file looking into
close "todd";
while( my $line = <> ){
chomp $line;
if ( grep( /^$line$/, #files) ) {
print $_;
}
print "\n";
}
which makes no sense to me because I have this other script that is basically doing the same thing
#!/usr/bin/perl -w
open("file", "<text2.txt"); #
#file = <file>; #file looking into
close "file"; #
while(<>){
$temp = $_;
$temp =~ tr/|/\t/; #puts tab between name and id
my ($name, $number1, $number2) = split("\t", $temp);
if ( grep( /^$number1$/, #file) ) {
print $_;
}
}
print "\n";
OK, the problem here is - grep sets $_ too. So grep { $_ } #array will always give you every element in the array.
At a basic level - you need to:
while ( my $line = <> ) {
chomp $line;
if ( grep { /^$line$/ } #todd ) {
#do something
}
}
But I'd suggest instead that you might want to consider building a hash of your lines instead:
open( my $input, '<', "text.txt" ) or die $!;
my %in_todd = map { $_ => 1 } <$input>;
close $input;
while (<>) {
print if $in_todd{$_};
}
Note - you might want to watch for trailing linefeeds.

Perl : Need to append two columns if the ID's are repeating

If id gets repeated I am appending app1, app2 and printing it once.
Input:
id|Name|app1|app2
1|abc|234|231|
2|xyz|123|215|
1|abc|265|321|
3|asd|213|235|
Output:
id|Name|app1|app2
1|abc|234,265|231,321|
2|xyz|123|215|
3|asd|213|235|
Output I'm getting:
id|Name|app1|app2
1|abc|234,231|
2|xyz|123,215|
1|abc|265,321|
3|asd|213,235|
My Code:
#! usr/bin/perl
use strict;
use warnings;
my $basedir = 'E:\Perl\Input\\';
my $file ='doctor.txt';
my $counter = 0;
my %RepeatNumber;
my $pos=0;
open(OUTFILE, '>', 'E:\Perl\Output\DoctorOpFile.csv') || die $!;
open(FH, '<', join('', $basedir, $file)) || die $!;
my $line = readline(FH);
unless ($counter) {
chomp $line;
print OUTFILE $line;
print OUTFILE "\n";
}
while ($line = readline(FH)) {
chomp $line;
my #obj = split('\|',$line);
if($RepeatNumber{$obj[0]}++) {
my $str1= join("|",$obj[0]);
my $str2=join(",",$obj[2],$obj[3]);
print OUTFILE join("|",$str1,$str2);
print OUTFILE "\n";
}
}
This should do the trick:
use strict;
use warnings;
my $file_in = "doctor.txt";
open (FF, "<$file_in");
my $temp = <FF>; # remove first line
my %out;
while (<FF>)
{
my ($id, $Name, $app1, $app2) = split /\|/, $_;
$out{$id}[0] = $Name;
push #{$out{$id}[1]}, $app1;
push #{$out{$id}[2]}, $app2;
}
foreach my $key (keys %out)
{
print $key, "|", $out{$key}[0], "|", join (",", #{$out{$key}[1]}), "|", join (",", #{$out{$key}[2]}), "\n";
}
EDIT
To see what the %out contains (in case it's not clear), you can use
use Data::Dumper;
and print it via
print Dumper(%out);
I'd tackle it like this:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
use 5.14.0;
my %stuff;
#extract the header row.
#use the regex to remove the linefeed, because
#we can't chomp it inline like this.
#works since perl 5.14
#otherwise we could just chomp (#header) later.
my ( $id, #header ) = split( /\|/, <DATA> =~ s/\n//r );
while (<DATA>) {
#turn this row into a hash of key-values.
my %row;
( $id, #row{#header} ) = split(/\|/);
#print for diag
print Dumper \%row;
#iterate each key, and insert into $row.
foreach my $key ( keys %row ) {
push( #{ $stuff{$id}{$key} }, $row{$key} );
}
}
#print for diag
print Dumper \%stuff;
print join ("|", "id", #header ),"\n";
#iterate ids in the hash
foreach my $id ( sort keys %stuff ) {
#join this record by '|'.
print join('|',
$id,
#turn inner arrays into comma separated via map.
map {
my %seen;
#use grep to remove dupes - e.g. "abc,abc" -> "abc"
join( ",", grep !$seen{$_}++, #$_ )
} #{ $stuff{$id} }{#header}
),
"\n";
}
__DATA__
id|Name|app1|app2
1|abc|234|231|
2|xyz|123|215|
1|abc|265|321|
3|asd|213|235|
This is perhaps a bit overkill for your application, but it should handle arbitrary column headings and arbitary numbers of duplicates. I'll coalesce them though - so the two abc entries don't end up abc,abc.
Output is:
id|Name|app1|app2
1|abc|234,265|231,321
2|xyz|123|215
3|asd|213|235
Another way of doing it which doesn't use a hash (in case you want to be more memory efficient), my contribution lies under the opens:
#!/usr/bin/perl
use strict;
use warnings;
my $basedir = 'E:\Perl\Input\\';
my $file ='doctor.txt';
open(OUTFILE, '>', 'E:\Perl\Output\DoctorOpFile.csv') || die $!;
select(OUTFILE);
open(FH, '<', join('', $basedir, $file)) || die $!;
print(scalar(<FH>));
my #lastobj = (undef);
foreach my $obj (sort {$a->[0] <=> $b->[0]}
map {chomp;[split('|')]} <FH>) {
if(defined($lastobj[0]) &&
$obj[0] eq $lastobj[0])
{#lastobj = (#obj[0..1],
$lastobj[2].','.$obj[2],
$lastobj[3].','.$obj[3])}
else
{
if($lastobj[0] ne '')
{print(join('|',#lastobj),"|\n")}
#lastobj = #obj[0..3];
}
}
print(join('|',#lastobj),"|\n");
Note that split, without it's third argument ignores empty elements, which is why you have to add the last bar. If you don't do a chomp, you won't need to supply the bar or the trailing hard return, but you would have to record $obj[4].

Split a line on every 16th comma

I am using perl to extract "Yes," or "No," from a large CSV, and output to a file using this code
open my $fin, "leads.csv";
my $str;
for (<$fin>) {
if (/^\s*\d+\.\s*(\w+)/) {
$str .= $1 . ",";
}
}
open (MYFILE, '>>data.txt');
print MYFILE $str;
close (MYFILE);
This is working correctly, and outputting data like this http://pastebin.com/r7Lwwz8p, however I need to break
to a new line after the 16th element so it looks like this on output: http://pastebin.com/xC8Lyk5R
Any tips/tricks greatly appreciated!
The following splits a line by commas, and then regroups them by 16 elements:
use strict;
use warnings;
while (my $line = <DATA>) {
chomp $line;
my #fields = split ',', $line;
while (my #data = splice #fields, 0, 16) {
print join(',', #data), "\n";
}
}
__DATA__
LineA,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,LineB,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,LineC,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,LineD,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,LineE,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,LineF,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,LineG,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,LineH,2,3,4,5,6,7,8,9,10,11,12
Outputs:
LineA,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
LineB,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
LineC,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
LineD,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
LineE,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
LineF,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
LineG,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
LineH,2,3,4,5,6,7,8,9,10,11,12
Use a variable to count the number of yes/no matches that you find, and then use the mod (%) operator to insert a newline into the string.
#!/usr/bin/perl
use strict;
use warnings;
open my $fin, "leads.csv";
my $str;
my $count = 0;
for (<$fin>) {
if (/^\s*\d+\.\s*(\w+)/) {
$str .= $1 . ",";
$count++;
}
$str .= "\n" unless ($count % 16);
}
open (MYFILE, '>>data.txt');
print MYFILE $str;
close (MYFILE);

String manipulation in Perl

I am trying to do string manipulation in perl like so.
/q1.pl asad566sads asad575sads
then this prints
asad566sads
asad567sads
asad568sads
...
...
asad575sads
I am thinking somehow separate the string, then join it back up again?
#!/usr/bin/perl -w
if (#ARGV != 2){
print "error\n";
}else{
my $numb1 = $ARGV[0];
my $numb2 = $ARGV[1];
$numb1 =~ s/(\d*)\D/$1/g;
$numb2 =~ s/(\d*)\D/$1/g;
print "$numb1 \n";
print "$numb2 \n";
for ($i=$numb1; $i<$numb2; $i++){
#my $numb_2_print;
my $string_to_print = $ARGV[0];
$string_to_print =~ s/(\D*)\d(\D*)/$1$i$2/g;
print "$string_to_print\n";
}
}
EDIT: assume numbers on appear once in string for this question. sorry about that
#!/usr/bin/perl -w
use strict;
use warnings;
die "Invalid number of parameters" if #ARGV != 2;
my ($pref1, $num1, $suf1) = $ARGV[0] =~ /^(\D*)(\d+)(\D*)$/ or die "Invalid Parameter";
my ($pref2, $num2, $suf2) = $ARGV[1] =~ /^(\D*)(\d+)(\D*)$/ or die "Invalid Parameter";
die "Prefixes don't match" if ($pref1 ne $pref2);
die "Suffixes don't match" if ($suf1 ne $suf2);
print "$pref1$_$suf1\n" for ($num1..$num2);
Addendum: If you care about the numbers being the same length, you can use this printf statement instead
printf "%s%0" . length($num2) . "d%s\n", $pref1, $_, $suf1 for ($num1..$num2);
use Algorithm::Loops qw( NestedLoops );
my $s = 'a1a1';
my $e = 'a2a2';
my #s_parts = $s =~ /(\d+|\D+)/g;
my #e_parts = $e =~ /(\d+|\D+)/g;
die if #s_parts != #e_parts;
my #loops;
for my $i (0..$#s_parts) {
if ($s_parts[$i] =~ /^\d/) {
die if $s_parts[$i] > $e_parts[$i];
push #loops, [ $s_parts[$i] .. $e_parts[$i] ];
} else {
die if $s_parts[$i] ne $e_parts[$i];
push #loops, [ $s_parts[$i] ];
}
}
NestedLoops(\#loops, sub {
print(#_, "\n");
});
a1a1
a1a2
a2a1
a2a2
The range operator is your friend (http://perldoc.perl.org/perlop.html#Range-Operators):
#!/usr/bin/perl
use strict;
use warnings;
my ($start, $end) = #ARGV;
$start =~ s/([a-zA-Z]+)$//;
$end =~ s/([a-zA-Z]+)$//;
my $trailing = $1;
foreach ( $start..$end ) {
print "$_$trailing\n";
}
__END__
asad566sads
asad567sads
asad568sads
asad569sads
asad570sads
asad571sads
asad572sads
asad573sads
asad574sads
asad575sads

How many different ways are there to concatenate two files line by line using Perl?

Suppose file1 looks like this:
bye bye
hello
thank you
And file2 looks like this:
chao
hola
gracias
The desired output is this:
bye bye chao
hello hola
thank you gracias
I myself have already come up with five different approaches to solve this problem. But I think there must be more ways, probably more concise and more elegant ways, and I hope I can learn more cool stuff :)
The following is what I have tried so far, based on what I've learnt from the many solutions of my previous problems. Also, I'm trying to sort of digest or internalize the knowledge I've acquired from the Llama book.
Code 1:
#!perl
use autodie;
use warnings;
use strict;
open my $file1,'<','c:/file1.txt';
open my $file2,'<','c:/file2.txt';
while(defined(my $line1 = <$file1>)
and defined(my $line2 = <$file2>)){
die "Files are different sizes!\n" unless eof(file1) == eof(file2);
$line1 .= $line2;
$line1 =~ s/\n/ /;
print "$line1 \n";
}
Code 2:
#!perl
use autodie;
use warnings;
use strict;
open my $file1,'<','c:/file1.txt';
my #file1 = <$file1>;
open my $file2,'<','c:/file2.txt';
my #file2 =<$file2>;
for (my $n=0; $n<=$#file1; $n++) {
$file1[$n] .=$file2[$n];
$file1[$n]=~s/\n/ /;
print $file1[$n];
}
Code 3:
#!perl
use autodie;
use warnings;
use strict;
open my $file1,'<','c:/file1.txt';
open my $file2,'<','c:/file2.txt';
my %hash;
while(defined(my $line1 = <$file1>)
and defined(my $line2 = <$file2>)) {
chomp $line1;
chomp $line2;
my ($key, $val) = ($line1,$line2);
$hash{$key} = $val;
}
print map { "$_ $hash{$_}\n" } sort keys %hash;
Code 4:
#!perl
use autodie;
use warnings;
use strict;
open my $file1,'<','c:/file1.txt';
open my $file2,'<','c:/file2.txt';
while(defined(my $line1 = <$file1>)
and defined(my $line2 = <$file2>)) {
$line1 =~ s/(.+)/$1 $line2/;
print $line1;
}
Code 5:
#!perl
use autodie;
use warnings;
use strict;
open my $file1,'<','c:/file1.txt';
my #file1 =<$file1>;
open my $file2,'<','c:/file2.txt';
my #file2 =<$file2>;
while ((#file1) && (#file2)){
my $m = shift (#file1);
chomp($m);
my $n = shift (#file2);
chomp($n);
$m .=" ".$n;
print "$m \n";
}
I have tried something like this:
foreach $file1 (#file2) && foreach $file2 (#file2) {...}
But Perl gave me a syntactic error warning. I was frustrated. But can we run two foreach loops simultaneously?
Thanks, as always, for any comments, suggestions and of course the generous code sharing :)
This works for any number of files:
use strict;
use warnings;
use autodie;
my #handles = map { open my $h, '<', $_; $h } #ARGV;
while (#handles){
#handles = grep { ! eof $_ } #handles;
my #lines = map { my $v = <$_>; chomp $v; $v } #handles;
print join(' ', #lines), "\n";
}
close $_ for #handles;
The most elegant way doesn't involve perl at all:
paste -d' ' file1 file2
If I were a golfing man, I could rewrite #FM's answer as:
($,,$\)=(' ',"\n");#_=#ARGV;open $_,$_ for #_;print
map{chomp($a=<$_>);$a} #_=grep{!eof $_} #_ while #_
which you might be able to turn into a one-liner but that is just evil. ;-)
Well, here it is, under 100 characters:
C:\Temp> perl -le "$,=' ';#_=#ARGV;open $_,$_ for #_;print map{chomp($a =<$_>);$a} #_=grep{!eof $_ }#_ while #_" file1 file2
If it is OK to slurp (and why the heck not — we are looking for different ways), I think I have discovered the path the insanity:
#_=#ARGV;chomp($x[$.-1]{$ARGV}=$_) && eof
and $.=0 while<>;print "#$_{#_}\n" for #x
C:\Temp> perl -e "#_=#ARGV;chomp($x[$.-1]{$ARGV}=$_) && eof and $.=0 while<>;print qq{#$_{#_}\n} for #x" file1 file2
Output:
bye bye chao
hello hola
thank you gracias
An easier alternative to your Code 5 which allows for an arbitrary number of lines and does not care if files have different numbers of lines (hat tip #FM):
#!/usr/bin/perl
use strict; use warnings;
use File::Slurp;
use List::AllUtils qw( each_arrayref );
my #lines = map [ read_file $_ ], #ARGV;
my $it = each_arrayref #lines;
while ( my #lines = grep { defined and chomp and length } $it->() ) {
print join(' ', #lines), "\n";
}
And, without using any external modules:
#!perl
use autodie; use warnings; use strict;
my ($file1, $file2) = #ARGV;
open my $file1_h,'<', $file1;
my #file1 = grep { chomp; length } <$file1_h>;
open my $file2_h,'<', $file2;
my #file2 = grep { chomp; length } <$file2_h>;
my $n_lines = #file1 > #file2 ? #file1 : #file2;
for my $i (0 .. $n_lines - 1) {
my ($line1, $line2) = map {
defined $_ ? $_ : ''
} $file1[$i], $file2[$i];
print $line1, ' ', $line2, "\n";
}
If you want to concatenate only the lines that appear in both files:
#!perl
use autodie; use warnings; use strict;
my ($file1, $file2) = #ARGV;
open my $file1_h,'<', $file1;
my #file1 = grep { chomp; length } <$file1_h>;
open my $file2_h,'<', $file2;
my #file2 = grep { chomp; length } <$file2_h>;
my $n_lines = #file1 < #file2 ? #file1 : #file2;
for my $i (0 .. $n_lines - 1) {
print $file1[$i], ' ', $file2[$i], "\n";
}
An easy one with minimal error checking:
#!/usr/bin/perl -w
use strict;
open FILE1, '<file1.txt';
open FILE2, '<file2.txt';
while (defined(my $one = <FILE1>) or defined(my $twotemp = <FILE2>)){
my $two = $twotemp ? $twotemp : <FILE2>;
chomp $one if ($one);
chomp $two if ($two);
print ''.($one ? "$one " : '').($two ? $two : '')."\n";
}
And no, you can't run two loops simultaneous within the same thread, you'd have to fork, but that would not be guaranteed to run synchronously.