Perl Regex to find all strings formed using substring - perl

Using Perl RegEx, How to find largest superstring in a sentence, when superstring is a repeatation of 1 or more substring.
For Ex:
$sentence = "zsabcxyzabcabcabccde_xdrabcabcrte__23abcerabcabccbabacxyz";
$subStr = "abc";
I want to find all the occurrences of abc and largest one in that.
Output:
abc
abcabcabc
abcabc
abc
abcabc
Largest string is abcabcabc

Compile a regex using a quantifier. + says 'one or more'.
So your "substr" becomes ((?:abc)+) the outer brackets to 'capture', the inner brackets non capturing. Otherwise you'll also get the 'partial' hits in the array - although the net result isn't changed much, because the longest hit will still sort to the top.
For example:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my $sentence = "zsabcxyzabcabcabccde_xdrabcabcrte__23abcerabcabccbabacxyz";
my $substring = "abc";
my $regex = qr/((?:$substring)+)/;
my #matches = $sentence =~ m/$regex/g;
print Dumper \#matches;
#Then sort it:
my ( $longest ) = sort { length ( $b ) <=> length ( $a ) } #matches;
print $longest,"\n";

Try as below one
use warnings;
use strict;
my $sentence = "zsabcxyzabcabcabccde_xdrabcabcrte__23abcerabcabccbabacxyz";
my ($larg) = sort{length($b)<=>length($a)} $sentence =~ m/((?:abc)+)/g;
print $larg,"\n";
If don't want to store it means, make a loop
use warnings;
use strict;
my $sentence = "zsabcxyzabcabcabccde_xdrabcabcrte__23abcerabcabccbabacxyzabcabcabcabc";
my $longstr;
my $len = 0;
while($sentence=~m/((?:abc)+)/g)
{
$longstr = $1 and $len = length($1) if(length($1) > $len)
}
the above one is in single regex with (?{}) but not recommended
my $sentence = "zsabcxyzabcabcabccde_xdrabcabcrte__23abcerabcabccbabacxyzabcabcabcabc";
my $lar = 0;
my $larg;
$sentence=~m/((?:abc)+)(?{ $larg = $1 and $lar=(length $1) if(length $1 > $lar )}) \G/x;

Thanks to #mkHun and #Sobrique...
I have used #matches = $str =~ m/(abc)+/ng; and then sorting.
/n makes it looks much simpler which is available from 5.22, i think.

The best way is next: ((?:abc)+) with sorting after finding matches

Related

How to fix the error of "Use of unitialized value in addition..." in perl script?

Here is the script of user Suic for calculating molecular weight of fasta sequences (calculating molecular weight in perl),
#!/usr/bin/perl
use strict;
use warnings;
use Encode;
for my $file (#ARGV) {
open my $fh, '<:encoding(UTF-8)', $file;
my $input = join q{}, <$fh>;
close $fh;
while ( $input =~ /^(>.*?)$([^>]*)/smxg ) {
my $name = $1;
my $seq = $2;
$seq =~ s/\n//smxg;
my $mass = calc_mass($seq);
print "$name has mass $mass\n";
}
}
sub calc_mass {
my $a = shift;
my #a = ();
my $x = length $a;
#a = split q{}, $a;
my $b = 0;
my %data = (
A=>71.09, R=>16.19, D=>114.11, N=>115.09,
C=>103.15, E=>129.12, Q=>128.14, G=>57.05,
H=>137.14, I=>113.16, L=>113.16, K=>128.17,
M=>131.19, F=>147.18, P=>97.12, S=>87.08,
T=>101.11, W=>186.12, Y=>163.18, V=>99.14
);
for my $i( #a ) {
$b += $data{$i};
}
my $c = $b - (18 * ($x - 1));
return $c;
}
and the protein.fasta file with n (here is 2) sequences:
seq_ID_1 descriptions etc
ASDGDSAHSAHASDFRHGSDHSDGEWTSHSDHDSHFSDGSGASGADGHHAH
ASDSADGDASHDASHSAREWAWGDASHASGASGASGSDGASDGDSAHSHAS
SFASGDASGDSSDFDSFSDFSD
>seq_ID_2 descriptions etc
ASDGDSAHSAHASDFRHGSDHSDGEWTSHSDHDSHFSDGSGASGADGHHAH
ASDSADGDASHDASHSAREWAWGDASHASGASGASG
When using: perl molecular_weight.pl protein.fasta > output.txt
in terminal, it will generate the correct results, however it also presents an error of "Use of unitialized value in addition (+) at molecular_weight.pl line36", which is just localized in line of "$b += $data{$i};" how to fix this bug ? Thanks in advance !
You probably have an errant SPACE somewhere in your data file. Just change
$seq =~ s/\n//smxg;
into
$seq =~ s/\s//smxg;
EDIT:
Besides whitespace, there may be some non-whitespace invisible characters in the data, like WORD JOINER (U+2060).
If you want to be sure to be thorough and you know all the legal symbols, you can delete everything apart from them:
$seq =~ s/[^ARDNCEQGHILKMFPSTWYV]//smxg;
Or, to make sure you won't miss any (even if you later change the symbols), you can populate a filter regex dynamically from the hash keys.
You'd need to make %Data and the filter regex global, so the filter is available in the main loop. As a beneficial side effect, you don't need to re-initialize the data hash every time you enter calc_mass().
use strict;
use warnings;
my %Data = (A=>71.09,...);
my $Filter_regex = eval { my $x = '[^' . join('', keys %Data) . ']'; qr/$x/; };
...
$seq =~ s/$Filter_regex//smxg;
(This filter works as long as the symbols are single character. For more complicated ones, it may be preferable to match for the symbols and collect them from the sequence, instead of removing unwanted characters.)

Count Characters in Perl

I need to count the letter "e" in the string
$x="012ei ke ek ek ";
So far, I've tried with a for-loop:
$l=length($x);
$a=0;
for($i=0;$i<$l;$i++)
{$s=substr($x,$i,1);
if($s=="e")
{$a++;}
print $a;
Your code has some problems. You forgot to close the for loop brace,
and in Perl == is supposed to compare numbers. Use eq for strings.
It is also recommended that you use warnings and enable strict mode,
which would have helped you debugging this. In your case, since e
would be treated as 0, so the other one char substrings, 1 and 2
would be the only characters not equal to e when compared with ==. A
cleaned up version of your code could be written as:
use warnings;
use strict;
my $x = "012ei ke ek ek ";
my $l = length $x;
my $count = 0;
for(my $i = 0; $i < $l; $i++) {
my $s = substr($x, $i, 1);
$count++ if ($s eq "e");
}
print $count;
There are multiple ways to achieve this. You could use a match with a
group, which if global returns all the occurrences in list context.
Since you want the number, take this result in scalar context. You can
achieve this for example with:
my $count = () = $string =~ /(e)/g;
Or:
my $count = #{[ $string =~ /(e)/g ]}
Another way is to split the string into characters and grep those that
are e:
my $count = grep $_ eq 'e', split //, $string;
And probably the most compact is to use tr which returns the count of
characters in scalar context, although this does restrict this usage to
counting characters only:
my $count = $string =~ tr/e//;
You compare characters with the numeric operator (==) when you should use the string comparison eq. If you had used the warnings pragma you would have seen that.
You code should have looked like:
#!/usr/bin/env perl
use strict;
use warnings;
my $x = "012ei ke ek ek ";
my $l = length($x);
my $a = 0;
for ( my $i = 0; $i < $l; $i++ ) {
my $s = substr( $x, $i, 1 );
if ( $s eq "e" ) {
$a++;
}
}
print "$a\n";
Proper indentation and the use of the strict and warnings pragmas will avoid and/or catch unintentional, dumb errors.
A much more Perl-ish (and shorter) way to achieve your answer is:
perl -le '$x="012ei ke ek ek";#count=$x=~m/e/g;print scalar #count'
4
This matches globally and collects all the matches in list context. The scalar value of the list gives the number of occurrences you seek.
Another way is to use tr
perl -le '$x="012ei ke ek ek";print scalar $x=~tr/e//'
4
#sidyll Already mentioned what is the problem in your script and all of the possible ways, but TIMTOWTDI.
$x="012ei ke ek ek ";
my $count;
$count++ while($x=~/e/g);
print $count;

Perl int to array

I have a number stored in a Perl variable and I want to 'pass/convert/store' its digits in the different positions of an array. An example for a better sight:
I have, let's say, this number stored:
$hello = 429384
And I need a new array with the digits stored in it, so:
$hello2[0] = 4
$hello2[1] = 2
$hello2[2] = 9
Etc..
I can probably make it with a couple of loops, but I want to know if there is an efficient and fast way to do it. Thx in advance!
my #hello = split //, $hello;
In Perl if you use number in a string operator, the conversion is done automatically
$hello = 429384;
#hello = split //, $hello;
print $hello[0];
Using only Regex and without using any inbuilt function:
#!/usr/bin/perl
use strict;
use warnings;
my $string=429384;
my #numbers = $string =~ /./g; # dot matches a single character at a time
#and returns it
print "#numbers \n";
this is significantly faster than the regexp way:
$string = '1234567890';
$_-=48 for #digits = unpack 'C*',$string;
benchmark:
use Time::HiRes;
$string = '1234567890';
$start_time = [Time::HiRes::gettimeofday()];
for (1.. 100000){
$_-=48 for #digits= unpack 'C*',$string;
}
$diff = Time::HiRes::tv_interval($start_time);
print "\n\n$diff\n";
$start_time = [Time::HiRes::gettimeofday()];
for (1.. 100000){
#digits = split //, $string;
}
$diff = Time::HiRes::tv_interval($start_time);
print "\n\n$diff\n";
output:
0.265814
0.314735

Trying to find the index of the first number in a string using perl

I'm trying to find the index of the first occurrence of a number from 0-9.
Let's say that:
$myString = "ABDFSASF9fjdkasljfdkl1"
I want to find the position where 9 is.
I've tried this:
print index($myString,[0-9]);
And:
print index($myString,\d);
Use regex Positional Information:
use strict;
use warnings;
my $myString = "ABDFSASF9fjdkasljfdkl1";
if ($myString =~ /\d/) {
print $-[0];
}
Outputs:
8
You can try even below perl code:
use strict;
use warnings;
my $String = "ABDFSASF9fjdkasljfdkl11";
if($String =~ /(\d)/)
{
print "Prematch string of first number $1 is $`\n";
print "Index of first number $1 is " . length($`);
}
You can try this:
perl -e '$string="ABDFSASF9fjdkasljfdkl1";#array=split(//,$string);for $i (0..$#array) {if($array[$i]=~/\d/){print $i;last;}}'

finding the substring present in string and also count the number of occurrences

Could anyone tel me what is the mistake? As the program is for finding the substrings in a given string and count there number of occurrences for those substrings. but the substring must check the occurrences for every three alphabets.
for eg: String: AGAUUUAGA (i.e. for AGA, UUU, AGA)
output: AGA-2
UUU-1
print"Enter the mRNA Sequence\n";
$count=0;
$count1=0;
$seq=<>;
chomp($seq);
$p='';
$ln=length($seq);
$j=$ln/3;
for($i=0,$k=0;$i<$ln,$k<$j;$k++) {
$fra[$k]=substr($seq,$i,3);
$i=$i+3;
if({$fra[$k]} eq AGA) {
$count++;
print"The number of AGA is $count";
} elseif({$fra[$k]} eq UUU) {
$count1++;
print" The number of UUU is $count1";
}
}
This is a Perl FAQ:
perldoc -q count
This code will count the occurrences of your 2 strings:
use warnings;
use strict;
my $seq = 'AGAUUUAGA';
my $aga_cnt = () = $seq =~ /AGA/g;
my $uuu_cnt = () = $seq =~ /UUU/g;
print "The number of AGA is $aga_cnt\n";
print "The number of UUU is $uuu_cnt\n";
__END__
The number of AGA is 2
The number of UUU is 1
If you use strict and warnings, you will get many messages pointing out errors in your code.
Here is another approach which is more scalable:
use warnings;
use strict;
use Data::Dumper;
my $seq = 'AGAUUUAGA';
my %counts;
for my $key (qw(AGA UUU)) {
$counts{$key} = () = $seq =~ /$key/g;
}
print Dumper(\%counts);
__END__
$VAR1 = {
'AGA' => 2,
'UUU' => 1
};
Have a try with this, that avoids overlaps:
#!/usr/bin/perl
use strict;
use warnings;
use 5.10.1;
use Data::Dumper;
my $str = q!AGAUUUAGAGAAGAG!;
my #list = $str =~ /(...)/g;
my ($AGA, $UUU);
foreach(#list) {
$AGA++ if $_ eq 'AGA';
$UUU++ if $_ eq 'UUU';
}
say "number of AGA is $AGA and number of UUU is $UUU";
output:
number of AGA is 2 and number of UUU is 1
This is an example of how quickly you can get things done in Perl. Grouping the strands together as a alternation is one way to make sure there is no overlap. Also a hash is a great way to count occurrences of they key.
$values{$_}++ foreach $seq =~ /(AGA|UUU)/g;
print "AGA-$values{AGA} UUU-$values{UUU}\n";
However, I generally want to generalize it to something like this, thinking that this might not be the only time you have to do something like this.
use strict;
use warnings;
use English qw<$LIST_SEPARATOR>;
my %values;
my #spans = qw<AGA UUU>;
my $split_regex
= do { local $LIST_SEPARATOR = '|';
qr/(#spans)/
}
;
$values{$_}++ foreach $seq =~ /$split_regex/g;
print join( ' ', map { "$_-$values{$_}" } #spans ), "\n";
Your not clear on how many "AGA" the string "AGAGAGA" contains.
If 2,
my $aga = () = $seq =~ /AGA/g;
my $uuu = () = $seq =~ /UUU/g;
If 3,
my $aga = () = $seq =~ /A(?=GA)/g;
my $uuu = () = $seq =~ /U(?=UU)/g;
If I understand you correctly (and certainly that is questionable; almost every answer so far is interpreting your question differently than every other answer):
my %substring;
$substring{$1}++ while $seq =~ /(...)/;
print "There are $substring{UUU} UUU's and $substring{AGA} AGA's\n";