How does this searching work in Perl? - perl

my ($len, $longest) =0;
length > $len and ($longest, $len)=($_, length) for #matches;
#matches stores some substrings. This code catches the longest substring from #matches, then stores it in $longest.
Source code:
#!usr/bin/perl
use strict;
use Data::Dumper;
my $needle = "axibidm";
my $haystack = "axididm";
my #matches;
for my $start (0..length $needle) {
for my $len (1 .. ((length $needle)-$start)) {
my $substr = substr($needle, $start, $len);
push #matches, $haystack =~ m[($substr)]g;
print "$substr\t";
print "#matches\t\n";
}
}
my ($len, $longest) = 0;
length > $len and ($longest, $len) = ($_, length) for #matches;
print "The longest common substring between\n", $needle, "\nand\n", $haystack, "\nis '$longest'\n";

Someone was being far too clever, or they made a typo. Or both. Probably both.
There's some things going on with that code which don't do what it appears to be doing. This does not initialize both variables to zero.
my ($len, $longest) = 0;
Its a deceptive way to write this.
my $len = 0;
my $longest;
for $matches is silly, there's only one thing to iterate over so why use a loop? This idiom is very occasionally used to put the value into $_ and use it in various default constructs, but that's not much use here.
Next, the following is a very tortured way to write do this if that. It's written that way to make it a single expression which will work within the for loop statement modifier.
length > $len and ($longest, $len)=($_, length)
It is much better written like this.
if( length > $len ) {
$longest = $_;
$len = length;
}
Expanding it out, and removing the useless for loop, we get...
my $len = 0;
my $longest;
if( length $match > $len ) {
$longest = $match;
$len = length $match;
}
The other option is that $matches is an array references and they meant for #$matches. for $matches would still "work" but it would always return a length of 21 because an array reference stringifies as something like ARRAY(0x7fc07c800468).

That looks almost deliberately obfuscated. Here's a more verbose expression of the same logic.
my $len = 0;
my $longest;
foreach my $match (#matches) {
if (length($match) > $len) {
$longest = $match;
$len = length($match);
}
}
So let's compare.
my ($len, $longest) = 0;
This declares the two lexical (my) variables $len and $longest, and sets the first one ($len) to 0, leaving $longest at its default value of undef.
This structure:
(code goes here) for #matches;
is the same as this:
for (#matches) {
(code goes here)
}
So we're iterating over the #matches array and running the code once per element. Within the body of the code, the special variable $_ will hold the current element.
length > $len and ($longest, $len) = ($_, length);
First, (expression) and (code) is a shorthand way of writing if ( (expression) ) { (code) }. It works because in Perl, and is evaluated left-to-right in a short-circuiting fashion. That is, if the left side expression is false, Perl doesn't bother evaluating the right side, since its value doesn't matter; false and anything is false.
When length is called without an argument, it means length($_), so that's the length of the current element of #matches that is being examined.
And ($var1, $var2) = ($val1, $val2) is parallel assignment that sets $var1 to $val1 and $var2 to $val2.

EXPR for LIST;
is roughly the same as
for (LIST) { EXPR; }
EXPR1 and EXPR2;
is roughly the same as
if (EXPR1) { EXPR2; }
(This is not a generally accepted practice, except when EXPR2 is a flow control expression (next, die, etc).)
length defaults to using $_ as its argument (length($_)).
( $x, $y ) = ( EXPR1, EXPR2 )
is roughly the same as
$x = EXPR1;
$y = EXPR2;
(One notable difference is that you can do ($x,$y)=($y,$x) to swap values, but that's used here.)
(Using a list assignment when two scalar assignments would do is also not a generally accepted practice.)
A more conventional way of writing the code would be:
my $len = 0;
my $longest;
for (#matches) {
if (length($_) > $len) {
$longest = $_;
$len = length($_);
}
}

Related

Can I use the size of an array without having to place it in a variable

Currently I am doing this to read the individual contents of an array
my $size = #words;
for(my $x = 0; $x < $size, $x++)
{
print $words[$x];
}
Is there away to skip the $size assignment? A way to cast the array and have one less line?
i.e.
for(my $x = 0; $x < $(#word), $x++)
{
print $words[$x];
}
Can't seem to find the right syntax.
Thanks
Replace
for (my $i = 0; $i < $(#words), $i++) { ... $words[$i] ... }
with
for (my $i = 0; $i < #words; $i++) { ... $words[$i] ... }
Just like in your assignment, an array evaluated in scalar context produces its size.
That said, using a C-style loop is complex and wasteful.
A better solution if you need the index:
for my $i (0..$#words) { ... $words[$i] ... }
A better solution if you don't need the index:
for my $word (#words) { ... $word ... }
Yes, for has a built in array iterator and for and foreach are synonyms.
for my $word (#words) {
print $word;
}
This is the preferred way to iterate through arrays in Perl. C style 3 statement for-loops are discouraged unless necessary. They're harder to read and lead to bugs, like this one.
for(my $x = 0; $x < $size, $x++)
^
should be a ;
Better to use foreach, but to your specific question, #foo in scalar context resolves to the length of the array, and $#foo resolves to the index of the last element:
foreach my $word (#words) { ... } # preferred
for(my $i = 0; $i < #words; ++$i) { my $word = $words[$i]; ... } # ok sometimes
for(my $i = 0; $i <= $#words; ++$i) { my $word = $words[$i]; ... } # same thing
(assuming that you haven't played with $[, which you shouldn't do.)
The syntax that you are searching for is actually no syntax at all. If you use an array variable anywhere where Perl knows you should be using a scalar value (like as an operand to a comparison operator) then Perl gives you the number of elements in the array.
So, based on your example, this will work:
# Note: I've corrected a syntax error here.
# I replaced a comma with a semicolon
for (my $x = 0; $x < #words; $x++)
{
print $words[$x];
}
But there are several ways that we can improve this. Firstly, let's get rid of the ugly and potentially confusing C-style for loop and replace it with a far easier to understand foreach.
foreach my $x (0 .. #words - 1)
{
print $words[$x];
}
We can also improve on that #words - 1. Instead, we can use $#words which gives the final index in the array #words.
foreach my $x (0 .. $#words)
{
print $words[$x];
}
Finally, we don't really need the index number here as we're just using it to access each element of the array in turn. Far better to iterate over the elements of the array rather than the indexes.
foreach my $element (#words)
{
print $element;
}

Count Characters in Perl

I need to count the letter "e" in the string
$x="012ei ke ek ek ";
So far, I've tried with a for-loop:
$l=length($x);
$a=0;
for($i=0;$i<$l;$i++)
{$s=substr($x,$i,1);
if($s=="e")
{$a++;}
print $a;
Your code has some problems. You forgot to close the for loop brace,
and in Perl == is supposed to compare numbers. Use eq for strings.
It is also recommended that you use warnings and enable strict mode,
which would have helped you debugging this. In your case, since e
would be treated as 0, so the other one char substrings, 1 and 2
would be the only characters not equal to e when compared with ==. A
cleaned up version of your code could be written as:
use warnings;
use strict;
my $x = "012ei ke ek ek ";
my $l = length $x;
my $count = 0;
for(my $i = 0; $i < $l; $i++) {
my $s = substr($x, $i, 1);
$count++ if ($s eq "e");
}
print $count;
There are multiple ways to achieve this. You could use a match with a
group, which if global returns all the occurrences in list context.
Since you want the number, take this result in scalar context. You can
achieve this for example with:
my $count = () = $string =~ /(e)/g;
Or:
my $count = #{[ $string =~ /(e)/g ]}
Another way is to split the string into characters and grep those that
are e:
my $count = grep $_ eq 'e', split //, $string;
And probably the most compact is to use tr which returns the count of
characters in scalar context, although this does restrict this usage to
counting characters only:
my $count = $string =~ tr/e//;
You compare characters with the numeric operator (==) when you should use the string comparison eq. If you had used the warnings pragma you would have seen that.
You code should have looked like:
#!/usr/bin/env perl
use strict;
use warnings;
my $x = "012ei ke ek ek ";
my $l = length($x);
my $a = 0;
for ( my $i = 0; $i < $l; $i++ ) {
my $s = substr( $x, $i, 1 );
if ( $s eq "e" ) {
$a++;
}
}
print "$a\n";
Proper indentation and the use of the strict and warnings pragmas will avoid and/or catch unintentional, dumb errors.
A much more Perl-ish (and shorter) way to achieve your answer is:
perl -le '$x="012ei ke ek ek";#count=$x=~m/e/g;print scalar #count'
4
This matches globally and collects all the matches in list context. The scalar value of the list gives the number of occurrences you seek.
Another way is to use tr
perl -le '$x="012ei ke ek ek";print scalar $x=~tr/e//'
4
#sidyll Already mentioned what is the problem in your script and all of the possible ways, but TIMTOWTDI.
$x="012ei ke ek ek ";
my $count;
$count++ while($x=~/e/g);
print $count;

Subkey comparison function for sorting

I need a Perl comparison function that can be used with sort.
Each key is a text string that has an arbitrary number of subkeys, separated by delimiter characters (dot, colon, space, and slash). Some subkeys are numeric, and need to be sorted numerically. The key format and number of subkeys varies. Therefore, the comparison has to handle one key being longer than the other, and has to handle the case where a subkey is numeric in one key but not in another (in which case a textual comparison is appropriate for that subkey).
This works, but I bet there are better solutions:
use warnings;
use strict;
use Scalar::Util qw[looks_like_number];
sub hier_cmp {
my $aa = $a;
my $bb = $b;
# convert all delims (. : / space) to the same delim
$aa =~ tr/.:\/ /::::/;
$bb =~ tr/.:\/ /::::/;
my #lista = split(":", $aa);
my #listb = split(":", $bb);
my $result;
for my $ix (0 .. min($#lista, $#listb)) {
if (exists($lista[$ix]) && exists($listb[$ix])) {
if ( looks_like_number($lista[$ix]) && looks_like_number($listb[$ix])) {
# compare numerically
$result = ($lista[$ix] <=> $listb[$ix]);
} else {
# compare as strings
$result = ($lista[$ix] cmp $listb[$ix]);
}
if ($result == 0) {
next;
}
return $result;
} elsif (exists($lista[$ix])) {
return 1;
} else {
return -1;
}
}
}
For my purposes, readability is more important than speed. This is just for an internal tool, and lists will rarely have more than hundreds of elements. However, any opportunity to learn something is good.
As you can see, I'm not a perl wizard. Even trivial improvements on my code would be appreciated.
Thanks!
That looks like natural sorting. There are several modules on CPAN that already do that such as Sort::Naturally or Sort::Key::Natural.
For instance:
use Sort::Key::Natural qw(natsort);
my #sorted = natsort #data;
It would help if you gave us some data to test with, but this code passes a few basic tests and it looks right.
It simplifies the problem by using the List::MoreUtils function pairwise to create an array of field pairs.
Then it is just a matter of checking whether only one is defined, when one of the lists has come to an end before the other and should be sorted first; if they are both numeric, when they should be compared with a numeric comparison; or otherwise simply compare them as strings.
If the end of the array of pairs is reached then everything has matched and zero is returned to indicate equiality.
Update
I have changed this code to remove the dependency on List::MoreUtils::pairwise.
use strict;
use warnings;
use Scalar::Util 'looks_like_number';
sub hier_cmp {
our ($a, $b);
my #a = split m|[.: /]+|, $a;
my #b = split m|[.: /]+|, $b;
for my $i (0 .. $#a > $#b ? $#a : $#b) {
my #ab = ( $a[$i], $b[$i] );
if (grep defined, #ab < 2) {
return defined $ab[0] ? 1 : -1;
}
else {
my $numeric = grep(looks_like_number($_), #ab) == 2;
my $result = $numeric ? $ab[0] <=> $ab[1] : $ab[0] cmp $ab[1];
return $result if $result;
}
}
return 0;
}

Saving a transliteration table in perl

I want to transliterate digits from 1 - 8 with 0 but not knowing the number at compile time. Since transliterations do not interpolate variables I'm doing this:
#trs = (sub{die},sub{${$_[0]} =~ tr/[0,1]/[1,0]/},sub{${$_[0]} =~ tr/[0,2]/[2,0]/},sub{${$_[0]} =~ tr/[0,3]/[3,0]/},sub{${$_[0]} =~ tr/[0,4]/[4,0]/},sub{${$_[0]} =~ tr/[0,5]/[5,0]/},sub{${$_[0]} =~ tr/[0,6]/[6,0]/},sub{${$_[0]} =~ tr/[0,7]/[7,0]/},sub{${$_[0]} =~ tr/[0,8]/[8,0]/});
and then index it like:
$trs[$character_to_transliterate](\$var_to_change);
I would appreciate if anyone can point me to a best looking solution.
Any time that you are repeating yourself, you should see if what you are doing can be done in a loop. Since tr creates its tables at compile time, you can use eval to access the compiler at runtime:
my #trs = (sub {die}, map {eval "sub {\$_[0] =~ tr/${_}0/0$_/}"} 1 .. 8);
my $x = 123;
$trs[2]($x);
print "$x\n"; # 103
There is also no need to use references here, subroutine arguments are already passed by reference.
If you do not want to use string eval, you need to use a construct that supports runtime modification. For that you can use the s/// operator:
sub subst {$_[0] =~ s/($_[1]|0)/$1 ? 0 : $_[1]/ge}
my $z = 1230;
subst $z => 2;
print "$z\n"; # 1032
The tr/// construct is faster than s/// since the latter supports regular expressions.
I'd suggest simply ditching tr in favor of something that actually permits a little bit of metaprogramming like s///. For example:
# Replace $to_swap with 0 and 0 with $to_swap, and leave
# everything else alone.
sub swap_with_0 {
my ($digit, $to_swap) = #_;
if ($digit == $to_swap) {
return 0;
} elsif ($digit == 0) {
return $to_swap;
} else {
return $digit;
}
}
# Swap 0 and $to_swap throughout $string
sub swap_digits {
my ($string, $to_swap) = #_;
$string =~ s/([0$to_swap])/swap_with_0($1, $to_swap)/eg;
return $string;
}
which is surprisingly straightforward. :)
Here's a short subroutine that uses substitution instead of transliteration:
sub swap_digits {
my ($str, $digit) = #_;
$str =~ s{ (0) | $digit }{ defined $1 ? $digit : 0 }gex;
return $str;
}

How can I count overlapping substrings in Perl?

i need to implement a program to count the occurrence of a substring in a string in perl. i have implemented it as follows
sub countnmstr
{
$count =0;
$count++ while $_[0] =~ /$_[1]/g;
return $count;
}
$count = countnmstr("aaa","aa");
print "$count\n";
now this is what i would normally do. however, in the implementation above i want to count occurrence of 'aa' in 'aaa'. here i get answer as 1 which seems reasonable but i need to consider the overlapping cases as well. hence the above case should give an answer as 2 since there are two 'aa's if we consider overlap.
can anyone suggest how to implement such a function??
Everyone is getting pretty complicated in their answers (d'oh! daotoad should have made his comment an answer!), perhaps because they are afraid of the goatse operator. I didn't name it, that's just what people call it. It uses the trick that the result of a list assignment is the number of elements in the righthand list.
The Perl idiom for counting matches is then:
my $count = () = $_[0] =~ /($pattern)/g;
The goatse part is the = () =, which is an empty list in the middle of two assignments. The lefthand part of the goatse gets the count from the righthand side of the goatse. Note the you need a capture in the pattern because that's the list the match operator will return in list context.
Now, the next trick in your case is that you really want a positive lookbehind (or lookahead maybe). The lookarounds don't consume characters, so you don't need to keep track of the position:
my $count = () = 'aaa' =~ /((?<=a)a)/g;
Your aaa is just an example. If you have a variable-width pattern, you have to use a lookahead. Lookbehinds in Perl have to be fixed width.
See ysth's answer ... I failed to realize that the pattern could consist solely of a zero width assertion and still work for this purpose.
You can use positive lookahead as suggested by others, and write the function as:
sub countnmstr {
my ($haystack, $needle) = #_;
my ($first, $rest) = $needle =~ /^(.)(.*)$/;
return scalar (() = $haystack =~ /(\Q$first\E(?=\Q$rest\E))/g);
}
You can also use pos to adjust where the next search picks up from:
#!/usr/bin/perl
use strict; use warnings;
sub countnmstr {
my ($haystack, $needle) = #_;
my $adj = length($needle) - 1;
die "Search string cannot be empty!" if $adj < 0;
my $count = 0;
while ( $haystack =~ /\Q$needle/g ) {
pos $haystack -= $adj;
$count += 1;
}
return $count;
}
print countnmstr("aaa","aa"), "\n";
Output:
C:\Temp> t
2
sub countnmstr
{
my ($string, $substr) = #_;
return scalar( () = $string =~ /(?=\Q$substr\E)/g );
}
$count = countnmstr("aaa","aa");
print "$count\n";
A few points:
//g in list context matches as many times as possible.
\Q...\E is used to auto-escape any meta characters, so that you are doing a substring count, not a subpattern count.
Using a lookahead (?= ... ) causes each match to not "consume" any of the string, allowing the following match to be attempted at the very next character.
This uses the same feature where a list assignment (in this case, to an empty list) in scalar context returns the count of elements on the right of the list assignment as the goatse/flying-lentil/spread-eagle/whatever operator, but uses scalar() instead of a scalar assignment to provide the scalar context.
$_[0] is not used directly, but instead copied to a lexical; a naive use of $_[0] in place of $string would cause the //g to start partway through the string instead of at the beginning if the passed string had a stored pos().
Update: s///g is faster, though not as fast as using index:
sub countnmstr
{
my ($string, $substr) = #_;
return scalar( $string =~ s/(?=\Q$substr\E)//g );
}
You could use a lookahead assertion in the regular expression:
sub countnmstr {
my #matches = $_[0] =~ /(?=($_[1]))/g;
return scalar #matches;
}
I suspect Sinan's suggestion will be quicker though.
you can try this, no more regex than needed.
$haystack="aaaaabbbcc";
$needle = "aa";
while ( 1 ){
$ind = index($haystack,$needle);
if ( $ind == -1 ) {last};
$haystack = substr($haystack,$ind+1);
$count++;
}
print "Total count: $count\n";
output
$ ./perl.pl
Total count: 4
If speed is an issue, the index approach suggested by ghostdog74 (with cjm's improvement) is likely to be considerably faster than the regex solutions.
use strict;
use warnings;
sub countnmstr_regex {
my ($haystack, $needle) = #_;
return scalar( () = $haystack =~ /(?=\Q$needle\E)/g );
}
sub countnmstr_index {
my ($haystack, $needle) = #_;
my $i = 0;
my $tally = 0;
while (1){
$i = index($haystack, $needle, $i);
last if $i == -1;
$tally ++;
$i ++;
}
return $tally;
}
use Benchmark qw(cmpthese);
my $size = 1;
my $h = 'aaa aaaaaa' x $size;
my $n = 'aa';
cmpthese( -2, {
countnmstr_regex => sub { countnmstr_regex($h, $n) },
countnmstr_index => sub { countnmstr_index($h, $n) },
} );
__END__
# Benchmarks run on Windows.
# Result using a small haystack ($size = 1).
Rate countnmstr_regex countnmstr_index
countnmstr_regex 93701/s -- -66%
countnmstr_index 271893/s 190% --
# Result using a large haystack ($size = 100).
Rate countnmstr_regex countnmstr_index
countnmstr_regex 929/s -- -81%
countnmstr_index 4960/s 434% --