Differences beteween switch- and if-statements - perl

Do these two statements behave equally or could they yield different results?
if ( ... ) {...}
elsif( ... ) {... }
elsif( ... ) { ... }
else { ... }
.
given ( ... ) {
when ( ... ) { ... }
when ( ... ) { ... }
default { ... }
}
I've found the problem - with a modified ninth "when" it works now.
...
no warnings qw(numeric);
my $c = &getch();
given ( $c ) {
when ( $c == $KEY_LEFT and 1 > 0 ) { say 1; say $c }
when ( $c == $KEY_RIGHT ) { say 2; say $c }
when ( $c eq "\cH" or $c eq "\c?" ) { say 3; say $c }
when ( $c eq "\cC" ) { say 4; say $c }
when ( $c eq "\cX" or $c eq "\cD" ) { say 5; say $c }
when ( $c eq "\cA" ) { say 6; say $c }
when ( $c eq "\cE" ) { say 7; say $c }
when ( $c eq "\cL" ) { say 8; say $c }
when ( not( not $SpecialKey{$c} ) ) { say 9; say $c }
when ( ord( $c ) >= 32 ) { say 10; say $c }
default { say 11; say $c }
}
if ( $c == $KEY_LEFT and 1 > 0 ) { say 1; say $c }
elsif ( $c == $KEY_RIGHT ) { say 2; say $c }
elsif ( $c eq "\cH" or $c eq "\c?" ) { say 3; say $c }
elsif ( $c eq "\cC" ) { say 4; say $c }
elsif ( $c eq "\cX" or $c eq "\cD" ) { say 5; say $c }
elsif ( $c eq "\cA" ) { say 6; say $c }
elsif ( $c eq "\cE" ) { say 7; say $c }
elsif ( $c eq "\cL" ) { say 8; say $c }
elsif ( $SpecialKey{$c} ) { say 9; say $c }
elsif ( ord( $c ) >= 32 ) { say 10; say $c }
else { say 11; say $c }
close TTYIN;

Your supposedly "fixed" version now does different things in the two versions of the code. Checking if a key exists in a hash is completely different to checking whether the associated value is true.
There are three different truth values you can get from a hash - whether the key exists in the hash, whether the associated value is defined and whether the associated value is true or false. This code should demonstrate the difference between the three:
#!/usr/bin/perl
use strict;
use warnings;
my %hash = (
key1 => undef,
key2 => 0,
key3 => 1,
);
foreach (qw(key1 key2 key3 key4)) {
check_key($_);
}
sub check_key {
my $k = shift;
print "Key $k ";
if (exists $hash{$k}) {
print 'exists. ';
} else {
print "doesn't exist. ";
}
print 'Value ';
if (defined $hash{$k}) {
print 'is defined ';
} else {
print 'is not defined ';
}
print 'and is ';
if ($hash{$k}) {
print "true\n";
} else {
print "false\n";
}
}

Whatever you can do with the given/when, you can do with if/elsif/else. The idea is that when/given is suppose to be easier to read, and can automatically use smartmatching by default while you have to specify smart matching with the if/else statement.
I didn't parse through your code to make sure they're exact equivalents, but it looks like you've got more or less the right idea about if/elsif/else and given/when.
I never really understood the desire for what was referred to as the switch statement. It was something Perl coders always complained about -- the lack of a switch statement in Perl. Maybe it's a C thing that most Perl developers fondly remember. But I never found the if/elsif/else statements that bad.
What really confuses me is that when they finally implemented the switch statement, they didn't call it switch. Why the heck not?
And why say instead of printnl?
And, why last and next instead of break and continue. Why not simply use the standard names that other languages already use?
But enough of this Seinfeld style whining...
As davorg said, there's a big difference between a hash key that doesn't exist, a hash key that exists, but isn't defined, and a hash key that's defined, but gets evaluated to false:
For example:
use strict;
use warnings;
no warnings qw(uninitialized);
my %hash;
$hash{FOO} = "bar";
if (not exists($hash{BAR})) {
print "\$hash{FOO} doesn't exist\n";
}
if (not defined($hash{BAR})) {
print "\$hash{BAR} is undefined\n";
}
if (not $hash{BAR}) {
print "\$hash{BAR} evaluates to false\n";
}
if ($hash{BAR} eq undef) {
print "\$hash{BAR} is equal to 'undef'\n";
}
You can see that $hash{BAR} doesn't even exist as a key in the %hash hash, but it is also undefined, and it evaluates as false. And, you can also see that it also evaluates as undef (notice I had to set no warnings qw(uninitialized); to prevent it from complaining about $hash{BAR} being uninitialized in my last if statement).
However, if I do this:
$hash{FOO} = bar;
$hash{BAR} = undef;
The first if statement no longer evaluates as true because the key BAR now does exist in the hash, but the value is still undefined and still evaluates as false.
And, if I did this:
$hash{FOO} = bar;
$hash{BAR} = 0;
$hash{BAR} now exists as a key in %hash, and it is no longer undefined, but it still evaluates as false.
I like simply being able to say this:
if (not $hash{BAR}) {
because it is short and sweet, and probably does what I want. However, I do have to understand the difference between existence of a key in a hash, evaluation to false, and not being defined as three separate things. It can be important if you have a subroutine that could return a NULL string or zero value:
if (not foo($bar)) {
die qq(Error of some sort\n);
}
sub foo {
$bar = <FOO> or return;
return chomp($bar);
}
If there's a blank line in my file, it'll return a NULL string, but the subroutine will still return a defined value. The above probably does not do what I want.

Adding another answer as I've just realised what the real problem is.
Your "when ($SpecialKey{$c})" is equivalent to "if ($_ ~~ $SpecialKey{$c})". And as "given" has set $_ to $c that's the same as "if ($c ~~ $SpecialKey{$c})". When comparing two scalar values (and I assume that's what we've got here) the smart match operator looks for values that are numbers and uses "eq" or "==" as appropriate.
So your code is effectively equivalent to "if ($c == $SpecialKey{$c})". And that's completely different to "if ($SpecialKey{$c})".

They behave totally equal.

given / when has implicit smart matching:
Most of the power comes from implicit smart matching:
when($foo)
is exactly equivalent to
when($_ ~~ $foo)
I don't think if/else does that(?)
See perlsyn
EDIT:
Guess this doesn't really matter when using given/when the way OP is, but it still answers the question :)

Related

Advice on Perl map function

I'm having difficulty understanding the behaviour of the map function in the Perl script below
my #list = qw/A C T G/;
my #curr = (0, undef, undef, 1);
my #res = map { $list[ $_ ] } #curr;
print #res; #prints AAAC
The value of #res is AAAC. I had expected it be just AC as the 2 middle values of #curr are undef.
My understanding of map is that each item from the list becomes the value of $_ in turn. So I can understand how $list[0] returns A and how $list[1] returns C. I can't understand why/how the undef values have returned A? To my mind $list[undef] would not be a value?
I'm missing something obvious here probably, but would be really grateful for some help
The script that I take this code from is below. I'm stepping through it in the debugger, and I can see this behaviour in the last line returned by sub gen_permutate
my #DNA = qw/A C T G/;
my $seq = gen_permutate(14, #DNA);
while ( my $strand = $seq->() ) {
print "$strand\n";
}
sub gen_permutate {
my ($max, #list) = #_;
my #curr;
return sub {
if ( (join '', map { $list[ $_ ] } #curr) eq $list[ -1 ] x #curr ) {
#curr = (0) x (#curr + 1);
else {
my $pos = #curr;
while ( --$pos > -1 ) {
++$curr[ $pos ], last if $curr[ $pos ] < $#list;
$curr[ $pos ] = 0;
}
}
return undef if #curr > $max;
return join '', map { $list[ $_ ] } #curr;
};
}
J
This is a part of "DWIM" -- an array index need be a number, so what is passed in is converted to one, as best as the interpreter can do/guess. In case of undef at least it's clear -- becomes a 0.
But it's really a programming error, or at least sloppiness.
To do what you expect:
my #res = map { $list[ $_ ] } grep { defined } #curr;
or, in one iteration over the list
my #res = map { defined $_ ? $list[$_] : () } #curr;
Here the empty list, indicated by (), gets flattened into the return list, thus disappearing; so we filter inside map as well. But the point of that ternary is to be able to put something meaningful instead, and if some input need be just removed then having an explicit grep first is clearer.
NOTE   With use warnings; in place you'd hear all about it
Use of uninitialized value $_ in array element at ...
(and it would convert it to a number)
Each script really must begin with use warnings; (and use strict;). Sometimes the biggest part in inheriting older scripts is to stick use warnings; on top right away (and work through the screenfulls of messages :(.
I think to answer my own question, that undef is sometimes treated as zero. That is why $list[undef] is the same as $list[0]

Use of logical operators AND OR when comparing conditionals

I am facing some doubts about the use (and return values I guess) of the logical operators &&, and, ||, or.
$number = 5;
$numberA = 5;
$numberB = 1;
$string = "x";
$stringA = "x";
$stringB = "y";
If two numbers are compared:
$x=5;
if ( $x == $number ) { print '$x == $number', "\n"; }
If two strings are compared:
$x="x";
if ( $x eq $string ) { print '$x eq $string', "\n"; }
But I'm not sure what is the best way to evaluate two numbers/strings to a number/string. Is this correct?
$x=5; $y=5;
if ( ($x && $y) == $number ) { print '($x && $y) == $number', "\n"; }
$x="x"; $y="x";
if ( ($x and $y) eq $string ) { print '($x and $y) eq $string', "\n"; }
And what is the rule when two logicals are evaluated in the same condition? Should the conditions itself be compared as numbers (&&,||) or strings (and,or)?
$x=5; $y=1;
if ( ($x == $numberA) && ($y == $numberB) ) { print '&& or and here?', "\n"; }
$x="x"; $y="y";
if ( ($x eq $stringA) and ($y eq $stringB) ) { print 'and or or here?', "\n"; }
( $foo && $bar ) == $baz
does not do what you think it does; it first evaluates the && operation, getting the value of $foo if $foo is true and otherwise getting the value of $bar, then compares that to $baz. You need to explicitly spell it out as $foo == $baz && $bar == $baz to test both.
If you have many values (preferably in an array, not a bunch of separate variables), grep can be useful:
if ( 2 == grep $_ == $baz, $foo, $bar ) {
List::MoreUtils provides a convenient all method, too:
use List::MoreUtils 'all';
if ( all { $_ == $baz } $foo, $bar ) {
and/or and &&/|| are not string or numeric operators; the alphabetic ones function exactly the same as the equivalent symbolic ones. The only difference is that they have different precedence; &&/|| have a higher precedence, such that they are useful within an expression; and/or have a lower precedence, such that they are useful for flow control between what are essentially different expressions. Some examples:
my $x = $y || 'default_value';
equivalent to:
my $x = ( $y || 'default_value' );
vs.
my #a = get_lines() or die "expected some lines!";
equivalent to:
( my #a = get_lines() ) or die "expected some lines!";

How can I check if all elements of an array are identical in Perl?

I have an array #test. What's the best way to check if each element of the array is the same string?
I know I can do it with a foreach loop but is there a better way to do this? I checked out the map function but I'm not sure if that's what I need.
If the string is known, you can use grep in scalar context:
if (#test == grep { $_ eq $string } #test) {
# all equal
}
Otherwise, use a hash:
my %string = map { $_, 1 } #test;
if (keys %string == 1) {
# all equal
}
or a shorter version:
if (keys %{{ map {$_, 1} #test }} == 1) {
# all equal
}
NOTE: The undefined value behaves like the empty string ("") when used as a string in Perl. Therefore, the checks will return true if the array contains only empty strings and undefs.
Here's a solution that takes this into account:
my $is_equal = 0;
my $string = $test[0]; # the first element
for my $i (0..$#test) {
last unless defined $string == defined $test[$i];
last if defined $test[$i] && $test[$i] ne $string;
$is_equal = 1 if $i == $#test;
}
Both methods in the accepted post give you the wrong answer if #test = (undef, ''). That is, they declare an undefined value to be equal to the empty string.
That might be acceptable. In addition, using grep goes through all elements of the array even if a mismatch is found early on and using the hash more than doubles the memory used by elements of array. Neither of these would be a problem if you have small arrays. And, grep is likely to be fast enough for reasonable list sizes.
However, here is an alternative that 1) returns false for (undef, '') and (undef, 0), 2) does not increase the memory footprint of your program and 3) short-circuits as soon as a mismatch is found:
#!/usr/bin/perl
use strict; use warnings;
# Returns true for an empty array as there exist
# no elements of an empty set that are different
# than each other (see
# http://en.wikipedia.org/wiki/Vacuous_truth)
sub all_the_same {
my ($ref) = #_;
return 1 unless #$ref;
my $cmpv = \ $ref->[-1];
for my $i (0 .. $#$ref - 1) {
my $this = \ $ref->[$i];
return unless defined $$cmpv == defined $$this;
return if defined $$this
and ( $$cmpv ne $$this );
}
return 1;
}
However, using List::MoreUtils::first_index is likely to be faster:
use List::MoreUtils qw( first_index );
sub all_the_same {
my ($ref) = #_;
my $first = \ $ref->[0];
return -1 == first_index {
(defined $$first != defined)
or (defined and $_ ne $$first)
} #$ref;
}
TIMTOWTDI, and I've been reading a lot of Mark Jason Dominus lately.
use strict;
use warnings;
sub all_the_same {
my $ref = shift;
return 1 unless #$ref;
my $cmp = $ref->[0];
my $equal = defined $cmp ?
sub { defined($_[0]) and $_[0] eq $cmp } :
sub { not defined $_[0] };
for my $v (#$ref){
return 0 unless $equal->($v);
}
return 1;
}
my #tests = (
[ qw(foo foo foo) ],
[ '', '', ''],
[ undef, undef, undef ],
[ qw(foo foo bar) ],
[ '', undef ],
[ undef, '' ]
);
for my $i (0 .. $#tests){
print "$i. ", all_the_same($tests[$i]) ? 'equal' : '', "\n";
}
You can check how many times the element in the array (#test) is repeated by counting it in a hash (%seen). You can check how many keys ($size) are present in the hash (%seen). If more than 1 key is present, you know that the elements in the array are not identical.
sub all_the_same {
my #test = #_;
my %seen;
foreach my $item (#test){
$seen{$item}++
}
my $size = keys %seen;
if ($size == 1){
return 1;
}
else{
return 0;
}
}
I think, we can use List::MoreUtils qw(uniq)
my #uniq_array = uniq #array;
my $array_length = #uniq_array;
$array_length == 1 ? return 1 : return 0;
I use List::Util::first for all similar purposes.
# try #0: $ok = !first { $_ ne $string } #test;
# try #1: $ok = !first { (defined $_ != defined $string) || !/\A\Q$string\E\z/ } #test;
# final solution
use List::Util 'first';
my $str = shift #test;
my $ok = !first { defined $$_ != defined $str || defined $str && $$_ ne $str } map \$_, #test;
I used map \$_, #test here to avoid problems with values that evaluate to false.
Note. As cjm noted fairly, using map defeats the advantage of first short-circuiting. So I tip my hat to Sinan with his first_index solution.

find extra, missing, invalid strings when comparing two lists in perl

List-1 List-2
one one
two three
three three
four four
five six
six seven
eight eighttt
nine nine
Looking to output
one | one PASS
two | * FAIL MISSING
three | three PASS
* | three FAIL EXTRA
four | four PASS
five | * FAIL MISSING
six | six PASS
* | seven FAIL EXTRA
eight | eighttt FAIL INVALID
nine | nine PASS
Actually the return from my current solution is a reference to the two modified lists and a reference to a "fail" list describing the failure for the index as either "no fail", "missing", "extra", or "invalid" which is also (obviously) fine output.
My current solution is:
sub compare {
local $thisfound = shift;
local $thatfound = shift;
local #thisorig = #{ $thisfound };
local #thatorig = #{ $thatfound };
local $best = 9999;
foreach $n (1..6) {
local $diff = 0;
local #thisfound = #thisorig;
local #thatfound = #thatorig;
local #fail = ();
for (local $i=0;$i<scalar(#thisfound) || $i<scalar(#thatfound);$i++) {
if($thisfound[$i] eq $thatfound[$i]) {
$fail[$i] = 'NO_FAIL';
next;
}
if($n == 1) { # 1 2 3
next unless __compare_missing__();
next unless __compare_extra__();
next unless __compare_invalid__();
} elsif($n == 2) { # 1 3 2
next unless __compare_missing__();
next unless __compare_invalid__();
next unless __compare_extra__();
} elsif($n == 3) { # 2 1 3
next unless __compare_extra__();
next unless __compare_missing__();
next unless __compare_invalid__();
} elsif($n == 4) { # 2 3 1
next unless __compare_extra__();
next unless __compare_invalid__();
next unless __compare_missing__();
} elsif($n == 5) { # 3 1 2
next unless __compare_invalid__();
next unless __compare_missing__();
next unless __compare_extra__();
} elsif($n == 6) { # 3 2 1
next unless __compare_invalid__();
next unless __compare_extra__();
next unless __compare_missing__();
}
push #fail,'INVALID';
$diff += 1;
}
if ($diff<$best) {
$best = $diff;
#thisbest = #thisfound;
#thatbest = #thatfound;
#failbest = #fail;
}
}
return (\#thisbest,\#thatbest,\#failbest)
}
sub __compare_missing__ {
my $j;
### Does that command match a later this command? ###
### If so most likely a MISSING command ###
for($j=$i+1;$j<scalar(#thisfound);$j++) {
if($thisfound[$j] eq $thatfound[$i]) {
$diff += $j-$i;
for ($i..$j-1) { push(#fail,'MISSING'); }
#end = #thatfound[$i..$#thatfound];
#thatfound = #thatfound[0..$i-1];
for ($i..$j-1) { push(#thatfound,'*'); }
push(#thatfound,#end);
$i=$j-1;
last;
}
}
$j == scalar(#thisfound);
}
sub __compare_extra__ {
my $j;
### Does this command match a later that command? ###
### If so, most likely an EXTRA command ###
for($j=$i+1;$j<scalar(#thatfound);$j++) {
if($thatfound[$j] eq $thisfound[$i]) {
$diff += $j-$i;
for ($i..$j-1) { push(#fail,'EXTRA'); }
#end = #thisfound[$i..$#thisfound];
#thisfound = #thisfound[0..$i-1];
for ($i..$j-1) { push (#thisfound,'*'); }
push(#thisfound,#end);
$i=$j-1;
last;
}
}
$j == scalar(#thatfound);
}
sub __compare_invalid__ {
my $j;
### Do later commands match? ###
### If so most likely an INVALID command ###
for($j=$i+1;$j<scalar(#thisfound);$j++) {
if($thisfound[$j] eq $thatfound[$j]) {
$diff += $j-$i;
for ($i..$j-1) { push(#fail,'INVALID'); }
$i=$j-1;
last;
}
}
$j == scalar(#thisfound);
}
But this isn't perfect ... who wants to simplify and improve? Specifically ... within a single data set, one order of searching is better for a subset and another order is better for a different subset.
If the arrays contain duplicate values, the answer is quite a bit more complicated than that.
See e.g. Algorithm::Diff or read about Levenshtein distance.
From perlfaq4's answer to How can I tell whether a certain element is contained in a list or array?:
(portions of this answer contributed by Anno Siegel and brian d foy)
Hearing the word "in" is an indication that you probably should have used a hash, not a list or array, to store your data. Hashes are designed to answer this question quickly and efficiently. Arrays aren't.
That being said, there are several ways to approach this. In Perl 5.10 and later, you can use the smart match operator to check that an item is contained in an array or a hash:
use 5.010;
if( $item ~~ #array )
{
say "The array contains $item"
}
if( $item ~~ %hash )
{
say "The hash contains $item"
}
With earlier versions of Perl, you have to do a bit more work. If you are going to make this query many times over arbitrary string values, the fastest way is probably to invert the original array and maintain a hash whose keys are the first array's values:
#blues = qw/azure cerulean teal turquoise lapis-lazuli/;
%is_blue = ();
for (#blues) { $is_blue{$_} = 1 }
Now you can check whether $is_blue{$some_color}. It might have been a good idea to keep the blues all in a hash in the first place.
If the values are all small integers, you could use a simple indexed array. This kind of an array will take up less space:
#primes = (2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31);
#is_tiny_prime = ();
for (#primes) { $is_tiny_prime[$_] = 1 }
# or simply #istiny_prime[#primes] = (1) x #primes;
Now you check whether $is_tiny_prime[$some_number].
If the values in question are integers instead of strings, you can save quite a lot of space by using bit strings instead:
#articles = ( 1..10, 150..2000, 2017 );
undef $read;
for (#articles) { vec($read,$_,1) = 1 }
Now check whether vec($read,$n,1) is true for some $n.
These methods guarantee fast individual tests but require a re-organization of the original list or array. They only pay off if you have to test multiple values against the same array.
If you are testing only once, the standard module List::Util exports the function first for this purpose. It works by stopping once it finds the element. It's written in C for speed, and its Perl equivalent looks like this subroutine:
sub first (&#) {
my $code = shift;
foreach (#_) {
return $_ if &{$code}();
}
undef;
}
If speed is of little concern, the common idiom uses grep in scalar context (which returns the number of items that passed its condition) to traverse the entire list. This does have the benefit of telling you how many matches it found, though.
my $is_there = grep $_ eq $whatever, #array;
If you want to actually extract the matching elements, simply use grep in list context.
my #matches = grep $_ eq $whatever, #array;
sub compare {
local #d = ();
my $this = shift;
my $that = shift;
my $distance = _levenshteindistance($this, $that);
my #thisorig = #{ $this };
my #thatorig = #{ $that };
my $s = $#thisorig;
my $t = $#thatorig;
#this = ();
#that = ();
#fail = ();
while($s>0 || $t>0) {
# deletion, insertion, substitution
my $min = _minimum($d[$s-1][$t],$d[$s][$t-1],$d[$s-1][$t-1]);
if($min == $d[$s-1][$t-1]) {
unshift(#this,$thisorig[$s]);
unshift(#that,$thatorig[$t]);
if($d[$s][$t] > $d[$s-1][$t-1]) {
unshift(#fail,'INVALID');
} else {
unshift(#fail,'NO_FAIL');
}
$s -= 1;
$t -= 1;
} elsif($min == $d[$s][$t-1]) {
unshift(#this,'*');
unshift(#that,$thatorig[$t]);
unshift(#fail,'EXTRA');
$t -= 1;
} elsif($min == $d[$s-1][$t]) {
unshift(#this,$thisorig[$s]);
unshift(#that,'*');
unshift(#fail,'MISSING');
$s -= 1;
} else {
die("Error! $!");
}
}
return(\#this, \#that, \#fail);
}
sub _minimum {
my $ret = 2**53;
foreach $in (#_) {
$ret = $ret < $in ? $ret : $in;
}
$ret;
}
sub _levenshteindistance {
my $s = shift;
my $t = shift;
my #s = #{ $s };
my #t = #{ $t };
for(my $i=0;$i<scalar(#s);$i++) {
$d[$i] = ();
}
for(my $i=0;$i<scalar(#s);$i++) {
$d[$i][0] = $i # deletion
}
for(my $j=0;$j<scalar(#t);$j++) {
$d[0][$j] = $j # insertion
}
for(my $j=1;$j<scalar(#t);$j++) {
for(my $i=1;$i<scalar(#s);$i++) {
if ($s[$i] eq $t[$j]) {
$d[$i][$j] = $d[$i-1][$j-1];
} else {
# deletion, insertion, substitution
$d[$i][$j] = _minimum($d[$i-1][$j]+1,$d[$i][$j-1]+1,$d[$i-1][$j-1]+1);
}
}
}
foreach $a (#d) {
#a = #{ $a };
foreach $b (#a) {
printf STDERR "%2d ",$b;
}
print STDERR "\n";
}
return $d[$#s][$#t];
}
The trick in Perl (and similar languages) is the hash, which doesn't care about order.
Suppose that the first array is the one that hold the valid elements. Construct a hash with those values as keys:
my #valid = qw( one two ... );
my %valid = map { $_, 1 } #valid;
Now, to find the invalid elements, you just have to find the ones not in the %valid hash:
my #invalid = grep { ! exists $valid{$_} } #array;
If you want to know the array indices of the invalid elements:
my #invalid_indices = grep { ! exists $valid{$_} } 0 .. $#array;
Now, you can expand that to find the repeated elements too. Not only do you check the %valid hash, but also keep track of what you have already seen:
my %Seen;
my #invalid_indices = grep { ! exists $valid{$_} && ! $Seen{$_}++ } 0 .. $#array;
The repeated valid elements are the ones with a value in %Seen that is greater than 1:
my #repeated_valid = grep { $Seen{$_} > 1 } #valid;
To find the missing elements, you look in %Seen to check what isn't in there.
my #missing = grep { ! $Seen{$_ } } #valid;
From perlfaq4's answer to How do I compute the difference of two arrays? How do I compute the intersection of two arrays?:
Use a hash. Here's code to do both and more. It assumes that each element is unique in a given array:
#union = #intersection = #difference = ();
%count = ();
foreach $element (#array1, #array2) { $count{$element}++ }
foreach $element (keys %count) {
push #union, $element;
push #{ $count{$element} > 1 ? \#intersection : \#difference }, $element;
}
Note that this is the symmetric difference, that is, all elements in either A or in B but not in both. Think of it as an xor operation.

How do I determine the longest similar portion of several strings?

As per the title, I'm trying to find a way to programmatically determine the longest portion of similarity between several strings.
Example:
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/
Ideally, I'd get back file:///home/gms8994/Music/, because that's the longest portion that's common for all 3 strings.
Specifically, I'm looking for a Perl solution, but a solution in any language (or even pseudo-language) would suffice.
From the comments: yes, only at the beginning; but there is the possibility of having some other entry in the list, which would be ignored for this question.
Edit: I'm sorry for mistake. My pity that I overseen that using my variable inside countit(x, q{}) is big mistake. This string is evaluated inside Benchmark module and #str was empty there. This solution is not as fast as I presented. See correction below. I'm sorry again.
Perl can be fast:
use strict;
use warnings;
package LCP;
sub LCP {
return '' unless #_;
return $_[0] if #_ == 1;
my $i = 0;
my $first = shift;
my $min_length = length($first);
foreach (#_) {
$min_length = length($_) if length($_) < $min_length;
}
INDEX: foreach my $ch ( split //, $first ) {
last INDEX unless $i < $min_length;
foreach my $string (#_) {
last INDEX if substr($string, $i, 1) ne $ch;
}
}
continue { $i++ }
return substr $first, 0, $i;
}
# Roy's implementation
sub LCP2 {
return '' unless #_;
my $prefix = shift;
for (#_) {
chop $prefix while (! /^\Q$prefix\E/);
}
return $prefix;
}
1;
Test suite:
#!/usr/bin/env perl
use strict;
use warnings;
Test::LCP->runtests;
package Test::LCP;
use base 'Test::Class';
use Test::More;
use Benchmark qw(:all :hireswallclock);
sub test_use : Test(startup => 1) {
use_ok('LCP');
}
sub test_lcp : Test(6) {
is( LCP::LCP(), '', 'Without parameters' );
is( LCP::LCP('abc'), 'abc', 'One parameter' );
is( LCP::LCP( 'abc', 'xyz' ), '', 'None of common prefix' );
is( LCP::LCP( 'abcdefgh', ('abcdefgh') x 15, 'abcdxyz' ),
'abcd', 'Some common prefix' );
my #str = map { chomp; $_ } <DATA>;
is( LCP::LCP(#str),
'file:///home/gms8994/Music/', 'Test data prefix' );
is( LCP::LCP2(#str),
'file:///home/gms8994/Music/', 'Test data prefix by LCP2' );
my $t = countit( 1, sub{LCP::LCP(#str)} );
diag("LCP: ${\($t->iters)} iterations took ${\(timestr($t))}");
$t = countit( 1, sub{LCP::LCP2(#str)} );
diag("LCP2: ${\($t->iters)} iterations took ${\(timestr($t))}");
}
__DATA__
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/
Test suite result:
1..7
ok 1 - use LCP;
ok 2 - Without parameters
ok 3 - One parameter
ok 4 - None of common prefix
ok 5 - Some common prefix
ok 6 - Test data prefix
ok 7 - Test data prefix by LCP2
# LCP: 22635 iterations took 1.09948 wallclock secs ( 1.09 usr + 0.00 sys = 1.09 CPU) # 20766.06/s (n=22635)
# LCP2: 17919 iterations took 1.06787 wallclock secs ( 1.07 usr + 0.00 sys = 1.07 CPU) # 16746.73/s (n=17919)
That means that pure Perl solution using substr is about 20% faster than Roy's solution at your test case and one prefix finding takes about 50us. There is not necessary using XS unless your data or performance expectations are bigger.
The reference given already by Brett Daniel for the Wikipedia entry on "Longest common substring problem" is very good general reference (with pseudocode) for your question as stated. However, the algorithm can be exponential. And it looks like you might actually want an algorithm for longest common prefix which is a much simpler algorithm.
Here's the one I use for longest common prefix (and a ref to original URL):
use strict; use warnings;
sub longest_common_prefix {
# longest_common_prefix( $|# ): returns $
# URLref: http://linux.seindal.dk/2005/09/09/longest-common-prefix-in-perl
# find longest common prefix of scalar list
my $prefix = shift;
for (#_) {
chop $prefix while (! /^\Q$prefix\E/);
}
return $prefix;
}
my #str = map {chomp; $_} <DATA>;
print longest_common_prefix(#ARGV), "\n";
__DATA__
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/
If you truly want a LCSS implementation, refer to these discussions (Longest Common Substring and Longest Common Subsequence) at PerlMonks.org. Tree::Suffix would probably be the best general solution for you and implements, to my knowledge, the best algorithm. Unfortunately recent builds are broken. But, a working subroutine does exist within the discussions referenced on PerlMonks in this post by Limbic~Region (reproduced here with your data).
#URLref: http://www.perlmonks.org/?node_id=549876
#by Limbic~Region
use Algorithm::Loops 'NestedLoops';
use List::Util 'reduce';
use strict; use warnings;
sub LCS{
my #str = #_;
my #pos;
for my $i (0 .. $#str) {
my $line = $str[$i];
for (0 .. length($line) - 1) {
my $char= substr($line, $_, 1);
push #{$pos[$i]{$char}}, $_;
}
}
my $sh_str = reduce {length($a) < length($b) ? $a : $b} #str;
my %map;
CHAR:
for my $char (split //, $sh_str) {
my #loop;
for (0 .. $#pos) {
next CHAR if ! $pos[$_]{$char};
push #loop, $pos[$_]{$char};
}
my $next = NestedLoops([#loop]);
while (my #char_map = $next->()) {
my $key = join '-', #char_map;
$map{$key} = $char;
}
}
my #pile;
for my $seq (keys %map) {
push #pile, $map{$seq};
for (1 .. 2) {
my $dir = $_ % 2 ? 1 : -1;
my #offset = split /-/, $seq;
$_ += $dir for #offset;
my $next = join '-', #offset;
while (exists $map{$next}) {
$pile[-1] = $dir > 0 ?
$pile[-1] . $map{$next} : $map{$next} . $pile[-1];
$_ += $dir for #offset;
$next = join '-', #offset;
}
}
}
return reduce {length($a) > length($b) ? $a : $b} #pile;
}
my #str = map {chomp; $_} <DATA>;
print LCS(#str), "\n";
__DATA__
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/
It sounds like you want the k-common substring algorithm. It is exceptionally simple to program, and a good example of dynamic programming.
My first instinct is to run a loop, taking the next character from each string, until the characters are not equal. Keep a count of what position in the string you're at and then take a substring (from any of the three strings) from 0 to the position before the characters aren't equal.
In Perl, you'll have to split up the string first into characters using something like
#array = split(//, $string);
(splitting on an empty character sets each character into its own element of the array)
Then do a loop, perhaps overall:
$n =0;
#array1 = split(//, $string1);
#array2 = split(//, $string2);
#array3 = split(//, $string3);
while($array1[$n] == $array2[$n] && $array2[$n] == $array3[$n]){
$n++;
}
$sameString = substr($string1, 0, $n); #n might have to be n-1
Or at least something along those lines. Forgive me if this doesn't work, my Perl is a little rusty.
If you google for "longest common substring" you'll get some good pointers for the general case where the sequences don't have to start at the beginning of the strings.
Eg, http://en.wikipedia.org/wiki/Longest_common_substring_problem.
Mathematica happens to have a function for this built in:
http://reference.wolfram.com/mathematica/ref/LongestCommonSubsequence.html (Note that they mean contiguous subsequence, ie, substring, which is what you want.)
If you only care about the longest common prefix then it should be much faster to just loop for i from 0 till the ith characters don't all match and return substr(s, 0, i-1).
From http://forums.macosxhints.com/showthread.php?t=33780
my #strings =
(
'file:///home/gms8994/Music/t.A.T.u./',
'file:///home/gms8994/Music/nina%20sky/',
'file:///home/gms8994/Music/A%20Perfect%20Circle/',
);
my $common_part = undef;
my $sep = chr(0); # assuming it's not used legitimately
foreach my $str ( #strings ) {
# First time through loop -- set common
# to whole
if ( !defined $common_part ) {
$common_part = $str;
next;
}
if ("$common_part$sep$str" =~ /^(.*).*$sep\1.*$/)
{
$common_part = $1;
}
}
print "Common part = $common_part\n";
Faster than above, uses perl's native binary xor function, adapted from perlmongers solution (the $+[0] didn't work for me):
sub common_suffix {
my $comm = shift #_;
while ($_ = shift #_) {
$_ = substr($_,-length($comm)) if (length($_) > length($comm));
$comm = substr($comm,-length($_)) if (length($_) < length($comm));
if (( $_ ^ $comm ) =~ /(\0*)$/) {
$comm = substr($comm, -length($1));
} else {
return undef;
}
}
return $comm;
}
sub common_prefix {
my $comm = shift #_;
while ($_ = shift #_) {
$_ = substr($_,0,length($comm)) if (length($_) > length($comm));
$comm = substr($comm,0,length($_)) if (length($_) < length($comm));
if (( $_ ^ $comm ) =~ /^(\0*)/) {
$comm = substr($comm,0,length($1));
} else {
return undef;
}
}
return $comm;
}