I'm trying to construct a permutation program in Perl using the NestedLoops function. Here's my code:
use strict;
use warnings;
use Algorithm::Loops qw(NestedLoops);
my #a = 'a'..'o';
my $length = 5;
my $start = 0;
my $depth = 2;
NestedLoops([
[0..$length],
( sub {
$start = 0 if $start == $depth;
$start++;
[$start * $length..$start * $length + $length - 1]
}) x $depth,
], \&permute,);
sub permute {
my #ind = #_;
foreach my $i (#ind) {
print $a[$i];
}
print "\n";
}
So I've got an array that holds the letters 'a' to 'o' (size being 15). I'm treating the array as if it had 3 rows, so my imagination of the array is this:
abcde
fghij
klmno
Then each loop corresponds to each row... and I want to build permutations like:
afk
afl
afm
afn
afo
agk // fails here... I end up getting agg
...
It works for the first 5 values (the entire run of the lowest for loop), but then the second run fails because the last row's value of $start gets reset to 0... this is a problem because that breaks everything.
So what I want to know is, how can I keep the value of $start persistent based on the level... So what I'm asking for is essentially having constants. My loops really should look like this:
for my $a (0..5) { # 0 at this level and never change
for my $b (5..10) { # $start should be 5 at this level and never change
for my $c (10..15) { # $start should be 10 at this level and never change
permute($a, $b, $c);
}
}
}
Now, because I will have a variable length of for loops, I can't hard code each start value, so I'm looking for a way to initially create those start values, and then keep them for when the loop gets reset.
I realize this is a confusing question, so please ask questions, and I will help clarify.
You are making this harder than it has to be.
Part of the problem is that the documentation for NestedLoops doesn't go into much detail about how a subroutine reference in the first argument, will be used.
For the following examples, assume this is written somewhere above them.
use strict;
use warnings;
use Algorithm::Loops qw'NestedLoops';
Really the simplest way to call NestedLoops to get what you want is like this:
NestedLoops(
[
['a'..'e'],
['f'..'j'],
['k'..'o'],
],
\&permute
);
sub permute {
print #_, "\n";
}
If you really want the arguments to NestedLoops to be generated on the fly, I would recommend using part from List::MoreUtils.
use List::MoreUtils qw'part';
my #a = 'a'..'o';
my $length = 5;
my $index;
NestedLoops(
[
part {
$index++ / $length
} #a
],
\&permute
);
sub permute {
print #_, "\n";
}
If for some reason you want to call NestedLoops with indexes into the array, It is still easy with part.
use List::MoreUtils qw'part';
my #a = 'a'..'o';
my $length = 5;
NestedLoops(
[
part {
$_ / $length
} 0..#a-1
],
\&permute
);
sub permute {
print map { $a[$_] } #_;
print "\n";
}
Really the main problem you're having is that the two subroutine references that you give to NestedLoops are modifying the same variables, and they are both called multiple times.
The best way to fix this is to rely on the last value given to the subroutine when it is called. ( From looking at the implementation, this seems to be closer to how it was meant to be used. )
my #a = 'a'..'o';
my $length = 5;
my $depth = 3;
NestedLoops(
[
[0..$length-1],
(sub{
return unless #_;
my $last = pop;
my $part = int( $last / $length ) + 1; # current partition
my $start = $part * $length; # start of this partition
my $end = $start + $length;
[$start..$end-1] # list of variables in this partition
}) x ($depth-1)
],
\&permute
);
sub permute {
print map { $a[$_] } #_;
print "\n";
}
When you use a subroutine to generate the range of a loop, it is called every time that one of the nested loops must start. That means once for each iteration of the containing loop. Before each call $_ is set to the current value of the containing loop's variable, and the values of all the containing loop variables are passed as parameters.
To clarify this, the NestedLoops statement you have coded is equivalent to
sub loop_over {
$start = 0 if $start == $depth;
$start++;
[$start * $length..$start * $length + $length - 1]
};
NestedLoops([
[0..$length],
(\&loop_over) x $depth,
], \&permute,);
which, in raw Perl, looks something like
for my $i (0 .. $length) {
$_ = $i;
my $list = loop_over($i);
for my $j (#$list) {
$_ = $j;
my $list = loop_over($i, $j);
for my $k (#$list) {
permute($i, $j, $k);
}
}
}
so perhaps it is clearer now that your calculation of $start is wrong? It is reevaluated several times for the innermost level before execution ascends to restart the containing loop.
Since the parameters passed to the subroutine consist of all the values of the containing loop variables, the size of #_ can be checked to see for which level of the loop to generate a range. For instance, in the code above, if #_ contains two values they are $i and $j, so the values for $k must be returned; alternatively, if there is only one parameter then it is the value of $i, and the returned value must be the range for $j. So the correct value for your $start is simply the number of elements in #_ and can be set using my $start = #_;.
Using this method the subroutine can return the range for the outermost loop as well. The code looks like this
use strict;
use warnings;
use Algorithm::Loops qw(NestedLoops);
my #a = 'a'..'o';
my $length = 5;
my $start = 0;
my $depth = 2;
NestedLoops([
(sub {
$start = #_;
[$start * $length .. $start * $length + $length - 1];
}) x ($depth + 1)
], \&permute,);
sub permute {
print map { $a[$_] } #_;
print "\n";
}
Related
As a biology student, I'm trying to extend my programming knowledge and I ran into a problem with Perl.
I'm trying to create a program that generates random DNA strings and performs analysis work on the generated data.
In the first part of the program, I am able to print out the strings stored in the array, but the second part I cannot retrieve all but one of the elements of the array.
Could this be part of the scoping rules of Perl?
#!usr/bin/perl
# generate a random DNA strings and print it to file specified by the user.
$largearray[0] = 0;
print "How many nucleotides for the string?\n";
$n = <>;
$mylong = $n;
print "how many strings?\n";
$numstrings = <>;
# #largearray =();
$j = 0;
while ( $j < $numstrings ) {
$numstring = ''; # start with the empty string;
$dnastring = '';
$i = 0;
while ( $i < $n ) {
$numstring = int( rand( 4 ) ) . $numstring; # generate a new random integer
# between 0 and 3, and concatenate
# it with the existing $numstring,
# assigning the result to $numstring.
$i++; # increase the value of $i by one.
}
$dnastring = $numstring;
$dnastring =~ tr/0123/actg/; # translate the numbers to DNA characters.
#print $dnastring;
#print "\n";
$largearray[j] = $dnastring; #append generated string to end of array
#print $largearray[j];
#print $j;
#IN HERE THERE ARE GOOD ARRAY VALUES
#print "\n";
$j++;
}
# ii will be used to continuously take the next couple of strings from largearray
# for LCS matching.
$mytotal = 0;
$ii = 0;
while ( $ii < $numstrings ) {
$line = $largearray[ii];
print $largearray[ii]; #CANNOT RETRIEVE ARRAY VALUES
print "\n";
$ii++;
#string1 = split( //, $line );
$line = $largearray[ii];
#print $largearray[ii];
#print "\n";
$ii++;
chomp $line;
#string2 = split( //, $line );
$n = #string1; #assigning a list to a scalar just assigns the
#number of elements in the list to the scalar.
$m = #string2;
$v = 1;
$Cm = 0;
$Im = 0;
$V[0][0] = 0; # Assign the 0,0 entry of the V matrix
for ( $i = 1; $i <= $n; $i++ ) { # Assign the column 0 values and print
# String 1 See section 5.2 of Johnson
# for loops
$V[$i][0] = -$Im * $i;
}
for ( $j = 1; $j <= $m; $j++ ) { # Assign the row 0 values and print String 2
$V[0][$j] = -$Im * $j;
}
for ( $i = 1; $i <= $n; $i++ ) { # follow the recurrences to fill in the V matrix.
for ( $j = 1; $j <= $m; $j++ ) {
# print OUT "$string1[$i-1], $string2[$j-1]\n"; # This is here for debugging purposes.
if ( $string1[ $i - 1 ] eq $string2[ $j - 1 ] ) {
$t = 1 * $v;
}
else {
$t = -1 * $Cm;
}
$max = $V[ $i - 1 ][ $j - 1 ] + $t;
# print OUT "For $i, $j, t is $t \n"; # Another debugging line.
if ( $max < $V[$i][ $j - 1 ] - 1 * $Im ) {
$max = $V[$i][ $j - 1 ] - 1 * $Im;
}
if ( $V[ $i - 1 ][$j] - 1 * $Im > $max ) {
$max = $V[ $i - 1 ][$j] - 1 * $Im;
}
$V[$i][$j] = $max;
}
} #outer for loop
print $V[$n][$m];
$mytotal += $V[$n][$m]; # append current result to the grand total
print "\n";
} # end while loop
print "the average LCS value for length ", $mylong, " strings is: ";
print $mytotal/ $numstrings;
This isn't a scoping issue. You have declared none of your variables, which has the effect of implicitly making them all global and accessible everywhere in your code
I reformatted your Perl program so that I could read it, and then added this to the top of your program
use strict;
use warnings 'all';
which are essential in every Perl program you write
Then I added
no strict 'vars';
which is a very bad idea, and lets you get away without declaring any variables
The result is this
Argument "ii" isn't numeric in array element at E:\Perl\source\dna.pl line 60.
Argument "ii" isn't numeric in array element at E:\Perl\source\dna.pl line 61.
Argument "ii" isn't numeric in array element at E:\Perl\source\dna.pl line 67.
Argument "j" isn't numeric in array element at E:\Perl\source\dna.pl line 42.
Bareword "ii" not allowed while "strict subs" in use at E:\Perl\source\dna.pl line 60.
Bareword "ii" not allowed while "strict subs" in use at E:\Perl\source\dna.pl line 61.
Bareword "ii" not allowed while "strict subs" in use at E:\Perl\source\dna.pl line 67.
Bareword "j" not allowed while "strict subs" in use at E:\Perl\source\dna.pl line 42.
Execution of E:\Perl\source\dna.pl aborted due to compilation errors.
Line 42 (of my reformatted version) is
$largearray[j] = $dnastring
and lines 60, 61 and 67 are
$line = $largearray[ii];
print $largearray[ii]; #CANNOT RETRIEVE ARRAY VALUES
and
$line = $largearray[ii];
You are using j and ii as array indexes. Those are Perl subroutine calls, not variables. Adding use strict would have stopped this from compiling unless you had also declared sub ii and sub j
You might get away with it if you just change j and ii to $j and $ii, but you are certain to get into further problems
Please make the same changes to your own code, and declare every variable that you need using my as close as possible to the first place they are used
You should also improve your variable naming. Things like #largearray are pointless: the # says that it's an array, and whether it's large or not is relative, and of little use in understanding your code. If you have no better description of its purpose then #table or #data are probably a little better
Likewise, please avoid capital letters and most single-letter names. #V, $Cm and $Im are meaningless, and you would need fewer comments if those names were better
You certainly wouldn't need comments like # end while loop and # outer for loop if you had indented your blocks properly and kept them short enough so that both the beginning and the end can be seen on the screen at the same time, and the fewer comments you can get away with the better, because they badly clutter the code structure
Finally, it's worth noting that the C-style for loop is rarely the best choice in Perl. Your
for ( $i = 1; $i <= $n; $i++ ) { ... }
is much clearer as
for my $i ( 1 .. $n ) { ... }
and declaring the control variable at that point makes it unnecessary to invent new names like $ii for each new loop
I think you have a typo in your code:
ii => must be $ii
don't forget to put this at the beginning of your code:
use strict;
use warnings;
in order to avoid this (and others) kind of errors
I'm having a problem coding my first Perl program.
What I'm trying to do here is getting the maximum, minimum,total and average of a list of numbers using a subroutine for each value and another subroutine to print the final values. I'm using a "private" for all my variables, but I still couldn't print my values.
Here is my code:
&max(<>);
&print_stat(<>);
sub max {
my ($mymax) = shift #_;
foreach (#_) {
if ( $_ > $mymax ) {
$mymax = $_;
}
}
return $mymax;
}
sub print_stat {
print max($mymax);
}
Please try this one:
use strict;
use warnings;
my #list_nums = qw(10 21 30 42 50 63 70);
ma_xi(#list_nums);
sub ma_xi
{
my #list_ele = #_;
my $set_val_max = '0'; my $set_val_min = '0';
my $add_all_vals = '0';
foreach my $each_ele(#list_ele)
{
$set_val_max = $each_ele if($set_val_max < $each_ele);
$set_val_min = $each_ele if($set_val_min eq '0');
$set_val_min = $each_ele if($set_val_min > $each_ele);
$add_all_vals += $each_ele;
}
my $set_val_avg = $add_all_vals / scalar(#list_ele) + 1;
print "MAX: $set_val_max\n";
print "MIN: $set_val_min\n";
print "TOT: $add_all_vals\n";
print "AVG: $set_val_avg\n";
#Return these values into array and get into the new sub routine's
}
Some notes
Use plenty of whitespace to lay out your code. I have tidied the Perl code in your question so that I could read it more easily, without changing its semantics
You must always use strict and use warnings 'all' at the top of every Perl program you write
Never use an ampersand & in a subroutine call. That hasn't been necessary or desirable since Perl 4 over twenty-five years ago. Any tutorial that tells you otherwise is wrong
Using <> in a list context (such as the parameters to a subroutine call) will read all of the file and exhaust the file handle. Thereafter, any calls to <> will return undef
You should use chomp to remove the newline from each line of input
You declare $mymax within the scope of the max subroutine, but then try to print it in print_stat where it doesn't exists. use strict and use warnings 'all' would have caught that error for you
Your max subroutine returns the maximum value that it calculated, but you never use that return value
Below is a fixed version of your code.
Note that I've read the whole file into array #values and then chomped them all at once. In general it's best to read and process input one line at a time, which would be quite possible here but I wanted to say as close to your original code as possible
I've also saved the return value from max in variable $max, and then passed that to print_stat. It doesn't make sense to try to read the file again and pass all of those values to print_stat, as your code does
I hope this helps
use strict;
use warnings 'all';
my #values = <>;
chomp #values;
my $max = max(#values);
print_stat( $max );
sub max {
my $mymax = shift;
for ( #_ ) {
if ( $_ > $mymax ) {
$mymax = $_;
}
}
return $mymax;
}
sub print_stat {
my ($val) = #_;
print $val, "\n";
}
Update
Here's a version that calculates all of the statistics that you mentioned. I don't think subroutines are a help in this case as the solution is short and no code is reusable
Note that I've added the data at the end of the program file, after __DATA__, which lets me read it from the DATA file handle. This is often handy for testing
use strict;
use warnings 'all';
my ($n, $max, $min, $tot);
while ( <DATA> ) {
next unless /\S/; # Skip blank lines
chomp;
if ( not defined $n ) {
$max = $min = $tot = $_;
}
else {
$max = $_ if $max < $_;
$min = $_ if $min > $_;
$tot += $_;
}
++$n;
}
my $avg = $tot / $n;
printf "\$n = %d\n", $n;
printf "\$max = %d\n", $max;
printf "\$min = %d\n", $min;
printf "\$tot = %d\n", $tot;
printf "\$avg = %.2f\n", $avg;
__DATA__
7
6
1
5
1
3
8
7
output
$n = 8
$max = 8
$min = 1
$tot = 38
$avg = 4.75
I am writing this loop where in the initializing i am intializing using a variable and not a absolute value.Why doesn't the value change? Or is it not allowed in the for loop?
enter code here
for($i = $one, $i > $top_level,$i--)
{
print $i,"\n";
print "One is:",$one,"\n";
}
Why can't i assign $i= $one.
The variable $one keeps changing so thats the reason why i declared $i to $one.When i print the individual values of the variables the values are correct, its just that in the for loop the value is not being assigned.
Use semicolons, not commas.
for($i = $one; $i > $top_level; $i--)
{
print $i,"\n";
print "One is:",$one,"\n";
}
Avoid C-Style loops in perl, if possible.
for my $i (reverse $top_level..$one) {
print $i,"\n";
print "One is:",$one,"\n";
}
The separator inside the for() statement is ";", not ",".
You've accidentally stumbled on using for with a list, by using commas instead of semi-colons. Watch what happens when you add a third print statement:
print "\$_=[$_]\n";
with $one as 1, I get:
$_=[0]
...
$_=[]
...
$_=[1]
And that is because there are three items in the list. First of all what you wanted to do, would have looked like this:
for my $i ( $one..( $top_level - 1 )) {
In this loop, $i is set to each member of the range and then the loop is executed. If we don't suppy the variable, perl assigns it to $_.
for ( $one..( $top_level - 1 )) {
And perl does not wait to compile the list, so before it ever execute the first time it goes through all the terms.
$i = $one;
# push actual $i returned by expression $i = $one
push #a, $i;
# result = [ $i=1 ]
# push boolean false => ''
push #a, ( $i > $top_level );
# result = [ $i=1, '' ]
# push the return of $i--, NOT $i
push #a, ( $i );
# result = [ $i=1, '', 1 ]
# decrement $i
$i--;
# result = [ $i=0, '', 1 ]
You can find this out using a TIE-ed scalar:
package Monitored;
sub TIESCALAR {
my ( $class, $name, $value ) = #_;
return bless { name => $name, value => $value }, $class;
}
sub FETCH {
my $self = shift;
Carp::carp "Reading \$$self->{name}...";
return $self->{value};
}
sub STORE {
my $self = shift;
my $value = shift;
Carp::carp "Storing \$$self->{name}=${\(defined( $value ) ? $value : 'undef')} ";
$self->{value} = $value;
}
And this initialization in the mainline:
my $one = 1;
tie my $i, 'Monitored', 'i';
tie my $top_level, 'Monitored', 'top_level', 5;
for($i = $one, $i > $top_level,$i--) # line 30
{
print "*LOOP*\n";
#print "\$i=$i\n"; <-- commented out to reduce noise
#print "\$one=$one\n";
print "\$_=[$_]\n"; # line 35
}
Then in running the loop the output is:
Storing $i=1 at - line 30
Reading $top_level... at - line 30
Reading $i... at - line 30
Reading $i... at - line 30
Reading $i... at - line 30
Storing $i=0 at - line 30
*LOOP*
Reading $i... at - line 35
$_=[0]
*LOOP*
$_=[]
*LOOP*
$_=[1]
Note that only one time, at line 35, do we access $i after the looping starts.
What on earth are you trying to do? Is this a normal 'for' loop or are you trying to do something exotic. Note use of semicolon, NOT comma. Using a comma in any loop does something completely different: runs each bit of code each time.
Normally it would be:
$one = 1;
for($i = $one; $i > $top_level;$i--)
{
print $i,"\n";
print "One is:",$one,"\n";
}
It is Perl right?
Having heard about Perl for year I decided to give it a few hours of my time to see how much I could pick up. I got through the basics fine and then got to loops. As a test I wanted to see if I could build a script to recurse through all alphanumerical values of up to 4 characters. I had written a PHP code that did the same thing some time ago so I took the same concept and used it. However when I run the script it puts "a" as the first 3 values and then only loops through the last digit. Anyone see what I am doing wrong?
#!/usr/local/bin/perl
$chars = "abcdefghijklmnopqrstuvwxyz";
$chars .= "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
$chars .= "0123456789";
#charset = split(//, $chars);
$charset_length = scalar(#charset);
sub recurse
{
($width, $position, $base_string) = #_;
for ($i = 0; $i < $charset_length; ++$i) {
$base = $base_string . $charset[$i];
if ($position < $width - 1) {
$pos = $position + 1;
recurse($width, $pos, $base);
}
print $base;
print "\n";
}
}
recurse(4, 0, '');
This is what I get when I run it:
aaaa
aaab
aaac
aaad
aaae
aaaf
aaag
aaah
aaai
aaaj
aaak
aaal
aaam
aaan
aaao
aaap
aaaq
aaar
aaas
aaat
aaau
aaav
aaaw
aaax
aaay
aaaz
aaaA
aaaB
aaaC
aaaD
aaaE
aaaF
aaaG
aaaH
aaaI
aaaJ
aaaK
aaaL
aaaM
aaaN
aaaO
aaaP
aaaQ
aaaR
aaaS
aaaT
aaaU
aaaV
aaaW
aaaX
aaaY
aaaZ
aaa0
aaa1
aaa2
aaa3
aaa4
aaa5
aaa6
aaa7
aaa8
aaa9
aaa9
aaa9
aaa9
You've been bitten by non strict scoping, this code does what it should (note the use strict at the top and the subsequent use of my to guarantee variable scoping).
#!/usr/bin/env perl
use strict;
use warnings;
my $chars = "abcdefghijklmnopqrstuvwxyz";
$chars .= "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
$chars .= "0123456789";
my #charset = split(//, $chars);
my $charset_length = scalar(#charset);
sub recurse {
my ($width, $position, $base_string) = #_;
for (my $i = 0; $i < $charset_length; ++$i) {
my $base = $base_string . $charset[$i];
if ($position < $width - 1) {
my $pos = $position + 1;
recurse($width, $pos, $base);
}
print $base;
print "\n";
}
}
recurse(4, 0, '');
Already well answered, but a more idiomatic approach would be:
use strict;
use warnings;
sub recurse {
my ($width, $base_string, $charset) = #_;
if (length $base_string) {
print "$base_string\n";
}
if (length($base_string) < $width) {
$recurser->($base_string . $_) for #$charset;
}
}
my #charset = ('a'..'z', 'A'..'Z', '0'..'9');
recurse(4, '', \#charset);
There's no need to pass position; it's implicit in the width of the base string passed in. The charset, on the other hand, should be passed in rather than having the subroutine use an external variable.
Alternatively, since the width and character set stay constant, generate a closure that references them:
use strict;
use warnings;
sub make_recurser {
my ($width, $charset) = #_;
my $recurser;
$recurser = sub {
my ($base_string) = #_;
if (length $base_string) {
print "$base_string\n";
}
if (length($base_string) < $width) {
$recurser->($base_string . $_) for #$charset;
}
}
}
my #charset = ('a'..'z', 'A'..'Z', '0'..'9');
my $recurser = make_recurser(4, \#charset);
$recurser->('');
Alternatively, just:
print "$_\n" for glob(('{' . join(',', 'a'..'z', 'A'..'Z', '0'..'9') . '}') x 4);
It has to do with the scope of the variables, you're still changing the same vars when you're calling the recursion. The keyword 'my' declares the variables local to the subroutine.
(http://perl.plover.com/FAQs/Namespaces.html)
I always use perl with 'use strict;' declared, forcing me to decide on the scope of the variables.
sub recurse {
my ($width, $position, $base_string) = #_;
for (my $i = 0; $i < $charset_length; ++$i) {
my $base = $base_string . $charset[$i];
if ($position < $width - 1) {
my $pos = $position + 1;
recurse($width, $pos, $base);
}
print $base;
print " ";
}
}
You seem to be running into some scoping issues. Perl is very flexible, so it is taking a guess at what you want because you haven't told it what you want. One of the first things you'll learn is to add use strict; as for your first statement after the shebang. It will point out the variables that are not being explicitly defined, as well as any variables that are accessed before being created (helps with misspelled variables, etc).
If you make your code look like this, you'll see why you are getting your errors:
sub recurse {
($width, $position, $base_string) = #_;
for ($i = 0; $i < $charset_length; ++$i) {
$base = $base_string . $charset[$i];
if ($position < $width - 1) {
$pos = $position + 1;
recurse($width, $pos, $base);
}
# print "$base\n";
}
print "$position\n";
}
This should output:
3
3
3
3
Because you are not scoping $position correctly with my, you aren't getting a new variable each recurse, you are re-using the same one. Toss a use strict; in there, and fix the errors you get, and the code should be good.
I realize that you're just tinkering with recursion. But as long as you're having fun comparing implementations between two languages you may as well also see how the CPAN can extend your tool set.
If you don't care about the order, you can generate all 13,388,280 permutations of ( 'a'..'z', 'A..'Z', '0'..'9' ) taken four at a time with the CPAN module, Algorithm::Permute
Here is an example of how that code may look.
use strict;
use warnings;
use Algorithm::Permute;
my $p = Algorithm::Permute->new(
[ 'a' .. 'z', 'A' .. 'Z', '0' .. '9' ], # Set of...
4 # <---- at a time.
);
while ( my #res = $p->next ) {
print #res, "\n";
}
The new() method accepts an array ref that enumerates the character set or list of what to permute. Its second argument is how many at a time to include in the permutation. So you're essentially taking 62 items 4 at a time. Then use the next() method to iterate through the permutations. The rest is just window dressing.
The same thing could be reduced to the following Perl one-liner:
perl -MAlgorithm::Permute -e '$p=Algorithm::Permute->new(["a".."z","A".."Z",0..9],4);print #r, "\n" while #r=$p->next;'
There is also a section on permutation, along with additional examples in perlfaq4. It includes several examples and lists some additional modules that handle the details for you. One of Perl's strengths is the size and completeness of the Comprehensive Perl Archive Network (the CPAN).
As per the title, I'm trying to find a way to programmatically determine the longest portion of similarity between several strings.
Example:
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/
Ideally, I'd get back file:///home/gms8994/Music/, because that's the longest portion that's common for all 3 strings.
Specifically, I'm looking for a Perl solution, but a solution in any language (or even pseudo-language) would suffice.
From the comments: yes, only at the beginning; but there is the possibility of having some other entry in the list, which would be ignored for this question.
Edit: I'm sorry for mistake. My pity that I overseen that using my variable inside countit(x, q{}) is big mistake. This string is evaluated inside Benchmark module and #str was empty there. This solution is not as fast as I presented. See correction below. I'm sorry again.
Perl can be fast:
use strict;
use warnings;
package LCP;
sub LCP {
return '' unless #_;
return $_[0] if #_ == 1;
my $i = 0;
my $first = shift;
my $min_length = length($first);
foreach (#_) {
$min_length = length($_) if length($_) < $min_length;
}
INDEX: foreach my $ch ( split //, $first ) {
last INDEX unless $i < $min_length;
foreach my $string (#_) {
last INDEX if substr($string, $i, 1) ne $ch;
}
}
continue { $i++ }
return substr $first, 0, $i;
}
# Roy's implementation
sub LCP2 {
return '' unless #_;
my $prefix = shift;
for (#_) {
chop $prefix while (! /^\Q$prefix\E/);
}
return $prefix;
}
1;
Test suite:
#!/usr/bin/env perl
use strict;
use warnings;
Test::LCP->runtests;
package Test::LCP;
use base 'Test::Class';
use Test::More;
use Benchmark qw(:all :hireswallclock);
sub test_use : Test(startup => 1) {
use_ok('LCP');
}
sub test_lcp : Test(6) {
is( LCP::LCP(), '', 'Without parameters' );
is( LCP::LCP('abc'), 'abc', 'One parameter' );
is( LCP::LCP( 'abc', 'xyz' ), '', 'None of common prefix' );
is( LCP::LCP( 'abcdefgh', ('abcdefgh') x 15, 'abcdxyz' ),
'abcd', 'Some common prefix' );
my #str = map { chomp; $_ } <DATA>;
is( LCP::LCP(#str),
'file:///home/gms8994/Music/', 'Test data prefix' );
is( LCP::LCP2(#str),
'file:///home/gms8994/Music/', 'Test data prefix by LCP2' );
my $t = countit( 1, sub{LCP::LCP(#str)} );
diag("LCP: ${\($t->iters)} iterations took ${\(timestr($t))}");
$t = countit( 1, sub{LCP::LCP2(#str)} );
diag("LCP2: ${\($t->iters)} iterations took ${\(timestr($t))}");
}
__DATA__
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/
Test suite result:
1..7
ok 1 - use LCP;
ok 2 - Without parameters
ok 3 - One parameter
ok 4 - None of common prefix
ok 5 - Some common prefix
ok 6 - Test data prefix
ok 7 - Test data prefix by LCP2
# LCP: 22635 iterations took 1.09948 wallclock secs ( 1.09 usr + 0.00 sys = 1.09 CPU) # 20766.06/s (n=22635)
# LCP2: 17919 iterations took 1.06787 wallclock secs ( 1.07 usr + 0.00 sys = 1.07 CPU) # 16746.73/s (n=17919)
That means that pure Perl solution using substr is about 20% faster than Roy's solution at your test case and one prefix finding takes about 50us. There is not necessary using XS unless your data or performance expectations are bigger.
The reference given already by Brett Daniel for the Wikipedia entry on "Longest common substring problem" is very good general reference (with pseudocode) for your question as stated. However, the algorithm can be exponential. And it looks like you might actually want an algorithm for longest common prefix which is a much simpler algorithm.
Here's the one I use for longest common prefix (and a ref to original URL):
use strict; use warnings;
sub longest_common_prefix {
# longest_common_prefix( $|# ): returns $
# URLref: http://linux.seindal.dk/2005/09/09/longest-common-prefix-in-perl
# find longest common prefix of scalar list
my $prefix = shift;
for (#_) {
chop $prefix while (! /^\Q$prefix\E/);
}
return $prefix;
}
my #str = map {chomp; $_} <DATA>;
print longest_common_prefix(#ARGV), "\n";
__DATA__
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/
If you truly want a LCSS implementation, refer to these discussions (Longest Common Substring and Longest Common Subsequence) at PerlMonks.org. Tree::Suffix would probably be the best general solution for you and implements, to my knowledge, the best algorithm. Unfortunately recent builds are broken. But, a working subroutine does exist within the discussions referenced on PerlMonks in this post by Limbic~Region (reproduced here with your data).
#URLref: http://www.perlmonks.org/?node_id=549876
#by Limbic~Region
use Algorithm::Loops 'NestedLoops';
use List::Util 'reduce';
use strict; use warnings;
sub LCS{
my #str = #_;
my #pos;
for my $i (0 .. $#str) {
my $line = $str[$i];
for (0 .. length($line) - 1) {
my $char= substr($line, $_, 1);
push #{$pos[$i]{$char}}, $_;
}
}
my $sh_str = reduce {length($a) < length($b) ? $a : $b} #str;
my %map;
CHAR:
for my $char (split //, $sh_str) {
my #loop;
for (0 .. $#pos) {
next CHAR if ! $pos[$_]{$char};
push #loop, $pos[$_]{$char};
}
my $next = NestedLoops([#loop]);
while (my #char_map = $next->()) {
my $key = join '-', #char_map;
$map{$key} = $char;
}
}
my #pile;
for my $seq (keys %map) {
push #pile, $map{$seq};
for (1 .. 2) {
my $dir = $_ % 2 ? 1 : -1;
my #offset = split /-/, $seq;
$_ += $dir for #offset;
my $next = join '-', #offset;
while (exists $map{$next}) {
$pile[-1] = $dir > 0 ?
$pile[-1] . $map{$next} : $map{$next} . $pile[-1];
$_ += $dir for #offset;
$next = join '-', #offset;
}
}
}
return reduce {length($a) > length($b) ? $a : $b} #pile;
}
my #str = map {chomp; $_} <DATA>;
print LCS(#str), "\n";
__DATA__
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/
It sounds like you want the k-common substring algorithm. It is exceptionally simple to program, and a good example of dynamic programming.
My first instinct is to run a loop, taking the next character from each string, until the characters are not equal. Keep a count of what position in the string you're at and then take a substring (from any of the three strings) from 0 to the position before the characters aren't equal.
In Perl, you'll have to split up the string first into characters using something like
#array = split(//, $string);
(splitting on an empty character sets each character into its own element of the array)
Then do a loop, perhaps overall:
$n =0;
#array1 = split(//, $string1);
#array2 = split(//, $string2);
#array3 = split(//, $string3);
while($array1[$n] == $array2[$n] && $array2[$n] == $array3[$n]){
$n++;
}
$sameString = substr($string1, 0, $n); #n might have to be n-1
Or at least something along those lines. Forgive me if this doesn't work, my Perl is a little rusty.
If you google for "longest common substring" you'll get some good pointers for the general case where the sequences don't have to start at the beginning of the strings.
Eg, http://en.wikipedia.org/wiki/Longest_common_substring_problem.
Mathematica happens to have a function for this built in:
http://reference.wolfram.com/mathematica/ref/LongestCommonSubsequence.html (Note that they mean contiguous subsequence, ie, substring, which is what you want.)
If you only care about the longest common prefix then it should be much faster to just loop for i from 0 till the ith characters don't all match and return substr(s, 0, i-1).
From http://forums.macosxhints.com/showthread.php?t=33780
my #strings =
(
'file:///home/gms8994/Music/t.A.T.u./',
'file:///home/gms8994/Music/nina%20sky/',
'file:///home/gms8994/Music/A%20Perfect%20Circle/',
);
my $common_part = undef;
my $sep = chr(0); # assuming it's not used legitimately
foreach my $str ( #strings ) {
# First time through loop -- set common
# to whole
if ( !defined $common_part ) {
$common_part = $str;
next;
}
if ("$common_part$sep$str" =~ /^(.*).*$sep\1.*$/)
{
$common_part = $1;
}
}
print "Common part = $common_part\n";
Faster than above, uses perl's native binary xor function, adapted from perlmongers solution (the $+[0] didn't work for me):
sub common_suffix {
my $comm = shift #_;
while ($_ = shift #_) {
$_ = substr($_,-length($comm)) if (length($_) > length($comm));
$comm = substr($comm,-length($_)) if (length($_) < length($comm));
if (( $_ ^ $comm ) =~ /(\0*)$/) {
$comm = substr($comm, -length($1));
} else {
return undef;
}
}
return $comm;
}
sub common_prefix {
my $comm = shift #_;
while ($_ = shift #_) {
$_ = substr($_,0,length($comm)) if (length($_) > length($comm));
$comm = substr($comm,0,length($_)) if (length($_) < length($comm));
if (( $_ ^ $comm ) =~ /^(\0*)/) {
$comm = substr($comm,0,length($1));
} else {
return undef;
}
}
return $comm;
}