looping in perl - perl

I am writing this loop where in the initializing i am intializing using a variable and not a absolute value.Why doesn't the value change? Or is it not allowed in the for loop?
enter code here
for($i = $one, $i > $top_level,$i--)
{
print $i,"\n";
print "One is:",$one,"\n";
}
Why can't i assign $i= $one.
The variable $one keeps changing so thats the reason why i declared $i to $one.When i print the individual values of the variables the values are correct, its just that in the for loop the value is not being assigned.

Use semicolons, not commas.
for($i = $one; $i > $top_level; $i--)
{
print $i,"\n";
print "One is:",$one,"\n";
}
Avoid C-Style loops in perl, if possible.
for my $i (reverse $top_level..$one) {
print $i,"\n";
print "One is:",$one,"\n";
}

The separator inside the for() statement is ";", not ",".

You've accidentally stumbled on using for with a list, by using commas instead of semi-colons. Watch what happens when you add a third print statement:
print "\$_=[$_]\n";
with $one as 1, I get:
$_=[0]
...
$_=[]
...
$_=[1]
And that is because there are three items in the list. First of all what you wanted to do, would have looked like this:
for my $i ( $one..( $top_level - 1 )) {
In this loop, $i is set to each member of the range and then the loop is executed. If we don't suppy the variable, perl assigns it to $_.
for ( $one..( $top_level - 1 )) {
And perl does not wait to compile the list, so before it ever execute the first time it goes through all the terms.
$i = $one;
# push actual $i returned by expression $i = $one
push #a, $i;
# result = [ $i=1 ]
# push boolean false => ''
push #a, ( $i > $top_level );
# result = [ $i=1, '' ]
# push the return of $i--, NOT $i
push #a, ( $i );
# result = [ $i=1, '', 1 ]
# decrement $i
$i--;
# result = [ $i=0, '', 1 ]
You can find this out using a TIE-ed scalar:
package Monitored;
sub TIESCALAR {
my ( $class, $name, $value ) = #_;
return bless { name => $name, value => $value }, $class;
}
sub FETCH {
my $self = shift;
Carp::carp "Reading \$$self->{name}...";
return $self->{value};
}
sub STORE {
my $self = shift;
my $value = shift;
Carp::carp "Storing \$$self->{name}=${\(defined( $value ) ? $value : 'undef')} ";
$self->{value} = $value;
}
And this initialization in the mainline:
my $one = 1;
tie my $i, 'Monitored', 'i';
tie my $top_level, 'Monitored', 'top_level', 5;
for($i = $one, $i > $top_level,$i--) # line 30
{
print "*LOOP*\n";
#print "\$i=$i\n"; <-- commented out to reduce noise
#print "\$one=$one\n";
print "\$_=[$_]\n"; # line 35
}
Then in running the loop the output is:
Storing $i=1 at - line 30
Reading $top_level... at - line 30
Reading $i... at - line 30
Reading $i... at - line 30
Reading $i... at - line 30
Storing $i=0 at - line 30
*LOOP*
Reading $i... at - line 35
$_=[0]
*LOOP*
$_=[]
*LOOP*
$_=[1]
Note that only one time, at line 35, do we access $i after the looping starts.

What on earth are you trying to do? Is this a normal 'for' loop or are you trying to do something exotic. Note use of semicolon, NOT comma. Using a comma in any loop does something completely different: runs each bit of code each time.
Normally it would be:
$one = 1;
for($i = $one; $i > $top_level;$i--)
{
print $i,"\n";
print "One is:",$one,"\n";
}
It is Perl right?

Related

Scoping in Perl

As a biology student, I'm trying to extend my programming knowledge and I ran into a problem with Perl.
I'm trying to create a program that generates random DNA strings and performs analysis work on the generated data.
In the first part of the program, I am able to print out the strings stored in the array, but the second part I cannot retrieve all but one of the elements of the array.
Could this be part of the scoping rules of Perl?
#!usr/bin/perl
# generate a random DNA strings and print it to file specified by the user.
$largearray[0] = 0;
print "How many nucleotides for the string?\n";
$n = <>;
$mylong = $n;
print "how many strings?\n";
$numstrings = <>;
# #largearray =();
$j = 0;
while ( $j < $numstrings ) {
$numstring = ''; # start with the empty string;
$dnastring = '';
$i = 0;
while ( $i < $n ) {
$numstring = int( rand( 4 ) ) . $numstring; # generate a new random integer
# between 0 and 3, and concatenate
# it with the existing $numstring,
# assigning the result to $numstring.
$i++; # increase the value of $i by one.
}
$dnastring = $numstring;
$dnastring =~ tr/0123/actg/; # translate the numbers to DNA characters.
#print $dnastring;
#print "\n";
$largearray[j] = $dnastring; #append generated string to end of array
#print $largearray[j];
#print $j;
#IN HERE THERE ARE GOOD ARRAY VALUES
#print "\n";
$j++;
}
# ii will be used to continuously take the next couple of strings from largearray
# for LCS matching.
$mytotal = 0;
$ii = 0;
while ( $ii < $numstrings ) {
$line = $largearray[ii];
print $largearray[ii]; #CANNOT RETRIEVE ARRAY VALUES
print "\n";
$ii++;
#string1 = split( //, $line );
$line = $largearray[ii];
#print $largearray[ii];
#print "\n";
$ii++;
chomp $line;
#string2 = split( //, $line );
$n = #string1; #assigning a list to a scalar just assigns the
#number of elements in the list to the scalar.
$m = #string2;
$v = 1;
$Cm = 0;
$Im = 0;
$V[0][0] = 0; # Assign the 0,0 entry of the V matrix
for ( $i = 1; $i <= $n; $i++ ) { # Assign the column 0 values and print
# String 1 See section 5.2 of Johnson
# for loops
$V[$i][0] = -$Im * $i;
}
for ( $j = 1; $j <= $m; $j++ ) { # Assign the row 0 values and print String 2
$V[0][$j] = -$Im * $j;
}
for ( $i = 1; $i <= $n; $i++ ) { # follow the recurrences to fill in the V matrix.
for ( $j = 1; $j <= $m; $j++ ) {
# print OUT "$string1[$i-1], $string2[$j-1]\n"; # This is here for debugging purposes.
if ( $string1[ $i - 1 ] eq $string2[ $j - 1 ] ) {
$t = 1 * $v;
}
else {
$t = -1 * $Cm;
}
$max = $V[ $i - 1 ][ $j - 1 ] + $t;
# print OUT "For $i, $j, t is $t \n"; # Another debugging line.
if ( $max < $V[$i][ $j - 1 ] - 1 * $Im ) {
$max = $V[$i][ $j - 1 ] - 1 * $Im;
}
if ( $V[ $i - 1 ][$j] - 1 * $Im > $max ) {
$max = $V[ $i - 1 ][$j] - 1 * $Im;
}
$V[$i][$j] = $max;
}
} #outer for loop
print $V[$n][$m];
$mytotal += $V[$n][$m]; # append current result to the grand total
print "\n";
} # end while loop
print "the average LCS value for length ", $mylong, " strings is: ";
print $mytotal/ $numstrings;
This isn't a scoping issue. You have declared none of your variables, which has the effect of implicitly making them all global and accessible everywhere in your code
I reformatted your Perl program so that I could read it, and then added this to the top of your program
use strict;
use warnings 'all';
which are essential in every Perl program you write
Then I added
no strict 'vars';
which is a very bad idea, and lets you get away without declaring any variables
The result is this
Argument "ii" isn't numeric in array element at E:\Perl\source\dna.pl line 60.
Argument "ii" isn't numeric in array element at E:\Perl\source\dna.pl line 61.
Argument "ii" isn't numeric in array element at E:\Perl\source\dna.pl line 67.
Argument "j" isn't numeric in array element at E:\Perl\source\dna.pl line 42.
Bareword "ii" not allowed while "strict subs" in use at E:\Perl\source\dna.pl line 60.
Bareword "ii" not allowed while "strict subs" in use at E:\Perl\source\dna.pl line 61.
Bareword "ii" not allowed while "strict subs" in use at E:\Perl\source\dna.pl line 67.
Bareword "j" not allowed while "strict subs" in use at E:\Perl\source\dna.pl line 42.
Execution of E:\Perl\source\dna.pl aborted due to compilation errors.
Line 42 (of my reformatted version) is
$largearray[j] = $dnastring
and lines 60, 61 and 67 are
$line = $largearray[ii];
print $largearray[ii]; #CANNOT RETRIEVE ARRAY VALUES
and
$line = $largearray[ii];
You are using j and ii as array indexes. Those are Perl subroutine calls, not variables. Adding use strict would have stopped this from compiling unless you had also declared sub ii and sub j
You might get away with it if you just change j and ii to $j and $ii, but you are certain to get into further problems
Please make the same changes to your own code, and declare every variable that you need using my as close as possible to the first place they are used
You should also improve your variable naming. Things like #largearray are pointless: the # says that it's an array, and whether it's large or not is relative, and of little use in understanding your code. If you have no better description of its purpose then #table or #data are probably a little better
Likewise, please avoid capital letters and most single-letter names. #V, $Cm and $Im are meaningless, and you would need fewer comments if those names were better
You certainly wouldn't need comments like # end while loop and # outer for loop if you had indented your blocks properly and kept them short enough so that both the beginning and the end can be seen on the screen at the same time, and the fewer comments you can get away with the better, because they badly clutter the code structure
Finally, it's worth noting that the C-style for loop is rarely the best choice in Perl. Your
for ( $i = 1; $i <= $n; $i++ ) { ... }
is much clearer as
for my $i ( 1 .. $n ) { ... }
and declaring the control variable at that point makes it unnecessary to invent new names like $ii for each new loop
I think you have a typo in your code:
ii => must be $ii
don't forget to put this at the beginning of your code:
use strict;
use warnings;
in order to avoid this (and others) kind of errors

How to break out of a nested until loop, without affecting the for loop its initially in

Got a very technical coding problem with perl. Basically I am search for some specific DNA sequences.
my $seq = '...';
my #dna = split //, $seq;
my $amount = scalar #dna;
for my $index (0 .. $amount - 1){
if ($dna[$index] eq 'A' and $dna[$index+1] eq 'T' and
$dna[$index+2] eq 'G' and $dna[$index+3] eq 'C'
) {
do {
print $dna[$index++];
} until ($dna[$index] eq 'C' and $dna[$index+1] eq 'C' and $dna[$index+2] eq'G')
}
}
The problem is,is that the "until" portion is in an infinite loop. I am getting all my sequences, but for some reason the until loop wont stop looping. I have tried
until ($dna[$index] eq 'C' and $dna[$index+1] eq 'C' and $dna[$index+2] eq 'G' ){last;}
But that breaks out of the "for" loop.
Is there a way to break out of do-until loop without breaking the for loop?
Case 1
my $seq = 'xxxxxATGCyyyyyCCGzzzzz'; # Ouputs ATGCyyyyy
When every ATGC is followed by a CCG, your code works.
Case 2
my $seq = 'xxxxxATGCyyyyy'; # Infinite loop
When there's no CCG following a ATGC, you keep incrementing $index past the end of #dna, causing an infinite loop.
Case 3
my $seq = 'wwwwwATGCxxxxxATGCyyyyyCCGzzzzz'; # Ouputs ATGCxxxxxATGCyyyyy and ATGCyyyyy
You start looking for another match where the previous match started rather than where it ended, so you can end up with overlapping matches. This is partly due to a bad assumption the loop will continue from the value to which you changed $index.
Solution
my $seq = '...';
my #seq = split //, $seq;
my $seq_len = #seq;
for (my $i = 0; $i < $seq_len-7; ++$i) {
if ( $seq[$i+0] eq 'A'
&& $seq[$i+1] eq 'T'
&& $seq[$i+2] eq 'G'
&& $seq[$i+3] eq 'C'
) {
my $start = $i;
$i += 4;
for (; $i < $seq_len-3; ++$i) {
if ( $seq[$i+0] eq 'C'
&& $seq[$i+1] eq 'C'
&& $seq[$i+2] eq 'G'
) {
my $end = $i;
print(join('', #seq[$start .. $end-1]), "\n");
last;
}
}
}
}
substr simplifies things.
my $seq = '...';
my $seq_len = length($seq);
for (my $i = 0; $i < $seq_len-7; ++$i) {
if (substr($seq, $i, 4) eq 'ATGC') {
my $start = $i;
$i += 4;
for (; $i < $seq_len-3; ++$i) {
if (substr($seq, $i, 3) eq 'CCG') {
my $end = $i;
print(substr($seq, $start, $end-$start), "\n");
last;
}
}
}
}
But regular expressions simply things much further.
my $seq = '...';
while ($seq =~ / ( ATGC .*? ) (?= CCG ) /xsg) {
print("$1\n");
}
If you wanted to output the remainder of the string when there is no CCG, you can use the following:
my $seq = '...';
while ($seq =~ / ( ATGC (?:(?! CCG ).)* ) /xsg) {
print("$1\n");
}
As mentioned in perlsyn
For "last", you have to be more elaborate:
LOOP: {
do {
last if $x = $y**2;
# do something here
} while $x++ <= $z;
}
If the sequence doesn't contain CCG somewhere later than ATGC, the loop won't terminate. Add or $index == $#dna to the condition.
Although I definitely endorse #ikegami's regular expression-based approach, let's try to salvage what you have. Some specific issues:
Use the C-style for/foreach loop so that adjustments to $index carry over to the outer loop
You can't use last or next directly in a do {} until () or do {} while () loop -- these aren't while () or until () loops, they are do constructs and these keywords don't apply. See some of the other answers with respect to wrapping the do construct in a one-shot loop to make these keywords viable.
You need to deal with the possiblity of running out of data before finding the things you're looking for.
Here's my rework of your code to get it to run and clear #ikegami's test hurdles:
use constant { START => 'ATGC', STOP => 'CCG' };
use constant { START_LENGTH => length(START), STOP_LENGTH => length(STOP) };
my $sequence = 'wwwwwATGCxxxxxATGCyyyyyCCGzzzzz';
my #dna = split //, $sequence;
for (my $index = 0; $index < #dna - START_LENGTH; $index++) {
if (join('', #dna[$index .. $index + START_LENGTH - 1]) eq START) {
do {
print $dna[$index++];
} until ($index > $#dna or ($index < #dna - STOP_LENGTH and join('', #dna[$index .. $index + STOP_LENGTH - 1]) eq STOP));
print("\n");
}
}
(Yes, the $#dna messes with the SO formatter as it thinks it's introducing a comment. I'm sure someone knows how to fix this and will edit this answer accordingly.)

Perl: Using Algorithm::Loops

I'm trying to construct a permutation program in Perl using the NestedLoops function. Here's my code:
use strict;
use warnings;
use Algorithm::Loops qw(NestedLoops);
my #a = 'a'..'o';
my $length = 5;
my $start = 0;
my $depth = 2;
NestedLoops([
[0..$length],
( sub {
$start = 0 if $start == $depth;
$start++;
[$start * $length..$start * $length + $length - 1]
}) x $depth,
], \&permute,);
sub permute {
my #ind = #_;
foreach my $i (#ind) {
print $a[$i];
}
print "\n";
}
So I've got an array that holds the letters 'a' to 'o' (size being 15). I'm treating the array as if it had 3 rows, so my imagination of the array is this:
abcde
fghij
klmno
Then each loop corresponds to each row... and I want to build permutations like:
afk
afl
afm
afn
afo
agk // fails here... I end up getting agg
...
It works for the first 5 values (the entire run of the lowest for loop), but then the second run fails because the last row's value of $start gets reset to 0... this is a problem because that breaks everything.
So what I want to know is, how can I keep the value of $start persistent based on the level... So what I'm asking for is essentially having constants. My loops really should look like this:
for my $a (0..5) { # 0 at this level and never change
for my $b (5..10) { # $start should be 5 at this level and never change
for my $c (10..15) { # $start should be 10 at this level and never change
permute($a, $b, $c);
}
}
}
Now, because I will have a variable length of for loops, I can't hard code each start value, so I'm looking for a way to initially create those start values, and then keep them for when the loop gets reset.
I realize this is a confusing question, so please ask questions, and I will help clarify.
You are making this harder than it has to be.
Part of the problem is that the documentation for NestedLoops doesn't go into much detail about how a subroutine reference in the first argument, will be used.
For the following examples, assume this is written somewhere above them.
use strict;
use warnings;
use Algorithm::Loops qw'NestedLoops';
Really the simplest way to call NestedLoops to get what you want is like this:
NestedLoops(
[
['a'..'e'],
['f'..'j'],
['k'..'o'],
],
\&permute
);
sub permute {
print #_, "\n";
}
If you really want the arguments to NestedLoops to be generated on the fly, I would recommend using part from List::MoreUtils.
use List::MoreUtils qw'part';
my #a = 'a'..'o';
my $length = 5;
my $index;
NestedLoops(
[
part {
$index++ / $length
} #a
],
\&permute
);
sub permute {
print #_, "\n";
}
If for some reason you want to call NestedLoops with indexes into the array, It is still easy with part.
use List::MoreUtils qw'part';
my #a = 'a'..'o';
my $length = 5;
NestedLoops(
[
part {
$_ / $length
} 0..#a-1
],
\&permute
);
sub permute {
print map { $a[$_] } #_;
print "\n";
}
Really the main problem you're having is that the two subroutine references that you give to NestedLoops are modifying the same variables, and they are both called multiple times.
The best way to fix this is to rely on the last value given to the subroutine when it is called. ( From looking at the implementation, this seems to be closer to how it was meant to be used. )
my #a = 'a'..'o';
my $length = 5;
my $depth = 3;
NestedLoops(
[
[0..$length-1],
(sub{
return unless #_;
my $last = pop;
my $part = int( $last / $length ) + 1; # current partition
my $start = $part * $length; # start of this partition
my $end = $start + $length;
[$start..$end-1] # list of variables in this partition
}) x ($depth-1)
],
\&permute
);
sub permute {
print map { $a[$_] } #_;
print "\n";
}
When you use a subroutine to generate the range of a loop, it is called every time that one of the nested loops must start. That means once for each iteration of the containing loop. Before each call $_ is set to the current value of the containing loop's variable, and the values of all the containing loop variables are passed as parameters.
To clarify this, the NestedLoops statement you have coded is equivalent to
sub loop_over {
$start = 0 if $start == $depth;
$start++;
[$start * $length..$start * $length + $length - 1]
};
NestedLoops([
[0..$length],
(\&loop_over) x $depth,
], \&permute,);
which, in raw Perl, looks something like
for my $i (0 .. $length) {
$_ = $i;
my $list = loop_over($i);
for my $j (#$list) {
$_ = $j;
my $list = loop_over($i, $j);
for my $k (#$list) {
permute($i, $j, $k);
}
}
}
so perhaps it is clearer now that your calculation of $start is wrong? It is reevaluated several times for the innermost level before execution ascends to restart the containing loop.
Since the parameters passed to the subroutine consist of all the values of the containing loop variables, the size of #_ can be checked to see for which level of the loop to generate a range. For instance, in the code above, if #_ contains two values they are $i and $j, so the values for $k must be returned; alternatively, if there is only one parameter then it is the value of $i, and the returned value must be the range for $j. So the correct value for your $start is simply the number of elements in #_ and can be set using my $start = #_;.
Using this method the subroutine can return the range for the outermost loop as well. The code looks like this
use strict;
use warnings;
use Algorithm::Loops qw(NestedLoops);
my #a = 'a'..'o';
my $length = 5;
my $start = 0;
my $depth = 2;
NestedLoops([
(sub {
$start = #_;
[$start * $length .. $start * $length + $length - 1];
}) x ($depth + 1)
], \&permute,);
sub permute {
print map { $a[$_] } #_;
print "\n";
}

Find the word with most letters in common with other words

I want Perl (5.8.8) to find out what word has the most letters in common with the other words in an array - but only letters that are in the same place. (And preferably without using libs.)
Take this list of words as an example:
BAKER
SALER
BALER
CARER
RUFFR
Her BALER is the word that has the most letters in common with the others. It matches BAxER in BAKER, xALER in SALER, xAxER in CARER, and xxxxR in RUFFR.
I want Perl to find this word for me in an arbitrary list of words with the same length and case. Seems I've hit the wall here, so help is much appreciated!
What I've tried until now
Don't really have much of a script at the moment:
use strict;
use warnings;
my #wordlist = qw(BAKER SALER MALER BARER RUFFR);
foreach my $word (#wordlist) {
my #letters = split(//, $word);
# now trip trough each iteration and work magic...
}
Where the comment is, I've tried several kinds of code, heavy with for-loops and ++ varables. Thus far, none of my attempts have done what I need it to do.
So, to better explain: What I need is to test word for word against the list, for each letterposition, to find the word that has the most letters in common with the others in the list, at that letter's position.
One possible way could be to first check which word(s) has the most in common at letter-position 0, then test letter-position 1, and so on, until you find the word that in sum has the most letters in common with the other words in the list. Then I'd like to print the list like a matrix with scores for each letterposition plus a total score for each word, not unlike what DavidO suggest.
What you'd in effect end up with is a matrix for each words, with the score for each letter position, and the sum total score fore each word in the matrix.
Purpose of the Program
Hehe, I might as well say it: The program is for hacking terminals in the game Fallout 3. :D My thinking is that it's a great way to learn Perl while also having fun gaming.
Here's one of the Fallout 3 terminal hacking tutorials I've used for research: FALLOUT 3: Hacking FAQ v1.2, and I've already made a program to shorten the list of words, like this:
#!/usr/bin/perl
# See if one word has equal letters as the other, and how many of them are equal
use strict;
use warnings;
my $checkword = "APPRECIATION"; # the word to be checked
my $match = 4; # equal to the match you got from testing your checkword
my #checkletters = split(//, $checkword); #/
my #wordlist = qw(
PARTNERSHIPS
REPRIMANDING
CIVILIZATION
APPRECIATION
CONVERSATION
CIRCUMSTANCE
PURIFICATION
SECLUSIONIST
CONSTRUCTION
DISAPPEARING
TRANSMISSION
APPREHENSIVE
ENCOUNTERING
);
print "$checkword has $match letters in common with:\n";
foreach my $word (#wordlist) {
next if $word eq $checkword;
my #letters = split(//, $word);
my $length = #letters; # determine length of array (how many letters to check)
my $eq_letters = 0; # reset to 0 for every new word to be tested
for (my $i = 0; $i < $length; $i++) {
if ($letters[$i] eq $checkletters[$i]) {
$eq_letters++;
}
}
if ($eq_letters == $match) {
print "$word\n";
}
}
# Now to make a script on to find the best word to check in the first place...
This script will yield CONSTRUCTION and TRANSMISSION as its result, just as in the game FAQ. The trick to the original question, though (and the thing I didn't manage to find out on my own), is how to find the best word to try in the first place, i.e. APPRECIATION.
OK, I've now supplied my own solution based on your help, and consider this thread closed. Many, many thanks to all the contributers. You've helped tremendously, and on the way I've also learned a lot. :D
Here's one way. Having re-read your spec a couple of times I think it's what you're looking for.
It's worth mentioning that it's possible there will be more than one word with an equal top score. From your list there's only one winner, but it's possible that in longer lists, there will be several equally winning words. This solution deals with that. Also, as I understand it, you count letter matches only if they occur in the same column per word. If that's the case, here's a working solution:
use 5.012;
use strict;
use warnings;
use List::Util 'max';
my #words = qw/
BAKER
SALER
BALER
CARER
RUFFR
/;
my #scores;
foreach my $word ( #words ) {
my $score;
foreach my $comp_word ( #words ) {
next if $comp_word eq $word;
foreach my $pos ( 0 .. ( length $word ) - 1 ) {
$score++ if substr( $word, $pos, 1 ) eq substr( $comp_word, $pos, 1);
}
}
push #scores, $score;
}
my $max = max( #scores );
my ( #max_ixs ) = grep { $scores[$_] == $max } 0 .. $#scores;
say "Words with most matches:";
say for #words[#max_ixs];
This solution counts how many times per letter column each word's letters match other words. So for example:
Words: Scores: Because:
ABC 1, 2, 1 = 4 A matched once, B matched twice, C matched once.
ABD 1, 2, 1 = 4 A matched once, B matched twice, D matched once.
CBD 0, 2, 1 = 3 C never matched, B matched twice, D matched once.
BAC 0, 0, 1 = 1 B never matched, A never matched, C matched once.
That gives you the winners of ABC and ABD, each with a score of four positional matches. Ie, the cumulative times that column one, row one matched column one row two, three, and four, and so on for the subsequent columns.
It may be able to be optimized further, and re-worded to be shorter, but I tried to keep the logic fairly easy to read. Enjoy!
UPDATE / EDIT
I thought about it and realized that though my existing method does exactly what your original question requested, it did it in O(n^2) time, which is comparatively slow. But if we use hash keys for each column's letters (one letter per key), and do a count of how many times each letter appears in the column (as the value of the hash element), we could do our summations in O(1) time, and our traversal of the list in O(n*c) time (where c is the number of columns, and n is the number of words). There's some setup time too (creation of the hash). But we still have a big improvement. Here is a new version of each technique, as well as a benchmark comparison of each.
use strict;
use warnings;
use List::Util qw/ max sum /;
use Benchmark qw/ cmpthese /;
my #words = qw/
PARTNERSHIPS
REPRIMANDING
CIVILIZATION
APPRECIATION
CONVERSATION
CIRCUMSTANCE
PURIFICATION
SECLUSIONIST
CONSTRUCTION
DISAPPEARING
TRANSMISSION
APPREHENSIVE
ENCOUNTERING
/;
# Just a test run for each solution.
my( $top, $indexes_ref );
($top, $indexes_ref ) = find_top_matches_force( \#words );
print "Testing force method: $top matches.\n";
print "#words[#$indexes_ref]\n";
( $top, $indexes_ref ) = find_top_matches_hash( \#words );
print "Testing hash method: $top matches.\n";
print "#words[#$indexes_ref]\n";
my $count = 20000;
cmpthese( $count, {
'Hash' => sub{ find_top_matches_hash( \#words ); },
'Force' => sub{ find_top_matches_force( \#words ); },
} );
sub find_top_matches_hash {
my $words = shift;
my #scores;
my $columns;
my $max_col = max( map { length $_ } #{$words} ) - 1;
foreach my $col_idx ( 0 .. $max_col ) {
$columns->[$col_idx]{ substr $_, $col_idx, 1 }++
for #{$words};
}
foreach my $word ( #{$words} ) {
my $score = sum(
map{
$columns->[$_]{ substr $word, $_, 1 } - 1
} 0 .. $max_col
);
push #scores, $score;
}
my $max = max( #scores );
my ( #max_ixs ) = grep { $scores[$_] == $max } 0 .. $#scores;
return( $max, \#max_ixs );
}
sub find_top_matches_force {
my $words = shift;
my #scores;
foreach my $word ( #{$words} ) {
my $score;
foreach my $comp_word ( #{$words} ) {
next if $comp_word eq $word;
foreach my $pos ( 0 .. ( length $word ) - 1 ) {
$score++ if
substr( $word, $pos, 1 ) eq substr( $comp_word, $pos, 1);
}
}
push #scores, $score;
}
my $max = max( #scores );
my ( #max_ixs ) = grep { $scores[$_] == $max } 0 .. $#scores;
return( $max, \#max_ixs );
}
The output is:
Testing force method: 39 matches.
APPRECIATION
Testing hash method: 39 matches.
APPRECIATION
Rate Force Hash
Force 2358/s -- -74%
Hash 9132/s 287% --
I realize your original spec changed after you saw some of the other options provided, and that's sort of the nature of innovation to a degree, but the puzzle was still alive in my mind. As you can see, my hash method is 287% faster than the original method. More fun in less time!
As a starting point, you can efficiently check how many letters they have in common with:
$count = ($word1 ^ $word2) =~ y/\0//;
But that's only useful if you loop through all possible pairs of words, something that isn't necessary in this case:
use strict;
use warnings;
my #words = qw/
BAKER
SALER
BALER
CARER
RUFFR
/;
# you want a hash to indicate which letters are present how many times in each position:
my %count;
for my $word (#words) {
my #letters = split //, $word;
$count{$_}{ $letters[$_] }++ for 0..$#letters;
}
# then for any given word, you get the count for each of its letters minus one (because the word itself is included in the count), and see if it is a maximum (so far) for any position or for the total:
my %max_common_letters_count;
my %max_common_letters_words;
for my $word (#words) {
my #letters = split //, $word;
my $total;
for my $position (0..$#letters, 'total') {
my $count;
if ( $position eq 'total' ) {
$count = $total;
}
else {
$count = $count{$position}{ $letters[$position] } - 1;
$total += $count;
}
if ( ! $max_common_letters_count{$position} || $count >= $max_common_letters_count{$position} ) {
if ( $max_common_letters_count{$position} && $count == $max_common_letters_count{$position} ) {
push #{ $max_common_letters_words{$position} }, $word;
}
else {
$max_common_letters_count{$position} = $count;
$max_common_letters_words{$position} = [ $word ];
}
}
}
}
# then show the maximum words for each position and in total:
for my $position ( sort { $a <=> $b } grep $_ ne 'total', keys %max_common_letters_count ) {
printf( "Position %s had a maximum of common letters of %s in words: %s\n",
$position,
$max_common_letters_count{$position},
join(', ', #{ $max_common_letters_words{$position} })
);
}
printf( "The maximum total common letters was %s in words(s): %s\n",
$max_common_letters_count{'total'},
join(', ', #{ $max_common_letters_words{'total'} })
);
Here's a complete script. It uses the same idea that ysth mentioned (although I had it independently). Use bitwise xor to combine the strings, and then count the number of NULs in the result. As long as your strings are ASCII, that will tell you how many matching letters there were. (That comparison is case sensitive, and I'm not sure what would happen if the strings were UTF-8. Probably nothing good.)
use strict;
use warnings;
use 5.010;
use List::Util qw(max);
sub findMatches
{
my ($words) = #_;
# Compare each word to every other word:
my #matches = (0) x #$words;
for my $i (0 .. $#$words-1) {
for my $j ($i+1 .. $#$words) {
my $m = ($words->[$i] ^ $words->[$j]) =~ tr/\0//;
$matches[$i] += $m;
$matches[$j] += $m;
}
}
# Find how many matches in the best word:
my $max = max(#matches);
# Find the words with that many matches:
my #wanted = grep { $matches[$_] == $max } 0 .. $#matches;
wantarray ? #$words[#wanted] : $words->[$wanted[0]];
} # end findMatches
my #words = qw(
BAKER
SALER
BALER
CARER
RUFFR
);
say for findMatches(\#words);
Haven't touched perl in a while, so pseudo-code it is. This isn't the fastest algorithm, but it will work fine for a small amount of words.
totals = new map #e.g. an object to map :key => :value
for each word a
for each word b
next if a equals b
totals[a] = 0
for i from 1 to a.length
if a[i] == b[i]
totals[a] += 1
end
end
end
end
return totals.sort_by_key.last
Sorry about the lack of perl, but if you code this into perl, it should work like a charm.
A quick note on run-time: this will run in time number_of_words^2 * length_of_words, so on a list of 100 words, each of length 10 characters, this will run in 100,000 cycles, which is adequate for most applications.
Here's a version that relies on transposing the words in order to count the identical characters. I used the words from your original comparison, not the code.
This should work with any length words, and any length list. Output is:
Word score
---- -----
BALER 12
SALER 11
BAKER 11
CARER 10
RUFFR 4
The code:
use warnings;
use strict;
my #w = qw(BAKER SALER BALER CARER RUFFR);
my #tword = t_word(#w);
my #score;
push #score, str_count($_) for #tword;
#score = t_score(#score);
my %total;
for (0 .. $#w) {
$total{$w[$_]} = $score[$_];
}
print "Word\tscore\n";
print "----\t-----\n";
print "$_\t$total{$_}\n" for (sort { $total{$b} <=> $total{$a} } keys %total);
# transpose the words
sub t_word {
my #w = #_;
my #tword;
for my $word (#w) {
my $i = 0;
while ($word =~ s/(.)//) {
$tword[$i++] .= $1;
}
}
return #tword;
}
# turn each character into a count
sub str_count {
my $str = uc(shift);
while ( $str =~ /([A-Z])/ ) {
my $chr = $1;
my $num = () = $str =~ /$chr/g;
$num--;
$str =~ s/$chr/$num /g;
}
return $str;
}
# sum up the character counts
# while reversing the transpose
sub t_score {
my #count = #_;
my #score;
for my $num (#count) {
my $i = 0;
while( $num =~ s/(\d+) //) {
$score[$i++] += $1;
}
}
return #score;
}
Here is my attempt at an answer. This will also allow you to see each individual match if you need it. (ie. BALER matches 4 characters in BAKER). EDIT: It now catches all matches if there is a tie between words (I added "CAKER" to the list to test).
#! usr/bin/perl
use strict;
use warnings;
my #wordlist = qw( BAKER SALER BALER CARER RUFFR CAKER);
my %wordcomparison;
#foreach word, break it into letters, then compare it against all other words
#break all other words into letters and loop through the letters (both words have same amount), adding to the count of matched characters each time there's a match
foreach my $word (#wordlist) {
my #letters = split(//, $word);
foreach my $otherword (#wordlist) {
my $count;
next if $otherword eq $word;
my #otherwordletters = split (//, $otherword);
foreach my $i (0..$#letters) {
$count++ if ( $letters[$i] eq $otherwordletters[$i] );
}
$wordcomparison{"$word"}{"$otherword"} = $count;
}
}
# sort (unnecessary) and loop through the keys of the hash (words in your list)
# foreach key, loop through the other words it compares with
#Add a new key: total, and sum up all the matched characters.
foreach my $word (sort keys %wordcomparison) {
foreach ( sort keys %{ $wordcomparison{$word} }) {
$wordcomparison{$word}{total} += $wordcomparison{$word}{$_};
}
}
#Want $word with highest total
my #max_match = (sort { $wordcomparison{$b}{total} <=> $wordcomparison{$a}{total} } keys %wordcomparison );
#This is to get all if there is a tie:
my $maximum = $max_match[0];
foreach (#max_match) {
print "$_\n" if ($wordcomparison{$_}{total} >= $wordcomparison{$maximum}{total} )
}
The output is simply: CAKER BALER and BAKER.
The hash %wordcomparison looks like:
'SALER'
{
'RUFFR' => 1,
'BALER' => 4,
'BAKER' => 3,
'total' => 11,
'CARER' => 3
};
You can do this, using a dirty regex trick to execute code if a letter matches in its place, but not otherwise, thankfully it's quite easy to build the regexes as you go:
An example regular expression is:
(?:(C(?{ $c++ }))|.)(?:(A(?{ $c++ }))|.)(?:(R(?{ $c++ }))|.)(?:(E(?{ $c++ }))|.)(?:(R(?{ $c++ }))|.)
This may or may not be fast.
use 5.12.0;
use warnings;
use re 'eval';
my #words = qw(BAKER SALER BALER CARER RUFFR);
my ($best, $count) = ('', 0);
foreach my $word (#words) {
our $c = 0;
foreach my $candidate (#words) {
next if $word eq $candidate;
my $regex_str = join('', map {"(?:($_(?{ \$c++ }))|.)"} split '', $word);
my $regex = qr/^$regex_str$/;
$candidate =~ $regex or die "did not match!";
}
say "$word $c";
if ($c > $count) {
$best = $word;
$count = $c;
}
}
say "Matching: first best: $best";
Using xor trick will be fast but assumes a lot about the range of characters you might encounter. There are many ways in which utf-8 will break with that case.
Many thanks to all the contributers! You've certainly shown me that I still have a lot to learn, but you have also helped me tremendously in working out my own answer. I'm just putting it here for reference and possible feedback, since there are probably better ways of doing it. To me this was the simplest and most straight forward approach I could find on my own. Enjøy! :)
#!/usr/bin/perl
use strict;
use warnings;
# a list of words for testing
my #list = qw(
BAKER
SALER
BALER
CARER
RUFFR
);
# populate two dimensional array with the list,
# so we can compare each letter with the other letters on the same row more easily
my $list_length = #list;
my #words;
for (my $i = 0; $i < $list_length; $i++) {
my #letters = split(//, $list[$i]);
my $letters_length = #letters;
for (my $j = 0; $j < $letters_length; $j++) {
$words[$i][$j] = $letters[$j];
}
}
# this gives a two-dimensionla array:
#
# #words = ( ["B", "A", "K", "E", "R"],
# ["S", "A", "L", "E", "R"],
# ["B", "A", "L", "E", "R"],
# ["C", "A", "R", "E", "R"],
# ["R", "U", "F", "F", "R"],
# );
# now, on to find the word with most letters in common with the other on the same row
# add up the score for each letter in each word
my $word_length = #words;
my #letter_score;
for my $i (0 .. $#words) {
for my $j (0 .. $#{$words[$i]}) {
for (my $k = 0; $k < $word_length; $k++) {
if ($words[$i][$j] eq $words[$k][$j]) {
$letter_score[$i][$j] += 1;
}
}
# we only want to add in matches outside the one we're testing, therefore
$letter_score[$i][$j] -= 1;
}
}
# sum each score up
my #scores;
for my $i (0 .. $#letter_score ) {
for my $j (0 .. $#{$letter_score[$i]}) {
$scores[$i] += $letter_score[$i][$j];
}
}
# find the highest score
my $max = $scores[0];
foreach my $i (#scores[1 .. $#scores]) {
if ($i > $max) {
$max = $i;
}
}
# and print it all out :D
for my $i (0 .. $#letter_score ) {
print "$list[$i]: $scores[$i]";
if ($scores[$i] == $max) {
print " <- best";
}
print "\n";
}
When run, the script yields the following:
BAKER: 11
SALER: 11
BALER: 12 <- best
CARER: 10
RUFFR: 4

How do I determine the longest similar portion of several strings?

As per the title, I'm trying to find a way to programmatically determine the longest portion of similarity between several strings.
Example:
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/
Ideally, I'd get back file:///home/gms8994/Music/, because that's the longest portion that's common for all 3 strings.
Specifically, I'm looking for a Perl solution, but a solution in any language (or even pseudo-language) would suffice.
From the comments: yes, only at the beginning; but there is the possibility of having some other entry in the list, which would be ignored for this question.
Edit: I'm sorry for mistake. My pity that I overseen that using my variable inside countit(x, q{}) is big mistake. This string is evaluated inside Benchmark module and #str was empty there. This solution is not as fast as I presented. See correction below. I'm sorry again.
Perl can be fast:
use strict;
use warnings;
package LCP;
sub LCP {
return '' unless #_;
return $_[0] if #_ == 1;
my $i = 0;
my $first = shift;
my $min_length = length($first);
foreach (#_) {
$min_length = length($_) if length($_) < $min_length;
}
INDEX: foreach my $ch ( split //, $first ) {
last INDEX unless $i < $min_length;
foreach my $string (#_) {
last INDEX if substr($string, $i, 1) ne $ch;
}
}
continue { $i++ }
return substr $first, 0, $i;
}
# Roy's implementation
sub LCP2 {
return '' unless #_;
my $prefix = shift;
for (#_) {
chop $prefix while (! /^\Q$prefix\E/);
}
return $prefix;
}
1;
Test suite:
#!/usr/bin/env perl
use strict;
use warnings;
Test::LCP->runtests;
package Test::LCP;
use base 'Test::Class';
use Test::More;
use Benchmark qw(:all :hireswallclock);
sub test_use : Test(startup => 1) {
use_ok('LCP');
}
sub test_lcp : Test(6) {
is( LCP::LCP(), '', 'Without parameters' );
is( LCP::LCP('abc'), 'abc', 'One parameter' );
is( LCP::LCP( 'abc', 'xyz' ), '', 'None of common prefix' );
is( LCP::LCP( 'abcdefgh', ('abcdefgh') x 15, 'abcdxyz' ),
'abcd', 'Some common prefix' );
my #str = map { chomp; $_ } <DATA>;
is( LCP::LCP(#str),
'file:///home/gms8994/Music/', 'Test data prefix' );
is( LCP::LCP2(#str),
'file:///home/gms8994/Music/', 'Test data prefix by LCP2' );
my $t = countit( 1, sub{LCP::LCP(#str)} );
diag("LCP: ${\($t->iters)} iterations took ${\(timestr($t))}");
$t = countit( 1, sub{LCP::LCP2(#str)} );
diag("LCP2: ${\($t->iters)} iterations took ${\(timestr($t))}");
}
__DATA__
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/
Test suite result:
1..7
ok 1 - use LCP;
ok 2 - Without parameters
ok 3 - One parameter
ok 4 - None of common prefix
ok 5 - Some common prefix
ok 6 - Test data prefix
ok 7 - Test data prefix by LCP2
# LCP: 22635 iterations took 1.09948 wallclock secs ( 1.09 usr + 0.00 sys = 1.09 CPU) # 20766.06/s (n=22635)
# LCP2: 17919 iterations took 1.06787 wallclock secs ( 1.07 usr + 0.00 sys = 1.07 CPU) # 16746.73/s (n=17919)
That means that pure Perl solution using substr is about 20% faster than Roy's solution at your test case and one prefix finding takes about 50us. There is not necessary using XS unless your data or performance expectations are bigger.
The reference given already by Brett Daniel for the Wikipedia entry on "Longest common substring problem" is very good general reference (with pseudocode) for your question as stated. However, the algorithm can be exponential. And it looks like you might actually want an algorithm for longest common prefix which is a much simpler algorithm.
Here's the one I use for longest common prefix (and a ref to original URL):
use strict; use warnings;
sub longest_common_prefix {
# longest_common_prefix( $|# ): returns $
# URLref: http://linux.seindal.dk/2005/09/09/longest-common-prefix-in-perl
# find longest common prefix of scalar list
my $prefix = shift;
for (#_) {
chop $prefix while (! /^\Q$prefix\E/);
}
return $prefix;
}
my #str = map {chomp; $_} <DATA>;
print longest_common_prefix(#ARGV), "\n";
__DATA__
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/
If you truly want a LCSS implementation, refer to these discussions (Longest Common Substring and Longest Common Subsequence) at PerlMonks.org. Tree::Suffix would probably be the best general solution for you and implements, to my knowledge, the best algorithm. Unfortunately recent builds are broken. But, a working subroutine does exist within the discussions referenced on PerlMonks in this post by Limbic~Region (reproduced here with your data).
#URLref: http://www.perlmonks.org/?node_id=549876
#by Limbic~Region
use Algorithm::Loops 'NestedLoops';
use List::Util 'reduce';
use strict; use warnings;
sub LCS{
my #str = #_;
my #pos;
for my $i (0 .. $#str) {
my $line = $str[$i];
for (0 .. length($line) - 1) {
my $char= substr($line, $_, 1);
push #{$pos[$i]{$char}}, $_;
}
}
my $sh_str = reduce {length($a) < length($b) ? $a : $b} #str;
my %map;
CHAR:
for my $char (split //, $sh_str) {
my #loop;
for (0 .. $#pos) {
next CHAR if ! $pos[$_]{$char};
push #loop, $pos[$_]{$char};
}
my $next = NestedLoops([#loop]);
while (my #char_map = $next->()) {
my $key = join '-', #char_map;
$map{$key} = $char;
}
}
my #pile;
for my $seq (keys %map) {
push #pile, $map{$seq};
for (1 .. 2) {
my $dir = $_ % 2 ? 1 : -1;
my #offset = split /-/, $seq;
$_ += $dir for #offset;
my $next = join '-', #offset;
while (exists $map{$next}) {
$pile[-1] = $dir > 0 ?
$pile[-1] . $map{$next} : $map{$next} . $pile[-1];
$_ += $dir for #offset;
$next = join '-', #offset;
}
}
}
return reduce {length($a) > length($b) ? $a : $b} #pile;
}
my #str = map {chomp; $_} <DATA>;
print LCS(#str), "\n";
__DATA__
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/
It sounds like you want the k-common substring algorithm. It is exceptionally simple to program, and a good example of dynamic programming.
My first instinct is to run a loop, taking the next character from each string, until the characters are not equal. Keep a count of what position in the string you're at and then take a substring (from any of the three strings) from 0 to the position before the characters aren't equal.
In Perl, you'll have to split up the string first into characters using something like
#array = split(//, $string);
(splitting on an empty character sets each character into its own element of the array)
Then do a loop, perhaps overall:
$n =0;
#array1 = split(//, $string1);
#array2 = split(//, $string2);
#array3 = split(//, $string3);
while($array1[$n] == $array2[$n] && $array2[$n] == $array3[$n]){
$n++;
}
$sameString = substr($string1, 0, $n); #n might have to be n-1
Or at least something along those lines. Forgive me if this doesn't work, my Perl is a little rusty.
If you google for "longest common substring" you'll get some good pointers for the general case where the sequences don't have to start at the beginning of the strings.
Eg, http://en.wikipedia.org/wiki/Longest_common_substring_problem.
Mathematica happens to have a function for this built in:
http://reference.wolfram.com/mathematica/ref/LongestCommonSubsequence.html (Note that they mean contiguous subsequence, ie, substring, which is what you want.)
If you only care about the longest common prefix then it should be much faster to just loop for i from 0 till the ith characters don't all match and return substr(s, 0, i-1).
From http://forums.macosxhints.com/showthread.php?t=33780
my #strings =
(
'file:///home/gms8994/Music/t.A.T.u./',
'file:///home/gms8994/Music/nina%20sky/',
'file:///home/gms8994/Music/A%20Perfect%20Circle/',
);
my $common_part = undef;
my $sep = chr(0); # assuming it's not used legitimately
foreach my $str ( #strings ) {
# First time through loop -- set common
# to whole
if ( !defined $common_part ) {
$common_part = $str;
next;
}
if ("$common_part$sep$str" =~ /^(.*).*$sep\1.*$/)
{
$common_part = $1;
}
}
print "Common part = $common_part\n";
Faster than above, uses perl's native binary xor function, adapted from perlmongers solution (the $+[0] didn't work for me):
sub common_suffix {
my $comm = shift #_;
while ($_ = shift #_) {
$_ = substr($_,-length($comm)) if (length($_) > length($comm));
$comm = substr($comm,-length($_)) if (length($_) < length($comm));
if (( $_ ^ $comm ) =~ /(\0*)$/) {
$comm = substr($comm, -length($1));
} else {
return undef;
}
}
return $comm;
}
sub common_prefix {
my $comm = shift #_;
while ($_ = shift #_) {
$_ = substr($_,0,length($comm)) if (length($_) > length($comm));
$comm = substr($comm,0,length($_)) if (length($_) < length($comm));
if (( $_ ^ $comm ) =~ /^(\0*)/) {
$comm = substr($comm,0,length($1));
} else {
return undef;
}
}
return $comm;
}