How to tell if a string can be transformed into another string by removing or uppercasing lowercase letters? - perl

In perl, I have two input strings, for this example, ahueFFggLKy and HFFGLK. I want to be able to iterate through all of the possible combinations of my input without lowercase letter groups (a, h, u, e, g...ah, au...hegy, etc) so in each iteration lowercase letters are removed and the remaining lowercase letters are uppercased:
ah:
ueFFggLKy (UEFFGGLKY)
^^
au:
h eFFggLKy (HEFFGGLKY)
^ ^
hegy:
a u FF gLKy | a u FFg LKy (AUFFGLKY)
^ ^ ^ | ^ ^ ^
auegy:
h FF gLK | h FFg LK (HFFGLK)
^ ^^ ^ ^ ^ ^^ ^ ^ -^--^-
The last option (auegy) is the answer, and I want to be able to iterate over letters to determine if I am able to convert ahueFFggLKy to HFFGLK without modifying any of the capital letters. This example would return "YES".
If inputs like fOoBar and BAR come up, I am not successfully able to convert fOoBar to BAR because the O in fOoBar is capitalized. My program would return "NO".
Can someone provide me with a perl example of how this would be done?

I think I have understood your requirement: the first string may be transformed by either deleting or upper-casing any lower-case letter, and you wish to know whether the second string can be derived from the first in this way
I suggest that you can transform the second string to a regex pattern to achieve this. If every upper-case letter in the second string must be matched by the corresponding upper or lower-case letter in the first, with any number of intervening lower-case letters, then the transformation is possible. Otherwise it is not
This program implements the idea
use strict;
use warnings 'all';
use feature 'say';
my #pairs = (
[ qw/ ahueFFggLKy HFFGLK / ],
[ qw/ fOoBar BAR / ],
);
for my $pair ( #pairs ) {
my ($s1, $s2) = #$pair;
printf "%s => %s -- %s\n", $s1, $s2, contains($s1, $s2) ? 'YES' : 'NO';
}
sub contains {
my ($s1, $s2) = #_;
my $re = join ' \p{Ll}* ', map { "(?i: $_ )" } $s2 =~ /\p{Lu}/g;
$re = qr/ ^ \p{Ll}* $re \p{Ll}* $ /x;
$s1 =~ $re;
}
output
ahueFFggLKy => HFFGLK -- YES
fOoBar => BAR -- NO
To read an array like #pairs from STDIN you could write something like this
my #pairs;
{
local $/;
my #input = split ' ', <>;
push #pairs, [ splice #input, 0, 2 ] while #input > 1;
}

Kind of unelegant solution, but it seems to output what you need.
#!/usr/bin/perl
use warnings;
use strict;
use List::Util qw{ all };
my ($str1, $str2) = qw( ahueFFggLKy HFFGLK );
my #small_indices;
push #small_indices, pos($str1) - 1 while $str1 =~ /[[:lower:]]/g;
my #present = (0) x #small_indices;
until (all { $_ } #present) {
my $try = $str1;
for my $i (reverse 0 .. $#present) {
substr $try, $small_indices[$i], 1,
$present[$i] ? substr $str1, $small_indices[$i], 1
: q();
}
if (uc $try eq $str2) {
print $present[$_] ? q() : substr $str1, $small_indices[$_], 1
for 0 .. $#present;
print ":\n";
my $j = 0;
for my $i (0 .. length($str1) - 1) {
my $char = substr $str1, $i, 1;
if ($char eq uc $char || $present[$j++]) {
print $char;
} else {
print '.';
}
}
print "\n";
}
my $idx = 0;
$present[$idx] = 0, ++$idx while $present[$idx];
$present[$idx] = 1;
}
It builds an indicator function #present, which say what lowercase letters are present in the string. All possible values of #present are iterated by adding 1 to the binary number corresponding to the function.

Related

Perl - longest common prefix of 2 or more strings?

How can i create a Perl subroutine which would take in an array and find the longest common prefix for 2 or more of its elements? (strings)
I have this code:
sub longest_common_prefix {
$prefix = shift;
for (#_) {
chop $prefix while (! /^\Q$prefix\E/);
}
return $prefix;
}
But it only works if you are looking for the longest common prefix of all strings.
For example, if i pass an array with the following strings:
aaaBGFB
aaaJJJJ
jjfkBBB
aaaHGHG
I want it to return aaa as the answer.
Thanks!
I'd use a modified trie.
Normally, one could use the following to add to a trie:
sub add {
my $p = \shift;
my $s = shift;
$p = \( $$p->{$_} ) for split(//, $s);
$$p->{''} = 1;
}
But we need two modifications:
All prefixes of a string must be added when adding a string. For example, adding abc should also add a and ab to the trie.
When adding to the trie, we want to return the length of previously-existing part of the path taken.
So we need:
sub add {
my $p = \shift;
my $s = shift;
my $cp_len = 0;
for (split(//, $s)) {
$p = \( $$p->{$_} );
++$cp_len if $$p->{$_}{''};
$$p->{''} = 1;
}
return $cp_len;
}
Combine (an optimized version of) this with an algorithm to find the longest strings in a list and with an algorithm to remove duplicate strings from a list to get the following solution:
use strict;
use warnings;
use feature qw( say );
sub add {
my $p = \shift;
my $s = shift;
my $cp_len = 0;
for (split(//, $s)) {
++$cp_len if exists($$p->{$_});
$p = \( $$p->{$_} );
}
return $cp_len;
}
my $t;
my $lcp_len = 0; # lcp = longest common prefix
my %lcps;
while (<>) {
chomp;
my $cp_len = add($t, $_)
or next;
if ($cp_len >= $lcp_len) {
if ($cp_len > $lcp_len) {
$lcp_len = $cp_len;
%lcps = ();
}
$lcps{ substr($_, 0, $cp_len) } = 1;
}
}
my #lcps = sort keys %lcps;
if (#lcps) {
say "Longest common prefix(es): #lcps";
} else {
say "No common prefix";
}
Data:
abc
abc
abcd
abcde
hijklx
hijkly
mnopqx
mnopqy
Output:
Longest common prefix(es): hijkl mnopq
The time taken by the above is proportional to the number of input characters.
One way would be to store the information in a hash. In this example, I set the hash key to the length of each prefix, and the value being the actual prefix found.
Note that this method overwrites a key and value if a same-length prefix exists, so you'll always get the last prefix found of the longest length (sort() takes care of finding the longest one).
The regex says "find the first character in the string and capture it, and use that char found in a second capture, and capture as many as there are". This string is then join()ed into a scalar and put into the hash.
use warnings;
use strict;
my %prefixes;
while (<DATA>){
my $prefix = join '', /^(.)(\1+)/;
$prefixes{length $prefix} = $prefix;
}
my $longest = (sort {$b <=> $a} keys %prefixes)[0];
print "$prefixes{$longest}\n";
__DATA__
aaBGFB
aaaJJJJ
jjfkBBB
aaaHGHG
Output:
aaa
You can keep a hash of an array of words keyed by the first character. By definition, if you have words starting with the same letter, those words share at least a one character common prefix of that one letter. Then reduce to the single longest prefix by stepping through the hash by character:
use strict; use warnings;
sub lcp {
(join("\0", #_) =~ /^ ([^\0]*) [^\0]* (?:\0 \1 [^\0]*)* $/sx)[0];
}
my %HoA;
my $longest='';
while (my $line=<DATA>){
$line =~ s/^\s+|\s+$//g ;
push #{ $HoA{substr $line, 0, 1} }, $line if $line=~/^[a-zA-Z]/;
}
for my $key ( sort (keys %HoA )) {
if (scalar #{ $HoA{$key} } > 1){
my $lon=lcp(#{ $HoA{$key} });
my $s = join ', ', map { qq/"$_"/ } #{ $HoA{$key} };
print "lcp: \"$lon\" for ($s)\n";
if (length($lon) > length($longest)) {
$longest=$lon;
}
}
else{
print "$key: no common prefix\n";
}
}
print "\nlongest common prefix is \"$longest\"\n";
__DATA__
aardvark
aaaBGFB
aaaJJJJ
jjfkBBB
aaaHGHG
interspecies
interstellar
interstate
Prints:
lcp: "aa" for ("aardvark", "aaaBGFB", "aaaJJJJ", "aaaHGHG")
lcp: "inters" for ("interspecies", "interstellar", "interstate")
j: no common prefix
longest common prefix is "inters"

Splitting and tallying substrings within mixed integer-string data

Input Data (example):
40A3B35A3C
30A5B28A2C2B
Desired output (per-line) is a single number determined by the composition of the code 40A3B35A3C and the following rules:
if A - add the proceeding number to the running total
if B - add the proceeding number to the running total
if C - subtract the proceeding number from the running total
40A 3B 35A 3C would thus produce 40 + 3 + 35 - 3 = 75.
Output from both lines:
75
63
Is there an efficient way to achieve this for a particular column (such as $F[2]) in a tab-delimited .txt file using a one-liner? I have considered splitting the entire code into individual characters, then performing if statement checks to detect A/B/C, but my Perl knowledge is limited and I am unsure how to go about this.
When you use split with a capture, the captured group is returned from split, too.
perl -lane '
#ar = split /([ABC])/, $F[2];
$s = 0;
$s += $n * ("C" eq $op ? -1 : 1) while ($n, $op) = splice #ar, 0, 2;
print $s
' < input
Or maybe more declarative:
BEGIN { %one = ( A => 1,
B => 1,
C => -1 ) }
#ar = split /([ABC])/, $F[2];
$s = 0;
$s += $n * $one{$op} while ($n, $op) = splice #ar, 0, 2;
print $s
When working through a string like this, it's useful to know that regular expressions can return a list of results.
E.g.
my #matches = $str =~ m/(\d+[A-C])/g; #will catch repeated instances
So you can do something like this:
#!/usr/bin/env perl
use strict;
use warnings;
while (<DATA>) {
my $total;
#break the string into digit+letter groups.
for (m/(\d+[A-C])/g) {
#separate out this group into num and code.
my ( $num, $code ) = m/(\d+)([A-C])/;
print "\t",$num, " => ", $code, "\n";
if ( $code eq "C" ) {
$total -= $num;
}
else {
$total += $num;
}
}
print $total, " => ", $_;
}
__DATA__
40A3B35A3C
30A5B28A2C2B
perl -lne 'push #a,/([\d]+)[AB]/g;
push #b,/([\d]+)[C]/g;
$sum+=$_ for(#a);$sum-=$_ for(#b);
print $sum;#a=#b=();undef $sum' Your_file
how it works
use the command line arg as the input
set the hash "%op" to the
operations per letter
substitute the letters for operators in the
input evaluate the substituted input as an expression
use strict;
use warnings;
my %op=qw(A + B + C -);
$ARGV[0] =~ s/(\d+)(A|B|C)/$op{$2} $1/g;
print eval($ARGV[0]);

Alternate between upper and lowercase, PERL

I want to alternate between upper and lower case, however I only managed to get the whole string upper or lower, or the first character.
I have not found a proper function to execute what I need. Please have a look and help me out. Cheers.
#!/usr/bin/perl
my $mystring = "this is my string I want each character to alternate between upper and lowercase";
my #myarray = split("", $mystring);
print ucfirst("#myarray");
A more general approach using function factory
use strict;
use warnings;
sub periodic {
my #subs = #_;
my $i = 0;
return sub {
$i = 0 if $i > $#subs;
return $subs[$i++]->(#_);
};
}
my $mystring = "this is my string I want each character to alternate between upper and lowercase";
my $f = periodic(
sub { uc pop },
sub { lc pop },
# sub { .. },
# sub { .. },
);
$mystring =~ s/([a-z])/ $f->($1) /egi;
print $mystring, "\n";
output
ThIs Is My StRiNg I wAnT eAcH cHaRaCtEr To AlTeRnAtE bEtWeEn UpPeR aNd LoWeRcAsE
How about:
my $mystring = "this is my string I want each character to alternate between upper and lowercase";
my #myarray = split("", $mystring);
my $cnt = 1;
for (#myarray) {
next unless /[a-z]/i;
$_ = ($cnt%2 ? uc($_) : lc($_));
$cnt++;
}
say join('',#myarray);
Output:
ThIs Is My StRiNg I wAnT eAcH cHaRaCtEr To AlTeRnAtE bEtWeEn UpPeR aNd LoWeRcAsE
My first thought was to use a regex substitution. Try this:
use strict;
use warnings;
my $str = "this string, I will change";
# Ignore whitespace and punctuation.
$str =~ s/(\w)(\w)/\L$1\U$2/g;
# Or include all characters in the uc/lc alternation.
# $str =~ s/(.)(.)/\L$1\U$2/g;
print $str, "\n";
If, for some reason, you wish to avoid regexes, try:
my $str = "this string, I will change";
my #ary;
my $count = 0;
for my $glyph ( split //, lc $str ) {
$glyph = uc $glyph if $count % 2;
push #ary, $glyph;
$count++;
}
print join( "", #ary ), "\n";
Try this:
use strict;
use warnings;
use 5.016;
use Data::Dumper;
my $str = 'hello';
my $x = 0;
$str =~ s/(.)/($x++ % 2 == 0) ? "\U$1" : "\L$1"/eg;
say $str;
--output:--
HeLlO
Save script below with name alter.pl
#!/usr/bin/perl
print#ARGV[0]=~s/([a-z])([^a-z]*)([a-z])/uc($1).$2.lc$3/egri
And run script by command
$ perl alter.pl "this is my string I want each character to alternate between upper and lowercase"
Output
ThIs Is My StRiNg I wAnT eAcH cHaRaCtEr To AlTeRnAtE bEtWeEn UpPeR aNd LoWeRcAse
You have some good answers already but I thought I'd chip in because I hadn't seen map yet.
print map { $c++ % 2 ? lc : uc } split ( //, $mystring );
splits $mystring into characters (split //);
uses map to apply a function to each letter.
uses $c++ to autoincrement, then take a modulo 2 to decide if this should be uppercase or lower case.
join the resultant array.
Gives:
#!c:\Strawberry\perl\bin
use strict;
use warnings;
my $mystring = "this is my string I want each character to alternate between upper and lowercase";
my $c;
print join ( "", map { $c++ % 2 ? lc : uc } split ( //, $mystring ));
Prints:
ThIs iS My sTrInG I WaNt eAcH ChArAcTeR To aLtErNaTe bEtWeEn uPpEr aNd lOwErCaSe
map is a useful function that applies some code to each element in a list, and then 'returns' the list that's produced. So if we treat your string as a list of characters, it works nicely.
Try this. simple if else condition enough for this
my $mystring = "this is my string I want each character to alternate between upper and lowercase";
#xz = split( '', $mystring );
for ( $i = 0; $i < scalar #xz; $i++ ) {
if ( $i % 2 ) {
print uc "$xz[$i]";
}
else {
print "$xz[$i]";
}
}

Why if/elsif in Perl execute only the first block?

I am new to Perl. I have an assignment to write a Perl program that accept a countable word from a command line and then generates its plural form. I have composed the following code below, and it shows no errors of compilation. When I execute it from the command line:
(perl plural.pl, for example), it prompts me to enter a noun, then whatever noun I feed as input, the plural form is the same. It doesn't execute the remaining if statements.
For example, if I enter the word "cat", the plural is generated as "cats". But when I enter the word 'church', for example, the plural is generated as 'churches', "fly" as "flys".
Here is the code:
#!/usr/bin/perl
$suffix1 = 's';
$suffix2 = 'es';
$suffix3 = 'ies';
print "Enter a countable noun to get plural: ";
$word = <STDIN>;
chomp($word);
if(substr $word, -1 == 'b' or 'd' or 'c' or 'g' or 'r' or 'j' or 'k' or 'l' or 'm' or 'n' or 'p' or 'q' or 'r' or 't' or 'v' or 'w' or 'e' or 'i' or 'o' or 'u') {
$temp = $word.$suffix1;
print "The plural form of the word \"$word\" is: $temp \n";
}
elsif (substr $word, -1 == 's' or 'sh' or 'ch' or 'x' or 'z') {
$temp = $word.$suffix2;
print "The plural form of the word \"$word\" is: $temp \n";
}
elsif (substr $word, -1 == 'y') {
chop($word);
$temp = $word.$suffix3;
print "The plural form of the word \"$word\" is: $temp \n";
}
Could you help me making the code execute the three statements.
First of all, always use use strict; use warnings;.
Strings are compared using eq, not ==.
substr $word, -1 eq 'b' means substr $word, (-1 eq 'b') when you meant substr($word, -1) eq 'b'. You'll face lots of problems if you omit parens around function calls.
substr($word, -1) eq 'b' or 'd' means the same as (substr($word, -1) eq 'b') or ('d'). 'd' is always true. You'd need to use substr($word, -1) eq 'b' or substr($word, -1) eq 'd'. (Preferably, you'd save substr $word, -1 in a variable to avoid doing it repeatedly.)
substr $word, -1 will never equal ch or sh.
The match operator makes this easy:
if ($word =~ /[bdcgrjklmnpqrtvweiou]\z/) {
...
}
elsif ($word =~ /(?:[sxz]|[sc]h)\z/) {
...
}
elsif ($word =~ /y\z/) {
...
}
In Perl, we use eq for string comparison instead of ==.
You can't use or like this. It should be like if (substr($word, -1) eq 'b' or substr ($word, -1) eq 'd'). Otherwise you could use an array containing all the string that you would like to compare and grep from that array.
Duskast is right. Perl uses symbols for numeric comparisons, and strings for string comparisons.
== eq
!= ne
< lt
<= le
> gt
>= ge
<=> cmp
Also, your use of or, though a good try, doesn't work. The keyword or has weak precedence, and so the expression
substr $word, -1 == 'b' or 'd' or 'c' or
'g' or 'r' or 'j' or
'k' or 'l' or 'm' or
'n' or 'p' or 'q' or
'r' or 't' or 'v' or
'w' or 'e' or 'i' or
'o' or 'u'
is interpreted as
substr ($word, (-1 == 'b')) or 'd' or 'c' or
'g' or 'r' or 'j' or
'k' or 'l' or 'm' or
'n' or 'p' or 'q' or
'r' or 't' or 'v' or
'w' or 'e' or 'i' or
'o' or 'u'
I'm not sure what the substr works out to, but if it's false, the expression continues to the or 'b', which is interpreted as true. Have you seen regular expressions yet? This is much more idiomatically done as
if ($word =~ /[bdcgrjklmnpqrtvweiou]$/) {...}
# Does $word match any of those characters followed by
# the end of the line or string?
Look in the Perl docs for string substitution and the s/.../.../ construct.
By the way, if you were paid to do this instead of being a student, you'd use the Lingua modules instead.
First of all, always, always include use strict; and use warnings;.
Second, use indentations. I've taught Perl courses at work and refuse to accept any assignment that was not indented correctly. In fact, I'm very, very strict about this because I want users to learn to code to the standard (4 space indent, etc.). It makes your program easier to read and to support.
While we're at it, break overly long lines -- especially on StackOverflow. It's hard to read a program when you have to scroll back and forth.
Quick look at your program:
In Perl, strings and numerics use two different sets of boolean operations. This is because strings can contain only digits, but still be strings. Imagine inventory item numbers like 1384 and 993. If I'm sorting these as strings, the 1384 item comes first. If I am sorting them numerically, 993 should come first. Your program has no way of knowing this except by the boolean operation you use:
Boolean Operation Numeric String
================= ======= ======
Equals == eq
Not Equals != ne
Greater Than > gt
Less Than < lt
Greater than/Equals >= ge
Less than/Equals <= le
THe other is that an or, and, || and && only work with two booleans. This won't work:
if ( $a > $b or $c ) {
What this is saying is this:
if ( ( $a > $b ) or $c ) {
So, if $c is a non-zero value, then $c will be true, and the whole statement would be true. You have to do your statement this way:
if ( $a > $b or $a > $c ) {
Another thing, use qq(..) and q() when quoting strings that contain quotation marks. This way, you don't have to put a backslash in front of them.
print "The word is \"swordfish\"\n";
print qq(The word is "swordfish"\n);
And, if you use use feature qw(say); at the top of your program, you get the bonus command of say which is like print, except the ending new line is assumed:
say qq(The word is "swordfish");
When you use substr, $foo, -1, you are only looking at the last character. It cannot ever be a two character string:
if ( substr $word, -1 eq "ch" ) {
will always be false.
Long ifs are hard to maintain. I would use a for loop (actually not, but let's pretend for now..):
#! /usr/bin/env perl
#
# Use these in ALL of your programs
#
use strict;
use warnings;
use feature qw(say);
#
# Use better, more descriptive names
#
my $standard_plural_suffix = 's';
my $uncommon_plural_suffix = 'es';
my $y_ending_plural_suffix = 'ies';
print "Enter a countable noun to get plural: ";
chomp (my $word = <STDIN>);
my $plural_form;
#
# Instead of a long, long "if", use a for loop for testing. Easier to maintain
#
for my $last_letter qw( b d c g r j k l m n p q r t v w e i o u) {
if ( substr($word, -1) eq $last_letter ) {
$plural_form = $word . $standard_plural_suffix;
last;
}
}
#
# Is it an "uncommon plural" test (single character)
#
if ( not $plural_form ) {
for my $last_letter qw(s x z) {
if ( substr($word, -1) eq $last_letter ) {
$plural_form = $word . $uncommon_plural_suffix;
last;
}
}
}
#
# Is it an "uncommon plural" test (double character)
#
if ( not $plural_form ) {
for my $last_two_letters qw(sh ch) {
if ( substr($word, -2) eq $last_two_letters ) {
$plural_form = $word . $uncommon_plural_suffix;
last;
}
}
}
if ( not $plural_form ) {
if ( substr($word, -1) eq 'y' ) {
chop ( my $chopped_word = $word );
$plural_form = $chopped_word . $y_ending_plural_suffix;
}
}
if ( $plural_form ) {
say qq(The plural of "$word" is "$plural_form");
}
else {
say qq(Could not find plural form of "$word");
}
Do you know about regular expressions? Those would work a lot better than using substr because you can test multiple things at once. Plus, I wouldn't use chop, but a regular expression substitution:
#! /usr/bin/env perl
#
# Use these in ALL of your programs
#
use strict;
use warnings;
use feature qw(say);
#
# Use better, more descriptive names
#
my $standard_plural_suffix = 's';
my $uncommon_plural_suffix = 'es';
my $y_ending_plural_suffix = 'ies';
print "Enter a countable noun to get plural: ";
chomp (my $word = <STDIN>);
my $plural_form;
#
# Standard plural (adding plain ol' 's'
#
if ( $word =~ /[bdcgrjklmnpqrtvweiou]$/ ) {
$plural_form = $word . $standard_plural_suffix;
}
#
# Uncommon plural (adding es)
#
elsif ( $word =~ /([sxz]|[sc]h)$/ ) {
$plural_form = $word . $uncommon_plural_suffix;
}
#
# Final 'y' rule: Replace y with ies
#
elsif ( $word =~ /y$/ ) {
$plural_form = $word;
$plural_form =~ s/y$/ies/;
}
if ( $plural_form ) {
say qq(The plural of "$word" is "$plural_form");
}
else {
say qq(Could not find plural form of "$word");
}
I have changed your code a bit. I'm using regular expression:
#!/usr/bin/perl
$suffix1 = 's';
$suffix2 = 'es';
$suffix3 = 'ies';
print "Enter a countable noun to get plural: ";
$word = <STDIN>;
chomp($word);
if ( $word =~ m/(s|sh|ch|x|z)$/) {
$temp = $word . $suffix2;
}
elsif ( substr( $word, -1 ) eq 'y' ) {
chop($word);
$temp = $word . $suffix3;
}
else {
$temp = $word . $suffix1;
}
print "The plural form of the word \"$word\" is: $temp \n";
Also I recommend you always use strict; and use warnings;

How do I determine the longest similar portion of several strings?

As per the title, I'm trying to find a way to programmatically determine the longest portion of similarity between several strings.
Example:
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/
Ideally, I'd get back file:///home/gms8994/Music/, because that's the longest portion that's common for all 3 strings.
Specifically, I'm looking for a Perl solution, but a solution in any language (or even pseudo-language) would suffice.
From the comments: yes, only at the beginning; but there is the possibility of having some other entry in the list, which would be ignored for this question.
Edit: I'm sorry for mistake. My pity that I overseen that using my variable inside countit(x, q{}) is big mistake. This string is evaluated inside Benchmark module and #str was empty there. This solution is not as fast as I presented. See correction below. I'm sorry again.
Perl can be fast:
use strict;
use warnings;
package LCP;
sub LCP {
return '' unless #_;
return $_[0] if #_ == 1;
my $i = 0;
my $first = shift;
my $min_length = length($first);
foreach (#_) {
$min_length = length($_) if length($_) < $min_length;
}
INDEX: foreach my $ch ( split //, $first ) {
last INDEX unless $i < $min_length;
foreach my $string (#_) {
last INDEX if substr($string, $i, 1) ne $ch;
}
}
continue { $i++ }
return substr $first, 0, $i;
}
# Roy's implementation
sub LCP2 {
return '' unless #_;
my $prefix = shift;
for (#_) {
chop $prefix while (! /^\Q$prefix\E/);
}
return $prefix;
}
1;
Test suite:
#!/usr/bin/env perl
use strict;
use warnings;
Test::LCP->runtests;
package Test::LCP;
use base 'Test::Class';
use Test::More;
use Benchmark qw(:all :hireswallclock);
sub test_use : Test(startup => 1) {
use_ok('LCP');
}
sub test_lcp : Test(6) {
is( LCP::LCP(), '', 'Without parameters' );
is( LCP::LCP('abc'), 'abc', 'One parameter' );
is( LCP::LCP( 'abc', 'xyz' ), '', 'None of common prefix' );
is( LCP::LCP( 'abcdefgh', ('abcdefgh') x 15, 'abcdxyz' ),
'abcd', 'Some common prefix' );
my #str = map { chomp; $_ } <DATA>;
is( LCP::LCP(#str),
'file:///home/gms8994/Music/', 'Test data prefix' );
is( LCP::LCP2(#str),
'file:///home/gms8994/Music/', 'Test data prefix by LCP2' );
my $t = countit( 1, sub{LCP::LCP(#str)} );
diag("LCP: ${\($t->iters)} iterations took ${\(timestr($t))}");
$t = countit( 1, sub{LCP::LCP2(#str)} );
diag("LCP2: ${\($t->iters)} iterations took ${\(timestr($t))}");
}
__DATA__
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/
Test suite result:
1..7
ok 1 - use LCP;
ok 2 - Without parameters
ok 3 - One parameter
ok 4 - None of common prefix
ok 5 - Some common prefix
ok 6 - Test data prefix
ok 7 - Test data prefix by LCP2
# LCP: 22635 iterations took 1.09948 wallclock secs ( 1.09 usr + 0.00 sys = 1.09 CPU) # 20766.06/s (n=22635)
# LCP2: 17919 iterations took 1.06787 wallclock secs ( 1.07 usr + 0.00 sys = 1.07 CPU) # 16746.73/s (n=17919)
That means that pure Perl solution using substr is about 20% faster than Roy's solution at your test case and one prefix finding takes about 50us. There is not necessary using XS unless your data or performance expectations are bigger.
The reference given already by Brett Daniel for the Wikipedia entry on "Longest common substring problem" is very good general reference (with pseudocode) for your question as stated. However, the algorithm can be exponential. And it looks like you might actually want an algorithm for longest common prefix which is a much simpler algorithm.
Here's the one I use for longest common prefix (and a ref to original URL):
use strict; use warnings;
sub longest_common_prefix {
# longest_common_prefix( $|# ): returns $
# URLref: http://linux.seindal.dk/2005/09/09/longest-common-prefix-in-perl
# find longest common prefix of scalar list
my $prefix = shift;
for (#_) {
chop $prefix while (! /^\Q$prefix\E/);
}
return $prefix;
}
my #str = map {chomp; $_} <DATA>;
print longest_common_prefix(#ARGV), "\n";
__DATA__
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/
If you truly want a LCSS implementation, refer to these discussions (Longest Common Substring and Longest Common Subsequence) at PerlMonks.org. Tree::Suffix would probably be the best general solution for you and implements, to my knowledge, the best algorithm. Unfortunately recent builds are broken. But, a working subroutine does exist within the discussions referenced on PerlMonks in this post by Limbic~Region (reproduced here with your data).
#URLref: http://www.perlmonks.org/?node_id=549876
#by Limbic~Region
use Algorithm::Loops 'NestedLoops';
use List::Util 'reduce';
use strict; use warnings;
sub LCS{
my #str = #_;
my #pos;
for my $i (0 .. $#str) {
my $line = $str[$i];
for (0 .. length($line) - 1) {
my $char= substr($line, $_, 1);
push #{$pos[$i]{$char}}, $_;
}
}
my $sh_str = reduce {length($a) < length($b) ? $a : $b} #str;
my %map;
CHAR:
for my $char (split //, $sh_str) {
my #loop;
for (0 .. $#pos) {
next CHAR if ! $pos[$_]{$char};
push #loop, $pos[$_]{$char};
}
my $next = NestedLoops([#loop]);
while (my #char_map = $next->()) {
my $key = join '-', #char_map;
$map{$key} = $char;
}
}
my #pile;
for my $seq (keys %map) {
push #pile, $map{$seq};
for (1 .. 2) {
my $dir = $_ % 2 ? 1 : -1;
my #offset = split /-/, $seq;
$_ += $dir for #offset;
my $next = join '-', #offset;
while (exists $map{$next}) {
$pile[-1] = $dir > 0 ?
$pile[-1] . $map{$next} : $map{$next} . $pile[-1];
$_ += $dir for #offset;
$next = join '-', #offset;
}
}
}
return reduce {length($a) > length($b) ? $a : $b} #pile;
}
my #str = map {chomp; $_} <DATA>;
print LCS(#str), "\n";
__DATA__
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/
It sounds like you want the k-common substring algorithm. It is exceptionally simple to program, and a good example of dynamic programming.
My first instinct is to run a loop, taking the next character from each string, until the characters are not equal. Keep a count of what position in the string you're at and then take a substring (from any of the three strings) from 0 to the position before the characters aren't equal.
In Perl, you'll have to split up the string first into characters using something like
#array = split(//, $string);
(splitting on an empty character sets each character into its own element of the array)
Then do a loop, perhaps overall:
$n =0;
#array1 = split(//, $string1);
#array2 = split(//, $string2);
#array3 = split(//, $string3);
while($array1[$n] == $array2[$n] && $array2[$n] == $array3[$n]){
$n++;
}
$sameString = substr($string1, 0, $n); #n might have to be n-1
Or at least something along those lines. Forgive me if this doesn't work, my Perl is a little rusty.
If you google for "longest common substring" you'll get some good pointers for the general case where the sequences don't have to start at the beginning of the strings.
Eg, http://en.wikipedia.org/wiki/Longest_common_substring_problem.
Mathematica happens to have a function for this built in:
http://reference.wolfram.com/mathematica/ref/LongestCommonSubsequence.html (Note that they mean contiguous subsequence, ie, substring, which is what you want.)
If you only care about the longest common prefix then it should be much faster to just loop for i from 0 till the ith characters don't all match and return substr(s, 0, i-1).
From http://forums.macosxhints.com/showthread.php?t=33780
my #strings =
(
'file:///home/gms8994/Music/t.A.T.u./',
'file:///home/gms8994/Music/nina%20sky/',
'file:///home/gms8994/Music/A%20Perfect%20Circle/',
);
my $common_part = undef;
my $sep = chr(0); # assuming it's not used legitimately
foreach my $str ( #strings ) {
# First time through loop -- set common
# to whole
if ( !defined $common_part ) {
$common_part = $str;
next;
}
if ("$common_part$sep$str" =~ /^(.*).*$sep\1.*$/)
{
$common_part = $1;
}
}
print "Common part = $common_part\n";
Faster than above, uses perl's native binary xor function, adapted from perlmongers solution (the $+[0] didn't work for me):
sub common_suffix {
my $comm = shift #_;
while ($_ = shift #_) {
$_ = substr($_,-length($comm)) if (length($_) > length($comm));
$comm = substr($comm,-length($_)) if (length($_) < length($comm));
if (( $_ ^ $comm ) =~ /(\0*)$/) {
$comm = substr($comm, -length($1));
} else {
return undef;
}
}
return $comm;
}
sub common_prefix {
my $comm = shift #_;
while ($_ = shift #_) {
$_ = substr($_,0,length($comm)) if (length($_) > length($comm));
$comm = substr($comm,0,length($_)) if (length($_) < length($comm));
if (( $_ ^ $comm ) =~ /^(\0*)/) {
$comm = substr($comm,0,length($1));
} else {
return undef;
}
}
return $comm;
}