Saving a transliteration table in perl

Saving a transliteration table in perl - perl

I want to transliterate digits from 1 - 8 with 0 but not knowing the number at compile time. Since transliterations do not interpolate variables I'm doing this:
#trs = (sub{die},sub{${$_[0]} =~ tr/[0,1]/[1,0]/},sub{${$_[0]} =~ tr/[0,2]/[2,0]/},sub{${$_[0]} =~ tr/[0,3]/[3,0]/},sub{${$_[0]} =~ tr/[0,4]/[4,0]/},sub{${$_[0]} =~ tr/[0,5]/[5,0]/},sub{${$_[0]} =~ tr/[0,6]/[6,0]/},sub{${$_[0]} =~ tr/[0,7]/[7,0]/},sub{${$_[0]} =~ tr/[0,8]/[8,0]/});
and then index it like:
$trs[$character_to_transliterate](\$var_to_change);
I would appreciate if anyone can point me to a best looking solution.

Any time that you are repeating yourself, you should see if what you are doing can be done in a loop. Since tr creates its tables at compile time, you can use eval to access the compiler at runtime:
my #trs = (sub {die}, map {eval "sub {\$_[0] =~ tr/${_}0/0$_/}"} 1 .. 8);
my $x = 123;
$trs[2]($x);
print "$x\n"; # 103
There is also no need to use references here, subroutine arguments are already passed by reference.
If you do not want to use string eval, you need to use a construct that supports runtime modification. For that you can use the s/// operator:
sub subst {$_[0] =~ s/($_[1]|0)/$1 ? 0 : $_[1]/ge}
my $z = 1230;
subst $z => 2;
print "$z\n"; # 1032
The tr/// construct is faster than s/// since the latter supports regular expressions.

I'd suggest simply ditching tr in favor of something that actually permits a little bit of metaprogramming like s///. For example:
# Replace $to_swap with 0 and 0 with $to_swap, and leave
# everything else alone.
sub swap_with_0 {
my ($digit, $to_swap) = #_;
if ($digit == $to_swap) {
return 0;
} elsif ($digit == 0) {
return $to_swap;
} else {
return $digit;
}
}
# Swap 0 and $to_swap throughout $string
sub swap_digits {
my ($string, $to_swap) = #_;
$string =~ s/([0$to_swap])/swap_with_0($1, $to_swap)/eg;
return $string;
}
which is surprisingly straightforward. :)

Here's a short subroutine that uses substitution instead of transliteration:
sub swap_digits {
my ($str, $digit) = #_;
$str =~ s{ (0) | $digit }{ defined $1 ? $digit : 0 }gex;
return $str;
}

Related

Count Characters in Perl

I need to count the letter "e" in the string
$x="012ei ke ek ek ";
So far, I've tried with a for-loop:
$l=length($x);
$a=0;
for($i=0;$i<$l;$i++)
{$s=substr($x,$i,1);
if($s=="e")
{$a++;}
print $a;

Your code has some problems. You forgot to close the for loop brace,
and in Perl == is supposed to compare numbers. Use eq for strings.
It is also recommended that you use warnings and enable strict mode,
which would have helped you debugging this. In your case, since e
would be treated as 0, so the other one char substrings, 1 and 2
would be the only characters not equal to e when compared with ==. A
cleaned up version of your code could be written as:
use warnings;
use strict;
my $x = "012ei ke ek ek ";
my $l = length $x;
my $count = 0;
for(my $i = 0; $i < $l; $i++) {
my $s = substr($x, $i, 1);
$count++ if ($s eq "e");
}
print $count;
There are multiple ways to achieve this. You could use a match with a
group, which if global returns all the occurrences in list context.
Since you want the number, take this result in scalar context. You can
achieve this for example with:
my $count = () = $string =~ /(e)/g;
Or:
my $count = #{[ $string =~ /(e)/g ]}
Another way is to split the string into characters and grep those that
are e:
my $count = grep $_ eq 'e', split //, $string;
And probably the most compact is to use tr which returns the count of
characters in scalar context, although this does restrict this usage to
counting characters only:
my $count = $string =~ tr/e//;

You compare characters with the numeric operator (==) when you should use the string comparison eq. If you had used the warnings pragma you would have seen that.
You code should have looked like:
#!/usr/bin/env perl
use strict;
use warnings;
my $x = "012ei ke ek ek ";
my $l = length($x);
my $a = 0;
for ( my $i = 0; $i < $l; $i++ ) {
my $s = substr( $x, $i, 1 );
if ( $s eq "e" ) {
$a++;
}
}
print "$a\n";
Proper indentation and the use of the strict and warnings pragmas will avoid and/or catch unintentional, dumb errors.
A much more Perl-ish (and shorter) way to achieve your answer is:
perl -le '$x="012ei ke ek ek";#count=$x=~m/e/g;print scalar #count'
4
This matches globally and collects all the matches in list context. The scalar value of the list gives the number of occurrences you seek.
Another way is to use tr
perl -le '$x="012ei ke ek ek";print scalar $x=~tr/e//'
4

#sidyll Already mentioned what is the problem in your script and all of the possible ways, but TIMTOWTDI.
$x="012ei ke ek ek ";
my $count;
$count++ while($x=~/e/g);
print $count;

Perl printf to use commas as thousands-separator

Using awk, I can print a number with commas as thousands separators.
(with a export LC_ALL=en_US.UTF-8 beforehand).
awk 'BEGIN{printf("%\047d\n", 24500)}'
24,500
I expected the same format to work with Perl, but it does not:
perl -e 'printf("%\047d\n", 24500)'
%'d
The Perl Cookbook offers this solution:
sub commify {
my $text = reverse $_[0];
$text =~ s/(\d\d\d)(?=\d)(?!\d*\.)/$1,/g;
return scalar reverse $text;
}
However I am assuming that since the printf option works in awk, it should also work in Perl.

The apostrophe format modifier is a non-standard POSIX extension.
The documentation for Perl's printf has this to say about such extensions
Perl does its own "sprintf" formatting: it emulates the C
function sprintf(3), but doesn't use it except for
floating-point numbers, and even then only standard modifiers
are allowed. Non-standard extensions in your local sprintf(3)
are therefore unavailable from Perl.
The Number::Format module will do this for you, and it takes its default settings from the locale, so is as portable as it can be
use strict;
use warnings 'all';
use v5.10.1;
use Number::Format 'format_number';
say format_number(24500);
output
24,500

A more perl-ish solution:
$a = 12345678; # no comment
$b = reverse $a; # $b = '87654321';
#c = unpack("(A3)*", $b); # $c = ('876', '543', '21');
$d = join ',', #c; # $d = '876,543,21';
$e = reverse $d; # $e = '12,345,678';
print $e;
outputs 12,345,678.

I realize this question was from almost 4 years ago, but since it comes up in searches, I'll add an elegant native Perl solution I came up with. I was originally searching for a way to do it with sprintf, but everything I've found indicates that it can't be done. Then since everyone is rolling their own, I thought I'd give it a go, and this is my solution.
$num = 12345678912345; # however many digits you want
while($num =~ s/(\d+)(\d\d\d)/$1\,$2/){};
print $num;
Results in:
12,345,678,912,345
Explanation:
The Regex does a maximal digit search for all leading digits. The minimum number of digits in a row it'll act on is 4 (1 plus 3). Then it adds a comma between the two. Next loop if there are still 4 digits at the end (before the comma), it'll add another comma and so on until the pattern doesn't match.
If you need something safe for use with more than 3 digits after the decimal, use this modification: (Note: This won't work if your number has no decimal)
while($num =~ s/(\d+)(\d\d\d)([.,])/$1\,$2$3/){};
This will ensure that it will only look for digits that ends in a comma (added on a previous loop) or a decimal.

Most of these answers assume that the format is universal. It isn't. CLDR uses Unicode information to figure it out. There's a long thread in How to properly localize numbers?.
CPAN has the CLDR::Number module:
#!perl
use v5.10;
use CLDR::Number;
use open qw(:std :utf8);
my $locale = $ARGV[0] // 'en';
my #numbers = qw(
123
12345
1234.56
-90120
);
my $cldr = CLDR::Number->new( locale => $locale );
my $decf = $cldr->decimal_formatter;
foreach my $n ( #numbers ) {
say $decf->format($n);
}
Here are a few runs:
$ perl comma.pl
123
12,345
1,234.56
-90,120
$ perl comma.pl es
123
12.345
1234,56
-90.120
$ perl comma.pl bn
১২৩
১২,৩৪৫
১,২৩৪.৫৬
-৯০,১২০
It seems heavyweight, but the output is correct and you don't have to allow the user to change the locale you want to use. However, when it's time to change the locale, you are ready to go. I also prefer this to Number::Format because I can use a locale that's different from my local settings for my terminal or session, or even use multiple locales:
#!perl
use v5.10;
use CLDR::Number;
use open qw(:std :utf8);
my #locales = qw( en pt bn );
my #numbers = qw(
123
12345
1234.56
-90120
);
my #formatters = map {
my $cldr = CLDR::Number->new( locale => $_ );
my $decf = $cldr->decimal_formatter;
[ $_, $cldr, $decf ];
} #locales;
printf "%10s %10s %10s\n" . '=' x 32 . "\n", #locales;
foreach my $n ( #numbers ) {
printf "%10s %10s %10s\n",
map { $_->[-1]->format($n) } #formatters;
}
The output has three locales at once:
en pt bn
================================
123 123 ১২৩
12,345 12.345 ১২,৩৪৫
1,234.56 1.234,56 ১,২৩৪.৫৬
-90,120 -90.120 -৯০,১২০

Here's an elegant Perl solution I've been using for over 20 years :)
1 while $text =~ s/(.*\d)(\d\d\d)/$1\.$2/g;
And if you then want two decimal places:
$text = sprintf("%0.2f", $text);

1 liner: Use a little loop whith a regex:
while ($number =~ s/^(\d+)(\d{3})/$1,$2/) {}
Example:
use strict;
use warnings;
my #numbers = (12321, 12.12, 122222.3334, '1234abc', '1.1', '1222333444555,666.77');
for(#numbers) {
my $number = $_;
while ($number =~ s/^(\d+)(\d{3})/$1,$2/) {}
print "$_ -> $number\n";
}
Output:
12321 -> 12,321
12.12 -> 12.12
122222.3334 -> 122,222.3334
1234abc -> 1,234abc
1.1 -> 1.1
1222333444555,666.77 -> 1,222,333,444,555,666.77
Pattern:
(\d+)(\d{3})
-> Take all numbers but the last 3 in group 1
-> Take the remaining 3 numbers in group2 on the beginning of $number
-> Followed is ignored
Substitution
$1,$2
-> Put a seperator sign (,) between group 1 and 2
-> The rest remains unchanged
So if you have 12345.67 the numers the regex uses are 12345. The '.' and all followed is ignored.
1. run (12345.67):
-> matches: 12345
-> group 1: 12,
group 2: 345
-> substitute 12,345
-> result: 12,345.67
2. run (12,345.67):
-> does not match!
-> while breaks.

Parting from #Laura's answer, I tweaked the pure perl, regex-only solution to work for numbers with decimals too:
while ($formatted_number =~ s/^(-?\d+)(\d{3}(?:,\d{3})*(?:\.\d+)*)$/$1,$2/) {};
Of course this assumes a "," as thousands separator and a "." as decimal separator, but it should be trivial to use variables to account for that for your given locale(s).

I used the following but it does not works as of perl v5.26.1
sub format_int
{
my $num = shift;
return reverse(join(",",unpack("(A3)*", reverse int($num))));
}
The form that worked for me was:
sub format_int
{
my $num = shift;
return scalar reverse(join(",",unpack("(A3)*", reverse int($num))));
}
But to use negative numbers the code must be:
sub format_int
{
if ( $val >= 0 ) {
return scalar reverse join ",", unpack( "(A3)*", reverse int($val) );
} else {
return "-" . scalar reverse join ",", unpack( "(A3)*", reverse int(-$val) );
}
}

Did somebody say Perl?
perl -pe '1while s/(\d+)(\d{3})/$1,$2/'
This works for any integer.

# turning above answer into a function
sub format_float
# returns number with commas..... and 2 digit decimal
# so format_float(12345.667) returns "12,345.67"
{
my $num = shift;
return reverse(join(",",unpack("(A3)*", reverse int($num)))) . sprintf(".%02d",int(100*(.005+($num - int($num)))));
}
sub format_int
# returns number with commas.....
# so format_int(12345.667) returns "12,345"
{
my $num = shift;
return reverse(join(",",unpack("(A3)*", reverse int($num))));
}

I wanted to print numbers it in a currency format. If it turned out even, I still wanted a .00 at the end. I used the previous example (ty) and diddled with it a bit more to get this.
sub format_number {
my $num = shift;
my $result;
my $formatted_num = "";
my #temp_array = ();
my $mantissa = "";
if ( $num =~ /\./ ) {
$num = sprintf("%0.02f",$num);
($num,$mantissa) = split(/\./,$num);
$formatted_num = reverse $num;
#temp_array = unpack("(A3)*" , $formatted_num);
$formatted_num = reverse (join ',', #temp_array);
$result = $formatted_num . '.'. $mantissa;
} else {
$formatted_num = reverse $num;
#temp_array = unpack("(A3)*" , $formatted_num);
$formatted_num = reverse (join ',', #temp_array);
$result = $formatted_num . '.00';
}
return $result;
}
# Example call
# ...
printf("some amount = %s\n",format_number $some_amount);
I didn't have the Number library on my default mac OS X perl, and I didn't want to mess with that version or go off installing my own perl on this machine. I guess I would have used the formatter module otherwise.
I still don't actually like the solution all that much, but it does work.

This is good for money, just keep adding lines if you handle hundreds of millions.
sub commify{
my $var = $_[0];
#print "COMMIFY got $var\n"; #DEBUG
$var =~ s/(^\d{1,3})(\d{3})(\.\d\d)$/$1,$2$3/;
$var =~ s/(^\d{1,3})(\d{3})(\d{3})(\.\d\d)$/$1,$2,$3$4/;
$var =~ s/(^\d{1,3})(\d{3})(\d{3})(\d{3})(\.\d\d)$/$1,$2,$3,$4$5/;
$var =~ s/(^\d{1,3})(\d{3})(\d{3})(\d{3})(\d{3})(\.\d\d)$/$1,$2,$3,$4,$5$6/;
#print "COMMIFY made $var\n"; #DEBUG
return $var;
}

A solution that produces a localized output:
# First part - Localization
my ( $thousands_sep, $decimal_point, $negative_sign );
BEGIN {
my ( $l );
use POSIX qw(locale_h);
$l = localeconv();
$thousands_sep = $l->{ 'thousands_sep' };
$decimal_point = $l->{ 'decimal_point' };
$negative_sign = $l->{ 'negative_sign' };
}
# Second part - Number transformation
sub readable_number {
my $val = shift;
#my $thousands_sep = ".";
#my $decimal_point = ",";
#my $negative_sign = "-";
sub _readable_int {
my $val = shift;
# a pinch of PERL magic
return scalar reverse join $thousands_sep, unpack( "(A3)*", reverse $val );
}
my ( $i, $d, $r );
$i = int( $val );
if ( $val >= 0 ) {
$r = _readable_int( $i );
} else {
$r = $negative_sign . _readable_int( -$i );
}
# If there is decimal part append it to the integer result
if ( $val != $i ) {
( undef, $d ) = ( $val =~ /(\d*)\.(\d*)/ );
$r = $r . $decimal_point . $d;
}
return $r;
}
The first part gets the symbols used in the current locale to be used on the second part.
The BEGIN block is used to calculate the sysmbols only once at the beginning.
If for some reason there is need to not use POSIX locale, one can ommit the first part and uncomment the variables on the second part to hardcode the sysmbols to be used ($thousands_sep, $thousands_sep and $thousands_sep)

Find all possible starting positions of a regular expression match in perl, including overlapping matches?

Is there a way to find all possible start positions for a regex match in perl?
For example, if your regex was "aa" and the text was "aaaa", it would return 0, 1, and 2, instead of, say 0 and 2.
Obviously, you could just do something like return the first match, and then delete all characters up to and including that starting character, and perform another search, but I'm hoping for something more efficient.

Use lookahead:
$ perl -le 'print $-[0] while "aaaa" =~ /a(?=a)/g'
In general, put everything except the first character of the regex inside of the (?=...).

Update:
I thought about this one a bit more, and came up with this solution using an embedded code block, which is nearly three times faster than the grep solution:
use 5.010;
use warnings;
use strict;
{my #pos;
my $push_pos = qr/(?{push #pos, $-[0]})/;
sub with_code {
my ($re, $str) = #_;
#pos = ();
$str =~ /(?:$re)$push_pos(?!)/;
#pos
}}
and for comparison:
sub with_grep { # old solution
my ($re, $str) = #_;
grep {pos($str) = $_; $str =~ /\G(?:$re)/} 0 .. length($str) - 1;
}
sub with_while { # per Michael Carman's solution, corrected
my ($re, $str) = #_;
my #pos;
while ($str =~ /\G.*?($re)/) {
push #pos, $-[1];
pos $str = $-[1] + 1
}
#pos
}
sub with_look_ahead { # a fragile "generic" version of Sean's solution
my ($re, $str) = #_;
my ($re_a, $re_b) = split //, $re, 2;
my #pos;
push #pos, $-[0] while $str =~ /$re_a(?=$re_b)/g;
#pos
}
Benchmarked and sanity checked with:
use Benchmark 'cmpthese';
my #arg = qw(aa aaaabbbbbbbaaabbbbbaaa);
my $expect = 7;
for my $sub qw(grep while code look_ahead) {
no strict 'refs';
my #got = &{"with_$sub"}(#arg);
"#got" eq '0 1 2 11 12 19 20' or die "$sub: #got";
}
cmpthese -2 => {
grep => sub {with_grep (#arg) == $expect or die},
while => sub {with_while (#arg) == $expect or die},
code => sub {with_code (#arg) == $expect or die},
ahead => sub {with_look_ahead(#arg) == $expect or die},
};
Which prints:
Rate grep while ahead code
grep 49337/s -- -20% -43% -65%
while 61293/s 24% -- -29% -56%
ahead 86340/s 75% 41% -- -38%
code 139161/s 182% 127% 61% --

I know you asked for a regex, but there is actually a simple builtin function that does something quite similar, the function index (perldoc -f index). From that we can build up a simple solution to your direct question, though if you really need a more complicated search than your example this will not work as it only looks for substrings (after an index given by the third parameter).
#!/usr/bin/env perl
use strict;
use warnings;
my $str = 'aaaa';
my $substr = 'aa';
my $pos = -1;
while (1) {
$pos = index($str, $substr, $pos + 1);
last if $pos < 0;
print $pos . "\n";
}

You can use global matching with the pos() function:
my $s1 = "aaaa";
my $s2 = "aa";
while ($s1 =~ /aa/g) {
print pos($s1) - length($s2), "\n";
}

How can I count overlapping substrings in Perl?

i need to implement a program to count the occurrence of a substring in a string in perl. i have implemented it as follows
sub countnmstr
{
$count =0;
$count++ while $_[0] =~ /$_[1]/g;
return $count;
}
$count = countnmstr("aaa","aa");
print "$count\n";
now this is what i would normally do. however, in the implementation above i want to count occurrence of 'aa' in 'aaa'. here i get answer as 1 which seems reasonable but i need to consider the overlapping cases as well. hence the above case should give an answer as 2 since there are two 'aa's if we consider overlap.
can anyone suggest how to implement such a function??

Everyone is getting pretty complicated in their answers (d'oh! daotoad should have made his comment an answer!), perhaps because they are afraid of the goatse operator. I didn't name it, that's just what people call it. It uses the trick that the result of a list assignment is the number of elements in the righthand list.
The Perl idiom for counting matches is then:
my $count = () = $_[0] =~ /($pattern)/g;
The goatse part is the = () =, which is an empty list in the middle of two assignments. The lefthand part of the goatse gets the count from the righthand side of the goatse. Note the you need a capture in the pattern because that's the list the match operator will return in list context.
Now, the next trick in your case is that you really want a positive lookbehind (or lookahead maybe). The lookarounds don't consume characters, so you don't need to keep track of the position:
my $count = () = 'aaa' =~ /((?<=a)a)/g;
Your aaa is just an example. If you have a variable-width pattern, you have to use a lookahead. Lookbehinds in Perl have to be fixed width.

See ysth's answer ... I failed to realize that the pattern could consist solely of a zero width assertion and still work for this purpose.
You can use positive lookahead as suggested by others, and write the function as:
sub countnmstr {
my ($haystack, $needle) = #_;
my ($first, $rest) = $needle =~ /^(.)(.*)$/;
return scalar (() = $haystack =~ /(\Q$first\E(?=\Q$rest\E))/g);
}
You can also use pos to adjust where the next search picks up from:
#!/usr/bin/perl
use strict; use warnings;
sub countnmstr {
my ($haystack, $needle) = #_;
my $adj = length($needle) - 1;
die "Search string cannot be empty!" if $adj < 0;
my $count = 0;
while ( $haystack =~ /\Q$needle/g ) {
pos $haystack -= $adj;
$count += 1;
}
return $count;
}
print countnmstr("aaa","aa"), "\n";
Output:
C:\Temp> t
2

sub countnmstr
{
my ($string, $substr) = #_;
return scalar( () = $string =~ /(?=\Q$substr\E)/g );
}
$count = countnmstr("aaa","aa");
print "$count\n";
A few points:
//g in list context matches as many times as possible.
\Q...\E is used to auto-escape any meta characters, so that you are doing a substring count, not a subpattern count.
Using a lookahead (?= ... ) causes each match to not "consume" any of the string, allowing the following match to be attempted at the very next character.
This uses the same feature where a list assignment (in this case, to an empty list) in scalar context returns the count of elements on the right of the list assignment as the goatse/flying-lentil/spread-eagle/whatever operator, but uses scalar() instead of a scalar assignment to provide the scalar context.
$_[0] is not used directly, but instead copied to a lexical; a naive use of $_[0] in place of $string would cause the //g to start partway through the string instead of at the beginning if the passed string had a stored pos().
Update: s///g is faster, though not as fast as using index:
sub countnmstr
{
my ($string, $substr) = #_;
return scalar( $string =~ s/(?=\Q$substr\E)//g );
}

You could use a lookahead assertion in the regular expression:
sub countnmstr {
my #matches = $_[0] =~ /(?=($_[1]))/g;
return scalar #matches;
}
I suspect Sinan's suggestion will be quicker though.

you can try this, no more regex than needed.
$haystack="aaaaabbbcc";
$needle = "aa";
while ( 1 ){
$ind = index($haystack,$needle);
if ( $ind == -1 ) {last};
$haystack = substr($haystack,$ind+1);
$count++;
}
print "Total count: $count\n";
output
$ ./perl.pl
Total count: 4

If speed is an issue, the index approach suggested by ghostdog74 (with cjm's improvement) is likely to be considerably faster than the regex solutions.
use strict;
use warnings;
sub countnmstr_regex {
my ($haystack, $needle) = #_;
return scalar( () = $haystack =~ /(?=\Q$needle\E)/g );
}
sub countnmstr_index {
my ($haystack, $needle) = #_;
my $i = 0;
my $tally = 0;
while (1){
$i = index($haystack, $needle, $i);
last if $i == -1;
$tally ++;
$i ++;
}
return $tally;
}
use Benchmark qw(cmpthese);
my $size = 1;
my $h = 'aaa aaaaaa' x $size;
my $n = 'aa';
cmpthese( -2, {
countnmstr_regex => sub { countnmstr_regex($h, $n) },
countnmstr_index => sub { countnmstr_index($h, $n) },
} );
__END__
# Benchmarks run on Windows.
# Result using a small haystack ($size = 1).
Rate countnmstr_regex countnmstr_index
countnmstr_regex 93701/s -- -66%
countnmstr_index 271893/s 190% --
# Result using a large haystack ($size = 100).
Rate countnmstr_regex countnmstr_index
countnmstr_regex 929/s -- -81%
countnmstr_index 4960/s 434% --

How do I determine the longest similar portion of several strings?

As per the title, I'm trying to find a way to programmatically determine the longest portion of similarity between several strings.
Example:
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/
Ideally, I'd get back file:///home/gms8994/Music/, because that's the longest portion that's common for all 3 strings.
Specifically, I'm looking for a Perl solution, but a solution in any language (or even pseudo-language) would suffice.
From the comments: yes, only at the beginning; but there is the possibility of having some other entry in the list, which would be ignored for this question.

Edit: I'm sorry for mistake. My pity that I overseen that using my variable inside countit(x, q{}) is big mistake. This string is evaluated inside Benchmark module and #str was empty there. This solution is not as fast as I presented. See correction below. I'm sorry again.
Perl can be fast:
use strict;
use warnings;
package LCP;
sub LCP {
return '' unless #_;
return $_[0] if #_ == 1;
my $i = 0;
my $first = shift;
my $min_length = length($first);
foreach (#_) {
$min_length = length($_) if length($_) < $min_length;
}
INDEX: foreach my $ch ( split //, $first ) {
last INDEX unless $i < $min_length;
foreach my $string (#_) {
last INDEX if substr($string, $i, 1) ne $ch;
}
}
continue { $i++ }
return substr $first, 0, $i;
}
# Roy's implementation
sub LCP2 {
return '' unless #_;
my $prefix = shift;
for (#_) {
chop $prefix while (! /^\Q$prefix\E/);
}
return $prefix;
}
1;
Test suite:
#!/usr/bin/env perl
use strict;
use warnings;
Test::LCP->runtests;
package Test::LCP;
use base 'Test::Class';
use Test::More;
use Benchmark qw(:all :hireswallclock);
sub test_use : Test(startup => 1) {
use_ok('LCP');
}
sub test_lcp : Test(6) {
is( LCP::LCP(), '', 'Without parameters' );
is( LCP::LCP('abc'), 'abc', 'One parameter' );
is( LCP::LCP( 'abc', 'xyz' ), '', 'None of common prefix' );
is( LCP::LCP( 'abcdefgh', ('abcdefgh') x 15, 'abcdxyz' ),
'abcd', 'Some common prefix' );
my #str = map { chomp; $_ } <DATA>;
is( LCP::LCP(#str),
'file:///home/gms8994/Music/', 'Test data prefix' );
is( LCP::LCP2(#str),
'file:///home/gms8994/Music/', 'Test data prefix by LCP2' );
my $t = countit( 1, sub{LCP::LCP(#str)} );
diag("LCP: ${\($t->iters)} iterations took ${\(timestr($t))}");
$t = countit( 1, sub{LCP::LCP2(#str)} );
diag("LCP2: ${\($t->iters)} iterations took ${\(timestr($t))}");
}
__DATA__
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/
Test suite result:
1..7
ok 1 - use LCP;
ok 2 - Without parameters
ok 3 - One parameter
ok 4 - None of common prefix
ok 5 - Some common prefix
ok 6 - Test data prefix
ok 7 - Test data prefix by LCP2
# LCP: 22635 iterations took 1.09948 wallclock secs ( 1.09 usr + 0.00 sys = 1.09 CPU) # 20766.06/s (n=22635)
# LCP2: 17919 iterations took 1.06787 wallclock secs ( 1.07 usr + 0.00 sys = 1.07 CPU) # 16746.73/s (n=17919)
That means that pure Perl solution using substr is about 20% faster than Roy's solution at your test case and one prefix finding takes about 50us. There is not necessary using XS unless your data or performance expectations are bigger.

The reference given already by Brett Daniel for the Wikipedia entry on "Longest common substring problem" is very good general reference (with pseudocode) for your question as stated. However, the algorithm can be exponential. And it looks like you might actually want an algorithm for longest common prefix which is a much simpler algorithm.
Here's the one I use for longest common prefix (and a ref to original URL):
use strict; use warnings;
sub longest_common_prefix {
# longest_common_prefix( $|# ): returns $
# URLref: http://linux.seindal.dk/2005/09/09/longest-common-prefix-in-perl
# find longest common prefix of scalar list
my $prefix = shift;
for (#_) {
chop $prefix while (! /^\Q$prefix\E/);
}
return $prefix;
}
my #str = map {chomp; $_} <DATA>;
print longest_common_prefix(#ARGV), "\n";
__DATA__
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/
If you truly want a LCSS implementation, refer to these discussions (Longest Common Substring and Longest Common Subsequence) at PerlMonks.org. Tree::Suffix would probably be the best general solution for you and implements, to my knowledge, the best algorithm. Unfortunately recent builds are broken. But, a working subroutine does exist within the discussions referenced on PerlMonks in this post by Limbic~Region (reproduced here with your data).
#URLref: http://www.perlmonks.org/?node_id=549876
#by Limbic~Region
use Algorithm::Loops 'NestedLoops';
use List::Util 'reduce';
use strict; use warnings;
sub LCS{
my #str = #_;
my #pos;
for my $i (0 .. $#str) {
my $line = $str[$i];
for (0 .. length($line) - 1) {
my $char= substr($line, $_, 1);
push #{$pos[$i]{$char}}, $_;
}
}
my $sh_str = reduce {length($a) < length($b) ? $a : $b} #str;
my %map;
CHAR:
for my $char (split //, $sh_str) {
my #loop;
for (0 .. $#pos) {
next CHAR if ! $pos[$_]{$char};
push #loop, $pos[$_]{$char};
}
my $next = NestedLoops([#loop]);
while (my #char_map = $next->()) {
my $key = join '-', #char_map;
$map{$key} = $char;
}
}
my #pile;
for my $seq (keys %map) {
push #pile, $map{$seq};
for (1 .. 2) {
my $dir = $_ % 2 ? 1 : -1;
my #offset = split /-/, $seq;
$_ += $dir for #offset;
my $next = join '-', #offset;
while (exists $map{$next}) {
$pile[-1] = $dir > 0 ?
$pile[-1] . $map{$next} : $map{$next} . $pile[-1];
$_ += $dir for #offset;
$next = join '-', #offset;
}
}
}
return reduce {length($a) > length($b) ? $a : $b} #pile;
}
my #str = map {chomp; $_} <DATA>;
print LCS(#str), "\n";
__DATA__
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/

It sounds like you want the k-common substring algorithm. It is exceptionally simple to program, and a good example of dynamic programming.

My first instinct is to run a loop, taking the next character from each string, until the characters are not equal. Keep a count of what position in the string you're at and then take a substring (from any of the three strings) from 0 to the position before the characters aren't equal.
In Perl, you'll have to split up the string first into characters using something like
#array = split(//, $string);
(splitting on an empty character sets each character into its own element of the array)
Then do a loop, perhaps overall:
$n =0;
#array1 = split(//, $string1);
#array2 = split(//, $string2);
#array3 = split(//, $string3);
while($array1[$n] == $array2[$n] && $array2[$n] == $array3[$n]){
$n++;
}
$sameString = substr($string1, 0, $n); #n might have to be n-1
Or at least something along those lines. Forgive me if this doesn't work, my Perl is a little rusty.

If you google for "longest common substring" you'll get some good pointers for the general case where the sequences don't have to start at the beginning of the strings.
Eg, http://en.wikipedia.org/wiki/Longest_common_substring_problem.
Mathematica happens to have a function for this built in:
http://reference.wolfram.com/mathematica/ref/LongestCommonSubsequence.html (Note that they mean contiguous subsequence, ie, substring, which is what you want.)
If you only care about the longest common prefix then it should be much faster to just loop for i from 0 till the ith characters don't all match and return substr(s, 0, i-1).

From http://forums.macosxhints.com/showthread.php?t=33780
my #strings =
(
'file:///home/gms8994/Music/t.A.T.u./',
'file:///home/gms8994/Music/nina%20sky/',
'file:///home/gms8994/Music/A%20Perfect%20Circle/',
);
my $common_part = undef;
my $sep = chr(0); # assuming it's not used legitimately
foreach my $str ( #strings ) {
# First time through loop -- set common
# to whole
if ( !defined $common_part ) {
$common_part = $str;
next;
}
if ("$common_part$sep$str" =~ /^(.*).*$sep\1.*$/)
{
$common_part = $1;
}
}
print "Common part = $common_part\n";

Faster than above, uses perl's native binary xor function, adapted from perlmongers solution (the $+[0] didn't work for me):
sub common_suffix {
my $comm = shift #_;
while ($_ = shift #_) {
$_ = substr($_,-length($comm)) if (length($_) > length($comm));
$comm = substr($comm,-length($_)) if (length($_) < length($comm));
if (( $_ ^ $comm ) =~ /(\0*)$/) {
$comm = substr($comm, -length($1));
} else {
return undef;
}
}
return $comm;
}
sub common_prefix {
my $comm = shift #_;
while ($_ = shift #_) {
$_ = substr($_,0,length($comm)) if (length($_) > length($comm));
$comm = substr($comm,0,length($_)) if (length($_) < length($comm));
if (( $_ ^ $comm ) =~ /^(\0*)/) {
$comm = substr($comm,0,length($1));
} else {
return undef;
}
}
return $comm;
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Saving a transliteration table in perl - perl

Here's a short subroutine that uses substitution instead of transliteration: sub swap_digits { my ($str, $digit) = #_; $str =~ s{ (0) | $digit }{ defined $1 ? $digit : 0 }gex; return $str; }

Related

Count Characters in Perl

Perl printf to use commas as thousands-separator

Find all possible starting positions of a regular expression match in perl, including overlapping matches?

How can I count overlapping substrings in Perl?

How do I determine the longest similar portion of several strings?

Categories

Resources