Using perl to find median, mode, Standard deviation? - perl

I have an array of numbers. What is the easiest way to calculate the Median, Mode, and Std Dev for the data set?

Statistics::Basic::Mean
Statistics::Basic::Median
Statistics::Basic::Mode
Statistics::Basic::StdDev

#!/usr/bin/perl
#
# stdev - figure N, min, max, median, mode, mean, & std deviation
#
# pull out all the real numbers in the input
# stream and run standard calculations on them.
# they may be intermixed with other test, need
# not be on the same or different lines, and
# can be in scientific notion (avagadro=6.02e23).
# they also admit a leading + or -.
#
# Tom Christiansen
# tchrist#perl.com
use strict;
use warnings;
use List::Util qw< min max >;
sub by_number {
if ($a < $b){ -1 } elsif ($a > $b) { 1 } else { 0 }
}
#
my $number_rx = qr{
# leading sign, positive or negative
(?: [+-] ? )
# mantissa
(?= [0123456789.] )
(?:
# "N" or "N." or "N.N"
(?:
(?: [0123456789] + )
(?:
(?: [.] )
(?: [0123456789] * )
) ?
|
# ".N", no leading digits
(?:
(?: [.] )
(?: [0123456789] + )
)
)
)
# abscissa
(?:
(?: [Ee] )
(?:
(?: [+-] ? )
(?: [0123456789] + )
)
|
)
}x;
my $n = 0;
my $sum = 0;
my #values = ();
my %seen = ();
while (<>) {
while (/($number_rx)/g) {
$n++;
my $num = 0 + $1; # 0+ is so numbers in alternate form count as same
$sum += $num;
push #values, $num;
$seen{$num}++;
}
}
die "no values" if $n == 0;
my $mean = $sum / $n;
my $sqsum = 0;
for (#values) {
$sqsum += ( $_ ** 2 );
}
$sqsum /= $n;
$sqsum -= ( $mean ** 2 );
my $stdev = sqrt($sqsum);
my $max_seen_count = max values %seen;
my #modes = grep { $seen{$_} == $max_seen_count } keys %seen;
my $mode = #modes == 1
? $modes[0]
: "(" . join(", ", #modes) . ")";
$mode .= ' # ' . $max_seen_count;
my $median;
my $mid = int #values/2;
my #sorted_values = sort by_number #values;
if (#values % 2) {
$median = $sorted_values[ $mid ];
} else {
$median = ($sorted_values[$mid-1] + $sorted_values[$mid])/2;
}
my $min = min #values;
my $max = max #values;
printf "n is %d, min is %g, max is %g\n", $n, $min, $max;
printf "mode is %s, median is %g, mean is %g, stdev is %g\n",
$mode, $median, $mean, $stdev;

Depending on how in depth you need to go, erickb's answer may work. However for numerical functionality in Perl there is PDL. You would create a piddle (the object containing your data) using the pdl function. From there you can use the operations on this page to do the statistics you need.
Edit: Looking around I found two function calls that do EXACTLY what you need. statsover gives statistics on one dimension of a piddle, while stats does the same over the whole piddle.
my $piddle = pdl #data;
my ($mean,$prms,$median,$min,$max,$adev,$rms) = statsover $piddle;

Related

Sum the odd and even indices of an array separately - Perl

I have an array of 11 elements. Where I want to sum the odd elements including the first and last elements as one scalar and the evens as another.
This is my code I am trying to use map adding 2 to each index to achieve the result but I think I have got it wrong.
use strict;
use warnings;
use Data::Dumper;
print 'Enter the 11 digiet serial number: ';
chomp( my #barcode = //, <STDIN> );
my #sum1 = map { 2 + $_ } $barcode[1] .. $barcode[11];
my $sum1 = sum Dumper( \#sum1 );
# sum2 = l2 + l4 + l6 + r8 + r10;
printf "{$sum1}";
What is a good way to achieve this?
Sum of even/odd indicies (what you asked for, but not what you want[1]):
use List::Util qw( sum ); # Or: sub sum { my $acc; $acc += $_ for #_; $acc }
my $sum_of_even_idxs = sum grep { $_ % 2 == 0 } 0..$#nums;
my $sum_of_odd_idxs = sum grep { $_ % 2 == 1 } 0..$#nums;
Sum of even/odd values (what you also asked for, but not what you want[1]):
use List::Util qw( sum ); # Or: sub sum { my $acc; $acc += $_ for #_; $acc }
my $sum_of_even_vals = sum grep { $_ % 2 == 0 } #nums;
my $sum_of_odd_vals = sum grep { $_ % 2 == 1 } #nums;
Sum of values at even/odd indexes (what you appear to want):
use List::Util qw( sum ); # Or: sub sum { my $acc; $acc += $_ for #_; $acc }
my $sum_of_vals_at_even_idxs = sum #nums[ grep { $_ % 2 == 0 } 0..$#nums ];
my $sum_of_vals_at_odd_idxs = sum #nums[ grep { $_ % 2 == 1 } 0..$#nums ];
Given that you know how many elements you have, you could use the following:
use List::Util qw( sum ); # Or: sub sum { my $acc; $acc += $_ for #_; $acc }
my $sum_of_vals_at_even_idxs = sum #nums[0,2,4,6,8,10];
my $sum_of_vals_at_odd_idxs = sum #nums[1,3,5,7,9];
I included these in case someone needing these lands on this Q&A.
Add up values at odd and at even indices
perl -wE'#ary = 1..6;
for (0..$#ary) { $_ & 1 ? $odds += $ary[$_] : $evens += $ary[$_] };
say "odds: $odds, evens: $evens"
'
Note for tests: with even indices (0,2,4) we have (odd!) values (1,3,5), in this (1..6) example
You can use the fact that the ?: operator is assignable
print 'Enter the 11 digiet serial number: ';
chomp( my #barcode = //, <STDIN> );
my $odd = 0;
my $even = 0;
for (my $index = 0; $index < #barcode; $index++) {
($index % 2 ? $even : $odd) += $barcode[$index];
}
This works by indexing over #barcode and taking the mod 2 of the index, ie dividing the index by 2 and taking the remainder, and if the remainder is 1 adding that element of #barcode to $even otherwise to $odd.
That looks strange until you remember that arrays are 0 based so your first number of the barcode is stored in $barcode[0] which is an even index.
chomp( my #barcode = //, <STDIN> ); presumably was supposed to have a split before the //?
#barcode will have all the characters in the line read, including the newline. The chomp will change the final element from a newline to an empty string.
Better to chomp first so you just have your digits in the array:
chomp(my $barcode = <STDIN>);
my #barcode = split //, $barcode;
Another Perl, if the string is of length 11 and contains only digits
$ perl -le ' $_="12345678911"; s/(.)(.)|(.)$/$odd+=$1+$3;$even+=$2/ge; print "odd=$odd; even=$even" '
odd=26; even=21
$
with different input
$ perl -le ' $_="12121212121"; s/(.)(.)|(.)$/$odd+=$1+$3;$even+=$2/ge; print "odd=$odd; even=$even" '
odd=6; even=10
$

How can I normalize my results in perl with foreach control structure?

I have this output:
10dvex2_miRNA_ce.out.data|6361
10dvex2_miRNA_ce.out.data|6361
10dvex2_misc_RNA_ce.out.data|0
10dvex2_rRNA_ce.out.data|239
with this script in Perl:
#!/usr/bin/perl
use warnings;
use strict;
open(MYINPUTFILE, $ARGV[0]); # open for input
my #lines = <MYINPUTFILE>; # read file into list
my $count = 0;
print "Frag"."\t"."ncRNA"."\t"."Amount"."\n";
foreach my $lines (#lines){
my $pattern = $lines;
$pattern =~ s/(.*)dvex\d_(.*)_(.*).(out.data)\|(.*)/$1 $2 $3 $5/g;
$count += $5;
print $1."\t".$2.$3."\t".$5."\n";
}
close(MYINPUTFILE);
exit;
I extract this information:
Frag ncRNA Amount
10 miRNAce 6361
10 misc_RNAce 0
10 rRNAce 239
but in the Amount column I want to report those numbers divided by the total result (6600). In this case I want this output:
Frag ncRNA Amount
10 miRNAce 0.964
10 misc_RNAce 0
10 rRNAce 0.036
My problem is extract the TOTAL result in the loop...to normalize this data. Some ideas?
Perhaps the following will be helpful:
use strict;
use warnings;
my ( %hash, $total, %seen, #array );
while (<>) {
next if $seen{$_}++;
/(\d+).+?_([^.]+).+\|(\d+)$/;
$hash{$1}{$2} = $3;
$total += $3;
}
print "Frag\tncRNA\tAmount\n";
while ( my ( $key1, $val1 ) = each %hash ) {
while ( my ( $key2, $val2 ) = each %$val1 ) {
my $frac = $val2 / $total == 0 ? 0 : sprintf( '%.3f', $val2 / $total );
push #array, "$key1\t$key2\t$frac\n";
}
}
print map { $_->[0] }
sort { $b->[1] <=> $a->[1] }
map { [ $_, (split)[2] ] }
#array;
Output from your data set:
Frag ncRNA Amount
10 miRNA_ce 0.964
10 rRNA_ce 0.036
10 misc_RNA_ce 0
Identical lines are skipped, and then the required elements are captured from each line. A running total is kept for the subsequent calculation. Your desired output showed sorting from high to low, which is why each record is pushed onto #array. However, if sorting isn't necessary, you can just print that line and omit the Schwartzian transform on #array.
Hope this helps!
To do this you will need two passes over the data.
#! /usr/bin/env perl
use warnings;
use strict;
print join("\t",qw'Frag ncRNA Amount'),"\n";
my #data;
my $total = 0;
# parse the lines
while( <> ){
my #elem = /(.+?)(?>dvex)\d_(.+)_([^._]+)[.]out[.]data[|](d+)/;
next unless #elem;
# running total
$total += $elem[-1];
# combine $2 and $3
splice #elem, 1, 2, $2.$3; # $elem[1].$elem[2];
push #data, \#elem;
}
# print them
for( #data ){
my #copy = #$_;
$copy[-1] = $copy[-1] / $total;
$copy[-1] = sprintf('%.3f', $copy[-1]) if $copy[-1];
print join("\t",#copy),"\n";
}

Possible combinations made?

In Perl, how can I test for all possible combinations in a number.
For example, the combination I am interested in is separating.
E.g: 53 could be "5 3" or just "53"
E.g: 215 could be "21 5" or "2 15"
In fact, you are distributing spaces to all the positions between characters. On each position, the space either is realized or not for each combination. Therefore, you can represent it as a binary number, 1 means space present, 0 means space not present.
#!/usr/bin/perl
use warnings;
use strict;
my $num = shift;
my #digits = split //, $num;
my $length = length($num) - 1;
if ($length == 0) {
print "$num\n";
exit;
}
for my $i (0 .. 2 ** $length - 1) {
my $mask = sprintf "%0${length}b", $i;
my #replace_arr = split //, $mask;
my $idx = 0;
for (#replace_arr, '') {
print $digits[$idx];
print ' ' if $_;
$idx++;
}
print "\n";
}

How can I generate a set of ranges from the first letters of a list of words in Perl?

I'm not sure exactly how to explain this, so I'll just start with an example.
Given the following data:
Apple
Apricot
Blackberry
Blueberry
Cherry
Crabapple
Cranberry
Elderberry
Grapefruit
Grapes
Kiwi
Mulberry
Nectarine
Pawpaw
Peach
Pear
Plum
Raspberry
Rhubarb
Strawberry
I want to generate an index based on the first letter of my data, but I want the letters grouped together.
Here is the frequency of the first letters in the above dataset:
2 A
2 B
3 C
1 E
2 G
1 K
1 M
1 N
4 P
2 R
1 S
Since my example data set is small, let's just say that the maximum number to combine the letters together is 3. Using the data above, this is what my index would come out to be:
A B C D-G H-O P Q-Z
Clicking the "D-G" link would show:
Elderberry
Grapefruit
Grapes
In my range listing above, I am covering the full alphabet - I guess that is not completely neccessary - I would be fine with this output as well:
A B C E-G K-N P R-S
Obviously my dataset is not fruit, I will have more data (around 1000-2000 items), and my "maximum per range" will be more than 3.
I am not too worried about lopsided data either - so if I 40% of my data starts with an "S", then S will just have its own link - I don't need to break it down by the second letter in the data.
Since my dataset won't change too often, I would be fine with a static "maximum per range", but it would be nice to have that calculated dynamically too. Also, the dataset will not start with numbers - it is guaranteed to start with a letter from A-Z.
I've started building the algorithm for this, but it keeps getting so messy I start over. I don't know how to search google for this - I'm not sure what this method is called.
Here is what I started with:
#!/usr/bin/perl
use strict;
use warnings;
my $index_frequency = { map { ( $_, 0 ) } ( 'A' .. 'Z' ) };
my $ranges = {};
open( $DATASET, '<', 'mydata' ) || die "Cannot open data file: $!\n";
while ( my $item = <$DATASET> ) {
chomp($item);
my $first_letter = uc( substr( $item, 0, 1 ) );
$index_frequency->{$first_letter}++;
}
foreach my $letter ( sort keys %{$index_frequency} ) {
if ( $index_frequency->{$letter} ) {
# build $ranges here
}
}
My problem is that I keep using a bunch of global variables to keep track of counts and previous letters examined - my code gets very messy very fast.
Can someone give me a step in the right direction? I guess this is more of an algorithm question, so if you don't have a way to do this in Perl, pseudo code would work too, I guess - I can convert it to Perl.
Thanks in advance!
Basic approach:
#!/usr/bin/perl -w
use strict;
use autodie;
my $PAGE_SIZE = 3;
my %frequencies;
open my $fh, '<', 'data';
while ( my $l = <$fh> ) {
next unless $l =~ m{\A([a-z])}i;
$frequencies{ uc $1 }++;
}
close $fh;
my $current_sum = 0;
my #letters = ();
my #pages = ();
for my $letter ( "A" .. "Z" ) {
my $letter_weigth = ( $frequencies{ $letter } || 0 );
if ( $letter_weigth + $current_sum > $PAGE_SIZE ) {
if ( $current_sum ) {
my $title = $letters[ 0 ];
$title .= '-' . $letters[ -1 ] if 1 < scalar #letters;
push #pages, $title;
}
$current_sum = $letter_weigth;
#letters = ( $letter );
next;
}
push #letters, $letter;
$current_sum += $letter_weigth;
}
if ( $current_sum ) {
my $title = $letters[ 0 ];
$title .= '-' . $letters[ -1 ] if 1 < scalar #letters;
push #pages, $title;
}
print "Pages : " . join( " , ", #pages ) . "\n";
Problem with it is that it outputs (from your data):
Pages : A , B , C-D , E-J , K-O , P , Q-Z
But I would argue this is actually good approach :) And you can always change the for loop into:
for my $letter ( sort keys %frequencies ) {
if you need.
Here's my suggestion:
# get the number of instances of each letter
my %count = ();
while (<FILE>)
{
$count{ uc( substr( $_, 0, 1 ) ) }++;
}
# transform the list of counts into a map of count => letters
my %freq = ();
while (my ($letter, $count) = each %count)
{
push #{ $freq{ $count } }, $letter;
}
# now print out the list of letters for each count (or do other appropriate
# output)
foreach (sort keys %freq)
{
my #sorted_letters = sort #{ $freq{$_} };
print "$_: #sorted_letters\n";
}
Update: I think that I misunderstood your requirements. The following code block does something more like what you want.
my %count = ();
while (<FILE>)
{
$count{ uc( substr( $_, 0, 1 ) ) }++;
}
# get the maximum frequency
my $max_freq = (sort values %count)[-1];
my $curr_set_count = 0;
my #curr_set = ();
foreach ('A' .. 'Z') {
push #curr_set, $_;
$curr_set_count += $count{$_};
if ($curr_set_count >= $max_freq) {
# print out the range of the current set, then clear the set
if (#curr_set > 1)
print "$curr_set[0] - $curr_set[-1]\n";
else
print "$_\n";
#curr_set = ();
$curr_set_count = 0;
}
}
# print any trailing letters from the end of the alphabet
if (#curr_set > 1)
print "$curr_set[0] - $curr_set[-1]\n";
else
print "$_\n";
Try something like that, where frequency is the frequency array you computed at the previous step and threshold_low is the minimal number of entries in a range, and threshold_high is the max. number. This should give harmonious results.
count=0
threshold_low=3
threshold_high=6
inrange=false
frequency['Z'+1]=threshold_high+1
for letter in range('A' to 'Z'):
count += frequency[letter];
if (count>=threshold_low or count+frequency[letter+1]>threshold_high):
if (inrange): print rangeStart+'-'
print letter+' '
inrange=false
count=0
else:
if (not inrange) rangeStart=letter
inrange=true
use strict;
use warnings;
use List::Util qw(sum);
my #letters = ('A' .. 'Z');
my #raw_data = qw(
Apple Apricot Blackberry Blueberry Cherry Crabapple Cranberry
Elderberry Grapefruit Grapes Kiwi Mulberry Nectarine
Pawpaw Peach Pear Plum Raspberry Rhubarb Strawberry
);
# Store the data by starting letter.
my %data;
push #{$data{ substr $_, 0, 1 }}, $_ for #raw_data;
# Set max page size dynamically, based on the average
# letter-group size (in this case, a multiple of it).
my $MAX_SIZE = sum(map { scalar #$_ } values %data) / keys %data;
$MAX_SIZE = int(1.5 * $MAX_SIZE + .5);
# Organize the data into pages. Each page is an array reference,
# with the first element being the letter range.
my #pages = (['']);
for my $letter (#letters){
my #d = exists $data{$letter} ? #{$data{$letter}} : ();
if (#{$pages[-1]} - 1 < $MAX_SIZE or #d == 0){
push #{$pages[-1]}, #d;
$pages[-1][0] .= $letter;
}
else {
push #pages, [ $letter, #d ];
}
}
$_->[0] =~ s/^(.).*(.)$/$1-$2/ for #pages; # Convert letters to range.
This is an example of how I would write this program.
#! /opt/perl/bin/perl
use strict;
use warnings;
my %frequency;
{
use autodie;
open my $data_file, '<', 'datafile';
while( my $line = <$data_file> ){
my $first_letter = uc( substr( $line, 0, 1 ) );
$frequency{$first_letter} ++
}
# $data_file is automatically closed here
}
#use Util::Any qw'sum';
use List::Util qw'sum';
# This is just an example of how to calculate a threshold
my $mean = sum( values %frequency ) / scalar values %frequency;
my $threshold = $mean * 2;
my #index;
my #group;
for my $letter ( sort keys %frequency ){
my $frequency = $frequency{$letter};
if( $frequency >= $threshold ){
if( #group ){
if( #group == 1 ){
push #index, #group;
}else{
# push #index, [#group]; # copy #group
push #index, "$group[0]-$group[-1]";
}
#group = ();
}
push #index, $letter;
}elsif( sum( #frequency{#group,$letter} ) >= $threshold ){
if( #group == 1 ){
push #index, #group;
}else{
#push #index, [#group];
push #index, "$group[0]-$group[-1]"
}
#group = ($letter);
}else{
push #group, $letter;
}
}
#push #index, [#group] if #group;
push #index, "$group[0]-$group[-1]" if #group;
print join( ', ', #index ), "\n";

How do I determine the longest similar portion of several strings?

As per the title, I'm trying to find a way to programmatically determine the longest portion of similarity between several strings.
Example:
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/
Ideally, I'd get back file:///home/gms8994/Music/, because that's the longest portion that's common for all 3 strings.
Specifically, I'm looking for a Perl solution, but a solution in any language (or even pseudo-language) would suffice.
From the comments: yes, only at the beginning; but there is the possibility of having some other entry in the list, which would be ignored for this question.
Edit: I'm sorry for mistake. My pity that I overseen that using my variable inside countit(x, q{}) is big mistake. This string is evaluated inside Benchmark module and #str was empty there. This solution is not as fast as I presented. See correction below. I'm sorry again.
Perl can be fast:
use strict;
use warnings;
package LCP;
sub LCP {
return '' unless #_;
return $_[0] if #_ == 1;
my $i = 0;
my $first = shift;
my $min_length = length($first);
foreach (#_) {
$min_length = length($_) if length($_) < $min_length;
}
INDEX: foreach my $ch ( split //, $first ) {
last INDEX unless $i < $min_length;
foreach my $string (#_) {
last INDEX if substr($string, $i, 1) ne $ch;
}
}
continue { $i++ }
return substr $first, 0, $i;
}
# Roy's implementation
sub LCP2 {
return '' unless #_;
my $prefix = shift;
for (#_) {
chop $prefix while (! /^\Q$prefix\E/);
}
return $prefix;
}
1;
Test suite:
#!/usr/bin/env perl
use strict;
use warnings;
Test::LCP->runtests;
package Test::LCP;
use base 'Test::Class';
use Test::More;
use Benchmark qw(:all :hireswallclock);
sub test_use : Test(startup => 1) {
use_ok('LCP');
}
sub test_lcp : Test(6) {
is( LCP::LCP(), '', 'Without parameters' );
is( LCP::LCP('abc'), 'abc', 'One parameter' );
is( LCP::LCP( 'abc', 'xyz' ), '', 'None of common prefix' );
is( LCP::LCP( 'abcdefgh', ('abcdefgh') x 15, 'abcdxyz' ),
'abcd', 'Some common prefix' );
my #str = map { chomp; $_ } <DATA>;
is( LCP::LCP(#str),
'file:///home/gms8994/Music/', 'Test data prefix' );
is( LCP::LCP2(#str),
'file:///home/gms8994/Music/', 'Test data prefix by LCP2' );
my $t = countit( 1, sub{LCP::LCP(#str)} );
diag("LCP: ${\($t->iters)} iterations took ${\(timestr($t))}");
$t = countit( 1, sub{LCP::LCP2(#str)} );
diag("LCP2: ${\($t->iters)} iterations took ${\(timestr($t))}");
}
__DATA__
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/
Test suite result:
1..7
ok 1 - use LCP;
ok 2 - Without parameters
ok 3 - One parameter
ok 4 - None of common prefix
ok 5 - Some common prefix
ok 6 - Test data prefix
ok 7 - Test data prefix by LCP2
# LCP: 22635 iterations took 1.09948 wallclock secs ( 1.09 usr + 0.00 sys = 1.09 CPU) # 20766.06/s (n=22635)
# LCP2: 17919 iterations took 1.06787 wallclock secs ( 1.07 usr + 0.00 sys = 1.07 CPU) # 16746.73/s (n=17919)
That means that pure Perl solution using substr is about 20% faster than Roy's solution at your test case and one prefix finding takes about 50us. There is not necessary using XS unless your data or performance expectations are bigger.
The reference given already by Brett Daniel for the Wikipedia entry on "Longest common substring problem" is very good general reference (with pseudocode) for your question as stated. However, the algorithm can be exponential. And it looks like you might actually want an algorithm for longest common prefix which is a much simpler algorithm.
Here's the one I use for longest common prefix (and a ref to original URL):
use strict; use warnings;
sub longest_common_prefix {
# longest_common_prefix( $|# ): returns $
# URLref: http://linux.seindal.dk/2005/09/09/longest-common-prefix-in-perl
# find longest common prefix of scalar list
my $prefix = shift;
for (#_) {
chop $prefix while (! /^\Q$prefix\E/);
}
return $prefix;
}
my #str = map {chomp; $_} <DATA>;
print longest_common_prefix(#ARGV), "\n";
__DATA__
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/
If you truly want a LCSS implementation, refer to these discussions (Longest Common Substring and Longest Common Subsequence) at PerlMonks.org. Tree::Suffix would probably be the best general solution for you and implements, to my knowledge, the best algorithm. Unfortunately recent builds are broken. But, a working subroutine does exist within the discussions referenced on PerlMonks in this post by Limbic~Region (reproduced here with your data).
#URLref: http://www.perlmonks.org/?node_id=549876
#by Limbic~Region
use Algorithm::Loops 'NestedLoops';
use List::Util 'reduce';
use strict; use warnings;
sub LCS{
my #str = #_;
my #pos;
for my $i (0 .. $#str) {
my $line = $str[$i];
for (0 .. length($line) - 1) {
my $char= substr($line, $_, 1);
push #{$pos[$i]{$char}}, $_;
}
}
my $sh_str = reduce {length($a) < length($b) ? $a : $b} #str;
my %map;
CHAR:
for my $char (split //, $sh_str) {
my #loop;
for (0 .. $#pos) {
next CHAR if ! $pos[$_]{$char};
push #loop, $pos[$_]{$char};
}
my $next = NestedLoops([#loop]);
while (my #char_map = $next->()) {
my $key = join '-', #char_map;
$map{$key} = $char;
}
}
my #pile;
for my $seq (keys %map) {
push #pile, $map{$seq};
for (1 .. 2) {
my $dir = $_ % 2 ? 1 : -1;
my #offset = split /-/, $seq;
$_ += $dir for #offset;
my $next = join '-', #offset;
while (exists $map{$next}) {
$pile[-1] = $dir > 0 ?
$pile[-1] . $map{$next} : $map{$next} . $pile[-1];
$_ += $dir for #offset;
$next = join '-', #offset;
}
}
}
return reduce {length($a) > length($b) ? $a : $b} #pile;
}
my #str = map {chomp; $_} <DATA>;
print LCS(#str), "\n";
__DATA__
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/
It sounds like you want the k-common substring algorithm. It is exceptionally simple to program, and a good example of dynamic programming.
My first instinct is to run a loop, taking the next character from each string, until the characters are not equal. Keep a count of what position in the string you're at and then take a substring (from any of the three strings) from 0 to the position before the characters aren't equal.
In Perl, you'll have to split up the string first into characters using something like
#array = split(//, $string);
(splitting on an empty character sets each character into its own element of the array)
Then do a loop, perhaps overall:
$n =0;
#array1 = split(//, $string1);
#array2 = split(//, $string2);
#array3 = split(//, $string3);
while($array1[$n] == $array2[$n] && $array2[$n] == $array3[$n]){
$n++;
}
$sameString = substr($string1, 0, $n); #n might have to be n-1
Or at least something along those lines. Forgive me if this doesn't work, my Perl is a little rusty.
If you google for "longest common substring" you'll get some good pointers for the general case where the sequences don't have to start at the beginning of the strings.
Eg, http://en.wikipedia.org/wiki/Longest_common_substring_problem.
Mathematica happens to have a function for this built in:
http://reference.wolfram.com/mathematica/ref/LongestCommonSubsequence.html (Note that they mean contiguous subsequence, ie, substring, which is what you want.)
If you only care about the longest common prefix then it should be much faster to just loop for i from 0 till the ith characters don't all match and return substr(s, 0, i-1).
From http://forums.macosxhints.com/showthread.php?t=33780
my #strings =
(
'file:///home/gms8994/Music/t.A.T.u./',
'file:///home/gms8994/Music/nina%20sky/',
'file:///home/gms8994/Music/A%20Perfect%20Circle/',
);
my $common_part = undef;
my $sep = chr(0); # assuming it's not used legitimately
foreach my $str ( #strings ) {
# First time through loop -- set common
# to whole
if ( !defined $common_part ) {
$common_part = $str;
next;
}
if ("$common_part$sep$str" =~ /^(.*).*$sep\1.*$/)
{
$common_part = $1;
}
}
print "Common part = $common_part\n";
Faster than above, uses perl's native binary xor function, adapted from perlmongers solution (the $+[0] didn't work for me):
sub common_suffix {
my $comm = shift #_;
while ($_ = shift #_) {
$_ = substr($_,-length($comm)) if (length($_) > length($comm));
$comm = substr($comm,-length($_)) if (length($_) < length($comm));
if (( $_ ^ $comm ) =~ /(\0*)$/) {
$comm = substr($comm, -length($1));
} else {
return undef;
}
}
return $comm;
}
sub common_prefix {
my $comm = shift #_;
while ($_ = shift #_) {
$_ = substr($_,0,length($comm)) if (length($_) > length($comm));
$comm = substr($comm,0,length($_)) if (length($_) < length($comm));
if (( $_ ^ $comm ) =~ /^(\0*)/) {
$comm = substr($comm,0,length($1));
} else {
return undef;
}
}
return $comm;
}