I am trying to write a subroutine that will receive an array reference and then delete some of the elements of the array. For example:
use strict;
use warnings;
my #a = (1, 2, 3, 6);
func1 (\#a);
sub func1 {
my $a = shift;
my #b = (2, 6);
for my $val_to_remove (#b) {
for my $i (0..$#$a) {
my $val = $a->[$i];
if ( $val == $val_to_remove ) {
splice #$a, $i, 1;
last;
}
}
}
}
This seems, to say the least, a little awkward using two for loops.
Is it possible to simplify this?
I also tried
use strict;
use warnings;
my #a = (1, 2, 3, 6);
my $temp = \#a;
func2 (\$temp );
sub func2 {
my $a = shift;
$$a = [2, 6];
}
but then #a is not modified, but rather $temp will be..
I would also rather like to avoid passing a reference to a reference, since that will mess up the calling syntax for other modules.
Use a hash as an indicator function for identifying efficiently the items to be removed; use a grep for filtering them out:
sub func1 {
my $a = shift;
my %b = map { ($_ => 1) } (2, 6);
#$a = grep { !$b{$_} } #$a;
}
Loic's solution works well and is quite readable. I would recommend it unless you're working with large arrays that cause the grep to eat a lot of memory, or if performance is absolutely critical.
You can get a bit of a performance boost by using splice:
use strict;
use warnings;
use Data::Dump;
my #haystack = (1, 2, 3, 6);
my %needle = map { $_ => 1 } (2, 6);
foreach my $i (reverse 0 .. $#haystack) {
splice #haystack, $i, 1 if exists $needle{ $haystack[$i] };
}
dd \#haystack;
Output:
[1, 3]
Note that you must iterate through #haystack in reverse order, since every time you remove an element, the remaining elements shift to the left, changing the array indexes.
Benchmark
Here are the results from a slightly modified version of BrowserUk's corrected benchmark, written in response to foreach array - delete current row ? on PerlMonks. The original benchmark included several other methods for removing elements from an array, which I've left out for simplicity.
$ ./benchmark -N=1e2
Rate grep for_splice
grep 40959/s -- -37%
for_splice 65164/s 59% --
$ ./benchmark -N=1e3
Rate grep for_splice
grep 4072/s -- -38%
for_splice 6515/s 60% --
$ ./benchmark -N=1e4
Rate grep for_splice
grep 366/s -- -33%
for_splice 550/s 50% --
$ ./benchmark -N=1e5
Rate grep for_splice
grep 32.7/s -- -38%
for_splice 52.9/s 62% --
$ ./benchmark -N=1e6
(warning: too few iterations for a reliable count)
Rate grep for_splice
grep 2.36/s -- -28%
for_splice 3.28/s 39% --
And the benchmark code itself:
#!/usr/bin/perl -sl
use strict;
use warnings;
use Benchmark 'cmpthese';
our $N //= 1e3;
our $I //= -1;
# 10% the size of the haystack
my $num_needles = int($N / 10) || 1;
our #as;
#{ $as[ $_ ] } = 1 .. $N for 0 .. 4;
our %needle = map { int(rand($N)) => 1 } 1 .. $num_needles;
cmpthese $I, {
for_splice => q[
my $ar = $as[0];
foreach my $i (reverse 0 .. $#$ar) {
splice #$ar, $i, 1 if exists $needle{ $ar->[$i] };
}
$I == 1 and print "0: ", "#$ar";
],
grep => q[
my $ar = $as[1];
#$ar = grep { ! exists $needle{$_} } #$ar;
$I == 1 and print "1: ", "#$ar";
],
};
You can't use a simple for (LIST) loop to iterate over the indices of an array if you're also modifying the contents of the array. That's because the index of the last item may change, and you will also skip over elements if you delete the current element and increment the counter.
A while loop is required instead, or the equivalent C-style for.
This program demonstrates, as well as uing List::Util::any to check whether an array elemnent should be deleted
use strict;
use warnings;
use List::Util 'any';
my #a = (1, 2, 3, 6);
func1 (\#a);
use Data::Dump;
dd \#a;
sub func1 {
my ($a) = #_;
my #b = (2, 6);
for ( my $i = 0; $i < #$a; ) {
if ( any { $a->[$i] == $_ } #b ) {
splice #$a, $i, 1;
}
else {
++$i;
}
}
}
output
[1, 3]
With problems of this nature, it is often easier to build the array you want rather than delete from an existing one.
my #a = ( 1, 2, 3, 6 );
sub func1 {
my $aref = shift #_;
my #b = ( 2, 6 );
my #results = ();
for my $item ( #$aref ){
if( grep { $item == $_ } #b ){
next;
}
push #results, $item;
}
return #results;
}
my #results = func1( \#a );
say "#results";
Related
I have this perl code below:
use strict;
use warnings;
sub powerset(&#) {
my $callback = shift;
my $bitmask = '';
my $bytes = #_/8;
{
my #indices = grep vec($bitmask, $_, 1), 0..$#_;
$callback->( #_[#indices] );
++vec($bitmask, $_, 8) and last for 0 .. $bytes;
redo if #indices != #_;
}
}
powerset { print "[#_]\n" } 1..21;
I'm trying to figure out how I can only run n times (say 6 times the powerset subroutine). i tried use it:
$x = 0;
while ($x <= 6) {
$x ++;
powerset() ;
}
On run 6 times I mean run to the sixth printed line, in this case to powerset {print "[#_]\n"} 1..21; matches up to the impression of
[]
[1]
[2]
[1 2]
[3]
[1 3]
[2 3]
But I don't know where I would apply it $x = 0; while ($x <= 6) { $x ++; powerset() ; }, if inside the sub powerset() or outside, it seems to me that inside the sub because where the routine is happening.
i know i should use die or return to get out of running a subroutine and that somewhere in while i should use maybe else, but i still don't see how to structure this in the code.
I feel that I should better understand the function of the variables $bitmask, $bytes, $callback, because as it is a lazy evaluation, the way of counting the execution is different.
Further research on how to run a perl subroutine leads me to questions about using timers that depend on the elapsed time in seconds rather than the number of times the subroutine is executed.
Here is an example of how you can constrain the number of times the callback is called:
use feature qw(state);
use strict;
use warnings;
sub powerset(&#) {
my $callback = shift;
my $bitmask = '';
my $bytes = #_/8;
{
my #indices = grep vec($bitmask, $_, 1), 0..$#_;
$callback->( #_[#indices] );
++vec($bitmask, $_, 8) and last for 0 .. $bytes;
redo if #indices != #_;
}
}
my $limit = 6;
powerset
{
state $counter = 0;
die "limit reached" if ++$counter > $limit;
print "[#_]\n"
}
1..21;
I have two arrays:
#array1 = (A,B,C,D,E,F);
#array2 = (A,C,H,D,E,G);
The arrays could be of different size. I want to find how many mismatches are there between the arrays. The indexes should be the same. In this case there are three mismatch :b->c,c->h and F->G.(i.e , The 'C' in $array[2] should not be considered a match to 'C' in $array[1]) I would like to get the number of mismatches as well as the mismatch.
foreach my $a1 ( 0 .. $#array1) {
foreach my $a2( 0 .. $#array2)
if($array1[$a1] ne $array2[$a2]) {
}
}
}
my %array_one = map {$_, 1} #array1;
my #difference = grep {!$array_one {$_}} #array1;
print "#difference\n";
Ans: gives me H, G but not C.
with my little Perl knowledge I tried this, with no result. Could you suggest me how I should deal this? Your suggestions and pointers would be very helpful.
You shouldn't have nested loops. You only need to go through the indexes once.
use List::Util qw( max );
my #mismatches;
for my $i (0..max($#array1, $#array2)) {
push #mismatches, $i
if $i >= #array1
|| $i >= #array2
|| $array1[$i] ne $array2[$i];
}
}
say "There are " . (0+#mismatches) . " mismatches";
for my $i (#mismatches) {
...
}
Since you mentioned grep, this is how you'd replace the for with grep:
use List::Util qw( max );
my #mismatches =
grep { $_ >= #array1
|| $_ >= #array2
|| array1[$_] ne $array2[$_] }
0 .. max($#array1, $#array2);
say "There are " . (0+#mismatches) . " mismatches";
for my $i (#mismatches) {
...
}
Here's an example using each_arrayref from List::MoreUtils.
sub diff_array{
use List::MoreUtils qw'each_arrayref';
return unless #_ && defined wantarray;
my #out;
my $iter = each_arrayref(#_);
my $index = 0;
while( my #current = $iter->() ){
next if all_same(#current);
unshift #current, $index;
push #out, \#current;
}continue{ ++$index }
return #out;
}
This version should be faster if you are going to use this for determining the number of differences often. The output is exactly the same. It just doesn't have to work as hard when returning a number.
Read about wantarray for more information.
sub diff_array{
use List::MoreUtils qw'each_arrayref';
return unless #_ && defined wantarray;
my $iter = each_arrayref(#_);
if( wantarray ){
# return structure
my #out;
my $index = 0;
while( my #current = $iter->() ){
next if all_same(#current);
unshift #current, $index;
push #out, \#current;
}continue{ ++$index }
return #out;
}else{
# only return a count of differences
my $out = 0;
while( my #current = $iter->() ){
++$out unless all_same #current;
}
return $out;
}
}
diff_array uses the subroutine all_same to determine if all of the current list of elements are the same.
sub all_same{
my $head = shift;
return undef unless #_; # not enough arguments
for( #_ ){
return 0 if $_ ne $head; # at least one mismatch
}
return 1; # all are the same
}
To get just the number of differences:
print scalar diff_array \#array1, \#array2;
my $count = diff_array \#array1, \#array2;
To get a list of differences:
my #list = diff_array \#array1, \#array2;
To get both:
my $count = my #list = diff_array \#array1, \#array2;
The output for the input you provided:
(
[ 1, 'B', 'C' ],
[ 2, 'C', 'H' ],
[ 5, 'F', 'G' ]
)
Example usage
my #a1 = qw'A B C D E F';
my #a2 = qw'A C H D E G';
my $count = my #list = diff_array \#a1, \#a2;
print "There were $count differences\n\n";
for my $group (#list){
my $index = shift #$group;
print " At index $index\n";
print " $_\n" for #$group;
print "\n";
}
You're iterating over both arrays when you don't want to be doing so.
#array1 = ("A","B","C","D","E","F");
#array2 = ("A","C","H","D","E","G");
foreach my $index (0 .. $#array1) {
if ($array1[$index] ne $array2[$index]) {
print "Arrays differ at index $index: $array1[$index] and $array2[$index]\n";
}
}
Output:
Arrays differ at index 1: B and C
Arrays differ at index 2: C and H
Arrays differ at index 5: F and G
Well, first, you're going to want to go over each element of one of the arrays, and compare it to the same element of the other array. List::MoreUtils provides an easy way to do this:
use v5.14;
use List::MoreUtils qw(each_array);
my #a = qw(a b c d);
my #b = qw(1 2 3);
my $ea = each_array #a, #b;
while ( my ($a, $b) = $ea->() ) {
say "a = $a, b = $b, idx = ", $ea->('index');
}
You can extend that to find where there is a non-match by checking inside that while loop (note: this assumes your arrays don't have undefs at the end, or that if they do, undef is the same as having a shorter array):
my #mismatch;
my $ea = each_array #a, #b;
while ( my ($a, $b) = $ea->() ) {
if (defined $a != defined $b || $a ne $b) {
push #mismatch, $ea->('index');
}
}
and then:
say "Mismatched count = ", scalar(#mismatch), " items are: ", join(q{, }, #mismatch);
The following code builds a list of mismatched pairs, then prints them out.
#a1 = (A,B,C,D,E,F);
#a2 = (A,C,H,D,E,G);
#diff = map { [$a1[$_] => $a2[$_]] }
grep { $a1[$_] ne $a2[$_] }
(0..($#a1 < $#a2 ? $#a1 : $#a2));
print "$_->[0]->$_->[1]\n" for #diff
You have the right idea, but you only need a single loop, since you are looking at each index and comparing entries between the arrays:
foreach my $a1 ( 0 .. $#array1) {
if($array1[$a1] ne $array2[$a1]) {
print "$a1: $array1[$a1] <-> $array2[$a1]\n";
}
}
I have two arrays, #a and #b. I want to do a compare among the elements of the two arrays.
my #a = qw"abc def efg ghy klm ghn";
my #b = qw"def ghy jgk lom com klm";
If any element matches then set a flag. Is there any simple way to do this?
First of all, your 2 arrays need to be written correctly.
#a = ("abc","def","efg","ghy","klm","ghn");
#b = ("def","efg","ghy","klm","ghn","klm");
Second of all, for arbitrary arrays (e.g. arrays whose elements may be references to other data structures) you can use Data::Compare.
For arrays whose elements are scalar, you can do comparison using List::MoreUtils pairwise BLOCK ARRAY1 ARRAY2, where BLOCK is your comparison subroutine. You can emulate pairwise (if you don't have List::MoreUtils access) via:
if (#a != #b) {
$equals = 0;
} else {
$equals = 1;
foreach (my $i = 0; $i < #a; $i++) {
# Ideally, check for undef/value comparison here as well
if ($a[$i] != $b[$i]) { # use "ne" if elements are strings, not numbers
# Or you can use generic sub comparing 2 values
$equals = 0;
last;
}
}
}
P.S. I am not sure but List::Compare may always sort the lists. I'm not sure if it can do pairwise comparisons.
List::Compare
if ( scalar List::Compare->new(\#a, \#b)->get_intersection ) {
…
}
Check to create an intersect function, which will return a list of items that are present in both lists. Then your return value is dependent on the number of items in the intersected list.
You can easily find on the web the best implementation of intersect for Perl. I remember looking for it a few years ago.
Here's what I found :
my #array1 = (1, 2, 3);
my #array2 = (2, 3, 4);
my %original = ();
my #isect = ();
map { $original{$_} = 1 } #array1;
#isect = grep { $original{$_} } #array2;
This is one way:
use warnings;
use strict;
my #a = split /,/, "abc,def,efg,ghy,klm,ghn";
my #b = split /,/, "def,ghy,jgk,lom,com,klm";
my $flag = 0;
my %a;
#a{#a} = (1) x #a;
for (#b) {
if ($a{$_}) {
$flag = 1;
last;
}
}
print "$flag\n";
From the requirement that 'if any element matches', use the intersection of sets:
sub set{
my %set = map { $_, undef }, #_;
return sort keys %set;
}
sub compare{
my ($listA,$listB) = #_;
return ( (set(#$listA)-set(#$listB)) > 0)
}
my #a = qw' abc def efg ghy klm ghn ';
my #b = qw' def ghy jgk lom com klm ';
my $flag;
foreach my $item(#a) {
$flag = #b~~$item ? 0 : 1;
last if !$flag;
}
Note that you will need Perl 5.10, or later, to use the smart match operator (~~) .
Brute force should do the trick for small a n:
my $flag = 0;
foreach my $i (#a) {
foreach my $k (#b) {
if ($i eq $k) {
$flag = 1;
last;
}
}
}
For a large n, use a hash table:
my $flag = 0;
my %aa = ();
$aa{$_} = 1 foreach (#a);
foreach my $i (#b) {
if ($aa{$i}) {
$flag = 1;
last;
}
}
Where a large n is |#a| + |#b| > ~1000 items
IMHO, you should use List::MoreUtils::pairwise. However, if for some reason you cannot, then the following sub would return a 1 for every index where the value in the first array compares equal to the value in the second array. You can generalize this method as much as you want and pass your own comparator if you want to, but at that point, just installing List::MoreUtils would be a more productive use of your time.
use strict; use warnings;
my #a = qw(abc def ghi jkl);
my #b = qw(abc dgh dlkfj jkl kjj lkm);
my $map = which_ones_equal(\#a, \#b);
print join(', ', #$map), "\n";
sub which_ones_equal {
my ($x, $y, $compare) = #_;
my $last = $#$x > $#$y ? $#$x : $#$y;
no warnings 'uninitialized';
return [ map { 0 + ($x->[$_] eq $y->[$_]) } $[ .. $last ];
}
This is Perl. The 'obvious' solution:
my #a = qw"abc def efg ghy klm ghn";
my #b = qw"def ghy jgk lom com klm";
print "arrays equal\n"
if #a == #b and join("\0", #a) eq join("\0", #b);
given "\0" not being in #a.
But thanks for confirming that there is no other generic solution than rolling your own.
my #a1 = qw|a b c d|;
my #a2 = qw|b c d e|;
for my $i (0..$#a1) {
say "element $i of array 1 was not found in array 2"
unless grep {$_ eq $a1[$i]} #a2
}
If you would consider the arrays with different order to be different, you may use Array::Diff
if (Array::Diff->diff(\#a, \#b)->count) {
# not_same
} else {
# same
}
This question still could mean two things where it states "If any element matches then set a flag":
Elements at the same position, i.e $a[2] eq $b[2]
Values at any position, i.e. $a[3] eq $b[5]
For case 1, you might do this:
# iterate over all positions, and compare values at that position
my #matches = grep { $a[$_] eq $b[$_] } 0 .. $#a;
# set flag if there's any match at the same position
my $flag = 1 if #matches;
For case 2, you might do that:
# make a hash of #a and check if any #b are in there
my %a = map { $_ => 1 } #a;
my #matches = grep { $a{$_} } #b;
# set flag if there's matches at any position
my $flag = 1 if #matches;
Note that in the first case, #matches holds the indexes of where there are matching elements, and in the second case #matches holds the matching values in the order in which they appear in #b.
As per the title, I'm trying to find a way to programmatically determine the longest portion of similarity between several strings.
Example:
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/
Ideally, I'd get back file:///home/gms8994/Music/, because that's the longest portion that's common for all 3 strings.
Specifically, I'm looking for a Perl solution, but a solution in any language (or even pseudo-language) would suffice.
From the comments: yes, only at the beginning; but there is the possibility of having some other entry in the list, which would be ignored for this question.
Edit: I'm sorry for mistake. My pity that I overseen that using my variable inside countit(x, q{}) is big mistake. This string is evaluated inside Benchmark module and #str was empty there. This solution is not as fast as I presented. See correction below. I'm sorry again.
Perl can be fast:
use strict;
use warnings;
package LCP;
sub LCP {
return '' unless #_;
return $_[0] if #_ == 1;
my $i = 0;
my $first = shift;
my $min_length = length($first);
foreach (#_) {
$min_length = length($_) if length($_) < $min_length;
}
INDEX: foreach my $ch ( split //, $first ) {
last INDEX unless $i < $min_length;
foreach my $string (#_) {
last INDEX if substr($string, $i, 1) ne $ch;
}
}
continue { $i++ }
return substr $first, 0, $i;
}
# Roy's implementation
sub LCP2 {
return '' unless #_;
my $prefix = shift;
for (#_) {
chop $prefix while (! /^\Q$prefix\E/);
}
return $prefix;
}
1;
Test suite:
#!/usr/bin/env perl
use strict;
use warnings;
Test::LCP->runtests;
package Test::LCP;
use base 'Test::Class';
use Test::More;
use Benchmark qw(:all :hireswallclock);
sub test_use : Test(startup => 1) {
use_ok('LCP');
}
sub test_lcp : Test(6) {
is( LCP::LCP(), '', 'Without parameters' );
is( LCP::LCP('abc'), 'abc', 'One parameter' );
is( LCP::LCP( 'abc', 'xyz' ), '', 'None of common prefix' );
is( LCP::LCP( 'abcdefgh', ('abcdefgh') x 15, 'abcdxyz' ),
'abcd', 'Some common prefix' );
my #str = map { chomp; $_ } <DATA>;
is( LCP::LCP(#str),
'file:///home/gms8994/Music/', 'Test data prefix' );
is( LCP::LCP2(#str),
'file:///home/gms8994/Music/', 'Test data prefix by LCP2' );
my $t = countit( 1, sub{LCP::LCP(#str)} );
diag("LCP: ${\($t->iters)} iterations took ${\(timestr($t))}");
$t = countit( 1, sub{LCP::LCP2(#str)} );
diag("LCP2: ${\($t->iters)} iterations took ${\(timestr($t))}");
}
__DATA__
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/
Test suite result:
1..7
ok 1 - use LCP;
ok 2 - Without parameters
ok 3 - One parameter
ok 4 - None of common prefix
ok 5 - Some common prefix
ok 6 - Test data prefix
ok 7 - Test data prefix by LCP2
# LCP: 22635 iterations took 1.09948 wallclock secs ( 1.09 usr + 0.00 sys = 1.09 CPU) # 20766.06/s (n=22635)
# LCP2: 17919 iterations took 1.06787 wallclock secs ( 1.07 usr + 0.00 sys = 1.07 CPU) # 16746.73/s (n=17919)
That means that pure Perl solution using substr is about 20% faster than Roy's solution at your test case and one prefix finding takes about 50us. There is not necessary using XS unless your data or performance expectations are bigger.
The reference given already by Brett Daniel for the Wikipedia entry on "Longest common substring problem" is very good general reference (with pseudocode) for your question as stated. However, the algorithm can be exponential. And it looks like you might actually want an algorithm for longest common prefix which is a much simpler algorithm.
Here's the one I use for longest common prefix (and a ref to original URL):
use strict; use warnings;
sub longest_common_prefix {
# longest_common_prefix( $|# ): returns $
# URLref: http://linux.seindal.dk/2005/09/09/longest-common-prefix-in-perl
# find longest common prefix of scalar list
my $prefix = shift;
for (#_) {
chop $prefix while (! /^\Q$prefix\E/);
}
return $prefix;
}
my #str = map {chomp; $_} <DATA>;
print longest_common_prefix(#ARGV), "\n";
__DATA__
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/
If you truly want a LCSS implementation, refer to these discussions (Longest Common Substring and Longest Common Subsequence) at PerlMonks.org. Tree::Suffix would probably be the best general solution for you and implements, to my knowledge, the best algorithm. Unfortunately recent builds are broken. But, a working subroutine does exist within the discussions referenced on PerlMonks in this post by Limbic~Region (reproduced here with your data).
#URLref: http://www.perlmonks.org/?node_id=549876
#by Limbic~Region
use Algorithm::Loops 'NestedLoops';
use List::Util 'reduce';
use strict; use warnings;
sub LCS{
my #str = #_;
my #pos;
for my $i (0 .. $#str) {
my $line = $str[$i];
for (0 .. length($line) - 1) {
my $char= substr($line, $_, 1);
push #{$pos[$i]{$char}}, $_;
}
}
my $sh_str = reduce {length($a) < length($b) ? $a : $b} #str;
my %map;
CHAR:
for my $char (split //, $sh_str) {
my #loop;
for (0 .. $#pos) {
next CHAR if ! $pos[$_]{$char};
push #loop, $pos[$_]{$char};
}
my $next = NestedLoops([#loop]);
while (my #char_map = $next->()) {
my $key = join '-', #char_map;
$map{$key} = $char;
}
}
my #pile;
for my $seq (keys %map) {
push #pile, $map{$seq};
for (1 .. 2) {
my $dir = $_ % 2 ? 1 : -1;
my #offset = split /-/, $seq;
$_ += $dir for #offset;
my $next = join '-', #offset;
while (exists $map{$next}) {
$pile[-1] = $dir > 0 ?
$pile[-1] . $map{$next} : $map{$next} . $pile[-1];
$_ += $dir for #offset;
$next = join '-', #offset;
}
}
}
return reduce {length($a) > length($b) ? $a : $b} #pile;
}
my #str = map {chomp; $_} <DATA>;
print LCS(#str), "\n";
__DATA__
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/
It sounds like you want the k-common substring algorithm. It is exceptionally simple to program, and a good example of dynamic programming.
My first instinct is to run a loop, taking the next character from each string, until the characters are not equal. Keep a count of what position in the string you're at and then take a substring (from any of the three strings) from 0 to the position before the characters aren't equal.
In Perl, you'll have to split up the string first into characters using something like
#array = split(//, $string);
(splitting on an empty character sets each character into its own element of the array)
Then do a loop, perhaps overall:
$n =0;
#array1 = split(//, $string1);
#array2 = split(//, $string2);
#array3 = split(//, $string3);
while($array1[$n] == $array2[$n] && $array2[$n] == $array3[$n]){
$n++;
}
$sameString = substr($string1, 0, $n); #n might have to be n-1
Or at least something along those lines. Forgive me if this doesn't work, my Perl is a little rusty.
If you google for "longest common substring" you'll get some good pointers for the general case where the sequences don't have to start at the beginning of the strings.
Eg, http://en.wikipedia.org/wiki/Longest_common_substring_problem.
Mathematica happens to have a function for this built in:
http://reference.wolfram.com/mathematica/ref/LongestCommonSubsequence.html (Note that they mean contiguous subsequence, ie, substring, which is what you want.)
If you only care about the longest common prefix then it should be much faster to just loop for i from 0 till the ith characters don't all match and return substr(s, 0, i-1).
From http://forums.macosxhints.com/showthread.php?t=33780
my #strings =
(
'file:///home/gms8994/Music/t.A.T.u./',
'file:///home/gms8994/Music/nina%20sky/',
'file:///home/gms8994/Music/A%20Perfect%20Circle/',
);
my $common_part = undef;
my $sep = chr(0); # assuming it's not used legitimately
foreach my $str ( #strings ) {
# First time through loop -- set common
# to whole
if ( !defined $common_part ) {
$common_part = $str;
next;
}
if ("$common_part$sep$str" =~ /^(.*).*$sep\1.*$/)
{
$common_part = $1;
}
}
print "Common part = $common_part\n";
Faster than above, uses perl's native binary xor function, adapted from perlmongers solution (the $+[0] didn't work for me):
sub common_suffix {
my $comm = shift #_;
while ($_ = shift #_) {
$_ = substr($_,-length($comm)) if (length($_) > length($comm));
$comm = substr($comm,-length($_)) if (length($_) < length($comm));
if (( $_ ^ $comm ) =~ /(\0*)$/) {
$comm = substr($comm, -length($1));
} else {
return undef;
}
}
return $comm;
}
sub common_prefix {
my $comm = shift #_;
while ($_ = shift #_) {
$_ = substr($_,0,length($comm)) if (length($_) > length($comm));
$comm = substr($comm,0,length($_)) if (length($_) < length($comm));
if (( $_ ^ $comm ) =~ /^(\0*)/) {
$comm = substr($comm,0,length($1));
} else {
return undef;
}
}
return $comm;
}
Let's make this very easy. What I want:
#array = qw/one two one/;
my #duplicates = duplicate(#array);
print "#duplicates"; # This should now print 'one'.
How to print duplicate values of a array/hash?
sub duplicate {
my #args = #_;
my %items;
for my $element(#args) {
$items{$element}++;
}
return grep {$items{$_} > 1} keys %items;
}
# assumes inputs can be hash keys
#a = (1, 2, 3, 3, 4, 4, 5);
# keep count for each unique input
%h = ();
map { $h{$_}++ } #a;
# duplicate inputs have count > 1
#dupes = grep { $h{$_} > 1 } keys %h;
# should print 3, 4
print join(", ", sort #dupes), "\n";
The extra verbose, extra readable version of what you want to do:
sub duplicate {
my %value_hash;
foreach my $val (#_) {
$value_hash{$val} +=1;
}
my #arr;
while (my ($val, $num) = each(%value_hash)) {
if ($num > 1) {
push(#arr, $val)
}
}
return #arr;
}
This can be shortened considerably, but I intentionally left it verbose so that you can follow along.
I didn't test it, though, so watch out for my typos.
Use a dictionary, put the value in the key, and the count in the value.
Ah, just noticed you've tagged as perl
while ([...]) {
$hash{[dbvalue]}++
}
Unspecified in the question is the order in which the duplicates should be returned.
I can think of several possibilities: don't care; by order of first/second/last occurrence in the input list; sorted.
I'm going golfing!
sub duplicate {
my %count;
grep $count{$_}++, #_;
}
#array = qw/one two one/;
my #duplicates = duplicate(#array);
print "#duplicates"; # This should now print 'one'.
# or if returning *exactly* 1 occurrence of each duplicated item is important
sub duplicate {
my %count;
grep ++$count{$_} == 2, #_;
}