perl: how to make compact name from a numbered sequence

perl: how to make compact name from a numbered sequence - perl

[perl 5.8.8]
I have a sequence of names of things like:
names='foobar1304,foobar1305,foobar1306,foobar1307'
where the names differ only by a contiguous string of digits somewhere in the name. The strings of digits in any sequence are all of the same length, and the digit strings form a continuous numeric sequence with no skips, e.g. 003,004,005.
I want a compact representation like:
compact_name='foobar1304-7'
(The compact form is just a name, so it's exact form is negotiable.)
There will usually only be <10 things, though some sets might span a decade, e.g.
'foobaz2205-11'
Is there some concise way to do this in perl? I'm not a big perl hacker, so be a little gentle...
Bonus points for handling embedded sequences like:
names='foobar33-pqq,foobar34-pqq,foobar35-pqq'
The ideal script would neatly fall back to 'firstname2301-lastname9922' in case it can't identify a sequence in the names.

I am not sure I got your specification, but it works somehow:
#!/usr/bin/perl
use warnings;
use strict;
use Test::More;
sub compact {
my $string = shift;
my ($name, $value) = split /=/, $string;
$name =~ s/s$// or die "Cannot create compact name for $name.\n"; #/ SO hilite bug
$name = 'compact_' . $name;
$value =~ s/^'|'$//g; #/ SO hilite bug
my #values = split /,/, $value; #/ SO hilite bug
my ($prefix, $first, $suffix) = $values[0] =~ /^(.+?)([0-9]+)(.*)$/;
my $last = $first + $#values;
my $same = 0;
$same++ while substr($first, 0, $same) eq substr($last, 0, $same);
$last = substr $last, $same - 1;
for my $i ($first .. $first + $#values) {
$values[$i - $first] eq ($prefix . $i . $suffix)
or die "Invalid sequence at $values[$i-$first].\n";
}
return "$name='$prefix$first-$last$suffix'";
}
is( compact("names='foobar1304,foobar1305,foobar1306,foobar1307'"),
"compact_name='foobar1304-7'");
is( compact("names='foobaz2205,foobaz2206,foobaz2207,foobaz2208,foobaz2209,foobaz2210,foobaz2211'"),
"compact_name='foobaz2205-11'");
is( compact("names='foobar33-pqq,foobar34-pqq,foobar35-pqq'"),
"compact_name='foobar33-5-pqq'");
done_testing();

Someone sure will post an more elegant solution, but the following
use strict;
use warnings;
my $names='foobar1308-xy,foobar1309-xy,foobar1310-xy,foobar1311-xy';
my #names = split /,/,$names;
my $pfx = lcp(#names);
my #nums = map { m/$pfx(\d*)/; $1 } #names;
my $first=shift #nums;
my $last = pop #nums;
my $suf=$names[0];
$suf =~ s/$pfx\d*//;
print "$pfx\{$first-$last}$suf\n";
#https://gist.github.com/3309172
sub lcp {
my $match = shift;
substr($match, (($match ^ $_) =~ /^\0*/, $+[0])) = '' for #_;
$match;
}
prints:
foobar13{08-11}-xy

Related

How to fix the error of "Use of unitialized value in addition..." in perl script?

Here is the script of user Suic for calculating molecular weight of fasta sequences (calculating molecular weight in perl),
#!/usr/bin/perl
use strict;
use warnings;
use Encode;
for my $file (#ARGV) {
open my $fh, '<:encoding(UTF-8)', $file;
my $input = join q{}, <$fh>;
close $fh;
while ( $input =~ /^(>.*?)$([^>]*)/smxg ) {
my $name = $1;
my $seq = $2;
$seq =~ s/\n//smxg;
my $mass = calc_mass($seq);
print "$name has mass $mass\n";
}
}
sub calc_mass {
my $a = shift;
my #a = ();
my $x = length $a;
#a = split q{}, $a;
my $b = 0;
my %data = (
A=>71.09, R=>16.19, D=>114.11, N=>115.09,
C=>103.15, E=>129.12, Q=>128.14, G=>57.05,
H=>137.14, I=>113.16, L=>113.16, K=>128.17,
M=>131.19, F=>147.18, P=>97.12, S=>87.08,
T=>101.11, W=>186.12, Y=>163.18, V=>99.14
);
for my $i( #a ) {
$b += $data{$i};
}
my $c = $b - (18 * ($x - 1));
return $c;
}
and the protein.fasta file with n (here is 2) sequences:
seq_ID_1 descriptions etc
ASDGDSAHSAHASDFRHGSDHSDGEWTSHSDHDSHFSDGSGASGADGHHAH
ASDSADGDASHDASHSAREWAWGDASHASGASGASGSDGASDGDSAHSHAS
SFASGDASGDSSDFDSFSDFSD
>seq_ID_2 descriptions etc
ASDGDSAHSAHASDFRHGSDHSDGEWTSHSDHDSHFSDGSGASGADGHHAH
ASDSADGDASHDASHSAREWAWGDASHASGASGASG
When using: perl molecular_weight.pl protein.fasta > output.txt
in terminal, it will generate the correct results, however it also presents an error of "Use of unitialized value in addition (+) at molecular_weight.pl line36", which is just localized in line of "$b += $data{$i};" how to fix this bug ? Thanks in advance !

You probably have an errant SPACE somewhere in your data file. Just change
$seq =~ s/\n//smxg;
into
$seq =~ s/\s//smxg;
EDIT:
Besides whitespace, there may be some non-whitespace invisible characters in the data, like WORD JOINER (U+2060).
If you want to be sure to be thorough and you know all the legal symbols, you can delete everything apart from them:
$seq =~ s/[^ARDNCEQGHILKMFPSTWYV]//smxg;
Or, to make sure you won't miss any (even if you later change the symbols), you can populate a filter regex dynamically from the hash keys.
You'd need to make %Data and the filter regex global, so the filter is available in the main loop. As a beneficial side effect, you don't need to re-initialize the data hash every time you enter calc_mass().
use strict;
use warnings;
my %Data = (A=>71.09,...);
my $Filter_regex = eval { my $x = '[^' . join('', keys %Data) . ']'; qr/$x/; };
...
$seq =~ s/$Filter_regex//smxg;
(This filter works as long as the symbols are single character. For more complicated ones, it may be preferable to match for the symbols and collect them from the sequence, instead of removing unwanted characters.)

Difficulty in finding error in perl script

Its a kind of bioinformatics concept but programmatic problem. I've tried a lot and at last I came here. I've reads like following.
ATGGAAG
TGGAAGT
GGAAGTC
GAAGTCG
AAGTCGC
AGTCGCG
GTCGCGG
TCGCGGA
CGCGGAA
GCGGAAT
CGGAATC
Now what I want to do is, in a simplistic way,
take the last 6 residues of first read -> check if any other read is starting with those 6 residues, if yes add the last residue of that read to the first read -> again same with the 2nd read and so on.
Here is the code what I've tried so far.
#!/usr/bin/perl -w
use strict;
use warnings;
my $in = $ARGV[0];
open(IN, $in);
my #short_reads = <IN>;
my $first_read = $short_reads[0];
chomp $first_read;
my #all_end_res;
for(my $i=0; $i<=$#short_reads; $i++){
chomp $short_reads[$i];
my $end_of_kmers = substr($short_reads[$i], -6);
if($short_reads[$i+1] =~ /^$end_of_kmers/){
my $end_res = substr($short_reads[$i], -1);
push(#all_end_res, $end_res);
}
}
my $end_res2 = join('', #all_end_res);
print $first_read.$end_res2,"\n\n";
At the end I should get an output like ATGGAAGTCGCGGAATC but I'm getting ATGGAAGGTCGCGGAAT. The error must be in if, any help is greatly appreciated.

There are three huge problems in IT.
Naming of things.
Off by one error.
And you just hit the second one. The problem is in a way you think about this task. You think in way I have this one string and if next one overlap I will add this one character to result. But correct way to think in this case I have this one string and if it overlaps with previous string or what I read so far, I will add one character or characters which are next.
#!/usr/bin/env perl
use strict;
use warnings;
use constant LENGTH => 6;
my $read = <>;
chomp $read;
while (<>) {
chomp;
last unless length > LENGTH;
if ( substr( $read, -LENGTH() ) eq substr( $_, 0, LENGTH ) ) {
$read .= substr( $_, LENGTH );
}
else {last}
}
print $read, "\n";
I didn't get this ARGV[0] thing. It is useless and inflexible.
$ chmod +x code.pl
$ cat data
ATGGAAG
TGGAAGT
GGAAGTC
GAAGTCG
AAGTCGC
AGTCGCG
GTCGCGG
TCGCGGA
CGCGGAA
GCGGAAT
CGGAATC
$ ./code.pl data
ATGGAAGTCGCGGAATC
But you have not defined what should happen if data doesn't overlap. Should there be some recovery or error? You can be also more strict
last unless length == LENGTH + 1;
Edit:
If you like working with an array you should try avoid using for(;;). It is prone to errors. (BTW for (my $i = 0; $i < #a; $i++) is more idiomatic.)
my #short_reads = <>;
chomp #short_reads;
my #all_end_res;
for my $i (1 .. $#short_reads) {
my $prev_read = $short_reads[$i-1];
my $curr_read = $short_reads[$i+1];
my $end_of_kmers = substr($prev_read, -6);
if ( $curr_read =~ /^\Q$end_of_kmers(.)/ ) {
push #all_end_res, $1;
}
}
print $short_reads[0], join('', #all_end_res), "\n";
The performance and memory difference is negligible up to thousands of lines. Now you can ask why to accumulate characters into an array instead of accumulate it to string.
my #short_reads = <>;
chomp #short_reads;
my $read = $short_reads[0];
for my $i (1 .. $#short_reads) {
my $prev_read = $short_reads[$i-1];
my $curr_read = $short_reads[$i+1];
my $end_of_kmers = substr($prev_read, -6);
if ( $curr_read =~ /^\Q$end_of_kmers(.)/ ) {
$read .= $1;
}
}
print "$read\n";
Now the question is why to use $prev_read when you have $end_of_kmers inside of $read.
my #short_reads = <>;
chomp #short_reads;
my $read = $short_reads[0];
for my $i (1 .. $#short_reads) {
my $curr_read = $short_reads[$i+1];
my $end_of_kmers = substr($read, -6);
if ( $curr_read =~ /^\Q$end_of_kmers(.)/ ) {
$read .= $1;
}
}
print "$read\n";
Now you can ask why I need indexes at all. You just should remove the first line to work with the rest of array.
my #short_reads = <>;
chomp #short_reads;
my $read = shift #short_reads;
for my $curr_read (#short_reads) {
my $end_of_kmers = substr($read, -6);
if ( $curr_read =~ /^\Q$end_of_kmers(.)/ ) {
$read .= $1;
}
}
print "$read\n";
And with few more steps and tweaks you will end up with the code what I posted initially. I don't need an array at all because I look only to the current line and accumulator. The difference is in a way how you think about the problem. If you think in terms of arrays and indexes and looping or in terms of data flow, data processing and state/accumulator. With more experience, you don't have to do all those steps and make the final solution just due different approach to problem solving.
Edit2:
It is almost ten times faster using substr and eq then using regular expressions.
$ time ./code.pl data.out > data.test
real 0m0.480s
user 0m0.468s
sys 0m0.008s
$ time ./code2.pl data.out > data2.test
real 0m4.520s
user 0m4.516s
sys 0m0.000s
$ cmp data.test data2.test && echo OK
OK
$ wc -c data.out data.test
6717368 data.out
839678 data.test

with minor modification:
use warnings;
use strict;
open my $in, '<', $ARGV[0] or die $!;
chomp(my #short_reads = <$in>);
my $first_read = $short_reads[0];
my #all_end_res;
for(my $i=0; $i<=$#short_reads; $i++){
chomp $short_reads[$i];
my $end_of_kmers = substr($short_reads[$i], -6);
my ($next_read) = $short_reads[$i+1];
if( (defined $next_read) and ($next_read =~ /^\Q$end_of_kmers/)){
my $end_res = substr($next_read, -1);
push(#all_end_res, $end_res);
}
}
my $end_res2 = join('', #all_end_res);
print $first_read.$end_res2,"\n";
ATGGAAGTCGCGGAATC

How to get consecutive pairs of words in Perl

With this sentence:
my $sent = "Mapping and quantifying mammalian transcriptomes RNA-Seq";
We want to get all possible consecutive pairs of words.
my $var = ['Mapping and',
'and quantifying',
'quantifying mammalian',
'mammalian transcriptomes',
'transcriptomes RNA-Seq'];
Is there a compact way to do it?

Yes.
my $sent = "Mapping and quantifying mammalian transcriptomes RNA-Seq";
my #pairs = $sent =~ /(?=(\S+\s+\S+))\S+/g;

A variation that (perhaps unwisely) relies on operator evaluation order but doesn't rely on fancy regexes or indices:
my #words = split /\s+/, $sent;
my $last = shift #words;
my #var;
push #var, $last . ' ' . ($last = $_) for #words;

This works:
my #sent = split(/\s+/, $sent);
my #var = map { $sent[$_] . ' ' . $sent[$_ + 1] } 0 .. $#sent - 1;
i.e. just split the original string into an array of words, and then use map to iteratively produce the desired pairs.

I don't have it as a single line, but the following code should give you somewhere to start. Basically does it with a push and a regext with /g.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
$Data::Dumper::Indent = 1;
my $t1 = 'aa bb cc dd ee ff';
my $t2 = 'aa bb cc dd ee';
foreach my $txt ( $t1, $t2 )
{
my #a;
push( #a, $& ) while( $txt =~ /\G\S+(\s+\S+|)\s*/g );
print Dumper( \#a );
}
One liner thanks to the syntax from #ysth
my #a = $txt =~ /\G(\S+(?:\s+\S+|))\s*/g;
My regex is slightly different in that if you have an odd number of words, the last word still gets an entry.

finding the substring present in string and also count the number of occurrences

Could anyone tel me what is the mistake? As the program is for finding the substrings in a given string and count there number of occurrences for those substrings. but the substring must check the occurrences for every three alphabets.
for eg: String: AGAUUUAGA (i.e. for AGA, UUU, AGA)
output: AGA-2
UUU-1
print"Enter the mRNA Sequence\n";
$count=0;
$count1=0;
$seq=<>;
chomp($seq);
$p='';
$ln=length($seq);
$j=$ln/3;
for($i=0,$k=0;$i<$ln,$k<$j;$k++) {
$fra[$k]=substr($seq,$i,3);
$i=$i+3;
if({$fra[$k]} eq AGA) {
$count++;
print"The number of AGA is $count";
} elseif({$fra[$k]} eq UUU) {
$count1++;
print" The number of UUU is $count1";
}
}

This is a Perl FAQ:
perldoc -q count
This code will count the occurrences of your 2 strings:
use warnings;
use strict;
my $seq = 'AGAUUUAGA';
my $aga_cnt = () = $seq =~ /AGA/g;
my $uuu_cnt = () = $seq =~ /UUU/g;
print "The number of AGA is $aga_cnt\n";
print "The number of UUU is $uuu_cnt\n";
__END__
The number of AGA is 2
The number of UUU is 1
If you use strict and warnings, you will get many messages pointing out errors in your code.
Here is another approach which is more scalable:
use warnings;
use strict;
use Data::Dumper;
my $seq = 'AGAUUUAGA';
my %counts;
for my $key (qw(AGA UUU)) {
$counts{$key} = () = $seq =~ /$key/g;
}
print Dumper(\%counts);
__END__
$VAR1 = {
'AGA' => 2,
'UUU' => 1
};

Have a try with this, that avoids overlaps:
#!/usr/bin/perl
use strict;
use warnings;
use 5.10.1;
use Data::Dumper;
my $str = q!AGAUUUAGAGAAGAG!;
my #list = $str =~ /(...)/g;
my ($AGA, $UUU);
foreach(#list) {
$AGA++ if $_ eq 'AGA';
$UUU++ if $_ eq 'UUU';
}
say "number of AGA is $AGA and number of UUU is $UUU";
output:
number of AGA is 2 and number of UUU is 1

This is an example of how quickly you can get things done in Perl. Grouping the strands together as a alternation is one way to make sure there is no overlap. Also a hash is a great way to count occurrences of they key.
$values{$_}++ foreach $seq =~ /(AGA|UUU)/g;
print "AGA-$values{AGA} UUU-$values{UUU}\n";
However, I generally want to generalize it to something like this, thinking that this might not be the only time you have to do something like this.
use strict;
use warnings;
use English qw<$LIST_SEPARATOR>;
my %values;
my #spans = qw<AGA UUU>;
my $split_regex
= do { local $LIST_SEPARATOR = '|';
qr/(#spans)/
}
;
$values{$_}++ foreach $seq =~ /$split_regex/g;
print join( ' ', map { "$_-$values{$_}" } #spans ), "\n";

Your not clear on how many "AGA" the string "AGAGAGA" contains.
If 2,
my $aga = () = $seq =~ /AGA/g;
my $uuu = () = $seq =~ /UUU/g;
If 3,
my $aga = () = $seq =~ /A(?=GA)/g;
my $uuu = () = $seq =~ /U(?=UU)/g;

If I understand you correctly (and certainly that is questionable; almost every answer so far is interpreting your question differently than every other answer):
my %substring;
$substring{$1}++ while $seq =~ /(...)/;
print "There are $substring{UUU} UUU's and $substring{AGA} AGA's\n";

How do I determine the longest similar portion of several strings?

As per the title, I'm trying to find a way to programmatically determine the longest portion of similarity between several strings.
Example:
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/
Ideally, I'd get back file:///home/gms8994/Music/, because that's the longest portion that's common for all 3 strings.
Specifically, I'm looking for a Perl solution, but a solution in any language (or even pseudo-language) would suffice.
From the comments: yes, only at the beginning; but there is the possibility of having some other entry in the list, which would be ignored for this question.

Edit: I'm sorry for mistake. My pity that I overseen that using my variable inside countit(x, q{}) is big mistake. This string is evaluated inside Benchmark module and #str was empty there. This solution is not as fast as I presented. See correction below. I'm sorry again.
Perl can be fast:
use strict;
use warnings;
package LCP;
sub LCP {
return '' unless #_;
return $_[0] if #_ == 1;
my $i = 0;
my $first = shift;
my $min_length = length($first);
foreach (#_) {
$min_length = length($_) if length($_) < $min_length;
}
INDEX: foreach my $ch ( split //, $first ) {
last INDEX unless $i < $min_length;
foreach my $string (#_) {
last INDEX if substr($string, $i, 1) ne $ch;
}
}
continue { $i++ }
return substr $first, 0, $i;
}
# Roy's implementation
sub LCP2 {
return '' unless #_;
my $prefix = shift;
for (#_) {
chop $prefix while (! /^\Q$prefix\E/);
}
return $prefix;
}
1;
Test suite:
#!/usr/bin/env perl
use strict;
use warnings;
Test::LCP->runtests;
package Test::LCP;
use base 'Test::Class';
use Test::More;
use Benchmark qw(:all :hireswallclock);
sub test_use : Test(startup => 1) {
use_ok('LCP');
}
sub test_lcp : Test(6) {
is( LCP::LCP(), '', 'Without parameters' );
is( LCP::LCP('abc'), 'abc', 'One parameter' );
is( LCP::LCP( 'abc', 'xyz' ), '', 'None of common prefix' );
is( LCP::LCP( 'abcdefgh', ('abcdefgh') x 15, 'abcdxyz' ),
'abcd', 'Some common prefix' );
my #str = map { chomp; $_ } <DATA>;
is( LCP::LCP(#str),
'file:///home/gms8994/Music/', 'Test data prefix' );
is( LCP::LCP2(#str),
'file:///home/gms8994/Music/', 'Test data prefix by LCP2' );
my $t = countit( 1, sub{LCP::LCP(#str)} );
diag("LCP: ${\($t->iters)} iterations took ${\(timestr($t))}");
$t = countit( 1, sub{LCP::LCP2(#str)} );
diag("LCP2: ${\($t->iters)} iterations took ${\(timestr($t))}");
}
__DATA__
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/
Test suite result:
1..7
ok 1 - use LCP;
ok 2 - Without parameters
ok 3 - One parameter
ok 4 - None of common prefix
ok 5 - Some common prefix
ok 6 - Test data prefix
ok 7 - Test data prefix by LCP2
# LCP: 22635 iterations took 1.09948 wallclock secs ( 1.09 usr + 0.00 sys = 1.09 CPU) # 20766.06/s (n=22635)
# LCP2: 17919 iterations took 1.06787 wallclock secs ( 1.07 usr + 0.00 sys = 1.07 CPU) # 16746.73/s (n=17919)
That means that pure Perl solution using substr is about 20% faster than Roy's solution at your test case and one prefix finding takes about 50us. There is not necessary using XS unless your data or performance expectations are bigger.

The reference given already by Brett Daniel for the Wikipedia entry on "Longest common substring problem" is very good general reference (with pseudocode) for your question as stated. However, the algorithm can be exponential. And it looks like you might actually want an algorithm for longest common prefix which is a much simpler algorithm.
Here's the one I use for longest common prefix (and a ref to original URL):
use strict; use warnings;
sub longest_common_prefix {
# longest_common_prefix( $|# ): returns $
# URLref: http://linux.seindal.dk/2005/09/09/longest-common-prefix-in-perl
# find longest common prefix of scalar list
my $prefix = shift;
for (#_) {
chop $prefix while (! /^\Q$prefix\E/);
}
return $prefix;
}
my #str = map {chomp; $_} <DATA>;
print longest_common_prefix(#ARGV), "\n";
__DATA__
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/
If you truly want a LCSS implementation, refer to these discussions (Longest Common Substring and Longest Common Subsequence) at PerlMonks.org. Tree::Suffix would probably be the best general solution for you and implements, to my knowledge, the best algorithm. Unfortunately recent builds are broken. But, a working subroutine does exist within the discussions referenced on PerlMonks in this post by Limbic~Region (reproduced here with your data).
#URLref: http://www.perlmonks.org/?node_id=549876
#by Limbic~Region
use Algorithm::Loops 'NestedLoops';
use List::Util 'reduce';
use strict; use warnings;
sub LCS{
my #str = #_;
my #pos;
for my $i (0 .. $#str) {
my $line = $str[$i];
for (0 .. length($line) - 1) {
my $char= substr($line, $_, 1);
push #{$pos[$i]{$char}}, $_;
}
}
my $sh_str = reduce {length($a) < length($b) ? $a : $b} #str;
my %map;
CHAR:
for my $char (split //, $sh_str) {
my #loop;
for (0 .. $#pos) {
next CHAR if ! $pos[$_]{$char};
push #loop, $pos[$_]{$char};
}
my $next = NestedLoops([#loop]);
while (my #char_map = $next->()) {
my $key = join '-', #char_map;
$map{$key} = $char;
}
}
my #pile;
for my $seq (keys %map) {
push #pile, $map{$seq};
for (1 .. 2) {
my $dir = $_ % 2 ? 1 : -1;
my #offset = split /-/, $seq;
$_ += $dir for #offset;
my $next = join '-', #offset;
while (exists $map{$next}) {
$pile[-1] = $dir > 0 ?
$pile[-1] . $map{$next} : $map{$next} . $pile[-1];
$_ += $dir for #offset;
$next = join '-', #offset;
}
}
}
return reduce {length($a) > length($b) ? $a : $b} #pile;
}
my #str = map {chomp; $_} <DATA>;
print LCS(#str), "\n";
__DATA__
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/

It sounds like you want the k-common substring algorithm. It is exceptionally simple to program, and a good example of dynamic programming.

My first instinct is to run a loop, taking the next character from each string, until the characters are not equal. Keep a count of what position in the string you're at and then take a substring (from any of the three strings) from 0 to the position before the characters aren't equal.
In Perl, you'll have to split up the string first into characters using something like
#array = split(//, $string);
(splitting on an empty character sets each character into its own element of the array)
Then do a loop, perhaps overall:
$n =0;
#array1 = split(//, $string1);
#array2 = split(//, $string2);
#array3 = split(//, $string3);
while($array1[$n] == $array2[$n] && $array2[$n] == $array3[$n]){
$n++;
}
$sameString = substr($string1, 0, $n); #n might have to be n-1
Or at least something along those lines. Forgive me if this doesn't work, my Perl is a little rusty.

If you google for "longest common substring" you'll get some good pointers for the general case where the sequences don't have to start at the beginning of the strings.
Eg, http://en.wikipedia.org/wiki/Longest_common_substring_problem.
Mathematica happens to have a function for this built in:
http://reference.wolfram.com/mathematica/ref/LongestCommonSubsequence.html (Note that they mean contiguous subsequence, ie, substring, which is what you want.)
If you only care about the longest common prefix then it should be much faster to just loop for i from 0 till the ith characters don't all match and return substr(s, 0, i-1).

From http://forums.macosxhints.com/showthread.php?t=33780
my #strings =
(
'file:///home/gms8994/Music/t.A.T.u./',
'file:///home/gms8994/Music/nina%20sky/',
'file:///home/gms8994/Music/A%20Perfect%20Circle/',
);
my $common_part = undef;
my $sep = chr(0); # assuming it's not used legitimately
foreach my $str ( #strings ) {
# First time through loop -- set common
# to whole
if ( !defined $common_part ) {
$common_part = $str;
next;
}
if ("$common_part$sep$str" =~ /^(.*).*$sep\1.*$/)
{
$common_part = $1;
}
}
print "Common part = $common_part\n";

Faster than above, uses perl's native binary xor function, adapted from perlmongers solution (the $+[0] didn't work for me):
sub common_suffix {
my $comm = shift #_;
while ($_ = shift #_) {
$_ = substr($_,-length($comm)) if (length($_) > length($comm));
$comm = substr($comm,-length($_)) if (length($_) < length($comm));
if (( $_ ^ $comm ) =~ /(\0*)$/) {
$comm = substr($comm, -length($1));
} else {
return undef;
}
}
return $comm;
}
sub common_prefix {
my $comm = shift #_;
while ($_ = shift #_) {
$_ = substr($_,0,length($comm)) if (length($_) > length($comm));
$comm = substr($comm,0,length($_)) if (length($_) < length($comm));
if (( $_ ^ $comm ) =~ /^(\0*)/) {
$comm = substr($comm,0,length($1));
} else {
return undef;
}
}
return $comm;
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

perl: how to make compact name from a numbered sequence - perl

Related

How to fix the error of "Use of unitialized value in addition..." in perl script?

Difficulty in finding error in perl script

How to get consecutive pairs of words in Perl

finding the substring present in string and also count the number of occurrences

How do I determine the longest similar portion of several strings?

Categories

Resources