Parsing a string into a hash structure in perl

Parsing a string into a hash structure in perl - perl

I have the following string:
$str = "list
XYZ
status1 : YES
value1 : 100
status2 : NO
value2 : 200
Thats all";
I want to convert it into a hash using a function which takes this string as input and returns a hash with status1 as key and YES as value for example.
How to do so?
And how to reference the returned hash?

Like always, there's more than one way to do it. Here come five.
Pure regular expressions (YEAH!)
I think this is the coolest one. The regex returns a list of all captures which is exactly the list we want to initialize the hash with:
my %regex = $str =~ /(\S+)\s*:\s*(\S+)/g;
Iterative
This is the most straightforward way for most programmers, I think:
my #lines = split /\R/ => $str;
my %iterative = ();
for (#lines) {
next unless /(\S+)\s*:\s*(\S+)/;
$iterative{$1} = $2;
}
Nothing to explain here. I first split the string in lines, then iterate over them, leaving out lines that don't look like foo : bar. Done.
List processing
Writing everything as a big list expression feels a little bit hackish, but maybe this is interesting to learn more ways to express stuff:
my %list = map { /(\S+)\s*:\s*(\S+)/ and $1 => $2 }
grep { /:/ }
split /\R/ => $str;
Read from right to left: Like in the example above we start with splitting the string in lines. grep filters the lines for : and in the final map I transform matching line strings in a list of length two, with a key and a value.
List reducing
Non-trivial use-cases of List::Util's reduce function are very rare. Here's one, based on the list approach from above, returning a hash reference:
my $reduced = reduce {
$a = { $a =~ /(\S+)\s*:\s*(\S+)/ } unless ref $a;
$a->{$1} = $2 if $b =~ /(\S+)\s*:\s*(\S+)/;
return $a;
} grep { /:/ } split /\R/ => $str;
State machine
Here's a funny one with regex usage for white-space separation only. It needs to keep track of a state:
# preparations
my $state = 'idle';
my $buffer = undef;
my %state = ();
my #words = split /\s+/ => $str;
# loop over words
for my $word (#words) {
# last word was a key
if ($state eq 'idle' and $word eq ':') {
$state = 'got_key';
}
# this is a value for the key in buffer
elsif ($state eq 'got_key') {
$state{$buffer} = $word;
$state = 'idle';
$buffer = undef;
}
# remember this word
else {
$buffer = $word;
}
}

Just for fun (note that I recommend using one of memowe's) here is one that (ab)uses the YAML:
#!/usr/bin/env perl
use strict;
use warnings;
use YAML;
my $str = "list
XYZ
status1 : YES
value1 : 100
status2 : NO
value2 : 200
Thats all";
$str = join "\n", grep { /:/ } split "\n", $str;
my $hash = Load "$str\n";

#!/usr/bin/perl
use warnings;
$\="\n";
sub convStr {
my $str = $_[0];
my %h1=();
while ($str =~m/(\w+)\s+:\s+(\w+)/g) {
$h1{$1} =$2;
}
return \%h1;
}
my $str = "list
XYZ
status1 : YES
value1 : 100
status2 : NO
value2 : 200
Thats all";
my $href=convStr($str);
foreach (keys(%$href)) {
print $_ , "=>", $href->{$_};
}
On running this, I get:
status2=>NO
value1=>100
status1=>YES
value2=>200

my %hhash;
my #lines = split /\s+\n/, $str;
foreach (#lines)
{
$_=~s/^\s+//g;
if(/:/)
{
$key=(split(/:/))[0];
$value=(split(/:/))[1];
$hhash{$key}=$value;
}
}

Related

Perl - longest common prefix of 2 or more strings?

How can i create a Perl subroutine which would take in an array and find the longest common prefix for 2 or more of its elements? (strings)
I have this code:
sub longest_common_prefix {
$prefix = shift;
for (#_) {
chop $prefix while (! /^\Q$prefix\E/);
}
return $prefix;
}
But it only works if you are looking for the longest common prefix of all strings.
For example, if i pass an array with the following strings:
aaaBGFB
aaaJJJJ
jjfkBBB
aaaHGHG
I want it to return aaa as the answer.
Thanks!

I'd use a modified trie.
Normally, one could use the following to add to a trie:
sub add {
my $p = \shift;
my $s = shift;
$p = \( $$p->{$_} ) for split(//, $s);
$$p->{''} = 1;
}
But we need two modifications:
All prefixes of a string must be added when adding a string. For example, adding abc should also add a and ab to the trie.
When adding to the trie, we want to return the length of previously-existing part of the path taken.
So we need:
sub add {
my $p = \shift;
my $s = shift;
my $cp_len = 0;
for (split(//, $s)) {
$p = \( $$p->{$_} );
++$cp_len if $$p->{$_}{''};
$$p->{''} = 1;
}
return $cp_len;
}
Combine (an optimized version of) this with an algorithm to find the longest strings in a list and with an algorithm to remove duplicate strings from a list to get the following solution:
use strict;
use warnings;
use feature qw( say );
sub add {
my $p = \shift;
my $s = shift;
my $cp_len = 0;
for (split(//, $s)) {
++$cp_len if exists($$p->{$_});
$p = \( $$p->{$_} );
}
return $cp_len;
}
my $t;
my $lcp_len = 0; # lcp = longest common prefix
my %lcps;
while (<>) {
chomp;
my $cp_len = add($t, $_)
or next;
if ($cp_len >= $lcp_len) {
if ($cp_len > $lcp_len) {
$lcp_len = $cp_len;
%lcps = ();
}
$lcps{ substr($_, 0, $cp_len) } = 1;
}
}
my #lcps = sort keys %lcps;
if (#lcps) {
say "Longest common prefix(es): #lcps";
} else {
say "No common prefix";
}
Data:
abc
abc
abcd
abcde
hijklx
hijkly
mnopqx
mnopqy
Output:
Longest common prefix(es): hijkl mnopq
The time taken by the above is proportional to the number of input characters.

One way would be to store the information in a hash. In this example, I set the hash key to the length of each prefix, and the value being the actual prefix found.
Note that this method overwrites a key and value if a same-length prefix exists, so you'll always get the last prefix found of the longest length (sort() takes care of finding the longest one).
The regex says "find the first character in the string and capture it, and use that char found in a second capture, and capture as many as there are". This string is then join()ed into a scalar and put into the hash.
use warnings;
use strict;
my %prefixes;
while (<DATA>){
my $prefix = join '', /^(.)(\1+)/;
$prefixes{length $prefix} = $prefix;
}
my $longest = (sort {$b <=> $a} keys %prefixes)[0];
print "$prefixes{$longest}\n";
__DATA__
aaBGFB
aaaJJJJ
jjfkBBB
aaaHGHG
Output:
aaa

You can keep a hash of an array of words keyed by the first character. By definition, if you have words starting with the same letter, those words share at least a one character common prefix of that one letter. Then reduce to the single longest prefix by stepping through the hash by character:
use strict; use warnings;
sub lcp {
(join("\0", #_) =~ /^ ([^\0]*) [^\0]* (?:\0 \1 [^\0]*)* $/sx)[0];
}
my %HoA;
my $longest='';
while (my $line=<DATA>){
$line =~ s/^\s+|\s+$//g ;
push #{ $HoA{substr $line, 0, 1} }, $line if $line=~/^[a-zA-Z]/;
}
for my $key ( sort (keys %HoA )) {
if (scalar #{ $HoA{$key} } > 1){
my $lon=lcp(#{ $HoA{$key} });
my $s = join ', ', map { qq/"$_"/ } #{ $HoA{$key} };
print "lcp: \"$lon\" for ($s)\n";
if (length($lon) > length($longest)) {
$longest=$lon;
}
}
else{
print "$key: no common prefix\n";
}
}
print "\nlongest common prefix is \"$longest\"\n";
__DATA__
aardvark
aaaBGFB
aaaJJJJ
jjfkBBB
aaaHGHG
interspecies
interstellar
interstate
Prints:
lcp: "aa" for ("aardvark", "aaaBGFB", "aaaJJJJ", "aaaHGHG")
lcp: "inters" for ("interspecies", "interstellar", "interstate")
j: no common prefix
longest common prefix is "inters"

Handling Nested Delimiters in perl

use strict;
use warnings;
my %result_hash = ();
my %final_hash = ();
Compare_results();
foreach my $key (sort keys %result_hash ){
print "$key \n";
print "$result_hash{$key} \n";
}
sub Compare_results
{
while ( <DATA> )
{
my($instance,$values) = split /\:/, $_;
$result_hash{$instance} = $values;
}
}
__DATA__
1:7802315095\d\d,7802315098\d\d;7802025001\d\d,7802025002\d\d,7802025003\d\ d,7802025004\d\d,7802025005\d\d,7802025006\d\d,7802025007\d\d
2:7802315095\d\d,7802025002\d\d,7802025003\d\d,7802025004\d\d,7802025005\d\d,7802025006\d\d,7802025007\d\d
Output
1
7802315095\d\d,7802315098\d\d;7802025001\d\d,7802025002\d\d,7802025003\d\d,7802025004\d\d,7802025005\d\d,7802025006\d\d,7802025007\d\d
2
7802315095\d\d,7802025002\d\d,7802025003\d\d,7802025004\d\d,7802025005\d\d,7802025006\d\d,7802025007\d\d
Iam trying to fetch value of each key and again trying to split the comma seperated value from result hash , if i find a semicolon in any value i would want to store the left and right values in separate hash keys.
Something like below
1.#split the value of result_hash{$key} again by , and see whether any chunk is seperated by ;
2. #every chunk without ; and value on left with ; should be stored in
#{$final_hash{"eto"}} = ['7802315095\d\d','7802315098\d\d','7802025002\d\d','7802025003\d\d','7802025004\d\d','7802025005\d\d','7802025006\d\d','7802025007\d\d'] ;
3.#Anything found on the right side of ; has to be stored in
#{$final_hash{"pro"}} = ['7802025001\d\d'] ;
Is there a way that i can handle everything in the subroutine? Can i make the code more simpler
Update :
I tried splitting the string in a single shot, but its just picking the values with semicolon and ignoring everything
foreach my $key (sort keys %result_hash ){
# print "$key \n";
# print "$result_hash{$key} \n";
my ($o,$t) = split(/,|;/, $result_hash{$key});
print "Left : $o \n";
print "Left : $t \n";
#push #{$final_hash{"eto"}}, $o;
#push #{$final_hash{"pro"}} ,$t;
}
}
My updated code after help
sub Compare_results
{
open my $fh, '<', 'Data_File.txt' or die $!;
# split by colon and further split by , and ; if any (done in insert_array)
my %result_hash = map { chomp; split ':', $_ } <$fh> ;
foreach ( sort { $a <=> $b } (keys %result_hash) )
{
($_ < 21)
? insert_array($result_hash{$_}, "west")
: insert_array($result_hash{$_}, "east");
}
}
sub insert_array()
{
my ($val,$key) = #_;
foreach my $field (split ',', $val)
{
$field =~ s/^\s+|\s+$//g; # / turn off editor coloring
if ($field !~ /;/) {
push #{ $file_data{"pto"}{$key} }, $field ;
}
else {
my ($left, $right) = split ';', $field;
push #{$file_data{"pto"}{$key}}, $left if($left ne '') ;
push #{$file_data{"ero"}{$key}}, $right if($right ne '') ;
}
}
}
Thanks

Update Added a two-pass regex, at the end
Just proceed systematically, analyze the string step by step. The fact that you need consecutive splits and a particular separation rule makes it unwieldy to do in one shot. Better have a clear method than a monster statement.
use warnings 'all';
use strict;
use feature 'say';
my (%result_hash, %final_hash);
Compare_results();
say "$_ => $result_hash{$_}" for sort keys %result_hash;
say '---';
say "$_ => [ #{$final_hash{$_}} ]" for sort keys %final_hash;
sub Compare_results
{
%result_hash = map { chomp; split ':', $_ } <DATA>;
my (#eto, #pro);
foreach my $val (values %result_hash)
{
foreach my $field (split ',', $val)
{
if ($field !~ /;/) { push #eto, $field }
else {
my ($left, $right) = split ';', $field;
push #eto, $left;
push #pro, $right;
}
}
}
$final_hash{eto} = \#eto;
$final_hash{pro} = \#pro;
return 1; # but add checks above
}
There are some inefficiencies here, and no error checking, but the method is straightforward. If your input is anything but smallish please change the above to process line by line, what you clearly know how to do. It prints
1 => ... (what you have in the question)
---
eto => [ 7802315095\d\d 7802315098\d\d 7802025002\d\d 7802025003\d\ d ...
pro => [ 7802025001\d\d ]
Note that your data does have one loose \d\ d.
We don't need to build the whole hash %result_hash for this but only need to pick the part of the line after :. I left the hash in since it is declared global so you may want to have it around. If it in fact isn't needed on its own this simplifies
sub Compare_results {
my (#eto, #pro);
while (<DATA>) {
my ($val) = /:(.*)/;
foreach my $field (split ',', $val)
# ... same
}
# assign to %final_hash, return from sub
}
Thanks to ikegami for comments.
Just for the curiosity's sake, here it is in two passes with regex
sub compare_rx {
my #data = map { (split ':', $_)[1] } <DATA>;
$final_hash{eto} = [ map { /([^,;]+)/g } #data ];
$final_hash{pro} = [ map { /;([^,;]+)/g } #data ];
return 1;
}
This picks all characters which are not , or ;, using the negated character class, [^,;]. So that is up to the first either of them, left to right. It does this globally, /g, so it keeps going through the string, collecting all fields that are "left of" , or ;. Then it cheats a bit, picking all [^,;] that are right of ;. The map is used to do this for all lines of data.
If %result_hash is needed build it instead of #data and then pull the values from it with my #values = values %hash_result and feed the map with #values.
Or, broken line by line (again, you can build %result_hash instead of taking $data directly)
my (#eto, #pro);
while (<DATA>) {
my ($data) = /:(.*)/;
push #eto, $data =~ /([^,;]+)/g;
push #pro, $data =~ /;([^,;]+)/g;
}

Force auto-increment to treat its argument as string

I need Perl's auto-increment magic for strings, but some strings (such as those composed entirely of digits) are interpreted as numbers and a normal increment is performed instead. How would I force Perl to treat a value passed to ++ as a string?

Here's the related question about how auto incrementing works: Autoincrementing letters in Perl
Like the docs explained, basically you need the variable to
match the regex /^[a-zA-Z]*[0-9]*\z/ and
only be used in string contexts.
Because you have variables that don't match the regex, those ones will be treated as numbers. You can write your own increment function to get your desired functionality. Here's an idea I had about how it could work to get you started.
#!/usr/bin/perl
use strict;
use warnings;
my $test = "1000";
for (0..100) {
$test = increment($test);
}
print $test . "\n";
$test = "M2V3";
for (0..100) {
$test = increment($test);
}
print $test . "\n";
sub increment {
my ($str) = #_;
my #letters = reverse split //, $str;
my $add = "";
my $increment = 1;
my $result = "";
for my $let (#letters) {
if ( $increment == 1 ) {
++$let;
}
if ( $let =~ /(.)(.)/ ) {
$add = $2;
$increment = 1;
} else {
$add = $let;
$increment = 0;
}
$result = $add . $result;
}
return $result;
}
This outputs:
1101
M3F4
I didn't calculate to confirm that M3F4 is the correct result but it seems close.

How to skip splitting for some part of the line

Say I have a line lead=george wife=jane "his boy"=elroy. I want to split with space but that does not include the "his boy" part. I should be considered as one.
With normal split it is also splitting "his boy" like taking "his" as one and "boy" as second part. How to escape this
Following this i tried
split " ", $_
Just came to know that this will work
use strict; use warnings;
my $string = q(hi my name is 'john doe');
my #parts = $string =~ /'.*?'|\S+/g;
print map { "$_\n" } #parts;
But it does not looks good. Any other simple thing with split itself?

You could use Text::ParseWords for this
use Text::ParseWords;
$list = "lead=george wife=jane \"his boy\"=elroy";
#words = quotewords('\s+', 0, $list);
$i = 0;
foreach (#words) {
print "$i: <$_>\n";
$i++;
}
ouput:
0: <lead=george>
1: <wife=jane>
2: <his boy=elroy>

sub split_space {
my ( $text ) = #_;
while (
$text =~ m/
( # group ($1)
\"([^\"]+)\" # first try find something in quotes ($2)
|
(\S+?) # else minimal non-whitespace run ($3)
)
=
(\S+) # then maximum non-whitespace run ($4)
/xg
) {
my $key = defined($2) ? $2 : $3;
my $value = $4;
print( "key=$key; value=$value\n" );
}
}
split_space( 'lead=george wife=jane "his boy"=elroy' );
Outputs:
key=lead; value=george
key=wife; value=jane
key=his boy; value=elroy

PP posted a good solution. But just to make it sure, that there is a cool other way to do it, comes my solution:
my $string = q~lead=george wife=jane "his boy"=elroy~;
my #split = split / (?=")/,$string;
my #split2;
foreach my $sp (#split) {
if ($sp !~ /"/) {
push #split2, $_ foreach split / /, $sp;
} else {
push #split2,$sp;
}
}
use Data::Dumper;
print Dumper #split2;
Output:
$VAR1 = 'lead=george';
$VAR2 = 'wife=jane';
$VAR3 = '"his boy"=elroy';
I use a Lookahead here for splitting at first the parts which keys are inside quotes " ". After that, i loop through the complete array and split all other parts, which are normal key=values.

You can get the required result using a single regexp, which extract the keys and the values and put the result inside a hash table.
(\w+|"[\w ]+") will match both a single and multiple word in the key side.
The regexp captures only the key and the value, so the result of the match operation will be a list with the following content: key #1, value #1, key #2, value#2, etc.
The hash is automatically initiated with the appropriate keys and values, when the match result is assigned to it.
here is the code
my $str = 'lead=george wife=jane "hello boy"=bye hello=world';
my %hash = ($str =~ m/(?:(\w+|"[\w ]+")=(\w+)(?:\s|$))/g);
## outputs the hash content
foreach $key (keys %hash) {
print "$key => $hash{$key}\n";
}
and here is the output of this script
lead => george
wife => jane
hello => world
"hello boy" => bye

How do I determine the longest similar portion of several strings?

As per the title, I'm trying to find a way to programmatically determine the longest portion of similarity between several strings.
Example:
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/
Ideally, I'd get back file:///home/gms8994/Music/, because that's the longest portion that's common for all 3 strings.
Specifically, I'm looking for a Perl solution, but a solution in any language (or even pseudo-language) would suffice.
From the comments: yes, only at the beginning; but there is the possibility of having some other entry in the list, which would be ignored for this question.

Edit: I'm sorry for mistake. My pity that I overseen that using my variable inside countit(x, q{}) is big mistake. This string is evaluated inside Benchmark module and #str was empty there. This solution is not as fast as I presented. See correction below. I'm sorry again.
Perl can be fast:
use strict;
use warnings;
package LCP;
sub LCP {
return '' unless #_;
return $_[0] if #_ == 1;
my $i = 0;
my $first = shift;
my $min_length = length($first);
foreach (#_) {
$min_length = length($_) if length($_) < $min_length;
}
INDEX: foreach my $ch ( split //, $first ) {
last INDEX unless $i < $min_length;
foreach my $string (#_) {
last INDEX if substr($string, $i, 1) ne $ch;
}
}
continue { $i++ }
return substr $first, 0, $i;
}
# Roy's implementation
sub LCP2 {
return '' unless #_;
my $prefix = shift;
for (#_) {
chop $prefix while (! /^\Q$prefix\E/);
}
return $prefix;
}
1;
Test suite:
#!/usr/bin/env perl
use strict;
use warnings;
Test::LCP->runtests;
package Test::LCP;
use base 'Test::Class';
use Test::More;
use Benchmark qw(:all :hireswallclock);
sub test_use : Test(startup => 1) {
use_ok('LCP');
}
sub test_lcp : Test(6) {
is( LCP::LCP(), '', 'Without parameters' );
is( LCP::LCP('abc'), 'abc', 'One parameter' );
is( LCP::LCP( 'abc', 'xyz' ), '', 'None of common prefix' );
is( LCP::LCP( 'abcdefgh', ('abcdefgh') x 15, 'abcdxyz' ),
'abcd', 'Some common prefix' );
my #str = map { chomp; $_ } <DATA>;
is( LCP::LCP(#str),
'file:///home/gms8994/Music/', 'Test data prefix' );
is( LCP::LCP2(#str),
'file:///home/gms8994/Music/', 'Test data prefix by LCP2' );
my $t = countit( 1, sub{LCP::LCP(#str)} );
diag("LCP: ${\($t->iters)} iterations took ${\(timestr($t))}");
$t = countit( 1, sub{LCP::LCP2(#str)} );
diag("LCP2: ${\($t->iters)} iterations took ${\(timestr($t))}");
}
__DATA__
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/
Test suite result:
1..7
ok 1 - use LCP;
ok 2 - Without parameters
ok 3 - One parameter
ok 4 - None of common prefix
ok 5 - Some common prefix
ok 6 - Test data prefix
ok 7 - Test data prefix by LCP2
# LCP: 22635 iterations took 1.09948 wallclock secs ( 1.09 usr + 0.00 sys = 1.09 CPU) # 20766.06/s (n=22635)
# LCP2: 17919 iterations took 1.06787 wallclock secs ( 1.07 usr + 0.00 sys = 1.07 CPU) # 16746.73/s (n=17919)
That means that pure Perl solution using substr is about 20% faster than Roy's solution at your test case and one prefix finding takes about 50us. There is not necessary using XS unless your data or performance expectations are bigger.

The reference given already by Brett Daniel for the Wikipedia entry on "Longest common substring problem" is very good general reference (with pseudocode) for your question as stated. However, the algorithm can be exponential. And it looks like you might actually want an algorithm for longest common prefix which is a much simpler algorithm.
Here's the one I use for longest common prefix (and a ref to original URL):
use strict; use warnings;
sub longest_common_prefix {
# longest_common_prefix( $|# ): returns $
# URLref: http://linux.seindal.dk/2005/09/09/longest-common-prefix-in-perl
# find longest common prefix of scalar list
my $prefix = shift;
for (#_) {
chop $prefix while (! /^\Q$prefix\E/);
}
return $prefix;
}
my #str = map {chomp; $_} <DATA>;
print longest_common_prefix(#ARGV), "\n";
__DATA__
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/
If you truly want a LCSS implementation, refer to these discussions (Longest Common Substring and Longest Common Subsequence) at PerlMonks.org. Tree::Suffix would probably be the best general solution for you and implements, to my knowledge, the best algorithm. Unfortunately recent builds are broken. But, a working subroutine does exist within the discussions referenced on PerlMonks in this post by Limbic~Region (reproduced here with your data).
#URLref: http://www.perlmonks.org/?node_id=549876
#by Limbic~Region
use Algorithm::Loops 'NestedLoops';
use List::Util 'reduce';
use strict; use warnings;
sub LCS{
my #str = #_;
my #pos;
for my $i (0 .. $#str) {
my $line = $str[$i];
for (0 .. length($line) - 1) {
my $char= substr($line, $_, 1);
push #{$pos[$i]{$char}}, $_;
}
}
my $sh_str = reduce {length($a) < length($b) ? $a : $b} #str;
my %map;
CHAR:
for my $char (split //, $sh_str) {
my #loop;
for (0 .. $#pos) {
next CHAR if ! $pos[$_]{$char};
push #loop, $pos[$_]{$char};
}
my $next = NestedLoops([#loop]);
while (my #char_map = $next->()) {
my $key = join '-', #char_map;
$map{$key} = $char;
}
}
my #pile;
for my $seq (keys %map) {
push #pile, $map{$seq};
for (1 .. 2) {
my $dir = $_ % 2 ? 1 : -1;
my #offset = split /-/, $seq;
$_ += $dir for #offset;
my $next = join '-', #offset;
while (exists $map{$next}) {
$pile[-1] = $dir > 0 ?
$pile[-1] . $map{$next} : $map{$next} . $pile[-1];
$_ += $dir for #offset;
$next = join '-', #offset;
}
}
}
return reduce {length($a) > length($b) ? $a : $b} #pile;
}
my #str = map {chomp; $_} <DATA>;
print LCS(#str), "\n";
__DATA__
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/

It sounds like you want the k-common substring algorithm. It is exceptionally simple to program, and a good example of dynamic programming.

My first instinct is to run a loop, taking the next character from each string, until the characters are not equal. Keep a count of what position in the string you're at and then take a substring (from any of the three strings) from 0 to the position before the characters aren't equal.
In Perl, you'll have to split up the string first into characters using something like
#array = split(//, $string);
(splitting on an empty character sets each character into its own element of the array)
Then do a loop, perhaps overall:
$n =0;
#array1 = split(//, $string1);
#array2 = split(//, $string2);
#array3 = split(//, $string3);
while($array1[$n] == $array2[$n] && $array2[$n] == $array3[$n]){
$n++;
}
$sameString = substr($string1, 0, $n); #n might have to be n-1
Or at least something along those lines. Forgive me if this doesn't work, my Perl is a little rusty.

If you google for "longest common substring" you'll get some good pointers for the general case where the sequences don't have to start at the beginning of the strings.
Eg, http://en.wikipedia.org/wiki/Longest_common_substring_problem.
Mathematica happens to have a function for this built in:
http://reference.wolfram.com/mathematica/ref/LongestCommonSubsequence.html (Note that they mean contiguous subsequence, ie, substring, which is what you want.)
If you only care about the longest common prefix then it should be much faster to just loop for i from 0 till the ith characters don't all match and return substr(s, 0, i-1).

From http://forums.macosxhints.com/showthread.php?t=33780
my #strings =
(
'file:///home/gms8994/Music/t.A.T.u./',
'file:///home/gms8994/Music/nina%20sky/',
'file:///home/gms8994/Music/A%20Perfect%20Circle/',
);
my $common_part = undef;
my $sep = chr(0); # assuming it's not used legitimately
foreach my $str ( #strings ) {
# First time through loop -- set common
# to whole
if ( !defined $common_part ) {
$common_part = $str;
next;
}
if ("$common_part$sep$str" =~ /^(.*).*$sep\1.*$/)
{
$common_part = $1;
}
}
print "Common part = $common_part\n";

Faster than above, uses perl's native binary xor function, adapted from perlmongers solution (the $+[0] didn't work for me):
sub common_suffix {
my $comm = shift #_;
while ($_ = shift #_) {
$_ = substr($_,-length($comm)) if (length($_) > length($comm));
$comm = substr($comm,-length($_)) if (length($_) < length($comm));
if (( $_ ^ $comm ) =~ /(\0*)$/) {
$comm = substr($comm, -length($1));
} else {
return undef;
}
}
return $comm;
}
sub common_prefix {
my $comm = shift #_;
while ($_ = shift #_) {
$_ = substr($_,0,length($comm)) if (length($_) > length($comm));
$comm = substr($comm,0,length($_)) if (length($_) < length($comm));
if (( $_ ^ $comm ) =~ /^(\0*)/) {
$comm = substr($comm,0,length($1));
} else {
return undef;
}
}
return $comm;
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Parsing a string into a hash structure in perl - perl

my %hhash; my #lines = split /\s+\n/, $str; foreach (#lines) { $_=~s/^\s+//g; if(/:/) { $key=(split(/:/))[0]; $value=(split(/:/))[1]; $hhash{$key}=$value; } }

Related

Perl - longest common prefix of 2 or more strings?

Handling Nested Delimiters in perl

Force auto-increment to treat its argument as string

How to skip splitting for some part of the line

How do I determine the longest similar portion of several strings?

Categories

Resources