Related
I'm analysing a log file with Perl 5.8.8.[1] I'm searching for days that expose some of two trigger patterns, maybe one of it, maybe both (I changed the actual patterns in the code snippet shown below). I'm interested in the count of occurrences per day, next step will be to make a spreadsheet of it, that's why the output formatting with tabs.
Because only one of the patterns may occur in a day, I need a way to combine the keys of both hashes. I did by generating a new hash. Is there a built-in function for that? I searched the web and stack overflow without any result, the only hit I got here was Build a string from 2 hashes, but in that case the key sets were identical.
#!/usr/bin/perl -w
use strict;
use warnings;
use locale;
# input analysis: searching for two patterns:
my %pattern_a = ();
my %pattern_b = ();
foreach my $line (<>) {
if ($line =~ m/^(\d{4}-\d{2}-\d{2})(.+)$/) {
my $day = $1;
my $what = $2;
if ($what =~ m/beendet/) {
$pattern_a{$day} ++;
} elsif ($what =~ m/ohne/) {
$pattern_b{$day} ++;
}
}
}
# generate the union of hash keys: <-- In Question
my %union = ();
$union{$_} = 1 for keys %pattern_a;
$union{$_} = 1 for keys %pattern_b;
# formatted output sorted by day:
foreach my $day (sort keys %union) {
print join "\t", $day,
($pattern_a{$day} || 0),
($pattern_b{$day} || 0)."\n";
}
The expected output would look like this:
2017-02-01 0 1
2017-02-18 0 592
2017-02-19 2 0
[1] I'm aware that this Perl version is quite outdated. But I'm using Perl rarely, but when I do, it has to go fast. So figuring out Perl versions and so on gets done later. But the Perl version is not so important for the actual question, at least I hope so...
Wouldn't it be easier to use a single hash?
#!/usr/bin/perl
use strict;
use warnings;
my %stats;
while (my $line = readline) {
my ($day, $pattern) = $line =~ /^(\d{4}-\d{2}-\d{2}).*(beendet|ohne)/
or next;
$stats{$day}{$pattern}++;
}
for my $day (sort keys %stats) {
printf "%s\t%d\t%d\n",
$day,
$stats{$day}{beendet} // 0,
$stats{$day}{ohne} // 0;
}
If you're using a perl before 5.10, replace // by ||; it makes no effective difference in this case. (But consider upgrading: 5.8.8 is from 2006. It's now more than a decade old. The officially maintained perl versions are 5.22 (2015) and 5.24 (2016).)
It's easier to structure your data first by day, then by pattern. That can be done using a hash reference.
use strict;
use warnings;
my %matches;
while ( my $line = <DATA> ) {
if ($line =~ m/^(\d{4}-\d{2}-\d{2})(.+)$/) {
my $day = $1;
my $what = $2;
if ($what =~ m/beendet/) {
$matches{$day}->{a} ++;
} elsif ($what =~ m/ohne/) {
$matches{$day}->{b} ++;
}
}
}
# formatted output sorted by day:
foreach my $day (sort keys %matches) {
print join(
"\t",
$day,
$matches{$day}->{a} || 0,
$matches{$day}->{b} || 0,
), "\n";
}
__DATA__
2017-02-01 einmal Pommes ohne
2017-02-02 Wartung gestartet
2017-02-02 Wartung beendet
2017-02-03 ohne Moos nix los
That program produces output as follows
2017-02-01 0 1
2017-02-02 1 0
2017-02-03 0 1
To understand the data structure, you can use Data::Dumper to output it (though I suggest using Data::Printer instead, as that's intended for human consumption and not as a serialization).
use Data::Dumper;
print Dumper \%matches;
__END__
$VAR1 = {
'2017-02-03' => {
'b' => 1
},
'2017-02-02' => {
'a' => 1
},
'2017-02-01' => {
'b' => 1
}
};
As you can see, the data is structured first by date. Each key represents one day. Inside, there is an additional hash reference that only holds one key. That's the pattern. Later we iterate the day first. Then we get
{
'b' => 1
}
in the first iteration. Then we iterate all the patterns. The above program does this not by actually iterating, but by explicitly stating each possible key. If it's there it's used. If it's not defined, it's set to 0 with the || operator.
The program can be further simplified to use arbitrary patterns. If you don't care about the order of the patterns in the output, include a header and you can easily add more patterns later.
I used a config hash for the patterns, and Text::Table to create the output.
use strict;
use warnings;
use Text::Table;
my %matches;
my %patterns = (
beendet => qr/beendet/,
ohne => qr/ohne/,
komplex => qr/foo\sbar?/, # or whatever
);
while ( my $line = <DATA> ) {
if ($line =~ m/^(\d{4}-\d{2}-\d{2})(.+)$/) {
my $day = $1;
my $what = $2;
foreach my $name ( sort keys %patterns ) {
if ( $what =~ $patterns{$name} ) {
$matches{$day}->{$name}++ ;
last;
}
}
}
}
# formatted output sorted by day:
my #head = sort keys %patterns;
my $tb = Text::Table->new( 'Tag', #head );
foreach my $day (sort keys %matches) {
$tb->load([ $day, map { $matches{$day}->{$_} || 0 } #head ]);
}
print $tb;
__DATA__
2017-02-01 einmal Pommes ohne
2017-02-02 Wartung gestartet
2017-02-02 Wartung beendet
2017-02-03 ohne Moos nix los
This prints
Tag beendet komplex ohne
2017-02-01 0 0 1
2017-02-02 1 0 0
2017-02-03 0 0 1
If you don't want to install an additional module, maybe just create a CSV file. Since you're from Germany, I suggest a semicolon ; as the separator, because German Excel uses that as the default.
Here is a verbose example of how to do this instead of Text::Table.
my #head = sort keys %patterns;
print join( ';', #head ), "\n";
foreach my $day (sort keys %matches) {
my #cols;
push #cols, $matches{$day}->{$_} || 0 for #head;
print join ';', $day, #cols;
print "\n";
}
And the output is
beendet;komplex;ohne
2017-02-01;0;0;1
2017-02-02;1;0;0
2017-02-03;0;0;1
But you should also look into Text::CSV if you don't want this to go to the screen.
Here is some Perl code which takes two files as input. The files contain TCP packets. It trains itself for the normal packets using the packets in first file and then prints the anomalous packets in the second file.
while (<>) {
if (($time, $to, $port, $from, $duration, $flags, $length, $text) = /(.{19}) (.{15}):(\d+) (.{15}):\d+ \+(\d+) (\S+) (\d+) (.*)/) {
$text =~ s/\^M//g;
$text =~ s/\^ /\n/g;
if (($port == 25 || $port == 80) && $text =~ /\n\n/) {$text = "$`\n";}
$text =~ s/^\^#//;
if ($time =~ /(\d\d)\/(\d\d)\/\d\d\d\d (\d\d):(\d\d):(\d\d)/) {
$now = ((($1 * 31 + $2) * 24 + $3) * 60 + $4) * 60 + $5;
}
foreach ($text =~ /.*\n/g) {
if (($k, $v) = /(\S*)(.*)/) {
$k = substr($k, 0, 30);
$v = substr($v, 0, 100);
$score = 0;
$comment = "";
&alarm($port, $k);
&alarm($to, $flags);
&alarm("To", "$to:$port");
&alarm($to, $from);
&alarm("$to:$port", $from);
if ($score > 30000) {
$score = log($score) / (10 * log(10));
printf(" # 0 $time $to %8.6f \#%s\n", $score, substr($comment, 0, 300));
}
}
}
}
}
sub alarm {
local ($key, $val, $sc) = #_;
if ($now < 10300000) {
++$n{$key};
if (++$v{$key . $val} == 1) {
++$r{$key};
$t{$key} = $now;
}
} elsif ($n{$key} > 0 && !$v{$key . $val}) {
$score += ($now - $t{$key}) * $n{$key} / $r{$key};
$comment .= " $key=$val";
$t{$key} = $now;
}
}
exit;
I am new to Perl and as a small part my project it needs that anomaly score is to be printed for all the packets in the second file. Can anybody tell how to modify the code?
From what I can see here, it looks as if the code (as it is now) looks for packets before some cutoff time, and stores whether or not it has seen certain conditions in the %n and %v hashes.
Why not give an extra flag to your alarm function called $training. If true, just account for the packet values, otherwise, calculate a score for this anomaly (if it is one), and return that value. If there is no anomaly, or if you're in training mode, just return zero:
sub alarm {
my ($key, $val, $training) = #_;
my $score = 0;
if ( $training ) {
...do your accounting...
} else {
...do your comparisons & set score accordingly...
}
return $score;
}
Throw your big while into a subroutine, and have that subroutine take a filename and whether it is in training mode or not.
sub examine {
my ($file, $training) = #_;
if ( open my $fh, '<', $file ) {
while (<$fh>) {
...this is your big while loop...
...pass $training along to your alarm() calls...
}
} else {
die "Failed to open $file: $!\n';
}
}
Your main program is now:
use constant TRAINING => 1;
examine('file1', TRAINING);
examine('file2', !TRAINING);
More notes:
Use my() instead of local, though it doesn't materially affect this program, it's a good habit to get into.
Don't use a well known function name alarm when it really isn't doing anything of the kind, instead name it something like check_packet_values -- or something that makes sense to you and your team.
Stop using magic numbers
use constant {
CUTOFF_TIME => 10300000,
ANOMALY_SCORE => 30000
};
Use a real date/time parser so that your values have some meaning. str2time from Date::Parse would give you your time in epoch seconds (seconds since Jan 1, 1970).
Use variable names that mean something. %n and %v are hard to understand in this code, but %n_seen and %value_seen (as well as %first_seen_time instead of %t). Remember, your code doesn't run faster if you use shorter variable names.
Stop using global variables when feasible. The counters can be global, but your comment should be built only in the routine which is initializing and printing the comment. So, instead of doing what you are doing, how about:
$to_score = check_packet_value($to, $flags)
and push #comments, "$to=$flags";
...
$score = $to_score + $from_score + ...
if ( !$training && $score > ANOMALY_THRESHOLD ) {
print "blah blah blah #comments\n";
}
Also, never, ever use $` -- it causes huge performance penalties in your entire script (even if it never calls this function). Instead of:
if ( $text =~ /\n\n/ ) { $text = $` }
Use
if ( $text =~ /(.*)\n\n/ ) {
$text = $1;
}
(Edit: added warning about $`)
I may have misunderstood your question and comment, so forgive me if this isn't what you're asking...
Your printf function currently resides inside this if ($score > 30000) check, so you'll only get the output if the $score is > 30000.
if ($score>30000) {
$score=log($score)/(10*log(10));
printf(" # 0 $time $to %8.6f \#%s\n", $score, substr($comment, 0, 300));
}
If you want to print the output regardless of the $score, you just need to move the printf line outside this if check.
I have a file in below format.
DATE Time, v1,v2,v3
05:33:25,n1,n2,n3
05:34:25,n4,n5,n5
05:35:24,n6,n7,n8
and so on upto 05:42:25.
I want calculate the values v1, v2 and v3 for every 5 min interval. I have written the below sample code.
while (<STDIN>) {
my ($dateTime, $v1, $v2, $v3) = split /,/, $_;
my ($date, $time) = split / /, $dateTime;
}
I can read all the values but need help to sum all the values for every 5 min interval. Can anyone please suggest me the code to add the time and values for every 5 min.
Required output
05:33 v1(sum 05:33 to 05:37) v2(sum 05:33 to 05:33) v3(sum 05:33 to 05:33)
05:38 v1(sum 05:38 to 05:42) v2(sum 05:38 to 05:42) v3(sum 05:38 to 05:42)
and so on..
The code is a variation the previous answer by Sinan Ünür below, except:
(1) Function timelocal will allow you to read in Day,Month,Year -- so you can sum any five minute gap.
(2) Should deal with case where final time gap is < 5 minutes.
#!/usr/bin/perl -w
use strict;
use warnings;
use Time::Local;
use POSIX qw(strftime);
my ( $start_time, $end_time, $current_time );
my ( $totV1, $totV2, $totV3 ); #totals in time bands
while (<DATA>) {
my ( $hour, $min, $sec, $v1, $v2, $v3 ) =
( $_ =~ /(\d+)\:(\d+)\:(\d+)\,(\d+),(\d+),(\d+)/ );
#convert time to epoch seconds
$current_time =
timelocal( $sec, $min, $hour, (localtime)[ 3, 4, 5 ] ); #sec,min,hr
if ( !$end_time ) {
$start_time = $current_time;
$end_time = $start_time + 5 * 60; #plus 5 min
}
if ( $current_time <= $end_time ) {
$totV1 += $v1;
$totV2 += $v2;
$totV3 += $v3;
}
else {
print strftime( "%H:%M:%S", localtime($start_time) ),
" $totV1,$totV2,$totV3\n";
$start_time = $current_time;
$end_time = $start_time + 5 * 60; #plus 5 min
( $totV1, $totV2, $totV3 ) = ( $v1, $v2, $v3 );
}
}
#Print results of final loop (if required)
if ( $current_time <= $end_time ) {
print strftime( "%H:%M:%S", localtime($start_time) ),
" $totV1,$totV2,$totV3\n";
}
__DATA__
05:33:25,29,74,96
05:34:25,41,69,95
05:35:25,24,38,55
05:36:25,96,63,70
05:37:25,84,65,74
05:38:25,78,58,93
05:39:25,51,38,19
05:40:25,86,40,64
05:41:25,80,68,65
05:42:25,4,93,81
Output:
05:33:25 352,367,483
05:39:25 221,239,229
Obviously, not tested much, for lack of sample data. For parsing the CSV, use either Text::CSV_XS or Text::xSV rather than the naive split below.
Note:
This code does not make sure the output has all consecutive five minute blocks if the input data has gaps.
You will have problems if there are time stamps from multiple days. In fact, if the time stamps are not in 24-hour format, you will have problems even if the data are from a single day.
With those caveats, it should still give you a starting point.
#!/usr/bin/perl
use strict;
use warnings;
my $split_re = qr/ ?, ?/;
my #header = split $split_re, scalar <DATA>;
my #data;
my $time_block = 0;
while ( my $data = <DATA> ) {
last unless $data =~ /\S/;
chomp $data;
my ($ts, #vals) = split $split_re, $data;
my ($hr, $min, $sec) = split /:/, $ts;
my $secs = 3600*$hr + 60*$min + $sec;
if ( $secs > $time_block + 300 ) {
$time_block = $secs;
push #data, [ $time_block ];
}
for my $i (1 .. #vals) {
$data[-1]->[$i] += $vals[$i - 1];
}
}
print join(', ', #header);
for my $row ( #data ) {
my $ts = shift #$row;
print join(', ',
sprintf('%02d:%02d', (localtime($ts))[2,1])
, #$row
), "\n";
}
__DATA__
DATE Time, v1,v2,v3
05:33:25,1,3,5
05:34:25,2,4,6
05:35:24,7,8,9
05:55:24,7,8,9
05:57:24,7,8,9
Output:
DATE Time, v1, v2, v3
05:33, 10, 15, 20
05:55, 14, 16, 18
This is a good problem for Perl to solve. The hardest part is taking the value from the datetime field and identifying which 5 minute bucket it belongs to. The rest is just hashes.
my (%v1,%v2,%v3);
while (<STDIN>) {
my ($datetime,$v1,$v2,$v3) = split /,/, $_;
my ($date,$time) = split / /, $datetime;
my $bucket = &get_bucket_for($time);
$v1{$bucket} += $v1;
$v2{$bucket} += $v2;
$v3{$bucket} += $v3;
}
foreach my $bucket (sort keys %v1) {
print "$bucket $v1{$bucket} $v2{$bucket} $v3{$bucket}\n";
}
Here's one way you could implement &get_bucket_for:
my $first_hhmm;
sub get_bucket_for {
my ($time) = #_;
my ($hh,$mm) = split /:/, $time; # looks like seconds are not important
# buckets are five minutes apart, but not necessarily at multiples of 5 min
# (i.e., buckets could go 05:33,05:38,... instead of 05:30,05:35,...)
# Use the value from the first time this function is called to decide
# what the starting point of the buckets is.
if (!defined $first_hhmm) {
$first_hhmm = $hh * 60 + $mm;
}
my $bucket_index = int(($hh * 60 + $mm - $first_hhmm) / 5);
my $bucket_start = $first_hhmm + 5 * $bucket_index;
return sprintf "%02d:%02d", $bucket_start / 60, $bucket_start % 60;
}
I'm not sure why you would use the times starting from the first time, instead of round 5 minute intervals (00 - 05, 05 - 10, etc), but this is a quick and dirty way to do it your way:
my %output;
my $last_min = -10; # -10 + 5 is less than any positive int.
while (<STDIN>) {
my ($dt, $v1, $v2, $v3) = split(/,/, $_);
my ($h, $m, $s) = split(/:/, $dt);
my $ts = $m + ($h * 60);
if (($last_min + 5) < $ts) {
$last_min = $ts;
}
$output{$last_min}{1} += $v1;
$output{$last_min}{2} += $v2;
$output{$last_min}{3} += $v3;
}
foreach my $ts (sort {$a <=> $b} keys %output) {
my $hour = int($ts / 60);
my $minute = $ts % 60;
printf("%01d:%02d v1(%i) v2(%i) v3(%i)\n", (
$hour,
$minute,
$output{$ts}{1},
$output{$ts}{2},
$output{$ts}{3},
));
}
Not sure why you would do it this way, but there you go in procedural Perl, as example. If you need more on the printf formatting, go here.
I was wondering if anyone had any suggestions on improving the following code (if possible) so that it didn't need the repeated (my #a = $time =~ ...), possibly using case/switch or given/when or some other idea that i'm missing?
my $time = '12:59pm';
if( my #a = $time =~ m/^(\d\d?)(am|pm)$/ ) { tell_time( $a[0], 0, $a[1] ) }
if( my #a = $time =~ m/^(\d\d?):(\d\d)(am|pm)$/ ) { tell_time( #a ) }
if( my #a = $time =~ m/^(\d\d?):(\d\d)$/ ) { tell_time( #a ) }
sub tell_time
{
my $hour = shift;
my $minute = shift || '00';
my $ampm = shift || ( $hour > 12 ) ? 'pm' : 'am';
print "Hour: $hour, Minute: $minute, AMPM: $ampm\n";
}
I've tried playing around with Switch and the 5.10 given/when but can't seem to be able to do something like:
given( $time )
{
when( /^(\d\d?)(am|pm)$/ ) { tell_time( $_[0], 0, $_[1] ) }
when( /^(\d\d?):(\d\d)(am|pm)$/ ) { tell_time( #_ ) }
when( /^(\d\d?):(\d\d)$/ ) { tell_time( #_ ) }
}
That doesn't fly because #_ appears to be storing $time.
also note I'm more interested in the syntax of the problem than the problem the code solves. I'm well aware that I could use Time::ParseDate to figure out the various parts of a string formatted like a time or date.
Your regex uses () to extract matches, but you don't have to store these in an array. If you want, they're stored in $1, $2, $3, and so on. Lookie:
given( $time )
{
when( /^(\d\d?)(am|pm)$/ ) { tell_time( $1, 0, $2 ) }
when( /^(\d\d?):(\d\d)(am|pm)$/ ) { tell_time( $1, $2, $3 ) }
when( /^(\d\d?):(\d\d)$/ ) { tell_time( $1, $2 ) }
}
Does exactly what I think you want to do.
If you want to add to the syntax, I would write tell_time() to simply take the time as a string, and have the function parse the result itself, rather than make the user of your code parse it himself. Alternatively, you could use this given() block as the start of a new function that does exactly that - parses a time string and passes it correctly to tell_time(). But that's just me. I don't know what you need your code to do, so by all means go for it.
Well, without using switch/case, I'd just use a single regex to capture all the variations...
#!/usr/bin/perl
tell_time ("12:59am"); # matches time format 1
tell_time ("2:59pm"); # matches time format 1
tell_time ("12am"); # matches time format 2
tell_time ("12:59"); # matches time format 3
tell_time ("14:59"); # matches time format 3
tell_time ("12:59:59am"); # produces no output, does not match any known time formats.
sub tell_time
{
my $timearg = shift;
# note: (?: ... ) creates a non-capturing group, which is not reflected in
# the returned array.
my ($hour , $minute, $ampm) = ( $timearg =~ m/^(\d\d?)(?::(\d\d?))?(am|pm)?$/ ) ;
# only continue if we captured all required fields (i.e. hour)
if($hour)
{
# set default values for optional fields (i.e. minute, ampm) if necessary
$minute ||= '00';
$ampm ||= ( $hour > 12 ) ? 'pm' : 'am';
print "Hour: $hour, Minute: $minute, AMPM: $ampm\n";
}
}
I can explain it further if necessary, but I think if you can read perl it should be clear what it's doing...
Since you are using 5.10, you might as well use named captures in your regex:
#!/usr/bin/perl
use 5.010;
use strict;
use warnings;
my $hour24 = qr/(?<hour>[1-9]|1[0-9]|2[0-3])/;
my $hour12 = qr/(?<hour>[1-9]|1[0-2])/;
my $minute = qr/(?<minute>[0-5][0-9])/;
my $meridiem = qr/(?<meridiem>am|AM|pm|PM)/;
for my $time (qw(5pm 10am 5:59pm 10:00pm 5:00 22:00 24:00)) {
given($time) {
when(/ ^ $hour12 $meridiem $ /x) {
my $hour = $+{hour};
$hour += 12 if 'pm' eq lc $+{meridiem};
tell_time($hour, "00")
}
when(/ ^ $hour12 : $minute $meridiem $ /x) {
my $hour = $+{hour};
$hour += 12 if 'pm' eq lc $+{meridiem};
tell_time($hour, $+{minute})
}
when(/ ^ $hour24 : $minute $ /x) {
tell_time($+{hour}, $+{minute})
}
default {
say "bad time: $time";
}
}
}
sub tell_time {
my ($hour, $minute) = #_;
say "it is $hour:$minute";
}
Chris Lutz already covered the switch syntax using Perl 5.10. In order versions of Perl you can use loop aliasing to emulate one:
for ($time) {
/^(\d\d?)(am|pm)$/ && do { tell_time( $1, 0, $2 ); last };
/^(\d\d?):(\d\d)(am|pm)$/ && do { tell_time( $1, $2, $3 ); last };
/^(\d\d?):(\d\d)$/ && do { tell_time( $1, $2 ); last };
}
I am not sure if the given/when aspect is important here. I would just combine the possible patterns in a single regex. Combined with the special variable %+ and the defined-or operator, we can make the code more succinct.
#!/usr/bin/perl
use strict;
use warnings;
my #times = qw( 12:59pm 12 1pm 13:11 11 11pm);
my $hour_pat = '(?<hour>[0-9]{1,2})';
my $minute_pat = '(?<minute>[0-9]{2})';
my $ampm_pat = '(?<ampm>am|pm)';
my $re = qr{
\A
(?:$hour_pat : $minute_pat $ampm_pat)
|
(?:$hour_pat : $minute_pat)
|
(?:$hour_pat $ampm_pat)
|
(?:$hour_pat)
\z
}x;
for my $time ( #times ) {
if ( $time =~ $re ) {
tell_time( %+ );
}
}
sub tell_time {
my %time = #_;
printf( "Hour: %2.2d, Minute: %2.2d, AMPM: %s\n",
$time{hour},
$time{minute} // 0,
$time{ampm} // ( $time{hour} >= 12 ? 'pm' : 'am' ),
);
return;
}
I create switch with a block label, like this:
my $time = '12:59pm';
SWITCH: {
$time =~ /^(\d\d?)(am|pm)$/ && do {
tell_time($1,0,$2);
last SWITCH;
};
$time =~ /^(\d\d?):(\d\d)(am|pm)$/ && do {
tell_time($1,$2,$3);
last SWITCH;
};
$time =~ /^(\d\d?):(\d\d)$/ && do {
tell_time($1,$2);
};
}
As per the title, I'm trying to find a way to programmatically determine the longest portion of similarity between several strings.
Example:
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/
Ideally, I'd get back file:///home/gms8994/Music/, because that's the longest portion that's common for all 3 strings.
Specifically, I'm looking for a Perl solution, but a solution in any language (or even pseudo-language) would suffice.
From the comments: yes, only at the beginning; but there is the possibility of having some other entry in the list, which would be ignored for this question.
Edit: I'm sorry for mistake. My pity that I overseen that using my variable inside countit(x, q{}) is big mistake. This string is evaluated inside Benchmark module and #str was empty there. This solution is not as fast as I presented. See correction below. I'm sorry again.
Perl can be fast:
use strict;
use warnings;
package LCP;
sub LCP {
return '' unless #_;
return $_[0] if #_ == 1;
my $i = 0;
my $first = shift;
my $min_length = length($first);
foreach (#_) {
$min_length = length($_) if length($_) < $min_length;
}
INDEX: foreach my $ch ( split //, $first ) {
last INDEX unless $i < $min_length;
foreach my $string (#_) {
last INDEX if substr($string, $i, 1) ne $ch;
}
}
continue { $i++ }
return substr $first, 0, $i;
}
# Roy's implementation
sub LCP2 {
return '' unless #_;
my $prefix = shift;
for (#_) {
chop $prefix while (! /^\Q$prefix\E/);
}
return $prefix;
}
1;
Test suite:
#!/usr/bin/env perl
use strict;
use warnings;
Test::LCP->runtests;
package Test::LCP;
use base 'Test::Class';
use Test::More;
use Benchmark qw(:all :hireswallclock);
sub test_use : Test(startup => 1) {
use_ok('LCP');
}
sub test_lcp : Test(6) {
is( LCP::LCP(), '', 'Without parameters' );
is( LCP::LCP('abc'), 'abc', 'One parameter' );
is( LCP::LCP( 'abc', 'xyz' ), '', 'None of common prefix' );
is( LCP::LCP( 'abcdefgh', ('abcdefgh') x 15, 'abcdxyz' ),
'abcd', 'Some common prefix' );
my #str = map { chomp; $_ } <DATA>;
is( LCP::LCP(#str),
'file:///home/gms8994/Music/', 'Test data prefix' );
is( LCP::LCP2(#str),
'file:///home/gms8994/Music/', 'Test data prefix by LCP2' );
my $t = countit( 1, sub{LCP::LCP(#str)} );
diag("LCP: ${\($t->iters)} iterations took ${\(timestr($t))}");
$t = countit( 1, sub{LCP::LCP2(#str)} );
diag("LCP2: ${\($t->iters)} iterations took ${\(timestr($t))}");
}
__DATA__
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/
Test suite result:
1..7
ok 1 - use LCP;
ok 2 - Without parameters
ok 3 - One parameter
ok 4 - None of common prefix
ok 5 - Some common prefix
ok 6 - Test data prefix
ok 7 - Test data prefix by LCP2
# LCP: 22635 iterations took 1.09948 wallclock secs ( 1.09 usr + 0.00 sys = 1.09 CPU) # 20766.06/s (n=22635)
# LCP2: 17919 iterations took 1.06787 wallclock secs ( 1.07 usr + 0.00 sys = 1.07 CPU) # 16746.73/s (n=17919)
That means that pure Perl solution using substr is about 20% faster than Roy's solution at your test case and one prefix finding takes about 50us. There is not necessary using XS unless your data or performance expectations are bigger.
The reference given already by Brett Daniel for the Wikipedia entry on "Longest common substring problem" is very good general reference (with pseudocode) for your question as stated. However, the algorithm can be exponential. And it looks like you might actually want an algorithm for longest common prefix which is a much simpler algorithm.
Here's the one I use for longest common prefix (and a ref to original URL):
use strict; use warnings;
sub longest_common_prefix {
# longest_common_prefix( $|# ): returns $
# URLref: http://linux.seindal.dk/2005/09/09/longest-common-prefix-in-perl
# find longest common prefix of scalar list
my $prefix = shift;
for (#_) {
chop $prefix while (! /^\Q$prefix\E/);
}
return $prefix;
}
my #str = map {chomp; $_} <DATA>;
print longest_common_prefix(#ARGV), "\n";
__DATA__
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/
If you truly want a LCSS implementation, refer to these discussions (Longest Common Substring and Longest Common Subsequence) at PerlMonks.org. Tree::Suffix would probably be the best general solution for you and implements, to my knowledge, the best algorithm. Unfortunately recent builds are broken. But, a working subroutine does exist within the discussions referenced on PerlMonks in this post by Limbic~Region (reproduced here with your data).
#URLref: http://www.perlmonks.org/?node_id=549876
#by Limbic~Region
use Algorithm::Loops 'NestedLoops';
use List::Util 'reduce';
use strict; use warnings;
sub LCS{
my #str = #_;
my #pos;
for my $i (0 .. $#str) {
my $line = $str[$i];
for (0 .. length($line) - 1) {
my $char= substr($line, $_, 1);
push #{$pos[$i]{$char}}, $_;
}
}
my $sh_str = reduce {length($a) < length($b) ? $a : $b} #str;
my %map;
CHAR:
for my $char (split //, $sh_str) {
my #loop;
for (0 .. $#pos) {
next CHAR if ! $pos[$_]{$char};
push #loop, $pos[$_]{$char};
}
my $next = NestedLoops([#loop]);
while (my #char_map = $next->()) {
my $key = join '-', #char_map;
$map{$key} = $char;
}
}
my #pile;
for my $seq (keys %map) {
push #pile, $map{$seq};
for (1 .. 2) {
my $dir = $_ % 2 ? 1 : -1;
my #offset = split /-/, $seq;
$_ += $dir for #offset;
my $next = join '-', #offset;
while (exists $map{$next}) {
$pile[-1] = $dir > 0 ?
$pile[-1] . $map{$next} : $map{$next} . $pile[-1];
$_ += $dir for #offset;
$next = join '-', #offset;
}
}
}
return reduce {length($a) > length($b) ? $a : $b} #pile;
}
my #str = map {chomp; $_} <DATA>;
print LCS(#str), "\n";
__DATA__
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/
It sounds like you want the k-common substring algorithm. It is exceptionally simple to program, and a good example of dynamic programming.
My first instinct is to run a loop, taking the next character from each string, until the characters are not equal. Keep a count of what position in the string you're at and then take a substring (from any of the three strings) from 0 to the position before the characters aren't equal.
In Perl, you'll have to split up the string first into characters using something like
#array = split(//, $string);
(splitting on an empty character sets each character into its own element of the array)
Then do a loop, perhaps overall:
$n =0;
#array1 = split(//, $string1);
#array2 = split(//, $string2);
#array3 = split(//, $string3);
while($array1[$n] == $array2[$n] && $array2[$n] == $array3[$n]){
$n++;
}
$sameString = substr($string1, 0, $n); #n might have to be n-1
Or at least something along those lines. Forgive me if this doesn't work, my Perl is a little rusty.
If you google for "longest common substring" you'll get some good pointers for the general case where the sequences don't have to start at the beginning of the strings.
Eg, http://en.wikipedia.org/wiki/Longest_common_substring_problem.
Mathematica happens to have a function for this built in:
http://reference.wolfram.com/mathematica/ref/LongestCommonSubsequence.html (Note that they mean contiguous subsequence, ie, substring, which is what you want.)
If you only care about the longest common prefix then it should be much faster to just loop for i from 0 till the ith characters don't all match and return substr(s, 0, i-1).
From http://forums.macosxhints.com/showthread.php?t=33780
my #strings =
(
'file:///home/gms8994/Music/t.A.T.u./',
'file:///home/gms8994/Music/nina%20sky/',
'file:///home/gms8994/Music/A%20Perfect%20Circle/',
);
my $common_part = undef;
my $sep = chr(0); # assuming it's not used legitimately
foreach my $str ( #strings ) {
# First time through loop -- set common
# to whole
if ( !defined $common_part ) {
$common_part = $str;
next;
}
if ("$common_part$sep$str" =~ /^(.*).*$sep\1.*$/)
{
$common_part = $1;
}
}
print "Common part = $common_part\n";
Faster than above, uses perl's native binary xor function, adapted from perlmongers solution (the $+[0] didn't work for me):
sub common_suffix {
my $comm = shift #_;
while ($_ = shift #_) {
$_ = substr($_,-length($comm)) if (length($_) > length($comm));
$comm = substr($comm,-length($_)) if (length($_) < length($comm));
if (( $_ ^ $comm ) =~ /(\0*)$/) {
$comm = substr($comm, -length($1));
} else {
return undef;
}
}
return $comm;
}
sub common_prefix {
my $comm = shift #_;
while ($_ = shift #_) {
$_ = substr($_,0,length($comm)) if (length($_) > length($comm));
$comm = substr($comm,0,length($_)) if (length($_) < length($comm));
if (( $_ ^ $comm ) =~ /^(\0*)/) {
$comm = substr($comm,0,length($1));
} else {
return undef;
}
}
return $comm;
}