Cant get Weighted Cosine Similarity to work - perl

I'm trying to get the Weighted Cosine Similarity of two documents. I'm using Text::Document and Text::DocumentCollection. My code seems to work but it isn't returning a number as I expected.
Here is my code
use strict;
use warnings;
use Text::Document;
use Text::DocumentCollection;
my $newfile = shift #ARGV;
my $newfile2 = shift #ARGV;
##This is in another file.
my $t1 = countFreq($newfile);
my $t2 = countFreq($newfile2);
my $collection = Text::DocumentCollection->new(file => 'coll.db');
$collection->Add("One", $t1);
$collection->Add("Two", $t2);
my $wSim = $t1->WeightedCosineSimilarity( $t2,
\&Text::DocumentCollection::IDF,
$collection
);
print "\nWeighted Cosine Sim is: $wSim\n";
All this returns is Weighted Cosine Sim is: without anything following the colon.
Here is the code for countFreq:
sub countFreq{
my ($file) = #_;
my $t1 = Text::Document->new();
open (my $info, $file) or die "Could not open file.";
while (my $line = <$info>) {
chomp $line;
$line =~ s/[[:punct:]]//g;
foreach my $str (split /\s+/, $line) {
if (!defined $sp{lc($str)}) {
$t1 -> AddContent ($str);
}
}
}
return $t1;
}

###Update
Here's an example program that works fine. It's based on looking at the test code in the distribution for inspiration
I was expecting the test to be much less sensitive, so I was getting zeroes from two wildly different text sources. This example adds three short sentences $d1, $d1, and $d3, to a collection $c, and then compares each of the three documents to $d1
Comparing $d1 to itself produces 1 -- an exact match, as expected, while comparing $d2and $d3 gives 0.087 and 0 respectively -- a partial match and no match at all
I hope this helps you to resolve your specific issue?
use strict;
use warnings 'all';
use Text::Document;
use Text::DocumentCollection;
my $d1 = Text::Document->new;
$d1->AddContent( 'my heart belongs to sally webster' );
my $d2 = Text::Document->new;
$d2->AddContent( 'my heart belongs to the girl next door' );
my $d3 = Text::Document->new;
$d3->AddContent( 'I want nothing to do with my neighbours' );
my $c = Text::DocumentCollection->new( file => 'coll2.db' );
$c->Add('one', $d1);
$c->Add('two', $d2);
$c->Add('three', $d3);
for my $doc ( $d1, $d2, $d3 ) {
my $wcs = $d1->WeightedCosineSimilarity(
$doc,
\&Text::DocumentCollection::IDF,
$c
);
die qq{Invalid parameters for "WeightedCosineSimilarity"} unless defined $wcs;
print $wcs, "\n";
}
###output
1
0.0874311036726221
0
This is the code for Text::Document::WeightedCosineSimilarity
# this is rather rough
sub WeightedCosineSimilarity
{
my $self = shift;
my ($e,$weightFunction,$rock) = #_;
my ($Dv,$Ev) = ($self->{terms}, $e->{terms});
# compute union
my %union = %{$self->{terms}};
my #keyse = keys %{$e->{terms}};
#union{#keyse} = #keyse;
my #allkeys = keys %union;
# weighted D
my #Dw = map(( defined( $Dv->{$_} )?
&{$weightFunction}( $rock, $_ )*$Dv->{$_} : 0.0 ),
#allkeys
);
# weighted E
my #Ew = map(( defined( $Ev->{$_} )?
&{$weightFunction}( $rock, $_ )*$Ev->{$_} : 0.0 ),
#allkeys
);
# dot product of D and E
my $dotProduct = 0.0;
map( $dotProduct += $Dw[$_] * $Ew[$_] , 0..$#Dw );
# norm of D
my $nD = 0.0;
map( $nD += $Dw[$_] * $Dw[$_] , 0..$#Dw );
$nD = sqrt( $nD );
# norm of E
my $nE = 0.0;
map( $nE += $Ew[$_] * $Ew[$_] , 0..$#Ew );
$nE = sqrt( $nE );
# dot product scaled by norm
if( ($nD==0) || ($nE==0) ){
return undef;
} else {
return $dotProduct / $nD / $nE;
}
}
I'm afraid I don't understand the theory behind what it is doing, but it looks like your problem is that either $nD ("norm of D") or $nE ("norm of D") is zero
All I can suggest is that your two text samples may be too similar/different, or perhaps they are too long/short?
Either way, your code should look like this so as to catch an invalid return value from the cosine function:
my $wSim = $t1->WeightedCosineSimilarity( $t2,
\&Text::DocumentCollection::IDF,
$collection
);
die qq{Invalid parameters for "WeightedCosineSimilarity"} unless defined $wSim;
print "\nWeighted Cosine Sim is: $wSim\n";

Related

Perl Expectation Maximization of log-odds scores of DNA sequences

My overall goal for this code is to find the motif (smaller sequence) in each DNA sequence that will report a maximum log-odds score based on a log-odds scoring matrix. The .txt file I am searching through looks like this:
>Sequence 1
TCGACGATCAGACAG
>Sequence 2
TGGGACTTGCACG
.... and so on.
I am working on the maximization step of my code at the moment, and I am struggling to calculate a log-odds score of a motif within my DNA sequence. I have code that creates the log-odds scoring matrix - see the following:
#!/usr/bin/perl -w
#Above line used in Unix and Linux
#Grome Programming Assignment 2
#E-M
use strict;
use warnings;
# usage: script.pl {motif_width} {dnafile}
#USER SPECIFICATIONS
print "Please enter the filename of the fasta sequence data: ";
my $filename1 = <STDIN>;
#Remove newline from file
chomp $filename1;
#Open the file and store each dna seq in hash
my %id2seq = ();
my $id = '';
open (FILE, '<', $filename1) or die "Cannot open $filename1.",$!;
my $dna;
while (<FILE>)
{
if($_ =~ /^>(.+)/)
{
$id = $1;
}
else
{
$id2seq{$id} .= $_;
}
}
close FILE;
foreach $id (keys %id2seq)
{
print "$id2seq{$id}\n\n";
}
#User specifies motif width
print "Please enter the motif width:\n";
my $width = <STDIN>;
#Remove newline from file
chomp $width;
#Default width is 3 (arbitrary number chosen)
if ($width eq '')
{
$width = 3;
}
elsif ($width <=0)
{
print "Please enter a number greater than zero:\n";
$width = <STDIN>;
chomp $width;
}
#User specifies number of initial random
#starting alignments
print "Please enter the number of initial random
starting alignments:\n";
my $start = <STDIN>;
#Remove newline from file
chomp $start;
#Default start is 50
if ($start eq '')
{
$start = 50;
}
elsif ($start <=0)
{
print "Please enter a number greater than zero:\n";
$start = <STDIN>;
chomp $start;
}
#User specifies number of iterations to
#perform expectation-maximization
print "Please enter the number of iterations for
expectation-maximization:\n";
my $iteration = <STDIN>;
#Remove newline from file
chomp $iteration;
#Default iteration = 500
if($iteration eq '')
{
$iteration = 500;
}
elsif ($iteration <=0)
{
print "Please enter a number greater than zero:\n";
$iteration = <STDIN>;
chomp $iteration;
}
#EXPECTATION
#Initialize counts for motif positions
#Incorporate pseudocounts initially
my %mot = map { $_ => [ (1) x $width ] } qw( A C G T );
# Initialize background counts
my %bg = map { $_ => 0 } qw( A C G T );
#Fill background and motif counts
foreach $id (keys %id2seq)
{
#Generate random start site in the sequence
#for motif to start from
my $ms = int(rand(length($id2seq{$id})-$width));
# Within a motif, count the bases at the positions
for my $pos (0..length($id2seq{$id})-1)
{
my $base = substr($id2seq{$id}, $pos, 1);
if ($pos >= $ms && $pos < $ms + $width)
{
++$mot{$base}[$pos-$ms]
if exists($mot{$base});
}
else
{
++$bg{$base}
if exists($bg{$base});
}
}
}
#Print the background and motif counts
for my $base (qw( A C G T ))
{
print "$base #{$mot{$base}}\n";
}
print "\n";
for my $base (qw( A C G T ))
{
print "bg$base = $bg{$base}\n";
}
#Create frequency table of the motifs
#Get sum of the background
my $bgsum = 0;
for my $base (qw( A C G T))
{
$bgsum = $bgsum + $bg{$base};
}
print "\n$bgsum\n\n";
#Create background frequency table
my %bgfreq = map { $_ => 0 } qw( A C G T );
for my $base (qw( A C G T))
{
$bgfreq{$base} = $bg{$base} / $bgsum;
print "bgfreq$base = $bgfreq{$base}\n";
}
#Get sum of each motif position
my #motsum = ( (0) x $width );
for my $base (qw( A C G T))
{
for my $arrpos (0.. ($width-1))
{
$motsum[$arrpos] = $motsum[$arrpos] + #{$mot{$base}}[$arrpos];
}
}
#Create motif frequency table
my %motfreq = map { $_ => [ (0) x $width ]} qw( A C G T );
for my $base (qw( A C G T))
{
for my $arrpos (0.. ($width-1))
{
$motfreq{$base}[$arrpos] = $mot{$base}[$arrpos] / $motsum[$arrpos];
}
print "motfreq$base #{$motfreq{$base}}\n";
}
#Create odds table of motifs
my %odds = map { $_ => [ (0) x ($width) ]} qw( A C G T );
for my $base (qw( A C G T))
{
for my $arrpos (0.. ($width-1))
{
$odds{$base}[$arrpos] = $motfreq{$base}[$arrpos] / $bgfreq{$base};
}
print "odds$base #{$odds{$base}}\n";
}
#Create log-odds table of motifs
my %logodds = map { $_ => [ (0) x ($width) ]} qw( A C G T );
for my $base (qw( A C G T))
{
for my $arrpos (0.. ($width-1))
{
$logodds{$base}[$arrpos] = log2($odds{$base}[$arrpos]);
}
print "logodds$base #{$logodds{$base}}\n";
}
#####################################################
sub log2
{
my $n = shift;
return log($n)/log(2);
}
Now, I need to calculate the log-odds score of a motif within each DNA sequence. Then, I will iterate through all possible positions in the sequence and find the maximum score for each sequence. Future work requires me to recall the motif with the max score, but I have not made an attempt at that yet (just wanted to give scope to these max scores).
Strategy: I am going to create a hash of log-odds scores and maximum scores to hold the max scores of each sequence as iterated. To calculate the log-odds score, I will see which element of the log-odds scoring matrix matches the elements in the motif.
#MAXIMIZATION
#Determine location for each sequence that maximally
#aligns to the motif pattern
#Calculate logodds for the motif
#Create hash of logodds scores and hash of maxscores
#so each id has a logodds score and max score
my %loscore = map { $_ => [ (0) x (length($id2seq{$id})-$width) ]} qw( $id ); #Not sure if $id is correct, but I want a loscore for each $id
my %maxscore = map { $_ => [ (0) x (length($id2seq{$id})-$width) ]} qw( $id ); #Not sure if $id is correct, but I want a maxscore for each $id
foreach $id (keys %loscore, %maxscore, %id2seq)
{
my $len = length($id2seq{$id});
for my $base (qw( A C G T ))
{
for my $pos (0..$len-1)
{
if ($id2seq{$id}[$pos] = $mot{$base})
{
for my $motpos (0..$width-1)
{
$loscore{$id} = $loscore{$id} + $logodds{$base}[$motpos];
if ($loscore{$id} > $maxscore{$id})
{
$maxscore{$id} = $loscore{$id};
}
}
}
}
}
print "$id2seq{$id} logodds score: $maxscore{$id}\n";
}
#####################################################
sub log2
{
my $n = shift;
return log($n)/log(2);
}
When I uncomment the maximization step from the code an run it, nothing prints from the maximization section. I do not get any errors, but nothing new prints. I understand that my expectation step can be simplified (I will clean everything up after it works), but I am focused on this maximization step first. I know that there are plenty of flaws in the maximization code, especially when trying to create the log-odds score and max score hashes. Any and all input helps! Thank you in advance for your patience and advice. Let me know if you need any clarification.
foreach $id (keys %loscore, %maxscore, %id2seq) doesn't iterate over all three hashes. You probably meant foreach $id (keys %loscore, keys %maxscore, keys %id2seq)
if ($id2seq{$id}[$pos] = $mot{$base}) assigns the value to $id2seq{$id}[$pos]. To check for eqaulity you need if ($id2seq{$id}[$pos] eq $mot{$base}). But this is probably wrong too since $id2seq{$id} should be a string and $mot{$base} is an array.
It is not clear how your code should work and it is hard to find errors in such long code samples.

Specifically Iterating through a file in perl

So, I have this script :
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
use Math::Vector::Real;
use constant DEG_PER_RAD => 45 / atan2(1, 1);
my ( $source, $out ) = qw/ OUT4 OUTABA12 /;
open my $in_fh, '<', $source or die qq{Unable to open "$source" for input: $!\n};
open my $out_fh, '>', $out or die qq{Unable to open "$out" for output: $!\n};
my #data;
push #data, V(split) while <$in_fh>;
my #aoa;
for my $i ( 0 .. $#data ) {
for my $j ( 0 .. $#data ) {
my $val1 = $data[$i];
my $val2 = $data[$j];
if ($val1 != $val2) {
my $math = sqrt(($val1->[0] - $val2->[0])**2 +
($val1->[1] - $val2->[1])**2 +
($val1->[2] - $val2->[2])**2);
if ($math < 2.2) {
print "#$val1 #$val2 $math\n";
push #aoa, [#$val1, #$val2, $math];
}
}
}
}
for my $k ( 0 .. $#aoa-1 ) {
my $aoadata1 = $aoa[$k];
my $aoadata2 = $aoa[$k+1];
my $vect1 = [ #{ $aoa[$k] }[0..2] ];
my $vect2 = [ #{ $aoa[$k+1] }[0..2] ];
my $vect3 = [ #{ $aoa[$k] }[3..5] ];
my $vect4 = [ #{ $aoa[$k+1] }[3..5] ];
my $math1 = [ #{ $aoa[$k] }[6] ];
my $math2 = [ #{ $aoa[$k+1] }[6] ];
my #matha = #$math1;
my #mathb = #$math2;
my #vecta = #$vect1;
my #vectb = #$vect2;
my #vectc = #$vect3;
my #vectd = #$vect4;
if ( #vecta != #vectb ) {
print "1\n";
}
}
Which runs on a test file like so:
18.474525 20.161419 20.33903
21.999333 20.220667 19.786734
18.333228 21.649157 21.125111
20.371077 19.675844 19.77649
17.04323 19.3106 20.148842
22.941106 19.105412 19.069893
and it calculates the distance between each point and every other point, and if it's below a threshold push it to an array for later. (For testing purposes, I also have it printing it. )
What I've been stuck on is the bottom half - I'm trying to eventually get to the point where the bottom half of the script script will iterate between rows like so:
If the first triple set of values on row 1 is not identical to the first set of triple values on row two, print 180, but only if this is the only instance of this line. If there is ever a point where row one's values were equal to row two's, do not print 180 whatsoever.
I cannot, for the life of me, get it to work. Any help is appreciated.
I think this is correct
I've used the facilities of the Math::Vector::Real module as I described in a comment
I leave the vectors as objects and avoid accessing their their contents directly
I calculate the distance between $vec1 and $vec2 as abs($vec1 - $vec2
I use the class's stringify ability to display it instead of extracting the individual values in my code
I have also changed the intermediate data format. The distance is no longer kept because it's not necessary, and the array #groups now contains an array for each group of vector pairs that have a common first vector. Each group is of the form
[ $vec1, $vec2, $vec2, $vec2, ... ]
and I use the first function from List::Util to find the group that each new vector pair belongs in. If an existing group is found with a matching first value then the second vector is just pushed onto the end of the group; otherwise a new group is created that looks like [ $vec1, $vec2 ]
Once the #groups array is built, it is processed again to generate the output
If there are only two values in the group then they are $vec1 and $vec2 of a unique point. $vec1 is printed with 180
If there are more than two elements then a line of output is generated for every pair of $vec2 values, each containing the value $vec1 together with the angle between the two $vec2 vectors in the pair
use strict;
use warnings;
use Math::Vector::Real qw/ V /;
use List::Util qw / first /;
use constant DEG_PER_RAD => 45 / atan2(1, 1);
my ( $source, $out ) = qw/ OUT4 OUTABA12 /;
open my $in_fh, '<', $source or die qq{Unable to open "$source" for input: $!\n};
my #data = map V(split), <$in_fh>;
my #groups;
for my $vec1 ( #data ) {
for my $vec2 ( #data ) {
next if abs($vec1 - $vec2) > 2.2 or $vec1 == $vec2;
my $group = first { $_->[0] == $vec1 } #groups;
if ( $group ) {
push #$group, $vec2;
}
else {
push #groups, [ $vec1, $vec2 ];
}
}
}
open my $out_fh, '>', $out or die qq{Unable to open "$out" for output: $!\n};
select $out_fh;
for my $group ( #groups ) {
my ($vec1, #vec2) = #$group;
if ( #vec2 == 1 ) {
print "$vec1 180\n";
next;
}
for my $i ( 0 .. $#vec2-1 ) {
for my $j ( $i+1 .. $#vec2 ) {
my ($vec2a, $vec2b) = #vec2[$i, $j];
my $angle = atan2( $vec2a, $vec2b ) * DEG_PER_RAD;
print "$vec1 $angle\n";
}
}
}
output
{18.474525, 20.161419, 20.33903} 4.96831567625921
{18.474525, 20.161419, 20.33903} 1.65052300046634
{18.474525, 20.161419, 20.33903} 4.80888617553369
{21.999333, 20.220667, 19.786734} 4.22499387836286
{18.333228, 21.649157, 21.125111} 180
{20.371077, 19.675844, 19.77649} 5.09280172684743
{17.04323, 19.3106, 20.148842} 180
{22.941106, 19.105412, 19.069893} 180

Looping variables

I'm working with perl to make a script that will work with Dot products/assorted vector math. I've got a working script ( Still very much in progress/needs refinement ) that will do what I ask.
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
use Math::Vector::Real;
use 5.010;
use Math::Trig;
my $source = "./IN";
my $out = "./OUT";
open(IN, '<', $source) or die "Couldn't open $source: $!\n";
open(OUT, '>', $out) or die "Couldn't open $out: $!\n";
my(#data);
while (<IN>) {
push #data, [ split ];
}
my $v;
my $u;
$v = V(0, 0, 0);
$u = V(0, 0, 0);
my $i = 0;
sub maths {
my $dot = $v * $u;
my $mag1 = (sqrt ((#$v[0])**2 + (#$v[1])**2 + (#$v[2])**2 ));
my $mag2 = (sqrt ((#$u[0])**2 + (#$u[1])**2 + (#$u[2])**2 ));
my $prefinal = acos( ( $dot ) / ( $mag1 * $mag2 ) );
my $degrees = ( ( 180 / 3.14159 ) * ( $prefinal ) );
return ($degrees);
}
my $ref1 = $data[$i][0];
my $ref2 = $data[$i][1];
my $ref3 = $data[$i][2];
my $ref4 = $data[$i+1][0];
my $ref5 = $data[$i+1][1];
my $ref6 = $data[$i+1][2];
$v->[0] = $ref1;
$v->[1] = $ref2;
$v->[2] = $ref3;
$u->[0] = $ref4;
$u->[1] = $ref5;
$u->[2] = $ref6;
my $result = maths;
print "$result\n";
A lot of stuff in the script is vestigial and for ease to follow (For me).
What I desire it to do, is to have the script rotate through each line of the input file and perform the calculations on it.
Something akin to having :
foreach $i (#data) {
my $ref1 = $data[$i][0];
my $ref2 = $data[$i][1];
my $ref3 = $data[$i][2];
my $ref4 = $data[$i+1][0];
my $ref5 = $data[$i+1][1];
my $ref6 = $data[$i+1][2];
$v->[0] = $ref1;
$v->[1] = $ref2;
$v->[2] = $ref3;
$u->[0] = $ref4;
$u->[1] = $ref5;
$u->[2] = $ref6;
my $result = maths;
print "$result\n";
}
Anything is appreciated.
As per my comment, most of the functionality of your program is provided by the Math::Vector::Real module that you're already using
It looks like you want the angle in degrees between successive pairs of 3D vectors in your file. This code creates vectors from each line in the file until there are two vectors in #pair, and then uses atan2 to calculate the angle between them. DEG_PER_RAD is precalculated as a constant from the value of atan2(1,1) which is π ÷ 4 by definition
#!/usr/bin/perl
use strict;
use warnings;
use Math::Vector::Real;
use constant DEG_PER_RAD => 45 / atan2(1, 1);
my ( $source, $out ) = qw/ IN OUT /;
open my $in_fh, '<', $source or die qq{Unable to open "$source" for input: $!\n};
open my $out_fh, '<', $out or die qq{Unable to open "$out" for output: $!\n};
select $out_fh;
my #pair;
while ( <$in_fh> ) {
push #pair, V(split);
if ( #pair == 2 ) {
my $degrees = atan2(#pair) * DEG_PER_RAD;
print "$degrees\n";
shift #pair;
}
}
Update
To do the calculation for every possible pair of vectors, it is simplest to read the whole of the file into an array of vectors and process them from there with two nested loops
This code demonstrates. Insert it after the select call in the original program. It labels each line of output with the two array indices that gave rise to the angle
my #data;
push #data, V(split) while <$in_fh>;
for my $i ( 0 .. $#data-1 ) {
for my $j ( $i+1 .. $#data ) {
my $degrees = atan2(#data[$i,$j]) * DEG_PER_RAD;
print "[$i,$j] $degrees\n";
}
}
You could use the following:
my $v = V( #{ $data[$i+0] } );
my $u = V( #{ $data[$i+1] } );
Cleaned up:
#!/usr/bin/perl
use strict;
use warnings;
use Math::Trig qw( acos );
use Math::Vector::Real qw( V );
sub maths {
my ($v, $u) = #_;
my $dot = $v * $u;
my $mag1 = sqrt( $v->[0]**2 + $v->[1]**2 + $v->[2]**2 );
my $mag2 = sqrt( $u->[0]**2 + $u->[1]**2 + $u->[2]**2 );
my $prefinal = acos( $dot / ( $mag1 * $mag2 ) );
my $degrees = ( 180 / 3.14159 ) * $prefinal;
return $degrees;
}
{
my #data = map { V( split ) } <>;
for my $i (0..$#data-1) {
for my $j ($i+1..$#data) {
my $maths = maths(#data[$i, $j]);
print("$i,$j: $maths\n");
}
}
}
Or with Borodin's alternate method:
#!/usr/bin/perl
use strict;
use warnings;
use Math::Trig qw( acos atan2 );
use Math::Vector::Real qw( V );
use constant DEG_PER_RAD => 45 / atan2(1, 1);
{
my #data = map { V( split ) } <>;
for my $i (0..$#data-1) {
for my $j ($i+1..$#data) {
my $maths = atan2(#data[$i, $j]) * DEG_PER_RAD;
print("$i,$j: $maths\n");
}
}
}
Usage:
perl script.pl ./IN >./OUT

How can I generate a set of ranges from the first letters of a list of words in Perl?

I'm not sure exactly how to explain this, so I'll just start with an example.
Given the following data:
Apple
Apricot
Blackberry
Blueberry
Cherry
Crabapple
Cranberry
Elderberry
Grapefruit
Grapes
Kiwi
Mulberry
Nectarine
Pawpaw
Peach
Pear
Plum
Raspberry
Rhubarb
Strawberry
I want to generate an index based on the first letter of my data, but I want the letters grouped together.
Here is the frequency of the first letters in the above dataset:
2 A
2 B
3 C
1 E
2 G
1 K
1 M
1 N
4 P
2 R
1 S
Since my example data set is small, let's just say that the maximum number to combine the letters together is 3. Using the data above, this is what my index would come out to be:
A B C D-G H-O P Q-Z
Clicking the "D-G" link would show:
Elderberry
Grapefruit
Grapes
In my range listing above, I am covering the full alphabet - I guess that is not completely neccessary - I would be fine with this output as well:
A B C E-G K-N P R-S
Obviously my dataset is not fruit, I will have more data (around 1000-2000 items), and my "maximum per range" will be more than 3.
I am not too worried about lopsided data either - so if I 40% of my data starts with an "S", then S will just have its own link - I don't need to break it down by the second letter in the data.
Since my dataset won't change too often, I would be fine with a static "maximum per range", but it would be nice to have that calculated dynamically too. Also, the dataset will not start with numbers - it is guaranteed to start with a letter from A-Z.
I've started building the algorithm for this, but it keeps getting so messy I start over. I don't know how to search google for this - I'm not sure what this method is called.
Here is what I started with:
#!/usr/bin/perl
use strict;
use warnings;
my $index_frequency = { map { ( $_, 0 ) } ( 'A' .. 'Z' ) };
my $ranges = {};
open( $DATASET, '<', 'mydata' ) || die "Cannot open data file: $!\n";
while ( my $item = <$DATASET> ) {
chomp($item);
my $first_letter = uc( substr( $item, 0, 1 ) );
$index_frequency->{$first_letter}++;
}
foreach my $letter ( sort keys %{$index_frequency} ) {
if ( $index_frequency->{$letter} ) {
# build $ranges here
}
}
My problem is that I keep using a bunch of global variables to keep track of counts and previous letters examined - my code gets very messy very fast.
Can someone give me a step in the right direction? I guess this is more of an algorithm question, so if you don't have a way to do this in Perl, pseudo code would work too, I guess - I can convert it to Perl.
Thanks in advance!
Basic approach:
#!/usr/bin/perl -w
use strict;
use autodie;
my $PAGE_SIZE = 3;
my %frequencies;
open my $fh, '<', 'data';
while ( my $l = <$fh> ) {
next unless $l =~ m{\A([a-z])}i;
$frequencies{ uc $1 }++;
}
close $fh;
my $current_sum = 0;
my #letters = ();
my #pages = ();
for my $letter ( "A" .. "Z" ) {
my $letter_weigth = ( $frequencies{ $letter } || 0 );
if ( $letter_weigth + $current_sum > $PAGE_SIZE ) {
if ( $current_sum ) {
my $title = $letters[ 0 ];
$title .= '-' . $letters[ -1 ] if 1 < scalar #letters;
push #pages, $title;
}
$current_sum = $letter_weigth;
#letters = ( $letter );
next;
}
push #letters, $letter;
$current_sum += $letter_weigth;
}
if ( $current_sum ) {
my $title = $letters[ 0 ];
$title .= '-' . $letters[ -1 ] if 1 < scalar #letters;
push #pages, $title;
}
print "Pages : " . join( " , ", #pages ) . "\n";
Problem with it is that it outputs (from your data):
Pages : A , B , C-D , E-J , K-O , P , Q-Z
But I would argue this is actually good approach :) And you can always change the for loop into:
for my $letter ( sort keys %frequencies ) {
if you need.
Here's my suggestion:
# get the number of instances of each letter
my %count = ();
while (<FILE>)
{
$count{ uc( substr( $_, 0, 1 ) ) }++;
}
# transform the list of counts into a map of count => letters
my %freq = ();
while (my ($letter, $count) = each %count)
{
push #{ $freq{ $count } }, $letter;
}
# now print out the list of letters for each count (or do other appropriate
# output)
foreach (sort keys %freq)
{
my #sorted_letters = sort #{ $freq{$_} };
print "$_: #sorted_letters\n";
}
Update: I think that I misunderstood your requirements. The following code block does something more like what you want.
my %count = ();
while (<FILE>)
{
$count{ uc( substr( $_, 0, 1 ) ) }++;
}
# get the maximum frequency
my $max_freq = (sort values %count)[-1];
my $curr_set_count = 0;
my #curr_set = ();
foreach ('A' .. 'Z') {
push #curr_set, $_;
$curr_set_count += $count{$_};
if ($curr_set_count >= $max_freq) {
# print out the range of the current set, then clear the set
if (#curr_set > 1)
print "$curr_set[0] - $curr_set[-1]\n";
else
print "$_\n";
#curr_set = ();
$curr_set_count = 0;
}
}
# print any trailing letters from the end of the alphabet
if (#curr_set > 1)
print "$curr_set[0] - $curr_set[-1]\n";
else
print "$_\n";
Try something like that, where frequency is the frequency array you computed at the previous step and threshold_low is the minimal number of entries in a range, and threshold_high is the max. number. This should give harmonious results.
count=0
threshold_low=3
threshold_high=6
inrange=false
frequency['Z'+1]=threshold_high+1
for letter in range('A' to 'Z'):
count += frequency[letter];
if (count>=threshold_low or count+frequency[letter+1]>threshold_high):
if (inrange): print rangeStart+'-'
print letter+' '
inrange=false
count=0
else:
if (not inrange) rangeStart=letter
inrange=true
use strict;
use warnings;
use List::Util qw(sum);
my #letters = ('A' .. 'Z');
my #raw_data = qw(
Apple Apricot Blackberry Blueberry Cherry Crabapple Cranberry
Elderberry Grapefruit Grapes Kiwi Mulberry Nectarine
Pawpaw Peach Pear Plum Raspberry Rhubarb Strawberry
);
# Store the data by starting letter.
my %data;
push #{$data{ substr $_, 0, 1 }}, $_ for #raw_data;
# Set max page size dynamically, based on the average
# letter-group size (in this case, a multiple of it).
my $MAX_SIZE = sum(map { scalar #$_ } values %data) / keys %data;
$MAX_SIZE = int(1.5 * $MAX_SIZE + .5);
# Organize the data into pages. Each page is an array reference,
# with the first element being the letter range.
my #pages = (['']);
for my $letter (#letters){
my #d = exists $data{$letter} ? #{$data{$letter}} : ();
if (#{$pages[-1]} - 1 < $MAX_SIZE or #d == 0){
push #{$pages[-1]}, #d;
$pages[-1][0] .= $letter;
}
else {
push #pages, [ $letter, #d ];
}
}
$_->[0] =~ s/^(.).*(.)$/$1-$2/ for #pages; # Convert letters to range.
This is an example of how I would write this program.
#! /opt/perl/bin/perl
use strict;
use warnings;
my %frequency;
{
use autodie;
open my $data_file, '<', 'datafile';
while( my $line = <$data_file> ){
my $first_letter = uc( substr( $line, 0, 1 ) );
$frequency{$first_letter} ++
}
# $data_file is automatically closed here
}
#use Util::Any qw'sum';
use List::Util qw'sum';
# This is just an example of how to calculate a threshold
my $mean = sum( values %frequency ) / scalar values %frequency;
my $threshold = $mean * 2;
my #index;
my #group;
for my $letter ( sort keys %frequency ){
my $frequency = $frequency{$letter};
if( $frequency >= $threshold ){
if( #group ){
if( #group == 1 ){
push #index, #group;
}else{
# push #index, [#group]; # copy #group
push #index, "$group[0]-$group[-1]";
}
#group = ();
}
push #index, $letter;
}elsif( sum( #frequency{#group,$letter} ) >= $threshold ){
if( #group == 1 ){
push #index, #group;
}else{
#push #index, [#group];
push #index, "$group[0]-$group[-1]"
}
#group = ($letter);
}else{
push #group, $letter;
}
}
#push #index, [#group] if #group;
push #index, "$group[0]-$group[-1]" if #group;
print join( ', ', #index ), "\n";

How do I determine the longest similar portion of several strings?

As per the title, I'm trying to find a way to programmatically determine the longest portion of similarity between several strings.
Example:
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/
Ideally, I'd get back file:///home/gms8994/Music/, because that's the longest portion that's common for all 3 strings.
Specifically, I'm looking for a Perl solution, but a solution in any language (or even pseudo-language) would suffice.
From the comments: yes, only at the beginning; but there is the possibility of having some other entry in the list, which would be ignored for this question.
Edit: I'm sorry for mistake. My pity that I overseen that using my variable inside countit(x, q{}) is big mistake. This string is evaluated inside Benchmark module and #str was empty there. This solution is not as fast as I presented. See correction below. I'm sorry again.
Perl can be fast:
use strict;
use warnings;
package LCP;
sub LCP {
return '' unless #_;
return $_[0] if #_ == 1;
my $i = 0;
my $first = shift;
my $min_length = length($first);
foreach (#_) {
$min_length = length($_) if length($_) < $min_length;
}
INDEX: foreach my $ch ( split //, $first ) {
last INDEX unless $i < $min_length;
foreach my $string (#_) {
last INDEX if substr($string, $i, 1) ne $ch;
}
}
continue { $i++ }
return substr $first, 0, $i;
}
# Roy's implementation
sub LCP2 {
return '' unless #_;
my $prefix = shift;
for (#_) {
chop $prefix while (! /^\Q$prefix\E/);
}
return $prefix;
}
1;
Test suite:
#!/usr/bin/env perl
use strict;
use warnings;
Test::LCP->runtests;
package Test::LCP;
use base 'Test::Class';
use Test::More;
use Benchmark qw(:all :hireswallclock);
sub test_use : Test(startup => 1) {
use_ok('LCP');
}
sub test_lcp : Test(6) {
is( LCP::LCP(), '', 'Without parameters' );
is( LCP::LCP('abc'), 'abc', 'One parameter' );
is( LCP::LCP( 'abc', 'xyz' ), '', 'None of common prefix' );
is( LCP::LCP( 'abcdefgh', ('abcdefgh') x 15, 'abcdxyz' ),
'abcd', 'Some common prefix' );
my #str = map { chomp; $_ } <DATA>;
is( LCP::LCP(#str),
'file:///home/gms8994/Music/', 'Test data prefix' );
is( LCP::LCP2(#str),
'file:///home/gms8994/Music/', 'Test data prefix by LCP2' );
my $t = countit( 1, sub{LCP::LCP(#str)} );
diag("LCP: ${\($t->iters)} iterations took ${\(timestr($t))}");
$t = countit( 1, sub{LCP::LCP2(#str)} );
diag("LCP2: ${\($t->iters)} iterations took ${\(timestr($t))}");
}
__DATA__
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/
Test suite result:
1..7
ok 1 - use LCP;
ok 2 - Without parameters
ok 3 - One parameter
ok 4 - None of common prefix
ok 5 - Some common prefix
ok 6 - Test data prefix
ok 7 - Test data prefix by LCP2
# LCP: 22635 iterations took 1.09948 wallclock secs ( 1.09 usr + 0.00 sys = 1.09 CPU) # 20766.06/s (n=22635)
# LCP2: 17919 iterations took 1.06787 wallclock secs ( 1.07 usr + 0.00 sys = 1.07 CPU) # 16746.73/s (n=17919)
That means that pure Perl solution using substr is about 20% faster than Roy's solution at your test case and one prefix finding takes about 50us. There is not necessary using XS unless your data or performance expectations are bigger.
The reference given already by Brett Daniel for the Wikipedia entry on "Longest common substring problem" is very good general reference (with pseudocode) for your question as stated. However, the algorithm can be exponential. And it looks like you might actually want an algorithm for longest common prefix which is a much simpler algorithm.
Here's the one I use for longest common prefix (and a ref to original URL):
use strict; use warnings;
sub longest_common_prefix {
# longest_common_prefix( $|# ): returns $
# URLref: http://linux.seindal.dk/2005/09/09/longest-common-prefix-in-perl
# find longest common prefix of scalar list
my $prefix = shift;
for (#_) {
chop $prefix while (! /^\Q$prefix\E/);
}
return $prefix;
}
my #str = map {chomp; $_} <DATA>;
print longest_common_prefix(#ARGV), "\n";
__DATA__
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/
If you truly want a LCSS implementation, refer to these discussions (Longest Common Substring and Longest Common Subsequence) at PerlMonks.org. Tree::Suffix would probably be the best general solution for you and implements, to my knowledge, the best algorithm. Unfortunately recent builds are broken. But, a working subroutine does exist within the discussions referenced on PerlMonks in this post by Limbic~Region (reproduced here with your data).
#URLref: http://www.perlmonks.org/?node_id=549876
#by Limbic~Region
use Algorithm::Loops 'NestedLoops';
use List::Util 'reduce';
use strict; use warnings;
sub LCS{
my #str = #_;
my #pos;
for my $i (0 .. $#str) {
my $line = $str[$i];
for (0 .. length($line) - 1) {
my $char= substr($line, $_, 1);
push #{$pos[$i]{$char}}, $_;
}
}
my $sh_str = reduce {length($a) < length($b) ? $a : $b} #str;
my %map;
CHAR:
for my $char (split //, $sh_str) {
my #loop;
for (0 .. $#pos) {
next CHAR if ! $pos[$_]{$char};
push #loop, $pos[$_]{$char};
}
my $next = NestedLoops([#loop]);
while (my #char_map = $next->()) {
my $key = join '-', #char_map;
$map{$key} = $char;
}
}
my #pile;
for my $seq (keys %map) {
push #pile, $map{$seq};
for (1 .. 2) {
my $dir = $_ % 2 ? 1 : -1;
my #offset = split /-/, $seq;
$_ += $dir for #offset;
my $next = join '-', #offset;
while (exists $map{$next}) {
$pile[-1] = $dir > 0 ?
$pile[-1] . $map{$next} : $map{$next} . $pile[-1];
$_ += $dir for #offset;
$next = join '-', #offset;
}
}
}
return reduce {length($a) > length($b) ? $a : $b} #pile;
}
my #str = map {chomp; $_} <DATA>;
print LCS(#str), "\n";
__DATA__
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/
It sounds like you want the k-common substring algorithm. It is exceptionally simple to program, and a good example of dynamic programming.
My first instinct is to run a loop, taking the next character from each string, until the characters are not equal. Keep a count of what position in the string you're at and then take a substring (from any of the three strings) from 0 to the position before the characters aren't equal.
In Perl, you'll have to split up the string first into characters using something like
#array = split(//, $string);
(splitting on an empty character sets each character into its own element of the array)
Then do a loop, perhaps overall:
$n =0;
#array1 = split(//, $string1);
#array2 = split(//, $string2);
#array3 = split(//, $string3);
while($array1[$n] == $array2[$n] && $array2[$n] == $array3[$n]){
$n++;
}
$sameString = substr($string1, 0, $n); #n might have to be n-1
Or at least something along those lines. Forgive me if this doesn't work, my Perl is a little rusty.
If you google for "longest common substring" you'll get some good pointers for the general case where the sequences don't have to start at the beginning of the strings.
Eg, http://en.wikipedia.org/wiki/Longest_common_substring_problem.
Mathematica happens to have a function for this built in:
http://reference.wolfram.com/mathematica/ref/LongestCommonSubsequence.html (Note that they mean contiguous subsequence, ie, substring, which is what you want.)
If you only care about the longest common prefix then it should be much faster to just loop for i from 0 till the ith characters don't all match and return substr(s, 0, i-1).
From http://forums.macosxhints.com/showthread.php?t=33780
my #strings =
(
'file:///home/gms8994/Music/t.A.T.u./',
'file:///home/gms8994/Music/nina%20sky/',
'file:///home/gms8994/Music/A%20Perfect%20Circle/',
);
my $common_part = undef;
my $sep = chr(0); # assuming it's not used legitimately
foreach my $str ( #strings ) {
# First time through loop -- set common
# to whole
if ( !defined $common_part ) {
$common_part = $str;
next;
}
if ("$common_part$sep$str" =~ /^(.*).*$sep\1.*$/)
{
$common_part = $1;
}
}
print "Common part = $common_part\n";
Faster than above, uses perl's native binary xor function, adapted from perlmongers solution (the $+[0] didn't work for me):
sub common_suffix {
my $comm = shift #_;
while ($_ = shift #_) {
$_ = substr($_,-length($comm)) if (length($_) > length($comm));
$comm = substr($comm,-length($_)) if (length($_) < length($comm));
if (( $_ ^ $comm ) =~ /(\0*)$/) {
$comm = substr($comm, -length($1));
} else {
return undef;
}
}
return $comm;
}
sub common_prefix {
my $comm = shift #_;
while ($_ = shift #_) {
$_ = substr($_,0,length($comm)) if (length($_) > length($comm));
$comm = substr($comm,0,length($_)) if (length($_) < length($comm));
if (( $_ ^ $comm ) =~ /^(\0*)/) {
$comm = substr($comm,0,length($1));
} else {
return undef;
}
}
return $comm;
}