cosine similarity between strings perl

cosine similarity between strings perl - perl

i have a file contain for example this text:
perl java python php scala
java pascal perl ruby ada
ASP awk php java perl
C# ada python java scala
I found a module which calculates cosine similaity, http://search.cpan.org/~wollmers/Bag-Similarity-0.019/lib/Bag/Similarity/Cosine.pm
I did a simple test in the bignning,
my $cosine = Bag::Similarity::Cosine->new;
my $similarity = $cosine->similarity(['perl','java','python','php','scala'],['java','pascal','perl','ruby','ada']);
print $similarity;
The rusult was 0.4;
The problem when i read from the file and calculate the cosine between each line, the results are different, this is the code:
open(F,"/home/ahmed/FILE.txt") or die " Pb pour ouvrir";
my #data; # containt each line of the FILE in each case
while(<F>) {
chomp;
push #data, $_;
}
#print join " ", #data;
my $cosine = Bag::Similarity::Cosine->new;
for my $i ( 0 .. $#data-1 ) {
for my $j ( $i + 1 .. $#data ) {
my $similarity = $cosine->similarity($data[$i],$data[$j]);
print "line $i a une similarite de $similarity avec line $j\n";
$i + 1,
$j + 1;
}
}
the results :
line 0 has a similarity of 0.933424735647156 with line 1
line 0 has a similarity of 0.953945734121021 with line 2
line 0 has a similarity of 0.939759036144578 with line 3
line 1 has a similarity of 0.917585834612093 with line 2
line 1 has a similarity of 0.945092544842746 with line 3
line 2 has a similarity of 0.908826679128811 with line 3
the similarity must be 0.4 between line 1 and 2;
I changed the FILE like this :
['perl','java','python','php','scala']
['java','pascal','perl','ruby','ada']
['ASP','awk','php','java','perl']
['C#','ada','python','java','scala']
but the same result,
Thank you.

There is syntax error in your program. Were you trying to use printf and used print mistakenly? Not sure about you but below works fine for me.
#!/usr/bin/perl
use strict;
use warnings;
use Bag::Similarity::Cosine;
my $cosine = Bag::Similarity::Cosine->new;
my #data;
while ( <DATA> ) {
push #data, { map { $_ => 1 } split };
}
for my $i ( 0 .. $#data-1 ) {
for my $j ( $i + 1 .. $#data ) {
my $similarity = $cosine->similarity($data[$i],$data[$j]);
print "line $i has a similarity of $similarity with line $j\n";
}
}
__DATA__
perl java python php scala
java pascal perl ruby ada
ASP awk php java perl
C# ada python java scala
Output:
line 0 has a similarity of 0.4 with line 1
line 0 has a similarity of 0.6 with line 2
line 0 has a similarity of 0.6 with line 3
line 1 has a similarity of 0.4 with line 2
line 1 has a similarity of 0.4 with line 3
line 2 has a similarity of 0.2 with line 3

I know nothing at all about this module. But I can read the documentation.
It looks to me like the module has two methods. similarity() is used for comparing two strings and from_bags() is used to compare two references to arrays containing strings. I expect that when you call similarity passing it two array references, then what gets compared is actually the stringification of the two references.
Try switching to from_bags() and see if that's any better.
Update: On investigating further, I see that similarity() will compare any kind of input (strings, array refs or hash refs).
This demonstrates using similarity() to compare the lines as text and as arrays of words.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use Bag::Similarity::Cosine;
chomp(my #data = <DATA>);
my $cos = Bag::Similarity::Cosine->new;
for my $i (0 .. $#data - 1) {
for my $j (1 .. $#data) {
next if $i == $j;
say "$i -> $j: strings ", $cos->similarity($data[$i], $data[$j]);
say "$i -> $j: array refs ", $cos->similarity([split /\s+/, $data[$i]], [split /\s+/, $data[$j]]);
}
}
__DATA__
perl java python php scala
java pascal perl ruby ada
ASP awk php java perl
C# ada python java scala
And it gives this output:
$ perl similar
0 -> 1: strings 0.88602000346543
0 -> 1: array refs 0.4
0 -> 2: strings 0.89566858950296
0 -> 2: array refs 0.6
0 -> 3: strings 0.852802865422442
0 -> 3: array refs 0.6
1 -> 2: strings 0.872356744289958
1 -> 2: array refs 0.4
1 -> 3: strings 0.884721984738799
1 -> 3: array refs 0.4
2 -> 1: strings 0.872356744289958
2 -> 1: array refs 0.4
2 -> 3: strings 0.753778361444409
2 -> 3: array refs 0.2
I don't know which version gives you the information you want. I suspect it might be the array reference version.

Related

Perl: calculate jackknife error from of each column of a multi-column file

I am trying to calculate the jacknife average and error of each column in a multi-column file.
My example data file look like this:
$ cat data.HW2
1.1 2.1 3.1 4.1
1.2 2.2 3.2 4.2
1.3 2.3 3.3 4.3
1.4 2.4 3.4 4.4
My attempted solution is to define arrays that will eventually be the size same as the number of columns (in this case 4) and iterate over them line by line:
cat jackkinfe.pl
#! /usr/bin/perl
use warnings; use strict;
my #n=0;
my #x;
my $j;
my $i;
my $dg;
my #x_jack;
my #x_tot=0;
my $cols;
my $col_start=0;
# read in the data
while(<>)
{
my #column = split();
$cols=#column;
foreach my $j ($col_start .. $#column) {
$x[$n[$j]][$j] = $column[$j];
$x_tot[$j] += $x[$n[$j]][$j];
$n[$j]++;
}
}
# Do the jackknife estimates
for ($j=$col_start; $j<$cols; $j++)
{
for ($i = 0; $i < $n[$j]; $i++)
{
$x_jack[$i][$j] = ($x_tot[$j] - $x[$i][$j]) / ($n[$j] - 1);
}
# Do the final jackknife estimate
my #g_jack_av=0;
my #g_jack_err=0;
for ($i = 0; $i < $n[$j]; $i++)
{
$dg = $x_jack[$i][$j];
$g_jack_av[$j] += $dg;
$g_jack_err[$j] += $dg**2;
}
$g_jack_av[$j] /= $n[$j];
$g_jack_err[$j] /= $n[$j];
$g_jack_err[$j] = sqrt(($n[$j] - 1) * abs($g_jack_err[$j] - $g_jack_av[$j]**2));
printf "%e %e ", $g_jack_av[$j], $g_jack_err[$j];
}
printf "\n";
It gives me the following two warnings:
$cat data.HW2 | perl jackknife.pl
Use of uninitialized value within #n in array element at cols_jacknife.pl line 19, <> line 1.
Use of uninitialized value within #n in array element at cols_jacknife.pl line 20, <> line 1.
It is complaining at the following two lines:
$x[$n[$j]][$j] = $column[$j];
$x_tot[$j] += $x[$n[$j]][$j];
But I want to set the size of #n dynamically depending on the size of the data file.
How do I remove this warning?
Any other suggestions on my Perl usage are also welcome and much appreciated since I am trying to learn best practices.

This part of your code
my #n=0;
....
foreach my $j ($col_start .. $#column) {
$x[$n[$j]][$j] = $column[$j];
$x_tot[$j] += $x[$n[$j]][$j];
$n[$j]++;
}
Will trigger the warning once for every value of $j larger than 0, because only the first element in #n is defined: $n[0] = 0. Only at the end of the loop iteration is the array value finally defined, when it is set to 1 by the increment operator with $n[$j]++.
Technically, the code will still work as you expect, because undef will be cast to 0. So.... it should be safe to ignore the warning. You can do something like this inside your loop to avoid it:
$n[$j] //= 0; # $n[$j] is defined, or set to 0
This is equivalent to
if (not defined($n[$j])) {
$n[$j] = 0;
}

How to print the string using foreach using with addition function? in perl

use warnings;
use strict;
my #a = qw(1 2 3 4 'c');
my #b = qw(5 6 7 8);
my $i;
for ($i=0; $i < scalar #a; $i++)
{
my $ax = $a[$i] + $b[$i];
print "$ax\n";
}
How to print the string variable c when using the add function.
# I expect output
6
8
10
12
c

In other languages, e.g., Javascript and, somewhat, in Python, + does both numeric addition and string concatenation.
But not in Perl.
In Perl, string concatenation is done with . and numeric addition with +
That is why this script produces the "isn't numeric in addition" warning.
The "use of uninitialized value" warning comes from going beyond the end of the #b array.
These are warnings. Execution did not stop. Note that the script decided that "c" was equivalent to 0 and undefined was equivalent to 0, and produced 0 as the addition on the last line.
6
8
10
12
Use of uninitialized value in addition (+) at test.pl line 8.
Argument "'c'" isn't numeric in addition (+) at test.pl line 8.
0
It is possible to overload '+' within a custom class, and it would be possible to write a custom class that adds scalars that match a numeric regex with + and concats others. But modifying core functionality would be unusual. In any event, these techniques are beyond the scope of this answer.

fast way to compare rows in a dataset

I asked this question in R and got a lot of answers, but all of them crash my 4Gb Ram computer after a few hours running or they take a very long time to finish.
faster way to compare rows in a data frame
Some people said that it's not a job to be done in R. As I don't know C and I'm a little bit fluent in Perl, I'll ask here.
I'd like to know if there is a fast way to compare each row of a large dataset with the other rows, identifying the rows with a specific degree of homology. Let's say for the simple example below that I want homology >= 3.
data:
sample_1,10,11,10,13
sample_2,10,11,10,14
sample_3,10,10,8,12
sample_4,10,11,10,13
sample_5,13,13,10,13
The output should be something like:
output
sample duplicate matches
1 sample_1 sample_2 3
2 sample_1 sample_4 4
3 sample_2 sample_4 3

Matches are calculated when both lines have same numbers on same positions,
perl -F',' -lane'
$k = shift #F;
for my $kk (#o) {
$m = grep { $h{$kk}[$_] == $F[$_] } 0 .. $#F;
$m >=3 or next;
print ++$i, " $kk $k $m";
}
push #o, $k;
$h{$k} = [ #F ];
' file
output,
1 sample_1 sample_2 3
2 sample_1 sample_4 4
3 sample_2 sample_4 3

This solution provides an alternative to direct comparison, which will be slow for large data amounts.
Basic idea is to build an inverted index while reading the data.
This makes comparison faster if there are a lot of different values per column.
For each row, you look up the index and count the matches - this way you only consider the samples where this value actually occurs.
You might still have a memory problem because the index gets as large as your data.
To overcome that, you can shorten the sample name and use a persistent index (using DB_File, for example).
use strict;
use warnings;
use 5.010;
my #h;
my $LIMIT_HOMOLOGY = 3;
while(my $line = <>) {
my #arr = split /,/, $line;
my $sample_no = shift #arr;
my %sim;
foreach my $i (0..$#arr) {
my $value = $arr[$i];
our $l;
*l = \$h[$i]->{$value};
foreach my $s (#$l) {
$sim{$s}++;
}
push #$l, $sample_no;
}
foreach my $s (keys %sim) {
if ($sim{$s}>=$LIMIT_HOMOLOGY) {
say "$sample_no: $s. Matches: $sim{$s}";
}
}
}
For 25000 rows with 26 columns with random integer values between 1 and 100, the program took 69 seconds on my mac book air to finish.

how do you select column from a text file using perl

I want to subtract values in one column from another column and add the differences.How do I do this in perl? I am new to perl.Hence I am unable to figure out how to go about it. Kindly help me.

The first thing is to separate the data into columns. In this case, the columns are separated by a space. split(/ /) will return a list of the columns.
To subtract one from the other, its pulling the values out of the the list and subtracting them.
At the end, you add the difference to the running sum and then loop over the data.
#!/usr/bin/perl
use strict;
my $sum = 0;
while(<DATA>) {
my #vals = split(/ /);
my $diff = $vals[1] - $vals[0];
$sum += $diff;
}
print $sum,"\n";
__DATA__
1 3
3 5
5 7
This will print out 6 --- (3 - 1) + (5 - 3) + (7 - 5)

FYI, if you combine the autosplit (-a), loop (n) and command-line program (-e) arguments (see perlrun), you can shorten this to a one-liner, much like awk:
perl -ane "$sum += $F[1] - $F[0]; END { print $sum }" filename

How to subtract values in 2 different arrays in perl?

Hi I have two arrays containing 4 columns and I want to subtract the value in column1 of array2 from value in column1 of array1 and value of column2 of array2 from column2 of array1 so on..example:
my #array1=(4.3,0.2,7,2.2,0.2,2.4)
my #array2=(2.2,0.6,5,2.1,1.3,3.2)
so the required output is
2.1 -0.4 2 # [4.3-2.2] [0.2-0.6] [7-5]
0.1 -1.1 -0.8 # [2.2-2.1] [0.2-1.3] [2.4-3.2]
For this the code I used is
my #diff = map {$array1[$_] - $array2[$_]} (0..2);
print OUT join(' ', #diff), "\n";
and the output I am getting now is
2.1 -0.4 2
2.2 -1.1 3.8
Again the first row is used from array one and not the second row, how can I overcome this problem?
I need output in rows of 3 columns like the way i have shown above so just i had filled in my array in row of 3 values.

This will produce the requested output. However, I suspect (based on your comments), that we could produce a better solution if you simply showed your raw input.
use strict;
use warnings;
my #a1 = (4.3,0.2,7,2.2,0.2,2.4);
my #a2 = (2.2,0.6,5,2.1,1.3,3.2);
my #out = map { $a1[$_] - $a2[$_] } 0 .. $#a1;
print "#out[0..2]\n";
print "#out[3..$#a1]\n";

First of all, your code doesn't even compile. Perl arrays aren't space separated - you need a qw() to turn those into arrays. Not sure how you got your results.
Perl doesn't have 2D arrays. 2.2 is NOT a column1 of row 1 of #array1 - it's element with index 3 of #array1. As far as Perl is concerned, your newline is just another whitespace separator, NOT something that magically turns a 1-d array into a table as you seem to think.
To get the result you want (process those 6 elements as 2 3-element arrays), you can either store them in an array of arrayrefs (Perl's implementation of C multidimentional arrays):
my #array1=( [ 4.3, 0.2, 7 ],
[ 2.2, 0.2, 2.4] );
for(my $idx=0; $idx1 < 2; $idx1++) {
for(my $idx2=0; $idx2 < 3; $idx2++) {
print $array1[$idx1]->[$idx2] - $array2[$idx1]->[$idx2];
print " ";
}
print "\n";
}
or, you can simply fake it by using offsets, the same way pointer arithmetic works in C's multidimentional arrays:
my #array1=( 4.3, 0.2, 7, # index 0..2
2.2, 0.2, 2.4); # index 3..5
for(my $idx=0; $idx1 < 2; $idx1++) {
for(my $idx2=0; $idx2 < 3; $idx2++) {
print $array1[$idx1 * 3 + $idx2] - $array2[$idx1 * 3 + $idx2];
print " ";
}
print "\n";
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

cosine similarity between strings perl - perl

Related

Perl: calculate jackknife error from of each column of a multi-column file

How to print the string using foreach using with addition function? in perl

fast way to compare rows in a dataset

how do you select column from a text file using perl

How to subtract values in 2 different arrays in perl?

Categories

Resources