I will explain the problem here.
Suppose i am having list of 1000 words. Say it is a dictionary. User will input some word and it will match with exact match if the word is correct or give the closest match. Just like Google search as we enter something and it gives the closest match.
Algorithm that i thought is
Read the word list one by one
split our input word string into characters
take the first word from the list and match character wise
similarly do it for other words in the list
I know this is the long way and it will take lot of time. Do anyone know how to implement better algorithm
Sort the words in an array
When a word comes in => binary search (log(n)) (we are doing that because if you use a hash table it will be good for direct match but poor for adjacent)
If perfect match return it
Else compute a levensthein distance of the requested word with the adjacent words and their neighbors (to be defined) and add them to a list of return (if they are satisfying)
Return the list of adjacent words selected
Quick and dirty implementation with /usr/share/dict/words (you still have to do the levensthein distance part and selection)
DISCLAIMER: Binary search code borrowed from http://www.perlmonks.org/?node_id=503154
open(FILE, "<", "/usr/share/dict/words");
my #lines = <FILE>;
my $word = $ARGV[0];
sub BinSearch
{
my ($target, $cmp) = #_;
my #array = #{$_[2]};
my $posmin = 0;
my $posmax = $#array;
return -0.5 if &$cmp (0, \#array, $target) > 0;
return $#array + 0.5 if &$cmp ($#array, \#array, $target) < 0;
while (1)
{
my $mid = int (($posmin + $posmax) / 2);
my $result = &$cmp ($mid, \#array, $target);
if ($result < 0)
{
$posmin = $posmax, next if $mid == $posmin && $posmax != $posmin;
if ($mid == $posmin){
return "Not found, TODO find close match\n";
}
$posmin = $mid;
}
elsif ($result > 0)
{
$posmax = $posmin, next if $mid == $posmax && $posmax != $posmin;
if ($mid == $posmax){
return "Not found, TODO find close match\n";
}
$posmax = $mid;
}
else
{
return "Found: ".#array[$mid];
}
}
}
sub cmpFunc
{
my ($index, $arrayRef, $target) = #_;
my $item = $$arrayRef[$index];
$item =lc($item);
$target =lc($target);
$a = $item cmp $target;
return $a;
}
print BinSearch($word."\n", \&cmpFunc, \#lines)."\n";
Usage (if the script is called find_words.pl):
perl find_words.pl word
Where word is the word you want to search for.
A common algorithm for this sort of "fuzzy" word search is Levenshtein distance. It doesn't really find similar words but calculates the similarity of words. This similarity score (or Levenshtein distance) can then be used by a sorting or filter function to select similar words.
How the distance is measured is simple: how many characters need to be changed from the target word to the matched word. For example, a distance of 3 means that the difference between the words are 3 edits (not necessarily characters since it also includes the act of adding and removing characters).
The Rosetta Code site has a listing of Levenshtein distance algorithms implemented in various languages including tcl and perl: http://rosettacode.org/wiki/Levenshtein_distance
There is a page on the tcler's wiki that discusses similarity algorithms which includes several implementations of Levenshtein distance: similarity
For perl, there's also a CPAN module that you can use: Text::Levenshtein
So in perl you can simply do:
use Text::Levenshtein;
my %word_distance;
#word_distance{#dictionary} = distance($word,#dictionary);
Then iterate through the word_distance hash to find the most similar words.
The problem with using a simple binary search to get a neighbourhood of similar words and then using the Levenshtein algorithm to refine is that errors can occur early in the word as well as late; you run the risk of completely missing words where there's an early error. A more effective technique might be to use the Soundex algorithm to create collation keys in your word list so that you search by basic similarity. Then you can use Levenshtein to refine, but weighting that similarity measure by the rarity of words in the underlying source corpus; assuming that users are more likely to want a common word than a rare one is a useful measure.
(This assumes you've got a source corpus, but if you're wanting to emulate Google then you've definitely got to have one of those.)
It might be better to instead look at ways to use some sort of map-reduce mechanism to run a weighted Levenshtein distance metric over the entire set of words. This is more of a “throw hardware at the problem” approach, but avoids the problems associated with potential problems with words getting missed due to the initial filter. Alas, this does mean that you're going to end up with something that can't be pushed as part of a simple piece of software (provisioning systems to support something like this is unlikely to be something that you'd want to foist on a normal user) but it is likely to be practical to deploy behind a service.
Related
I am still trying to work on permutation match, and I wonder if anyone has better way to do it. I want to match all patterns in an array in any order, i.e., match permutations of items (string or other objects) in an array. E.g., if array is (1,2,3), then it is true if a string contains 1 and 2 and 3 in any order; i.e, true if a string contains permutation of (1,2,3).
What I have now is this:
my #x = < one eins uno yi two zwei dos er one one one two two two >;
my #z = < one eins uno yi two zwei dos er one one one two two two foo >;
my $y = "xxx one eins uno yi two zwei dos er xxx";
sub matchAllWords($aString, #anArray) {
my $arraySize = #anArray.elems;
if $arraySize == 0 { False; }
elsif $arraySize == 1 {
($aString.match(/:i "#anArray[0]" /)).Bool;
} else {
my $firstCheck = ($aString.match(/:i "#anArray[0]"/)).Bool;
if $firstCheck {
$firstCheck
and
(matchAllWords($aString, #anArray[1..*]));
} else {
return False;
}
}
}
say matchAllWords($y, #x);
# result is True, but it should NOT be True because $y should not
# match permutations of #x which contains multiple identical elements
# of "one" and "two"
say matchAllWords($y, #z); # False as expected;
The problems is that my function matches all unique words in the array, but is unable to differentiate permutations of duplicate words. I can add more and more codes to tell if a word has been matched, but more codes to accomplish a simple idea, "permutation match", is un-perl-ly. Any suggestions? Thanks
New answer
Based on everyone's comments, here's a restatement of the problem as I now understand it, followed by a new solution:
Test that Y, a string, contains all of the strings in Z, a Bag (multiset) of strings, with correct copy count / multiplicity.
my \Z = < one eins uno yi two zwei dos er two > .Bag ;
my \Y = "xxx one eins uno yi two zwei dos er two xxx" ;
sub string-matches-bag ($string, $bag) {
for $bag.kv -> $sub-string, $copy-count {
fail unless ($string ~~ m:g/ $sub-string /).elems == $copy-count
}
True
}
say string-matches-bag Y, Z
Old answer
say so $y.words.all eq #z.any
An explanation for this line of code is in the last part of this answer.
I found your question pretty confusing. But I'm hopeful this answer is either what you want or at least enough to move things in the right direction.
I found your data confusing. There are two 'xxx' words in your $y but none in either array. So that bit can't match. There's a 'foo' in your #z. Was that supposed to be 'xxx'? There's a 'one' in your $y but both arrays have at least two 'one's. Is that an issue?
I found your narrative confusing too.
For this answer I've assumed that #z has a xxx at the end, and that the key comment is:
a simple idea, "permutation match"
say so $y.words.all eq #z.any
so returns the boolean evaluation (True or False) of the expression on its right.
The expression on so's right uses Junctions. An English prose summary of it is 'all of the "words" in $y, taken one at a time, are string equal to at least one element of #z'.
Is this the simple solution you're asking for?
I have a loop for example :
for my $something ( #place[1..$#thing] ) {
}
I don't get this statement 1..$#thing
I know that # is for comments but my IDE doesn't color #thing as comment. Or is it really just a comment for someone to know that what is in "$" is "thing" ? And if it's a comment why was the rest of the line not commented out like ] ) { ?
If it has other meanings, i will like to know. Sorry if my question sounds odd, i am just new to perl and perplexed by such an expression.
The $# is the syntax for getting the highest index of the array in question, so $#thing is the highest index of the array #thing. This is documented in perldoc perldata
.. is the range operator, and 1 .. $#thing means a list of numbers, from 1 to whatever the highest index of #thing is.
Using this list inside array brackets with the # sigill denotes that this is an array slice, which is to say, a selected number of elements in the #place array.
So assuming the following:
my #thing = qw(foo bar baz);
my #place = qw(home work restaurant gym);
then #place[1 .. $#thing] (or 1 .. 2) would expand into the list work, restaurant.
It is correct that # is used for comments, but not in this case.
it's how you define a range. From starting value to some other value.
for my $something ( #place[1..3] ) {
# Takes the first three elements
}
Binary ".." is the range operator, which is really two different
operators depending on the context. In list context, it returns a list
of values counting (up by ones) from the left value to the right
value. If the left value is greater than the right value then it
returns the empty list. The range operator is useful for writing
foreach (1..10) loops and for doing slice operations on arrays. In the
current implementation, no temporary array is created when the range
operator is used as the expression in foreach loops, but older
versions of Perl might burn a lot of memory when you write something
like this:
http://perldoc.perl.org/perlop.html#Range-Operators
I am very new to Perl. Recently I wrote a code to calculate the coefficient of correlation between the atoms between two structures. This is a brief summary of my program.
for($i=1;$i<=2500;$i++)
{
for($j=1;$j<=2500;$j++)
{
calculate the correlation (Cij);
print $Cij;
}
}
This program prints all the correlations serially in a single column. But I need to print the correlations in the form of a matrix, something like..
Atom1 Atom2 Atom3 Atom4
Atom1 0.5 -0.1 0.6 0.8
Atom2 0.1 0.2 0.3 -0.5
Atom3 -0.8 0.9 1.0 0.0
Atom4 0.3 1.0 0.8 -0.8
I don't know, how it can be done. Please help me with a solution or suggest me how to do it !
Simple issue you're having. You need to print a NL after you finish printing a row. However, while i have your attention, I'll prattle on.
You should store your data in a matrix using references. This way, the way you store your data matches the concept of your data:
my #atoms; # Storing the data in here
my $i = 300;
my $j = 400;
my $value = ...; # Calculating what the value should be at column 300, row 400.
# Any one of these will work. Pick one:
my $atoms[$i][$j] = $value; # Looks just like a matrix!
my $atoms[$i]->[$j] = $value; # Reminds you this isn't really a matrix.
my ${$atoms[$1]}[$j] = $value; # Now this just looks ridiculous, but is technically correct.
My preference is the second way. It's just a light reminder that this isn't actually a matrix. Instead it's an array of my rows, and each row points to another array that holds the column data for that particular row. The syntax is still pretty clean although not quite as clean as the first way.
Now, let's get back to your problem:
my #atoms; # I'll store the calculated values here
....
my $atoms[$i]->[$j] = ... # calculated value for row $i column $j
....
# And not to print out my matrix
for my $i (0..$#atoms) {
for my $j (0..$#{ $atoms[$i] } ) {
printf "%4.2f ", $atoms[$i]->[$j]; # Notice no "\n".
}
print "\n"; # Print the NL once you finish a row
}
Notice I use for my $i (0..$#atoms). This syntax is cleaner than the C style three part for which is being discouraged. (Python doesn't have it, and I don't know it will be supported in Perl 6). This is very easy to understand: I'm incrementing through my array. I also use $#atom which is the length of my #atoms array -- or the number of rows in my Matrix. This way, as my matrix size changes, I don't have to edit my program.
The columns [$j] is a bit tricker. $atom[$i] is a reference to an array that contains my column data for row $i, and doesn't really represent a row of data directly. (This is why I like $atoms[$i]->[$j] instead of $atoms[$i][$j]. It gives me this subtle reminder.) To get the actual array that contains my column data for row $i, I need to dereference it. Thus, the actual column values are stored in row $i in the array array #{$atoms[$i]}.
To get the last entry in an array, you replace the # sigil with $#, so the last index in my
array is $#{ $atoms[$i] }.
Oh, another thing because this isn't a true matrix: Each row could have a different numbers of entries. You can't have that with a real matrix. This makes using an Array of Arrays in Perl a bit more powerful, and a bit more dangerous. If you need a consistent number of columns, you have to manually check for that. A true matrix would automatically create the required columns based upon the largest $j value.
Disclaimer: Pseudo Code, you might have to take care of special cases and especially the headers yourself.
for($i=1;$i<=2500;$i++)
{
print "\n"; # linebreak here.
for($j=1;$j<=2500;$j++)
{
calculate the correlation (Cij);
printf "\t%4f",$Cij; # print a tab followed by your float giving it 4
# spaces of room. But no linebreak here.
}
}
This is of course a very crude and quick and dirty solution. But if you save the output into a .csv file, most csv-able spreadsheet programs (OpenOfice) should easily be able to read it into a proper table. If the spreadsheet viewer of your choice can not understand tabs as delimeter, you could easily add ; or / or whatever it can use into the printf string.
I have a one-dimensional PDL that I'd like to perform calculations on each half of; i.e. split it, then do calculations on the first half, and the same calculations on the second half.
Is there an easier/nicer/elegant way to simply split the PDL in half than getting the number of elements (with nelem), dividing that in two, then doing two lots of slices?
Thanks
Yes, in so far as you don't need to directly invoke slice to get what you want. You could chain splitdim and dog with something like this:
# Assume we have $data, a piddle
my ($left, $right) = $data->splitdim(0, $data->nelem/2)->dog;
That, of course, is easily extended to more than two divisions. However, if you want to extend it to higher-dimensional piddles (i.e. a collection of time series all stored in one piddle), you would need to be a little more subtle. If you want to split along the first dimension (which has index 0), you would say this instead:
# Assume we have $data, a piddle
my ($left, $right) = $data->splitdim(0, $data->dim(0)/2)->mv(1, -1)->dog;
The splitdim operation splits the 0th dimension into two dimensions, the 0th being dim(0)/2 in length, the 1st being 2 in length (because we divided it into two pieces). Since dog operates on the last dimension, we move the 1st dimension to the end before invoking dog.
However, even with the single-dimensional solution, there's a caveat. Due to the way that $data->splitdim works, it will truncate the last piece of data if you have an odd number of elements. Try that operation on a piddle with 21 elements and you'll see what I mean:
my $data = sequence(20);
say "data is $data"; # lists 0-19
my ($left, $right) = $data->splitdim(0, $data->nelem/2)->dog;
say "left is $left and right is $right"; # lists 0-9, then 10-19
$data = sequence(21);
say "data is $data"; # lists 0-20, i.e. 21 elements
my ($left, $right) = $data->splitdim(0, $data->nelem/2)->dog;
say "left is $left and right is $right"; # lists 0-9, then 10-19!!
If you want to avoid that, you can produce your own method that splits the first dimension in half without truncation. It would probably look something like this:
sub PDL::split_in_half {
my $self = shift;
# the int() isn't strictly necessary, but should make things a
# tad faster
my $left = $self->slice(':' . int($self->dim(0)/2-1) );
my $right = $self->slice(int($self->dim(0)/2) . ':');
return ($left, $right);
}
Here I have also used the int built-in to make sure we don't have the .5 if dim(0) is odd. It's a little more complicated, but we're burying this complexity into a method precisely so we don't have to think about the complexity, so we may as well buy ourselves a few clock cycles while we're at it.
Then you could easily invoke the method thus:
my ($left, $right) = $data->split_in_half;
Thank you in advance for indulging an amateur Perl question. I'm extracting some data from a large, unformatted text file, and am having trouble combining the use of a 'while' loop and regular expression matching over multiple lines.
First, a sample of the data:
01-034575 18/12/2007 258,750.00 11,559.00 36 -2 0 6 -3 2 -2 0 2 1 -1 3 0 5 15
-13 -44 -74 -104 -134 -165 -196 -226 -257 -287 -318 -349 -377 -408 -438
-469 -510 -541 -572 -602 -633 -663
Atraso Promedio ---> 0.94
The first sequence, XX-XXXXXX is a loan ID number. The date and the following two numbers aren't important. '36' is the number of payments. The following sequence of positive and negative numbers represent how late/early this client was for this loan at each of the 36 payment periods. The '0.94' following 'Atraso Promedio' is the bank's calculation for average delay. The problem is it's wrong, since they substitute all negative (i.e. early) payments in the series with zeros, effectively over-stating how risky a client is. I need to write a program that extracts ID and number of payments, and then dynamically calculates a multi-line average delay.
Here's what I have so far:
#Create an output file
open(OUT, ">out.csv");
print OUT "Loan_ID,Atraso_promedio,Atraso_alt,N_payments,\n";
open(MYINPUTFILE, "<DATA.txt");
while(<MYINPUTFILE>){
chomp($_);
if($ID_select != 1 && m/(\d{2}\-\d{6})/){$Loan_ID = $1, $ID_select = 1}
if($ID_select == 1 && m/\d{1,2},\d{1,3}\.00\s+\d{1,2},\d{1,3}\.00\s+(\d{1,2})/) {$N_payments = $1, $Payment_find = 1};
if($Payment_find == 1 && $ID_select == 1){
while(m/\s{2,}(\-?\d{1,3})/g){
$N++;
$SUM = $SUM + $1;
print OUT "$Loan_ID,$1\n"; #THIS SHOWS ME WHAT NUMBERS THE CODE IS GRABBING. ACTUAL OUTPUT WILL BE WRITTEN BELOW
print $Loan_ID,"\n";
}
if(m/---> *(\d*.\d*)/){$Atraso = $1, $Atraso_select = 1}
if($ID_select == 1 && $Payment_find == 1 && $Atraso_select == 1){
...
There's more, but the while loop is where the program is breaking down. The problem is with the pattern modifier, 'g,' which performs a global search of the string. This makes the program grab numbers that I don't want, such as the '1' in loan ID and the '36' for the number of payments. I need the while loop to start from wherever the previous line in the code left off, which should be right after it has identified the number of loans. I've tried every pattern modifier that I've been able to look up, and only 'g' keeps me out of an infinite loop. I need the while loop to go to the end of the line, then start on the next one without combing over the parts of the string already fed through the program.
Thoughts? Does this make sense? Would be immensely grateful for any help you can offer. This work is pro-bono, unpaid: just trying to help out some friends in a micro-lending institution conduct a risk analysis.
Cheers,
Aaron
The problem is probably easier using split, for instance something like this:
use strict;
use warnings;
open DATA, "<DATA.txt" or die "$!";
my #payments;
my $numberOfPayments;
my $loanNumber;
while(<DATA>)
{
if(/\b\d{2}-\d{6}\b/)
{
($loanNumber, undef, undef, undef, $numberOfPayments, #payments) = split;
}
elsif(/Atraso Promedio/)
{
my (undef, undef, undef, $atrasoPromedio) = split;
# Calculate average of payments and print results
}
else
{
push(#payments, split);
}
}
If the data's clean enough, I might approach it by using split instead of regular expressions. The first line is identifiable if field[0] matches the form of a loan number and field[1] matches the format of a date; then the payment dates are an array slice of field[5..-1]. Similarly testing the first field of each line tells you where you are in the data.
Peter van her Heijden's answer is a nice simplification for a solution.
To answer the OP's question about getting the regexp to continue where it left off, see Perl operators - regexp-quote-like operators, specifically the section "Matching in list context" and the "\G assertion" section just after that.
Essentially, you can use m//gc along with the \G assertion to use regexps match where previous matches left off.
The example in the "\G assertion" section about lex-like scanners would seem to apply to this question.