Test Suite for Double Metaphone? - metaphone

I've translated Double-Metaphone into ActionScript3 and I want to test it (obviously) before I release the source to ... um ... the open.
I'm looking for a long list of names with the primary and secondary codes. Google does not find anything except one list with pairs of names (presumably they should match).
Thanks

You could find someone else's double metaphone implementation, run it on the same long list of words, and compare the results to your own.
For long lists of words, I like infochimps. They have lots of word lists, like this one of 350,000 english words or this one of place names, and many more.
Here are implementations you can compare your results against. Here is an online example, except that it tests only one word at a time - I guess you'll have to download and run one of the scripts to test a large list of words.
For each word, two codes will be returned; you'll probably want to test that both codes returned match the ones returned of another implementation. You probably know that the reference implementation is here with full source code here, but including the links anyway for others' benefit.

Related

Lisp - Extracting info from a list of comma separated values

I've tried searching this but have yet to find something that suits anything close to my needs. I'm trying to create a Autocad LISP that takes a text file, which is a list of comma-separated values, and place a block at coordinates defined by the list. BUT, only for items on the list where the last entry starts with "HP"
So that's sounds a bit complex, but the text file is basically a UTM survey output, and looks like this:
1000,Easting,Northing,Elevation,Identifier
1001,Easting,Northing,Elevation,Identifier
Etc.
The identifier is a variety of values, but I want to extract the Northing,Easting,Elevation, and insert a block (this last part I've got) at that location when the identifier begins with "HP". The list can be long and the number of HPs can be 1 or 5000. I'm assuming there's a "for x=1:end, do" type of loop than can be made that reuses the same variables over and over.
I'm a newbie to LISP so I'm stuck in that spot between "here are I've-never-programmed-before tutorials to make hello world" and "here is a library of the 3000 different commands in alphabetical order"
I believe the functions you are needing to solve this question are open, read-line or read-char, close,strlen, and substr. The first four functions relate to AutoLisp writing and reading a file. The last two functions manipulate the string variables that were pulled from the file. With them, you can find the "HP" within the text. To loop through the same code, three come to my mind: repeat, while, and foreach.
For a list of variables to quickly reference with their descriptions, here's a good starting point. This particular page has the information broken up by category instead of alphabetical order.
https://help.solidworks.com/2022/English/api/draftsightlispreference/html/lisp_functions_overview.htm
Here are a few tutorials where AutoLisp code is used to write and read other files:
https://www.afralisp.net/autolisp/tutorials/file-handling.php
https://www.afralisp.net/autolisp/tutorials/external-data.php
Lastly, here's an example of AutoLisp writing and reading attributes from and to blocks.
https://github.com/GitHubUser5376/AttributeImportExport
You can use Lee-Mac's Reacd-CSV function to get a list of the csv values.
And for the "HP" detection yes you might have to go through(using loop options mentioned above like while, repeat,foreach) each and use
(substr Identifier 1 2)
to validate

CLUTO doc2mat specified stop word list not working

I am trying to convert my documents into vector-space format using doc2mat
On the website, it says I can use my specified text file where words are white-space separated or on multiple lines. So, I use some code similar to this one:
./doc2mat -mystoplist=stopword.txt -skipnumeric mydocuments.txt myvectorspace.txt
However, when I check the output .clabel file, it still has stop words that's in stopword.txt.
I really do not know how to do this. Someone help me out please? Thank you!
There's one important thing I should remember: I should include ALL the unwanted words in my stop list. This is somewhat difficult since there's always some variations available...
For example, if I want to exclude method I add it to my list. However, the resulting vocabulary may also contain method since there are words like methodist, methods, etc. Then doc2mat by default stems these words and I will still get method in the output.
Another thing is to make sure that "-nostop" option must be provided for user-specified stop list.

What is the fastest way to search through strings in Objective-C?

I am implementing a sort of autocomplete for an iOS app. The data I am using for the autocomplete values is a comma-separated text file with about 100,000 strings. This is what I am doing now:
Read the text file, and create an NSArray with 100,000 NSString.
As the user types, do [array containsObject:text]
Surely there is a better/faster way to do this lookup. Any thoughts?
Absolutely, there is! It's not "in Objective-C" though: most likely, you would need to code it yourself.
The idea is to convert your list of string to a suffix tree, a data structure that lets you search by prefix very quickly. Searching for possible completions in a suffix tree are very fast, but the structure itself is not easy to build. A quick search on the internet revealed that there is no readily available implementation in Objective C, but you may be able to port an implementation in another language, use a C implementation, or even write your own if you are not particularly pressed for time.
Perhaps an easier approach would be to sort your strings alphabetically, and run a binary search on the prefix that has been entered so far. Though not as efficient as a suffix tree, the sorted array approach will be acceptable for 100K strings, because you get to the right spot in under seventeen checks.
The simplest is probably binary search. See -[NSArray indexOfObject:inSortedRange:options:usingComparator:].
In particular, I'd try something like this:
Pre-sort the array that you save to the file
When you load the array, possibly #selector(compare:) (if you are worried about it being accidentally unsorted or the Unicode sort order changing for some edge cases). This should be approximately O(n) assuming the array is mostly sorted already.
To find the first potential match, [array indexOfObject:searchString inSortedRange:(NSRange){0,[array count]} options:NSBinarySearchingInsertionIndex|NSBinarySearchingFirstEqual usingComparator:#selector(compare:)]
Walk down the array until the entries no longer contain searchString as a prefix. You probably want to do case/diacritic/width-insensitive comparisons to determine whether it is a prefix (NSAnchoredSearch|NSCaseInsensitiveSearch|NSDiacriticInsensitiveSearch|NSWidthInsensitiveSearch)
This may not "correctly" handle all locales (Turkish in particular), but neither will replacing compare: with localizedCompare:, nor will naïve string-folding. (It is only 9 lines long, but took about a day of work time to get right and has about 40 lines of code and 200 lines of test, so I probably shouldn't share it here.)

Perl module for text comparison

Can anyone suggest a Perl module which can compare two strings and return a degree to which they match? I searched CPAN extensively, and although there are similar modules like String::Approx and Data::Compare, they are not what I am looking for. Suppose I have two strings : I love you, and I boht you. I want functionality which will compare these two strings, taking into account numerous parameters, the matching of words in correct order (love as the first word in a string should not "match" love as the 4th word in the 2nd string, even though both strings have that word), words not matching but spelt almost similarly (like say love and loge), number of words, etc and return an index, say a number from 0 to 1 on a scale of 1, representing the degree of similarity between the two strings. Is there any such Perl module?
There are many such modules. Often, though, you'll have to make use of them in some special way to account for your own assumptions. Most of the string comparison tools like this just implement some algorithm for comparing one string to another. Most assume that if you have specific policy decisions to make, you'll code them yourself.
Personally, I am not sure I'd recommend Text::Levenshtein because of bugs and lack of ut8 support. I don't have a better recommendation either, though.
However, these searches will reveal lots of potential modules you could look into and determine what works best for your purpose (based on the names of common algorithms for doing this sort of thing):
https://metacpan.org/search?q=levenshtein
https://metacpan.org/search?q=wagner+fischer
https://metacpan.org/search?q=edit+distance
If you're interested in spoken similarities, you can also look into phonetic comparisons:
https://metacpan.org/search?q=phonetic
https://metacpan.org/search?q=soundex
https://metacpan.org/search?q=metaphone

Tool to compare/diff HTML in bulk

I have a lot of HTML files (10,000's and GBs worth) scraped from a server and I want to check to make sure the server produces the same results after some modifications but ignore kinds of differences that don't matter, e.g. whitespace, missing newlines, timestamps, small changes in some kinds of number, etc.
Does anyone know of a tool for doing this? I'd really rather not do more filtering than I have to.
(Oh and it needs to run under linux)
You might consider using a clone detector such as our CloneDR. This tool parses large sets of computer program (HTML is special case) files, builds abstract syntax trees representing the essential structure of each files, and compares programs for similarity.
Because it is comparing essential program structure, it ignores inessential differences such as comments and whitespace, and deterimines that two code segments are either identical or one can be obtained from the other by substituting other blocks of code. The latter allows the recognition of code that has been modified in various ways. You can see samples of clone detection runs on a variety of computer languages at the web site.
In your case, what you would be looking for are files in system A which are essentially clones (exact or near misses) of files in system B. As a general rule, if a file a is a variant of file b (e.g., with a few changes) the CloneDr will report it as a clone and show the exact differences.
At the scale of 20,000 files, I can see why you want a tool, and I can see why you want near-miss matches rather than exact matches.
Doesn't run under Linux, but I assume your problem is hard to enough to solve so that isn't what you are optimizing.
I use winmerge alot in windows and from what i can see some people enjoy meld in linux, so perhaps that could do the trick for you
http://meld.sourceforge.net/
Other examples i saw from a quick googling was Kompare,xxdiff.sourceforge.net, and kdiff3.sourceforge.net
(could only post 1 link so wrote the adresses to xxdiff and kdiff3 as text)
Beyond Compare is purchased software that is actually worth the money (I never thought I'd hear myself typing that!). It is GUI based but handles thousands of files very well. It will allow you to specify unimportant changes with regular expressions as well as whitespace (beginning, middle and end of line). The feature set is very extensive, check out a trial download.
I do not work for this company, I just use Beyond Compare every day at work and enjoy it every time!