hash function to index similar text - hash

I'm searching about a sort of hash function to index similar text. So for example if we have two very long text called "A" and "B" where A and B differ not so much, then the hash function (called H) applied to A and B should return the same number.
So H(A) = H(B) where A and B are similar text.
I tried the "DoubleMetaphone" (I use italian language text), but I saw that it depends very strong from the string prefixes. For example:
A = "This is the very long text that I want to hash"
B = "This is the very"
==> doubleMetaPhone(A) = doubleMetaPhone(B)
And this is not so good for me, beacause strings with the same prefix could be compared as similar and I don't want this.
Could anyone suggest me any other way?

see http://en.wikipedia.org/wiki/Locality_sensitive_hashing

You problem is (close to) insoluble for many distance functions between strings.
Most distance functions (e.g. edit distance) allow you to transform a string into another string via a sequence of 1-distance transformations:
"AAAA" -> "AAAB" -> "AAABC"
according to your requirements, the first and second strings should have the same hash value. But so must the second and the third, and so on. So all the strings will have to have the same hash, if we allow a pair with distance=1 to have the same hash value.
Even if we impose a higher threshold on the distance (maybe in relation to string length), we'll end up with a messy result.
A better (IMO) approach is to find an equivalence relation on the set of strings, such that each string in each equivalence class has the same hash. A possibility is to define classes by their distance to a predefined string (e.g. edit distance from "AAAAA"), and the distance itself would be the hash value. Probably this approach would not be the best in your case, but maybe with some extra info on the problem we can come up with a better equivalence relation.

Related

How to find common patterns in thousands of strings?

I don't want to find "abc" in strings ["kkkabczzz", "shdirabckai"]
Not like that.
But bigger patterns like this:
If I have to __, then I will ___.
["If I have to do it, then I will do it right.", "Even if I have to make it, I will not make it without Jack.", "....If I have to do, I will not...."]
I want to discover patterns in a large array or database of strings. Say going over the contents of an entire book.
Is there a way to find patterns like this?
I can work with JavaScript, Python, PHP.
The following could be a starting point:
The RegExp rx=/(\b\w+(\s+\w+\b)+)(?=.+\1)+/g looks for small (multiple word) patterns that occur at least twice in the text.
By playing around with the repeat quantifier + after (\s+\w+\b) (i.e. changing it to something like {2}) you can restrict your word patterns to any number of words (in the above case to 3: original + 2 repetitions) and you will get different results.
(?=.+\1)+ is a look ahead pattern that will not consume any of the matched parts of the string, so there is "more string" left for the remaining match attempts in the while loop.
const str="If I have to do it, then I will do it right. Even if I have to make it, I will not make it without Jack. If I have to do, I will not."
const rx=/(\b\w+(\s+\w+\b)+)(?=.+\1)+/g, r={};
let t;
while (t=rx.exec(str)) r[t[1]]=(rx.lastIndex+=1-t[1].length);
const res=Object.keys(r).map(p=>
[p,[...str.matchAll(p)].length]).sort((a,b)=>b[1]-a[1]||b[0].localeCompare(a[0]));
// list all repeated patterns and their occurrence counts,
// ordered by occurrence count and alphabet:
console.log(res);
I extended my snippet a little bit by collecting all the matches as keys in an object (r). At the end I list all the keys of this object alphabetically with Object.keys(r).sort().
In the while loop I also reset the rx.lastIndex property to start the search for that next pattern immediately after the start of the last one found: rx.lastIndex+=1-t[1].length.

multiple-value hash common lisp

How can I hash pairs or triples of 'eq-able objects like symbols or ints?
In python I can use tuples as dictionary keys, is there a way to do this in lisp without resorting to an 'equal test?
While some implementations might provide provisions for custom hash table functions, the standard only defines four:
18.1.1 Hash-Table Operations
There are four kinds of hash tables: those whose keys are compared with eq, those whose keys are compared with eql, those whose keys are
compared with equal, and those whose keys are compared with equalp.
That means that if you want to use the standard hash tables, then you'll probably need to use an equal or equalp hash table. I do notice that you wrote:
How can I hash pairs or triples of 'eq-able objects like symbols or
ints?
While symbols can be compared reliably with eq, you shouldn't compare numbers with eq. The documentation of eq says:
numbers with the same value need not be eq, … An implementation is permitted to make "copies" of characters and numbers at any time. The effect is that Common Lisp makes no guarantee that eq is true even when both its arguments are "the same thing" if that thing is a character or number.
and gives this example:
(eq 3 3)
; => true
; OR=> false
However, if you are working with (small) tuples of integers, you could easily hash on a function of them. E.g., the tuple (a,b,c) could be mapped to 2a×3b×5c. Since a function like that would generate unique numbers which are comparable with eql, you could use an eql hash table.
Another option for such a mapping function (that would work with symbols, too) would be to use sxhash. It's a standard hashing function that should produce identical values for equal values. How it works, and what exactly it does is not really specified at all, but it has the advantage that it's stable across Lisp images of the same implementation (e.g., run one version of SBCL today and tomorrow, and sxhash will return the same result for an equal object). Of course, it's possible that an equal-hash-table is just doing this for you already, so your mileage might vary.

Building an ngram frequency table and dealing with multibyte runes

I am currently learning Go and am making a lot of progress. One way I do this is to port past projects and prototypes from a prior language to a new one.
Right now I am busying myself with a "language detector" I prototyped in Python a while ago. In this module, I generate an ngram frequency table, where I then calculate the difference between a given text and a known corpora.
This allows one to effectively determine which corpus is the best match by returning the cosine of two vector representations of the given ngram tables. Yay. Math.
I have a prototype written in Go that works perfectly with plain ascii characters, but I would very much like to have it working with unicode multibyte support. This is where I'm doing my head in.
Here is a quick example of what I'm dealing with: http://play.golang.org/p/2bnAjZX3r0
I've only posted the table generating logic since everything already works just fine.
As you can see by running the snippet, the first text works quite well and builds an accurate table. The second text, which is German, has a few double-byte characters in it. Due to the way I am building the ngram sequence, and due to the fact that these specific runes are made of two bytes, there appear 2 ngrams where the first byte is cut off.
Could someone perhaps post a more efficient solution or, at the very least, guide me through a fix? I'm almost positive I am over analysing this problem.
I plan on open sourcing this package and implementing it as a service using Martini, thus providing a simple API people can use for simple linguistic computation.
As ever, thanks!
If I understand correctly, you want chars in your Parse function to hold the last n characters in the string. Since you're interested in Unicode characters rather than their UTF-8 representation, you might find it easier to manage it as a []rune slice, and only convert back to a string when you have your ngram ready to add to the table. This way you don't need to special case non-ASCII characters in your logic.
Here is a simple modification to your sample program that does the above: http://play.golang.org/p/QMYoSlaGSv
By keeping a circular buffer of runes, you can minimise allocations. Also note that reading a new key from a map returns the zero value (which for int is 0), which means the unknown key check in your code is redundant.
func Parse(text string, n int) map[string]int {
chars := make([]rune, 2 * n)
table := make(map[string]int)
k := 0
for _, chars[k] = range strings.Join(strings.Fields(text), " ") + " " {
chars[n + k] = chars[k]
k = (k + 1) % n
table[string(chars[k:k+n])]++
}
return table
}

Why can't CASE be used on string values and only symbol values?

In book 'land of lisp' I read
Because the case command uses eq for comparisons, it is usually used
only for branching on symbol values. It cannot be used to branch on
string values, among other things.
Please explain why?
The other two excellent answers do answer the question asked. I will try to answer the natural next question - why does case use eql?
The reason is actually the same as in C (where the corresponding switch statement uses numeric comparison): the case forms in Lisp are usually compiled to something like goto, so (case x (1 ...) (2 ...) (3 ...)) is much more efficient than the corresponding cond. This is often accomplished by compiling case to a hash table lookup which maps the value being compared to the clause directly.
That said, the next question would be - why not have a case variant with equal hash table clause lookup instead of eql? Well, this is not in the ANSI standard, but implementations can provide such extensions, e.g., ext:fcase in CLISP.
See also why eql is the default comparison.
Two strings with the same content "foo" and "foo" are not EQL. CASE uses EQL as a comparison (not EQ as in your question). Usually one might want different tests: string comparison case and case insensitive, for example. But for CASE on cannot use another test. EQL is built-in. EQL compares for pointer equality, numbers and characters. But not string contents. You can test if two strings are the identical data objects, though.
So, two strings "FOO" and "FOO" are usually two different objects.
But two symbols FOO and FOO are usually really the same object. That's a basic feature of Lisp. Thus they are EQL and CASE can be used to compare them.
Because (eq "foo" "foo") is not necessarily true. Each time you type a string literal, it may create a fresh, unique string. So when CASE is comparing the value with the literals in the cases with EQ, they won't match.

How can I generate this hash?

I'm new to programming (just started!) and have hit a wall recently. I am making a fansite for World of Warcraft, and I want to link to a popular site (wowhead.com). The following page shows what I'm trying to figure out: http://www.wowhead.com/?talent#ozxZ0xfcRMhuVurhstVhc0c
From what I understand, the "ozxZ0xfcRMhuVurhstVhc0c" part of the link is a hash. It contains all the information about that particular talent spec on the page, and changes whenever I add or remove points into a talent. I want to be able to recreate this part, so that I can then link my users directly to wowhead to view their talent trees, but I havn't the foggiest idea how to do that. Can anyone provide some guidance?
The first character indicates the class:
0 Druid
c Hunter
o Mage
s Paladin
b Priest
f Rogue
h Shaman
I Warlock
L Warrior
j Death Knight
The remaining characters indicate where in each tree points have been allocated. Each tree is separate, delimited by 'Z'. So if e.g. all the points are in the third tree, then the 2nd and 3rd characters will be "ZZ" indicating "end of first tree" and "end of second tree".
To generate the code for a given tree, split the talents up into pairs, going left-to-right and top-to-bottom. Each pair of talents is represented by a single character. So for example, in the DK's Blood tree segment, the first character will indicate the number of points allocated to Butchery and Subversion, and the second character will stand for Blade Barrier and Bladed Armor.
What character represents each allocation among the pair? I'm sure there's an algorithm, probably based on the ASCII character set, but all I've worked out so far is this lookup table. Find the number of points in the first talent along the top, and the number of points in the second talent along the left side. The encoded character is at the intersection.
0 1 2 3 4 5
0 0 o b h L x
1 z k d u p t
2 M R r G T g
3 c s f I j e
4 m a w N n v
5 V q i A y E
So if our Death Knight has one point in Butchery and two points in Subversion, the first character is 'R'. If instead we put no points in those two and five in Blade Barrier, the first two characters will be "0x". Trailing '0's (all the other pairs in the tree with no points allocated) can be omitted, as can trailing 'Z' delimiters (when there are no points in the subsequent trees). For one final example, the entire code for a DK with just a single point in Toughness would be "jZ0o": "Death Knight", "End of the first tree", "No points in the first pair of talents", "one point in the first talent of the second pair".
Can anyone work out what function generates the lookup table above? There's probably a clue in the codes for the classes: in alphabetical order (except for the DK which was added to the game after the others), they correspond to a series in the lookup table of (0,0), (0,3), (1,0), (1,3), (2,0), etc.
If you go to http://www.wowhead.com/?talent and start using the talent tree you can see the mysterious code being built up in the address bar as you click on the various boxes. So it's definitely not a hash but some kind of encoded structure data.
As the code is built up as you click the logic for building the code will be in the JavaScript on that page.
So my advice is do a view source on the page, download the JavaScript files and have a look at them.
I think it isn't a hash value, because hash values are normally one-ways values. This means you cannot (easily) restore the original information from which the hash code was generated.
Best thing would be to contact someone from wowhead.com and ask them how to interpret this information. I am sure they will help you out with some information about what type of encoding they use for the parameters. But without any help of the developers from wowhead.com it is almost impossible to figure out what information is encoded into this parameter.
I am not even sure the parameter you mentioned contains the talents of your character. Maybe it's just a session id or something like that. Take a look into the post data your browser sends to the server, it may contain a hidden field with the value you are searching for (you can use Tamper Data Firefox Addon).
I don't think ozxZ0xfcRMhuVurhstVhc0c is a hash value. I think it is a key (probably encrypted/encoded in some way). The server uses this key to retrieve information from it database. Since you don't have access to the database you don't know which key is needed, let alone how to encode it.
You need the original function that generates the hash.
I don't think that's public though :(
Check this out: hash wikipedia
Good luck learning how to program!
These hashes are hard to 'reverse engineer' unless you know how it was generated.
For example, it could be:
s1 = "random_string-" + score;
hash = encrypt(s1)
...etc
so it is hard to get the original data back from the hash (that is the whole point anyway).
your best bet would be link to the profile that would have the latest score ..etc