How can I hash pairs or triples of 'eq-able objects like symbols or ints?
In python I can use tuples as dictionary keys, is there a way to do this in lisp without resorting to an 'equal test?
While some implementations might provide provisions for custom hash table functions, the standard only defines four:
18.1.1 Hash-Table Operations
There are four kinds of hash tables: those whose keys are compared with eq, those whose keys are compared with eql, those whose keys are
compared with equal, and those whose keys are compared with equalp.
That means that if you want to use the standard hash tables, then you'll probably need to use an equal or equalp hash table. I do notice that you wrote:
How can I hash pairs or triples of 'eq-able objects like symbols or
ints?
While symbols can be compared reliably with eq, you shouldn't compare numbers with eq. The documentation of eq says:
numbers with the same value need not be eq, … An implementation is permitted to make "copies" of characters and numbers at any time. The effect is that Common Lisp makes no guarantee that eq is true even when both its arguments are "the same thing" if that thing is a character or number.
and gives this example:
(eq 3 3)
; => true
; OR=> false
However, if you are working with (small) tuples of integers, you could easily hash on a function of them. E.g., the tuple (a,b,c) could be mapped to 2a×3b×5c. Since a function like that would generate unique numbers which are comparable with eql, you could use an eql hash table.
Another option for such a mapping function (that would work with symbols, too) would be to use sxhash. It's a standard hashing function that should produce identical values for equal values. How it works, and what exactly it does is not really specified at all, but it has the advantage that it's stable across Lisp images of the same implementation (e.g., run one version of SBCL today and tomorrow, and sxhash will return the same result for an equal object). Of course, it's possible that an equal-hash-table is just doing this for you already, so your mileage might vary.
Related
It seems like most hashes (usually in base16/hex) could be easily represented in base32 in a lossless way, resulting in much shorter (and more easily readable) hash strings.
I understand that naive implementations might mix "O"s, "0"s, "1"s, and "I"s, but one could easily choose alphabetic characters without such problems. There are also enough characters to keep hashes case-insensitive. I know that shorter hash algorithms exist (like crc32), but this idea could be applied to those too for even shorter hashes.
Why, then, do most (if not all) hash algorithm implementations not output in base32, or at least provide an option to do so?
The Alexandria Manual
includes a boolean function for testing the length of sequences:
Function: length= &rest sequences
Takes any number of sequences or integers in any order. Returns true iff the length of all the sequences and the integers are equal. Hint: there’s a compiler macro that expands into more efficient code if the first argument is a literal integer.
The first sentence talks about "integers" (plural). Is this simply for testing whether several computed integers are the same, at the same time as testing for sequence lengths? Or is there some deeper significance?
The third sentence offers an optimization. Does this mean that counting over a list will stop when the literal index is reached, making it potentially more efficient than (= (length lst) 3) if lst is lengthy?
The first sentence talks about "integers" (plural). Is this simply for testing whether several computed integers are the same, at the same time as testing for sequence lengths? Or is there some deeper significance?
There is no deeper significance. It is probably just for symmetry. Basically, (length= ...) with only integer arguments is simply a slower =. But the primary use case for this is (length= 3 (some-list)), i.e., the test whether some sequence has a specific length ("has the sequence value produced by (some-list) a length of 3?").
The third sentence offers an optimization. Does this mean that counting over a list will stop when the literal index is reached, making it potentially more efficient than (= (length lst) 3) if lst is lengthy?
Yes, this is actually the case; the compiler macro expands into a call to sequence-of-length-p which (for lists) does something akin to that (via nthcdr).
In book 'land of lisp' I read
Because the case command uses eq for comparisons, it is usually used
only for branching on symbol values. It cannot be used to branch on
string values, among other things.
Please explain why?
The other two excellent answers do answer the question asked. I will try to answer the natural next question - why does case use eql?
The reason is actually the same as in C (where the corresponding switch statement uses numeric comparison): the case forms in Lisp are usually compiled to something like goto, so (case x (1 ...) (2 ...) (3 ...)) is much more efficient than the corresponding cond. This is often accomplished by compiling case to a hash table lookup which maps the value being compared to the clause directly.
That said, the next question would be - why not have a case variant with equal hash table clause lookup instead of eql? Well, this is not in the ANSI standard, but implementations can provide such extensions, e.g., ext:fcase in CLISP.
See also why eql is the default comparison.
Two strings with the same content "foo" and "foo" are not EQL. CASE uses EQL as a comparison (not EQ as in your question). Usually one might want different tests: string comparison case and case insensitive, for example. But for CASE on cannot use another test. EQL is built-in. EQL compares for pointer equality, numbers and characters. But not string contents. You can test if two strings are the identical data objects, though.
So, two strings "FOO" and "FOO" are usually two different objects.
But two symbols FOO and FOO are usually really the same object. That's a basic feature of Lisp. Thus they are EQL and CASE can be used to compare them.
Because (eq "foo" "foo") is not necessarily true. Each time you type a string literal, it may create a fresh, unique string. So when CASE is comparing the value with the literals in the cases with EQ, they won't match.
Is it possible to convert a church numeral to an integer representation without using a language primitive such as add1?
All the examples I've come across use a primitive to dechurch to int
Example:
plus1 = lambda x: x + 1
church2int = lambda n: n(plus1)(0)
Example 2:
(define (church-numeral->int cn)
((cn add1) 0))
I'm experimenting with a micro lisp intepretter (using only John McCarthy's 10 rules) and would like to understand if that can be done without adding a primitive.
The integer numeric type is not part of McCarthy's list of Lisp elementary primitive procedures - you only have functions at that level, no other data types exist. That's why integers would need to be represented as functions (for instance, using Church numerals) if we were to adhere strictly to such minimalistic definition of Lisp. So the answer is no. You can't convert to a data type that doesn't exist yet.
Now suppose that we add integers as atoms in the language (notice that adding a new data type to the language goes beyond the 7-10 primitive procedures mentioned). To simplify even more, suppose that we just add a single number, the number zero - then we'd still need the add1 operation to build the rest of the integers, as per Peano axioms, which require the existence of the successor operation for the natural numbers to exist. Again, we can't convert from Church numerals to integers without at least having the number zero as an atom and the add1 function.
No. int, as you describe it, is a primitive type of value, not a function. You can't manipulate such ints at all without primitives (without add1, how are you ever going to get to 1 from 0?).
However, you certainly can convert between two different Church-encodings of natural numbers without using primitives, as long as your language is Turing-complete without those primitives.
I'm searching about a sort of hash function to index similar text. So for example if we have two very long text called "A" and "B" where A and B differ not so much, then the hash function (called H) applied to A and B should return the same number.
So H(A) = H(B) where A and B are similar text.
I tried the "DoubleMetaphone" (I use italian language text), but I saw that it depends very strong from the string prefixes. For example:
A = "This is the very long text that I want to hash"
B = "This is the very"
==> doubleMetaPhone(A) = doubleMetaPhone(B)
And this is not so good for me, beacause strings with the same prefix could be compared as similar and I don't want this.
Could anyone suggest me any other way?
see http://en.wikipedia.org/wiki/Locality_sensitive_hashing
You problem is (close to) insoluble for many distance functions between strings.
Most distance functions (e.g. edit distance) allow you to transform a string into another string via a sequence of 1-distance transformations:
"AAAA" -> "AAAB" -> "AAABC"
according to your requirements, the first and second strings should have the same hash value. But so must the second and the third, and so on. So all the strings will have to have the same hash, if we allow a pair with distance=1 to have the same hash value.
Even if we impose a higher threshold on the distance (maybe in relation to string length), we'll end up with a messy result.
A better (IMO) approach is to find an equivalence relation on the set of strings, such that each string in each equivalence class has the same hash. A possibility is to define classes by their distance to a predefined string (e.g. edit distance from "AAAAA"), and the distance itself would be the hash value. Probably this approach would not be the best in your case, but maybe with some extra info on the problem we can come up with a better equivalence relation.