Perfect hashtable for - hash

I'm looking for a hashfunction which exploits the following requirements:
N distinct integer values will be stored in the hashtable
At any given point in time there will be no more than M values present in the hashtable
Hashtable stays static for several queries (i.e. at some point the whole hashtable will be initialized and the following calls only read from the hash table)
largest possible key value K is known at the initialization of the hashtable (K >> N)
Every queried key-value pair is present in the hashtable
So far I'm using a hash-function like:
h(k) = 7 * k % M
with M = PRIME_CLOSE_TO(7*N)
7 is somewhat arbitrary.
Do you have any suggestions on how to improve this?

This is a starting point: http://en.wikipedia.org/wiki/Perfect_hash_function
In practice, any ordinary hash function would be fine. But if you want a minimal perfect hash for some reason, you may look into a library that does perfect hashing, such as: CMPH - C Minimal Perfect Hashing Library

Related

Matlab : Conceptual difficulty in How to create multiple hash tables in Locality sensitive Hashing

The key idea of Locality sensitive hashing (LSH) is that neighbor points, v are more likely
mapped to the same bucket but points far from each other are more likely mapped to different buckets. In using Random projection, if the the database contains N samples each of higher dimension d, then theory says that we must create k randomly generated hash functions, where k is the targeted reduced dimension denoted as g(**v**) = (h_1(v),h_2(v),...,h_k(v)). So, for any vector point v, the point is mapped to a k-dimensional vector with a g-function. Then the hash code is the vector of reduced length /dimension k and is regarded as a bucket. Now, to increase probability of collision, theory says that we should have L such g-functions g_1, g_2,...,g_L at random. This is the part that I do not understand.
Question : How to create multiple hash tables? How many buckets are contained in a hash table?
I am following the code given in the paper Sparse Projections for High-Dimensional Binary Codes by Yan Xia et. al Link to Code
In the file Coding.m
dim = size(X_train, 2);
R = randn(dim, bit);
% coding
B_query = (X_query*R >= 0);
B_base = (X_base*R >=0);
X_query is the set of query data each of dimension d and there are 1000 query samples; R is the random projection and bit is the target reduced dimensionality. The output of B_query and B_base are N strings of length k taking 0/1 values.
Does this way create multiple hash tables i.e. N is the number of hash tables? I am confused as to how. A detailed explanation will be very helpful.
How to create multiple hash tables?
LSH creates hash-table using (amplified) hash functions by concatenation:
g(p) = [h1(p), h2(p), · · · , hk (p)], hi ∈R H
g() is a hash function and it corresponds to one hashtable. So we map the data, via g() to that hashtable and with probability, the close ones will fall into the same bucket and the non-close ones will fall into different buckets.
We do that L times, thus we create L hashtables. Note that every g() is/should most likely to be different that the other g() hash functions.
Note: Large k ⇒ larger gap between P1, P2. Small P1 ⇒ larer L so as to find neighbors. A practical choice is L = 5 (or 6). P1 and P2 are defined in the image below:
How many buckets are contained in a hash table?
Wish I knew! That's a difficult question, how about sqrt(N) where N is the number of points in the dataset. Check this: Number of buckets in LSH
The code of Yan Xia
I am not familiar with that, but from what you said, I believe that the query data you see are 1000 in number, because we wish to pose 1000 queries.
k is the length of the strings, because we have to hash the query to see in which bucket of a hashtable it will be mapped. The points inside that bucket are potential (approximate) Nearest Neighbors.

Universal hashing, should get the same hash value for the same key?

I mean, I have implemented an universal hashing function using this expression:
h(k) = ((a*k + b)mod p)mod m; (from Cormen)
where:
-p is big prime number greater than k;
-a and b are two numbers that are randomly choosen the first in the range [1, p-1] and the second one [0, p-1].
Now, I implemented this, and for the random function I have choosen the seed equal to k. That's because, if I don't do this, when I insert a value with the key k, it will generate a hash value, that will depends on the default seed of Random function (maybe the time). So if I want to search the key again, I can't do this, because now the universal hashing function returns me another value. So, I would appreciate you to tell me if my reasoning is correct or not.
My doubt is that now, doing so, if two elements have the same key, they will be irrimediably stored in the same linked list (thing that I didn't understand if it is correct or not).
Thanks in advance.
I think you have a slight misunderstanding about how universal hashing works. Rather than choosing a and b at random every time you compute the hash, instead, before you do any hashing at all, select a random a and b. Once you've done that, every time you need to compute the hash, go and compute it using the formula above based on the input value k and the values a and b that you chose initially.

How to evaluate a hash generating algorithm

What ways do you know to evaluate the efficiency of a hash function besides generating a large set of values and see the distribution of values?
By efficiency I mean that the keys generated by your hash function distribute evenly. Is there a way to prove this without actually testing for actual values?
A hash function is only even in the context of the data being hashed
Consider two data sets:
Set 1
1, 3, 6, 2, 7, 9, 5, 8, 4
Set 2
65355, 96424664, 86463624, 133, 643564, 24232, 88677, 865747, 2224
A good hashing function for one set (ie mod 10 for set 1) gives no collisions and could be seen as the perfect hash for that data set
However apply it to the second set and there are collisions everywhere
Hash = (x * 37) mod 256
Is much better for the second set but may not suit the first set quite so well... Especially when partitioning the hash for eg a small number of buckets.
What you can do is evaluate a hash against random data that you "expect" your function to have to handle... But that is making assumptions...
Premature optimisation is looking for the perfect hash function before you have enough real data to base your assessment on.
You should get enough data well before the cost of rehashing becomes prohibitive to change your hash function
Update
Lets suppose we are looking for a hash function that generates an 8 bit hash of the input data. Lets further suppose that the hash function is supposed to take byte-streams of varying length.
If we assume that the bytes in the byte-streams are uniformly distributed, we can make some assessment of different hash functions.
int hash = 0;
for (byte b in datastream) hash = hash xor b;
This function will produce uniformly distributed hash values for the specified data set, and would therefore be a good hash function in this context. If you don't see why this is, then you might have other problems.
int hash = 37;
for (byte b in datastream hash = (31 * hash + b) mod 256;
This function will produce uniformly distributed hash values for the specified data set, and would therefore be a good hash function in this context.
Now lets change the data set from being variable length strings of random numbers in the range 0 to 255 to being variable length strings comprising English sentences encoded as US-ASCII.
The XOR is then a poor hash because the input data never has the 8th bit set and as a result only generates hashes in the range 0-127, also there is a higher likelyhood of some "hot" values because of the letter frequency in english words and the cancelling affect of the XOR.
The pair of primes remains reasonably good as a hash function because it uses the full output range and the prime initial offset coupled with a different prime multiplier tends to spread the values out. But it is still weak for collisions due to how English language is structured... Something that only testing with real data can show.

Generating k pairwise independent hash functions

I'm trying to implement a Count-Min Sketch algorithm in Scala, and so I need to generate k pairwise independent hash functions.
This is a lower-level than anything I've ever programmed before, and I don't know much about hash functions except from Algorithms classes, so my question is: how do I generate these k pairwise independent hash functions?
Am I supposed to use a hash function like MD5 or MurmurHash? Do I just generate k hash functions of the form f(x) = ax + b (mod p), where p is a prime and a and b are random integers? (i.e., the universal hashing family everyone learns in algorithms 101)
I'm looking more for simplicity than raw speed (e.g., I'll take something 5x slower if it's simpler to implement).
Scala already has MurmurHash implemented (it's scala.util.MurmurHash). It's very fast and very good at distributing values. A cryptographic hash is overkill--you'll just take tens or hundreds of times longer than you need to. Just pick k different seeds to start with and, since it's nearly cryptographic in quality, you'll get k largely independent hash codes. (In 2.10, you should probably switch to using scala.util.hashing.MurmurHash3; the usage is rather different but you can still do the same thing with mixing.)
If you only need near values to be mapped to randomly far values this will work; if you want to avoid collisions (i.e. if A and B collide using hash 1 they will probably not also collide using hash 2), then you'll need to go at least one more step and hash not the whole object but subcomponents of it so there's an opportunity for the hashes to start out different.
Probably the simplest approach is to take some cryptographic hash function and "seed" it with different sequences of bytes. For most practical purposes, the results should be independent, as this is one of the key properties a cryptographic hash function should have (if you replace any part of a message, the hash should be completely different).
I'd do something like:
// for each 0 <= i < k generate a sequence of random numbers
val randomSeeds: Array[Array[Byte]] = ... ; // initialize by random sequences
def hash(i: Int, value: Array[Byte]): Array[Byte] = {
val dg = java.security.MessageDigest.getInstance("SHA-1");
// "seed" the digest by a random value based on the index
dg.update(randomSeeds(i));
return dg.digest(value);
// if you need integer hash values, just take 4 bytes
// of the result and convert them to an int
}
Edit:
I don't know the precise requirements of the Count-Min Sketch, maybe a simple has function would suffice, but it doesn't seem to be the simplest solution.
I suggested a cryptographic hash function, because there you have quite strong guarantees that the resulting hash functions will be very different, and it's easy to implement, just use the standard libraries.
On the other hand, if you have two hash functions of the form f1(x) = ax + b (mod p) and f2(x) = cx + d (mod p), then you can compute one using another (without knowing x) using a simple linear formula f2(x) = c / a * (f1(x) - b) + d (mod p), which suggests that they aren't very independent. So you could run into unexpected problems here.

Using hash functions with Bloom filters

A bloom filter uses a hash function (or many) to generate a value between 0 and m given an input string X. My question is how to you use a hash function to generate a value in this way, for example an MD5 hash is typically represented by a 32 length hex string, how would I use an MD5 hashing algorithm to generate a value between 0 and m where I can specify m? I'm using Java at the moment so an example of to do this with the MessageDigest functionality it offers would be great, though just a generic description of how to do about it would be fine too.
Thanks
You should first convert the hash output to an unsigned integer, then reduce it modulo m. This looks like this:
MessageDigest md = MessageDigest.getInstance("MD5");
// hash data...
byte[] hashValue = md.digest();
BigInteger n = new BigInteger(1, hashValue);
n = n.mod(m);
// at that point, n has a value between 0 and m-1 (inclusive)
I have assumed that m is a BigInteger instance. If necessary, use BigInteger.valueOf(). Similarly, use n.intValue() or n.longValue() to get the value of n as one of the primitive types of Java.
The modular reduction is somewhat biased, but the bias is very small if m is substantially smaller than 2^128.
Simplest way would probably be to just convert the hash output (as a byte sequence) to a single binary number and take that modulo m.