Generating k pairwise independent hash functions - scala

I'm trying to implement a Count-Min Sketch algorithm in Scala, and so I need to generate k pairwise independent hash functions.
This is a lower-level than anything I've ever programmed before, and I don't know much about hash functions except from Algorithms classes, so my question is: how do I generate these k pairwise independent hash functions?
Am I supposed to use a hash function like MD5 or MurmurHash? Do I just generate k hash functions of the form f(x) = ax + b (mod p), where p is a prime and a and b are random integers? (i.e., the universal hashing family everyone learns in algorithms 101)
I'm looking more for simplicity than raw speed (e.g., I'll take something 5x slower if it's simpler to implement).

Scala already has MurmurHash implemented (it's scala.util.MurmurHash). It's very fast and very good at distributing values. A cryptographic hash is overkill--you'll just take tens or hundreds of times longer than you need to. Just pick k different seeds to start with and, since it's nearly cryptographic in quality, you'll get k largely independent hash codes. (In 2.10, you should probably switch to using scala.util.hashing.MurmurHash3; the usage is rather different but you can still do the same thing with mixing.)
If you only need near values to be mapped to randomly far values this will work; if you want to avoid collisions (i.e. if A and B collide using hash 1 they will probably not also collide using hash 2), then you'll need to go at least one more step and hash not the whole object but subcomponents of it so there's an opportunity for the hashes to start out different.

Probably the simplest approach is to take some cryptographic hash function and "seed" it with different sequences of bytes. For most practical purposes, the results should be independent, as this is one of the key properties a cryptographic hash function should have (if you replace any part of a message, the hash should be completely different).
I'd do something like:
// for each 0 <= i < k generate a sequence of random numbers
val randomSeeds: Array[Array[Byte]] = ... ; // initialize by random sequences
def hash(i: Int, value: Array[Byte]): Array[Byte] = {
val dg = java.security.MessageDigest.getInstance("SHA-1");
// "seed" the digest by a random value based on the index
dg.update(randomSeeds(i));
return dg.digest(value);
// if you need integer hash values, just take 4 bytes
// of the result and convert them to an int
}
Edit:
I don't know the precise requirements of the Count-Min Sketch, maybe a simple has function would suffice, but it doesn't seem to be the simplest solution.
I suggested a cryptographic hash function, because there you have quite strong guarantees that the resulting hash functions will be very different, and it's easy to implement, just use the standard libraries.
On the other hand, if you have two hash functions of the form f1(x) = ax + b (mod p) and f2(x) = cx + d (mod p), then you can compute one using another (without knowing x) using a simple linear formula f2(x) = c / a * (f1(x) - b) + d (mod p), which suggests that they aren't very independent. So you could run into unexpected problems here.

Related

Matlab : Conceptual difficulty in How to create multiple hash tables in Locality sensitive Hashing

The key idea of Locality sensitive hashing (LSH) is that neighbor points, v are more likely
mapped to the same bucket but points far from each other are more likely mapped to different buckets. In using Random projection, if the the database contains N samples each of higher dimension d, then theory says that we must create k randomly generated hash functions, where k is the targeted reduced dimension denoted as g(**v**) = (h_1(v),h_2(v),...,h_k(v)). So, for any vector point v, the point is mapped to a k-dimensional vector with a g-function. Then the hash code is the vector of reduced length /dimension k and is regarded as a bucket. Now, to increase probability of collision, theory says that we should have L such g-functions g_1, g_2,...,g_L at random. This is the part that I do not understand.
Question : How to create multiple hash tables? How many buckets are contained in a hash table?
I am following the code given in the paper Sparse Projections for High-Dimensional Binary Codes by Yan Xia et. al Link to Code
In the file Coding.m
dim = size(X_train, 2);
R = randn(dim, bit);
% coding
B_query = (X_query*R >= 0);
B_base = (X_base*R >=0);
X_query is the set of query data each of dimension d and there are 1000 query samples; R is the random projection and bit is the target reduced dimensionality. The output of B_query and B_base are N strings of length k taking 0/1 values.
Does this way create multiple hash tables i.e. N is the number of hash tables? I am confused as to how. A detailed explanation will be very helpful.
How to create multiple hash tables?
LSH creates hash-table using (amplified) hash functions by concatenation:
g(p) = [h1(p), h2(p), · · · , hk (p)], hi ∈R H
g() is a hash function and it corresponds to one hashtable. So we map the data, via g() to that hashtable and with probability, the close ones will fall into the same bucket and the non-close ones will fall into different buckets.
We do that L times, thus we create L hashtables. Note that every g() is/should most likely to be different that the other g() hash functions.
Note: Large k ⇒ larger gap between P1, P2. Small P1 ⇒ larer L so as to find neighbors. A practical choice is L = 5 (or 6). P1 and P2 are defined in the image below:
How many buckets are contained in a hash table?
Wish I knew! That's a difficult question, how about sqrt(N) where N is the number of points in the dataset. Check this: Number of buckets in LSH
The code of Yan Xia
I am not familiar with that, but from what you said, I believe that the query data you see are 1000 in number, because we wish to pose 1000 queries.
k is the length of the strings, because we have to hash the query to see in which bucket of a hashtable it will be mapped. The points inside that bucket are potential (approximate) Nearest Neighbors.

which hash functions are orthogonal to each other?

I'm interested in multi-level data integrity checking and correcting. Where multiple error correcting codes are being used (they can be 2 of the same type of codes). I'm under the impression that a system using 2 codes would achieve maximum effectiveness if the 2 hash codes being used were orthogonal to each other.
Is there a list of which codes are orthogonal to what? Or do you need to use the same hashing function but with different parameters or usage?
I expect that the first level ecc will be a reed-solomon code, though I do not actually have control over this first function, hence I cannot use a single code with improved capabilities.
Note that I'm not concerned with encryption security.
Edit: This is not a duplicate of
When are hash functions orthogonal to each other? due to it essentially asking what the definition of orthogonal hash functions are. I want examples of which hash functions that are orthogonal.
I'm not certain it is even possible to enumerate all orthogonal hash functions. However, you only asked for some examples, so I will endeavour to provide some as well as some intuition as to what properties seem to lead to orthogonal hash functions.
From a related question, these two functions are orthogonal to each other:
Domain: Reals --> Codomain: Reals
f(x) = x + 1
g(x) = x + 2
This is a pretty obvious case. It is easier to determine orthogonality if the hash functions are (both) perfect hash functions such as these are. Please note that the term "perfect" is meant in the mathematical sense, not in the sense that these should ever be used as hash functions.
It is a more or less trivial case for perfect hash functions to satisfy orthogonality requirements. Whenever the functions are injective they are perfect hash functions and are thus orthogonal. Similar examples:
Domain: Integers --> Codomain: Integers
f(x) = 2x
g(x) = 3x
In the previous case, this is an injective function but not bijective as there is exactly one element in the codomain mapped to by each element in the domain, but there are many elements in the codomain that are not mapped to at all. These are still adequate for both perfect hashing and orthogonality. (Note that if the Domain/Codomain were Reals, this would be a bijection.)
Functions that are not injective are more tricky to analyze. However, it is always the case that if one function is injective and the other is not, they are not orthogonal:
Domain: Reals --> Codomain: Reals
f(x) = e^x // Injective -- every x produces a unique value
g(x) = x^2 // Not injective -- every number other than 0 can be produced by two different x's
So one trick is thus to know that one function is injective and the other is not. But what if neither is injective? I do not presently know of an algorithm for the general case that will determine this other than brute force.
Domain: Naturals --> Codomain: Naturals
j(x) = ceil(sqrt(x))
k(x) = ceil(x / 2)
Neither function is injective, in this case because of the presence of two obvious non-injective functions: ceil and abs combined with a restricted domain. (In practice most hash functions will not have a domain more permissive than integers.) Testing out values will show that j will have non-unique results when k will not and vice versa:
j(1) = ceil(sqrt(1)) = ceil(1) = 1
j(2) = ceil(sqrt(2)) = ceil(~1.41) = 2
k(1) = ceil(x / 2) = ceil(0.5) = 1
k(2) = ceil(x / 2) = ceil(1) = 1
But what about these functions?
Domain: Integers --> Codomain: Reals
m(x) = cos(x^3) % 117
n(x) = ceil(e^x)
In these cases, neither of the functions are injective (due to the modulus and the ceil) but when do they have a collision? More importantly, for what tuples of values of x do they both have a collision? These questions are hard to answer. I would suspect they are not orthogonal, but without a specific counterexample, I'm not sure I could prove that.
These are not the only hash functions you could encounter, of course. So the trick to determining orthogonality is first to see if they are both injective. If so, they are orthogonal. Second, see if exactly one is injective. If so, they are not orthogonal. Third, see if you can see the pieces of the function that are causing them to not be injective, see if you can determine its period or special cases (such as x=0) and try to come up with counter-examples. Fourth, visit math-stack-exchange and hope someone can tell you where they break orthogonality, or prove that they don't.

Does a string hash exist which can ignore the order of chars in this string

Does a string hash exist which can ignore the order of chars in this string? Eg."helloword" and "wordhello" can map into the same bucket.
There is a number of different approaches you can take.
You can add the values of the characters together. (a + b + c is
equal to a + c + b.) Unfortunately, this is the least desirable
approach, since strings like "ac" and "bb" will generate the same
hash value.
To reduce the possibility of hash code collisions, you can XOR the
values together. (a ^ b ^ c is equal to a ^ c ^ b.) Unfortunately,
this will not give a very broad distribution of random bits, so it
will still give a high chance of collisions for different strings.
To even further reduce the possibility of hash code collisions, you
can multiply the values of the characters together. (a * b * c is
equal to a * c * b.)
If that's not good enough either, then you can sort all the
characters in the string before applying the default string hashing
function offered to you by whatever language it is that you are
using. (So, both "helloword" ad "wordhello" would become "dehlloorw"
before hashing, thus generating the same hash code.) The only disadvantage of this approach is that it is computationally more expensive than the others.
Although the other suggestions of multiplying or adding characters would work, notice that such a hash function is not secure at all.
The reason is that it will introduce a ton of collisions and one of the main properties a hash function has is the low probability of collisions.
For example, a + b + c is the same as c + b + a. However, it is also the same as a + a + d (since the sum of the ascii characters are the same). The same thing applies for multiplying or xor-ing the numbers.
In sum, if you want to achieve a hash function which ignores order, you can but it will introduce a ton of collisions which will potentially make your program faulty and insecure.

Perfect hashtable for

I'm looking for a hashfunction which exploits the following requirements:
N distinct integer values will be stored in the hashtable
At any given point in time there will be no more than M values present in the hashtable
Hashtable stays static for several queries (i.e. at some point the whole hashtable will be initialized and the following calls only read from the hash table)
largest possible key value K is known at the initialization of the hashtable (K >> N)
Every queried key-value pair is present in the hashtable
So far I'm using a hash-function like:
h(k) = 7 * k % M
with M = PRIME_CLOSE_TO(7*N)
7 is somewhat arbitrary.
Do you have any suggestions on how to improve this?
This is a starting point: http://en.wikipedia.org/wiki/Perfect_hash_function
In practice, any ordinary hash function would be fine. But if you want a minimal perfect hash for some reason, you may look into a library that does perfect hashing, such as: CMPH - C Minimal Perfect Hashing Library

How to evaluate a hash generating algorithm

What ways do you know to evaluate the efficiency of a hash function besides generating a large set of values and see the distribution of values?
By efficiency I mean that the keys generated by your hash function distribute evenly. Is there a way to prove this without actually testing for actual values?
A hash function is only even in the context of the data being hashed
Consider two data sets:
Set 1
1, 3, 6, 2, 7, 9, 5, 8, 4
Set 2
65355, 96424664, 86463624, 133, 643564, 24232, 88677, 865747, 2224
A good hashing function for one set (ie mod 10 for set 1) gives no collisions and could be seen as the perfect hash for that data set
However apply it to the second set and there are collisions everywhere
Hash = (x * 37) mod 256
Is much better for the second set but may not suit the first set quite so well... Especially when partitioning the hash for eg a small number of buckets.
What you can do is evaluate a hash against random data that you "expect" your function to have to handle... But that is making assumptions...
Premature optimisation is looking for the perfect hash function before you have enough real data to base your assessment on.
You should get enough data well before the cost of rehashing becomes prohibitive to change your hash function
Update
Lets suppose we are looking for a hash function that generates an 8 bit hash of the input data. Lets further suppose that the hash function is supposed to take byte-streams of varying length.
If we assume that the bytes in the byte-streams are uniformly distributed, we can make some assessment of different hash functions.
int hash = 0;
for (byte b in datastream) hash = hash xor b;
This function will produce uniformly distributed hash values for the specified data set, and would therefore be a good hash function in this context. If you don't see why this is, then you might have other problems.
int hash = 37;
for (byte b in datastream hash = (31 * hash + b) mod 256;
This function will produce uniformly distributed hash values for the specified data set, and would therefore be a good hash function in this context.
Now lets change the data set from being variable length strings of random numbers in the range 0 to 255 to being variable length strings comprising English sentences encoded as US-ASCII.
The XOR is then a poor hash because the input data never has the 8th bit set and as a result only generates hashes in the range 0-127, also there is a higher likelyhood of some "hot" values because of the letter frequency in english words and the cancelling affect of the XOR.
The pair of primes remains reasonably good as a hash function because it uses the full output range and the prime initial offset coupled with a different prime multiplier tends to spread the values out. But it is still weak for collisions due to how English language is structured... Something that only testing with real data can show.