Is it possible to implement universal hashing for the complete range of integers? - hash

I am reading about Universal hashing on integers. The prerequisite and mandatory precondition seems to be that we choose a prime number p greater than the set of all possible keys.
I am not clear on this point.
If our set of keys are of type int then this means that the prime number needs to be of the next bigger data type e.g. long.
But eventually whatever we get as the hash would need to be down-casted to an int to index the hash table. Doesn't this down-casting affect the quality of the Universal Hashing (I am referring to the distribution of the keys over the buckets) somehow?

If our set of keys are integers then this means that the prime number
needs to be of the next bigger data type e.g. long.
That is not a problem. Sometimes it is necessary otherwise the hash family cannot be universal. See below for more information.
But eventually whatever we get as the hash would need to be
down-casted to an int to index the hash table.
Doesn't this down-casting affect the quality of the Universal Hashing
(I am referring to the distribution of the keys over the buckets)
somehow?
The answer is no. I will try to explain.
Whether p has another data type or not is not important for the hash family to be universal. Important is that p is equal or larger than u (the maximum integer of the universe of integers). It is important that p is big enough (i.e. >= u).
A hash family is universal when the collision probability is equal or
smaller than 1/m.
So the idea is to hold that constraint.
The value of p, in theory, can be as big as a long or more. It just needs to be an integer and prime.
u is the size of the domain/universe (or the number of keys). Given the universe U = {0, ..., u-1}, u denotes the size |U|.
m is the number of bins or buckets
p is a prime which must be equal or greater than n
the hash family is defined as H = {h(a,b)(x)} with h(a,b)(x) = ((a * x + b) mod p) mod m. Note that a and b are randomly chosen integers (from all possible integers, so theoretically can be larger than p) modulo a prime p (which can make them either smaller or larger than m, the number of bins/buckets); but here too the data type (domain of values does not matter). See Hashing integers on Wikipedia for notation.
Follow the proof on Wikipedia and you conclude that the collision probability is _p/m_ * 1/(p-1) (the underscores mean to truncate the decimals). For p >> m (p considerably bigger than m) the probability tends to 1/m (but this does not mean that the probability would be better the larger p is).
In other terms answering your question: p being a bigger data type is not a problem here and can be even required. p has to be equal or greater than u and a and b have to be randomly chosen integers modulo p, no matter the number of buckets m. With these constraints you can construct a universal hash family.
Maybe a mathematical example could help
Let U be the universe of integers that correspond to unsigned char (in C for example). Then U = {0, ..., 255}
Let p be (next possible) prime equal or greater than 256. Note that p can be any of these types (short, int, long be it signed or unsigned). The point is that the data type does not play a role (In programming the type mainly denotes a domain of values.). Whether 257 is short, int or long doesn't really matter here for the sake of correctness of the mathematical proof. Also we could have chosen a larger p (i.e. a bigger data type); this does not change the proof's correctness.
The next possible prime number would be 257.
We say we have 25 buckets, i.e. m = 25. This means a hash family would be universal if the collision probability is equal or less than 1/25, i.e. approximately 0.04.
Put in the values for _p/m_ * 1/(p-1): _257/25_ * 1/256 = 10/256 = 0.0390625 which is smaller than 0.04. It is a universal hash family with the chosen parameters.
We could have chosen m = u = 256 buckets. Then we would have a collision probability of 0.003891050584, which is smaller than 1/256 = 0,00390625. Hash family is still universal.
Let's try with m being bigger than p, e.g. m = 300. Collision probability is 0, which is smaller than 1/300 ~= 0.003333333333. Trivial, we had more buckets than keys. Still universal, no collisions.
Implementation detail example
We have the following:
x of type int (an element of |U|)
a, b, p of type long
m we'll see later in the example
Choose p so that it is bigger than the max u (element of |U|), p is of type long.
Choose a and b (modulo p) randomly. They are of type long, but always < p.
For an x (of type int from U) calculate ((a*x+b) mod p). a*x is of type long, (a*x+b) is also of type long and so ((a*x+b) mod p is also of type long. Note that ((a*x+b) mod p)'s result is < p. Let's denote that result h_a_b(x).
h_a_b(x) is now taken modulo m, which means that at this step it depends on the data type of m whether there will be downcasting or not. However, it does not really matter. h_a_b(x) is < m, because we take it modulo m. Hence the value of h_a_b(x) modulo m fits into m's data type. In case it has to be downcasted there won't be a loss of value. And so you have mapped a key to a bin/bucket.

Related

Hashing using division method

For the hash function : h(k) = k mod m;
I understand that m=2^n will always give the last n LSB digits. I also understand that m=2^p-1 when K is a string converted to integers using radix 2^p will give same hash value for every permutation of characters in K. But why exactly "a prime not too close to an exact power of 2" is a good choice? What if I choose 2^p - 2 or 2^p-3? Why are these choices considered bad?
Following is the text from CLRS:
"A prime not too close to an exact power of 2 is often a good choice for m. For
example, suppose we wish to allocate a hash table, with collisions resolved by
chaining, to hold roughly n D 2000 character strings, where a character has 8 bits.
We don’t mind examining an average of 3 elements in an unsuccessful search, and
so we allocate a hash table of size m D 701. We could choose m D 701 because
it is a prime near 2000=3 but not near any power of 2."
Suppose we work with radix 2p.
2p-1 case:
Why that is a bad idea to use 2p-1? Let us see,
k = ∑ai2ip
and if we divide by 2p-1 we just get
k = ∑ai2ip = ∑ai mod 2p-1
so, as addition is commutative, we can permute digits and get the same result.
2p-b case:
Quote from CLRS:
A prime not too close to an exact power of 2 is often a good choice for m.
k = ∑ai2ip = ∑aibi mod 2p-b
So changing least significant digit by one will change hash by one. Changing second least significant bit by one will change hash by two. To really change hash we would need to change digits with bigger significance. So, in case of small b we face problem similar to the case then m is power of 2, namely we depend on distribution of least significant digits.

Matlab : Conceptual difficulty in How to create multiple hash tables in Locality sensitive Hashing

The key idea of Locality sensitive hashing (LSH) is that neighbor points, v are more likely
mapped to the same bucket but points far from each other are more likely mapped to different buckets. In using Random projection, if the the database contains N samples each of higher dimension d, then theory says that we must create k randomly generated hash functions, where k is the targeted reduced dimension denoted as g(**v**) = (h_1(v),h_2(v),...,h_k(v)). So, for any vector point v, the point is mapped to a k-dimensional vector with a g-function. Then the hash code is the vector of reduced length /dimension k and is regarded as a bucket. Now, to increase probability of collision, theory says that we should have L such g-functions g_1, g_2,...,g_L at random. This is the part that I do not understand.
Question : How to create multiple hash tables? How many buckets are contained in a hash table?
I am following the code given in the paper Sparse Projections for High-Dimensional Binary Codes by Yan Xia et. al Link to Code
In the file Coding.m
dim = size(X_train, 2);
R = randn(dim, bit);
% coding
B_query = (X_query*R >= 0);
B_base = (X_base*R >=0);
X_query is the set of query data each of dimension d and there are 1000 query samples; R is the random projection and bit is the target reduced dimensionality. The output of B_query and B_base are N strings of length k taking 0/1 values.
Does this way create multiple hash tables i.e. N is the number of hash tables? I am confused as to how. A detailed explanation will be very helpful.
How to create multiple hash tables?
LSH creates hash-table using (amplified) hash functions by concatenation:
g(p) = [h1(p), h2(p), · · · , hk (p)], hi ∈R H
g() is a hash function and it corresponds to one hashtable. So we map the data, via g() to that hashtable and with probability, the close ones will fall into the same bucket and the non-close ones will fall into different buckets.
We do that L times, thus we create L hashtables. Note that every g() is/should most likely to be different that the other g() hash functions.
Note: Large k ⇒ larger gap between P1, P2. Small P1 ⇒ larer L so as to find neighbors. A practical choice is L = 5 (or 6). P1 and P2 are defined in the image below:
How many buckets are contained in a hash table?
Wish I knew! That's a difficult question, how about sqrt(N) where N is the number of points in the dataset. Check this: Number of buckets in LSH
The code of Yan Xia
I am not familiar with that, but from what you said, I believe that the query data you see are 1000 in number, because we wish to pose 1000 queries.
k is the length of the strings, because we have to hash the query to see in which bucket of a hashtable it will be mapped. The points inside that bucket are potential (approximate) Nearest Neighbors.

why number of string should be greater than or equal to number of states in pumping lemma?

If L is a regular language, then there exists a constant n (which depends on L) such that for every string w in the language L, such that the length of w is greater than or equal to n, we can divide w into three strings, w = xyz.
w = length of string. n = Number of States.
Why should we pick w greater than or equal to n?
and what is Pumping length?
If you look at the complete statement of the lemma (http://en.wikipedia.org/wiki/Pumping_lemma_for_regular_languages), you can see that it is actually stating that every string is formed by a prefix x, a part that can be repeated any number of times y and a suffix z. Now it is obvious that, in the shortest case (when the repeating part is taken only once), the length of w equals the number of states needed for the language. This Wikipedia image is very useful:
http://en.wikipedia.org/wiki/File:Pumping-Lemma_xyz_svg.svg
You seem to be misunderstanding the lemma (which you also have not stated completely), and mixing aspects of a proof with what you did state. The lemma says that for every regular language L, there is a constant p such that every string of at least p symbols that belongs to L has a non-empty substring of length no greater than p that can be "pumped", always yielding another element of L. The constant p is the (a) "pumping length".
This can be proved by observing that if a language is regular then there is a finite state automaton that accepts it, and taking p to be the number of states in that automaton (details omitted).
That does not imply, however, that the number of states in the smallest FSA the recognizes a given regular language is the smallest possible pumping length for that language. For instance, consider the language consisting of the union of { an } and { bn } for all n. You need a four-state FSA to recognize this language, but its minimum pumping length is 1.

Expected chain length after rehashing - Linear Hashing

There is one confusion I've about load factor. Some sources say that it is just the number of keys in hash table divided by total number of slots which is same as expected chain length for each slot. But that is only in simple uniform hashing right?
Suppose hash table T has n elements and we expand T into T1 by redistributing elements in slot T[0] by rehashing them using h'(k) = k mod 2m. The hash function of T1 is h(k) = k mod 2m if h(k) < 1 and k mod m if h(k) >= 1. Many sources say that we "Expand and rehash to maintain the load factor (does this imply expected chain length is still same?) Since this is not simple uniform hashing, I think the probability that any key k enters a slot is (1/4 + 1/2(m-1)).
For a key k (randomly selected), h(k) is first evaluated (there are 50-50 chances whether it is less than 1 or greater than or equal to 1) and then if it's less than 1, key k has JUST two ways - slot 0 or slot m. Hence, probability 1/4 (1/2 * 1/2) But if it is greater than or equal to 1, it has m-1 slots and could enter any and hence probability (1/2 * 1/m-1). So expected chain length would now be n/4 + n/2(m-1). Am I on right track?
The calculation for linear hashing should be the same as for "non-linear" hashing. With a certain initial number of buckets, uniform distribution of hash values would result in uniform placement. With enough expansions to double the size of the table, each of those values would be randomly split over the larger space via the incremental re-hashing, and new values would also have been distributed over the larger space. Incrementally, each point is equally likely to be at (initial bucket position) and (2x initial bucket position) as the table expands to that length.
There is a paper here which goes into detail about the chain length calculation under different circumstances (not just the average), specifically for linear hashing.

Difference bloom filters and FM-sketches

What is the difference between bloom filters and hash sketches (also FM-sketches) and what is their use?
Hash sketches/Flajolet-Martin Sketches
Flajolet, P./Martin, G. (1985): Probabilistic counting algorithms for data base applications, in: Journal of Computer and System Sciences, Vol. 31, No. 2 (September 1985), pp. 182-209.
Durand, M./Flajolet, P. (2003): Loglog Counting of Large Cardinalities, in: Springer LNCS 2832, Algorithms ESA 2003, pp. 605–617.
Hash sketches are used to count the number of distinct elements in a set.
given:
a bit array B[] of length l
a (single) hash function h() that maps to [0,1,...2^l)
a function r() that gives the position of the least-significant 1-bit in the binary representation of its input (e.g. 000101 returns 1, 001000 returns 4)
insertion of element x:
pn := h(x) returns a pseudo-random number
apply r(pn) to get the position of the bit array to set to 1
since output of h() is pseudo-random every bit i is set to 1 ~n/(2^(i+1)) times
number of distinct elements in the set:
find the position p of the right-most 0 in the bit array
p = log2(n), solve for n to get the number of distinct element in the set;
the result might be up to 1.83 magnitudes off
usage:
in Data Mining, P2P/distributed applications, estimation of the document frequency, etc.
Bloom filters
Bloom, H. (1970): Space/time trade-offs in hash coding with allowable errors, in: Communications of the ACM, Vol. 13, No. 7 (July 1970), pp. 422-426.
Bloom filters are used to test whether an element is a member of a set.
given:
a bit array B[] of length m
k different hash functions h_k() that map to [0,...,m-1], i.e. to one of the position of the m-bit array
insertion of element x:
apply h_k to x (h_k(x)), for all k, i.e. you get k values
set the resulting bits in the array B to 1 (if already set to 1, don't change anything)
check if y is already in the set:
get the positions p_k to check using all the hash functions h_k (h_k(y)), i.e. for each function h_k you get a position p_k
if one of the positions p_k is set to 0 in the array B, the element y is definitively not in the set
if all positions given by p_k are 1, the element y might (!) be in the set
false positive rate is approximately (1 - e^(-kn/m))^k, no false negatives are possible!
by increasing the number of hashing functions, the false positive rate can be decreased; however, at the same time your bloom filter gets slower; the optimal value of k is k = (m/n)ln(2)
usage:
in the beginning used as a cheap filter in databases to filter out elements that do not match a query
various applications today, e.g. in Google BigTable, but also in networking for IP lookups, etc.
The Bloom Filter is a data structure used for Membership lookup while FM Sketch is primarily used for counting of elements. These two data structures provide the respective solutions optimizing over the space required to perform the lookup/computation and the trade off is the accuracy of the result.