I need to hash a sequence S of numbers of length n^2 where each number is a sum of two numbers each of them is an element on one of the sequences: {x_1,..., x_n},{y_1,..., y_n}.
I am using universal hashing so the question is how do I find the universe of keys U, if there are infinitely many possibilities for members of S.
Related
for example 20986 and 96208 should generate the same key (but not 09862 or 9862 as leading zero means it not even a 5 digit number so we igore those).
One option is to get the least/max sorted permutation and then the sorted number is the hashkey, but sorting is too costly for my case. I need to generate key in O(1) time.
Other idea I have is to traverse the number and get frequency of each digits and the then get a hash function out of it. Now whats the best function to combine the frequencies given that 0<= Summation(f[i]) <= no_of_digits.
To create an order-insensitive hash simply hash each value (in your case the digits of the number) and then combine them using a commutative function (e.g. addition/multiplication/XOR). XOR is probably the most appropriate as it retains a constant hash output size and is very fast.
Also, you will want to strip away any leading 0's before hashing the number.
The key idea of Locality sensitive hashing (LSH) is that neighbor points, v are more likely
mapped to the same bucket but points far from each other are more likely mapped to different buckets. In using Random projection, if the the database contains N samples each of higher dimension d, then theory says that we must create k randomly generated hash functions, where k is the targeted reduced dimension denoted as g(**v**) = (h_1(v),h_2(v),...,h_k(v)). So, for any vector point v, the point is mapped to a k-dimensional vector with a g-function. Then the hash code is the vector of reduced length /dimension k and is regarded as a bucket. Now, to increase probability of collision, theory says that we should have L such g-functions g_1, g_2,...,g_L at random. This is the part that I do not understand.
Question : How to create multiple hash tables? How many buckets are contained in a hash table?
I am following the code given in the paper Sparse Projections for High-Dimensional Binary Codes by Yan Xia et. al Link to Code
In the file Coding.m
dim = size(X_train, 2);
R = randn(dim, bit);
% coding
B_query = (X_query*R >= 0);
B_base = (X_base*R >=0);
X_query is the set of query data each of dimension d and there are 1000 query samples; R is the random projection and bit is the target reduced dimensionality. The output of B_query and B_base are N strings of length k taking 0/1 values.
Does this way create multiple hash tables i.e. N is the number of hash tables? I am confused as to how. A detailed explanation will be very helpful.
How to create multiple hash tables?
LSH creates hash-table using (amplified) hash functions by concatenation:
g(p) = [h1(p), h2(p), · · · , hk (p)], hi ∈R H
g() is a hash function and it corresponds to one hashtable. So we map the data, via g() to that hashtable and with probability, the close ones will fall into the same bucket and the non-close ones will fall into different buckets.
We do that L times, thus we create L hashtables. Note that every g() is/should most likely to be different that the other g() hash functions.
Note: Large k ⇒ larger gap between P1, P2. Small P1 ⇒ larer L so as to find neighbors. A practical choice is L = 5 (or 6). P1 and P2 are defined in the image below:
How many buckets are contained in a hash table?
Wish I knew! That's a difficult question, how about sqrt(N) where N is the number of points in the dataset. Check this: Number of buckets in LSH
The code of Yan Xia
I am not familiar with that, but from what you said, I believe that the query data you see are 1000 in number, because we wish to pose 1000 queries.
k is the length of the strings, because we have to hash the query to see in which bucket of a hashtable it will be mapped. The points inside that bucket are potential (approximate) Nearest Neighbors.
What is the difference between bloom filters and hash sketches (also FM-sketches) and what is their use?
Hash sketches/Flajolet-Martin Sketches
Flajolet, P./Martin, G. (1985): Probabilistic counting algorithms for data base applications, in: Journal of Computer and System Sciences, Vol. 31, No. 2 (September 1985), pp. 182-209.
Durand, M./Flajolet, P. (2003): Loglog Counting of Large Cardinalities, in: Springer LNCS 2832, Algorithms ESA 2003, pp. 605–617.
Hash sketches are used to count the number of distinct elements in a set.
given:
a bit array B[] of length l
a (single) hash function h() that maps to [0,1,...2^l)
a function r() that gives the position of the least-significant 1-bit in the binary representation of its input (e.g. 000101 returns 1, 001000 returns 4)
insertion of element x:
pn := h(x) returns a pseudo-random number
apply r(pn) to get the position of the bit array to set to 1
since output of h() is pseudo-random every bit i is set to 1 ~n/(2^(i+1)) times
number of distinct elements in the set:
find the position p of the right-most 0 in the bit array
p = log2(n), solve for n to get the number of distinct element in the set;
the result might be up to 1.83 magnitudes off
usage:
in Data Mining, P2P/distributed applications, estimation of the document frequency, etc.
Bloom filters
Bloom, H. (1970): Space/time trade-offs in hash coding with allowable errors, in: Communications of the ACM, Vol. 13, No. 7 (July 1970), pp. 422-426.
Bloom filters are used to test whether an element is a member of a set.
given:
a bit array B[] of length m
k different hash functions h_k() that map to [0,...,m-1], i.e. to one of the position of the m-bit array
insertion of element x:
apply h_k to x (h_k(x)), for all k, i.e. you get k values
set the resulting bits in the array B to 1 (if already set to 1, don't change anything)
check if y is already in the set:
get the positions p_k to check using all the hash functions h_k (h_k(y)), i.e. for each function h_k you get a position p_k
if one of the positions p_k is set to 0 in the array B, the element y is definitively not in the set
if all positions given by p_k are 1, the element y might (!) be in the set
false positive rate is approximately (1 - e^(-kn/m))^k, no false negatives are possible!
by increasing the number of hashing functions, the false positive rate can be decreased; however, at the same time your bloom filter gets slower; the optimal value of k is k = (m/n)ln(2)
usage:
in the beginning used as a cheap filter in databases to filter out elements that do not match a query
various applications today, e.g. in Google BigTable, but also in networking for IP lookups, etc.
The Bloom Filter is a data structure used for Membership lookup while FM Sketch is primarily used for counting of elements. These two data structures provide the respective solutions optimizing over the space required to perform the lookup/computation and the trade off is the accuracy of the result.
Typically, we do hashing by calculating the integer or string according to a rule, then return hash(int-or-str) % m as the index in the hash table, but how do we choose the modulo m? Is there any convention to follow?
There are two possible conventions. One is to use a prime number, which yields good performance with quadratic probing.
The other is to use a power of two, since n mod m where m = 2^k is a fast operation; it's a bitwise AND with m-1. Of course, the modulus must be equal to the size of the hash table, and powers of two mean your hash table must double in size whenever it's overcrowded. This gives you amortized O(1) insertion in a similar way that a dynamic array does.
Since [val modulo m] is used as an index into the table, m is the number of elements in that table. Are you free to choose that ? Then use a big enough prime number. If you need to resize the table, you can either chose to use a bigger prime number, or (if you choose doubling the table for resizing) you'd better make sure that your hash function has enough entropy in the lower bits.
I'm trying to write a generator that produces Pearson perfect hashes. Note that I don't need a minimal perfect hash. Wikipedia says that a Pearson perfect hash can be found in O(|S|) time using a randomized algorithm (where S is the set of keys). However, I haven't been able to find such an algorithm online. Is this even possible?
Note: I don't want to use gperf/cmph/etc., I'd rather write my own implementation.
Pearson's original paper outlines an algorithm to construct a permutation table T for perfect hashing:
The table T at the heart of this new hashing function can sometimes be modified to produce a minimal, perfect hashing function over a modest list of words. In fact, one can usually choose the exact value of the function for a particular word. For example, Knuth [3] illustrates perfect hashing with an algorithm that maps a list of 31 common English words onto unique integers between −10 and 30. The table T presented in Table II maps these same 31 words onto the integers from 1 to 31 in alphabetic order.
Although the procedure for constructing the table in Table II is too involved to be detailed here, the following highlights will enable the interested reader to repeat the process:
A table T was constructed by pseudorandom permutation of the integers (0 ... 255).
One by one, the desired values were assigned to the words in the list. Each assignment was effected by exchanging two elements in the table.
For each word, the first candidate considered for exchange was T[h[n − 1] ⊕ C[n]], the last table element referenced in the computation of the hash function for that word.
A table element could not be exchanged if it was referenced during the hashing of a previously assigned word or if it was referenced earlier in the hashing of the same word.
If the necessary exchange was forbidden by Rule 4, attention was shifted to the previously referenced table element, T[h[n − 2] ⊕ C[n − 1]].
The procedure is not always successful. For example, using the ASCII character codes, if the word “a” hashes to 0 and the word “i” hashes to 15, it turns out that the word “in” must hash to 0. Initial attempts to map Knuth's 31 words onto the integers (0 ... 30) failed for exactly this reason. The shift to the range (1 ... 31) was an ad hoc tactic to circumvent this problem.
Does this tampering with T damage the statistical behavior of the hashing function? Not seriously. When the 26,662 dictionary entries are hashed into 256 bins, the resulting distribution is still not significantly different from uniform (χ² = 266.03, 255 d.f., p = 0.30). Hashing the 128 randomly selected dictionary words resulted in an average of 27.5 collisions versus 26.8 with the unmodified T. When this function is extended as described above to produce 16-bit hash indices, the same test produces a substantially greater number of collisions (4,870 versus 4,721 with the unmodified T), although the distribution still is not significantly different from uniform (χ² = 565.2, 532 d.f., p = 0.154).