Pearson perfect hashing - hash

I'm trying to write a generator that produces Pearson perfect hashes. Note that I don't need a minimal perfect hash. Wikipedia says that a Pearson perfect hash can be found in O(|S|) time using a randomized algorithm (where S is the set of keys). However, I haven't been able to find such an algorithm online. Is this even possible?
Note: I don't want to use gperf/cmph/etc., I'd rather write my own implementation.

Pearson's original paper outlines an algorithm to construct a permutation table T for perfect hashing:
The table T at the heart of this new hashing function can sometimes be modified to produce a minimal, perfect hashing function over a modest list of words. In fact, one can usually choose the exact value of the function for a particular word. For example, Knuth [3] illustrates perfect hashing with an algorithm that maps a list of 31 common English words onto unique integers between −10 and 30. The table T presented in Table II maps these same 31 words onto the integers from 1 to 31 in alphabetic order.
Although the procedure for constructing the table in Table II is too involved to be detailed here, the following highlights will enable the interested reader to repeat the process:
A table T was constructed by pseudorandom permutation of the integers (0 ... 255).
One by one, the desired values were assigned to the words in the list. Each assignment was effected by exchanging two elements in the table.
For each word, the first candidate considered for exchange was T[h[n − 1] ⊕ C[n]], the last table element referenced in the computation of the hash function for that word.
A table element could not be exchanged if it was referenced during the hashing of a previously assigned word or if it was referenced earlier in the hashing of the same word.
If the necessary exchange was forbidden by Rule 4, attention was shifted to the previously referenced table element, T[h[n − 2] ⊕ C[n − 1]].
The procedure is not always successful. For example, using the ASCII character codes, if the word “a” hashes to 0 and the word “i” hashes to 15, it turns out that the word “in” must hash to 0. Initial attempts to map Knuth's 31 words onto the integers (0 ... 30) failed for exactly this reason. The shift to the range (1 ... 31) was an ad hoc tactic to circumvent this problem.
Does this tampering with T damage the statistical behavior of the hashing function? Not seriously. When the 26,662 dictionary entries are hashed into 256 bins, the resulting distribution is still not significantly different from uniform (χ² = 266.03, 255 d.f., p = 0.30). Hashing the 128 randomly selected dictionary words resulted in an average of 27.5 collisions versus 26.8 with the unmodified T. When this function is extended as described above to produce 16-bit hash indices, the same test produces a substantially greater number of collisions (4,870 versus 4,721 with the unmodified T), although the distribution still is not significantly different from uniform (χ² = 565.2, 532 d.f., p = 0.154).

Related

how can I create a hash function in which different permutaions of digits of an integer form the same key?

for example 20986 and 96208 should generate the same key (but not 09862 or 9862 as leading zero means it not even a 5 digit number so we igore those).
One option is to get the least/max sorted permutation and then the sorted number is the hashkey, but sorting is too costly for my case. I need to generate key in O(1) time.
Other idea I have is to traverse the number and get frequency of each digits and the then get a hash function out of it. Now whats the best function to combine the frequencies given that 0<= Summation(f[i]) <= no_of_digits.
To create an order-insensitive hash simply hash each value (in your case the digits of the number) and then combine them using a commutative function (e.g. addition/multiplication/XOR). XOR is probably the most appropriate as it retains a constant hash output size and is very fast.
Also, you will want to strip away any leading 0's before hashing the number.

values for MAD compression method?

I am stuck trying to implement the perfect hashing technique using universal hashing at each level from Cormen. Specifically, with the compression method (at least, I think here is my problem).
I am working on strings, I think short strings (between 8 and 150), and for that, I have my set of hash functions with Murmur3/2, xxhash, FNV1, Cityhash and Spookyhash, using the 64-bits keys (for those hash functions like spookyhash I am getting the lower 64-bits), the problem is that there exist collissions with only three unique strings (two of 10 characters and one of 11 characters) in 9 buckets.
I am using the Cormen's hash compression method for that:
h_ab(k) = ((ak+b)mod p) mod m
with a = 3, p = 4294967291 (largest 32-bit prime), b = 5 and m = 9 (because m_j should be the square of n_j). As "k" I am using the hash value returned by the hash function (like murmur).
If for example, I am using a hash function like murmur2 (64-bit version), the p number should be the largest 64-prime number? I that way, I am covering all possibles hashes that murmur could return, is that right?
Which other hash compressions techniques (apart of division) exist and do you recommend?
Any reference, hint, book, paper, help is pretty welcome.
Sorry for the silly question, I am pretty newbie with hash functions and hash tables.
Thanks in advance.

Matlab non-linear binary Minimisation

I have to set up a phoneme table with a specific probability distribution for encoding things.
Now there are 22 base elements (each with an assigned probability, sum 100%), which shall be mapped on a 12 element table, which has desired element probabilities (sum 100%).
So part of the minimisation is to merge several base elements to get 12 table elements. Each base element must occur exactly once.
In addition, the table has 3 rows. So the same 12 element composition of the 22 base elements must minimise the error for 3 target vectors. Let's say the given target vectors are b1,b2,b3 (dimension 12x1), the given base vector is x (dimension 22x1) and they are connected by the unknown matrix A (12x22) by:
b1+err1=Ax
b2+err2=Ax
b3+err3=Ax
To sum it up: A is to be found so that dot_prod(err1+err2+err3, err1+err2+err3)=min (least squares). And - according to the above explanation - A must contain only 1's and 0's, while having exactly one 1 per column.
Unfortunately I have no idea how to approach this problem. Can it be expressed in a way different from the matrix-vector form?
Which tools in matlab could do it?
I think I found the answer while parsing some sections of the Matlab documentation.
First of all, the problem can be rewritten as:
errSum=err1+err2+err3=3Ax-b1-b2-b3
=> dot_prod(errSum, errSum) = min(A)
Applying the dot product (least squares) yields a quadratic scalar expression.
Syntax-wise, the fmincon tool within the optimization box could do the job. It has constraints parameters, which allow to force Aij to be binary and each column to be 1 in sum.
But apparently fmincon is not ideal for binary problems algorithm-wise and the ga tool should be used instead, which can be called in a similar way.
Since the equation would be very long in my case and needs to be written out, I haven't tried yet. Please correct me, if I'm wrong. Or add further solution-methods, if available.

Difference bloom filters and FM-sketches

What is the difference between bloom filters and hash sketches (also FM-sketches) and what is their use?
Hash sketches/Flajolet-Martin Sketches
Flajolet, P./Martin, G. (1985): Probabilistic counting algorithms for data base applications, in: Journal of Computer and System Sciences, Vol. 31, No. 2 (September 1985), pp. 182-209.
Durand, M./Flajolet, P. (2003): Loglog Counting of Large Cardinalities, in: Springer LNCS 2832, Algorithms ESA 2003, pp. 605–617.
Hash sketches are used to count the number of distinct elements in a set.
given:
a bit array B[] of length l
a (single) hash function h() that maps to [0,1,...2^l)
a function r() that gives the position of the least-significant 1-bit in the binary representation of its input (e.g. 000101 returns 1, 001000 returns 4)
insertion of element x:
pn := h(x) returns a pseudo-random number
apply r(pn) to get the position of the bit array to set to 1
since output of h() is pseudo-random every bit i is set to 1 ~n/(2^(i+1)) times
number of distinct elements in the set:
find the position p of the right-most 0 in the bit array
p = log2(n), solve for n to get the number of distinct element in the set;
the result might be up to 1.83 magnitudes off
usage:
in Data Mining, P2P/distributed applications, estimation of the document frequency, etc.
Bloom filters
Bloom, H. (1970): Space/time trade-offs in hash coding with allowable errors, in: Communications of the ACM, Vol. 13, No. 7 (July 1970), pp. 422-426.
Bloom filters are used to test whether an element is a member of a set.
given:
a bit array B[] of length m
k different hash functions h_k() that map to [0,...,m-1], i.e. to one of the position of the m-bit array
insertion of element x:
apply h_k to x (h_k(x)), for all k, i.e. you get k values
set the resulting bits in the array B to 1 (if already set to 1, don't change anything)
check if y is already in the set:
get the positions p_k to check using all the hash functions h_k (h_k(y)), i.e. for each function h_k you get a position p_k
if one of the positions p_k is set to 0 in the array B, the element y is definitively not in the set
if all positions given by p_k are 1, the element y might (!) be in the set
false positive rate is approximately (1 - e^(-kn/m))^k, no false negatives are possible!
by increasing the number of hashing functions, the false positive rate can be decreased; however, at the same time your bloom filter gets slower; the optimal value of k is k = (m/n)ln(2)
usage:
in the beginning used as a cheap filter in databases to filter out elements that do not match a query
various applications today, e.g. in Google BigTable, but also in networking for IP lookups, etc.
The Bloom Filter is a data structure used for Membership lookup while FM Sketch is primarily used for counting of elements. These two data structures provide the respective solutions optimizing over the space required to perform the lookup/computation and the trade off is the accuracy of the result.

How to evaluate a hash generating algorithm

What ways do you know to evaluate the efficiency of a hash function besides generating a large set of values and see the distribution of values?
By efficiency I mean that the keys generated by your hash function distribute evenly. Is there a way to prove this without actually testing for actual values?
A hash function is only even in the context of the data being hashed
Consider two data sets:
Set 1
1, 3, 6, 2, 7, 9, 5, 8, 4
Set 2
65355, 96424664, 86463624, 133, 643564, 24232, 88677, 865747, 2224
A good hashing function for one set (ie mod 10 for set 1) gives no collisions and could be seen as the perfect hash for that data set
However apply it to the second set and there are collisions everywhere
Hash = (x * 37) mod 256
Is much better for the second set but may not suit the first set quite so well... Especially when partitioning the hash for eg a small number of buckets.
What you can do is evaluate a hash against random data that you "expect" your function to have to handle... But that is making assumptions...
Premature optimisation is looking for the perfect hash function before you have enough real data to base your assessment on.
You should get enough data well before the cost of rehashing becomes prohibitive to change your hash function
Update
Lets suppose we are looking for a hash function that generates an 8 bit hash of the input data. Lets further suppose that the hash function is supposed to take byte-streams of varying length.
If we assume that the bytes in the byte-streams are uniformly distributed, we can make some assessment of different hash functions.
int hash = 0;
for (byte b in datastream) hash = hash xor b;
This function will produce uniformly distributed hash values for the specified data set, and would therefore be a good hash function in this context. If you don't see why this is, then you might have other problems.
int hash = 37;
for (byte b in datastream hash = (31 * hash + b) mod 256;
This function will produce uniformly distributed hash values for the specified data set, and would therefore be a good hash function in this context.
Now lets change the data set from being variable length strings of random numbers in the range 0 to 255 to being variable length strings comprising English sentences encoded as US-ASCII.
The XOR is then a poor hash because the input data never has the 8th bit set and as a result only generates hashes in the range 0-127, also there is a higher likelyhood of some "hot" values because of the letter frequency in english words and the cancelling affect of the XOR.
The pair of primes remains reasonably good as a hash function because it uses the full output range and the prime initial offset coupled with a different prime multiplier tends to spread the values out. But it is still weak for collisions due to how English language is structured... Something that only testing with real data can show.