Is key mod TableSize a good hash function in this particular case - hash

If a user is designing a hash table and knows that all the keys will be multiples of 4 between 0 and 10,000 and evenly distributed. Is the following hash function good?
hash(key) = key mod TableSize
where TableSize is some prime number.
My intuition is that this function is highly flawed because only 1/4 of the possible keys actually occur. But when I ran tests the hash values were about evenly distributed.
Am I missing something?

Good enough if except keys are multiples of 4, they are effectively random. BTW why don't you divide each key by 4 (>> 2) before putting into the hash table?

Related

how can I create a hash function in which different permutaions of digits of an integer form the same key?

for example 20986 and 96208 should generate the same key (but not 09862 or 9862 as leading zero means it not even a 5 digit number so we igore those).
One option is to get the least/max sorted permutation and then the sorted number is the hashkey, but sorting is too costly for my case. I need to generate key in O(1) time.
Other idea I have is to traverse the number and get frequency of each digits and the then get a hash function out of it. Now whats the best function to combine the frequencies given that 0<= Summation(f[i]) <= no_of_digits.
To create an order-insensitive hash simply hash each value (in your case the digits of the number) and then combine them using a commutative function (e.g. addition/multiplication/XOR). XOR is probably the most appropriate as it retains a constant hash output size and is very fast.
Also, you will want to strip away any leading 0's before hashing the number.

how to "explain" the following hash function is bad

we have a hash table with size 16, using double hashing method.
h1(x) = k mod 16
h2(x) = 2*(k mod 8)
I know that h2 hash function is bad, probably because mod 8 and times 2, but I don't know how to explain it. is there any explanation like "h2 hash function should mod prime or it will cause ____ problem "
It is bad because it increases the number of collisions.
The (mod 8) means that you are only looking for 8 pigeonholes in your 16-pigeonhole table.
Multiplying it by 2 just spreads those 8 pigeonholes out so that you don’t have to search too many slots past the hashed index to find an empty hole...
You should always compute modulo the size of your table.
h(x) ::= x (mod N) // where N is the table size
The purpose of making the table size a prime number just has to do with how powers of two are very common in computer science. If your data is random, then the size of the table doesn’t matter.
— As long as it is big enough for your expected load factor. A 16-element table is very small. You shouldn’t expect to store more than 6-12 random values in your table without a high-probability of collisions.
A very good linked thread is What is a good Hash Function?, which is totally worth a read just for the links to further reading alone.

how mod x hashing function uses only lower bits of key?

Assume there are a hash table sized 16, and hash function h(k) = k % 16.
It is said that the above hash function is not a good option.
There might be many reasons, but there are 2 major problems respresentatively.
Firstly, if the input keys are even numbers then only 50% of the hash table can be used.
Secondly, the above hash function uses only the last 4 bits of the key.
But I cannot understand the second problem.
How can we know that the only lower 4bits of key is used?
Appreciate for your help in advance.

values for MAD compression method?

I am stuck trying to implement the perfect hashing technique using universal hashing at each level from Cormen. Specifically, with the compression method (at least, I think here is my problem).
I am working on strings, I think short strings (between 8 and 150), and for that, I have my set of hash functions with Murmur3/2, xxhash, FNV1, Cityhash and Spookyhash, using the 64-bits keys (for those hash functions like spookyhash I am getting the lower 64-bits), the problem is that there exist collissions with only three unique strings (two of 10 characters and one of 11 characters) in 9 buckets.
I am using the Cormen's hash compression method for that:
h_ab(k) = ((ak+b)mod p) mod m
with a = 3, p = 4294967291 (largest 32-bit prime), b = 5 and m = 9 (because m_j should be the square of n_j). As "k" I am using the hash value returned by the hash function (like murmur).
If for example, I am using a hash function like murmur2 (64-bit version), the p number should be the largest 64-prime number? I that way, I am covering all possibles hashes that murmur could return, is that right?
Which other hash compressions techniques (apart of division) exist and do you recommend?
Any reference, hint, book, paper, help is pretty welcome.
Sorry for the silly question, I am pretty newbie with hash functions and hash tables.
Thanks in advance.

How to evaluate a hash generating algorithm

What ways do you know to evaluate the efficiency of a hash function besides generating a large set of values and see the distribution of values?
By efficiency I mean that the keys generated by your hash function distribute evenly. Is there a way to prove this without actually testing for actual values?
A hash function is only even in the context of the data being hashed
Consider two data sets:
Set 1
1, 3, 6, 2, 7, 9, 5, 8, 4
Set 2
65355, 96424664, 86463624, 133, 643564, 24232, 88677, 865747, 2224
A good hashing function for one set (ie mod 10 for set 1) gives no collisions and could be seen as the perfect hash for that data set
However apply it to the second set and there are collisions everywhere
Hash = (x * 37) mod 256
Is much better for the second set but may not suit the first set quite so well... Especially when partitioning the hash for eg a small number of buckets.
What you can do is evaluate a hash against random data that you "expect" your function to have to handle... But that is making assumptions...
Premature optimisation is looking for the perfect hash function before you have enough real data to base your assessment on.
You should get enough data well before the cost of rehashing becomes prohibitive to change your hash function
Update
Lets suppose we are looking for a hash function that generates an 8 bit hash of the input data. Lets further suppose that the hash function is supposed to take byte-streams of varying length.
If we assume that the bytes in the byte-streams are uniformly distributed, we can make some assessment of different hash functions.
int hash = 0;
for (byte b in datastream) hash = hash xor b;
This function will produce uniformly distributed hash values for the specified data set, and would therefore be a good hash function in this context. If you don't see why this is, then you might have other problems.
int hash = 37;
for (byte b in datastream hash = (31 * hash + b) mod 256;
This function will produce uniformly distributed hash values for the specified data set, and would therefore be a good hash function in this context.
Now lets change the data set from being variable length strings of random numbers in the range 0 to 255 to being variable length strings comprising English sentences encoded as US-ASCII.
The XOR is then a poor hash because the input data never has the 8th bit set and as a result only generates hashes in the range 0-127, also there is a higher likelyhood of some "hot" values because of the letter frequency in english words and the cancelling affect of the XOR.
The pair of primes remains reasonably good as a hash function because it uses the full output range and the prime initial offset coupled with a different prime multiplier tends to spread the values out. But it is still weak for collisions due to how English language is structured... Something that only testing with real data can show.