how to pick a modulo for integer or string hash? - hash

Typically, we do hashing by calculating the integer or string according to a rule, then return hash(int-or-str) % m as the index in the hash table, but how do we choose the modulo m? Is there any convention to follow?

There are two possible conventions. One is to use a prime number, which yields good performance with quadratic probing.
The other is to use a power of two, since n mod m where m = 2^k is a fast operation; it's a bitwise AND with m-1. Of course, the modulus must be equal to the size of the hash table, and powers of two mean your hash table must double in size whenever it's overcrowded. This gives you amortized O(1) insertion in a similar way that a dynamic array does.

Since [val modulo m] is used as an index into the table, m is the number of elements in that table. Are you free to choose that ? Then use a big enough prime number. If you need to resize the table, you can either chose to use a bigger prime number, or (if you choose doubling the table for resizing) you'd better make sure that your hash function has enough entropy in the lower bits.

Related

how can I create a hash function in which different permutaions of digits of an integer form the same key?

for example 20986 and 96208 should generate the same key (but not 09862 or 9862 as leading zero means it not even a 5 digit number so we igore those).
One option is to get the least/max sorted permutation and then the sorted number is the hashkey, but sorting is too costly for my case. I need to generate key in O(1) time.
Other idea I have is to traverse the number and get frequency of each digits and the then get a hash function out of it. Now whats the best function to combine the frequencies given that 0<= Summation(f[i]) <= no_of_digits.
To create an order-insensitive hash simply hash each value (in your case the digits of the number) and then combine them using a commutative function (e.g. addition/multiplication/XOR). XOR is probably the most appropriate as it retains a constant hash output size and is very fast.
Also, you will want to strip away any leading 0's before hashing the number.

how to "explain" the following hash function is bad

we have a hash table with size 16, using double hashing method.
h1(x) = k mod 16
h2(x) = 2*(k mod 8)
I know that h2 hash function is bad, probably because mod 8 and times 2, but I don't know how to explain it. is there any explanation like "h2 hash function should mod prime or it will cause ____ problem "
It is bad because it increases the number of collisions.
The (mod 8) means that you are only looking for 8 pigeonholes in your 16-pigeonhole table.
Multiplying it by 2 just spreads those 8 pigeonholes out so that you don’t have to search too many slots past the hashed index to find an empty hole...
You should always compute modulo the size of your table.
h(x) ::= x (mod N) // where N is the table size
The purpose of making the table size a prime number just has to do with how powers of two are very common in computer science. If your data is random, then the size of the table doesn’t matter.
— As long as it is big enough for your expected load factor. A 16-element table is very small. You shouldn’t expect to store more than 6-12 random values in your table without a high-probability of collisions.
A very good linked thread is What is a good Hash Function?, which is totally worth a read just for the links to further reading alone.

How to convert values to N bits of resolution in MATLAB?

My computer uses 32 bits of resolution as default. I'm writing a script that involves taking measurements with a multimeter that has N bits of resolution. How do I convert the values to that?
For example, if I have a RNG that gives 1000 values
nums = randn(1,1000);
and I use an N-bit multimeter to read those values, how would I get the values to reflect that?
I currently have
meas = round(nums,N-1);
but it's giving me N digits, not N bits. The original random numbers are unbounded, but the resolution of the multimeter is the limitation; how to implement the limitation is what I'm looking for.
Edit I: I'm talking about the resolution of measurement, not the bounds of the numbers. The original values are unbounded. The accuracy of the measured values should be limited by the resolution.
Edit II: I revised the question to try to be a bit clearer.
randn doesn’t produce bounded numbers. Let’s say you are producing 32-bit integers instead:
mums = randi([0,2^32-1],1,n);
To drop the bottom 32-N bits, simply divide by an appropriate value and round (or take the floor):
nums = round(nums/(2^(32-N)));
Do note that we only use floating-point arithmetic here, numbers are integer-valued, but not actually integers. You can do a similar operation using actual integers if you need that.
Also, obviously, N should be lower than 32. You cannot invent new bits. If N is larger, the code above will add zero bits at the bottom of the number.
With a multimeter, it is likely that the range is something like -M V to M V with a a constant resolution, and you can configure the M selecting the range.
This is fixed point math. My answer will not use it because I don't have the toolbox available, if you have it you could use it to have simpler code.
You can generate the integer values with the intended resolution, then rescale it to the intended range.
F=2^N-1 %Maximum integer value
X=randi([0,F],100,1)
X*2*M/F-M %Rescale, divide by the integer range, multiply by the intended range. Then offset by intended minimum.

Hashing using division method

For the hash function : h(k) = k mod m;
I understand that m=2^n will always give the last n LSB digits. I also understand that m=2^p-1 when K is a string converted to integers using radix 2^p will give same hash value for every permutation of characters in K. But why exactly "a prime not too close to an exact power of 2" is a good choice? What if I choose 2^p - 2 or 2^p-3? Why are these choices considered bad?
Following is the text from CLRS:
"A prime not too close to an exact power of 2 is often a good choice for m. For
example, suppose we wish to allocate a hash table, with collisions resolved by
chaining, to hold roughly n D 2000 character strings, where a character has 8 bits.
We don’t mind examining an average of 3 elements in an unsuccessful search, and
so we allocate a hash table of size m D 701. We could choose m D 701 because
it is a prime near 2000=3 but not near any power of 2."
Suppose we work with radix 2p.
2p-1 case:
Why that is a bad idea to use 2p-1? Let us see,
k = ∑ai2ip
and if we divide by 2p-1 we just get
k = ∑ai2ip = ∑ai mod 2p-1
so, as addition is commutative, we can permute digits and get the same result.
2p-b case:
Quote from CLRS:
A prime not too close to an exact power of 2 is often a good choice for m.
k = ∑ai2ip = ∑aibi mod 2p-b
So changing least significant digit by one will change hash by one. Changing second least significant bit by one will change hash by two. To really change hash we would need to change digits with bigger significance. So, in case of small b we face problem similar to the case then m is power of 2, namely we depend on distribution of least significant digits.

How to evaluate a hash generating algorithm

What ways do you know to evaluate the efficiency of a hash function besides generating a large set of values and see the distribution of values?
By efficiency I mean that the keys generated by your hash function distribute evenly. Is there a way to prove this without actually testing for actual values?
A hash function is only even in the context of the data being hashed
Consider two data sets:
Set 1
1, 3, 6, 2, 7, 9, 5, 8, 4
Set 2
65355, 96424664, 86463624, 133, 643564, 24232, 88677, 865747, 2224
A good hashing function for one set (ie mod 10 for set 1) gives no collisions and could be seen as the perfect hash for that data set
However apply it to the second set and there are collisions everywhere
Hash = (x * 37) mod 256
Is much better for the second set but may not suit the first set quite so well... Especially when partitioning the hash for eg a small number of buckets.
What you can do is evaluate a hash against random data that you "expect" your function to have to handle... But that is making assumptions...
Premature optimisation is looking for the perfect hash function before you have enough real data to base your assessment on.
You should get enough data well before the cost of rehashing becomes prohibitive to change your hash function
Update
Lets suppose we are looking for a hash function that generates an 8 bit hash of the input data. Lets further suppose that the hash function is supposed to take byte-streams of varying length.
If we assume that the bytes in the byte-streams are uniformly distributed, we can make some assessment of different hash functions.
int hash = 0;
for (byte b in datastream) hash = hash xor b;
This function will produce uniformly distributed hash values for the specified data set, and would therefore be a good hash function in this context. If you don't see why this is, then you might have other problems.
int hash = 37;
for (byte b in datastream hash = (31 * hash + b) mod 256;
This function will produce uniformly distributed hash values for the specified data set, and would therefore be a good hash function in this context.
Now lets change the data set from being variable length strings of random numbers in the range 0 to 255 to being variable length strings comprising English sentences encoded as US-ASCII.
The XOR is then a poor hash because the input data never has the 8th bit set and as a result only generates hashes in the range 0-127, also there is a higher likelyhood of some "hot" values because of the letter frequency in english words and the cancelling affect of the XOR.
The pair of primes remains reasonably good as a hash function because it uses the full output range and the prime initial offset coupled with a different prime multiplier tends to spread the values out. But it is still weak for collisions due to how English language is structured... Something that only testing with real data can show.