Don't you get a random number after doing modulo on a hashed number? - hash

I'm trying to understand hash tables, and from what I've seen the modulo operator is used to select which bucket a key will be placed in. I know that hash algorithms are supposed to minimize the same result for different inputs, however I don't understand how the same results for different inputs can be minimal after the modulo operation. Let's just say we have a near-perfect hash function that gives a different hashed value between 0 and 100,000, and then we take the result modulo 20 (in our example we have 20 buckets), isn't the resulting number very close to a random number between 0 and 19? Meaning roughly the probability that the final result is any of a number between 0 and 19 is about 1 in 20? If this is the case, then the original hash function doesn't seem to ensure minimal collisions because after the modulo operation we end up with something like a random number? I must be wrong, but I'm thinking that what ensures minimal collisions the most is not the original hash function but how many buckets we have.
I'm sure I'm misunderstanding this. Can someone explain?

Don't you get a random number after doing modulo on a hashed number?
It depends on the hash function.
Say you have an identify hash for numbers - h(n) = n - then if the keys being hashed are generally incrementing numbers (perhaps with an occasional ommision), then after hashing they'll still generally hit successive buckets (wrapping at some point from the last bucket back to the first), with low collision rates overall. Not very random, but works out well enough. If the keys are random, it still works out pretty well - see the discussion of random-but-repeatable hashing below. The problem is when the keys are neither roughly-incrementing nor close-to-random - then an identity hash can provide terrible collision rates. (You might think "this is a crazy bad example hash function, nobody would do this; actually, most C++ Standard Library implementations' hash functions for integers are identity hashes).
On the other hand, if you have a hash function that say takes the address of the object being hashed, and they're all 8 byte aligned, then if you take the mod and the bucket count is also a multiple of 8, you'll only ever hash to every 8th bucket, having 8 times more collisions than you might expect. Not very random, and doesn't work out well. But, if the number of buckets is a prime, then the addresses will tend to scatter much more randomly over the buckets, and things will work out much better. This is the reason the GNU C++ Standard Library tends to use prime numbers of buckets (Visual C++ uses power-of-two sized buckets so it can utilise a bitwise AND for mapping hash values to buckets, as AND takes one CPU cycle and MOD can take e.g. 30-40 cycles - depending on your exact CPU - see here).
When all the inputs are known at compile time, and there's not too many of them, then it's generally possible to create a perfect hash function (GNU gperf software is designed specifically for this), which means it will work out a number of buckets you'll need and a hash function that avoids any collisions, but the hash function may take longer to run than a general purpose function.
People often have a fanciful notion - also seen in the question - that a "perfect hash function" - or at least one that has very few collisions - in some large numerical hashed-to range will provide minimal collisions in actual usage in a hash table, as indeed this stackoverflow question is about coming to grips with the falsehood of this notion. It's just not true if there are still patterns and probabilities in the way the keys map into that large hashed-to range.
The gold standard for a general purpose high-quality hash function for runtime inputs is to have a quality that you might call "random but repeatable", even before the modulo operation, as that quality will apply to the bucket selection as well (even using the dumber and less forgiving AND bit-masking approach to bucket selection).
As you've noticed, this does mean you'll see collisions in the table. If you can exploit patterns in the keys to get less collisions that this random-but-repeatable quality would give you, then by all means make the most of that. If not, the beauty of hashing is that with random-but-repeatable hashing your collisions are statistically related to your load factor (the number of stored elements divided by the number of buckets).
As an example, for separate chaining - when your load factor is 1.0, 1/e (~36.8%) of buckets will tend to be empty, another 1/e (~36.8%) have one element, 1/(2e) or ~18.4% two elements, 1/(3!e) about 6.1% three elements, 1/(4!e) or ~1.5% four elements, 1/(5!e) ~.3% have five etc.. - the average chain length from non-empty buckets is ~1.58 no matter how many elements are in the table (i.e. whether there are 100 elements and 100 buckets, or 100 million elements and 100 million buckets), which is why we say lookup/insert/erase are O(1) constant time operations.
I know that hash algorithms are supposed to minimize the same result for different inputs, however I don't understand how the same results for different inputs can be minimal after the modulo operation.
This is still true post-modulo. Minimising the same result means each post-modulo value has (about) the same number of keys mapping to it. We're particularly concerned about in-use keys stored in the table, if there's a non-uniform statistical distribution to the use of keys. With a hash function that exhibits the random-but-repeatable quality, there will be random variation in post-modulo mapping, but overall they'll be close enough to evenly balanced for most practical purposes.
Just to recap, let me address this directly:
Let's just say we have a near-perfect hash function that gives a different hashed value between 0 and 100,000, and then we take the result modulo 20 (in our example we have 20 buckets), isn't the resulting number very close to a random number between 0 and 19? Meaning roughly the probability that the final result is any of a number between 0 and 19 is about 1 in 20? If this is the case, then the original hash function doesn't seem to ensure minimal collisions because after the modulo operation we end up with something like a random number? I must be wrong, but I'm thinking that what ensures minimal collisions the most is not the original hash function but how many buckets we have.
So:
random is good: if you get something like the random-but-repeatable hash quality, then your average hash collisions will statistically be capped at low levels, and in practice you're unlikely to ever see a particularly horrible collision chain, provided you keep the load factor reasonable (e.g. <= 1.0)
that said, your "near-perfect hash function...between 0 and 100,000" may or may not be high quality, depending on whether the distribution of values has patterns in it that would produce collisions. When in doubt about such patterns, use a hash function with the random-but-repeatable quality.
What would happen if you took a random number instead of using a hash function? Then doing the modulo on it? If you call rand() twice you can get the same number - a proper hash function doesn't do that I guess, or does it? Even hash functions can output the same value for different input.
This comment shows you grappling with the desirability of randomness - hopefully with earlier parts of my answer you're now clear on this, but anyway the point is that randomness is good, but it has to be repeatable: the same key has to produce the same pre-modulo hash so the post-modulo value tells you the bucket it should be in.
As an example of random-but-repeatable, imagine you used rand() to populate a uint32_t a[256][8] array, you could then hash any 8 byte key (e.g. including e.g. a double) by XORing the random numbers:
auto h(double d) {
uint8_t i[8];
memcpy(i, &d, 8);
return a[i[0]] ^ a[i[1]] ^ a[i[2]] ^ ... ^ a[i[7]];
}
This would produce a near-ideal (rand() isn't a great quality pseudo-random number generator) random-but-repeatable hash, but having a hash function that needs to consult largish chunks of memory can easily be slowed down by cache misses.
Following on from what [Mureinik] said, assuming you have a perfect hash function, say your array/buckets are 75% full, then doing modulo on the hashed function will probably result in a 75% collision probability. If that's true, I thought they were much better. Though I'm only learning about how they work now.
The 75%/75% thing is correct for a high quality hash function, assuming:
closed hashing / open addressing, where collisions are handled by finding an alternative bucket, or
separate chaining when 75% of buckets have one or more elements linked therefrom (which is very likely to mean the load factor (which many people may think of when you talk about how "full" the table is) is already significantly more than 75%)
Regarding "I thought they were much better." - that's actually quite ok, as evidenced by the percentages of colliding chain lengths mentioned earlier in my answer.

I think you have the right understanding of the situation.
Both the hash function and the number of buckets affect the chance of collisions. Consider, for example, the worst possible hash function - one that returns a constant value. No matter how many buckets you have, all the entries will be lumped to the same bucket, and you'd have a 100% chance of collision.
On the other hand, if you have a (near) perfect hash function, the number of buckets would be the main factor for the chance of collision. If your hash table has only 20 buckets, the minimal chance of collision will indeed be 1 in 20 (over time). If the hash values weren't uniformly spread, you'd have a much higher chance of collision in at least one of the buckets. The more buckets you have, the less chance of collision. On the other hand, having too many buckets will take up more memory (even if they are empty), and ultimately reduce performance, even if there are less collisions.

Related

How do I truncate a 64-bit hash into a 32-bit hash? [duplicate]

We're trying to settle an internal debate on our dev team:
We're looking for a 64-bit PHP hash function. We found a PHP implementation of MurmurHash3, but MurmurHash3 is either 32-bit or 128-bit, not 64-bit.
Co-worker #1 believes that to produce a 64-bit hash from MurmurHash3, we can simply slice the first (or last, or any) 64 bits of the 128-bit hash and that it will be as collision-proof as a native 64-bit hash function.
Co-worker #2 believes that we must find a native 64-bit hash function to reduce collisions and that 64-bit slices of a 128-bit hash will not be as collision proof as a native 64-bit hash.
Who's correct?
Does the answer change if we take the first (or last, or any) 64-bits of a cryptographic hash like SHA1 instead of Murmur3?
If you had real random, uniformly distributed values, then "slicing" would yield exactly the same results as if you had started with the smaller value right from the start. To see why, consider this very simple example: Let's say your random generator outputs 3 random bits, but you only need one random bit to work with. Let's assume the output is
b1 b2 b3
The possible values are
000, 001, 010, 011, 100, 101, 110, 111
and all are to occur with equal probability of 1/8. Now whatever bit you slice from those three for your purpose - the first, second or third - the probability of having a '1' is always going to be 1/2, regardless of the position - and the same is true for a '0'.
You can easily scale this experiment to the 64 out of 128 bit case: regardless of which bits you slice, the probability of ending up with a one or a zero in a certain position is going to be one half. What this means is that if you had a sample taken from a uniformly distributed random variable, then slicing wouldn't make the probability for collisions more or less likely.
Now a good question is whether random functions are really the best we can do to prevent collisions. But as it turns out, it can be shown that the probability of finding collisions increases whenever a function deviates from random.
Cryptographic hash functions: co-worker #1 wins
The problem in real life is that hash functions are not random at all, on the contrary, they are boringly deterministic. But a design goal of cryptographic hash functions is as follows: if we didn't know their initial state, then their output would be computationally indistinguishable from a real random function, that is there's no computationally efficient way to tell the difference between the hash output and real random values. This is why you'd consider a hash already as kind of broken if you can find a "distinguisher", a method to tell the hash from real random values with a probability higher than one half. Unfortunately, we can't really prove these properties for existing cryptographic hashes, but unless somebody breaks them, we may assume these properties hold with some confidence. Here is an example of a paper about a distinguisher for one of the SHA-3 submissions that illustrates the process.
To summarize, unless a distinguisher is found for a given cryptographic hash, slicing is perfectly fine and does not increase the probability of a collision.
Non-cryptographic hash functions: co-worker #2 might win
Non-cryptographic hashes do not have to satisfy the same set of requirements as cryptographic hashes do. They are usually defined to be very fast and satisfy certain properties "under sane/benevolent conditions", but they might easily fall short if somebody tries to maliciously manipulate them. A good example for what this means in practice is the computational complexity attack on hash table implementations (hashDoS) presented earlier this year. Under normal conditions, non-crypto hashes work perfectly fine, but their collision resistance may be severely undermined by some clever inputs. This can't happen with cryptographic hash functions, because their very definition requires them to be immune to all sorts of clever inputs.
Because it is possible, sometimes even quite easy, to find a distinguisher like above for the output of non-cryptographic hashes, we can immediately say that they do not qualify as cryptographic hash functions. Being able to tell the difference means that somewhere there is a pattern or bias in the output.
And this fact alone implies that they deviate more or less from a random function, and thus (after what we said above) collisions are probably more likely than they would be for random functions. Finally, since collisions occur with higher probability for the full 128 bits already, this will not get better with shorter ouptputs, collisions will probably be even more likely in that case.
tl;dr You're safe with a cryptographic hash function when truncating it. But you're better off with a "native" 64 bit cryptographic hash function compared to truncating a non-cryptographic hash with a larger output to 64 bits.
Due to the avalanche effect, a strong hash is one where a single bit of change in the source results in half the bits of the hash flipping on average. For a good hash, then, the "hashness" is evenly distributed, and so each section or slice is affected by an equal and evenly distributed amount of source bits, and therefore is just as strong as any other slice of the same bit length could be.
I would agree with co-worker 1 as long as the hash has good properties and even distribution.
This question seems incomplete without this being mentioned:
Some hashes are provably perfect hashes for a specific class of inputs (eg., for input of length n for some reasonable value of n). If you truncate that hash then you are likely to destroy that property, in which case you are, by definition, increasing the rate of collisions from zero to non-zero and you have weakened the hash in that use case.
It's not the general case, but it's an example of a legitimate concern when truncating hashes.

Is it safe to cut the hash?

I would like to store hashes for approximately 2 billion strings. For that purpose I would like to use as less storage as possible.
Consider an ideal hashing algorithm which returns hash as series of hexadecimal digits (like an md5 hash).
As far as i understand the idea this means that i need hash to be not less and not more than 8 symbols in length. Because such hash would be capable of hashing 4+ billion (16 * 16 * 16 * 16 * 16 * 16 * 16 * 16) distinct strings.
So I'd like to know whether it is it safe to cut hash to a certain length to save space ?
(hashes, of course, should not collide)
Yes/No/Maybe - i would appreciate answers with explanations or links to related studies.
P.s. - i know i can test whether 8-character hash would be ok to store 2 billion strings. But i need to compare 2 billion hashes with their 2 billion cutted versions. It doesn't seem trivial to me so i'd better ask before i do that.
The hash is a number, not a string of hexadecimal numbers (characters). In case of MD5, it is 128 bits or 16 bytes saved in efficient form. If your problem still applies, you sure can consider truncating the number (by either coersing into a word or first bitshifting by). Good hash algorithms distribute evenly to all bits.
Addendum:
Generally whenever you deal with hashes, you want to check if the strings really match. This takes care of the possibility of collising hashes. The more you cut the hash the more collisions you're going to get. But it's good to plan for that happening at this phase.
Whether or not its safe to store x values in a hash domain only capable of representing 2x distinct hash values depends entirely on whether you can tolerate collisions.
Hash functions are effectively random number generators, so your 2 billion calculated hash values will be distributed evenly about the 4 billion possible results. This means that you are subject to the Birthday Problem.
In your case, if you calculate 2^31 (2 billion) hashes with only 2^32 (4 billion) possible hash values, the chance of at least two having the same hash (a collision) is very, very nearly 100%. (And the chance of three being the same is also very, very nearly 100%. And so on.) I can't find the formula for calculating the probable number of collisions based on these numbers, but I suspect it is a huge number.
If in your case hash collisions are not a disaster (such as in Java's HashMap implementation which deals with collisions by turning the hash target into a list of objects which share the same hash key, albeit at the cost of reduced performance) then maybe you can live with the certainty of a high number of collisions. But if you need uniqueness then you need either a far, far larger hash domain, or you need to assign each record a guaranteed-unique serial ID number, depending on your purposes.
Finally, note that Keccak is capable of generating any desired output length, so it makes little sense to spend CPU resources generating a long hash output only to trim it down afterwards. You should be able to tell your Keccak function to give only the number of bits you require. (Also note that a change in Keccak output length does not affect the initial output bits, so the result will be exactly the same as if you did a manual bitwise trim afterwards.)

choosing a hash function

I was wondering: what are maximum number of bytes that can safely be hashed while maintaining the expected collision count of a hash function?
For md5, sha-*, maybe even crc32 or adler32.
Your question isn't clear. By "maximum number of bytes" you mean "maximum number of items"? The size of the files being hashed has no relation with the number of collisions (assuming that all files are different, of course).
And what do you mean by "maintaining the expected collision count"? Taken literally, the answer is "infinite", but after a certain number you will aways have collisions, as expected.
As for the answer to the question "How many items I can hash while maintaining the probability of a collision under x%?", take a look at the following table:
http://en.wikipedia.org/wiki/Birthday_problem#Probability_table
From the link:
For comparison, 10^-18 to 10^-15 is the uncorrectable bit error rate of a typical hard disk [2]. In theory, MD5, 128 bits, should stay within that range until about 820 billion documents, even if its possible outputs are many more.
This assumes a hash function that outputs a uniform distribution. You may assume that, given enough items to be hashed and cryptographic hash functions (like md5 and sha) or good hashes (like Murmur3, Jenkins, City, and Spooky Hash).
And also assumes no malevolent adversary actively fabricating collisions. Then you really need a secure cryptographic hash function, like SHA-2.
And be careful: CRC and Adler are checksums, designed to detect data corruption, NOT minimizing expected collisions. They have proprieties like "detect all bit zeroing of sizes < X or > Y for inputs up to Z kbytes", but not as good statistical proprieties.
EDIT: Don't forget this is all about probabilities. It is entirely possible to hash only two files smaller than 0.5kb and get the same SHA-512, though it is extremely unlikely (no collision has ever been found for SHA hashes till this date, for example).
You are basically looking at the Birthday paradox, only looking at really big numbers.
Given a normal 'distribution' of your data, I think you could go to about 5-10% of the amount of possibilities before running into issues, though nothing is guaranteed.
Just go with a long enough hash to not run into problems ;)

Choosing a minimum hash size for a given allowable number of collisions

I am parsing a large amount of network trace data. I want to split the trace into chunks, hash each chunk, and store a sequence of the resulting hashes rather than the original chunks. The purpose of my work is to identify identical chunks of data - I'm hashing the original chunks to reduce the data set size for later analysis. It is acceptable in my work that we trade off the possibility that collisions occasionally occur in order to reduce the hash size (e.g. 40 bit hash with 1% misidentification of identical chunks might beat 60 bit hash with 0.001% misidentification).
My question is, given a) number of chunks to be hashed and b) allowable percentage of misidentification, how can one go about choosing an appropriate hash size?
As an example:
1,000,000 chunks to be hashed, and we're prepared to have 1% misidentification (1% of hashed chunks appear identical when they are not identical in the original data). How do we choose a hash with the minimal number of bits that satisifies this?
I have looked at materials regarding the Birthday Paradox, though this is concerned specifically with the probability of a single collision. I have also looked at materials which discuss choosing a size based on an acceptable probability of a single collision, but have not been able to extrapolate from this how to choose a size based on an acceptable probability of n (or fewer) collisions.
Obviously, the quality of your hash function matters, but some easy probability theory will probably help you here.
The question is what exactly are you willing to accept, is it good enough that you have an expected number of collisions at only 1% of the data? Or, do you demand that the probability of the number of collisions going over some bound be something? If its the first, then back of the envelope style calculation will do:
Expected number of pairs that hash to the same thing out of your set is (1,000,000 C 2)*P(any two are a pair). Lets assume that second number is 1/d where d is the the size of the hashtable. (Note: expectations are linear, so I'm not cheating very much so far). Now, you say you want 1% collisions, so that is 10000 total. Well, you have (1,000,000 C 2)/d = 10,000, so d = (1,000,000 C 2)/10,000 which is according to google about 50,000,000.
So, you need a 50 million ish possible hash values. That is a less than 2^26, so you will get your desired performance with somewhere around 26 bits of hash (depending on quality of hashing algorithm). I probably have a factor of 2 mistake in there somewhere, so you know, its rough.
If this is an offline task, you cant be that space constrained.
Sounds like a fun exercise!
Someone else might have a better answer, but I'd go the brute force route, provided that there's ample time:
Run the hashing calculation using incremental hash size and record the collision percentage for each hash size.
You might want to use binary search to reduce the search space.

Is there a collision rate difference between one 32-bit hash vs two 16 bit hashes?

I am working on a system where hash collisions would be a problem. Essentially there is a system that references items in a hash-table+tree structure. However the system in question first compiles text-files containing paths in the structure into a binary file containing the hashed values instead. This is done for performance reasons. However because of this collisions are very bad as the structure cannot store 2 items with the same hash value; the part asking for an item would not have enough information to know which one it needs.
My initial thought is that 2 hashes, either using 2 different algorithms, or the same algorithm twice, with 2 salts would be more collision resistant. Two items having the same hash for different hashing algorithms would be very unlikely.
I was hoping to keep the hash value 32-bits for space reasons, so I thought I could switch to using two 16-bit algorithms instead of one 32-bit algorithm. But that would not increase the range of possible hash values...
I know that switching to two 32-bit hashes would be more collision resistant, but I am wondering if switching to 2 16-bit hashes has at least some gain over a single 32-bit hash? I am not the most mathematically inclined person, so I do not even know how to begin checking for an answer other than to bruit force it...
Some background on the system:
Items are given names by humans, they are not random strings, and will typically be made of words, letters, and numbers with no whitespace. It is a nested hash structure, so if you had something like { a => { b => { c => 'blah' }}} you would get the value 'blah' by getting value of a/b/c, the compiled request would be 3 hash values in immediate sequence, the hashe values of a, b, and then c.
There is only a problem when there is a collision on a given level. A collision between an item at the top level and a lower level is fine. You can have { a => {a => {...}}}, almost guaranteeing collisions that are on different levels (not a problem).
In practice any given level will likely have less than 100 values to hash, and none will be duplicates on the same level.
To test the hashing algorithm I adopted (forgot which one, but I did not invent it) I downloaded the entire list of CPAN Perl modules, split all namespaces/modules into unique words, and finally hashed each one searching for collisions, I encountered 0 collisions. That means that the algorithm has a different hash value for each unique word in the CPAN namespace list (Or that I did it wrong). That seems good enough to me, but its still nagging at my brain.
If you have 2 16 bit hashes, that are producing uncorrelated values, then you have just written a 32-bit hash algorithm. That will not be better or worse than any other 32-bit hash algorithm.
If you are concerned about collisions, be sure that you are using a hash algorithm that does a good job of hashing your data (some are written to merely be fast to compute, this is not what you want), and increase the size of your hash until you are comfortable.
This raises the question of the probability of collisions. It turns out that if you have n things in your collection, there are n * (n-1) / 2 pairs of things that could collide. If you're using a k bit hash, the odds of a single pair colliding are 2-k. If you have a lot of things, then the odds of different pairs colliding is almost uncorrelated. This is exactly the situation that the Poisson distribution describes.
Thus the number of collisions that you will see should approximately follow the Poisson distribution with λ = n * (n-1) * 2-k-1. From that the probability of no hash collisions is about e-λ. With 32 bits and 100 items, the odds of a collision in one level are about 1.1525 in a million. If you do this enough times, with enough different sets of data, eventually those one in a million chances will add up.
But note that you have many normal sized levels and a few large ones, the large ones will have a disproportionate impact on your risk of collision. That is because each thing you add to a collection can collide with any of the preceeding things - more things equals higher risk of collision. So, for instance, a single level with 1000 data items has about 1 chance in 10,000 of failing - which is about the same risk as 100 levels with 100 data items.
If the hashing algorithm is not doing its job properly, your risk of collision will go up rapidly. How rapidly depends very much on the nature of the failure.
Using those facts and your projections for what the usage of your application is, you should be able to decide whether you're comfortable with the risk from 32-bit hashes, or whether you should move up to something larger.