According to my understanding, Hashing is a process of producing a unique fixed length (let's assume 64bit) output to an input of ANY length. (correct me if am wrong)
So if I take all the (x) possible 64bit hash values that a hash function can produce and append a 0 or 1 at the end of it. I get a list of size 2x (where each hash is 65bit long).
If I give all the 2x combinations as the input to the same hash function, how can it generate a unique hash for all the inputs?
You are correct. This is called a hash collision, and it's a real thing. The reason it's not a bigger deal is that the number of hashes is so overwhelmingly large that these types of collisions are rare. Your example of 64 bits is a little unrealistic, though. 256 bits or 512 bits is a more likely scenario. (Even 128 is no longer considered strong enough.) And the range of hashes in this case is so large that finding inputs that create a hash collision is very difficult.
By the Pigeonhole principle, hash collisions are inevitable. That is it is inevitable to find two distinct messages m1 != m2 such that their hash are equal H(m1) = H(m2)
Therefore, one cannot generate unique hashes for the inputs. With a very very small probability ( negligible ), there will be a collision. Even, inside of 264 possible values, there can be a collision for a hash function with 64-bit output.
Better use a Hash function like SHA3-512 or BLAKE2b and if you really want them unique, compare them with previous hashes that you generate. If you find a collision, you will be famous.
SHA3 family can generate 224, 256, 384, or 512-bit outputs.
Related
In hashing, we take the input and apply some complex hashing algorithm. Then, we do mod n to find the bucket or server into which this input needs to be sent.
Hash input x -> Hash(x) -> Divide by n - >Hash(x) mod n gives location of the bucket.
If we take the input directly without hashing, it is equivalent to having an identity hash function. Hash(x) =x .. mod n..Wikipedia calls this function a 'trivial' hash function.
Generally,hash(x) is a complex hashing algorithm such as MD5, SHA etc...
Q1) Regardless of how we hash it, it just boils down to a value between 0 and n-1(reminder when divided by n). So, how does the choice of hashing function matter?
Q2) I know that an ideal hash function distributes the input values uniformly across the buckets. In this aspect, are those complex hashing functions superior to the hash identity function?
Assume that the input is always an integer.
What is the advantage of applying a complex hash function and then taking mod n instead of simply doing mod n for the input?
Let's look at a simple example. Say our keys are 100 pointers to some objects in memory that are 8-byte aligned: that means the 3 least-significant bits are always 0s. Our table size is currently 128 buckets. We mod the pointer values by 128 before hashing, we're effectively taking:
32-bit pointer bits xxxxxxxx xxxxxxxx xxxxxxxx xxxxx000
mod 128 00000000 00000000 00000000 0xxxx000
Notice that only 4 potentially meaningful bits from the pointer make it through to our hash function, which means at most 16 distinct values reach the hash function: our 100 pointers will collide into only 16 buckets, which means collision chains will typically be 7 or 8 deep even for the strongest hash function. That's woeful given we had 128 buckets for 100 keys: we should have had mostly 0, 1 or 2 keys mapped to any given bucket.
Now, what would have happened if we'd had 100 pointers to memory mapped areas, each 4096-byte page aligned? They all would have mapped to the same bucket.
Not doing the mod operation until the end ensures higher order bits in the keys can help randomise the lower order bit positions in the hash value, and those lower-significance bits can affect the bucket the key maps to. (Another thing that can help a bit is ensuring the table size is a prime number, but that's best used in combination with doing the mod after hashing. As a random sampling, GNU's C++ compiler uses prime bucket counts for Standard Library hash tables, while Visual C++ uses powers of two (and for long strings faster but weaker hash functions))
Q1) Regardless of how we hash it, it just boils down to a value between 0 and n-1(reminder when divided by n). So, how does the choice of hashing function matter?
Obviously if our hash function was h(key) { return 0 } every key would collide at bucket 0. At the other extreme, a crytographic hash function should effectively randomly-but-repeatedly map any given key to a given bucket, such that any bit changing anywhere in the key creates a completely uncorrelated mapping. That helps protect you from excessive collisions with keys that don't vary at many bit positions. But, strong hash functions tend to take longer to calculate, and the reduction in collisions may or may not result in a net performance win. It's sometimes worth choosing the strength of the hash function based on knowledge of how much the keys are likely to differ from each other.
Q2) I know that an ideal hash function distributes the input values uniformly across the buckets. In this aspect, are those complex hashing functions superior to the hash identity function?
At the extreme, identity hash functions hope that the input numbers will map onto distinct buckets with more probability than a crytographic strength hash function would: for example, if we hash 5, 6, 7, 8, 10 into a table using an identity function, they're dense (close to each other) and span just 6 values (5 through 10), so as long as the table size is >= 6 (e.g. prime value 7) they're guaranteed not to collide. But, identity hash functions given collision prone inputs (e.g. pointers cast to numbers) are a disaster as they've done nothing to mix in more-significant bits with less-significant bits before the mod kicks in - same problem explained for pointers above.
Summarily, identity hash functions can have better average-case performance for common integer keys, but have far worse worst-case performance for non-dense, non-random / collision-prone keys.
If I have some data I hash with SHA256 like this :- hash=SHA256(data)
And then copy only the first 8 bytes of the hash instead of the whole 32 bytes, how easy is it to find a hash collision with different data? Is it 2^64 or 2^32 ?
If I need to reduce a hash of some data to a smaller size (n bits) is there any way to ensure the search space 2^n ?
I think you're actually interested in three things.
The first you need to understand is the entropy distribution of the hash. If the output of a hash function is n-bits long, then the maximum entropy is n bits. Note that I say maximum; you are never guaranteed to have n bits of entropy. Similarly, if you truncate the hash output to n/4 bits, you are not guaranteed to have a 2n/4 bits of entropy in the result. SHA-256 is fairly uniformly distributed, which means in part that you are unlikely to have more entropy in the high bits than the low bits (or vice versa).
However, information on this is sparse because the hash function is intended to be used with its whole hash output. If you only need an 8-byte hash output, then you might not even need a cryptographic hash function and could consider other algorithms. (The point is, if you need a cryptographic hash function, then you need as many bits as it can give you, as shortening the output weakens the security of the function.)
The second is the search space: it is not dependent on the hash function at all. Searching for an input that creates a given output on a hash function is more commonly known as a Brute-Force attack. The number of inputs that will have to be searched does not depend on the hash function itself; how could it? Every hash function output is the same: every SHA-256 output is 256 bits. If you just need a collision, you could find one specific input that generated each possible output of 256 bits. Unfortunately, this would take up a minimum storage space of 256 * 2256 ≈ 3 * 1079 for just the hash values themselves (i.e. not counting the inputs needed to generate them), which vastly eclipses the entire hard drive capacity of the entire world.
Therefore, the search space depends on the complexity and length of the input to the hash function. If your data is 8-character long ASCII strings, then you're pretty well guaranteed to never have a collision, BUT the search space for those hash values is only 27*8 ≈ 7.2 * 1016, which could be searched by your computer in a few minutes, probably. After all, you don't need to find a collision if you can find the original input itself. This is why salts are important in cryptography.
Third, you're interested in knowing the collision resistance. As GregS' linked article points out, the collision resistance of a space is much more limited than the input search space due to the pigeonhole principle.
Every hash function with more inputs than outputs will necessarily have collisions. Consider a hash function such as SHA-256 that produces 256 bits of output from an arbitrarily large input. Since it must generate one of 2256 outputs for each member of a much larger set of inputs, the pigeonhole principle guarantees that some inputs will hash to the same output. Collision resistance doesn't mean that no collisions exist; simply that they are hard to find.
The "birthday paradox" places an upper bound on collision resistance: if a hash function produces N bits of output, an attacker who computes "only" 2N/2 (or sqrt(2N)) hash operations on random input is likely to find two matching outputs. If there is an easier method than this brute force attack, it is typically considered a flaw in the hash function.
So consider what happens when you examine and store only the first 8 bytes (one fourth) of your output. Your collision resistance has dropped from 2256/2 = 2128 to 264/2 = 232. How much smaller is 232 than 2128? It's a whole lot smaller, as it turns out, approximately 0.0000000000000000000000000001% of the size at best.
I would like to store hashes for approximately 2 billion strings. For that purpose I would like to use as less storage as possible.
Consider an ideal hashing algorithm which returns hash as series of hexadecimal digits (like an md5 hash).
As far as i understand the idea this means that i need hash to be not less and not more than 8 symbols in length. Because such hash would be capable of hashing 4+ billion (16 * 16 * 16 * 16 * 16 * 16 * 16 * 16) distinct strings.
So I'd like to know whether it is it safe to cut hash to a certain length to save space ?
(hashes, of course, should not collide)
Yes/No/Maybe - i would appreciate answers with explanations or links to related studies.
P.s. - i know i can test whether 8-character hash would be ok to store 2 billion strings. But i need to compare 2 billion hashes with their 2 billion cutted versions. It doesn't seem trivial to me so i'd better ask before i do that.
The hash is a number, not a string of hexadecimal numbers (characters). In case of MD5, it is 128 bits or 16 bytes saved in efficient form. If your problem still applies, you sure can consider truncating the number (by either coersing into a word or first bitshifting by). Good hash algorithms distribute evenly to all bits.
Addendum:
Generally whenever you deal with hashes, you want to check if the strings really match. This takes care of the possibility of collising hashes. The more you cut the hash the more collisions you're going to get. But it's good to plan for that happening at this phase.
Whether or not its safe to store x values in a hash domain only capable of representing 2x distinct hash values depends entirely on whether you can tolerate collisions.
Hash functions are effectively random number generators, so your 2 billion calculated hash values will be distributed evenly about the 4 billion possible results. This means that you are subject to the Birthday Problem.
In your case, if you calculate 2^31 (2 billion) hashes with only 2^32 (4 billion) possible hash values, the chance of at least two having the same hash (a collision) is very, very nearly 100%. (And the chance of three being the same is also very, very nearly 100%. And so on.) I can't find the formula for calculating the probable number of collisions based on these numbers, but I suspect it is a huge number.
If in your case hash collisions are not a disaster (such as in Java's HashMap implementation which deals with collisions by turning the hash target into a list of objects which share the same hash key, albeit at the cost of reduced performance) then maybe you can live with the certainty of a high number of collisions. But if you need uniqueness then you need either a far, far larger hash domain, or you need to assign each record a guaranteed-unique serial ID number, depending on your purposes.
Finally, note that Keccak is capable of generating any desired output length, so it makes little sense to spend CPU resources generating a long hash output only to trim it down afterwards. You should be able to tell your Keccak function to give only the number of bits you require. (Also note that a change in Keccak output length does not affect the initial output bits, so the result will be exactly the same as if you did a manual bitwise trim afterwards.)
I have a code that uses a cyclic polynomial rolling hash (Buzhash) to compute hash values of n-grams of source code. If i use small hash values (7-8 bits) then there are some collisions i.e. different n-grams map to the same hash value. If i increase the bits in the hash value to say 31, then there are 0 collisions - all ngrams map to different hash values.
I want to know why this is so? Do the collisions depend on the number of n-grams in the text or the number of different characters that an n-gram can have or is it the size of an n-gram?
How does one choose the number of bits for the hash value when hashing n-grams (using rolling hashes)?
How Length effects Collisions
This is simply a question of permutations.
If i use small hash values (7-8 bits) then there are some collisions
Well, let's analyse this. With 8 bits, there are 2^8 possible binary sequences that can be generated for any given input. That is 256 possible hash values that can be generated, which means that in theory, every 256 message digest values generated guarantee a collision. This is called the birthday problem.
If i increase the bits in the hash value to say 31, then there are 0 collisions - all ngrams map to different hash values.
Well, let's apply the same logic. With 31 bit precision, we have 2^31 possible combinations. That is 2147483648 possible combinations. And we can generalise this to:
Let N denote the amount of bits we use.
Amount of different hash values we can generate (X) = 2^N
Assuming repetition of values is allowed (which it is in this case!)
This is an exponential growth, which is why with 8 bits, you found a lot of collisions and with 31 bits, you've found very little collisions.
How does this effect collisions?
Well, with a very small amount of values, and an equal chance for each of those values being mapped to an input, you have it that:
Let A denote the number of different values already generated.
Chance of a collision is: A / X
Where X is the possible number of outputs the hashing algorithm can generate.
When X equals 256, you have a 1/256 chance of a collision, the first time around. Then you have a 2/256 chance of a collision when a different value is generated. Until eventually, you have generated 255 different values and you have a 255/256 chance of a collision. The next time around, obviously it becomes a 256/256 chance, or 1, which is a probabilistic certainty. Obviously it usually won't reach this point. A collision will likely occur a lot more than every 256 cycles. In fact, the Birthday paradox tells us that we can start to expect a collision after 2^N/2 message digest values have been generated. So following our example, that's after we've created 16 unique hashes. We do know, however, that it has to happen, at minimum, every 256 cycles. Which isn't good!
What this means, on a mathematical level, is that the chance of a collision is inversely proportional to the possible number of outputs, which is why we need to increase the size of our message digest to a reasonable length.
A note on hashing algorithms
Collisions are completely unavoidable. This is because, there are an extremely large number of possible inputs (2^All possible character codes), and a finite number of possible outputs (as demonstrated above).
If you have hash values of 8 bits the total possible number of values is 256 - that means that if you hash 257 different n-grams there will be for sure at least one collision (...and very likely you will get many more collisions, even with less that 257 n-grams) - and this will happen regardless of the hashing algorithm or the data being hashed.
If you use 32 bits the total possible number of values is around 4 billion - and so the likelihood of a collision is much less.
'How does one choose the number of bits': I guess depends on the use of the hash. If it is used to store the n-grams in some kind of hashed data structure (a dictionary) then it should be related to the possible number of 'buckets' of the data structure - e.g. if the dictionary has less than 256 buckets that a 8 bit hash is OK.
See this for some background
If counting from 1 to X, where X is the first number to have an md5 collision with a previous number, what number is X?
I want to know if I'm using md5 for serial numbers, how many units I can expect to be able to enumerate before I get a collision.
Theoretically, you can expect collisions for X around 264. For a hash function with an output of n bits, first collisions appear when you have accumulated about 2n/2 outputs (it does not matter how you choose the inputs; sequential integer values are nothing special in that respect).
Of course, MD5 has been shown not to be a good hash function. Also, the 2n/2 is only an average. So, why don't you try it ? Take a MD5 implementation, hash your serial numbers, and see if you get a collision. A basic MD5 implementation should be able to hash a few million values per second, and, with a reasonable hard disk, you could accumulate a few billions of outputs, sort them, and see if there is a collision.
I can't answer your question, but what you are looking for is a uuid. UUID serial numbers can be unique for millions of products, but you might need to check a database to mitigate the tiny chance of a collision.
I believe no one has done some test on this
Considering that if you have a simple incremental number you don't need to hash it
As far as i know there are no known collisions in md5 for 2^32 (size of an integer)
It really depends on the size of your input. A perfect hash function has collisions every (input_length / hash_length) hashes.
If your input is small collisions are fairly unlikely, so far there has only been a single one-block collision.
I realize this is an old question but I stumbled upon it, found a much better approach, and figured I'd share it.
You have an upper boundary for your ordinal number N so let's take advantage of that. Let's say N < 232 ≈ 4.3*1010. Now each time you need a new identifier you just pick a random 32-bit number R and concatenate it with R xor N (zero-pad before concatenation). This yields a random looking unique 64-bit identifier which you could denote with just 16 hexadecimal digits.
This approach prevents collisions completely because two identifiers that happen to have the same random component necessarily have distinct xor-ed components.
Bonus feature: you can split such a 64-bit identifier into two 32-bit numbers and xor them with each other to recover the original ordinal number.