How can I calculate the impact on collision probability when truncating a hash? - hash

I'd like to reduce an MD5 digest from 32 characters down to, ideally closer to 16. I'll be using this as a database key to retrieve a set of (public) user-defined parameters. I'm expecting the number of unique "IDs" to eventually exceed 10,000. Collisions are undesirable but not the end of the world.
I'd like to understand the viability of a naive truncation of the MD5 digest to achieve a shorter key. But I'm having trouble digging up a formula that I can understand (given I have a limited Math background), let alone use to determine the impact on collision probability that truncating the hash would have.
The shorter the better, within reason. I feel there must be a simple formula, but I'd rather have a definitive answer than do my own guesswork cobbled together from bits and pieces I have read around the web.

You can calculate the chance of collisions with this formula:
chance of collision = 1 - e^(-n^2 / (2 * d))
Where n is the number of messages, d is the number of possibilities, and e is the constant e (2.718281828...).

#mypetition's answer is great.
I found a few other equations that are more-or-less accurate and/or simplified here, along with a great explanation and a handy comparison of real-world probabilities:
1−e^((−k(k−1))/2N) - sample plot here
(k(k-1))/2N - sample plot here
k^2/2N - sample plot here
...where k is the number of ID's you'll be generating (the "messages") and N is the largest number that can be produced by the hash digest or the largest number that your truncated hexadecimal number could represent (technically + 1, to account for 0).
A bit more about "N"
If your original hash is, for example, "38BF05A71DDFB28A504AFB083C29D037" (32 hex chars), and you truncate it down to, say, 12 hex chars (e.g.: "38BF05A71DDF"), the largest number you could produce in hexadecimal is "0xFFFFFFFFFFFF" (281474976710655 - which is 16^12-1 (or 256^6 if you prefer to think in terms of bytes). But since "0" itself counts as one of the numbers you could theoretically produce, you add back that 1, which leaves you simply with 16^12.
So you can think of N as 16 ^ (numberOfHexDigits).

Related

How can I find the average length of a codeword encoded in Huffman if there are N(10 or more) symbols?

I'm practicing for an exam and I found a problem which asks to find the average length of codewords which are encoded in Huffman.
This usually wouldn't be hard, but in this problem we have to encode 100 symbols which all have the same probability (1/100).
Since there is obviously no point in trying to encode 100 symbols by hand I was wondering if there is a method to find out the average length without actually going through the process of encoding.
I'm guessing this is possible since all the probabilities are equal, however I couldn't find anything online.
Any help is appreciated!
For 100 symbols with equal probability, some will be encoded with six bits, some with seven bits. A Huffman code is a complete prefix code. "Complete" means that all possible bits patterns are used.
Let's say that i codes are six bits long and j codes are seven bits long. We know that i + j = 100. There are 64 possible six-bit codes, so after i get used up, there are 64 - i left. Adding one bit to each of those to make them seven bits long doubles the number of possible codes. So now we can have up to 2(64 - i) seven-bit codes.
For the code to be complete, all of those codes must be used, so j = 2(64 - i). We now have two equations in two unknowns. We get i = 28 and j = 72.
Since all symbols are equally probable, the average number of bits used per symbol is (28x6 + 72x7) / 100, which is 6.72. Not too bad, considering the entropy of each symbol is 6.64 bits.

Design for max hash size given N-digit numerical input and collision related target

Assume a hacker obtains a data set of stored hashes, salts, pepper, and algorithm and has access to unlimited computing resources. I wish to determine a max hash size so that the certainty of determining the original input string is nominally equal to some target certainty percentage.
Constraints:
The input string is limited to exactly 8 numeric characters
uniformly distributed. There is no inter-digit relation such as a
checksum digit.
The target nominal certainty percentage is 1%.
Assume the hashing function is uniform.
What is the maximum hash size in bytes so there are nominally 100 (i.e. 1% certainty) 8-digit values that will compute to the same hash? It should be possible to generalize to N numerical digits and X% from the accepted answer.
Please include whether there are any issues with using the first N bytes of the standard 20 byte SHA1 as an acceptable implementation.
It is recognized that this approach will greatly increase susceptibility to a brute force attack by increasing the possible "correct" answers so there is a design trade off and some additional measures may be required (time delays, multiple validation stages, etc).
It appears you want to ensure collisions, with the idea that if a hacker obtained everything, such that it's assumed they can brute force all the hashed values, then they will not end up with the original values, but only a set of possible original values for each hashed value.
You could achieve this by executing a precursor step before your normal cryptographic hashing. This precursor step simply folds your set of possible values to a smaller set of possible values. This can be accomplished by a variety of means. Basically, you are applying an initial hash function over your input values. Using modulo arithmetic as described below is a simple variety of hash function. But other types of hash functions could be used.
If you have 8 digit original strings, there are 100,000,000 possible values: 00000000 - 99999999. To ensure that 100 original values hash to the same thing, you just need to map them to a space of 1,000,000 values. The simplest way to do that would be convert your strings to integers, perform a modulo 1,000,000 operation and convert back to a string. Having done that the following values would hash to the same bucket:
00000000, 01000000, 02000000, ....
The problem with that is that the hacker would not only know what 100 values a hashed value could be, but they would know with surety what 6 of the 8 digits are. If the real life variability of digits in the actual values being hashed is not uniform over all positions, then the hacker could use that to get around what you're trying to do.
Because of that, it would be better to choose your modulo value such that the full range of digits are represented fairly evenly for every character position within the set of values that map to the same hashed value.
If different regions of the original string have more variability than other regions, then you would want to adjust for that, since the static regions are easier to just guess anyway. The part the hacker would want is the highly variable part they can't guess. By breaking the 8 digits into regions, you can perform this pre-hash separately on each region, with your modulo values chosen to vary the degree of collisions per region.
As an example you could break the 8 digits thus 000-000-00. The prehash would convert each region into a separate value, perform a modulo, on each, concatenate them back into an 8 digit string, and then do the normal hashing on that. In this example, given the input of "12345678", you would do 123 % 139, 456 % 149, and 78 % 47 which produces 123 009 31. There are 139*149*47 = 973,417 possible results from this pre-hash. So, there will be roughly 103 original values that will map to each output value. To give an idea of how this ends up working, the following 3 digit original values in the first region would map to the same value of 000: 000, 139, 278, 417, 556, 695, 834, 973. I made this up on the fly as an example, so I'm not specifically recommending these choices of regions and modulo values.
If the hacker got everything, including source code, and brute forced all, he would end up with the values produced by the pre-hash. So for any particular hashed value, he would know that that it is one of around 100 possible values. He would know all those possible values, but he wouldn't know which of those was THE original value that produced the hashed value.
You should think hard before going this route. I'm wary of anything that departs from standard, accepted cryptographic recommendations.

Understanding cyclic polynomial hash collisions

I have a code that uses a cyclic polynomial rolling hash (Buzhash) to compute hash values of n-grams of source code. If i use small hash values (7-8 bits) then there are some collisions i.e. different n-grams map to the same hash value. If i increase the bits in the hash value to say 31, then there are 0 collisions - all ngrams map to different hash values.
I want to know why this is so? Do the collisions depend on the number of n-grams in the text or the number of different characters that an n-gram can have or is it the size of an n-gram?
How does one choose the number of bits for the hash value when hashing n-grams (using rolling hashes)?
How Length effects Collisions
This is simply a question of permutations.
If i use small hash values (7-8 bits) then there are some collisions
Well, let's analyse this. With 8 bits, there are 2^8 possible binary sequences that can be generated for any given input. That is 256 possible hash values that can be generated, which means that in theory, every 256 message digest values generated guarantee a collision. This is called the birthday problem.
If i increase the bits in the hash value to say 31, then there are 0 collisions - all ngrams map to different hash values.
Well, let's apply the same logic. With 31 bit precision, we have 2^31 possible combinations. That is 2147483648 possible combinations. And we can generalise this to:
Let N denote the amount of bits we use.
Amount of different hash values we can generate (X) = 2^N
Assuming repetition of values is allowed (which it is in this case!)
This is an exponential growth, which is why with 8 bits, you found a lot of collisions and with 31 bits, you've found very little collisions.
How does this effect collisions?
Well, with a very small amount of values, and an equal chance for each of those values being mapped to an input, you have it that:
Let A denote the number of different values already generated.
Chance of a collision is: A / X
Where X is the possible number of outputs the hashing algorithm can generate.
When X equals 256, you have a 1/256 chance of a collision, the first time around. Then you have a 2/256 chance of a collision when a different value is generated. Until eventually, you have generated 255 different values and you have a 255/256 chance of a collision. The next time around, obviously it becomes a 256/256 chance, or 1, which is a probabilistic certainty. Obviously it usually won't reach this point. A collision will likely occur a lot more than every 256 cycles. In fact, the Birthday paradox tells us that we can start to expect a collision after 2^N/2 message digest values have been generated. So following our example, that's after we've created 16 unique hashes. We do know, however, that it has to happen, at minimum, every 256 cycles. Which isn't good!
What this means, on a mathematical level, is that the chance of a collision is inversely proportional to the possible number of outputs, which is why we need to increase the size of our message digest to a reasonable length.
A note on hashing algorithms
Collisions are completely unavoidable. This is because, there are an extremely large number of possible inputs (2^All possible character codes), and a finite number of possible outputs (as demonstrated above).
If you have hash values of 8 bits the total possible number of values is 256 - that means that if you hash 257 different n-grams there will be for sure at least one collision (...and very likely you will get many more collisions, even with less that 257 n-grams) - and this will happen regardless of the hashing algorithm or the data being hashed.
If you use 32 bits the total possible number of values is around 4 billion - and so the likelihood of a collision is much less.
'How does one choose the number of bits': I guess depends on the use of the hash. If it is used to store the n-grams in some kind of hashed data structure (a dictionary) then it should be related to the possible number of 'buckets' of the data structure - e.g. if the dictionary has less than 256 buckets that a 8 bit hash is OK.
See this for some background

md5 hash collisions.

If counting from 1 to X, where X is the first number to have an md5 collision with a previous number, what number is X?
I want to know if I'm using md5 for serial numbers, how many units I can expect to be able to enumerate before I get a collision.
Theoretically, you can expect collisions for X around 264. For a hash function with an output of n bits, first collisions appear when you have accumulated about 2n/2 outputs (it does not matter how you choose the inputs; sequential integer values are nothing special in that respect).
Of course, MD5 has been shown not to be a good hash function. Also, the 2n/2 is only an average. So, why don't you try it ? Take a MD5 implementation, hash your serial numbers, and see if you get a collision. A basic MD5 implementation should be able to hash a few million values per second, and, with a reasonable hard disk, you could accumulate a few billions of outputs, sort them, and see if there is a collision.
I can't answer your question, but what you are looking for is a uuid. UUID serial numbers can be unique for millions of products, but you might need to check a database to mitigate the tiny chance of a collision.
I believe no one has done some test on this
Considering that if you have a simple incremental number you don't need to hash it
As far as i know there are no known collisions in md5 for 2^32 (size of an integer)
It really depends on the size of your input. A perfect hash function has collisions every (input_length / hash_length) hashes.
If your input is small collisions are fairly unlikely, so far there has only been a single one-block collision.
I realize this is an old question but I stumbled upon it, found a much better approach, and figured I'd share it.
You have an upper boundary for your ordinal number N so let's take advantage of that. Let's say N < 232 ≈ 4.3*1010. Now each time you need a new identifier you just pick a random 32-bit number R and concatenate it with R xor N (zero-pad before concatenation). This yields a random looking unique 64-bit identifier which you could denote with just 16 hexadecimal digits.
This approach prevents collisions completely because two identifiers that happen to have the same random component necessarily have distinct xor-ed components.
Bonus feature: you can split such a 64-bit identifier into two 32-bit numbers and xor them with each other to recover the original ordinal number.

Why are 5381 and 33 so important in the djb2 algorithm?

The djb2 algorithm has a hash function for strings.
unsigned long hash = 5381;
int c;
while (c = *str++)
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
Why are 5381 and 33 so important?
This hash function is similar to a Linear Congruential Generator (LCG - a simple class of functions that generate a series of psuedo-random numbers), which generally has the form:
X = (a * X) + c; // "mod M", where M = 2^32 or 2^64 typically
Note the similarity to the djb2 hash function... a=33, M=2^32. In order for an LCG to have a "full period" (i.e. as random as it can be), a must have certain properties:
a-1 is divisible by all prime factors of M (a-1 is 32, which is divisible by 2, the only prime factor of 2^32)
a-1 is a multiple of 4 if M is a multiple of 4 (yes and yes)
In addition, c and M are supposed to be relatively prime (which will be true for odd values of c).
So as you can see, this hash function somewhat resembles a good LCG. And when it comes to hash functions, you want one that produces a "random" distribution of hash values given a realistic set of input strings.
As for why this hash function is good for strings, I think it has a good balance of being extremely fast, while providing a reasonable distribution of hash values. But I've seen many other hash functions which claim to have much better output characteristics, but involved many more lines of code. For instance see this page about hash functions
EDIT: This good answer explains why 33 and 5381 were chosen for practical reasons.
33 was chosen because:
1) As stated before, multiplication is easy to compute using shift and add.
2) As you can see from the shift and add implementation, using 33 makes two copies of most of the input bits in the hash accumulator, and then spreads those bits relatively far apart. This helps produce good avalanching. Using a larger shift would duplicate fewer bits, using a smaller shift would keep bit interactions more local and make it take longer for the interactions to spread.
3) The shift of 5 is relatively prime to 32 (the number of bits in the register), which helps with avalanching. While there are enough characters left in the string, each bit of an input byte will eventually interact with every preceding bit of input.
4) The shift of 5 is a good shift amount when considering ASCII character data. An ASCII character can sort of be thought of as a 4-bit character type selector and a 4-bit character-of-type selector. E.g. the digits all have 0x3 in the first 4 bits. So an 8-bit shift would cause bits with a certain meaning to mostly interact with other bits that have the same meaning. A 4-bit or 2-bit shift would similarly produce strong interactions between like-minded bits. The 5-bit shift causes many of the four low order bits of a character to strongly interact with many of the 4-upper bits in the same character.
As stated elsewhere, the choice of 5381 isn't too important and many other choices should work as well here.
This is not a fast hash function since it processes it's input a character at a time and doesn't try to use instruction level parallelism. It is, however, easy to write. Quality of the output divided by ease of writing the code is likely to hit a sweet spot.
On modern processors, multiplication is much faster than it was when this algorithm was developed and other multiplication factors (e.g. 2^13 + 2^5 + 1) may have similar performance, slightly better output, and be slightly easier to write.
Contrary to an answer above, a good non-cryptographic hash function doesn't want to produce a random output. Instead, given two inputs that are nearly identical, it wants to produce widely different outputs. If you're input values are randomly distributed, you don't need a good hash function, you can just use an arbitrary set of bits from your input. Some of the modern hash functions (Jenkins 3, Murmur, probably CityHash) produce a better distribution of outputs than random given inputs that are highly similar.
On 5381, Dan Bernstein (djb2) says in this article:
[...] practically any good multiplier works. I think you're worrying
about the fact that 31c + d doesn't cover any reasonable range of hash
values if c and d are between 0 and 255. That's why, when I discovered
the 33 hash function and started using it in my compressors, I started
with a hash value of 5381. I think you'll find that this does just as
well as a 261 multiplier.
The whole thread is here if you're interested.
Ozan Yigit has a page on hash functions which says:
[...] the magic of number 33 (why it works better than many other constants, prime or not) has never been adequately explained.
Maybe because 33 == 2^5 + 1 and many hashing algorithms use 2^n + 1 as their multiplier?
Credit to Jerome Berger
Update:
This seems to be borne out by the current version of the software package djb2 originally came from: cdb
The notes I linked to describe the heart of the hashing algorithm as using h = ((h << 5) + h) ^ c to do the hashing... x << 5 is a fast hardware way to use 2^5 as the multiplier.