Is it safe to ignore the possibility of SHA collisions in practice? - hash

Let's say we have a billion unique images, one megabyte each.
We calculate the SHA-256 hash for the contents of each file.
The possibility of collision depends on:
the number of files
the size of the single file
How far can we go ignoring this possibility, assuming it is zero?

The usual answer goes thus: what is the probability that a rogue asteroid crashes on Earth within the next second, obliterating civilization-as-we-know-it, and killing off a few billion people? It can be argued that any unlucky event with a probability lower than that is not actually very important.
If we have a "perfect" hash function with output size n, and we have p messages to hash (individual message length is not important), then probability of collision is about p2/2n+1 (this is an approximation which is valid for "small" p, i.e. substantially smaller than 2n/2). For instance, with SHA-256 (n=256) and one billion messages (p=109) then the probability is about 4.3*10-60.
A mass-murderer space rock happens about once every 30 million years on average. This leads to a probability of such an event occurring in the next second to about 10-15. That's 45 orders of magnitude more probable than the SHA-256 collision. Briefly stated, if you find SHA-256 collisions scary then your priorities are wrong.
In a security setup, where an attacker gets to choose the messages which will be hashed, then the attacker may use substantially more than a billion messages; however, you will find that the attacker's success probability will still be vanishingly small. That's the whole point of using a hash function with a 256-bit output: so that risks of collision can be neglected.
Of course, all of the above assumes that SHA-256 is a "perfect" hash function, which is far from being proven. Still, SHA-256 seems quite robust.

The possibility of a collision does not depend on the size of the files, only on their number.
This is an example of the birthday paradox. The Wikipedia page gives an estimate of the likelihood of a collision. If you run the numbers, you'll see that all harddisks ever produced on Earth can't hold enough 1MB files to get a likelihood of a collision of even 0.01% for SHA-256.
Basically, you can simply ignore the possibility.
Edit: if (some of) the files are potentially provided or manipulated by an adversary who could profit from provoking a collision, then the above of course only holds true as long as the hash algorithm is cryptographically strong without any known attacks.

First of all, it is not zero, but very close to zero.
The key question is what happens if a collision actually occurs? If the answer is "a nuclear power plant will explode" then you likely shouldn't ignore the collision possibility. In most cases the consequences are not that dire and so you can ignore the collision possibility.
Also don't forget that you software (or a tiny part of it) might be deployed and simultaneously used in a gazillion of computers (some tiny embedded microcomputers that are almost everywhere nowadays included). In such case you need to multiply the estimate you've got by the largest possible number of copies.

Related

choosing a hash function

I was wondering: what are maximum number of bytes that can safely be hashed while maintaining the expected collision count of a hash function?
For md5, sha-*, maybe even crc32 or adler32.
Your question isn't clear. By "maximum number of bytes" you mean "maximum number of items"? The size of the files being hashed has no relation with the number of collisions (assuming that all files are different, of course).
And what do you mean by "maintaining the expected collision count"? Taken literally, the answer is "infinite", but after a certain number you will aways have collisions, as expected.
As for the answer to the question "How many items I can hash while maintaining the probability of a collision under x%?", take a look at the following table:
http://en.wikipedia.org/wiki/Birthday_problem#Probability_table
From the link:
For comparison, 10^-18 to 10^-15 is the uncorrectable bit error rate of a typical hard disk [2]. In theory, MD5, 128 bits, should stay within that range until about 820 billion documents, even if its possible outputs are many more.
This assumes a hash function that outputs a uniform distribution. You may assume that, given enough items to be hashed and cryptographic hash functions (like md5 and sha) or good hashes (like Murmur3, Jenkins, City, and Spooky Hash).
And also assumes no malevolent adversary actively fabricating collisions. Then you really need a secure cryptographic hash function, like SHA-2.
And be careful: CRC and Adler are checksums, designed to detect data corruption, NOT minimizing expected collisions. They have proprieties like "detect all bit zeroing of sizes < X or > Y for inputs up to Z kbytes", but not as good statistical proprieties.
EDIT: Don't forget this is all about probabilities. It is entirely possible to hash only two files smaller than 0.5kb and get the same SHA-512, though it is extremely unlikely (no collision has ever been found for SHA hashes till this date, for example).
You are basically looking at the Birthday paradox, only looking at really big numbers.
Given a normal 'distribution' of your data, I think you could go to about 5-10% of the amount of possibilities before running into issues, though nothing is guaranteed.
Just go with a long enough hash to not run into problems ;)

When generating a SHA256 / 512 hash, is there a minimum 'safe' amount of data to hash?

I have heard that when creating a hash, it's possible that if small files or amounts of data are used, the resulting hash is more likely to suffer from a collision. If that is true, is there a minimum "safe" amount of data that should be used to ensure this doesn't happen?
I guess the question could also be phrased as:
What is the smallest amount of data that can be safely and securely hashed?
A hash function accepts inputs of arbitrary (or at least very high) length, and produces a fixed-length output. There are more possible inputs than possible outputs, so collisions must exist. The whole point of a secure hash function is that it is "collision resistant", which means that while collisions must mathematically exist, it is very very hard to actually compute one. Thus, there is no known collision for SHA-256 and SHA-512, and the best known methods for computing one (by doing it on purpose) are so ludicrously expensive that they will not be applied soon (the whole US federal budget for a century would buy only a ridiculously small part of the task).
So, if it cannot be realistically done on purpose, you can expect not to hit a collision out of (bad) luck.
Moreover, if you limit yourself to very short inputs, there is a chance that there is no collision at all. E.g., if you consider 12-byte inputs: there are 296 possible sequences of 12 bytes. That's huge (more than can be enumerated with today's technology). Yet, SHA-256 will map each input to a 256-bit value, i.e. values in a much wider space (of size 2256). We cannot prove it formally, but chances are that all those 296 hash values are distinct from each other. Note that this has no practical consequence: there is no measurable difference between not finding a collision because there is none, and not finding a collision because it is extremely improbable to hit one.
Just to illustrate how low risks of collision are with SHA-256: consider your risks of being mauled by a gorilla escaped from a local zoo or private owner. Unlikely? Yes, but it still may conceivably happen: it seems that a gorilla escaped from the Dallas zoo in 2004 and injured four persons; another gorilla escaped from the same zoo in 2010. Assuming that there is only one rampaging gorilla every 6 years on the whole Earth (not only in the Dallas area) and you happen to be the unlucky chap who is on his path, out of a human population of 6.5 billions, then risks of grievous-bodily-harm-by-gorilla can be estimated at about 1 in 243.7 per day. Now, take 10 thousands of PC and have them work on finding a collision for SHA-256. The chances of hitting a collision are close to 1 in 275 per day -- more than a billion less probable than the angry ape thing. The conclusion is that if you fear SHA-256 collisions but do not keep with you a loaded shotgun at all times, then you are getting your priorities wrong. Also, do not mess with Texas.
There is no minimum input size. SHA-256 algorithm is effectively a random mapping and collision probability doesn't depend on input length. Even a 1 bit input is 'safe'.
Note that the input is padded to a multiple of 512 bits (64 bytes) for SHA-256 (multiple of 1024 for SHA-512). Taking a 12 byte input (as Thomas used in his example), when using SHA-256, there are 2^96 possible sequences of length 64 bytes.
As an example, a 12 byte input Hello There! (0x48656c6c6f20546865726521) will be padded with a one bit, followed by 351 zero bits followed by the 64 bit representation of the length of the input in bits which is 0x0000000000000060 to form a 512 bit padded message. This 512 bit message is used as the input for computing the hash.
More details can be found in RFC: 4634 "US Secure Hash Algorithms (SHA and HMAC-SHA)", http://www.ietf.org/rfc/rfc4634.txt
No, message length does not effect the likeliness of a collision.
If that were the case, the algorithm is broken.
You can try for yourself by running SHA against all one-byte inputs, then against all two-byte inputs and so on, and see if you get a collision. Probably not, because no one has ever found a collision for SHA-256 or SHA-512 (or at least they kept it a secret from Wikipedia)
Τhe hash is 256 bits long, there is a collision for anything longer than 256bits.
Υou cannot compress something into a smaller thing without having collisions, its defying mathmatics.
Yes, because of the algoritm and the 2 to the power of 256 there is a lot of different hashes, but they are not collision free, that is impossible.
Depends very much on your application: if you were simply hashing "YES" and "NO" strings to send across a network to indicate whether you should give me a $100,000 loan, it would be a pretty big failure -- the domain of answers can't be that large, so someone could easily check observed hashes on the wire against a database of 'small input' hash outputs.
If you were to include the date, time, my name, my tax ID, the amount requested, the amount of data being hashed probably won't amount to much, but the chances of that data being in precomputed hash tables is pretty slim.
But I know of no research to point you to beyond my instincts. Sorry.

Hash function combining - is there a significant decrease in collision risk?

Does anyone know if there's a real benefit regarding decreasing collision probability by combining hash functions? I especially need to know this regarding 32 bit hashing, namely combining Adler32 and CRC32.
Basically, will adler32(crc32(data)) yield a smaller collision probability than crc32(data)?
The last comment here gives some test results in favor of combining, but no source is mentioned.
For my purpose, collision is not critical (i.e. the task does not involve security), but I'd rather minimize the probability anyway, if possible.
PS: I'm just starting in the wonderful world of hashing, doing a lot of reading about it. Sorry if I asked a silly question, I haven't even acquired the proper "hash dialect" yet, probably my Google searches regarding this were also poorly formed.
Thanks.
This doesn't make sense combining them in series like that. You are hashing one 32-bit space to another 32-bit space.
In the case of a crc32 collision in the first step, the final result is still a collision. Then you add on any potential collisions in the adler32 step. So it can not get any better, and can only be the same or worse.
To reduce collisions, you might try something like using the two hashes independently to create a 64-bit output space:
adler32(data) << 32 | crc32(data)
Whether there is significant benefit in doing that, I'm not sure.
Note that the original comment you referred to was storing the hashes independently:
Whichever algorithm you use there is
going to be some chance of false
positives. However, you can reduce
these chances by a considerable margin
by using two different hashing
algorithms. If you were to calculate
and store both the CRC32 and the
Alder32 for each url, the odds of a
simultaneous collision for both hashes
for any given pair of urls is vastly
reduced.
Of course that means storing twice as
much information which is a part of
your original problem. However, there
is a way of storing both sets of hash
data such that it requires minimal
memory (10kb or so) whilst giving
almost the same lookup performance (15
microsecs/lookup compared to 5
microsecs) as Perl's hashes.

Uniquely identifying URLs with one 64-bit number

This is basically a math problem, but very programing related: if I have 1 billion strings containing URLs, and I take the first 64 bits of the MD5 hash of each of them, what kind of collision frequency should I expect?
How does the answer change if I only have 100 million URLs?
It seems to me that collisions will be extremely rare, but these things tend to be confusing.
Would I be better off using something other than MD5? Mind you, I'm not looking for security, just a good fast hash function. Also, native support in MySQL is nice.
EDIT: not quite a duplicate
If the first 64 bits of the MD5 constituted a hash with ideal distribution, the birthday paradox would still mean you'd get collisions for every 2^32 URL's. In other words, the probability of a collision is the number of URL's divided by 4,294,967,296. See http://en.wikipedia.org/wiki/Birthday_paradox#Cast_as_a_collision_problem for details.
I wouldn't feel comfortable just throwing away half the bits in MD5; it would be better to XOR the high and low 64-bit words to give them a chance to mix. Then again, MD5 is by no means fast or secure, so I wouldn't bother with it at all. If you want blinding speed with good distribution, but no pretence of security, you could try the 64-bit versions of MurmurHash. See http://en.wikipedia.org/wiki/MurmurHash for details and code.
You have tagged this as "birthday-paradox", I think you know the answer already.
P(Collision) = 1 - (2^64)!/((2^64)^n (1 - n)!)
where n is 1 billion in your case.
You will be a bit better using something other then MD5, because MD5 have pratical collusion problem.
From what I see, you need a hash function with the following requirements,
Hash arbitrary length strings to a 64-bit value
Be good -- Avoid collisions
Not necessarily one-way (security not required)
Preferably fast -- which is a necessary characteristic for a non-security application
This hash function survey may be useful for drilling down to the function most suitable for you.
I will suggest trying out multiple functions from here and characterizing them for your likely input set (pick a few billion URL that you think you will see).
You can actually generate another column like this test survey for your test URL list to characterize and select from the existing or any new hash functions (more rows in that table) that you might want to check. They have MSVC++ source code to start with (reference to ZIP link).
Changing the hash functions to suit your output width (64-bit) will give you a more accurate characterization for your application.
If you have 2^n hash possibilities, there's over a 50% chance of collision when you have 2^(n/2) items.
E.G. if your hash is 64 bits, you have 2^64 hash possibilities, you'd have a 50% chance of collision if you have 2^32 items in a collection.
Just by using a hash, there is always a chance of collisions. And you don't know beforehand wether collisions will happen once or twice, or even hundreds or thousands of times in your list of urls.
The probability is still just a probability. Its like throwing a dice 10 or 100 times, what are the chances of getting all sixes? The probability says it is low, but it still can happen. Maybe even many times in a row...
So while the birthday paradox shows you how to calculate the probabilities, you still need to decide if collisions are acceptable or not.
...and collisions are acceptable, and hashes are still the right way to go; find a 64 bit hashing algorithm instead of relying on "half-a-MD5" having a good distribution. (Though it probably has...)

Understanding sha-1 collision weakness

According to various sources, attacks looking for sha-1 collisions have been improved to 2^52 operations:
http://www.secureworks.com/research/blog/index.php/2009/6/3/sha-1-collision-attacks-now-252/
What I'd like to know is the implication of these discoveries on systems that are not under attack. Meaning if I hash random data, what are the statistical odds of a collision? Said another way, does the recent research indicate that a brute-force birthday attack has a higher chance of finding collisions that originally proposed?
Some writeups, like the one above, say that obtaining a SHA-1 collision via brute force would require 2^80 operations. Most sources say that 2^80 is a theoretical number (I assume because no hash function is really distributed perfectly even over its digest space).
So are any of the announced sha1 collision weaknesses in the fundamental hash distribution? Or are the increased odds of collision only the result of guided mathematical attacks?
I realize that in the end it is just a game of odds, and that their is an infinitesimally small change that your first and second messages will result in a collision. I also realize that even 2^52 is a really big number, but I still want to understand the implications for a system not under attack. So please don't answer with "don't worry about it".
Well good hash functions are resistant to 3 different types of attacks (as the article states).
The most important resistance in a practical sense is 2nd pre-image resistance. This basically means given a message M1 and Hash(M1)=H1, it is hard to find a M2 such that Hash(M2)=H1.
If someone found a way to do that efficiently, that would be bad. Further, a preimage attack is not susceptible to the birthday paradox, since message M1 is fixed for us.
This is not a pre-image or second pre-image attack, merely a collision finding attack.
To answer your question, No a brute force attack does NOT have a higher chance of finding collisions. What this means is that the naive brute force method, combined with the researchers methods result in finding collisions after 2^52. A standard brute force attack still takes 2^80.
The result announced in your link is an attack, a sequence of careful, algorithmically-chosen steps that generate collisions with greater probability than would a random attack. It is not a weakness in the hash function's distribution. Well, ok, it is, but not of the sort that makes a random attack likely on the order of 2^52 to succeed.
If no one is trying to generate collisions in your hash outputs, this result does not affect you.
The key question is "Can the attacker modify both m1 and m2 messages"?. If so, the attacker needs to find m1, m2 such that hash(m1) = hash(m2). This is the birthday attack and the complexity reduces significantly --- becomes square root. If hash output is 128 bits (MD5), the complexity is 2^64, well within reach with current computing power.
The usual example given is that the seller asks his secretary to type message "I will sell it for 10 million dollars". The scheming secretary creates 2 documents one that says "I will sell it for 10 million dollars" and another that says "I will sell it for x million dollars", where x is much less than 10, modifies both messages by adding spaces, capitalizing words etc, modifies x, until hash(m1) = hash(m2). Now, the secretary shows the correct message m1 to the seller, and he signs it using his private key, resulting in hash h. The secretary switches the message and sends out (m2, h). Only the seller has access to his private key and so he cannot repudiate and say that he didn't sign the message.
For SHA1, which outputs 160 bits, the birthday attack reduces the complexity to 2^80. This should be safe for 30 years or more. New government regulations, 4G 3gpp specs are starting to require SHA256.
But if in your use-case, the attacker cannot modify both the messages (preimage or second preimage scenarios), then for SHA1 the complexity is 2^160. Should be safe for eternity unless non-brute-force attack is discovered.