What is the maximum number of SHA-1 hashes? - hash

Clearly since SHA-1 hashing produces 40 characters each time, there is a finite number of possible hashes—does anyone know exactly how many?

SHA-1 hashes have 160 bits, so 2160 of them.
(2160 = 1461501637330902918203684832716283019655932542976 ~= 1.46 x 1048)
Note that since you have a much larger message space than possible hashes, collisions are bound to occur.
Also note that the probability of collision is much higher than you might think. At just 280 messages the probability of a collision is 50%, thanks to the Birthday paradox. (ie: with just 23 people the probability that 2 people have the same birthday is 50%).

SHA-1 produces 160-bit outputs, and it should be able to produce just about any sequence of 160 bits, There are 2160 such sequences, i.e. close to 1461 billions of billions of billions of billions of billions. That's kind of big.
However we have no proof that every single one of them is reachable. It would be bad for SHA-1 security if the number of possible outputs would be significantly lower than 2160; for instance, if only 1/4 of them were reachable (2158), security against preimage attacks would be divided by 4, and security against collisions would be halved. No such issue is currently known with SHA-1 (there are known weaknesses of SHA-1 when it comes to resistance to collisions, but not that one).
It is possible (but it would be at least mildly surprising) that a few 160-bits outputs cannot be reached. It is expected that this will be remain unknowable. To some extent, being able to prove that SHA-1 possible outputs cover the whole 160-bit space would be worrisome: such a proof would require a good deal of analysis of the mathematical structure of SHA-1, and the security of SHA-1 largely relies on such an analysis being intractable.

SHA-1 is made up of 5 32 bit integers.
That's 4294967296^5 or 2^160
or 1,461,501,637,330,902,918,203,684,832,716,283,019,655,932,542,976 possibilities
To put that into perspective
Total Possible SHA-1 Values: 1,461,501,637,330,902,918,203,684,832,716,283,019,655,932,542,976
Total gallons of Water on Earth: 365,904,000,000,000,000,000
That includes every ocean, sea, lake, swimming pool, bath tub, etc - source
The possibility of collisions is only theoretical at this point. Still waiting to hear of one.

Related

Does halving every SHA224 2 bytes to 1 byte to halve the hash length introduce a higher collision risk?

Let's say I have strings that need not be reversible and let's say I use SHA224 to hash it.
The hash of hello world is 2f05477fc24bb4faefd86517156dafdecec45b8ad3cf2522a563582b and its length is 56 bytes.
What if I convert every two chars to its numerical representation and make a single byte out of them?
In Python I'd do something like this:
shalist = list("2f05477fc24bb4faefd86517156dafdecec45b8ad3cf2522a563582b")
for first_byte,next_byte in zip(shalist[0::2],shalist[1::2]):
chr(ord(first_byte)+ord(next_byte))
The result will be \x98ek\x9d\x95\x96\x96\xc7\xcb\x9ckhf\x9a\xc7\xc9\xc8\x97\x97\x99\x97\xc9gd\x96im\x94. 28 bytes. Effectively halved the input.
Now, is there a higher hash collision risk by doing so?
The simple answer is pretty obvious: yes, it increases the chance of collision by as many powers of 2 as there are bits missing. For 56 bytes halved to 28 bytes you get the chance of collision increased 2^(28*8). That still leaves the chance of collision at 1:2^(28*8).
Your use of that truncation can be still perfectly legit, depending what it is. Git for example shows only the first few bytes from a commit hash and for most practical purposes the short one works fine.
A "perfect" hash should retain a proportional amount of "effective" bits if you truncate it. For example 32 bits of SHA256 result should have the same "strength" as a 32-bit CRC, although there may be some special properties of CRC that make it more suitable for some purposes while the truncated SHA may be better for others.
If you're doing any kind of security with this it will be difficult to prove your system, you're probably better of using a shorter but complete hash.
Lets shrink the size to make sense of it and use 2 bytes hash instead of 56. The original hash will have 65536 possible values, so if you hash more than that many strings you will surely get a collision. Half that to 1 bytes and you will get a collision after at most 256 strings hashed, regardless do you take the first or the second byte. So your chance of collision is 256 greater (2^(1byte*8bits)) and is 1:256.
Long hashes are used to make it truly impractical to brute-force them, even after long years of cryptanalysis. When MD5 was introduced in 1991 it was considered secure enough to use for certificate signing, in 2008 it was considered "broken" and not suitable for security-related use. Various cryptanalysis techniques can be developed to reduce the "effective" strength of hash and encryption algorithms, so the more spare bits there are (in an otherwise strong algorithm) the more effective bits should remain to keep the hash secure for all practical purposes.

choosing a hash function

I was wondering: what are maximum number of bytes that can safely be hashed while maintaining the expected collision count of a hash function?
For md5, sha-*, maybe even crc32 or adler32.
Your question isn't clear. By "maximum number of bytes" you mean "maximum number of items"? The size of the files being hashed has no relation with the number of collisions (assuming that all files are different, of course).
And what do you mean by "maintaining the expected collision count"? Taken literally, the answer is "infinite", but after a certain number you will aways have collisions, as expected.
As for the answer to the question "How many items I can hash while maintaining the probability of a collision under x%?", take a look at the following table:
http://en.wikipedia.org/wiki/Birthday_problem#Probability_table
From the link:
For comparison, 10^-18 to 10^-15 is the uncorrectable bit error rate of a typical hard disk [2]. In theory, MD5, 128 bits, should stay within that range until about 820 billion documents, even if its possible outputs are many more.
This assumes a hash function that outputs a uniform distribution. You may assume that, given enough items to be hashed and cryptographic hash functions (like md5 and sha) or good hashes (like Murmur3, Jenkins, City, and Spooky Hash).
And also assumes no malevolent adversary actively fabricating collisions. Then you really need a secure cryptographic hash function, like SHA-2.
And be careful: CRC and Adler are checksums, designed to detect data corruption, NOT minimizing expected collisions. They have proprieties like "detect all bit zeroing of sizes < X or > Y for inputs up to Z kbytes", but not as good statistical proprieties.
EDIT: Don't forget this is all about probabilities. It is entirely possible to hash only two files smaller than 0.5kb and get the same SHA-512, though it is extremely unlikely (no collision has ever been found for SHA hashes till this date, for example).
You are basically looking at the Birthday paradox, only looking at really big numbers.
Given a normal 'distribution' of your data, I think you could go to about 5-10% of the amount of possibilities before running into issues, though nothing is guaranteed.
Just go with a long enough hash to not run into problems ;)

Can you show me two actual, non-trivial strings that produce the same MD5 or SHA1 hash?

...and if not, why not?
So here's the question behind the question.
I understand that the likelihood of accidental collisions in MD5 and SHA1 is small (though less likely in SHA1 than in MD5). I also understand that deliberate collisions are theoretically possible. Is it practically possible? Could I go through some process to deliberately generate two messages with the same hash, in either of these algorithms? What process would I go through?
Collisions necessarily exist for a given hash function, in a mathematical sense: there are more possible inputs than possible outputs, so there must be two inputs which map to the same output. Now proving the existence of a collision, and actually finding one, are two different things. If I drop a diamond in the middle of the ocean, I positively know that there is now a diamond somewhere in the ocean -- but I am quite at a loss if I want to recover it.
For a "generic" hash function with an output of n bits, there are generic methods to find a collision, with average cost 2n/2 evaluations of the function (see this page). Depending on n, this can range from the easy to the totally unfeasible. MD5 has an output of 128 bits, and 264 is "quite high": you can do it, but it will require a few thousands of machines and months of computations.
Now there are known weaknesses in MD5, i.e. some internal structure which can be exploited to produce collisions much more easily. Best attack on MD5 known so far requires a bit less than 221 function invocations, so this is a matter of a few seconds (at most) on a basic PC. #Omri points in his response to a great example of an MD5 collision, in which the colliding messages are actually executable files with widely different behaviors.
For SHA-1, the output has size 160 bits. This means that a generic collision attack has cost about 280, which is not attainable with existing technology (well, Mankind could do it, but certainly not discreetly: it should be doable with, say, the equivalent of one year of budget for the whole US Army). However, SHA-1, like MD5, has known weaknesses. Right now, these weaknesses are still theoretical, in that they lead to a collision attack with cost 261, which is too expensive for any single crypto research lab, and thus has not been fully conducted yet (there was an announced attack with cost 251 but it seems that it was a dud -- the analysis was flawed). So no actual collision to show (but researchers are pretty sure that the 261 attack is correct and would work, if someone found the budget).
With SHA-256, there is no known weakness, and the 256-bit output size implies a generic cost of 2128, far away into the undoable with today's and tomorrow's technology.

When generating a SHA256 / 512 hash, is there a minimum 'safe' amount of data to hash?

I have heard that when creating a hash, it's possible that if small files or amounts of data are used, the resulting hash is more likely to suffer from a collision. If that is true, is there a minimum "safe" amount of data that should be used to ensure this doesn't happen?
I guess the question could also be phrased as:
What is the smallest amount of data that can be safely and securely hashed?
A hash function accepts inputs of arbitrary (or at least very high) length, and produces a fixed-length output. There are more possible inputs than possible outputs, so collisions must exist. The whole point of a secure hash function is that it is "collision resistant", which means that while collisions must mathematically exist, it is very very hard to actually compute one. Thus, there is no known collision for SHA-256 and SHA-512, and the best known methods for computing one (by doing it on purpose) are so ludicrously expensive that they will not be applied soon (the whole US federal budget for a century would buy only a ridiculously small part of the task).
So, if it cannot be realistically done on purpose, you can expect not to hit a collision out of (bad) luck.
Moreover, if you limit yourself to very short inputs, there is a chance that there is no collision at all. E.g., if you consider 12-byte inputs: there are 296 possible sequences of 12 bytes. That's huge (more than can be enumerated with today's technology). Yet, SHA-256 will map each input to a 256-bit value, i.e. values in a much wider space (of size 2256). We cannot prove it formally, but chances are that all those 296 hash values are distinct from each other. Note that this has no practical consequence: there is no measurable difference between not finding a collision because there is none, and not finding a collision because it is extremely improbable to hit one.
Just to illustrate how low risks of collision are with SHA-256: consider your risks of being mauled by a gorilla escaped from a local zoo or private owner. Unlikely? Yes, but it still may conceivably happen: it seems that a gorilla escaped from the Dallas zoo in 2004 and injured four persons; another gorilla escaped from the same zoo in 2010. Assuming that there is only one rampaging gorilla every 6 years on the whole Earth (not only in the Dallas area) and you happen to be the unlucky chap who is on his path, out of a human population of 6.5 billions, then risks of grievous-bodily-harm-by-gorilla can be estimated at about 1 in 243.7 per day. Now, take 10 thousands of PC and have them work on finding a collision for SHA-256. The chances of hitting a collision are close to 1 in 275 per day -- more than a billion less probable than the angry ape thing. The conclusion is that if you fear SHA-256 collisions but do not keep with you a loaded shotgun at all times, then you are getting your priorities wrong. Also, do not mess with Texas.
There is no minimum input size. SHA-256 algorithm is effectively a random mapping and collision probability doesn't depend on input length. Even a 1 bit input is 'safe'.
Note that the input is padded to a multiple of 512 bits (64 bytes) for SHA-256 (multiple of 1024 for SHA-512). Taking a 12 byte input (as Thomas used in his example), when using SHA-256, there are 2^96 possible sequences of length 64 bytes.
As an example, a 12 byte input Hello There! (0x48656c6c6f20546865726521) will be padded with a one bit, followed by 351 zero bits followed by the 64 bit representation of the length of the input in bits which is 0x0000000000000060 to form a 512 bit padded message. This 512 bit message is used as the input for computing the hash.
More details can be found in RFC: 4634 "US Secure Hash Algorithms (SHA and HMAC-SHA)", http://www.ietf.org/rfc/rfc4634.txt
No, message length does not effect the likeliness of a collision.
If that were the case, the algorithm is broken.
You can try for yourself by running SHA against all one-byte inputs, then against all two-byte inputs and so on, and see if you get a collision. Probably not, because no one has ever found a collision for SHA-256 or SHA-512 (or at least they kept it a secret from Wikipedia)
Τhe hash is 256 bits long, there is a collision for anything longer than 256bits.
Υou cannot compress something into a smaller thing without having collisions, its defying mathmatics.
Yes, because of the algoritm and the 2 to the power of 256 there is a lot of different hashes, but they are not collision free, that is impossible.
Depends very much on your application: if you were simply hashing "YES" and "NO" strings to send across a network to indicate whether you should give me a $100,000 loan, it would be a pretty big failure -- the domain of answers can't be that large, so someone could easily check observed hashes on the wire against a database of 'small input' hash outputs.
If you were to include the date, time, my name, my tax ID, the amount requested, the amount of data being hashed probably won't amount to much, but the chances of that data being in precomputed hash tables is pretty slim.
But I know of no research to point you to beyond my instincts. Sorry.

Is it safe to ignore the possibility of SHA collisions in practice?

Let's say we have a billion unique images, one megabyte each.
We calculate the SHA-256 hash for the contents of each file.
The possibility of collision depends on:
the number of files
the size of the single file
How far can we go ignoring this possibility, assuming it is zero?
The usual answer goes thus: what is the probability that a rogue asteroid crashes on Earth within the next second, obliterating civilization-as-we-know-it, and killing off a few billion people? It can be argued that any unlucky event with a probability lower than that is not actually very important.
If we have a "perfect" hash function with output size n, and we have p messages to hash (individual message length is not important), then probability of collision is about p2/2n+1 (this is an approximation which is valid for "small" p, i.e. substantially smaller than 2n/2). For instance, with SHA-256 (n=256) and one billion messages (p=109) then the probability is about 4.3*10-60.
A mass-murderer space rock happens about once every 30 million years on average. This leads to a probability of such an event occurring in the next second to about 10-15. That's 45 orders of magnitude more probable than the SHA-256 collision. Briefly stated, if you find SHA-256 collisions scary then your priorities are wrong.
In a security setup, where an attacker gets to choose the messages which will be hashed, then the attacker may use substantially more than a billion messages; however, you will find that the attacker's success probability will still be vanishingly small. That's the whole point of using a hash function with a 256-bit output: so that risks of collision can be neglected.
Of course, all of the above assumes that SHA-256 is a "perfect" hash function, which is far from being proven. Still, SHA-256 seems quite robust.
The possibility of a collision does not depend on the size of the files, only on their number.
This is an example of the birthday paradox. The Wikipedia page gives an estimate of the likelihood of a collision. If you run the numbers, you'll see that all harddisks ever produced on Earth can't hold enough 1MB files to get a likelihood of a collision of even 0.01% for SHA-256.
Basically, you can simply ignore the possibility.
Edit: if (some of) the files are potentially provided or manipulated by an adversary who could profit from provoking a collision, then the above of course only holds true as long as the hash algorithm is cryptographically strong without any known attacks.
First of all, it is not zero, but very close to zero.
The key question is what happens if a collision actually occurs? If the answer is "a nuclear power plant will explode" then you likely shouldn't ignore the collision possibility. In most cases the consequences are not that dire and so you can ignore the collision possibility.
Also don't forget that you software (or a tiny part of it) might be deployed and simultaneously used in a gazillion of computers (some tiny embedded microcomputers that are almost everywhere nowadays included). In such case you need to multiply the estimate you've got by the largest possible number of copies.