Is there a limit on the message size for SHA-256? - hash

When hashing a string, like a password, with SHA-256, is there a limit to the length of the string I am hashing? For example, is it only "safe" to hash strings that are smaller than 64 characters?

There is technically a limit, but it's quite large. The padding scheme used for SHA-256 requires that the size of the input (in bits) be expressed as a 64-bit number. Therefore, the maximum size is (264-1)/8 bytes ~= 2'091'752 terabytes.
That renders the limit almost entirely theoretical, not practical.
Most people don't have the storage for nearly that much data anyway, but even if they did, processing it all serially to produce a single hash would take an amount of time most would consider prohibitive.
A quick back-of-the-envelope kind of calculation indicates that even with the fastest enterprise SSDs currently1 listed on Tom's hardware, and striping them 16 wide to improve bandwidth, just reading that quantity of data would still take about 220 years.
1. As of April 2016.

There is no such limit, other than the maximum message size of 264-1 bits. SHA2 is frequently used to generate hashes for executables, which tend to be much larger than a few dozen bytes.

The upper limit is given in the NIST Standard FIPS 180-4. The reason for the upper limit is the padding scheme to countermeasure against the MOV attack that Merkle-Damgard construction's artifact. The message length l is lastly appended to the message during padding.
Then append the 64-bit block that is equal to the number l expressed using a binary representation
Therefore by the NIST standard, the maximum file size can be hashed with SHA-256 is 2^64-1 in bits ( approx 2.305 exabytes - that is close to the lower range of the estimated NSA's data center in UTAH, so you don't need to worry).
NIST enables the hash of the size zero message. Therefore the message length starts from 0 to 2^64-1.
If you need to hash files larger than 2^64-1 then either use SHA-512 which has 2^128-1 limit or use SHA3 which has no limit.

Related

Does halving every SHA224 2 bytes to 1 byte to halve the hash length introduce a higher collision risk?

Let's say I have strings that need not be reversible and let's say I use SHA224 to hash it.
The hash of hello world is 2f05477fc24bb4faefd86517156dafdecec45b8ad3cf2522a563582b and its length is 56 bytes.
What if I convert every two chars to its numerical representation and make a single byte out of them?
In Python I'd do something like this:
shalist = list("2f05477fc24bb4faefd86517156dafdecec45b8ad3cf2522a563582b")
for first_byte,next_byte in zip(shalist[0::2],shalist[1::2]):
chr(ord(first_byte)+ord(next_byte))
The result will be \x98ek\x9d\x95\x96\x96\xc7\xcb\x9ckhf\x9a\xc7\xc9\xc8\x97\x97\x99\x97\xc9gd\x96im\x94. 28 bytes. Effectively halved the input.
Now, is there a higher hash collision risk by doing so?
The simple answer is pretty obvious: yes, it increases the chance of collision by as many powers of 2 as there are bits missing. For 56 bytes halved to 28 bytes you get the chance of collision increased 2^(28*8). That still leaves the chance of collision at 1:2^(28*8).
Your use of that truncation can be still perfectly legit, depending what it is. Git for example shows only the first few bytes from a commit hash and for most practical purposes the short one works fine.
A "perfect" hash should retain a proportional amount of "effective" bits if you truncate it. For example 32 bits of SHA256 result should have the same "strength" as a 32-bit CRC, although there may be some special properties of CRC that make it more suitable for some purposes while the truncated SHA may be better for others.
If you're doing any kind of security with this it will be difficult to prove your system, you're probably better of using a shorter but complete hash.
Lets shrink the size to make sense of it and use 2 bytes hash instead of 56. The original hash will have 65536 possible values, so if you hash more than that many strings you will surely get a collision. Half that to 1 bytes and you will get a collision after at most 256 strings hashed, regardless do you take the first or the second byte. So your chance of collision is 256 greater (2^(1byte*8bits)) and is 1:256.
Long hashes are used to make it truly impractical to brute-force them, even after long years of cryptanalysis. When MD5 was introduced in 1991 it was considered secure enough to use for certificate signing, in 2008 it was considered "broken" and not suitable for security-related use. Various cryptanalysis techniques can be developed to reduce the "effective" strength of hash and encryption algorithms, so the more spare bits there are (in an otherwise strong algorithm) the more effective bits should remain to keep the hash secure for all practical purposes.

SHA collision probability when removing bytes

I'm implementing some program which uses id's with variable length. These id's identify a message and are sent to a broker which will perform some operation (not relevant to the question). However, the maximum length for this id in the broker is 24 bytes. I was thinking about hashing the id (prior to sending to the broker) with SHA and removing some bytes until it gets 24 bytes only.
However, I want to have an idea of how much will this increase the collisions. So this is what I got until now:
I found out that for a "perfect" hash we have the formula p^2 / 2^n+1 to describe the probability of collisions and where p is the number of messages and n is the size of the message in bits. Here is where my problem starts. I'm assuming that removing some bytes from the final hash the function still remains "perfect" and I can still use the same formula. So assuming this I get:
5160^2 / 2^192 + 1 = 2.12x10^-51
Where 5160 is the pick number of messages and 192 is basically the number of bits in 24 bytes.
My questions:
Is my assumption correct? Does the hash stay "perfect" by removing some bytes.
If so and since the probability is really small, which bytes should I remove? Most or less significant? Does it really matter at all?
PS: Any other suggestion to achieve the same result is welcomed. Thanks.
However, the maximum length for this id in the broker is 24 bytes. I was thinking about hashing the id (prior to sending to the broker) with SHA and removing some bytes until it gets 24 bytes only.
SHA-1 outputs only 20 bytes (160 bits), so you'd need to pad it. At least if all bytes are valid, and you're not restricted to hex or Base64. I recommend using truncated SHA-2 instead.
Is my assumption correct? Does the hash stay "perfect" by removing some bytes.
Pretty much. Truncating hashes should conserve all their important properties, obviously at the reduced security level corresponding to the smaller output size.
If so and since the probability is really small, which bytes should I remove? Most or less significant? Does it really matter at all?
That should not matter at all. NIST defined a truncated SHA-2 variant, called SHA-224, which takes the first 28 bytes of SHA-256 using a different initial state for the hash calculation.
My recommendation is to use SHA-256, keeping the first 24 bytes. This requires around 2^96 hash-function calls to find one collision. Which is currently infeasible, even for extremely powerful attackers, and essentially impossible for accidental collisions.

What are some of the best hashing algorithms to use for data integrity and deduplication?

I'm trying to hash a large number of files with binary data inside of them in order to:
(1) check for corruption in the future, and
(2) eliminate duplicate files (which might have completely different names and other metadata).
I know about md5 and sha1 and their relatives, but my understanding is that these are designed for security and therefore are deliberately slow in order to reduce the efficacy of brute force attacks. In contrast, I want algorithms that run as fast as possible, while reducing collisions as much as possible.
Any suggestions?
You are the most right. If your system does not have any adversary, using cryptographic hash-functions is overkill given their security properties.
Collisions depend on the number of bits, b, of your hash function and the number of hash values, N, you estimate to compute. Academic literature defends this collision probability must be bellow hardware error probability, so it is less likely to make a collision with a hash function than to be comparing data byte-by-byte [ref1,ref2,ref3,ref4,ref5]. Hardware error probability is in the range of 2^-12 and 2^-15 [ref6]. If you expect to generate N=2^q hash values then your collision probability may be given by this equation, which already takes into account the birthday paradox:
The number of bits of your hash function is directly proportional to its computational complexity. So you are interested in finding an hash function with the minimum bits possible, while being able to maintain collision probability at acceptable values.
Here's an example on how to make that analysis:
Let's say you have f=2^15 files;
The average size of each file lf is 2^20 bytes;
You pretend to divide each file into chunks of average size lc equal to 2^10 bytes;
Each file will be divided into c=lf/lc=2^10 chunks;
You will then hash q = f*c =2^25 objects.
From that equation the collision probability for several hash sizes is the following:
P(hash=64 bits) = 2^(2*25-64+1) = 2^-13 (lesser than 2^-12)
P(hash=128 bits) = 2^(2*25-128+1) 2^-77 (way much lesser than 2^-12)
Now you just need to decide which non-cryptographic hash function of 64 or 128 bits you will use, knowing 64 bits it pretty close to hardware error probability (but will be faster) and 128 bits is a much safer option (though slower).
Bellow you can find a small list removed from wikipedia of non-cryptographic hash functions. I know Murmurhash3 and it is much faster than any cryptographic hash function:
Fowler–Noll–Vo : 32, 64, 128, 256, 512 and 1024 bits
Jenkins : 64 and 128 bits
MurmurHash : 32, 64, 128, and 160 bits
CityHash : 64, 128 and 256 bits
MD5 and SHA1 are not designed for security, no, so they are not particularly secure, and hence not really very slow, either. I've used MD5 for deduplication myself (with Python), and performance was just fine.
This article claims machines today can compute the MD5 hash of 330 MB of data per second.
SHA-1 was developed as a safer alternative to MD5 when it was discovered that you could craft inputs that would hash to the same value with MD5, but I think for your purposes MD5 will work fine. It certainly did for me.
If security is not a concern for you you can take one of the secure hash functions and reduce the number of rounds. This makes the cryptographically unsound but still perfect for equality-testing.
Skein is very strong. It has 80 rounds. Try reducing to 10 or so.
Or encrypt with AES and XOR the output blocks together. AES is hardware-accelerated on modern CPUs and insanely fast.

When generating a SHA256 / 512 hash, is there a minimum 'safe' amount of data to hash?

I have heard that when creating a hash, it's possible that if small files or amounts of data are used, the resulting hash is more likely to suffer from a collision. If that is true, is there a minimum "safe" amount of data that should be used to ensure this doesn't happen?
I guess the question could also be phrased as:
What is the smallest amount of data that can be safely and securely hashed?
A hash function accepts inputs of arbitrary (or at least very high) length, and produces a fixed-length output. There are more possible inputs than possible outputs, so collisions must exist. The whole point of a secure hash function is that it is "collision resistant", which means that while collisions must mathematically exist, it is very very hard to actually compute one. Thus, there is no known collision for SHA-256 and SHA-512, and the best known methods for computing one (by doing it on purpose) are so ludicrously expensive that they will not be applied soon (the whole US federal budget for a century would buy only a ridiculously small part of the task).
So, if it cannot be realistically done on purpose, you can expect not to hit a collision out of (bad) luck.
Moreover, if you limit yourself to very short inputs, there is a chance that there is no collision at all. E.g., if you consider 12-byte inputs: there are 296 possible sequences of 12 bytes. That's huge (more than can be enumerated with today's technology). Yet, SHA-256 will map each input to a 256-bit value, i.e. values in a much wider space (of size 2256). We cannot prove it formally, but chances are that all those 296 hash values are distinct from each other. Note that this has no practical consequence: there is no measurable difference between not finding a collision because there is none, and not finding a collision because it is extremely improbable to hit one.
Just to illustrate how low risks of collision are with SHA-256: consider your risks of being mauled by a gorilla escaped from a local zoo or private owner. Unlikely? Yes, but it still may conceivably happen: it seems that a gorilla escaped from the Dallas zoo in 2004 and injured four persons; another gorilla escaped from the same zoo in 2010. Assuming that there is only one rampaging gorilla every 6 years on the whole Earth (not only in the Dallas area) and you happen to be the unlucky chap who is on his path, out of a human population of 6.5 billions, then risks of grievous-bodily-harm-by-gorilla can be estimated at about 1 in 243.7 per day. Now, take 10 thousands of PC and have them work on finding a collision for SHA-256. The chances of hitting a collision are close to 1 in 275 per day -- more than a billion less probable than the angry ape thing. The conclusion is that if you fear SHA-256 collisions but do not keep with you a loaded shotgun at all times, then you are getting your priorities wrong. Also, do not mess with Texas.
There is no minimum input size. SHA-256 algorithm is effectively a random mapping and collision probability doesn't depend on input length. Even a 1 bit input is 'safe'.
Note that the input is padded to a multiple of 512 bits (64 bytes) for SHA-256 (multiple of 1024 for SHA-512). Taking a 12 byte input (as Thomas used in his example), when using SHA-256, there are 2^96 possible sequences of length 64 bytes.
As an example, a 12 byte input Hello There! (0x48656c6c6f20546865726521) will be padded with a one bit, followed by 351 zero bits followed by the 64 bit representation of the length of the input in bits which is 0x0000000000000060 to form a 512 bit padded message. This 512 bit message is used as the input for computing the hash.
More details can be found in RFC: 4634 "US Secure Hash Algorithms (SHA and HMAC-SHA)", http://www.ietf.org/rfc/rfc4634.txt
No, message length does not effect the likeliness of a collision.
If that were the case, the algorithm is broken.
You can try for yourself by running SHA against all one-byte inputs, then against all two-byte inputs and so on, and see if you get a collision. Probably not, because no one has ever found a collision for SHA-256 or SHA-512 (or at least they kept it a secret from Wikipedia)
Τhe hash is 256 bits long, there is a collision for anything longer than 256bits.
Υou cannot compress something into a smaller thing without having collisions, its defying mathmatics.
Yes, because of the algoritm and the 2 to the power of 256 there is a lot of different hashes, but they are not collision free, that is impossible.
Depends very much on your application: if you were simply hashing "YES" and "NO" strings to send across a network to indicate whether you should give me a $100,000 loan, it would be a pretty big failure -- the domain of answers can't be that large, so someone could easily check observed hashes on the wire against a database of 'small input' hash outputs.
If you were to include the date, time, my name, my tax ID, the amount requested, the amount of data being hashed probably won't amount to much, but the chances of that data being in precomputed hash tables is pretty slim.
But I know of no research to point you to beyond my instincts. Sorry.

Hash length reduction?

I know that say given a md5/sha1 of a value, that reducing it from X bits (ie 128) to say Y bits (ie 64 bits) increases the possibility of birthday attacks since information has been lost. Is there any easy to use tool/formula/table that will say what the probability of a "correct" guess will be when that length reduction occurs (compared to its original guess probability)?
Crypto is hard. I would recommend against trying to do this sort of thing. It's like cooking pufferfish: Best left to experts.
So just use the full length hash. And since MD5 is broken and SHA-1 is starting to show cracks, you shouldn't use either in new applications. SHA-2 is probably your best bet right now.
I would definitely recommend against reducing the bit count of hash. There are too many issues at stake here. Firstly, how would you decide which bits to drop?
Secondly, it would be hard to predict how the dropping of those bits would affect the distribution of outputs in the new "shortened" hash function. A (well-designed) hash function is meant to distribute inputs evenly across the whole of the output space, not a subset of it.
By dropping half the bits you are effectively taking a subset of the original hash function, which might not have nearly the desirably properties of a properly-designed hash function, and may lead to further weaknesses.
Well, since every extra bit in the hash provides double the number of possible hashes, every time you shorten the hash by a bit, there are only half as many possible hashes and thus the chances of guessing that random number is doubled.
128 bits = 2^128 possibilities
thus
64 bits = 2^64
so by cutting it in half, you get
2^64 / 2^128 percent
less possibilities