SHA collision probability when removing bytes

SHA collision probability when removing bytes - hash

I'm implementing some program which uses id's with variable length. These id's identify a message and are sent to a broker which will perform some operation (not relevant to the question). However, the maximum length for this id in the broker is 24 bytes. I was thinking about hashing the id (prior to sending to the broker) with SHA and removing some bytes until it gets 24 bytes only.
However, I want to have an idea of how much will this increase the collisions. So this is what I got until now:
I found out that for a "perfect" hash we have the formula p^2 / 2^n+1 to describe the probability of collisions and where p is the number of messages and n is the size of the message in bits. Here is where my problem starts. I'm assuming that removing some bytes from the final hash the function still remains "perfect" and I can still use the same formula. So assuming this I get:
5160^2 / 2^192 + 1 = 2.12x10^-51
Where 5160 is the pick number of messages and 192 is basically the number of bits in 24 bytes.
My questions:
Is my assumption correct? Does the hash stay "perfect" by removing some bytes.
If so and since the probability is really small, which bytes should I remove? Most or less significant? Does it really matter at all?
PS: Any other suggestion to achieve the same result is welcomed. Thanks.

However, the maximum length for this id in the broker is 24 bytes. I was thinking about hashing the id (prior to sending to the broker) with SHA and removing some bytes until it gets 24 bytes only.
SHA-1 outputs only 20 bytes (160 bits), so you'd need to pad it. At least if all bytes are valid, and you're not restricted to hex or Base64. I recommend using truncated SHA-2 instead.
Is my assumption correct? Does the hash stay "perfect" by removing some bytes.
Pretty much. Truncating hashes should conserve all their important properties, obviously at the reduced security level corresponding to the smaller output size.
If so and since the probability is really small, which bytes should I remove? Most or less significant? Does it really matter at all?
That should not matter at all. NIST defined a truncated SHA-2 variant, called SHA-224, which takes the first 28 bytes of SHA-256 using a different initial state for the hash calculation.
My recommendation is to use SHA-256, keeping the first 24 bytes. This requires around 2^96 hash-function calls to find one collision. Which is currently infeasible, even for extremely powerful attackers, and essentially impossible for accidental collisions.

Related

Does halving every SHA224 2 bytes to 1 byte to halve the hash length introduce a higher collision risk?

Let's say I have strings that need not be reversible and let's say I use SHA224 to hash it.
The hash of hello world is 2f05477fc24bb4faefd86517156dafdecec45b8ad3cf2522a563582b and its length is 56 bytes.
What if I convert every two chars to its numerical representation and make a single byte out of them?
In Python I'd do something like this:
shalist = list("2f05477fc24bb4faefd86517156dafdecec45b8ad3cf2522a563582b")
for first_byte,next_byte in zip(shalist[0::2],shalist[1::2]):
chr(ord(first_byte)+ord(next_byte))
The result will be \x98ek\x9d\x95\x96\x96\xc7\xcb\x9ckhf\x9a\xc7\xc9\xc8\x97\x97\x99\x97\xc9gd\x96im\x94. 28 bytes. Effectively halved the input.
Now, is there a higher hash collision risk by doing so?

The simple answer is pretty obvious: yes, it increases the chance of collision by as many powers of 2 as there are bits missing. For 56 bytes halved to 28 bytes you get the chance of collision increased 2^(28*8). That still leaves the chance of collision at 1:2^(28*8).
Your use of that truncation can be still perfectly legit, depending what it is. Git for example shows only the first few bytes from a commit hash and for most practical purposes the short one works fine.
A "perfect" hash should retain a proportional amount of "effective" bits if you truncate it. For example 32 bits of SHA256 result should have the same "strength" as a 32-bit CRC, although there may be some special properties of CRC that make it more suitable for some purposes while the truncated SHA may be better for others.
If you're doing any kind of security with this it will be difficult to prove your system, you're probably better of using a shorter but complete hash.
Lets shrink the size to make sense of it and use 2 bytes hash instead of 56. The original hash will have 65536 possible values, so if you hash more than that many strings you will surely get a collision. Half that to 1 bytes and you will get a collision after at most 256 strings hashed, regardless do you take the first or the second byte. So your chance of collision is 256 greater (2^(1byte*8bits)) and is 1:256.
Long hashes are used to make it truly impractical to brute-force them, even after long years of cryptanalysis. When MD5 was introduced in 1991 it was considered secure enough to use for certificate signing, in 2008 it was considered "broken" and not suitable for security-related use. Various cryptanalysis techniques can be developed to reduce the "effective" strength of hash and encryption algorithms, so the more spare bits there are (in an otherwise strong algorithm) the more effective bits should remain to keep the hash secure for all practical purposes.

Crypto - Express.js is PBKDF2 HMAC-SHA1 enough?

Using the Express.js framework and crypto to hash a password with pbkdf2 I read that the default algorithm is HMAC-SHA1 but i dont understand why it hasnt been upgraded to one of the other families or SHA.
crypto.pbkdf2(password, salt, iterations, keylen, callback)
Is the keylen that we provide the variation of the the SHA we want? like SHA-256,512 etc?
Also how does HMAC change the output?
And lastly is it strong enough when SHA1 is broken?
Sorry if i am mixing things up.

Is the keylen that we provide the variation of the the SHA we want? like SHA-256,512 etc?
As you state you're hashing a password in particular, #CodesInChaos is right - keylen (i.e. the length of the output from PBKDF2) would be at most the number of bits of your HMAC's native hash function.
For SHA-1, that's 160 bits (20 bytes)
For SHA-256, that's 256 bits (32 bytes), etc.
The reason for this is that if you ask for a longer hash (keylen) than the hash function supports, the first native length is identical, so an attacker only needs to attack bits. This is the problem 1Password found and fixed when the Hashcat team found it.
Example as a proof:
Here's 22 bytes worth of PBKDF2-HMAC-SHA-1 - that's one native hash size + 2 more bytes (taking a total of 8192 iterations! - the first 4096 iterations generate the first 20 bytes, then we do another 4096 iterations for the set after that!):
pbkdf2 sha1 "password" "salt" 4096 22
4b007901b765489abead49d926f721d065a429c12e46
And here's just getting the first 20 bytes of PBKDF2-HMAC-SHA-1 - i.e. exactly one native hash output size (taking a total of 4096 iterations)
pbkdf2 sha1 "password" "salt" 4096 20
4b007901b765489abead49d926f721d065a429c1
Even if you store 22 bytes of PBKDF2-HMAC-SHA-1, an attacker only needs to compute 20 bytes... which takes about half the time, as to get bytes 21 and 22, another entire set of HMAC values is calculated and then only 2 bytes are kept.
Yes, you're correct; 21 bytes takes twice the time 20 does for PBKDF2-HMAC-SHA-1, and 40 bytes takes just as long as 21 bytes in practical terms. 41 bytes, however, takes three times as long as 20 bytes, since 41/20 is between 2 and 3, exclusive.
Also how does HMAC change the output?
HMAC RFC2104 is a way of keying hash functions, particularly those with weaknesses when you simply concatenate key and text together. HMAC-SHA-1 is SHA-1 used in an HMAC; HMAC-SHA-512 is SHA-512 used in an HMAC.
And lastly is it strong enough when SHA1 is broken?
If you have enough iterations (upper tens of thousands to lower hundreds of thousands or more in 2014) then it should be all right. PBKDF2-HMAC-SHA-512 in particular has an advantage that it does much worse on current graphics cards (i.e. many attackers) than it does on current CPU's (i.e. most defenders).
For the gold standard, see the answer #ThomasPornin gave in Is SHA-1 secure for password storage?, a tiny part of which is "The known attacks on MD4, MD5 and SHA-1 are about collisions, which do not impact preimage resistance. It has been shown that MD4 has a few weaknesses which can be (only theoretically) exploited when trying to break HMAC/MD4, but this does not apply to your problem. The 2106 second preimage attack in the paper by Kesley and Schneier is a generic trade-off which applies only to very long inputs (260 bytes; that's a million terabytes -- notice how 106+60 exceeds 160; that's where you see that the trade-off has nothing magic in it)."

SHA-1 is broken, but it does not mean its unsafe to use; SHA-256 (SHA-2) is more or less for future proofing and long term substitute. Broken only means faster than bruteforce, but no necesarily feasible or practical possible (yet).
See also this answer: https://crypto.stackexchange.com/questions/3690/no-sha-1-collision-yet-sha1-is-broken
A function getting broken often only means that we should start
migrating to other, stronger functions, and not that there is
practical danger yet. Attacks only get stronger, so it's a good idea
to consider alternatives once the first cracks begin to appear.

Is there a limit on the message size for SHA-256?

When hashing a string, like a password, with SHA-256, is there a limit to the length of the string I am hashing? For example, is it only "safe" to hash strings that are smaller than 64 characters?

There is technically a limit, but it's quite large. The padding scheme used for SHA-256 requires that the size of the input (in bits) be expressed as a 64-bit number. Therefore, the maximum size is (264-1)/8 bytes ~= 2'091'752 terabytes.
That renders the limit almost entirely theoretical, not practical.
Most people don't have the storage for nearly that much data anyway, but even if they did, processing it all serially to produce a single hash would take an amount of time most would consider prohibitive.
A quick back-of-the-envelope kind of calculation indicates that even with the fastest enterprise SSDs currently1 listed on Tom's hardware, and striping them 16 wide to improve bandwidth, just reading that quantity of data would still take about 220 years.
1. As of April 2016.

There is no such limit, other than the maximum message size of 264-1 bits. SHA2 is frequently used to generate hashes for executables, which tend to be much larger than a few dozen bytes.

The upper limit is given in the NIST Standard FIPS 180-4. The reason for the upper limit is the padding scheme to countermeasure against the MOV attack that Merkle-Damgard construction's artifact. The message length l is lastly appended to the message during padding.
Then append the 64-bit block that is equal to the number l expressed using a binary representation
Therefore by the NIST standard, the maximum file size can be hashed with SHA-256 is 2^64-1 in bits ( approx 2.305 exabytes - that is close to the lower range of the estimated NSA's data center in UTAH, so you don't need to worry).
NIST enables the hash of the size zero message. Therefore the message length starts from 0 to 2^64-1.
If you need to hash files larger than 2^64-1 then either use SHA-512 which has 2^128-1 limit or use SHA3 which has no limit.

When generating a SHA256 / 512 hash, is there a minimum 'safe' amount of data to hash?

I have heard that when creating a hash, it's possible that if small files or amounts of data are used, the resulting hash is more likely to suffer from a collision. If that is true, is there a minimum "safe" amount of data that should be used to ensure this doesn't happen?
I guess the question could also be phrased as:
What is the smallest amount of data that can be safely and securely hashed?

A hash function accepts inputs of arbitrary (or at least very high) length, and produces a fixed-length output. There are more possible inputs than possible outputs, so collisions must exist. The whole point of a secure hash function is that it is "collision resistant", which means that while collisions must mathematically exist, it is very very hard to actually compute one. Thus, there is no known collision for SHA-256 and SHA-512, and the best known methods for computing one (by doing it on purpose) are so ludicrously expensive that they will not be applied soon (the whole US federal budget for a century would buy only a ridiculously small part of the task).
So, if it cannot be realistically done on purpose, you can expect not to hit a collision out of (bad) luck.
Moreover, if you limit yourself to very short inputs, there is a chance that there is no collision at all. E.g., if you consider 12-byte inputs: there are 296 possible sequences of 12 bytes. That's huge (more than can be enumerated with today's technology). Yet, SHA-256 will map each input to a 256-bit value, i.e. values in a much wider space (of size 2256). We cannot prove it formally, but chances are that all those 296 hash values are distinct from each other. Note that this has no practical consequence: there is no measurable difference between not finding a collision because there is none, and not finding a collision because it is extremely improbable to hit one.
Just to illustrate how low risks of collision are with SHA-256: consider your risks of being mauled by a gorilla escaped from a local zoo or private owner. Unlikely? Yes, but it still may conceivably happen: it seems that a gorilla escaped from the Dallas zoo in 2004 and injured four persons; another gorilla escaped from the same zoo in 2010. Assuming that there is only one rampaging gorilla every 6 years on the whole Earth (not only in the Dallas area) and you happen to be the unlucky chap who is on his path, out of a human population of 6.5 billions, then risks of grievous-bodily-harm-by-gorilla can be estimated at about 1 in 243.7 per day. Now, take 10 thousands of PC and have them work on finding a collision for SHA-256. The chances of hitting a collision are close to 1 in 275 per day -- more than a billion less probable than the angry ape thing. The conclusion is that if you fear SHA-256 collisions but do not keep with you a loaded shotgun at all times, then you are getting your priorities wrong. Also, do not mess with Texas.

There is no minimum input size. SHA-256 algorithm is effectively a random mapping and collision probability doesn't depend on input length. Even a 1 bit input is 'safe'.
Note that the input is padded to a multiple of 512 bits (64 bytes) for SHA-256 (multiple of 1024 for SHA-512). Taking a 12 byte input (as Thomas used in his example), when using SHA-256, there are 2^96 possible sequences of length 64 bytes.
As an example, a 12 byte input Hello There! (0x48656c6c6f20546865726521) will be padded with a one bit, followed by 351 zero bits followed by the 64 bit representation of the length of the input in bits which is 0x0000000000000060 to form a 512 bit padded message. This 512 bit message is used as the input for computing the hash.
More details can be found in RFC: 4634 "US Secure Hash Algorithms (SHA and HMAC-SHA)", http://www.ietf.org/rfc/rfc4634.txt

No, message length does not effect the likeliness of a collision.
If that were the case, the algorithm is broken.
You can try for yourself by running SHA against all one-byte inputs, then against all two-byte inputs and so on, and see if you get a collision. Probably not, because no one has ever found a collision for SHA-256 or SHA-512 (or at least they kept it a secret from Wikipedia)

Τhe hash is 256 bits long, there is a collision for anything longer than 256bits.
Υou cannot compress something into a smaller thing without having collisions, its defying mathmatics.
Yes, because of the algoritm and the 2 to the power of 256 there is a lot of different hashes, but they are not collision free, that is impossible.

Depends very much on your application: if you were simply hashing "YES" and "NO" strings to send across a network to indicate whether you should give me a $100,000 loan, it would be a pretty big failure -- the domain of answers can't be that large, so someone could easily check observed hashes on the wire against a database of 'small input' hash outputs.
If you were to include the date, time, my name, my tax ID, the amount requested, the amount of data being hashed probably won't amount to much, but the chances of that data being in precomputed hash tables is pretty slim.
But I know of no research to point you to beyond my instincts. Sorry.

Is it okay to truncate a SHA256 hash to 128 bits?

MD5 and SHA-1 hashes have weaknesses against collision attacks. SHA256 does not but it outputs 256 bits. Can I safely take the first or last 128 bits and use that as the hash? I know it will be weaker (because it has less bits) but otherwise will it work?
Basically I want to use this to uniquely identify files in a file system that might one day contain a trillion files. I'm aware of the birthday problem and a 128 bit hash should yield about a 1 in a trillion chance on a trillion files that there would be two different files with the same hash. I can live with those odds.
What I can't live with is if somebody could easily, deliberately, insert a new file with the same hash and the same beginning characters of the file. I believe in MD5 and SHA1 this is possible.

Yeah that will work. Theoretically it's better to XOR the two halves together but even truncated SHA256 is stronger than MD5. You should still consider the result a 128 bit hash rather than a 256 bit hash though.
My particular recommendation in this particular case is to store and reference using HASH + uniquifier where uniquifier is the count of how many distinct files you've seen with this hash before. This way you don't absolutely fall down flat if somebody tries to store future discovered collision vectors for SHA256.

But is it worth it? If you have a hash for each file, then you essentially have an overhead for each file. Let's say that each file must take up at least 512 bytes (a typical disk sector) and that you're storing these hashes compactly enough so as to not have each hash take up much more than the hash size.
So, even if all your files are 512 bytes, the smallest, you're talking either 16 / 512 = 3.1% or 32 / 512 = 6.3%. In reality, I'd bet your average file size is higher (unless all your files are 1 sector...), so that overhead would be less.
Now, the amount of space you need for hashes scales linearly with the number of files you have. Is that extra space worth that much? Even if you had your mentioned trillion files - that's 1 000 000 000 000 * 16 = ~29 TiB, which is a lot of space, but keep in mind: your data would be 1 000 000 000 000 * 512 = 465 TiB. The numbers are worthless, really, since it's still 3% or 6% overhead. But at this level, where you have a half petabyte of storage, does 15 terabytes matter? At any level, does a 3% savings mean anything? And remember, if they're larger, you save less. (Which, they probably are: good luck getting a 512 byte sector size at that hard disk size.)
So, is this 3% or less disk savings worth the potential risk in security. (Which I'll leave unanswered, as it's waaay not my cup of tea.)
Alternatively, could you, say, group files together in some logical fashion, so that you have less files? (I mean, if you have trillions of 512 byte files, do you really want to hash every byte on disk?)

Yes, that will work.
For the record, there are known in-use collision attacks against MD5, but the SHA-1 attacks are at this point completely theoretical (no SHA-1 collision has ever been found... yet).

Crypto does something similar, for example Ethereum addresses are the low-order 160 bits of the Keccak (precursor to SHA-3) hash.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse