Crypto - Express.js is PBKDF2 HMAC-SHA1 enough? - hash

Using the Express.js framework and crypto to hash a password with pbkdf2 I read that the default algorithm is HMAC-SHA1 but i dont understand why it hasnt been upgraded to one of the other families or SHA.
crypto.pbkdf2(password, salt, iterations, keylen, callback)
Is the keylen that we provide the variation of the the SHA we want? like SHA-256,512 etc?
Also how does HMAC change the output?
And lastly is it strong enough when SHA1 is broken?
Sorry if i am mixing things up.

Is the keylen that we provide the variation of the the SHA we want? like SHA-256,512 etc?
As you state you're hashing a password in particular, #CodesInChaos is right - keylen (i.e. the length of the output from PBKDF2) would be at most the number of bits of your HMAC's native hash function.
For SHA-1, that's 160 bits (20 bytes)
For SHA-256, that's 256 bits (32 bytes), etc.
The reason for this is that if you ask for a longer hash (keylen) than the hash function supports, the first native length is identical, so an attacker only needs to attack bits. This is the problem 1Password found and fixed when the Hashcat team found it.
Example as a proof:
Here's 22 bytes worth of PBKDF2-HMAC-SHA-1 - that's one native hash size + 2 more bytes (taking a total of 8192 iterations! - the first 4096 iterations generate the first 20 bytes, then we do another 4096 iterations for the set after that!):
pbkdf2 sha1 "password" "salt" 4096 22
4b007901b765489abead49d926f721d065a429c12e46
And here's just getting the first 20 bytes of PBKDF2-HMAC-SHA-1 - i.e. exactly one native hash output size (taking a total of 4096 iterations)
pbkdf2 sha1 "password" "salt" 4096 20
4b007901b765489abead49d926f721d065a429c1
Even if you store 22 bytes of PBKDF2-HMAC-SHA-1, an attacker only needs to compute 20 bytes... which takes about half the time, as to get bytes 21 and 22, another entire set of HMAC values is calculated and then only 2 bytes are kept.
Yes, you're correct; 21 bytes takes twice the time 20 does for PBKDF2-HMAC-SHA-1, and 40 bytes takes just as long as 21 bytes in practical terms. 41 bytes, however, takes three times as long as 20 bytes, since 41/20 is between 2 and 3, exclusive.
Also how does HMAC change the output?
HMAC RFC2104 is a way of keying hash functions, particularly those with weaknesses when you simply concatenate key and text together. HMAC-SHA-1 is SHA-1 used in an HMAC; HMAC-SHA-512 is SHA-512 used in an HMAC.
And lastly is it strong enough when SHA1 is broken?
If you have enough iterations (upper tens of thousands to lower hundreds of thousands or more in 2014) then it should be all right. PBKDF2-HMAC-SHA-512 in particular has an advantage that it does much worse on current graphics cards (i.e. many attackers) than it does on current CPU's (i.e. most defenders).
For the gold standard, see the answer #ThomasPornin gave in Is SHA-1 secure for password storage?, a tiny part of which is "The known attacks on MD4, MD5 and SHA-1 are about collisions, which do not impact preimage resistance. It has been shown that MD4 has a few weaknesses which can be (only theoretically) exploited when trying to break HMAC/MD4, but this does not apply to your problem. The 2106 second preimage attack in the paper by Kesley and Schneier is a generic trade-off which applies only to very long inputs (260 bytes; that's a million terabytes -- notice how 106+60 exceeds 160; that's where you see that the trade-off has nothing magic in it)."

SHA-1 is broken, but it does not mean its unsafe to use; SHA-256 (SHA-2) is more or less for future proofing and long term substitute. Broken only means faster than bruteforce, but no necesarily feasible or practical possible (yet).
See also this answer: https://crypto.stackexchange.com/questions/3690/no-sha-1-collision-yet-sha1-is-broken
A function getting broken often only means that we should start
migrating to other, stronger functions, and not that there is
practical danger yet. Attacks only get stronger, so it's a good idea
to consider alternatives once the first cracks begin to appear.

Related

Does halving every SHA224 2 bytes to 1 byte to halve the hash length introduce a higher collision risk?

Let's say I have strings that need not be reversible and let's say I use SHA224 to hash it.
The hash of hello world is 2f05477fc24bb4faefd86517156dafdecec45b8ad3cf2522a563582b and its length is 56 bytes.
What if I convert every two chars to its numerical representation and make a single byte out of them?
In Python I'd do something like this:
shalist = list("2f05477fc24bb4faefd86517156dafdecec45b8ad3cf2522a563582b")
for first_byte,next_byte in zip(shalist[0::2],shalist[1::2]):
chr(ord(first_byte)+ord(next_byte))
The result will be \x98ek\x9d\x95\x96\x96\xc7\xcb\x9ckhf\x9a\xc7\xc9\xc8\x97\x97\x99\x97\xc9gd\x96im\x94. 28 bytes. Effectively halved the input.
Now, is there a higher hash collision risk by doing so?
The simple answer is pretty obvious: yes, it increases the chance of collision by as many powers of 2 as there are bits missing. For 56 bytes halved to 28 bytes you get the chance of collision increased 2^(28*8). That still leaves the chance of collision at 1:2^(28*8).
Your use of that truncation can be still perfectly legit, depending what it is. Git for example shows only the first few bytes from a commit hash and for most practical purposes the short one works fine.
A "perfect" hash should retain a proportional amount of "effective" bits if you truncate it. For example 32 bits of SHA256 result should have the same "strength" as a 32-bit CRC, although there may be some special properties of CRC that make it more suitable for some purposes while the truncated SHA may be better for others.
If you're doing any kind of security with this it will be difficult to prove your system, you're probably better of using a shorter but complete hash.
Lets shrink the size to make sense of it and use 2 bytes hash instead of 56. The original hash will have 65536 possible values, so if you hash more than that many strings you will surely get a collision. Half that to 1 bytes and you will get a collision after at most 256 strings hashed, regardless do you take the first or the second byte. So your chance of collision is 256 greater (2^(1byte*8bits)) and is 1:256.
Long hashes are used to make it truly impractical to brute-force them, even after long years of cryptanalysis. When MD5 was introduced in 1991 it was considered secure enough to use for certificate signing, in 2008 it was considered "broken" and not suitable for security-related use. Various cryptanalysis techniques can be developed to reduce the "effective" strength of hash and encryption algorithms, so the more spare bits there are (in an otherwise strong algorithm) the more effective bits should remain to keep the hash secure for all practical purposes.

SHA collision probability when removing bytes

I'm implementing some program which uses id's with variable length. These id's identify a message and are sent to a broker which will perform some operation (not relevant to the question). However, the maximum length for this id in the broker is 24 bytes. I was thinking about hashing the id (prior to sending to the broker) with SHA and removing some bytes until it gets 24 bytes only.
However, I want to have an idea of how much will this increase the collisions. So this is what I got until now:
I found out that for a "perfect" hash we have the formula p^2 / 2^n+1 to describe the probability of collisions and where p is the number of messages and n is the size of the message in bits. Here is where my problem starts. I'm assuming that removing some bytes from the final hash the function still remains "perfect" and I can still use the same formula. So assuming this I get:
5160^2 / 2^192 + 1 = 2.12x10^-51
Where 5160 is the pick number of messages and 192 is basically the number of bits in 24 bytes.
My questions:
Is my assumption correct? Does the hash stay "perfect" by removing some bytes.
If so and since the probability is really small, which bytes should I remove? Most or less significant? Does it really matter at all?
PS: Any other suggestion to achieve the same result is welcomed. Thanks.
However, the maximum length for this id in the broker is 24 bytes. I was thinking about hashing the id (prior to sending to the broker) with SHA and removing some bytes until it gets 24 bytes only.
SHA-1 outputs only 20 bytes (160 bits), so you'd need to pad it. At least if all bytes are valid, and you're not restricted to hex or Base64. I recommend using truncated SHA-2 instead.
Is my assumption correct? Does the hash stay "perfect" by removing some bytes.
Pretty much. Truncating hashes should conserve all their important properties, obviously at the reduced security level corresponding to the smaller output size.
If so and since the probability is really small, which bytes should I remove? Most or less significant? Does it really matter at all?
That should not matter at all. NIST defined a truncated SHA-2 variant, called SHA-224, which takes the first 28 bytes of SHA-256 using a different initial state for the hash calculation.
My recommendation is to use SHA-256, keeping the first 24 bytes. This requires around 2^96 hash-function calls to find one collision. Which is currently infeasible, even for extremely powerful attackers, and essentially impossible for accidental collisions.

What are some of the best hashing algorithms to use for data integrity and deduplication?

I'm trying to hash a large number of files with binary data inside of them in order to:
(1) check for corruption in the future, and
(2) eliminate duplicate files (which might have completely different names and other metadata).
I know about md5 and sha1 and their relatives, but my understanding is that these are designed for security and therefore are deliberately slow in order to reduce the efficacy of brute force attacks. In contrast, I want algorithms that run as fast as possible, while reducing collisions as much as possible.
Any suggestions?
You are the most right. If your system does not have any adversary, using cryptographic hash-functions is overkill given their security properties.
Collisions depend on the number of bits, b, of your hash function and the number of hash values, N, you estimate to compute. Academic literature defends this collision probability must be bellow hardware error probability, so it is less likely to make a collision with a hash function than to be comparing data byte-by-byte [ref1,ref2,ref3,ref4,ref5]. Hardware error probability is in the range of 2^-12 and 2^-15 [ref6]. If you expect to generate N=2^q hash values then your collision probability may be given by this equation, which already takes into account the birthday paradox:
The number of bits of your hash function is directly proportional to its computational complexity. So you are interested in finding an hash function with the minimum bits possible, while being able to maintain collision probability at acceptable values.
Here's an example on how to make that analysis:
Let's say you have f=2^15 files;
The average size of each file lf is 2^20 bytes;
You pretend to divide each file into chunks of average size lc equal to 2^10 bytes;
Each file will be divided into c=lf/lc=2^10 chunks;
You will then hash q = f*c =2^25 objects.
From that equation the collision probability for several hash sizes is the following:
P(hash=64 bits) = 2^(2*25-64+1) = 2^-13 (lesser than 2^-12)
P(hash=128 bits) = 2^(2*25-128+1) 2^-77 (way much lesser than 2^-12)
Now you just need to decide which non-cryptographic hash function of 64 or 128 bits you will use, knowing 64 bits it pretty close to hardware error probability (but will be faster) and 128 bits is a much safer option (though slower).
Bellow you can find a small list removed from wikipedia of non-cryptographic hash functions. I know Murmurhash3 and it is much faster than any cryptographic hash function:
Fowler–Noll–Vo : 32, 64, 128, 256, 512 and 1024 bits
Jenkins : 64 and 128 bits
MurmurHash : 32, 64, 128, and 160 bits
CityHash : 64, 128 and 256 bits
MD5 and SHA1 are not designed for security, no, so they are not particularly secure, and hence not really very slow, either. I've used MD5 for deduplication myself (with Python), and performance was just fine.
This article claims machines today can compute the MD5 hash of 330 MB of data per second.
SHA-1 was developed as a safer alternative to MD5 when it was discovered that you could craft inputs that would hash to the same value with MD5, but I think for your purposes MD5 will work fine. It certainly did for me.
If security is not a concern for you you can take one of the secure hash functions and reduce the number of rounds. This makes the cryptographically unsound but still perfect for equality-testing.
Skein is very strong. It has 80 rounds. Try reducing to 10 or so.
Or encrypt with AES and XOR the output blocks together. AES is hardware-accelerated on modern CPUs and insanely fast.

60bit hashing algorithm

Is there a cryptographically secure hashing algorithm which gives a message digest of 60 bits?
I have a unique string (id + timestamp), I need to generate a 60 bit hash from it. What will be the best algorithm to create such a hash?
You can always take a hash algorithm with a larger output size, e.g. sha256, and truncate it to 60 bits. Whether that is appropriate for your needs I cannot say without much more information. 60 bits is generally considered way too short for most security needs.
There is no 60 bit algorithm for encryption. Algorithms are in powers of 2.
I suggest using sha1 to create the hash. It is 128 bit
hash=sha1(id + timestamp)
If you must(not recommended) compress this, use substring to reduce it to 64 bits
smallHash=substr(hash, 0,8)
(8 characters=64 bits)
Any hashing algorithm that has a 60-bit output size can at maximum provide only 30 bits of collision resistance (by the birthday paradox). 30 bits is much too short to be useful in security nowadays.

When generating a SHA256 / 512 hash, is there a minimum 'safe' amount of data to hash?

I have heard that when creating a hash, it's possible that if small files or amounts of data are used, the resulting hash is more likely to suffer from a collision. If that is true, is there a minimum "safe" amount of data that should be used to ensure this doesn't happen?
I guess the question could also be phrased as:
What is the smallest amount of data that can be safely and securely hashed?
A hash function accepts inputs of arbitrary (or at least very high) length, and produces a fixed-length output. There are more possible inputs than possible outputs, so collisions must exist. The whole point of a secure hash function is that it is "collision resistant", which means that while collisions must mathematically exist, it is very very hard to actually compute one. Thus, there is no known collision for SHA-256 and SHA-512, and the best known methods for computing one (by doing it on purpose) are so ludicrously expensive that they will not be applied soon (the whole US federal budget for a century would buy only a ridiculously small part of the task).
So, if it cannot be realistically done on purpose, you can expect not to hit a collision out of (bad) luck.
Moreover, if you limit yourself to very short inputs, there is a chance that there is no collision at all. E.g., if you consider 12-byte inputs: there are 296 possible sequences of 12 bytes. That's huge (more than can be enumerated with today's technology). Yet, SHA-256 will map each input to a 256-bit value, i.e. values in a much wider space (of size 2256). We cannot prove it formally, but chances are that all those 296 hash values are distinct from each other. Note that this has no practical consequence: there is no measurable difference between not finding a collision because there is none, and not finding a collision because it is extremely improbable to hit one.
Just to illustrate how low risks of collision are with SHA-256: consider your risks of being mauled by a gorilla escaped from a local zoo or private owner. Unlikely? Yes, but it still may conceivably happen: it seems that a gorilla escaped from the Dallas zoo in 2004 and injured four persons; another gorilla escaped from the same zoo in 2010. Assuming that there is only one rampaging gorilla every 6 years on the whole Earth (not only in the Dallas area) and you happen to be the unlucky chap who is on his path, out of a human population of 6.5 billions, then risks of grievous-bodily-harm-by-gorilla can be estimated at about 1 in 243.7 per day. Now, take 10 thousands of PC and have them work on finding a collision for SHA-256. The chances of hitting a collision are close to 1 in 275 per day -- more than a billion less probable than the angry ape thing. The conclusion is that if you fear SHA-256 collisions but do not keep with you a loaded shotgun at all times, then you are getting your priorities wrong. Also, do not mess with Texas.
There is no minimum input size. SHA-256 algorithm is effectively a random mapping and collision probability doesn't depend on input length. Even a 1 bit input is 'safe'.
Note that the input is padded to a multiple of 512 bits (64 bytes) for SHA-256 (multiple of 1024 for SHA-512). Taking a 12 byte input (as Thomas used in his example), when using SHA-256, there are 2^96 possible sequences of length 64 bytes.
As an example, a 12 byte input Hello There! (0x48656c6c6f20546865726521) will be padded with a one bit, followed by 351 zero bits followed by the 64 bit representation of the length of the input in bits which is 0x0000000000000060 to form a 512 bit padded message. This 512 bit message is used as the input for computing the hash.
More details can be found in RFC: 4634 "US Secure Hash Algorithms (SHA and HMAC-SHA)", http://www.ietf.org/rfc/rfc4634.txt
No, message length does not effect the likeliness of a collision.
If that were the case, the algorithm is broken.
You can try for yourself by running SHA against all one-byte inputs, then against all two-byte inputs and so on, and see if you get a collision. Probably not, because no one has ever found a collision for SHA-256 or SHA-512 (or at least they kept it a secret from Wikipedia)
Τhe hash is 256 bits long, there is a collision for anything longer than 256bits.
Υou cannot compress something into a smaller thing without having collisions, its defying mathmatics.
Yes, because of the algoritm and the 2 to the power of 256 there is a lot of different hashes, but they are not collision free, that is impossible.
Depends very much on your application: if you were simply hashing "YES" and "NO" strings to send across a network to indicate whether you should give me a $100,000 loan, it would be a pretty big failure -- the domain of answers can't be that large, so someone could easily check observed hashes on the wire against a database of 'small input' hash outputs.
If you were to include the date, time, my name, my tax ID, the amount requested, the amount of data being hashed probably won't amount to much, but the chances of that data being in precomputed hash tables is pretty slim.
But I know of no research to point you to beyond my instincts. Sorry.