How to calculate the risk of conflicts of a 64 bit hash? - hash

I need global unique ids for my application. I know there is an UUID standard for this, but I wonder if I really need 128 bits.
So I think about writing my own generator that uses system time, a random number, and the machines network address to generate an id that fits into 64 bit and therefore, can be stored in the unsigned long long int datatype in C++.
How can I determine if 64 bits is enough for me?

64 bit runs to about 18,446,744,073,709,551,616 combinations
which is around 18 and a half quintillion.
so if your'e generating 1.92 million hashes, the odds of a collision will be 1 in 10 million
Source: http://preshing.com/20110504/hash-collision-probabilities

Related

How to hash two 32 bit integers into one 32 bit integers without collision?

I am looking for a one-way hashing to combine two 32 bits integers into one 32 bits integer. Not sure if it's feasible to do it without collision.
Edit:
I think my integers are generally small. One of them rarely takes more than 14 bits, and the other one rarely takes more than 20 bits.
Edit 2: Thanks for the help in the comments. I think for cases like if the combination of two integers took more than 32 bits, I can do something differently to not hash it. With that, how should I hash my integers?
Thanks!

Does halving every SHA224 2 bytes to 1 byte to halve the hash length introduce a higher collision risk?

Let's say I have strings that need not be reversible and let's say I use SHA224 to hash it.
The hash of hello world is 2f05477fc24bb4faefd86517156dafdecec45b8ad3cf2522a563582b and its length is 56 bytes.
What if I convert every two chars to its numerical representation and make a single byte out of them?
In Python I'd do something like this:
shalist = list("2f05477fc24bb4faefd86517156dafdecec45b8ad3cf2522a563582b")
for first_byte,next_byte in zip(shalist[0::2],shalist[1::2]):
chr(ord(first_byte)+ord(next_byte))
The result will be \x98ek\x9d\x95\x96\x96\xc7\xcb\x9ckhf\x9a\xc7\xc9\xc8\x97\x97\x99\x97\xc9gd\x96im\x94. 28 bytes. Effectively halved the input.
Now, is there a higher hash collision risk by doing so?
The simple answer is pretty obvious: yes, it increases the chance of collision by as many powers of 2 as there are bits missing. For 56 bytes halved to 28 bytes you get the chance of collision increased 2^(28*8). That still leaves the chance of collision at 1:2^(28*8).
Your use of that truncation can be still perfectly legit, depending what it is. Git for example shows only the first few bytes from a commit hash and for most practical purposes the short one works fine.
A "perfect" hash should retain a proportional amount of "effective" bits if you truncate it. For example 32 bits of SHA256 result should have the same "strength" as a 32-bit CRC, although there may be some special properties of CRC that make it more suitable for some purposes while the truncated SHA may be better for others.
If you're doing any kind of security with this it will be difficult to prove your system, you're probably better of using a shorter but complete hash.
Lets shrink the size to make sense of it and use 2 bytes hash instead of 56. The original hash will have 65536 possible values, so if you hash more than that many strings you will surely get a collision. Half that to 1 bytes and you will get a collision after at most 256 strings hashed, regardless do you take the first or the second byte. So your chance of collision is 256 greater (2^(1byte*8bits)) and is 1:256.
Long hashes are used to make it truly impractical to brute-force them, even after long years of cryptanalysis. When MD5 was introduced in 1991 it was considered secure enough to use for certificate signing, in 2008 it was considered "broken" and not suitable for security-related use. Various cryptanalysis techniques can be developed to reduce the "effective" strength of hash and encryption algorithms, so the more spare bits there are (in an otherwise strong algorithm) the more effective bits should remain to keep the hash secure for all practical purposes.

Crypto - Express.js is PBKDF2 HMAC-SHA1 enough?

Using the Express.js framework and crypto to hash a password with pbkdf2 I read that the default algorithm is HMAC-SHA1 but i dont understand why it hasnt been upgraded to one of the other families or SHA.
crypto.pbkdf2(password, salt, iterations, keylen, callback)
Is the keylen that we provide the variation of the the SHA we want? like SHA-256,512 etc?
Also how does HMAC change the output?
And lastly is it strong enough when SHA1 is broken?
Sorry if i am mixing things up.
Is the keylen that we provide the variation of the the SHA we want? like SHA-256,512 etc?
As you state you're hashing a password in particular, #CodesInChaos is right - keylen (i.e. the length of the output from PBKDF2) would be at most the number of bits of your HMAC's native hash function.
For SHA-1, that's 160 bits (20 bytes)
For SHA-256, that's 256 bits (32 bytes), etc.
The reason for this is that if you ask for a longer hash (keylen) than the hash function supports, the first native length is identical, so an attacker only needs to attack bits. This is the problem 1Password found and fixed when the Hashcat team found it.
Example as a proof:
Here's 22 bytes worth of PBKDF2-HMAC-SHA-1 - that's one native hash size + 2 more bytes (taking a total of 8192 iterations! - the first 4096 iterations generate the first 20 bytes, then we do another 4096 iterations for the set after that!):
pbkdf2 sha1 "password" "salt" 4096 22
4b007901b765489abead49d926f721d065a429c12e46
And here's just getting the first 20 bytes of PBKDF2-HMAC-SHA-1 - i.e. exactly one native hash output size (taking a total of 4096 iterations)
pbkdf2 sha1 "password" "salt" 4096 20
4b007901b765489abead49d926f721d065a429c1
Even if you store 22 bytes of PBKDF2-HMAC-SHA-1, an attacker only needs to compute 20 bytes... which takes about half the time, as to get bytes 21 and 22, another entire set of HMAC values is calculated and then only 2 bytes are kept.
Yes, you're correct; 21 bytes takes twice the time 20 does for PBKDF2-HMAC-SHA-1, and 40 bytes takes just as long as 21 bytes in practical terms. 41 bytes, however, takes three times as long as 20 bytes, since 41/20 is between 2 and 3, exclusive.
Also how does HMAC change the output?
HMAC RFC2104 is a way of keying hash functions, particularly those with weaknesses when you simply concatenate key and text together. HMAC-SHA-1 is SHA-1 used in an HMAC; HMAC-SHA-512 is SHA-512 used in an HMAC.
And lastly is it strong enough when SHA1 is broken?
If you have enough iterations (upper tens of thousands to lower hundreds of thousands or more in 2014) then it should be all right. PBKDF2-HMAC-SHA-512 in particular has an advantage that it does much worse on current graphics cards (i.e. many attackers) than it does on current CPU's (i.e. most defenders).
For the gold standard, see the answer #ThomasPornin gave in Is SHA-1 secure for password storage?, a tiny part of which is "The known attacks on MD4, MD5 and SHA-1 are about collisions, which do not impact preimage resistance. It has been shown that MD4 has a few weaknesses which can be (only theoretically) exploited when trying to break HMAC/MD4, but this does not apply to your problem. The 2106 second preimage attack in the paper by Kesley and Schneier is a generic trade-off which applies only to very long inputs (260 bytes; that's a million terabytes -- notice how 106+60 exceeds 160; that's where you see that the trade-off has nothing magic in it)."
SHA-1 is broken, but it does not mean its unsafe to use; SHA-256 (SHA-2) is more or less for future proofing and long term substitute. Broken only means faster than bruteforce, but no necesarily feasible or practical possible (yet).
See also this answer: https://crypto.stackexchange.com/questions/3690/no-sha-1-collision-yet-sha1-is-broken
A function getting broken often only means that we should start
migrating to other, stronger functions, and not that there is
practical danger yet. Attacks only get stronger, so it's a good idea
to consider alternatives once the first cracks begin to appear.

How are datatypes that need more than 32 bits stored in a 32 bit OS

How are datatypes that would need more than 32 bits stored in the system?
For example consider an unsigned int or a long which can have a value greater than 2 to the power of 32, how is it stored in the memory?
Any OS, or compiler, will use the number of bits that it needs. So if an OS or language has a need for 64-bit integers, it will just store such integers into an 8-byte representation.
There are standards for this, for integers as well as floating point numbers. See this article on Wikipedia for more: http://en.wikipedia.org/wiki/Computer_numbering_formats
The 32-bits in a 32-bit architecture to the number of bits the CPU registers are wide (There are some exceptions, such as floating point registers). This does not mean that the system can't handle datatypes larger than this, only that it must deal with these datatypes 32-bits at a time.
For example, machines have an "Add With Carry" instruction, which allows the machine to chain link multiple adds together so that arbitrarily sized numbers, say two 512-bit numbers, can be added in 16 steps (512/32).

Does truncating a sha-160 hash produce a reasonable hash?

I have a sha-160 computation that gives me a 160 bit hash of my data, but I expect this is way larger than necessary. So I'm thinking I could truncate the resulting hash down to say the low 64 bits and use that.
Does taking the low 64 bits of a sha-160 hash computation give a reasonably random 64 bit hash?
Part of what it means for something to be a good hash is that any fixed subset of its bits is also (so far as possible, given how many bits) a good hash. The low 64 bits of a SHA-160 hash should be a good 64-bit hash, in so far as there is such a thing.
Note that for some purposes 64 bits really isn't all that many. For instance, if anything breaks in your application when someone finds two different things with the same hash, you probably want something longer: on average it will only take a modest number of billions of trials to find two things with the same 64-bit hash, no matter what your hashing algorithm.
What bad thing would happen if you just used all 160 bits?