The maximum length of a message that can be hashed with WHIRLPOOL - hash

I'm just wondering what is the maximum length. I read from Wikipedia that it takes a message of any length less than 2^256 bits. Does that mean 2 to the power of 256? Also, would it be more secure to hash a password multiple times? Example:
WHIRLPOOL(WHIRLPOOL(WHIRLPOOL(WHIRLPOOL("passw0rd"))))
Or does that increase the risk of collisions?

Yes, this does mean 2^256 bits. Of course, as there are 2^3 bits in a byte, you will thus have a maximum space of 2^253 bytes. Nothing to worry about.
Yes, it is better to hash multiple times. No, you don't have to worry about "cycles" (much). Many pseudo random number generators are using hashes the same way. Hash algorithms should not loose too much information and should not have a short cycle time.
Passwords hashes are should however be calculated using password based key derivation functions. The "key" is then stored. PBKDF's may use hashes (e.g. PBKDF2) or keyed block ciphers (bcrypt). Most KDF's are using message authentication codes (HMAC or MAC) instead of directly using the underlying hash algorithm or block cipher.
Input to PBKDF's is a salt and iteration count. The iteration count is used to make it harder for an attacker to brute force the system by trying out all kinds of passwords. It's basically the same as what you did above with WHIRLPOOL. Only the iteration count is normally somewhere between 1 and 10 thousand. More data is normally mixed in as well in each iteration.
More importantly, the (password specific) salt is used to make sure that duplicate passwords cannot be detected and to avoid attacks using rainbow tables. Normally the salt is about 64 to 128 bits. The salt and iteration count should be stored with the "hash".
Finally, it is probably better to use a NIST vetted hash algorithm such as SHA-512 instead of WHIRLPOOL.

Related

Hash that generates Decimal output for Swift

I want to hashed a String into a hashed object which has some numerical values NSNumber/Int as an output instead of alpha-numeric values.
The problem is that after digging through swift and some 3rd party library, I'm not able to find any library that suffices our need.
I'm working on a Chat SDK and it takes NSNumber/Int as unique identifier to co-relate Chat Message and Conversation Message.
My company demand is not to store any addition field onto the database
or change the schema that we have which complicates thing.
A neat solution my team came with was some sort of hashed function that generates number.
func userIdToConversationNumber(id:String) -> NSNumber
We can use that function to convert String to NSNumber/Int. This Int should be produced by that function and probability of colliding should be negligible. Any suggestion on any approach.
The key calculation you need to perform is the birthday bound. My favorite table is the one in Wikipedia, and I reference it regularly when I'm designing systems like this one.
The table expresses how many items you can hash for a given hash size before you have a certain expectation of a collision. This is based on a perfectly uniform hash, which a cryptographic hash is a close approximation of.
So for a 64-bit integer, after hashing 6M elements, there is a 1-in-a-million chance that there was a single collision anywhere in that list. After hashing 20M elements, there is a 1-in-a-thousand chance that there was a single collision. And after 5 billion elements, you should bet on a collision (50% chance).
So it all comes down to how many elements you plan to hash and how bad it is if there is a collision (would it create a security problem? can you detect it? can you do anything about it like change the input data?), and of course how much risk you're willing to take for the given problem.
Personally, I'm a 1-in-a-million type of person for these things, though I've been convinced to go down to 1-in-a-thousand at times. (Again, this is not 1:1000 chance of any given element colliding; that would be horrible. This is 1:1000 chance of there being a collision at all after hashing some number of elements.) I would not accept 1-in-a-million in situations where an attacker can craft arbitrary things (of arbitrary size) for you to hash. But I'm very comfortable with it for structured data (email addresses, URLs) of constrained length.
If these numbers work for you, then what you want is a hash that is highly uniform in all its bits. And that's a SHA hash. I'd use a SHA-2 (like SHA-256) because you should always use SHA-2 unless you have a good reason not to. Since SHA-2's bits are all independent of each other (or at least that's its intent), you can select any number of its bits to create a shorter hash. So you compute a SHA-256, and take the top (or bottom) 64-bits as an integer, and that's your hash.
As a rule, for modest sized things, you can get away with this in 64 bits. You cannot get away with this in 32 bits. So when you say "NSNumber/Int", I want you to mean explicitly "64-bit integer." For example, on a 32-bit platform, Swift's Int is only 32 bits, so I would use UInt64 or uint64_t, not Int or NSInteger. I recommend unsigned integers here because these are really unique bit patterns, not "numbers" (i.e. it is not meaningful to add or multiply them) and having negative values tends to be confusing in identifiers unless there is some semantic meaning to it.
Note that everything said about hashes here is also true of random numbers, if they're generated by a cryptographic random number generator. In fact, I generally use random numbers for these kinds of problems. For example, if I want clients to generate their own random unique IDs for messages, how many bits do I need to safely avoid collisions? (In many of my systems, you may not be able to use all the bits in your value; some may be used as flags.)
That's my general solution, but there's an even better solution if your input space is constrained. If your input space is smaller than 2^64, then you don't need hashing at all. Obviously, any Latin-1 string up to 8 characters can be stored in a 64-bit value. But if your input is even more constrained, then you can compress the data and get slightly longer strings. It only takes 5 bits to encode 26 symbols, so you can store a 12 letter string (of a single Latin case) in a UInt64 if you're willing to do the math. It's pretty rare that you get lucky enough to use this, but it's worth keeping in the back of your mind when space is at a premium.
I've built a lot of these kinds of systems, and I will say that eventually, we almost always wind up just making a longer identifier. You can make it work on a small identifier, but it's always a little complicated, and there is nothing as effective as just having more bits.... Best of luck till you get there.
Yes, you can create a hashes that are collision resistant using a cryptographic hash function. The output of such a hash function is in bits if you follow the algorithms specifications. However, implementations will generally only return bytes or an encoding of the byte values. A hash does not return a number, as other's have indicated in the comments.
It is relatively easy to convert such a hash into a number of 32 bites such as an Int or Int32. You just take the leftmost bytes of the hash and interpret those to be an unsigned integer.
However, a cryptographic hash has a relatively large output size precisely to make sure that the chance of collisions is small. Collisions are prone to the birthday problem, which means that you only have to try about 2 to the power of hLen divided by 2 inputs to create a collision within the generated set. E.g. you'd need 2^80 tries to create a collision of RIPEMD-160 hashes.
Now for most cryptographic hashes, certainly the common ones, the same rule counts. That means that for 32 bit hash that you'd only need 2^16 hashes to be reasonably sure that you have a collision. That's not good, 65536 tries are very easy to accomplish. And somebody may get lucky, e.g. after 256 tries you'd have a 1 in 256 chance of a collision. That's no good.
So calculating a hash value to use it as ID is fine, but you'd need the full output of a hash function, e.g. 256 bits of SHA-2 to be very sure you don't have a collision. Otherwise you may need to use something line a serial number instead.

Does halving every SHA224 2 bytes to 1 byte to halve the hash length introduce a higher collision risk?

Let's say I have strings that need not be reversible and let's say I use SHA224 to hash it.
The hash of hello world is 2f05477fc24bb4faefd86517156dafdecec45b8ad3cf2522a563582b and its length is 56 bytes.
What if I convert every two chars to its numerical representation and make a single byte out of them?
In Python I'd do something like this:
shalist = list("2f05477fc24bb4faefd86517156dafdecec45b8ad3cf2522a563582b")
for first_byte,next_byte in zip(shalist[0::2],shalist[1::2]):
chr(ord(first_byte)+ord(next_byte))
The result will be \x98ek\x9d\x95\x96\x96\xc7\xcb\x9ckhf\x9a\xc7\xc9\xc8\x97\x97\x99\x97\xc9gd\x96im\x94. 28 bytes. Effectively halved the input.
Now, is there a higher hash collision risk by doing so?
The simple answer is pretty obvious: yes, it increases the chance of collision by as many powers of 2 as there are bits missing. For 56 bytes halved to 28 bytes you get the chance of collision increased 2^(28*8). That still leaves the chance of collision at 1:2^(28*8).
Your use of that truncation can be still perfectly legit, depending what it is. Git for example shows only the first few bytes from a commit hash and for most practical purposes the short one works fine.
A "perfect" hash should retain a proportional amount of "effective" bits if you truncate it. For example 32 bits of SHA256 result should have the same "strength" as a 32-bit CRC, although there may be some special properties of CRC that make it more suitable for some purposes while the truncated SHA may be better for others.
If you're doing any kind of security with this it will be difficult to prove your system, you're probably better of using a shorter but complete hash.
Lets shrink the size to make sense of it and use 2 bytes hash instead of 56. The original hash will have 65536 possible values, so if you hash more than that many strings you will surely get a collision. Half that to 1 bytes and you will get a collision after at most 256 strings hashed, regardless do you take the first or the second byte. So your chance of collision is 256 greater (2^(1byte*8bits)) and is 1:256.
Long hashes are used to make it truly impractical to brute-force them, even after long years of cryptanalysis. When MD5 was introduced in 1991 it was considered secure enough to use for certificate signing, in 2008 it was considered "broken" and not suitable for security-related use. Various cryptanalysis techniques can be developed to reduce the "effective" strength of hash and encryption algorithms, so the more spare bits there are (in an otherwise strong algorithm) the more effective bits should remain to keep the hash secure for all practical purposes.

choosing a hash function

I was wondering: what are maximum number of bytes that can safely be hashed while maintaining the expected collision count of a hash function?
For md5, sha-*, maybe even crc32 or adler32.
Your question isn't clear. By "maximum number of bytes" you mean "maximum number of items"? The size of the files being hashed has no relation with the number of collisions (assuming that all files are different, of course).
And what do you mean by "maintaining the expected collision count"? Taken literally, the answer is "infinite", but after a certain number you will aways have collisions, as expected.
As for the answer to the question "How many items I can hash while maintaining the probability of a collision under x%?", take a look at the following table:
http://en.wikipedia.org/wiki/Birthday_problem#Probability_table
From the link:
For comparison, 10^-18 to 10^-15 is the uncorrectable bit error rate of a typical hard disk [2]. In theory, MD5, 128 bits, should stay within that range until about 820 billion documents, even if its possible outputs are many more.
This assumes a hash function that outputs a uniform distribution. You may assume that, given enough items to be hashed and cryptographic hash functions (like md5 and sha) or good hashes (like Murmur3, Jenkins, City, and Spooky Hash).
And also assumes no malevolent adversary actively fabricating collisions. Then you really need a secure cryptographic hash function, like SHA-2.
And be careful: CRC and Adler are checksums, designed to detect data corruption, NOT minimizing expected collisions. They have proprieties like "detect all bit zeroing of sizes < X or > Y for inputs up to Z kbytes", but not as good statistical proprieties.
EDIT: Don't forget this is all about probabilities. It is entirely possible to hash only two files smaller than 0.5kb and get the same SHA-512, though it is extremely unlikely (no collision has ever been found for SHA hashes till this date, for example).
You are basically looking at the Birthday paradox, only looking at really big numbers.
Given a normal 'distribution' of your data, I think you could go to about 5-10% of the amount of possibilities before running into issues, though nothing is guaranteed.
Just go with a long enough hash to not run into problems ;)

is it possible to retrieve a password from a (partial) MD5 hash?

Suppose I have only the first 16 characters of a MD5 hash. If I use brute force attack or rainbow tables or any other method to retrieve the original password, how many compatible candidates have I to expect? 1? (I do not think) 10, 100, 1000, 10^12? Even a rough answer is welcome (for the number, but please be coherent with hash theory and methodology).
The output of MD5 is 16 bytes (128 bits). I suppose that you are talking about an hexadecimal representation, hence as 32 characters. Thus, "16 characters" means "64 bits". You are considering MD5 with its output truncated to 64 bits.
MD5 accepts inputs up to 264 bits in length; assuming that MD5 behaves as a random function, this means that the 218446744073709551616 possible input strings will map more or less uniformly among the 264 outputs, hence the average number of candidates for a given output is about 218446744073709551552, which is close to 105553023288523357112.95.
However, if you consider that you can find at least one candidate, then this means that the space of possible passwords that you consider is much reduced. A rainbow table is a special kind of precomputed table which accepts a compact representation (at the expense of a relatively expensive lookup procedure), but if it covers N passwords, then this means that, at some point, someone could apply the hash function N times. In practice, this severely limits the size N. Assuming N=260 (which means that the table builder had about one hundred NVidia GTX 580 GPU and could run them for six months; also, the table will use quite a lot of hard disks), then, on average, only 1/16th of 64-bit outputs have a matching password in the table. For those passwords which are in the table, there is a 93.75% probability that there is no other password in the table which leads to the same output; if you prefer, if you find a matching password, then you will find, on average, 0.0625 other candidates (i.e. most of the time, no other candidate).
In brief, the answer to your question depends on the size N of the space of possible passwords that you consider (those which were covered during rainbow table construction); but, in practice with Earth-based technology, if you can find one matching password for a 64-bit output, chances are that you will not be able to find another (although there are are really many others).
You should never ever be able to get a password from a partial hash.

How are hash functions like MD5 unique?

I'm aware that MD5 has had some collisions but this is more of a high-level question about hashing functions.
If MD5 hashes any arbitrary string into a 32-digit hex value, then according to the Pigeonhole Principle surely this can not be unique, as there are more unique arbitrary strings than there are unique 32-digit hex values.
You're correct that it cannot guarantee uniqueness, however there are approximately 3.402823669209387e+38 different values in a 32 digit hex value (16^32). That means that, assuming the math behind the algorithm gives a good distribution, your odds are phenomenally small that there will be a duplicate. You do have to keep in mind that it IS possible to duplicate when you're thinking about how it will be used. MD5 is generally used to determine if something has been changed (I.e. it's a checksum). It would be ridiculously unlikely that something could be modified and result in the same MD5 checksum.
Edit: (given recent news re: SHA1 hashes)
The answer above, still holds, but you shouldn't expect an MD5 hash to serve as any kind of security check against manipulation. SHA-1 Hashes as 2^32 (over 4 billion) times less likely to collide, and it has been demonstrated that it is possible to contrive an input to produce the same value. (This was demonstrated against MD5 quite some time ago). If you're looking to ensure nobody has maliciously modified something to produce the same hash value, these days, you need at SHA-2 to have a solid guarantee.
On the other hand, if it's not in a security check context, MD5 still has it's usefulness.
The argument could be made that an SHA-2 hash is cheap enough to compute, that you should just use it anyway.
You are absolutely correct. But hashes are not about "unique", they are about "unique enough".
As others have pointed out, the goal of a hash function like MD5 is to provide a way of easily checking whether two objects are equivalent, without knowing what they originally were (passwords) or comparing them in their entirety (big files).
Say you have an object O and its hash hO. You obtain another object P and wish to check whether it is equal to O. This could be a password, or a file you downloaded (in which case you won't have O but rather the hash of it hO that came with P, most likely). First, you hash P to get hP.
There are now 2 possibilities:
hO and hP are different. This must mean that O and P are different, because using the same hash on 2 values/objects must yield the same value. Hashes are deterministic. There are no false negatives.
hO and hP are equal. As you stated, because of the Pigeonhole Principle this could mean that different objects hashed to the same value, and further action may need to be taken.
a. Because the number of possibilities is so high, if you have faith in your hash function it may be enough to say "Well there was a 1 in 2128 chance of collision (ideal case), so we can assume O = P. This may work for passwords if you restrict the length and complexity of characters, for example. It is why you see hashes of passwords stored in databases rather than the passwords themselves.
b. You may decide that just because the hash came out equal doesn't mean the objects are equal, and do a direct comparison of O and P. You may have a false positive.
So while you may have false positive matches, you won't have false negatives. Depending on your application, and whether you expect the objects to always be equal or always be different, hashing may be a superfluous step.
Cryptographic one-way hash functions are, by nature of definition, not Injective.
In terms of hash functions, "unique" is pretty meaningless. These functions are measured by other attributes, which affects their strength by making it hard to create a pre-image of a given hash. For example, we may care about how many image bits are affected by changing a single bit in the pre-image. We may care about how hard it is to conduct a brute force attack (finding a prie-image for a given hash image). We may care about how hard it is to find a collision: finding two pre-images that have the same hash image, to be used in a birthday attack.
While it is likely that you get collisions if the values to be hashed are much longer than the resulting hash, the number of collisions is still sufficiently low for most purposes (there are 2128 possible hashes total so the chance of two random strings producing the same hash is theoretically close to 1 in 1038).
MD5 was primarily created to do integrity checks, so it is very sensitive to minimal changes. A minor modification in the input will result in a drastically different output. This is why it is hard to guess a password based on the hash value alone.
While the hash itself is not reversible, it is still possible to find a possible input value by pure brute force. This is why you should always make sure to add a salt if you are using MD5 to store password hashes: if you include a salt in the input string, a matching input string has to include exactly the same salt in order to result in the same output string because otherwise the raw input string that matches the output will fail to match after the automated salting (i.e. you can't just "reverse" the MD5 and use it to log in because the reversed MD5 hash will most likely not be the salted string that originally resulted in the creation of the hash).
So hashes are not unique, but the authentication mechanism can be made to make it sufficiently unique (which is one somewhat plausible argument for password restrictions in lieu of salting: the set of strings that results in the same hash will probably contain many strings that do not obey the password restrictions, so it's more difficult to reverse the hash by brute force -- obviously salts are still a good idea nevertheless).
Bigger hashes mean a larger set of possible hashes for the same input set, so a lower chance of overlap, but until processing power advances sufficiently to make brute-forcing MD5 trivial, it's still a decent choice for most purposes.
(It seems to be Hash Function Sunday.)
Cryptographic hash functions are designed to have very, very, very, low duplication rates. For the obvious reason you state, the rate can never be zero.
The Wikipedia page is informative.
As Mike (and basically every one else) said, its not perfect, but it does the job, and collision performance really depends on the algo (which is actually pretty good).
What is of real interest is automatic manipulation of files or data to keep the same hash with different data, see this Demo
As others have answered, hash functions are by definition not guaranteed to return unique values, since there are a fixed number of hashes for an infinite number of inputs. Their key quality is that their collisions are unpredictable.
In other words, they're not easily reversible -- so while there may be many distinct inputs that will produce the same hash result (a "collision"), finding any two of them is computationally infeasible.