How are hash functions like MD5 unique?

How are hash functions like MD5 unique? - hash

I'm aware that MD5 has had some collisions but this is more of a high-level question about hashing functions.
If MD5 hashes any arbitrary string into a 32-digit hex value, then according to the Pigeonhole Principle surely this can not be unique, as there are more unique arbitrary strings than there are unique 32-digit hex values.

You're correct that it cannot guarantee uniqueness, however there are approximately 3.402823669209387e+38 different values in a 32 digit hex value (16^32). That means that, assuming the math behind the algorithm gives a good distribution, your odds are phenomenally small that there will be a duplicate. You do have to keep in mind that it IS possible to duplicate when you're thinking about how it will be used. MD5 is generally used to determine if something has been changed (I.e. it's a checksum). It would be ridiculously unlikely that something could be modified and result in the same MD5 checksum.
Edit: (given recent news re: SHA1 hashes)
The answer above, still holds, but you shouldn't expect an MD5 hash to serve as any kind of security check against manipulation. SHA-1 Hashes as 2^32 (over 4 billion) times less likely to collide, and it has been demonstrated that it is possible to contrive an input to produce the same value. (This was demonstrated against MD5 quite some time ago). If you're looking to ensure nobody has maliciously modified something to produce the same hash value, these days, you need at SHA-2 to have a solid guarantee.
On the other hand, if it's not in a security check context, MD5 still has it's usefulness.
The argument could be made that an SHA-2 hash is cheap enough to compute, that you should just use it anyway.

You are absolutely correct. But hashes are not about "unique", they are about "unique enough".

As others have pointed out, the goal of a hash function like MD5 is to provide a way of easily checking whether two objects are equivalent, without knowing what they originally were (passwords) or comparing them in their entirety (big files).
Say you have an object O and its hash hO. You obtain another object P and wish to check whether it is equal to O. This could be a password, or a file you downloaded (in which case you won't have O but rather the hash of it hO that came with P, most likely). First, you hash P to get hP.
There are now 2 possibilities:
hO and hP are different. This must mean that O and P are different, because using the same hash on 2 values/objects must yield the same value. Hashes are deterministic. There are no false negatives.
hO and hP are equal. As you stated, because of the Pigeonhole Principle this could mean that different objects hashed to the same value, and further action may need to be taken.
a. Because the number of possibilities is so high, if you have faith in your hash function it may be enough to say "Well there was a 1 in 2128 chance of collision (ideal case), so we can assume O = P. This may work for passwords if you restrict the length and complexity of characters, for example. It is why you see hashes of passwords stored in databases rather than the passwords themselves.
b. You may decide that just because the hash came out equal doesn't mean the objects are equal, and do a direct comparison of O and P. You may have a false positive.
So while you may have false positive matches, you won't have false negatives. Depending on your application, and whether you expect the objects to always be equal or always be different, hashing may be a superfluous step.

Cryptographic one-way hash functions are, by nature of definition, not Injective.
In terms of hash functions, "unique" is pretty meaningless. These functions are measured by other attributes, which affects their strength by making it hard to create a pre-image of a given hash. For example, we may care about how many image bits are affected by changing a single bit in the pre-image. We may care about how hard it is to conduct a brute force attack (finding a prie-image for a given hash image). We may care about how hard it is to find a collision: finding two pre-images that have the same hash image, to be used in a birthday attack.

While it is likely that you get collisions if the values to be hashed are much longer than the resulting hash, the number of collisions is still sufficiently low for most purposes (there are 2128 possible hashes total so the chance of two random strings producing the same hash is theoretically close to 1 in 1038).
MD5 was primarily created to do integrity checks, so it is very sensitive to minimal changes. A minor modification in the input will result in a drastically different output. This is why it is hard to guess a password based on the hash value alone.
While the hash itself is not reversible, it is still possible to find a possible input value by pure brute force. This is why you should always make sure to add a salt if you are using MD5 to store password hashes: if you include a salt in the input string, a matching input string has to include exactly the same salt in order to result in the same output string because otherwise the raw input string that matches the output will fail to match after the automated salting (i.e. you can't just "reverse" the MD5 and use it to log in because the reversed MD5 hash will most likely not be the salted string that originally resulted in the creation of the hash).
So hashes are not unique, but the authentication mechanism can be made to make it sufficiently unique (which is one somewhat plausible argument for password restrictions in lieu of salting: the set of strings that results in the same hash will probably contain many strings that do not obey the password restrictions, so it's more difficult to reverse the hash by brute force -- obviously salts are still a good idea nevertheless).
Bigger hashes mean a larger set of possible hashes for the same input set, so a lower chance of overlap, but until processing power advances sufficiently to make brute-forcing MD5 trivial, it's still a decent choice for most purposes.

(It seems to be Hash Function Sunday.)
Cryptographic hash functions are designed to have very, very, very, low duplication rates. For the obvious reason you state, the rate can never be zero.
The Wikipedia page is informative.

As Mike (and basically every one else) said, its not perfect, but it does the job, and collision performance really depends on the algo (which is actually pretty good).
What is of real interest is automatic manipulation of files or data to keep the same hash with different data, see this Demo

As others have answered, hash functions are by definition not guaranteed to return unique values, since there are a fixed number of hashes for an infinite number of inputs. Their key quality is that their collisions are unpredictable.
In other words, they're not easily reversible -- so while there may be many distinct inputs that will produce the same hash result (a "collision"), finding any two of them is computationally infeasible.

Related

is it possible to find the md5 hash of a password without out actually having the original password

I was just messing around with some code on python and realized it was not that difficult to the find out what a password is if you have the md5 (basically a brute force attack, imputing md5, going through millions of passwords, converting them to md5 and checking if it matches, and then outputting the password) but what is difficult is getting the md5. I did some digging and all i found was some videos of people using randomly generated md5 hashes of passwords and then finding out what password it corresponded to. What i was wondering was if there was any way you could find the md5 hash of a password without having the original password. Thx
-If anything is unclear just tell me in the comments and i will clean it up

You're correct that you can brute-force an md5 hash to retrieve the original password, provided that the original password and the brute-force attempts hash to the same value. To compensate for this, often password systems use what's known as "salt" to make this significantly more difficult. (See also: What is SALT and how do i use it?)
The answer to your question, in general, is no, there is no easy way to obtain the hash of some value without having that value first.
Originally, hashing algorithms were designed to take some input and manipulate it so the output of the algorithm can be used as an index into a table of values. The goal is to have a 1:1 hash (ideally that's extremely fast, hopefully constant time). This means that given some input value x, y = hash(x) should be such that ONLY x hashes to y. In other words, y1 = hash(x1) = hash(x) if, and only if, x1 = x.
As time went on, algorithms were developed that had other properties. Since it became common for hashing algorithms to be used to things like password storage and quick comparison, one of the things that was valued with a hashing algorithm is how small changes to the input should lead to differences in the output. In other words, the hashing function hash(x) should change if x changes entirely, (as in the case of not(x)), or if it changes by a single bit.
One corollary is that, if hash(x) changes significantly when you change a single bit (as in the case of hash(x+0x000001), then it makes the comparison function much faster (since you really only need to check the higher order bits to determine if two objects are the same, in the average case). This means that you can't easily compute the hash of sequential items simply by iterating through hashes (i.e. "guessing" the hash output of the function hash(x) without first having x).

Comparing hashes to test for collisions

I wish to compare hashes to check for collisions (Yes, I know it is time consuming, but never mind that). In checking for collisions, hashes need to be compared. Is the best method to have a single hash in a variable to compare against or to have a list of all hashes previously generated and compare the latest hash to each item in the list.
I would prefer the first option because it is much faster, but is there a recommended method? Are you less likely to find a collision by using the first method?

Is the best method to have a single hash in a variable to compare against or to have a list of all hashes previously generated and compare the latest hash to each item in the list.
Neither.
I would prefer the first option because it is much faster, but is there a recommended method?
I don't understand why you think the first method might work, but then you haven't fully explained your situation. Still, if you want to detect hash values that repeat, you do indeed need to keep track of already-seen hash values: to do that you don't want to search linearly though a list, and should use a set container to store seen hashes; a hash table - as suggested in a comment by gnasher729 a few hours back - would give O(1) performance e.g. in C++ in your hashes are 64 bit, std::unordered_set<uint64_t>), or a balance binary tree for O(logN) performance (e.g. C++ std::set<uint64_t>).
Are you less likely to find a collision by using the first method?
You're very likely to miss collisions.
All that said, you may want to reexamine your premise. The chance of a good (cryptographic quality) hash function producing collisions closely approaches the odds described by the "birthday paradox". As a rule of thumb, if you have 2^N distinct values to hash you're statistically unlikely to see collisions if your hashes are comfortably more than 2*N bits wide: if you allow enough "comfort", you're more likely to be hit on the noggin by a meteor than have your program see a collision. You mentioned MD5 so I'd expect 128 bits: unless you're storing order-of a quadrillion values or more (literally), it's pretty safe to ignore the potential for collisions.
Do note one important use of hash values where collisions happen more often for a different reason, and that's in hash tables, where even non-colliding hash values may collide at the same bucket index after they're "wrapped" - often a la h % N when N is the number of buckets. In general, it's impractical to ignore the potential for collisions in a hash table, and very unwise to try.

If I use a composite hashing strategy for strings can I virtually eliminate collisions?

Ok so here's the use case. I have lots of somewhat lengthy (200-500 character) strings that I'd like to have a smaller deterministic hash for. Since I can store the full 160-bit SHA1 value in a mere 20 bytes, this yields an order of magnitude space improvement per string.
But of course one has to worry about collisions with hashing on strings even with a crypto hash with decent avalanche effects. I know the chances are infintesimely small, but I'd like to be more conservative. If I do something like this:
hash(input) = CONCAT(HF1(input),HF2(input))
where HF1 is some suitable robust hashing f() and HF2 is another distinct but robust hashing f(). Does this effectively make the chance of a collision near impossible (At the cost of 40 bytes now instead of 20)?
NOTE: I am not concerned with the security/crypto implications of SHA-1 for my use case.
CLARIFICATION: original question was posed about a hashing the concatenated hash value, not concatenating hashes which DOES NOT change the hash collision probabilities of the outer hash function.

Assuming "reasonable" hash functions, then by concatenating, all you're doing is creating a hash function with a larger output space. So yes, this reduces the probability of collision.
But either way, it's probably not worth worrying about. 2^320 is something like the number of particles in the universe. So you only need to worry if you're expecting attackers.

I asked the wrong question initially. This was probably the question I was looking for:
Probability of SHA1 collisions
This was also illuminating
Understanding sha-1 collision weakness
I guess it's fair to ask if I had two hash functions whose concatenated size was smaller than 20 bytes say 2 distinct 32-bit hashing functions. If concatenating those produces a probability that is small enough to ignore in practice since 2 (or even 3) of those concatenated would be smaller than SHA-1.

choosing a hash function

I was wondering: what are maximum number of bytes that can safely be hashed while maintaining the expected collision count of a hash function?
For md5, sha-*, maybe even crc32 or adler32.

Your question isn't clear. By "maximum number of bytes" you mean "maximum number of items"? The size of the files being hashed has no relation with the number of collisions (assuming that all files are different, of course).
And what do you mean by "maintaining the expected collision count"? Taken literally, the answer is "infinite", but after a certain number you will aways have collisions, as expected.
As for the answer to the question "How many items I can hash while maintaining the probability of a collision under x%?", take a look at the following table:
http://en.wikipedia.org/wiki/Birthday_problem#Probability_table
From the link:
For comparison, 10^-18 to 10^-15 is the uncorrectable bit error rate of a typical hard disk [2]. In theory, MD5, 128 bits, should stay within that range until about 820 billion documents, even if its possible outputs are many more.
This assumes a hash function that outputs a uniform distribution. You may assume that, given enough items to be hashed and cryptographic hash functions (like md5 and sha) or good hashes (like Murmur3, Jenkins, City, and Spooky Hash).
And also assumes no malevolent adversary actively fabricating collisions. Then you really need a secure cryptographic hash function, like SHA-2.
And be careful: CRC and Adler are checksums, designed to detect data corruption, NOT minimizing expected collisions. They have proprieties like "detect all bit zeroing of sizes < X or > Y for inputs up to Z kbytes", but not as good statistical proprieties.
EDIT: Don't forget this is all about probabilities. It is entirely possible to hash only two files smaller than 0.5kb and get the same SHA-512, though it is extremely unlikely (no collision has ever been found for SHA hashes till this date, for example).

You are basically looking at the Birthday paradox, only looking at really big numbers.
Given a normal 'distribution' of your data, I think you could go to about 5-10% of the amount of possibilities before running into issues, though nothing is guaranteed.
Just go with a long enough hash to not run into problems ;)

What are the important points about cryptographic hash functions?

I was reading this question on MD5 hash values and the accepted answer confuses me. One of the main properties, as I understand it, of a cryptopgraphic hash function is that it is infeasible to find two different messages (inputs) with the same hash value.
Yet the consensus answer to the question Why aren't MD5 hash values reversible? is Because an infinite number of input strings will generate the same output. This seems completely contradictory to me.
Also, what perplexes me somewhat is the fact that the algorithms are public, yet the hash values are still irreversible. Is this because there is always data loss in a hash function so there's no way to tell which data was thrown away?
What happens when the input data size is smaller than the fixed output data size (e.g., hashing a password "abc")?
EDIT:
OK, let me see if I have this straight:
It is really, really hard to infer the input from the hash because there are an infinite amount of input strings that will generate the same output (irreversible property).
However, finding even a single instance of multiple input strings that generate the same output is also really, really hard (collision resistant property).

Warning: Long answer
I think all of these answers are missing a very important property of cryptographic hash functions: Not only is it impossible to compute the original message that was hashed to get a given hash, it's impossible to compute any message that would hash to a given hash value. This is called preimage resistance.
(By "impossible" - I mean that no one knows how to do it in less time than it takes to guess every possible message until you guess the one that was hashed into your hash.)
(Despite popular belief in the insecurity of MD5, MD5 is still preimage resistant. Anyone who doesn't believe me is free to give me anything that hashes to 2aaddf751bff2121cc51dc709e866f19. What MD5 doesn't have is collision resistance, which is something else entirely.)
Now, if the only reason you can't "work backwards" in a cryptographic hash function was because the hash function discards data to create the hash, then it would not guarantee preimage resistance: You can still "work backwards", and just insert random data wherever the hash function discards data, and while you wouldn't come up with the original message, you'd still come up with a message that hashes to the desired hash value. But you can't.
So the question becomes: Why not? (Or, in other words, how do you make a function preimage resistant?)
The answer is that cryptographic hash functions simulate chaotic systems. They take your message, break it into blocks, mix those blocks around, have some of the blocks interact with each other, mix those blocks around, and repeat that a lot of times (well, one cryptographic hash function does that; others have their own methods). Since the blocks interact with each other, block C not only has to interact with block D to produce block A, but it has to interact with block E to produce block B. Now, sure, you can find values of blocks C, D, E that would produce the blocks A and B in your hash value, but as you go further back, suddenly you need a block F that interacts with C to make D, and with E to make B, and no such block can do both at the same time! You must have guessed wrong values for C, D, and E.
While not all cryptographic hash functions are exactly as described above with block interaction, they have the same idea: That if you try to "work backwards", you're going to end up with a whole lot of dead ends, and the time it takes for you to try enough values to generate a preimage is on the order of hundreds to millions of years (depending on the hash function), not much better than the time it would take just to try messages until you find one that works.

1: The primary purpose of a hash is to map a very, very large space to a smaller but still very large space (e.g., MD5, which will take 'anything' and convert it into a space of size 2^128 -- big, but not nearly as big as aleph-0.)
In addition to other features, good hashes fill the destination space homogeneously. Bad hashes fill the space in a clumpy fashion, coming up with the same hash for many common inputs.
Imagine the idiotic hash function sum(), which just adds all the digits of the input number: it succeeds in mapping down, but there are a bunch of collisions (inputs with the same output, like 3 and 12 and 21) at the low end of the output space and the upper end of the space is nearly empty. As a result it makes very poor use of the space, is easy to crack, etc.
So a good hash that makes even use of the destination space will make it difficult to find two inputs with the same output, just by the odds: if MD5 were perfect, the odds that two inputs would have the same output would be 2^-128. That's pretty decent odds: the best you can do without resorting to a larger output space. (In truth MD5 isn't perfect, which is one of the things that makes it vulnerable.)
But it will still be true that a huge number of inputs will map to any given hash, because the input space is 'infinite', and dividing infinity by 2^128 still gives you infinity.
2: Yes, hashes always cause data loss, except in the case where your output space is the same as, or larger than, your input space -- and in that case you probably didn't need to hash!
3: For smaller inputs, best practice is to salt the input. Actually, that's good practice for any cryptographic hashing, because otherwise an attacker can feed you specific inputs and try to figure out which hash you are using. 'Salt' is just a set of additional information that you append (or prepend) to your input; you then hash the result.
edit: In cryptography, it is also important that the hash function is resistant to preimage attacks, intuitively, that is hard to guess the input for a given output even knowing many other input/output pairs. The "sum" function could probably be guessed rather easily (but since it destroys data still might not be easy to reverse).

You may be confused, because the answer to the question you cite is confusing.
One of the requirements for a cryptographic hash function is that it should be preimage resistant. That is, if you know MD5(x) but not the message x, then it is difficult to find any x' (either equal x or different from x) such that MD5(x') = MD5(x).
Being preimage resistant is a different property than being reversible. A function is reversible if given y = f(x) there is exactly one x which fits (whether this is easy or not). For example define f(x) = x mod 10.
Then f is not reversible. From f(x) = 7 you can't determine whether x was 17, 27 or something else. But f is not preimage resistant, since values x' such that f(x) = 7 are easy to find. x' = 17, 27, 12341237 etc all work.
When doing crypto you usually need functions that are preimage resistant (and other properties such as collision resistance), not just something that is not reversible.

These are the properties of hash functions in general.
A word of caution though, MD5 shouldn't be used anymore because of vulnerabilities that have been found in it. Check the 'Vulnerabilities' section and external links detailing these attacks. http://en.wikipedia.org/wiki/Md5 You can make an MD5 collision by changing only 128 bits in a message.
SHA-1 is safe for simple hashing although there are some attacks that would make it weaker against well-funded entities (Governments, large corporations)
SHA-256 is a safe starting point against technology for the next couple decades.

Yet the consensus answer to the question "why aren't MD5 hash values reversible?" is because "an infinite number of input strings will generate the same output."
This is true for any hash function, but it is not the essence of a cryptographic hash function.
For short input strings such as passwords it is theoretically possible to reverse a cryptographic hash function, but it ought to be computationally infeasible. I.e. your computation would run too long to be useful.
The reason for this infeasibility is that the input is so thoroughly "mixed together" in the hash value that it becomes impossible to disentangle it with any less effort than the brute force attack of computing the hash value for all inputs

"why aren't MD5 hash values reversible?" is because "an infinite number of input strings >will generate the same output"
this is the reason that it isn't possible to reverse the hash function (get the same input).
cryptographic hash functions are collision resistant, that means that it's also hard to find another input value that maps to the same output (if your hash function was mod 2 : 134 mod 2 = 0; now you can't get the 134 back from the result, but we can stil find number 2 with the same output value (134 and 2 collide)).
When the input is smaller than the block size, padding is used to fit it to the block size.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse