What prevents me from reversing a hash function? [closed]

What prevents me from reversing a hash function? [closed] - hash

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 5 years ago.
Improve this question
What actually prevents me from reversing a hash function and generating a possible input from a hash that will have the same hash?
I understand that hash functions are one-way functions, which means I cannot recover the real input by it's hash.
I googled it a lot, and I found out a lot of peoples explaining this simple example hash function:
hash(x) = x % 7
I can't recover the input (x) from the hash here, but if I know the hash, I can generate a possible input from it that will have the same hash:
unhash(h) = some_random_integer * 7 + h
The value of some_random_integer does not matter at all. unhash(3) will be for example: 24 , and hash(24) is: 3 !
One more example that I found is:
hash(x, y) = x * y
So like the previous example, I cannot find the real input (x and y) from the hash but I can find a possible input that will have the same hash:
x = hash / some_random_integer
y = hash / x
When for example, a malicious hacker gains access to a database full of hashed passwords, he would be able to log in to a hacked user only by generating a possible input that will generate the same hash as his password! It does not have to be the exact original password.
I know that real hash functions are a lot more complicated than this examples, but I cannot think of a math operation that cannot be reversed this way. (or maybe there are some?)
What actually prevents me from reversing real hash functions this way? (like MD5, SHA1, etc...)

By hash function the assumption you are referring to cryptographic hash functions such as the SHA family.
The design of the cryptographic hash function keeps you from reversing it, that is the basic criteria in the design.
There are other types of hash functions such as dictionary hash functions that may be quite simple but even these but usually lose portions of the input. hash(x) = x % 7 is an example of such a simple hash function.
In the case of password hashing brute forcing must be taken into account, that is trying passwords from lists of frequently used passwords and fuzzing. The usual solution is to use a hash function that consumes substantial CPU time, ofter by iteration a hash function for about 100ms, PBKFD2 is such a function and is recommended by NIST for password hashing.
Additionally the input may be larger than the output= and some information is intentionally lost.

Related

Can we repeatedly hash an input and hash it indefinitely?

Was just wondering that if we are given an input string x and we hash it with function f to get f(x) can we repeat this process indefinitely i.e f(f(x)) and so on. Because most hash functions generate a different fixed output that is not the same as the input.
So by this premise, would we be able to carry this out indefinitely? One possible issue I can think is that it has to be fixed length and usually hashes are shorter than the input?
Please correct me if I am wrong. Would love an explanation!

Yes you absolutely can hash the prior hash output.
When we do this with cryptographic keys it’s called ratcheting.
The output size of the hashing algo will determine how many outputs you can rehash before you get a collision.
Thus for a 256-bit hash function we will see a collision with 50% probability after 2^128 hashing calls.

Hash functions and polynomial division [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 5 years ago.
Improve this question
I understand that a CRC verifies data integrity by producing a checksum, which is the result of polynomial long division. I've heard hash values referred to as hash checksums, so my question is whether hash functions use some sort of polynomial division as well? I know they break the data up into block ciphers, so my guess would be that the hash functions create some relationship between the polynomial check value and how it's divided into the different blocks. Can someone let me know if I'm way off base here?

A CRC is a hash function, but there are many other ways to implement a hash function. The other ways generally do not use polynomial division, though there are some that use a CRC as a part of the hash calculation, in order to make use of hardware CRC instructions. Most hash functions use a long, convoluted series of ands, nots, exclusive-ors, integer additions, multiplications, and modulos.

is it possible to find the md5 hash of a password without out actually having the original password

I was just messing around with some code on python and realized it was not that difficult to the find out what a password is if you have the md5 (basically a brute force attack, imputing md5, going through millions of passwords, converting them to md5 and checking if it matches, and then outputting the password) but what is difficult is getting the md5. I did some digging and all i found was some videos of people using randomly generated md5 hashes of passwords and then finding out what password it corresponded to. What i was wondering was if there was any way you could find the md5 hash of a password without having the original password. Thx
-If anything is unclear just tell me in the comments and i will clean it up

You're correct that you can brute-force an md5 hash to retrieve the original password, provided that the original password and the brute-force attempts hash to the same value. To compensate for this, often password systems use what's known as "salt" to make this significantly more difficult. (See also: What is SALT and how do i use it?)
The answer to your question, in general, is no, there is no easy way to obtain the hash of some value without having that value first.
Originally, hashing algorithms were designed to take some input and manipulate it so the output of the algorithm can be used as an index into a table of values. The goal is to have a 1:1 hash (ideally that's extremely fast, hopefully constant time). This means that given some input value x, y = hash(x) should be such that ONLY x hashes to y. In other words, y1 = hash(x1) = hash(x) if, and only if, x1 = x.
As time went on, algorithms were developed that had other properties. Since it became common for hashing algorithms to be used to things like password storage and quick comparison, one of the things that was valued with a hashing algorithm is how small changes to the input should lead to differences in the output. In other words, the hashing function hash(x) should change if x changes entirely, (as in the case of not(x)), or if it changes by a single bit.
One corollary is that, if hash(x) changes significantly when you change a single bit (as in the case of hash(x+0x000001), then it makes the comparison function much faster (since you really only need to check the higher order bits to determine if two objects are the same, in the average case). This means that you can't easily compute the hash of sequential items simply by iterating through hashes (i.e. "guessing" the hash output of the function hash(x) without first having x).

What's the difference between NOT second preimage resistant and NOT collision resistant [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 8 years ago.
Improve this question
By definition, Not 2nd-preimage resistant means: there exists at least one x (which is known) such that it is easy to find another x', such that h(x) = h(x').
While, Not collision resistant indicates: it is easy to find at least one such pair (x, x') that h(x) = h(x')
I don't see any difference here, anyone can tell? Or do I give the wrong definitions?
And, it is said that "Not collision resistant not necessarily means Not 2nd-preimage resistant", why is that?

Putting this into another answer because it's just too much to type for a comment.
The definition of 2nd-preimage-resistant is you have h(x) and x, and can't create x'.
The definition of preimage-resistant (without second!) means you have only h(x), and can't create x.
And the definition of collision resistant is you have nothing, and may choose any h(x), x and x'.
If you use the hash to sign a plaintext message, you need 2nd-preimage-resistancy, but not collision resistancy. It doesn't matter to you if someone can find two colliding messages that produce a hash that is different from yours, but you want to make sure noone is able to craft a different message that has the your hash, even if they know your plaintext.
If you use the hash to store hashed passwords, you don't care about collision resistance, and you don't care about 2nd-preimage-resistance, preimage-resistance is all you need. If an attacker knows one password, you don't really care if he can use that password to find a different one.
So these were two examples where collision resistance is not required, but preimage-resistance or 2nd-preimage-resistance is.
As to "Not collision resistant not necessarily means Not 2nd-preimage resistant", why is that? , consider the hash function if x has less then 24 bits, then h(x)=0, else h(x)=sha256(x). This is very obviously not collision resistant (choose any 2 words that have less than 4 letters), but, as long as your text is longer, this function is preimage-resistant and 2nd-preimage-resistant (assuming sha256 hasn't been broken yet).

2nd preimage resistant means, there's no (easy) way to find a 2nd x (called x') when you have only h(x), and maybe x.
Collision resistant means there's an (easy) way to find a random pair (x, x') with h(x)=h(x').
So the second one is weaker. Think about what happened to MD5 a while ago: there's an algorithm that finds pairs of input bytes that produce the same output. But this works only for specifically constructed input, not for random input. So, while it is possible to find messages that have a collision, the generic case "x is some specific message, find a second message that has the same MD5 as x" is not solved yet.

What are the important points about cryptographic hash functions?

I was reading this question on MD5 hash values and the accepted answer confuses me. One of the main properties, as I understand it, of a cryptopgraphic hash function is that it is infeasible to find two different messages (inputs) with the same hash value.
Yet the consensus answer to the question Why aren't MD5 hash values reversible? is Because an infinite number of input strings will generate the same output. This seems completely contradictory to me.
Also, what perplexes me somewhat is the fact that the algorithms are public, yet the hash values are still irreversible. Is this because there is always data loss in a hash function so there's no way to tell which data was thrown away?
What happens when the input data size is smaller than the fixed output data size (e.g., hashing a password "abc")?
EDIT:
OK, let me see if I have this straight:
It is really, really hard to infer the input from the hash because there are an infinite amount of input strings that will generate the same output (irreversible property).
However, finding even a single instance of multiple input strings that generate the same output is also really, really hard (collision resistant property).

Warning: Long answer
I think all of these answers are missing a very important property of cryptographic hash functions: Not only is it impossible to compute the original message that was hashed to get a given hash, it's impossible to compute any message that would hash to a given hash value. This is called preimage resistance.
(By "impossible" - I mean that no one knows how to do it in less time than it takes to guess every possible message until you guess the one that was hashed into your hash.)
(Despite popular belief in the insecurity of MD5, MD5 is still preimage resistant. Anyone who doesn't believe me is free to give me anything that hashes to 2aaddf751bff2121cc51dc709e866f19. What MD5 doesn't have is collision resistance, which is something else entirely.)
Now, if the only reason you can't "work backwards" in a cryptographic hash function was because the hash function discards data to create the hash, then it would not guarantee preimage resistance: You can still "work backwards", and just insert random data wherever the hash function discards data, and while you wouldn't come up with the original message, you'd still come up with a message that hashes to the desired hash value. But you can't.
So the question becomes: Why not? (Or, in other words, how do you make a function preimage resistant?)
The answer is that cryptographic hash functions simulate chaotic systems. They take your message, break it into blocks, mix those blocks around, have some of the blocks interact with each other, mix those blocks around, and repeat that a lot of times (well, one cryptographic hash function does that; others have their own methods). Since the blocks interact with each other, block C not only has to interact with block D to produce block A, but it has to interact with block E to produce block B. Now, sure, you can find values of blocks C, D, E that would produce the blocks A and B in your hash value, but as you go further back, suddenly you need a block F that interacts with C to make D, and with E to make B, and no such block can do both at the same time! You must have guessed wrong values for C, D, and E.
While not all cryptographic hash functions are exactly as described above with block interaction, they have the same idea: That if you try to "work backwards", you're going to end up with a whole lot of dead ends, and the time it takes for you to try enough values to generate a preimage is on the order of hundreds to millions of years (depending on the hash function), not much better than the time it would take just to try messages until you find one that works.

1: The primary purpose of a hash is to map a very, very large space to a smaller but still very large space (e.g., MD5, which will take 'anything' and convert it into a space of size 2^128 -- big, but not nearly as big as aleph-0.)
In addition to other features, good hashes fill the destination space homogeneously. Bad hashes fill the space in a clumpy fashion, coming up with the same hash for many common inputs.
Imagine the idiotic hash function sum(), which just adds all the digits of the input number: it succeeds in mapping down, but there are a bunch of collisions (inputs with the same output, like 3 and 12 and 21) at the low end of the output space and the upper end of the space is nearly empty. As a result it makes very poor use of the space, is easy to crack, etc.
So a good hash that makes even use of the destination space will make it difficult to find two inputs with the same output, just by the odds: if MD5 were perfect, the odds that two inputs would have the same output would be 2^-128. That's pretty decent odds: the best you can do without resorting to a larger output space. (In truth MD5 isn't perfect, which is one of the things that makes it vulnerable.)
But it will still be true that a huge number of inputs will map to any given hash, because the input space is 'infinite', and dividing infinity by 2^128 still gives you infinity.
2: Yes, hashes always cause data loss, except in the case where your output space is the same as, or larger than, your input space -- and in that case you probably didn't need to hash!
3: For smaller inputs, best practice is to salt the input. Actually, that's good practice for any cryptographic hashing, because otherwise an attacker can feed you specific inputs and try to figure out which hash you are using. 'Salt' is just a set of additional information that you append (or prepend) to your input; you then hash the result.
edit: In cryptography, it is also important that the hash function is resistant to preimage attacks, intuitively, that is hard to guess the input for a given output even knowing many other input/output pairs. The "sum" function could probably be guessed rather easily (but since it destroys data still might not be easy to reverse).

You may be confused, because the answer to the question you cite is confusing.
One of the requirements for a cryptographic hash function is that it should be preimage resistant. That is, if you know MD5(x) but not the message x, then it is difficult to find any x' (either equal x or different from x) such that MD5(x') = MD5(x).
Being preimage resistant is a different property than being reversible. A function is reversible if given y = f(x) there is exactly one x which fits (whether this is easy or not). For example define f(x) = x mod 10.
Then f is not reversible. From f(x) = 7 you can't determine whether x was 17, 27 or something else. But f is not preimage resistant, since values x' such that f(x) = 7 are easy to find. x' = 17, 27, 12341237 etc all work.
When doing crypto you usually need functions that are preimage resistant (and other properties such as collision resistance), not just something that is not reversible.

These are the properties of hash functions in general.
A word of caution though, MD5 shouldn't be used anymore because of vulnerabilities that have been found in it. Check the 'Vulnerabilities' section and external links detailing these attacks. http://en.wikipedia.org/wiki/Md5 You can make an MD5 collision by changing only 128 bits in a message.
SHA-1 is safe for simple hashing although there are some attacks that would make it weaker against well-funded entities (Governments, large corporations)
SHA-256 is a safe starting point against technology for the next couple decades.

Yet the consensus answer to the question "why aren't MD5 hash values reversible?" is because "an infinite number of input strings will generate the same output."
This is true for any hash function, but it is not the essence of a cryptographic hash function.
For short input strings such as passwords it is theoretically possible to reverse a cryptographic hash function, but it ought to be computationally infeasible. I.e. your computation would run too long to be useful.
The reason for this infeasibility is that the input is so thoroughly "mixed together" in the hash value that it becomes impossible to disentangle it with any less effort than the brute force attack of computing the hash value for all inputs

"why aren't MD5 hash values reversible?" is because "an infinite number of input strings >will generate the same output"
this is the reason that it isn't possible to reverse the hash function (get the same input).
cryptographic hash functions are collision resistant, that means that it's also hard to find another input value that maps to the same output (if your hash function was mod 2 : 134 mod 2 = 0; now you can't get the 134 back from the result, but we can stil find number 2 with the same output value (134 and 2 collide)).
When the input is smaller than the block size, padding is used to fit it to the block size.