What's the difference between NOT second preimage resistant and NOT collision resistant [closed] - hash

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 8 years ago.
Improve this question
By definition, Not 2nd-preimage resistant means: there exists at least one x (which is known) such that it is easy to find another x', such that h(x) = h(x').
While, Not collision resistant indicates: it is easy to find at least one such pair (x, x') that h(x) = h(x')
I don't see any difference here, anyone can tell? Or do I give the wrong definitions?
And, it is said that "Not collision resistant not necessarily means Not 2nd-preimage resistant", why is that?

Putting this into another answer because it's just too much to type for a comment.
The definition of 2nd-preimage-resistant is you have h(x) and x, and can't create x'.
The definition of preimage-resistant (without second!) means you have only h(x), and can't create x.
And the definition of collision resistant is you have nothing, and may choose any h(x), x and x'.
If you use the hash to sign a plaintext message, you need 2nd-preimage-resistancy, but not collision resistancy. It doesn't matter to you if someone can find two colliding messages that produce a hash that is different from yours, but you want to make sure noone is able to craft a different message that has the your hash, even if they know your plaintext.
If you use the hash to store hashed passwords, you don't care about collision resistance, and you don't care about 2nd-preimage-resistance, preimage-resistance is all you need. If an attacker knows one password, you don't really care if he can use that password to find a different one.
So these were two examples where collision resistance is not required, but preimage-resistance or 2nd-preimage-resistance is.
As to "Not collision resistant not necessarily means Not 2nd-preimage resistant", why is that? , consider the hash function if x has less then 24 bits, then h(x)=0, else h(x)=sha256(x). This is very obviously not collision resistant (choose any 2 words that have less than 4 letters), but, as long as your text is longer, this function is preimage-resistant and 2nd-preimage-resistant (assuming sha256 hasn't been broken yet).

2nd preimage resistant means, there's no (easy) way to find a 2nd x (called x') when you have only h(x), and maybe x.
Collision resistant means there's an (easy) way to find a random pair (x, x') with h(x)=h(x').
So the second one is weaker. Think about what happened to MD5 a while ago: there's an algorithm that finds pairs of input bytes that produce the same output. But this works only for specifically constructed input, not for random input. So, while it is possible to find messages that have a collision, the generic case "x is some specific message, find a second message that has the same MD5 as x" is not solved yet.

Related

Is there a way to verify a common seed to a cumulative sequence of hashes with unknown repetitions between each value presented?

I am writing a variant of the Cuckoo Cycle that uses an adjacency list for presenting solutions from two pairs of 8 bit coordinates, and I am not having any problems finding what I think should be an optimal solver for it, that uses two pairs of head/tail binary search trees to keep track of possible solution nodes, reject (branches) nodes and a binary tree that keeps a list of the candidate cycles as they are being assembled (as I understand it, binary search trees shorten the amount of processing for finding duplicates), but I need to refine the verifier function for solutions.
I see in Cuckoo that there is some process by which it modifies the edges with XOR functions and masks to identify a valid cycle, but I have two issues.
One is that each hash is generated from the previous hash, starting with the nonce, and proving that all offered node/edge pairs are valid derivatives from the nonce seems to me to require the verifier to repeat the hash function each time checking for a match until it gets a hit, which could be up to several thousand, in the worst case. Is there some property that can be used to shortcut this identification process, since unlike protection against DoS, we are providing the salt of the hash?
Second is that even if the presented cycle is perfectly valid, it is possible that one or more of the node/edge pairs in the cycle has a duplicate coordinate. The hashes are 32 bits long, and each coordinate is 8 bits. The answer to this probably has some relation to the previous question also, as having the seed for a hash function is a known security risk because of collisions. So obviously, as well as verifying the nodes are part of a cycle in the lowest possible values in the finite field, I need a way to be sure that a pair does not overlap with another possible, and branching pair.
I will be studying the verifier closer in the Cuckoo Cycle implementation to see if I can figure out how the algorithm ensures it is not approving a cycle that actually has a branch (and thus is invalid), but I thought I'd pop the question on this site in case someone knows better the ways of recognising hashes from a common seed, and if there is any way to recognise a 50% collision between a given coordinate and another one.
Note: After thinking about it for a while, I realised that I could solve the 'fake cycle' with one or more nodes having a branch by simply splitting the heads and tails into separate hashes, subsequent (odd then even), such as Murmur3 16 bit hashes.
And further thinking about it, I realised that Cuckoo Cycle is actually a special type of hash collision search that seeks only collisions that occur only once in the low order of the finite field. I am devising a new scheme called Hummingbird, which instead will not target the smallest numbers (which is also the same thing done by hashcash) but instead will target the most proximate hashes in a chain to the seed nonce. This will mean that attempts to insert branched nodes in the graph of the solution will be discovered in the verification. Which will probably take about 2-5 seconds depending on how deep. These solutions could be eliminated by specifying a maximum hash chain length as part of the consensus.
I just wanted to add that I answered my own question by realising that I am looking for, essentially, a hash collision, in my algorithm, and the simplest solution, with the least bit-twiddling was to make each coordinate a distinct hash in a hash chain (hash of nonce, then hash of hash, etc)
I didn't understand fully that Cuckoo Cycle is essentially a search for partial hash collisions, and when that dawned on me, I realised that the simple solution is to just make it into a search for hash collisions.
I have, from this realisation, moved very quickly forward to figuring out how my variation of Cuckoo can be much more simply implemented, as well as how to structure the B-tree based progressive search algorithm, the difficulty adjustment, and the rest.
I wasn't aware there was a stackexchange specialist site for math, or cryptography, or I would have posted it there instead. I studied FEC a few months ago and that opened the floodgates to a whole bunch of other ideas that led me to getting so worked up about Cuckoo Cycle. I believe I have generalised the Cuckoo Cycle into a generic, parameterisable graph theoretic proof of work and I will get back to finishing my implementation.
Thanks to everyone who submitted an answer, I will upvote as I deem correct, though I have zero or nearly zero rep, for what it's worth.

What prevents me from reversing a hash function? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 5 years ago.
Improve this question
What actually prevents me from reversing a hash function and generating a possible input from a hash that will have the same hash?
I understand that hash functions are one-way functions, which means I cannot recover the real input by it's hash.
I googled it a lot, and I found out a lot of peoples explaining this simple example hash function:
hash(x) = x % 7
I can't recover the input (x) from the hash here, but if I know the hash, I can generate a possible input from it that will have the same hash:
unhash(h) = some_random_integer * 7 + h
The value of some_random_integer does not matter at all. unhash(3) will be for example: 24 , and hash(24) is: 3 !
One more example that I found is:
hash(x, y) = x * y
So like the previous example, I cannot find the real input (x and y) from the hash but I can find a possible input that will have the same hash:
x = hash / some_random_integer
y = hash / x
When for example, a malicious hacker gains access to a database full of hashed passwords, he would be able to log in to a hacked user only by generating a possible input that will generate the same hash as his password! It does not have to be the exact original password.
I know that real hash functions are a lot more complicated than this examples, but I cannot think of a math operation that cannot be reversed this way. (or maybe there are some?)
What actually prevents me from reversing real hash functions this way? (like MD5, SHA1, etc...)
By hash function the assumption you are referring to cryptographic hash functions such as the SHA family.
The design of the cryptographic hash function keeps you from reversing it, that is the basic criteria in the design.
There are other types of hash functions such as dictionary hash functions that may be quite simple but even these but usually lose portions of the input. hash(x) = x % 7 is an example of such a simple hash function.
In the case of password hashing brute forcing must be taken into account, that is trying passwords from lists of frequently used passwords and fuzzing. The usual solution is to use a hash function that consumes substantial CPU time, ofter by iteration a hash function for about 100ms, PBKFD2 is such a function and is recommended by NIST for password hashing.
Additionally the input may be larger than the output= and some information is intentionally lost.

Sha-1 hash fixed point

how hard is it to find x
where
sha1(x) = x?
where x is the form of 'c999303647068a6abaca25717850c26c9cd0d89c'
i think the fact that there are sha1 collisions make this possible, but, how easy (or hard) is it to find an example?
Read Cryptanalysis of SHA-1 on Wikipedia. There's more information than you need on that article and its references combined.
Edit:
how hard is it to find x where sha1(x) = x?
Such an attack is known as a preimage attack and finding such an x is usually much harder than a general collision attack, i.e. finding arbitrary x1 and x2 such that sha(x1) = sha(x2).
SHA1 Collisions can be Found in 2^63 Operations. I would say its rather hard. You could go about brute forcing it. Get the book applied cryptography and sit down for a read. Look into the Birthday Paradox, which can be used to find collisions.
The one most important reason for existence of cryptographic hash functions (of which SHA family functions are) is to make finding inputs corresponding to a given digest difficult. A cryptographic hash function producing N-bit digests is considered good if to find a matching input one must perform 2^N/2 operations in average, that is, no other way than brute-force is reliably possible.
So you are searching for mathematical invariant for SHA1 transformation. invariant subspace problem. :-)

Understanding sha-1 collision weakness

According to various sources, attacks looking for sha-1 collisions have been improved to 2^52 operations:
http://www.secureworks.com/research/blog/index.php/2009/6/3/sha-1-collision-attacks-now-252/
What I'd like to know is the implication of these discoveries on systems that are not under attack. Meaning if I hash random data, what are the statistical odds of a collision? Said another way, does the recent research indicate that a brute-force birthday attack has a higher chance of finding collisions that originally proposed?
Some writeups, like the one above, say that obtaining a SHA-1 collision via brute force would require 2^80 operations. Most sources say that 2^80 is a theoretical number (I assume because no hash function is really distributed perfectly even over its digest space).
So are any of the announced sha1 collision weaknesses in the fundamental hash distribution? Or are the increased odds of collision only the result of guided mathematical attacks?
I realize that in the end it is just a game of odds, and that their is an infinitesimally small change that your first and second messages will result in a collision. I also realize that even 2^52 is a really big number, but I still want to understand the implications for a system not under attack. So please don't answer with "don't worry about it".
Well good hash functions are resistant to 3 different types of attacks (as the article states).
The most important resistance in a practical sense is 2nd pre-image resistance. This basically means given a message M1 and Hash(M1)=H1, it is hard to find a M2 such that Hash(M2)=H1.
If someone found a way to do that efficiently, that would be bad. Further, a preimage attack is not susceptible to the birthday paradox, since message M1 is fixed for us.
This is not a pre-image or second pre-image attack, merely a collision finding attack.
To answer your question, No a brute force attack does NOT have a higher chance of finding collisions. What this means is that the naive brute force method, combined with the researchers methods result in finding collisions after 2^52. A standard brute force attack still takes 2^80.
The result announced in your link is an attack, a sequence of careful, algorithmically-chosen steps that generate collisions with greater probability than would a random attack. It is not a weakness in the hash function's distribution. Well, ok, it is, but not of the sort that makes a random attack likely on the order of 2^52 to succeed.
If no one is trying to generate collisions in your hash outputs, this result does not affect you.
The key question is "Can the attacker modify both m1 and m2 messages"?. If so, the attacker needs to find m1, m2 such that hash(m1) = hash(m2). This is the birthday attack and the complexity reduces significantly --- becomes square root. If hash output is 128 bits (MD5), the complexity is 2^64, well within reach with current computing power.
The usual example given is that the seller asks his secretary to type message "I will sell it for 10 million dollars". The scheming secretary creates 2 documents one that says "I will sell it for 10 million dollars" and another that says "I will sell it for x million dollars", where x is much less than 10, modifies both messages by adding spaces, capitalizing words etc, modifies x, until hash(m1) = hash(m2). Now, the secretary shows the correct message m1 to the seller, and he signs it using his private key, resulting in hash h. The secretary switches the message and sends out (m2, h). Only the seller has access to his private key and so he cannot repudiate and say that he didn't sign the message.
For SHA1, which outputs 160 bits, the birthday attack reduces the complexity to 2^80. This should be safe for 30 years or more. New government regulations, 4G 3gpp specs are starting to require SHA256.
But if in your use-case, the attacker cannot modify both the messages (preimage or second preimage scenarios), then for SHA1 the complexity is 2^160. Should be safe for eternity unless non-brute-force attack is discovered.

What are the important points about cryptographic hash functions?

I was reading this question on MD5 hash values and the accepted answer confuses me. One of the main properties, as I understand it, of a cryptopgraphic hash function is that it is infeasible to find two different messages (inputs) with the same hash value.
Yet the consensus answer to the question Why aren't MD5 hash values reversible? is Because an infinite number of input strings will generate the same output. This seems completely contradictory to me.
Also, what perplexes me somewhat is the fact that the algorithms are public, yet the hash values are still irreversible. Is this because there is always data loss in a hash function so there's no way to tell which data was thrown away?
What happens when the input data size is smaller than the fixed output data size (e.g., hashing a password "abc")?
EDIT:
OK, let me see if I have this straight:
It is really, really hard to infer the input from the hash because there are an infinite amount of input strings that will generate the same output (irreversible property).
However, finding even a single instance of multiple input strings that generate the same output is also really, really hard (collision resistant property).
Warning: Long answer
I think all of these answers are missing a very important property of cryptographic hash functions: Not only is it impossible to compute the original message that was hashed to get a given hash, it's impossible to compute any message that would hash to a given hash value. This is called preimage resistance.
(By "impossible" - I mean that no one knows how to do it in less time than it takes to guess every possible message until you guess the one that was hashed into your hash.)
(Despite popular belief in the insecurity of MD5, MD5 is still preimage resistant. Anyone who doesn't believe me is free to give me anything that hashes to 2aaddf751bff2121cc51dc709e866f19. What MD5 doesn't have is collision resistance, which is something else entirely.)
Now, if the only reason you can't "work backwards" in a cryptographic hash function was because the hash function discards data to create the hash, then it would not guarantee preimage resistance: You can still "work backwards", and just insert random data wherever the hash function discards data, and while you wouldn't come up with the original message, you'd still come up with a message that hashes to the desired hash value. But you can't.
So the question becomes: Why not? (Or, in other words, how do you make a function preimage resistant?)
The answer is that cryptographic hash functions simulate chaotic systems. They take your message, break it into blocks, mix those blocks around, have some of the blocks interact with each other, mix those blocks around, and repeat that a lot of times (well, one cryptographic hash function does that; others have their own methods). Since the blocks interact with each other, block C not only has to interact with block D to produce block A, but it has to interact with block E to produce block B. Now, sure, you can find values of blocks C, D, E that would produce the blocks A and B in your hash value, but as you go further back, suddenly you need a block F that interacts with C to make D, and with E to make B, and no such block can do both at the same time! You must have guessed wrong values for C, D, and E.
While not all cryptographic hash functions are exactly as described above with block interaction, they have the same idea: That if you try to "work backwards", you're going to end up with a whole lot of dead ends, and the time it takes for you to try enough values to generate a preimage is on the order of hundreds to millions of years (depending on the hash function), not much better than the time it would take just to try messages until you find one that works.
1: The primary purpose of a hash is to map a very, very large space to a smaller but still very large space (e.g., MD5, which will take 'anything' and convert it into a space of size 2^128 -- big, but not nearly as big as aleph-0.)
In addition to other features, good hashes fill the destination space homogeneously. Bad hashes fill the space in a clumpy fashion, coming up with the same hash for many common inputs.
Imagine the idiotic hash function sum(), which just adds all the digits of the input number: it succeeds in mapping down, but there are a bunch of collisions (inputs with the same output, like 3 and 12 and 21) at the low end of the output space and the upper end of the space is nearly empty. As a result it makes very poor use of the space, is easy to crack, etc.
So a good hash that makes even use of the destination space will make it difficult to find two inputs with the same output, just by the odds: if MD5 were perfect, the odds that two inputs would have the same output would be 2^-128. That's pretty decent odds: the best you can do without resorting to a larger output space. (In truth MD5 isn't perfect, which is one of the things that makes it vulnerable.)
But it will still be true that a huge number of inputs will map to any given hash, because the input space is 'infinite', and dividing infinity by 2^128 still gives you infinity.
2: Yes, hashes always cause data loss, except in the case where your output space is the same as, or larger than, your input space -- and in that case you probably didn't need to hash!
3: For smaller inputs, best practice is to salt the input. Actually, that's good practice for any cryptographic hashing, because otherwise an attacker can feed you specific inputs and try to figure out which hash you are using. 'Salt' is just a set of additional information that you append (or prepend) to your input; you then hash the result.
edit: In cryptography, it is also important that the hash function is resistant to preimage attacks, intuitively, that is hard to guess the input for a given output even knowing many other input/output pairs. The "sum" function could probably be guessed rather easily (but since it destroys data still might not be easy to reverse).
You may be confused, because the answer to the question you cite is confusing.
One of the requirements for a cryptographic hash function is that it should be preimage resistant. That is, if you know MD5(x) but not the message x, then it is difficult to find any x' (either equal x or different from x) such that MD5(x') = MD5(x).
Being preimage resistant is a different property than being reversible. A function is reversible if given y = f(x) there is exactly one x which fits (whether this is easy or not). For example define f(x) = x mod 10.
Then f is not reversible. From f(x) = 7 you can't determine whether x was 17, 27 or something else. But f is not preimage resistant, since values x' such that f(x) = 7 are easy to find. x' = 17, 27, 12341237 etc all work.
When doing crypto you usually need functions that are preimage resistant (and other properties such as collision resistance), not just something that is not reversible.
These are the properties of hash functions in general.
A word of caution though, MD5 shouldn't be used anymore because of vulnerabilities that have been found in it. Check the 'Vulnerabilities' section and external links detailing these attacks. http://en.wikipedia.org/wiki/Md5 You can make an MD5 collision by changing only 128 bits in a message.
SHA-1 is safe for simple hashing although there are some attacks that would make it weaker against well-funded entities (Governments, large corporations)
SHA-256 is a safe starting point against technology for the next couple decades.
Yet the consensus answer to the question "why aren't MD5 hash values reversible?" is because "an infinite number of input strings will generate the same output."
This is true for any hash function, but it is not the essence of a cryptographic hash function.
For short input strings such as passwords it is theoretically possible to reverse a cryptographic hash function, but it ought to be computationally infeasible. I.e. your computation would run too long to be useful.
The reason for this infeasibility is that the input is so thoroughly "mixed together" in the hash value that it becomes impossible to disentangle it with any less effort than the brute force attack of computing the hash value for all inputs
"why aren't MD5 hash values reversible?" is because "an infinite number of input strings >will generate the same output"
this is the reason that it isn't possible to reverse the hash function (get the same input).
cryptographic hash functions are collision resistant, that means that it's also hard to find another input value that maps to the same output (if your hash function was mod 2 : 134 mod 2 = 0; now you can't get the 134 back from the result, but we can stil find number 2 with the same output value (134 and 2 collide)).
When the input is smaller than the block size, padding is used to fit it to the block size.