Does any published research indicate that preimage attacks on MD5 are imminent? - hash

I keep on reading on SO that MD5 is broken, bust, obsolete and never to be used. That angers me.
The fact is that collision attacks on MD5 are now fairly easy. Some people have collision attacks down to an art and can even us use them to predict elections.
I find most of the examples MD5 "brokeness" less interesting. Even the famous CA certificate hack was a collision attack meaning that its provable that the party generated the GOOD and EVIL certificates at same time. This means that if the EVIL CA found its way into the wild, it is provable that it leaked from the person who had the good CA and thus was trusted anyway.
What would be a lot more concerning is a preimage or second preimage attack.
How likely is a preimage attack on MD5? Is there any current research to indicate that it is imminent? Is the fact that MD5 is vulnerable to collision attacks make it more likely to suffer a preimage attack?

In cryptography recommendations are not generally made by predicting the future, as this is impossible to do. Rather cryptographers try to evaluate what is already known and published. To adjust for potential future attacks, cryptosystems are generally designed so that there is some safety margin. E.g. cryptographic keys are generally chosen a little bit longer than absolutely necessary. For the same reason algorithms are avoided once weaknesses are found, even if these weaknesses are just certificational.
In particular, the RSA Labs recommended to abandon MD5 for signatures already in 1996 after Dobbertin found collisions in the compression function. Collisions in the compression function do not imply that collisions in the hash function exist, but we can't find collisions for MD5 unless we can find collisions for its compression function. Thus the RSA Labs decided that they no longer have confidence in MD5s collision resistance.
Today, we are in a similar situation. If we are confident that a hash function is collision resistant then we can also be confident that the hash function is preimage resistant. But MD5 has significant weaknesses. Hence many cryptographers (including people like Arjen Lenstra) think that MD5 no longer has the necessary safety margin to be used even in applications that only rely on preimage resistance and hence recommend to no longer use it. Cryptographers can't predict the future (so don't look for papers doing just that), but they can recommend reasonable precautions against potential attacks. Recommending not to use MD5 anymore is one such reasonable precaution.

We don't know.
This kind of advance tends to come 'all of a sudden' - someone makes a theoretical breakthrough, and finds a method that's 2^10 (or whatever) times better than the previous best.
It does seem that preimage attacks might still be a bit far off; a recent paper claims a complexity of 2^96 for a preimage on a reduced, 44-round version of MD5. However, this isn't a question of likelihood but rather whether someone is clever enough to go that final step and bring the complexity for the real deal into a realistic margin.
That said, since collision attacks are very real already (one minute on a typical laptop), and preimage attacks might (or might not) be just around the corner, it's generall considered prudent to switch to something stronger now, before it's too late.
If collisions aren't a problem for you, you might have time to wait for the NIST SHA-3 competition to come up with something new. But if you have the processing power and bits to spare, using SHA-256 or similar is probably a prudent precaution.

Cryptographically speaking MD5's pre-image resistance is already broken, see this paper from Eurocrypt 2009. In this formal context "broken" means faster than brute force attacks, i.e. attacks having a complexity of less than (2^128)/2 on average. Sasaki and Aoki presented an attack with a complexity of 2^123.4 which is by far only theoretical, but every practical attack is build on less potent theoretical attack, so even a theoretical break casts serious doubts on its medium-term security. What is also interesting is that they reuse a lot of research that has gone into collision attacks on MD5. That nicely illustrates Accipitridae's point that MD5's safety margin on pre-image resistance is gone with the collision attacks.
Another reason why the use of MD5 in 2009 has been and now the use of SHA1 is strongly discouraged for any application is that most people do not understand which exact property the security of their use case relies on. You unfortunately proved my point in your question stating that the 2008 CA attack did not rely on a failure of collision resistance, as caf has pointed out.
To elaborate a bit, every time a (trusted) CA signs a certificate it also signs possibly malicious data that is coming from a customer in form of a certificate signing request (CSR). Now in most cases all the data that is going to be signed can be pre-calculated out of the CSR and some external conditions. This has the fatal side effect that the state the hash function will be in, when it is going to hash the untrusted data coming out of the CSR is completely known to the attacker, which facilitates a collision attack. Thus an attacker can precompute a CSR that will force the CA to hash and sign data that has a collision with a shadow certificate only known to the attacker. The CA cannot check the preconditions of the shadow certificate that it would usually check before signing it (for example that the new certificate does not claim to be a root certificate), as it only has access to legitimate CSR the attackers provided. Generally speaking, once you have collision attacks and part of your data is controlled by an attacker then you no longer know what else you might be signing beside the data you see.

Related

How to equalize hash creation speed across different processors?

I wondered a time ago why no technology exists to equalize the hash creation speed across different cpu's/gpu's. I have no idea if this is feasible or not, that's why I ask this question here. The idea behind this is to make the proof of work just between two parties with each a 50% chance to create the winning hash (equal hashing speed!). In combination with an easier to find nonce, this solution is energy friendlier than existing proof of work technologies, while the desired goal is still met.
This is more or less impossible for the simple reason that a faster machine is just … faster. If one of the two parties buys a faster machine, then they will compute the hash faster. That's just the way it is.
However, there is something we can do. Bitcoin, for example, is based on SHA-256 (the 256 bit long version of SHA-2). SHA-2 is specifically designed to be fast, and to be easy to speed up by specialized hardware. And that is exactly what we see happen in the Bitcoin mining space with the move from pure software-based mining to CPUs with built-in acceleration for SHA-2 to GPUs to FPGAs to ASICs.
The reason for this is that SHA-2 is designed as a general cryptographic hash function, and one of the main usage of cryptographic hashes is as the basis for TLS/SSL and digital signatures where large amounts of data need to be hashed in a short amount of time.
But, there are other use cases for cryptographic hash functions, in particular, password hashing. For password hashing, we want the hash function to be slow and hard to speed up, since a legitimate user only needs to hash a very small amount of data (the password) once (when logging in), whereas an attacker needs to hash large numbers of passwords over and over again, for a brute force attack.
Examples for such hash functions are PBKDF2, bcrypt, scrypt, Catena, Lyra2, yescrypt, Makwa, and Argon2 (the latter being the winner of the 2013 Password Hashing Competition). Scrypt in particular is designed to be hard to speed up using GPUs, FPGAs, and ASICs as well as through space-time or time-space tradeoffs. Scrypt uses a cryptographically secure pseudo-random number generator to initialize a huge array of pseudo-random numbers in memory, and then uses another CSPRNG to generate indices for accesses into this array, thus making both the memory contents as well as the memory access patterns pseudo-random.
Theoretically, of course, it would be possible to pre-compute the result, after all, accessing an array in some specific order is the same as accessing a much larger array in linear order, however, scrypt is designed in such a way that this pre-computed array would be prohibitively large. Plus, scrypt has a simple work-factor parameter that can be used to exponentially increase the size of this array, if memory capacity increases. So, trading space for time is not possible.
Likewise, it would be possible to create a PRNG which combines the two pseudo-random processes into one process and computes the results on the fly. However, scrypt is designed in such a way that the computing time for this would be prohibitively long, and again, there is the exponential work-factor which can be used to drastically increase the computing time without changes to the algorithm. So, trading time for space is not possible.
The pseudo-random access pattern to the memory also defeats any sort of branch-prediction, memory prefetching or caching scheme of the CPU.
And lastly, since the large array is a shared global mutable state, and there is no way to sensibly divide the work into independent units, the algorithm is not sensibly parallelizable, which means you can't speed it up using GPUs.
And in fact, some newer cryptocurrencies, smart contracts, blockchains etc. use an scrypt-based proof-of-work scheme.
Note, however, that running scrypt on a faster machine is still faster than running scrypt on a slower machine. There is no way around that. It just means that we cannot get the ridiculous amounts of speedup we get from using specialized hardware for SHA-2, for example. But, designing cryptographic algorithms is hard, and there actually are ASIC-based scrypt miners for Litecoin out there, that do get a significant speedup, however still less than the impressive ones we see for SHA-2 / Bitcoin.

Second preimage resistance using MD4 AND MD5

Let's say we have the following:
- String: str
- MD4 hash of the string: MD4(str)
- MD5 hash of the string: MD5(str)
MD4 and MD5 are cryptographically "broken" algorithms, meaning it is not difficult to:
1) find str_2 where MD4(str) = MD4(str_2) (i.e. attack on MD4)
2) find str_3 where MD5(str) = MD5(str_3) (i.e. attack on MD5)
But how hard would it be to:
3) find str_4 where MD4(str) = MD4(str_4) AND MD5(str) = MD5(str_4)
(i.e. attack on MD4 and MD5 simultaneously)?
The obvious (probably not very efficient) way would be to:
1) Find a string STR where MD4(STR) = MD4(str)
2) Check if MD5(STR) = MD5(str)
3) If so, we're done. If not, go back to step 1 and satisfy step 1 with a different string.
But the above algorithm doesn't seem fast to me (or is it?). So is it true that a string hashed by both MD4 and MD5 would be quite safe from a second preimage attack?
EDIT:
(1) The main concern is enhancing second pre-image resistance
(2) The main motivation is not to use outdated hashes for today's applications. Rather, it is two-fold: first, I am anticipating the day that hashes considered secure today become broken. For example, If I use only SHA-2, then the day it becomes broken is the same day I will become very worried about my data. But if I use SHA-2 and BCrypt, then even if both become individually broken, it may still be unfeasible to defeat the second pre-image resistance of concat(Sha2_hash, Bcrypt_Hash). Second, I want to reduce the chance of accidental collision (server thinks two inputs are the same because two hashes JUST so happens to be the same)
This sort of thing doesn't improve security as much as you think. The resulting (M+N) bit value is actually weaker than the output of a hash that natively generates (M+N) bits of output. This answer on crypto.stackexchange.com goes a little deeper if you want to know more details.
But the bottom line is that when constructing a hash function whose output is the concatenation of other hash functions, the output you get is, at best, as strong as the strongest constituent hash.
And I have to ask why even use MD4 or MD5 and go to this trouble to begin with? Use SHA-3. If you want to feel "extra safe" then calculate the margin of safety that you feel comfortable with, and increase it by some percentage. That is, if you feel that 384 bits are enough, then go for 512.
So, with some more information about you are trying to do, which is to use the file contents to generate both a "quick checksum" value and generate a unique locator/identifier for the file at the same time I still think that choosing a single hash is the better approach.
If you insist on using two hash functions, then I would submit that instead of concatenating two hashes the better approach would be to instead use a HMAC using two different hash functions/algorithms. Please note, that I do not have a rigorous proof that this works better, or that this construct won't generate horrible output. So take it with a grain of salt:
Let H1 and H2 be two cryptographically secure hash functions, and let P be your input data. Then, the hash & file identifier for your file is given by the construct:
HMAC (K,P) = H1((KGEN(P) ⊕ PAD1) ∥ H1((KGEN(P) ⊕ PAD2) ∥ P)))
Where
KGEN (P) = H2(P)
It is kinda more difficult. Because one would need to calculate collision for MD4 and simultaneously for MD5. But kinda is a lame term in cryptography. Rolling your own security scheme is the enemy #1. However, there are examples, when people chain algorithms, such as DES => 3 DES or TrueCrypt allows chaining several encryption algorithms or PBKDF2 key derivation runs the same algorithms N times.
Seriously, if you need a strong hash - use SHA2 and onwards.
The problem with finding MD4 and MD5 hash collisions is that it's possible to make a chain of devices that would allow attacker to linearly scale number of attack attempts, and given large enough budget this sounds plausible.

Can you show me two actual, non-trivial strings that produce the same MD5 or SHA1 hash?

...and if not, why not?
So here's the question behind the question.
I understand that the likelihood of accidental collisions in MD5 and SHA1 is small (though less likely in SHA1 than in MD5). I also understand that deliberate collisions are theoretically possible. Is it practically possible? Could I go through some process to deliberately generate two messages with the same hash, in either of these algorithms? What process would I go through?
Collisions necessarily exist for a given hash function, in a mathematical sense: there are more possible inputs than possible outputs, so there must be two inputs which map to the same output. Now proving the existence of a collision, and actually finding one, are two different things. If I drop a diamond in the middle of the ocean, I positively know that there is now a diamond somewhere in the ocean -- but I am quite at a loss if I want to recover it.
For a "generic" hash function with an output of n bits, there are generic methods to find a collision, with average cost 2n/2 evaluations of the function (see this page). Depending on n, this can range from the easy to the totally unfeasible. MD5 has an output of 128 bits, and 264 is "quite high": you can do it, but it will require a few thousands of machines and months of computations.
Now there are known weaknesses in MD5, i.e. some internal structure which can be exploited to produce collisions much more easily. Best attack on MD5 known so far requires a bit less than 221 function invocations, so this is a matter of a few seconds (at most) on a basic PC. #Omri points in his response to a great example of an MD5 collision, in which the colliding messages are actually executable files with widely different behaviors.
For SHA-1, the output has size 160 bits. This means that a generic collision attack has cost about 280, which is not attainable with existing technology (well, Mankind could do it, but certainly not discreetly: it should be doable with, say, the equivalent of one year of budget for the whole US Army). However, SHA-1, like MD5, has known weaknesses. Right now, these weaknesses are still theoretical, in that they lead to a collision attack with cost 261, which is too expensive for any single crypto research lab, and thus has not been fully conducted yet (there was an announced attack with cost 251 but it seems that it was a dud -- the analysis was flawed). So no actual collision to show (but researchers are pretty sure that the 261 attack is correct and would work, if someone found the budget).
With SHA-256, there is no known weakness, and the 256-bit output size implies a generic cost of 2128, far away into the undoable with today's and tomorrow's technology.

Hash function combining - is there a significant decrease in collision risk?

Does anyone know if there's a real benefit regarding decreasing collision probability by combining hash functions? I especially need to know this regarding 32 bit hashing, namely combining Adler32 and CRC32.
Basically, will adler32(crc32(data)) yield a smaller collision probability than crc32(data)?
The last comment here gives some test results in favor of combining, but no source is mentioned.
For my purpose, collision is not critical (i.e. the task does not involve security), but I'd rather minimize the probability anyway, if possible.
PS: I'm just starting in the wonderful world of hashing, doing a lot of reading about it. Sorry if I asked a silly question, I haven't even acquired the proper "hash dialect" yet, probably my Google searches regarding this were also poorly formed.
Thanks.
This doesn't make sense combining them in series like that. You are hashing one 32-bit space to another 32-bit space.
In the case of a crc32 collision in the first step, the final result is still a collision. Then you add on any potential collisions in the adler32 step. So it can not get any better, and can only be the same or worse.
To reduce collisions, you might try something like using the two hashes independently to create a 64-bit output space:
adler32(data) << 32 | crc32(data)
Whether there is significant benefit in doing that, I'm not sure.
Note that the original comment you referred to was storing the hashes independently:
Whichever algorithm you use there is
going to be some chance of false
positives. However, you can reduce
these chances by a considerable margin
by using two different hashing
algorithms. If you were to calculate
and store both the CRC32 and the
Alder32 for each url, the odds of a
simultaneous collision for both hashes
for any given pair of urls is vastly
reduced.
Of course that means storing twice as
much information which is a part of
your original problem. However, there
is a way of storing both sets of hash
data such that it requires minimal
memory (10kb or so) whilst giving
almost the same lookup performance (15
microsecs/lookup compared to 5
microsecs) as Perl's hashes.

Understanding sha-1 collision weakness

According to various sources, attacks looking for sha-1 collisions have been improved to 2^52 operations:
http://www.secureworks.com/research/blog/index.php/2009/6/3/sha-1-collision-attacks-now-252/
What I'd like to know is the implication of these discoveries on systems that are not under attack. Meaning if I hash random data, what are the statistical odds of a collision? Said another way, does the recent research indicate that a brute-force birthday attack has a higher chance of finding collisions that originally proposed?
Some writeups, like the one above, say that obtaining a SHA-1 collision via brute force would require 2^80 operations. Most sources say that 2^80 is a theoretical number (I assume because no hash function is really distributed perfectly even over its digest space).
So are any of the announced sha1 collision weaknesses in the fundamental hash distribution? Or are the increased odds of collision only the result of guided mathematical attacks?
I realize that in the end it is just a game of odds, and that their is an infinitesimally small change that your first and second messages will result in a collision. I also realize that even 2^52 is a really big number, but I still want to understand the implications for a system not under attack. So please don't answer with "don't worry about it".
Well good hash functions are resistant to 3 different types of attacks (as the article states).
The most important resistance in a practical sense is 2nd pre-image resistance. This basically means given a message M1 and Hash(M1)=H1, it is hard to find a M2 such that Hash(M2)=H1.
If someone found a way to do that efficiently, that would be bad. Further, a preimage attack is not susceptible to the birthday paradox, since message M1 is fixed for us.
This is not a pre-image or second pre-image attack, merely a collision finding attack.
To answer your question, No a brute force attack does NOT have a higher chance of finding collisions. What this means is that the naive brute force method, combined with the researchers methods result in finding collisions after 2^52. A standard brute force attack still takes 2^80.
The result announced in your link is an attack, a sequence of careful, algorithmically-chosen steps that generate collisions with greater probability than would a random attack. It is not a weakness in the hash function's distribution. Well, ok, it is, but not of the sort that makes a random attack likely on the order of 2^52 to succeed.
If no one is trying to generate collisions in your hash outputs, this result does not affect you.
The key question is "Can the attacker modify both m1 and m2 messages"?. If so, the attacker needs to find m1, m2 such that hash(m1) = hash(m2). This is the birthday attack and the complexity reduces significantly --- becomes square root. If hash output is 128 bits (MD5), the complexity is 2^64, well within reach with current computing power.
The usual example given is that the seller asks his secretary to type message "I will sell it for 10 million dollars". The scheming secretary creates 2 documents one that says "I will sell it for 10 million dollars" and another that says "I will sell it for x million dollars", where x is much less than 10, modifies both messages by adding spaces, capitalizing words etc, modifies x, until hash(m1) = hash(m2). Now, the secretary shows the correct message m1 to the seller, and he signs it using his private key, resulting in hash h. The secretary switches the message and sends out (m2, h). Only the seller has access to his private key and so he cannot repudiate and say that he didn't sign the message.
For SHA1, which outputs 160 bits, the birthday attack reduces the complexity to 2^80. This should be safe for 30 years or more. New government regulations, 4G 3gpp specs are starting to require SHA256.
But if in your use-case, the attacker cannot modify both the messages (preimage or second preimage scenarios), then for SHA1 the complexity is 2^160. Should be safe for eternity unless non-brute-force attack is discovered.