Is there a way to crack a 129 character length hash? - hash

I was given a 129 character length hash ( probably whirlwind or SHA-512) to decrypt. I have tried cracking it using the whirlpooldeep and hashcat tools (without wordlists, i.e dictionary attack) but with no success. I'm sorry for the ambiguous question but this is all the information I have about my current task. Any suggestion would be gladly appreciated

I assume this task is feasible because whoever gave you the hash has prepared it on purpose in a way so you have a chance to crack it. If this is not the case, you have no chance at current computer performance and will not within the next few decades.
So, assuming it is feasible, e.g. because the hash is of some short data like less than 10 bytes:
You are probably making a mistake in how you compare the hashes. The hash should not be of length 129. It should be of length 128. That you see a file with 129 bytes is probably because there is a line feed (\n) at the end of the file.
Moreover, you should, for performance, not compare the hashes in hexadecimal format (length 128), but instead in binary format (length 64).

A rainbow table seems to be the perfect tool for your scenario.
There is a software called RainbowCrack to do this sort of reversing a hash. There should be rainbow tables available for SHA-512, which should be, roughly estimated, about 1TB in size (for finding original data of length up to 9 bytes).
Using this kind of hash reversal is likely to take you a day or two to get into the matter and understand it, and if it's feasible, after downloading the 1TB, you are likely to get the result in less than a day.
Before you start all this, I would ask the source of your hash how long the data is that you are searching for. This could easily end up being a kid's joke, e.g. that your source hashed 512bit of random data, in which case you would have absolutely no chance and you would waste your time.

Related

why can't json web token be reversed engineered? [duplicate]

Why can't you just reverse the algorithm like you could reverse a math function? How is it possible to make an algorithm that isn't reversible?
And if you use a rainbow table, what makes using a salt impossible to crack it? If you are making a rainbow table with brute force to generate it, then it invents each plaintext value possible (to a length), which would end up including the salt for each possible password and each possible salt (the salt and password/text would just come together as a single piece of text).
MD5 is designed to be cryptographically irreversible. In this case, the most important property is that it is computationally unfeasible to find the reverse of a hash, but it is easy to find the hash of any data. For example, let's think about just operating on numbers (binary files after all, could be interpreted as just a very long number).
Let's say we have the number "7", and we want to take the hash of it. Perhaps the first thing we try as our hash function is "multiply by two". As we'll see, this is not a very good hash function, but we'll try it, to illustrate a point. In this case, the hash of the number will be "14". That was pretty easy to calculate. But now, if we look at how hard it is to reverse it, we find that it is also just as easy! Given any hash, we can just divide it by two to get the original number! This is not a good hash, because the whole point of a hash is that it is much harder to calculate the inverse than it is to calculate the hash (this is the most important property in at least some contexts).
Now, let's try another hash. For this one, I'm going to have to introduce the idea of clock arithmetic. On a clock, there aren't an infinite amount of number. In fact, it just goes from 0 to 11 (remember, 0 and 12 are the same on a clock). So if you "add one" to 11, you just get zero. You can extend the ideas of multiplication, addition, and exponentiation to a clock. For example, 8+7=15, but 15 on a clock is really just 3! So on a clock, you would say 8+7=3! 6*6=36, but on a clock, 36=0! so 6*6=0! Now, for the concept of powers, you can do the same thing. 2^4=16, but 16 is just 4. So 2^4=4! Now, here's how it ties into hashing. How about we try the hash function f(x)=5^x, but with clock arithmetic. As you'll see, this leads to some interesting results. Let's try taking the hash of 7 as before.
We see that 5^7=78125 but on a clock, that's just 5 (if you do the math, you see that we've wrapped around the clock 6510 times). So we get f(7)=5. Now, the question is, if I told you that the hash of my number was 5, would you be able to figure out that my number was 7? Well, it's actually very hard to calculate the reverse of this function in the general case. People much smarter than me have proved that in certain cases, reversing this function is way harder than calculating it forward. (EDIT: Nemo has pointed out that this in fact has not been "proven"; in fact, the only guarantee you get is that a lot of smart people have tried a long time to find an easy way to do so, and none of them have succeeded.) The problem of reversing this operation is called the "Discrete Logarithm Problem". Look it up for more in depth coverage. This is at least the beginning of a good hash function.
With real world hash functions, the idea is basically the same: You find some function that is hard to reverse. People much smarter than me have engineered MD5 and other hashes to make them provably hard to reverse.
Now, perhaps earlier the thought has occurred to you: "it would be easy to calculate the inverse! I'd just take the hash of every number until I found the one that matched!" Now, for the case where the numbers are all less than twelve, this would be feasible. But for the analog of a real-world hash function, imagine all the numbers involved are huge. The idea is that it is still relatively easy to calculate the hash function for these large numbers, but to search through all possible inputs becomes harder much quicker. But what you've stumbled upon is the still a very important idea though: searching through the input space for an input which will give a matching output. Rainbow tables are a more complex variation on the idea, which use precomputed tables of input-output pairs in smart ways in order to make it possible to quickly search through a large number of possible inputs.
Now let's say that you are using a hash function to store passwords on your computer. The idea is this: The computer just stores the hash of the correct password. When a user tries to login, you compare the hash of the input password to the hash of the correct password. If they match, you assume the user has the correct password. The reason this is advantageous is because if someone steals your computer, they still don't have access to your password, just the hash of it. Because the hash function was designed by smart people to be hard to take the reverse of, they can't easily retrieve your password from it.
An attacker's best bet is a bruteforce attack, where they try a bunch of passwords. Just like you might try the numbers less that 12 in the previous problem, an attacker might try all the passwords just composed of numbers and letters less than 7 characters long, or all words which show up in the dictionary. The important thing here is that he can't try all possible passwords, because there are way too many possible 16 character passwords, for example, to ever test. So the point is that an attacker has to restrict the possible passwords he tests, otherwise he will never even check a small percentage of them.
Now, as for a salt, the idea is this: What if two users had the same password? They would have the same hash. If you think about it, the attacker doesn't really have to crack every users password individually. He simply goes through every possible input password, and compares the hash to all the hashes. If it matches one of them, then he has found a new password. What we'd really like to force him to do is calculate a new hash for every user+password combination he wants to check. That's the idea of a salt, is that you make the hash function be slightly different for every user, so he can't reuse a single set of precomputed values for all users. The most straightforward way to do this is to tack on some random string to each user's password before you take the hash, where the random string is different for each user. So, for example, if my password is "shittypassword", my hash might show up as MD5("6n93nshittypassword") and if your password is "shittypassword", your hash might show up as MD5("fa9elshittypassword"). This little bit "fa9el" is called the "salt", and it's different for every user. For example, my salt is "6n93n". Now, this little bit which is tacked on to your password is just stored on your computer as well. When you try to login with the password X, the computer can just calculate MD5("fa9el"+X) and see if it matches the stored hash.
So the basic mechanics of logging in remain unchanged, but for an attacker, they are now faced with a more daunting challenge: rather than a list of MD5 hashes, they are faced with a list of MD5 sums and salts. They essentially have two options:
They can ignore the fact that the hashes are salted, and try to crack the passwords with their lookup table as is. However, the chances that they'll actually crack a password are much reduced. For example, even if "shittypassword" is on their list of inputs to check, most likely "fa9elshittypassword" isn't. In order to get even a small percentage of the probability of cracking a password that they had before, they'll need to test orders of magnitude more possible passwords.
They can recalculate the hashes on a per-user basis. So rather than calculating MD5(passwordguess), for each user X, they calculate MD5( Salt_of_user_X + passwordguess). Not only does this force them to calculate a new hash for each user they want to crack, but also most importantly, it prevents them from being able to use precalculated tables (like rainbow table, for example), because they can't know what Salt_of_user_X is before hand, so they can't precalculate the hashes to test.
So basically, if they are trying to use precalculated tables, using a salt effectively greatly increases the possible inputs they have to test in order to crack the password, and even if they aren't using precalculated tables, it still slows them down by a factor of N, where N is the number of passwords you are storing.
Hopefully this answers all your questions.
Think of 2 numbers from 1 to 9999. Add them. Now tell me the final digit.
I can't, from that information, deduce which numbers you originally thought of. That is a very simple example of a one-way hash.
Now, I can think of two numbers which give the same result, and this is where this simple example differs from a 'proper' cryptographic hash like MD5 or SHA1. With those algorithms, it should be computationally difficult to come up with an input which produces a specific hash.
One big reason you can't reverse the hash function is because data is lost.
Consider a simple example function: 'OR'. If you apply that to your input data of 1 and 0, it yields 1. But now, if you know the answer is '1', how do you back out the original data? You can't. It could have been 1,1 or maybe 0,1, or maybe 1,0.
As for salting and rainbow tables. Yes, theoretically, you could have a rainbow table which would encompass all possible salts and passwords, but practically, that's just too big. If you tried every possible combination of lower case letters, upper case, numbers, and twelve punctuation symbols, up to 50 characters long, that's (26+26+10+12)^50 = 2.9 x 10^93 different possibilities. That's more than the number of atoms in the visible universe.
The idea behind rainbow tables is to calculate the hash for a bunch of possible passwords in advance, and passwords are much shorter than 50 characters, so it's possible to do so. That's why you want to add a salt in front: if you add on '57sjflk43380h4ljs9flj4ay' to the front of the password. While someone may have already computed the hash for "pa55w0rd", no one will have already calculated the has for '57sjflk43380h4ljs9flj4aypa55w0rd'.
I don't think the md5 gives you the whole result - so you can't work backwards to find the original things that was md5-ed
md5 is 128bit, that's 3.4*10^38 combinations.
the total number of eight character length passwords:
only lowercase characters and numbers: 36^8 = 2.8*10^12
lower&uppercase and numbers: 62^8 = 2.18*10^14
You have to store 8 bytes for the password, 16 for the md5 value, that's 24 bytes total per entry.
So you need approx 67000G or 5200000G storage for your rainbow table.
The only reason that it's actually possible to figure out passwords is because people use obvious ones.

Using an ASIC to brute force MD5

Is it possible to use an Application Specific Integrated Circuit (ASIC) to brute force MD5 hashes and thus reverse them down to their original form? I know there could be multiple collisions, but leaving that aside, would it be possible? The idea interests me because I happen to have ASIC Miner Block Erupters which are ASIC's used to generate the SHA-256 hash, but why not MD5?
Thanks in advance.
This is a very old question, but while working with a client and working to convince them that they couldn't use MD5 to hash passwords and needed to upgrade to something more secure, this post came up in the discussion.
While the accepted answer is technically correct, one doesn't have to calculate all possible md5 hashes to break a password, one only has rotate strings and positions in a methodical fashion to land on actual passwords. If we assume 8 characters in length and the common rule of uppercase, lowercase, and digits at minimum, that's only 218 trillion combinations.
Within the narrow confines of the answer, yes, it is completely impractical to brute force md5 collisions, but it is absolutely feasible to throw random smaller data sets at MD5 records and see what matches you get. Put simply, to calculate every possible MD5 for a set of passwords 5 characters in length containing letters, numbers and special characters might take two hours at 1 Mh/s.
I did that exact thing using a MacBook and some hastily written code for the aforementioned client. Within the span of the 45 minutes it took to explain the problem, and for them to point to this answer as a reason that they didn't need to bother, I had already gotten almost a thousand of the horrifyingly insecure passwords stored in their database.
Long story short, I just don't want people reading this answer and thinking that passwords hashed using MD5 are impossible to crack.
A brute force attack is futile as there are 2^128 MD5 hashes. If you could compute 10^18 (that's a billion times a billion) hashes per second it would still take billions of years to find a single collision (unless you are extraordinarily lucky). Terahashes per second is not nearly enough. 2^128 / 1 terahertz is in the order of 10^26 seconds, which is about 10^19 years.
MD5 is broken, but broken does not imply "feasible to brute force", only "feasible to attack in some manner (probably more sophisticated than brute force)".

Number only Hash or Cipher decrypt

I wanted to know if it's possible to derive a method to generate a cipher or Hash if I have a large data sample of the ciphered text and it's corresponding ASCII text.
An example of the ciphered text is: 01jvaWf0SJRuEL2HM5xHVEV6C8pXHQpLGGg2gnnkdZU=
That would translate to: 12540991
the ASCII text contains only numbers.
I would think it is possible, since we're dealing with only numbers and I do have a sample of the ciphers and their ASCII translations.
But I am not sure where to start looking, or maybe I am wrong and such a thing is not possible.
What do you guys think ?
If you are trying to derive the original algorithm that generated the hashes of a giving set of values and hashes, you could try mainstream algorithms and see if you get any hits, if not it maybe impossible or simply take to much time to find, the most common homegrown algorithms tend to be a combination of a world wide salt + unique random salt + multiple iterations of a common hashing function SHA256.
If you are trying to invert a mainstream hashing functions, that would be impossible, there one way functions, you can't find the original text giving the hash value, if you still want the original text you would need to iterate over all the possible values to determine which generated that hash, being that its numbers it isn't that bad, just build up a look up table using which ever algorithm was used, the hash would be key and text that generated that hash would be the value, one done simply look up the hash to find the original text. This is called an online attack.
What you're describing is what's called a known-plaintext attack. This is a form of cryptanalysis, so it is certainly possible, although good one-way hash algorithms are designed to be resistant to it.
While it's possible, it is unlikely to be practical against well-known hashing algorithms unless you are an expert in cryptography and an experienced code-breaker--and even then, it's not what one might call a short-term project.
A homegrown algorithm or simple encoding scheme is another matter, of course. If your question is "Is it possible?", then the answer is "Yes."

Can two different strings generate the same MD5 hash code?

For each of our binary assets we generate a MD5 hash. This is used to check whether a certain binary asset is already in our application. But is it possible that two different binary assets generate the same MD5 hash. So is it possible that two different strings generate the same MD5 hash?
For a set of even billions of assets, the chances of random collisions are negligibly small -- nothing that you should worry about. Considering the birthday paradox, given a set of 2^64 (or 18,446,744,073,709,551,616) assets, the probability of a single MD5 collision within this set is 50%. At this scale, you'd probably beat Google in terms of storage capacity.
However, because the MD5 hash function has been broken (it's vulnerable to a collision attack), any determined attacker can produce 2 colliding assets in a matter of seconds worth of CPU power. So if you want to use MD5, make sure that such an attacker would not compromise the security of your application!
Also, consider the ramifications if an attacker could forge a collision to an existing asset in your database. While there are no such known attacks (preimage attacks) against MD5 (as of 2011), it could become possible by extending the current research on collision attacks.
If these turn out to be a problem, I suggest looking at the SHA-2 series of hash functions (SHA-256, SHA-384 and SHA-512). The downside is that it's slightly slower and has longer hash output.
MD5 is a hash function – so yes, two different strings can absolutely generate colliding MD5 codes.
In particular, note that MD5 codes have a fixed length so the possible number of MD5 codes is limited. The number of strings (of any length), however, is definitely unlimited so it logically follows that there must be collisions.
Yes, it is possible that two different strings can generate the same MD5 hash code.
Here is a simple test using very similar binary message in hex string:
$ echo '4dc968ff0ee35c209572d4777b721587d36fa7b21bdc56b74a3dc0783e7b9518afbfa200a8284bf36e8e4b55b35f427593d849676da0d1555d8360fb5f07fea2' | xxd -r -p | tee >/dev/null >(md5) >(sha1sum)
c6b384c4968b28812b676b49d40c09f8af4ed4cc -
008ee33a9d58b51cfeb425b0959121c9
$ echo '4dc968ff0ee35c209572d4777b721587d36fa7b21bdc56b74a3dc0783e7b9518afbfa202a8284bf36e8e4b55b35f427593d849676da0d1d55d8360fb5f07fea2' | xxd -r -p | tee >/dev/null >(md5) >(sha1sum)
c728d8d93091e9c7b87b43d9e33829379231d7ca -
008ee33a9d58b51cfeb425b0959121c9
They generate different SHA-1 sum, but the same MD5 hash value. Secondly the strings are very similar, so it's difficult to find the difference between them.
The difference can be found by the following command:
$ diff -u <(echo 4dc968ff0ee35c209572d4777b721587d36fa7b21bdc56b74a3dc0783e7b9518afbfa200a8284bf36e8e4b55b35f427593d849676da0d1555d8360fb5f07fea2 | fold -w2) <(echo 4dc968ff0ee35c209572d4777b721587d36fa7b21bdc56b74a3dc0783e7b9518afbfa202a8284bf36e8e4b55b35f427593d849676da0d1d55d8360fb5f07fea2 | fold -w2)
--- /dev/fd/63 2016-02-05 12:55:04.000000000 +0000
+++ /dev/fd/62 2016-02-05 12:55:04.000000000 +0000
## -33,7 +33,7 ##
af
bf
a2
-00
+02
a8
28
4b
## -53,7 +53,7 ##
6d
a0
d1
-55
+d5
5d
83
60
Above collision example is taken from Marc Stevens: Single-block collision for MD5, 2012; he explains his method, with source code (alternate link to the paper).
Another test:
$ echo '0e306561559aa787d00bc6f70bbdfe3404cf03659e704f8534c00ffb659c4c8740cc942feb2da115a3f4155cbb8607497386656d7d1f34a42059d78f5a8dd1ef' | xxd -r -p | tee >/dev/null >(md5) >(sha1sum)
756f3044edf52611a51a8fa7ec8f95e273f21f82 -
cee9a457e790cf20d4bdaa6d69f01e41
$ echo '0e306561559aa787d00bc6f70bbdfe3404cf03659e744f8534c00ffb659c4c8740cc942feb2da115a3f415dcbb8607497386656d7d1f34a42059d78f5a8dd1ef' | xxd -r -p | tee >/dev/null >(md5) >(sha1sum)
6d5294e385f50c12745a4d901285ddbffd3842cb -
cee9a457e790cf20d4bdaa6d69f01e41
Different SHA-1 sum, the same MD5 hash.
Difference is in one byte:
$ diff -u <(echo 0e306561559aa787d00bc6f70bbdfe3404cf03659e704f8534c00ffb659c4c8740cc942feb2da115a3f4155cbb8607497386656d7d1f34a42059d78f5a8dd1ef | fold -w2) <(echo 0e306561559aa787d00bc6f70bbdfe3404cf03659e744f8534c00ffb659c4c8740cc942feb2da115a3f415dcbb8607497386656d7d1f34a42059d78f5a8dd1ef | fold -w2)
--- /dev/fd/63 2016-02-05 12:56:43.000000000 +0000
+++ /dev/fd/62 2016-02-05 12:56:43.000000000 +0000
## -19,7 +19,7 ##
03
65
9e
-70
+74
4f
85
34
## -41,7 +41,7 ##
a3
f4
15
-5c
+dc
bb
86
07
Above example is adapted from Tao Xie and Dengguo Feng: Construct MD5 Collisions Using Just A Single Block Of Message, 2010.
Related:
Are there two known strings which have the same MD5 hash value? at Crypto.SE
Yes, it is possible. This is in fact a Birthday problem. However the probability of two randomly chosen strings having the same MD5 hash is very low.
See this and this questions for examples.
Yes, of course: MD5 hashes have a finite length, but there are an infinite number of possible character strings that can be MD5-hashed.
Yes, it is possible. It is called a Hash collision.
Having said that, algorithms such as MD5 are designed to minimize the probability of a collision.
The Wikipedia entry on MD5 explains some vulnerabilities in MD5, which you should be aware of.
Just to be more informative.
From a math point of view, Hash functions are not injective.
It means that there is not a 1 to 1 (but one way) relationship between the starting set and the resulting one.
Bijection on wikipedia
EDIT: to be complete injective hash functions exist: it's called Perfect hashing.
I think we need to be careful choosing the hashing algorithm as per our requirement, as hash collisions are not as rare as I expected. I recently found a very simple case of hash collision in my project. I am using Python wrapper of xxhash for hashing. Link: https://github.com/ewencp/pyhashxx
s1 = 'mdsAnalysisResult105588'
s2 = 'mdsAlertCompleteResult360224'
pyhashxx.hashxx(s1) # Out: 2535747266
pyhashxx.hashxx(s2) # Out: 2535747266
It caused a very tricky caching issue in the system, then I finally found that it's a hash collision.
Yes, it is! Collision will be a possibility (although, the risk is very small). If not, you would have a pretty effective compression method!
EDIT: As Konrad Rudolph says: A potentially unlimited set of input converted to a finite set of output (32 hex chars) will results in an endless number of collisions.
As other people have said, yes, there can be collisions between two different inputs. However, in your use case, I don't see that being a problem. I highly doubt that you will run into collisions - I've used MD5 for fingerprinting hundreds of thousands of image files of a number of image (JPG, bitmap, PNG, raw) formats at a previous job and I didn't have a collision.
However, if you are trying to fingerprint some kind of data, perhaps you could use two hash algorithms - the odds of one input resulting in the same output of two different algorithms is near impossible.
I realize this is old, but thought I would contribute my solution. There are a 2^128 possible hash combinations. And thus a 2^64 probability of a birthday paradox. While the solution below won't eliminate possibility of collisions, it surely will reduce the risk by a very substantial amount.
2^64 = 18,446,744,073,709,500,000 possible combinations
What I have done is I put a few hashes together based on the input string to get a much longer resulting string that you consider your hash...
So my pseudo-code for this is:
Result = Hash(string) & Hash(Reverse(string)) & Hash(Length(string))
That is to practical improbability of a collision. But if you want to be super paranoid and can't have it happen, and storage space is not an issue (nor is computing cycles)...
Result = Hash(string) & Hash(Reverse(string)) & Hash(Length(string))
& Hash(Reverse(SpellOutLengthWithWords(Length(string))))
& Hash(Rotate13(string)) Hash(Hash(string)) & Hash(Reverse(Hash(string)))
Okay, not the cleanest solution, but this now gets you a lot more play with how infrequently you will run into a collision. To the point I might assume impossibility in all realistic senses of the term.
For my sake, I think the possibility of a collision is infrequent enough that I will consider this not "surefire" but so unlikely to happen that it suits the need.
Now the possible combinations goes up significantly. While you could spend a long time on how many combinations this could get you, I will say in theory it lands you SIGNIFICANTLY more than the quoted number above of
2^64 (or 18,446,744,073,709,551,616)
Likely by a hundred more digits or so. The theoretical max this could give you would be
Possible number of resulting strings:
528294531135665246352339784916516606518847326036121522127960709026673902556724859474417255887657187894674394993257128678882347559502685537250538978462939576908386683999005084168731517676426441053024232908211188404148028292751561738838396898767036476489538580897737998336
It looks like theory understanding doesn't help when talking about theory in practice and need to know what means only 2 numbers 1 and 0 it means 1111111111, so 100 means 10 times of that.
To have at all hashes used you need on one filesystem or one birthday system every person in world would need to have 18446744073709551616/8000000000=2305843009.21 files for every person and if its in 1mb size then its 2305843009 MB or 2305843 GB or 2305 TB or 153722 Google drives free 15 GB per each person.
If we make files bigger, then more space used and less file count means less hashes. So we still wont have smaller size files but only bigger.
Calculate someone how big files needs to be so that we could have MD5 all hashes filled.
If average file size is in 2002 3.22 MB in 2005 8.92 and we can assume we still use same quality of file size. so even google filesystem would never have so many files on one system since if 15gb free google drive full with average a lot small 3 mb files for every 8 milliard people in world would make 40000000000000 that's from all MD5 hashes 0.0000021684% of possible of all hashable file sizes.
Talking about non related things like birthdays of 100 birth year day of 2 people would be comparing 2 days or 0.02 and in 365 of 2 people would be comparing 0.00547%
MD5 files 2/18446744073709551616=0.0000000000000000000108420217% of all files if so many would exist at all.
It like asking in world of adam and eve if they have the same hash birthday when there no 365 people in world or in file system files or so many password at all.
So collisions of trying to hack are so many that are impossible in real life secured server.
If MD5 full limit is 18,446,744,073,709,551,616 then you will never have so much files in whole world.
MD5 is example of having all world strings counted into hashes, which will never exist so long, so its just a problem of MD5 being short, but do we will have trillion amount of strings of huge length having really the same hash?
Actually it would be like comparing 365 different day babies with 366 baby to find out which birthday is the same.
As you see all answers are theoretically answering yes, but fail to prove real life examples. If its password, then only very long string might be same as short one.
If its file identification hashing then use different hashing or combination of them.
Birthday problem is as one person is word "abcd" a 4 letter word while other person DNA could be the same only if its "abcdfijdfj".
If you read wikipedia about birthday problem, its not only like birthday date but birthday birth date, hour, second, ms and more like DNA problem.
With hash you can have same DNA and birthday with twins? Nope. With someone else even sometimes.
Birthday paradox is certainly probability confidence math trick result possibility of 365 options or days, while hash is from how much? Much more. So if you have 2 different matching string, its just because MD5 hash is too short for too many files, so use something longer then MD5.
Its not comparing 50 babies in 365 days, its comparing 2 hashes if they are the same from multiple length strings been hashed like abcd same as 25 letter abcdef...zdgdege and 150 letter sadiasdjsadijfsdf.sdaidjsad.dfijsdf.
So if its password, then its birthday sibling will be much longer that doesn't even exist, since no one makes birth of 25 letter password.
For file size comparing, I'm not sure how big the probability is but its not 97% and not even 0.0000001%.
Ok let's be more specific.
If its file then can occur of huge system since files will be different but needs to be not a problem in practice since 5 quadrillion or 5 000 000 000 000 000 files should be on same system for UUID and for MD5.
And if it is a password, then 10 years to try every second, but could try every millisecond, but then in 3 wrong guesses blocking ip for 1minute would make guessing millions of years.
When I see something wrong, then I know it's wrong. Theory promises vs reality.

How come MD5 hash values are not reversible?

One concept I've always wondered about is the use of cryptographic hash functions and values. I understand that these functions can generate a hash value that is unique and virtually impossible to reverse, but here's what I've always wondered:
If on my server, in PHP I produce:
md5("stackoverflow.com") = "d0cc85b26f2ceb8714b978e07def4f6e"
When you run that same string through an MD5 function, you get the same result on your PHP installation. A process is being used to produce some value, from some starting value.
Doesn't this mean that there is some way to deconstruct what is happening and reverse the hash value?
What is it about these functions that makes the resulting strings impossible to retrace?
The input material can be an infinite length, where the output is always 128 bits long. This means that an infinite number of input strings will generate the same output.
If you pick a random number and divide it by 2 but only write down the remainder, you'll get either a 0 or 1 -- even or odd, respectively. Is it possible to take that 0 or 1 and get the original number?
If hash functions such as MD5 were reversible then it would have been a watershed event in the history of data compression algorithms! Its easy to see that if MD5 were reversible then arbitrary chunks of data of arbitrary size could be represented by a mere 128 bits without any loss of information. Thus you would have been able to reconstruct the original message from a 128 bit number regardless of the size of the original message.
Contrary to what the most upvoted answers here emphasize, the non-injectivity (i.e. that there are several strings hashing to the same value) of a cryptographic hash function caused by the difference between large (potentially infinite) input size and fixed output size is not the important point – actually, we prefer hash functions where those collisions happen as seldom as possible.
Consider this function (in PHP notation, as the question):
function simple_hash($input) {
return bin2hex(substr(str_pad($input, 16), 0, 16));
}
This appends some spaces, if the string is too short, and then takes the first 16 bytes of the string, then encodes it as hexadecimal. It has the same output size as an MD5 hash (32 hexadecimal characters, or 16 bytes if we omit the bin2hex part).
print simple_hash("stackoverflow.com");
This will output:
737461636b6f766572666c6f772e636f6d
This function also has the same non-injectivity property as highlighted by Cody's answer for MD5: We can pass in strings of any size (as long as they fit into our computer), and it will output only 32 hex-digits. Of course it can't be injective.
But in this case, it is trivial to find a string which maps to the same hash (just apply hex2bin on your hash, and you have it). If your original string had the length 16 (as our example), you even will get this original string. Nothing of this kind should be possible for MD5, even if you know the length of the input was quite short (other than by trying all possible inputs until we find one that matches, e.g. a brute-force attack).
The important assumptions for a cryptographic hash function are:
it is hard to find any string producing a given hash (preimage resistance)
it is hard to find any different string producing the same hash as a given string (second preimage resistance)
it is hard to find any pair of strings with the same hash (collision resistance)
Obviously my simple_hash function fulfills neither of these conditions. (Actually, if we restrict the input space to "16-byte strings", then my function becomes injective, and thus is even provable second-preimage resistant and collision resistant.)
There now exist collision attacks against MD5 (e.g. it is possible to produce a pair of strings, even with a given same prefix, which have the same hash, with quite some work, but not impossible much work), so you shouldn't use MD5 for anything critical.
There is not yet a preimage attack, but attacks will get better.
To answer the actual question:
What is it about these functions that makes the
resulting strings impossible to retrace?
What MD5 (and other hash functions build on the Merkle-Damgard construction) effectively do is applying an encryption algorithm with the message as the key and some fixed value as the "plain text", using the resulting ciphertext as the hash. (Before that, the input is padded and split in blocks, each of this blocks is used to encrypt the output of the previous block, XORed with its input to prevent reverse calculations.)
Modern encryption algorithms (including the ones used in hash functions) are made in a way to make it hard to recover the key, even given both plaintext and ciphertext (or even when the adversary chooses one of them).
They do this generally by doing lots of bit-shuffling operations in a way that each output bit is determined by each key bit (several times) and also each input bit. That way you can only easily retrace what happens inside if you know the full key and either input or output.
For MD5-like hash functions and a preimage attack (with a single-block hashed string, to make things easier), you only have input and output of your encryption function, but not the key (this is what you are looking for).
Cody Brocious's answer is the right one. Strictly speaking, you cannot "invert" a hash function because many strings are mapped to the same hash. Notice, however, that either finding one string that gets mapped to a given hash, or finding two strings that get mapped to the same hash (i.e. a collision), would be major breakthroughs for a cryptanalyst. The great difficulty of both these problems is the reason why good hash functions are useful in cryptography.
MD5 does not create a unique hash value; the goal of MD5 is to quickly produce a value that changes significantly based on a minor change to the source.
E.g.,
"hello" -> "1ab53"
"Hello" -> "993LB"
"ZR#!RELSIEKF" -> "1ab53"
(Obviously that's not actual MD5 encryption)
Most hashes (if not all) are also non-unique; rather, they're unique enough, so a collision is highly improbable, but still possible.
A good way to think of a hash algorithm is to think of resizing an image in Photoshop... say you have a image that is 5000x5000 pixels and you then resize it to just 32x32. What you have is still a representation of the original image but it is much much smaller and has effectively "thrown away" certain parts of the image data to make it fit in the smaller size. So if you were to resize that 32x32 image back up to 5000x5000 all you'd get is a blurry mess. However because a 32x32 image is not that large it would be theoretically conceivable that another image could be downsized to produce the exact same pixels!
That's just an analogy but it helps understand what a hash is doing.
A hash collision is much more likely than you would think. Take a look at the birthday paradox to get a greater understanding of why that is.
As the number of possible input files is larger than the number of 128-bit outputs, it's impossible to uniquely assign an MD5 hash to each possible.
Cryptographic hash functions are used for checking data integrity or digital signatures (the hash being signed for efficiency). Changing the original document should therefore mean the original hash doesn't match the altered document.
These criteria are sometimes used:
Preimage resistance: for a given hash function and given hash, it should be difficult to find an input that has the given hash for that function.
Second preimage resistance: for a given hash function and input, it should be difficult to find a second, different, input with the same hash.
Collision resistance: for a given has function, it should be difficult to find two different inputs with the same hash.
These criterial are chosen to make it difficult to find a document that matches a given hash, otherwise it would be possible to forge documents by replacing the original with one that matched by hash. (Even if the replacement is gibberish, the mere replacement of the original may cause disruption.)
Number 3 implies number 2.
As for MD5 in particular, it has been shown to be flawed:
How to break MD5 and other hash functions.
But this is where rainbow tables come into play.
Basically it is just a large amount of values hashed separetely and then the result is saved to disk. Then the reversing bit is "just" to do a lookup in a very large table.
Obviously this is only feasible for a subset of all possible input values but if you know the bounds of the input value it might be possible to compute it.
Chinese scientist have found a way called "chosen-prefix collisions" to make a conflict between two different strings.
Here is an example: http://www.win.tue.nl/hashclash/fastcoll_v1.0.0.5.exe.zip
The source code: http://www.win.tue.nl/hashclash/fastcoll_v1.0.0.5_source.zip
The best way to understand what all the most voted answers meant is to actually try to revert the MD5 algorithm. I remember I tried to revert the MD5crypt algorithm some years ago, not to recover the original message because it is clearly impossible, but just to generate a message that would produce the same hash as the original hash. This, at least theoretically, would provide me a way to login to a Linux device that stored the user:password in the /etc/passwd file using the generated message (password) instead of using the original one. Since both messages would have the same resulting hash, the system would recognize my password (generated from the original hash) as valid. That didn't work at all. After several weeks, if I remember correctly, the use of salt in the initial message killed me. I had to produce not only a valid initial message, but a salted valid initial message, which I was never able to do. But the knowledge that I got from this experiment was nice.
As most have already said MD5 was designed for variable length data streams to be hashed to a fixed length chunk of data, so a single hash is shared by many input data streams.
However if you ever did need to find out the original data from the checksum, for example if you have the hash of a password and need to find out the original password, it's often quicker to just google (or whatever searcher you prefer) the hash for the answer than to brute force it. I have successfully found out a few passwords using this method.
Now a days MD5 hashes or any other hashes for that matter are pre computed for all possible strings and stored for easy access. Though in theory MD5 is not reversible but using such databases you may find out which text resulted in a particular hash value.
For example try the following hash code at http://gdataonline.com/seekhash.php to find out what text i used to compute the hash
aea23489ce3aa9b6406ebb28e0cda430
f(x) = 1 is irreversible. Hash functions aren't irreversible.
This is actually required for them to fulfill their function of determining whether someone possesses an uncorrupted copy of the hashed data. This brings susceptibility to brute force attacks, which are quite powerful these days, particularly against MD5.
There's also confusion here and elsewhere among people who have mathematical knowledge but little cipherbreaking knowledge. Several ciphers simply XOR the data with the keystream, and so you could say that a ciphertext corresponds to all plaintexts of that length because you could have used any keystream.
However, this ignores that a reasonable plaintext produced from the seed password is much, much more likely than another produced by the seed Wsg5Nm^bkI4EgxUOhpAjTmTjO0F!VkWvysS6EEMsIJiTZcvsh#WI$IH$TYqiWvK!%&Ue&nk55ak%BX%9!NnG%32ftud%YkBO$U6o to the extent that anyone claiming that the second was a possibility would be laughed at.
In the same way, if you're trying to decide between the two potential passwords password and Wsg5Nm^bkI4EgxUO, it's not as difficult to do as some mathematicians would have you believe.
By definition, a cryptographic hash function should not be invertible and should have the least collisions possible.
Regarding your question: it is a one way hash. The input (irrespective of length) will generate a fixed size output, which will be padded based on algo (512 bit boundary for MD5). The information is compressed (lost) and practically not possible to generate from reverse transforms.
Additional info on MD5: it is vulnerable to collisions. I have gone through this article recently,
http://www.win.tue.nl/hashclash/Nostradamus/
Open source code for crypto hash implementations (MD5 and SHA) can be found at Mozilla code.
(freebl library).
I like all the various arguments.
It is obvious the real value of hashed values is simply to provide human-unreadable placeholders for strings such as passwords.
It has no specific enhanced security benefit.
Assuming an attacker gained access to a table with hashed passwords, he/she can:
Hash a password of his/her own choice and place the results inside the password table if he/she has writing/edit rights to the table.
Generate hashed values of common passwords and test the existence of similar hashed values in the password table.
In this case weak passwords cannot be protected by the mere fact that they are hashed.