Does compression change the hash value? - hash

I have a large file that I need to compress, however I need to ensure that the original file has the same hash value as the compressed one. I tried it on a smaller file, hash values are different but I am thinking that this might be because of metadata change. How do I ensure that the files don't change after compression?

It depends on which shash you are using. If you are using crc32 it's pretty trivial to make your hashes the same. MD5 might be possible already (I don't know the start of the art there), SHA1 will probably be doable in a few years. If you are using SHA256, better give up.
Snark about broken crypto aside, unless your hash algorithm knows specifically about your compression setup or your input file was very carefully crafted to provoke a hash collision: the hash will change before and after compression. That means any standard cryptographic hash will change upon compression.
All the hash algorithm sees are a stream of bits without any meaning. It does not know about compression schemes, and should not.

If your hash is a CRC-32, then you can insert or append four bytes to the compressed data, and set those to get the original CRC. For example, in a gzip stream you can insert a four-byte extra block in the header.
The whole point of cryptographic hashes, like MD5 noted as a tag to the question, is to make that exceedingly difficult, or practically impossible.

Related

Does MD5 create same hash code for two same binaries even if add few lines code of 2nd one?

I have created small automated antivirus tool it works with MD5 hash code.it takes the hash of binary then compare with existing signatures in database.
Suppose if malware author just change/add few lines of code of malware then it will generate same hash key?or different one? Does it generate same hash key even if enhance/alter the code?
The entire point of checksums is that a small change is supposed to result in a large change to the hash.
That being said, checksum collisions are a fact of life and clever people can make changes to code very carefully, such that the checksum remains the same. I'd say that in general this is a low risk, though, because doing this would take serious effort.
That is generally true for Hash function, however MD5 is not considered secure anymore as you can read Here
In March 2005, Xiaoyun Wang and Hongbo Yu of Shandong University in
China published an article in which they describe an algorithm that
can find two different sequences of 128 bytes with the same MD5 hash.
It is also not secure specifically for ps files link.
Ideally, any change in the source binary would generate a different hash. However, the MD5 algorithm is not collision resistant. It is possible (though somewhat unlikely) that 2 different binaries generate the same hash.

Is there a fast hashing algorithm which is resistant to deliberate collision, and if so, which one?

I am developing an "open distributed cloud storage system".
By open I mean that anyone can participate in hosting of files.
My current design uses a sha1 hash of the files content as global file id.
It is given that the client already knows this hash value and receives the file from a "bandwidth donor".
The client now needs to verify that the file indeed is the correct one, by generating the hash and comparing it to the expected value.
However my concern is that someone could deliberately modify a file to produce the same hash. As far as I know this is doable easily for hashes of the CRC family. Some "googling" around revealed a lot of claims that the same would be easy for MD5.
Now my question is: Is there a hashing algorithm which satisfies the criteria of beeing
fast for big amounts of data
well distributed in the hashing range (aka "unique")
has a sufficient target range ("bit length")
is resistant to deliberate collision attacks
All other means that I can think of achieving a setup that serves my needs involve a secret component, for example a secret openssl key or a shared secret salt for a hash function.
Unfortunately I cannot work with that.
What you are asking for is a one-way function, whose existence is a major open problem.
With cryptographic hash functions, the specific attack you wanted to avoid is called the "second pre-image attack".
That should help you Googling what you want, but as far as I know there is actually no known practical second pre-image attack for MD5.
First of all, you probably found that it is easy to find two arbitrary files that have the same hash, and to find two different such pairs every time you try.
But it is difficult to generate a file to disguise as some specific file - in other words, it is unlikely that one of the prementioned "two arbitrary files" actually belongs to a non-malicious agent in your storage.
If you're still not satisfied, you might want to try something like SHA-1 or SHA-2 or GOST.
First of all, a hash value can never identify a file, as there will always be collisions.
Having said that, what you are looking for is called a cryptographic hash. These are designed to not (easily, i.e. other than brute force) allow modifications of the data while keeping the hash, or producing new data with a given hash.
As such, the SHA family is ok.
For the moment, SHA1 is adequate. No collisions are known.
It would help a lot to know the average size of the thing you are hashing. But most likely, if your platforms are predominantly 64-bit, SHA512 is your best choice. You can truncate the hash and use only 256-bits of it. If your platforms are predominantly 32-bit, SHA256 is your best choice.

CRC32+Size vs MD5/SHA1

We have a storage of files and the storage uniquely identifies a file on the basis of size appended to crc32.
I wanted to know if this checksum ( crc32 + size ) would be good enough for identifying files or should we consider some other hashing technique like MD5/SHA1?
CRC is most an error detection method than a serious hash function. It helps in identify corrupting files rather than uniquely identify them.
So your choice should be between MD5 and SHA1.
If you don't have strong security needings you can choose MD5 that should be faster.
(remember that MD5 is vulnerable to collision attacks).
If you need more security you better use SHA1 or even SHA2 .
CRC-32 is not good enough; it is trivial to build collisions, i.e. two files (of the same length if you wish it so) which have the same CRC-32. Even in the absence of a malicious attacker, collisions will happen randomly once you have about 65000 distinct files with the same length.
A hash function is designed to avoid collisions. With MD5 or SHA-1, you will not get random collisions. If your setup is security-related (i.e. there is someone, somewhere, who may actively try to create collisions), then you need a secure hash function. MD5 is not secure anymore (creating collisions with MD5 is easy) and SHA-1 is somewhat weak in that respect (no actual collisions were computed, but a method for creating one is known and, while expensive, it is much less expensive than what it ought to be). The usual recommendation is to use SHA-256 or SHA-512 (SHA-256 is enough for security; SHA-512 may be a tad faster on big, 64-bit systems, but file reading bandwidth will be more limitating than hashing speed).
Note: when using a cryptographic hash function, there is no need to store and compare the file lengths; the hash is sufficient to disambiguate files.
In a non-security setup (i.e. you only fear random collisions), then MD4 can be used. It is thoroughly "broken" as a cryptographic hash function, but it still is a very good checksum, and it is really fast (on some ARM-based platforms, it is even faster than CRC-32, for a much better resistance to random collisions). Basically, you should not use MD5: if you have security issues, then MD5 must not be used (it is broken; use SHA-256); and if you do not have security issues then MD4 is faster than MD5.
The space that would be used by a CRC32+size gives you enough room for a bigger CRC which would be a much better choice. If you are not worried about malicious collision that's it in which case Thomas' answer applies.
You didn't specify a language but for example in C++ you got Boost CRC giving you CRC of the size you want (or you can afford to store).
As others have said, CRC doesn't guarantee absence of collisions. However, your problem is be solved simply by giving the files incrementing 64-bit numbers. This is guaranteed to never collide (unless you want to keep gazillion of files in one directory which is not a good idea anyway).

Purposely create two files to have the same hash?

If someone is purposely trying to modify two files to have the same hash, what are ways to stop them? Can md5 and sha1 prevent the majority case?
I was thinking of writing my own and I figure even if I don't do a good job if the user doesn't know my hash he may not be able to fool mine.
What's the best way to prevent this?
MD5 is generally considered insecure if hash collisions are a major concern. SHA1 is likewise no longer considered acceptable by the US government. There is was a competition under way to find a replacement hash algorithm, but the recommendation at the moment is to use the SHA2 family - SHA-256, SHA-384 or SHA-512. [Update: 2012-10-02 NIST has chosen SHA-3 to be the algorithm Keccak.]
You can try to create your own hash — it would probably not be as good as MD5, and 'security through obscurity' is likewise not advisable.
If you want security, hash with multiple hash algorithms. Being able to simultaneously create files that have hash collisions using a number of algorithms is excessively improbable. [And, in the light of comments, let me make it clear: I mean publish both the SHA-256 and the Whirlpool values for the file — not combining hash algorithms to create a single value, but using separate algorithms to create separate values. Generally, a corrupted file will fail to match any of the algorithms; if, perchance, someone has managed to create a collision value using one algorithm, the chance of also producing a second collision in one of the other algorithms is negligible.]
The Public TimeStamp uses an array of algorithms. See, for example, sqlcmd-86.00.tgz for an illustration.
If the user doesn't know your hashing algorithm he also can't verify your signature on a document that you actually signed.
The best option is to use public-key one-way hashing algorithms that generate the longest hash. SHA-256 creates a 256-bit hash, so a forger would have to try 2255 different documents (on average) before they created one that matched a given document, which is pretty secure. If that's still not secure enough for you, there's SHA-512.
Also, I think it's worth mentioning that a good low-tech way to protect yourself against forged digitally-signed documents is to simply keep a copy of anything you sign. That way, if it comes down to a dispute, you can show that the original document you signed was altered.
There is a hierarchy of difficulty (for an attacker) here. It is easier to find two files with the same hash than to generate one to match a given hash, and easier to do the later if you don't have to respect form/content/lengths restrictions.
Thus, if it is possible to use a well defined document structure and lengths, you can make an attackers life a bit harder no matter what underling hash you use.
Why are you trying to create your own hash algorithm? What's wrong with SHA1HMAC?
Yes, there are repeats for hashes.
Any hash that is shorter than the plaintext is necessarily less information. That means there will be some repeats. The key for hashes is that the repeats are hard to reverse-engineer.
Consider CRC32 - commonly used as a hash. It's a 32-bit quantity. Because there are more than 2^32 messages in the universe, then there will be repeats with CRC32.
The same idea applies to other hashes.
This is called a "hash collision", and the best way to avoid it is to use a strong hash function. MD5 is relatively easy to artificially build colliding files, as seen here. Similarly, it's known there is a relatively efficient method for computing colliding SH1 files, although in this case "relatively efficient" still takes hunreds of hours of compute time.
Generally, MD5 and SHA1 are still expensive to crack, but not impossible. If you're really worried about it, use a stronger hash function, like SHA256.
Writing your own isn't actually a good idea unless you're a pretty expert cryptographer. most of the simple ideas have been tried and there are well-known attacks against them.
If you really want to learn more about it, have a look at Schneier's Applied Cryptography.
I don't think coming up with your own hash algorithm is a good choice.
Another good option is used Salted MD5. For example, the input to your MD5 hash function is appended with string "acidzom!##" before passing to MD5 function.
There is also a good reading at Slashdot.

How come MD5 hash values are not reversible?

One concept I've always wondered about is the use of cryptographic hash functions and values. I understand that these functions can generate a hash value that is unique and virtually impossible to reverse, but here's what I've always wondered:
If on my server, in PHP I produce:
md5("stackoverflow.com") = "d0cc85b26f2ceb8714b978e07def4f6e"
When you run that same string through an MD5 function, you get the same result on your PHP installation. A process is being used to produce some value, from some starting value.
Doesn't this mean that there is some way to deconstruct what is happening and reverse the hash value?
What is it about these functions that makes the resulting strings impossible to retrace?
The input material can be an infinite length, where the output is always 128 bits long. This means that an infinite number of input strings will generate the same output.
If you pick a random number and divide it by 2 but only write down the remainder, you'll get either a 0 or 1 -- even or odd, respectively. Is it possible to take that 0 or 1 and get the original number?
If hash functions such as MD5 were reversible then it would have been a watershed event in the history of data compression algorithms! Its easy to see that if MD5 were reversible then arbitrary chunks of data of arbitrary size could be represented by a mere 128 bits without any loss of information. Thus you would have been able to reconstruct the original message from a 128 bit number regardless of the size of the original message.
Contrary to what the most upvoted answers here emphasize, the non-injectivity (i.e. that there are several strings hashing to the same value) of a cryptographic hash function caused by the difference between large (potentially infinite) input size and fixed output size is not the important point – actually, we prefer hash functions where those collisions happen as seldom as possible.
Consider this function (in PHP notation, as the question):
function simple_hash($input) {
return bin2hex(substr(str_pad($input, 16), 0, 16));
}
This appends some spaces, if the string is too short, and then takes the first 16 bytes of the string, then encodes it as hexadecimal. It has the same output size as an MD5 hash (32 hexadecimal characters, or 16 bytes if we omit the bin2hex part).
print simple_hash("stackoverflow.com");
This will output:
737461636b6f766572666c6f772e636f6d
This function also has the same non-injectivity property as highlighted by Cody's answer for MD5: We can pass in strings of any size (as long as they fit into our computer), and it will output only 32 hex-digits. Of course it can't be injective.
But in this case, it is trivial to find a string which maps to the same hash (just apply hex2bin on your hash, and you have it). If your original string had the length 16 (as our example), you even will get this original string. Nothing of this kind should be possible for MD5, even if you know the length of the input was quite short (other than by trying all possible inputs until we find one that matches, e.g. a brute-force attack).
The important assumptions for a cryptographic hash function are:
it is hard to find any string producing a given hash (preimage resistance)
it is hard to find any different string producing the same hash as a given string (second preimage resistance)
it is hard to find any pair of strings with the same hash (collision resistance)
Obviously my simple_hash function fulfills neither of these conditions. (Actually, if we restrict the input space to "16-byte strings", then my function becomes injective, and thus is even provable second-preimage resistant and collision resistant.)
There now exist collision attacks against MD5 (e.g. it is possible to produce a pair of strings, even with a given same prefix, which have the same hash, with quite some work, but not impossible much work), so you shouldn't use MD5 for anything critical.
There is not yet a preimage attack, but attacks will get better.
To answer the actual question:
What is it about these functions that makes the
resulting strings impossible to retrace?
What MD5 (and other hash functions build on the Merkle-Damgard construction) effectively do is applying an encryption algorithm with the message as the key and some fixed value as the "plain text", using the resulting ciphertext as the hash. (Before that, the input is padded and split in blocks, each of this blocks is used to encrypt the output of the previous block, XORed with its input to prevent reverse calculations.)
Modern encryption algorithms (including the ones used in hash functions) are made in a way to make it hard to recover the key, even given both plaintext and ciphertext (or even when the adversary chooses one of them).
They do this generally by doing lots of bit-shuffling operations in a way that each output bit is determined by each key bit (several times) and also each input bit. That way you can only easily retrace what happens inside if you know the full key and either input or output.
For MD5-like hash functions and a preimage attack (with a single-block hashed string, to make things easier), you only have input and output of your encryption function, but not the key (this is what you are looking for).
Cody Brocious's answer is the right one. Strictly speaking, you cannot "invert" a hash function because many strings are mapped to the same hash. Notice, however, that either finding one string that gets mapped to a given hash, or finding two strings that get mapped to the same hash (i.e. a collision), would be major breakthroughs for a cryptanalyst. The great difficulty of both these problems is the reason why good hash functions are useful in cryptography.
MD5 does not create a unique hash value; the goal of MD5 is to quickly produce a value that changes significantly based on a minor change to the source.
E.g.,
"hello" -> "1ab53"
"Hello" -> "993LB"
"ZR#!RELSIEKF" -> "1ab53"
(Obviously that's not actual MD5 encryption)
Most hashes (if not all) are also non-unique; rather, they're unique enough, so a collision is highly improbable, but still possible.
A good way to think of a hash algorithm is to think of resizing an image in Photoshop... say you have a image that is 5000x5000 pixels and you then resize it to just 32x32. What you have is still a representation of the original image but it is much much smaller and has effectively "thrown away" certain parts of the image data to make it fit in the smaller size. So if you were to resize that 32x32 image back up to 5000x5000 all you'd get is a blurry mess. However because a 32x32 image is not that large it would be theoretically conceivable that another image could be downsized to produce the exact same pixels!
That's just an analogy but it helps understand what a hash is doing.
A hash collision is much more likely than you would think. Take a look at the birthday paradox to get a greater understanding of why that is.
As the number of possible input files is larger than the number of 128-bit outputs, it's impossible to uniquely assign an MD5 hash to each possible.
Cryptographic hash functions are used for checking data integrity or digital signatures (the hash being signed for efficiency). Changing the original document should therefore mean the original hash doesn't match the altered document.
These criteria are sometimes used:
Preimage resistance: for a given hash function and given hash, it should be difficult to find an input that has the given hash for that function.
Second preimage resistance: for a given hash function and input, it should be difficult to find a second, different, input with the same hash.
Collision resistance: for a given has function, it should be difficult to find two different inputs with the same hash.
These criterial are chosen to make it difficult to find a document that matches a given hash, otherwise it would be possible to forge documents by replacing the original with one that matched by hash. (Even if the replacement is gibberish, the mere replacement of the original may cause disruption.)
Number 3 implies number 2.
As for MD5 in particular, it has been shown to be flawed:
How to break MD5 and other hash functions.
But this is where rainbow tables come into play.
Basically it is just a large amount of values hashed separetely and then the result is saved to disk. Then the reversing bit is "just" to do a lookup in a very large table.
Obviously this is only feasible for a subset of all possible input values but if you know the bounds of the input value it might be possible to compute it.
Chinese scientist have found a way called "chosen-prefix collisions" to make a conflict between two different strings.
Here is an example: http://www.win.tue.nl/hashclash/fastcoll_v1.0.0.5.exe.zip
The source code: http://www.win.tue.nl/hashclash/fastcoll_v1.0.0.5_source.zip
The best way to understand what all the most voted answers meant is to actually try to revert the MD5 algorithm. I remember I tried to revert the MD5crypt algorithm some years ago, not to recover the original message because it is clearly impossible, but just to generate a message that would produce the same hash as the original hash. This, at least theoretically, would provide me a way to login to a Linux device that stored the user:password in the /etc/passwd file using the generated message (password) instead of using the original one. Since both messages would have the same resulting hash, the system would recognize my password (generated from the original hash) as valid. That didn't work at all. After several weeks, if I remember correctly, the use of salt in the initial message killed me. I had to produce not only a valid initial message, but a salted valid initial message, which I was never able to do. But the knowledge that I got from this experiment was nice.
As most have already said MD5 was designed for variable length data streams to be hashed to a fixed length chunk of data, so a single hash is shared by many input data streams.
However if you ever did need to find out the original data from the checksum, for example if you have the hash of a password and need to find out the original password, it's often quicker to just google (or whatever searcher you prefer) the hash for the answer than to brute force it. I have successfully found out a few passwords using this method.
Now a days MD5 hashes or any other hashes for that matter are pre computed for all possible strings and stored for easy access. Though in theory MD5 is not reversible but using such databases you may find out which text resulted in a particular hash value.
For example try the following hash code at http://gdataonline.com/seekhash.php to find out what text i used to compute the hash
aea23489ce3aa9b6406ebb28e0cda430
f(x) = 1 is irreversible. Hash functions aren't irreversible.
This is actually required for them to fulfill their function of determining whether someone possesses an uncorrupted copy of the hashed data. This brings susceptibility to brute force attacks, which are quite powerful these days, particularly against MD5.
There's also confusion here and elsewhere among people who have mathematical knowledge but little cipherbreaking knowledge. Several ciphers simply XOR the data with the keystream, and so you could say that a ciphertext corresponds to all plaintexts of that length because you could have used any keystream.
However, this ignores that a reasonable plaintext produced from the seed password is much, much more likely than another produced by the seed Wsg5Nm^bkI4EgxUOhpAjTmTjO0F!VkWvysS6EEMsIJiTZcvsh#WI$IH$TYqiWvK!%&Ue&nk55ak%BX%9!NnG%32ftud%YkBO$U6o to the extent that anyone claiming that the second was a possibility would be laughed at.
In the same way, if you're trying to decide between the two potential passwords password and Wsg5Nm^bkI4EgxUO, it's not as difficult to do as some mathematicians would have you believe.
By definition, a cryptographic hash function should not be invertible and should have the least collisions possible.
Regarding your question: it is a one way hash. The input (irrespective of length) will generate a fixed size output, which will be padded based on algo (512 bit boundary for MD5). The information is compressed (lost) and practically not possible to generate from reverse transforms.
Additional info on MD5: it is vulnerable to collisions. I have gone through this article recently,
http://www.win.tue.nl/hashclash/Nostradamus/
Open source code for crypto hash implementations (MD5 and SHA) can be found at Mozilla code.
(freebl library).
I like all the various arguments.
It is obvious the real value of hashed values is simply to provide human-unreadable placeholders for strings such as passwords.
It has no specific enhanced security benefit.
Assuming an attacker gained access to a table with hashed passwords, he/she can:
Hash a password of his/her own choice and place the results inside the password table if he/she has writing/edit rights to the table.
Generate hashed values of common passwords and test the existence of similar hashed values in the password table.
In this case weak passwords cannot be protected by the mere fact that they are hashed.