What are the other data that can be retrieved from comparing hash values of two files?

What are the other data that can be retrieved from comparing hash values of two files? - hash

I understand that hash functions like md5 can be used to tell if two files(or sets of data) are similar or not. Even changing a single bit changes the hash value of any file. Apart from this information is there any other information when comparing two hash function like to what degree are the two files different or the location of the changes. Are there any hash functions that can used to get these information

None if the hash function is cryptographically secure.
If you are presented with two hashes coming from two files, the only thing you can tell is if the files are exactly, bit for bit, identical (same hash) or not.
Some properties of a hash function is that any final bit of the hash depends on multiple bits of the message, and that a change in a single bit in the message will result in a completely different hash, to the extent that this second hash cannot be distinguished from any other possible hash.
Even with a somewhat vulnerable hash function like md5, the main thing an attacker could do is create a second document hashing to the same final hash (a collision). Not really infer the relatedness of two documents. For this to be possible, the hash function would have to be quite weak.

Related

How do Hash-functions encode an infinite amount of data into a finite amount?

Hash-functions always create an output with a fixed length, even though the input can be infinitely large.
So how is it possible, that no information is lost here? Shouldn't some inputs result in the same output then?

Yes. Two inputs can result in the same output, resulting in a hash collision.
Hashes are designed so that hashing text is very easy, but reversing the process is difficult. The point of hashing isn't to store information. Instead, hashes are commonly used in security (and also data structures).
For instance, websites will hash a user's passwords and store the hashes instead of the physical passwords. This way, if the website's security is breached, the attacker can only obtain the hashes, which still doesn't let the attacker log in, as it is very difficult to reverse-engineer the password.
The hash set is another application of hashing. By hashing an object and storing only the hashes, you can check whether an object is present or not present in the set in constant time. You only have to search through all of the objects in the hash set that have the same hash as the object that you are checking. As the size of the hash set grows, so does the chance of a hash collision.

So how is it possible, that no information is lost here?
It's not possible, and lots of information is lost.
In the case of a perfect hash there is no collision and we could even argue that information isn't really lost (it's just not contained in the system alone) because we know all possible inputs and know there is no collisions in the hashes produced, but they can be used as an index in a way that isn't possible or as good with the input data, so they are useful.
In the case of a hash-based collection we use a hash code to (hopefully) have few collisions so we get close to O(1) lookup, but have some means to handle it if a collision does happen.
In the case of a cryptographic hash we could have collisions but it's extremely hard to deliberately do so, for similar (roughly speaking) reasons as to why its hard to break modern cryptography, so while you could have two passwords with the same hash you couldn't find it easily (especially if you aren't going to e.g. have a password of several thousand pages of text).
In the case of a checksum hash we could have collisions, but that they're unlikely means that if we have corruption we probably won't have the matching hash.

Hash Table Confusion - How much space is needed for Hash Table with a good (eg. Cryptographic) Hash Function?

I am learning about Hash Tables, Hash Maps etc. I have just implemented a Hash Table in C, with operations: insert(HTable, key), delete(HTable, key), initialize(HTable) and search(HTable, key).
I would like to ask something. Since in a (proper) Hash Table the computed hashed indexes could be very large, doesn't this mean that the space consumed will be like INT_MAX (which is still O(n) of course), or more? I mean given the input element that we want to store in a hash table (ie insert it in), the insert() function would call the hash function which would then compute the hashed index for the element to go in. Thus it would use the hash function to find this index.
When we use the hash function to operate on the element, the hashed index could become very large. With a proper, for example cryptographic hash function, this index could become huge (they are using prime numbers with 300 digits - Diffie Hellman public key cryptography etc.), right? I know that in normal hash functions (such as the trivial ones beginners use to learn) we apply mod operation in order for the element to fit within the hash table's bounds, but in doing so, maybe we limit the hash function potential?
So to uniquely map an element to the hash table we must use a HUGE Hash Table. How are these cryptographic hash tables implemented? They must be completely secure, right? Even the Stack Overflow tag on "cryptographichashfunction" says that it is extremely unlikely to find two inputs that will map to the same element (as such the possibility of collisions is tiny). Wouldn't this require though a HUGE array to be stored in memory (or to disk)? Therefore, the memory consumption would be huge.
Of course, the time complexity is not a problem. We just see the start address of the hash table / array add it with the index and just go that place in memory to get the value (O(1) - search principle of Hash Table).
Am i wrong somewhere? Is there something i'm missing? I hope i made myself clear. So to conclude, i would like confirmation on this. Does a good hash function require a huge array (Hash Table) and as such a very large amount of memory to be properly implemented? Is so much space justified, or is there something i don't quite get? Thanks.

In general cryptographic hash values are not used for hash tables. Instead a fast hash is used. Of that hash value only as many bits may be used to tweak the size of the table. If multiple key values map to the same index then the values are stored in a separate structure, possibly with additional information to choose between the two.
It is not required that the hash output is unique; the hash function output would be too large and the table required would certainly not fit in memory. Besides that, cryptographic hashes are generally quite slow.
Cryptographic hash functions are usually build from operations also used in symmetric block ciphers. That means mixing and bitwise operators used in a large amount of rounds. Modular arithmetic, as used for e.g. RSA are commonly not used.
All in all, the main thing is that the index generated doesn't need to be unique. Usually if one hash leeds to multiple values they are stored in a list or set where the key can be compared by value.

Hash Sensibility for data changes

I've seen many hash algorithms has a common feature, it is any change in the data produce a total change in the hash code, although this is so, I would like to know if there is any known standard hash algorithm with a different behaviour, with little changes of hash for little changes of data, a kind of near-linear relation of amount of hash changes, respect to amount of data changes.
An idea for doing this is to create a hash concatenating various hashes calculated from parts of the data, it would use small partial hashes, or a bigger final hash, anyway, I would like to know if there is any algorithm having this behaviour.

I think you're looking for something like Simhash. It's actually meant for finding "near duplicates".
e.g. http://irl.cs.tamu.edu/people/sadhan/papers/cikm2011.pdf

How to generate MD5 Hash value that Hashes to itself?

Is it possible to generate a text file, the content of which is the file's hash/md5 value.
How to write the program?

If such a file exists, it is possible to generate it by trying every possible MD5 hash and checking if its MD5 hash equals it. But since all possible MD5 hashes are a finite set, such a special MD5 value might not even exist at all.
Note: you asked only if it's possible, not how much time it would take.

I was interested too, so i wrote following pascal program:
program hash;
uses md5;
var a, b: string;
begin
b:='d41d8cd98f00b204e9800998ecf8427e'; //md5sum of /dev/null
repeat
a:=md5Print(md5String(b));
b:=md5Print(md5String(a));
until a=b;
writeln(a);
writeln(b);
end.
It running for about five days already, but still no result yet)))

Note that if you wish to brute force it, trimethoxy's approach is fundamentally flawed. Each hash effectively points to another random hash, and as the series of hashes increases, it becomes increasingly more likely that any newly visited hash will simply point you back to a previously visited hash, forming a cycle millions or billions of hashes long.
If we assume that the entire hash space of MD5 is not a single cyclical loop, which is exceedingly unlikely, then nearly all values are in a competitively short cyclical loop that leaves the vast majority of MD5 hashes unvisited.
Basically, even if self mapped hashes exist, that approach is still more likely to simply put itself in an infinite loop than it is to actually find one.

How come MD5 hash values are not reversible?

One concept I've always wondered about is the use of cryptographic hash functions and values. I understand that these functions can generate a hash value that is unique and virtually impossible to reverse, but here's what I've always wondered:
If on my server, in PHP I produce:
md5("stackoverflow.com") = "d0cc85b26f2ceb8714b978e07def4f6e"
When you run that same string through an MD5 function, you get the same result on your PHP installation. A process is being used to produce some value, from some starting value.
Doesn't this mean that there is some way to deconstruct what is happening and reverse the hash value?
What is it about these functions that makes the resulting strings impossible to retrace?

The input material can be an infinite length, where the output is always 128 bits long. This means that an infinite number of input strings will generate the same output.
If you pick a random number and divide it by 2 but only write down the remainder, you'll get either a 0 or 1 -- even or odd, respectively. Is it possible to take that 0 or 1 and get the original number?

If hash functions such as MD5 were reversible then it would have been a watershed event in the history of data compression algorithms! Its easy to see that if MD5 were reversible then arbitrary chunks of data of arbitrary size could be represented by a mere 128 bits without any loss of information. Thus you would have been able to reconstruct the original message from a 128 bit number regardless of the size of the original message.

Contrary to what the most upvoted answers here emphasize, the non-injectivity (i.e. that there are several strings hashing to the same value) of a cryptographic hash function caused by the difference between large (potentially infinite) input size and fixed output size is not the important point – actually, we prefer hash functions where those collisions happen as seldom as possible.
Consider this function (in PHP notation, as the question):
function simple_hash($input) {
return bin2hex(substr(str_pad($input, 16), 0, 16));
}
This appends some spaces, if the string is too short, and then takes the first 16 bytes of the string, then encodes it as hexadecimal. It has the same output size as an MD5 hash (32 hexadecimal characters, or 16 bytes if we omit the bin2hex part).
print simple_hash("stackoverflow.com");
This will output:
737461636b6f766572666c6f772e636f6d
This function also has the same non-injectivity property as highlighted by Cody's answer for MD5: We can pass in strings of any size (as long as they fit into our computer), and it will output only 32 hex-digits. Of course it can't be injective.
But in this case, it is trivial to find a string which maps to the same hash (just apply hex2bin on your hash, and you have it). If your original string had the length 16 (as our example), you even will get this original string. Nothing of this kind should be possible for MD5, even if you know the length of the input was quite short (other than by trying all possible inputs until we find one that matches, e.g. a brute-force attack).
The important assumptions for a cryptographic hash function are:
it is hard to find any string producing a given hash (preimage resistance)
it is hard to find any different string producing the same hash as a given string (second preimage resistance)
it is hard to find any pair of strings with the same hash (collision resistance)
Obviously my simple_hash function fulfills neither of these conditions. (Actually, if we restrict the input space to "16-byte strings", then my function becomes injective, and thus is even provable second-preimage resistant and collision resistant.)
There now exist collision attacks against MD5 (e.g. it is possible to produce a pair of strings, even with a given same prefix, which have the same hash, with quite some work, but not impossible much work), so you shouldn't use MD5 for anything critical.
There is not yet a preimage attack, but attacks will get better.
To answer the actual question:
What is it about these functions that makes the
resulting strings impossible to retrace?
What MD5 (and other hash functions build on the Merkle-Damgard construction) effectively do is applying an encryption algorithm with the message as the key and some fixed value as the "plain text", using the resulting ciphertext as the hash. (Before that, the input is padded and split in blocks, each of this blocks is used to encrypt the output of the previous block, XORed with its input to prevent reverse calculations.)
Modern encryption algorithms (including the ones used in hash functions) are made in a way to make it hard to recover the key, even given both plaintext and ciphertext (or even when the adversary chooses one of them).
They do this generally by doing lots of bit-shuffling operations in a way that each output bit is determined by each key bit (several times) and also each input bit. That way you can only easily retrace what happens inside if you know the full key and either input or output.
For MD5-like hash functions and a preimage attack (with a single-block hashed string, to make things easier), you only have input and output of your encryption function, but not the key (this is what you are looking for).

Cody Brocious's answer is the right one. Strictly speaking, you cannot "invert" a hash function because many strings are mapped to the same hash. Notice, however, that either finding one string that gets mapped to a given hash, or finding two strings that get mapped to the same hash (i.e. a collision), would be major breakthroughs for a cryptanalyst. The great difficulty of both these problems is the reason why good hash functions are useful in cryptography.

MD5 does not create a unique hash value; the goal of MD5 is to quickly produce a value that changes significantly based on a minor change to the source.
E.g.,
"hello" -> "1ab53"
"Hello" -> "993LB"
"ZR#!RELSIEKF" -> "1ab53"
(Obviously that's not actual MD5 encryption)
Most hashes (if not all) are also non-unique; rather, they're unique enough, so a collision is highly improbable, but still possible.

A good way to think of a hash algorithm is to think of resizing an image in Photoshop... say you have a image that is 5000x5000 pixels and you then resize it to just 32x32. What you have is still a representation of the original image but it is much much smaller and has effectively "thrown away" certain parts of the image data to make it fit in the smaller size. So if you were to resize that 32x32 image back up to 5000x5000 all you'd get is a blurry mess. However because a 32x32 image is not that large it would be theoretically conceivable that another image could be downsized to produce the exact same pixels!
That's just an analogy but it helps understand what a hash is doing.

A hash collision is much more likely than you would think. Take a look at the birthday paradox to get a greater understanding of why that is.

As the number of possible input files is larger than the number of 128-bit outputs, it's impossible to uniquely assign an MD5 hash to each possible.
Cryptographic hash functions are used for checking data integrity or digital signatures (the hash being signed for efficiency). Changing the original document should therefore mean the original hash doesn't match the altered document.
These criteria are sometimes used:
Preimage resistance: for a given hash function and given hash, it should be difficult to find an input that has the given hash for that function.
Second preimage resistance: for a given hash function and input, it should be difficult to find a second, different, input with the same hash.
Collision resistance: for a given has function, it should be difficult to find two different inputs with the same hash.
These criterial are chosen to make it difficult to find a document that matches a given hash, otherwise it would be possible to forge documents by replacing the original with one that matched by hash. (Even if the replacement is gibberish, the mere replacement of the original may cause disruption.)
Number 3 implies number 2.
As for MD5 in particular, it has been shown to be flawed:
How to break MD5 and other hash functions.

But this is where rainbow tables come into play.
Basically it is just a large amount of values hashed separetely and then the result is saved to disk. Then the reversing bit is "just" to do a lookup in a very large table.
Obviously this is only feasible for a subset of all possible input values but if you know the bounds of the input value it might be possible to compute it.

Chinese scientist have found a way called "chosen-prefix collisions" to make a conflict between two different strings.
Here is an example: http://www.win.tue.nl/hashclash/fastcoll_v1.0.0.5.exe.zip
The source code: http://www.win.tue.nl/hashclash/fastcoll_v1.0.0.5_source.zip

The best way to understand what all the most voted answers meant is to actually try to revert the MD5 algorithm. I remember I tried to revert the MD5crypt algorithm some years ago, not to recover the original message because it is clearly impossible, but just to generate a message that would produce the same hash as the original hash. This, at least theoretically, would provide me a way to login to a Linux device that stored the user:password in the /etc/passwd file using the generated message (password) instead of using the original one. Since both messages would have the same resulting hash, the system would recognize my password (generated from the original hash) as valid. That didn't work at all. After several weeks, if I remember correctly, the use of salt in the initial message killed me. I had to produce not only a valid initial message, but a salted valid initial message, which I was never able to do. But the knowledge that I got from this experiment was nice.

As most have already said MD5 was designed for variable length data streams to be hashed to a fixed length chunk of data, so a single hash is shared by many input data streams.
However if you ever did need to find out the original data from the checksum, for example if you have the hash of a password and need to find out the original password, it's often quicker to just google (or whatever searcher you prefer) the hash for the answer than to brute force it. I have successfully found out a few passwords using this method.

Now a days MD5 hashes or any other hashes for that matter are pre computed for all possible strings and stored for easy access. Though in theory MD5 is not reversible but using such databases you may find out which text resulted in a particular hash value.
For example try the following hash code at http://gdataonline.com/seekhash.php to find out what text i used to compute the hash
aea23489ce3aa9b6406ebb28e0cda430

f(x) = 1 is irreversible. Hash functions aren't irreversible.
This is actually required for them to fulfill their function of determining whether someone possesses an uncorrupted copy of the hashed data. This brings susceptibility to brute force attacks, which are quite powerful these days, particularly against MD5.
There's also confusion here and elsewhere among people who have mathematical knowledge but little cipherbreaking knowledge. Several ciphers simply XOR the data with the keystream, and so you could say that a ciphertext corresponds to all plaintexts of that length because you could have used any keystream.
However, this ignores that a reasonable plaintext produced from the seed password is much, much more likely than another produced by the seed Wsg5Nm^bkI4EgxUOhpAjTmTjO0F!VkWvysS6EEMsIJiTZcvsh#WI$IH$TYqiWvK!%&Ue&nk55ak%BX%9!NnG%32ftud%YkBO$U6o to the extent that anyone claiming that the second was a possibility would be laughed at.
In the same way, if you're trying to decide between the two potential passwords password and Wsg5Nm^bkI4EgxUO, it's not as difficult to do as some mathematicians would have you believe.

By definition, a cryptographic hash function should not be invertible and should have the least collisions possible.
Regarding your question: it is a one way hash. The input (irrespective of length) will generate a fixed size output, which will be padded based on algo (512 bit boundary for MD5). The information is compressed (lost) and practically not possible to generate from reverse transforms.
Additional info on MD5: it is vulnerable to collisions. I have gone through this article recently,
http://www.win.tue.nl/hashclash/Nostradamus/
Open source code for crypto hash implementations (MD5 and SHA) can be found at Mozilla code.
(freebl library).

I like all the various arguments.
It is obvious the real value of hashed values is simply to provide human-unreadable placeholders for strings such as passwords.
It has no specific enhanced security benefit.
Assuming an attacker gained access to a table with hashed passwords, he/she can:
Hash a password of his/her own choice and place the results inside the password table if he/she has writing/edit rights to the table.
Generate hashed values of common passwords and test the existence of similar hashed values in the password table.
In this case weak passwords cannot be protected by the mere fact that they are hashed.