Change sha1 hash of files - hash

Lets say I have two files with two different sha1 hashes. Is there a way to change the hash of one file to the hash from the other file so they both have the same hash value?
The two files have different values in it.

Yes.
Either:
Change the content of the file so the two files are identical or
Find another set of data (you'll probably have to brute force search this so it would take a lot of computing power) which happens to hash to the same value
You can't change the hash without changing the content of the file. The hash is just what you get after doing some calculations on the content of the file.

Related

How to set one md5 hash to multiple files

how can I set one md5 hash to multiple .ps files?
I know how to change md5 hash, but no idea how to set it to value which I would like to have.
To create two files with the same md5 hash you can have a look here https://natmchugh.blogspot.co.za/2014/10/how-i-made-two-php-files-with-same-md5.html?m=1 . The tricky part may be getting them to remain as valid PS files. In the example the author takes advantage of Pups language to add the differing text in such a way the syntax remains correct. Perhaps in a PS text section. Also have a look at https://natmchugh.blogspot.co.za/2015/05/how-to-make-two-binaries-with-same-md5.html?m=1
However usually you do not set the md5 hash but generate it from a file. This allows the recipient of the file to confirm it has not been corrupted by generating the md5 hash on their end and comparing with the one you gave them. If you are looking to generate one md5 hash for multiple files you could
Compress the files and generate the md5 hash from that file
Generate multiple md5 hashes and join them using an algorithm that always ensures the same value is produced from the hashes each time. For example concatenate the hashes or create an md5 hash of all the md5 hashes. The downside is whoever will use the hashes will also need to know the algorithm.
Have a look at Combine MD5 hashes of multiple files

Do MD5 and CRC generate "unpredictable" hash values?

about integrity checking of files, I am not sure if CRC32 or MD5 checksum generate "unpredictable" hash values:
When checking if files are identical, usually CRC32 or MD5 checksum is used. It means that each file that is possibly a duplicate of another is read from the beginning to the end and a unique number will be calculated based on
its unique binary content. As a fingerprint, this number is stored and used to compare the file’s contents to other files to determine if they are truly identical. That means a tiny change in a file results in a fairly large
and "unpredictable" change in the generated hash.
This is not a proper use of the term "unpredictable". The algorithms are deterministic, which means that they will always produce the same output given the same input. Therefore they are entirely predictable.
Yes, for both a small change in the input will result in a "fairly large change" in the output, on the order of half of the bits of the output.
These checks cannot be used to determine if two files "are truly identical". They can only indicate that there is a very high probability that the two files are identical. You'd need to directly compare the two files to determine if they are truly identical.
On the other hand, if the checks differ, then you know for certain that the files differ.

What are the other data that can be retrieved from comparing hash values of two files?

I understand that hash functions like md5 can be used to tell if two files(or sets of data) are similar or not. Even changing a single bit changes the hash value of any file. Apart from this information is there any other information when comparing two hash function like to what degree are the two files different or the location of the changes. Are there any hash functions that can used to get these information
None if the hash function is cryptographically secure.
If you are presented with two hashes coming from two files, the only thing you can tell is if the files are exactly, bit for bit, identical (same hash) or not.
Some properties of a hash function is that any final bit of the hash depends on multiple bits of the message, and that a change in a single bit in the message will result in a completely different hash, to the extent that this second hash cannot be distinguished from any other possible hash.
Even with a somewhat vulnerable hash function like md5, the main thing an attacker could do is create a second document hashing to the same final hash (a collision). Not really infer the relatedness of two documents. For this to be possible, the hash function would have to be quite weak.

How to know two hash data are same or not-- fastest approach

I get getting data as a hash from some source at two different times. I need to know whether both hashes are the same or not. I do not need to know which key-value pairs differ.
I am thinking of storing the md5sum of the hash using Digest::MD5 module in some place (such as a database) and then compare if next received hash's md5sum is same as previous stored md5sum or not. If it is not same then data in hash differs.
My hash size is not very big max 50 keys in single hash. Is there any other better and faster approach in perl?
For such a small dataset there is no need to overoptimize things.
You could use Data::Compare:
use Data::Compare;
print 'structures of \%h and \%v are ',
Compare(\%h, \%v) ? "" : "not ", "identical.\n";
I'm assuming the two hash variables are in separate processes.
Hashes (e.g. md5sums) aren't guaranteed to be unique for two different texts. You need to follow with a full text compare to be sure.
Hashes are useful if you're going to compare members of a large set as it reduces the number of times you need to do a full text compare. It's just a waste of time if you just have two string to compare.
Of course, if rare false positives are not a problem, then using a hash will save storage space.

Commutative, accumulator-based function for calculating a digest of multiple hashes

I'm writing something that summarizes the files in a file system by hashing a sample of their contents. It constructs a tree of directories and files. Each file entry has the hash of the file contents. For each directory entry, I want to store a hash of the contents of all files in the directory, including those in sub-directories - I'll call this the directory content hash.
The tricky thing about the directory content hash is that I want it to be independent of the structure of the directory. I.E. the hash should be the same if two directories contain the same files, but organized with a different sub-directories structure.
The only two methods I can think of are:
Calculate the MD5 of the concatenation of all file content hashes. In order to get the desired hash properties, I would have to list all of the files in the directory, sort them by their hash, concatenate the sorted hashes, and then run MD5 on the concatenation. This seems slower than I would like. I can do the sorting pretty efficiently by using merge sort while calculating directory content hashes throughout a tree, but I can't get around calculating a lot of MD5 hashes on large inputs.
Combine file content hashes using XOR. Each directory would only need to XOR the file content hashes and directory content hashes of its immediate children. This is very fast and simple, but not very collision resistant. It can't even tell the difference between a directory which contains 1 instance of a file, and a directory which contains three instances of the same file.
It would be nice if there is a function which can be used similar to the way XOR is used in method #2, but is more collision resistant. I think method #1 would be fast enough for this specific case, but in the interest of exploring-all-the-options/intellectual-curiosity/future-applications, I'd like to know whether there's a function that satisfies the description in the title (I have a vague memory of wanting a function like that several times in the past).
Thanks.
Order independent hashing of collections of hashes (is essentially what you're looking for, non?)
It sounds like any order independent operation (like addition or multiplication) will do the trick for you. Addition has the benefit of overflowing in a nice way. I don't recall if multiplication will work as well.
In short: add all of your values, ignoring the overflow, and you should get something useful. Any other similar function should do the trick if addition isn't sufficiently collision resistant.
As the count of items is important but the order isn't; just sort the list of hashes and then hash the list.
find . -print0 | xargs -0 sha1sum | cut -c -40 | sort | sha1sum
This would give the type of hash value which is invariant to the directory arrangement.
I found this article: https://kevinventullo.com/2018/12/24/hashing-unordered-sets-how-far-will-cleverness-take-you/
Like #Slartibartfast says, addition is what you want. The interesting thing from the article is that it proves that no matter what "commutative" operation you do there will always be problem elements. In the case of addition, the problem element is the item with a hash of 0.
While there are several documented approaches to defining a hash
function for lists and other containers where iteration order is
guaranteed, there seems to be less discussion around best practices
for defining a hash function for unordered containers. One obvious
approach is to simply sum {(+)} or xor {(\oplus)} the hashes of the
individual elements of the container. A downside to these approaches
is the existence of “problem elements” which hash to 0; when such
elements are inserted into any container, that container’s hash will
remain unchanged. One might suspect that this is due to the structured
nature of addition or xor, and that a more clever choice of hash
function on the unordered container could avoid this. In fact, at the
end of the post, we’ll mathematically prove a proposition which
roughly states that any general purpose method for hashing unordered
containers, which can be incrementally updated based on the existing
hash, is essentially equivalent to one of the more “obvious” choices
in that it has the same algebraic structure, and in particular has the
same “problem” elements.
If you have Google guava available, it provides a utility method, Hashing.combinedUnordered(), that does what you want. (Internally, this is implemented by adding all the hashes together.)
https://code.google.com/p/guava-libraries/wiki/HashingExplained