hash collision and appending data - hash

Assume I have two strings (or byte arrays) A and B which both have the same hash (with hash I mean things like MD5 or SHA1). If I concatenate another string behind it, will A+C and B+C have the same hash H' as well? What happens to C+A and C+B?
I tested it with MD5 and in all my tests, appending something to the end made the hash the same, but appending at the beginning did not.
Is this always true (for all inputs)?
Is this true for all (well-known) hash functions? If no, is there a (well-known) hash function, where A+C and B+C will not collide (and C+A and C+B do not either)?
(besides from MD5(x + reverse(x)) and other constructed stuff I mean)

Details depend on the hash function H, but generally they work as follows:
Consume a block of input X (say, 512 bits)
Break the input into smaller pieces (say, 32 bits) and update hash internal state based on the input
If there's more input, go to step 1
At the end, spit the internal state out as the hash value H(X)
So, if A and B collide i.e. H(A) = H(B), the hash will be in the same state after consuming them. Updating the state further with the same input C can make the resulting hash value identical. This explains why H(A+C) is sometimes H(B+C). But it depends how A's and B's sizes are aligned to input block size and how the hash breaks the input block internally.
C+A and C+B can be identical if C is a multiple of the hash block size but probably not otherwise.

This depends entirely on the hash function. Also, the probability that you have those collisions is really small.

The hash functions being discussed here are typically cryptographic (SHA1, MD5). These hash functions have an Avalanche effect -- the output will change drastically with a slight change in the input.
The prefix and suffix extension of C will effectively make a longer input.
So, adding anything to the front or rear of the input should change the effective hash outputs significantly.
I do not understand how you did the MD5 check, here is my test.
echo "abcd" | md5sum
70fbc1fdada604e61e8d72205089b5eb
echo "0abcd" | md5sum
f5ac8127b3b6b85cdc13f237c6005d80
echo "abcd0" | md5sum
4c8a24d096de5d26c77677860a3c50e3
Are you saying that you located two inputs which had the same MD5 hash and then appended something to the end or beginning of the input and found that adding at the end resulted in the same MD5 as that for the original input?
Please provide samples with your test results.

Related

Prefix preserving hash function

I am looking for a hash function f() whose outputs can preserve the prefix of the inputs. The detailed requirements are as followings.
f() takes variable-length bit strings as input and outputs bit strings;
assume a and b are bit strings and a is a substring of b, then f(a) is also a substring of f(b);
the length of the output bit string should be smaller than the input bit string.
Any idea?
There will be no such hash function that meets your criterion.
Suppose you have such hash function Hash that preserves prefix, then answer these questions:
(1) Hash("a") =? It could be anything, right?
(2) What about Hash("xa")=? to preserve the prefix, it has to be
|Hash("xa")-Hash("a")| + Hash("a")
(3) What about Hash("yxa")=? similarly as (2), it has to be
|Hash("yxa")-Hash("xa")| + |Hash("xa")-Hash("a")| + Hash("a")
So the hash will always have longer lengh than the original.

case and white space insensitive hashing function

I am looking for a hashing function that is case insensitive and ignores white spaces as well.
for example:
the hash value generated for this is a hash and ThisIsAHash will be exactly the same.
does any such hash function exist?
Hash Functions are how we make them. For example:
First, for all strings ->
Step1. Lowercase them (or Uppercase them)
Step2. Strip all Whitespaces.
By now, both strings would map to: thisisahash
Step3. Now, apply any Hash function to it: crc32, java's polynomial or whatever...
Given a string, you can always now do a lookup and see if other Strings are hashed to the same key.
Note that hash functions are one-way. So doing Step1 and Step2 don't count against valid hash methods.

How can I convert the tiger hash values from the official implementations into the form used by Direct Connect?

I am trying to implement a Direct Connect Client, and I am currently stuck at a point where I need to hash the files in order to be able to upload them to other clients.
As the all other clients require a TTHL (Tiger Tree Hashing Leaves) support for verification of the downloaded data. I have searched for implementations of the algorithm, and found tiger-hash-python.
I have implemented a routine that uses the hash function from before, and is able to hash large files, according to the logic specified in Tree Hash EXchange format (THEX) (basically, the tree diagram is the important part on that page).
However, the value produced by it is similar to those shown on Wikipedia, a hex digest, but is different from those shown in the DC clients I'm using for reference.
I have been unable to find out how the hex digest form is converted to this other one (39 characters, A-Z, 0-9). Could someone please explain how that is done?
Well ... I tried what Paulo Ebermann said, using the following functions:
def strdivide(list,length):
result = []
# Calculate how many blocks there are, using the condition: i*length = len(list).
# The additional maths operations are to deal with the last block which might have a smaller size
for i in range(0,int(math.ceil(float(len(list))/length))):
result.append(list[i*length:(i+1)*length])
return result
def dchash(data):
result = tiger.hash(data) # From the aformentioned tiger-hash-python script, 48-char hex digest
result = "".join([ "".join(strdivide(result[i:i+16],2)[::-1]) for i in range(0,48,16) ]) # Representation Transform
bits = "".join([chr(int(c,16)) for c in strdivide(result,2)]) # Converting every 2 hex characters into 1 normal
result = base64.b32encode(bits) # Result will be 40 characters
return result[:-1] # Leaving behind the trailing '='
The TTH for an empty file was found to be 8B630E030AD09E5D0E90FB246A3A75DBB6256C3EE7B8635A, which after the transformation specified here, becomes 5D9ED00A030E638BDB753A6A24FB900E5A63B8E73E6C25B6. Base-32 encoding this result yielded LWPNACQDBZRYXW3VHJVCJ64QBZNGHOHHHZWCLNQ, which was found to be what DC++ generates.
The only mention of the format of the hash in the Direct Connect protocol I found is on the $SR page on the NMDC Protocol wiki:
For files containing TTH, the <hub_name> parameter is replaced with TTH:<base32_encoded_tth_hash> (ref: TTH_Hash).
So, it is Base32-encoding. This is defined in RFC 4648 (and some earlier ones), section 6.
Basically, you are using the capital letters A-Z and the decimal digits 2 to 7, and one base32 digit represents 5 bits, while one base16 (hexadecimal) digit represents only 4 ones.
This means, each 5 hex digits map to 4 base32-digits, and for a Tiger hash (192 bits) you will need 40 base32-digits (in the official encoding, the last one would be a = padding, which seems to be omitted if you say that there are always 39 characters).
I'm not sure of an implementation of a conversion from hex (or bytes) to base32, but it shouldn't be too complicated with a lookup table and some bit-shifting.

Perl autoincrement of string not working as before

I have some code where I am converting some data elements in a flat file. I save the old:new values to a hash which is written to a file at the end of processing. On subsequence execution, I reload into a hash so I can reuse previously converted values on additional data files. I also save the last conversion value so if I encounter an unconverted value, I can assign it a new converted value and add it to the hash.
I had used this code before (back in Feb) on six files with no issues. I have a variable that is set to ZCKL0 (last character is a zero) which is retrieved from a file holding the last used value. I apply the increment operator
...
$data{$olddata} = ++$dataseed;
...
and the resultant value in $dataseed is 1 instead of ZCKL1. The original starting seed value was ZAAA0.
What am I missing here?
Do you use the $dataseed variable in a numeric context in your code?
From perlop:
If you increment a variable that is
numeric, or that has ever been used in
a numeric context, you get a normal
increment. If, however, the variable
has been used in only string contexts
since it was set, and has a value that
is not the empty string and matches
the pattern /^[a-zA-Z][0-9]\z/ , the
increment is done as a string,
preserving each character within its
range.
As prevously mentioned, ++ on strings is "magic" in that it operates differently based on the content of the string and the context in which the string is used.
To illustrate the problem and assuming:
my $s='ZCL0';
then
print ++$s;
will print:
ZCL1
while
$s+=0; print ++$s;
prints
1
NB: In other popular programming languages, the ++ is legal for numeric values only.
Using non-intuitive, "magic" features of Perl is discouraged as they lead to confusing and possibly unsupportable code.
You can write this almost as succinctly without relying on the magic ++ behavior:
s/(\d+)$/ $1 + 1 /e
The e flag makes it an expression substitution.

Hash(m1 xor m2) = Hash(m1) xor Hash (m2) Is this true in case of SHA1

Can anyone shed some knowledge on this?
My answer is no, it is not true, because SHA1 has a strong collision resistant property.
No this isn't true. (And it would only take a few seconds to actually test it yourself.)
No, it is not true. A function would have to go out of its way to have this property. SHA1 incorporates bytes from its stream on block at a time starting with a predefined initial value. At the end it incorporates the length of the byte stream into the byte stream and pads out to the block size.
It makes no attempt to satisfy the property in question (which is a good thing!)
No. To quote from Wikipedia:
Even a small change in the message will, with overwhelming probability, result in a completely different hash due to the avalanche effect.
Here's a counterexample (0xFF xor 0x00 is 0xFF):
$ echo -ne "\xff" > 1
$ echo -ne "\x00" > 2
$ sha1sum *
85e53271e14006f0265921d02d4d736cdc580b0b *1
5ba93c9db0cff93f52b521d7420e43f6eda2784f *2
If your statement were true, the second hash would have to be 00000000..., but it is not.
I'm affraid this will hold only if your hash function is XOR.