How to calculate equal hash for similar strings? - hash

I create Antiplagiat. I use a shingle method. For example, I have the following shingles:
I go to the cinema
I go to the cinema1
I go to th cinema
Is there a method of calculating the equal hash for these lines?
I know of the existence of Levenshtein distance. However, I do not know what I should take source word. Maybe there is a better way than to consider Levenshtein distance.

The problem with hashing is that, logically, you'll run into 2 strings that differ by a single character that hash to different values.
Small proof:
Consider all possible strings.
Assume all of these hash to at least 2 different values.
Take any 2 strings A and B that hash to different values.
You can obviously go from A to B by just changing one character at a time.
Thus at some point the hash will change.
Thus at this point the hash will be different for a single character change.
Some options I can think of:
Hash multiple parts of the string and check each of these hashes. Probably won't work too well since a single character omission will cause significant difference in the hash values.
Check a range of hashes. A hash is one dimensional, but string similarity is not, thus this probably won't work either.
All in all, hashing is probably not the way to go.

This questions is a bit old but you may be interested in this paper by two researchers at AT&T. They employ a technique that is reminiscent of the Nilsimsa hash to detect when similar sms messages have been seen an "abnormal" number of times in a time window.
It sounds Locality Sensitive hashing would also be pertinent to your problem.

Related

Comparing hashes to test for collisions

I wish to compare hashes to check for collisions (Yes, I know it is time consuming, but never mind that). In checking for collisions, hashes need to be compared. Is the best method to have a single hash in a variable to compare against or to have a list of all hashes previously generated and compare the latest hash to each item in the list.
I would prefer the first option because it is much faster, but is there a recommended method? Are you less likely to find a collision by using the first method?
Is the best method to have a single hash in a variable to compare against or to have a list of all hashes previously generated and compare the latest hash to each item in the list.
Neither.
I would prefer the first option because it is much faster, but is there a recommended method?
I don't understand why you think the first method might work, but then you haven't fully explained your situation. Still, if you want to detect hash values that repeat, you do indeed need to keep track of already-seen hash values: to do that you don't want to search linearly though a list, and should use a set container to store seen hashes; a hash table - as suggested in a comment by gnasher729 a few hours back - would give O(1) performance e.g. in C++ in your hashes are 64 bit, std::unordered_set<uint64_t>), or a balance binary tree for O(logN) performance (e.g. C++ std::set<uint64_t>).
Are you less likely to find a collision by using the first method?
You're very likely to miss collisions.
All that said, you may want to reexamine your premise. The chance of a good (cryptographic quality) hash function producing collisions closely approaches the odds described by the "birthday paradox". As a rule of thumb, if you have 2^N distinct values to hash you're statistically unlikely to see collisions if your hashes are comfortably more than 2*N bits wide: if you allow enough "comfort", you're more likely to be hit on the noggin by a meteor than have your program see a collision. You mentioned MD5 so I'd expect 128 bits: unless you're storing order-of a quadrillion values or more (literally), it's pretty safe to ignore the potential for collisions.
Do note one important use of hash values where collisions happen more often for a different reason, and that's in hash tables, where even non-colliding hash values may collide at the same bucket index after they're "wrapped" - often a la h % N when N is the number of buckets. In general, it's impractical to ignore the potential for collisions in a hash table, and very unwise to try.

Hash UUIDs without requiring ordering

I have two UUIDs. I want to hash them perfectly to produce a single unique value, but with a constraint that f(m,n) and f(n,m) must generate the same hash.
UUIDs are 128-bit values
the hash function should have no collisions - all possible input pairings must generate unique hash values
f(m,n) and f(n,m) must generate the same hash - that is, ordering is not important
I'm working in Go, so the resulting value must fit in a 256-bit int
the hash does not need to be reversible
Can anyone help?
Concatenate them with the smaller one first.
To build on user2357112's brilliant solution and boil down the comment chain, let's consider your requirements one by one (and out of order):
No collisions
Technically, that's not a hash function. A hash function is about mapping heterogeneous, arbitrary length data inputs into fixed-width, homogenous outputs. The only way to accomplish that if the input is longer than the output is through some data loss. For most applications, this is tolerable because the hash function is only used as a fast lookup key and the code falls back onto the slower, complete comparison of the data. That's why many guides and languages insist that if you implement one, you must implement the other.
Fortunately, you say:
Two UUID inputs m and n
UUIDs are 128 bits each
Output of f(m,n) must be 256 bits or less
Combined your two inputs are exactly 256 bits, which means you do not have to lose any data. If you needed a smaller output, then you would be out of luck. As it is, you can concatenate the two numbers together and generate a perfect, unique representation.
f(m,n) and f(n,m) must generate the same hash
To accomplish this final requirement, make a decision on the concatenation order by some intrinsic value of the two UUIDs. The suggested smaller-first works just great. However...
The hash does not need to be reversible
If you specifically need irreversible hashing, that's a different question entirely. You could still use the less-than comparison to ensure order independence when feeding to a cryptographically hash function, but you would be hard pressed to find something that guaranteed no collisions even with fixed-width inputs a 256 bit output width.

Hashing Similarity

Normally, the goal of hashing is to turn a continuous function into a discrete one: a small change in the input should cause a large change in the output. However, is there any hashing algorithm that will, (very) roughly speaking, return similar but (still different) hashes for similar inputs?
(An example of the use of this would be to check whether two files are "similar" by checking their hashes for similarity. Of course, some failure is always acceptable.)
Look at Locality Sensitive Hashing (LSH). That is a probabilistic way of quickly finding a bunch of points near a given one, for example.
Given a distance function that tells you how similar or different are your objects, you can also employ distance permutations:
http://www.computer.org/portal/web/csdl/doi/10.1109/TPAMI.2007.70815
or sketches:
http://portal.acm.org/citation.cfm?id=1638180
For an implementation of the latter approach:
http://obsearch.net
You really don't want to see similar hashes. Hashing is to insure Integrity, therefore the slightest change in your file/app/program will produce an entirely different hash. If two different strings show the same hash, this is called a collision, and the hashing algorithm is now compromised. MD5 has some collisions but is still used today.

How are hash functions like MD5 unique?

I'm aware that MD5 has had some collisions but this is more of a high-level question about hashing functions.
If MD5 hashes any arbitrary string into a 32-digit hex value, then according to the Pigeonhole Principle surely this can not be unique, as there are more unique arbitrary strings than there are unique 32-digit hex values.
You're correct that it cannot guarantee uniqueness, however there are approximately 3.402823669209387e+38 different values in a 32 digit hex value (16^32). That means that, assuming the math behind the algorithm gives a good distribution, your odds are phenomenally small that there will be a duplicate. You do have to keep in mind that it IS possible to duplicate when you're thinking about how it will be used. MD5 is generally used to determine if something has been changed (I.e. it's a checksum). It would be ridiculously unlikely that something could be modified and result in the same MD5 checksum.
Edit: (given recent news re: SHA1 hashes)
The answer above, still holds, but you shouldn't expect an MD5 hash to serve as any kind of security check against manipulation. SHA-1 Hashes as 2^32 (over 4 billion) times less likely to collide, and it has been demonstrated that it is possible to contrive an input to produce the same value. (This was demonstrated against MD5 quite some time ago). If you're looking to ensure nobody has maliciously modified something to produce the same hash value, these days, you need at SHA-2 to have a solid guarantee.
On the other hand, if it's not in a security check context, MD5 still has it's usefulness.
The argument could be made that an SHA-2 hash is cheap enough to compute, that you should just use it anyway.
You are absolutely correct. But hashes are not about "unique", they are about "unique enough".
As others have pointed out, the goal of a hash function like MD5 is to provide a way of easily checking whether two objects are equivalent, without knowing what they originally were (passwords) or comparing them in their entirety (big files).
Say you have an object O and its hash hO. You obtain another object P and wish to check whether it is equal to O. This could be a password, or a file you downloaded (in which case you won't have O but rather the hash of it hO that came with P, most likely). First, you hash P to get hP.
There are now 2 possibilities:
hO and hP are different. This must mean that O and P are different, because using the same hash on 2 values/objects must yield the same value. Hashes are deterministic. There are no false negatives.
hO and hP are equal. As you stated, because of the Pigeonhole Principle this could mean that different objects hashed to the same value, and further action may need to be taken.
a. Because the number of possibilities is so high, if you have faith in your hash function it may be enough to say "Well there was a 1 in 2128 chance of collision (ideal case), so we can assume O = P. This may work for passwords if you restrict the length and complexity of characters, for example. It is why you see hashes of passwords stored in databases rather than the passwords themselves.
b. You may decide that just because the hash came out equal doesn't mean the objects are equal, and do a direct comparison of O and P. You may have a false positive.
So while you may have false positive matches, you won't have false negatives. Depending on your application, and whether you expect the objects to always be equal or always be different, hashing may be a superfluous step.
Cryptographic one-way hash functions are, by nature of definition, not Injective.
In terms of hash functions, "unique" is pretty meaningless. These functions are measured by other attributes, which affects their strength by making it hard to create a pre-image of a given hash. For example, we may care about how many image bits are affected by changing a single bit in the pre-image. We may care about how hard it is to conduct a brute force attack (finding a prie-image for a given hash image). We may care about how hard it is to find a collision: finding two pre-images that have the same hash image, to be used in a birthday attack.
While it is likely that you get collisions if the values to be hashed are much longer than the resulting hash, the number of collisions is still sufficiently low for most purposes (there are 2128 possible hashes total so the chance of two random strings producing the same hash is theoretically close to 1 in 1038).
MD5 was primarily created to do integrity checks, so it is very sensitive to minimal changes. A minor modification in the input will result in a drastically different output. This is why it is hard to guess a password based on the hash value alone.
While the hash itself is not reversible, it is still possible to find a possible input value by pure brute force. This is why you should always make sure to add a salt if you are using MD5 to store password hashes: if you include a salt in the input string, a matching input string has to include exactly the same salt in order to result in the same output string because otherwise the raw input string that matches the output will fail to match after the automated salting (i.e. you can't just "reverse" the MD5 and use it to log in because the reversed MD5 hash will most likely not be the salted string that originally resulted in the creation of the hash).
So hashes are not unique, but the authentication mechanism can be made to make it sufficiently unique (which is one somewhat plausible argument for password restrictions in lieu of salting: the set of strings that results in the same hash will probably contain many strings that do not obey the password restrictions, so it's more difficult to reverse the hash by brute force -- obviously salts are still a good idea nevertheless).
Bigger hashes mean a larger set of possible hashes for the same input set, so a lower chance of overlap, but until processing power advances sufficiently to make brute-forcing MD5 trivial, it's still a decent choice for most purposes.
(It seems to be Hash Function Sunday.)
Cryptographic hash functions are designed to have very, very, very, low duplication rates. For the obvious reason you state, the rate can never be zero.
The Wikipedia page is informative.
As Mike (and basically every one else) said, its not perfect, but it does the job, and collision performance really depends on the algo (which is actually pretty good).
What is of real interest is automatic manipulation of files or data to keep the same hash with different data, see this Demo
As others have answered, hash functions are by definition not guaranteed to return unique values, since there are a fixed number of hashes for an infinite number of inputs. Their key quality is that their collisions are unpredictable.
In other words, they're not easily reversible -- so while there may be many distinct inputs that will produce the same hash result (a "collision"), finding any two of them is computationally infeasible.

Is a hash result ever the same as the source value?

This is more of a cryptography theory question, but is it possible that the result of a hash algorithm will ever be the same value as the source? For example, say I have a string:
baf34551fecb48acc3da868eb85e1b6dac9de356
If I get the SHA1 hash on it, the result is:
4d2f72adbafddfe49a726990a1bcb8d34d3da162
In theory, is there ever a case where these two values would match? I'm not asking about SHA1 specifically here - it's just my example. I'm just wondering if hashing algorithms are built in such a way as to prevent this.
Well, it would depend on the hashing algorithm - but I'd be surprised to see anything explicitly prevent this. After all, it really shouldn't matter.
I suspect it's very unlikely to happen, of course (for cryptographic hashes)... but even if it does, that shouldn't cause a problem.
For non-crypto hashes (used in hash tables etc) it would be perfectly reasonable to return the source value in some cases. For example, in Java, Integer.hashCode() just returns the embedded value.
Sure, the Python hashing algorithm for integers returns the value of the integer. So hash(1) == 1.
Given a good hashing algorithm, one that returns a seemingly random output, I believe there should be on average one input that gives itself as the output. Let's say the hash can give N possible outputs. That means there are N possible inputs for which this is possible. For each of those, the odds of the output matching the input is 1/N, so there the expected number of fixed points is N*1/N, or 1.
A hash function might be defined to avoid ‘fixed points’ where hash(x)==x, but your hash-quine differs a little in that you're taking the string representation in hex of the hash rather than the raw binary. It would, I think, be infeasible to design a hash that could frustrate that, and it's mathematically less interesting since it depends on the arbitrary mapping of 0-F to ASCII character codes.
See Is there an MD5 Fixed Point where md5(x) == x? for a discussion about fixed points in MD5. The probability calculation would be equally true for hex hash-quines and any other hash function with 128 bits of output.