Hashing Similar Strings to same Hash Value - hash

Is there some hashing algorithm that can hash similar text documents to a particular Hash Value ?
For example,
A = "This is Sample Text 1"
B= "This is Sample Text 2"
A and B need to be hashed to a same value.
I have done a bit of research and read about SimHash and LSH algorithms.
Simhash causes hash collisions and similarity can be defined by using hamming distance.
Ideally I want something like " If String A and String B differ by a acceptable threshold of similarity (t < tmax), hash A and B to a same hash value."

An obvious option is to use Soundex or one of its variants (depending on the language of these words).
You didn't specify what you need this for.
If you need to create some sort of hashtable variant, where you place similar strings in the same bucket, soundex variants could work, but you need to take the possibility that you could have collisions into account.
If you only need some indication of how similar two strings are, you can also look at an algorithm called Simil; see this link, or spell checking related algorithms.

Related

Hash UUIDs without requiring ordering

I have two UUIDs. I want to hash them perfectly to produce a single unique value, but with a constraint that f(m,n) and f(n,m) must generate the same hash.
UUIDs are 128-bit values
the hash function should have no collisions - all possible input pairings must generate unique hash values
f(m,n) and f(n,m) must generate the same hash - that is, ordering is not important
I'm working in Go, so the resulting value must fit in a 256-bit int
the hash does not need to be reversible
Can anyone help?
Concatenate them with the smaller one first.
To build on user2357112's brilliant solution and boil down the comment chain, let's consider your requirements one by one (and out of order):
No collisions
Technically, that's not a hash function. A hash function is about mapping heterogeneous, arbitrary length data inputs into fixed-width, homogenous outputs. The only way to accomplish that if the input is longer than the output is through some data loss. For most applications, this is tolerable because the hash function is only used as a fast lookup key and the code falls back onto the slower, complete comparison of the data. That's why many guides and languages insist that if you implement one, you must implement the other.
Fortunately, you say:
Two UUID inputs m and n
UUIDs are 128 bits each
Output of f(m,n) must be 256 bits or less
Combined your two inputs are exactly 256 bits, which means you do not have to lose any data. If you needed a smaller output, then you would be out of luck. As it is, you can concatenate the two numbers together and generate a perfect, unique representation.
f(m,n) and f(n,m) must generate the same hash
To accomplish this final requirement, make a decision on the concatenation order by some intrinsic value of the two UUIDs. The suggested smaller-first works just great. However...
The hash does not need to be reversible
If you specifically need irreversible hashing, that's a different question entirely. You could still use the less-than comparison to ensure order independence when feeding to a cryptographically hash function, but you would be hard pressed to find something that guaranteed no collisions even with fixed-width inputs a 256 bit output width.

Efficient Function to Map (or Hash) Integers and Integer Ranges into Index

We are looking for the computationally simplest function that will enable an indexed look-up of a function to be determined by a high frequency input stream of widely distributed integers and ranges of integers.
It is OK if the hash/map function selection itself varies based on the specific integer and range requirements, and the performance associated with the part of the code that selects this algorithm is not critical. The number of integers/ranges of interest in most cases will be small (zero to a few thousand). The performance critical portion is in processing the incoming stream and selecting the appropriate function.
As a simple example, please consider the following pseudo-code:
switch (highFrequencyIntegerStream)
case(2) : func1();
case(3) : func2();
case(8) : func3();
case(33-122) : func4();
...
case(10,000) : func40();
In a typical example, there would be only a few thousand of the "cases" shown above, which could include a full range of 32-bit integer values and ranges. (In the pseudo code above 33-122 represents all integers from 33 to 122.) There will be a large number of objects containing these "switch statements."
(Note that the actual implementation will not include switch statements. It will instead be a jump table (which is an array of function pointers) or maybe a combination of the Command and Observer patterns, etc. The implementation details are tangential to the request, but provided to help with visualization.)
Many of the objects will contain "switch statements" with only a few entries. The values of interest are subject to real time change, but performance associated with managing these changes is not critical. Hash/map algorithms can be re-generated slowly with each update based on the specific integers and ranges of interest (for a given object at a given time).
We have searched around the internet, looking at Bloom filters, various hash functions listed on Wikipedia's "hash function" page and elsewhere, quite a few Stack Overflow questions, abstract algebra (mostly Galois theory which is attractive for its computationally simple operands), various ciphers, etc., but have not found a solution that appears to be targeted to this problem. (We could not even find a hash or map function that considered these types of ranges as inputs, much less a highly efficient one. Perhaps we are not looking in the right places or using the correct vernacular.)
The current plan is to create a custom algorithm that preprocesses the list of interesting integers and ranges (for a given object at a given time) looking for shifts and masks that can be applied to input stream to help delineate the ranges. Note that most of the incoming integers will be uninteresting, and it is of critical importance to make a very quick decision for as large a percentage of that portion of the stream as possible (which is why Bloom filters looked interesting at first (before we starting thinking that their implementation required more computational complexity than other solutions)).
Because the first decision is so important, we are also considering having multiple tables, the first of which would be inverse masks (masks to select uninteresting numbers) for the easy to find large ranges of data not included in a given "switch statement", to be followed by subsequent tables that would expand the smaller ranges. We are thinking this will, for most cases of input streams, yield something quite a bit faster than a binary search on the bounds of the ranges.
Note that the input stream can be considered to be randomly distributed.
There is a pretty extensive theory of minimal perfect hash functions that I think will meet your requirement. The idea of a minimal perfect hash is that a set of distinct inputs is mapped to a dense set of integers in 1-1 fashion. In your case a set of N 32-bit integers and ranges would each be mapped to a unique integer in a range of size a small multiple of N. Gnu has a perfect hash function generator called gperf that is meant for strings but might possibly work on your data. I'd definitely give it a try. Just add a length byte so that integers are 5 byte strings and ranges are 9 bytes. There are some formal references on the Wikipedia page. A literature search in ACM and IEEE literature will certainly turn up more.
I just ran across this library I had not seen before.
Addition
I see now that you are trying to map all integers in the ranges to the same function value. As I said in the comment, this is not very compatible with hashing because hash functions deliberately try to "erase" the magnitude information in a bit's position so that values with similar magnitude are unlikely to map to the same hash value.
Consequently, I think that you will not do better than an optimal binary search tree, or equivalently a code generator that produces an optimal "tree" of "if else" statements.
If we wanted to construct a function of the type you are asking for, we could try using real numbers where individual domain values map to consecutive integers in the co-domain and ranges map to unit intervals in the co-domain. So a simple floor operation will give you the jump table indices you're looking for.
In the example you provided you'd have the following mapping:
2 -> 0.0
3 -> 1.0
8 -> 2.0
33 -> 3.0
122 -> 3.99999
...
10000 -> 42.0 (for example)
The trick is to find a monotonically increasing polynomial that interpolates these points. This is certainly possible, but with thousands of points I'm certain you'ed end up with something much slower to evaluate than the optimal search would be.
Perhaps our thoughts on hashing integers can help a little bit. You will also find there a hashing library (hashlib.zip) based on Bob Jenkins' work which deals with integer numbers in a smart way.
I would propose to deal with larger ranges after the single cases have been rejected by the hashing mechanism.

How to calculate equal hash for similar strings?

I create Antiplagiat. I use a shingle method. For example, I have the following shingles:
I go to the cinema
I go to the cinema1
I go to th cinema
Is there a method of calculating the equal hash for these lines?
I know of the existence of Levenshtein distance. However, I do not know what I should take source word. Maybe there is a better way than to consider Levenshtein distance.
The problem with hashing is that, logically, you'll run into 2 strings that differ by a single character that hash to different values.
Small proof:
Consider all possible strings.
Assume all of these hash to at least 2 different values.
Take any 2 strings A and B that hash to different values.
You can obviously go from A to B by just changing one character at a time.
Thus at some point the hash will change.
Thus at this point the hash will be different for a single character change.
Some options I can think of:
Hash multiple parts of the string and check each of these hashes. Probably won't work too well since a single character omission will cause significant difference in the hash values.
Check a range of hashes. A hash is one dimensional, but string similarity is not, thus this probably won't work either.
All in all, hashing is probably not the way to go.
This questions is a bit old but you may be interested in this paper by two researchers at AT&T. They employ a technique that is reminiscent of the Nilsimsa hash to detect when similar sms messages have been seen an "abnormal" number of times in a time window.
It sounds Locality Sensitive hashing would also be pertinent to your problem.

Checking for string matches using hashes, without double-checking the entire string

I'm trying to check if two strings are identical as quickly as possible. Can I protect myself from hash collisions without also comparing the entire string?
I've got a cache of items that are keyed by a string. I store the hash of the string, the length of the string, and the string itself. (I'm currently using djb2 to generate the hash.)
To check if an input string is a match to an item in the cache, I compute the input's hash, and compare it to the stored hash. If that matches, I compare the length of the input (which I got as a side effect of computing the hash) to the stored length. Finally, if that matches, I do a full string comparison of the input and the stored string.
Is it necessary to do that full string comparison? For example, is there a string hashing algorithm that can mathematically guarantee that no two strings of the same length will generate the same hash? If not, can an algorithm guarantee that two different strings of the same length will generate different hash codes if any of the first N characters differ?
Basically, any string comparison scheme that offers O(1) performance when the strings differ but better than O(n) performance when they match would be an improvement over what I'm doing now.
For example, is there a string hashing algorithm that can mathematically guarantee that no two strings of the same length will generate the same hash?
No, and there can't be. Think about it: The hash has a finite length, but the strings do not. Say for argument's sake that the hash is 32-bits. Can you create more than 2 billion unique strings with the same length? Of course you can - you can create an infinite number of unique strings, so comparing the hashes is not enough to guarantee uniqueness. This argument scales to longer hashes.
If not, can an algorithm guarantee that two different strings of the same length will generate different hash codes if any of the first N characters differ?
Well, yes, as long as the number of bits in the hash is as great as the number of bits in the string, but that's probably not the answer you were looking for.
Some of the algorithms used for cyclic redundancy checks have guarantees like if there's exactly one bit different then the CRC is guaranteed to be different over a certain run length of bits, but that only works for relatively short runs.
You should be safe from collisions if you use a modern hashing function such as one of the Secure Hash Algorithm (SHA) variants.

Is a hash result ever the same as the source value?

This is more of a cryptography theory question, but is it possible that the result of a hash algorithm will ever be the same value as the source? For example, say I have a string:
baf34551fecb48acc3da868eb85e1b6dac9de356
If I get the SHA1 hash on it, the result is:
4d2f72adbafddfe49a726990a1bcb8d34d3da162
In theory, is there ever a case where these two values would match? I'm not asking about SHA1 specifically here - it's just my example. I'm just wondering if hashing algorithms are built in such a way as to prevent this.
Well, it would depend on the hashing algorithm - but I'd be surprised to see anything explicitly prevent this. After all, it really shouldn't matter.
I suspect it's very unlikely to happen, of course (for cryptographic hashes)... but even if it does, that shouldn't cause a problem.
For non-crypto hashes (used in hash tables etc) it would be perfectly reasonable to return the source value in some cases. For example, in Java, Integer.hashCode() just returns the embedded value.
Sure, the Python hashing algorithm for integers returns the value of the integer. So hash(1) == 1.
Given a good hashing algorithm, one that returns a seemingly random output, I believe there should be on average one input that gives itself as the output. Let's say the hash can give N possible outputs. That means there are N possible inputs for which this is possible. For each of those, the odds of the output matching the input is 1/N, so there the expected number of fixed points is N*1/N, or 1.
A hash function might be defined to avoid ‘fixed points’ where hash(x)==x, but your hash-quine differs a little in that you're taking the string representation in hex of the hash rather than the raw binary. It would, I think, be infeasible to design a hash that could frustrate that, and it's mathematically less interesting since it depends on the arbitrary mapping of 0-F to ASCII character codes.
See Is there an MD5 Fixed Point where md5(x) == x? for a discussion about fixed points in MD5. The probability calculation would be equally true for hex hash-quines and any other hash function with 128 bits of output.