Near Duplicate Detection in Data Streams - streaming

I am currently working on a streaming API that generates a lot of textual content. As expected, the API gives out a lot of duplicates and we also have a business requirement to filter near duplicate data.
I did a bit of research on duplicate detection in data streams and read about Stable Bloom Filters. Stable bloom filters are data structures for duplicate detection in data streams with an upper bound on the false positive rate.
But, I want to identify near duplicates and I also looked at Hashing Algorithms like LSH and MinHash that are used in Nearest Neighbour problems and Near Duplicate Detection.
I am kind of stuck and looking for pointers as to how to proceed and papers/implementations that I could look at?

First, normalize the text to all lowercase (or uppercase) characters, replace all non-letters with a white space, compress all multiple white spaces to one, remove leading and trailing white space; for speed I would perform all these operations in one pass of the text. Next take the MD5 hash (or something faster) of the resulting string. Do a database lookup of the MD5 hash (as two 64 bit integers) in a table, if it exists, it is an exact duplicate, if not, add it to the table and proceed to the next step. You will want to age off old hashes based either on time or memory usage.
To find near duplicates the normalized string needs to be converted into potential signatures (hashes of substrings), see the SpotSigs paper and blog post by Greg Linden. Suppose the routine Sigs() does that for a given string, that is, given the normalized string x, Sigs(x) returns a small (1-5) set of 64 bit integers. You could use something like the SpotSigs algorithm to select the substrings in the text for the signatures, but making your own selection method could perform better if you know something about your data. You may also want to look at the simhash algorithm (the code is here).
Given the Sigs() the problem of efficiently finding the near duplicates is commonly called the set similarity joins problem. The SpotSigs paper outlines some heuristics to trim the number of sets a new set needs to be compared to as does the simhash method.

http://micvog.com/2013/09/08/storm-first-story-detection/ has some nice implementation notes

Related

How can I address the SHA3 state vector in programming terms?

I've been working on an implementation of SHA3, and I'm getting a bit muddled on this particular aspect of the algorithm. The addressing scheme of the state vector is given by the following diagram:
My issue with the above is: How does one go about addressing this in terms of actual code? I am using a 3 dimensional array to express the state vector, but this leads to obvious issues since the conventional mapping of an array (0 index is first) differs from the above convention used in SHA3.
For example, if I wanted to address the (0,0,0) bit in the SHA3 state array, the following expression would achieve this:
state_vector[2][2][0]
I find this highly cumbersome however because when implementing the actual round algorithms, the intended x and y values do not directly map to the array indices. Addressing state_vector[0][0][0] would return the very first index in the array instead of the (0,0,0) bit in the SHA3 state array.
Is there a way I can get around this in code?
Sorry, I know this is probably a stupid question.
The way this is customarily implemented is as a 5×5 array of 64-bit words, an array of 25 64-bit words or, if you believe your architecture (say, AArch64) will have a lot of registers, as 25 individual 64-bit words. (I prefer the second option because it's simpler to work with.) Typically they are indeed ordered in the typical order for arrays, and one simply rewrites things accordingly.
Usually this isn't a problem, because the operations are specified in terms of words in relation to each other, such as in the theta and chi steps. It's common to simply code rho and pi together such that it involves reading a word, rotating it, and storing it in the destination word, and in such a case you can simply just reorder the rotation constants as you need to.
If you want to get very fancy, you can write this as an SIMD implementation, but I think it's easier to see how it works in a practical implementation if you write it as a one- or two-dimensional array of words first.

SHA-512 Length of the input string impact ?

I am implementing a way to quickly find changes in the sources of my datawarehouse.
After couple of try we have found the hashing all the attribute of a given table and comparing it to the target is one of the most efficient way to compare it.
However the non negligible issue for us is the collision risk. Because I need to trust my data 100%
My understanding is that with SHA-512 it should be close to 0 (2^-256...). But what we cannot find is if the length of my input string can influence the probaility of collision.
Because in the case of a table with 20 field I am confident it will work, but for a table with 280 fields some of them having free text ... I want to be sure.
I know the maximum length of a string is 2^128 but does hashing a longer string of 20.000 character instead of 200, will raise the probability of a collision ?
Thanks for your help.
Hashing algorithms internal functions always work with fixed-length inputs. So when hashing long strings it will split the string into data blocks that are as long as the required input length of the internal function (padding the last one if necessary). Then it will loop over the blocks and combine the output of a block with the current state, the combined output from all the previous blocks.
It's been shown that this construct makes the final hash as much resistant to collision as the internal function. Check the Merkle–Damgård construction (used in SHA-512) article.

Efficient Function to Map (or Hash) Integers and Integer Ranges into Index

We are looking for the computationally simplest function that will enable an indexed look-up of a function to be determined by a high frequency input stream of widely distributed integers and ranges of integers.
It is OK if the hash/map function selection itself varies based on the specific integer and range requirements, and the performance associated with the part of the code that selects this algorithm is not critical. The number of integers/ranges of interest in most cases will be small (zero to a few thousand). The performance critical portion is in processing the incoming stream and selecting the appropriate function.
As a simple example, please consider the following pseudo-code:
switch (highFrequencyIntegerStream)
case(2) : func1();
case(3) : func2();
case(8) : func3();
case(33-122) : func4();
...
case(10,000) : func40();
In a typical example, there would be only a few thousand of the "cases" shown above, which could include a full range of 32-bit integer values and ranges. (In the pseudo code above 33-122 represents all integers from 33 to 122.) There will be a large number of objects containing these "switch statements."
(Note that the actual implementation will not include switch statements. It will instead be a jump table (which is an array of function pointers) or maybe a combination of the Command and Observer patterns, etc. The implementation details are tangential to the request, but provided to help with visualization.)
Many of the objects will contain "switch statements" with only a few entries. The values of interest are subject to real time change, but performance associated with managing these changes is not critical. Hash/map algorithms can be re-generated slowly with each update based on the specific integers and ranges of interest (for a given object at a given time).
We have searched around the internet, looking at Bloom filters, various hash functions listed on Wikipedia's "hash function" page and elsewhere, quite a few Stack Overflow questions, abstract algebra (mostly Galois theory which is attractive for its computationally simple operands), various ciphers, etc., but have not found a solution that appears to be targeted to this problem. (We could not even find a hash or map function that considered these types of ranges as inputs, much less a highly efficient one. Perhaps we are not looking in the right places or using the correct vernacular.)
The current plan is to create a custom algorithm that preprocesses the list of interesting integers and ranges (for a given object at a given time) looking for shifts and masks that can be applied to input stream to help delineate the ranges. Note that most of the incoming integers will be uninteresting, and it is of critical importance to make a very quick decision for as large a percentage of that portion of the stream as possible (which is why Bloom filters looked interesting at first (before we starting thinking that their implementation required more computational complexity than other solutions)).
Because the first decision is so important, we are also considering having multiple tables, the first of which would be inverse masks (masks to select uninteresting numbers) for the easy to find large ranges of data not included in a given "switch statement", to be followed by subsequent tables that would expand the smaller ranges. We are thinking this will, for most cases of input streams, yield something quite a bit faster than a binary search on the bounds of the ranges.
Note that the input stream can be considered to be randomly distributed.
There is a pretty extensive theory of minimal perfect hash functions that I think will meet your requirement. The idea of a minimal perfect hash is that a set of distinct inputs is mapped to a dense set of integers in 1-1 fashion. In your case a set of N 32-bit integers and ranges would each be mapped to a unique integer in a range of size a small multiple of N. Gnu has a perfect hash function generator called gperf that is meant for strings but might possibly work on your data. I'd definitely give it a try. Just add a length byte so that integers are 5 byte strings and ranges are 9 bytes. There are some formal references on the Wikipedia page. A literature search in ACM and IEEE literature will certainly turn up more.
I just ran across this library I had not seen before.
Addition
I see now that you are trying to map all integers in the ranges to the same function value. As I said in the comment, this is not very compatible with hashing because hash functions deliberately try to "erase" the magnitude information in a bit's position so that values with similar magnitude are unlikely to map to the same hash value.
Consequently, I think that you will not do better than an optimal binary search tree, or equivalently a code generator that produces an optimal "tree" of "if else" statements.
If we wanted to construct a function of the type you are asking for, we could try using real numbers where individual domain values map to consecutive integers in the co-domain and ranges map to unit intervals in the co-domain. So a simple floor operation will give you the jump table indices you're looking for.
In the example you provided you'd have the following mapping:
2 -> 0.0
3 -> 1.0
8 -> 2.0
33 -> 3.0
122 -> 3.99999
...
10000 -> 42.0 (for example)
The trick is to find a monotonically increasing polynomial that interpolates these points. This is certainly possible, but with thousands of points I'm certain you'ed end up with something much slower to evaluate than the optimal search would be.
Perhaps our thoughts on hashing integers can help a little bit. You will also find there a hashing library (hashlib.zip) based on Bob Jenkins' work which deals with integer numbers in a smart way.
I would propose to deal with larger ranges after the single cases have been rejected by the hashing mechanism.

How to calculate equal hash for similar strings?

I create Antiplagiat. I use a shingle method. For example, I have the following shingles:
I go to the cinema
I go to the cinema1
I go to th cinema
Is there a method of calculating the equal hash for these lines?
I know of the existence of Levenshtein distance. However, I do not know what I should take source word. Maybe there is a better way than to consider Levenshtein distance.
The problem with hashing is that, logically, you'll run into 2 strings that differ by a single character that hash to different values.
Small proof:
Consider all possible strings.
Assume all of these hash to at least 2 different values.
Take any 2 strings A and B that hash to different values.
You can obviously go from A to B by just changing one character at a time.
Thus at some point the hash will change.
Thus at this point the hash will be different for a single character change.
Some options I can think of:
Hash multiple parts of the string and check each of these hashes. Probably won't work too well since a single character omission will cause significant difference in the hash values.
Check a range of hashes. A hash is one dimensional, but string similarity is not, thus this probably won't work either.
All in all, hashing is probably not the way to go.
This questions is a bit old but you may be interested in this paper by two researchers at AT&T. They employ a technique that is reminiscent of the Nilsimsa hash to detect when similar sms messages have been seen an "abnormal" number of times in a time window.
It sounds Locality Sensitive hashing would also be pertinent to your problem.

best way to resolve collisions in hashing strings

I got asked this question at an interview and said to use a second has function, but the interviewer kept probing me for other answers. Anyone have other solutions?
best way to resolve collisions in hashing strings
"with continuous inserts"
Assuming the inserts are of strings whose contents can't be predicted, then reasonable options are:
Use a displacement list, so you try a number of offsets from the
hashed-to bucket until you find a free bucket (modding by table
size). Displacement lists might look something like { 3, 5, 11,
19... } etc. - ideally you want to have the difference between
displacements not be the sum of a sequence of other displacements.
rehash using a different algorithm (but then you'd need yet another
algorithm if you happen to clash twice etc.)
root a container in the
buckets, such that colliding strings can be searched for. Typically
the number of buckets should be similar to or greater than the
number of elements, so elements per bucket will be fairly small and
a brute-force search through an array/vector is a reasonable
approach, but a linked list is also credible.
Comparing these, displacement lists tend to be fastest (because adding an offset is cheaper than calculating another hash or support separate heap & allocation, and in most cases the first one or two displacements (which can reasonably be by a small number of buckets) is enough to find an empty bucket so the locality of memory use is reasonable) though they're more collision prone than an alternative hashing algorithm (which should approach #elements/#buckets chance of further collisions). With both displacement lists and rehashing you have to provide enough retries that in practice you won't expect a complete failure, add some last-resort handling for failures, or accept that failures may happen.
Use a linked list as the hash bucket. So any collisions are handled gracefully.
Alternative approach: You might want to concider using a trie instead of a hash table for dictionaries of strings.
The up side of this approach is you get O(|S|) worst case complexity for seeking/inserting each string [where |S| is the length of that string]. Note that hash table allows you only average case of O(|S|), where the worst case is O(|S|*n) [where n is the size of the dictionary]. A trie also does not require rehashing when load balance is too high.
Assuming we are not using a perfect hash function (which you usually don't have) the hash tells you that:
if the hashes are different, the objects are distinct
if the hashes are the same, the objects are probably the same (if good hashing function is used), but may still be distinct.
So in a hashtable, the collision will be resolved with some additional checking if the objects are actually the same or not (this brings some performance penalty, but according to Amdahl's law, you still gained a lot, because collisions rarely happen for good hashing functions). In a dictionary you just need to resolve that rare collision cases and assure you get the right object out.
Using another non-perfect hash function will not resolve anything, it just reduces the chance of (another) collision.