Function to check similarity of a string with a list of strings - word

I have a collection of similar strings in a bucket and have multiple such buckets. What kind of function should I use on the strings to compare a random string with the buckets to find out which bucket it belongs to?
To clarify on each entity in the bucket, it is a sentence that can have multiple words.
An example:
Consider the list of strings in the bucket:
1. round neck black t-shirt
2. printed tee shirt
3. brown polo t-shirt
If we have as input, "blue high neck t-shirt", we want to check if this can be added to the same bucket. This might be a simpler example but consider doing this for a bucket of let's say 100s of strings.
Any reference to an article or paper will be really helpful.

First of all, I am thinking about two kinds of similarities: syntactic and semantic.
1) Syntactic
Levenstein distance can be used to measure distance between two sequences (of characters)
Stemming can be used to increase match probability. There are various implementations (e.g. here). You get the stem (or root) word for your random string and compare with stems from your buckets. Of course, buckets stems should be precomputed for efficiency.
2) Semantic
for general info you can read the article from Wikipedia
for actual implementation you can read this article from CodeProject. It is based on WordNet, a great ontology for English language that stores concepts in synsets and also provides various relations between these synsets.
To get more details you should tell us what kind of similarity you need.
[edit]
Based on provided information, I think you can do something like this:
1) split all strings in words => random string will be named Array1 and current bucket Array2
2) compute similarity as number_of_common_words(Array1, Array2) / count(Array2)
3) choose maximum similarity
Specificity may also be increased, by adding points to position match: Array1[i] = Array2[i]
For better performance I would store buckets as Hash tables, Dictionary etc., so that existence check is done in O(1).

Related

Questions about LSH (Locality-sensitive hashing) and minihashing implementation

I'm trying to implement this paper
Browser Fingerprint Coding Methods Increasing the Effectiveness of User Identification in the Web Traffic
I got a couple of questions about the LHS algorithm in general and the proposed implementation:
The LSH algorithm it's used only when you have a lot of documents to compare with each other (because it is supposed to put the similar ones in the same bucket from what I got). If for example I have a new document and I want to calculate the similarity with the others, I have to relaunch the LHS algorithm from scratch, including the new document, correct?
In 'Mining of Massive Datasets, Ch3', it is said that for the LHS we should use one hash function per band. Each hash function creates n buckets.
So, for the first band, we are going to have n buckets. For the second band onward, Am I supposed to keep using the same hash function (so this way I keep using the same buckets as before) or another one (ending so with m>>n buckets)?
This question is related t the previous one. If I use the same hash function for all the bands, then I'll have n buckets. No problem here. But If I have to use more hash functions (one different function per row), I'm going to end up with a lot of different buckets. Am I supposed to measure the similarity for each pair in each bucket? (If I have to use only one hash function then here it's not a problem).
In the paper, I understood most of the algorithm except for its end.
Basically, two Signatures matrices are created (one for stable features and one for unstable features) via minhashing. Then, they use LSH on the first matrix to obtain a list of candidates pairs. So far so good.
What happens at the end? do they perform the LHS on the second matrix? How the result of the first LHS is used? I cannot see the relationship between the first and the second LHS.
The output of the final step is supposed to be a list of pairing candidates, right? and all that I have to do is performing Jaccard similarity on them and setting a threshold, right?
Thanks for your answers!
I got a partial answer to my question (still missing question 4)
No. You would keep the bucket structure and hash the new doc into it. Then compare with only those docs in one of the buckets it fell into.
No. You HAVE to use different hash functions and a different set of buckets for each hash function.
This is irrelevant because of the answer to (2).

Grouping similar words (bad , worse )

I know there are ways to find synonyms either by using NLTK/pywordnet or Pattern package in python but it isn't solving my problem.
If there are words like
bad,worst,poor
bag,baggage
lost,lose,misplace
I am not able to capture them. Can anyone suggest me a possible way?
There have been numerous research in this area in past 20 years. Yes computers don't understand language but we can train them to find similarity or difference in two words with the help of some manual effort.
Approaches may be:
Based on manually curated datasets that contain how words in a language are related to each other.
Based on statistical or probabilistic measures of words appearing in a corpus.
Method 1:
Try Wordnet. It is a human-curated network of words which preserves the relationship between words according to human understanding. In short, it is a graph with nodes as something called 'synsets' and edges as relations between them. So any two words which are very close to each other are close in meaning. Words that fall within the same synset might mean exactly the same. Bag and Baggage are close - which you can find either by iteratively exploring node-to-node in a breadth first style - like starting with 'baggage', exploring its neighbors in an attempt to find 'baggage'. You'll have to limit this search upto a small number of iterations for any practical application. Another style is starting a random walk from a node and trying to reach the other node within a number of tries and distance. It you reach baggage from bag say, 500 times out of 1000 within 10 moves, you can be pretty sure that they are very similar to each other. Random walk is more helpful in much larger and complex graphs.
There are many other similar resources online.
Method 2:
Word2Vec. Hard to explain it here but it works by creating a vector of a user's suggested number of dimensions based on its context in the text. There has been an idea for two decades that words in similar context mean the same. e.g. I'm gonna check out my bags and I'm gonna check out my baggage both might appear in text. You can read the paper for explanation (link in the end).
So you can train a Word2Vec model over a large amount of corpus. In the end, you will be able to get 'vector' for each word. You do not need to understand the significance of this vector. You can this vector representation to find similarity or difference between words, or generate synonyms of any word. The idea is that words which are similar to each other have vectors close to each other.
Word2vec came up two years ago and immediately became the 'thing-to-use' in most of NLP applications. The quality of this approach depends on amount and quality of your data. Generally Wikipedia dump is considered good training data for training as it contains articles about almost everything that makes sense. You can easily find ready-to-use models trained on Wikipedia online.
A tiny example from Radim's website:
>>> model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
[('queen', 0.50882536)]
>>> model.doesnt_match("breakfast cereal dinner lunch".split())
'cereal'
>>> model.similarity('woman', 'man')
0.73723527
First example tells you the closest word (topn=1) to words woman and king but meanwhile also most away from the word man. The answer is queen.. Second example is odd one out. Third one tells you how similar the two words are, in your corpus.
Easy to use tool for Word2vec :
https://radimrehurek.com/gensim/models/word2vec.html
http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf (Warning : Lots of Maths Ahead)

In preprocessing data with high cardinality, do you hash first or one-hot-encode first?

Hashing reduces dimensionality while one-hot-encoding essentially blows up the feature space by transforming multi-categorical variables into many binary variables. So it seems like they have opposite effects. My questions are:
What is the benefit of doing both on the same dataset? I read something about capturing interactions but not in detail - can somebody elaborate on this?
Which one comes first and why?
Binary one-hot-encoding is needed for feeding categorical data to linear models and SVMs with the standard kernels.
For example, you might have a feature which is a day of a week. Then you create a one-hot-encoding for each of them.
1000000 Sunday
0100000 Monday
0010000 Tuesday
...
0000001 Saturday
Feature-hashing is mostly used to allow for significant storage compression for parameter vectors: one hashes the high dimensional input vectors into a lower dimensional feature space. Now the parameter vector of a resulting classifier can therefore live in the lower-dimensional space instead of in the original input space. This can be used as a method of dimension reduction thus usually you expect to trade a bit of decreasing of performance with significant storage benefit.
The example in wikipedia is a good one. Suppose your have three documents:
John likes to watch movies.
Mary likes movies too.
John also likes football.
Using a bag-of-words model, you first create below document to words model. (each row is a document, each entry in the matrix indicates whether a word appears in the document).
The problem with this process is that such dictionaries take up a large amount of storage space, and grow in size as the training set grows.
Instead of maintaining a dictionary, a feature vectorizer that uses the hashing trick can build a vector of a pre-defined length by applying a hash function h to the features (e.g., words) in the items under consideration, then using the hash values directly as feature indices and updating the resulting vector at those indices.
Suppose you generate below hashed features with 3 buckets. (you apply k different hash functions to the original features and count how many times the hashed value hit a bucket).
bucket1 bucket2 bucket3
doc1: 3 2 0
doc2: 2 2 0
doc3: 1 0 2
Now you successfully transformed the features in 9-dimensions to 3-dimensions.
A more interesting application of feature hashing is to do personalization. The original paper of feature hashing contains a nice example.
Imagine you want to design a spam filter but customized to each user. The naive way of doing this is to train a separate classifier for each user, which are unfeasible regarding either training (to train and update the personalized model) or serving (to hold all classifiers in memory). A smart way is illustrated below:
Each token is duplicated and one copy is individualized by concatenating each word with a unique user id. (See USER123_NEU and USER123_Votre).
The bag of words model now holds the common keywords and also use-specific keywords.
All words are then hashed into a low dimensioanl feature space where the document is trained and classified.
Now to answer your questions:
Yes. one-hot-encoding should come first since it is transforming a categorical feature to binary feature to make it consumable by linear models.
You can apply both on the same dataset for sure as long as there is benefit to use the compressed feature-space. Note if you can tolerate the original feature dimension, feature-hashing is not required. For example, in a common digit recognition problem, e.g., MINST, the image is represented by 28x28 binary pixels. The input dimension is only 784. For sure feature hashing won't have any benefit in this case.

Efficient Function to Map (or Hash) Integers and Integer Ranges into Index

We are looking for the computationally simplest function that will enable an indexed look-up of a function to be determined by a high frequency input stream of widely distributed integers and ranges of integers.
It is OK if the hash/map function selection itself varies based on the specific integer and range requirements, and the performance associated with the part of the code that selects this algorithm is not critical. The number of integers/ranges of interest in most cases will be small (zero to a few thousand). The performance critical portion is in processing the incoming stream and selecting the appropriate function.
As a simple example, please consider the following pseudo-code:
switch (highFrequencyIntegerStream)
case(2) : func1();
case(3) : func2();
case(8) : func3();
case(33-122) : func4();
...
case(10,000) : func40();
In a typical example, there would be only a few thousand of the "cases" shown above, which could include a full range of 32-bit integer values and ranges. (In the pseudo code above 33-122 represents all integers from 33 to 122.) There will be a large number of objects containing these "switch statements."
(Note that the actual implementation will not include switch statements. It will instead be a jump table (which is an array of function pointers) or maybe a combination of the Command and Observer patterns, etc. The implementation details are tangential to the request, but provided to help with visualization.)
Many of the objects will contain "switch statements" with only a few entries. The values of interest are subject to real time change, but performance associated with managing these changes is not critical. Hash/map algorithms can be re-generated slowly with each update based on the specific integers and ranges of interest (for a given object at a given time).
We have searched around the internet, looking at Bloom filters, various hash functions listed on Wikipedia's "hash function" page and elsewhere, quite a few Stack Overflow questions, abstract algebra (mostly Galois theory which is attractive for its computationally simple operands), various ciphers, etc., but have not found a solution that appears to be targeted to this problem. (We could not even find a hash or map function that considered these types of ranges as inputs, much less a highly efficient one. Perhaps we are not looking in the right places or using the correct vernacular.)
The current plan is to create a custom algorithm that preprocesses the list of interesting integers and ranges (for a given object at a given time) looking for shifts and masks that can be applied to input stream to help delineate the ranges. Note that most of the incoming integers will be uninteresting, and it is of critical importance to make a very quick decision for as large a percentage of that portion of the stream as possible (which is why Bloom filters looked interesting at first (before we starting thinking that their implementation required more computational complexity than other solutions)).
Because the first decision is so important, we are also considering having multiple tables, the first of which would be inverse masks (masks to select uninteresting numbers) for the easy to find large ranges of data not included in a given "switch statement", to be followed by subsequent tables that would expand the smaller ranges. We are thinking this will, for most cases of input streams, yield something quite a bit faster than a binary search on the bounds of the ranges.
Note that the input stream can be considered to be randomly distributed.
There is a pretty extensive theory of minimal perfect hash functions that I think will meet your requirement. The idea of a minimal perfect hash is that a set of distinct inputs is mapped to a dense set of integers in 1-1 fashion. In your case a set of N 32-bit integers and ranges would each be mapped to a unique integer in a range of size a small multiple of N. Gnu has a perfect hash function generator called gperf that is meant for strings but might possibly work on your data. I'd definitely give it a try. Just add a length byte so that integers are 5 byte strings and ranges are 9 bytes. There are some formal references on the Wikipedia page. A literature search in ACM and IEEE literature will certainly turn up more.
I just ran across this library I had not seen before.
Addition
I see now that you are trying to map all integers in the ranges to the same function value. As I said in the comment, this is not very compatible with hashing because hash functions deliberately try to "erase" the magnitude information in a bit's position so that values with similar magnitude are unlikely to map to the same hash value.
Consequently, I think that you will not do better than an optimal binary search tree, or equivalently a code generator that produces an optimal "tree" of "if else" statements.
If we wanted to construct a function of the type you are asking for, we could try using real numbers where individual domain values map to consecutive integers in the co-domain and ranges map to unit intervals in the co-domain. So a simple floor operation will give you the jump table indices you're looking for.
In the example you provided you'd have the following mapping:
2 -> 0.0
3 -> 1.0
8 -> 2.0
33 -> 3.0
122 -> 3.99999
...
10000 -> 42.0 (for example)
The trick is to find a monotonically increasing polynomial that interpolates these points. This is certainly possible, but with thousands of points I'm certain you'ed end up with something much slower to evaluate than the optimal search would be.
Perhaps our thoughts on hashing integers can help a little bit. You will also find there a hashing library (hashlib.zip) based on Bob Jenkins' work which deals with integer numbers in a smart way.
I would propose to deal with larger ranges after the single cases have been rejected by the hashing mechanism.

Near Duplicate Detection in Data Streams

I am currently working on a streaming API that generates a lot of textual content. As expected, the API gives out a lot of duplicates and we also have a business requirement to filter near duplicate data.
I did a bit of research on duplicate detection in data streams and read about Stable Bloom Filters. Stable bloom filters are data structures for duplicate detection in data streams with an upper bound on the false positive rate.
But, I want to identify near duplicates and I also looked at Hashing Algorithms like LSH and MinHash that are used in Nearest Neighbour problems and Near Duplicate Detection.
I am kind of stuck and looking for pointers as to how to proceed and papers/implementations that I could look at?
First, normalize the text to all lowercase (or uppercase) characters, replace all non-letters with a white space, compress all multiple white spaces to one, remove leading and trailing white space; for speed I would perform all these operations in one pass of the text. Next take the MD5 hash (or something faster) of the resulting string. Do a database lookup of the MD5 hash (as two 64 bit integers) in a table, if it exists, it is an exact duplicate, if not, add it to the table and proceed to the next step. You will want to age off old hashes based either on time or memory usage.
To find near duplicates the normalized string needs to be converted into potential signatures (hashes of substrings), see the SpotSigs paper and blog post by Greg Linden. Suppose the routine Sigs() does that for a given string, that is, given the normalized string x, Sigs(x) returns a small (1-5) set of 64 bit integers. You could use something like the SpotSigs algorithm to select the substrings in the text for the signatures, but making your own selection method could perform better if you know something about your data. You may also want to look at the simhash algorithm (the code is here).
Given the Sigs() the problem of efficiently finding the near duplicates is commonly called the set similarity joins problem. The SpotSigs paper outlines some heuristics to trim the number of sets a new set needs to be compared to as does the simhash method.
http://micvog.com/2013/09/08/storm-first-story-detection/ has some nice implementation notes