Compare documents by sequence vector - matlab

I'm trying to classify documents by sequence vector. Basically, I have a vocabulary (more than 5000 words). Each document is converted to a vector of integer numbers so that each element in the vector corresponds the position of the word in the vocabulary.
For example, if the vocab is [hello, how, are, you, today] and the document is "hello you" then I'll have the vector: [1 4]. Another document of "how are you" will result in [2 3 4].
Now what I want is to assess the similarity between the first and the second vector. Here you can see these vectors don't have the same length. Furthermore, comparing directly them may not make sense because they represent sequence of words. This case is different from binary (bag-of-word) vector, which considers the appearance of a word in the document (1 if appear, otherwise 0), and also frequency (word count) vector, which considers frequency of a word in the document with the given vocabulary.
Can you give me a suggestion?

The Jaccard similarity is normally used to compare the similarity of sets (in your case, text). The text is n-grammed (shingled), and then locality sensitive hashing is used to determine their Jaccard similarity.
There is a whole field dedicated to this - Google is your friend!

Related

fasttext - Extracting and comparing pre-trained word vectors

I'm working with the German pre-trained word vectors from https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md
I encountered the following problems:
To extract the vectors for my words, I started by simply searching the wiki.de.vec text-file for the respective words. However, vectors in the wiki.de.vec text-file differ from those that the print-word-vectors function outputs (e.g. the vector for 'affe' meaning 'monkey' is different in the wiki.de.vec file than the output for 'affe' from print-word-vectors). What is the reason for this? I assume this occurs because the vector for a word is computed by adding the sum of its character n-gram vectors in the model by Bojanowski et al., but what does the vector for 'affe' in the wiki.de.vec text-file reflect then? Is it the vector for the n-gram 'affe' that also occurs in other words like 'karaffe'. So, should one always use the print-word-vectors function (i.e. add the character n-gram vectors) when working with these vectors and not simply extract vectors from the text-file?
Some real German words (e.g. knatschen, resonieren) receive a null vector (even with the print-word-vectors function). How can this be if the major advantage of this subword approach is to compute vectors for out-of-vocabulary words?
The nearest-neighbors function (./fasttext nn) outputs the nearest-neighbors of a word with the cosine distance. However, this value differs from the value I obtain by getting the vectors of the individual words with print-word-vectors and computing their cosine distance manually in Matlab using pdist2(wordVector1, wordVector2, 'cosine'). Why is this the case? Is this the wrong way to obtain the cosine distance between two word vectors?
Thanks in advance for your help and suggestions!

Find nearest sparse vectors, what kind of index or DB to use?

I'd like to detect similar text documents.
There's a function that takes text as an input and produces vector as an output.
text => vector
The produced vector is sparse. Its dimension is huge (can't say for sure but probably will be about 10_000), but almost all of its elements are nulls. Only about 10-20 of its elements are not null.
vector = [0, 0, 0..., v1, 0...., v2, 0.... ]
So it makes sense to represent this sparse vector as a map instead of array.
vector = { i1: v1, i2: v2 }
What kind of index can I use to efficiently find N vectors closest to the given { i1: v1, i2: v2 } vector? The distance metric could be euclidean or cosine or other.
There are millions of documents. What kind of DB could be used to do such kind of search? PostgreSQL? Redis?
After meditating on Machine Learning stuff here's the answer:
There's no ready to use DB or Index that can handle high dimensional spaces. There's tools like https://github.com/spotify/annoy but they only can handle dimensions < 1000
Theoretically it's possible to handle high dimensional spaces using tricks like partitioning, but it's very case specific, no universal solution.
The better way would be to reduce dimensionality using PCA to <1000 and then it will be possible to use tools like https://github.com/spotify/annoy

Kullback Leibler Divergence of 2 Histograms in MatLab

I would like a function to calculate the KL distance between two histograms in MatLab. I tried this code:
http://www.mathworks.com/matlabcentral/fileexchange/13089-kldiv
However, it says that I should have two distributions P and Q of sizes n x nbins. However, I am having trouble understanding how the author of the package wants me to arrange the histograms. I thought that providing the discretized values of the random variable together with the number of bins would suffice (I would assume the algorithm would use an arbitrary support to evaluate the expectations).
Any help is appreciated.
Thanks.
The function you link to requires that the two histograms passed be aligned and thus have the same length NBIN x N (not N X NBIN), that is, if N>1 then the number of rows in the inputs should be equal to the number of bins in the histograms. If you are just going to compare two histograms (that is if N=1) it doesn't really matter, you can pass either row or column vector versions of these as long as you are consistent and the order of bins matches.
A generic call to the function looks like this:
dists = kldiv(bins,P,Q)
The implementation allows comparison of multiple histograms to each other (that is, N>1), in which case pairs of columns (with matching column index) in each array are compared and the result is a row vector with distances for each matching pair.
Array bins should be the same size as P and Q and is used to perform a very minimal check that the inputs are of the same size, but is not used in the computation. The routine expects bins to contain the numeric labels of your bins so that it can check for repeated bin labels and warn you if repeats occur, but otherwise doesn't use the information.
You could do away with bins and compute the distance with
KL = sum(P .* (log2(P)-log2(Q)));
without using the Matlab Central versions. However the version you link to performs the abovementioned minimal checks and in addition allows computation of two alternative distances (consult the documentation).
The version linked to by eigenchris checks that no histogram bins are empty (which would make the computation blow up numerically) and if there are, removes their contribution to the sum (not sure this is entirely appropriate - consult an expert on the subject). It should probably also be aware of the exact form of the formula, specifically note the use of log2 above versus natural logarithm in the version linked to by eigenchris.

Purpose of matrix length

Matlab defines the matrix function length to return
Length of largest array dimension
What is an example of a use of knowing the largest dimension? Knowing number of rows or columns has obvious uses... but I don't know why someone would want the largest dimension regardless of whether it is rows or cols.
Thank You
In fact, most of my code wants to do things exactly once for each row, for each column or for each element.
Therefore, I typically use one of these
size(M,1)
size(M,2)
numel(V)
In particular do not depend on length to match the number of elements in a vector!
The only real convenience that I found {in older versions of matlab} for length is if I need a repeat statement rather than a while. Then it is convenient that length of vectors usually returns at least one.
Some other uses that I had for length:
A quick rough check whether something is big.
Making something square as mentioned by #Mike
This question addresses a good point and I have seen programs fail because of applying the length command on matrices (for looping). Especially when one expects to get size(M, n) because the n-th dimension should be the largest. In total, I can not see an advantage of allowing length to be applied on matrices, in fact I only see risks from probably unexpected behavior.
If I want to know the largest dimension of any matrix, I would prefer to be more explicit and use max(size(M)), which also should be much clearer for anyone reading this code.
I am not sure, whether the following example should be in this answer, but It somehow addresses the same point.
It is also useful to be explicit with dimension, when averaging over matrices. Consider the case, where you always want to average over the first dimension, i.e. over the columns of a matrix. As long as your matrix is of size n x m, where n is greater than 1, you do not have to care about specifying a dimension. But for unforseen cases, where your matrix happens to be a row-vector, things get messy:
%// good case, where num of rows is 2 or greater
size(mean(rand(2, 4), 1)) %// [1, 4]
size(mean(rand(2, 4))) %// [1, 4]
%// bad case, where num of rows is 1
size(mean(rand(1, 4), 1)) %// [1, 4]
size(mean(rand(1, 4))) %// [1, 1], returns the average of that row
If you want to create a square matrix B that can contain the input matrix A which is non-square, you can take the latter's length and use it to initialize the matrix B with zeros where the rows and columns would be of A's length, then copy the input matrix into the new zeroed matrix.
Another example - the one I use most - is when working with vectors. There it is very convenient to work with length instead of size(vec,1) or size(vec,2) as it doesn't matter if it is a row or a column vector.
As #Dennis Jaheruddin pointed out, length gave wrong results for empty vectors in some versions of MATLAB. Using numel instead of length might therefore be convenient for better backward compatibility. The readibility of the code is almost the same IMHO.
This question compares length and numel and their performance, and comes to the result that they perform similarly up to 100k elements in a vector. With more than 100k elements, numel appears to be faster. I tried to verify this (with MATLAB R2014a) and came to the following results:
Here, length is a bit slower, but as it is in the range of micro seconds, I guess it won't be a real difference in speed.

Clustering Distance Matrix in Matlab

I have a list of words an a numeric value for each group of words.
The same word is written sometimes incorrectly in my database, so if "dog" has the numeric value "120", I also want to assign the same numeric value to a misspelling like "dogg".
My idea so far was to use the Levenshtein distance to calculate a distance matrix for the words, which I have done now in Matlab. My questions: Which methods would be best now to cluster my (obviously symmetric) distance matrix, and as a final step being able to predict for a new dataset of words which numeric value can be assigned to them?