Clustering Distance Matrix in Matlab - matlab

I have a list of words an a numeric value for each group of words.
The same word is written sometimes incorrectly in my database, so if "dog" has the numeric value "120", I also want to assign the same numeric value to a misspelling like "dogg".
My idea so far was to use the Levenshtein distance to calculate a distance matrix for the words, which I have done now in Matlab. My questions: Which methods would be best now to cluster my (obviously symmetric) distance matrix, and as a final step being able to predict for a new dataset of words which numeric value can be assigned to them?

Related

How to use silhouette_score in Sklearn with mixed (categorical and numerical) data?

I have come to a situation where I have mixed data set as mentioned and try unsupervised clustering.
I am trying many different experiments including Gower's distance and K-prototype. I wanna try some of sklearn metrics to see how they will give me values.
While I was looking at silhouette_score, there is an argument 'metric' and I can decide with what I want to compute distances. But as my data has mixed types, I would like to choose manhattan for numerical and hamming for categorical. Is there a way I can use silhouette_score for both metrics at one go? if all my input data were numerical, I would have done as below:
silhouette_score(friendRecomennderData, labels, metric = 'manhattan')
Thank you in advance.
You are getting confused in the arguments that are passed to silhouette_score. If you read the documentation mentioned here, it say the following about the input data, i.e. the parameter X:
X: array [n_samples_a, n_samples_a] if metric == “precomputed”, or, [n_samples_a, n_features] otherwise. Array of pairwise distances between samples, or a feature array.
Thus the data can only be a numerical array comprising of distances between the samples. It's not possible to have distances as categorical values.
You need to first cluster your data, then get the distance matrix and provide the distance matrix as input to silhouette_score.
You can use distance metrics like gowers distance which deals with mixed data types and then use computed distance matrix as X and metric = 'precomputed' in silhouette_score function.

fasttext - Extracting and comparing pre-trained word vectors

I'm working with the German pre-trained word vectors from https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md
I encountered the following problems:
To extract the vectors for my words, I started by simply searching the wiki.de.vec text-file for the respective words. However, vectors in the wiki.de.vec text-file differ from those that the print-word-vectors function outputs (e.g. the vector for 'affe' meaning 'monkey' is different in the wiki.de.vec file than the output for 'affe' from print-word-vectors). What is the reason for this? I assume this occurs because the vector for a word is computed by adding the sum of its character n-gram vectors in the model by Bojanowski et al., but what does the vector for 'affe' in the wiki.de.vec text-file reflect then? Is it the vector for the n-gram 'affe' that also occurs in other words like 'karaffe'. So, should one always use the print-word-vectors function (i.e. add the character n-gram vectors) when working with these vectors and not simply extract vectors from the text-file?
Some real German words (e.g. knatschen, resonieren) receive a null vector (even with the print-word-vectors function). How can this be if the major advantage of this subword approach is to compute vectors for out-of-vocabulary words?
The nearest-neighbors function (./fasttext nn) outputs the nearest-neighbors of a word with the cosine distance. However, this value differs from the value I obtain by getting the vectors of the individual words with print-word-vectors and computing their cosine distance manually in Matlab using pdist2(wordVector1, wordVector2, 'cosine'). Why is this the case? Is this the wrong way to obtain the cosine distance between two word vectors?
Thanks in advance for your help and suggestions!

Compare documents by sequence vector

I'm trying to classify documents by sequence vector. Basically, I have a vocabulary (more than 5000 words). Each document is converted to a vector of integer numbers so that each element in the vector corresponds the position of the word in the vocabulary.
For example, if the vocab is [hello, how, are, you, today] and the document is "hello you" then I'll have the vector: [1 4]. Another document of "how are you" will result in [2 3 4].
Now what I want is to assess the similarity between the first and the second vector. Here you can see these vectors don't have the same length. Furthermore, comparing directly them may not make sense because they represent sequence of words. This case is different from binary (bag-of-word) vector, which considers the appearance of a word in the document (1 if appear, otherwise 0), and also frequency (word count) vector, which considers frequency of a word in the document with the given vocabulary.
Can you give me a suggestion?
The Jaccard similarity is normally used to compare the similarity of sets (in your case, text). The text is n-grammed (shingled), and then locality sensitive hashing is used to determine their Jaccard similarity.
There is a whole field dedicated to this - Google is your friend!

Kullback Leibler Divergence of 2 Histograms in MatLab

I would like a function to calculate the KL distance between two histograms in MatLab. I tried this code:
http://www.mathworks.com/matlabcentral/fileexchange/13089-kldiv
However, it says that I should have two distributions P and Q of sizes n x nbins. However, I am having trouble understanding how the author of the package wants me to arrange the histograms. I thought that providing the discretized values of the random variable together with the number of bins would suffice (I would assume the algorithm would use an arbitrary support to evaluate the expectations).
Any help is appreciated.
Thanks.
The function you link to requires that the two histograms passed be aligned and thus have the same length NBIN x N (not N X NBIN), that is, if N>1 then the number of rows in the inputs should be equal to the number of bins in the histograms. If you are just going to compare two histograms (that is if N=1) it doesn't really matter, you can pass either row or column vector versions of these as long as you are consistent and the order of bins matches.
A generic call to the function looks like this:
dists = kldiv(bins,P,Q)
The implementation allows comparison of multiple histograms to each other (that is, N>1), in which case pairs of columns (with matching column index) in each array are compared and the result is a row vector with distances for each matching pair.
Array bins should be the same size as P and Q and is used to perform a very minimal check that the inputs are of the same size, but is not used in the computation. The routine expects bins to contain the numeric labels of your bins so that it can check for repeated bin labels and warn you if repeats occur, but otherwise doesn't use the information.
You could do away with bins and compute the distance with
KL = sum(P .* (log2(P)-log2(Q)));
without using the Matlab Central versions. However the version you link to performs the abovementioned minimal checks and in addition allows computation of two alternative distances (consult the documentation).
The version linked to by eigenchris checks that no histogram bins are empty (which would make the computation blow up numerically) and if there are, removes their contribution to the sum (not sure this is entirely appropriate - consult an expert on the subject). It should probably also be aware of the exact form of the formula, specifically note the use of log2 above versus natural logarithm in the version linked to by eigenchris.

Controlled random number/dataset generation in MATLAB

Say, I have a cube of dimensions 1x1x1 spanning between coordinates (0,0,0) and (1,1,1). I want to generate a random set of points (assume 10 points) within this cube which are somewhat uniformly distributed (i.e. within certain minimum and maximum distance from each other and also not too close to the boundaries). How do I go about this without using loops? If this is not possible using vector/matrix operations then the solution with loops will also do.
Let me provide some more background details about my problem (This will help in terms of what I exactly need and why). I want to integrate a function, F(x,y,z), inside a polyhedron. I want to do it numerically as follows:
$F(x,y,z) = \sum_{i} F(x_i,y_i,z_i) \times V_i(x_i,y_i,z_i)$
Here, $F(x_i,y_i,z_i)$ is the value of function at point $(x_i,y_i,z_i)$ and $V_i$ is the weight. So to calculate the integral accurately, I need to identify set of random points which are not too close to each other or not too far from each other (Sorry but I myself don't know what this range is. I will be able to figure this out using parametric study only after I have a working code). Also, I need to do this for a 3D mesh which has multiple polyhedrons, hence I want to avoid loops to speed things out.
Check out this nice random vectors generator with fixed sum FEX file.
The code "generates m random n-element column vectors of values, [x1;x2;...;xn], each with a fixed sum, s, and subject to a restriction a<=xi<=b. The vectors are randomly and uniformly distributed in the n-1 dimensional space of solutions. This is accomplished by decomposing that space into a number of different types of simplexes (the many-dimensional generalizations of line segments, triangles, and tetrahedra.) The 'rand' function is used to distribute vectors within each simplex uniformly, and further calls on 'rand' serve to select different types of simplexes with probabilities proportional to their respective n-1 dimensional volumes. This algorithm does not perform any rejection of solutions - all are generated so as to already fit within the prescribed hypercube."
Use i=rand(3,10) where each column corresponds to one point, and each row corresponds to the coordinate in one axis (x,y,z)