fasttext - Extracting and comparing pre-trained word vectors - matlab

I'm working with the German pre-trained word vectors from https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md
I encountered the following problems:
To extract the vectors for my words, I started by simply searching the wiki.de.vec text-file for the respective words. However, vectors in the wiki.de.vec text-file differ from those that the print-word-vectors function outputs (e.g. the vector for 'affe' meaning 'monkey' is different in the wiki.de.vec file than the output for 'affe' from print-word-vectors). What is the reason for this? I assume this occurs because the vector for a word is computed by adding the sum of its character n-gram vectors in the model by Bojanowski et al., but what does the vector for 'affe' in the wiki.de.vec text-file reflect then? Is it the vector for the n-gram 'affe' that also occurs in other words like 'karaffe'. So, should one always use the print-word-vectors function (i.e. add the character n-gram vectors) when working with these vectors and not simply extract vectors from the text-file?
Some real German words (e.g. knatschen, resonieren) receive a null vector (even with the print-word-vectors function). How can this be if the major advantage of this subword approach is to compute vectors for out-of-vocabulary words?
The nearest-neighbors function (./fasttext nn) outputs the nearest-neighbors of a word with the cosine distance. However, this value differs from the value I obtain by getting the vectors of the individual words with print-word-vectors and computing their cosine distance manually in Matlab using pdist2(wordVector1, wordVector2, 'cosine'). Why is this the case? Is this the wrong way to obtain the cosine distance between two word vectors?
Thanks in advance for your help and suggestions!

Related

is it possible to get code books from code words in the vector quantization?

i was working in matlab for vector quatization
as we know in vector Quantization, if we provide set of code words as an input we get code vectors
so what i did , i used LPG and Loyed algorithms to do that by :-
training set=randn(2,100) == code word
distortion=0.001
[code book]=Vector-Quantization (training set,distortion)
the result was some locations of code word
now , i want to be the locations of code book as a subset of locations of code vectors ?
ali
A codebook can be thought of as a 2D array.
A codeword is one row in that 2D array.
If you are given a codeword you cannot reconstruct a codebook as the codeword only contains the information held within that row.
If you know the size of the codebook is 256, and you have 256 codewords then you just have to place all the codewords in order to "reconstruct" the codebook. Alternatively, if you know the codebook was sorted by distortion values (very common) then you can calculate the distortion of each row and sort accordingly.
I hope this answer is of help to you as I'm not sure I fully understand your question.

Kullback Leibler Divergence of 2 Histograms in MatLab

I would like a function to calculate the KL distance between two histograms in MatLab. I tried this code:
http://www.mathworks.com/matlabcentral/fileexchange/13089-kldiv
However, it says that I should have two distributions P and Q of sizes n x nbins. However, I am having trouble understanding how the author of the package wants me to arrange the histograms. I thought that providing the discretized values of the random variable together with the number of bins would suffice (I would assume the algorithm would use an arbitrary support to evaluate the expectations).
Any help is appreciated.
Thanks.
The function you link to requires that the two histograms passed be aligned and thus have the same length NBIN x N (not N X NBIN), that is, if N>1 then the number of rows in the inputs should be equal to the number of bins in the histograms. If you are just going to compare two histograms (that is if N=1) it doesn't really matter, you can pass either row or column vector versions of these as long as you are consistent and the order of bins matches.
A generic call to the function looks like this:
dists = kldiv(bins,P,Q)
The implementation allows comparison of multiple histograms to each other (that is, N>1), in which case pairs of columns (with matching column index) in each array are compared and the result is a row vector with distances for each matching pair.
Array bins should be the same size as P and Q and is used to perform a very minimal check that the inputs are of the same size, but is not used in the computation. The routine expects bins to contain the numeric labels of your bins so that it can check for repeated bin labels and warn you if repeats occur, but otherwise doesn't use the information.
You could do away with bins and compute the distance with
KL = sum(P .* (log2(P)-log2(Q)));
without using the Matlab Central versions. However the version you link to performs the abovementioned minimal checks and in addition allows computation of two alternative distances (consult the documentation).
The version linked to by eigenchris checks that no histogram bins are empty (which would make the computation blow up numerically) and if there are, removes their contribution to the sum (not sure this is entirely appropriate - consult an expert on the subject). It should probably also be aware of the exact form of the formula, specifically note the use of log2 above versus natural logarithm in the version linked to by eigenchris.

Clustering Distance Matrix in Matlab

I have a list of words an a numeric value for each group of words.
The same word is written sometimes incorrectly in my database, so if "dog" has the numeric value "120", I also want to assign the same numeric value to a misspelling like "dogg".
My idea so far was to use the Levenshtein distance to calculate a distance matrix for the words, which I have done now in Matlab. My questions: Which methods would be best now to cluster my (obviously symmetric) distance matrix, and as a final step being able to predict for a new dataset of words which numeric value can be assigned to them?

Controlled random number/dataset generation in MATLAB

Say, I have a cube of dimensions 1x1x1 spanning between coordinates (0,0,0) and (1,1,1). I want to generate a random set of points (assume 10 points) within this cube which are somewhat uniformly distributed (i.e. within certain minimum and maximum distance from each other and also not too close to the boundaries). How do I go about this without using loops? If this is not possible using vector/matrix operations then the solution with loops will also do.
Let me provide some more background details about my problem (This will help in terms of what I exactly need and why). I want to integrate a function, F(x,y,z), inside a polyhedron. I want to do it numerically as follows:
$F(x,y,z) = \sum_{i} F(x_i,y_i,z_i) \times V_i(x_i,y_i,z_i)$
Here, $F(x_i,y_i,z_i)$ is the value of function at point $(x_i,y_i,z_i)$ and $V_i$ is the weight. So to calculate the integral accurately, I need to identify set of random points which are not too close to each other or not too far from each other (Sorry but I myself don't know what this range is. I will be able to figure this out using parametric study only after I have a working code). Also, I need to do this for a 3D mesh which has multiple polyhedrons, hence I want to avoid loops to speed things out.
Check out this nice random vectors generator with fixed sum FEX file.
The code "generates m random n-element column vectors of values, [x1;x2;...;xn], each with a fixed sum, s, and subject to a restriction a<=xi<=b. The vectors are randomly and uniformly distributed in the n-1 dimensional space of solutions. This is accomplished by decomposing that space into a number of different types of simplexes (the many-dimensional generalizations of line segments, triangles, and tetrahedra.) The 'rand' function is used to distribute vectors within each simplex uniformly, and further calls on 'rand' serve to select different types of simplexes with probabilities proportional to their respective n-1 dimensional volumes. This algorithm does not perform any rejection of solutions - all are generated so as to already fit within the prescribed hypercube."
Use i=rand(3,10) where each column corresponds to one point, and each row corresponds to the coordinate in one axis (x,y,z)

How to represent an array of numbers as characters

I have a sine wave which is wavelet thresholded (say soft thresholding). How to program so that the signal is transformed using a discrete wavelet transform and then display the coefficients of the signal in this new basis using alphabetical characters rather than using numbers.
For instance: $a=(\text{coeff}_1,\text{coeff}_2,...,\text{coeff}_9)$, $b=(\text{coeff}_{10},...,\text{coeff}_{19})$, and so on. Now, the depending on how many numbers are to be represented by a single character, a rule can be formed such that say if the number of alphabets are 8 and the length of the signal is 1000 then how to specify a sliding window for the assignment of characters?It is possible that there can be more than one instance of $a$ coefficients; they are not unique numbers. This is similar to a compression technique. The characters of alphabets can be assigned by Markov Method.
lets see if I've got your question. You have an array with numbers, and you want matlab to display it as letters with this strange syntax. Try something like:
a=[];
for i=1:length(sinewave)
a=[a sprintf('%c',sinewave(i))]; % dont remember if %c is to char, see sprintf help, but if it doesnt works use %s
end
a=reshape(a,[length(a) 10]); % just because you wanted to show 10 letters per row
Anyway, this is just a clue, trying to help you. Good luck!