Distributed representations for words: How do I generate it?

Distributed representations for words: How do I generate it? - neural-network

I've been reading about neural networks and how CBOW and Skip-Gram works but I can't figure out one thing: How do I generate the word vectors itself?
It always seems to me that I use those methods to calculate the weight matrix, and I use the word vector to do adjust it, and I'm struggling to understand how I got the word vectors in the first place.
When I found Rumelhart paper I thought I would find the answer there, but all I got was the same thing: calculate the error comparing the output expected with the one I found and adjust the model. But who is my expected output? How did I get it?
For example, Omer Levy and Yoav Goldberg explained in a perfect clear way (in Linguistic Regularities in Sparse and Explicit Word Representations) how the Explicit Vector Space Representation works, but I couldn't find an explanation on how Distributed representations for words works.

Related

Reverse TF-IDF vector (vec2text)

Given a generated doc2vec vector on some document. is it possible to reverse the vector back to the original document?
If so, does there exist any hash algorithm that would make the vector irreversible but still comparable to other vectors of the same type (using cosine/Euclidean distance)?

It's unclear why you've mentioned "TF-IDF vector" in your question title, but then asked about a Doc2Vec vector – which is very different from a TF-IDF approach. I'll assume your main interest is Doc2Vec vectors.
In general, a Doc2Vec vector has far too little information to actually reconstruct the document for which the vector was calculated. It's essentially a compressed summary, based on evolving (in training or inference) a vector that's good (within the limits of the model) at predicting the document's words.
For example, one commonly-used dimensionality for Doc2Vec vectors is 300. Those 300 dimensions are each represented by a 4-byte floating-point value. So the vector is 1200 bytes in total - but could be the summary vector for a document of many hundreds or thousands of words, far far larger than 1200 bytes.
It's theoretically plausible that with a Doc2Vec vector, and the associated model from which it was trained or inferred, you could generate a ranked list of words most-likely to be in the document. There's a pending feature-request to offer this in Gensim (#2459), but not yet implementing code. But such a list-of-words wouldn't be grammatical, and the top 10 words in such a list might not be in the document at all. (It might be entirely made up of other similar words.)
With a large set of calculated vectors, as you get when training of a model has finished, you could take a vector (from that set, or from inferring a new text), and look through the set-of-vectors for whichever one has a vector closest to your query vector. That would point you at one of your known documents - but that's more of a lookup (when you already know many example documents) than reversing a vector into a document directly.
You'd have to say more about your need for a 'irreversible' vector that is still good for document-to-document comparisons for me to make further suggestions to meet that need.
To an extent, Doc2Vec vectors already meet that need, as they can't regenerate an exat document. But given that they could generate a list of likely words (per above), if your needs are more rigorous, you might need extra steps. For example, if you used a model to calcualte all needed vectors, but then threw away the model, even that theoretical capability to list most-likely words would go away.
But to the extent you still have the vectors, and potentially their mappings to full documents, a vector still implies one, or a few, closest-documents from the known set. And even if you somehow had a novel vector, without its text, simply looking among your known documents that are closest would be highly suggestive (but not dispositive) about what words are in the source document.
(If your needs are very demanding, there might be something in the genre of 'Fully Homomorphic Encryption' and/or 'Private Information Retrieval' that would help. Those use advanced cryptography to allow queries on encrypted data that only reveal final results, hiding the details of what you're doing even from the system answering your query. But those techniques are far more new & complicated, with few if any sources of ready-to-use code, and adapting them specifically for vector-similarity style calculations might require significant custom advanced-cryptography work.)

Word Embedding to word

I am using a GloVe based pre-trained embedded vectors for words in my I/P sentences to a NMT-like model. The model then generates a series of word embeddings as its output for each sentence.
How can I convert these output word embeddings to respective words? One way I tried is using cosine similarity between each output embedding vector and all the i/p embedding vectors. Is there a better way than this?
Also, is there a better way to approach this than using embedding vectors?

First of all the question is lacking a lot of details like the library used for word embedding, the nature of the model, and the training data, etc ...
But I will try to give you an idea what you can do in those situations, assuming you are using some word embedding library like Gensim.
How to get the word from the vector:
We are dealing with predicted word vectors here, so our word vector may not be the exact vector of the original word, we have to use similarity, in Gensim you can use similar_by_vector, something like
target_word_candidates = similar_by_vector(target_word_vector,top=3)
That's would solve the reverse lookup problem, as highlighted here, given all the word vectors how to get the most similar word, but we need to find the best single word according to the context.
You can use some sort of post-processing on the output target word vectors, this would be beneficial for trying to solve some problems like:
1.How to guide the translation of out-of-vocabulary
terms?
2.How to enforce the presence of a
given translation recommendation in the decoder’s
output?
3.How to place these word(s) in the right
position?
One of the ideas is to use an external resource for the target language, i.e. language model, to predict which combination of words are gonna be used. Some other techniques incorporate the external knowledge inside the translation network itself

Can someone tell me about the kNN search algo that Matlab uses?

I wrote a basic O(n^2) algorithm for a nearest neighbor search. As usual Matlab 2013a's knnsearch(..) method works a lot faster.
Can someone tell me what kind of optimization they used in their implementation?
I am okay with reading any documentation or paper that you may point me to.
PS: I understand the documentation on the site mentions the paper on kd trees as a reference. But as far as I understand kd trees are the default option when column number is less than 10. Mine is 21. Correct me if I'm wrong about it.

The biggest optimization MathWorks have made in implementing nearest-neighbors search is that all the hard stuff is implemented in a MEX file, as compiled C, rather than MATLAB.
With an algorithm such as kNN that (in my limited understanding) is quite recursive and difficult to vectorize, that's likely to give such an improvement that the O() analysis will only be relevant at pretty high n.
In more detail, under the hood the knnsearch command uses createns to create a NeighborSearcher object. By default, when X has less than 10 columns, this will be a KDTreeSearcher object, and when X has more than 10 columns it will be an ExhaustiveSearcher object (both KDTreeSearcher and ExhaustiveSearcher are subclasses of NeighborSearcher).
All objects of class NeighbourSearcher have a method knnsearch (which you would rarely call directly, using instead the convenience command knnsearch rather than this method). The knnsearch method of KDTreeSearcher calls straight out to a MEX file for all the hard work. This lives in matlabroot\toolbox\stats\stats\#KDTreeSearcher\private\knnsearchmex.mexw64.
As far as I know, this MEX file performs pretty much the algorithm described in the paper by Friedman, Bentely, and Finkel referenced in the documentation page, with no structural changes. As the title of the paper suggests, this algorithm is O(log(n)) rather than O(n^2). Unfortunately, the contents of the MEX file are not available for inspection to confirm that.

The code builds a KD-tree space-partitioning structure to speed up nearest neighbor search, think of it like building indexes commonly used in RDBMS to speed up lookup operations.
In addition to nearest neighbor(s) searches, this structure also speeds up range-searches, which finds all points that are within a distance r from a query point.
As pointed by #SamRoberts, the core of the code is implemented in C/C++ as a MEX-function.
Note that knnsearch chooses to build a KD-tree only under certain conditions, and falls back to an exhaustive search otherwise (by naively searching all points for the nearest one).
Keep in mind that in cases of very high-dimensional data (and few instances), the algorithm degenerates and is no better than an exhaustive search. In general as you go with dimensions d>30, the cost of searching KD-trees will increase to searching almost all the points, and could even become worse than a brute force search due to the overhead involved in building the tree.
There are other variations to the algorithm that deals with high dimensions such as the ball trees which partitions the data in a series of nesting hyper-spheres (as opposed to partitioning the data along Cartesian axes like KD-trees). Unfortunately those are not implemented in the official Statistics toolbox. If you are interested, here is a paper which presents a survey of available kNN algorithms.
(The above is an illustration of searching a kd-tree partitioned 2d space, borrowed from the docs)

Different results for Fundamental Matrix in Matlab

I am implementing stereo matching and as preprocessing I am trying to rectify images without camera calibration.
I am using surf detector to detect and match features on images and try to align them. After I find all matches, I remove all that doesn't lie on the epipolar lines, using this function:
[fMatrix, epipolarInliers, status] = estimateFundamentalMatrix(...
matchedPoints1, matchedPoints2, 'Method', 'RANSAC', ...
'NumTrials', 10000, 'DistanceThreshold', 0.1, 'Confidence', 99.99);
inlierPoints1 = matchedPoints1(epipolarInliers, :);
inlierPoints2 = matchedPoints2(epipolarInliers, :);
figure; showMatchedFeatures(I1, I2, inlierPoints1, inlierPoints2);
legend('Inlier points in I1', 'Inlier points in I2');
Problem is, that if I run this function with the same data, I am still getting different results causing differences in resulted disparity map in each run on the same data
Pulatively matched points are still the same, but inliners points differs in each run.
Here you can see that some matches are different in result:
UPDATE: I thought that differences was caused by RANSAC method, but using LMedS, MSAC, I am still getting different results on the same data

EDIT: Admittedly, this is only a partial answer, since I am only explaining why this is even possible with these fitting methods and not how to improve the input keypoints to avoid this problem from the start. There are problems with the distribution of your keypoint matches, as noted in the other answers, and there are ways to address that at the stage of keypoint detection. But, the reason the same input can yield different results for repeated executions of estimateFundamentalMatrix with the same pairs of keypoints is because of the following. (Again, this does not provide sound advice for improving keypoints so as to solve this problem).
The reason for different results on repeated executions, is related to the the RANSAC method (and LMedS and MSAC). They all utilize stochastic (random) sampling and are thus non-deterministic. All methods except Norm8Point operate by randomly sampling 8 pairs of points at a time for (up to) NumTrials.
But first, note that the different results you get for the same inputs are not equally suitable (they will not have the same residuals) but the search space can easily lead to any such minimum because the optimization algorithms are not deterministic. As the other answers rightly suggest, improve your keypoints and this won't be a problem, but here is why the robust fitting methods can do this and some ways to modify their behavior.
Notice the documentation for the 'NumTrials' option (ADDED NOTE: changing this is not the solution, but this does explain the behavior):
'NumTrials' — Number of random trials for finding the outliers
500 (default) | integer
Number of random trials for finding the outliers, specified as the comma-separated pair consisting of 'NumTrials' and an integer value. This parameter applies when you set the Method parameter to LMedS, RANSAC, MSAC, or LTS.
MSAC (M-estimator SAmple Consensus) is a modified RANSAC (RANdom SAmple Consensus). Deterministic algorithms for LMedS have exponential complexity and thus stochastic sampling is practically required.
Before you decide to use Norm8Point (again, not the solution), keep in mind that this method assumes NO outliers, and is thus not robust to erroneous matches. Try using more trials to stabilize the other methods (EDIT: I mean, rather than switching to Norm8Point, but if you are able to back up in your algorithms then address the the inputs -- the keypoints -- as a first line of attack). Also, to reset the random number generator, you could do rng('default') before each call to estimateFundamentalMatrix. But again, note that while this will force the same answer each run, improving your key point distribution is the better solution in general.

I know its too late for your answer, but I guess it would be useful for someone in the future. Actually, the problem in your case is two fold,
Degenerate location of features, i.e., The location of features is mostly localized (on you :P) and not well-spread throughout the image.
These matches are sort of on the same plane. I know you would argue that your body is not planar, but comparing it to the depth of the room, it sort of is.
Mathematically, this means you are kind of extracting E (or F) from a planar surface, which always has infinite solutions. To sort this out, I would suggest using some constrain on distance between any two extracted SURF features, i.e., any two SURF features used for matching should be at least 40 or 100 pixels apart (depending on the resolution of your image).

Another way to get better SURF features is to set 'NumOctaves' in detectSURFFeatures(rgb2gray(I1),'NumOctaves',5); to larger values.
I am facing the same problem and this has helped (a little bit).

Functional form of 2D interpolation in Matlab

I need to construct an interpolating function from a 2D array of data. The reason I need something that returns an actual function is, that I need to be able to evaluate the function as part of an expression that I need to numerically integrate.
For that reason, "interp2" doesn't cut it: it does not return a function.
I could use "TriScatteredInterp", but that's heavy-weight: my grid is equally spaced (and big); so I don't need the delaunay triangularisation.
Are there any alternatives?

(Apologies for the 'late' answer, but I have some suggestions that might help others if the existing answer doesn't help them)
It's not clear from your question how accurate the resulting function needs to be (or how big, 'big' is), but one approach that you could adopt is to regress the data points that you have using a least-squares or Kalman filter-based method. You'd need to do this with a number of candidate function forms and then choose the one that is 'best', for example by using an measure such as MAE or MSE.
Of course this requires some idea of what the form underlying function could be, but your question isn't clear as to whether you have this kind of information.
Another approach that could work (and requires no knowledge of what the underlying function might be) is the use of the fuzzy transform (F-transform) to generate line segments that provide local approximations to the surface.
The method for this would be:
Define a 2D universe that includes the x and y domains of your input data
Create a 2D fuzzy partition of this universe - chosing partition sizes that give the accuracy you require
Apply the discrete F-transform using your input data to generate fuzzy data points in a 3D fuzzy space
Pass the inverse F-transform as a function handle (along with the fuzzy data points) to your integration function
If you're not familiar with the F-transform then I posted a blog a while ago about how the F-transform can be used as a universal approximator in a 1D case: http://iainism-blogism.blogspot.co.uk/2012/01/fuzzy-wuzzy-was.html
To see the mathematics behind the method and extend it to a multidimensional case then the University of Ostravia has published a PhD thesis that explains its application to various engineering problems and also provides an example of how it is constructed for the case of a 2D universe: http://irafm.osu.cz/f/PhD_theses/Stepnicka.pdf

If you want a function handle, why not define f=#(xi,yi)interp2(X,Y,Z,xi,yi) ?
It might be a little slow, but I think it should work.

If I understand you correctly, you want to perform a surface/line integral of 2-D data. There are ways to do it but maybe not the way you want it. I had the exact same problem and it's annoying! The only way I solved it was using the Surface Fitting Tool (sftool) to create a surface then integrating it.
After you create your fit using the tool (it has a GUI as well), it will generate an sftool object which you can then integrate in (2-D) using quad2d
I also tried your method of using interp2 and got the results (which were similar to the sfobject) but I had no idea how to do a numerical integration (line/surface) with the data. Creating thesfobject and then integrating it was much faster.
It was the first time I do something like this so I confirmed it using a numerically evaluated line integral. According to Stoke's theorem, the surface integral and the line integral should be the same and it did turn out to be the same.
I asked this question in the mathematics stackexchange, wanted to do a line integral of 2-d data, ended up doing a surface integral and then confirming the answer using a line integral!

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse