Word2vec model query - neural-network

I trained a word2vec model on my dataset using the word2vec gensim package. My dataset has about 131,681 unique words but the model outputs a vector matrix of shape (47629,100). So only 47,629 words have vectors associated with them. What about the rest? Why am I not able to get a 100 dimensional vector for every unique word?

The gensim Word2Vec class uses a default min_count of 5, meaning any words appearing fewer than 5 times in your corpus will be ignored. If you enable INFO level logging, you should see logged messages about this and other steps taken by the training.
Note that it's hard to learn meaningful vectors with few (on non-varied) usage examples. So while you could lower the min_count to 1, you shouldn't expect those vectors to be very good – and even trying to train them may worsen your other vectors. (Low-occurrence words can be essentially noise, interfering with the training of other word-vectors, where those other more-frequent words do have sufficiently numerous/varied examples to be better.)

Related

Extract CBOW embeddings - pytorch

I am trying to train word embeddings from scratch. I decided to start out with basics and chose CBOW arch. from the word2vec paper. Here are the steps I used based on my understanding of the same (these are the steps post tokenization and numericalization):
Generate training examples using a context window. I used a context window of size 3, so I have 6 context words for every training example
Simple FFNN with 1 hidden layer (dim = batch_size * 500)
Train model on data using CrossEntropyLoss() as my loss function
The vocab size is quite small (~6k) with around 1.4M tokens available for training.
The model is trained on the task of predicting a target word given a set of 6 context words. I managed to train it to ~24% accuracy. Note, I have not used PyTorch's nn.Embedding layer. My model is defined asnn.Sequential(
nn.Linear(6,500),
nn.Linear(500,len(vocab))
) No softmax as I am directly using nn.CrossEntropy as my loss
Now I am at a loss as to how to actually extract the embeddings from the model? If I were using an Embedding layer, it was simply a matter of passing the vocab index to the layer to get the corresponding embedding. But in my case, how do I extract the embeddings?
I realize I can simply take the weights of the hidden layer as my embedding matrix and use that for lookups but how are the keys defined? How do I know which row of the matrix maps to which word? I am confused because we have 6 context words as input, not just one word. Can anyone please help me understand this?

What model is Rasa NLU entity extraction using? Is it LSTM or just a simple neural network?

What kind of model is RASA NLU using to extract the entities and intents after word embedding?
This blog post from Rasa clarifies some aspects.
With Rasa you will first train a vectorizer that transforms each document in a N-dimensional vector, where N is the size of your vocabulary. This is exactly what scikit-learn's CountVectorizer does.
Each intent embedding is instead built as an one-hot vector (or a vector with more 1s if you have "mixed" intents). Each of these vectors has the same dimensions of a document embedding, so I guess N may actually be (vocabulary size) + (number of intents).
At that point Rasa will train a neural network (default: 2 hidden layers) where the loss function is designed to maximise the similarity between document d and intent i if d is labelled as i in the training set (and minimize d's similarity with all the other intent embeddings). The similarity is by default calculated as cosine similarity.
Each new, unseen document is embedded by the neural network and its similarity computed for each of the intents. The intent which is most similar to the new document will be returned as the predicted label.
Old answer:
It's not an LSTM. They say their approach is inspired by Facebook's
StarSpace.
I didn't find the paper above very enlightning, however looking at
Starspace's Github repo, the text classification use case is said
to have same setting as their previous work TagSpace.
The TagSpace paper is more clear and explains how they use a CNN
to embed each document in a space such that its distance to the
associated class vector is minimized. Both words, documents and
classes ("tags") are embedded in the same d-dimensional space and
their distance measured via cosine similarity or inner product.

Restricting output classes in multi-class classification in Tensorflow

I am building a bidirectional LSTM to do multi-class sentence classification.
I have in total 13 classes to choose from and I am multiplying the output of my LSTM network to a matrix whose dimensionality is [2*num_hidden_unit,num_classes] and then apply softmax to get the probability of the sentence to fall into 1 of the 13 classes.
So if we consider output[-1] as the network output:
W_output = tf.Variable(tf.truncated_normal([2*num_hidden_unit,num_classes]))
result = tf.matmul(output[-1],W_output) + bias
and I get my [1, 13] matrix (assuming I am not working with batches for the moment).
Now, I also have information that a given sentence does not fall into a given class for sure and I want to restrict the number of classes considered for a given sentence. So let's say for instance that for a given sentence, I know it can fall only in 6 classes so the output should really be a matrix of dimensionality [1,6].
One option I was thinking of is to put a mask over the result matrix where I multiply the rows corresponding to the classes that I want to keep by 1 and the ones I want to discard by 0, by in this way I will just lose some of the information instead of redirecting it.
Anyone has a clue on what to do in this case?
I think your best bet is, as you seem to have described, using a weighted cross entropy loss function where the weights for your "impossible class" are 0 and 1 for the other possible classes. Tensorflow has a weighted cross entropy loss function.
Another interesting but probably less effective method is to feed whatever information you now have about what classes your sentence can/cannot fall into the network at some point (probably towards the end).

Pattern recognition teachniques that allow input as sequence of different length

I am trying to classify water end-use events expressed as a time-series sequences into appropriate categories (e.g. toilet, tap, shower, etc). My first attempt using HMM shows a quite promising result with an average accuracy of 80%. I just wonder if there is any other techniques that allow the training input as time-series sequences of different length like HMM does rather than the extracted feature vector of each sequence. I have tried Conditional Random Field (CRF) and SVM ;however, as far as I know, these two techniques require input as a pre-computed feature vector and the length of all input vectors must be the same for training purpose. I am not sure if I am right or wrong at this point. Any help would be appreciated.
Thanks, Will

Clustering: a training dataset of variable data dimensions

I have a dataset of n data, where each data is represented by a set of extracted features. Generally, the clustering algorithms need that all input data have the same dimensions (the same number of features), that is, the input data X is a n*d matrix of n data points each of which has d features.
In my case, I've previously extracted some features from my data but the number of extracted features for each data is most likely to be different (I mean, I have a dataset X where data points have not the same number of features).
Is there any way to adapt them, in order to cluster them using some common clustering algorithms requiring data to be of the same dimensions.
Thanks
Sounds like the problem you have is that it's a 'sparse' data set. There are generally two options.
Reduce the dimensionality of the input data set using multi-dimensional scaling techniques. For example Sparse SVD (e.g. Lanczos algorithm) or sparse PCA. Then apply traditional clustering on the dense lower dimensional outputs.
Directly apply a sparse clustering algorithm, such as sparse k-mean. Note you can probably find a PDF of this paper if you look hard enough online (try scholar.google.com).
[Updated after problem clarification]
In the problem, a handwritten word is analyzed visually for connected components (lines). For each component, a fixed number of multi-dimensional features is extracted. We need to cluster the words, each of which may have one or more connected components.
Suggested solution:
Classify the connected components first, into 1000(*) unique component classifications. Then classify the words against the classified components they contain (a sparse problem described above).
*Note, the exact number of component classifications you choose doesn't really matter as long as it's high enough as the MDS analysis will reduce them to the essential 'orthogonal' classifications.
There are also clustering algorithms such as DBSCAN that in fact do not care about your data. All this algorithm needs is a distance function. So if you can specify a distance function for your features, then you can use DBSCAN (or OPTICS, which is an extension of DBSCAN, that doesn't need the epsilon parameter).
So the key question here is how you want to compare your features. This doesn't have much to do with clustering, and is highly domain dependant. If your features are e.g. word occurrences, Cosine distance is a good choice (using 0s for non-present features). But if you e.g. have a set of SIFT keypoints extracted from a picture, there is no obvious way to relate the different features with each other efficiently, as there is no order to the features (so one could compare the first keypoint with the first keypoint etc.) A possible approach here is to derive another - uniform - set of features. Typically, bag of words features are used for such a situation. For images, this is also known as visual words. Essentially, you first cluster the sub-features to obtain a limited vocabulary. Then you can assign each of the original objects a "text" composed of these "words" and use a distance function such as cosine distance on them.
I see two options here:
Restrict yourself to those features for which all your data-points have a value.
See if you can generate sensible default values for missing features.
However, if possible, you should probably resample all your data-points, so that they all have values for all features.