I am trying to train word embeddings from scratch. I decided to start out with basics and chose CBOW arch. from the word2vec paper. Here are the steps I used based on my understanding of the same (these are the steps post tokenization and numericalization):
Generate training examples using a context window. I used a context window of size 3, so I have 6 context words for every training example
Simple FFNN with 1 hidden layer (dim = batch_size * 500)
Train model on data using CrossEntropyLoss() as my loss function
The vocab size is quite small (~6k) with around 1.4M tokens available for training.
The model is trained on the task of predicting a target word given a set of 6 context words. I managed to train it to ~24% accuracy. Note, I have not used PyTorch's nn.Embedding layer. My model is defined asnn.Sequential(
nn.Linear(6,500),
nn.Linear(500,len(vocab))
) No softmax as I am directly using nn.CrossEntropy as my loss
Now I am at a loss as to how to actually extract the embeddings from the model? If I were using an Embedding layer, it was simply a matter of passing the vocab index to the layer to get the corresponding embedding. But in my case, how do I extract the embeddings?
I realize I can simply take the weights of the hidden layer as my embedding matrix and use that for lookups but how are the keys defined? How do I know which row of the matrix maps to which word? I am confused because we have 6 context words as input, not just one word. Can anyone please help me understand this?
Related
I am trying to cluster a dataset using an encoder and since I am new in this field I cant tell how to do it.My main issue is how to define the loss function since the dataset is unlabeled and up to know, what I have seen from bibliography they define as loss function the distance between the desired output and the predicted output.My question is since that I dont have a desired output how should I implement this?
You can use an auto encoder to pre-train your convolutional layers, like it described in my question here with usage of convolutional autoencoder for images
As you can see form code, loss function is Adam with metrics accuracy and dice coefficient, I think you can use accuracy only, since dice coefficient is image-specific
I’m not sure how it will work for you, because you hadn’t provided your idea how you will transform your bibliography lists to vector, perhaps you will create a list for bibliography id’s sorted by the cosine distance between them
For example, you can use a set of vector with cosine distances to each item in a bibliography list above for each reference in your dataset and use it as input for autoencoder
After encoder will be trained, you can remove the decoder part from your model output and use as an input for one of unsupervised clustering algorithms, for example, k-mean. You can find details about them here
I trained a word2vec model on my dataset using the word2vec gensim package. My dataset has about 131,681 unique words but the model outputs a vector matrix of shape (47629,100). So only 47,629 words have vectors associated with them. What about the rest? Why am I not able to get a 100 dimensional vector for every unique word?
The gensim Word2Vec class uses a default min_count of 5, meaning any words appearing fewer than 5 times in your corpus will be ignored. If you enable INFO level logging, you should see logged messages about this and other steps taken by the training.
Note that it's hard to learn meaningful vectors with few (on non-varied) usage examples. So while you could lower the min_count to 1, you shouldn't expect those vectors to be very good – and even trying to train them may worsen your other vectors. (Low-occurrence words can be essentially noise, interfering with the training of other word-vectors, where those other more-frequent words do have sufficiently numerous/varied examples to be better.)
I have a project for face recognition of five people that I want my CNN to detect, and I was wondering if people could have a look at my model to see if this is a step in the right direction
def model():
model= Sequential()
# sort out the input layer later
model.add(convolutional.Convolution2D(64,3,3, activation='relu'), input_shape=(3,800,800))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(convolutional.Convolution2D(64,3,3, activation='relu'))
model.add(convolutional.MaxPooling2D((2,2), strides=(2,2)))
flatten()
model.add(Dense(128, activation='relu'))
model.add(Dropout(p=0.2))
model.add(Dense(number_of_faces, activation='softmax'))
so the model will be taking in pictures (headshots found on google of 5 people) in 3 channels of size 800 by 800 with 64 feature maps, pooled and then another set of feature maps
and then connected to a mlp for classification into a binary vector for 5 output neurons. My question is, is this a decent approach to try and classify headshots of certain people?
for example if I were to download one hundred pictures of a certain person and put them through this model, would the feature space created in the convolution be big enough to capture
the features of that face and four others?
thanks for the help guys
Well, it is not an engineering issue but a scientific one. It is hard to judge whether 100 picture is enough for your purpose without showing current progress (like, what is the accuracy now? Are your facing overfitting or underfitting.
But, YES, extra data of faces can help with your model, especially when those faces are of same context (background, light, angle, skin color, etc.) with your eventual testing data.
If you are interesting in face recognition, you can start with Deep Learning Face Representation from Predicting 10,000 Classes (unofficial code here), they use 10 thousand faces as extra dataset to train. You can search "DeepID" for more information.
If you are an engineering guy, you can check Facial Expression Recognition with Convolutional Neural Networks, this report focus more on implementation, which is also implemented by Keras.
By then way, 800*800 is extra large in face recognition community. You might like to resize them to a smaller size. Otherwise your program might be too gargantuan to train and consumes butch of memory.
Face recognition is not a regular classification study. If you train your model for 5 people, even if it would be a successful model, you need to re-train it if a new person join to the team. It means that your new model might not be successful anymore.
We firstly train a regular classification model but then drop its final softmax layer and use its early layer to represent images. Representations are multi-dimensional vector. Herein, we expect that image pair of same person should have high similarity whereas image pair of different persons should have low similarity. We can find the vector similarities with cosine similarity or euclidean distance methods.
To sum up, you should not train a model anymore for face recognition application. You just need to use a neural networks to predict. Predictions will be representations.
I recommend you to use deepface. It wraps state-of-the-art face recognition models such as VGG-Face, Google FaceNet, OpenFace, Facebook DeepFace, DeepID and Dlib. It also handles face detection and alignment in the background. You just need to call a line of code to apply face recognition.
#!pip install deepface
from deepface import DeepFace
models = ['VGG-Face', 'Facenet', 'OpenFace', 'DeepFace', 'DeepID', 'Dlib']
obj = DeepFace.verify("img1.jpg", "img2.jpg", model_name = models[0])
print(obj["verified"], ", ", obj["distance"])
Returned object stores max threshold value and found distance. In this way, it returns True in verified param if the image pair is same person, returns False if the image pair is different persons.
I need to classify pairs of image and indicate whether they're the same of not. I use several descriptors as SIFT LBP and more.
I want now to use LIBSVM to do the training and test.
how can I use teh svmTrain.
should I save only the distance between 2 descriptors and then just have 1 1:SIftDelta, 2:LBPDelta
is this the correct way or is there any better approach?
thanks
I'm not sure this is the right forum for this question, as it deals more with "high level" notions of learning, rather the specific implementation of it in Matlab.
Having said that, it seems like you are trying to combine multiple cues for learning, which is not a trivial task.
I can propose two methods for you:
Direct method - just concatenate all your descriptors into a single, very long, one and do the learning in this high dimensional space.
Do the learning in two stages (consequently, you'll have to partition your training data into two):
At the first stage, learn K classifiers, each using a different descriptor (assuming you wish to use K different descriptors).
Then, at the second stage, (using the reminder of your training data), you classify each example using the K classifiers you have: this will give you a new K-dimensional feature vector for each sample (you can put the classification result, or use the distance from the separating hyper plane to populate the k-th entry in the new descriptor). Now you can train a second classifier on the new K-dimension vectors. This second classifier gives you the final output of your multi-descriptor system.
-Enjoy!
In Matlab (Neural Network Toolbox + Image Processing Toolbox), I have written a script to extract features from images and construct a "feature vector". My problem is that some features have more data than others. I don't want these features to have more significance than others with less data.
For example, I might have a feature vector made up of 9 elements:
hProjection = [12,45,19,10];
vProjection = [3,16,90,19];
area = 346;
featureVector = [hProjection, vProjection, area];
If I construct a Neural Network with featureVector as my input, the area only makes up 10% of the input data and is less significant.
I'm using a feed-forward back-propogation network with a tansig transfer function (pattern-recognition network).
How do I deal with this?
When you present your input data to the network, each column of your feature vector is fed to the input layer as an attribute by itself.
The only bias you have to worry about is the scale of each (ie: we usually normalize the features to the [0,1] range).
Also if you believe that the features are dependent/correlated, you might want to perform some kind of attribute selection technique. And in your case it depends one the meaning of the hProj/vProj features...
EDIT:
It just occurred to me that as an alternative to feature selection, you can use a dimensionality reduction technique (PCA/SVD, Factor Analysis, ICA, ...). For example, factor analysis can be used to extract a set of latent hidden variables upon which those hProj/vProj depends on. So instead of these 8 features, you can get 2 features such that the original 8 are a linear combination of the new two features (plus some error term). Refer to this page for a complete example