number of units in the output layer of Hierarchical Softmax

number of units in the output layer of Hierarchical Softmax - neural-network

In word2vec, there are 3 layers: input, hidden, and output layer.
If we use the traditional softmax approach,
for a corpus with size V, the number of units of the output layer will be also V (one-hot vector input).
If we use Hierarchical Softmax,
the article says that there are only V-1 nodes (in the Huffman binary tree).
Does it mean there are only V-1 units in the output layer in the case?
Here is the reference I am reading:
https://arxiv.org/pdf/1411.2738.pdf
Thank you very much.

In practice, word2vec hierarchical-softmax implementations create an output layer with exactly as many nodes as vocabulary words. See for example in the original Google word2vec.c line:
https://github.com/tmikolov/word2vec/blob/20c129af10659f7c50e86e3be406df663beff438/word2vec.c#L356
Or in the gensim Python implementation line:
https://github.com/RaRe-Technologies/gensim/blob/f3bf792ee1344ed17ad2836ab3c38b4210f59889/gensim/models/word2vec.py#L1171
You can then see how words are assigned individual Huffman codes and nodes ('points`) in the output layer in the CreateBinaryTree (C) or create_binary_tree functions.

Related

Extract CBOW embeddings - pytorch

I am trying to train word embeddings from scratch. I decided to start out with basics and chose CBOW arch. from the word2vec paper. Here are the steps I used based on my understanding of the same (these are the steps post tokenization and numericalization):
Generate training examples using a context window. I used a context window of size 3, so I have 6 context words for every training example
Simple FFNN with 1 hidden layer (dim = batch_size * 500)
Train model on data using CrossEntropyLoss() as my loss function
The vocab size is quite small (~6k) with around 1.4M tokens available for training.
The model is trained on the task of predicting a target word given a set of 6 context words. I managed to train it to ~24% accuracy. Note, I have not used PyTorch's nn.Embedding layer. My model is defined asnn.Sequential(
nn.Linear(6,500),
nn.Linear(500,len(vocab))
) No softmax as I am directly using nn.CrossEntropy as my loss
Now I am at a loss as to how to actually extract the embeddings from the model? If I were using an Embedding layer, it was simply a matter of passing the vocab index to the layer to get the corresponding embedding. But in my case, how do I extract the embeddings?
I realize I can simply take the weights of the hidden layer as my embedding matrix and use that for lookups but how are the keys defined? How do I know which row of the matrix maps to which word? I am confused because we have 6 context words as input, not just one word. Can anyone please help me understand this?

units of neural network layer are independent?

In neural network, there are 3 main parts defined as input layer, hidden layer and output layer. Is there any correlation between units of hidden layer? For example, for 1st and 2nd neurons of hidden layer are independent of each other, or there is a relation between each other? Is there any source that explains this issue?

Answer depends on many factors. From probabilistic perspective they are independent given inputs and before training. If input is not fixed then they are heavily correlated (as two "almost linear" functions of the same input signal). Finally, after training they will be strongly correlated, and exact correlations will depend on initialisation and training itself.

How to extract memnet heat maps with the caffe model?

I want to extract both memorability score and memorability heat maps by using the available memnet caffemodel by Khosla et al. at link
Looking at the prototxt model, I can understand that the final inner-product output should be the memorability score, but how should I obtain the memorability map for a given input image? Here some examples.
Thanks in advance

As described in their paper [1], the CNN (MemNet) outputs a single, real-valued output for the memorability. So, the network they made publicly available, calculates this single memorability score, given an input image - and not a heatmap.
In section 5 of the paper, they describe how to use this trained CNN to predict a memorability heatmap:
To generate memorability maps, we simply scale up the image and apply MemNet to overlapping regions of the image. We do this for multiple scales of the image and average the resulting memorability maps.
Let's consider the two important steps here:
Problem 1: Make the CNN work with any input size.
To make the CNN work on images of any arbitrary size, they use the method presented in [2].
While convolutional layers can be applied to images of arbitrary size - resulting in smaller or larger outputs - the inner product layers have a fixed input and output size.
To make an inner product layer work with any input size, you apply it just like a convolutional kernel. For an FC layer with 4096 outputs, you interpret it as a 1x1 convolution with 4096 feature maps.
To do that in Caffe, you can directly follow the Net Surgery tutorial. You create a new .prototxt file, where you replace the InnerProduct layers with Convolution layers. Now, Caffe won't recognize the weights in the .caffemodel anymore, as the layer types don't match anymore. So, you load the old net and its parameters into Python, load the new net, and assign the old parameters to the new net and save it as a new .caffemodel file.
Now, we can run images of any dimensions (larger or equal than 227x227) through the network.
Problem 2: Generate the heat map
As explained in the paper [1], you apply this fully-convolutional network from Problem 1 to the same image at different scales. The MemNet is a re-trained AlexNet, so the default input dimension is 227x227. They mention that a 451x451 input gives a 8x8 output, which implies a stride of 28 for applying the layers. So a simple example could be:
Scale 1: 227x227 → 1x1. (I guess they definitely use this scale.)
Scale 2: 283x283 → 2x2. (Wild guess)
Scale 3: 339x339 → 4x4. (Wild guess)
Scale 4: 451x451 → 8x8. (This scale is mentioned in the paper.)
The results will look like this:
So, you'll just average these outputs to get your final 8x8 heatmap. From the image above, it should be clear how to average the different-scale outputs: you'll have to upsample the low-res ones to 8x8, and average then.
From the paper, I assume that they use very high-res scales, so their heatmap will be around the same size as the image initially was. They write that it takes 1s on a "normal" GPU. This is a quite long time, which also indicates that they probably upsample the input images quite to quite high dimensions.
Bibliography:
[1]: A. Khosla, A. S. Raju, A. Torralba, and A. Oliva, "Understanding and Predicting Image Memorability at a Large Scale", in: ICCV, 2015. [PDF]
[2]: J. Long, E. Shelhamer, and T. Darrell, "Fully convolutional networks for semantic segmentation", in: CVPR, 2015. [PDF]

Hybrid SOM (with MLP)

Could someone please provide some information on how to properly combine a self organizing map with a multilayer perceptron?
I recently read some articles about this technique in comparison to regular MLPs and it performed way better in prediction tasks. So, I want to use the SOM as front-end for dimension reduction by clustering the input data and pass the results to an MLP back-end.
My current idea of implementing it is it to train the SOM with a couple of training sets and to determine the clusters. Afterwards, I initialize the MLP with as many input units as SOM clusters. Next step would be to train the MLP using the SOM's output (which value?...weights of BMU?) as in input for the network (SOM's Output for the Cluster matching Input Unit and zeros for any other Input Units?).

There is no single way of doing that. Let me list some possibilities:
The one you describe. But then, your MLP will need to have K*D inputs, where K is the number of clusters and D is the input dimension. There is no dimensionality reduction.
Similar to your idea, but instead of using the weights, just send 1 for the BMU and 0 for the remaining clusters. Then your MLP will need K inputs.
Same as above, but instead of 1 or 0, send the distance from the input vector to each cluster.
Same as above, but instead of the distance, compute a Gaussian activation for each cluster.
Since the SOM preserves topology, send only the 2D coordinates of the BMU (possibly normalized between 0 and 1). Then your MLP will need only 2 inputs and you achieve real extreme dimensionality reduction.
You can read about those ideas and some more here: Principal temporal extensions of SOM: Overview. It is not about feeding the output of a SOM to a MLP, but a SOM to itself. But you'll be able to understand the various possibilities when trying to produce some output from a SOM.

3D coordinates as the output of a Neural Network

Neural Networks are mostly used to classify. So, the activation of a neuron in the output layer indicates the class of whatever you are classifying.
Is it possible (and correct) to design a NN to get 3D coordinates? This is, three output neurons with values in ranges, for example [-1000.0, 1000.0], each one.

Yes. You can use a neural network to perform linear regression, and more complicated types of regression, where the output layer has multiple nodes that can be interpreted as a 3-D coordinate (or a much higher-dimensional tuple).
To achieve this in TensorFlow, you would create a final layer with three output neurons, each corresponding to a different dimension of your target coordinates, then minimize the root mean squared error between the current output and the known value for each example.