How's it even possible to use softmax for word2vec? - neural-network

How is it possible to use softmax for word2vec? I mean softmax outputs probabilities of all classes which sum up to 1, e.g. [0, 0.1, 0.8, 0.1]. But if my label is, for example [0, 1, 0, 1, 0] (multiple correct classes), then it is impossible for softmax to output the correct value?
Should I use softmax instead? Or am I missing something?

I suppose you're talking about Skip-Gram model (i.e., predict the context word by the center), because CBOW model predicts the single center word, so it assumes exactly one correct class.
Strictly speaking, if you were to train word2vec using SG model and ordinary softmax loss, the correct label would be [0, 0.5, 0, 0.5, 0]. Or, alternatively, you can feed several examples per center word, with labels [0, 1, 0, 0, 0] and [0, 0, 0, 1, 0]. It's hard to say, which one performs better, but the label must be a valid probability distribution per input example.
In practice, however, ordinary softmax is rarely used, because there are too many classes and strict distribution is too expensive and simply not needed (almost all probabilities are nearly zero all the time). Instead, the researchers use sampled loss functions for training, which approximate softmax loss, but are much more efficient. The following loss functions are particularly popular:
Negative Sampling
Noise-Contrastive Estimation
These losses are more complicated than softmax, but if you're using tensorflow, all of them are implemented and can be used just as easily.

Related

BCEWithLogitsLoss: Trying to get binary output for predicted label as a tensor, confused with output layer

Each element of my dataset has a multi-label tensor like [1, 0, 0, 1] with varying combinations of 1's and 0's. In this scenario, since I have 4 tensors, I have the output layer of my neural network to be 4. In doing so with BCEWithLogitsLoss, I obtain an output tensor like [3, 2, 0, 0] when I call model(inputs) which is in the range of (0, 3) as I specified with the output layer to have 4 output neurons. This does not match the format of what the target is expected to be, although when I change the number of output neurons to 2, I get a shape mismatch error. What needs to be done to fix this?
When using BCEWithLogitsLoss you make a 1D prediction per output binary label.
In your example, you have 4 binary labels to predict, and therefore, your model outputs 4d vector, each entry represents the prediction of one of the binary labels.
Using BCEWithLogitsLoss you implicitly apply Sigmoid to your outputs:
This loss combines a Sigmoid layer and the BCELoss in one single class.
Therefore, if you want to get the predicted probabilities of your model, you need to add a torch.sigmoid on top of your prediction. The sigmoid function will convert your predicted logits to probabilities.

Does my Neural Net Vector Input Size Need to Match the Output Size?

I’m trying to use a Neural Network for purposes of binary classification. It consist of three layers. The first layer has three input neurons, the hidden layer has two neurons, and the output layer has three neurons that output a binary value of 1 or 0. Actually the output is usually a floating point number, but it typically rounds up to a whole number.
If the network only outputs vectors of 3, then shouldn't my input vectors be the same size? Otherwise, for classification, how else do you map the output to the input?
I wrote the neural network in Excel using VBA based on the following article: https://www.analyticsvidhya.com/blog/2017/05/neural-network-from-scratch-in-python-and-r/
So far it works exactly as described in the article. I don’t have access to a machine learning library at the moment so I’ve chosen to give this a try.
For example:
If the output of the network is [n, n ,n], does that mean that my input data has to be [n, n, n] also?
From what I read in here: Neural net input/output
It seems that's the way it should be. I'm not entirely sure though.
To speak simple,
for regression task, your output usually has the dimension [1] (if you predict single value).
For the classification task, your output should have the same number of dimensions equal to the number of classes you have (outputs are probabilities, the sum of them = 1).
So, there is no need to have equal dimensions of input and output. NN is just a projection of one dimension to another.
For example,
regression, we predict house prices: input is [1, 10] (to features of the property), the output is [1] - price
classification, we predict class (will be sold or not): input is [1, 11] (same features + listed price), output is [1, 2] (probability of class 0 (will be not sold) and 1 (will be sold); for example, [1; 0], [0; 1] or [0.5; 0.5] and so on; it is binary classification)
Additionally, equality of input-output dimensions exists in more specific tasks, for example, autoencoder models (when you need to present your data in other dimension and then represent it back, to the original dimension).
Again, the output dimension is the size of outputs for 1 batch. Only one, not of the whole dataset.

Why is softmax function necessory? Why not simple normalization?

I am not familiar with deep learning so this might be a beginner question.
In my understanding, softmax function in Multi Layer Perceptrons is in charge of normalization and distributing probability for each class.
If so, why don't we use the simple normalization?
Let's say, we get a vector x = (10 3 2 1)
applying softmax, output will be y = (0.9986 0.0009 0.0003 0.0001).
Applying simple normalization (dividing each elements by the sum(16))
output will be y = (0.625 0.1875 0.125 0.166).
It seems like simple normalization could also distribute the probabilities.
So, what is the advantage of using softmax function on the output layer?
Normalization does not always produce probabilities, for example, it doesn't work when you consider negative values. Or what if the sum of the values is zero?
But using exponential of the logits changes that, it is in theory never zero, and it can map the full range of the logits into probabilities. So it is preferred because it actually works.
This depends on the training loss function. Many models are trained with a log loss algorithm, so that the values you see in that vector estimate the log of each probability. Thus, SoftMax is merely converting back to linear values and normalizing.
The empirical reason is simple: SoftMax is used where it produces better results.

Does the place of the sigmoid function matter in neural network?

I am trying to build a neural net in python using Keras with a custom loss and I was wandering whether having a sigmoid function as an activation function in the last layer and having a sigmoid in the beginning of the custom loss is the same or not. So here is what I mean by that:
I have a feeling that in the second model the loss is calculated but it is not back propagated through the sigmoid meanwhile in the first model it is. Is that right?
Indeed, in the second case the backpropagation doesn't go through the sigmoid. It is a really bad thing to alter data inside of the loss function.
The reason this is a bad thing to do is because then, you will backpropagate an error on the output which is not the real error that the network is making.
Explaining myself with a simple case:
you have labels in a binary form say a tensor [0, 0, 1, 0]
If your sigmoid is inside your custom loss function, you might have outputs that look like this [-100, 0, 20, 100], the sigmoid in your loss will transform this into something looking approximately like tihs :[0, 0.5, 1, 1]
The error that will be backpropagated will then be [0, -0.5, 0, -1]. The backpropagation will not take into account the sigmoid and you will apply this error directly to the output. You can see that the magnitude of the error doesn't reflect at all the magnitude of the output's error: the last value is 100 and should be in negative territory, but the model will backpropagate a small error of -1 on that layer.
To summarize, the sigmoid must be in the network so that the backpropagation takes it into account when backpropagating the error.

Plotting vector in 3D in Matlab

I'm studying Linear Algebra. I would like to visualize a vector [2, 1, 2] in 3D. I used the following command:
quiver3(0,0,0,2,1,2)
And either my understanding of Linear Algebra is off or I'm doing something wrong with MATLAB. But what the plot looks like to me is that it's plotting vector [1.8, 0.9, 1.8].
By default, quiver3 will use whatever scaling that optimizes the display of the vectors.
quiver3(...,scale) automatically scales the vectors to prevent them from overlapping, and then multiplies them by scale. scale = 2 doubles their relative length, and scale = 0.5 halves them. Use scale = 0 to plot the vectors without the automatic scaling.
You'll want to specify the scale parameter as 0 to prevent this automatic scaling and to accurately represent the data that you provide
quiver3(0, 0, 0, 2, 1, 2, 0);