If I'm using softmax in an RBM, do I need to use it in hidden units as well as in the visible ones? - softmax

As I understand, when using softmax of K values in RBM visible units, the hidden unit stays binary.
If so - I'm not sure how to compute contributions by the binary units to the visible ones. Am I supposed to relate the binary 0 state in a hidden unit to a specific state out of the K states of the softmax, and the 1 state to the other K-1 states? Or maybe a 0 in the hidden unit correlates to 0 in all of the K possible states of the visible unit (but doesn't it contradict the fact that at least one of the K states must be on?).

I think I've figured out my misunderstanding: The softmax units behave as groups of binary subunits, and each subunit has its own weights to the hidden units. This means the matrix of weights between the hidden layer and visible layer is 3 dimensional, instead of 2, and now it is obvious how to calculate the contributions.

Related

Binary features with 0,0,0, in NN always return 0.5

Let's assume we have three columns, with binary features (0,1). One row in dataset is 0,0,0 with label 0.
The problem I am facing is:
When assigning weights to this row and activating sigmoid function, I'll always receive 0.5, because an array of zeros products on any weight with 0.
How to overcome this issue?
In addition to multiplication by a weight matrix, you can also add a bias (which is how it is typically done in neural networks), and hence you won't necessarily get a zero vector. You could also add more hidden layers (but as I said, even adding a bias vector will resolve the issue you mentioned).

Cutoff on Neural Network regression predictions

Context: I have a set of documents, each of them with two associated probability values: probability to belong to class A or and probability to belong to class B. The classes are mutually exclusive, and the probabilities add up to one. So, for instance document D has probabilities (0.6, 0.4) associated as ground truth.
Each document is represented by the tfidf of the terms that it contains, normalized from 0 to 1. I also tried doc2vec (normalized form -1 to 1) and a couple of other methods.
I built a very simple Neural Network to predict this probability distribution.
Input layer with as many nodes as features
Single hidden layer with one node
Output layer with softmax and two nodes
Cross entropy loss function
I also tried with different update functions and learning rates
This is the code I wrote using nolearn:
net = nolearn.lasagne.NeuralNet(
layers=[('input', layers.InputLayer),
('hidden1', layers.DenseLayer),
('output', layers.DenseLayer)],
input_shape=(None, X_train.shape[1]),
hidden1_num_units=1,
output_num_units=2,
output_nonlinearity=lasagne.nonlinearities.softmax,
objective_loss_function=lasagne.objectives.binary_crossentropy,
max_epochs=50,
on_epoch_finished=[es.EarlyStopping(patience=5, gamma=0.0001)],
regression=True,
update=lasagne.updates.adam,
update_learning_rate=0.001,
verbose=2)
net.fit(X_train, y_train)
y_true, y_pred = y_test, net.predict(X_test)
My problem is: my predictions have a cutoff point and no prediction goes below that point (check the picture to understand what I mean).
This plot shows the difference between the true probability and my predictions. The closer a point is to the red line the better the prediction is. Ideally all the points would lie on the line. How can I solve this and why is this happening?
Edit: actually I solved the problem by simply removing the hidden layer:
net = nolearn.lasagne.NeuralNet(
layers=[('input', layers.InputLayer),
('output', layers.DenseLayer)],
input_shape=(None, X_train.shape[1]),
output_num_units=2,
output_nonlinearity=lasagne.nonlinearities.softmax,
objective_loss_function=lasagne.objectives.binary_crossentropy,
max_epochs=50,
on_epoch_finished=[es.EarlyStopping(patience=5, gamma=0.0001)],
regression=True,
update=lasagne.updates.adam,
update_learning_rate=0.001,
verbose=2)
net.fit(X_train, y_train)
y_true, y_pred = y_test, net.predict(X_test)
But I still fail to understand why I had this problem and why removing the hidden layer solved it. Any ideas?
Here the new plot:
I think your training set output value should be [0,1] or [1,0],
[0.6,0.4] is not suited for softmax/Crossentropy .

scale the loss value according to "badness" in caffe

I want to scale the loss value of each image based on how close/far is the "current prediction" to the "correct label" during the training. For example if the correct label is "cat" and the network think it is "dog" the penalty (loss) should be less than the case if the network thinks it is a "car".
The way that I am doing is as following:
1- I defined a matrix of the distance between the labels,
2- pass that matrix as a bottom to the "softmaxWithLoss" layer,
3- multiply each log(prob) to this value to scale the loss according to badness in forward_cpu
However I do not know what should I do in the backward_cpu part. I understand the gradient (bottom_diff) has to be changed but not quite sure, how to incorporate the scale value here. According to the math I have to scale the gradient by the scale (because it is just an scale) but don't know how.
Also, seems like there is loosLayer in caffe called "InfoGainLoss" that does very similar job if I am not mistaken, however the backward part of this layer is a little confusing:
bottom_diff[i * dim + j] = scale * infogain_mat[label * dim + j] / prob;
I am not sure why infogain_mat[] is divide by prob rather than being multiply by! If I use identity matrix for infogain_mat isn't it supposed to act like softmax loss in both forward and backward?
It will be highly appreciated if someone can give me some pointers.
You are correct in observing that the scaling you are doing for the log(prob) is exactly what "InfogainLoss" layer is doing (You can read more about it here and here).
As for the derivative (back-prop): the loss computed by this layer is
L = - sum_j infogain_mat[label * dim + j] * log( prob(j) )
If you differentiate this expression with respect to prob(j) (which is the input variable to this layer), you'll notice that the derivative of log(x) is 1/x this is why you see that
dL/dprob(j) = - infogain_mat[label * dim + j] / prob(j)
Now, why don't you see similar expression in the back-prop of "SoftmaxWithLoss" layer?
well, as the name of that layer suggests it is actually a combination of two layers: softmax that computes class probabilities from classifiers outputs and a log loss layer on top of it. Combining these two layer enables a more numerically robust estimation of the gradients.
Working a little with "InfogainLoss" layer I noticed that sometimes prob(j) can have a very small value leading to unstable estimation of the gradients.
Here's a detailed computation of the forward and backward passes of "SoftmaxWithLoss" and "InfogainLoss" layers with respect to the raw predictions (x), rather than the "softmax" probabilities derived from these predictions using a softmax layer. You can use these equations to create a "SoftmaxWithInfogainLoss" layer that is more numerically robust than computing infogain loss on top of a softmax layer:
PS,
Note that if you are going to use infogain loss for weighing, you should feed H (the infogain_mat) with label similarities, rather than distances.
Update:
I recently implemented this robust gradient computation and created this pull request. This PR was merged to master branch on April, 2017.

How can a well trained ANN have a single set of weights that can represent multiple classes?

In multinomial classification, I'm using soft-max activation function for all non-linear units and ANN has 'k' number of output nodes for 'k' number of classes. Each of the 'k' output nodes present in output layer is connected to all the weights in preceding layer, kind of like the one shown below.
So, if the first output node intends to pull the weights in it's favor, it will change all the weights that precede this layer and the other output nodes will also pull which usually contradicts to the direction in which the first one was pulling. It seems more like a tug of war with single set of weights. So, do we need a separate set of weights(,which includes weights for every node of every layer) for each of the output classes or is there a different form of architecture present? Please, correct me if I'm wrong.
Each node has its set of weights. Implementations and formulas usually use matrix multiplications, which can make you forget the fact that, conceptually, each node has its own set of weights, but they do.
Each node returns a single value that gets sent to every node in the next layer. So a node on layer h receives num(h - 1) inputs, where num(h - 1) is the number of nodes in layer h - 1. Let these inputs be x1, x2, ..., xk. Then the neuron returns:
x1*w1 + x2*w2 + ... + xk*wk
Or a function of this. So each neuron maintains its own set of weights.
Let's consider the network in your image. Assume that we have some training instance for which the topmost neuron should output 1 and the others 0.
So our target is:
y = [1 0 0 0]
And our actual output is (ignoring the softmax for simplicity):
y^ = [0.88 0.12 0.04 0.5]
So it's already doing pretty well, but we must still do backpropagation to make it even better.
Now, our output delta is:
y^ - y = [-0.12 0.12 0.04 0.5]
You will update the weights of the topmost neuron using the delta -0.12, of the second neuron using 0.12 and so on.
Notice that each output neuron's weights get updated using these values: these weights will all increase or decrease in order to approach the correct values (0 or 1).
Now, notice that each output neuron's output depends on the outputs of hidden neurons. So you must also update those. Those will get updated using each output neuron's delta (see page 7 here for the update formulas). This is like applying the chain rule when taking derivatives.
You're right that, for a given hidden neuron, there is a "tug of war" going on, with each output neuron's errors pulling their own way. But this is normal, because the hidden layer must learn to satisfy all output neurons. This is a reason for initializing the weights randomly and for using multiple hidden neurons.
It is the output layer that adapts to give the final answers, which it can do since the weights of the output nodes are independent of each other. The hidden layer has to be influenced by all output nodes, and it must learn to accommodate them all.

Which scaling technique does it use?

I have a matrix X, the size of which is 100*2000 double. I want to know which kind of scaling technique is applied to matrix X in the following command, and why it does not use z-score to do scaling?
X = X./repmat(sqrt(sum(X.^2)),size(X,1),1);
That scaling comes from linear algebra. That's what we call normalizing by producing a unit vector. Assuming that each row is an observation and each column is a feature, what's happening here is that we are going through every observation that you collected and normalizing each feature value over all observations such that the overall length / magnitude of a particular feature for all observations is set to 1.
The bottom division takes a look at each feature and determines the norm or magnitude of the feature over all observations. Once you find these magnitudes, you then take each feature for each observation and divide by their respective magnitudes.
The reason why unit vectors are often employed is to describe a point in feature space with respect to a set of basis vectors. Normalizing by producing unit vectors gives you the smallest possible way to represent one component in feature space and so what's probably happening here is that the observations are now being transformed such that each component / feature is being represented in terms of a set of basis vectors. Each basis vector is one feature in the data.
Check out the Wikipedia article on Unit Vectors for more details: http://en.wikipedia.org/wiki/Unit_vector