How can a well trained ANN have a single set of weights that can represent multiple classes?

How can a well trained ANN have a single set of weights that can represent multiple classes? - neural-network

In multinomial classification, I'm using soft-max activation function for all non-linear units and ANN has 'k' number of output nodes for 'k' number of classes. Each of the 'k' output nodes present in output layer is connected to all the weights in preceding layer, kind of like the one shown below.
So, if the first output node intends to pull the weights in it's favor, it will change all the weights that precede this layer and the other output nodes will also pull which usually contradicts to the direction in which the first one was pulling. It seems more like a tug of war with single set of weights. So, do we need a separate set of weights(,which includes weights for every node of every layer) for each of the output classes or is there a different form of architecture present? Please, correct me if I'm wrong.

Each node has its set of weights. Implementations and formulas usually use matrix multiplications, which can make you forget the fact that, conceptually, each node has its own set of weights, but they do.
Each node returns a single value that gets sent to every node in the next layer. So a node on layer h receives num(h - 1) inputs, where num(h - 1) is the number of nodes in layer h - 1. Let these inputs be x1, x2, ..., xk. Then the neuron returns:
x1*w1 + x2*w2 + ... + xk*wk
Or a function of this. So each neuron maintains its own set of weights.
Let's consider the network in your image. Assume that we have some training instance for which the topmost neuron should output 1 and the others 0.
So our target is:
y = [1 0 0 0]
And our actual output is (ignoring the softmax for simplicity):
y^ = [0.88 0.12 0.04 0.5]
So it's already doing pretty well, but we must still do backpropagation to make it even better.
Now, our output delta is:
y^ - y = [-0.12 0.12 0.04 0.5]
You will update the weights of the topmost neuron using the delta -0.12, of the second neuron using 0.12 and so on.
Notice that each output neuron's weights get updated using these values: these weights will all increase or decrease in order to approach the correct values (0 or 1).
Now, notice that each output neuron's output depends on the outputs of hidden neurons. So you must also update those. Those will get updated using each output neuron's delta (see page 7 here for the update formulas). This is like applying the chain rule when taking derivatives.
You're right that, for a given hidden neuron, there is a "tug of war" going on, with each output neuron's errors pulling their own way. But this is normal, because the hidden layer must learn to satisfy all output neurons. This is a reason for initializing the weights randomly and for using multiple hidden neurons.
It is the output layer that adapts to give the final answers, which it can do since the weights of the output nodes are independent of each other. The hidden layer has to be influenced by all output nodes, and it must learn to accommodate them all.

Related

Does my Neural Net Vector Input Size Need to Match the Output Size?

I’m trying to use a Neural Network for purposes of binary classification. It consist of three layers. The first layer has three input neurons, the hidden layer has two neurons, and the output layer has three neurons that output a binary value of 1 or 0. Actually the output is usually a floating point number, but it typically rounds up to a whole number.
If the network only outputs vectors of 3, then shouldn't my input vectors be the same size? Otherwise, for classification, how else do you map the output to the input?
I wrote the neural network in Excel using VBA based on the following article: https://www.analyticsvidhya.com/blog/2017/05/neural-network-from-scratch-in-python-and-r/
So far it works exactly as described in the article. I don’t have access to a machine learning library at the moment so I’ve chosen to give this a try.
For example:
If the output of the network is [n, n ,n], does that mean that my input data has to be [n, n, n] also?
From what I read in here: Neural net input/output
It seems that's the way it should be. I'm not entirely sure though.

To speak simple,
for regression task, your output usually has the dimension [1] (if you predict single value).
For the classification task, your output should have the same number of dimensions equal to the number of classes you have (outputs are probabilities, the sum of them = 1).
So, there is no need to have equal dimensions of input and output. NN is just a projection of one dimension to another.
For example,
regression, we predict house prices: input is [1, 10] (to features of the property), the output is [1] - price
classification, we predict class (will be sold or not): input is [1, 11] (same features + listed price), output is [1, 2] (probability of class 0 (will be not sold) and 1 (will be sold); for example, [1; 0], [0; 1] or [0.5; 0.5] and so on; it is binary classification)
Additionally, equality of input-output dimensions exists in more specific tasks, for example, autoencoder models (when you need to present your data in other dimension and then represent it back, to the original dimension).
Again, the output dimension is the size of outputs for 1 batch. Only one, not of the whole dataset.

Architecture of the neural network

I was reading a paper and the authors described their network as follows:
"To train the corresponding deep network, a fully connected network with one
hidden layer is used. The network has nine binary input nodes. The hidden layer contains one sigmoid node, and in the output layer there is one inner product
function. Thus, the network has 10 variables."
The network is used to predict a continuous number (y). My problem is, I do not understand the structure of the network after the sigmoid node. What does the output layer do? What is the inner product used for?

Usually, the pre-activation functions per neuron are a combination of an inner product (or dot product in vector-vector multiplication) and one addition to introduce a bias. A single neuron can be described as
z = b + w1*x1 + x2*x2 + ... + xn*xn
= b + w'*x
h = activation(z)
where b is an additive term (the neuron's bias) and each h is the output of one layer and corresponds to the input of the following layer. In the case of the "output layer", it is that y = h. A layer might also consist of multiple neurons or - like in your example - only of single neurons.
In the described case, it seems like no bias is used. I understand it as follows:
For each input neuron x1 to x9, a single weight is used, nothing fancy here. Since there are nine inputs, this makes 9 weights, resulting in something like:
hidden_out = sigmoid(w1*x1 + w2*x2 + ... + w9*x9)
In order to connect the hidden layer to the output, the same rule applies: The output layer's input is weighted and then summed over all inputs. Since there is only one input, only one weight is to be "summed", such that
output = w10*hidden_out
Keep in mind that the sigmoid function squashes its input onto an output range of 0..1, so multiplying it with a weight re-scales it to your required output range.

scale the loss value according to "badness" in caffe

I want to scale the loss value of each image based on how close/far is the "current prediction" to the "correct label" during the training. For example if the correct label is "cat" and the network think it is "dog" the penalty (loss) should be less than the case if the network thinks it is a "car".
The way that I am doing is as following:
1- I defined a matrix of the distance between the labels,
2- pass that matrix as a bottom to the "softmaxWithLoss" layer,
3- multiply each log(prob) to this value to scale the loss according to badness in forward_cpu
However I do not know what should I do in the backward_cpu part. I understand the gradient (bottom_diff) has to be changed but not quite sure, how to incorporate the scale value here. According to the math I have to scale the gradient by the scale (because it is just an scale) but don't know how.
Also, seems like there is loosLayer in caffe called "InfoGainLoss" that does very similar job if I am not mistaken, however the backward part of this layer is a little confusing:
bottom_diff[i * dim + j] = scale * infogain_mat[label * dim + j] / prob;
I am not sure why infogain_mat[] is divide by prob rather than being multiply by! If I use identity matrix for infogain_mat isn't it supposed to act like softmax loss in both forward and backward?
It will be highly appreciated if someone can give me some pointers.

You are correct in observing that the scaling you are doing for the log(prob) is exactly what "InfogainLoss" layer is doing (You can read more about it here and here).
As for the derivative (back-prop): the loss computed by this layer is
L = - sum_j infogain_mat[label * dim + j] * log( prob(j) )
If you differentiate this expression with respect to prob(j) (which is the input variable to this layer), you'll notice that the derivative of log(x) is 1/x this is why you see that
dL/dprob(j) = - infogain_mat[label * dim + j] / prob(j)
Now, why don't you see similar expression in the back-prop of "SoftmaxWithLoss" layer?
well, as the name of that layer suggests it is actually a combination of two layers: softmax that computes class probabilities from classifiers outputs and a log loss layer on top of it. Combining these two layer enables a more numerically robust estimation of the gradients.
Working a little with "InfogainLoss" layer I noticed that sometimes prob(j) can have a very small value leading to unstable estimation of the gradients.
Here's a detailed computation of the forward and backward passes of "SoftmaxWithLoss" and "InfogainLoss" layers with respect to the raw predictions (x), rather than the "softmax" probabilities derived from these predictions using a softmax layer. You can use these equations to create a "SoftmaxWithInfogainLoss" layer that is more numerically robust than computing infogain loss on top of a softmax layer:
PS,
Note that if you are going to use infogain loss for weighing, you should feed H (the infogain_mat) with label similarities, rather than distances.
Update:
I recently implemented this robust gradient computation and created this pull request. This PR was merged to master branch on April, 2017.

Gradient checking in backpropagation

I'm trying to implement gradient checking for a simple feedforward neural network with 2 unit input layer, 2 unit hidden layer and 1 unit output layer. What I do is the following:
Take each weight w of the network weights between all layers and perform forward propagation using w + EPSILON and then w - EPSILON.
Compute the numerical gradient using the results of the two feedforward propagations.
What I don't understand is how exactly to perform the backpropagation. Normally, I compare the output of the network to the target data (in case of classification) and then backpropagate the error derivative across the network. However, I think in this case some other value have to be backpropagated, since in the results of the numerical gradient computation are not dependent of the target data (but only of the input), while the error backpropagation depends on the target data. So, what is the value that should be used in the backpropagation part of gradient check?

Backpropagation is performed after computing the gradients analytically and then using those formulas while training. A neural network is essentially a multivariate function, where the coefficients or the parameters of the functions needs to be found or trained.
The definition of a gradient with respect to a specific variable is the rate of change of the function value. Therefore, as you mentioned, and from the definition of the first derivative we can approximate the gradient of a function, including a neural network.
To check if your analytical gradient for your neural network is correct or not, it is good to check it using the numerical method.
For each weight layer w_l from all layers W = [w_0, w_1, ..., w_l, ..., w_k]
For i in 0 to number of rows in w_l
For j in 0 to number of columns in w_l
w_l_minus = w_l; # Copy all the weights
w_l_minus[i,j] = w_l_minus[i,j] - eps; # Change only this parameter
w_l_plus = w_l; # Copy all the weights
w_l_plus[i,j] = w_l_plus[i,j] + eps; # Change only this parameter
cost_minus = cost of neural net by replacing w_l by w_l_minus
cost_plus = cost of neural net by replacing w_l by w_l_plus
w_l_grad[i,j] = (cost_plus - cost_minus)/(2*eps)
This process changes only one parameter at a time and computes the numerical gradient. In this case I have used the (f(x+h) - f(x-h))/2h, which seems to work better for me.
Note that, you mentiond: "since in the results of the numerical gradient computation are not dependent of the target data", this is not true. As when you find the cost_minus and cost_plus above, the cost is being computed on the basis of
The weights
The target classes
Therefore, the process of backpropagation should be independent of the gradient checking. Compute the numerical gradients before backpropagation update. Compute the gradients using backpropagation in one epoch (using something similar to above). Then compare each gradient component of the vectors/matrices and check if they are close enough.

Whether you want to do some classification or have your network calculate a certain numerical function, you always have some target data. For example, let's say you wanted to train a network to calculate the function f(a, b) = a + b. In that case, this is the input and target data you want to train your network on:
a b Target
1 1 2
3 4 7
21 0 21
5 2 7
...
Just as with "normal" classification problems, the more input-target pairs, the better.

Neural Network with tanh wrong saturation with normalized data

I'm using a neural network made of 4 input neurons, 1 hidden layer made of 20 neurons and a 7 neuron output layer.
I'm trying to train it for a bcd to 7 segment algorithm. My data is normalized 0 is -1 and 1 is 1.
When the output error evaluation happens, the neuron saturates wrong. If the desired output is 1 and the real output is -1, the error is 1-(-1)= 2.
When I multiply it by the derivative of the activation function error*(1-output)*(1+output), the error becomes almost 0 Because of 2*(1-(-1)*(1-1).
How can I avoid this saturation error?

Saturation at the asymptotes of of the activation function is a common problem with neural networks. If you look at a graph of the function, it doesn't surprise: They are almost flat, meaning that the first derivative is (almost) 0. The network cannot learn any more.
A simple solution is to scale the activation function to avoid this problem. For example, with tanh() activation function (my favorite), it is recommended to use the following activation function when the desired output is in {-1, 1}:
f(x) = 1.7159 * tanh( 2/3 * x)
Consequently, the derivative is
f'(x) = 1.14393 * (1- tanh( 2/3 * x))
This will force the gradients into the most non-linear value range and speed up the learning. For all the details I recommend reading Yann LeCun's great paper Efficient Back-Prop.
In the case of tanh() activation function, the error would be calculated as
error = 2/3 * (1.7159 - output^2) * (teacher - output)

This is bound to happen no matter what function you use. The derivative, by definition, will be zero when the output reaches one of two extremes. It's been a while since I have worked with Artificial Neural Networks but if I remember correctly, this (among many other things) is one of the limitations of using the simple back-propagation algorithm.
You could add a Momentum factor to make sure there is some correction based off previous experience, even when the derivative is zero.
You could also train it by epoch, where you accumulate the delta values for the weights before doing the actual update (compared to updating it every iteration). This also mitigates conditions where the delta values are oscillating between two values.
There may be more advanced methods, like second order methods for back propagation, that will mitigate this particular problem.
However, keep in mind that tanh reaches -1 or +1 at the infinities and the problem is purely theoretical.

Not totally sure if I am reading the question correctly, but if so, you should scale your inputs and targets between 0.9 and -0.9 which would help your derivatives be more sane.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse