I'm currently interested in using Cross Entropy Error when performing the BackPropagation algorithm for classification, where I use the Softmax Activation Function in my output layer.
From what I gather, you can drop the derivative to look like this with Cross Entropy and Softmax:
Error = targetOutput[i] - layerOutput[i]
This differs from the Mean Squared Error of:
Error = Derivative(layerOutput[i]) * (targetOutput[i] - layerOutput[i])
So, can you only drop the derivative term when your output layer is using the Softmax Activation Function for classification with Cross Entropy? For instance, if I were to do Regression using the Cross Entropy Error (with say TANH activation function) I would still need to keep the derivative term, correct?
I haven't been able to find an explicit answer on this and I haven't attempted to work out the math on this either (as I am rusty).
You do not use the derivative term in the output layer since you get the 'real' error (the difference between your output and your target), in the hidden layers you have to calculate the approximate error using backpropagation.
What we are doing is an approximation taking the derivate of the error of the next layer against the weights of the current layer instead of the error of the current layer (that its unknown).
Best regards,
Related
This question is quite general and mathematical.
I'm using a neural network with a single hidden layer (10 neurons) to approximate a function with 2 input and 2 output variables.
The sigmoid function is used as activation function.
Later, I'm using the derivative for something else. I approximate the derivative with a numerical method. This method uses outputs of the neural network.
I was wondering if the derivative of the approximated function can be obtained easily as the learning algorithm uses derivatives for every neuron.
I could get rid of the numerical method for the derivation if there was a simple way to get the derivative out of the neural net.
I'm thinking of the chain rule, but I'm not sure if this is the right way to go and how to use it correctly.
I am learning to build neural networks for regression problems. It works well approximating linear functions. Setup with 1-5–1 units with linear activation functions in hidden and output layers does the trick and results are fast and reliable. However, when I try to feed it simple quadratic data (f(x) = x*x) here is what happens:
With linear activation function, it tries to fit a linear function through dataset
And with TANH function it tries to fit a a TANH curve through the dataset.
This makes me believe that the current setup is inherently unable to learn anything but a linear relation, since it's repeating the shape of activation function on the chart. But this may not be true because I've seen other implementations learn curves just perfectly. So I may be doing something wrong. Please provide your guidance.
About my code
My weights are randomized (-1, 1) inputs are not normalized. Dataset is fed in random order. Changing learning rate or adding layers, does not change the picture much.
I've created a jsfiddle,
the place to play with is this function:
function trainingSample(n) {
return [[n], [n]];
}
It produces a single training sample: an array of an input vector array and a target vector array.
In this example it produces an f(x)=x function. Modify it to be [[n], [n*n]] and you've got a quadratic function.
The play button is at the upper right, and there also are two input boxes to manually input these values. If target (right) box is left empty, you can test the output of the network by feedforward only.
There is also a configuration file for the network in the code, where you can set learning rate and other things. (Search for var Config)
It's occurred to me that in the setup I am describing, it is impossible to learn non–linear functions, because of the choice of features. Nowhere in forward pass we have input dependency of power higher than 1, that's why I am seeing a snapshot of my activation function in the output. Duh.
I have been reading this ebook about ANN:https://www4.rgu.ac.uk/files/chapter3%20-%20bp.pdf
and got a doubt about the effect of the sigmoid function for calculating the errorB. In the text says that if I have threshold neuron I can use:
Target-Output
but because I have a sigmoid function involved I should add:
Output(1-Output)
and end up with:
ErrorB=OutputB(1-OutputB)(TargetB-OutputB)
I mean why I should add the part of O(1-O), I have tried with different values, but I really do not get the intuition why it should be in that way.
Any help?
Thanks
As Kelu stated, that part of the equation is based on derivatives of your transfer function (in this case sigmoid). To understand why you need derivatives, you need to understand how the delta rule works(*):
Your overall goal is to minimize the error in the network's output using gradient descent. Gradient descent itself tries to find a minimum in the error function (E) by taking steps proportional to the negative of the gradient. A gradient is simply the derivative and the reason you're working with derivatives mathematically is that gradients point in the direction of the greatest rate of increase of the (error) function. Conclusion: Since you wanna minimize the error, you go the opposite way of the gradient.
This is the intuitive reason for using gradients. If you want the mathematical derivation, you should check this basic wiki article (additional comment as it's not mentioned anywhere: the g'(x) in the article is the first derivative of g(x))
Other transfer functions can be used, e.g. linear (in this case there is no g'(x) term as the derivative is simply a constant) or hyperbolic tangent in which case the derivative is something different again.
(*) Equation is derived from following equation where you start by minimizing the error of the output:
It is like that because of the fact that Output(1-Output) is a derivative of sigmoid function (simplified). In general, this part is based on derivatives, you can try with different functions (from sigmoid) and then you have to use their derivatives too to get a proper learning rate.
If you want you can take a look at my implementation (it's far from perfect, but maybe you will get some idea from it ;)), it's a simple project I made on my university - https://github.com/kelostrada/neuron-network
When using the sigmoid activation function I understand that the derivative is calculated by output*(1-output). But how is this determined? How do I get from the sigmoid function 1/(1+e^(-x)) to determining that the derivative should be output*(1-output)?
For example if I want to determine the derivative of atan(x) or atan(x) with output scaled to the range 0-1 (atan(x)*0.3183098861837907+0.5), how do I determine this derivative for use in training the neural net?
Well it seems to me like this is more of a maths related question than a coding one, but here you go anyways.
For the sigmoid function:
where
If you compute its derivative:
and
Thus:
Remember, x is the input, and f is the output. Which is why you get your "output*(1-output)"
For other activation functions, you'll just have to compute the derivative first and then code it. Usually though, it won't have a nice form like the one above.
For the other part of your question, what you have is something of this form:
If you compute its derivative (and this will work for any function u(x) that is scaled and offset), you get:
Put simply, the b part is a constant so it disappears when derived and the a is a constant coefficient so it remains unchanged when derived.
In your case, since:
the derivative you're looking for is:
On a personal note, this is pretty simple maths and I would strongly suggest you focus on understanding these before you start using neural networks ;)
Cheers
edit:
A more pointed question:
What is the derivative of softmax to be used in my gradient descent?
This is more or less a research project for a course, and my understanding of NN is very/fairly limited, so please be patient :)
I am currently in the process of building a neural network that attempts to examine an input dataset and output the probability/likelihood of each classification (there are 5 different classifications). Naturally, the sum of all output nodes should add up to 1.
Currently, I have two layers, and I set the hidden layer to contain 10 nodes.
I came up with two different types of implementations
Logistic sigmoid for hidden layer activation, softmax for output activation
Softmax for both hidden layer and output activation
I am using gradient descent to find local maximums in order to adjust the hidden nodes' weights and the output nodes' weights. I am certain in that I have this correct for sigmoid. I am less certain with softmax (or whether I can use gradient descent at all), after a bit of researching, I couldn't find the answer and decided to compute the derivative myself and obtained softmax'(x) = softmax(x) - softmax(x)^2 (this returns an column vector of size n). I have also looked into the MATLAB NN toolkit, the derivative of softmax provided by the toolkit returned a square matrix of size nxn, where the diagonal coincides with the softmax'(x) that I calculated by hand; and I am not sure how to interpret the output matrix.
I ran each implementation with a learning rate of 0.001 and 1000 iterations of back propagation. However, my NN returns 0.2 (an even distribution) for all five output nodes, for any subset of the input dataset.
My conclusions:
I am fairly certain that my gradient of descent is incorrectly done, but I have no idea how to fix this.
Perhaps I am not using enough hidden nodes
Perhaps I should increase the number of layers
Any help would be greatly appreciated!
The dataset I am working with can be found here (processed Cleveland):
http://archive.ics.uci.edu/ml/datasets/Heart+Disease
The gradient you use is actually the same as with squared error: output - target. This might seem surprising at first, but the trick is that a different error function is minimized:
(- \sum^N_{n=1}\sum^K_{k=1} t_{kn} log(y_{kn}))
where log is the natural logarithm, N depicts the number of training examples and K the number of classes (and thus units in the output layer). t_kn depicts the binary coding (0 or 1) of the k'th class in the n'th training example. y_kn the corresponding network output.
Showing that the gradient is correct might be a good exercise, I haven't done it myself, though.
To your problem: You can check whether your gradient is correct by numerical differentiation. Say you have a function f and an implementation of f and f'. Then the following should hold:
(f'(x) = \frac{f(x - \epsilon) - f(x + \epsilon)}{2\epsilon} + O(\epsilon^2))
please look at sites.google.com/site/gatmkorn for the open-source Desire simulation program.
For the Windows version, /mydesire/neural folder has several softmax classifiers, some with softmax-specific gradient-descent algorithm.
In the examples, this works nicely for a simplemcharacter-recognition task.
ASee also
Korn, G.A.: Advanced dynamic-system Simulation, Wiley 2007
GAK
look at the link:
http://www.youtube.com/watch?v=UOt3M5IuD5s
the softmax derivative is: dyi/dzi= yi * (1.0 - yi);