I wonder if the mlperr from the Netlab package is calculating the mean squared error.
The documentation states that it's dependent on the ouput's units activation function. How does that make sense? Shouldn't it be independent from that?
I also tried to read the source code of mlperr and I didn't see any signs that could make me think that this is a MSE error function.
Any Netlab expert here that can offer some insights? Thanks! :)
This method is used to evaluate the multilayer perceptron accodring to its output activation. It assumes the most common usage of such, so:
for linear output it returns the MSE error
0.5*sum(sum((y - t).^2))
for logistic output it returns the cross entropy error
-sum(sum(t.*log(y) + (1 - t).*log(1 - y)))
for softmax output it returns the corresponding energy error
-sum(sum(t.*log(y)))
Whole source can be seen here.
Related
I came across some different error calculation functions for backpropagation:
Squared error function from http://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/
or a nice explanation for the derivation for the BP loss function
Error = Output(i) * (1 - Output(i)) * (Target(i) - Output(i))
Now, I'm wondering how many more there are, and what difference in effect it has on training?
Also, since I understand that the second example uses the derivative of the activation function used by the layer, does the first one also does this in a way? And would it be true for any loss function (if there are more)?
Finally, how to know which one to use, and when?
This was a very broad question, but I can shed some light on the error / cost function part.
Cost functions
There are many different cost functions that can be applied when working with neural networks. There are no neural network specific cost functions. The most common cost function in NN is probably the Mean Squared Error (MSE) and the Cross Entropy Cost function. The latter cost function is often the most appropriate when working with logistic or softmax output layers. The MSE cost function on the other hand, is convenient since it does not require the output values to be in the range [0, 1].
The different cost functions excerts different convergence properties and has their own pros and cons. You'll have to read up on those that are interesting to you.
List of cost functions
Danielle Ensign has compiled a short, nice list of cost functions over at CrossValidated.
Sidenote
You have confused the derivative of squared error function. The equation you've defined as the derivative of the error function, is actually the derivative of the error functions times the derivative of your output layer activation function. This multiplication calculates the delta of the output layer.
The squared error function and its derivative are defined as:
While the sigmoid activation function and its derivative are defined as:
The delta of the output layer is defined as:
And this is true for all cost functions.
What's the correct way to do 'disjoint' classification (where the outputs are mutually exclusive, i.e. true probabilities sum to 1) in FANN since it doesn't seems to have an option for softmax output?
My understanding is that using sigmoid outputs, as if doing 'labeling', that I wouldn't be getting the correct results for a classification problem.
FANN only supports tanh and linear error functions. This means, as you say, that the probabilities output by the neural network will not sum to 1. There is no easy solution to implementing a softmax output, as this will mean changing the cost function and hence the error function used in the backpropagation routine. As FANN is open source you could have a look at implementing this yourself. A question on Cross Validated seems to give the equations you would have to implement.
Although not the mathematically elegant solution you are looking for, I would try play around with some cruder approaches before tackling the implementation of a softmax cost function - as one of these might be sufficient for your purposes. For example, you could use a tanh error function and then just renormalise all the outputs to sum to 1. Or, if you are actually only interested in what the most likely classification is you could just take the output with the highest score.
Steffen Nissen, the guy behind FANN, presents an example here where he tries to classify what language a text is written in based on letter frequency. I think he uses a tanh error function (default) and just takes the class with the biggest score, but he indicates that it works well.
I have been reading this ebook about ANN:https://www4.rgu.ac.uk/files/chapter3%20-%20bp.pdf
and got a doubt about the effect of the sigmoid function for calculating the errorB. In the text says that if I have threshold neuron I can use:
Target-Output
but because I have a sigmoid function involved I should add:
Output(1-Output)
and end up with:
ErrorB=OutputB(1-OutputB)(TargetB-OutputB)
I mean why I should add the part of O(1-O), I have tried with different values, but I really do not get the intuition why it should be in that way.
Any help?
Thanks
As Kelu stated, that part of the equation is based on derivatives of your transfer function (in this case sigmoid). To understand why you need derivatives, you need to understand how the delta rule works(*):
Your overall goal is to minimize the error in the network's output using gradient descent. Gradient descent itself tries to find a minimum in the error function (E) by taking steps proportional to the negative of the gradient. A gradient is simply the derivative and the reason you're working with derivatives mathematically is that gradients point in the direction of the greatest rate of increase of the (error) function. Conclusion: Since you wanna minimize the error, you go the opposite way of the gradient.
This is the intuitive reason for using gradients. If you want the mathematical derivation, you should check this basic wiki article (additional comment as it's not mentioned anywhere: the g'(x) in the article is the first derivative of g(x))
Other transfer functions can be used, e.g. linear (in this case there is no g'(x) term as the derivative is simply a constant) or hyperbolic tangent in which case the derivative is something different again.
(*) Equation is derived from following equation where you start by minimizing the error of the output:
It is like that because of the fact that Output(1-Output) is a derivative of sigmoid function (simplified). In general, this part is based on derivatives, you can try with different functions (from sigmoid) and then you have to use their derivatives too to get a proper learning rate.
If you want you can take a look at my implementation (it's far from perfect, but maybe you will get some idea from it ;)), it's a simple project I made on my university - https://github.com/kelostrada/neuron-network
I'm currently interested in using Cross Entropy Error when performing the BackPropagation algorithm for classification, where I use the Softmax Activation Function in my output layer.
From what I gather, you can drop the derivative to look like this with Cross Entropy and Softmax:
Error = targetOutput[i] - layerOutput[i]
This differs from the Mean Squared Error of:
Error = Derivative(layerOutput[i]) * (targetOutput[i] - layerOutput[i])
So, can you only drop the derivative term when your output layer is using the Softmax Activation Function for classification with Cross Entropy? For instance, if I were to do Regression using the Cross Entropy Error (with say TANH activation function) I would still need to keep the derivative term, correct?
I haven't been able to find an explicit answer on this and I haven't attempted to work out the math on this either (as I am rusty).
You do not use the derivative term in the output layer since you get the 'real' error (the difference between your output and your target), in the hidden layers you have to calculate the approximate error using backpropagation.
What we are doing is an approximation taking the derivate of the error of the next layer against the weights of the current layer instead of the error of the current layer (that its unknown).
Best regards,
I am new to using Matlab and am trying to follow the example in the Bioinformatics Toolbox documentation (SVM Classification with Cross Validation) to handle a classification problem.
However, I am not able to understand Step 9, which says:
Set up a function that takes an input z=[rbf_sigma,boxconstraint], and returns the cross-validation value of exp(z).
The reason to take exp(z) is twofold:
rbf_sigma and boxconstraint must be positive.
You should look at points spaced approximately exponentially apart.
This function handle computes the cross validation at parameters
exp([rbf_sigma,boxconstraint]):
minfn = #(z)crossval('mcr',cdata,grp,'Predfun', ...
#(xtrain,ytrain,xtest)crossfun(xtrain,ytrain,...
xtest,exp(z(1)),exp(z(2))),'partition',c);
What is the function that I should be implementing here? Is it exp or minfn? I will appreciate if you can give me the code for this section. Thanks.
I will like to know what does it mean when it says exp([rbf_sigma,boxconstraint])
rbf_sigma: The svm is using a gaussian kernel, the rbf_sigma set the standard deviation (~size) of the kernel. To understand how kernels work, the SVM is putting the kernel around every sample (so that you have a gaussian around every sample). Then the kernels are added up (sumed) for the samples of each category/type. At each point the type which sum is higher would be the "winner". For example if type A has a higher sum of these kernels at point X, then if you have a new datum to classify in point X, it will be classified as type A. (there are other configuration parameters that may change the actual threshold where a category is selected over another)
Fig. Analyze this figure from the webpage you gave us. You can see how by adding up the gaussian kernels on the red samples "sumA", and on the green samples "sumB"; it is logical that sumA>sumB in the center part of the figure. It is also logical that sumB>sumA in the outer part of the image.
boxconstraint: it is a cost/penalty over miss-classified data. During the training stage of the classifier, where you use the training data to adjust the SVM parameters, the training algorithm is using an error function to decide how to optimize the SVM parameters in an iterative fashion. The cost for a miss-classified sample is proportional to how far it is from the boundary where it would have been classified correctly. In the figure that I am attaching the boundary is the inner blue circumference.
Taking into account BGreene indications and from what I understand of the tutorial:
In the tutorial they advice to try values for rbf_sigma and boxconstraint that are exponentially apart. This means that you should compare values like {0.2, 2, 20, ...} (note that this is {2*10^(i-2), i=1,2,3,...}), and NOT like {0.2, 0.3, 0.4, 0.5} (which would be linearly apart). They advice this to try a wide range of values first. You can further optimize later FROM the first optimum that you obtained before.
The command "[searchmin fval] = fminsearch(minfn,randn(2,1),opts)" will give you back the optimum values for rbf_sigma and boxconstraint. Probably you have to use exp(z) because it affects how fminsearch increments the values of z(1) and z(2) during the search for the optimum value. I suppose that when you put exp(z(1)) in the definition of #minfn, then fminsearch will take 'exponentially' big steps.
In machine learning, always try to understand that there are three subsets in your data: training data, cross-validation data, and test data. The training set is used to optimize the parameters of the SVM classifier for EACH value of rbf_sigma and boxconstraint. Then the cross validation set is used to select the optimum value of the parameters rbf_sigma and boxconstraint. And finally the test data is used to obtain an idea of the performance of your classifier (the efficiency of the classifier is determined upon the test set).
So, if you start with 10000 samples you may divide the data for example as training(50%), cross-validation(25%), test(25%). So that you will sample randomly 5000 samples for the training set, then 2500 samples from the 5000 remaining samples for the cross-validation set, and the rest of samples (that is 2500) would be separated for the test set.
I hope that I could clarify your doubts. By the way, if you are interested in the optimization of the parameters of classifiers and machine learning algorithms I strongly suggest that you follow this free course -> www.ml-class.org (it is awesome, really).
You need to implement a function called crossfun (see example).
The function handle minfn is passed to fminsearch to be minimized.
exp([rbf_sigma,boxconstraint]) is the quantity being optimized to minimize classification error.
There are a number of functions nested within this function handle:
- crossval is producing the classification error based on cross validation using partition c
- crossfun - classifies data using an SVM
- fminsearch - optimizes SVM hyperparameters to minimize classification error
Hope this helps