I'm trying to implement gradient calculation for neural networks using backpropagation.
I cannot get it to work with cross entropy error and rectified linear unit (ReLU) as activation.
I managed to get my implementation working for squared error with sigmoid, tanh and ReLU activation functions. Cross entropy (CE) error with sigmoid activation gradient is computed correctly. However, when I change activation to ReLU - it fails. (I'm skipping tanh for CE as it retuls values in (-1,1) range.)
Is it because of the behavior of log function at values close to 0 (which is returned by ReLUs approx. 50% of the time for normalized inputs)?
I tried to mitiage that problem with:
log(max(y,eps))
but it only helped to bring error and gradients back to real numbers - they are still different from numerical gradient.
I verify the results using numerical gradient:
num_grad = (f(W+epsilon) - f(W-epsilon)) / (2*epsilon)
The following matlab code presents a simplified and condensed backpropagation implementation used in my experiments:
function [f, df] = backprop(W, X, Y)
% W - weights
% X - input values
% Y - target values
act_type='relu'; % possible values: sigmoid / tanh / relu
error_type = 'CE'; % possible values: SE / CE
N=size(X,1); n_inp=size(X,2); n_hid=100; n_out=size(Y,2);
w1=reshape(W(1:n_hid*(n_inp+1)),n_hid,n_inp+1);
w2=reshape(W(n_hid*(n_inp+1)+1:end),n_out, n_hid+1);
% feedforward
X=[X ones(N,1)];
z2=X*w1'; a2=act(z2,act_type); a2=[a2 ones(N,1)];
z3=a2*w2'; y=act(z3,act_type);
if strcmp(error_type, 'CE') % cross entropy error - logistic cost function
f=-sum(sum( Y.*log(max(y,eps))+(1-Y).*log(max(1-y,eps)) ));
else % squared error
f=0.5*sum(sum((y-Y).^2));
end
% backprop
if strcmp(error_type, 'CE') % cross entropy error
d3=y-Y;
else % squared error
d3=(y-Y).*dact(z3,act_type);
end
df2=d3'*a2;
d2=d3*w2(:,1:end-1).*dact(z2,act_type);
df1=d2'*X;
df=[df1(:);df2(:)];
end
function f=act(z,type) % activation function
switch type
case 'sigmoid'
f=1./(1+exp(-z));
case 'tanh'
f=tanh(z);
case 'relu'
f=max(0,z);
end
end
function df=dact(z,type) % derivative of activation function
switch type
case 'sigmoid'
df=act(z,type).*(1-act(z,type));
case 'tanh'
df=1-act(z,type).^2;
case 'relu'
df=double(z>0);
end
end
Edit
After another round of experiments, I found out that using a softmax for the last layer:
y=bsxfun(#rdivide, exp(z3), sum(exp(z3),2));
and softmax cost function:
f=-sum(sum(Y.*log(y)));
make the implementaion working for all activation functions including ReLU.
This leads me to conclusion that it is the logistic cost function (binary clasifier) that does not work with ReLU:
f=-sum(sum( Y.*log(max(y,eps))+(1-Y).*log(max(1-y,eps)) ));
However, I still cannot figure out where the problem lies.
Every squashing function sigmoid, tanh and softmax (in the output layer)
means different cost functions.
Then makes sense that a RLU (in the output layer) does not match with the cross entropy cost function.
I will try a simple square error cost function to test a RLU output layer.
The true power of RLU is in the hidden layers of a deep net since it not suffer from gradient vanishing error.
If you use gradient descendent you need to derive the activation function to be used later in the back-propagation approach. Are you sure about the 'df=double(z>0)'?. For the logistic and tanh seems to be right.
Further, are you sure about this 'd3=y-Y' ? I would say this is true when you use the logistic function but not for the ReLu (the derivative is not the same and therefore will not lead to that simple equation).
You could use the softplus function that is a smooth version of the ReLU, which the derivative is well known (logistic function).
I think the flaw lies in comapring with the numerically computed derivatives. In your derivativeActivation function , you define the derivative of ReLu at 0 to be 0. Where as numerically computing the derivative at x=0 shows it to be
(ReLU(x+epsilon)-ReLU(x-epsilon)/(2*epsilon)) at x =0 which is 0.5. Therefore, defining the derivative of ReLU at x=0 to be 0.5 will solve the problem
I thought I'd share my experience I had with similar problem. I too have designed my multi classifier ANN in a way that all hidden layers use RELU as non-linear activation function and the output layer uses softmax function.
My problem was related to some degree to numerical precision of the programming language/platform I was using. In my case I noticed that if I used "plain" RELU not only does it kill the gradient but the programming language I used produced the following softmax output vectors (this is just a example sample):
⎡1.5068230536681645e-35⎤
⎢ 2.520367499064734e-18⎥
⎢3.2572859518007807e-22⎥
⎢ 1⎥
⎢ 5.020155103452967e-32⎥
⎢1.7620297760773188e-18⎥
⎢ 5.216008990667109e-18⎥
⎢ 1.320937038894421e-20⎥
⎢2.7854159049317976e-17⎥
⎣1.8091246170996508e-35⎦
Notice the values of most of the elements are close to 0, but most importantly notice the 1 value in the output.
I used a different cross-entropy error function than the one you used. Instead of calculating log(max(1-y, eps)) I stuck to the basic log(1-y). So given the output vector above, when I calculated log(1-y) I got the -Inf as a result of cross-entropy, which obviously killed the algorithm.
I imagine if your eps is not reasonably high enough so that log(max(1-y, eps)) -> log(max(0, eps)) doesn't yield way too small log output you might be in a similar pickle like myself.
My solution to this problem was to use Leaky RELU. Once I've started using it, I could carry on using the multi classifier cross-entropy as oppose to softmax-cost function you decided to try.
I would like to draw learning curves for a given SVM classifier. Thus, in order to do this, I would like to compute the training, cross-validation and test error, and then plot them while varying some parameter (e.g., number of instances m).
How to compute training, cross-validation and test error on libsvm when used with MATLAB?
I have seen other answers (see example) that suggest solutions for other languages.
Isn't there a compact way of doing it?
Given a set of instances described by:
a set of features featureVector;
their corresponding labels (e.g., either 0 or 1),
if a model was previously inferred via libsvm, the MSE error can be computed as follows:
[predictedLabels, accuracy, ~] = svmpredict(labels, featureVectors, model,'-q');
MSE = accuracy(2);
Notice that predictedLabels contains the labels that were predicted by the classifier for the given instances.
I have a data set of 13 attributes where some are categorical and some are continuous (can be converted to categorical). I need to use logistic regression to create a model that predicts the responses of a row and find the prediction's accuracy, sensitivity, and specificity.
Can/Should I use cross validation to divide my data set and get the results?
Is there any code sample on how to go about doing this? (I'm new to all of this)
Should I be using mnrfit/mnrval or glmfit/glmval? What's the difference and how do I choose?
Thanks!
If you want to determine how well the model can predict unseen data you can use cross validation. In Matlab, you can use glmfit to fit the logistic regression model and glmval to test it.
Here is a sample of Matlab code that illustrates how to do it, where X is the feature matrix and Labels is the class label for each case, num_shuffles is the number of repetitions of the cross-validation while num_folds is the number of folds:
for j = 1:num_shuffles
indices = crossvalind('Kfold',Labels,num_folds);
for i = 1:num_folds
test = (indices == i); train = ~test;
[b,dev,stats] = glmfit(X(train,:),Labels(train),'binomial','logit'); % Logistic regression
Fit(j,i) = glmval(b,X(test,:),'logit')';
end
end
Fit is then the fitted logistic regression estimate for each test fold. Thresholding this will yield an estimate of the predicted class for each test case. Performance measures are then calculated by comparing the predicted class label against the actual class label. Averaging the performance measures across all folds and repetitions gives an estimate of the model performance on unseen data.
originally answered by BGreene on #Stats.SE.
I am trying to get the residuals for the scatter plot of two variables. I could get the least squares linear regression line using lsline function of matlab. However, I want to get the residuals as well. How can I get this in matlab. For that I need to know the parameters a and b of the linear regression line
ax+b
Use the function polyfit to obtain the regression parameters. You can then evaluate the fitted values and calculate your residuals accordingly.
Basically polyfit performs least-squares regression for a specified degree N which, in your case will be 1 for straight line regression. The regression parameters are returned by the function and you can use the other function polyval to get the fitted values from the regression parameters
If you have the curve fitting toolbox, type cftool and press enter and the GUI will appear.
You can use this tool to find a linear polynomial fit for a given data set, as well as many other fits.