Backpropagation for rectified linear unit activation with cross entropy error - matlab

I'm trying to implement gradient calculation for neural networks using backpropagation.
I cannot get it to work with cross entropy error and rectified linear unit (ReLU) as activation.
I managed to get my implementation working for squared error with sigmoid, tanh and ReLU activation functions. Cross entropy (CE) error with sigmoid activation gradient is computed correctly. However, when I change activation to ReLU - it fails. (I'm skipping tanh for CE as it retuls values in (-1,1) range.)
Is it because of the behavior of log function at values close to 0 (which is returned by ReLUs approx. 50% of the time for normalized inputs)?
I tried to mitiage that problem with:
log(max(y,eps))
but it only helped to bring error and gradients back to real numbers - they are still different from numerical gradient.
I verify the results using numerical gradient:
num_grad = (f(W+epsilon) - f(W-epsilon)) / (2*epsilon)
The following matlab code presents a simplified and condensed backpropagation implementation used in my experiments:
function [f, df] = backprop(W, X, Y)
% W - weights
% X - input values
% Y - target values
act_type='relu'; % possible values: sigmoid / tanh / relu
error_type = 'CE'; % possible values: SE / CE
N=size(X,1); n_inp=size(X,2); n_hid=100; n_out=size(Y,2);
w1=reshape(W(1:n_hid*(n_inp+1)),n_hid,n_inp+1);
w2=reshape(W(n_hid*(n_inp+1)+1:end),n_out, n_hid+1);
% feedforward
X=[X ones(N,1)];
z2=X*w1'; a2=act(z2,act_type); a2=[a2 ones(N,1)];
z3=a2*w2'; y=act(z3,act_type);
if strcmp(error_type, 'CE') % cross entropy error - logistic cost function
f=-sum(sum( Y.*log(max(y,eps))+(1-Y).*log(max(1-y,eps)) ));
else % squared error
f=0.5*sum(sum((y-Y).^2));
end
% backprop
if strcmp(error_type, 'CE') % cross entropy error
d3=y-Y;
else % squared error
d3=(y-Y).*dact(z3,act_type);
end
df2=d3'*a2;
d2=d3*w2(:,1:end-1).*dact(z2,act_type);
df1=d2'*X;
df=[df1(:);df2(:)];
end
function f=act(z,type) % activation function
switch type
case 'sigmoid'
f=1./(1+exp(-z));
case 'tanh'
f=tanh(z);
case 'relu'
f=max(0,z);
end
end
function df=dact(z,type) % derivative of activation function
switch type
case 'sigmoid'
df=act(z,type).*(1-act(z,type));
case 'tanh'
df=1-act(z,type).^2;
case 'relu'
df=double(z>0);
end
end
Edit
After another round of experiments, I found out that using a softmax for the last layer:
y=bsxfun(#rdivide, exp(z3), sum(exp(z3),2));
and softmax cost function:
f=-sum(sum(Y.*log(y)));
make the implementaion working for all activation functions including ReLU.
This leads me to conclusion that it is the logistic cost function (binary clasifier) that does not work with ReLU:
f=-sum(sum( Y.*log(max(y,eps))+(1-Y).*log(max(1-y,eps)) ));
However, I still cannot figure out where the problem lies.

Every squashing function sigmoid, tanh and softmax (in the output layer)
means different cost functions.
Then makes sense that a RLU (in the output layer) does not match with the cross entropy cost function.
I will try a simple square error cost function to test a RLU output layer.
The true power of RLU is in the hidden layers of a deep net since it not suffer from gradient vanishing error.

If you use gradient descendent you need to derive the activation function to be used later in the back-propagation approach. Are you sure about the 'df=double(z>0)'?. For the logistic and tanh seems to be right.
Further, are you sure about this 'd3=y-Y' ? I would say this is true when you use the logistic function but not for the ReLu (the derivative is not the same and therefore will not lead to that simple equation).
You could use the softplus function that is a smooth version of the ReLU, which the derivative is well known (logistic function).

I think the flaw lies in comapring with the numerically computed derivatives. In your derivativeActivation function , you define the derivative of ReLu at 0 to be 0. Where as numerically computing the derivative at x=0 shows it to be
(ReLU(x+epsilon)-ReLU(x-epsilon)/(2*epsilon)) at x =0 which is 0.5. Therefore, defining the derivative of ReLU at x=0 to be 0.5 will solve the problem

I thought I'd share my experience I had with similar problem. I too have designed my multi classifier ANN in a way that all hidden layers use RELU as non-linear activation function and the output layer uses softmax function.
My problem was related to some degree to numerical precision of the programming language/platform I was using. In my case I noticed that if I used "plain" RELU not only does it kill the gradient but the programming language I used produced the following softmax output vectors (this is just a example sample):
⎡1.5068230536681645e-35⎤
⎢ 2.520367499064734e-18⎥
⎢3.2572859518007807e-22⎥
⎢ 1⎥
⎢ 5.020155103452967e-32⎥
⎢1.7620297760773188e-18⎥
⎢ 5.216008990667109e-18⎥
⎢ 1.320937038894421e-20⎥
⎢2.7854159049317976e-17⎥
⎣1.8091246170996508e-35⎦
Notice the values of most of the elements are close to 0, but most importantly notice the 1 value in the output.
I used a different cross-entropy error function than the one you used. Instead of calculating log(max(1-y, eps)) I stuck to the basic log(1-y). So given the output vector above, when I calculated log(1-y) I got the -Inf as a result of cross-entropy, which obviously killed the algorithm.
I imagine if your eps is not reasonably high enough so that log(max(1-y, eps)) -> log(max(0, eps)) doesn't yield way too small log output you might be in a similar pickle like myself.
My solution to this problem was to use Leaky RELU. Once I've started using it, I could carry on using the multi classifier cross-entropy as oppose to softmax-cost function you decided to try.

Related

Confused by the notation (a and z) and usage of backpropagation equations used in neural networks gradient decent training

I’m writing a neural network but I have trouble training it using backpropagation so I suspect there is a bug/mathematical mistake somewhere in my code. I’ve spent ours reading different literature on how the equations of backpropagation should look but I’m a bit confused by it since different books say different things, or at least use wildly confusing and contradictory notation. So, I was hoping that someone who knows with a 100% certainty how it works could clear it out for me.
There are two steps in the backpropagation that confuse me. Let’s assume for simplicity that I only have a three layer feed forward net, so we have connections between input-hidden and hidden-output. I call the weighted sum that reaches a node z and the same value but after it has passed the activation function of the node a.
Apparently I’m not allowed to embed an image with the equations that my question concern so I will have to link it like this: https://i.stack.imgur.com/CvyyK.gif
Now. During backpropagation, when calculating the error in the nodes of the output layer, is it:
[Eq. 1] Delta_output = (output-target) * a_output through the derivative of the activation function
Or is it
[Eq. 2] Delta_output = (output-target) * z_output through the derivative of the activation function
And during the error calculation of the nodes in the hidden layer, same thing, is it:
[Eq. 3] Delta_hidden = a_h through the derivative of the activation function * sum(w_h*Delta_output)
Or is it
[Eq. 4] Delta_hidden = z_h through the derivative of the activation function * sum(w_h*Delta_output)
So the question is basically; when running a node's value through the derivative version of the activation function during backpropagation, should the value be expressed as it was before or after it passed the activation function (z or a)?
Is the first or the second equation in the image correct and similarly is the third or fourth equation in the image correct?
Thanks.
You have to compute the derivatives with the values before it have passed through the activation function. So the answer is "z".
Some activation functions simplify the computation of the derivative, like tanh:
a = tanh(z)
derivative on z of tanh(z) = 1.0 - tanh(z) * tanh(z) = 1.0 - a * a
This simplification can lead to the confusion you was talking about, but here is another activation function without possible confusion:
a = sin(z)
derivative on z of sin(z) = cos(z)
You can find a list of activation functions and their derivatives on wikipedia: activation function.
Some networks doesn't have an activation function on the output nodes, so the derivative is 1.0, and delta_output = output - target or delta_output = target - output, depending if you add or substract the weight change.
If you are using and activation function on the output nodes, the you'll have to give targets that are in the range of the activation function like [-1,1] for tanh(z).

Kriging / Gaussian Process Conditional Simulations in Matlab

I would like to perform conditional simulations for Gaussian process (GP) models in Matlab. I have found a tutorial by Martin Kolář (http://mrmartin.net/?p=223).
sigma_f = 1.1251; %parameter of the squared exponential kernel
l = 0.90441; %parameter of the squared exponential kernel
kernel_function = #(x,x2) sigma_f^2*exp((x-x2)^2/(-2*l^2));
%This is one of many popular kernel functions, the squared exponential
%kernel. It favors smooth functions. (Here, it is defined here as an anonymous
%function handle)
% we can also define an error function, which models the observation noise
sigma_n = 0.1; %known noise on observed data
error_function = #(x,x2) sigma_n^2*(x==x2);
%this is just iid gaussian noise with mean 0 and variance sigma_n^2s
%kernel functions can be added together. Here, we add the error kernel to
%the squared exponential kernel)
k = #(x,x2) kernel_function(x,x2)+error_function(x,x2);
X_o = [-1.5 -1 -0.75 -0.4 -0.3 0]';
Y_o = [-1.6 -1.3 -0.5 0 0.3 0.6]';
prediction_x=-2:0.01:1;
K = zeros(length(X_o));
for i=1:length(X_o)
for j=1:length(X_o)
K(i,j)=k(X_o(i),X_o(j));
end
end
%% Demo #5.2 Sample from the Gaussian Process posterior
clearvars -except k prediction_x K X_o Y_o
%We can also sample from this posterior, the same way as we sampled before:
K_ss=zeros(length(prediction_x),length(prediction_x));
for i=1:length(prediction_x)
for j=i:length(prediction_x)%We only calculate the top half of the matrix. This an unnecessary speedup trick
K_ss(i,j)=k(prediction_x(i),prediction_x(j));
end
end
K_ss=K_ss+triu(K_ss,1)'; % We can use the upper half of the matrix and copy it to the
K_s=zeros(length(prediction_x),length(X_o));
for i=1:length(prediction_x)
for j=1:length(X_o)
K_s(i,j)=k(prediction_x(i),X_o(j));
end
end
[V,D]=eig(K_ss-K_s/K*K_s');
A=real(V*(D.^(1/2)));
for i=1:7
standard_random_vector = randn(length(A),1);
gaussian_process_sample(:,i) = A * standard_random_vector+K_s/K*Y_o;
end
hold on
plot(prediction_x,real(gaussian_process_sample))
set(plot(X_o,Y_o,'r.'),'MarkerSize',20)
The tutorial generates the conditional simulations using a direct simulation method based on covariance matrix decomposition. It is my understanding that there are several methods of generating conditional simulations that may be better when the number of simulation points is large such as conditioning by Kriging using a local neighborhood. I have found information regarding several methods in J.-P. Chilès and P. Delfiner, “Chapter 7 - Conditional Simulations,” in Geostatistics: Modeling Spatial Uncertainty, Second Edition, John Wiley & Sons, Inc., 2012, pp. 478–628.
Is there an existing Matlab toolbox that can be used for conditional simulations? I am aware of DACE, GPML, and mGstat (http://mgstat.sourceforge.net/). I believe only mGstat offers the capability to perform conditional simulations. However, mGstat also seems to be limited to only 3D models and I am interested in higher dimensional models.
Can anybody offer any advice on getting started performing conditional simulations with an existing toolbox such as GPML?
===================================================================
EDIT
I have found a few more Matlab toolboxes: STK, ScalaGauss, ooDACE
It appears STK is capable of conditional simulations using covariance matrix decomposition. However, is limited to a moderate number (maybe a few thousand?) of simulation points due to the Cholesky factorization.
I used the STK toolbox and I recommend it for others:
http://kriging.sourceforge.net/htmldoc/
I found that if you need conditional simulations at a large number of points then you might consider generating a conditional simulation at the points in a large design of experiment (DoE) and then simply relying on the mean prediction conditional on that DoE.

Gradient checking in backpropagation

I'm trying to implement gradient checking for a simple feedforward neural network with 2 unit input layer, 2 unit hidden layer and 1 unit output layer. What I do is the following:
Take each weight w of the network weights between all layers and perform forward propagation using w + EPSILON and then w - EPSILON.
Compute the numerical gradient using the results of the two feedforward propagations.
What I don't understand is how exactly to perform the backpropagation. Normally, I compare the output of the network to the target data (in case of classification) and then backpropagate the error derivative across the network. However, I think in this case some other value have to be backpropagated, since in the results of the numerical gradient computation are not dependent of the target data (but only of the input), while the error backpropagation depends on the target data. So, what is the value that should be used in the backpropagation part of gradient check?
Backpropagation is performed after computing the gradients analytically and then using those formulas while training. A neural network is essentially a multivariate function, where the coefficients or the parameters of the functions needs to be found or trained.
The definition of a gradient with respect to a specific variable is the rate of change of the function value. Therefore, as you mentioned, and from the definition of the first derivative we can approximate the gradient of a function, including a neural network.
To check if your analytical gradient for your neural network is correct or not, it is good to check it using the numerical method.
For each weight layer w_l from all layers W = [w_0, w_1, ..., w_l, ..., w_k]
For i in 0 to number of rows in w_l
For j in 0 to number of columns in w_l
w_l_minus = w_l; # Copy all the weights
w_l_minus[i,j] = w_l_minus[i,j] - eps; # Change only this parameter
w_l_plus = w_l; # Copy all the weights
w_l_plus[i,j] = w_l_plus[i,j] + eps; # Change only this parameter
cost_minus = cost of neural net by replacing w_l by w_l_minus
cost_plus = cost of neural net by replacing w_l by w_l_plus
w_l_grad[i,j] = (cost_plus - cost_minus)/(2*eps)
This process changes only one parameter at a time and computes the numerical gradient. In this case I have used the (f(x+h) - f(x-h))/2h, which seems to work better for me.
Note that, you mentiond: "since in the results of the numerical gradient computation are not dependent of the target data", this is not true. As when you find the cost_minus and cost_plus above, the cost is being computed on the basis of
The weights
The target classes
Therefore, the process of backpropagation should be independent of the gradient checking. Compute the numerical gradients before backpropagation update. Compute the gradients using backpropagation in one epoch (using something similar to above). Then compare each gradient component of the vectors/matrices and check if they are close enough.
Whether you want to do some classification or have your network calculate a certain numerical function, you always have some target data. For example, let's say you wanted to train a network to calculate the function f(a, b) = a + b. In that case, this is the input and target data you want to train your network on:
a b Target
1 1 2
3 4 7
21 0 21
5 2 7
...
Just as with "normal" classification problems, the more input-target pairs, the better.

Neural Network (FFW, BP) - function approximation

is it possible to train NN to approximate this function:
If I tun approximation for x^2 or sin or something simple, it works fine, but for this sort of function i got only constant valued line.
My NN has 2 inputs (x, f(x)), one hidden layer (10 neurons), 1 output (f(x))
For training I am using BP, activation functions sigmoid -> tanh
My goal is to get "smooth" function without noise, that catch function on image above.
Or is there any other way with NN or genetic algorithm, how to approximate this ?
You're gping to have major problems because the input (x, f(x)) is discontinuous (not exactly, but sort of).
Therefore, your NN will have to literally memorize the x-f(x) mapping given the large discontinuities.
One approach is to use a four-layer NN which can address the discontinuities.
But really, you may simply want to look at other smoothening methods rather than NN for thos problem.
You have a periodic function so first of all, only use one period, or you will memorize and not generalize.

Neural Network with tanh wrong saturation with normalized data

I'm using a neural network made of 4 input neurons, 1 hidden layer made of 20 neurons and a 7 neuron output layer.
I'm trying to train it for a bcd to 7 segment algorithm. My data is normalized 0 is -1 and 1 is 1.
When the output error evaluation happens, the neuron saturates wrong. If the desired output is 1 and the real output is -1, the error is 1-(-1)= 2.
When I multiply it by the derivative of the activation function error*(1-output)*(1+output), the error becomes almost 0 Because of 2*(1-(-1)*(1-1).
How can I avoid this saturation error?
Saturation at the asymptotes of of the activation function is a common problem with neural networks. If you look at a graph of the function, it doesn't surprise: They are almost flat, meaning that the first derivative is (almost) 0. The network cannot learn any more.
A simple solution is to scale the activation function to avoid this problem. For example, with tanh() activation function (my favorite), it is recommended to use the following activation function when the desired output is in {-1, 1}:
f(x) = 1.7159 * tanh( 2/3 * x)
Consequently, the derivative is
f'(x) = 1.14393 * (1- tanh( 2/3 * x))
This will force the gradients into the most non-linear value range and speed up the learning. For all the details I recommend reading Yann LeCun's great paper Efficient Back-Prop.
In the case of tanh() activation function, the error would be calculated as
error = 2/3 * (1.7159 - output^2) * (teacher - output)
This is bound to happen no matter what function you use. The derivative, by definition, will be zero when the output reaches one of two extremes. It's been a while since I have worked with Artificial Neural Networks but if I remember correctly, this (among many other things) is one of the limitations of using the simple back-propagation algorithm.
You could add a Momentum factor to make sure there is some correction based off previous experience, even when the derivative is zero.
You could also train it by epoch, where you accumulate the delta values for the weights before doing the actual update (compared to updating it every iteration). This also mitigates conditions where the delta values are oscillating between two values.
There may be more advanced methods, like second order methods for back propagation, that will mitigate this particular problem.
However, keep in mind that tanh reaches -1 or +1 at the infinities and the problem is purely theoretical.
Not totally sure if I am reading the question correctly, but if so, you should scale your inputs and targets between 0.9 and -0.9 which would help your derivatives be more sane.