Neural network backpropagation with RELU - neural-network

I am trying to implement neural network with RELU.
input layer -> 1 hidden layer -> relu -> output layer -> softmax layer
Above is the architecture of my neural network.
I am confused about backpropagation of this relu.
For derivative of RELU, if x <= 0, output is 0.
if x > 0, output is 1.
So when you calculate the gradient, does that mean I kill gradient decent if x<=0?
Can someone explain the backpropagation of my neural network architecture 'step by step'?

if x <= 0, output is 0. if x > 0, output is 1
The ReLU function is defined as: For x > 0 the output is x, i.e. f(x) = max(0,x)
So for the derivative f '(x) it's actually:
if x < 0, output is 0. if x > 0, output is 1.
The derivative f '(0) is not defined. So it's usually set to 0 or you modify the activation function to be f(x) = max(e,x) for a small e.
Generally: A ReLU is a unit that uses the rectifier activation function. That means it works exactly like any other hidden layer but except tanh(x), sigmoid(x) or whatever activation you use, you'll instead use f(x) = max(0,x).
If you have written code for a working multilayer network with sigmoid activation it's literally 1 line of change. Nothing about forward- or back-propagation changes algorithmically. If you haven't got the simpler model working yet, go back and start with that first. Otherwise your question isn't really about ReLUs but about implementing a NN as a whole.

If you have a layer made out of a single ReLU, like your architecture suggests, then yes, you kill the gradient at 0. During training, the ReLU will return 0 to your output layer, which will either return 0 or 0.5 if you're using logistic units, and the softmax will squash those. So a value of 0 under your current architecture doesn't make much sense for the forward propagation part either.
See for example this. What you can do is use a "leaky ReLU", which is a small value at 0, such as 0.01.
I would reconsider this architecture however, it doesn't make much sense to me to feed a single ReLU into a bunch of other units then apply a softmax.

Here is a good example, use ReLU to implement XOR:
reference, http://pytorch.org/tutorials/beginner/pytorch_with_examples.html
# -*- coding: utf-8 -*-
import numpy as np
import matplotlib.pyplot as plt
# N is batch size(sample size); D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 4, 2, 30, 1
# Create random input and output data
x = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])
# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)
learning_rate = 0.002
loss_col = []
for t in range(200):
# Forward pass: compute predicted y
h = x.dot(w1)
h_relu = np.maximum(h, 0) # using ReLU as activate function
y_pred = h_relu.dot(w2)
# Compute and print loss
loss = np.square(y_pred - y).sum() # loss function
loss_col.append(loss)
print(t, loss, y_pred)
# Backprop to compute gradients of w1 and w2 with respect to loss
grad_y_pred = 2.0 * (y_pred - y) # the last layer's error
grad_w2 = h_relu.T.dot(grad_y_pred)
grad_h_relu = grad_y_pred.dot(w2.T) # the second laye's error
grad_h = grad_h_relu.copy()
grad_h[h < 0] = 0 # the derivate of ReLU
grad_w1 = x.T.dot(grad_h)
# Update weights
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2
plt.plot(loss_col)
plt.show()
More about the derivate of ReLU, you can see here: http://kawahara.ca/what-is-the-derivative-of-relu/

So when you calculate the gradient, does that mean I kill gradient
decent if x <= 0?
Yes! ‌ If the weighted sum of the inputs and bias of the neuron (activation function input) is less than zero and the neuron uses the Relu activation function, the value of the derivative is zero during backpropagation and the input weights to this neuron do not change (not updated).
Can someone explain the backpropagation of my neural network architecture 'step by step'?
A simple example can show one step of backpropagation. This example covers a complete process of one step. But you can also check only the part that related to Relu. This is similar to the architecture introduced in question and uses one neuron in each layer for simplicity. The architecture is as follows:
f and g represent Relu and sigmoid, respectively, and b represents bias.
Step 1:
First, the output is calculated:
This merely represents the output calculation. "z" and "a" represent the sum of the input to the neuron and the output value of the neuron activating function, respectively.
So h is the estimated value. Suppose the real value is y.
Weights are now updated with backpropagation.
The new weight is obtained by calculating the gradient of the error function relative to the weight, and subtracting this gradient from the previous weight, ie:
In backpropagation, the gradient of the last neuron(s) of the last layer is first calculated. A chain derivative rule is used to calculate:
The three general terms used above are:
The difference between the actual value and the estimated value
Neuron output square
And the derivative of the activator function, given that the activator function in the last layer is sigmoid, we have this:
And the above statement does not necessarily become zero.
Now we go to the second layer. In the second layer we will have:
It consisted of 4 main terms:
The difference between the actual value and the estimated value.
Neuron output square
The sum of the loss derivatives of the connected neurons in the next layer
A derivative of the activator function and since the activator function is Relu we will have:
if z2<=0 (z2 is the input of Relu function):
Otherwise, it's not necessarily zero:
So if the input of neurons is less than zero, the loss derivative is always zero and weights will not update.
*It is repeated that the sum of the neuron inputs must be less than zero to kill gradient descent.
The example given is a very simple example to illustrate the backpropagation process.

Yes the orginal Relu function has the problem you describe.
So they later made a change to the formula, and called it leaky Relu
In essence Leaky Relu tilts the horizontal part of the function slightly by a very small amount. for more information watch this :
An explantion of activation methods, and a improved Relu on youtube

Additionally, here you can find an implementation in caffe framework: https://github.com/BVLC/caffe/blob/master/src/caffe/layers/relu_layer.cpp
The negative_slope specifies whether to "leak" the negative part by multiplying it with the slope value rather than setting it to 0. Of course you should set this parameter to zero to have classical version.

Related

why is 1 appended to the input layer of a neural network?

I am following this tutorial for making neural network
https://www.kaggle.com/antmarakis/another-neural-network-from-scratch
I do not understand the train part of this code where 1 is appended to the input feature vector.
def Train(X, Y, lr, weights):`
`layers = len(weights)`
`for i in range(len(X)):`
`x, y = X[i], Y[i]`
`x = np.matrix(np.append(1, x)) # Augment feature vector`
`activations = ForwardPropagation(x, weights, layers)`
`weights = BackPropagation(y, activations, weights, layers)`
`return weights
any help in understanding this would be appreciated.
Forward propagation includes multiplying by weights and adding a bias term. The equation is
y = X*W + b. This can be written in a more vectorised form as y = [X, 1] * [W, b]. (* stands for matrix multiplication here).
In the code, the weights and biases seemed to have been combined into a single weight matrix W and x is modified as an augmented vector by appending a one to it.

Why sigmoid layer gives worse result than tanh layer in 0-1 regression task?

I'm working with regression to predict an array with 0-1 value (array of bit). The neural network specification is the following (MATLAB):
layers = [
imageInputLayer([1 16 2],'Normalization','none')
fullyConnectedLayer(512)
batchNormalizationLayer
reluLayer
fullyConnectedLayer(64)
batchNormalizationLayer
reluLayer
% sigmoidLayer
tanhLayer
regressionLayer
];
I've used the following code to implement Sigmoid Layer:
classdef sigmoidLayer < nnet.layer.Layer
methods
function layer = sigmoidLayer(name)
% Set layer name
if nargin == 2
layer.Name = name;
end
% Set layer description
layer.Description = 'sigmoidLayer';
end
function Z = predict(layer,X)
% Forward input data through the layer and output the result
Z = exp(X)./(exp(X)+1);
end
function dLdX = backward(layer, X ,Z,dLdZ,memory)
% Backward propagate the derivative of the loss function through
% the layer
dLdX = Z.*(1-Z) .* dLdZ;
end
end
end
The output is only 0 or 1. So why sigmoid is worse than tanh, instead of equal or better?
It depends on what you call "worse". Without more details it's hard to answer clearly.
However one of the key differences is the function's derivative. As the gradient update's magnitude depends on the derivative of the function, it can become close to 0 (and the network can't learn anymore) when the derivative saturates.
The sigmoid saturates at 1 and 0, when x->+/- inf, sigmoid -> 1/0 and d(sigmoid)/dx -> 0 and therefore depending on your data this might cause slower or "worse" learning. On the contrary, though it does saturate when going to 1, tanh does not saturate (actually it's a maxima for its derivative) around 0 so learning in this region is not problematic.
You might also want to look into label smoothing

Computing Image Saliency via Neural Network Classifier

Assume that we have a Convolutional Neural Network trained to classify (w.l.o.g. grayscale) images, in Tensor-Flow.
Given the trained net and a test image one can trace which pixels of it are salient, or "equivalently" which pixels are most responsible for the output classification of the image. A nice, explanation and implementation details in Theano, are given in this article.
Assume that for the first layer of convolutions that is directly linked with the input image, we do have the gradient for the parameters of every convolutional kernel-wrt. the classification function.
How can one propagate the gradient back to the Input layer, so to compute a partial derivative on every pixel of the image?
Propagating and accumulating back the gradient, would give us the salient pixels (they are those with big in-magnitude derivative).
To find the gradient wrt. the kernels of the first layer, so far I did:
Replaced the usual loss operator with the output layer operator.
Used the "compute_gradient" function,
All in all, it looks like:
opt = tf.train.GradientDescentOptimizer(1)
grads = opt.compute_gradients(output)
grad_var = [(grad1) for grad in grads]
g1 = sess.run([grad_var[0]])
Where, the "output" is the max of the output layer of the NN.
And g1, is a (k, k, 1, M) tensor, since I used M: k x k convolutional kernels on the first layer.
Now, I need to find the correct way to propagate g1 on every input pixel, as to compute their derivative wrt. the output.
To compute the gradients, you don't need to use an optimizer, and you can directly use tf.gradients.
With this function, you can directly compute the gradient of output with respect to the image input, whereas the optimizer compute_gradients method can only compute gradients with respect to Variables.
The other advantage of tf.gradients is that you can specify the gradients of the output you want to backpropagate.
So here is how to get the gradients of an input image with respect to output[1, 1]:
we have to set the output gradients to 0 everywhere except at indice [1, 1]
input = tf.ones([1, 4, 4, 1])
filter = tf.ones([3, 3, 1, 1])
output = tf.nn.conv2d(input, filter, [1, 1, 1, 1], 'SAME')
grad_output = np.zeros((1, 4, 4, 1), dtype=np.float32)
grad_output[0, 1, 1, 0] = 1.
grads = tf.gradients(output, input, grad_output)
sess = tf.Session()
print sess.run(grads[0]).reshape((4, 4))
# prints [[ 1. 1. 1. 0.]
# [ 1. 1. 1. 0.]
# [ 1. 1. 1. 0.]
# [ 0. 0. 0. 0.]]

Neural Network Neurons output numbers > 1

I have read that you calculate the output of a Neuron in a Neural Net by adding up all the inputs times their corresponding weights and then smoothing it with e.g. the Sigmoid Function.
But what I don't understand is that this sum (without smoothing) could get bigger than 1.
When this happens my Sigmoid Function outputs 1.0.
The function I am using to calculate the Neuron Output (without smoothing) is:
def sum(self, inputs):
valu = 0
for i, val in enumerate(inputs):
valu += float(val) * self.weights[i]
return valu
So my question is:
Am I doing something wrong, because I have read that the output should be between 0 and 1?
The sigmoid function is not exactly a smoothing function, it is a non-linear function that maps the domain to the [0, 1] range in a non-linear manner. Informally speaking, a non-linear function does not have a constant slope or, in other words, it can't be described as a straight line.
The sigmoid function, as you can see in the image below, squishes the input such that, as the magnitude of the input increases, the output of the sigmoid asymptotically approximates 0 (negative input) and 1 (positive input)

How should I use maximum likelihood classifier in Matlab? [duplicate]

I am implementing logistic regression using batch gradient descent. There are two classes into which the input samples are to be classified. The classes are 1 and 0. While training the data, I am using the following sigmoid function:
t = 1 ./ (1 + exp(-z));
where
z = x*theta
And I am using the following cost function to calculate cost, to determine when to stop training.
function cost = computeCost(x, y, theta)
htheta = sigmoid(x*theta);
cost = sum(-y .* log(htheta) - (1-y) .* log(1-htheta));
end
I am getting the cost at each step to be NaN as the values of htheta are either 1 or zero in most cases. What should I do to determine the cost value at each iteration?
This is the gradient descent code for logistic regression:
function [theta,cost_history] = batchGD(x,y,theta,alpha)
cost_history = zeros(1000,1);
for iter=1:1000
htheta = sigmoid(x*theta);
new_theta = zeros(size(theta,1),1);
for feature=1:size(theta,1)
new_theta(feature) = theta(feature) - alpha * sum((htheta - y) .*x(:,feature))
end
theta = new_theta;
cost_history(iter) = computeCost(x,y,theta);
end
end
There are two possible reasons why this may be happening to you.
The data is not normalized
This is because when you apply the sigmoid / logit function to your hypothesis, the output probabilities are almost all approximately 0s or all 1s and with your cost function, log(1 - 1) or log(0) will produce -Inf. The accumulation of all of these individual terms in your cost function will eventually lead to NaN.
Specifically, if y = 0 for a training example and if the output of your hypothesis is log(x) where x is a very small number which is close to 0, examining the first part of the cost function would give us 0*log(x) and will in fact produce NaN. Similarly, if y = 1 for a training example and if the output of your hypothesis is also log(x) where x is a very small number, this again would give us 0*log(x) and will produce NaN. Simply put, the output of your hypothesis is either very close to 0 or very close to 1.
This is most likely due to the fact that the dynamic range of each feature is widely different and so a part of your hypothesis, specifically the weighted sum of x*theta for each training example you have will give you either very large negative or positive values, and if you apply the sigmoid function to these values, you'll get very close to 0 or 1.
One way to combat this is to normalize the data in your matrix before performing training using gradient descent. A typical approach is to normalize with zero-mean and unit variance. Given an input feature x_k where k = 1, 2, ... n where you have n features, the new normalized feature x_k^{new} can be found by:
m_k is the mean of the feature k and s_k is the standard deviation of the feature k. This is also known as standardizing data. You can read up on more details about this on another answer I gave here: How does this code for standardizing data work?
Because you are using the linear algebra approach to gradient descent, I'm assuming you have prepended your data matrix with a column of all ones. Knowing this, we can normalize your data like so:
mX = mean(x,1);
mX(1) = 0;
sX = std(x,[],1);
sX(1) = 1;
xnew = bsxfun(#rdivide, bsxfun(#minus, x, mX), sX);
The mean and standard deviations of each feature are stored in mX and sX respectively. You can learn how this code works by reading the post I linked to you above. I won't repeat that stuff here because that isn't the scope of this post. To ensure proper normalization, I've made the mean and standard deviation of the first column to be 0 and 1 respectively. xnew contains the new normalized data matrix. Use xnew with your gradient descent algorithm instead. Now once you find the parameters, to perform any predictions you must normalize any new test instances with the mean and standard deviation from the training set. Because the parameters learned are with respect to the statistics of the training set, you must also apply the same transformations to any test data you want to submit to the prediction model.
Assuming you have new data points stored in a matrix called xx, you would do normalize then perform the predictions:
xxnew = bsxfun(#rdivide, bsxfun(#minus, xx, mX), sX);
Now that you have this, you can perform your predictions:
pred = sigmoid(xxnew*theta) >= 0.5;
You can change the threshold of 0.5 to be whatever you believe is best that determines whether examples belong in the positive or negative class.
The learning rate is too large
As you mentioned in the comments, once you normalize the data the costs appear to be finite but then suddenly go to NaN after a few iterations. Normalization can only get you so far. If your learning rate or alpha is too large, each iteration will overshoot in the direction towards the minimum and would thus make the cost at each iteration oscillate or even diverge which is what is appearing to be happening. In your case, the cost is diverging or increasing at each iteration to the point where it is so large that it can't be represented using floating point precision.
As such, one other option is to decrease your learning rate alpha until you see that the cost function is decreasing at each iteration. A popular method to determine what the best learning rate would be is to perform gradient descent on a range of logarithmically spaced values of alpha and seeing what the final cost function value is and choosing the learning rate that resulted in the smallest cost.
Using the two facts above together should allow gradient descent to converge quite nicely, assuming that the cost function is convex. In this case for logistic regression, it most certainly is.
Let's assume you have an observation where:
the true value is y_i = 1
your model is quite extreme and says that P(y_i = 1) = 1
Then your cost function will get a value of NaN because you're adding 0 * log(0), which is undefined. Hence:
Your formula for the cost function has a problem (there is a subtle 0, infinity issue)!
As #rayryeng pointed out, 0 * log(0) produces a NaN because 0 * Inf isn't kosher. This is actually a huge problem: if your algorithm believes it can predict a value perfectly, it incorrectly assigns a cost of NaN.
Instead of:
cost = sum(-y .* log(htheta) - (1-y) .* log(1-htheta));
You can avoid multiplying 0 by infinity by instead writing your cost function in Matlab as:
y_logical = y == 1;
cost = sum(-log(htheta(y_logical))) + sum( - log(1 - htheta(~y_logical)));
The idea is if y_i is 1, we add -log(htheta_i) to the cost, but if y_i is 0, we add -log(1 - htheta_i) to the cost. This is mathematically equivalent to -y_i * log(htheta_i) - (1 - y_i) * log(1- htheta_i) but without running into numerical problems that essentially stem from htheta_i being equal to 0 or 1 within the limits of double precision floating point.
It happened to me because an indetermination of the type:
0*log(0)
This can happen when one of the predicted values Y equals either 0 or 1.
In my case the solution was to add an if statement to the python code as follows:
y * np.log (Y) + (1-y) * np.log (1-Y) if ( Y != 1 and Y != 0 ) else 0
This way, when the actual value (y) and the predicted one (Y) are equal, no cost needs to be computed, which is the expected behavior.
(Notice that when a given Y is converging to 0 the left addend is canceled (because of y=0) and the right addend tends toward 0. The same happens when Y converges to 1, but with the opposite addend.)
(There is also a very rare scenario, which you probably won't need to worry about, where y=0 and Y=1 or viceversa, but if your dataset is standarized and the weights are properly initialized it won't be an issue.)