Impact of using relu for gradient descent - neural-network

What impact does the fact the relu activation function does not contain a derivative ?
How to implement the ReLU function in Numpy implements relu as maximum of (0 , matrix vector elements).
Does this mean for gradient descent we do not take derivative of relu function ?
Update :
From Neural network backpropagation with RELU
this text aids in understanding :
The ReLU function is defined as: For x > 0 the output is x, i.e. f(x)
= max(0,x)
So for the derivative f '(x) it's actually:
if x < 0, output is 0. if x > 0, output is 1.
The derivative f '(0) is not defined. So it's usually set to 0 or you
modify the activation function to be f(x) = max(e,x) for a small e.
Generally: A ReLU is a unit that uses the rectifier activation
function. That means it works exactly like any other hidden layer but
except tanh(x), sigmoid(x) or whatever activation you use, you'll
instead use f(x) = max(0,x).
If you have written code for a working multilayer network with sigmoid
activation it's literally 1 line of change. Nothing about forward- or
back-propagation changes algorithmically. If you haven't got the
simpler model working yet, go back and start with that first.
Otherwise your question isn't really about ReLUs but about
implementing a NN as a whole.
But this still leaves some confusion as the neural network cost function typically takes derivative of activation function, so for relu how does this impact cost function ?

The standard answer is that the input to ReLU is rarely exactly zero, see here for example, so it doesn't make any significant difference.
Specifically, for ReLU to get a zero input, the dot product of one entire row of the input to a layer with one entire column of the layer's weight matrix would have to be exactly zero. Even if you have an all-zero input sample, there should still be a bias term in the last position, so I don't really see this ever happening.
However, if you want to test for yourself, try implementing the derivative at zero as 0, 0.5, and 1 and see if anything changes.
The PyTorch docs give a simple neural network with numpy example with one hidden layer and relu activation. I have reproduced it below with a fixed random seed and three options for setting the behavior of the ReLU gradient at 0. I have also added a bias term.
N, D_in, H, D_out = 4, 2, 30, 1
# Create random input and output data
x = x = np.random.randn(N, D_in)
x = np.c_(x, no.ones(x.shape[0]))
y = x = np.random.randn(N, D_in)
np.random.seed(1)
# Randomly initialize weights
w1 = np.random.randn(D_in+1, H)
w2 = np.random.randn(H, D_out)
learning_rate = 0.002
loss_col = []
for t in range(200):
# Forward pass: compute predicted y
h = x.dot(w1)
h_relu = np.maximum(h, 0) # using ReLU as activate function
y_pred = h_relu.dot(w2)
# Compute and print loss
loss = np.square(y_pred - y).sum() # loss function
loss_col.append(loss)
print(t, loss, y_pred)
# Backprop to compute gradients of w1 and w2 with respect to loss
grad_y_pred = 2.0 * (y_pred - y) # the last layer's error
grad_w2 = h_relu.T.dot(grad_y_pred)
grad_h_relu = grad_y_pred.dot(w2.T) # the second laye's error
grad_h = grad_h_relu.copy()
grad_h[h < 0] = 0 # grad at zero = 1
# grad[h <= 0] = 0 # grad at zero = 0
# grad_h[h < 0] = 0; grad_h[h == 0] = 0.5 # grad at zero = 0.5
grad_w1 = x.T.dot(grad_h)
# Update weights
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2

Related

Reconstructing Sklearn MLP Regression in MatLab

I am using Sklearn to train a MultiLayer Perceptron Regression on 12 features and one output. The StandardScalar() is fit to the training data and applied to all input data. After a training period with architectural optimization, I get a model that is seemingly quite accurate (<10% error). I now need to extract the weights and biases in order to implement the prediction in real time on a system that interacts with a person. This is being done with my_model.coefs_ for weights and my_model.intercepts_ for the biases. The weights are appropriately shaped for the number of nodes in my model and the biases have the appropriate lengths for each layer.
The problem is now that I implement the matrix algebra in MatLab and get wildly different predictions from what my_model.predict() yields.
My reconstruction process for a 2 layer MLP (with 11 nodes in the first layer and 10 nodes in the second):
scale() % elementwise subtract feature mean and divide by feature stdev
scaled_obs = scale(raw_obs)
% Up to this point results from MatLab == Sklearn
weight1 = [12x11] % weights to transition from the input layer to the first hidden layer
weight2 = [11x10]
weight3 = [10x1]
bias1 = [11x1] % bias to add to the first layer after weight1 has been applied
bias2 = [10x1]
bias3 = [1x1]
my_prediction = ((( scaled_obs * w1 + b1') * w2 + b2') * w3 + b3);
I also tried
my_prediction2 = ((( scaled_obs * w1 .* b1') * w2 .* b2') * w3 .* b3); % because nothing worked...```
for my specific data:
Sklearn prediction = 1.731
my_prediction = -50.347
my_prediction2 = -3.2075
Is there another weight/bias that I am skipping when extracting relevant params from my_model? Is my order of operations in the reconstruction flawed?
In my opinion my_prediction = ((( scaled_obs * w1 + b1') * w2 + b2') * w3 + b3); is correct, but there is only 1 missing part and that is activation function. What was the activation function you had passed for the model. By default MLPRegressor have relu as activation function from first layer to third last layer(inclusive). Second last layer doesn't have any activation function. And output layer have a separate activation function which is identity function, basically f(x) = x so you don't have to do anything for that.
If you selected relu or if You didn't at all selected an activation (then relu is default), then you have to do something like this in numpy as np.maximum(0, your_layer1_calculation), I am not sure how this is done in matlab
So final formula would be :
layer1 = np.dot(scaled_inputs, weight0) + bias0
layer2 = np.dot(np.maximum(0, layer1), weight1) + bias1
layer......
layer(n-1) = np.dot(np.maximum(0, layer(n-2), weight(n-1)) + bias(n-1)
layer(n) = layer(n-1) # identity function

Does correction to weights include derivative of Sigmoid function also?

Let's evaluate usage of this line in the block of code given below.
L1_delta = L1_error * nonlin(L1,True) # line 36
import numpy as np #line 1
# sigmoid function
def nonlin(x,deriv=False):
if(deriv==True):
return x*(1-x)
return 1/(1+np.exp(-x))
# input dataset
X = np.array([ [0,0,1],
[0,1,1],
[1,0,1],
[1,1,1] ])
# output dataset
y = np.array([[0,0,1,1]]).T
# seed random numbers to make calculation
# deterministic (just a good practice)
np.random.seed(1)
# initialize weights randomly with mean 0
syn0 = 2*np.random.random((3,1)) - 1
for iter in range(1000):
# forward propagation
L0 = X
L1 = nonlin(np.dot(L0,syn0))
# how much did we miss?
L1_error = y - L1
# multiply how much we missed by the
# slope of the sigmoid at the values in L1
L1_delta = L1_error * nonlin(L1,True) # line 36
# update weights
syn0 += np.dot(L0.T,L1_delta)
print ("Output After Training:")
print (L1)
I wanted to know, is the line required? Why do we need the factor of derivative of Sigmoid?
I have seen many similar logistic regression examples where derivative of Sigmoid is not used.
For example
https://github.com/chayankathuria/LogReg01/blob/master/GradientDescent.py
Yes, the line is indeed required. You need the derivative of the activation function (in this case sigmoid) because your final output is only implicitly dependent of the weights.
That's why you need to apply the chain rule where the derivative of the sigmoid will appear.
I recommend you to take a look at this post regardind backpropagation: https://datascience.stackexchange.com/questions/28719/a-good-reference-for-the-back-propagation-algorithm
It explains the mathematics behind backpropagation quite well.

Fitting a neural network with ReLUs to polynomial functions

Out of curiosity I am trying to fit neural network with rectified linear units to polynomial functions.
For example, I would like to see how easy (or difficult) it is for a neural network to come up with an approximation for the function f(x) = x^2 + x. The following code should be able to do it, but seems to not learn anything. When I run
using Base.Iterators: repeated
ENV["JULIA_CUDA_SILENT"] = true
using Flux
using Flux: throttle
using Random
f(x) = x^2 + x
x_train = shuffle(1:1000)
y_train = f.(x_train)
x_train = hcat(x_train...)
m = Chain(
Dense(1, 45, relu),
Dense(45, 45, relu),
Dense(45, 1),
softmax
)
function loss(x, y)
Flux.mse(m(x), y)
end
evalcb = () -> #show(loss(x_train, y_train))
opt = ADAM()
#show loss(x_train, y_train)
dataset = repeated((x_train, y_train), 50)
Flux.train!(loss, params(m), dataset, opt, cb = throttle(evalcb, 10))
println("Training finished")
#show m([20])
it returns
loss(x_train, y_train) = 2.0100101f14
loss(x_train, y_train) = 2.0100101f14
loss(x_train, y_train) = 2.0100101f14
Training finished
m([20]) = Float32[1.0]
Anyone here sees how I could make the network fit f(x) = x^2 + x?
There seem to be couple of things wrong with your trial that have mostly to do with how you use your optimizer and treat your input -- nothing wrong with Julia or Flux. Provided solution does learn, but is by no means optimal.
It makes no sense to have softmax output activation on a regression problem. Softmax is used in classification problems where the output(s) of your model represent probabilities and therefore should be on the interval (0,1). It is clear your polynomial has values outside this interval. It is usual to have linear output activation in regression problems like these. This means in Flux no output activation should be defined on the output layer.
The shape of your data matters. train! computes gradients for loss(d...) where d is a batch in your data. In your case a minibatch consists of 1000 samples, and this same batch is repeated 50 times. Neural nets are often trained with smaller batches sizes, but a larger sample set. In the code I provided all batches consist of different data.
For training neural nets, in general, it is advised to normalize your input. Your input takes values from 1 to 1000. My example applies a simple linear transformation to get the input data in the right range.
Normalization can also apply to the output. If the outputs are large, this can result in (too) large gradients and weight updates. Another approach is to lower the learning rate a lot.
using Flux
using Flux: #epochs
using Random
normalize(x) = x/1000
function generate_data(n)
f(x) = x^2 + x
xs = reduce(hcat, rand(n)*1000)
ys = f.(xs)
(normalize(xs), normalize(ys))
end
batch_size = 32
num_batches = 10000
data_train = Iterators.repeated(generate_data(batch_size), num_batches)
data_test = generate_data(100)
model = Chain(Dense(1,40, relu), Dense(40,40, relu), Dense(40, 1))
loss(x,y) = Flux.mse(model(x), y)
opt = ADAM()
ps = Flux.params(model)
Flux.train!(loss, ps, data_train, opt , cb = () -> #show loss(data_test...))

Pytorch, what are the gradient arguments

I am reading through the documentation of PyTorch and found an example where they write
gradients = torch.FloatTensor([0.1, 1.0, 0.0001])
y.backward(gradients)
print(x.grad)
where x was an initial variable, from which y was constructed (a 3-vector). The question is, what are the 0.1, 1.0 and 0.0001 arguments of the gradients tensor ? The documentation is not very clear on that.
Explanation
For neural networks, we usually use loss to assess how well the network has learned to classify the input image (or other tasks). The loss term is usually a scalar value. In order to update the parameters of the network, we need to calculate the gradient of loss w.r.t to the parameters, which is actually leaf node in the computation graph (by the way, these parameters are mostly the weight and bias of various layers such Convolution, Linear and so on).
According to chain rule, in order to calculate gradient of loss w.r.t to a leaf node, we can compute derivative of loss w.r.t some intermediate variable, and gradient of intermediate variable w.r.t to the leaf variable, do a dot product and sum all these up.
The gradient arguments of a Variable's backward() method is used to calculate a weighted sum of each element of a Variable w.r.t the leaf Variable. These weight is just the derivate of final loss w.r.t each element of the intermediate variable.
A concrete example
Let's take a concrete and simple example to understand this.
from torch.autograd import Variable
import torch
x = Variable(torch.FloatTensor([[1, 2, 3, 4]]), requires_grad=True)
z = 2*x
loss = z.sum(dim=1)
# do backward for first element of z
z.backward(torch.FloatTensor([[1, 0, 0, 0]]), retain_graph=True)
print(x.grad.data)
x.grad.data.zero_() #remove gradient in x.grad, or it will be accumulated
# do backward for second element of z
z.backward(torch.FloatTensor([[0, 1, 0, 0]]), retain_graph=True)
print(x.grad.data)
x.grad.data.zero_()
# do backward for all elements of z, with weight equal to the derivative of
# loss w.r.t z_1, z_2, z_3 and z_4
z.backward(torch.FloatTensor([[1, 1, 1, 1]]), retain_graph=True)
print(x.grad.data)
x.grad.data.zero_()
# or we can directly backprop using loss
loss.backward() # equivalent to loss.backward(torch.FloatTensor([1.0]))
print(x.grad.data)
In the above example, the outcome of first print is
2 0 0 0
[torch.FloatTensor of size 1x4]
which is exactly the derivative of z_1 w.r.t to x.
The outcome of second print is :
0 2 0 0
[torch.FloatTensor of size 1x4]
which is the derivative of z_2 w.r.t to x.
Now if use a weight of [1, 1, 1, 1] to calculate the derivative of z w.r.t to x, the outcome is 1*dz_1/dx + 1*dz_2/dx + 1*dz_3/dx + 1*dz_4/dx. So no surprisingly, the output of 3rd print is:
2 2 2 2
[torch.FloatTensor of size 1x4]
It should be noted that weight vector [1, 1, 1, 1] is exactly derivative of loss w.r.t to z_1, z_2, z_3 and z_4. The derivative of loss w.r.t to x is calculated as:
d(loss)/dx = d(loss)/dz_1 * dz_1/dx + d(loss)/dz_2 * dz_2/dx + d(loss)/dz_3 * dz_3/dx + d(loss)/dz_4 * dz_4/dx
So the output of 4th print is the same as the 3rd print:
2 2 2 2
[torch.FloatTensor of size 1x4]
Typically, your computational graph has one scalar output says loss. Then you can compute the gradient of loss w.r.t. the weights (w) by loss.backward(). Where the default argument of backward() is 1.0.
If your output has multiple values (e.g. loss=[loss1, loss2, loss3]), you can compute the gradients of loss w.r.t. the weights by loss.backward(torch.FloatTensor([1.0, 1.0, 1.0])).
Furthermore, if you want to add weights or importances to different losses, you can use loss.backward(torch.FloatTensor([-0.1, 1.0, 0.0001])).
This means to calculate -0.1*d(loss1)/dw, d(loss2)/dw, 0.0001*d(loss3)/dw simultaneously.
Here, the output of forward(), i.e. y is a a 3-vector.
The three values are the gradients at the output of the network. They are usually set to 1.0 if y is the final output, but can have other values as well, especially if y is part of a bigger network.
For eg. if x is the input, y = [y1, y2, y3] is an intermediate output which is used to compute the final output z,
Then,
dz/dx = dz/dy1 * dy1/dx + dz/dy2 * dy2/dx + dz/dy3 * dy3/dx
So here, the three values to backward are
[dz/dy1, dz/dy2, dz/dy3]
and then backward() computes dz/dx
The original code I haven't found on PyTorch website anymore.
gradients = torch.FloatTensor([0.1, 1.0, 0.0001])
y.backward(gradients)
print(x.grad)
The problem with the code above is there is no function based on how to calculate the gradients. This means we don't know how many parameters (arguments the function takes) and the dimension of parameters.
To fully understand this I created an example close to the original:
Example 1:
a = torch.tensor([1.0, 2.0, 3.0], requires_grad = True)
b = torch.tensor([3.0, 4.0, 5.0], requires_grad = True)
c = torch.tensor([6.0, 7.0, 8.0], requires_grad = True)
y=3*a + 2*b*b + torch.log(c)
gradients = torch.FloatTensor([0.1, 1.0, 0.0001])
y.backward(gradients,retain_graph=True)
print(a.grad) # tensor([3.0000e-01, 3.0000e+00, 3.0000e-04])
print(b.grad) # tensor([1.2000e+00, 1.6000e+01, 2.0000e-03])
print(c.grad) # tensor([1.6667e-02, 1.4286e-01, 1.2500e-05])
I assumed our function is y=3*a + 2*b*b + torch.log(c) and the parameters are tensors with three elements inside.
You can think of the gradients = torch.FloatTensor([0.1, 1.0, 0.0001]) like this is the accumulator.
As you may hear, PyTorch autograd system calculation is equivalent to Jacobian product.
In case you have a function, like we did:
y=3*a + 2*b*b + torch.log(c)
Jacobian would be [3, 4*b, 1/c]. However, this Jacobian is not how PyTorch is doing things to calculate the gradients at a certain point.
PyTorch uses forward pass and backward mode automatic differentiation (AD) in tandem.
There is no symbolic math involved and no numerical differentiation.
Numerical differentiation would be to calculate δy/δb, for b=1 and b=1+ε where ε is small.
If you don't use gradients in y.backward():
Example 2
a = torch.tensor(0.1, requires_grad = True)
b = torch.tensor(1.0, requires_grad = True)
c = torch.tensor(0.1, requires_grad = True)
y=3*a + 2*b*b + torch.log(c)
y.backward()
print(a.grad) # tensor(3.)
print(b.grad) # tensor(4.)
print(c.grad) # tensor(10.)
You will simply get the result at a point, based on how you set your a, b, c tensors initially.
Be careful how you initialize your a, b, c:
Example 3:
a = torch.empty(1, requires_grad = True, pin_memory=True)
b = torch.empty(1, requires_grad = True, pin_memory=True)
c = torch.empty(1, requires_grad = True, pin_memory=True)
y=3*a + 2*b*b + torch.log(c)
gradients = torch.FloatTensor([0.1, 1.0, 0.0001])
y.backward(gradients)
print(a.grad) # tensor([3.3003])
print(b.grad) # tensor([0.])
print(c.grad) # tensor([inf])
If you use torch.empty() and don't use pin_memory=True you may have different results each time.
Also, note gradients are like accumulators so zero them when needed.
Example 4:
a = torch.tensor(1.0, requires_grad = True)
b = torch.tensor(1.0, requires_grad = True)
c = torch.tensor(1.0, requires_grad = True)
y=3*a + 2*b*b + torch.log(c)
y.backward(retain_graph=True)
y.backward()
print(a.grad) # tensor(6.)
print(b.grad) # tensor(8.)
print(c.grad) # tensor(2.)
Lastly few tips on terms PyTorch uses:
PyTorch creates a dynamic computational graph when calculating the gradients in forward pass. This looks much like a tree.
So you will often hear the leaves of this tree are input tensors and the root is output tensor.
Gradients are calculated by tracing the graph from the root to the leaf and multiplying every gradient in the way using the chain rule. This multiplying occurs in the backward pass.
Back some time I created PyTorch Automatic Differentiation tutorial that you may check interesting explaining all the tiny details about AD.

Logistic regression in Matlab, confused about the results

I am testing out logistic regression in Matlab on 2 datasets created from the audio files:
The first set is created via wavread by extracting vectors of each file: the set is 834 by 48116 matrix. Each traning example is a 48116 vector of the wav's frequencies.
The second set is created by extracting frequencies of 3 formants of the vowels, where each formant(feature) has its' frequency range (for example, F1 range is 500-1500Hz, F2 is 1500-2000Hz and so on). Each training example is a 3-vector of the wav's formants.
I am implementing the algorithm like so:
Cost function and gradient:
h = sigmoid(X*theta);
J = sum(y'*log(h) + (1-y)'*log(1-h)) * -1/m;
grad = ((h-y)'*X)/m;
theta_partial = theta;
theta_partial(1) = 0;
J = J + ((lambda/(2*m)) * (theta_partial'*theta_partial));
grad = grad + (lambda/m * theta_partial');
where X is the dataset and y is the output matrix of 8 classes.
Classifier:
initial_theta = zeros(n + 1, 1);
options = optimset('GradObj', 'on', 'MaxIter', 50);
for c = 1:num_labels,
[theta] = fmincg(#(t)(lrCostFunction(t, X, (y==c), lambda)), initial_theta, options);
all_theta(c, :) = theta';
end
where num_labels = 8, lambda(regularization) is 0.1
With the first set, MaxIter = 50, and I get ~99.8% classification accuracy.
With the second set and MaxIter=50, the accuracy is poor - 62.589928
I thought about increasing MaxIter to a larger value to improve the performance, however, even at a ridiculous amount of iterations, the result doesn't go higher than 66.546763. Changing of the regularization value (lambda) doesn't seem to influence the results in any better way.
What could be the problem? I am new to machine learning and I can't seem to catch what exactly causes this drastic difference. The only reason that obviously stands out for me is that the first set's examples are very long vectors, hence, larger amount of features, and the second set's examples are represented by short 3-vectors. Is this data not enough to classify the second set? If so, what can be done about it to achieve better classification results for the second set?