Training bias weight backpropagation - neural-network

I'm trying to write an XOR solution using neural networks and the sigmoid activation function. (With True=0.9 and False=0.1)
I'm at the backpropagation part now.
The formula I was given for computing weight adjustments is:
delta_weight(l,i,j) = gamma*output(l,i)*error_signal(l,j)
i.e - the weight adjustment for the link between layer 1 (hidden), node 2 and layer 2(output), node 0 is:
delta_weight(1,2,0)
I chose gamma=0.5
Since bias weights are associated with a single node I guessed the weight adjustment formula was:
delta_weight(l,i) = gamma*output(l,i)
My program is not working, clearly my guess was incorrect. Could someone help me along?
Thanks a bunch!
EDIT: CODE
def applyInputs(self, inps):
for i in range(len(self.layers)-1):
for n, node in enumerate(self.layers[i+1].nodes):
ans = 0
for m, mode in enumerate(self.layers[i].nodes):
ans += self.links[stringify(i,m,i+1,n)].weight * mode.output
if node.bias == True:
ans+= self.links[stringify(-1,-1,i+1,n)].weight
node.set_output(response(ans))
return self.layers[len(self.layers)-1].nodes[0].output
def computeErrorSignals(self, out): # 'out' is the output of the entire network (only 1 output node)
# output node error signal
output_node = self.layers[len(self.layers)-1].nodes[0]
fin_err = (out - output_node.output)*output_node.output*(1-output_node.output)
output_node.set_error(fin_err)
# hidden node error signals
for j in range(len(self.layers[1].nodes)):
hid_node = self.layers[1].nodes[j]
err = (hid_node.output)*(1-hid_node.output)*self.layers[2].nodes[0].error_signal*self.links[stringify(1,j,2,0)].weight
hid_node.set_error(err)
def computeWeightAdjustments(self):
for i in range(len(self.layers)-1):
for n, node in enumerate(self.layers[i+1].nodes):
for m, mode in enumerate(self.layers[i].nodes):
self.links[stringify(i,m,i+1,n)].weight += ((0.5)*self.layers[i+1].nodes[n].error_signal*self.layers[i].nodes[m].output)
if node.bias == True:
self.links[stringify(-1,-1,i+1,n)].weight += ((0.5)*self.layers[i].nodes[m].output)

Related

pytorch - model.named_parameters() returns 0 after optimizer.zero_grad() step

I am trying to store the weights of the model. The code is given below:
for step, batch in enumerate(train_dataloader):
outputs = model(**batch)
loss = outputs.loss
loss = loss / args.gradient_accumulation_steps
accelerator.backward(loss)
progress_bar.update(1)
progress_bar.set_postfix(loss=round(loss.item(), 3))
del outputs
gc.collect()
torch.cuda.empty_cache()
if (step+1) % args.gradient_accumulation_steps == 0 or (step+1) == len(train_dataloader):
optimizer.step()
scheduler.step()
optimizer.zero_grad()
reference_gradient = [ p.grad.view(-1) if p.grad is not None else torch.zeros(p.numel()) for
n, p in model.named_parameters()]
reference_gradient = torch.cat(reference_gradient)
However, reference_gradient tensor has all zeros in it. How can I save the gradients of the entire model?
If you zero_grad the gradients - you delete the information. You cannot access the gradients after you set them to zero. You need to save the gradients before optimizer.zero_grad().

Transforming Train argument in Chainer 5

How can I change this train argument(older version code) and use this in trainer extensions. What are the necessary changes to be made to use this code in Chainer: 5.4.0.
ValueError: train argument is not supported anymore. Use
chainer.using_config
[AutoEncoder/StackedAutoEncoder/Regression.py](https://github.com/quolc/chainer-ML-examples/blob/master/mnist-stacked-autoencoder/net.py)
[Train.py](https://github.com/quolc/chainer-ML-examples/blob/master/mnist-stacked-autoencoder/train_mnist_sae.py)
for epoch in range(0, n_epoch):
print(' epoch {}'.format(epoch+1))
perm = np.random.permutation(N)
permed_data = np.array(input_data[perm])
sum_loss = 0
start = time.time()
for i in range(0, N, batchsize):
x = chainer.Variable(permed_data[i:i+batchsize])
y = chainer.Variable(permed_data[i:i+batchsize])
optimizer.update(model, x, y)
sum_loss += float(model.loss.data) * len(y.data)
end = time.time()
throughput = N / (end - start)
print(' train mean loss={}, throughput={} data/sec'.format(sum_loss
/ N, throughput))
sys.stdout.flush()
# prepare train data for next layer
x = chainer.Variable(np.array(train_data))
train_data_for_next_layer = cuda.to_cpu(ae.encode(x, train=False).data)
In errors it points out to two different sections:
1. optimizer.update(model, x, y)
2. prepare train data for next layer second line where they mismatch the number of nodes in each layer. The error code is given below.
InvalidType:
Invalid operation is performed in: LinearFunction (Forward)
Expect: prod(in_types[0].shape[1:]) == in_types[1].shape[1]
Actual: 784 != 250
As to train argument, the details are written here: https://docs.chainer.org/en/stable/upgrade_v2.html
train argument is used by dropout in v1, but now Chainer uses config to manage its phase: in training or not.
So, there are two things to do.
First, remove train arguments from scripts.
Second, move inference code in the context.
with chainer.using_config(‘train’, False):
# define the inference process
prepare train data for next layer second line where they mismatch the number of nodes in each layer.
Could you share the error messages?

Impact of using relu for gradient descent

What impact does the fact the relu activation function does not contain a derivative ?
How to implement the ReLU function in Numpy implements relu as maximum of (0 , matrix vector elements).
Does this mean for gradient descent we do not take derivative of relu function ?
Update :
From Neural network backpropagation with RELU
this text aids in understanding :
The ReLU function is defined as: For x > 0 the output is x, i.e. f(x)
= max(0,x)
So for the derivative f '(x) it's actually:
if x < 0, output is 0. if x > 0, output is 1.
The derivative f '(0) is not defined. So it's usually set to 0 or you
modify the activation function to be f(x) = max(e,x) for a small e.
Generally: A ReLU is a unit that uses the rectifier activation
function. That means it works exactly like any other hidden layer but
except tanh(x), sigmoid(x) or whatever activation you use, you'll
instead use f(x) = max(0,x).
If you have written code for a working multilayer network with sigmoid
activation it's literally 1 line of change. Nothing about forward- or
back-propagation changes algorithmically. If you haven't got the
simpler model working yet, go back and start with that first.
Otherwise your question isn't really about ReLUs but about
implementing a NN as a whole.
But this still leaves some confusion as the neural network cost function typically takes derivative of activation function, so for relu how does this impact cost function ?
The standard answer is that the input to ReLU is rarely exactly zero, see here for example, so it doesn't make any significant difference.
Specifically, for ReLU to get a zero input, the dot product of one entire row of the input to a layer with one entire column of the layer's weight matrix would have to be exactly zero. Even if you have an all-zero input sample, there should still be a bias term in the last position, so I don't really see this ever happening.
However, if you want to test for yourself, try implementing the derivative at zero as 0, 0.5, and 1 and see if anything changes.
The PyTorch docs give a simple neural network with numpy example with one hidden layer and relu activation. I have reproduced it below with a fixed random seed and three options for setting the behavior of the ReLU gradient at 0. I have also added a bias term.
N, D_in, H, D_out = 4, 2, 30, 1
# Create random input and output data
x = x = np.random.randn(N, D_in)
x = np.c_(x, no.ones(x.shape[0]))
y = x = np.random.randn(N, D_in)
np.random.seed(1)
# Randomly initialize weights
w1 = np.random.randn(D_in+1, H)
w2 = np.random.randn(H, D_out)
learning_rate = 0.002
loss_col = []
for t in range(200):
# Forward pass: compute predicted y
h = x.dot(w1)
h_relu = np.maximum(h, 0) # using ReLU as activate function
y_pred = h_relu.dot(w2)
# Compute and print loss
loss = np.square(y_pred - y).sum() # loss function
loss_col.append(loss)
print(t, loss, y_pred)
# Backprop to compute gradients of w1 and w2 with respect to loss
grad_y_pred = 2.0 * (y_pred - y) # the last layer's error
grad_w2 = h_relu.T.dot(grad_y_pred)
grad_h_relu = grad_y_pred.dot(w2.T) # the second laye's error
grad_h = grad_h_relu.copy()
grad_h[h < 0] = 0 # grad at zero = 1
# grad[h <= 0] = 0 # grad at zero = 0
# grad_h[h < 0] = 0; grad_h[h == 0] = 0.5 # grad at zero = 0.5
grad_w1 = x.T.dot(grad_h)
# Update weights
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2

Back Propagation Neural Network Hidden Layer all output is 1

everyone I have created a neural network with 1600 input, one hidden layer with different number of neurons nodes and 24 output neurons.
My code shown that I can decrease the error each epoch, but the output of hidden layer always is 1. Due to this reason, the weight adjusted always produce same result for my testing data.
I try different number of neuron nodes and learning rate in the ANN and also randomly initialize my initial weight. I use sigmoid function as my activate function since my output is either 1 or 0 in different output.
May I know that what is the main reason that causes the output of hidden layer always is 1 and how should i solve it?
My purpose for this neural network is to recognize 24 hand shape for alphabet, I try intensities data in my first phase of project.
I have try 30 hidden neural nodes also 100 neural nodes even 1000 neural nodes but the output of hidden layer still is 1. Due to this reason, all of the outcome in testing data is always similar.
I added the code for my network
Thanks
g = inline('logsig(x)');
[row, col] = size(input);
numofInputNeurons = col;
weight_input_hidden = rand(numofInputNeurons, numofFirstHiddenNeurons);
weight_hidden_output = rand(numofFirstHiddenNeurons, numofOutputNeurons);
epochs = 0;
errorMatrix = [];
while(true)
if(totalEpochs > 0 && epochs >= totalEpochs)
break;
end
totalError = 0;
epochs = epochs + 1;
for i = 1:row
targetRow = zeros(1, numofOutputNeurons);
targetRow(1, target(i)) = 1;
hidden_output = g(input(1, 1:end)*weight_input_hidden);
final_output = g(hidden_output*weight_hidden_output);
error = abs(targetRow - final_output);
error = sum(error);
totalError = totalError + error;
if(error ~= 0)
delta_final_output = learningRate * (targetRow - final_output) .* final_output .* (1 - final_output);
delta_hidden_output = learningRate * (hidden_output) .* (1-hidden_output) .* (delta_final_output * weight_hidden_output');
for m = 1:numofFirstHiddenNeurons
for n = 1:numofOutputNeurons
current_changes = delta_final_output(1, n) * hidden_output(1, m);
weight_hidden_output(m, n) = weight_hidden_output(m, n) + current_changes;
end
end
for m = 1:numofInputNeurons
for n = 1:numofFirstHiddenNeurons
current_changes = delta_hidden_output(1, n) * input(1, m);
weight_input_hidden(m, n) = weight_input_hidden(m, n) + current_changes;
end
end
end
end
totalError = totalError / (row);
errorMatrix(end + 1) = totalError;
if(errorThreshold > 0 && totalEpochs == 0 && totalError < errorThreshold)
break;
end
end
I see a few obvious errors that need fixing in your code:
1) You have no negative weights when initialising. This is likely to get the network stuck. The weight initialisation should be something like:
weight_input_hidden = 0.2 * rand(numofInputNeurons, numofFirstHiddenNeurons) - 0.1;
2) You have not implemented bias. That will severely limit the ability of the network to learn. You should go back to your notes and figure that out, it is usually implemented as an extra column of 1's inserted into input and activation vectors/matrix before determining the activations of each layer, and there should be a matching additional column of weights.
3) Your delta for output layer is wrong. This line
delta_final_output = learningRate * (targetRow - final_output) .* final_output .* (1 - final_output);
. . . is not the delta for the output layer activations. It has some extra unwanted factors.
The correct delta for logloss objective function and sigmoid activation in output layer would be:
delta_final_output = (final_output - targetRow);
There are other possibilities, depending on your objective function, which is not shown. You original code is close to correct for mean squared error, which would probably still work if you changed the sign and removed the factor of learningRate
4) Your delta for hidden layer is wrong. This line:
delta_hidden_output = learningRate * (hidden_output) .* (1-hidden_output) .* (delta_final_output * weight_hidden_output');
. . . is not the delta for the hidden layer activations. You have multiplied by the learningRate for some reason (combined with the other delta that means you have a factor of learningRate squared).
The correct delta would be:
delta_hidden_output = (hidden_output) .* (1-hidden_output) .* (delta_final_output * weight_hidden_output');
5) Your weight update step needs adjusting to match fixes to (3) and (4). These lines:
current_changes = delta_final_output(1, n) * hidden_output(1, m);
would need to be adjusted to get correct sign and learning rate multiplier
current_changes = -learningRate * delta_final_output(1, n) * hidden_output(1, m);
That's 5 bugs from looking through the code, I may have missed some. But I think that's more than enough for now.

Neural Network convergence speed (Levenberg-Marquardt) (MATLAB)

I was trying to approximate a function (single input and single output) with an ANN. Using MATLAB toolbox I could see that with 5 or more neurons in the hidden layer, I can achieve a very nice result. So I am trying to do it manually.
Calculations:
As the network has only one input and one output, the partial derivative of the error (e=d-o, where 'd' is the desired output and 'o' is the actual output) in respect to a weigth which connects a hidden neuron j to the output neuron, will be -hj (where hj is the output of a hidden neuron j);
The partial derivative of the error in respect to output bias will be -1;
The partial derivative of the error in respect to a weight which connects the input to a hidden neuron j will be -woj*f'*i, where woj is the hidden neuron j output weigth, f' is the tanh() derivative and 'i' is the input value;
Finally, the partial derivative of the error in respect to hidden layer bias will be the same as above (in respect to input weight) except that here we dont have the input:
-woj*f'
The problem is:
the MATLAB algorithm always converge faster and better. I can achieve the same curve as MATLAB does, but my algorithm requires much more epochs.
I've tried to remove pre and postprocessing functions from MATLAB algorithm. It still converges faster.
I've also tried to create and configure the network, and extract weight/bias values before training so I could copy them to my algorithm to see if it converges faster but nothing changed (is the weight/bias initialization inside create/configure or train function?).
Does the MATLAB algorithm have some kind of optimizations inside the code?
Or may be this difference only in the organization of the training set and weight/bias initialization?
In case one wants to look my code, here is the main loop which makes the training:
Err2 = N;
epochs = 0;
%compare MSE of error2
while ((Err2/N > 0.0003) && (u < 10000000) && (epochs < 100))
epochs = epochs+1;
Err = 0;
%input->hidden weight vector
wh = w(1:hidden_layer_len);
%hidden->output weigth vector
wo = w((hidden_layer_len+1):(2*hidden_layer_len));
%hidden bias
bi = w((2*hidden_layer_len+1):(3*hidden_layer_len));
%output bias
bo = w(length(w));
%start forward propagation
for i=1:N
%take next input value
x = t(i);
%propagate to hidden layer
neth = x*wh + bi;
%propagate through neurons
ij = tanh(neth)';
%propagate to output layer
neto = ij*wo + bo;
%propagate to output (purelin)
output(i) = neto;
%calculate difference from target (error)
error(i) = yp(i) - output(i);
%Backpropagation:
%tanh derivative
fhd = 1 - tanh(neth').*tanh(neth');
%jacobian matrix
J(i,:) = [-x*wo'.*fhd -ij -wo'.*fhd -1];
%SSE (sum square error)
Err = Err + 0.5*error(i)*error(i);
end
%calculate next error with updated weights and compare with old error
%start error2 from error1 + 1 to enter while loop
Err2 = Err+1;
%while error2 is > than old error and Mu (u) is not too large
while ((Err2 > Err) && (u < 10000000))
%Weight update
w2 = w - (((J'*J + u*eye(3*hidden_layer_len+1))^-1)*J')*error';
%New Error calculation
%New weights to propagate
wh = w2(1:hidden_layer_len);
wo = w2((hidden_layer_len+1):(2*hidden_layer_len));
%new bias to propagate
bi = w2((2*hidden_layer_len+1):(3*hidden_layer_len));
bo = w2(length(w));
%calculate error2
Err2 = 0;
for i=1:N
%forward propagation again
x = t(i);
neth = x*wh + bi;
ij = tanh(neth)';
neto = ij*wo + bo;
output(i) = neto;
error2(i) = yp(i) - output(i);
%Error2 (SSE)
Err2 = Err2 + 0.5*error2(i)*error2(i);
end
%compare MSE from error2 with a minimum
%if greater still runing
if (Err2/N > 0.0003)
%compare with old error
if (Err2 <= Err)
%if less, update weights and decrease Mu (u)
w = w2;
u = u/10;
else
%if greater, increment Mu (u)
u = u*10;
end
end
end
end
It's not easy to know the exact implementation of the Levenberg Marquardt algorithm in Matlab. You may try to run the algorithm one iteration at a time, and see if it is identical to your algorithm. You can also try other implementations, such as, http://www.mathworks.com/matlabcentral/fileexchange/16063-lmfsolve-m--levenberg-marquardt-fletcher-algorithm-for-nonlinear-least-squares-problems, to see if the performance can be improved. For simple learning problems, convergence speed may be a matter of learning rate. You might simply increase the learning rate to get faster convergence.