Create custom gradient descent in pytorch - neural-network

I am trying to use PyTorch autograd to implement my own batch gradient descent algorithm. I want to create a simple one-layer neural net with a linear activation function and the mean squared error as the loss function. I can't seem to get my head around what exactly is happening in the backward pass and how PyTorch understands my outputs. I have coded one class specifying the linear function in the forward pass, and in the backward pass, I calculated the gradients with respect to each variable. I also coded a class for the MSE function and specified the gradients with respect to ITS variables in the backward pass. When I run a simple gradient descent algorithm, I get no errors, but the MSE only goes down in the first iteration, and after that, it continually goes up. This leads me to believe that I have made a mistake, but I am not sure, where. Does anybody see the error in my code? Also, if somebody could explain to me what exactly the grad_output stands for, that would be amazing.
Here are the functions:
import torch
from torch.autograd import Function
from torch.autograd import gradcheck
class Context:
def __init__(self):
self._saved_tensors = ()
def save_for_backward(self, *args):
self._saved_tensors = args
#property
def saved_tensors(self):
return self._saved_tensors
class MSE(Function):
#staticmethod
def forward(ctx, yhat, y):
ctx.save_for_backward(yhat, y)
q = yhat.size()[0]
mse = torch.sum((yhat-y)**2)/q
return mse
#staticmethod
def backward(ctx, grad_output):
yhat, y = ctx.saved_tensors
q = yhat.size()[0]
return 2*grad_output*(yhat-y)/q, -2*grad_output*(yhat-y)/q
class Linear(Function):
#staticmethod
def forward(ctx, X, W, b):
rows = X.size()[0]
yhat = torch.mm(X,W) + b.repeat(rows,1)
ctx.save_for_backward(yhat, X, W)
return yhat
#staticmethod
def backward(ctx, grad_output):
yhat, X, W = ctx.saved_tensors
q = yhat.size()[0]
p = yhat.size()[1]
return torch.transpose(X, 0, 1), W, torch.ones(p)
And here is my gradient descent:
import torch
from torch.utils.tensorboard import SummaryWriter
from tp1moi import MSE, Linear, Context
x = torch.randn(50, 13)
y = torch.randn(50, 3)
w = torch.randn(13, 3, requires_grad=True)
b = torch.randn(3, requires_grad=True)
epsilon = 0.05
writer = SummaryWriter()
for n_iter in range(100):
linear = Linear.apply
mse = MSE.apply
loss = mse(linear(x, w, b), y)
writer.add_scalar('Loss/train', loss, n_iter)
print(f"Itérations {n_iter}: loss {loss}")
loss.backward()
with torch.no_grad():
w -= epsilon*w.grad
b -= epsilon*b.grad
w.grad.zero_()
b.grad.zero_()
Here is one output I got (they all look similar to this one):
Itérations 0: loss 72.99712371826172
Itérations 1: loss 7.509067535400391
Itérations 2: loss 7.309497833251953
Itérations 3: loss 7.124927997589111
Itérations 4: loss 6.955358982086182
Itérations 5: loss 6.800788402557373
Itérations 6: loss 6.661219596862793
Itérations 7: loss 6.536648750305176
Itérations 8: loss 6.427078723907471
Itérations 9: loss 6.3325090408325195
Itérations 10: loss 6.252938747406006
Itérations 11: loss 6.188369274139404
Itérations 12: loss 6.138798713684082
Itérations 13: loss 6.104228973388672
Itérations 14: loss 6.084658145904541
Itérations 15: loss 6.0800886154174805
Itérations 16: loss 6.090517520904541
Itérations 17: loss 6.115947723388672
Itérations 18: loss 6.156377792358398
Itérations 19: loss 6.2118072509765625
Itérations 20: loss 6.2822370529174805
Itérations 21: loss 6.367666721343994
Itérations 22: loss 6.468096733093262
Itérations 23: loss 6.583526611328125
Itérations 24: loss 6.713956356048584
Itérations 25: loss 6.859385967254639
Itérations 26: loss 7.019815444946289
Itérations 27: loss 7.195245742797852
Itérations 28: loss 7.385674953460693
Itérations 29: loss 7.591104507446289
Itérations 30: loss 7.811534881591797
Itérations 31: loss 8.046965599060059
Itérations 32: loss 8.297393798828125
Itérations 33: loss 8.562823295593262
Itérations 34: loss 8.843254089355469
Itérations 35: loss 9.138683319091797
Itérations 36: loss 9.449112892150879
Itérations 37: loss 9.774543762207031
Itérations 38: loss 10.114972114562988
Itérations 39: loss 10.470401763916016
Itérations 40: loss 10.840831756591797
Itérations 41: loss 11.226261138916016
Itérations 42: loss 11.626690864562988
Itérations 43: loss 12.042119979858398
Itérations 44: loss 12.472548484802246
Itérations 45: loss 12.917980194091797
Itérations 46: loss 13.378408432006836
Itérations 47: loss 13.853838920593262
Itérations 48: loss 14.344267845153809
Itérations 49: loss 14.849695205688477
Itérations 50: loss 15.370124816894531
Itérations 51: loss 15.905555725097656
Itérations 52: loss 16.455984115600586
Itérations 53: loss 17.02141571044922
Itérations 54: loss 17.601844787597656
Itérations 55: loss 18.19727325439453
Itérations 56: loss 18.807701110839844
Itérations 57: loss 19.43313217163086
Itérations 58: loss 20.07356071472168
Itérations 59: loss 20.728988647460938
Itérations 60: loss 21.3994197845459
Itérations 61: loss 22.084848403930664
Itérations 62: loss 22.7852783203125
Itérations 63: loss 23.50070571899414
Itérations 64: loss 24.23113441467285
Itérations 65: loss 24.9765625
Itérations 66: loss 25.73699188232422
Itérations 67: loss 26.512422561645508
Itérations 68: loss 27.302854537963867
Itérations 69: loss 28.108285903930664
Itérations 70: loss 28.9287166595459
Itérations 71: loss 29.764144897460938
Itérations 72: loss 30.614578247070312
Itérations 73: loss 31.48000717163086
Itérations 74: loss 32.36043930053711
Itérations 75: loss 33.2558708190918
Itérations 76: loss 34.16630172729492
Itérations 77: loss 35.091732025146484
Itérations 78: loss 36.032161712646484
Itérations 79: loss 36.98759460449219
Itérations 80: loss 37.95802307128906
Itérations 81: loss 38.943458557128906
Itérations 82: loss 39.943885803222656
Itérations 83: loss 40.959320068359375
Itérations 84: loss 41.98974609375
Itérations 85: loss 43.03517532348633
Itérations 86: loss 44.09561538696289
Itérations 87: loss 45.171043395996094
Itérations 88: loss 46.261474609375
Itérations 89: loss 47.366905212402344
Itérations 90: loss 48.487335205078125
Itérations 91: loss 49.62276840209961
Itérations 92: loss 50.773197174072266
Itérations 93: loss 51.93863296508789
Itérations 94: loss 53.11906433105469
Itérations 95: loss 54.31448745727539
Itérations 96: loss 55.524925231933594
Itérations 97: loss 56.75035095214844
Itérations 98: loss 57.990787506103516
Itérations 99: loss 59.2462158203125```

Let's take a look at the implementation of MSE, the forward pass will be MSE(y, y_hat) = (y_hat-y)² which is straightforward. For the backward pass, we are looking to compute the derivative of the output with regards to the input, as well as the derivative with regards to each of the parameters. Here MSE does not have any learned parameters, so we just want to compute dMSE/dy*dz/dMSE using the chain rule, which is d(y_hat-y)²/dy*dz/dMSE, i.e. -2(y_hat-y)*dz/dMSE. Not to confuse you here: I wrote dz/dMSEas the incoming gradient. It corresponds to the gradient following backward towards the MSE layer. From your notation grad_output is dz/dMSE. Therefore the backward pass is simply -2*(y_hat-y)*grad_output. Then normalized by the batch size q, retrieved from y_hat.size(0).
The same thing goes with the Linear layer. It will involve some more computation since, this time, the layer is parametrized by w and b. The forward pass is essentially x#w + b. While the backward pass, consists in calculating dz/dx, dz/dw, and dz/db. Writing f as x#w + b. After some work you can find that that:
dz/dx = d(x#w + b)/dx * dz/df = dz/df*W.T,
dz/dw = d(x#w + b)/dw * dz/df = X.T*dz/df,
dz/db = d(x#w + b)/db * dz/df = 1.
In terms of implementation this would look like:
output_grad#w.T for the gradient w.r.t x,
x.T#output_grad for the gradient w.r.t w,
torch.ones_like(b) for the gradient w.r.t b.

Related

ValueError: Expected input batch_size (24) to match target batch_size (8)

Got many links to solve this read different stackoverflow answer related to this but not able to figure it out .
My image size is torch.Size([8, 3, 16, 16]).
My architechture is as below
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
# linear layer (784 -> 1 hidden node)
self.fc1 = nn.Linear(16 * 16, 768)
self.fc2 = nn.Linear(768, 64)
self.fc3 = nn.Linear(64, 10)
self.dropout = nn.Dropout(p=.5)
def forward(self, x):
# flatten image input
x = x.view(-1, 16 * 16)
# add hidden layer, with relu activation function
x = self.dropout(F.relu(self.fc1(x)))
x = self.dropout(F.relu(self.fc2(x)))
x = F.log_softmax(self.fc3(x), dim=1)
return x
# specify loss function
criterion = nn.NLLLoss()
# specify optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=.003)
# number of epochs to train the model
n_epochs = 30 # suggest training between 20-50 epochs
model.train() # prep model for training
for epoch in range(n_epochs):
# monitor training loss
train_loss = 0.0
###################
# train the model #
###################
for data, target in trainloader:
# clear the gradients of all optimized variables
optimizer.zero_grad()
# forward pass: compute predicted outputs by passing inputs to the model
output = model(data)
# calculate the loss
loss = criterion(output, target)
# backward pass: compute gradient of the loss with respect to model parameters
loss.backward()
# perform a single optimization step (parameter update)
optimizer.step()
# update running training loss
train_loss += loss.item()*data.size(0)
# print training statistics
# calculate average loss over an epoch
train_loss = train_loss/len(trainloader.dataset)
print('Epoch: {} \tTraining Loss: {:.6f}'.format(
epoch+1,
train_loss
))
i am getting value error as
ValueError: Expected input batch_size (24) to match target batch_size (8).
how to fix it . My batch size is 8 and input image size is (16*16).And i have 10 class classification here .
Your input images have 3 channels, therefore your input feature size is 16*16*3, not 16*16. Currently, you consider each channel as separate instances, leading to a classifier output - after x.view(-1, 16*16) flattening - of (24, 16*16). Clearly, the batch size doesn't match because it is supposed to be 8, not 8*3 = 24.
You could either:
Switch to a CNN to handle multi-channel inputs (here 3 channels).
Use a self.fc1 with 16*16*3 input features.
If the input is RGB, maybe even convert to 1-channel grayscale map.

Understanding the backward mechanism of LSTMCell in Pytorch

I want to hook into the backward pass of a LSTMCell function in pytorch so in the initialization pass I do the following (num_layers=4, hidden_size=1000, input_size=1000):
self.layers = nn.ModuleList([
LSTMCell(
input_size=input_size,
hidden_size=hidden_size,
)
for layer in range(num_layers)
])
for l in self.layers:
l.register_backward_hook(backward_hook)
In the forward pass I simply iterate the LSTMCell over sequence length and the num_layers as follow:
for j in range(seqlen):
input = #some tensor of size (batch_size, input_size)
for i, rnn in enumerate(self.layers):
# recurrent cell
hidden, cell = rnn(input, (prev_hiddens[i], prev_cells[i]))
Where input is of size (batch_size, input_size), prev_hiddens[i] is size of (batch_size, hidden_size), prev_cells[i] is of size (batch_size, hidden_size).
In the backward_hook I print the size of the tensors that are input to this function:
def backward_hook(module, grad_input, grad_output):
for grad in grad_output:
print ("grad_output {}".format(grad))
for grad in grad_input:
print ("grad_input.size () {}".format(grad.size()))
As the results, for the first time backward_hook is called for example:
[A] For grad_output I get 2 tensors among which the second tensor is None. This is understandable because in the backward phase we have a gradient of internal states (c) and gradient of output (h). The last iteration in time dimension has no future hidden so its gradient is None.
[B] For grad_input I get 5 tensors (batch_size=9):
grad_input.size () torch.Size([9, 4000])
grad_input.size () torch.Size([9, 4000])
grad_input.size () torch.Size([9, 1000])
grad_input.size () torch.Size([4000])
grad_input.size () torch.Size([4000])
My questions are:
(1) Is my understanding from [A] correct?
(2) How do I interpret the 5 tensors from the grad_input tuple? I thought there should have only 3 since there are only 3 inputs to the LSTMCell forward()?
Thanks
Your understanding of grad_input and grad_output is wrong. I am trying to explain it with a simpler example.
def backward_hook(module, grad_input, grad_output):
for grad in grad_output:
print ("grad_output.size {}".format(grad.size()))
for grad in grad_input:
if grad is None:
print('None')
else:
print ("grad_input.size: {}".format(grad.size()))
print()
model = nn.Linear(10, 20)
model.register_backward_hook(backward_hook)
input = torch.randn(8, 3, 10)
Y = torch.randn(8, 3, 20)
Y_pred = []
for i in range(input.size(1)):
out = model(input[:, i])
Y_pred.append(out)
loss = torch.norm(Y - torch.stack(Y_pred, dim=1), 2)
loss.backward()
The output is:
grad_output.size torch.Size([8, 20])
grad_input.size: torch.Size([8, 20])
None
grad_input.size: torch.Size([10, 20])
grad_output.size torch.Size([8, 20])
grad_input.size: torch.Size([8, 20])
None
grad_input.size: torch.Size([10, 20])
grad_output.size torch.Size([8, 20])
grad_input.size: torch.Size([8, 20])
None
grad_input.size: torch.Size([10, 20])
Explanation
grad_output: Gradient of the loss w.r.t. the layer output, Y_pred.
grad_input: Gradients of the loss w.r.t the layer inputs. For Linear layer, the inputs are the input tensor and the weight and the bias.
So, in the output you see:
grad_input.size: torch.Size([8, 20]) # for the `bias`
None # for the `input`
grad_input.size: torch.Size([10, 20]) # for the `weight`
The Linear layer in PyTorch uses a LinearFunction which is as follows.
class LinearFunction(Function):
# Note that both forward and backward are #staticmethods
#staticmethod
# bias is an optional argument
def forward(ctx, input, weight, bias=None):
ctx.save_for_backward(input, weight, bias)
output = input.mm(weight.t())
if bias is not None:
output += bias.unsqueeze(0).expand_as(output)
return output
# This function has only a single output, so it gets only one gradient
#staticmethod
def backward(ctx, grad_output):
# This is a pattern that is very convenient - at the top of backward
# unpack saved_tensors and initialize all gradients w.r.t. inputs to
# None. Thanks to the fact that additional trailing Nones are
# ignored, the return statement is simple even when the function has
# optional inputs.
input, weight, bias = ctx.saved_tensors
grad_input = grad_weight = grad_bias = None
# These needs_input_grad checks are optional and there only to
# improve efficiency. If you want to make your code simpler, you can
# skip them. Returning gradients for inputs that don't require it is
# not an error.
if ctx.needs_input_grad[0]:
grad_input = grad_output.mm(weight)
if ctx.needs_input_grad[1]:
grad_weight = grad_output.t().mm(input)
if bias is not None and ctx.needs_input_grad[2]:
grad_bias = grad_output.sum(0).squeeze(0)
return grad_input, grad_weight, grad_bias
For LSTM, there are four sets of weight parameters.
weight_ih_l0
weight_hh_l0
bias_ih_l0
bias_hh_l0
So, in your case, the grad_input would be a tuple of 5 tensors. And as you mentioned, the grad_output is two tensors.

Numerical analysis: secant method. What am I doing wrong

I'm trying to solve a problem regarding the application of the secant numerical method.
My MATLAB code is the following
function [f]= fsecante(t)
R=24.7;
L=2.74;
C=0.000251;
P1=-0.5*(R/L)*t;
P2=t*sqrt(1/(L*C)-(R^2)/(4*L^2));
f=2*exp(P1).*cos(P2)-1;
end
%iteradas iniciais%
x0=0;
x1=10^-4;
wanted=10^-8;
f0=fsecante(x0);
f1=fsecante(x1);
iter=0;
error=wanted;
while(erro>=wanted)
F=(x1-x0)/(f1-f0);
xn=x1-F*f1
error=abs(F*f1);
iter=iter+1;
x0=x1;
x1=xn;
f0=fsecante(x0);
f1=fsecante(x1);
end
I used a calculator to get an idea about the value I should obtain which is 0.152652376 (approximately)
However using the method in MATLAB, it converges to 1.4204 which is way over what we should get.
What am I doing wrong?
My guess is that I have my error variable wrong in the cycle? I also find strange that my solution goes of the set [0,1] where the solution should be. Can someone give me some clarification about what am I missing?
If this is just about solving the equation, use
fsolve(#fsecante, 0)
to find the root of the function closest to 0.
ans = 0.0257389353753764
You did nothing wrong, the secant method just does not converge for all initial points. Fast convergence is only guaranteed if convergence happens at all.
For a method using secant roots with a guarantee of convergence use the regula falsi method. In its Illinois variation it can be implemented as
x0=0
f0=fsecante(x0)
x1=1
f1=fsecante(x1)
wanted=10^-8;
iter=1;
while( abs(x1-x0) >= wanted)
iter=iter+1
F=(x1-x0)/(f1-f0);
xn=x1-F*f1
fn = fsecante(xn)
if fn*f0 < 0
x1=x0; f1=f0;
else
f1 = f1*0.5;
end
x0=xn; f0=fn;
end
and gives the result for x0=0; x1=1;
init: x0= 0 , f(x0)= 1
init: x1= 1 , f(x1)= -0.97824464599
n= 2: xn= 0.5054986510525 , f(xn)= -0.803719660003
n= 3: xn= 0.28025344639849 , f(xn)= -1.21180917676
n= 4: xn= 0.081858845659703, f(xn)= -2.38168069197
n= 5: xn= 0.007776289683393, f(xn)= 0.848005195428
n= 6: xn= 0.027227838393959, f(xn)= -0.090747663322
n= 7: xn= 0.025347489826149, f(xn)= 0.0235362327768
n= 8: xn= 0.025734738800888, f(xn)= 0.000253074462073
n= 9: xn= 0.025743020435749, f(xn)= -0.000246366400381
n=10: xn= 0.02573893523423 , f(xn)= 7.80359576957e-09
n=11: xn= 0.025738935363624, f(xn)= 2.40252262529e-13
n=12: xn= 0.025738935363632, f(xn)= -2.39919195621e-13

My keras neural network predicts the same straight line for both train and test data

I have a neural network as below:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_11 (Dense) (None, 36) 288
_________________________________________________________________
dense_12 (Dense) (None, 1) 37
=================================================================
Total params: 325
Trainable params: 325
Non-trainable params: 0
_________________________________________________________________
The activation functions for the first and second layers are "relu" and "sigmoid" respectively.
My problem is the output is a straight line:
I did more investigation and figured out that the weights of this neural net are also a straight line (just first layer).
x_train shape is (2516, 7), and y_train shape is (280, 7)
one of the features (dimensions) of the input data is like below and other are similar to this:
and the labels are like below:

Python Keras MLP for Multi-class classification value error while model fit

Getting value error while error while running the Keras multi class classification model using below code:
model2 = Sequential()
model2.add(Dense(200, input_shape=(4132,), activation='relu'))
model2.add(Dense(200, activation='relu'))
model2.add(Dense(31, activation='softmax'))
SGD = optimizers.SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model2.compile(optimizer=SGD,
loss='categorical_crossentropy',
metrics=['accuracy'])
model2.fit(x_train, y_train, epochs=100, verbose=2) ---> Error on this line
Error:
Train Shape: (4132, 49)
Test Shape: (1033, 49)
Traceback (most recent call last):
File "ANN.py", line 213, in <module>
model2.fit(x_train, y_train, epochs=100, verbose=2)
File "C:\Users\C256121\AppData\Local\Programs\Python\Python36\lib\site-package
s\keras\models.py", line 960, in fit
validation_steps=validation_steps)
File "C:\Users\C256121\AppData\Local\Programs\Python\Python36\lib\site-package
s\keras\engine\training.py", line 1574, in fit
batch_size=batch_size)
File "C:\Users\C256121\AppData\Local\Programs\Python\Python36\lib\site-package
s\keras\engine\training.py", line 1407, in _standardize_user_data
exception_prefix='input')
File "C:\Users\C256121\AppData\Local\Programs\Python\Python36\lib\site-package
s\keras\engine\training.py", line 128, in _standardize_input_data
arrays[i] = array
ValueError: could not broadcast input array from shape (49,1) into shape (49)
I have 31 classes in the target variable. Please help.