Understanding the backward mechanism of LSTMCell in Pytorch - neural-network

I want to hook into the backward pass of a LSTMCell function in pytorch so in the initialization pass I do the following (num_layers=4, hidden_size=1000, input_size=1000):
self.layers = nn.ModuleList([
LSTMCell(
input_size=input_size,
hidden_size=hidden_size,
)
for layer in range(num_layers)
])
for l in self.layers:
l.register_backward_hook(backward_hook)
In the forward pass I simply iterate the LSTMCell over sequence length and the num_layers as follow:
for j in range(seqlen):
input = #some tensor of size (batch_size, input_size)
for i, rnn in enumerate(self.layers):
# recurrent cell
hidden, cell = rnn(input, (prev_hiddens[i], prev_cells[i]))
Where input is of size (batch_size, input_size), prev_hiddens[i] is size of (batch_size, hidden_size), prev_cells[i] is of size (batch_size, hidden_size).
In the backward_hook I print the size of the tensors that are input to this function:
def backward_hook(module, grad_input, grad_output):
for grad in grad_output:
print ("grad_output {}".format(grad))
for grad in grad_input:
print ("grad_input.size () {}".format(grad.size()))
As the results, for the first time backward_hook is called for example:
[A] For grad_output I get 2 tensors among which the second tensor is None. This is understandable because in the backward phase we have a gradient of internal states (c) and gradient of output (h). The last iteration in time dimension has no future hidden so its gradient is None.
[B] For grad_input I get 5 tensors (batch_size=9):
grad_input.size () torch.Size([9, 4000])
grad_input.size () torch.Size([9, 4000])
grad_input.size () torch.Size([9, 1000])
grad_input.size () torch.Size([4000])
grad_input.size () torch.Size([4000])
My questions are:
(1) Is my understanding from [A] correct?
(2) How do I interpret the 5 tensors from the grad_input tuple? I thought there should have only 3 since there are only 3 inputs to the LSTMCell forward()?
Thanks

Your understanding of grad_input and grad_output is wrong. I am trying to explain it with a simpler example.
def backward_hook(module, grad_input, grad_output):
for grad in grad_output:
print ("grad_output.size {}".format(grad.size()))
for grad in grad_input:
if grad is None:
print('None')
else:
print ("grad_input.size: {}".format(grad.size()))
print()
model = nn.Linear(10, 20)
model.register_backward_hook(backward_hook)
input = torch.randn(8, 3, 10)
Y = torch.randn(8, 3, 20)
Y_pred = []
for i in range(input.size(1)):
out = model(input[:, i])
Y_pred.append(out)
loss = torch.norm(Y - torch.stack(Y_pred, dim=1), 2)
loss.backward()
The output is:
grad_output.size torch.Size([8, 20])
grad_input.size: torch.Size([8, 20])
None
grad_input.size: torch.Size([10, 20])
grad_output.size torch.Size([8, 20])
grad_input.size: torch.Size([8, 20])
None
grad_input.size: torch.Size([10, 20])
grad_output.size torch.Size([8, 20])
grad_input.size: torch.Size([8, 20])
None
grad_input.size: torch.Size([10, 20])
Explanation
grad_output: Gradient of the loss w.r.t. the layer output, Y_pred.
grad_input: Gradients of the loss w.r.t the layer inputs. For Linear layer, the inputs are the input tensor and the weight and the bias.
So, in the output you see:
grad_input.size: torch.Size([8, 20]) # for the `bias`
None # for the `input`
grad_input.size: torch.Size([10, 20]) # for the `weight`
The Linear layer in PyTorch uses a LinearFunction which is as follows.
class LinearFunction(Function):
# Note that both forward and backward are #staticmethods
#staticmethod
# bias is an optional argument
def forward(ctx, input, weight, bias=None):
ctx.save_for_backward(input, weight, bias)
output = input.mm(weight.t())
if bias is not None:
output += bias.unsqueeze(0).expand_as(output)
return output
# This function has only a single output, so it gets only one gradient
#staticmethod
def backward(ctx, grad_output):
# This is a pattern that is very convenient - at the top of backward
# unpack saved_tensors and initialize all gradients w.r.t. inputs to
# None. Thanks to the fact that additional trailing Nones are
# ignored, the return statement is simple even when the function has
# optional inputs.
input, weight, bias = ctx.saved_tensors
grad_input = grad_weight = grad_bias = None
# These needs_input_grad checks are optional and there only to
# improve efficiency. If you want to make your code simpler, you can
# skip them. Returning gradients for inputs that don't require it is
# not an error.
if ctx.needs_input_grad[0]:
grad_input = grad_output.mm(weight)
if ctx.needs_input_grad[1]:
grad_weight = grad_output.t().mm(input)
if bias is not None and ctx.needs_input_grad[2]:
grad_bias = grad_output.sum(0).squeeze(0)
return grad_input, grad_weight, grad_bias
For LSTM, there are four sets of weight parameters.
weight_ih_l0
weight_hh_l0
bias_ih_l0
bias_hh_l0
So, in your case, the grad_input would be a tuple of 5 tensors. And as you mentioned, the grad_output is two tensors.

Related

Neural Network Always Predicting Average Value

I'm trying to train a neural network to approximate a known scalar function of two variables; however, no matter the parameters of my training, the network always just ends up simply predicting the average value of the true outputs.
I am using an MLP and have tried:
using several network depths and widths
different optimizers (SGD and ADAM)
different activations (ReLU and Sigmoid)
changing the learning rate (several points within the range 0.1 to 0.001)
increasing the data (to 10,000 points)
increasing the number of epochs (to 2,000)
and different random seeds
to no avail.
My loss function is MSE and always plateaus to a value of about 5.14.
Regardless of changes I make, I get the following results:
Where the blue surface is the function to be approximated, and the green surface is the MLP approximation of the function, having a value that is roughly the average of the true function over that domain (the true average is 2.15 with a square of 4.64 - not far from the loss plateau value).
I feel like I could be missing something very obvious and have just been looking at it for too long. Any help is greatly appreciated! Thanks
I've attached my code here (I'm using JAX):
import jax.numpy as jnp
from jax import grad, jit, vmap, random, value_and_grad
import flax
import flax.linen as nn
import optax
seed = 2
key, data_key = random.split(random.PRNGKey(seed))
x1, x2, y= generate_data(data_key) # Data generation function
# Using Flax - define an MLP
class MLP(nn.Module):
features: Sequence[int]
#nn.compact
def __call__(self, x):
for feat in self.features[:-1]:
x = nn.relu(nn.Dense(feat)(x))
x = nn.Dense(self.features[-1])(x)
return x
# Define function that returns JITted loss function
def make_mlp_loss(input_data, true_y):
def mlp_loss(params):
pred_y = model.apply(params, input_data)
loss_vector = jnp.square(true_y.reshape(-1) - pred_y)
return jnp.average(loss_vector)
# Outer scope incapsulation saves the data and true output
return jit(mlp_loss)
# Concatenate independent variable vectors to be proper input shape
input_data = jnp.hstack((x1.reshape(-1, 1), x2.reshape(-1, 1)))
# Create loss function with data and true output
mlp_loss = make_mlp_loss(input_data, y)
# Create function that returns loss and gradient
loss_and_grad = value_and_grad(mlp_loss)
# Example architectures I've tried
architectures = [[16, 16, 1], [8, 16, 1], [16, 8, 1], [8, 16, 8, 1], [32, 32, 1]]
# Only using one seed but iterated over several
for seed in [645]:
for architecture in architectures:
# Create model
model = MLP(architecture)
# Initialize model with random parameters
key, params_key = random.split(key)
dummy = jnp.ones((1000, 2))
params = model.init(params_key, dummy)
# Create optimizer
opt = optax.adam(learning_rate=0.01) #sgd
opt_state = opt.init(params)
epochs = 50
for i in range(epochs):
# Get loss and gradient
curr_loss, curr_grad = loss_and_grad(params)
if i % 5 == 0:
print(curr_loss)
# Update
updates, opt_state = opt.update(curr_grad, opt_state)
params = optax.apply_updates(params, updates)
print(f"Architecture: {architecture}\nLoss: {curr_loss}\nSeed: {seed}\n\n")

ValueError: Expected input batch_size (24) to match target batch_size (8)

Got many links to solve this read different stackoverflow answer related to this but not able to figure it out .
My image size is torch.Size([8, 3, 16, 16]).
My architechture is as below
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
# linear layer (784 -> 1 hidden node)
self.fc1 = nn.Linear(16 * 16, 768)
self.fc2 = nn.Linear(768, 64)
self.fc3 = nn.Linear(64, 10)
self.dropout = nn.Dropout(p=.5)
def forward(self, x):
# flatten image input
x = x.view(-1, 16 * 16)
# add hidden layer, with relu activation function
x = self.dropout(F.relu(self.fc1(x)))
x = self.dropout(F.relu(self.fc2(x)))
x = F.log_softmax(self.fc3(x), dim=1)
return x
# specify loss function
criterion = nn.NLLLoss()
# specify optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=.003)
# number of epochs to train the model
n_epochs = 30 # suggest training between 20-50 epochs
model.train() # prep model for training
for epoch in range(n_epochs):
# monitor training loss
train_loss = 0.0
###################
# train the model #
###################
for data, target in trainloader:
# clear the gradients of all optimized variables
optimizer.zero_grad()
# forward pass: compute predicted outputs by passing inputs to the model
output = model(data)
# calculate the loss
loss = criterion(output, target)
# backward pass: compute gradient of the loss with respect to model parameters
loss.backward()
# perform a single optimization step (parameter update)
optimizer.step()
# update running training loss
train_loss += loss.item()*data.size(0)
# print training statistics
# calculate average loss over an epoch
train_loss = train_loss/len(trainloader.dataset)
print('Epoch: {} \tTraining Loss: {:.6f}'.format(
epoch+1,
train_loss
))
i am getting value error as
ValueError: Expected input batch_size (24) to match target batch_size (8).
how to fix it . My batch size is 8 and input image size is (16*16).And i have 10 class classification here .
Your input images have 3 channels, therefore your input feature size is 16*16*3, not 16*16. Currently, you consider each channel as separate instances, leading to a classifier output - after x.view(-1, 16*16) flattening - of (24, 16*16). Clearly, the batch size doesn't match because it is supposed to be 8, not 8*3 = 24.
You could either:
Switch to a CNN to handle multi-channel inputs (here 3 channels).
Use a self.fc1 with 16*16*3 input features.
If the input is RGB, maybe even convert to 1-channel grayscale map.

Does correction to weights include derivative of Sigmoid function also?

Let's evaluate usage of this line in the block of code given below.
L1_delta = L1_error * nonlin(L1,True) # line 36
import numpy as np #line 1
# sigmoid function
def nonlin(x,deriv=False):
if(deriv==True):
return x*(1-x)
return 1/(1+np.exp(-x))
# input dataset
X = np.array([ [0,0,1],
[0,1,1],
[1,0,1],
[1,1,1] ])
# output dataset
y = np.array([[0,0,1,1]]).T
# seed random numbers to make calculation
# deterministic (just a good practice)
np.random.seed(1)
# initialize weights randomly with mean 0
syn0 = 2*np.random.random((3,1)) - 1
for iter in range(1000):
# forward propagation
L0 = X
L1 = nonlin(np.dot(L0,syn0))
# how much did we miss?
L1_error = y - L1
# multiply how much we missed by the
# slope of the sigmoid at the values in L1
L1_delta = L1_error * nonlin(L1,True) # line 36
# update weights
syn0 += np.dot(L0.T,L1_delta)
print ("Output After Training:")
print (L1)
I wanted to know, is the line required? Why do we need the factor of derivative of Sigmoid?
I have seen many similar logistic regression examples where derivative of Sigmoid is not used.
For example
https://github.com/chayankathuria/LogReg01/blob/master/GradientDescent.py
Yes, the line is indeed required. You need the derivative of the activation function (in this case sigmoid) because your final output is only implicitly dependent of the weights.
That's why you need to apply the chain rule where the derivative of the sigmoid will appear.
I recommend you to take a look at this post regardind backpropagation: https://datascience.stackexchange.com/questions/28719/a-good-reference-for-the-back-propagation-algorithm
It explains the mathematics behind backpropagation quite well.

Impact of using relu for gradient descent

What impact does the fact the relu activation function does not contain a derivative ?
How to implement the ReLU function in Numpy implements relu as maximum of (0 , matrix vector elements).
Does this mean for gradient descent we do not take derivative of relu function ?
Update :
From Neural network backpropagation with RELU
this text aids in understanding :
The ReLU function is defined as: For x > 0 the output is x, i.e. f(x)
= max(0,x)
So for the derivative f '(x) it's actually:
if x < 0, output is 0. if x > 0, output is 1.
The derivative f '(0) is not defined. So it's usually set to 0 or you
modify the activation function to be f(x) = max(e,x) for a small e.
Generally: A ReLU is a unit that uses the rectifier activation
function. That means it works exactly like any other hidden layer but
except tanh(x), sigmoid(x) or whatever activation you use, you'll
instead use f(x) = max(0,x).
If you have written code for a working multilayer network with sigmoid
activation it's literally 1 line of change. Nothing about forward- or
back-propagation changes algorithmically. If you haven't got the
simpler model working yet, go back and start with that first.
Otherwise your question isn't really about ReLUs but about
implementing a NN as a whole.
But this still leaves some confusion as the neural network cost function typically takes derivative of activation function, so for relu how does this impact cost function ?
The standard answer is that the input to ReLU is rarely exactly zero, see here for example, so it doesn't make any significant difference.
Specifically, for ReLU to get a zero input, the dot product of one entire row of the input to a layer with one entire column of the layer's weight matrix would have to be exactly zero. Even if you have an all-zero input sample, there should still be a bias term in the last position, so I don't really see this ever happening.
However, if you want to test for yourself, try implementing the derivative at zero as 0, 0.5, and 1 and see if anything changes.
The PyTorch docs give a simple neural network with numpy example with one hidden layer and relu activation. I have reproduced it below with a fixed random seed and three options for setting the behavior of the ReLU gradient at 0. I have also added a bias term.
N, D_in, H, D_out = 4, 2, 30, 1
# Create random input and output data
x = x = np.random.randn(N, D_in)
x = np.c_(x, no.ones(x.shape[0]))
y = x = np.random.randn(N, D_in)
np.random.seed(1)
# Randomly initialize weights
w1 = np.random.randn(D_in+1, H)
w2 = np.random.randn(H, D_out)
learning_rate = 0.002
loss_col = []
for t in range(200):
# Forward pass: compute predicted y
h = x.dot(w1)
h_relu = np.maximum(h, 0) # using ReLU as activate function
y_pred = h_relu.dot(w2)
# Compute and print loss
loss = np.square(y_pred - y).sum() # loss function
loss_col.append(loss)
print(t, loss, y_pred)
# Backprop to compute gradients of w1 and w2 with respect to loss
grad_y_pred = 2.0 * (y_pred - y) # the last layer's error
grad_w2 = h_relu.T.dot(grad_y_pred)
grad_h_relu = grad_y_pred.dot(w2.T) # the second laye's error
grad_h = grad_h_relu.copy()
grad_h[h < 0] = 0 # grad at zero = 1
# grad[h <= 0] = 0 # grad at zero = 0
# grad_h[h < 0] = 0; grad_h[h == 0] = 0.5 # grad at zero = 0.5
grad_w1 = x.T.dot(grad_h)
# Update weights
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2

Pytorch, what are the gradient arguments

I am reading through the documentation of PyTorch and found an example where they write
gradients = torch.FloatTensor([0.1, 1.0, 0.0001])
y.backward(gradients)
print(x.grad)
where x was an initial variable, from which y was constructed (a 3-vector). The question is, what are the 0.1, 1.0 and 0.0001 arguments of the gradients tensor ? The documentation is not very clear on that.
Explanation
For neural networks, we usually use loss to assess how well the network has learned to classify the input image (or other tasks). The loss term is usually a scalar value. In order to update the parameters of the network, we need to calculate the gradient of loss w.r.t to the parameters, which is actually leaf node in the computation graph (by the way, these parameters are mostly the weight and bias of various layers such Convolution, Linear and so on).
According to chain rule, in order to calculate gradient of loss w.r.t to a leaf node, we can compute derivative of loss w.r.t some intermediate variable, and gradient of intermediate variable w.r.t to the leaf variable, do a dot product and sum all these up.
The gradient arguments of a Variable's backward() method is used to calculate a weighted sum of each element of a Variable w.r.t the leaf Variable. These weight is just the derivate of final loss w.r.t each element of the intermediate variable.
A concrete example
Let's take a concrete and simple example to understand this.
from torch.autograd import Variable
import torch
x = Variable(torch.FloatTensor([[1, 2, 3, 4]]), requires_grad=True)
z = 2*x
loss = z.sum(dim=1)
# do backward for first element of z
z.backward(torch.FloatTensor([[1, 0, 0, 0]]), retain_graph=True)
print(x.grad.data)
x.grad.data.zero_() #remove gradient in x.grad, or it will be accumulated
# do backward for second element of z
z.backward(torch.FloatTensor([[0, 1, 0, 0]]), retain_graph=True)
print(x.grad.data)
x.grad.data.zero_()
# do backward for all elements of z, with weight equal to the derivative of
# loss w.r.t z_1, z_2, z_3 and z_4
z.backward(torch.FloatTensor([[1, 1, 1, 1]]), retain_graph=True)
print(x.grad.data)
x.grad.data.zero_()
# or we can directly backprop using loss
loss.backward() # equivalent to loss.backward(torch.FloatTensor([1.0]))
print(x.grad.data)
In the above example, the outcome of first print is
2 0 0 0
[torch.FloatTensor of size 1x4]
which is exactly the derivative of z_1 w.r.t to x.
The outcome of second print is :
0 2 0 0
[torch.FloatTensor of size 1x4]
which is the derivative of z_2 w.r.t to x.
Now if use a weight of [1, 1, 1, 1] to calculate the derivative of z w.r.t to x, the outcome is 1*dz_1/dx + 1*dz_2/dx + 1*dz_3/dx + 1*dz_4/dx. So no surprisingly, the output of 3rd print is:
2 2 2 2
[torch.FloatTensor of size 1x4]
It should be noted that weight vector [1, 1, 1, 1] is exactly derivative of loss w.r.t to z_1, z_2, z_3 and z_4. The derivative of loss w.r.t to x is calculated as:
d(loss)/dx = d(loss)/dz_1 * dz_1/dx + d(loss)/dz_2 * dz_2/dx + d(loss)/dz_3 * dz_3/dx + d(loss)/dz_4 * dz_4/dx
So the output of 4th print is the same as the 3rd print:
2 2 2 2
[torch.FloatTensor of size 1x4]
Typically, your computational graph has one scalar output says loss. Then you can compute the gradient of loss w.r.t. the weights (w) by loss.backward(). Where the default argument of backward() is 1.0.
If your output has multiple values (e.g. loss=[loss1, loss2, loss3]), you can compute the gradients of loss w.r.t. the weights by loss.backward(torch.FloatTensor([1.0, 1.0, 1.0])).
Furthermore, if you want to add weights or importances to different losses, you can use loss.backward(torch.FloatTensor([-0.1, 1.0, 0.0001])).
This means to calculate -0.1*d(loss1)/dw, d(loss2)/dw, 0.0001*d(loss3)/dw simultaneously.
Here, the output of forward(), i.e. y is a a 3-vector.
The three values are the gradients at the output of the network. They are usually set to 1.0 if y is the final output, but can have other values as well, especially if y is part of a bigger network.
For eg. if x is the input, y = [y1, y2, y3] is an intermediate output which is used to compute the final output z,
Then,
dz/dx = dz/dy1 * dy1/dx + dz/dy2 * dy2/dx + dz/dy3 * dy3/dx
So here, the three values to backward are
[dz/dy1, dz/dy2, dz/dy3]
and then backward() computes dz/dx
The original code I haven't found on PyTorch website anymore.
gradients = torch.FloatTensor([0.1, 1.0, 0.0001])
y.backward(gradients)
print(x.grad)
The problem with the code above is there is no function based on how to calculate the gradients. This means we don't know how many parameters (arguments the function takes) and the dimension of parameters.
To fully understand this I created an example close to the original:
Example 1:
a = torch.tensor([1.0, 2.0, 3.0], requires_grad = True)
b = torch.tensor([3.0, 4.0, 5.0], requires_grad = True)
c = torch.tensor([6.0, 7.0, 8.0], requires_grad = True)
y=3*a + 2*b*b + torch.log(c)
gradients = torch.FloatTensor([0.1, 1.0, 0.0001])
y.backward(gradients,retain_graph=True)
print(a.grad) # tensor([3.0000e-01, 3.0000e+00, 3.0000e-04])
print(b.grad) # tensor([1.2000e+00, 1.6000e+01, 2.0000e-03])
print(c.grad) # tensor([1.6667e-02, 1.4286e-01, 1.2500e-05])
I assumed our function is y=3*a + 2*b*b + torch.log(c) and the parameters are tensors with three elements inside.
You can think of the gradients = torch.FloatTensor([0.1, 1.0, 0.0001]) like this is the accumulator.
As you may hear, PyTorch autograd system calculation is equivalent to Jacobian product.
In case you have a function, like we did:
y=3*a + 2*b*b + torch.log(c)
Jacobian would be [3, 4*b, 1/c]. However, this Jacobian is not how PyTorch is doing things to calculate the gradients at a certain point.
PyTorch uses forward pass and backward mode automatic differentiation (AD) in tandem.
There is no symbolic math involved and no numerical differentiation.
Numerical differentiation would be to calculate δy/δb, for b=1 and b=1+ε where ε is small.
If you don't use gradients in y.backward():
Example 2
a = torch.tensor(0.1, requires_grad = True)
b = torch.tensor(1.0, requires_grad = True)
c = torch.tensor(0.1, requires_grad = True)
y=3*a + 2*b*b + torch.log(c)
y.backward()
print(a.grad) # tensor(3.)
print(b.grad) # tensor(4.)
print(c.grad) # tensor(10.)
You will simply get the result at a point, based on how you set your a, b, c tensors initially.
Be careful how you initialize your a, b, c:
Example 3:
a = torch.empty(1, requires_grad = True, pin_memory=True)
b = torch.empty(1, requires_grad = True, pin_memory=True)
c = torch.empty(1, requires_grad = True, pin_memory=True)
y=3*a + 2*b*b + torch.log(c)
gradients = torch.FloatTensor([0.1, 1.0, 0.0001])
y.backward(gradients)
print(a.grad) # tensor([3.3003])
print(b.grad) # tensor([0.])
print(c.grad) # tensor([inf])
If you use torch.empty() and don't use pin_memory=True you may have different results each time.
Also, note gradients are like accumulators so zero them when needed.
Example 4:
a = torch.tensor(1.0, requires_grad = True)
b = torch.tensor(1.0, requires_grad = True)
c = torch.tensor(1.0, requires_grad = True)
y=3*a + 2*b*b + torch.log(c)
y.backward(retain_graph=True)
y.backward()
print(a.grad) # tensor(6.)
print(b.grad) # tensor(8.)
print(c.grad) # tensor(2.)
Lastly few tips on terms PyTorch uses:
PyTorch creates a dynamic computational graph when calculating the gradients in forward pass. This looks much like a tree.
So you will often hear the leaves of this tree are input tensors and the root is output tensor.
Gradients are calculated by tracing the graph from the root to the leaf and multiplying every gradient in the way using the chain rule. This multiplying occurs in the backward pass.
Back some time I created PyTorch Automatic Differentiation tutorial that you may check interesting explaining all the tiny details about AD.