When we use mini batch, should I call optimizer.zero_grad() before starting the iteration? Or inside the iteration? I think the second code is correct, but I'm not sure.
nb_epochs = 20
for epoch in range(nb_epochs + 1):
optimizer.zero_grad() # THIS PART!!
for batch_idx, samples in enumerate(dataloader):
x_train, y_train = samples
prediction = model(x_train)
cost = F.mse_loss(prediction, y_train)
cost.backward()
optimizer.step()
print('Epoch {:4d}/{} Batch {}/{} Cost: {:.6f}'.format(
epoch, nb_epochs, batch_idx+1, len(dataloader),
cost.item()
))
or
nb_epochs = 20
for epoch in range(nb_epochs + 1):
for batch_idx, samples in enumerate(dataloader):
x_train, y_train = samples
prediction = model(x_train)
optimizer.zero_grad() #THIS PART!!
cost = F.mse_loss(prediction, y_train)
cost.backward()
optimizer.step()
print('Epoch {:4d}/{} Batch {}/{} Cost: {:.6f}'.format(
epoch, nb_epochs, batch_idx+1, len(dataloader),
cost.item()
))
Which one is correct? The only difference is location of optimizer.zero_grad().
Gradients accumulates by default everytime you call .backward() on the computational graph.
On the first snippet, you are resetting the gradients once per epoch so all gradients will accumulate their values over time. With a total of len(dataloader) accumulated gradients, only resseting the gradients when the next epoch starts. On the second snippet, you are doing the right thing, which is to reset the gradient after every backward pass.
So your assumptions were right.
There are some instances where accumulating gradients is needed, but most times it's not.
Related
I am trying to store the weights of the model. The code is given below:
for step, batch in enumerate(train_dataloader):
outputs = model(**batch)
loss = outputs.loss
loss = loss / args.gradient_accumulation_steps
accelerator.backward(loss)
progress_bar.update(1)
progress_bar.set_postfix(loss=round(loss.item(), 3))
del outputs
gc.collect()
torch.cuda.empty_cache()
if (step+1) % args.gradient_accumulation_steps == 0 or (step+1) == len(train_dataloader):
optimizer.step()
scheduler.step()
optimizer.zero_grad()
reference_gradient = [ p.grad.view(-1) if p.grad is not None else torch.zeros(p.numel()) for
n, p in model.named_parameters()]
reference_gradient = torch.cat(reference_gradient)
However, reference_gradient tensor has all zeros in it. How can I save the gradients of the entire model?
If you zero_grad the gradients - you delete the information. You cannot access the gradients after you set them to zero. You need to save the gradients before optimizer.zero_grad().
I have some embeddings, which are the output of a pre-trained model, saved to disk. I am trying to perform a binary classification task of accept/reject. I have trained a simple neural network to perform classification, however, I am not seeing any decrease in the loss after some time.
Here is my NN, the cached embeddings are of shape 512:
from transformers.modeling_outputs import SequenceClassifierOutput
class ClassNet(nn.Module):
def __init__(self, num_labels=2):
super(ClassNet, self).__init__()
self.num_labels = num_labels
self.classifier = nn.Sequential(
nn.Linear(512, 256, bias=True),
nn.ReLU(inplace=True),
nn.Dropout(p=.5, inplace=False),
nn.Linear(256, 128, bias=True),
nn.ReLU(inplace=True),
nn.Dropout(p=.5, inplace=False),
nn.Linear(128, num_labels, bias=True)
)
def forward(self, inputs):
return self.classifier(inputs)
This is some random architechture that I am trying to over-fit to, but it seems that the network plateau's quickly on the training data. Could it be that my data is too complicated?
here's my training loop:
optimizer = optim.Adam(model.parameters(), lr=1e-4,weight_decay=5e-3) # L2 regularization
loss_fct=nn.CrossEntropyLoss()
model.train()
for epoch in range(10): # loop over the dataset multiple times
running_loss = 0.0
for i, data in enumerate(train_loader):
# get the inputs; data is a list of [inputs, labels]
inputs, labels = data['embeddings'], data['labels']
# zero the parameter gradients
optimizer.zero_grad()
outputs = model(inputs)
logits = outputs.squeeze(1)
loss = loss_fct(logits, labels.squeeze())
loss.backward()
optimizer.step()
running_loss += loss.item()
if i % 2000 == 1999: # print every 2000 mini-batches
print('[%d, %5d] loss: %.3f' %
(epoch + 1, i + 1, running_loss / 2000))
running_loss = 0.0
print('Finished Training')
The loss is stuck at around .4 and doesn't really decrease at all after an epoch.
To give a little context, the pre-trained embeddings are the output of specially trained ViT model from HuggingFace, I am trying to perform a classification task directly on the outputs of that model by building a simple neural network on top of it.
Can anyone advise on what is going wrong? Also if anyone has any suggestions to get a better accuracy, I would love to hear it.
I implemented a simple linear regression and I’m getting some poor results. Just wondering if these results are normal or I’m making some mistake.
I tried different optimizers and learning rates, I always get bad/poor results
Here is my code:
import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt
from torch.autograd import Variable
class LinearRegressionPytorch(nn.Module):
def __init__(self, input_dim=1, output_dim=1):
super(LinearRegressionPytorch, self).__init__()
self.linear = nn.Linear(input_dim, output_dim)
def forward(self,x):
x = x.view(x.size(0),-1)
y = self.linear(x)
return y
input_dim=1
output_dim = 1
if torch.cuda.is_available():
model = LinearRegressionPytorch(input_dim, output_dim).cuda()
else:
model = LinearRegressionPytorch(input_dim, output_dim)
criterium = nn.MSELoss()
l_rate =0.00001
optimizer = torch.optim.SGD(model.parameters(), lr=l_rate)
#optimizer = torch.optim.Adam(model.parameters(),lr=l_rate)
epochs = 100
#create data
x = np.random.uniform(0,10,size = 100) #np.linspace(0,10,100);
y = 6*x+5
mu = 0
sigma = 5
noise = np.random.normal(mu, sigma, len(y))
y_noise = y+noise
#pass it to pytorch
x_data = torch.from_numpy(x).float()
y_data = torch.from_numpy(y_noise).float()
if torch.cuda.is_available():
inputs = Variable(x_data).cuda()
target = Variable(y_data).cuda()
else:
inputs = Variable(x_data)
target = Variable(y_data)
for epoch in range(epochs):
#predict data
pred_y= model(inputs)
#compute loss
loss = criterium(pred_y, target)
#zero grad and optimization
optimizer.zero_grad()
loss.backward()
optimizer.step()
#if epoch % 50 == 0:
# print(f'epoch = {epoch}, loss = {loss.item()}')
#print params
for name, param in model.named_parameters():
if param.requires_grad:
print(name, param.data)
There are the poor results :
linear.weight tensor([[1.7374]], device='cuda:0')
linear.bias tensor([0.1815], device='cuda:0')
The results should be weight = 6 , bias = 5
Problem Solution
Actually your batch_size is problematic. If you have it set as one, your targetneeds the same shape as outputs (which you are, correctly, reshaping with view(-1, 1)).
Your loss should be defined like this:
loss = criterium(pred_y, target.view(-1, 1))
This network is correct
Results
Your results will not be bias=5 (yes, weight will go towards 6 indeed) as you are adding random noise to target (and as it's a single value for all your data points, only bias will be affected).
If you want bias equal to 5 remove addition of noise.
You should increase number of your epochs as well, as your data is quite small and network (linear regression in fact) is not really powerful. 10000 say should be fine and your loss should oscillate around 0 (if you change your noise to something sensible).
Noise
You are creating multiple gaussian distributions with different variations, hence your loss would be higher. Linear regression is unable to fit your data and find sensible bias (as the optimal slope is still approximately 6 for your noise, you may try to increase multiplication of 5 to 1000 and see what weight and bias will be learned).
Style (a little offtopic)
Please read documentation about PyTorch and keep your code up to date (e.g. Variable is deprecated in favor of Tensor and rightfully so).
This part of code:
x_data = torch.from_numpy(x).float()
y_data = torch.from_numpy(y_noise).float()
if torch.cuda.is_available():
inputs = Tensor(x_data).cuda()
target = Tensor(y_data).cuda()
else:
inputs = Tensor(x_data)
target = Tensor(y_data)
Could be written succinctly like this (without much thought):
inputs = torch.from_numpy(x).float()
target = torch.from_numpy(y_noise).float()
if torch.cuda.is_available():
inputs = inputs.cuda()
target = target.cuda()
I know deep learning has it's reputation for bad code and fatal practice, but please do not help spreading this approach.
I have a network model that is trained using batch training. Once it is trained, I want to predict the output for a single example.
Here is my model code:
model = Sequential()
model.add(Dense(32, batch_input_shape=(5, 1, 1)))
model.add(LSTM(16, stateful=True))
model.add(Dense(1, activation='linear'))
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['accuracy'])
I have a sequence of single inputs to single outputs. I'm doing some test code to map characters to next characters (A->B, B->C, etc).
I create an input data of shape (15,1,1) and an output data of shape (15, 1) and call the function:
model.fit(x, y, nb_epoch=epochs, batch_size=5, shuffle=False, verbose=0)
The model trains, and now I want to take a single character and predict the next character (input A, it predicts B). I create an input of shape (1, 1, 1) and call:
pred = model.predict(x, batch_size=1, verbose=0)
This gives:
ValueError: Shape mismatch: x has 5 rows but z has 1 rows
I saw one solution was to add "dummy data" to your predict values, so the input shape for the prediction would be (5,1,1) with data [x 0 0 0 0] and you would just take the first element of the output as your value. However, this seems inefficient when dealing with larger batches.
I also tried to remove the batch size from the model creation, but I got the following message:
ValueError: If a RNN is stateful, a complete input_shape must be provided (including batch size).
Is there another way? Thanks for the help.
Currently (Keras v2.0.8) it takes a bit more effort to get predictions on single rows after training in batch.
Basically, the batch_size is fixed at training time, and has to be the same at prediction time.
The workaround right now is to take the weights from the trained model, and use those as the weights in a new model you've just created, which has a batch_size of 1.
The quick code for that is
model = create_model(batch_size=64)
mode.fit(X, y)
weights = model.get_weights()
single_item_model = create_model(batch_size=1)
single_item_model.set_weights(weights)
single_item_model.compile(compile_params)
Here's a blog post that goes into more depth:
https://machinelearningmastery.com/use-different-batch-sizes-training-predicting-python-keras/
I've used this approach in the past to have multiple models at prediction time- one that makes predictions on big batches, one that makes predictions on small batches, and one that makes predictions on single items. Since batch predictions are much more efficient, this gives us the flexibility to take in any number of prediction rows (not just a number that is evenly divisible by batch_size), while still getting predictions pretty rapidly.
#ClimbsRocks showed a nice workaround. I cannot provide a "correct" answer in sense of "this is how Keras intends it to be done", but I can share another workaround which might help somebody depending on the use-case.
In this workaround I use predict_on_batch(). This method allows to pass a single sample out of a batch without throwing an error. Unfortunately, it returns a vector in the shape the target has according to the training-settings. However, each sample in the target yields then the prediction for your single sample.
You can access it like this:
to_predict = #Some single sample that would be part of a batch (has to have the right shape)#
model.predict_on_batch(to_predict)[0].flatten() #Flatten is optional
The result of the prediction is exactly the same as if you would pass an entire batch to predict().
Here some cod-example.
The code is from my question which also deals with this issue (but in a sligthly different manner).
sequence_size = 5
number_of_features = 1
input = (sequence_size, number_of_features)
batch_size = 2
model = Sequential()
#Of course you can replace the Gated Recurrent Unit with a LSTM-layer
model.add(GRU(100, return_sequences=True, activation='relu', input_shape=input, batch_size=2, name="GRU"))
model.add(GRU(1, return_sequences=True, activation='relu', input_shape=input, batch_size=batch_size, name="GRU2"))
model.compile(optimizer='adam', loss='mse')
model.summary()
#Summary-output:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
GRU (GRU) (2, 5, 100) 30600
_________________________________________________________________
GRU2 (GRU) (2, 5, 1) 306
=================================================================
Total params: 30,906
Trainable params: 30,906
Non-trainable params: 0
def generator(data, batch_size, sequence_size, num_features):
"""Simple generator"""
while True:
for i in range(len(data) - (sequence_size * batch_size + sequence_size) + 1):
start = i
end = i + (sequence_size * batch_size)
yield data[start : end].reshape(batch_size, sequence_size, num_features), \
data[end - ((sequence_size * batch_size) - sequence_size) : end + sequence_size].reshape(batch_size, sequence_size, num_features)
#Task: Predict the continuation of a linear range
data = np.arange(100)
hist = model.fit_generator(
generator=generator(data, batch_size, sequence_size, num_features),
steps_per_epoch=total_batches,
epochs=200,
shuffle=False
)
to_predict = np.asarray([[np.asarray([x]) for x in range(95,100,1)]]) #Only single element of a batch
correct = np.asarray([100,101,102,103,104])
print( model.predict_on_batch(to_predict)[0].flatten() )
#Output:
[ 99.92908 100.95854 102.32129 103.28584 104.20213 ]
I have the following network for training,
graph = tf.Graph()
with graph.as_default():
tf_train_dataset = tf.constant(X_train)
tf_train_labels = tf.constant(y_train)
tf_valid_dataset = tf.constant(X_test)
weights = tf.Variable(tf.truncated_normal([X_train.shape[1], 1]))
biases = tf.Variable(tf.zeros([num_labels]))
logits = tf.nn.softmax(tf.matmul(tf_train_dataset, weights) + biases)
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits, tf_train_labels))
optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
train_prediction = tf.nn.softmax(logits)
valid_prediction = tf.nn.softmax(tf.matmul(tf_valid_dataset, weights) + biases)
and I ran it as follows,
num_steps = 10
with tf.Session(graph=graph) as session:
tf.initialize_all_variables().run()
print('Initialized')
for step in range(num_steps):
_, l, predictions = session.run([optimizer, loss, train_prediction])
print("Loss: ",l)
print('Training accuracy: %.1f' % sklearn.metrics.accuracy_score(predictions.flatten(), y_train.flatten()))
But it outputes as follows
Initialized
Loss: 0.0
Training accuracy: 0.5
Loss: 0.0
Training accuracy: 0.5
The shape of X_train is (213403, 25) and y_train is (213403, 1) to cope up with the logits. I didn't encode the the labels as one hot because there are only two classes , either 1 or 0. I also tried with quadratic loss function and it was still the same, and same thing happened, the loss function doesn't decrease at all. I am sensing a syntactical mistake here but I am clueless.
Your are passing a labels as a single column(without encoding).
Model is unable to get labels as factor type.
So it considers your labels as continuous value.
Loss: 0.0 means loss is zero. That means your model is perfectly fit.
This is happening because your labels are continuous(regression function) and you are using softmax_cross_entropy_with_logits loss function.
Try passing one hot encoding of labels and check.