Training a NN on top of cached embeddings from a pre-trained model, loss not going down? - neural-network

I have some embeddings, which are the output of a pre-trained model, saved to disk. I am trying to perform a binary classification task of accept/reject. I have trained a simple neural network to perform classification, however, I am not seeing any decrease in the loss after some time.
Here is my NN, the cached embeddings are of shape 512:
from transformers.modeling_outputs import SequenceClassifierOutput
class ClassNet(nn.Module):
def __init__(self, num_labels=2):
super(ClassNet, self).__init__()
self.num_labels = num_labels
self.classifier = nn.Sequential(
nn.Linear(512, 256, bias=True),
nn.ReLU(inplace=True),
nn.Dropout(p=.5, inplace=False),
nn.Linear(256, 128, bias=True),
nn.ReLU(inplace=True),
nn.Dropout(p=.5, inplace=False),
nn.Linear(128, num_labels, bias=True)
)
def forward(self, inputs):
return self.classifier(inputs)
This is some random architechture that I am trying to over-fit to, but it seems that the network plateau's quickly on the training data. Could it be that my data is too complicated?
here's my training loop:
optimizer = optim.Adam(model.parameters(), lr=1e-4,weight_decay=5e-3) # L2 regularization
loss_fct=nn.CrossEntropyLoss()
model.train()
for epoch in range(10): # loop over the dataset multiple times
running_loss = 0.0
for i, data in enumerate(train_loader):
# get the inputs; data is a list of [inputs, labels]
inputs, labels = data['embeddings'], data['labels']
# zero the parameter gradients
optimizer.zero_grad()
outputs = model(inputs)
logits = outputs.squeeze(1)
loss = loss_fct(logits, labels.squeeze())
loss.backward()
optimizer.step()
running_loss += loss.item()
if i % 2000 == 1999: # print every 2000 mini-batches
print('[%d, %5d] loss: %.3f' %
(epoch + 1, i + 1, running_loss / 2000))
running_loss = 0.0
print('Finished Training')
The loss is stuck at around .4 and doesn't really decrease at all after an epoch.
To give a little context, the pre-trained embeddings are the output of specially trained ViT model from HuggingFace, I am trying to perform a classification task directly on the outputs of that model by building a simple neural network on top of it.
Can anyone advise on what is going wrong? Also if anyone has any suggestions to get a better accuracy, I would love to hear it.

Related

NaN Loss and Quantity of Data

everyone.
While I was training a neural network with a training dataset of 40000 objects, I was having problems related to the loss function being equal to Nan at each epoch. After sampling the dataset, using 50% of it, this problem wasn't occurring any more. I was wondering how the size of the training data would have an impact in this setting. I used the following function to do the training:
def train_test_net_notLinearCDF(X,y,coefficient,test_input):
# Neural network
model = Sequential()
model.add(Dense(80, activation="relu", input_dim=X.shape[1]))
model.add(Dense(20, activation="tanh"))
model.add(Dense(1, activation="linear"))
opt_adam = Adam(clipvalue=0.5)
model.compile(loss='mean_squared_error', optimizer=opt_adam)
history = model.fit(X, y, epochs=100, validation_split = 0.2,batch_size=32)
fig1 = plt.gcf()
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.suptitle('MSE de treino e Validação ' + coefficient)
plt.ylabel('MSE')
plt.xlabel('Epoch')
plt.legend(['Train', 'Val'], loc='upper left')
plt.show()
fig1.savefig('losses_varying_alpha_'+coefficient+'.png', dpi=300)
y_pred = model.predict(test_input)
return y_pred
Thanks in advance.

ValueError: Expected input batch_size (24) to match target batch_size (8)

Got many links to solve this read different stackoverflow answer related to this but not able to figure it out .
My image size is torch.Size([8, 3, 16, 16]).
My architechture is as below
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
# linear layer (784 -> 1 hidden node)
self.fc1 = nn.Linear(16 * 16, 768)
self.fc2 = nn.Linear(768, 64)
self.fc3 = nn.Linear(64, 10)
self.dropout = nn.Dropout(p=.5)
def forward(self, x):
# flatten image input
x = x.view(-1, 16 * 16)
# add hidden layer, with relu activation function
x = self.dropout(F.relu(self.fc1(x)))
x = self.dropout(F.relu(self.fc2(x)))
x = F.log_softmax(self.fc3(x), dim=1)
return x
# specify loss function
criterion = nn.NLLLoss()
# specify optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=.003)
# number of epochs to train the model
n_epochs = 30 # suggest training between 20-50 epochs
model.train() # prep model for training
for epoch in range(n_epochs):
# monitor training loss
train_loss = 0.0
###################
# train the model #
###################
for data, target in trainloader:
# clear the gradients of all optimized variables
optimizer.zero_grad()
# forward pass: compute predicted outputs by passing inputs to the model
output = model(data)
# calculate the loss
loss = criterion(output, target)
# backward pass: compute gradient of the loss with respect to model parameters
loss.backward()
# perform a single optimization step (parameter update)
optimizer.step()
# update running training loss
train_loss += loss.item()*data.size(0)
# print training statistics
# calculate average loss over an epoch
train_loss = train_loss/len(trainloader.dataset)
print('Epoch: {} \tTraining Loss: {:.6f}'.format(
epoch+1,
train_loss
))
i am getting value error as
ValueError: Expected input batch_size (24) to match target batch_size (8).
how to fix it . My batch size is 8 and input image size is (16*16).And i have 10 class classification here .
Your input images have 3 channels, therefore your input feature size is 16*16*3, not 16*16. Currently, you consider each channel as separate instances, leading to a classifier output - after x.view(-1, 16*16) flattening - of (24, 16*16). Clearly, the batch size doesn't match because it is supposed to be 8, not 8*3 = 24.
You could either:
Switch to a CNN to handle multi-channel inputs (here 3 channels).
Use a self.fc1 with 16*16*3 input features.
If the input is RGB, maybe even convert to 1-channel grayscale map.

PyTorch mini batch, when to call optimizer.zero_grad()

When we use mini batch, should I call optimizer.zero_grad() before starting the iteration? Or inside the iteration? I think the second code is correct, but I'm not sure.
nb_epochs = 20
for epoch in range(nb_epochs + 1):
optimizer.zero_grad() # THIS PART!!
for batch_idx, samples in enumerate(dataloader):
x_train, y_train = samples
prediction = model(x_train)
cost = F.mse_loss(prediction, y_train)
cost.backward()
optimizer.step()
print('Epoch {:4d}/{} Batch {}/{} Cost: {:.6f}'.format(
epoch, nb_epochs, batch_idx+1, len(dataloader),
cost.item()
))
or
nb_epochs = 20
for epoch in range(nb_epochs + 1):
for batch_idx, samples in enumerate(dataloader):
x_train, y_train = samples
prediction = model(x_train)
optimizer.zero_grad() #THIS PART!!
cost = F.mse_loss(prediction, y_train)
cost.backward()
optimizer.step()
print('Epoch {:4d}/{} Batch {}/{} Cost: {:.6f}'.format(
epoch, nb_epochs, batch_idx+1, len(dataloader),
cost.item()
))
Which one is correct? The only difference is location of optimizer.zero_grad().
Gradients accumulates by default everytime you call .backward() on the computational graph.
On the first snippet, you are resetting the gradients once per epoch so all gradients will accumulate their values over time. With a total of len(dataloader) accumulated gradients, only resseting the gradients when the next epoch starts. On the second snippet, you are doing the right thing, which is to reset the gradient after every backward pass.
So your assumptions were right.
There are some instances where accumulating gradients is needed, but most times it's not.

Using SGD on MNIST dataset with Pytorch, loss not decreasing

I tried to use SGD on MNIST dataset with batch size of 32, but the loss does not decrease at all.
I checked my model, loss function and read documentation but couldn't figure out what I've done wrong.
I defined my neural network as below
class classification(nn.Module):
def __init__(self):
super(classification, self).__init__()
# construct layers for a neural network
self.classifier1 = nn.Sequential(
nn.Linear(in_features=28*28, out_features=20*20),
nn.Sigmoid(),
)
self.classifier2 = nn.Sequential(
nn.Linear(in_features=20*20, out_features=10*10),
nn.Sigmoid(),
)
self.classifier3 = nn.Sequential(
nn.Linear(in_features=10*10, out_features=10),
nn.LogSoftmax(dim=1),
)
def forward(self, inputs): # [batchSize, 1, 28, 28]
x = inputs.view(inputs.size(0), -1) # [batchSize, 28*28]
x = self.classifier1(x) # [batchSize, 20*20]
x = self.classifier2(x) # [batchSize, 10*10]
out = self.classifier3(x) # [batchSize, 10]
return out
And I defined my training process as below
classifier = classification().to("cuda")
#optimizer
optimizer = torch.optim.SGD(classifier.parameters(), lr=learning_rate_value)
#loss function
criterion = nn.NLLLoss()
batch_size=32
epoch = 30
#array to save loss history
loss_train_arr=np.zeros(epoch)
#used DataLoader to make split batch
batched_train = torch.utils.data.DataLoader(training_set, batch_size, shuffle=True)
for i in range(epoch):
loss_train=0
#train and compute loss, accuracy
for img, label in batched_train:
img=img.to(device)
label=label.to(device)
optimizer.zero_grad()
predicted = classifier(img)
label_predicted = torch.argmax(predicted,dim=1)
loss = criterion(predicted, label)
loss.backward
optimizer.step()
loss_train += loss.item()
loss_train_arr[i]=loss_train/(len(batched_train.dataset)/batch_size)
I am using a model with LogSoftmax layer, so my loss function seems right. But the loss does not decrease at all.
If the code you posted is the exact code you use, the problem is that you don't actually call backward on the loss (missing parentheses ()).

Simple tensorflow neural network not increasing accuracy or decreasing loss?

I have the following network for training,
graph = tf.Graph()
with graph.as_default():
tf_train_dataset = tf.constant(X_train)
tf_train_labels = tf.constant(y_train)
tf_valid_dataset = tf.constant(X_test)
weights = tf.Variable(tf.truncated_normal([X_train.shape[1], 1]))
biases = tf.Variable(tf.zeros([num_labels]))
logits = tf.nn.softmax(tf.matmul(tf_train_dataset, weights) + biases)
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits, tf_train_labels))
optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
train_prediction = tf.nn.softmax(logits)
valid_prediction = tf.nn.softmax(tf.matmul(tf_valid_dataset, weights) + biases)
and I ran it as follows,
num_steps = 10
with tf.Session(graph=graph) as session:
tf.initialize_all_variables().run()
print('Initialized')
for step in range(num_steps):
_, l, predictions = session.run([optimizer, loss, train_prediction])
print("Loss: ",l)
print('Training accuracy: %.1f' % sklearn.metrics.accuracy_score(predictions.flatten(), y_train.flatten()))
But it outputes as follows
Initialized
Loss: 0.0
Training accuracy: 0.5
Loss: 0.0
Training accuracy: 0.5
The shape of X_train is (213403, 25) and y_train is (213403, 1) to cope up with the logits. I didn't encode the the labels as one hot because there are only two classes , either 1 or 0. I also tried with quadratic loss function and it was still the same, and same thing happened, the loss function doesn't decrease at all. I am sensing a syntactical mistake here but I am clueless.
Your are passing a labels as a single column(without encoding).
Model is unable to get labels as factor type.
So it considers your labels as continuous value.
Loss: 0.0 means loss is zero. That means your model is perfectly fit.
This is happening because your labels are continuous(regression function) and you are using softmax_cross_entropy_with_logits loss function.
Try passing one hot encoding of labels and check.