Simple tensorflow neural network not increasing accuracy or decreasing loss? - neural-network

I have the following network for training,
graph = tf.Graph()
with graph.as_default():
tf_train_dataset = tf.constant(X_train)
tf_train_labels = tf.constant(y_train)
tf_valid_dataset = tf.constant(X_test)
weights = tf.Variable(tf.truncated_normal([X_train.shape[1], 1]))
biases = tf.Variable(tf.zeros([num_labels]))
logits = tf.nn.softmax(tf.matmul(tf_train_dataset, weights) + biases)
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits, tf_train_labels))
optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
train_prediction = tf.nn.softmax(logits)
valid_prediction = tf.nn.softmax(tf.matmul(tf_valid_dataset, weights) + biases)
and I ran it as follows,
num_steps = 10
with tf.Session(graph=graph) as session:
tf.initialize_all_variables().run()
print('Initialized')
for step in range(num_steps):
_, l, predictions = session.run([optimizer, loss, train_prediction])
print("Loss: ",l)
print('Training accuracy: %.1f' % sklearn.metrics.accuracy_score(predictions.flatten(), y_train.flatten()))
But it outputes as follows
Initialized
Loss: 0.0
Training accuracy: 0.5
Loss: 0.0
Training accuracy: 0.5
The shape of X_train is (213403, 25) and y_train is (213403, 1) to cope up with the logits. I didn't encode the the labels as one hot because there are only two classes , either 1 or 0. I also tried with quadratic loss function and it was still the same, and same thing happened, the loss function doesn't decrease at all. I am sensing a syntactical mistake here but I am clueless.

Your are passing a labels as a single column(without encoding).
Model is unable to get labels as factor type.
So it considers your labels as continuous value.
Loss: 0.0 means loss is zero. That means your model is perfectly fit.
This is happening because your labels are continuous(regression function) and you are using softmax_cross_entropy_with_logits loss function.
Try passing one hot encoding of labels and check.

Related

Training a NN on top of cached embeddings from a pre-trained model, loss not going down?

I have some embeddings, which are the output of a pre-trained model, saved to disk. I am trying to perform a binary classification task of accept/reject. I have trained a simple neural network to perform classification, however, I am not seeing any decrease in the loss after some time.
Here is my NN, the cached embeddings are of shape 512:
from transformers.modeling_outputs import SequenceClassifierOutput
class ClassNet(nn.Module):
def __init__(self, num_labels=2):
super(ClassNet, self).__init__()
self.num_labels = num_labels
self.classifier = nn.Sequential(
nn.Linear(512, 256, bias=True),
nn.ReLU(inplace=True),
nn.Dropout(p=.5, inplace=False),
nn.Linear(256, 128, bias=True),
nn.ReLU(inplace=True),
nn.Dropout(p=.5, inplace=False),
nn.Linear(128, num_labels, bias=True)
)
def forward(self, inputs):
return self.classifier(inputs)
This is some random architechture that I am trying to over-fit to, but it seems that the network plateau's quickly on the training data. Could it be that my data is too complicated?
here's my training loop:
optimizer = optim.Adam(model.parameters(), lr=1e-4,weight_decay=5e-3) # L2 regularization
loss_fct=nn.CrossEntropyLoss()
model.train()
for epoch in range(10): # loop over the dataset multiple times
running_loss = 0.0
for i, data in enumerate(train_loader):
# get the inputs; data is a list of [inputs, labels]
inputs, labels = data['embeddings'], data['labels']
# zero the parameter gradients
optimizer.zero_grad()
outputs = model(inputs)
logits = outputs.squeeze(1)
loss = loss_fct(logits, labels.squeeze())
loss.backward()
optimizer.step()
running_loss += loss.item()
if i % 2000 == 1999: # print every 2000 mini-batches
print('[%d, %5d] loss: %.3f' %
(epoch + 1, i + 1, running_loss / 2000))
running_loss = 0.0
print('Finished Training')
The loss is stuck at around .4 and doesn't really decrease at all after an epoch.
To give a little context, the pre-trained embeddings are the output of specially trained ViT model from HuggingFace, I am trying to perform a classification task directly on the outputs of that model by building a simple neural network on top of it.
Can anyone advise on what is going wrong? Also if anyone has any suggestions to get a better accuracy, I would love to hear it.

NaN Loss and Quantity of Data

everyone.
While I was training a neural network with a training dataset of 40000 objects, I was having problems related to the loss function being equal to Nan at each epoch. After sampling the dataset, using 50% of it, this problem wasn't occurring any more. I was wondering how the size of the training data would have an impact in this setting. I used the following function to do the training:
def train_test_net_notLinearCDF(X,y,coefficient,test_input):
# Neural network
model = Sequential()
model.add(Dense(80, activation="relu", input_dim=X.shape[1]))
model.add(Dense(20, activation="tanh"))
model.add(Dense(1, activation="linear"))
opt_adam = Adam(clipvalue=0.5)
model.compile(loss='mean_squared_error', optimizer=opt_adam)
history = model.fit(X, y, epochs=100, validation_split = 0.2,batch_size=32)
fig1 = plt.gcf()
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.suptitle('MSE de treino e Validação ' + coefficient)
plt.ylabel('MSE')
plt.xlabel('Epoch')
plt.legend(['Train', 'Val'], loc='upper left')
plt.show()
fig1.savefig('losses_varying_alpha_'+coefficient+'.png', dpi=300)
y_pred = model.predict(test_input)
return y_pred
Thanks in advance.

PyTorch mini batch, when to call optimizer.zero_grad()

When we use mini batch, should I call optimizer.zero_grad() before starting the iteration? Or inside the iteration? I think the second code is correct, but I'm not sure.
nb_epochs = 20
for epoch in range(nb_epochs + 1):
optimizer.zero_grad() # THIS PART!!
for batch_idx, samples in enumerate(dataloader):
x_train, y_train = samples
prediction = model(x_train)
cost = F.mse_loss(prediction, y_train)
cost.backward()
optimizer.step()
print('Epoch {:4d}/{} Batch {}/{} Cost: {:.6f}'.format(
epoch, nb_epochs, batch_idx+1, len(dataloader),
cost.item()
))
or
nb_epochs = 20
for epoch in range(nb_epochs + 1):
for batch_idx, samples in enumerate(dataloader):
x_train, y_train = samples
prediction = model(x_train)
optimizer.zero_grad() #THIS PART!!
cost = F.mse_loss(prediction, y_train)
cost.backward()
optimizer.step()
print('Epoch {:4d}/{} Batch {}/{} Cost: {:.6f}'.format(
epoch, nb_epochs, batch_idx+1, len(dataloader),
cost.item()
))
Which one is correct? The only difference is location of optimizer.zero_grad().
Gradients accumulates by default everytime you call .backward() on the computational graph.
On the first snippet, you are resetting the gradients once per epoch so all gradients will accumulate their values over time. With a total of len(dataloader) accumulated gradients, only resseting the gradients when the next epoch starts. On the second snippet, you are doing the right thing, which is to reset the gradient after every backward pass.
So your assumptions were right.
There are some instances where accumulating gradients is needed, but most times it's not.

pytorch linear regression given wrong results

I implemented a simple linear regression and I’m getting some poor results. Just wondering if these results are normal or I’m making some mistake.
I tried different optimizers and learning rates, I always get bad/poor results
Here is my code:
import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt
from torch.autograd import Variable
class LinearRegressionPytorch(nn.Module):
def __init__(self, input_dim=1, output_dim=1):
super(LinearRegressionPytorch, self).__init__()
self.linear = nn.Linear(input_dim, output_dim)
def forward(self,x):
x = x.view(x.size(0),-1)
y = self.linear(x)
return y
input_dim=1
output_dim = 1
if torch.cuda.is_available():
model = LinearRegressionPytorch(input_dim, output_dim).cuda()
else:
model = LinearRegressionPytorch(input_dim, output_dim)
criterium = nn.MSELoss()
l_rate =0.00001
optimizer = torch.optim.SGD(model.parameters(), lr=l_rate)
#optimizer = torch.optim.Adam(model.parameters(),lr=l_rate)
epochs = 100
#create data
x = np.random.uniform(0,10,size = 100) #np.linspace(0,10,100);
y = 6*x+5
mu = 0
sigma = 5
noise = np.random.normal(mu, sigma, len(y))
y_noise = y+noise
#pass it to pytorch
x_data = torch.from_numpy(x).float()
y_data = torch.from_numpy(y_noise).float()
if torch.cuda.is_available():
inputs = Variable(x_data).cuda()
target = Variable(y_data).cuda()
else:
inputs = Variable(x_data)
target = Variable(y_data)
for epoch in range(epochs):
#predict data
pred_y= model(inputs)
#compute loss
loss = criterium(pred_y, target)
#zero grad and optimization
optimizer.zero_grad()
loss.backward()
optimizer.step()
#if epoch % 50 == 0:
# print(f'epoch = {epoch}, loss = {loss.item()}')
#print params
for name, param in model.named_parameters():
if param.requires_grad:
print(name, param.data)
There are the poor results :
linear.weight tensor([[1.7374]], device='cuda:0')
linear.bias tensor([0.1815], device='cuda:0')
The results should be weight = 6 , bias = 5
Problem Solution
Actually your batch_size is problematic. If you have it set as one, your targetneeds the same shape as outputs (which you are, correctly, reshaping with view(-1, 1)).
Your loss should be defined like this:
loss = criterium(pred_y, target.view(-1, 1))
This network is correct
Results
Your results will not be bias=5 (yes, weight will go towards 6 indeed) as you are adding random noise to target (and as it's a single value for all your data points, only bias will be affected).
If you want bias equal to 5 remove addition of noise.
You should increase number of your epochs as well, as your data is quite small and network (linear regression in fact) is not really powerful. 10000 say should be fine and your loss should oscillate around 0 (if you change your noise to something sensible).
Noise
You are creating multiple gaussian distributions with different variations, hence your loss would be higher. Linear regression is unable to fit your data and find sensible bias (as the optimal slope is still approximately 6 for your noise, you may try to increase multiplication of 5 to 1000 and see what weight and bias will be learned).
Style (a little offtopic)
Please read documentation about PyTorch and keep your code up to date (e.g. Variable is deprecated in favor of Tensor and rightfully so).
This part of code:
x_data = torch.from_numpy(x).float()
y_data = torch.from_numpy(y_noise).float()
if torch.cuda.is_available():
inputs = Tensor(x_data).cuda()
target = Tensor(y_data).cuda()
else:
inputs = Tensor(x_data)
target = Tensor(y_data)
Could be written succinctly like this (without much thought):
inputs = torch.from_numpy(x).float()
target = torch.from_numpy(y_noise).float()
if torch.cuda.is_available():
inputs = inputs.cuda()
target = target.cuda()
I know deep learning has it's reputation for bad code and fatal practice, but please do not help spreading this approach.

Logistic regression in Matlab, confused about the results

I am testing out logistic regression in Matlab on 2 datasets created from the audio files:
The first set is created via wavread by extracting vectors of each file: the set is 834 by 48116 matrix. Each traning example is a 48116 vector of the wav's frequencies.
The second set is created by extracting frequencies of 3 formants of the vowels, where each formant(feature) has its' frequency range (for example, F1 range is 500-1500Hz, F2 is 1500-2000Hz and so on). Each training example is a 3-vector of the wav's formants.
I am implementing the algorithm like so:
Cost function and gradient:
h = sigmoid(X*theta);
J = sum(y'*log(h) + (1-y)'*log(1-h)) * -1/m;
grad = ((h-y)'*X)/m;
theta_partial = theta;
theta_partial(1) = 0;
J = J + ((lambda/(2*m)) * (theta_partial'*theta_partial));
grad = grad + (lambda/m * theta_partial');
where X is the dataset and y is the output matrix of 8 classes.
Classifier:
initial_theta = zeros(n + 1, 1);
options = optimset('GradObj', 'on', 'MaxIter', 50);
for c = 1:num_labels,
[theta] = fmincg(#(t)(lrCostFunction(t, X, (y==c), lambda)), initial_theta, options);
all_theta(c, :) = theta';
end
where num_labels = 8, lambda(regularization) is 0.1
With the first set, MaxIter = 50, and I get ~99.8% classification accuracy.
With the second set and MaxIter=50, the accuracy is poor - 62.589928
I thought about increasing MaxIter to a larger value to improve the performance, however, even at a ridiculous amount of iterations, the result doesn't go higher than 66.546763. Changing of the regularization value (lambda) doesn't seem to influence the results in any better way.
What could be the problem? I am new to machine learning and I can't seem to catch what exactly causes this drastic difference. The only reason that obviously stands out for me is that the first set's examples are very long vectors, hence, larger amount of features, and the second set's examples are represented by short 3-vectors. Is this data not enough to classify the second set? If so, what can be done about it to achieve better classification results for the second set?