I'm building a model to detect keypoints of body parts. To do that I'm using the COCO dataset (http://cocodataset.org/#download). I'm trying to understand why I'm running into overfitting issues (training loss converges, but I reach a ceiling really early for testing loss). In the model, I've tried adding layers of dropout (gradually adding more layers with higher probabilities, but I quickly get to a point when training loss stops decreasing which is just as bad. My theory is that the model I use isn't complex enough but I'd like to know if that's the likely reason or if it could be something else. The models I've found online are all extremely deep (30+ layers).
Data
I'm using 10,000 RGB images each of which has a single person in it. They each have different sizes but a max of 640 length and width. As a preprocessing step, I make every image the size 640x640 by filling any extra area (bottom and right of image) with (0,0,0) or black.
Targets
The full dataset has many keypoints but I'm only interested in the right shoulder, right elbow, and right wrist. Each body part has 2 keypoints (X coordinate and Y coordinate) so my target is a list of length 6.
Model
activation_function = 'relu'
batch_size = 16 # ##
epoch_count = 40 # ##
loss_function = 'mean_squared_error'
opt = 'adam'
model = Sequential()
model.add(Conv2D(filters=16, kernel_size=(3, 3), input_shape=inp_shape))
# model.add(Conv2D(filters=16, kernel_size=(3, 3)))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(filters=32, kernel_size=(3, 3)))
# model.add(Conv2D(filters=32, kernel_size=(3, 3)))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(300, activation=activation_function))
model.add(Dropout(rate=0.1))
model.add(Dense(300, activation=activation_function))
model.add(Dense(num_targets))
model.summary()
model.compile(loss=loss_function, optimizer=opt)
hist = model.fit(x_train, y_train, batch_size=batch_size, epochs=epoch_count,
verbose=verbose_level,
validation_data=(x_valid, y_valid))
Your theory
the model I use isn't complex enough
it's a good theory, the model is pretty simple and given that we don't know exactly how much overfitting are you suffering it seems possible that the overfitting is because of the complexity of the model.
In the model, I've tried adding layers of dropout
Could be a simple but effective way of making the model a little more complex, but furthermore, I'd increase the dropout rate. It seems that you have a dropout of 0.1, try 0.5 for example and compare if the overfitting decreases.
Anyway, I think the best you can try is incrementing the complexity of the model, but in the convolution part, not just adding Dense layers after the Flatten. If it seems difficult to you, I suggest to find some pre-built general architectures for Convolutional Neural Networks for Image Recognition or even more specific builds for similar problems to yours.
Tell us how it goes!
In addition to already said in the answers:
You can have several Dropout layers with different probabilities, e.g. after the pooling layers. Early layers often have higher keep probability, since they use fewer filters.
Image data augmentation is another way towards generalization and in my experience it always improves the result, at least slightly (of course, provided that input transformation is not severe).
Batch normalization (and its successors, weight normalization and layer normalization) is a modern regularization method that reduces the required dropout intensity, sometimes completely, i.e. you can get rid of dropout layers. In addition, batchnorm improves activations statistics, which often makes the network learn faster. I used it in addition to dropout and it worked pretty well.
A technique called Scaled Exponential Linear Units (SELU) has been published very recently, which is said to have implicit self-normalizing properties. It's even already implemented in keras.
The good old L2 or L1 regularizer is still in use. If nothing else helps, consider adding it too. But I'm pretty sure that batchnorm, selu and few dropout layers will be enough.
Related
I am trying to build a simpleRNN network with custom loss function. I am predicting bmi based on 25 different features. My dataset is unbalanced and have outliers and want to predict better on the outliers. Rather it is more important to predict better on outliers.
For my custom loss function I have added condition that if the loss is greater than 2 units then I want to penalize those observations more.
import keras.backend as K
def custom_loss(y_true, y_pred):
loss = K.abs(y_pred - y_true)
wt = loss * 5
loss_mae = K.switch((loss > 2),wt,loss)
return loss_mae
model = Sequential()
model.add(SimpleRNN(units=64, input_shape=(25, 1), activation="relu"))
model.add(Dense(32, activation="linear"))
model.add(Dropout(0.2))
model.add(Dense(1, activation="linear"))
model.compile(loss=custom_loss, optimizer='adam')
model.add(Dropout(0.1))
model.summary()
model.fit(train_x, train_y)
sample predictions after running this code
preds=[[16.015867], [16.022823], [15.986835], [16.69895 ], [17.537468]]
actual=[[18.68], [24.35], [18.07], [15.2 ], [13.78]]
As you can see, the prediction for 2nd and 5th obs, is still way off. Am I doing anything wrong in the code?
One thing that is very wrong is that you should never have a dropout on the output neuron. Apart from this:
activation function of hidden layers should not be linear (model.add(Dense(32, activation="linear")) should be model.add(Dense(32, activation="relu")))
A neural network should always be able to overfit to your training data, and this should be your debugging state before worrying about generalisation, consequently:
Do not use dropout (this only makes fitting harder, you can experiment with it once you are concerned about generalisation)
Your network is somewhat tiny,try making it much wider and see if your predictions improve
overall MAE is much worse behaved learning signal than MSE, which also automatically penalises outliers a lot, why not use it?
Consider normalising your data, neural networks work well with their default initisalisations if both inputs and targets are in somewhat bounded space, preferably in [-1, 1] or [0, 1] scale.
I'm doing regression using Neural Networks. It should be a simple task for NN to do, I have 10 features and 1 output that I want to predict.I’m using pytorch for my project but my Model is not learning well. the loss start with a very high value (40000), then after the first 5-10 epochs the loss decrease rapidly to 6000-7000 and then it stuck there, no matter what I make. I tried even to change to skorch instead of pytorch so that I can use cross validation functionality but that also didn’t help. I tried different optimizers and added layers and neurons to the network but that didn’t help, it stuck at 6000 which is a very high loss value. I’m doing regression here, I have 10 features and I’m trying to predict one continuous value. that should be easy to do that’s why it is confusing me more.
here is my network:
I tried here all the possibilities from making more complex architectures like adding layers and units to batch normalization, changing activations etc.. nothing have worked
class BearingNetwork(nn.Module):
def __init__(self, n_features=X.shape[1], n_out=1):
super().__init__()
self.model = nn.Sequential(
nn.Linear(n_features, 512),
nn.BatchNorm1d(512),
nn.LeakyReLU(),
nn.Linear(512, 64),
nn.BatchNorm1d(64),
nn.LeakyReLU(),
nn.Linear(64, n_out),
# nn.LeakyReLU(),
# nn.Linear(256, 128),
# nn.LeakyReLU(),
# nn.Linear(128, 64),
# nn.LeakyReLU(),
# nn.Linear(64, n_out)
)
def forward(self, x):
out = self.model(x)
return out
and here are my settings:
using skorch is easier than pytorch. here I'm monitoring also the R2 metric and I made RMSE as a custom metric to also monitor the performance of my model. I also tried the amsgrad for Adam but that didn't help.
R2 = EpochScoring(r2_score, lower_is_better=False, name='R2')
explained_var_score = EpochScoring(EVS, lower_is_better=False, name='EVS Metric')
custom_score = make_scorer(RMSE)
rmse = EpochScoring(custom_score, lower_is_better=True, name='rmse')
bearing_nn = NeuralNetRegressor(
BearingNetwork,
criterion=nn.MSELoss,
optimizer=optim.Adam,
optimizer__amsgrad=True,
max_epochs=5000,
batch_size=128,
lr=0.001,
train_split=skorch.dataset.CVSplit(10),
callbacks=[R2, explained_var_score, rmse, Checkpoint(), EarlyStopping(patience=100)],
device=device
)
I also standardize the Input values.
my Input have the shape:
torch.Size([39006, 10])
and shape of output is:
torch.Size([39006, 1])
I’m using 128 as my Batch_size but I also tried other values like 32, 64, 512 and even 1024. Although normalizing output is not necessary but I also tried that and It didn’t work when I predict values, the loss is high. Please someone help me on this, I would appreciate every helpful advice. I ll also add a screenshot of my training and val losses and metrics over epochs to visualize how the loss is decreasing in the first 5 epochs and then it stays like forever at the value 6000 which is a very high value for a loss.
considering that your training and dev loss are decreasing over time, it seems like your model is training correctly. With respect to your worry regarding your training and dev loss values, this is entirely dependent on the scale of your target values (how big are your target values?) and the metric used to compute the training and dev losses. If your target values are big and you want smaller train and dev loss values, you can normalise the target values.
From what I gather with respect to your experiments as well as your R2 scores, it seems that you are looking for a solution in the wrong area. To me, it seems like your features aren't strong enough considering that your R2 scores are low, which could mean that you have a data quality issue. This would also explain why your architecture tuning has not improved your model's performance as it is not your model that is the issue. So if I were you, I would think about what new useful features I could add and see if that helps. In machine learning, the general rule is that models are only as good as the data that they are trained on. I hope this helps!
The metric you should be looking at is R^2, not the magnitude of the loss function. The purpose of a loss function is just to let the optimizer know if it's going in the right direction--it's not a measure of fit that's comparable across data sets and learning setups. That's what R^2 is for.
Your R^2 scores show that you're explaining around a third of the total variance in the output, which is often a very good result for a data set with only 10 features. Actually, given the shape of your data, it's more likely that your hidden layers are considerably larger than necessary and risk over fitting.
To really evaluate this model, you'd need to know (1) how the R^2 score compares to simpler regression approaches like OLS and (2) why you should have any confidence that more than 30% of the output variance should be captured by the input variables.
For #1, at least the R^2 shouldn't be worse. As for #2, consider the canonical digit categorization example. We know that all the information necessary to recognize digits with very high accuracy (i.e. R^2 approaching 1) because humans can do it. That's not necessarily the case with other data sets, because there are important sources of variance that aren't captured in the source data.
As your loss decreases from 40000 to 6000, that means your NN model has learnt the prevalent relation but not all of them. You can aid this learning by transforming the predictor variables and then feeding them as derived ones to your model and see if that helps. You can try performing step wise addition of features to your NN model, by adding the most influential predictors first. At every iteration evaluate the model performance (i.e. training loss).
If first step doesn't help and as you are open to other approaches, Presuming your data's dynamics, Gaussian process Regression or Quantile regression should help as these methods are free from assumptions like linear regression techniques. Also it should help to explore different aspects of relationship between your independent and dependent variable.
I am training a neural network with dropout. It happens that as I decrease dropout from 0.9 to 0.7, the loss (cross-validation error) also decreases for the training data data. I noticed also that accuracy increases as I reduce dropout parameter.
It seems odd to me. Does it make sense?
Dropout is a regularization technique. You should use it only to reduce variance (validation performance vs training performance).It is not intended to reduce the bias, and you should not use it in this way. it is very misleading.
Probably the reason for which you see this behavior is that you use a very high value for dropout. 0.9 means you neutralize too many neurons. It makes sense that once you put there 0.7 instead, the network has higher neurons to use while learning on training set. So the performance will increase for lower values.
You usually should see the training performance dropping a bit, while increasing the performance on the validation set (if you do not have one, at least on the test set). This is the desired behavior you are looking for, when using dropout. The current behavior you get is because if the very high values for dropout.
Start with 0.2 or 0.3 and compare the bias vs. variance in order to get a good value for dropout.
My clear recommendation: don't use it to improve bias, but to reduce variance (error on validation set).
In order to fit better on the training set I recommend :
find a better architecture (or change the number of neurons per
layer)
try different optimizers
hyperparameter tunning
maybe train the network a bit longer
Hopefully this helps !
Dropout works by probabilistically removing, or “dropping out,” inputs to a layer, which may be input variables in the data sample or activations from a previous layer. It has the effect of simulating a large number of networks with a very different network structure and, in turn, making nodes in the network generally more robust to the inputs.
With dropout (dropout rate less than some small value), the accuracy will gradually increase and loss will gradually decrease first(That is what is happening in your case).
When you increase dropout beyond a certain threshold, it results in the model not being able to fit properly. Intuitively, a higher dropout rate would result in a higher variance to some of the layers, which also degrades training.
What you should always remember is that Dropout is like all other forms of regularization in that it reduces model capacity. If you reduce the capacity too much, it is sure that you will get bad results.
Hope this may help you.
Im new with NN and i have this problem:
I have a dataset with 300 rows and 33 columns. Each row has 3 more columns for the results.
Im trying to use MLP for trainning a model so that when i have a new row, it estimates those 3 result columns.
I can easily reduce the error during trainning to 0.001 but when i use cross validation it keep estimating very poorly.
It estimates correctly if i use the same entry it used to train, but if i use another values that werent used for trainning the results are very wrong
Im using two hidden layers with 20 neurons each, so my architecture is [33 20 20 3]
For activation function im using biporlarsigmoid function.
Do you guys have some suggestion on where i could try to change to improve this?
Overfitting
As mentioned in the comments, this perfectly describes overfitting.
I strongly suggest reading the wikipedia article on overfitting, as it well describes causes, but I'll summarize some key points here.
Model complexity
Overfitting often happens when you model is needlessly complex for the problem. I don't know anything about your dataset, but I'm guessing [33 20 20 3] is more parameters than necessary for predicting.
Try running your cross-validation methods again, this time with either fewer layers, or fewer nodes per layer. Right now you are using 33*20 + 20*20 + 20*3 = 1120 parameters (weights) to make your prediction, is this necessary?
Regularization
A common solution to overfitting is regularization. The driving principle is KISS (keep it simple, stupid).
By applying an L1 regularizer to your weights, you keep preference for the smallest number of weights to solve your problem. The network will pull many weights to 0 as they aren't need.
By applying an L2 regularizer to your weights, you keep preference for lower rank solutions to your problem. This means that your network will prefer weights matrices that span lower dimensions. Practically this means your weights will be smaller numbers, and are less likely to be able to "memorize" the data.
What is L1 and L2? These are types of vector norms. L1 is the sum of the absolute value of your weights. L2 is the sqrt of the sum of squares of your weights. (L3 is the cubed root of the sum of cubes of weights, L4 ...).
Distortions
Another commonly used technique is to augment your training data with distorted versions of your training samples. This only makes sense with certain types of data. For instance images can be rotated, scaled, shifted, add gaussian noise, etc. without dramatically changing the content of the image.
By adding distortions, your network will no longer memorize your data, but will also learn when things look similar to your data. The number 1 rotated 2 degrees still looks like a 1, so the network should be able to learn from both of these.
Only you know your data. If this is something that can be done with your data (even just adding a little gaussian noise to each feature), then maybe this is worth looking into. But do not use this blindly without considering the implications it may have on your dataset.
Careful analysis of data
I put this last because it is an indirect response to the overfitting problem. Check your data before pumping it through a black-box algorithm (like a neural network). Here are a few questions worth answering if your network doesn't work:
Are any of my features strongly correlated with each other?
How do baseline algorithms perform? (Linear regression, logistic regression, etc.)
How are my training samples distributed among classes? Do I have 298 samples of one class and 1 sample of the other two?
How similar are my samples within a class? Maybe I have 100 samples for this class, but all of them are the same (or nearly the same).
I am currently trying to use Neural Network to make regression predictions.
However, I don't know what is the best way to handle this, as I read that there were 2 different ways to do regression predictions with a NN.
1) Some websites/articles suggest to add a final layer which is linear.
http://deeplearning4j.org/linear-regression.html
My final layers would look like, I think, :
layer1 = tanh(layer0*weight1 + bias1)
layer2 = identity(layer1*weight2+bias2)
I also noticed that when I use this solution, I usually get a prediction which is the mean of the batch prediction. And this is the case when I use tanh or sigmoid as a penultimate layer.
2) Some other websites/articles suggest to scale the output to a [-1,1] or [0,1] range and to use tanh or sigmoid as a final layer.
Are these 2 solutions acceptable ? Which one should one prefer ?
Thanks,
Paul
I would prefer the second case, in which we use normalization and sigmoid function as the output activation and then scale back the normalized output values to their actual values. This is because, in the first case, to output the large values (since actual values are large in most cases), the weights mapping from penultimate layer to the output layer would have to be large. Thus, for faster convergence, the learning rate has to be made larger. But this may also cause learning of the earlier layers to diverge since we are using a larger learning rate. Hence, it is advised to work with normalized target values, so that the weights are small and they learn quickly.
Hence in short, the first method learns slowly or may diverge if a larger learning rate is used and on the other hand, the second method is comparatively safer to use and learns quickly.