It is common to use a dropout rate of 0.5 as a default which I also use in my fully-connected network. This advise follows the recommendations from the original Dropout paper (Hinton at al).
My network consists of fully-connected layers of size
[1000, 500, 100, 10, 100, 500, 1000, 20].
I do not apply dropout to the last layer. But I do apply it to the bottle neck layer of size 10. This does not seem reasonable given that dropout = 0.5. I guess to much information gets lost. Is there a rule of thumb how to treat bottle neck layers when using dropout? Is it better to increase the size of the bottle neck or decrease dropout rate?
Drop out layer is added to prevent over-fitting(relgularization) in neural Network.
Firstly Drop out rate adds noise in output values of layer to break happenstance patterns that cause overfitting .
here droput rate of 0.5 means 50% of values shall be droped out, which is a high noise ratio and a definite No for bottle neck layer.
I would recommend you train your bottle neck layer without dropout first and then compare its results with increasing dropout.
choose the model that best validates your test Data.
Related
I am training a neural network with dropout. It happens that as I decrease dropout from 0.9 to 0.7, the loss (cross-validation error) also decreases for the training data data. I noticed also that accuracy increases as I reduce dropout parameter.
It seems odd to me. Does it make sense?
Dropout is a regularization technique. You should use it only to reduce variance (validation performance vs training performance).It is not intended to reduce the bias, and you should not use it in this way. it is very misleading.
Probably the reason for which you see this behavior is that you use a very high value for dropout. 0.9 means you neutralize too many neurons. It makes sense that once you put there 0.7 instead, the network has higher neurons to use while learning on training set. So the performance will increase for lower values.
You usually should see the training performance dropping a bit, while increasing the performance on the validation set (if you do not have one, at least on the test set). This is the desired behavior you are looking for, when using dropout. The current behavior you get is because if the very high values for dropout.
Start with 0.2 or 0.3 and compare the bias vs. variance in order to get a good value for dropout.
My clear recommendation: don't use it to improve bias, but to reduce variance (error on validation set).
In order to fit better on the training set I recommend :
find a better architecture (or change the number of neurons per
layer)
try different optimizers
hyperparameter tunning
maybe train the network a bit longer
Hopefully this helps !
Dropout works by probabilistically removing, or “dropping out,” inputs to a layer, which may be input variables in the data sample or activations from a previous layer. It has the effect of simulating a large number of networks with a very different network structure and, in turn, making nodes in the network generally more robust to the inputs.
With dropout (dropout rate less than some small value), the accuracy will gradually increase and loss will gradually decrease first(That is what is happening in your case).
When you increase dropout beyond a certain threshold, it results in the model not being able to fit properly. Intuitively, a higher dropout rate would result in a higher variance to some of the layers, which also degrades training.
What you should always remember is that Dropout is like all other forms of regularization in that it reduces model capacity. If you reduce the capacity too much, it is sure that you will get bad results.
Hope this may help you.
I'm building a model to detect keypoints of body parts. To do that I'm using the COCO dataset (http://cocodataset.org/#download). I'm trying to understand why I'm running into overfitting issues (training loss converges, but I reach a ceiling really early for testing loss). In the model, I've tried adding layers of dropout (gradually adding more layers with higher probabilities, but I quickly get to a point when training loss stops decreasing which is just as bad. My theory is that the model I use isn't complex enough but I'd like to know if that's the likely reason or if it could be something else. The models I've found online are all extremely deep (30+ layers).
Data
I'm using 10,000 RGB images each of which has a single person in it. They each have different sizes but a max of 640 length and width. As a preprocessing step, I make every image the size 640x640 by filling any extra area (bottom and right of image) with (0,0,0) or black.
Targets
The full dataset has many keypoints but I'm only interested in the right shoulder, right elbow, and right wrist. Each body part has 2 keypoints (X coordinate and Y coordinate) so my target is a list of length 6.
Model
activation_function = 'relu'
batch_size = 16 # ##
epoch_count = 40 # ##
loss_function = 'mean_squared_error'
opt = 'adam'
model = Sequential()
model.add(Conv2D(filters=16, kernel_size=(3, 3), input_shape=inp_shape))
# model.add(Conv2D(filters=16, kernel_size=(3, 3)))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(filters=32, kernel_size=(3, 3)))
# model.add(Conv2D(filters=32, kernel_size=(3, 3)))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(300, activation=activation_function))
model.add(Dropout(rate=0.1))
model.add(Dense(300, activation=activation_function))
model.add(Dense(num_targets))
model.summary()
model.compile(loss=loss_function, optimizer=opt)
hist = model.fit(x_train, y_train, batch_size=batch_size, epochs=epoch_count,
verbose=verbose_level,
validation_data=(x_valid, y_valid))
Your theory
the model I use isn't complex enough
it's a good theory, the model is pretty simple and given that we don't know exactly how much overfitting are you suffering it seems possible that the overfitting is because of the complexity of the model.
In the model, I've tried adding layers of dropout
Could be a simple but effective way of making the model a little more complex, but furthermore, I'd increase the dropout rate. It seems that you have a dropout of 0.1, try 0.5 for example and compare if the overfitting decreases.
Anyway, I think the best you can try is incrementing the complexity of the model, but in the convolution part, not just adding Dense layers after the Flatten. If it seems difficult to you, I suggest to find some pre-built general architectures for Convolutional Neural Networks for Image Recognition or even more specific builds for similar problems to yours.
Tell us how it goes!
In addition to already said in the answers:
You can have several Dropout layers with different probabilities, e.g. after the pooling layers. Early layers often have higher keep probability, since they use fewer filters.
Image data augmentation is another way towards generalization and in my experience it always improves the result, at least slightly (of course, provided that input transformation is not severe).
Batch normalization (and its successors, weight normalization and layer normalization) is a modern regularization method that reduces the required dropout intensity, sometimes completely, i.e. you can get rid of dropout layers. In addition, batchnorm improves activations statistics, which often makes the network learn faster. I used it in addition to dropout and it worked pretty well.
A technique called Scaled Exponential Linear Units (SELU) has been published very recently, which is said to have implicit self-normalizing properties. It's even already implemented in keras.
The good old L2 or L1 regularizer is still in use. If nothing else helps, consider adding it too. But I'm pretty sure that batchnorm, selu and few dropout layers will be enough.
In the context of a binary classification, I use a neural network with 1 hidden layer using a tanh activation function. The input is coming from a word2vect model and is normalized.
The classifier accuracy is between 49%-54%.
I used a confusion matrix to have a better understanding on what’s going on. I study the impact of feature number in input layer and the number of neurons in the hidden layer on the accuracy.
What I can observe from the confusion matrix is the fact that the model predict based on the parameters sometimes most of the lines as positives and sometimes most of the times as negatives.
Any suggestion why this issue happens? And which other points (other than input size and hidden layer size) might impact the accuracy of the classification?
Thanks
It's a bit hard to guess given the information you provide.
Are the labels balanced (50% positives, 50% negatives)? So this would mean your network is not training at all as your performance corresponds to the random performance, roughly. Is there maybe a bug in the preprocessing? Or is the task too difficult? What is the training set size?
I don't believe that the number of neurons is the issue, as long as it's reasonable, i.e. hundreds or a few thousand.
Alternatively, you can try another loss function, namely cross entropy, which is standard for multi-class classification and can also be used for binary classification:
https://www.tensorflow.org/api_docs/python/nn/classification#softmax_cross_entropy_with_logits
Hope this helps.
The data set is well balanced, 50% positive and negative.
The training set shape is (411426,X)
The training set shape is (68572,X)
X is the number of the feature coming from word2vec and I try with the values between [100,300]
I have 1 hidden layer, and the number of neurons that I test varied between [100,300]
I also test with mush smaller features/neurons size: 2-20 features and 10 neurons on the hidden layer.
I use also the cross entropy as cost fonction.
I have a neural network which does image segmentation. I trained it ~100 epochs. The current effect is that the validation loss is constant (0.2 +/- 0.03) and the training accuracy is still decreasing (currently 0.07), but very slow.
The result of the neural network is quite well.
What does this mean? Is it overfitting? Should i stop the training?
I currently use dropout in the first layer (50%). Would it make sense to add dropout to every layer (there are about ~15 layers)? Or should i also add L2 regularization? Does it make sense to use L2 and also droput?
Thank you very much
It is recommended to use L2 when you use dropout. I think that your dropout at 50% is a little too high. People usually use it around 20% depending on the operations.
Moreover, 100 epochs may not be enough, it depends on the size of your training set and the size of your neural network.
What do you mean by "quite well"? Please quantify it and share an example. The validation and accuracy are just "indicators", their value also depend on the NN and the training set, so 0.2 can be either bad or good depending on your problem.
I am using tensorflow to train a convnet with a set of 15000 training images with 22 classes. I have 2 conv layers and one fully connected layer. I have trained the network with the 15000 images and have experienced convergence and high accuracy on the training set.
However, my test set is experiencing much lower accuracy so I am assuming the network is over fitting. To combat this I added dropout before the fully connected layer of my network.
However, adding dropout has caused the network to never converge after many iterations. I was wondering why this may be. I have even used a high dropout probability (keep probability of .9) and have experienced the same results.
Well by making your keep dropout probability 0.9 it means there's 10% chance of that neuron connection getting off in each iteration .So for dropout also there should be an optimum value.
As in the above you can understand with the dropout we are also scaling our neurons. The above case is 0.5 drop out . If it's o.9 then again there will a different scaling .
So basically if it's 0.9 dropout keep probability we need to scale it by 0.9. Which means we are getting 0.1 larger something in the testing .
Just by this you can get an idea how dropout can affect . So by some probabilities it can saturate your nodes etc which causes the non converging issue..
You can add dropout to your dense layers after convolutional layers and remove dropout from convolutional layers. If you want to have many more examples, you can put some white noise (5% random pixels) on each picture and have P, P' variant for each picture. This can improve your results.
You shouldn't put 0.9 for dropout with doing this you are losing feature in your training phase. As far as I've seen most of the dropouts have had a value between 0.2 or 0.5. However, using too much dropout could cause some problems in the training phase and a longer time to converge or even in some rare cases cause the network to learn something wrong.
you need to be careful with using of dropout as you can see the image below dropout prevents features from getting to the next layer to using too many dropout or a very high dropout value could kill the learning
DropoutImage