Why is dropout preventing convergence in Convolutional Neural Network? - neural-network

I am using tensorflow to train a convnet with a set of 15000 training images with 22 classes. I have 2 conv layers and one fully connected layer. I have trained the network with the 15000 images and have experienced convergence and high accuracy on the training set.
However, my test set is experiencing much lower accuracy so I am assuming the network is over fitting. To combat this I added dropout before the fully connected layer of my network.
However, adding dropout has caused the network to never converge after many iterations. I was wondering why this may be. I have even used a high dropout probability (keep probability of .9) and have experienced the same results.

Well by making your keep dropout probability 0.9 it means there's 10% chance of that neuron connection getting off in each iteration .So for dropout also there should be an optimum value.
As in the above you can understand with the dropout we are also scaling our neurons. The above case is 0.5 drop out . If it's o.9 then again there will a different scaling .
So basically if it's 0.9 dropout keep probability we need to scale it by 0.9. Which means we are getting 0.1 larger something in the testing .
Just by this you can get an idea how dropout can affect . So by some probabilities it can saturate your nodes etc which causes the non converging issue..

You can add dropout to your dense layers after convolutional layers and remove dropout from convolutional layers. If you want to have many more examples, you can put some white noise (5% random pixels) on each picture and have P, P' variant for each picture. This can improve your results.

You shouldn't put 0.9 for dropout with doing this you are losing feature in your training phase. As far as I've seen most of the dropouts have had a value between 0.2 or 0.5. However, using too much dropout could cause some problems in the training phase and a longer time to converge or even in some rare cases cause the network to learn something wrong.
you need to be careful with using of dropout as you can see the image below dropout prevents features from getting to the next layer to using too many dropout or a very high dropout value could kill the learning
DropoutImage

Related

Can dropout increases training data performance?

I am training a neural network with dropout. It happens that as I decrease dropout from 0.9 to 0.7, the loss (cross-validation error) also decreases for the training data data. I noticed also that accuracy increases as I reduce dropout parameter.
It seems odd to me. Does it make sense?
Dropout is a regularization technique. You should use it only to reduce variance (validation performance vs training performance).It is not intended to reduce the bias, and you should not use it in this way. it is very misleading.
Probably the reason for which you see this behavior is that you use a very high value for dropout. 0.9 means you neutralize too many neurons. It makes sense that once you put there 0.7 instead, the network has higher neurons to use while learning on training set. So the performance will increase for lower values.
You usually should see the training performance dropping a bit, while increasing the performance on the validation set (if you do not have one, at least on the test set). This is the desired behavior you are looking for, when using dropout. The current behavior you get is because if the very high values for dropout.
Start with 0.2 or 0.3 and compare the bias vs. variance in order to get a good value for dropout.
My clear recommendation: don't use it to improve bias, but to reduce variance (error on validation set).
In order to fit better on the training set I recommend :
find a better architecture (or change the number of neurons per
layer)
try different optimizers
hyperparameter tunning
maybe train the network a bit longer
Hopefully this helps !
Dropout works by probabilistically removing, or “dropping out,” inputs to a layer, which may be input variables in the data sample or activations from a previous layer. It has the effect of simulating a large number of networks with a very different network structure and, in turn, making nodes in the network generally more robust to the inputs.
With dropout (dropout rate less than some small value), the accuracy will gradually increase and loss will gradually decrease first(That is what is happening in your case).
When you increase dropout beyond a certain threshold, it results in the model not being able to fit properly. Intuitively, a higher dropout rate would result in a higher variance to some of the layers, which also degrades training.
What you should always remember is that Dropout is like all other forms of regularization in that it reduces model capacity. If you reduce the capacity too much, it is sure that you will get bad results.
Hope this may help you.

How to fine tune hyper-parameters of momentum optimizer?

There are several optimizers in training of neural network. But the Momentum and SGD seem always better than adaptive methods.
Now I am writing a program in tensorflow to reproduce the results of others. They use momentum to train in pylearn2. But there are several parameters: momentum factor, weight scale, bias scale. They assign the weight scale as the weight of dropout layers.
When I train my network I use Momentum. However, the result seems too hard to train, and the loss is always high. The result seems not bad when I use adam to train, but the result is worse than his in 0.00X.
I want to know how to tune Momentum optimizer. And I also want to know the reason why my program doesn't work well.

why is tanh performing better than relu in simple neural network

Here is my scenario
I have used EMNIST database of capital letters of english language.
My neural network is as follows
Input layer has 784 neurons which are pixel values of image 28x28 grey scaled image divided by 255 so value will be in range[0,1]
Hidden layer has 49 neuron fully connected to previous 784.
Output layer has 9 neurons denoting class of image.
Loss function is defined as cross entropy of softmax of output layer.
Initialized all weights as random real number from [-1,+1].
Now I did training with 500 fixed samples for each class.
Simply, passed 500x9 images to train function which uses backpropagation and does 100 iterations changing weights by learning_rate*derivative_of_loss_wrt_corresponding_weight.
I found that when I use tanh activation on neuron then network learns faster than relu with learning rate 0.0001.
I concluded that because accuracy on fixed test dataset was higher for tanh than relu . Also , loss value after 100 epochs was slightly lower for tanh.
Isn't relu expected to perform better ?
Isn't relu expected to perform better ?
In general, no. RELU will perform better on many problems but not all problems.
Furthermore, if you use an architecture and set of parameters that is optimized to perform well with one activation function, you may get worse results after swapping in a different activation function.
Often you will need to adjust the architecture and parameters like learning rate to get comparable results. This may mean changing the number of hidden nodes and/or the learning rate in your example.
One final note: In the MNIST example architectures I have seen, hidden layers with RELU activations are typically followed by Dropout layers, whereas hidden layers with sigmoid or tanh activations are not. Try adding dropout after the hidden layer and see if that improves your results with RELU. See the Keras MNIST example here.

Neural Network: validation accuracy constant, training accuracy decreasing

I have a neural network which does image segmentation. I trained it ~100 epochs. The current effect is that the validation loss is constant (0.2 +/- 0.03) and the training accuracy is still decreasing (currently 0.07), but very slow.
The result of the neural network is quite well.
What does this mean? Is it overfitting? Should i stop the training?
I currently use dropout in the first layer (50%). Would it make sense to add dropout to every layer (there are about ~15 layers)? Or should i also add L2 regularization? Does it make sense to use L2 and also droput?
Thank you very much
It is recommended to use L2 when you use dropout. I think that your dropout at 50% is a little too high. People usually use it around 20% depending on the operations.
Moreover, 100 epochs may not be enough, it depends on the size of your training set and the size of your neural network.
What do you mean by "quite well"? Please quantify it and share an example. The validation and accuracy are just "indicators", their value also depend on the NN and the training set, so 0.2 can be either bad or good depending on your problem.

How to improve digit recognition prediction in Neural Networks in Matlab?

I've made digit recognition (56x56 digits) using Neural Networks, but I'm getting 89.5% accuracy on test set and 100% on training set. I know that it's possible to get >95% on test set using this training set. Is there any way to improve my training so I can get better predictions? Changing iterations from 300 to 1000 gave me +0.12% accuracy. I'm also file size limited so increasing number of nodes can be impossible, but if that's the case maybe I could cut some pixels/nodes from the input layer.
To train I'm using:
input layer: 3136 nodes
hidden layer: 220 nodes
labels: 36
regularized cost function with lambda=0.1
fmincg to calculate weights (1000 iterations)
As mentioned in the comments, the easiest and most promising way is to switch to a Convolutional Neural Network. But with you current model you can:
Add more layers with less neurons each, which increases learning capacity and should increase accuracy by a bit. Problem is that you might start overfitting. Use regularization to counter this.
Use batch Normalization (BN). While you are already using regularization, BN accelerates training and also does regularization, and is a NN specific algorithm that might work better.
Make an ensemble. Train several NNs on the same dataset, but with a different initialization. This will produce slightly different classifiers and you can combine their output to get a small increase in accuracy.
Cross-entropy loss. You don't mention what loss function you are using, if its not Cross-entropy, then you should start using it. All the high accuracy classifiers use cross-entropy loss.
Switch to backpropagation and Stochastic Gradient Descent. I do not know the effect of using a different optimization algorithm, but backpropagation might outperform the optimization algorithm you are currently using, and you could combine this with other optimizers such as Adagrad or ADAM.
Other small changes that might increase accuracy are changing the activation functions (like ReLU), shuffle training samples after every epoch, and do data augmentation.