Must accuracy increase after every epoch? - neural-network

When training a neural neural network using batches, should accuracy (training and validation) increase after every epoch (after seeing the whole data an additional time)?
I want to be able to quickly judge if the network settings (learning rate, number of nodes.. etc) is reasonable. It also seemed necessary that the more the whole dataset is seen, the better the performance should be.
So, if the performance decreases at an epoch, should I be worried that something is wrong (high learning rate, high bias)? (Or do I always have to wait several epochs to judge?)

I would say it depends on dataset and architecture. Hence, fluctuations are normal, but in general loss should improve. You can have a look at these practical guides to better interpret loss curves:
http://cs231n.github.io/neural-networks-3/#loss
https://blog.slavv.com/37-reasons-why-your-neural-network-is-not-working-4020854bd607

Yes, in a perfect world one would expect the test accuracy to increase. If the test accuracy starts to decrease it might be that your network is overfitting. You might want to stop the learning just before you reach that point or take other steps to counter the overfitting problem.
Also, it could be a result of noise in the test dataset, i.e. wrongly labeled examples.

Related

Is running more epochs really a direct cause of overfitting?

I've seen some comments in online articles/tutorials or Stack Overflow questions which suggest that increasing number of epochs can result in overfitting. But my intuition tells me that there should be no direct relationship at all between number of epochs and overfitting. So I'm looking for answer which explains if I'm right or wrong (or whatever's in between).
Here's my reasoning though. To overfit, you need to have enough free parameters (I think this is called "capacity" in neural networks) in your model to generate a function which can replicate the sample data points. If you don't have enough free parameters, you'll never overfit. You might just underfit.
So really, if you don't have too many free parameters, you could run infinite epochs and never overfit. If you have too many free parameters, then yes, the more epochs you have the more likely it is that you get to a place where you're overfitting. But that's just because running more epochs revealed the root cause: too many free parameters. The real loss function doesn't care about how many epochs you run. It existed the moment you defined your model structure, before you ever even tried to do gradient descent on it.
In fact, I'd venture as far as to say: assuming you have the computational resources and time, you should always aim to run as many epochs as possible, because that will tell you whether your model is prone to overfitting. Your best model will be the one that provides great training and validation accuracy, no matter how many epochs you run it for.
EDIT
While reading more into this, I realise I forgot to take into account that you can arbitrarily vary the sample size as well. Given a fixed model, a smaller sample size is more prone to being overfit. And then that kind of makes me doubt my intuition above. Still happy to get an answer though!
Your intuition to me seems completely correct.
But here is the caveat. The whole purpose of deep models is that they are "deep" (duh!!). So what happens is that your feature space gets exponentially larger as you grow your network.
Here is an example to compare a deep model with a simpler mode:
Assume you have a 10-variable data set. With a crazy amount of feature engineering, you might be able to extract 50 features out of it. Then if you run a traditional model (let's say a logistic regression), you will have 50 parameters (capacity in your word, or degree of freedom) to train.
But, if you use a very simple deep model with Layer 1: 10 unit, layer2: 10 units, layer3: 5 units, layer4: 2 units, you will end up with (10*10 + 10*10 + 5*2 = 210) parameters to train.
Therefore, usually when we train a neural net for a long time, we end of with a memorized version of our data set(this gets worse if our data set is small and easy to be memorized).
But as you also mentioned, there is no intrinsic reason why higher number of epochs result in overfitting. Early stopping is usually a very good way for avoiding this. Just set patience equal to 5-10 epochs.
If the amount of trainable parameters is small with respect to the size of your training set (and your training set is reasonably diverse) then running over the same data multiple times will not be that significant, since you will be learning some features about your problem, rather than just memorizing the training data set. The problem arises when the amount of parameters is comparable to your training data set size (or bigger), it is basically the same problem as with any machine learning technique that uses too many features. This is quite common if you use large layers with dense connections. To combat this overfitting problem there are lots of regularization techniques (dropout, L1 regularizer, constraining certain connections to be 0 or equal such as in CNN).
The problem is that might still be left with too many trainable parameters. A simple way to regularize even further is to have a small learning rate (i.e. don't learn too much from this particular example lest you memorize it) combined with monitoring the epochs (if there is a large gap increase between validation/training accuracy, you are starting to overfit your model). You can then use the gap info to stop your training. This is a version of what is known as early stopping (stop before you reach the minimum in your loss function).

adjust Learning Rate in a deep neural network

Currently I am training a YOLO model to detect object, but I have noted that sometimes the loss in the output is like in a loop, for example "in 20 minute of training my loss was between 0.2 and 0.5 each time that my program decrease to 0.2 it's automatically increase to 0.5 and it loop like that "
My question is: Do I need to change my learning rate if the loss loop?
Learning rate is a possibility (not the only one). Optimizing the learning rate (and also scheduling the decay if needed) might be the most important thing in training procedure.
You need to have a good sense about the loss value (roughly what do you expect to get and what is the value of loss in the beginning of training).
Also, since YOLO is an object detection algorithm (I don't remember the details of paper completely), is classification or regression or both losses are high.
Also take a look at your data. You may need to shuffle your data before using it in your training.
It's a late answer but if you give me feedback about what I've mentioned I might be able to help more.

When to stop CNN learning

In tensorflow, I used to execute cnn learning for fixed number of epochs and save checkpoints in between after specified number of epochs interval. For evaluating the model, the checkpoints are restored and perform prediction on the validation dataset.
I want to automate the learning process, instead of using fixed epochs. Please explain how the loss value over mini batches can be utilised for determining the stopping point? Also please help me towards implementing learning rate decay in tensorflow. Which is better constant decay or exponential and how to determine the decay factor?
First for the number of iterations you can exit the training if your loss stopped improving on the batch i.e. if the difference between two loss values AVERAGED accross batches (to reduce batch fluctuations) is less than a determined threshold.
But you probably realized that the threshold is an hyperparameter too !
In fact there are quite a few attempts to completely automate ML but no matter what you do you still end up with some hyperparameters.
Secondly for the decay factor it is used when you feel the loss has stopped improving and think that you are in a local minima and oscillating in and out of the well without actually going in (this metaphore only works when you have 2 dimensions but I find it usefull still).
Almost every time it is done in the litterature it looks very hand-made: like you train for 200 epochs you see that it reached a plateau so you decrease your lr with a step function (argument staircase=True in TF) and then again.
What is commonly used is to divide the learning rate by 10 (exponential decay) but like before it is very arbitrary !
For details on how to implement learning rate decay in TF you can see dga's answer in this SO question.
It is pretty straightforward !
What can help with the schedule and the values you use is cross-validation but oftentimes you can simply look at your loss and do it by hands.
There is no silver bullet in deep learning it is just trials and errors.

Accuracy in Caffe keeps on 0.1 and does not change

Through all training process, accuracy is 0.1. What am I doing wrong?
Model, solver and part of log here:
https://gist.github.com/yutkin/3a147ebbb9b293697010
Topology in png format:
P.S. I am using the latest version of Caffe and g2.2xlarge instance on AWS.
You're working on CIFAR-10 dataset which has 10 classes. When the training of a network commences, the first guess is usually random due to which your accuracy is 1/N, where N is the number of classes. In your case it is 1/10, i.e., 0.1. If your accuracy stays the same over time it implies that your network isn't learning anything. This may happen due to a large learning rate. The basic idea of training a network is that you calculate the loss and propagate it back. The gradients are multiplied with the learning rate and added to the current weights and biases. If the learning rate is too big you may overshoot the local minima every time. If it is too small, the convergence will be slow. I see that your base_lr here is 0.01. As far as my experience goes, this is somewhat large. You may want to keep it at 0.001 in the beginning and then go on reducing it by a factor of 10 whenever you observe that the accuracy is not improving. But then anything below 0.00001 usually doesn't make much of a difference. The trick is to observe the progress of the training and make parameter changes as and when required.
I know the thread is quite old but maybe my answer helps somebody. I experienced the same problem with an accuracy like a random guess.
What helped was to set the number of outputs of the last layer before the accuracy layer to the number of labels.
In your case that should be the ip2 layer. Open the model definition of your net and set num_outputs to the number of labels.
See Section 4.4 for more information: A Practical Introduction to Deep Learning with Caffe and Python

How to change the default parameters for newfit() in MATLAB?

I am using
net = newfit(in,out,lag(j),{'tansig','tansig'});
to generate a new neural network. The default value of the number of validation checks is 6.
I am training a lot of networks and this is taking a lot of time. I guess it doesn't matter if my results are a bit less accurate if they can be made considerably faster.
How can I train faster?
I believe one of the ways might be to reduce the value of the number of validation checks. How can I do that (in code, not using GUI)
Is there some other way to increase speed.
As I said, the increase in speed may be at a little loss of accuracy.
Just to extend #mtrw answer, according to the documentation, training stops when any of these conditions occurs:
The maximum number of epochs is reached: net.trainParam.epochs
The maximum amount of time is exceeded: net.trainParam.time
Performance is minimized to the goal: net.trainParam.goal
The performance gradient falls below min_grad: net.trainParam.min_grad
mu exceeds mu_max: net.trainParam.mu_max
Validation performance has increased more than max_fail times since
the last time it decreased (when using validation): net.trainParam.max_fail
Epochs and time contraints allows to put an upper bound on the training duration.
Goal constraint stop the training when the performance (error) drops below it, and usually allows you to adjust the level of time/accuracy trade-off: less accurate results for faster execution.
This is similar to min_grad (gradient tells you the strength of the "descent") in that if the magnitude of the gradient is less than mingrad, training stops. It can be understood by the fact that if the error function is not changing by much, then we are reaching a plateau and we should probably stop training since we are not going to improve by much.
mu, mu_dec, and mu_max are used to control the weight updating process (backpropagation).
max_fail is usually used to avoid over-fitting, not so much for speedup.
My advice, set time and epochs to the maximum possible that your application constraints allow (otherwise the results will be poor). And in turn, you can control goal and min_grad to reach the level of speed/accuracy trade-off desired. Keep in mind that max_fails wont make you gain any time, since its mainly used to assure good generalization power.
(Disclaimer: I don't have the neural network toolbox, so I'm only extrapolating from the Mathworks documentation)
It looks from your input parameters like you're using TRAINLM. According to the documentation, you can set the net.trainParam.max_fail parameter to change the validation checks.
You can set the initial mu value, as well as the increment and decrement factors. But this would require some insight into the expected answer and performance of the search.
For a more blunt approach, you can also control the maximum number of iterations by setting the net.trainParam.epochs parameter to something less than its default 100. You might also set the net.trainParam.time parameter to limit the number of seconds.
You should probably set net.trainParam.show to NaN to skip any displays.
Neural nets are treated as objects in MATLAB. To access any parameter before (or after) training, you need to access the network's properties using the . operator.
In addition to mtrw's and Amro's answers, make MATLAB's Neural Network Toolbox documentation your new best friend. It will usually explain things in much better detail.