Currently I am training a YOLO model to detect object, but I have noted that sometimes the loss in the output is like in a loop, for example "in 20 minute of training my loss was between 0.2 and 0.5 each time that my program decrease to 0.2 it's automatically increase to 0.5 and it loop like that "
My question is: Do I need to change my learning rate if the loss loop?
Learning rate is a possibility (not the only one). Optimizing the learning rate (and also scheduling the decay if needed) might be the most important thing in training procedure.
You need to have a good sense about the loss value (roughly what do you expect to get and what is the value of loss in the beginning of training).
Also, since YOLO is an object detection algorithm (I don't remember the details of paper completely), is classification or regression or both losses are high.
Also take a look at your data. You may need to shuffle your data before using it in your training.
It's a late answer but if you give me feedback about what I've mentioned I might be able to help more.
Related
So for example, I have trained a CNN on my data using a learning rate of 0.0003 and 10 epochs, with a minibatch size of 32. After training it, lets say I get an accuracy of 0.7. Now I want to adjust the learning rate and the minibatch size and try training it again to see how the accuracy changes, using the trainNetwork Matlab function. My question is, is it training the model from scratch or is it training them using the weights previously calculated? I want it to start from scratch to prevent overfitting every time I adjust the hyperparamters. Sorry if this is intuitive and I'm being dumb lol I just wanna make sure.
It will start from scratch each time.
MATLAB does support transfer learning which can be useful if you want to fine tune a pretrained model, but you have to program it to do so. Here's an article on transfer learning in MATLAB (I guess so you can make sure you're not doing it!)
https://www.mathworks.com/help/deeplearning/ug/train-deep-learning-network-to-classify-new-images.html
I've seen some comments in online articles/tutorials or Stack Overflow questions which suggest that increasing number of epochs can result in overfitting. But my intuition tells me that there should be no direct relationship at all between number of epochs and overfitting. So I'm looking for answer which explains if I'm right or wrong (or whatever's in between).
Here's my reasoning though. To overfit, you need to have enough free parameters (I think this is called "capacity" in neural networks) in your model to generate a function which can replicate the sample data points. If you don't have enough free parameters, you'll never overfit. You might just underfit.
So really, if you don't have too many free parameters, you could run infinite epochs and never overfit. If you have too many free parameters, then yes, the more epochs you have the more likely it is that you get to a place where you're overfitting. But that's just because running more epochs revealed the root cause: too many free parameters. The real loss function doesn't care about how many epochs you run. It existed the moment you defined your model structure, before you ever even tried to do gradient descent on it.
In fact, I'd venture as far as to say: assuming you have the computational resources and time, you should always aim to run as many epochs as possible, because that will tell you whether your model is prone to overfitting. Your best model will be the one that provides great training and validation accuracy, no matter how many epochs you run it for.
EDIT
While reading more into this, I realise I forgot to take into account that you can arbitrarily vary the sample size as well. Given a fixed model, a smaller sample size is more prone to being overfit. And then that kind of makes me doubt my intuition above. Still happy to get an answer though!
Your intuition to me seems completely correct.
But here is the caveat. The whole purpose of deep models is that they are "deep" (duh!!). So what happens is that your feature space gets exponentially larger as you grow your network.
Here is an example to compare a deep model with a simpler mode:
Assume you have a 10-variable data set. With a crazy amount of feature engineering, you might be able to extract 50 features out of it. Then if you run a traditional model (let's say a logistic regression), you will have 50 parameters (capacity in your word, or degree of freedom) to train.
But, if you use a very simple deep model with Layer 1: 10 unit, layer2: 10 units, layer3: 5 units, layer4: 2 units, you will end up with (10*10 + 10*10 + 5*2 = 210) parameters to train.
Therefore, usually when we train a neural net for a long time, we end of with a memorized version of our data set(this gets worse if our data set is small and easy to be memorized).
But as you also mentioned, there is no intrinsic reason why higher number of epochs result in overfitting. Early stopping is usually a very good way for avoiding this. Just set patience equal to 5-10 epochs.
If the amount of trainable parameters is small with respect to the size of your training set (and your training set is reasonably diverse) then running over the same data multiple times will not be that significant, since you will be learning some features about your problem, rather than just memorizing the training data set. The problem arises when the amount of parameters is comparable to your training data set size (or bigger), it is basically the same problem as with any machine learning technique that uses too many features. This is quite common if you use large layers with dense connections. To combat this overfitting problem there are lots of regularization techniques (dropout, L1 regularizer, constraining certain connections to be 0 or equal such as in CNN).
The problem is that might still be left with too many trainable parameters. A simple way to regularize even further is to have a small learning rate (i.e. don't learn too much from this particular example lest you memorize it) combined with monitoring the epochs (if there is a large gap increase between validation/training accuracy, you are starting to overfit your model). You can then use the gap info to stop your training. This is a version of what is known as early stopping (stop before you reach the minimum in your loss function).
In tensorflow, I used to execute cnn learning for fixed number of epochs and save checkpoints in between after specified number of epochs interval. For evaluating the model, the checkpoints are restored and perform prediction on the validation dataset.
I want to automate the learning process, instead of using fixed epochs. Please explain how the loss value over mini batches can be utilised for determining the stopping point? Also please help me towards implementing learning rate decay in tensorflow. Which is better constant decay or exponential and how to determine the decay factor?
First for the number of iterations you can exit the training if your loss stopped improving on the batch i.e. if the difference between two loss values AVERAGED accross batches (to reduce batch fluctuations) is less than a determined threshold.
But you probably realized that the threshold is an hyperparameter too !
In fact there are quite a few attempts to completely automate ML but no matter what you do you still end up with some hyperparameters.
Secondly for the decay factor it is used when you feel the loss has stopped improving and think that you are in a local minima and oscillating in and out of the well without actually going in (this metaphore only works when you have 2 dimensions but I find it usefull still).
Almost every time it is done in the litterature it looks very hand-made: like you train for 200 epochs you see that it reached a plateau so you decrease your lr with a step function (argument staircase=True in TF) and then again.
What is commonly used is to divide the learning rate by 10 (exponential decay) but like before it is very arbitrary !
For details on how to implement learning rate decay in TF you can see dga's answer in this SO question.
It is pretty straightforward !
What can help with the schedule and the values you use is cross-validation but oftentimes you can simply look at your loss and do it by hands.
There is no silver bullet in deep learning it is just trials and errors.
I'm training a network for image localization with Adam optimizer, and someone suggest me to use exponential decay. I don't want to try that because Adam optimizer itself decays learning rate. But that guy insists and he said he did that before. So should I do that and is there any theory behind your suggestion?
It depends. ADAM updates any parameter with an individual learning rate. This means that every parameter in the network has a specific learning rate associated.
But the single learning rate for each parameter is computed using lambda (the initial learning rate) as an upper limit. This means that every single learning rate can vary from 0 (no update) to lambda (maximum update).
It's true, that the learning rates adapt themselves during training steps, but if you want to be sure that every update step doesn't exceed lambda you can than lower lambda using exponential decay or whatever.
It can help to reduce loss during the latest step of training, when the computed loss with the previously associated lambda parameter has stopped to decrease.
In my experience it usually not necessary to do learning rate decay with Adam optimizer.
The theory is that Adam already handles learning rate optimization (check reference) :
"We propose Adam, a method for efficient stochastic optimization that
only requires first-order gradients with little memory requirement.
The method computes individual adaptive learning rates for different
parameters from estimates of first and second moments of the
gradients; the name Adam is derived from adaptive moment estimation."
As with any deep learning problem YMMV, one size does not fit all, you should try different approaches and see what works for you, etc. etc.
Yes, absolutely. From my own experience, it's very useful to Adam with learning rate decay. Without decay, you have to set a very small learning rate so the loss won't begin to diverge after decrease to a point. Here, I post the code to use Adam with learning rate decay using TensorFlow. Hope it is helpful to someone.
decayed_lr = tf.train.exponential_decay(learning_rate,
global_step, 10000,
0.95, staircase=True)
opt = tf.train.AdamOptimizer(decayed_lr, epsilon=adam_epsilon)
Adam has a single learning rate, but it is a max rate that is adaptive, so I don't think many people using learning rate scheduling with it.
Due to the adaptive nature the default rate is fairly robust, but there may be times when you want to optimize it. What you can do is find an optimal default rate beforehand by starting with a very small rate and increasing it until loss stops decreasing, then look at the slope of the loss curve and pick the learning rate that is associated with the fastest decrease in loss (not the point where loss is actually lowest). Jeremy Howard mentions this in the fast.ai deep learning course and its from the Cyclical Learning Rates paper.
Edit: People have fairly recently started using one-cycle learning rate policies in conjunction with Adam with great results.
Be careful when using weight decay with the vanilla Adam optimizer, as it appears that the vanilla Adam formula is wrong when using weight decay, as pointed out in the article Decoupled Weight Decay Regularization https://arxiv.org/abs/1711.05101 .
You should probably use the AdamW variant when you want to use Adam with weight decay.
A simple alternative is to increase the batch size. A larger number of samples per update will force the optimizer to be more cautious with the updates. If GPU memory limits the number of samples that can be tracked per update, you may have to resort to CPU and conventional RAM for training, which will obviously further slow down training.
From another point of view
All stochastic gradient descent (SGD) optimizers, including Adam, have randomization built and have no guarantees of reaching a global minima
After several
times of reduction, a satisfying local extremum will be obtained.
so using learning decay will not help reach global minima as it is supposed to help.
Also if you used it the learning rate will eventually become very
small, and the algorithm will become ineffective.
Through all training process, accuracy is 0.1. What am I doing wrong?
Model, solver and part of log here:
https://gist.github.com/yutkin/3a147ebbb9b293697010
Topology in png format:
P.S. I am using the latest version of Caffe and g2.2xlarge instance on AWS.
You're working on CIFAR-10 dataset which has 10 classes. When the training of a network commences, the first guess is usually random due to which your accuracy is 1/N, where N is the number of classes. In your case it is 1/10, i.e., 0.1. If your accuracy stays the same over time it implies that your network isn't learning anything. This may happen due to a large learning rate. The basic idea of training a network is that you calculate the loss and propagate it back. The gradients are multiplied with the learning rate and added to the current weights and biases. If the learning rate is too big you may overshoot the local minima every time. If it is too small, the convergence will be slow. I see that your base_lr here is 0.01. As far as my experience goes, this is somewhat large. You may want to keep it at 0.001 in the beginning and then go on reducing it by a factor of 10 whenever you observe that the accuracy is not improving. But then anything below 0.00001 usually doesn't make much of a difference. The trick is to observe the progress of the training and make parameter changes as and when required.
I know the thread is quite old but maybe my answer helps somebody. I experienced the same problem with an accuracy like a random guess.
What helped was to set the number of outputs of the last layer before the accuracy layer to the number of labels.
In your case that should be the ip2 layer. Open the model definition of your net and set num_outputs to the number of labels.
See Section 4.4 for more information: A Practical Introduction to Deep Learning with Caffe and Python