Should we do learning rate decay for adam optimizer - neural-network

I'm training a network for image localization with Adam optimizer, and someone suggest me to use exponential decay. I don't want to try that because Adam optimizer itself decays learning rate. But that guy insists and he said he did that before. So should I do that and is there any theory behind your suggestion?

It depends. ADAM updates any parameter with an individual learning rate. This means that every parameter in the network has a specific learning rate associated.
But the single learning rate for each parameter is computed using lambda (the initial learning rate) as an upper limit. This means that every single learning rate can vary from 0 (no update) to lambda (maximum update).
It's true, that the learning rates adapt themselves during training steps, but if you want to be sure that every update step doesn't exceed lambda you can than lower lambda using exponential decay or whatever.
It can help to reduce loss during the latest step of training, when the computed loss with the previously associated lambda parameter has stopped to decrease.

In my experience it usually not necessary to do learning rate decay with Adam optimizer.
The theory is that Adam already handles learning rate optimization (check reference) :
"We propose Adam, a method for efficient stochastic optimization that
only requires first-order gradients with little memory requirement.
The method computes individual adaptive learning rates for different
parameters from estimates of first and second moments of the
gradients; the name Adam is derived from adaptive moment estimation."
As with any deep learning problem YMMV, one size does not fit all, you should try different approaches and see what works for you, etc. etc.

Yes, absolutely. From my own experience, it's very useful to Adam with learning rate decay. Without decay, you have to set a very small learning rate so the loss won't begin to diverge after decrease to a point. Here, I post the code to use Adam with learning rate decay using TensorFlow. Hope it is helpful to someone.
decayed_lr = tf.train.exponential_decay(learning_rate,
global_step, 10000,
0.95, staircase=True)
opt = tf.train.AdamOptimizer(decayed_lr, epsilon=adam_epsilon)

Adam has a single learning rate, but it is a max rate that is adaptive, so I don't think many people using learning rate scheduling with it.
Due to the adaptive nature the default rate is fairly robust, but there may be times when you want to optimize it. What you can do is find an optimal default rate beforehand by starting with a very small rate and increasing it until loss stops decreasing, then look at the slope of the loss curve and pick the learning rate that is associated with the fastest decrease in loss (not the point where loss is actually lowest). Jeremy Howard mentions this in the fast.ai deep learning course and its from the Cyclical Learning Rates paper.
Edit: People have fairly recently started using one-cycle learning rate policies in conjunction with Adam with great results.

Be careful when using weight decay with the vanilla Adam optimizer, as it appears that the vanilla Adam formula is wrong when using weight decay, as pointed out in the article Decoupled Weight Decay Regularization https://arxiv.org/abs/1711.05101 .
You should probably use the AdamW variant when you want to use Adam with weight decay.

A simple alternative is to increase the batch size. A larger number of samples per update will force the optimizer to be more cautious with the updates. If GPU memory limits the number of samples that can be tracked per update, you may have to resort to CPU and conventional RAM for training, which will obviously further slow down training.

From another point of view
All stochastic gradient descent (SGD) optimizers, including Adam, have randomization built and have no guarantees of reaching a global minima
After several
times of reduction, a satisfying local extremum will be obtained.
so using learning decay will not help reach global minima as it is supposed to help.
Also if you used it the learning rate will eventually become very
small, and the algorithm will become ineffective.

Related

adjust Learning Rate in a deep neural network

Currently I am training a YOLO model to detect object, but I have noted that sometimes the loss in the output is like in a loop, for example "in 20 minute of training my loss was between 0.2 and 0.5 each time that my program decrease to 0.2 it's automatically increase to 0.5 and it loop like that "
My question is: Do I need to change my learning rate if the loss loop?
Learning rate is a possibility (not the only one). Optimizing the learning rate (and also scheduling the decay if needed) might be the most important thing in training procedure.
You need to have a good sense about the loss value (roughly what do you expect to get and what is the value of loss in the beginning of training).
Also, since YOLO is an object detection algorithm (I don't remember the details of paper completely), is classification or regression or both losses are high.
Also take a look at your data. You may need to shuffle your data before using it in your training.
It's a late answer but if you give me feedback about what I've mentioned I might be able to help more.

When to stop CNN learning

In tensorflow, I used to execute cnn learning for fixed number of epochs and save checkpoints in between after specified number of epochs interval. For evaluating the model, the checkpoints are restored and perform prediction on the validation dataset.
I want to automate the learning process, instead of using fixed epochs. Please explain how the loss value over mini batches can be utilised for determining the stopping point? Also please help me towards implementing learning rate decay in tensorflow. Which is better constant decay or exponential and how to determine the decay factor?
First for the number of iterations you can exit the training if your loss stopped improving on the batch i.e. if the difference between two loss values AVERAGED accross batches (to reduce batch fluctuations) is less than a determined threshold.
But you probably realized that the threshold is an hyperparameter too !
In fact there are quite a few attempts to completely automate ML but no matter what you do you still end up with some hyperparameters.
Secondly for the decay factor it is used when you feel the loss has stopped improving and think that you are in a local minima and oscillating in and out of the well without actually going in (this metaphore only works when you have 2 dimensions but I find it usefull still).
Almost every time it is done in the litterature it looks very hand-made: like you train for 200 epochs you see that it reached a plateau so you decrease your lr with a step function (argument staircase=True in TF) and then again.
What is commonly used is to divide the learning rate by 10 (exponential decay) but like before it is very arbitrary !
For details on how to implement learning rate decay in TF you can see dga's answer in this SO question.
It is pretty straightforward !
What can help with the schedule and the values you use is cross-validation but oftentimes you can simply look at your loss and do it by hands.
There is no silver bullet in deep learning it is just trials and errors.

In what order should we tune hyperparameters in Neural Networks?

I have a quite simple ANN using Tensorflow and AdamOptimizer for a regression problem and I am now at the point to tune all the hyperparameters.
For now, I saw many different hyperparameters that I have to tune :
Learning rate : initial learning rate, learning rate decay
The AdamOptimizer needs 4 arguments (learning-rate, beta1, beta2, epsilon) so we need to tune them - at least epsilon
batch-size
nb of iterations
Lambda L2-regularization parameter
Number of neurons, number of layers
what kind of activation function for the hidden layers, for the output layer
dropout parameter
I have 2 questions :
1) Do you see any other hyperparameter I might have forgotten ?
2) For now, my tuning is quite "manual" and I am not sure I am not doing everything in a proper way.
Is there a special order to tune the parameters ? E.g learning rate first, then batch size, then ...
I am not sure that all these parameters are independent - in fact, I am quite sure that some of them are not. Which ones are clearly independent and which ones are clearly not independent ? Should we then tune them together ?
Is there any paper or article which talks about properly tuning all the parameters in a special order ?
EDIT :
Here are the graphs I got for different initial learning rates, batch sizes and regularization parameters. The purple curve is completely weird for me... Because the cost decreases like way slowly that the others, but it got stuck at a lower accuracy rate. Is it possible that the model is stuck in a local minimum ?
Accuracy
Cost
For the learning rate, I used the decay :
LR(t) = LRI/sqrt(epoch)
Thanks for your help !
Paul
My general order is:
Batch size, as it will largely affect the training time of future experiments.
Architecture of the network:
Number of neurons in the network
Number of layers
Rest (dropout, L2 reg, etc.)
Dependencies:
I'd assume that the optimal values of
learning rate and batch size
learning rate and number of neurons
number of neurons and number of layers
strongly depend on each other. I am not an expert on that field though.
As for your hyperparameters:
For the Adam optimizer: "Recommended values in the paper are eps = 1e-8, beta1 = 0.9, beta2 = 0.999." (source)
For the learning rate with Adam and RMSProp, I found values around 0.001 to be optimal for most problems.
As an alternative to Adam, you can also use RMSProp, which reduces the memory footprint by up to 33%. See this answer for more details.
You could also tune the initial weight values (see All you need is a good init). Although, the Xavier initializer seems to be a good way to prevent having to tune the weight inits.
I don't tune the number of iterations / epochs as a hyperparameter. I train the net until its validation error converges. However, I give each run a time budget.
Get Tensorboard running. Plot the error there. You'll need to create subdirectories in the path where TB looks for the data to plot. I do that subdir creation in the script. So I change a parameter in the script, give the trial a name there, run it, and plot all the trials in the same chart. You'll very soon get a feel for the most effective settings for your graph and data.
For parameters that are less important you can probably just pick a reasonable value and stick with it.
Like you said, the optimal values of these parameters all depend on each other. The easiest thing to do is to define a reasonable range of values for each hyperparameter. Then randomly sample a parameter from each range and train a model with that setting. Repeat this a bunch of times and then pick the best model. If you are lucky you will be able to analyze which hyperparameter settings worked best and make some conclusions from that.
I don't know any tool specific for tensorflow, but the best strategy is to first start with the basic hyperparameters such as learning rate of 0.01, 0.001, weight_decay of 0.005, 0.0005. And then tune them. Doing it manually will take a lot of time, if you are using caffe, following is the best option that will take the hyperparameters from a set of input values and will give you the best set.
https://github.com/kuz/caffe-with-spearmint
for more information, you can follow this tutorial as well:
http://fastml.com/optimizing-hyperparams-with-hyperopt/
For number of layers, What I suggest you to do is first make smaller network and increase the data, and after you have sufficient data, increase the model complexity.
Before you begin:
Set batch size to maximal (or maximal power of 2) that works on your hardware. Simply increase it until you get a CUDA error (or system RAM usage > 90%).
Set regularizes to low values.
The architecture and exact numbers of neurons and layers - use known architectures as inspirations and adjust them to your specific performance requirements: more layers and neurons -> possibly a stronger, but slower model.
Then, if you want to do it one by one, I would go like this:
Tune learning rate in a wide range.
Tune other parameters of the optimizer.
Tune regularizes (dropout, L2 etc).
Fine tune learning rate - it's the most important hyper-parameter.

Neural Network learning rate and batch weight update

I have programmed a Neural Network in Java and am now working on the back-propagation algorithm.
I've read that batch updates of the weights will cause a more stable gradient search instead of a online weight update.
As a test I've created a time series function of 100 points, such that x = [0..99] and y = f(x). I've created a Neural Network with one input and one output and 2 hidden layers with 10 neurons for testing. What I am struggling with is the learning rate of the back-propagation algorithm when tackling this problem.
I have 100 input points so when I calculate the weight change dw_{ij} for each node it is actually a sum:
dw_{ij} = dw_{ij,1} + dw_{ij,2} + ... + dw_{ij,p}
where p = 100 in this case.
Now the weight updates become really huge and therefore my error E bounces around such that it is hard to find a minimum. The only way I got some proper behaviour was when I set the learning rate y to something like 0.7 / p^2.
Is there some general rule for setting the learning rate, based on the amount of samples?
http://francky.me/faqai.php#otherFAQs :
Subject: What learning rate should be used for
backprop?
In standard backprop, too low a learning rate makes the network learn very slowly. Too high a learning rate
makes the weights and objective function diverge, so there is no learning at all. If the objective function is
quadratic, as in linear models, good learning rates can be computed from the Hessian matrix (Bertsekas and
Tsitsiklis, 1996). If the objective function has many local and global optima, as in typical feedforward NNs
with hidden units, the optimal learning rate often changes dramatically during the training process, since
the Hessian also changes dramatically. Trying to train a NN using a constant learning rate is usually a
tedious process requiring much trial and error. For some examples of how the choice of learning rate and
momentum interact with numerical condition in some very simple networks, see
ftp://ftp.sas.com/pub/neural/illcond/illcond.html
With batch training, there is no need to use a constant learning rate. In fact, there is no reason to use
standard backprop at all, since vastly more efficient, reliable, and convenient batch training algorithms exist
(see Quickprop and RPROP under "What is backprop?" and the numerous training algorithms mentioned
under "What are conjugate gradients, Levenberg-Marquardt, etc.?").
Many other variants of backprop have been invented. Most suffer from the same theoretical flaw as
standard backprop: the magnitude of the change in the weights (the step size) should NOT be a function of
the magnitude of the gradient. In some regions of the weight space, the gradient is small and you need a
large step size; this happens when you initialize a network with small random weights. In other regions of
the weight space, the gradient is small and you need a small step size; this happens when you are close to a
local minimum. Likewise, a large gradient may call for either a small step or a large step. Many algorithms
try to adapt the learning rate, but any algorithm that multiplies the learning rate by the gradient to compute
the change in the weights is likely to produce erratic behavior when the gradient changes abruptly. The
great advantage of Quickprop and RPROP is that they do not have this excessive dependence on the
magnitude of the gradient. Conventional optimization algorithms use not only the gradient but also secondorder derivatives or a line search (or some combination thereof) to obtain a good step size.
With incremental training, it is much more difficult to concoct an algorithm that automatically adjusts the
learning rate during training. Various proposals have appeared in the NN literature, but most of them don't
work. Problems with some of these proposals are illustrated by Darken and Moody (1992), who
unfortunately do not offer a solution. Some promising results are provided by by LeCun, Simard, and
Pearlmutter (1993), and by Orr and Leen (1997), who adapt the momentum rather than the learning rate.
There is also a variant of stochastic approximation called "iterate averaging" or "Polyak averaging"
(Kushner and Yin 1997), which theoretically provides optimal convergence rates by keeping a running
average of the weight values. I have no personal experience with these methods; if you have any solid
evidence that these or other methods of automatically setting the learning rate and/or momentum in
incremental training actually work in a wide variety of NN applications, please inform the FAQ maintainer
(saswss#unx.sas.com).
References:
Bertsekas, D. P. and Tsitsiklis, J. N. (1996), Neuro-Dynamic
Programming, Belmont, MA: Athena Scientific, ISBN 1-886529-10-8.
Darken, C. and Moody, J. (1992), "Towards faster stochastic gradient
search," in Moody, J.E., Hanson, S.J., and Lippmann, R.P., eds.
Advances in Neural Information Processing Systems 4, San Mateo, CA:
Morgan Kaufmann Publishers, pp. 1009-1016. Kushner, H.J., and Yin,
G. (1997), Stochastic Approximation Algorithms and Applications, NY:
Springer-Verlag. LeCun, Y., Simard, P.Y., and Pearlmetter, B.
(1993), "Automatic learning rate maximization by online estimation of
the Hessian's eigenvectors," in Hanson, S.J., Cowan, J.D., and Giles,
C.L. (eds.), Advances in Neural Information Processing Systems 5, San
Mateo, CA: Morgan Kaufmann, pp. 156-163. Orr, G.B. and Leen, T.K.
(1997), "Using curvature information for fast stochastic search," in
Mozer, M.C., Jordan, M.I., and Petsche, T., (eds.) Advances in Neural
Information Processing Systems 9,Cambridge, MA: The MIT Press, pp.
606-612.
Credits:
Archive-name: ai-faq/neural-nets/part1
Last-modified: 2002-05-17
URL: ftp://ftp.sas.com/pub/neural/FAQ.html
Maintainer: saswss#unx.sas.com (Warren S. Sarle)
Copyright 1997, 1998, 1999, 2000, 2001, 2002 by Warren S. Sarle, Cary, NC, USA.
A simple solution would be to take the average weight of a batch instead of summing it. This way you can just use a learning rate of 0.7 (or any other value of your liking), without having to worry about optimizing yet another parameter.
More interesting information about batch updating and learning rates can be found in this article by Wilson (2003).

Which multiplication and addition factor to use when doing adaptive learning rate in neural networks?

I am new to neural networks and, to get grip on the matter, I have implemented a basic feed-forward MLP which I currently train through back-propagation. I am aware that there are more sophisticated and better ways to do that, but in Introduction to Machine Learning they suggest that with one or two tricks, basic gradient descent can be effective for learning from real world data. One of the tricks is adaptive learning rate.
The idea is to increase the learning rate by a constant value a when the error gets smaller, and decrease it by a fraction b of the learning rate when the error gets larger. So basically the learning rate change is determined by:
+(a)
if we're learning in the right direction, and
-(b * <learning rate>)
if we're ruining our learning. However, on the above book there's no advice on how to set these parameters. I wouldn't expect a precise suggestion since parameter tuning is a whole topic on its own, but just a hint at least on their order of magnitude. Any ideas?
Thank you,
Tunnuz
I haven't looked at neural networks for the longest time (10 years+) but after I saw your question I thought I would have a quick scout about. I kept seeing the same figures all over the internet in relation to increase(a) and decrease(b) factor (1.2 & 0.5 respectively).
I have managed to track these values down to Martin Riedmiller and Heinrich Braun's RPROP algorithm (1992). Riedmiller and Braun are quite specific about sensible parameters to choose.
See: RPROP: A Fast Adaptive Learning Algorithm
I hope this helps.