Learning Rate in Neural Networks

Learning Rate in Neural Networks - neural-network

I am new in field of neural networks and recently I came to know that very slow learning rate affects accuracy negatively. I understand that very low learning rate will make it inefficient but how accuracy is affected by low learning rate?

when learning rate is too large, it can cause the model to converge too quickly to a solution that is not optimal, think it as overshooting. On the contrary, when learning rate is too small, it can make the model get stuck (due to tiny steps taken to converge) to a local minimum which is not the optimal solution again.
Therefore accuracy is affected and the goal is to find the best learning rate for the that model.

Related

Should we do learning rate decay for adam optimizer

I'm training a network for image localization with Adam optimizer, and someone suggest me to use exponential decay. I don't want to try that because Adam optimizer itself decays learning rate. But that guy insists and he said he did that before. So should I do that and is there any theory behind your suggestion?

It depends. ADAM updates any parameter with an individual learning rate. This means that every parameter in the network has a specific learning rate associated.
But the single learning rate for each parameter is computed using lambda (the initial learning rate) as an upper limit. This means that every single learning rate can vary from 0 (no update) to lambda (maximum update).
It's true, that the learning rates adapt themselves during training steps, but if you want to be sure that every update step doesn't exceed lambda you can than lower lambda using exponential decay or whatever.
It can help to reduce loss during the latest step of training, when the computed loss with the previously associated lambda parameter has stopped to decrease.

In my experience it usually not necessary to do learning rate decay with Adam optimizer.
The theory is that Adam already handles learning rate optimization (check reference) :
"We propose Adam, a method for efficient stochastic optimization that
only requires first-order gradients with little memory requirement.
The method computes individual adaptive learning rates for different
parameters from estimates of first and second moments of the
gradients; the name Adam is derived from adaptive moment estimation."
As with any deep learning problem YMMV, one size does not fit all, you should try different approaches and see what works for you, etc. etc.

Yes, absolutely. From my own experience, it's very useful to Adam with learning rate decay. Without decay, you have to set a very small learning rate so the loss won't begin to diverge after decrease to a point. Here, I post the code to use Adam with learning rate decay using TensorFlow. Hope it is helpful to someone.
decayed_lr = tf.train.exponential_decay(learning_rate,
global_step, 10000,
0.95, staircase=True)
opt = tf.train.AdamOptimizer(decayed_lr, epsilon=adam_epsilon)

Adam has a single learning rate, but it is a max rate that is adaptive, so I don't think many people using learning rate scheduling with it.
Due to the adaptive nature the default rate is fairly robust, but there may be times when you want to optimize it. What you can do is find an optimal default rate beforehand by starting with a very small rate and increasing it until loss stops decreasing, then look at the slope of the loss curve and pick the learning rate that is associated with the fastest decrease in loss (not the point where loss is actually lowest). Jeremy Howard mentions this in the fast.ai deep learning course and its from the Cyclical Learning Rates paper.
Edit: People have fairly recently started using one-cycle learning rate policies in conjunction with Adam with great results.

Be careful when using weight decay with the vanilla Adam optimizer, as it appears that the vanilla Adam formula is wrong when using weight decay, as pointed out in the article Decoupled Weight Decay Regularization https://arxiv.org/abs/1711.05101 .
You should probably use the AdamW variant when you want to use Adam with weight decay.

A simple alternative is to increase the batch size. A larger number of samples per update will force the optimizer to be more cautious with the updates. If GPU memory limits the number of samples that can be tracked per update, you may have to resort to CPU and conventional RAM for training, which will obviously further slow down training.

From another point of view
All stochastic gradient descent (SGD) optimizers, including Adam, have randomization built and have no guarantees of reaching a global minima
After several
times of reduction, a satisfying local extremum will be obtained.
so using learning decay will not help reach global minima as it is supposed to help.
Also if you used it the learning rate will eventually become very
small, and the algorithm will become ineffective.

Problems in reinforcement learning: bug, parameters tuning, and training period

I am currently training a reinforcement learning agent using a simple Neural Network with 100 hidden elements to solve 2048 game. I am using DQN's reinforcement learning algorithm (i.e. Q-learning with replay memory), but with 2 layers Neural Network instead of Deep Neural Network.
However, I left it trained on my laptop overnight (~7 hours, ~1000 games played, > 100000 steps) and the score does not seem to increase. I suspect there might be 3 sources of errors in my code: bug, parameters tuned badly, or maybe I just don't wait long enough.
Is there any method to figure out what is wrong with the code?
And what is the best practice to improve the training results?

I'll talk about all three of your hypothesis.
If you are using a standard DL framework like caffe or tensorflow, the chance of it being a bug is small.
Try decreasing the learning rate. Maybe you set it too high for the network to converge.
The training time of 100000 steps is not that long. For a simple pong game, you need to train around 500000 steps to get a good accuracy. So you can try training it for longer.
Also, 2048 is a fairly complicated game, so maybe you network is not deep enough to learn how to play it. Two layers is not much for such a complicated game. Try increasing the number of hidden layers. Perhaps you can use the network provided here

Sudden drop in accuracy while training a deep neural net

I am using mxnet to train a 11-class image classifier. I am observing a weird behavior training accuracy was increasing slowly and went upto 39% and in next epoch it went down to 9% and then it stays close to 9% for rest of the training.
I restarted the training with saved model (with 39% training accuracy) keeping all other parameter same . Now training accuracy is increasing again. What can be the reason here ? I am not able to understand it . And its getting difficult to train the model this way as it requires me to see training accuracy values constantly.
learning rate is constant at 0.01

as you can see your late accuracy is near random one. there is 2 common issue in this kind of cases.
your learning rate is high. try to lower it
The error (or entropy) you are trying to use is giving you NaN value. if you are trying to use entropies with log functions you must use them precisely.

It is common during training of neural networks for accuracy to improve for a while and then get worse -- in general this is caused by over-fitting. It's also fairly common for the network to "get unlucky" and get knocked into a bad part of parameter space corresponding to a sudden decrease in accuracy -- sometimes it can recover from this quickly, but sometimes not.
In general, lowering your learning rate is a good approach to this kind of problem. Also, setting a learning rate schedule like FactorScheduler can help you achieve more stable convergence by lowering the learning rate every few epochs. In fact, this can sometimes cover up mistakes in picking an initial learning rate that is too high.

I faced the same problem.And I solved it by use (y-a)^a loss function instead of the cross-entropy function(because of log(0)).I hope there is better solution for this problem.

These problems often come up. I observed that this may happen due to one of the following reasons:
Something returning NaN
The inputs of the network are not as expected - many modern frameworks do not raise errors in some of such cases
The model layers get incompatible shapes at some point

It happened probably because 0log0 returns NaN.
You might avoid it by;
cross_entropy = -tf.reduce_sum(labels*tf.log(tf.clip_by_value(logits,1e-10,1.0)))

Which multiplication and addition factor to use when doing adaptive learning rate in neural networks?

I am new to neural networks and, to get grip on the matter, I have implemented a basic feed-forward MLP which I currently train through back-propagation. I am aware that there are more sophisticated and better ways to do that, but in Introduction to Machine Learning they suggest that with one or two tricks, basic gradient descent can be effective for learning from real world data. One of the tricks is adaptive learning rate.
The idea is to increase the learning rate by a constant value a when the error gets smaller, and decrease it by a fraction b of the learning rate when the error gets larger. So basically the learning rate change is determined by:
+(a)
if we're learning in the right direction, and
-(b * <learning rate>)
if we're ruining our learning. However, on the above book there's no advice on how to set these parameters. I wouldn't expect a precise suggestion since parameter tuning is a whole topic on its own, but just a hint at least on their order of magnitude. Any ideas?
Thank you,
Tunnuz

I haven't looked at neural networks for the longest time (10 years+) but after I saw your question I thought I would have a quick scout about. I kept seeing the same figures all over the internet in relation to increase(a) and decrease(b) factor (1.2 & 0.5 respectively).
I have managed to track these values down to Martin Riedmiller and Heinrich Braun's RPROP algorithm (1992). Riedmiller and Braun are quite specific about sensible parameters to choose.
See: RPROP: A Fast Adaptive Learning Algorithm
I hope this helps.

Neural Net Optimize w/ Genetic Algorithm

Is a genetic algorithm the most efficient way to optimize the number of hidden nodes and the amount of training done on an artificial neural network?
I am coding neural networks using the NNToolbox in Matlab. I am open to any other suggestions of optimization techniques, but I'm most familiar with GA's.

Actually, there are multiple things that you can optimize using GA regarding NN.
You can optimize the structure (number of nodes, layers, activation function etc.).
You can also train using GA, that means setting the weights.
Genetic algorithms will never be the most efficient, but they usually used when you have little clue as to what numbers to use.
For training, you can use other algorithms including backpropagation, nelder-mead etc..
You said you wanted to optimize number hidden nodes, for this, genetic algorithm may be sufficient, although far from "optimal". The space you are searching is probably too small to use genetic algorithms, but they can still work and afaik, they are already implemented in matlab, so no biggie.
What do you mean by optimizing amount of training done? If you mean number of epochs, then that's fine, just remember that training is somehow dependent on starting weights and they are usually random, so the fitness function used for GA won't really be a function.

A good example of neural networks and genetic programming is the NEAT architecture (Neuro-Evolution of Augmenting Topologies). This is a genetic algorithm that finds an optimal topology. It's also known to be good at keeping the number of hidden nodes down.
They also made a game using this called Nero. Quite unique and very amazing tangible results.
Dr. Stanley's homepage:
http://www.cs.ucf.edu/~kstanley/
Here you'll find just about everything NEAT related as he is the one who invented it.

Genetic algorithms can be usefully applied to optimising neural networks, but you have to think a little about what you want to do.
Most "classic" NN training algorithms, such as Back-Propagation, only optimise the weights of the neurons. Genetic algorithms can optimise the weights, but this will typically be inefficient. However, as you were asking, they can optimise the topology of the network and also the parameters for your training algorithm. You'll have to be especially wary of creating networks that are "over-trained" though.
One further technique with a modified genetic algorithms can be useful for overcoming a problem with Back-Propagation. Back-Propagation usually finds local minima, but it finds them accurately and rapidly. Combining a Genetic Algorithm with Back-Propagation, e.g., in a Lamarckian GA, gives the advantages of both. This technique is briefly described during the GAUL tutorial

It is sometimes useful to use a genetic algorithm to train a neural network when your objective function isn't continuous.

I'm not sure whether you should use a genetic algorithm for this.
I suppose the initial solution population for your genetic algorithm would consist of training sets for your neural network (given a specific training method). Usually the initial solution population consists of random solutions to your problem. However, random training sets would not really train your neural network.
The evaluation algorithm for your genetic algorithm would be a weighed average of the amount of training needed, the quality of the neural network in solving a specific problem and the numer of hidden nodes.
So, if you run this, you would get the training set that delivered the best result in terms of neural network quality (= training time, number hidden nodes, problem solving capabilities of the network).
Or are you considering an entirely different approach?

I'm not entirely sure what kind of problem you're working with, but GA sounds like a little bit of overkill here. Depending on the range of parameters you're working with, an exhaustive (or otherwise unintelligent) search may work. Try plotting your NN's performance with respect to number of hidden nodes for a first few values, starting small and jumping by larger and larger increments. In my experience, many NNs plateau in performance surprisingly early; you may be able to get a good picture of what range of hidden node numbers makes the most sense.
The same is often true for NNs' training iterations. More training helps networks up to a point, but soon ceases to have much effect.
In the majority of cases, these NN parameters don't affect performance in a very complex way. Generally, increasing them increases performance for a while but then diminishing returns kick in. GA is not really necessary to find a good value on this kind of simple curve; if the number of hidden nodes (or training iterations) really does cause the performance to fluctuate in a complicated way, then metaheuristics like GA may be apt. But give the brute-force approach a try before taking that route.

I would tend to say that genetic algorithms is a good idea since you can start with a minimal solution and grow the number of neurons. It is very likely that the "quality function" for which you want to find the optimal point is smooth and has only few bumps.
If you have to find this optimal NN frequently I would recommend using optimization algorithms and in your case quasi newton as described in numerical recipes which is optimal for problems where the function is expensive to evaluate.