Which multiplication and addition factor to use when doing adaptive learning rate in neural networks? - neural-network

I am new to neural networks and, to get grip on the matter, I have implemented a basic feed-forward MLP which I currently train through back-propagation. I am aware that there are more sophisticated and better ways to do that, but in Introduction to Machine Learning they suggest that with one or two tricks, basic gradient descent can be effective for learning from real world data. One of the tricks is adaptive learning rate.
The idea is to increase the learning rate by a constant value a when the error gets smaller, and decrease it by a fraction b of the learning rate when the error gets larger. So basically the learning rate change is determined by:
+(a)
if we're learning in the right direction, and
-(b * <learning rate>)
if we're ruining our learning. However, on the above book there's no advice on how to set these parameters. I wouldn't expect a precise suggestion since parameter tuning is a whole topic on its own, but just a hint at least on their order of magnitude. Any ideas?
Thank you,
Tunnuz

I haven't looked at neural networks for the longest time (10 years+) but after I saw your question I thought I would have a quick scout about. I kept seeing the same figures all over the internet in relation to increase(a) and decrease(b) factor (1.2 & 0.5 respectively).
I have managed to track these values down to Martin Riedmiller and Heinrich Braun's RPROP algorithm (1992). Riedmiller and Braun are quite specific about sensible parameters to choose.
See: RPROP: A Fast Adaptive Learning Algorithm
I hope this helps.

Related

Confusion in Backpropagation

I've started working on Forward and back propagation of neural networks. I've coded it as-well and works properly too. But i'm confused in the algorithm itself. I'm new to Neural Networks.
So Forward propagation of neural networks is finding the right label with the given weights?
and Back-propagation is using forward propagation to find the most error free parameters by minimizing cost function and using these parameters to help classify other training examples? And this is called a trained Neural Network?
I feel like there is a big blunder in my concept if there is please let me know where i'm wrong and why i am wrong.
I will try my best to explain forward and back propagation in a detailed yet simple to understand manner, although it's not an easy topic to do.
Forward Propagation
Forward propagation is the process in a neural network where-by during the runtime of the network, values are fed into the front of the neural network, (the inputs). You can imagine that these values then travel across the weights which multiply the original value from the inputs by themselves. They then arrive at the hidden layer (neurons). Neurons vary quite a lot based on different types of networks, but here is one way of explaining it. When the values reach the neuron they go through a function where every single value being fed into the neuron is summed up and then fed into an activation function. This activation function can be very different depending on the use-case but let's take for example a linear activation function. It essentially gets the value being fed into it and then it rounds it to a 0 or 1. It is then fed through more weights and then it is spat out into the outputs. Which is the last step into the network.
You can imagine this network with this diagram.
Back Propagation
Back propagation is just like forward propagation except we work backwards from where we were in forward propagation.
The aim of back propagation is to reduce the error in the training phase (trying to get the neural network as accurate as possible). The way this is done is by going backwards through the weights and layers. At each weight the error is calculated and each weight is individually adjusted using an optimization algorithm; optimization algorithm is exactly what it sounds like. It optimizes the weights and adjusts their values to make the neural network more accurate.
Some optimization algorithms include gradient descent and stochastic gradient descent. I will not go through the details in this answer as I have already explained them in some of my other answers (linked below).
The process of calculating the error in the weights and adjusting them accordingly is the back-propagation process and it is usually repeated many times to get the network as accurate as possible. The number of times you do this is called the epoch count. It is good to learn the importance of how you should manage epochs and batch sizes (another topic), as these can severely impact the efficiency and accuracy of your network.
I understand that this answer may be hard to follow, but unfortunately this is the best way I can explain this. It is expected that you might not understand this the first time you read it, but remember this is a complicated topic. I have a linked a few more resources down below including a video (not mine) that explains these processes even better than a simple text explanation can. But I also hope my answer may have resolved your question and have a good day!
Further resources:
Link 1 - Detailed explanation of back-propagation.
Link 2 - Detailed explanation of stochastic/gradient-descent.
Youtube Video 1 - Detailed explanation of types of propagation.
Credits go to Sebastian Lague

Learning Rate in Neural Networks

I am new in field of neural networks and recently I came to know that very slow learning rate affects accuracy negatively. I understand that very low learning rate will make it inefficient but how accuracy is affected by low learning rate?
when learning rate is too large, it can cause the model to converge too quickly to a solution that is not optimal, think it as overshooting. On the contrary, when learning rate is too small, it can make the model get stuck (due to tiny steps taken to converge) to a local minimum which is not the optimal solution again.
Therefore accuracy is affected and the goal is to find the best learning rate for the that model.

Should we do learning rate decay for adam optimizer

I'm training a network for image localization with Adam optimizer, and someone suggest me to use exponential decay. I don't want to try that because Adam optimizer itself decays learning rate. But that guy insists and he said he did that before. So should I do that and is there any theory behind your suggestion?
It depends. ADAM updates any parameter with an individual learning rate. This means that every parameter in the network has a specific learning rate associated.
But the single learning rate for each parameter is computed using lambda (the initial learning rate) as an upper limit. This means that every single learning rate can vary from 0 (no update) to lambda (maximum update).
It's true, that the learning rates adapt themselves during training steps, but if you want to be sure that every update step doesn't exceed lambda you can than lower lambda using exponential decay or whatever.
It can help to reduce loss during the latest step of training, when the computed loss with the previously associated lambda parameter has stopped to decrease.
In my experience it usually not necessary to do learning rate decay with Adam optimizer.
The theory is that Adam already handles learning rate optimization (check reference) :
"We propose Adam, a method for efficient stochastic optimization that
only requires first-order gradients with little memory requirement.
The method computes individual adaptive learning rates for different
parameters from estimates of first and second moments of the
gradients; the name Adam is derived from adaptive moment estimation."
As with any deep learning problem YMMV, one size does not fit all, you should try different approaches and see what works for you, etc. etc.
Yes, absolutely. From my own experience, it's very useful to Adam with learning rate decay. Without decay, you have to set a very small learning rate so the loss won't begin to diverge after decrease to a point. Here, I post the code to use Adam with learning rate decay using TensorFlow. Hope it is helpful to someone.
decayed_lr = tf.train.exponential_decay(learning_rate,
global_step, 10000,
0.95, staircase=True)
opt = tf.train.AdamOptimizer(decayed_lr, epsilon=adam_epsilon)
Adam has a single learning rate, but it is a max rate that is adaptive, so I don't think many people using learning rate scheduling with it.
Due to the adaptive nature the default rate is fairly robust, but there may be times when you want to optimize it. What you can do is find an optimal default rate beforehand by starting with a very small rate and increasing it until loss stops decreasing, then look at the slope of the loss curve and pick the learning rate that is associated with the fastest decrease in loss (not the point where loss is actually lowest). Jeremy Howard mentions this in the fast.ai deep learning course and its from the Cyclical Learning Rates paper.
Edit: People have fairly recently started using one-cycle learning rate policies in conjunction with Adam with great results.
Be careful when using weight decay with the vanilla Adam optimizer, as it appears that the vanilla Adam formula is wrong when using weight decay, as pointed out in the article Decoupled Weight Decay Regularization https://arxiv.org/abs/1711.05101 .
You should probably use the AdamW variant when you want to use Adam with weight decay.
A simple alternative is to increase the batch size. A larger number of samples per update will force the optimizer to be more cautious with the updates. If GPU memory limits the number of samples that can be tracked per update, you may have to resort to CPU and conventional RAM for training, which will obviously further slow down training.
From another point of view
All stochastic gradient descent (SGD) optimizers, including Adam, have randomization built and have no guarantees of reaching a global minima
After several
times of reduction, a satisfying local extremum will be obtained.
so using learning decay will not help reach global minima as it is supposed to help.
Also if you used it the learning rate will eventually become very
small, and the algorithm will become ineffective.

Neural network: Just some regression?

I just started reading about neural networks. I thought they are something magic and extremly intelligent, but at the end of the day it just seems to be a large math function with many "undefined" constants? The learning is just another way for some kind of (more or less "stupid") regression? Is this true? For me this seems not to be very brilliant, so i am a bit surprised why this works that well.
Thank you very much
It has been proved that an artificial neural network with just one hidden layer is a universal approximator; that is, under proper parameterisation, it can approximate any continuous function (see universal approximation theorem). More importantly, as the Wikipedia article mentions:
Work by Hava Siegelmann and Eduardo D. Sontag has provided a proof that a specific recurrent architecture with rational valued weights (as opposed to full precision real number-valued weights) has the full power of a Universal Turing Machine using a finite number of neurons and standard linear connections.
This means that, at least in theory, a neural net is as much as clever as your expensive PC. And this is true without taking into account all modern extensions, e.g. as in Long-Short Term Memory networks. As one of the comments mentions, though, the real problem is learnability, i.e. how to find the right set of parameters for the task under consideration.

Neural Net Optimize w/ Genetic Algorithm

Is a genetic algorithm the most efficient way to optimize the number of hidden nodes and the amount of training done on an artificial neural network?
I am coding neural networks using the NNToolbox in Matlab. I am open to any other suggestions of optimization techniques, but I'm most familiar with GA's.
Actually, there are multiple things that you can optimize using GA regarding NN.
You can optimize the structure (number of nodes, layers, activation function etc.).
You can also train using GA, that means setting the weights.
Genetic algorithms will never be the most efficient, but they usually used when you have little clue as to what numbers to use.
For training, you can use other algorithms including backpropagation, nelder-mead etc..
You said you wanted to optimize number hidden nodes, for this, genetic algorithm may be sufficient, although far from "optimal". The space you are searching is probably too small to use genetic algorithms, but they can still work and afaik, they are already implemented in matlab, so no biggie.
What do you mean by optimizing amount of training done? If you mean number of epochs, then that's fine, just remember that training is somehow dependent on starting weights and they are usually random, so the fitness function used for GA won't really be a function.
A good example of neural networks and genetic programming is the NEAT architecture (Neuro-Evolution of Augmenting Topologies). This is a genetic algorithm that finds an optimal topology. It's also known to be good at keeping the number of hidden nodes down.
They also made a game using this called Nero. Quite unique and very amazing tangible results.
Dr. Stanley's homepage:
http://www.cs.ucf.edu/~kstanley/
Here you'll find just about everything NEAT related as he is the one who invented it.
Genetic algorithms can be usefully applied to optimising neural networks, but you have to think a little about what you want to do.
Most "classic" NN training algorithms, such as Back-Propagation, only optimise the weights of the neurons. Genetic algorithms can optimise the weights, but this will typically be inefficient. However, as you were asking, they can optimise the topology of the network and also the parameters for your training algorithm. You'll have to be especially wary of creating networks that are "over-trained" though.
One further technique with a modified genetic algorithms can be useful for overcoming a problem with Back-Propagation. Back-Propagation usually finds local minima, but it finds them accurately and rapidly. Combining a Genetic Algorithm with Back-Propagation, e.g., in a Lamarckian GA, gives the advantages of both. This technique is briefly described during the GAUL tutorial
It is sometimes useful to use a genetic algorithm to train a neural network when your objective function isn't continuous.
I'm not sure whether you should use a genetic algorithm for this.
I suppose the initial solution population for your genetic algorithm would consist of training sets for your neural network (given a specific training method). Usually the initial solution population consists of random solutions to your problem. However, random training sets would not really train your neural network.
The evaluation algorithm for your genetic algorithm would be a weighed average of the amount of training needed, the quality of the neural network in solving a specific problem and the numer of hidden nodes.
So, if you run this, you would get the training set that delivered the best result in terms of neural network quality (= training time, number hidden nodes, problem solving capabilities of the network).
Or are you considering an entirely different approach?
I'm not entirely sure what kind of problem you're working with, but GA sounds like a little bit of overkill here. Depending on the range of parameters you're working with, an exhaustive (or otherwise unintelligent) search may work. Try plotting your NN's performance with respect to number of hidden nodes for a first few values, starting small and jumping by larger and larger increments. In my experience, many NNs plateau in performance surprisingly early; you may be able to get a good picture of what range of hidden node numbers makes the most sense.
The same is often true for NNs' training iterations. More training helps networks up to a point, but soon ceases to have much effect.
In the majority of cases, these NN parameters don't affect performance in a very complex way. Generally, increasing them increases performance for a while but then diminishing returns kick in. GA is not really necessary to find a good value on this kind of simple curve; if the number of hidden nodes (or training iterations) really does cause the performance to fluctuate in a complicated way, then metaheuristics like GA may be apt. But give the brute-force approach a try before taking that route.
I would tend to say that genetic algorithms is a good idea since you can start with a minimal solution and grow the number of neurons. It is very likely that the "quality function" for which you want to find the optimal point is smooth and has only few bumps.
If you have to find this optimal NN frequently I would recommend using optimization algorithms and in your case quasi newton as described in numerical recipes which is optimal for problems where the function is expensive to evaluate.