Neural Networks: A step-by-step breakdown of the Backpropagation phase? - neural-network

I have to design an animated visual representation of a neural network that is functional (i.e. with UI that allows you to tweak values etc). The primary goal with it is to help people visualize how and when the different math operations are performed in a slow-motion, real-time animation. I have the visuals set up along with the UI that allows you to tweak values and change the layout of the neurons, as well as the visualizations for the feed forward stage, but since I don’t actually specialize in neural networks at all, I’m having trouble figuring out the best way to visualize the back propagation phase- mainly due to the fact that I’ve had trouble figuring out the exact order of operations during this stage.
The visualization starts by firing neurons forward, and then after that chain of fired neurons reach an output, an animation shows the difference between the actual and predicted values, and from this point I want to visualize the network firing backwards while demonstrating the math that is taking place. But this is where I really am unsure about what exactly is supposed to happen.
So my questions are:
Which weights are actually adjusted in the backpropagation phase? Are all of the weights adjusted throughout the entire neural network, or just the ones that fired during the forward pass?
Are all of the weights in each hidden layer adjusted by the same amount during this phase, or are they adjusted by a value that is offset by their current weight, or some other value? It didn't really make sense to me that they would all be adjusted by the same amount, without being offset by a curve or something of the sort.
I’ve found a lot of great information about the feed forward phase online, but when it comes to the backpropagation phase I’ve had a lot of trouble finding any good visualizations/explanations about what is actually happening during this phase.

Which weights are actually adjusted in the back-propagation phase? Are all of the weights adjusted throughout the entire neural network, or just the ones that fired during the forward pass?
It depends on how you build the neural network, typically you forward-propagate your network first, and then back-propagate, in the back-propagation phase, the weights are adjusted based on the error and Sigmoid derivative. It is up to you to choose which weights are adjusted, as well as the type of structure that you have. For a simple Perceptron network (based on what I know) every weight would be adjusted.
Are all of the weights in each hidden layer adjusted by the same amount during this phase, or are they adjusted by a value that is offset by their current weight, or some other value? It didn't really make sense to me that they would all be adjusted by the same amount, without being offset by a curve or something of the sort.
Back-propagation slightly depends on the type of structure you are using. You usually use some kind of algorithm - usually a gradient descent or stochastic gradient descent to control how much a weight is adjusted. From what I know, in a Perceptron network every weight is adjusted by it's own value.
In conclusion, a back-propagation is just a way to adjust the weights so that the output values are closer to the desired result. It might also help you to look in to gradient descent, or watch a network being built from scratch (I learned how to build neural networks through breaking them down step-by-step).
Here is my own version of a step-by-step break down of back-propagation:
The error is calculated based on the difference between the actual outputs and the expected outputs.
The adjustments matrix/vector is calculated by finding the dot product of the error matrix/vector and the Sigmoid derivative of training inputs.
The adjustments are applied to the weights.
Steps 1 - 3 are iterated many times until the actual outputs are close to the expected outputs.
EXT. In a more complicated neural network you might use stochastic gradient descent or gradient descent to find the best adjustments for the weights.
Edit on Gradient Descent:
Gradient descent, also known as the network derivative, is a method of finding a good adjustment value to change your weights in back-propagation.
Gradient Descent Formulae: f(X) = X * (1 - X)
Gradient Descent Formulae (Programmatic):
Gradient Descent Explanation:
Gradient descent is a method which involves finding the best adjustment to a weight. It is necessary so that the best weight values can be found. During the back-propagation iteration, the further the actual output is from the expected output, the bigger the change to the weights is. You can imagine it as an inverted hill, and in each iteration, the ball rolling down the hill goes faster and then slower as it reaches the bottom.
Credit to Clairvoyant.
Stochastic gradient descent is a more advanced method used when the best weight value is harder to find than in the use case of a standard gradient descent example. This might not be the best explanation, so for a much clearer explanation, refer to this video. For a clear explanation of stochastic gradient descent, refer to this video.

Related

weight update of one random layer in multilayer neural network using backpagation?

In training Multi-layer Neural networks using back-propagation, weights of all layer are updated in each iteration.
I am thinking if we randomly select any layer and update weights of that layer only in each iteration of back-propagation.
How is it going to impact training time? Does model performance (generalization capabilities of model) suffers from this type of training?
My intuition is that generalization capability will be same and training time will be reduced. Please correct if I am wrong.
Your intution is wrong. What you are proposing is a block coordinated descent and while it makes sense to do something like this if the gradients are not correlated it does not make sense to do so in this context.
The problem in NNs for this is that you get the gradient of preceeding layers for free, while you calculate the gradient for any single layer, due to the chain rule. Therefore, you are just discarding this information for no good reason.

Neural Networks Regression : scaling the outputs or using a linear layer?

I am currently trying to use Neural Network to make regression predictions.
However, I don't know what is the best way to handle this, as I read that there were 2 different ways to do regression predictions with a NN.
1) Some websites/articles suggest to add a final layer which is linear.
http://deeplearning4j.org/linear-regression.html
My final layers would look like, I think, :
layer1 = tanh(layer0*weight1 + bias1)
layer2 = identity(layer1*weight2+bias2)
I also noticed that when I use this solution, I usually get a prediction which is the mean of the batch prediction. And this is the case when I use tanh or sigmoid as a penultimate layer.
2) Some other websites/articles suggest to scale the output to a [-1,1] or [0,1] range and to use tanh or sigmoid as a final layer.
Are these 2 solutions acceptable ? Which one should one prefer ?
Thanks,
Paul
I would prefer the second case, in which we use normalization and sigmoid function as the output activation and then scale back the normalized output values to their actual values. This is because, in the first case, to output the large values (since actual values are large in most cases), the weights mapping from penultimate layer to the output layer would have to be large. Thus, for faster convergence, the learning rate has to be made larger. But this may also cause learning of the earlier layers to diverge since we are using a larger learning rate. Hence, it is advised to work with normalized target values, so that the weights are small and they learn quickly.
Hence in short, the first method learns slowly or may diverge if a larger learning rate is used and on the other hand, the second method is comparatively safer to use and learns quickly.

Backpropagation neural network, too many neurons in layer causing output to be too high

Having neural network with alot of inputs causes my network problems like
Neural network gets stuck and feed forward calculation always gives output as
1.0 because of the output sum being too big and while doing backpropagation, sum of gradients will be too high what causes the
learning speed to be too dramatic.
Neural network is using tanh as an active function in all layers.
Giving alot of thought, I came up with following solutions:
Initalizing smaller random weight values ( WeightRandom / PreviousLayerNeuronCount )
or
After calculation the sum of either outputs or gradients, dividing the sum with the number of 'neurons in previus layer for output sum' and number of 'neurons in next layer for gradient sum' and then passing sum into activation/derivative function.
I don't feel comfortable with solutions I came up with.
Solution 1. does not solve problem entirely. Possibility of gradient or output sum getting to high is still there. Solution 2. seems to solve the problem but I fear that it completely changes network behavior in a way that it might not solve some problems anymore.
What would you suggest me in this situation, keeping in mind that reducing neuron count in layers is not an option?
Thanks in advance!
General things that affect the output backpropagation include weights and biases of early elections, the number of hidden units, the amount of exercise patterns, and long iterations. As an alternative way, the selection of initial weights and biases there are several algorithms that can be used, one of which is an algorithm Nguyen widrow. You can use it to initialize the weights and biases early, I've tried it and gives good results.

Why do we take the derivative of the transfer function in calculating back propagation algorithm?

What is the concept behind taking the derivative? It's interesting that for somehow teaching a system, we have to adjust its weights. But why are we doing this using a derivation of the transfer function. What is in derivation that helps us. I know derivation is the slope of a continuous function at a given point, but what does it have to do with the problem.
You must already know that the cost function is a function with the weights as the variables.
For now consider it as f(W).
Our main motive here is to find a W for which we get the minimum value for f(W).
One of the ways for doing this is to plot function f in one axis and W in another....... but remember that here W is not just a single variable but a collection of variables.
So what can be the other way?
It can be as simple as changing values of W and see if we get a lower value or not than the previous value of W.
But taking random values for all the variables in W can be a tedious task.
So what we do is, we first take random values for W and see the output of f(W) and the slope at all the values of each variable(we get this by partially differentiating the function with the i'th variable and putting the value of the i'th variable).
now once we know the slope at that point in space we move a little further towards the lower side in the slope (this little factor is termed alpha in gradient descent) and this goes on until the slope gives a opposite value stating we already reached the lowest point in the graph(graph with n dimensions, function vs W, W being a collection of n variables).
The reason is that we are trying to minimize the loss. Specifically, we do this by a gradient descent method. It basically means that from our current point in the parameter space (determined by the complete set of current weights), we want to go in a direction which will decrease the loss function. Visualize standing on a hillside and walking down the direction where the slope is steepest.
Mathematically, the direction that gives you the steepest descent from your current point in parameter space is the negative gradient. And the gradient is nothing but the vector made up of all the derivatives of the loss function with respect to each single parameter.
Backpropagation is an application of the Chain Rule to neural networks. If the forward pass involves applying a transfer function, the gradient of the loss function with respect to the weights will include the derivative of the transfer function, since the derivative of f(g(x)) is f’(g(x))g’(x).
Your question is a really good one! Why should I move the weight more in one direction when the slope of the error wrt. the weight is high? Does that really make sense? In fact it does makes sense if the error function wrt. the weight is a parabola. However it is a wild guess to assume it is a parabola. As rcpinto says, assuming the error function is a parabola, make the derivation of the a updates simple with the Chain Rule.
However, there are some other parameter update rules that actually addresses this, non-intuitive assumption. You can make update rule that takes the weight a fixed size step in the down-slope direction, and then maybe later in the training decrease the step size logarithmic as you train. (I'm not sure if this method has a formal name.)
There are also som alternative error function that can be used. Look up Cross Entropy in you neural network text book. This is an adjustment to the error function such that the derivative (of the transfer function) factor in the update rule cancels out. Just remember to pick the right cross entropy function based on you output transfer function.
When I first started getting into Neural Nets, I had this question too.
The other answers here have explained the math which makes it pretty clear that a derivative term will appear in your calculations while you are trying to update the weights. But all of those calculations are being done in order to implement Back-propagation, which is just one of the ways of updating weights! Now read on...
You are correct in assuming that at the end of the day, all a neural network tries to do is update its weights to fit the data you feed into it. Within this statement lies your answer too. What you are getting confused with here is the idea of the Back-propagation algorithm. Many textbooks use backprop to update neural nets by default but do not mention that there are other ways to update weights too. This leads to the confusion that neural nets and backprop are the same thing and are inherently connected. This also leads to the false belief that neural nets need backprop to train.
Please remember that Back-propagation is just ONE of the ways out there to train your neural network (although it is the most famous one). Now, you must have seen the math involved in backprop, and hence you can see where the derivative term comes in from (some other answers have also explained that). It is possible that other training methods won't need the derivatives, although most of them do. Read on to find out why....
Think about this intuitively, we are talking about CHANGING weights, the direct mathematical operation related to change is a derivative, makes sense that you should need to evaluate derivatives to change weights.
Do let me know if you are still confused and I'll try to modify my answer to make it better. Just as a parting piece of information, another common misconception is that gradient descent is a part of backprop, just like it is assumed that backprop is a part of neural nets. Gradient descent is just one way to minimize your cost function, there are plenty of others you can use. One of the answers above makes this wrong assumption too when it says "Specifically Gradient Descent". This is factually incorrect. :)
Training a neural network means minimizing an associated "error" function wrt the networks weights. Now there are optimization methods that use only function values (Simplex method of Nelder and Mead, Hooke and Jeeves, etc), methods that in addition use first derivatives (steepest descend, quasi Newton, conjugate gradient) and Newton methods using second derivatives as well. So if you want to use a derivative method, you have to calculate the derivatives of the error function, which in return involves the derivatives of the transfer or activation function.
Back propagation is just a nice algorithm to calculate the derivatives, and nothing more.
Yes, the question was really good, this question was also came in my head while i am understanding the Backpropagation. After doing ForwordPropagation on neural network we do back propagation in network to minimize the total error. And there also many other way to minimize the error.your question is why we are doing derivative in backpropagation, the reason is that, As we all know the meaning of derivative is to find the slope of a function or in other words we can find change of particular thing with respect to particular thing. So here we are doing derivative to minimize the total error with respect to the corresponding weights of the network.
and here by doing the derivation of total error with respect to weights we can find it's slope or in other words we can find what is the change in total error with respect to the small change of the weight, so that we can update the weight to minimize the error with the help of this Gradient Descent formula, that is, Weight= weight-Alpha*(del(Total error)/del(weight)).Or in other words New Weights = Old Weights - learning-rate x Partial derivatives of loss function w.r.t. parameters.
Here Alpha is the learning rate which is control the weight update, means if the derivative the - ve than Alpha make it +ve(Becouse of -Alpha in formula) and if +ve it's remain +ve so that weight update goes in +ve direction and it's reflected to minimize the Total error.And also the as derivative part is multiples with Alpha, it's decrees the step size of Alpha when the weight converge to the optimal value of weight(minimum error). Thats why we are doing derivative to minimize the error.

Updating weights in backpropagation algorithm

I think I've understood each step of backpropagation algorithm but the most important one. How do weights get updated? Like at the end of this tutorial? http://home.agh.edu.pl/~vlsi/AI/backp_t_en/backprop.html
The weight updates are done via the equations written at the last part of the page (Backpropagation) you provided.
Let me elaborate a little bit:
New Weights = Old Weights - learning-rate x Partial derivatives of loss function w.r.t. parameters
For a given weight, calculate the (which can be done easily by back propagating the error) which is nothing but the steepest direction of the function and subtract a scaled version of it, the scale factor being the step size or how large step you want to make in that direction.
Just a little clarification which I felt you might need after looking at the way you asked the question ...
What is exactly Back-propagation?
Backpropagation is just a trick to quickly evaluate the partial derivatives of the loss function w.r.t. all weights. It has nothing to do with weight updating. Updating the weights is a part of gradient descent algorithm.