I'm a bit confused by activation functions and blogs/posts that continuously mention that neurons are not activated or not fired.
But mathematically speaking, if whatever activation function (whether it's sigmoid, tanh, relu) calculates an output of 0, isn't that value still given to all connected neurons in the next layer?
And if so, doesn't that mean that this neuron still firing/activating?
Or am I simply wrong and the neuron is really not firing and it really doesn't provide any value at all to any connected neurons in the next layer? And how does this work mathematically?
Please help me clear up my confusion :)
Expressions such as not activated and not fired, as well as the term neuron itself, are just metaphorical depictions, and they should not be taken at face value. They are used just to verbally describe the (very) loose analogy between the (artificial) neural networks used in machine learning and the actual neuronal networks of living beings, but that's all.
As you correctly suspect, in such cases an output value of 0 is indeed produced by the "neuron" and propagated in the net. Because, actually, there are not any neurons there, just variables in a computer program, which must have a value at all times, for both mathematical and computational reasons.
Related
I've started working on Forward and back propagation of neural networks. I've coded it as-well and works properly too. But i'm confused in the algorithm itself. I'm new to Neural Networks.
So Forward propagation of neural networks is finding the right label with the given weights?
and Back-propagation is using forward propagation to find the most error free parameters by minimizing cost function and using these parameters to help classify other training examples? And this is called a trained Neural Network?
I feel like there is a big blunder in my concept if there is please let me know where i'm wrong and why i am wrong.
I will try my best to explain forward and back propagation in a detailed yet simple to understand manner, although it's not an easy topic to do.
Forward Propagation
Forward propagation is the process in a neural network where-by during the runtime of the network, values are fed into the front of the neural network, (the inputs). You can imagine that these values then travel across the weights which multiply the original value from the inputs by themselves. They then arrive at the hidden layer (neurons). Neurons vary quite a lot based on different types of networks, but here is one way of explaining it. When the values reach the neuron they go through a function where every single value being fed into the neuron is summed up and then fed into an activation function. This activation function can be very different depending on the use-case but let's take for example a linear activation function. It essentially gets the value being fed into it and then it rounds it to a 0 or 1. It is then fed through more weights and then it is spat out into the outputs. Which is the last step into the network.
You can imagine this network with this diagram.
Back Propagation
Back propagation is just like forward propagation except we work backwards from where we were in forward propagation.
The aim of back propagation is to reduce the error in the training phase (trying to get the neural network as accurate as possible). The way this is done is by going backwards through the weights and layers. At each weight the error is calculated and each weight is individually adjusted using an optimization algorithm; optimization algorithm is exactly what it sounds like. It optimizes the weights and adjusts their values to make the neural network more accurate.
Some optimization algorithms include gradient descent and stochastic gradient descent. I will not go through the details in this answer as I have already explained them in some of my other answers (linked below).
The process of calculating the error in the weights and adjusting them accordingly is the back-propagation process and it is usually repeated many times to get the network as accurate as possible. The number of times you do this is called the epoch count. It is good to learn the importance of how you should manage epochs and batch sizes (another topic), as these can severely impact the efficiency and accuracy of your network.
I understand that this answer may be hard to follow, but unfortunately this is the best way I can explain this. It is expected that you might not understand this the first time you read it, but remember this is a complicated topic. I have a linked a few more resources down below including a video (not mine) that explains these processes even better than a simple text explanation can. But I also hope my answer may have resolved your question and have a good day!
Further resources:
Link 1 - Detailed explanation of back-propagation.
Link 2 - Detailed explanation of stochastic/gradient-descent.
Youtube Video 1 - Detailed explanation of types of propagation.
Credits go to Sebastian Lague
I read in a book, where the author mentioned that the bias bk is used to produce an affine transform to the output uk (The summation of weighted input signals).
Also, the author mentioned that due to this bias that gives a constant value of, say 'k' makes the neuron not connected to the previous layer.
I am in a confused state. Can someone please tell me what the above two points mean, and if there are any other uses of a bias to the network?
Thanks in advance!
If the neurons activation is z(a) = wa + b, b is the bias. It's a bias because the larger it is, the more this neuron is biased, or in other words it doesnt care much about what was passed to it (a) from the last layer. I'm assuming the second point is referring to the fact that if a bias is large enough (positive or negative) it is like the neuron no longer cares what is passed to it, it's always going to pass the same thing to the next layer. I would need to see it in context to be certain about what the author is saying, but overall you just need to understand that it is a constant that can add bias (doesnt care about what the last layer gave it). Dont fret too much about its implications though, because the learning (or optimization) process is going to adjust these automatically so you're not going to have to choose proper bias values for the network. As you become more familiar with the concepts it will start to make more sense
In a simple single-layer network, it is easy to calculate the target outputs of neurons, as they are identical to the target outputs of the network itself. However, in a multiple-layer network, I am not quite sure how to calculate the targets for each individual neuron in the hidden layers, because they do not necessarily have a direct connection to the final output and are most likely not given in the training data. How would one find these values?
I would not be surprised if I am missing something and am going about this incorrectly, but I would like to know nonetheless. Thanks in advance for any and all input.
Taken from this great guide on pg. 18:
Calculate the Errors for the hidden layer neurons. Unlike the output layer we can’t
calculate these directly (because we don’t have a Target), so we Back Propagate
them from the output layer (hence the name of the algorithm). This is done by
taking the Errors from the output neurons and running them back through the
weights to get the hidden layer errors.
Or in other words, you don't. You propagate the activations from the input to the output, calculate the error of the output, then backpropagate the error from the output back to the input (thus the name of the algorithm).
In the unfortunate case that the link I posted goes down, it can be found by Googling "backpropagation algorithm 3".
This question already has answers here:
Why use softmax only in the output layer and not in hidden layers?
(5 answers)
Closed 5 years ago.
I have read the answer given here. My exact question pertains to the accepted answer:
Variables independence : a lot of regularization and effort is put to keep your variables independent, uncorrelated and quite sparse. If you use softmax layer as a hidden layer - then you will keep all your nodes (hidden variables) linearly dependent which may result in many problems and poor generalization.
What are the complications that forgoing the variable independence in hidden layers arises? Please provide at least one example. I know hidden variable independence helps a lot in codifying the backpropogation but backpropogation can be codified for softmax as well (Please verify if or not i am correct in this claim. I seem to have gotten the equations right according to me. hence the claim).
Training issue: try to imagine that to make your network working better you have to make a part of activations from your hidden layer a little bit lower. Then - automaticaly you are making rest of them to have mean activation on a higher level which might in fact increase the error and harm your training phase.
I don't understand how you achieve that kind of flexibility even in sigmoid hidden neuron where you can fine tune the activation of a particular given neuron which is precisely what the gradient descent's job is. So why are we even worried about this issue. If you can implement the backprop rest will be taken care of by gradient descent. Fine tuning the weights so as to make the activations proper is not something you, even if you could do, which you cant, would want to do. (Kindly correct me if my understanding is wrong here)
mathematical issue: by creating constrains on activations of your model you decrease the expressive power of your model without any logical explaination. The strive for having all activations the same is not worth it in my opinion.
Kindly explain what is being said here
Batch normalization: I understand this, No issues here
1/2. I don't think you have a clue of what the author is trying to say. Imagine a layer with 3 nodes. 2 of these nodes have an error responsibility of 0 with respect to the output error; so there is óne node that should be adjusted. So if you want to improve the output of node 0, then you immediately affect nodes 1 and 2 in that layer - possibly making the output even more wrong.
Fine tuning the weights so as to make the activations proper is not something you, even if you could do, which you cant, would want to do. (Kindly correct me if my understanding is wrong here)
That is the definition of backpropagation. That is exactly what you want. Neural networks rely on activations (which are non-linear) to map a function.
3. Your basically saying to every neuron 'hey, your output cannot be higher than x, because some other neuron in this layer already has value y'. Because all neurons in a softmax layer should have a total activation of 1, it means that neurons cannot be higher than a specific value. For small layers - small problem, but for big layers - big problem. Imagine a layer with 100 neurons. Now imagine their total output should be 1. The average value of those neurons will be 0.01 -> that means you are making networks connection relying (because activations will stay very low, averagely) - as other activation functions output (or take on input) of range (0:1 / -1:1).
I have a neural network with 2 entry variables, 1 hidden layer with 2 neurons and the output layer with one output neuron. When I start with some randomly (from 0 to 1) generated weights, the network learns the XOR function very fast and good, but in other cases, the network NEVER learns the XOR function! Do you know why this happens and how can I overcome this problem? Could some chaotic behaviour be involved? Thanks!
This is quite normal situation, because error function for multilayer NN is not convex, and optimization converges to local minimum.
You can just keep initial weights that resulted in successful optimization, or run optimizer multiple times starting from different weights, and keep the best solution. Optimization algorithm and learning rate also plays certain role, for example backpropagation with momentum and/or stochastic gradient descent sometimes work better. Also, if you add more neurons, beyond the minimum needed to learn XOR, this also helps.
There exist methodologies designed to find global minimum, such as simulated annealing, but, in practice they are not commonly used for NN optimization, except for some specific cases