Why do fully connected NNs exist? - neural-network

I have implemented my own little NN with the back-propagation algorithm. What I do not understand at the moment is, if your hidden layer is fully connected with the input layer and fully connected with the output layer, aren't the weights for the nodes in the hidden layer updated equally for each hidden node?

Looking at the back propagation algorithm :
Back Propagation Algorithm Wikipedia
You can see weights update formula contains current weights values and also the outputs of nodes, then weight update differ for each connection between nodes.
And NEVER set the weights to 0. Either to any value. Always set them random.
Also Note this type of questions should be better asked on Data Science Stack Exchange site or Cross Validated Stack Exchange site

Related

Confusion in Backpropagation

I've started working on Forward and back propagation of neural networks. I've coded it as-well and works properly too. But i'm confused in the algorithm itself. I'm new to Neural Networks.
So Forward propagation of neural networks is finding the right label with the given weights?
and Back-propagation is using forward propagation to find the most error free parameters by minimizing cost function and using these parameters to help classify other training examples? And this is called a trained Neural Network?
I feel like there is a big blunder in my concept if there is please let me know where i'm wrong and why i am wrong.
I will try my best to explain forward and back propagation in a detailed yet simple to understand manner, although it's not an easy topic to do.
Forward Propagation
Forward propagation is the process in a neural network where-by during the runtime of the network, values are fed into the front of the neural network, (the inputs). You can imagine that these values then travel across the weights which multiply the original value from the inputs by themselves. They then arrive at the hidden layer (neurons). Neurons vary quite a lot based on different types of networks, but here is one way of explaining it. When the values reach the neuron they go through a function where every single value being fed into the neuron is summed up and then fed into an activation function. This activation function can be very different depending on the use-case but let's take for example a linear activation function. It essentially gets the value being fed into it and then it rounds it to a 0 or 1. It is then fed through more weights and then it is spat out into the outputs. Which is the last step into the network.
You can imagine this network with this diagram.
Back Propagation
Back propagation is just like forward propagation except we work backwards from where we were in forward propagation.
The aim of back propagation is to reduce the error in the training phase (trying to get the neural network as accurate as possible). The way this is done is by going backwards through the weights and layers. At each weight the error is calculated and each weight is individually adjusted using an optimization algorithm; optimization algorithm is exactly what it sounds like. It optimizes the weights and adjusts their values to make the neural network more accurate.
Some optimization algorithms include gradient descent and stochastic gradient descent. I will not go through the details in this answer as I have already explained them in some of my other answers (linked below).
The process of calculating the error in the weights and adjusting them accordingly is the back-propagation process and it is usually repeated many times to get the network as accurate as possible. The number of times you do this is called the epoch count. It is good to learn the importance of how you should manage epochs and batch sizes (another topic), as these can severely impact the efficiency and accuracy of your network.
I understand that this answer may be hard to follow, but unfortunately this is the best way I can explain this. It is expected that you might not understand this the first time you read it, but remember this is a complicated topic. I have a linked a few more resources down below including a video (not mine) that explains these processes even better than a simple text explanation can. But I also hope my answer may have resolved your question and have a good day!
Further resources:
Link 1 - Detailed explanation of back-propagation.
Link 2 - Detailed explanation of stochastic/gradient-descent.
Youtube Video 1 - Detailed explanation of types of propagation.
Credits go to Sebastian Lague

Neural Networks: exact high level training algorithm

I am trying to make my very first Neural Network work. I designed it so that I can choose the number of layers and the number of nodes per layer freely. I had a hard time trying to implement back propagation but I think I have done it recursively even if it is not as performant as it can be. I am using the sigmoid as an activation for all nodes (even the input nodes and the output node).
My network has a single output node in the output layer that should predict a variable (zero or one).
My question is how exactly should I do to train my network ? I noticed that when I use the following algorithm:
for i in [1:100000]
feed the same record to my neural network
perform a forward pass
compute the error using the square of the difference as a loss function for this record with the current weights
Update the weights using back propagation
it converges to the correct result (the output node value converges to zero when the record is labeled with zero, and to one when the record is labeled as one). But when I feed a different record to the network at each time of this iterative algorithm the network completely diverges.
Suppose that I would like to work with a mini batch of N records, this means that I have to make N forward passes giving at each time one of the N records as input to the network, comùpute the error, take the average over the N records, but then, when I would like to use the average error in the back propagation algorithm, what input record should I use ? Because, as far as I know the input layer is also used to compute the weights between it and the first hidden layer. Should I then use the last one of the N records as input? Or the first one ? Does it even matter ? I am a bit confused here and I found nothing on the internet to answer this particular question.
Best regards.

How does a Neural Network "remember" what its learned?

Im trying to wrap my head around understanding neural networks and from everything I've seen, I understand that they are made up of layers created by nodes. These nodes are attached to each other with "weighted" connections, and by passing values through the input layer, the values travel through the nodes, changing their values dependent on the "weight" of the connections (right?). Eventually they reach the output layer with a value. I understand the process but I don't see how this leads to the network being trained. Does the network remember a pattern between weighted connections? How does it remember that pattern?
Each weight and bias on each node is like a stored variable. As new data causes its weights and biases to change, these variables change. Eventually a trained algorithm is done and the weights and biases don't need to change anymore. You can then store the information about the all the nodes, weights, biases and connections however you like. This information is your model. So the "remembering" is just the values of the weights and biases.
Neural network remembers what its learned through its weights and biases. Lets explain it with a binary classification example. During forward propagation, the value computed is the
probability(say p) and actual value is y. Now, loss is calculated using the formula:->
-(ylog(p) + (1-y)log(1-p)). Once the loss is calculated, this info is propagated backwards and corresponding derivatives of weights and biases are calculated using this loss. Now weights and biases are adjusted according to these derivatives. In one epoch, all the examples present are propagated and weights and biases are adjusted. Then, same examples are propagated forward and backward and correspondingly in each step, weights and biases are adjusted. Finally, after minimizing the loss to a good extent or, achieving a high accuracy (make sure not to overfit), we can store the value of weights and biases and this is what neural network has learned.

How does one calculate the target outputs of neurons in hidden layers of a neural network?

In a simple single-layer network, it is easy to calculate the target outputs of neurons, as they are identical to the target outputs of the network itself. However, in a multiple-layer network, I am not quite sure how to calculate the targets for each individual neuron in the hidden layers, because they do not necessarily have a direct connection to the final output and are most likely not given in the training data. How would one find these values?
I would not be surprised if I am missing something and am going about this incorrectly, but I would like to know nonetheless. Thanks in advance for any and all input.
Taken from this great guide on pg. 18:
Calculate the Errors for the hidden layer neurons. Unlike the output layer we can’t
calculate these directly (because we don’t have a Target), so we Back Propagate
them from the output layer (hence the name of the algorithm). This is done by
taking the Errors from the output neurons and running them back through the
weights to get the hidden layer errors.
Or in other words, you don't. You propagate the activations from the input to the output, calculate the error of the output, then backpropagate the error from the output back to the input (thus the name of the algorithm).
In the unfortunate case that the link I posted goes down, it can be found by Googling "backpropagation algorithm 3".

Why is softmax not used in hidden layers [duplicate]

This question already has answers here:
Why use softmax only in the output layer and not in hidden layers?
(5 answers)
Closed 5 years ago.
I have read the answer given here. My exact question pertains to the accepted answer:
Variables independence : a lot of regularization and effort is put to keep your variables independent, uncorrelated and quite sparse. If you use softmax layer as a hidden layer - then you will keep all your nodes (hidden variables) linearly dependent which may result in many problems and poor generalization.
What are the complications that forgoing the variable independence in hidden layers arises? Please provide at least one example. I know hidden variable independence helps a lot in codifying the backpropogation but backpropogation can be codified for softmax as well (Please verify if or not i am correct in this claim. I seem to have gotten the equations right according to me. hence the claim).
Training issue: try to imagine that to make your network working better you have to make a part of activations from your hidden layer a little bit lower. Then - automaticaly you are making rest of them to have mean activation on a higher level which might in fact increase the error and harm your training phase.
I don't understand how you achieve that kind of flexibility even in sigmoid hidden neuron where you can fine tune the activation of a particular given neuron which is precisely what the gradient descent's job is. So why are we even worried about this issue. If you can implement the backprop rest will be taken care of by gradient descent. Fine tuning the weights so as to make the activations proper is not something you, even if you could do, which you cant, would want to do. (Kindly correct me if my understanding is wrong here)
mathematical issue: by creating constrains on activations of your model you decrease the expressive power of your model without any logical explaination. The strive for having all activations the same is not worth it in my opinion.
Kindly explain what is being said here
Batch normalization: I understand this, No issues here
1/2. I don't think you have a clue of what the author is trying to say. Imagine a layer with 3 nodes. 2 of these nodes have an error responsibility of 0 with respect to the output error; so there is óne node that should be adjusted. So if you want to improve the output of node 0, then you immediately affect nodes 1 and 2 in that layer - possibly making the output even more wrong.
Fine tuning the weights so as to make the activations proper is not something you, even if you could do, which you cant, would want to do. (Kindly correct me if my understanding is wrong here)
That is the definition of backpropagation. That is exactly what you want. Neural networks rely on activations (which are non-linear) to map a function.
3. Your basically saying to every neuron 'hey, your output cannot be higher than x, because some other neuron in this layer already has value y'. Because all neurons in a softmax layer should have a total activation of 1, it means that neurons cannot be higher than a specific value. For small layers - small problem, but for big layers - big problem. Imagine a layer with 100 neurons. Now imagine their total output should be 1. The average value of those neurons will be 0.01 -> that means you are making networks connection relying (because activations will stay very low, averagely) - as other activation functions output (or take on input) of range (0:1 / -1:1).