The understanding about dropout in DNN - neural-network

From what I understand about DNN's dropout regularization is that:
First we randomly delete neurons from the DNN and leave only the input and output the same. Then we perform forward propagation and backward propagation based on a mini-batch; learn the gradient for this mini-batch and then update the weights and biases – Here I denote these updated weights and biases as Updated_Set_1.
Then, we restore the DNN to default state and randomly delete the neurons. Now we perform the forward and backward propagation and find a new set of weights and biases called Updated_Set_2. This process continues until Updated_Set_N ~ N represents the number of mini batches.
Lastly, we calculate the average of all weights and biases based on the total Updated_Set_N; example, from Updated_Set_1 ~ Updated_Set_N. These new average weights and biases will be used to predict the new input.
I would just want to confirm whether my understanding is correct or wrong. If wrong, please do share me your thoughts and teach me. thank you in advance.

Well, actually there is no averaging. During training, for every feed forward/back forward pass, we randomly "mute"/deactivate some neurons, so that their outputs and related weights are not considered during computation of the output neither during back propagation.
That means we are forcing the other activated neurons to give good prediction without the help of the deactivated neurons.
So this increase their independency to the other neurons(features) and in the same way increase the model generalization.
Other than this the forward and back propagation phase are the same without dropout.


How does a Neural Network "remember" what its learned?

Im trying to wrap my head around understanding neural networks and from everything I've seen, I understand that they are made up of layers created by nodes. These nodes are attached to each other with "weighted" connections, and by passing values through the input layer, the values travel through the nodes, changing their values dependent on the "weight" of the connections (right?). Eventually they reach the output layer with a value. I understand the process but I don't see how this leads to the network being trained. Does the network remember a pattern between weighted connections? How does it remember that pattern?
Each weight and bias on each node is like a stored variable. As new data causes its weights and biases to change, these variables change. Eventually a trained algorithm is done and the weights and biases don't need to change anymore. You can then store the information about the all the nodes, weights, biases and connections however you like. This information is your model. So the "remembering" is just the values of the weights and biases.
Neural network remembers what its learned through its weights and biases. Lets explain it with a binary classification example. During forward propagation, the value computed is the
probability(say p) and actual value is y. Now, loss is calculated using the formula:->
-(ylog(p) + (1-y)log(1-p)). Once the loss is calculated, this info is propagated backwards and corresponding derivatives of weights and biases are calculated using this loss. Now weights and biases are adjusted according to these derivatives. In one epoch, all the examples present are propagated and weights and biases are adjusted. Then, same examples are propagated forward and backward and correspondingly in each step, weights and biases are adjusted. Finally, after minimizing the loss to a good extent or, achieving a high accuracy (make sure not to overfit), we can store the value of weights and biases and this is what neural network has learned.

Cross Entropy Loss for One Hot Encoding

CE-loss sums up the loss over all output nodes
Sum_i[ - target_i*log(output_i) ].
The derivative of CE-loss is: - target_i/output_i.
Since for a target=0 the loss and derivative of the loss is zero regardless of the actual output, it seems like only the node with target=1 recieves feedback on how to adjust weights.
I also noticed the singularity in the derivative for output=0. How is this processed during backpropagation?
I do not see how the weights are adjusted to match the target=0. Maybe you know better :)
You can use the formula you mentioned if your final layer forms a probability distribution (that way all nodes will receive feedback since when one final layer neuron's output increases, others have to decrease because they form a probability distribution and must add up to 1). You can achieve having final layer forming a probability distribution by applying softmax activation function to final layer. You can read more about it here.

Why disable dropout during validation and testing?

I've seen in multiple places that you should disable dropout during validation and testing stages and only keep it during the training phase. Is there a reason why that should happen? I haven't been able to find a good reason for that and was just wondering.
One reason I'm asking is because I trained a model with dropout, and the results turned out well - about 80% accuracy. Then, I went on to validate the model but forgot to set the prob to 1 and the model's accuracy went down to about 70%. Is it supposed to be that drastic? And is it as simple as setting the prob to 1 in each dropout layer?
Thanks in advance!
Dropout is a random process of disabling neurons in a layer with chance p. This will make certain neurons feel they are 'wrong' in each iteration - basically, you are making neurons feel 'wrong' about their output so that they rely less on the outputs of the nodes in the previous layer. This is a method of regularization and reduces overfitting.
However, there are two main reasons you should not use dropout to test data:
Dropout makes neurons output 'wrong' values on purpose
Because you disable neurons randomly, your network will have different outputs every (sequences of) activation. This undermines consistency.
However, you might want to read some more on what validation/testing exactly is:
Training set: a set of examples used for learning: to fit the parameters of the classifier In the MLP case, we would use the training set to find the “optimal” weights with the back-prop rule
Validation set: a set of examples used to tune the parameters of a classifier In the MLP case, we would use the validation set to find the “optimal” number of hidden units or determine a stopping point for the back-propagation algorithm
Test set: a set of examples used only to assess the performance of a fully-trained classifier In the MLP case, we would use the test to estimate the error rate after we have chosen the final model (MLP size and actual weights) After assessing the final model on the test set, YOU MUST NOT tune the model any further!
Why separate test and validation sets? The error rate estimate of the final model on validation data will be biased (smaller than the true error rate) since the validation set is used to select the final model After assessing the final model on the test set, YOU MUST NOT tune the model any further!
source : Introduction to Pattern Analysis,Ricardo Gutierrez-OsunaTexas A&M University, Texas A&M University (answer)
So even for validation, how would you determine which nodes you remove if the nodes have a random probability of being disactivated?
Dropout is a method of making bagging practical for ensembles of very many large neural networks.
Along the same line we may remember that using the following false explanation:
For the new data, we can predict their classes by taking the average of the results from all the learners:
Since N is a constant we can just ignore it and the result remains the same, so we should disable dropout during validation and testing.
The true reason is much more complex. It is because of the weight scaling inference rule:
We can approximate p_{ensemble} by evaluating p(y|x) in one model: the model with all units, but with the weights going out of unit i multiplied by the probability of including unit i. The motivation for this modification is to capture the right expected value of the output from that unit. There is not yet any theoretical argument for the accuracy of this approximate inference rule in deep nonlinear networks, but empirically it performs very well.
When we train the model using dropout(for example for one layer) we zero out some outputs of some neurons and scale the others up by 1/keep_prob to keep the expectation of the layer almost the same as before. In the prediction process, we can use dropout but we can only get different predictions each time because we drop the values out randomly, then we need to run the prediction many times to get the expected output. Such a process is time-consuming so we can remove the dropout and the expectation of the layer remains the same.
Difference between Bagging and Boosting?
7.12 of Deep Learning
Simplest reason can be, during prediction(test, validation or after production deployment) you want to use the capability of each and every learned neurons and really don't like to skip some of them randomly.
Thats the only reason we set probability as 1 during testing.
There is a Bayesian technique called Monte Carlo dropout in which the dropout would be not disabled during testing. The model will run several times with the same dropout rate(or in one go as a batch), and the mean(line 6 depicted below) and variance(line 7 depicted below) of the results will be calculated to determine the uncertainty.
Here is Uber's application to quantify uncertainty:
Short answer:
Dropouts to bring down over fitting in the training data. They are used as a regularization parameters. So if you have high variance (i.e. look at the difference between training set and validation set accuracy for this) then use drop out on training data, as it won't be good enough to apply dropout on test and validation data as you haven't been sure about the neurons which are going to shut off hence laying off the importance of random neurons which can be important.

Backpropagation neural network, too many neurons in layer causing output to be too high

Having neural network with alot of inputs causes my network problems like
Neural network gets stuck and feed forward calculation always gives output as
1.0 because of the output sum being too big and while doing backpropagation, sum of gradients will be too high what causes the
learning speed to be too dramatic.
Neural network is using tanh as an active function in all layers.
Giving alot of thought, I came up with following solutions:
Initalizing smaller random weight values ( WeightRandom / PreviousLayerNeuronCount )
After calculation the sum of either outputs or gradients, dividing the sum with the number of 'neurons in previus layer for output sum' and number of 'neurons in next layer for gradient sum' and then passing sum into activation/derivative function.
I don't feel comfortable with solutions I came up with.
Solution 1. does not solve problem entirely. Possibility of gradient or output sum getting to high is still there. Solution 2. seems to solve the problem but I fear that it completely changes network behavior in a way that it might not solve some problems anymore.
What would you suggest me in this situation, keeping in mind that reducing neuron count in layers is not an option?
Thanks in advance!
General things that affect the output backpropagation include weights and biases of early elections, the number of hidden units, the amount of exercise patterns, and long iterations. As an alternative way, the selection of initial weights and biases there are several algorithms that can be used, one of which is an algorithm Nguyen widrow. You can use it to initialize the weights and biases early, I've tried it and gives good results.

Issues with neural network

I am having some issues with using neural network. I am using a non linear activation function for the hidden layer and a linear function for the output layer. Adding more neurons in the hidden layer should have increased the capability of the NN and made it fit to the training data more/have less error on training data.
However, I am seeing a different phenomena. Adding more neurons is decreasing the accuracy of the neural network even on the training set.
Here is the graph of the mean absolute error with increasing number of neurons. The accuracy on the training data is decreasing. What could be the cause of this?
Is it that the nntool that I am using of matlab splits the data randomly into training,test and validation set for checking generalization instead of using cross validation.
Also I could see lots of -ve output values adding neurons while my targets are supposed to be positives. Could it be another issues?
I am not able to explain the behavior of NN here. Any suggestions? Here is the link to my data consisting of the covariates and targets
I am unfamiliar with nntool but I would suspect that your problem is related to the selection of your initial weights. Poor initial weight selection can lead to very slow convergence or failure to converge at all.
For instance, notice that as the number of neurons in the hidden layer increases, the number of inputs to each neuron in the visible layer also increases (one for each hidden unit). Say you are using a logit in your hidden layer (always positive) and pick your initial weights from the random uniform distribution between a fixed interval. Then as the number of hidden units increases, the inputs to each neuron in the visible layer will also increase because there are more incoming connections. With a very large number of hidden units, your initial solution may become very large and result in poor convergence.
Of course, how this all behaves depends on your activation functions and the distributio of the data and how it is normalized. I would recommend looking at Efficient Backprop by Yann LeCun for some excellent advice on normalizing your data and selecting initial weights and activation functions.