I have searched a lot for how the gradients for a mini-batch in Keras when using the multilayer perceptron are computed, but can't seem to find the answer. I'm wondering if the average of the gradients for each mini-batch is used to update the weights and biases or if it is the sum of the gradients?
Would appreciate if someone could help if they know the answer and if possible tell me where I can find this information.
I think it makes sense to say either average of the loss or sum of the loss, but not average/sum of the gradients.
And I think it's always safe to use average of the loss to calculate the gradients.
Related
Consider the training process of deep FF neural network using mini-batch gradient descent. As far as I understand, at each epoch of the training we have different random set of mini-batches. Then iterating over all mini batches and computing the gradients of the NN parameters we will get random gradients at each iteration and, therefore, random directions for the model parameters to minimize the cost function. Let's imagine we fixed the hyperparameters of the training algorithm and started the training process again and again, then we would end up with models, which completely differs from each other, because in those trainings the changes of model parameters were different.
1) Is it always the case when we use such random based training algorithms?
2) If it is so, where is the guaranty that training the NN one more time with the best hyperparameters found during the previous trainings and validations will yield us the best model again?
3) Is it possible to find such hyperparameters, which will always yield the best models?
Neural Network are solving a optimization problem, As long as it is computing a gradient in right direction but can be random, it doesn't hurt its objective to generalize over data. It can stuck in some local optima. But there are many good methods like Adam, RMSProp, momentum based etc, by which it can accomplish its objective.
Another reason, when you say mini-batch, there is at least some sample by which it can generalize over those sample, there can be fluctuation in the error rate, and but at least it can give us a local solution.
Even, at each random sampling, these mini-batch have different-2 sample, which helps in generalize well over the complete distribution.
For hyperparameter selection, you need to do tuning and validate result on unseen data, there is no straight forward method to choose these.
I'm learing how LSTM works by practicing with time series training data(input is a list of features and output is a scalar).
There is a problem that i couldnt understand when calculating loss for RNN/LSTM:
How loss is calculated? Is it calculated at each time i give the nn new input or acummulated through all the given inputs and then be backprop
#seed Answer is correct. However, in LSTM, or any RNN architecture, the loss for each instance, across all time steps, is added up. In other words, you'll have (L0#t0, L1#t1, ... LT#tT) for each sample in your input batch. Add those losses separately for each instance in the batch. Finally average the losses of each input instance to get the average loss for a current batch
For more information please visit: https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks
The answer does not depend on the neural network model.
It depends on your choice of optimization method.
If you are using batch gradient descent, the loss is averaged over the whole training set. This is often impractical for neural networks, because the training set is too big to fit into RAM, and each optimization step takes a lot of time.
In stochastic gradient descent, the loss is calculated for each new input. The problem with this method is that it is noisy.
In mini-batch gradient descent, the loss is averaged over each new minibatch - a subsample of inputs of some small fixed size. Some variation of this method is typically used in practice.
So, the answer to your question depends on the minibatch size you choose.
(Image is from here)
I am looking at (two-layer) feed-forward Neural Networks in Matlab. I am investigating parameters that can minimise the classification error.
A google search reveals that these are some of them:
Number of neurons in the hidden layer
Learning Rate
Momentum
Training type
Epoch
Minimum Error
Any other suggestions?
I've varied the number of hidden neurons in Matlab, varying it from 1 to 10. I found that the classification error is close to 0% with 1 hidden neuron and then grows very slightly as the number of neurons increases. My question is: shouldn't a larger number of hidden neurons guarantee an equal or better answer, i.e. why might the classification error go up with more hidden neurons?
Also, how might I vary the Learning Rate, Momentum, Training type, Epoch and Minimum Error in Matlab?
Many thanks
Since you are considering a simple two layer feed forward network and have already pointed out 6 different things you need to consider to reduce classification errors, I just want to add one thing only and that is amount of training data. If you train a neural network with more data, it will work better. Note that, training with large amount of data is a key to get good outcome from neural networks, specially from deep neural networks.
Why the classification error goes up with more hidden neurons?
Answer is simple. Your model has over-fitted the training data and thus resulting in poor performance. Note that, if you increase the number of neurons in hidden layers, it would decrease training errors but increase testing errors.
In the following figure, see what happens with increased hidden layer size!
How may I vary the Learning Rate, Momentum, Training type, Epoch and Minimum Error in Matlab?
I am expecting you have already seen feed forward neural net in Matlab. You just need to manipulate the second parameter of the function feedforwardnet(hiddenSizes,trainFcn) which is trainFcn - a training function.
For example, if you want to use gradient descent with momentum and adaptive learning rate backpropagation, then use traingdx as the training function. You can also use traingda if you want to use gradient descent with adaptive learning rate backpropagation.
You can change all the required parameters of the function as you want. For example, if you want to use traingda, then you just need to follow the following two steps.
Set net.trainFcn to traingda. This sets net.trainParam to traingda's default parameters.
Set net.trainParam properties to desired values.
Example
net = feedforwardnet(3,'traingda');
net.trainParam.lr = 0.05; % setting the learning rate to 5%
net.trainParam.epochs = 2000 % setting number of epochs
Please see this - gradient descent with adaptive learning rate backpropagation and gradient descent with momentum and adaptive learning rate backpropagation.
I am learning Neural Network Toolbox in MATLAB. I have tested it on data from the Machine Learning Repository and drawn a graph from calculating ROC given the data. I calculated ROCs for all parts: train, validation, and test. I understand what ROC curves represent, what a particular cutt-off point corresponds to and why we call it a trade-off between TPR and FPR.
This ROC presents good classifiction but my goal is to have an influence on decision if a classifier provides more TP (with higher TPR) even though I accept more FPR? I can read from the graph that if I expect 90% of TP, I would have to accept slight over 40% of FPR as well. How can this knowledge be of any help to tune current classification performance?
Is there any way, a property in the structure that NN returns (e.g. "patternnet") that would let me tune a classifier according to what ROC shows and then make classifier to give more TRP and FPR? I am not sure if this is possible.
Thanks
If you want to increase both TRP and FPR, you actually want to increase your recall at expenses of you precision
You could modify your cost function and use recall instead. Here, you have more info about how to use a custom performance function in Matlab NN
I am currently working on an indoor navigation system using a Zigbee WSN in star topology.
I currently have signal strength data for 60 positions in an area of 15m by 10 approximately. I want to use ANN to help predict the coordinates for other positions. After going through a number of threads, I realized that normalizing the data would give me better results.
I tried that and re-trained my network a few times. I managed to get the goal parameter in the nntool of MATLAB to the value .000745, but still after I give a training sample as a test input, and then scaling it back, it is giving a value way-off.
A value of .000745 means that my data has been very closely fit, right? If yes, why this anomaly? I am dividing and multiplying by the maximum value to normalize and scale the value back respectively.
Can someone please explain me where I might be going wrong? Am I using the wrong training parameters? (I am using TRAINRP, 4 layers with 15 neurons in each layer and giving a goal of 1e-8, gradient of 1e-6 and 100000 epochs)
Should I consider methods other than ANN for this purpose?
Please help.
For spatial data you can always use Gaussian Process Regression. With a proper kernel you can predict pretty well and GP regression is a pretty simple thing to do (just matrix inversion and matrix vector multiplication) You don't have much data so exact GP regression can be easily done. For a nice source on GP Regression check this.
What did you scale? Inputs or outputs? Did scale input+output for your trainingset and only the output while testing?
What kind of error measure do you use? I assume your "goal parameter" is an error measure. Is it SSE (sum of squared errors) or MSE (mean squared errors)? 0.000745 seems to be very small and usually you should have almost no error on your training data.
Your ANN architecture might be too deep with too few hidden units for an initial test. Try different architectures like 40-20 hidden units, 60 HU, 30-20-10 HU, ...
You should generate a test set to verify your ANN's generalization. Otherwise overfitting might be a problem.