I'm training a XOR neural network via back-propagation using stochastic gradient descent. The weights of the neural network are initialized to random values between -0.5 and 0.5. The neural network successfully trains itself around 80% of the time. However sometimes it gets "stuck" while backpropagating. By "stuck", I mean that I start seeing a decreasing rate of error correction. For example, during a successful training, the total error decreases rather quickly as the network learns, like so:
...
...
Total error for this training set: 0.0010008071327708653
Total error for this training set: 0.001000750550254843
Total error for this training set: 0.001000693973929822
Total error for this training set: 0.0010006374037948094
Total error for this training set: 0.0010005808398488103
Total error for this training set: 0.0010005242820908169
Total error for this training set: 0.0010004677305198344
Total error for this training set: 0.0010004111851348654
Total error for this training set: 0.0010003546459349181
Total error for this training set: 0.0010002981129189812
Total error for this training set: 0.0010002415860860656
Total error for this training set: 0.0010001850654351723
Total error for this training set: 0.001000128550965301
Total error for this training set: 0.0010000720426754587
Total error for this training set: 0.0010000155405646494
Total error for this training set: 9.99959044631871E-4
Testing trained XOR neural network
0 XOR 0: 0.023956746649767453
0 XOR 1: 0.9736079194769579
1 XOR 0: 0.9735670067093437
1 XOR 1: 0.045068688874314006
However when it gets stuck, the total errors are decreasing, but it seems to be at a decreasing rate:
...
...
Total error for this training set: 0.12325486644721295
Total error for this training set: 0.12325486642503929
Total error for this training set: 0.12325486640286581
Total error for this training set: 0.12325486638069229
Total error for this training set: 0.12325486635851894
Total error for this training set: 0.12325486633634561
Total error for this training set: 0.1232548663141723
Total error for this training set: 0.12325486629199914
Total error for this training set: 0.12325486626982587
Total error for this training set: 0.1232548662476525
Total error for this training set: 0.12325486622547954
Total error for this training set: 0.12325486620330656
Total error for this training set: 0.12325486618113349
Total error for this training set: 0.12325486615896045
Total error for this training set: 0.12325486613678775
Total error for this training set: 0.12325486611461482
Total error for this training set: 0.1232548660924418
Total error for this training set: 0.12325486607026936
Total error for this training set: 0.12325486604809655
Total error for this training set: 0.12325486602592373
Total error for this training set: 0.12325486600375107
Total error for this training set: 0.12325486598157878
Total error for this training set: 0.12325486595940628
Total error for this training set: 0.1232548659372337
Total error for this training set: 0.12325486591506139
Total error for this training set: 0.12325486589288918
Total error for this training set: 0.12325486587071677
Total error for this training set: 0.12325486584854453
While I was reading up on neural networks I came across a discussion on local minimas and global minimas and how neural networks don't really "know" which minima its supposed to be going towards.
Is my network getting stuck in a local minima instead of a global minima?
Yes, neural networks can get stuck in local minima, depending on the error surface. However this abstract suggests that there are no local minima in the error surface of the XOR problem. However I cannot get to the full text, so I cannot verify what the authors did to proove this and how it applies to your problem.
There also might be other factors leading to this problem. For example if you descend very fast at some steep valley, if you just use a first order gradient descent, you might get to the opposite slope and bounce back and forth all the time. You could try also giving the average change over all weights at each iteration, to test if you realy have a "stuck" network, or rather one, which just has run into a limit cycle.
You should first try fiddling with your parameters (learning rate, momentum if you implemented it etc). If you can make the problem go away, by changing parameters, your algorithm is probably ok.
Poor gradient descent with excessively large steps as described by LiKao is one possible problem. Another is that there are very flat regions of the XOR error landscape which means that it takes a very long time to converge, and in fact the gradient may be so weak that descent algorithm doesn't pull you in the right direction.
These two papers look at 2-1-1 and 2-2-1 XOR landscapes. One uses a "cross entropy" error function which I don't know. In the first they declare there are no local minima but in the second they say there are local minima at infinity - basically when weights run off to very large values. So for the second case, their results suggest if you don't start off near "enough" true minima you may get trapped at the infinite points. They also say that other analyses of 2-2-1 XOR networks that show no local minima are not contradicted by their results because of particular definitions.
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.31.4770
http://www.ncbi.nlm.nih.gov/pubmed/12662806
I encountered the same issue and found that using the activation function 1.7159*tanh(2/3*x) described in LeCun's "Efficient Backprop" paper helps. This is presumably because that function does not saturate around the target values {-1, 1}, whereas regular tanh does.
The paper by Hamey cited in #LiKao's answer proves there are no strict "regional local minima" for XOR in a 2-2-1 neural network. However, it admits "asymptotic minima" wherein the error surface flattens out as one or more weights approach infinity.
In practice, the weights don't even need to be so large for this to happen and it is quite common for a 2-2-1 net to get stuck in this flat asymptotic region. The reason for this is saturation: the gradient of sigmoid activation approaches 0 as the weights get large, so the network is unable to keep learning.
See my notebook experiment - typically around 2 or 3 out of 10 networks end up stuck, even after 10,000 epochs. Results differ slightly if you change the learning rate, batch size, activation or loss functions, initial weights, whether inputs are created randomly or in a fixed order, etc. but usually a network gets stuck now and then.
Related
Just some general questions about training. I used a convolutional neural network for binary classification of text on a dataset of about 10,000 samples. The dataset was pretty unbalanced with about 80% of the samples in class 1. The very last image shows a model on a balanced dataset of about a few million samples doing a 14-way classification task. All models use nn.ClassNLLCriterion, momentum of 0.9, dropout, and weight decay of 0.00001:
Here's the code for more details
For loss, I got values over 1 for validation. How bad is that? Is a loss over one big or is it reasonable?
Is the error y-axis unit usually in percent? So here, for the error, would it range from 0% to 0.16% or is it 0% to 16%?
For the graphs below, the loss and error graphs look about the same shape. In general, should the loss and error graphs always have the same shape?
Are the error and loss usually on different scales?
I have a dataset containing data 1100, from where I have considered 75% for training, 15% testing and 15% for validation. The problem is that every time I train the network for the same training set I get very different results. Is there any standard rule for considering the best result or at which stage I have to stop train the data with minimum error.
Normally, if you are using a neural network, you should not have too different results between different runs on the same training set. So, first of all, check that your algorithm is working correctly using some standard benchmark problems (like iris/wisconsin from UCI repository)
Regarding when to stop the training, there are two options:
1. When the training set error falls below a threshold
2. When the validation set error starts increasing
Case (1) is clear, as the training error always decreases. For case (2) however, there is no absolute criterion, as the validation error might vary during the training. So, just plot it, to see how it behaves, and then set a threshold depending on you observations (for example, stop when its value becomes 10% larger than the minimum value it acquired during the training)
I am training a neural network with 1 sigmoid hidden layer and a linear output layer. The network simply approximates a cosine function. The weights are initiliazed according to Nguyen-Widrow initialization and the biases are initialized to 1. I am using MATLAB as a platform.
Running the network a number of times without changing any parameters, I am getting results (mean squared error) which range from 0.5 to 0.5*10^-6. I cannot understand how the results can even vary that much, I'd imagine there would at least be a narrower and more consistent window of errors.
What could be causing such a big variance?
I am training the neural network with input vector of 85*650 and target vector of 26*650. Here is the list of parameters that I have used
net.trainParam.max_fail = 6;
net.trainParam.min_grad=1e-5;
net.trainParam.show=10;
net.trainParam.lr=0.9;
net.trainParam.epochs=13500;
net.trainParam.goal=0.001;
Number of hidden nodes=76
As you can see ,I have set the number of epochs to 13500. Is it OK to set the number of epochs to such a large number?. Performance goal is not reaching if the number of epochs is decreased and I am getting a bad classification while testing.
Try not to focus on the number of epochs. Instead, you should have, at least, two sets of data: one for training and another for testing. Use the testing set to get a feel for how well your ANN is performing and how many epochs is needed to get a decent ANN.
For example, you want to stop training when performance on your testing set as levelled-off or has begun to decrease (get worse). This would be evidence of over-learning which is the reason why more epochs is not always better.
Such a problem: I've trained some ann using MSE stop function up to "desired error" 10^-5 (5MB of training data, 15000 input items,long training period -- about a day). I've got 0 bit fail during training. I've saved the ann to a file.
Then I loaded the net from the file, and check it on the same training data. And sometimes I'm getting bit fail up to 5 (not so seldom, BTW!).
What is this? Does anybody meet such a phenomenon?
I suspect, this is the rounding artefact: many thousands of weights saved to the file in text format and loaded back...
Solved.
MSE after fann_reset_MSE() and fann_test_data() has no relation to the error returned by fann_train(). If the ANN is trained up to very low MSE, then fann_get_MSE() and fann_get_bit_fail() are more or less in agreement with values returned by these functions ater fann_reset_MSE() and fann_test_data(). If not (ANN is not trained well), then these values might differ in orders of magnitude.