Neural Network XOR not converging - neural-network

I have tried to implement a neural network in Java by myslef to be an XOR-gate, it has kinda of worked. About ~20% of the time when I try to train it, the weights converges to produce a good enough output (RMS < 0.05) but the other 80% of the time it doesn't.
The neural network (can be seen here) is composed of 2 inputs (+ 1 bias), 2 hidden (+ 1 bias) and 1 output unit. The activation function I used was the sigmoid function
e / ( 1 + e^-x)
which maps the input values to between 0 and 1. The learning algorithm used is Stochastic Gradient Descent using RMS as the cost function. The bias neurons has a constant output of 1. I have tried changing the learning rate between 0.1 and 0.01 and doesn't seem to fix the problem.
I have the network track the weights and rms of the network and plotted in on a graph. There is basically three different behaviours the weights can get. I can only post one the three.
two (or more) value diverges in different directions
One of the other two is just the weights converging to a good value and the second is a random wiggle of one weight.
I don't know if this is just a thing that happens or if there is some kind of way to fix it so please tell me if you know anything.

Related

Can't approximate simple multiplication function in neural network with 1 hidden layer

I just wanted to test how good can neural network approximate multiplication function (regression task).
I am using Azure Machine Learning Studio. I have 6500 samples, 1 hidden layer
(I have tested 5 /30 /100 neurons per hidden layer), no normalization. And default parameters
Learning rate - 0.005, Number of learning iterations - 200, The initial learning weigh - 0.1,
The momentum - 0 [description]. I got extremely bad accuracy, close to 0.
At the same time boosted Decision forest regression shows very good approximation.
What am I doing wrong? This task should be very easy for NN.
Big multiplication function gradient forces the net probably almost immediately into some horrifying state where all its hidden nodes have zero gradient.
We can use two approaches:
1) Devide by constant. We are just deviding everything before the learning and multiply after.
2) Make log-normalization. It makes multiplication into addition:
m = x*y => ln(m) = ln(x) + ln(y).
Some things to check:
Your output layer should have a linear activation function. If it's sigmoidal, it won't be able to represent values outside it's range (e.g. -1 to 1)
You should use a loss function that's appropriate for regression (e.g. squared error)
If your hidden layer uses sigmoidal activation functions, check that you're not saturating them. Multiplication can work on arbitrarily small/large values. And, if you pass a large number as input you can get saturation, which will lose information. If using ReLUs, make sure they're not getting stuck at 0 on all examples (although activations will generally be sparse on any given example).
Check that your training procedure is working as intended. Plot the error over time during training. How does it look? Are your gradients well behaved or are they blowing up? One source of problems can be the learning rate being set too high (unstable error, exploding gradients) or too low (very slow progress, error doesn't decrease quickly enough).
This is how I do multiplication with neural network:
import numpy as np
from keras import layers
from keras import models
model = models.Sequential()
model.add(layers.Dense(150, activation='relu', input_shape=(2,)))
model.add(layers.Dense(1, activation='relu'))
data = np.random.random((10000, 2))
results = np.asarray([a * b for a, b in data])
model.compile(optimizer='sgd', loss='mae')
model.fit(data, results, epochs=1, batch_size=1)
model.predict([[0.8, 0.5]])
It works.
"Two approaches: divide by constant, or make log normalization"
I'm tried both approaches. Certainly, log normalization works since as you rightly point out it forces an implementation of addition. Dividing by constant -- or similarly normalizing across any range -- seems not to succeed in my extensive testing.
The log approach is fine, but if you have two datasets with a set of inputs and a target y value where:
In dataset one the target is consistently a sum of two of the inputs
In dataset two the target is consistently the product of two of the inputs
Then it's not clear to me how to design a neural network which will find the target y in both datasets using backpropogation. If this isn't possible, then I find it a surprising limitation in the ability of a neural network to find the "an approximation to any function". But I'm new to this game, and my expectations may be unrealistic.
Here is one way you could approximate the multiplication function using one hidden layer. It uses a sigmoidal activation in the hidden layer, and it works quite nicely until a certain range of numbers. This is the gist link
m = x*y => ln(m) = ln(x) + ln(y), but only if x, y > 0

MNIST - Training stuck

I'm reading Neural Networks and Deep Learning (first two chapters), and I'm trying to follow along and build my own ANN to classify digits from the MNIST data set.
I've been scratching my head for several days now, since my implementation peaks out at ~57% accuracy at classifying digits from the test set (some 5734/10000) after 10 epochs (accuracy for the training set stagnates after the tenth epoch, and accuracy for the test set deteriorates presumably because of over-fitting).
I'm using nearly the same configuration as in the book: 2-layer feedforward ANN (784-30-10) with all layers fully connected; standard sigmoid activation functions; quadratic cost function; weights are initialized the same way (taken from a Gaussian distribution with mean 0 and standard deviation 1)
The only differences being that I'm using online training instead of batch/mini-batch training and a learning rate of 1.0 instead of 3.0 (I have tried mini-batch training + learning rate of 3.0 though)
And yet, my implementation doesn't pass the 60% percentile after a bunch of epochs where as in the book the ANN goes above %90 just after the first epoch with pretty much the exact same configuration.
At first I messed up implementing the backpropagation algorithm, but after reimplementing backpropagation differently three times, with the exactly the same results in each reimplementation, I'm stumped...
An example of the results the backpropagation algorithm is producing:
With a simpler feedforward network with the same configuration mentioned above (online training + learning rate of 1.0): 3 input neurons, 2 hidden neurons and 1 output neuron.
The initial weights are initialized as follows:
Layer #0 (3 neurons)
Layer #1 (2 neurons)
- Neuron #1: weights=[0.1, 0.15, 0.2] bias=0.25
- Neuron #2: weights=[0.3, 0.35, 0.4] bias=0.45
Layer #2 (1 neuron)
- Neuron #1: weights=[0.5, 0.55] bias=0.6
Given an input of [0.0, 0.5, 1.0], the output is 0.78900331.
Backpropagating for the same input and with the desired output of 1.0 gives the following partial derivatives (dw = derivative wrt weight, db = derivative wrt bias):
Layer #0 (3 neurons)
Layer #1 (2 neurons)
- Neuron #1: dw=[0, 0.0066968054, 0.013393611] db=0.013393611
- Neuron #2: dw=[0, 0.0061298212, 0.012259642] db=0.012259642
Layer #2 (1 neuron)
- Neuron #1: dw=[0.072069918, 0.084415339] db=0.11470326
Updating the network with those partial derivatives yields a corrected output value of 0.74862305.
If anyone would be kind enough to confirm the above results, it would help me tremendously as I've pretty much ruled out backpropagation being faulty as the reason for the problem.
Did anyone tackling the MNIST problem ever come across this problem?
Even suggestions for things I should check would help since I'm really lost here.
Doh..
Turns out nothing was wrong with my backpropagation implementation...
The problem was that I read the images into a signed char (in C++) array, and the pixel values overflowed, so that when I divided by 255.0 to normalize the input vectors into the range of 0.0-1.0, I actually got negative values... ;-;
So basically I spent some four days debugging and reimplementing the same thing when the problem was somewhere else entirely.

Why does my neural network trained on MNIST data set not predict 7 and 9 correctly?

I'm using Matlab ( github code repository ). The details of the network are:
Hidden units: 100 ( variable )
Epochs : 500
Batch size: 100
The weights are being updated using Back propagation algorithm.
I've been able to recognize 0,1,2,3,4,5,6,8 which I have drawn in photoshop.
However 7,9 are not recognized, but upon running on the test set I get only 749/10000 wrong and it correctly classifies 9251/10000.
Any idea what might be wrong? Because it is learning and based on the test set results its learning correctly.
I don't see anything downright incorrect in your code, but there is a lot that can be improved:
You use this to set the initial weights:
hiddenWeights = rand(hiddenUnits,inputVectorSize);
outputWeights = rand(outputVectorSize,hiddenUnits);
hiddenWeights = hiddenWeights./size(hiddenWeights, 2);
outputWeights = outputWeights./size(outputWeights, 2);
This will make your weights very small I think. Not only that, but you will have no negative values, so you'll throw away half of the sigmoid's range of values. I suggest you try:
weights = 2*rand(x, y) - 1
Which will generate random numbers in [-1, 1]. You can then try dividing this interval to get smaller weights (try dividing by the sqrt of the size).
You use this as the output delta:
outputDelta = dactivation(outputActualInput).*(outputVector - targetVector) % (tk-yk)*f'(yin)
Multiplying by the derivative is done if you use the square loss function. For log loss (which is usually the one used in classification), you should have just outputVector - targetVector. It might not make that big of a difference, but you might want to try.
You say in the comments that the network doesn't detect your own sevens and nines. This can suggest overfitting on the MNIST data. To address this, you'll need to add some form of regularization to your network: either weight decay or dropout.
You should try different learning rates as well, if you haven't already.
You don't seem to have any bias neurons. Each layer, except the output layer, should have a neuron that only returns the value 1 to the next layer. You can implement this by adding another feature to your input data that is always 1.
MNIST is a big data set for which better algorithms are still being researched. Your networks is very basic, small, with no regularization, no bias neurons and no improvements to classic gradient descent. It's not surprising that it's not working too well: you'll likely need a more complex network for better results.
Nothing to do with neural nets or your code,
but this picture of KNN-nearest digits shows that some MNIST digits
are simply hard to recognize:

Conceptual issues on training neural network wih particle swarm optimization

I have a 4 Input and 3 Output Neural network trained by particle swarm optimization (PSO) with Mean square error (MSE) as the fitness function using the IRIS Database provided by MATLAB. The fitness function is evaluated 50 times. The experiment is to classify features. I have a few doubts
(1) Does the PSO iterations/generations = number of times the fitness function is evaluated?
(2) In many papers I have seen the training curve of MSE vs generations being plot. In the picture, the graph (a) on left side is a model similar to NN. It is a 4 input-0 hidden layer-3 output cognitive map. And graph (b) is a NN trained by the same PSO. The purpose of this paper was to show the effectiveness of the new model in (a) over NN.
But they mention that the experiment is conducted say Cycles = 100 times with Generations =300. In that case, the Training curve for (a) and (b) should have been MSE vs Cycles and not MSE vs PSO generations ? For ex, Cycle1 : PSO iteration 1-50 --> Result(Weights_1,Bias_1, MSE_1, Classification Rate_1). Cycle2: PSO iteration 1- 50 -->Result(Weights_2,Bias_2, MSE_2, Classification Rate_2) and so on for 100 Cycles. How come the X axis in (a),(b) is different and what do they mean?
(3) Lastly, for every independent run of the program (Running the m file several times independently, through the console) , I never get the same classification rate (CR) or the same set of weights. Concretely, when I first run the program I get W (Weights) values and CR =100%. When I again run the Matlab code program, I may get CR = 50% and another set of weights!! As shown below for an example,
%Run1 (PSO generaions 1-50)
>>PSO_NN.m
Correlation =
0
Classification rate = 25
FinalWeightsBias =
-0.1156 0.2487 2.2868 0.4460 0.3013 2.5761
%Run2 (PSO generaions 1-50)
>>PSO_NN.m
Correlation =
1
Classification rate = 100
%Run3 (PSO generaions 1-50)
>>PSO_NN.m
Correlation =
-0.1260
Classification rate = 37.5
FinalWeightsBias =
-0.1726 0.3468 0.6298 -0.0373 0.2954 -0.3254
What should be the correct method? So, which weight set should I finally take and how do I say that the network has been trained? I am aware that evolutionary algorithms due to their randomness will never give the same answer, but then how do I ensure that the network has been trained?
Shall be obliged for clarification.
As in most machine learning methods, the number of iterations in PSO is the number of times the solution is updated. In the case of PSO, this is the number of update rounds over all particles. The cost function here is evaluated after every particle is updated, so more than the number of iterations. Approximately, (# cost function calls) = (# iterations) * (# particles).
The graphs here are comparing different classifiers, fuzzy cognitive maps for graph (a) and a neural network for graph (b). So the X-axis displays the relevant measures of learning iterations for each.
Every time you run your NN you initialize it with different random values, so the results are never the same. The fact that the results vary greatly from one run to the next means you have a convergence problem. The first thing to do in this case is to try running more iterations. In general convergence is a rather complicated issue and the solutions vary greatly with applications (and read carefully through the answer and comments Isaac gave you on your other question). If your problem persists after increasing the number of iterations, you can post it as a new question, providing a sample of your data and the actual code you use to construct and train the network.

Neural Network with softmax activation

edit:
A more pointed question:
What is the derivative of softmax to be used in my gradient descent?
This is more or less a research project for a course, and my understanding of NN is very/fairly limited, so please be patient :)
I am currently in the process of building a neural network that attempts to examine an input dataset and output the probability/likelihood of each classification (there are 5 different classifications). Naturally, the sum of all output nodes should add up to 1.
Currently, I have two layers, and I set the hidden layer to contain 10 nodes.
I came up with two different types of implementations
Logistic sigmoid for hidden layer activation, softmax for output activation
Softmax for both hidden layer and output activation
I am using gradient descent to find local maximums in order to adjust the hidden nodes' weights and the output nodes' weights. I am certain in that I have this correct for sigmoid. I am less certain with softmax (or whether I can use gradient descent at all), after a bit of researching, I couldn't find the answer and decided to compute the derivative myself and obtained softmax'(x) = softmax(x) - softmax(x)^2 (this returns an column vector of size n). I have also looked into the MATLAB NN toolkit, the derivative of softmax provided by the toolkit returned a square matrix of size nxn, where the diagonal coincides with the softmax'(x) that I calculated by hand; and I am not sure how to interpret the output matrix.
I ran each implementation with a learning rate of 0.001 and 1000 iterations of back propagation. However, my NN returns 0.2 (an even distribution) for all five output nodes, for any subset of the input dataset.
My conclusions:
I am fairly certain that my gradient of descent is incorrectly done, but I have no idea how to fix this.
Perhaps I am not using enough hidden nodes
Perhaps I should increase the number of layers
Any help would be greatly appreciated!
The dataset I am working with can be found here (processed Cleveland):
http://archive.ics.uci.edu/ml/datasets/Heart+Disease
The gradient you use is actually the same as with squared error: output - target. This might seem surprising at first, but the trick is that a different error function is minimized:
(- \sum^N_{n=1}\sum^K_{k=1} t_{kn} log(y_{kn}))
where log is the natural logarithm, N depicts the number of training examples and K the number of classes (and thus units in the output layer). t_kn depicts the binary coding (0 or 1) of the k'th class in the n'th training example. y_kn the corresponding network output.
Showing that the gradient is correct might be a good exercise, I haven't done it myself, though.
To your problem: You can check whether your gradient is correct by numerical differentiation. Say you have a function f and an implementation of f and f'. Then the following should hold:
(f'(x) = \frac{f(x - \epsilon) - f(x + \epsilon)}{2\epsilon} + O(\epsilon^2))
please look at sites.google.com/site/gatmkorn for the open-source Desire simulation program.
For the Windows version, /mydesire/neural folder has several softmax classifiers, some with softmax-specific gradient-descent algorithm.
In the examples, this works nicely for a simplemcharacter-recognition task.
ASee also
Korn, G.A.: Advanced dynamic-system Simulation, Wiley 2007
GAK
look at the link:
http://www.youtube.com/watch?v=UOt3M5IuD5s
the softmax derivative is: dyi/dzi= yi * (1.0 - yi);