backpropagation neural network with discrete output - neural-network

I am working through the xor example with a three layer back propagation network. When the output layer has a sigmoid activation, an input of (1,0) might give 0.99 for a desired output of 1 and an input of (1,1) might give 0.01 for a desired output of 0.
But what if want the output to be discrete, either 0 or 1, do I simply set a threshold in between at 0.5? Would this threshold need to be trained like any other weight?

Well, you can of course put a threshold after the output neuron which makes the values after 0.5 as 1 and, vice versa, all the outputs below 0.5 as zero. I suggest to don't hide the continuous output with a discretization threshold, because an output of 0.4 is less "zero" than a value of 0.001 and this difference can give you useful information about your data.

Do the training without threshold, ie. computes the error on a example by using what the neuron networks outputs, without thresholding it.
Another little detail : you use a transfer function such as sigmoid ? The sigmoid function returns values in [0, 1], but 0 and 1 are asymptote ie. the sigmoid function can come close to those values but never reach them. A consequence of this is that your neural network can not exactly output 0 or 1 ! Thus, using sigmoid times a factor a little above 1 can correct this. This and some other practical aspects of back propagation are discussed here http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf

Related

Thresholding for neural networks

I implemented a neural network for AND gate with 2 input units, 2 hidden units and 1 output unit.
I trained the neural network using 40 inputs for 200 epochs with a learning rate of 0.03.
When I try to test the trained neural network for AND inputs, it gives me output as :
0,0 = 0.295 (0 expected)
0,1 = 0.355 (0 expected)
1,0 = 0.329 (0 expected)
1,1 = 0.379 (1 expected)
This is not the output which is expected from the network. But if I set the threshold as 0.36 and set all values above 0.36 as 1 and rest as 0, the neural network output is just as expected every time.
My question is that : Is applying a threshold to the output of the network necessary in order to generate expected outputs like in my case ?
A threshold is not necessary, but can help better classification, for example.
In your case, maybe you can set a threshold of 0.1 for 0, and 0.9 for 1. When your output is below 0.1, we can consider that it's a 0, and if output is higher than 0.9, then it's a 1.
Therefore, set a threshold of 0.36 just because with that test example it works, is a really bad idea.
Because 0.36 is far from the output 1 that we want. And because it may (will) not work with all your test datas.
You should consider problem with your code.
This is not the initial question, but here is some ideas :
1. Look at you training accuracy at each eopch. If it's learn slowly, increase your learning rate, and maybe descrease it after few epoch.
2. If the accuracy don't change, look at your Back probagation algorithm
3. Look at your dataset and be sure inputs and outputs are correct
4. Be sure that your weights are initialized randomly
5. AND gate can be solve with a linear NN, without hidden layer. Myabe try removing your hidden layer ?

How can a well trained ANN have a single set of weights that can represent multiple classes?

In multinomial classification, I'm using soft-max activation function for all non-linear units and ANN has 'k' number of output nodes for 'k' number of classes. Each of the 'k' output nodes present in output layer is connected to all the weights in preceding layer, kind of like the one shown below.
So, if the first output node intends to pull the weights in it's favor, it will change all the weights that precede this layer and the other output nodes will also pull which usually contradicts to the direction in which the first one was pulling. It seems more like a tug of war with single set of weights. So, do we need a separate set of weights(,which includes weights for every node of every layer) for each of the output classes or is there a different form of architecture present? Please, correct me if I'm wrong.
Each node has its set of weights. Implementations and formulas usually use matrix multiplications, which can make you forget the fact that, conceptually, each node has its own set of weights, but they do.
Each node returns a single value that gets sent to every node in the next layer. So a node on layer h receives num(h - 1) inputs, where num(h - 1) is the number of nodes in layer h - 1. Let these inputs be x1, x2, ..., xk. Then the neuron returns:
x1*w1 + x2*w2 + ... + xk*wk
Or a function of this. So each neuron maintains its own set of weights.
Let's consider the network in your image. Assume that we have some training instance for which the topmost neuron should output 1 and the others 0.
So our target is:
y = [1 0 0 0]
And our actual output is (ignoring the softmax for simplicity):
y^ = [0.88 0.12 0.04 0.5]
So it's already doing pretty well, but we must still do backpropagation to make it even better.
Now, our output delta is:
y^ - y = [-0.12 0.12 0.04 0.5]
You will update the weights of the topmost neuron using the delta -0.12, of the second neuron using 0.12 and so on.
Notice that each output neuron's weights get updated using these values: these weights will all increase or decrease in order to approach the correct values (0 or 1).
Now, notice that each output neuron's output depends on the outputs of hidden neurons. So you must also update those. Those will get updated using each output neuron's delta (see page 7 here for the update formulas). This is like applying the chain rule when taking derivatives.
You're right that, for a given hidden neuron, there is a "tug of war" going on, with each output neuron's errors pulling their own way. But this is normal, because the hidden layer must learn to satisfy all output neurons. This is a reason for initializing the weights randomly and for using multiple hidden neurons.
It is the output layer that adapts to give the final answers, which it can do since the weights of the output nodes are independent of each other. The hidden layer has to be influenced by all output nodes, and it must learn to accommodate them all.

Why matlab neural network classification returns decimal values

I have an input dataset (matrix 25x1575) which is normalized to values between 0 and 1.
I also have a binary formatted output matrix (9x1575) like 0 0 0 0 0 0 0 0 1, 1 0 0 1 1 1 0 0 1 ...
I imported both files in matlab nntool and it automatically created a network with 25 input and 9 output nodes as I wanted.
After I trained this network using feed-forward backProp, I tested the model in its training data and each output nodes returns a decimal value like (-0.1978 0.45913 0.12748 0.25072 0.45199 0.59368 0.38359 0.31435 1.0604).
Why it doesn't return discrete values like 1 0 0 1 1 1 0 0 1?
Is there any thing that I must set in nntool to get such values?
Depending on the nature of neurons, the output can be anything. The most popular neurons are linear, sigmoidal curve (range [0, 1]) and Hyperbolic Tangent (range [-1, 1]). The first one can output any value. The latter two c approximate step function (i.e. binary behavior), but it is up to the end user (you) to define the cut-off value for that translation.
You didn't say which neurons you use, but you should definitely read more on how neural networks are implemented and how they work. You may start with this video and then read Artificial Neural Networks for Beginners by C Gershenson.
UPDATE You say that you use tanh-sigmoid neurons and wonder how come you don't get values either very close to -1 or to 1.
The output of tanh neuron is hyperbolic tangent of the sum of all its inputs. Every value between -1 and 1 is possible. What determines the "steepness" of the output (in other words: the proportion of interim values) is the output values of the preceding neurons and their weights. These depend on the output of their preceding neurons and their weights etc etc etc. It is up to the learning algorithm to find the set of weights that minimizes a predefined scoring function, given a certain input. In a typical setup, a scoring function is a function that compares neural network output to a set of desired results and returns a single number that indicates how different the actual and the desired outputs are.
Before using NN you have to do some homework. At the minimum you have to decide what your goal is, how you interpret NN output and how you measure NN performance and how you update the weights.

Neural Network with tanh wrong saturation with normalized data

I'm using a neural network made of 4 input neurons, 1 hidden layer made of 20 neurons and a 7 neuron output layer.
I'm trying to train it for a bcd to 7 segment algorithm. My data is normalized 0 is -1 and 1 is 1.
When the output error evaluation happens, the neuron saturates wrong. If the desired output is 1 and the real output is -1, the error is 1-(-1)= 2.
When I multiply it by the derivative of the activation function error*(1-output)*(1+output), the error becomes almost 0 Because of 2*(1-(-1)*(1-1).
How can I avoid this saturation error?
Saturation at the asymptotes of of the activation function is a common problem with neural networks. If you look at a graph of the function, it doesn't surprise: They are almost flat, meaning that the first derivative is (almost) 0. The network cannot learn any more.
A simple solution is to scale the activation function to avoid this problem. For example, with tanh() activation function (my favorite), it is recommended to use the following activation function when the desired output is in {-1, 1}:
f(x) = 1.7159 * tanh( 2/3 * x)
Consequently, the derivative is
f'(x) = 1.14393 * (1- tanh( 2/3 * x))
This will force the gradients into the most non-linear value range and speed up the learning. For all the details I recommend reading Yann LeCun's great paper Efficient Back-Prop.
In the case of tanh() activation function, the error would be calculated as
error = 2/3 * (1.7159 - output^2) * (teacher - output)
This is bound to happen no matter what function you use. The derivative, by definition, will be zero when the output reaches one of two extremes. It's been a while since I have worked with Artificial Neural Networks but if I remember correctly, this (among many other things) is one of the limitations of using the simple back-propagation algorithm.
You could add a Momentum factor to make sure there is some correction based off previous experience, even when the derivative is zero.
You could also train it by epoch, where you accumulate the delta values for the weights before doing the actual update (compared to updating it every iteration). This also mitigates conditions where the delta values are oscillating between two values.
There may be more advanced methods, like second order methods for back propagation, that will mitigate this particular problem.
However, keep in mind that tanh reaches -1 or +1 at the infinities and the problem is purely theoretical.
Not totally sure if I am reading the question correctly, but if so, you should scale your inputs and targets between 0.9 and -0.9 which would help your derivatives be more sane.

NN activation function

I need reliable substitute for sigmoid activation function. The problem with sigmoid function is that formatted output is between 0 and 1. I need activation function with output between 0 and 255. I'm training NN with backpropagation algorithm. If I will be using some other a function do I need to tweak learning algorithm ?
The most simple solution for you problem is to scale your data. Divide the outputs of your training set by 255 during training and when you are using your trained model you have to multiply the output of your neural network by 255. This way you don't have to change the gradient calculation.
You can easily achieve that by multiplying the output by 255. That will move the data from the scale of 0 to 1 to the scale of 0 to 255.
If you change the activation function, you'll definitely need to change the calculations also. The back propagation algorithm uses gradient descent approach, so you'll need to incorporate the derivative of the activation function accordingly.