Weights in feed-forward backpropogation ANN not changing - neural-network

I am designing a Feed-Forward BackPropogation ANN with 22 inputs and 1 output (either a 1 or 0). The NN has 3 layers and is using 10 hidden neurons. When I run the NN it only changes the weights a tiny bit and the total error for the output is about 40%. Intially, I thought it was over/under fitting but after I changed the number of hidden neurons, nothing changed.
N is the number of inputs (22)
M is the number of hidden neurons (10)
This is the code that I am using to backpropagate
oin is the output calculated before putting into sigmoid function
oout is the output after going through sigmoid function
double odelta = sigmoidDerivative(oin) * (TARGET_VALUE1[i] - oout);
double dobias = 0.0;
double doweight[] = new double[m];
for(int j = 0; j < m; j++)
{
doweight[j] = (ALPHA * odelta * hout[j]) + (MU * (oweight[j] - oweight2[j]));
oweight2[j] = oweight[j];
oweight[j] += doweight[j];
} // j
dobias = (ALPHA * odelta) + (MU * (obias - obias2));
obias2 = obias;
obias += dobias;
updateHidden(N, m, odelta);
This is the code I am using to change the hidden neurons.
for(int j = 0; j < m; j++)
{
hdelta = (d * oweight[j]) * sigmoidDerivative(hin[j]);
for(int i = 0; i < n; i++)
{
dhweight[i][j] = (ALPHA * hdelta * inputNeuron[i]) + (MU * (hweight[i][j] - hweight2[i][j]));
hweight2[i][j] = hweight[i][j];
hweight[i][j] += dhweight[i][j];
}
dhbias[j] = (ALPHA * hdelta) + (MU * (hbias[j] - hbias2[j]));
hbias2[j] = hbias[j];
hbias[j] += dhbias[j];
} `

You are learning your network to output on one node two classes. the weights connected to this network are adapting to predict a single class then another. so most of the time your weights are adapted to the dominate class in your data. to avoid having this problem add another node to have two nodes on your output each one refer to one class.

Related

Matlab neural network handwritten digit recognition, output going to indifference

Using Matlab I am trying to construct a neural network that can classify handwritten digits that are 30x30 pixels. I use backpropagation to find the correct weights and biases. The network starts with 900 inputs, then has 2 hidden layers with 16 neurons and it ends with 10 outputs. Each output neuron has a value between 0 and 1 that represents the belief that the input should be classified as a certain digit. The problem is that after training, the output becomes almost indifferent to the input and it goes towards a uniform belief of 0.1 for each output.
My approach is to take each image with 30x30 pixels and reshape it to be vectors of 900x1 (note that 'Images_vector' is already in the vector format when it is loaded). The weights and biases are initiated with random values between 0 and 1. I am using stochastic gradiënt descent to update the weights and biases with 10 randomly selected samples per batch. The equations are as described by Nielsen.
The script is as follows.
%% Inputs
numberofbatches = 1000;
batchsize = 10;
alpha = 1;
cutoff = 8000;
layers = [900 16 16 10];
%% Initialization
rng(0);
load('Images_vector')
Images_vector = reshape(Images_vector', 1, 10000);
labels = [ones(1,1000) 2*ones(1,1000) 3*ones(1,1000) 4*ones(1,1000) 5*ones(1,1000) 6*ones(1,1000) 7*ones(1,1000) 8*ones(1,1000) 9*ones(1,1000) 10*ones(1,1000)];
newOrder = randperm(10000);
Images_vector = Images_vector(newOrder);
labels = labels(newOrder);
images_training = Images_vector(1:cutoff);
images_testing = Images_vector(cutoff + 1:10000);
w = cell(1,length(layers) - 1);
b = cell(1,length(layers));
dCdw = cell(1,length(layers) - 1);
dCdb = cell(1,length(layers));
for i = 1:length(layers) - 1
w{i} = rand(layers(i+1),layers(i));
b{i+1} = rand(layers(i+1),1);
end
%% Learning process
batches = randi([1 cutoff - batchsize],1,numberofbatches);
cost = zeros(numberofbatches,1);
c = 1;
for batch = batches
for i = 1:length(layers) - 1
dCdw{i} = zeros(layers(i+1),layers(i));
dCdb{i+1} = zeros(layers(i+1),1);
end
for n = batch:batch+batchsize
y = zeros(10,1);
disp(labels(n))
y(labels(n)) = 1;
% Network
a{1} = images_training{n};
z{2} = w{1} * a{1} + b{2};
a{2} = sigmoid(0, z{2});
z{3} = w{2} * a{2} + b{3};
a{3} = sigmoid(0, z{3});
z{4} = w{3} * a{3} + b{4};
a{4} = sigmoid(0, z{4});
% Cost
cost(c) = sum((a{4} - y).^2) / 2;
% Gradient
d{4} = (a{4} - y) .* sigmoid(1, z{4});
d{3} = (w{3}' * d{4}) .* sigmoid(1, z{3});
d{2} = (w{2}' * d{3}) .* sigmoid(1, z{2});
dCdb{4} = dCdb{4} + d{4} / 10;
dCdb{3} = dCdb{3} + d{3} / 10;
dCdb{2} = dCdb{2} + d{2} / 10;
dCdw{3} = dCdw{3} + (a{3} * d{4}')' / 10;
dCdw{2} = dCdw{2} + (a{2} * d{3}')' / 10;
dCdw{1} = dCdw{1} + (a{1} * d{2}')' / 10;
c = c + 1;
end
% Adjustment
b{4} = b{4} - dCdb{4} * alpha;
b{3} = b{3} - dCdb{3} * alpha;
b{2} = b{2} - dCdb{2} * alpha;
w{3} = w{3} - dCdw{3} * alpha;
w{2} = w{2} - dCdw{2} * alpha;
w{1} = w{1} - dCdw{1} * alpha;
end
figure
plot(cost)
ylabel 'Cost'
xlabel 'Batches trained on'
With the sigmoid function being the following.
function y = sigmoid(derivative, x)
if derivative == 0
y = 1 ./ (1 + exp(-x));
else
y = sigmoid(0, x) .* (1 - sigmoid(0, x));
end
end
Other than this I have also tried to have 1 of each digit in each batch, but this gave the same result. Also I have tried varying the batch size, the number of batches and alpha, but with no success.
Does anyone know what I am doing wrong?
Correct me if I'm wrong: You have 10000 samples in you're data, which you divide into 1000 batches of 10 samples. Your training process consists of running over these 10000 samples once.
This might be too little, normally your training process consists of several epochs (one epoch = iterating over every sample once). You can try going over your batches multiple times.
Also for 900 inputs your network seems small. Try it with more neurons in the second layer. Hope it helps!

Building a non-layered LSTM network from scratch, how to do forward pass and backward pass?

I'm building a LSTM network from scratch, from my own understanding of how LSTM cells work.
There are no layers, so I'm trying to implement non-vectorized forms of the equations I see in the tutorials. I'm also using peepholes from the cell state.
So far, I understand that it looks like this: LSTM network
With that I've made these equations for each of the gates for forward pass:
i_t = sigmoid( i_w * (x_t + c_t) + i_b )
f_t = sigmoid( f_w * (x_t + c_t) + f_b )
cell_gate = tanh( c_w * x_t + c_b )
c_t = (f_t * c_t) + (i_t * cell_gate)
o_t = sigmoid( o_w * (x_t + c_t) + o_b )
h_t = o_t * tanh(c_t)
Where _w's mean weights for that respective gate and _b for biases. Also, I've named that first sigmoid on the far left the "cell_gate".
Back pass is where things get fuzzy for me, I'm not sure how to derive these equations correctly.
I know generally to calculate error, the equation is: error = f'(x_t) * (received_error). Where f'(x_t) is the first derivative of the activation function and received_error could be either (target - output) for output neurons or ∑(o_e * w_io) for hidden neurons.
Where o_e is the error of one of the cells the current cell outputs to and w_io is the weight connecting them.
I not sure if the LSTM cell as a whole is considered a neuron, so I treated each of the gates as neurons and tried to calculate error signals for each. Then used the error signal from the cell gate alone to pass back up the network...:
o_e = sigmoid'(o_w * (x_t + c_t) + o_b) * (received_error)
o_w += o_l * x_t * o_e
o_b += o_l * sigmoid(o_b) * o_e
...The rest of the gates follow the same format...
Then the error for the entire LSTM cell is equal to o_e.
Then for a LSTM cell above the current cell, the error it receives is equal to to:
tanh'(x_t) * ∑(o_e * w_io)
Is this all correct? Am I doing anything completely wrong?
I to am taking on this task, I believe your approach is correct:
https://github.com/evolvingstuff/LongShortTermMemory/blob/master/src/com/evolvingstuff/LSTM.java
Some nice work from: Thomas Lahore
//////////////////////////////////////////////////////////////
//////////////////////////////////////////////////////////////
//BACKPROP
//////////////////////////////////////////////////////////////
//////////////////////////////////////////////////////////////
//scale partials
for (int c = 0; c < cell_blocks; c++) {
for (int i = 0; i < full_input_dimension; i++) {
this.dSdwWeightsInputGate[c][i] *= ForgetGateAct[c];
this.dSdwWeightsForgetGate[c][i] *= ForgetGateAct[c];
this.dSdwWeightsNetInput[c][i] *= ForgetGateAct[c];
dSdwWeightsInputGate[c][i] += full_input[i] * neuronInputGate.Derivative(InputGateSum[c]) * NetInputAct[c];
dSdwWeightsForgetGate[c][i] += full_input[i] * neuronForgetGate.Derivative(ForgetGateSum[c]) * CEC1[c];
dSdwWeightsNetInput[c][i] += full_input[i] * neuronNetInput.Derivative(NetInputSum[c]) * InputGateAct[c];
}
}
if (target_output != null) {
double[] deltaGlobalOutputPre = new double[output_dimension];
for (int k = 0; k < output_dimension; k++) {
deltaGlobalOutputPre[k] = target_output[k] - output[k];
}
//output to hidden
double[] deltaNetOutput = new double[cell_blocks];
for (int k = 0; k < output_dimension; k++) {
//links
for (int c = 0; c < cell_blocks; c++) {
deltaNetOutput[c] += deltaGlobalOutputPre[k] * weightsGlobalOutput[k][c];
weightsGlobalOutput[k][c] += deltaGlobalOutputPre[k] * NetOutputAct[c] * learningRate;
}
//bias
weightsGlobalOutput[k][cell_blocks] += deltaGlobalOutputPre[k] * 1.0 * learningRate;
}
for (int c = 0; c < cell_blocks; c++) {
//update output gates
double deltaOutputGatePost = deltaNetOutput[c] * CECSquashAct[c];
double deltaOutputGatePre = neuronOutputGate.Derivative(OutputGateSum[c]) * deltaOutputGatePost;
for (int i = 0; i < full_input_dimension; i++) {
weightsOutputGate[c][i] += full_input[i] * deltaOutputGatePre * learningRate;
}
peepOutputGate[c] += CEC3[c] * deltaOutputGatePre * learningRate;
//before outgate
double deltaCEC3 = deltaNetOutput[c] * OutputGateAct[c] * neuronCECSquash.Derivative(CEC3[c]);
//update input gates
double deltaInputGatePost = deltaCEC3 * NetInputAct[c];
double deltaInputGatePre = neuronInputGate.Derivative(InputGateSum[c]) * deltaInputGatePost;
for (int i = 0; i < full_input_dimension; i++) {
weightsInputGate[c][i] += dSdwWeightsInputGate[c][i] * deltaCEC3 * learningRate;
}
peepInputGate[c] += CEC2[c] * deltaInputGatePre * learningRate;
//before ingate
double deltaCEC2 = deltaCEC3;
//update forget gates
double deltaForgetGatePost = deltaCEC2 * CEC1[c];
double deltaForgetGatePre = neuronForgetGate.Derivative(ForgetGateSum[c]) * deltaForgetGatePost;
for (int i = 0; i < full_input_dimension; i++) {
weightsForgetGate[c][i] += dSdwWeightsForgetGate[c][i] * deltaCEC2 * learningRate;
}
peepForgetGate[c] += CEC1[c] * deltaForgetGatePre * learningRate;
//update cell inputs
for (int i = 0; i < full_input_dimension; i++) {
weightsNetInput[c][i] += dSdwWeightsNetInput[c][i] * deltaCEC3 * learningRate;
}
//no peeps for cell inputs
}
}
//////////////////////////////////////////////////////////////
//roll-over context to next time step
for (int j = 0; j < cell_blocks; j++) {
context[j] = NetOutputAct[j];
CEC[j] = CEC3[j];
}
Also, and perhaps even more interesting is the Lecture and Lecture Notes from Andrej Karpathy:
https://youtu.be/cO0a0QYmFm8?t=45m36s
http://cs231n.stanford.edu/slides/2016/winter1516_lecture10.pdf

How to implement the Softmax derivative independently from any loss function?

For a neural networks library I implemented some activation functions and loss functions and their derivatives. They can be combined arbitrarily and the derivative at the output layers just becomes the product of the loss derivative and the activation derivative.
However, I failed to implement the derivative of the Softmax activation function independently from any loss function. Due to the normalization i.e. the denominator in the equation, changing a single input activation changes all output activations and not just one.
Here is my Softmax implementation where the derivative fails the gradient checking by about 1%. How can I implement the Softmax derivative so that it can be combined with any loss function?
import numpy as np
class Softmax:
def compute(self, incoming):
exps = np.exp(incoming)
return exps / exps.sum()
def delta(self, incoming, outgoing):
exps = np.exp(incoming)
others = exps.sum() - exps
return 1 / (2 + exps / others + others / exps)
activation = Softmax()
cost = SquaredError()
outgoing = activation.compute(incoming)
delta_output_layer = activation.delta(incoming) * cost.delta(outgoing)
Mathematically, the derivative of Softmax σ(j) with respect to the logit Zi (for example, Wi*X) is
where the red delta is a Kronecker delta.
If you implement iteratively:
def softmax_grad(s):
# input s is softmax value of the original input x. Its shape is (1,n)
# i.e. s = np.array([0.3,0.7]), x = np.array([0,1])
# make the matrix whose size is n^2.
jacobian_m = np.diag(s)
for i in range(len(jacobian_m)):
for j in range(len(jacobian_m)):
if i == j:
jacobian_m[i][j] = s[i] * (1 - s[i])
else:
jacobian_m[i][j] = -s[i] * s[j]
return jacobian_m
Test:
In [95]: x
Out[95]: array([1, 2])
In [96]: softmax(x)
Out[96]: array([ 0.26894142, 0.73105858])
In [97]: softmax_grad(softmax(x))
Out[97]:
array([[ 0.19661193, -0.19661193],
[-0.19661193, 0.19661193]])
If you implement in a vectorized version:
soft_max = softmax(x)
# reshape softmax to 2d so np.dot gives matrix multiplication
def softmax_grad(softmax):
s = softmax.reshape(-1,1)
return np.diagflat(s) - np.dot(s, s.T)
softmax_grad(soft_max)
#array([[ 0.19661193, -0.19661193],
# [-0.19661193, 0.19661193]])
It should be like this: (x is the input to the softmax layer and dy is the delta coming from the loss above it)
dx = y * dy
s = dx.sum(axis=dx.ndim - 1, keepdims=True)
dx -= y * s
return dx
But the way you compute the error should be:
yact = activation.compute(x)
ycost = cost.compute(yact)
dsoftmax = activation.delta(x, cost.delta(yact, ycost, ytrue))
Explanation: Because the delta function is a part of the backpropagation algorithm, its responsibility is to multiply the vector dy (in my code, outgoing in your case) by the Jacobian of the compute(x) function evaluated at x. If you work out what does this Jacobian look like for softmax [1], and then multiply it from the left by a vector dy, after a bit of algebra you'll find out that you get something that corresponds to my Python code.
[1] https://stats.stackexchange.com/questions/79454/softmax-layer-in-a-neural-network
The other answers are great, here to share a simple implementation of forward/backward, regardless of loss functions.
In the image below, it is a brief derivation of the backward for softmax. The 2nd equation is loss function dependent, not part of our implementation.
backward verified by manual grad checking.
import numpy as np
class Softmax:
def forward(self, x):
mx = np.max(x, axis=1, keepdims=True)
x = x - mx # log-sum-exp trick
e = np.exp(x)
probs = e / np.sum(np.exp(x), axis=1, keepdims=True)
return probs
def backward(self, x, probs, bp_err):
dim = x.shape[1]
output = np.empty(x.shape)
for j in range(dim):
d_prob_over_xj = - (probs * probs[:,[j]]) # i.e. prob_k * prob_j, no matter k==j or not
d_prob_over_xj[:,j] += probs[:,j] # i.e. when k==j, +prob_j
output[:,j] = np.sum(bp_err * d_prob_over_xj, axis=1)
return output
def compute_manual_grads(x, pred_fn):
eps = 1e-3
batch_size, dim = x.shape
grads = np.empty(x.shape)
for i in range(batch_size):
for j in range(dim):
x[i,j] += eps
y1 = pred_fn(x)
x[i,j] -= 2*eps
y2 = pred_fn(x)
grads[i,j] = (y1 - y2) / (2*eps)
x[i,j] += eps
return grads
def loss_fn(probs, ys, loss_type):
batch_size = probs.shape[0]
# dummy mse
if loss_type=="mse":
loss = np.sum((np.take_along_axis(probs, ys.reshape(-1,1), axis=1) - 1)**2) / batch_size
values = 2 * (np.take_along_axis(probs, ys.reshape(-1,1), axis=1) - 1) / batch_size
# cross ent
if loss_type=="xent":
loss = - np.sum( np.take_along_axis(np.log(probs), ys.reshape(-1,1), axis=1) ) / batch_size
values = -1 / np.take_along_axis(probs, ys.reshape(-1,1), axis=1) / batch_size
err = np.zeros(probs.shape)
np.put_along_axis(err, ys.reshape(-1,1), values, axis=1)
return loss, err
if __name__ == "__main__":
batch_size = 10
dim = 5
x = np.random.rand(batch_size, dim)
ys = np.random.randint(0, dim, batch_size)
for loss_type in ["mse", "xent"]:
S = Softmax()
probs = S.forward(x)
loss, bp_err = loss_fn(probs, ys, loss_type)
grads = S.backward(x, probs, bp_err)
def pred_fn(x, ys):
pred = S.forward(x)
loss, err = loss_fn(pred, ys, loss_type)
return loss
manual_grads = compute_manual_grads(x, lambda x: pred_fn(x, ys))
# compare both grads
print(f"loss_type = {loss_type}, grad diff = {np.sum((grads - manual_grads)**2) / batch_size}")
Just in case you are processing in batches, here is an implementation in NumPy (tested vs TensorFlow). However, I will suggest avoiding the associated tensor operations, by mixing the jacobian with the cross-entropy, which leads to a very simple and efficient expression.
def softmax(z):
exps = np.exp(z - np.max(z))
return exps / np.sum(exps, axis=1, keepdims=True)
def softmax_jacob(s):
return np.einsum('ij,jk->ijk', s, np.eye(s.shape[-1])) \
- np.einsum('ij,ik->ijk', s, s)
def np_softmax_test(z):
return softmax_jacob(softmax(z))
def tf_softmax_test(z):
z = tf.constant(z, dtype=tf.float32)
with tf.GradientTape() as g:
g.watch(z)
a = tf.nn.softmax(z)
jacob = g.batch_jacobian(a, z)
return jacob.numpy()
z = np.random.randn(3, 5)
np.all(np.isclose(np_softmax_test(z), tf_softmax_test(z)))
Here is a c++ vectorized version, using intrinsics ( 22 times (!) faster than the non-SSE version):
// How many floats fit into __m256 "group".
// Used by vectors and matrices, to ensure their dimensions are appropriate for
// intrinsics.
// Otherwise, consecutive rows of matrices will not be 16-byte aligned, and
// operations on them will be incorrect.
#define F_MULTIPLE_OF_M256 8
//check to quickly see if your rows are divisible by m256.
//you can 'undefine' to save performance, after everything was verified to be correct.
#define ASSERT_THE_M256_MULTIPLES
#ifdef ASSERT_THE_M256_MULTIPLES
#define assert_is_m256_multiple(x) assert( (x%F_MULTIPLE_OF_M256) == 0)
#else
#define assert_is_m256_multiple (q)
#endif
// usually used at the end of our Reduce functions,
// where the final __m256 mSum needs to be collapsed into 1 scalar.
static inline float slow_hAdd_ps(__m256 x){
const float *sumStart = reinterpret_cast<const float*>(&x);
float sum = 0.0f;
for(size_t i=0; i<F_MULTIPLE_OF_M256; ++i){
sum += sumStart[i];
}
return sum;
}
f_vec SoftmaxGrad_fromResult(const float *softmaxResult, size_t size,
const float *gradFromAbove){//<--gradient vector, flowing into us from the above layer
assert_is_m256_multiple(size);
//allocate vector, where to store output:
f_vec grad_v(size, true);//true: skip filling with zeros, to save performance.
const __m256* end = (const __m256*)(softmaxResult + size);
for(size_t i=0; i<size; ++i){// <--for every row
//go through this i'th row:
__m256 sum = _mm256_set1_ps(0.0f);
const __m256 neg_sft_i = _mm256_set1_ps( -softmaxResult[i] );
const __m256 *s = (const __m256*)softmaxResult;
const __m256 *gAbove = (__m256*)gradFromAbove;
for (s; s<end; ){
__m256 mul = _mm256_mul_ps(*s, neg_sft_i); // sftmaxResult_j * (-sftmaxResult_i)
mul = _mm256_mul_ps( mul, *gAbove );
sum = _mm256_add_ps( sum, mul );//adding to the total sum of this row.
++s;
++gAbove;
}
grad_v[i] = slow_hAdd_ps( sum );//collapse the sum into 1 scalar (true sum of this row).
}//end for every row
//reset back to start and subtract a vector, to account for Kronecker delta:
__m256 *g = (__m256*)grad_v._contents;
__m256 *s = (__m256*)softmaxResult;
__m256 *gAbove = (__m256*)gradFromAbove;
for(s; s<end; ){
__m256 mul = _mm256_mul_ps(*s, *gAbove);
*g = _mm256_add_ps( *g, mul );
++s;
++g;
}
return grad_v;
}
If for some reason somebody wants a simple (non-SSE) version, here it is:
inline static void SoftmaxGrad_fromResult_nonSSE(const float* softmaxResult,
const float *gradFromAbove, //<--gradient vector, flowing into us from the above layer
float *gradOutput,
size_t count ){
// every pre-softmax element in a layer contributed to the softmax of every other element
// (it went into the denominator). So gradient will be distributed from every post-softmax element to every pre-elem.
for(size_t i=0; i<count; ++i){
//go through this i'th row:
float sum = 0.0f;
const float neg_sft_i = -softmaxResult[i];
for(size_t j=0; j<count; ++j){
float mul = gradFromAbove[j] * softmaxResult[j] * neg_sft_i;
sum += mul;//adding to the total sum of this row.
}
//NOTICE: equals, overwriting any old values:
gradOutput[i] = sum;
}//end for every row
for(size_t i=0; i<count; ++i){
gradOutput[i] += softmaxResult[i] * gradFromAbove[i];
}
}

Teaching a Neural Net: Bipolar XOR

I'm trying to to teach a neural net of 2 inputs, 4 hidden nodes (all in same layer) and 1 output node. The binary representation works fine, but I have problems with the Bipolar. I can't figure out why, but the total error will sometimes converge to the same number around 2.xx. My sigmoid is 2/(1+ exp(-x)) - 1. Perhaps I'm sigmoiding in the wrong place. For example to calculate the output error should I be comparing the sigmoided output with the expected value or with the sigmoided expected value?
I was following this website here: http://galaxy.agh.edu.pl/~vlsi/AI/backp_t_en/backprop.html , but they use different functions then I was instructed to use. Even when I did try to implement their functions I still ran into the same problem. Either way I get stuck about half the time at the same number (a different number for different implementations). Please tell me if I have made a mistake in my code somewhere or if this is normal (I don't see how it could be). Momentum is set to 0. Is this a common 0 momentum problem? The error functions we are supposed to be using are:
if ui is an output unit
Error(i) = (Ci - ui ) * f'(Si )
if ui is a hidden unit
Error(i) = Error(Output) * weight(i to output) * f'(Si)
public double sigmoid( double x ) {
double fBipolar, fBinary, temp;
temp = (1 + Math.exp(-x));
fBipolar = (2 / temp) - 1;
fBinary = 1 / temp;
if(bipolar){
return fBipolar;
}else{
return fBinary;
}
}
// Initialize the weights to random values.
private void initializeWeights(double neg, double pos) {
for(int i = 0; i < numInputs + 1; i++){
for(int j = 0; j < numHiddenNeurons; j++){
inputWeights[i][j] = Math.random() - pos;
if(inputWeights[i][j] < neg || inputWeights[i][j] > pos){
print("ERROR ");
print(inputWeights[i][j]);
}
}
}
for(int i = 0; i < numHiddenNeurons + 1; i++){
hiddenWeights[i] = Math.random() - pos;
if(hiddenWeights[i] < neg || hiddenWeights[i] > pos){
print("ERROR ");
print(hiddenWeights[i]);
}
}
}
// Computes output of the NN without training. I.e. a forward pass
public double outputFor ( double[] argInputVector ) {
for(int i = 0; i < numInputs; i++){
inputs[i] = argInputVector[i];
}
double weightedSum = 0;
for(int i = 0; i < numHiddenNeurons; i++){
weightedSum = 0;
for(int j = 0; j < numInputs + 1; j++){
weightedSum += inputWeights[j][i] * inputs[j];
}
hiddenActivation[i] = sigmoid(weightedSum);
}
weightedSum = 0;
for(int j = 0; j < numHiddenNeurons + 1; j++){
weightedSum += (hiddenActivation[j] * hiddenWeights[j]);
}
return sigmoid(weightedSum);
}
//Computes the derivative of f
public static double fPrime(double u){
double fBipolar, fBinary;
fBipolar = 0.5 * (1 - Math.pow(u,2));
fBinary = u * (1 - u);
if(bipolar){
return fBipolar;
}else{
return fBinary;
}
}
// This method is used to update the weights of the neural net.
public double train ( double [] argInputVector, double argTargetOutput ){
double output = outputFor(argInputVector);
double lastDelta;
double outputError = (argTargetOutput - output) * fPrime(output);
if(outputError != 0){
for(int i = 0; i < numHiddenNeurons + 1; i++){
hiddenError[i] = hiddenWeights[i] * outputError * fPrime(hiddenActivation[i]);
deltaHiddenWeights[i] = learningRate * outputError * hiddenActivation[i] + (momentum * lastDelta);
hiddenWeights[i] += deltaHiddenWeights[i];
}
for(int in = 0; in < numInputs + 1; in++){
for(int hid = 0; hid < numHiddenNeurons; hid++){
lastDelta = deltaInputWeights[in][hid];
deltaInputWeights[in][hid] = learningRate * hiddenError[hid] * inputs[in] + (momentum * lastDelta);
inputWeights[in][hid] += deltaInputWeights[in][hid];
}
}
}
return 0.5 * (argTargetOutput - output) * (argTargetOutput - output);
}
General coding comments:
initializeWeights(-1.0, 1.0);
may not actually get the initial values you were expecting.
initializeWeights should probably have:
inputWeights[i][j] = Math.random() * (pos - neg) + neg;
// ...
hiddenWeights[i] = (Math.random() * (pos - neg)) + neg;
instead of:
Math.random() - pos;
so that this works:
initializeWeights(0.0, 1.0);
and gives you initial values between 0.0 and 1.0 rather than between -1.0 and 0.0.
lastDelta is used before it is declared:
deltaHiddenWeights[i] = learningRate * outputError * hiddenActivation[i] + (momentum * lastDelta);
I'm not sure if the + 1 on numInputs + 1 and numHiddenNeurons + 1 are necessary.
Remember to watch out for rounding of ints: 5/2 = 2, not 2.5!
Use 5.0/2.0 instead. In general, add the .0 in your code when the output should be a double.
Most importantly, have you trained the NeuralNet long enough?
Try running it with numInputs = 2, numHiddenNeurons = 4, learningRate = 0.9, and train for 1,000 or 10,000 times.
Using numHiddenNeurons = 2 it sometimes get "stuck" when trying to solve the XOR problem.
See also XOR problem - simulation

Resilient backpropagation neural network - question about gradient

First I want to say that I'm really new to neural networks and I don't understand it very good ;)
I've made my first C# implementation of the backpropagation neural network. I've tested it using XOR and it looks it work.
Now I would like change my implementation to use resilient backpropagation (Rprop - http://en.wikipedia.org/wiki/Rprop).
The definition says: "Rprop takes into account only the sign of the partial derivative over all patterns (not the magnitude), and acts independently on each "weight".
Could somebody tell me what partial derivative over all patterns is? And how should I compute this partial derivative for a neuron in hidden layer.
Thanks a lot
UPDATE:
My implementation base on this Java code: www_.dia.fi.upm.es/~jamartin/downloads/bpnn.java
My backPropagate method looks like this:
public double backPropagate(double[] targets)
{
double error, change;
// calculate error terms for output
double[] output_deltas = new double[outputsNumber];
for (int k = 0; k < outputsNumber; k++)
{
error = targets[k] - activationsOutputs[k];
output_deltas[k] = Dsigmoid(activationsOutputs[k]) * error;
}
// calculate error terms for hidden
double[] hidden_deltas = new double[hiddenNumber];
for (int j = 0; j < hiddenNumber; j++)
{
error = 0.0;
for (int k = 0; k < outputsNumber; k++)
{
error = error + output_deltas[k] * weightsOutputs[j, k];
}
hidden_deltas[j] = Dsigmoid(activationsHidden[j]) * error;
}
//update output weights
for (int j = 0; j < hiddenNumber; j++)
{
for (int k = 0; k < outputsNumber; k++)
{
change = output_deltas[k] * activationsHidden[j];
weightsOutputs[j, k] = weightsOutputs[j, k] + learningRate * change + momentumFactor * lastChangeWeightsForMomentumOutpus[j, k];
lastChangeWeightsForMomentumOutpus[j, k] = change;
}
}
// update input weights
for (int i = 0; i < inputsNumber; i++)
{
for (int j = 0; j < hiddenNumber; j++)
{
change = hidden_deltas[j] * activationsInputs[i];
weightsInputs[i, j] = weightsInputs[i, j] + learningRate * change + momentumFactor * lastChangeWeightsForMomentumInputs[i, j];
lastChangeWeightsForMomentumInputs[i, j] = change;
}
}
// calculate error
error = 0.0;
for (int k = 0; k < outputsNumber; k++)
{
error = error + 0.5 * (targets[k] - activationsOutputs[k]) * (targets[k] - activationsOutputs[k]);
}
return error;
}
So can I use change = hidden_deltas[j] * activationsInputs[i] variable as a gradient (partial derivative) for checking the sing?
I think the "over all patterns" simply means "in every iteration"... take a look at the RPROP paper
For the paritial derivative: you've already implemented the normal back-propagation algorithm. This is a method for efficiently calculate the gradient... there you calculate the δ values for the single neurons, which are in fact the negative ∂E/∂w values, i.e. the parital derivative of the global error as function of the weights.
so instead of multiplying the weights with these values, you take one of two constants (η+ or η-), depending on whether the sign has changed
The following is an example of a part of an implementation of the RPROP training technique in the Encog Artificial Intelligence Library. It should give you an idea of how to proceed. I would recommend downloading the entire library, because it will be easier to go through the source code in an IDE rather than through the online svn interface.
http://code.google.com/p/encog-cs/source/browse/#svn/trunk/encog-core/encog-core-cs/Neural/Networks/Training/Propagation/Resilient
http://code.google.com/p/encog-cs/source/browse/#svn/trunk
Note the code is in C#, but shouldn't be difficult to translate into another language.