heterogeneous class recognition with ANN / MLP - neural-network

I have put together a classifying 3 layer artificial neural network that appears to work on other datasets. Playing around some artificial datasets that I made, I was unable to correctly predict between two classes when one class was positive in one feature or another feature.
Clearly class1 is can be identified by asking if either feature 1 or feature 2 is equal to 1 but I can't get the algorithm to predict the dataset correctly (there are 20 examples following this pattern in the dataset).
Can ANN/MLPs recognize this type of pattern? If so, what am I missing? If not, are there other methods that can predict this type of pattern (maybe SVM)?
I used Octave as that was what was used in the online course offered from coursera. I have listed most of the code here although it is structured slightly differently when I run it. As you can see I do use bias units on the first and second layers and I have also varied the number of hidden units in the second layer from 1-5 with no improvement over random guessing.
% Load dataset
y = [1; 1; 2; 2]
X = [1, 0; 0, 1; 0, 0; 0, 0]
m = size(X, 1);
Theta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), hidden_layer_size, (input_layer_size + 1));
Theta2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):end), num_labels, (hidden_layer_size + 1));
% Randomly initialize weight parameters
initial_Theta1 = randInitializeWeights(input_layer_size, hidden_layer_size);
initial_Theta2 = randInitializeWeights(hidden_layer_size, num_labels);
initial_nn_params = [initial_Theta1(:) ; initial_Theta2(:)];
% Add bias units to layers and feedforward
Xbias = [ones(m,1), X];
L2bias = [ones(m,1), sigmoid(Xbias*Theta1')];
L3 = sigmoid(L2bias * Theta2');
% Create class matrix Y
Y = zeros(m, num_labels);
for r = 1:m;
Y(r, y(r)) = 1;
end
% Set cost function
J = (sum(sum(Y.*log(L3) + (1-Y).*log(1-L3))))/-m + lambda*(sum(sum((Theta1(:,2:columns(Theta1))).^2)) + sum(sum((Theta2(:,2:columns(Theta2))).^2)))/2/m;
% Initialize weight gradient matrices
D2 = zeros(rows(Theta2),columns(Theta2));
D1 = zeros(rows(Theta1),columns(Theta1));
% Calculate gradient with backpropagation
for t = 1:m;
a1 = [1 X(t,:)]';
z2 = Theta1*a1;
a2 = [1; sigmoid(z2)];
z3 = Theta2*a2;
a3 = sigmoid(z3);
d3 = a3 - Y(t,:)';
d2 = (Theta2'*d3)(2:end).*sigmoidGradient(z2);
D2 = D2 + d3*a2';
D1 = D1 + d2*a1';
end
Theta2_grad = D2/m;
Theta1_grad = D1/m;
Theta2_grad(:,2:end) = Theta2_grad(:,2:end) + lambda*Theta2(:,2:end)/m;
Theta1_grad(:,2:end) = Theta1_grad(:,2:end) + lambda*Theta1(:,2:end)/m;
% Unroll gradients
grad = [Theta1_grad(:) ; Theta2_grad(:)];
% Compute cost (Feed forward)
[J,grad] = nnCostFunction(initial_nn_params, input_layer_size, hidden_layer_size, num_labels, X, y, lambda);
% Create "short hand" for the cost function to be minimized using fmincg
costFunction = #(p) nnCostFunction(p, input_layer_size, hidden_layer_size, num_labels, X, y, lambda);
% Train the neural network using fmincg
options = optimset('MaxIter', 1000);
[nn_params, cost] = fmincg(costFunction, initial_nn_params, options);
% Obtain Theta1 and Theta2 back from nn_params
Theta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), hidden_layer_size, (input_layer_size + 1));
Theta2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):end), num_labels, (hidden_layer_size + 1));

NN can recognize any pattern. Universal Approximation Theorem proves that (as well as many others).
The most obvious reason I can think of is lack of bias neuron. Althouh for more valuable answers you have to include your code.

Related

Speeding up code which integrates a 2D gpuArray matrix using Simpson's Rule

I have a (real) 2D gpuArray, which I am using as part of a larger code, and now am trying to also integrate the array using the Composite Simpson Rule inside my main loop (several 10000 iterations at least). A MWE looks like the following:
%%%%%%%%%%%%%%%%%% MAIN CODE %%%%%%%%%%%%%%%%%%
Ny = 501; % Dimensions of matrix M
Nx = 503; %
dx = 0.1; % Grid spacings
dy = 0.2; %
M = rand(Ny, Nx, 'gpuArray'); % Initialise a matrix
for k = 1:10000
% M = function1(M) % Apply some other functions to M
% ... etc ...
I = simpsons_integration_2D(M, dx, dy, Nx, Ny); % Now integrate M
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%% Integrator %%%%%%%%%%%%%%%%%
function I = simpsons_integration_2D(F, dx, dy, Nx, Ny)
% Integrate the 2D function F with Nx columns and Ny rows, and grid spacings
% dx and dy using Simpson's rule.
% Integrate along x direction (vertically) --> IX is a vector afterwards
sX = sum( F(:,1:2:Nx-2) + 4*F(:,2:2:(Nx-1)) + F(:,3:2:Nx) , 2);
IX = dx/3 * sX;
% Integrate along y direction --> I is a scalar afterwards
sY = sum( IX(1:2:Ny-2) + 4*IX(2:2:(Ny-1)) + IX(3:2:Ny) , 1);
I = dy/3 * sY;
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
The operation of performing the integration is around 850 µs, which is currently a significant part of my code. This was measured using
f = #() simpsons_integration_2D(M, dx, dy, Nx, Ny);
t = gputimeit(f)
Is there a way to reduce the execution time for integrating the gpuArray matrix?
(The graphics card is the Nvidia Quadro P4000)
Many thanks
Assuming that the matrix has odd dimensions here is a way to optimize the function:
function I = simpsons_integration_2D(F, dx, dy, Nx, Ny)
sX = 2 * sum(F,2) + 2 * sum (F(:,2:2:(Nx-1)),2) - F(:,1) - F(:,Nx);
sY = dx/3 * (2 * sum(sX) + 2 * sum (sX(2:2:(Ny-1))) - sX(1) - sX(Ny));
I = dy/3 * sY;
end
EDIT
A more optimized solution using matrix multiplication:
function I = simpsons_integration_2D2(F, dx, dy, Nx, Ny)
mx = repmat (2, Nx, 1);
mx(2:2:(Nx-1)) = 4;
mx(1) = 1;
mx(Nx) = 1;
my = repmat (2, 1, Ny);
my(2:2:(Ny-1)) = 4;
my(1) = 1;
my(Ny) = 1;
I = (dx*dy/9) * (my * (F * mx));
end
If Nx and Ny are the same you need to compute only one of them mx or my:
function I = simpsons_integration_2D2(F, dx, dy, Nx, Ny)
mx = repmat (2, Nx, 1);
mx(2:2:(Nx-1)) = 4;
mx(1) = 1;
mx(Nx) = 1;
I = (dx*dy/9) * (mx.' * (F * mx));
end
If Nx and Ny are constant you can precompute mx outside the function and pass it as a function argument:
function I = simpsons_integration_2D2(F, dx, dy, mx)
I = (dx*dy/9) * (mx.' * (F * mx));
end
EDIT:
If both mx and my can be precomputed the problem is reduced to a dot product:
m = reshape (my.' .* mx.', 1, []);
function I = simpsons_integration_2D3(F, dx, dy, m)
I = (dx*dy/9) * (m * F(:));
end
Well I cannot test this for you but there are a few things that may help.
First the axis 1 and then the the axis 2 may make some difference in terms of locality of the modified terms (I don't know if to better or to worse).
function I = variation1(F, dx, dy, Nx, Ny)
% Sum each term separately, prevents the creation of a big intermediate matrix
% Multiply outside the summation does only Ny multiplications by 4 instead of Ny*Nx/2
sX = sum(F(:,1:2:Nx-2), 2) + 4*sum(F(:,2:2:(Nx-1)), 2) + sum(F(:,3:2:Nx), 2);
IX = dx/3 * sX;
sY = sum(IX(1:2:Ny-2), 1) + 4*sum(IX(2:2:(Ny-1)), 1) + sum(IX(3:2:Ny) , 1);
I = dy/3 * sY;
end
function I = variation2(F, dx, dy, Nx, Ny)
% a.
% Sum each term separately, prevents the creation of a big intermediate matrix
% Multiply outside the summation does only Ny multiplications by 4 instead of Ny*Nx/2
% b.
% Notice that the terms 2:3:NX-2 appear in two summations
% Saves Nx*Ny/2 additions at the expense of Ny multiplications by 2
sX = 2*sum(F(:,3:2:Nx-2), 2) + 4*sum(F(:,2:2:(Nx-1)), 2) + F(:,1) + F(:,Nx);
% saves Ny multiplications by moving the constant factor after the next sum
sY = 2*sum(sX(3:2:Ny-2), 1) + 4*sum(sX(2:2:(Ny-1)), 1) + sX(1) + sX(Ny);
I = (dy*dy/9) * sY;
end
function I = alternate_simpsons_integration_2D(F, dx, dy, Nx, Ny)
% Integrate the 2D function F with Nx columns and Ny rows, and grid spacings
% dx and dy using Simpson's rule.
% Notice that sum(F(:,1:2:Nx-2) + F(:,3:2:Nx)) have all but the end poitns repeated.
IX = 4*sum(F(:,2:2:Nx-1), 2) + 2 * sum(F(:,3:2:Nx-2) , 2) + F(:,1) + F(:,Nx);
disp(size(IX))
% Integrate along y direction --> I is a scalar afterwards
sY = 4*sum(IX(2:2:Ny-1)) + 2*sum(IX(3:2:Ny-2)) + IX(1) + IX(Ny);
I = dy*dy/9 * sY;
end
If you think it is better to make a single summation then you can do using the formula 2*(sum(2*F(2:2:end-1) + F(1:2:end-2)) + F(end) - F(1) that gives the same result but has Nx*Ny/2 less additions on the first integration. But these options have to be tested in your environment.
Transposed implementation
function I = transposed_simpsons_integration_2D(F, dx, dy, Nx, Ny)
sY = 2*sum(2*F(2:2:end-1, :) + F(1:2:end-2, :), 1) + F(end, :) - F(1, :);
sX = 2*sum(2*sY(2:2:end-1) + sY(1:2:end-2)) + sY(end) - sY(1);
I = dy*dy/9 * sX;
end
Using octave (usually slower than matlab) I get a run time of ~400us per iteration with. This is not the type of workload that will be interesting to run on the GPU. For comparison, randn about 10 times slower than this function.

My matlab neural network backpropagation algorithm seems buggy

Here is my code. I think it is wrong because the difference between this computed gradient and my numerical estimate is too significant. It doesn't seem to be due to wrongly inverting matrices, etc.
For context, Y is the output layer, X is the input layer, and there is only 1 hidden layer. Theta1 is the weights for the first input layer and Theta2 is the weights for the hidden layer.
for t = 1:m
% do fw prop again...
a1 = [1 X(i,:)];
a2 = [1 sigmoid(a1 * Theta1')];
a3 = sigmoid(a2 * Theta2');
delta_3 = a3' - Y(:, t);
delta_2 = Theta2' * delta_3 .* a2' .* (1 - a2)';
delta_2 = delta_2(2:end,:);
Theta1_grad = Theta1_grad + delta_2 * [1 X(i, :)];
Theta2_grad = Theta2_grad + delta_3 * [1 sigmoid([1 X(i,:)] * Theta1')];
end
grad = [Theta1_grad(:) ; Theta2_grad(:)];

Low accuracy in emotion voice recognition

I am training a neural network to do voice emotion recognition with:
100 input layer size.
25 hidden layer size.
6 labels (output layer).
I have divided the data set into train set and test set, then extracted features from the voice using MLFCC (Mel Frequency Cepstral Coefficients) which returns a matrix with different sizes. So, I used 100 features of them for each time.
The accuracy on the training set is 100%, but when it comes to the test set it is about 30-40%.
I still don't know a lot about this domain, but it is obviously overfitting the problem (maybe not, but that is what I have learned). I have done some adjustments to avoid this problem:
Increasing the lambda, decreasing the number of features, adding an extra hidden layer. The accuracy got better, but never bigger than 40%.
What would be the problem?
Here is the implementation of MLFCC:
function [cepstra,aspectrum,pspectrum] = melfcc(samples, sr, varargin)
if nargin < 2; sr = 16000; end
% Parse out the optional arguments
[wintime, hoptime, numcep, lifterexp, sumpower, preemph, dither, ...
minfreq, maxfreq, nbands, bwidth, dcttype, fbtype, usecmp, modelorder, ...
broaden, useenergy] = ...
process_options(varargin, 'wintime', 0.025, 'hoptime', 0.010, ...
'numcep', 13, 'lifterexp', 0.6, 'sumpower', 1, 'preemph', 0.97, ...
'dither', 0, 'minfreq', 0, 'maxfreq', 4000, ...
'nbands', 40, 'bwidth', 1.0, 'dcttype', 2, ...
'fbtype', 'mel', 'usecmp', 0, 'modelorder', 0, ...
'broaden', 0, 'useenergy', 0);
if preemph ~= 0
samples = filter([1 -preemph], 1, samples);
end
% Compute FFT power spectrum
[pspectrum,logE] = powspec(samples, sr, wintime, hoptime, dither);
aspectrum = audspec(pspectrum, sr, nbands, fbtype, minfreq, maxfreq, sumpower, bwidth);
if (usecmp)
% PLP-like weighting/compression
aspectrum = postaud(aspectrum, maxfreq, fbtype, broaden);
end
if modelorder > 0
if (dcttype ~= 1)
disp(['warning: plp cepstra are implicitly dcttype 1 (not ', num2str(dcttype), ')']);
end
% LPC analysis
lpcas = dolpc(aspectrum, modelorder);
% convert lpc to cepstra
cepstra = lpc2cep(lpcas, numcep);
% Return the auditory spectrum corresponding to the cepstra?
% aspectrum = lpc2spec(lpcas, nbands);
% else return the aspectrum that the cepstra are based on, prior to PLP
else
% Convert to cepstra via DCT
cepstra = spec2cep(aspectrum, numcep, dcttype);
end
cepstra = lifter(cepstra, lifterexp);
if useenergy
cepstra(1,:) = logE;
end
And here is my implementation:
clear ; close all; clc
[input,output]=gettingPatterns;
input_layer_size = 70;
hidden_layer_size = 100;
hidden2_layer_size = 25;
num_labels = 6;
fu = [input output];size(fu)
fu=fu(randperm(size(fu,1)),:);
input = fu(:,1:70);
output = fu (:,71:76);
crossIn = input(201:240,:);
crossOut=output(201:240,:);
trainIn = input(1:200,:);
trainOut=output(1:200,:);
Theta1 = randInitializeWeights(input_layer_size, hidden_layer_size);
Theta2 = randInitializeWeights(hidden_layer_size,hidden2_layer_size);
Theta3 =randInitializeWeights(hidden2_layer_size,num_labels);
initial_nn_params = [Theta1(:) ; Theta2(:);Theta3(:)];
size(initial_nn_params)
options = optimset('MaxIter',1000);
% You should also try different values of lambda
lambda=1;
costFunction = #(p) nnCostFunction(p, ...
input_layer_size, ...
hidden_layer_size, ...
hidden2_layer_size,num_labels, trainIn, trainOut, lambda);
[nn_params, cost] = fmincg(costFunction, initial_nn_params, options);
num_labels, (hidden_layer_size + 1));
Theta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), ...
hidden_layer_size, (input_layer_size + 1));
Theta2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):( (hidden_layer_size * (input_layer_size + 1)))+(hidden2_layer_size*(hidden_layer_size+1))), ...
hidden2_layer_size, (hidden_layer_size + 1));
Theta3 = reshape(nn_params(((1 + (hidden_layer_size * (input_layer_size + 1)))+(hidden2_layer_size*(hidden_layer_size+1))):end), ...
num_labels, (hidden2_layer_size + 1));
%[error_train, error_val] = learningCurve(trainIn, trainOut, crossIn, crossOut, lambda,input_layer_size,hidden_layer_size,num_labels);
pred = predict(Theta1, Theta2,Theta3,trainIn);
[dummy, p] = max(trainOut, [], 2);
[pred trainOut]
fprintf('\nTraining Set Accuracy: %f\n', mean(double(pred == p)) * 100);
pred = predict(Theta1, Theta2,Theta3,crossIn);
[pred crossOut]
[dummy, p] = max(crossOut, [], 2);
fprintf('\nTraining Set Accuracy: %f\n', mean(double(pred == p)) * 100);
And here is the code for getting the patterns:
function [ input,output ] = gettingPatterns()
myFolder='C:\Users\ahmed\Documents\MATLAB\New Folder (3)\homeWork\speech';
filePattern=fullfile(myFolder,'*.wav');
wavFiles=dir(filePattern);
output=[];
input=[];
for k = 1:length(wavFiles)
sampleOutput=zeros(1,6);
baseFileName = wavFiles(k).name;
if baseFileName(3:5)=='ang',sampleOutput(1)=1;,end;
if baseFileName(3:5)=='fea',sampleOutput(2)=1;,end;
if baseFileName(3:5)=='bor',sampleOutput(3)=1;,end;
if baseFileName(3:5)=='sad',sampleOutput(4)=1;,end;
if baseFileName(3:5)=='joy',sampleOutput(5)=1;,end;
if baseFileName(3:5)=='neu',sampleOutput(6)=1;,end;
output(k,:)=sampleOutput;
fullFileName = fullfile(myFolder, baseFileName);
wavArray = wavread(fullFileName);
[cepstra,xxx]=melfcc(wavArray);
[m,n]=size(cepstra);
reshapedArray=reshape(cepstra,m*n,1);
smalledArray=small(reshapedArray,70);
%Normalized Features
%x(i)=(x(i)-mean)/std;
normalizedFeatures=[];
for i=1:length(smalledArray)
normalizedFeatures(i)=(smalledArray(i)-mean(smalledArray))/std(smalledArray);
end
input(k,:)=normalizedFeatures;
end
Please note the following:
I have done this test on the nn toolbox, I got the same result. The only reason for implementing it myself is to be able to add an extra hidden layer.
The implementation of the cost function, forward propagation and backward propagation are 100% correct, so I did not include them in this question.

Neural Networks: Sigmoid Activation Function for continuous output variable

Okay, so I am in the middle of Andrew Ng's machine learning course on coursera and would like to adapt the neural network which was completed as part of assignment 4.
In particular, the neural network which I had completed correctly as part of the assignment was as follows:
Sigmoid activation function: g(z) = 1/(1+e^(-z))
10 output units, each which could take 0 or 1
1 hidden layer
Back-propagation method used to minimize cost function
Cost function:
where L=number of layers, s_l = number of units in layer l, m = number of training examples, K = number of output units
Now I want to adjust the exercise so that there is one continuous output unit that takes any value between [0,1] and I am trying to work out what needs to change, so far I have
Replaced the data with my own, i.e.,such that the output is continuous variable between 0 and 1
Updated references to the number of output units
Updated the cost function in the back-propagation algorithm to:
where a_3 is the value of the output unit determined from forward propagation.
I am certain that something else must change as the gradient checking method shows the gradient determined by back-propagation and that by the numerical approximation no longer match up. I did not change the sigmoid gradient; it is left at f(z)*(1-f(z)) where f(z) is the sigmoid function 1/(1+e^(-z))) nor did I update the numerical approximation of the derivative formula; simply (J(theta+e) - J(theta-e))/(2e).
Can anyone advise of what other steps would be required?
Coded in Matlab as follows:
% FORWARD PROPAGATION
% input layer
a1 = [ones(m,1),X];
% hidden layer
z2 = a1*Theta1';
a2 = sigmoid(z2);
a2 = [ones(m,1),a2];
% output layer
z3 = a2*Theta2';
a3 = sigmoid(z3);
% BACKWARD PROPAGATION
delta3 = a3 - y;
delta2 = delta3*Theta2(:,2:end).*sigmoidGradient(z2);
Theta1_grad = (delta2'*a1)/m;
Theta2_grad = (delta3'*a2)/m;
% COST FUNCTION
J = 1/(2 * m) * sum( (a3-y).^2 );
% Implement regularization with the cost function and gradients.
Theta1_grad(:,2:end) = Theta1_grad(:,2:end) + Theta1(:,2:end)*lambda/m;
Theta2_grad(:,2:end) = Theta2_grad(:,2:end) + Theta2(:,2:end)*lambda/m;
J = J + lambda/(2*m)*( sum(sum(Theta1(:,2:end).^2)) + sum(sum(Theta2(:,2:end).^2)));
I have since realised that this question is similar to that asked by #Mikhail Erofeev on StackOverflow, however in this case I wish the continuous variable to be between 0 and 1 and therefore use a sigmoid function.
First, your cost function should be:
J = 1/m * sum( (a3-y).^2 );
I think your Theta2_grad = (delta3'*a2)/m;is expected to match the numerical approximation after changed to delta3 = 1/2 * (a3 - y);).
Check this slide for more details.
EDIT:
In case there is some minor discrepancy between our codes, I pasted my code below for your reference. The code has already been compared with numerical approximation function checkNNGradients(lambda);, the Relative Difference is less than 1e-4 (not meets the 1e-11 requirement by Dr.Andrew Ng though)
function [J grad] = nnCostFunctionRegression(nn_params, ...
input_layer_size, ...
hidden_layer_size, ...
num_labels, ...
X, y, lambda)
Theta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), ...
hidden_layer_size, (input_layer_size + 1));
Theta2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):end), ...
num_labels, (hidden_layer_size + 1));
m = size(X, 1);
J = 0;
Theta1_grad = zeros(size(Theta1));
Theta2_grad = zeros(size(Theta2));
X = [ones(m, 1) X];
z1 = sigmoid(X * Theta1');
zs = z1;
z1 = [ones(m, 1) z1];
z2 = z1 * Theta2';
ht = sigmoid(z2);
y_recode = zeros(length(y),num_labels);
for i=1:length(y)
y_recode(i,y(i))=1;
end
y = y_recode;
regularization=lambda/2/m*(sum(sum(Theta1(:,2:end).^2))+sum(sum(Theta2(:,2:end).^2)));
J=1/(m)*sum(sum((ht - y).^2))+regularization;
delta_3 = 1/2*(ht - y);
delta_2 = delta_3 * Theta2(:,2:end) .* sigmoidGradient(X * Theta1');
delta_cap2 = delta_3' * z1;
delta_cap1 = delta_2' * X;
Theta1_grad = ((1/m) * delta_cap1)+ ((lambda/m) * (Theta1));
Theta2_grad = ((1/m) * delta_cap2)+ ((lambda/m) * (Theta2));
Theta1_grad(:,1) = Theta1_grad(:,1)-((lambda/m) * (Theta1(:,1)));
Theta2_grad(:,1) = Theta2_grad(:,1)-((lambda/m) * (Theta2(:,1)));
grad = [Theta1_grad(:) ; Theta2_grad(:)];
end
If you want to have continuous output try not to use sigmoid activation when computing target value.
a1 = [ones(m, 1) X];
a2 = sigmoid(X * Theta1');
a2 = [ones(m, 1) z1];
a3 = z1 * Theta2';
ht = a3;
Normalize input before using it in nnCostFunction. Everything else remains same.

Octave backpropagation implementation issues

I wrote a code to implement steepest descent backpropagation with which I am having issues. I am using the Machine CPU dataset and have scaled the inputs and outputs into range [0 1]
The codes in matlab/octave is as follows:
steepest descent backpropagation
%SGD = Steepest Gradient Decent
function weights = nnSGDTrain (X, y, nhid_units, gamma, max_epoch, X_test, y_test)
iput_units = columns (X);
oput_units = columns (y);
n = rows (X);
W2 = rand (nhid_units + 1, oput_units);
W1 = rand (iput_units + 1, nhid_units);
train_rmse = zeros (1, max_epoch);
test_rmse = zeros (1, max_epoch);
for (epoch = 1:max_epoch)
delW2 = zeros (nhid_units + 1, oput_units)';
delW1 = zeros (iput_units + 1, nhid_units)';
for (i = 1:rows(X))
o1 = sigmoid ([X(i,:), 1] * W1); %1xn+1 * n+1xk = 1xk
o2 = sigmoid ([o1, 1] * W2); %1xk+1 * k+1xm = 1xm
D2 = o2 .* (1 - o2);
D1 = o1 .* (1 - o1);
e = (y_test(i,:) - o2)';
delta2 = diag (D2) * e; %mxm * mx1 = mx1
delta1 = diag (D1) * W2(1:(end-1),:) * delta2; %kxm * mx1 = kx1
delW2 = delW2 + (delta2 * [o1 1]); %mx1 * 1xk+1 = mxk+1 %already transposed
delW1 = delW1 + (delta1 * [X(i, :) 1]); %kx1 * 1xn+1 = k*n+1 %already transposed
end
delW2 = gamma .* delW2 ./ n;
delW1 = gamma .* delW1 ./ n;
W2 = W2 + delW2';
W1 = W1 + delW1';
[dummy train_rmse(epoch)] = nnPredict (X, y, nhid_units, [W1(:);W2(:)]);
[dummy test_rmse(epoch)] = nnPredict (X_test, y_test, nhid_units, [W1(:);W2(:)]);
printf ('Epoch: %d\tTrain Error: %f\tTest Error: %f\n', epoch, train_rmse(epoch), test_rmse(epoch));
fflush (stdout);
end
weights = [W1(:);W2(:)];
% plot (1:max_epoch, test_rmse, 1);
% hold on;
plot (1:max_epoch, train_rmse(1:end), 2);
% hold off;
end
predict
%Now SFNN Only
function [o1 rmse] = nnPredict (X, y, nhid_units, weights)
iput_units = columns (X);
oput_units = columns (y);
n = rows (X);
W1 = reshape (weights(1:((iput_units + 1) * nhid_units),1), iput_units + 1, nhid_units);
W2 = reshape (weights((((iput_units + 1) * nhid_units) + 1):end,1), nhid_units + 1, oput_units);
o1 = sigmoid ([X ones(n,1)] * W1); %nxiput_units+1 * iput_units+1xnhid_units = nxnhid_units
o2 = sigmoid ([o1 ones(n,1)] * W2); %nxnhid_units+1 * nhid_units+1xoput_units = nxoput_units
rmse = RMSE (y, o2);
end
RMSE function
function rmse = RMSE (a1, a2)
rmse = sqrt (sum (sum ((a1 - a2).^2))/rows(a1));
end
I have also trained the same dataset using the R RSNNS package mlp and the RMSE for train set (first 100 examples) are around 0.03 . But in my implementation I cannot achieve lower RMSE than 0.14 . And sometimes the errors grow for some higher learning rates, and no learning rate gets me lower RMSE than 0.14. Also a paper i referred report the RMSE in for the train set is around 0.03
I wanted to know where is the problem i the code. I have followed Raul Rojas book and confirmed that things are okay.
In backprobagation code the line
e = (y_test(i,:) - o2)';
is not correct, because the o2 is the output from the train set and i am finding the difference from one example from the test set y_test. The line should have been as below:
e = (y(i,:) - o2)';
which correctly finds the difference between the predicted output by the current model and the target output of the corresponding example.
This took me 3 days to find this one, I am fortunate enough to find this freaking bug which stopped me from going into further modifications.