Octave implementation of simple neural network with one float output - neural-network

Some of you may be familiar with the simple handwritten digit classification NN that's part of Andrew Ng's ML course on coursera. To improve my understanding of the theory I'm trying to modify the implementation such that it outputs one float instead of 10 classification labels.
It has only one hidden layer. The code below is my attempt but the backprop produces wrong gradients. They don't match at all with analytical gradient comparisons. Because it only outputs one number, the activation of that output node is a simple linear function f(x) = x. Due to this, the error in the output unit is simply proportionally distributed backwards to the hidden layer based on weights. Perhaps someone can spot what I'm missing here, why the gradients are wrong.
m is the training data count, y is the correct output in a vector, X holds the training data, one row each.
%average cost over all training data
X = [ones(m, 1) X]; %insert bias column
a2 = tanh(X * Theta1'); %hidden layer activation
a2 = [ones(m, 1) a2];
a3 = a2 * Theta2' %linear combination for final output
Cost = sum((y-a3).^2)/m; %mean squared error
backprop, iterating over m to accumulate gradients before averaging them:
for i = 1:m
a1 = [1 X(i, :)]'; %get current training row
z2 = Theta1 * a1;
a2 = tanh(z2);
a2 = [1; a2];
z3 = Theta2 * a2;
a3 = z3; %no activation function for final output
delta3 = (a3-y(i))^2; %cost of current training row
delta2 = Theta2' * delta3; %proportionally distribute error backwards, because no activation function was used?
delta2 = delta2(2:end); %cut out bias element
Theta2_grad = Theta2_grad + delta3 * a2'; %accumulate gradients
Theta1_grad = Theta1_grad + delta2 * a1';
endfor
Theta1_grad = (1/m) * Theta1_grad; %average gradients
Theta2_grad = (1/m) * Theta2_grad;

I think you are wrong delta3=a3-y not (a3-y)**2

Related

Cost function computation for neural network

I am in week 5 of Andrew Ng's Machine Learning Course on Coursera. I am working through the programming assignment in Matlab for this week, and I chose to use a for loop implementation to compute the cost J. Here is my function.
function [J grad] = nnCostFunction(nn_params, ...
input_layer_size, ...
hidden_layer_size, ...
num_labels, ...
X, y, lambda)
%NNCOSTFUNCTION Implements the neural network cost function for a two layer
%neural network which performs classification
% [J grad] = NNCOSTFUNCTON(nn_params, hidden_layer_size, num_labels, ...
% X, y, lambda) computes the cost and gradient of the neural network. The
% parameters for the neural network are "unrolled" into the vector
% nn_params and need to be converted back into the weight matrices.
% Reshape nn_params back into the parameters Theta1 and Theta2, the weight matrices
% for our 2 layer neural network
Theta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), ...
hidden_layer_size, (input_layer_size + 1));
Theta2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):end), ...
num_labels, (hidden_layer_size + 1));
% Setup some useful variables
m = size(X, 1);
% add bias to X to create 5000x401 matrix
X = [ones(m, 1) X];
% You need to return the following variables correctly
J = 0;
Theta1_grad = zeros(size(Theta1));
Theta2_grad = zeros(size(Theta2));
% initialize summing terms used in cost expression
sum_i = 0.0;
% loop through each sample to calculate the cost
for i = 1:m
% logical vector output for 1 example
y_i = zeros(num_labels, 1);
class = y(m);
y_i(class) = 1;
% first layer just equals features in one example 1x401
a1 = X(i, :);
% compute z2, a 25x1 vector
z2 = Theta1*a1';
% compute activation of z2
a2 = sigmoid(z2);
% add bias to a2 to create a 26x1 vector
a2 = [1; a2];
% compute z3, a 10x1 vector
z3 = Theta2*a2;
%compute activation of z3. returns output vector of size 10x1
a3 = sigmoid(z3);
h = a3;
% loop through each class k to sum cost over each class
for k = 1:num_labels
% sum_i returns cost summed over each class
sum_i = sum_i + ((-1*y_i(k) * log(h(k))) - ((1 - y_i(k)) * log(1 - h(k))));
end
end
J = sum_i/m;
I understand that a vectorized implementaion of this would be easier, but I do not understand why this implementation is wrong. When num_labels = 10, this function outputs J = 8.47, but the expected cost is 0.287629. I computed J from this formula. Am I misunderstanding the computation? My understanding is that each training example's cost for each of the 10 classes are computed then the cost for all 10 classes for each example are summed together. Is that incorrect? Or did I not implement this in my code properly? Thanks in advance.
the problem is in the formula you are implementing
this expression ((-1*y_i(k) * log(h(k))) - ((1 - y_i(k)) * log(1 - h(k)))); represent the loss in case in binary classification because you were simply have 2 classes so either
y_i is 0 so (1 - yi) = 1
y_i is 1 so (1 - yi) = 0
so you basically take into account only the target class probability.
how ever in case of 10 labels as you mention (y_i) or (1 - yi) not necessary of one of them to be 0 and the other to be 1
you should correct the loss function implementation so that you only take into account the probability of the target class only not all other classes.
My problem is with indexing. Rather than saying class = y(m) it should be class = y(i) since i is the index and m is 5000 from the number of rows in the training data.

Why do I get a predictable output out of a random neural network?

Neural Networks have been a topic of interest for me lately, even though I have not had any formal formation on them, so everything I know I learned from the internet.
It is my understanding that a neural network, after sufficient training, can be made output an approximation of any given function. The output of a neural network with randomly generated weights and biases should then look very random.
To challenge this supposition of mine, I programmed a very rudimentary network in Matlab, with two inputs, one output, and three fully random eight neuron hidden layers.
The number of inputs and outputs was chosen so that I could 3D-plot the result as a surface, where (x, y) is the input and z the output.
% Hidden layers
L1 = 8;
L2 = 8;
L3 = 8;
% Weight matrices
A = randi([ -500, 500], [L1, 2]) / 10000;
B = randi([ -500, 500], [L2,L1]) / 10000;
C = randi([ -500, 500], [L3,L2]) / 10000;
D = randi([ -500, 500], [01,L3]) / 10000;
% Biases
a = randi([-2000, 2000], [L1, 1]) / 10000;
b = randi([-2000, 2000], [L2, 1]) / 10000;
c = randi([-2000, 2000], [L3, 1]) / 10000;
d = randi([-2000, 2000], [ 1, 1]) / 10000;
x = -ran:ran/plotn:ran;
y = -ran:ran/plotn:ran;
z = zeros(length(x),length(y));
for i = 1:length(x)
for j = 1:length(y)
z(i,j) = NN([x(i); y(j)],A,B,C,D,a,b,c,d);
end
end
function y = NN(x,A,B,C,D,a,b,c,d)
sig = #(x) 2*exp(x) ./ (exp(x) + 1) -1;
x = A*x+a;
x = sig(x/1);
x = B*x+b;
x = sig(x/1);
x = C*x+c;
x = sig(x/1);
x = D*x+d;
x = sig(x/1);
y = x;
end
By plotting z vs x and y, to my surprise, I found the output to be quite more regular and predictable than I would have thought, as seen in the following images.
Even when increasing the number of neurons in the hidden layers to absurd amounts the same behavior is followed: the surface consists of horizontal planes at different z-values, with those weird folds all crossing in (0,0).
This makes me wonder how, even with sufficient training, the output could follow any given function (say 1/(1+x^2+y^2)), not to speak of any other pattern for which the analytical expression is unclear (where you'd apply a neural network).
I'd be thankful if anyone could explain to me where I'm confused about neural networks

MATLAB: vectorised backpropagation (no loop over training examples)

In MATLAB/Octave, how do I implement backpropagation without any loops over the training examples?
This answer talks about the theory of parallelism, but how would this be implemented in actual Octave code?
For me the final piece of the puzzle came from computing sum of outer products.
Here is what I came up with:
% X is a {# of training examples} x {# of features} matrix
% Y is a {# of training examples} x {# of output neurons} matrix
% Theta is a cell matrix containing Theta{1}...Theta{n}
% Number of training examples
m = size(X, 1);
% Get h(X) and z (non-activated output of all neurons in network)
[hX, z, activation] = predict(Theta, X);
% Get error of output layer
layers = 1 + length(Theta);
d{layers} = hX - Y;
% Propagate errors backwards through hidden layers
for layer = layers-1 : -1 : 2
d{layer} = d{layer+1} * Theta{layer};
d{layer} = d{layer}(:, 2:end); % Remove "error" for constant bias term
d{layer} .*= sigmoidGradient(z{layer});
end
% Calculate Theta gradients
for l = 1:layers-1
Theta_grad{l} = zeros(size(Theta{l}));
% Sum of outer products
Theta_grad{l} += d{l+1}' * [ones(m,1) activation{l}];
% Add regularisation term
Theta_grad{l}(:, 2:end) += lambda * Theta{l}(:, 2:end);
Theta_grad{l} /= m;
end

[Octave]Using fminunc is not always giving a consistent solution

I am trying to find the coefficients in an equation to model the step response of a motor which is of the form 1-e^x. The equation I'm using to model is of the form
a(1)*t^2 + a(2)*t^3 + a(3)*t^3 + ...
(It is derived in a research paper used to solve for motor parameters)
Sometimes using fminunc to find the coefficients works out okay, and I get a good result, and it matches the training data fairly well. Other times the returned coefficients are horrible (going extremely higher than what the output should be and is orders of magnitude off). This especially happens once I started using higher order terms: using any model that uses x^8 or higher (x^9, x^10, x^11, etc.) always produces bad results.
Since it works sometimes, I can't think why my implementation would be wrong. I have tried fminunc while providing the gradients and while also not providing the gradients yet there is no difference. I've looked into using other functions to solve for the coefficients, like polyfit, but in that instance it has to have terms that are raised from 1 to the highest order term, but the model I'm using has its lowest power at 2.
Here is the main code:
clear;
%Overall Constants
max_power = 7;
%Loads in data
%data = load('TestData.txt');
load testdata.mat
%Sets data into variables
indep_x = data(:,1); Y = data(:,2);
%number of data points
m = length(Y);
%X is a matrix with the independant variable
exps = [2:max_power];
X_prime = repmat(indep_x, 1, max_power-1); %Repeats columns of the indep var
X = bsxfun(#power, X_prime, exps);
%Initializes theta to rand vals
init_theta = rand(max_power-1,1);
%Sets up options for fminunc
options = optimset( 'MaxIter', 400, 'Algorithm', 'quasi-newton');
%fminunc minimizes the output of the cost function by changing the theta paramaeters
[theta, cost] = fminunc(#(t)(costFunction(t, X, Y)), init_theta, options)
%
Y_line = X * theta;
figure;
hold on; plot(indep_x, Y, 'or');
hold on; plot(indep_x, Y_line, 'bx');
And here is costFunction:
function [J, Grad] = costFunction (theta, X, Y)
%# of training examples
m = length(Y);
%Initialize Cost and Grad-Vector
J = 0;
Grad = zeros(size(theta));
%Poduces an output based off the current values of theta
model_output = X * theta;
%Computes the squared error for each example then adds them to get the total error
squared_error = (model_output - Y).^2;
J = (1/(2*m)) * sum(squared_error);
%Computes the gradients for each theta t
for t = 1:size(theta, 1)
Grad(t) = (1/m) * sum((model_output-Y) .* X(:, t));
end
endfunction
Any help or advice would be appreciated.
Try adding regularization to your costFunction:
function [J, Grad] = costFunction (theta, X, Y, lambda)
m = length(Y);
%Initialize Cost and Grad-Vector
J = 0;
Grad = zeros(size(theta));
%Poduces an output based off the current values of theta
model_output = X * theta;
%Computes the squared error for each example then adds them to get the total error
squared_error = (model_output - Y).^2;
J = (1/(2*m)) * sum(squared_error);
% Regularization
J = J + lambda*sum(theta(2:end).^2)/(2*m);
%Computes the gradients for each theta t
regularizator = lambda*theta/m;
% overwrite 1st element i.e the one corresponding to theta zero
regularizator(1) = 0;
for t = 1:size(theta, 1)
Grad(t) = (1/m) * sum((model_output-Y) .* X(:, t)) + regularizator(t);
end
endfunction
The regularization term lambda is used to control the learning rate. Start with lambda=1. The grater the value for lambda, the slower the learning will occur. Increase lambda if the behavior you describe persists. You may need to increase the number of iterations if lambda gets high.
You may also consider normalization of your data, and some heuristic for initializing theta - setting all theta to 0.1 may be better than random. If nothing else it'll provide better reproducibility from training to training.

Gradient Descent with multiple variable without Matrix

I'm new with Matlab and Machine Learning and I tried to make a gradient descent function without using matrix.
m is the number of example on my training set
n is the number of feature for each example
The function gradientDescentMulti takes 5 arguments:
X mxn Matrix
y m-dimensional vector
theta : n-dimensional vector
alpha : a real number
nb_iters : a real number
I already have a solution using matrix multiplication
function theta = gradientDescentMulti(X, y, theta, alpha, num_iters)
for iter = 1:num_iters
gradJ = 1/m * (X'*X*theta - X'*y);
theta = theta - alpha * gradJ;
end
end
The result after iterations:
theta =
1.0e+05 *
3.3430
1.0009
0.0367
But now, I tried to do the same without matrix multiplication, this is the function:
function theta = gradientDescentMulti(X, y, theta, alpha, num_iters)
m = length(y); % number of training examples
n = size(X, 2); % number of features
for iter = 1:num_iters
new_theta = zeros(1, n);
%// for each feature, found the new theta
for t = 1:n
S = 0;
for example = 1:m
h = 0;
for example_feature = 1:n
h = h + (theta(example_feature) * X(example, example_feature));
end
S = S + ((h - y(example)) * X(example, n)); %// Sum each feature for this example
end
new_theta(t) = theta(t) - alpha * (1/m) * S; %// Calculate new theta for this example
end
%// only at the end of the function, update all theta simultaneously
theta = new_theta'; %// Transpose new_theta (horizontal vector) to theta (vertical vector)
end
end
The result, all the theta are the same :/
theta =
1.0e+04 *
3.5374
3.5374
3.5374
If you look at the gradient update rule, it may be more efficient to actually compute the hypothesis of all of your training examples first, then subtract this with the ground truth value of each training example and store these into an array or vector. Once you do this, you can then compute the update rule very easily. To me, it doesn't appear that you're doing this in your code.
As such, I rewrote the code, but I have a separate array that stores the difference in the hypothesis of each training example and ground truth value. Once I do this, I compute the update rule for each feature separately:
for iter = 1 : num_iters
%// Compute hypothesis differences with ground truth first
h = zeros(1, m);
for t = 1 : m
%// Compute hypothesis
for tt = 1 : n
h(t) = h(t) + theta(tt)*X(t,tt);
end
%// Compute difference between hypothesis and ground truth
h(t) = h(t) - y(t);
end
%// Now update parameters
new_theta = zeros(1, n);
%// for each feature, find the new theta
for tt = 1 : n
S = 0;
%// For each sample, compute products of hypothesis difference
%// and the right feature of the sample and accumulate
for t = 1 : m
S = S + h(t)*X(t,tt);
end
%// Compute gradient descent step
new_theta(tt) = theta(tt) - (alpha/m)*S;
end
theta = new_theta'; %// Transpose new_theta (horizontal vector) to theta (vertical vector)
end
When I do this, I get the same answers as using the matrix formulation.