Okay, so I am in the middle of Andrew Ng's machine learning course on coursera and would like to adapt the neural network which was completed as part of assignment 4.
In particular, the neural network which I had completed correctly as part of the assignment was as follows:
Sigmoid activation function: g(z) = 1/(1+e^(-z))
10 output units, each which could take 0 or 1
1 hidden layer
Back-propagation method used to minimize cost function
Cost function:
where L=number of layers, s_l = number of units in layer l, m = number of training examples, K = number of output units
Now I want to adjust the exercise so that there is one continuous output unit that takes any value between [0,1] and I am trying to work out what needs to change, so far I have
Replaced the data with my own, i.e.,such that the output is continuous variable between 0 and 1
Updated references to the number of output units
Updated the cost function in the back-propagation algorithm to:
where a_3 is the value of the output unit determined from forward propagation.
I am certain that something else must change as the gradient checking method shows the gradient determined by back-propagation and that by the numerical approximation no longer match up. I did not change the sigmoid gradient; it is left at f(z)*(1-f(z)) where f(z) is the sigmoid function 1/(1+e^(-z))) nor did I update the numerical approximation of the derivative formula; simply (J(theta+e) - J(theta-e))/(2e).
Can anyone advise of what other steps would be required?
Coded in Matlab as follows:
% FORWARD PROPAGATION
% input layer
a1 = [ones(m,1),X];
% hidden layer
z2 = a1*Theta1';
a2 = sigmoid(z2);
a2 = [ones(m,1),a2];
% output layer
z3 = a2*Theta2';
a3 = sigmoid(z3);
% BACKWARD PROPAGATION
delta3 = a3 - y;
delta2 = delta3*Theta2(:,2:end).*sigmoidGradient(z2);
Theta1_grad = (delta2'*a1)/m;
Theta2_grad = (delta3'*a2)/m;
% COST FUNCTION
J = 1/(2 * m) * sum( (a3-y).^2 );
% Implement regularization with the cost function and gradients.
Theta1_grad(:,2:end) = Theta1_grad(:,2:end) + Theta1(:,2:end)*lambda/m;
Theta2_grad(:,2:end) = Theta2_grad(:,2:end) + Theta2(:,2:end)*lambda/m;
J = J + lambda/(2*m)*( sum(sum(Theta1(:,2:end).^2)) + sum(sum(Theta2(:,2:end).^2)));
I have since realised that this question is similar to that asked by #Mikhail Erofeev on StackOverflow, however in this case I wish the continuous variable to be between 0 and 1 and therefore use a sigmoid function.
First, your cost function should be:
J = 1/m * sum( (a3-y).^2 );
I think your Theta2_grad = (delta3'*a2)/m;is expected to match the numerical approximation after changed to delta3 = 1/2 * (a3 - y);).
Check this slide for more details.
EDIT:
In case there is some minor discrepancy between our codes, I pasted my code below for your reference. The code has already been compared with numerical approximation function checkNNGradients(lambda);, the Relative Difference is less than 1e-4 (not meets the 1e-11 requirement by Dr.Andrew Ng though)
function [J grad] = nnCostFunctionRegression(nn_params, ...
input_layer_size, ...
hidden_layer_size, ...
num_labels, ...
X, y, lambda)
Theta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), ...
hidden_layer_size, (input_layer_size + 1));
Theta2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):end), ...
num_labels, (hidden_layer_size + 1));
m = size(X, 1);
J = 0;
Theta1_grad = zeros(size(Theta1));
Theta2_grad = zeros(size(Theta2));
X = [ones(m, 1) X];
z1 = sigmoid(X * Theta1');
zs = z1;
z1 = [ones(m, 1) z1];
z2 = z1 * Theta2';
ht = sigmoid(z2);
y_recode = zeros(length(y),num_labels);
for i=1:length(y)
y_recode(i,y(i))=1;
end
y = y_recode;
regularization=lambda/2/m*(sum(sum(Theta1(:,2:end).^2))+sum(sum(Theta2(:,2:end).^2)));
J=1/(m)*sum(sum((ht - y).^2))+regularization;
delta_3 = 1/2*(ht - y);
delta_2 = delta_3 * Theta2(:,2:end) .* sigmoidGradient(X * Theta1');
delta_cap2 = delta_3' * z1;
delta_cap1 = delta_2' * X;
Theta1_grad = ((1/m) * delta_cap1)+ ((lambda/m) * (Theta1));
Theta2_grad = ((1/m) * delta_cap2)+ ((lambda/m) * (Theta2));
Theta1_grad(:,1) = Theta1_grad(:,1)-((lambda/m) * (Theta1(:,1)));
Theta2_grad(:,1) = Theta2_grad(:,1)-((lambda/m) * (Theta2(:,1)));
grad = [Theta1_grad(:) ; Theta2_grad(:)];
end
If you want to have continuous output try not to use sigmoid activation when computing target value.
a1 = [ones(m, 1) X];
a2 = sigmoid(X * Theta1');
a2 = [ones(m, 1) z1];
a3 = z1 * Theta2';
ht = a3;
Normalize input before using it in nnCostFunction. Everything else remains same.
Related
Here is my code. I think it is wrong because the difference between this computed gradient and my numerical estimate is too significant. It doesn't seem to be due to wrongly inverting matrices, etc.
For context, Y is the output layer, X is the input layer, and there is only 1 hidden layer. Theta1 is the weights for the first input layer and Theta2 is the weights for the hidden layer.
for t = 1:m
% do fw prop again...
a1 = [1 X(i,:)];
a2 = [1 sigmoid(a1 * Theta1')];
a3 = sigmoid(a2 * Theta2');
delta_3 = a3' - Y(:, t);
delta_2 = Theta2' * delta_3 .* a2' .* (1 - a2)';
delta_2 = delta_2(2:end,:);
Theta1_grad = Theta1_grad + delta_2 * [1 X(i, :)];
Theta2_grad = Theta2_grad + delta_3 * [1 sigmoid([1 X(i,:)] * Theta1')];
end
grad = [Theta1_grad(:) ; Theta2_grad(:)];
Assume we have three equations:
eq1 = x1 + (x1 - x2) * t - X == 0;
eq2 = z1 + (z1 - z2) * t - Z == 0;
eq3 = ((X-x1)/a)^2 + ((Z-z1)/b)^2 - 1 == 0;
while six of known variables are:
a = 42 ;
b = 12 ;
x1 = 316190;
z1 = 234070;
x2 = 316190;
z2 = 234070;
So we are looking for three unknown variables that are:
X , Z and t
I wrote two method to solve it. But, since I need to run these code for 5.7 million data, it become really slow.
Method one (using "solve"):
tic
S = solve( eq1 , eq2 , eq3 , X , Z , t ,...
'ReturnConditions', true, 'Real', true);
toc
X = double(S.X(1))
Z = double(S.Z(1))
t = double(S.t(1))
results of method one:
X = 316190;
Z = 234060;
t = -2.9280;
Elapsed time is 0.770429 seconds.
Method two (using "fsolve"):
coeffs = [a,b,x1,x2,z1,z2]; % Known parameters
x0 = [ x2 ; z2 ; 1 ].'; % Initial values for iterations
f_d = #(x0) myfunc(x0,coeffs); % f_d considers x0 as variables
options = optimoptions('fsolve','Display','none');
tic
M = fsolve(f_d,x0,options);
toc
results of method two:
X = 316190; % X = M(1)
Z = 234060; % Z = M(2)
t = -2.9280; % t = M(3)
Elapsed time is 0.014 seconds.
Although, the second method is faster, but it still needs to be improved. Please let me know if you have a better solution for that. Thanks
* extra information:
if you are interested to know what those 3 equations are, the first two are equations of a line in 2D and the third equation is an ellipse equation. I need to find the intersection of the line with the ellipse. Obviously, we have two points as result. But, let's forget about the second answer for simplicity.
My suggestion it's to use the second approce,which it's the recommended by matlab for nonlinear equation system.
Declare a M-function
function Y=mysistem(X)
%X(1) = X
%X(2) = t
%X(3) = Z
a = 42 ;
b = 12 ;
x1 = 316190;
z1 = 234070;
x2 = 316190;
z2 = 234070;
Y(1,1) = x1 + (x1 - x2) * X(2) - X(1);
Y(2,1) = z1 + (z1 - z2) * X(2) - X(3);
Y(3,1) = ((X-x1)/a)^2 + ((Z-z1)/b)^2 - 1;
end
Then for solving use
x0 = [ x2 , z2 , 1 ];
M = fsolve(#mysistem,x0,options);
If you may want to reduce the default precision by changing StepTolerance (default 1e-6).
Also for more increare you may want to use the jacobian matrix for greater efficencies.
For more reference take a look in official documentation:
fsolve Nonlinear Equations with Analytic Jacobian
Basically giving the solver the Jacobian matrix of the system(and special options) you can increase method efficency.
I have put together a classifying 3 layer artificial neural network that appears to work on other datasets. Playing around some artificial datasets that I made, I was unable to correctly predict between two classes when one class was positive in one feature or another feature.
Clearly class1 is can be identified by asking if either feature 1 or feature 2 is equal to 1 but I can't get the algorithm to predict the dataset correctly (there are 20 examples following this pattern in the dataset).
Can ANN/MLPs recognize this type of pattern? If so, what am I missing? If not, are there other methods that can predict this type of pattern (maybe SVM)?
I used Octave as that was what was used in the online course offered from coursera. I have listed most of the code here although it is structured slightly differently when I run it. As you can see I do use bias units on the first and second layers and I have also varied the number of hidden units in the second layer from 1-5 with no improvement over random guessing.
% Load dataset
y = [1; 1; 2; 2]
X = [1, 0; 0, 1; 0, 0; 0, 0]
m = size(X, 1);
Theta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), hidden_layer_size, (input_layer_size + 1));
Theta2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):end), num_labels, (hidden_layer_size + 1));
% Randomly initialize weight parameters
initial_Theta1 = randInitializeWeights(input_layer_size, hidden_layer_size);
initial_Theta2 = randInitializeWeights(hidden_layer_size, num_labels);
initial_nn_params = [initial_Theta1(:) ; initial_Theta2(:)];
% Add bias units to layers and feedforward
Xbias = [ones(m,1), X];
L2bias = [ones(m,1), sigmoid(Xbias*Theta1')];
L3 = sigmoid(L2bias * Theta2');
% Create class matrix Y
Y = zeros(m, num_labels);
for r = 1:m;
Y(r, y(r)) = 1;
end
% Set cost function
J = (sum(sum(Y.*log(L3) + (1-Y).*log(1-L3))))/-m + lambda*(sum(sum((Theta1(:,2:columns(Theta1))).^2)) + sum(sum((Theta2(:,2:columns(Theta2))).^2)))/2/m;
% Initialize weight gradient matrices
D2 = zeros(rows(Theta2),columns(Theta2));
D1 = zeros(rows(Theta1),columns(Theta1));
% Calculate gradient with backpropagation
for t = 1:m;
a1 = [1 X(t,:)]';
z2 = Theta1*a1;
a2 = [1; sigmoid(z2)];
z3 = Theta2*a2;
a3 = sigmoid(z3);
d3 = a3 - Y(t,:)';
d2 = (Theta2'*d3)(2:end).*sigmoidGradient(z2);
D2 = D2 + d3*a2';
D1 = D1 + d2*a1';
end
Theta2_grad = D2/m;
Theta1_grad = D1/m;
Theta2_grad(:,2:end) = Theta2_grad(:,2:end) + lambda*Theta2(:,2:end)/m;
Theta1_grad(:,2:end) = Theta1_grad(:,2:end) + lambda*Theta1(:,2:end)/m;
% Unroll gradients
grad = [Theta1_grad(:) ; Theta2_grad(:)];
% Compute cost (Feed forward)
[J,grad] = nnCostFunction(initial_nn_params, input_layer_size, hidden_layer_size, num_labels, X, y, lambda);
% Create "short hand" for the cost function to be minimized using fmincg
costFunction = #(p) nnCostFunction(p, input_layer_size, hidden_layer_size, num_labels, X, y, lambda);
% Train the neural network using fmincg
options = optimset('MaxIter', 1000);
[nn_params, cost] = fmincg(costFunction, initial_nn_params, options);
% Obtain Theta1 and Theta2 back from nn_params
Theta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), hidden_layer_size, (input_layer_size + 1));
Theta2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):end), num_labels, (hidden_layer_size + 1));
NN can recognize any pattern. Universal Approximation Theorem proves that (as well as many others).
The most obvious reason I can think of is lack of bias neuron. Althouh for more valuable answers you have to include your code.
I wrote a code to implement steepest descent backpropagation with which I am having issues. I am using the Machine CPU dataset and have scaled the inputs and outputs into range [0 1]
The codes in matlab/octave is as follows:
steepest descent backpropagation
%SGD = Steepest Gradient Decent
function weights = nnSGDTrain (X, y, nhid_units, gamma, max_epoch, X_test, y_test)
iput_units = columns (X);
oput_units = columns (y);
n = rows (X);
W2 = rand (nhid_units + 1, oput_units);
W1 = rand (iput_units + 1, nhid_units);
train_rmse = zeros (1, max_epoch);
test_rmse = zeros (1, max_epoch);
for (epoch = 1:max_epoch)
delW2 = zeros (nhid_units + 1, oput_units)';
delW1 = zeros (iput_units + 1, nhid_units)';
for (i = 1:rows(X))
o1 = sigmoid ([X(i,:), 1] * W1); %1xn+1 * n+1xk = 1xk
o2 = sigmoid ([o1, 1] * W2); %1xk+1 * k+1xm = 1xm
D2 = o2 .* (1 - o2);
D1 = o1 .* (1 - o1);
e = (y_test(i,:) - o2)';
delta2 = diag (D2) * e; %mxm * mx1 = mx1
delta1 = diag (D1) * W2(1:(end-1),:) * delta2; %kxm * mx1 = kx1
delW2 = delW2 + (delta2 * [o1 1]); %mx1 * 1xk+1 = mxk+1 %already transposed
delW1 = delW1 + (delta1 * [X(i, :) 1]); %kx1 * 1xn+1 = k*n+1 %already transposed
end
delW2 = gamma .* delW2 ./ n;
delW1 = gamma .* delW1 ./ n;
W2 = W2 + delW2';
W1 = W1 + delW1';
[dummy train_rmse(epoch)] = nnPredict (X, y, nhid_units, [W1(:);W2(:)]);
[dummy test_rmse(epoch)] = nnPredict (X_test, y_test, nhid_units, [W1(:);W2(:)]);
printf ('Epoch: %d\tTrain Error: %f\tTest Error: %f\n', epoch, train_rmse(epoch), test_rmse(epoch));
fflush (stdout);
end
weights = [W1(:);W2(:)];
% plot (1:max_epoch, test_rmse, 1);
% hold on;
plot (1:max_epoch, train_rmse(1:end), 2);
% hold off;
end
predict
%Now SFNN Only
function [o1 rmse] = nnPredict (X, y, nhid_units, weights)
iput_units = columns (X);
oput_units = columns (y);
n = rows (X);
W1 = reshape (weights(1:((iput_units + 1) * nhid_units),1), iput_units + 1, nhid_units);
W2 = reshape (weights((((iput_units + 1) * nhid_units) + 1):end,1), nhid_units + 1, oput_units);
o1 = sigmoid ([X ones(n,1)] * W1); %nxiput_units+1 * iput_units+1xnhid_units = nxnhid_units
o2 = sigmoid ([o1 ones(n,1)] * W2); %nxnhid_units+1 * nhid_units+1xoput_units = nxoput_units
rmse = RMSE (y, o2);
end
RMSE function
function rmse = RMSE (a1, a2)
rmse = sqrt (sum (sum ((a1 - a2).^2))/rows(a1));
end
I have also trained the same dataset using the R RSNNS package mlp and the RMSE for train set (first 100 examples) are around 0.03 . But in my implementation I cannot achieve lower RMSE than 0.14 . And sometimes the errors grow for some higher learning rates, and no learning rate gets me lower RMSE than 0.14. Also a paper i referred report the RMSE in for the train set is around 0.03
I wanted to know where is the problem i the code. I have followed Raul Rojas book and confirmed that things are okay.
In backprobagation code the line
e = (y_test(i,:) - o2)';
is not correct, because the o2 is the output from the train set and i am finding the difference from one example from the test set y_test. The line should have been as below:
e = (y(i,:) - o2)';
which correctly finds the difference between the predicted output by the current model and the target output of the corresponding example.
This took me 3 days to find this one, I am fortunate enough to find this freaking bug which stopped me from going into further modifications.
I've written some code to implement an algorithm that takes as input a vector q of real numbers, and returns as an output a complex matrix R. The Matlab code below produces a plot showing the input vector q and the output matrix R.
Given only the complex matrix output R, I would like to obtain the input vector q. Can I do this using least-squares optimization? Since there is a recursive running sum in the code (rs_r and rs_i), the calculation for a column of the output matrix is dependent on the calculation of the previous column.
Perhaps a non-linear optimization can be set up to recompose the input vector q from the output matrix R?
Looking at this in another way, I've used an algorithm to compute a matrix R. I want to run the algorithm "in reverse," to get the input vector q from the output matrix R.
If there is no way to recompose the starting values from the output, thereby treating the problem as a "black box," then perhaps the mathematics of the model itself can be used in the optimization? The program evaluates the following equation:
The Utilde(tau, omega) is the output matrix R. The tau (time) variable comprises the columns of the response matrix R, whereas the omega (frequency) variable comprises the rows of the response matrix R. The integration is performed as a recursive running sum from tau = 0 up to the current tau timestep.
Here are the plots created by the program posted below:
Here is the full program code:
N = 1001;
q = zeros(N, 1); % here is the input
q(1:200) = 55;
q(201:300) = 120;
q(301:400) = 70;
q(401:600) = 40;
q(601:800) = 100;
q(801:1001) = 70;
dt = 0.0042;
fs = 1 / dt;
wSize = 101;
Glim = 20;
ginv = 0;
R = get_response(N, q, dt, wSize, Glim, ginv); % R is output matrix
rows = wSize;
cols = N;
figure; plot(q); title('q value input as vector');
ylim([0 200]); xlim([0 1001])
figure; imagesc(abs(R)); title('Matrix output of algorithm')
colorbar
Here is the function that performs the calculation:
function response = get_response(N, Q, dt, wSize, Glim, ginv)
fs = 1 / dt;
Npad = wSize - 1;
N1 = wSize + Npad;
N2 = floor(N1 / 2 + 1);
f = (fs/2)*linspace(0,1,N2);
omega = 2 * pi .* f';
omegah = 2 * pi * f(end);
sigma2 = exp(-(0.23*Glim + 1.63));
sign = 1;
if(ginv == 1)
sign = -1;
end
ratio = omega ./ omegah;
rs_r = zeros(N2, 1);
rs_i = zeros(N2, 1);
termr = zeros(N2, 1);
termi = zeros(N2, 1);
termr_sub1 = zeros(N2, 1);
termi_sub1 = zeros(N2, 1);
response = zeros(N2, N);
% cycle over cols of matrix
for ti = 1:N
term0 = omega ./ (2 .* Q(ti));
gamma = 1 / (pi * Q(ti));
% calculate for the real part
if(ti == 1)
Lambda = ones(N2, 1);
termr_sub1(1) = 0;
termr_sub1(2:end) = term0(2:end) .* (ratio(2:end).^-gamma);
else
termr(1) = 0;
termr(2:end) = term0(2:end) .* (ratio(2:end).^-gamma);
rs_r = rs_r - dt.*(termr + termr_sub1);
termr_sub1 = termr;
Beta = exp( -1 .* -0.5 .* rs_r );
Lambda = (Beta + sigma2) ./ (Beta.^2 + sigma2); % vector
end
% calculate for the complex part
if(ginv == 1)
termi(1) = 0;
termi(2:end) = (ratio(2:end).^(sign .* gamma) - 1) .* omega(2:end);
else
termi = (ratio.^(sign .* gamma) - 1) .* omega;
end
rs_i = rs_i - dt.*(termi + termi_sub1);
termi_sub1 = termi;
integrand = exp( 1i .* -0.5 .* rs_i );
if(ginv == 1)
response(:,ti) = Lambda .* integrand;
else
response(:,ti) = (1 ./ Lambda) .* integrand;
end
end % ti loop
No, you cannot do so unless you know the "model" itself for this process. If you intend to treat the process as a complete black box, then it is impossible in general, although in any specific instance, anything can happen.
Even if you DO know the underlying process, then it may still not work, as any least squares estimator is dependent on the starting values, so if you do not have a good guess there, it may converge to the wrong set of parameters.
It turns out that by using the mathematics of the model, the input can be estimated. This is not true in general, but for my problem it seems to work. The cumulative integral is eliminated by a partial derivative.
N = 1001;
q = zeros(N, 1);
q(1:200) = 55;
q(201:300) = 120;
q(301:400) = 70;
q(401:600) = 40;
q(601:800) = 100;
q(801:1001) = 70;
dt = 0.0042;
fs = 1 / dt;
wSize = 101;
Glim = 20;
ginv = 0;
R = get_response(N, q, dt, wSize, Glim, ginv);
rows = wSize;
cols = N;
cut_val = 200;
imagLogR = imag(log(R));
Mderiv = zeros(rows, cols-2);
for k = 1:rows
val = deriv_3pt(imagLogR(k,:), dt);
val(val > cut_val) = 0;
Mderiv(k,:) = val(1:end-1);
end
disp('Running iteration');
q0 = 10;
q1 = 500;
NN = cols - 2;
qout = zeros(NN, 1);
for k = 1:NN
data = Mderiv(:,k);
qout(k) = fminbnd(#(q) curve_fit_to_get_q(q, dt, rows, data),q0,q1);
end
figure; plot(q); title('q value input as vector');
ylim([0 200]); xlim([0 1001])
figure;
plot(qout); title('Reconstructed q')
ylim([0 200]); xlim([0 1001])
Here are the supporting functions:
function output = deriv_3pt(x, dt)
% Function to compute dx/dt using the 3pt symmetrical rule
% dt is the timestep
N = length(x);
N0 = N - 1;
output = zeros(N0, 1);
denom = 2 * dt;
for k = 2:N0
output(k - 1) = (x(k+1) - x(k-1)) / denom;
end
function sse = curve_fit_to_get_q(q, dt, rows, data)
fs = 1 / dt;
N2 = rows;
f = (fs/2)*linspace(0,1,N2); % vector for frequency along cols
omega = 2 * pi .* f';
omegah = 2 * pi * f(end);
ratio = omega ./ omegah;
gamma = 1 / (pi * q);
termi = ((ratio.^(gamma)) - 1) .* omega;
Error_Vector = termi - data;
sse = sum(Error_Vector.^2);