Neural Network convergence speed (Levenberg-Marquardt) (MATLAB) - matlab

I was trying to approximate a function (single input and single output) with an ANN. Using MATLAB toolbox I could see that with 5 or more neurons in the hidden layer, I can achieve a very nice result. So I am trying to do it manually.
Calculations:
As the network has only one input and one output, the partial derivative of the error (e=d-o, where 'd' is the desired output and 'o' is the actual output) in respect to a weigth which connects a hidden neuron j to the output neuron, will be -hj (where hj is the output of a hidden neuron j);
The partial derivative of the error in respect to output bias will be -1;
The partial derivative of the error in respect to a weight which connects the input to a hidden neuron j will be -woj*f'*i, where woj is the hidden neuron j output weigth, f' is the tanh() derivative and 'i' is the input value;
Finally, the partial derivative of the error in respect to hidden layer bias will be the same as above (in respect to input weight) except that here we dont have the input:
-woj*f'
The problem is:
the MATLAB algorithm always converge faster and better. I can achieve the same curve as MATLAB does, but my algorithm requires much more epochs.
I've tried to remove pre and postprocessing functions from MATLAB algorithm. It still converges faster.
I've also tried to create and configure the network, and extract weight/bias values before training so I could copy them to my algorithm to see if it converges faster but nothing changed (is the weight/bias initialization inside create/configure or train function?).
Does the MATLAB algorithm have some kind of optimizations inside the code?
Or may be this difference only in the organization of the training set and weight/bias initialization?
In case one wants to look my code, here is the main loop which makes the training:
Err2 = N;
epochs = 0;
%compare MSE of error2
while ((Err2/N > 0.0003) && (u < 10000000) && (epochs < 100))
epochs = epochs+1;
Err = 0;
%input->hidden weight vector
wh = w(1:hidden_layer_len);
%hidden->output weigth vector
wo = w((hidden_layer_len+1):(2*hidden_layer_len));
%hidden bias
bi = w((2*hidden_layer_len+1):(3*hidden_layer_len));
%output bias
bo = w(length(w));
%start forward propagation
for i=1:N
%take next input value
x = t(i);
%propagate to hidden layer
neth = x*wh + bi;
%propagate through neurons
ij = tanh(neth)';
%propagate to output layer
neto = ij*wo + bo;
%propagate to output (purelin)
output(i) = neto;
%calculate difference from target (error)
error(i) = yp(i) - output(i);
%Backpropagation:
%tanh derivative
fhd = 1 - tanh(neth').*tanh(neth');
%jacobian matrix
J(i,:) = [-x*wo'.*fhd -ij -wo'.*fhd -1];
%SSE (sum square error)
Err = Err + 0.5*error(i)*error(i);
end
%calculate next error with updated weights and compare with old error
%start error2 from error1 + 1 to enter while loop
Err2 = Err+1;
%while error2 is > than old error and Mu (u) is not too large
while ((Err2 > Err) && (u < 10000000))
%Weight update
w2 = w - (((J'*J + u*eye(3*hidden_layer_len+1))^-1)*J')*error';
%New Error calculation
%New weights to propagate
wh = w2(1:hidden_layer_len);
wo = w2((hidden_layer_len+1):(2*hidden_layer_len));
%new bias to propagate
bi = w2((2*hidden_layer_len+1):(3*hidden_layer_len));
bo = w2(length(w));
%calculate error2
Err2 = 0;
for i=1:N
%forward propagation again
x = t(i);
neth = x*wh + bi;
ij = tanh(neth)';
neto = ij*wo + bo;
output(i) = neto;
error2(i) = yp(i) - output(i);
%Error2 (SSE)
Err2 = Err2 + 0.5*error2(i)*error2(i);
end
%compare MSE from error2 with a minimum
%if greater still runing
if (Err2/N > 0.0003)
%compare with old error
if (Err2 <= Err)
%if less, update weights and decrease Mu (u)
w = w2;
u = u/10;
else
%if greater, increment Mu (u)
u = u*10;
end
end
end
end

It's not easy to know the exact implementation of the Levenberg Marquardt algorithm in Matlab. You may try to run the algorithm one iteration at a time, and see if it is identical to your algorithm. You can also try other implementations, such as, http://www.mathworks.com/matlabcentral/fileexchange/16063-lmfsolve-m--levenberg-marquardt-fletcher-algorithm-for-nonlinear-least-squares-problems, to see if the performance can be improved. For simple learning problems, convergence speed may be a matter of learning rate. You might simply increase the learning rate to get faster convergence.

Related

Something's wrong with my Logistic Regression?

I'm trying to verify if my implementation of Logistic Regression in Matlab is good. I'm doing so by comparing the results I get via my implementation with the results given by the built-in function mnrfit.
The dataset D,Y that I have is such that each row of D is an observation in R^2 and the labels in Y are either 0 or 1. Thus, D is a matrix of size (n,2), and Y is a vector of size (n,1)
Here's how I do my implementation:
I first normalize my data and augment it to include the offset :
d = 2; %dimension of data
M = mean(D) ;
centered = D-repmat(M,n,1) ;
devs = sqrt(sum(centered.^2)) ;
normalized = centered./repmat(devs,n,1) ;
X = [normalized,ones(n,1)];
I will be doing my calculations on X.
Second, I define the gradient and hessian of the likelihood of Y|X:
function grad = gradient(w)
grad = zeros(1,d+1) ;
for i=1:n
grad = grad + (Y(i)-sigma(w'*X(i,:)'))*X(i,:) ;
end
end
function hess = hessian(w)
hess = zeros(d+1,d+1) ;
for i=1:n
hess = hess - sigma(w'*X(i,:)')*sigma(-w'*X(i,:)')*X(i,:)'*X(i,:) ;
end
end
with sigma being a Matlab function encoding the sigmoid function z-->1/(1+exp(-z)).
Third, I run the Newton algorithm on gradient to find the roots of the gradient of the likelihood. I implemented it myself. It behaves as expected as the norm of the difference between the iterates goes to 0. I wrote it based on this script.
I verified that the gradient at the wOPT returned by my Newton implementation is null:
gradient(wOP)
ans =
1.0e-15 *
0.0139 -0.0021 0.2290
and that the hessian has strictly negative eigenvalues
eig(hessian(wOPT))
ans =
-7.5459
-0.0027
-0.0194
Here's the wOPT I get with my implementation:
wOPT =
-110.8873
28.9114
1.3706
the offset being the last element. In order to plot the decision line, I should convert the slope wOPT(1:2) using M and devs. So I set :
my_offset = wOPT(end);
my_slope = wOPT(1:d)'.*devs + M ;
and I get:
my_slope =
1.0e+03 *
-7.2109 0.8166
my_offset =
1.3706
Now, when I run B=mnrfit(D,Y+1), I get
B =
-1.3496
1.7052
-1.0238
The offset is stored in B(1).
I get very different values. I would like to know what I am doing wrong. I have some doubt about the normalization and 'un-normalization' process. But I'm not sure, may be I'm doing something else wrong.
Additional Info
When I tape :
B=mnrfit(normalized,Y+1)
I get
-1.3706
110.8873
-28.9114
which is a rearranged version of the opposite of my wOPT. It contains exactly the same elements.
It seems likely that my scaling back of the learnt parameters is wrong. Otherwise, it would have given the same as B=mnrfit(D,Y+1)

Matlab : Help in entropy estimation of a disretized time series

This Question is in continuation to a previous one asked Matlab : Plot of entropy vs digitized code length
I want to calculate the entropy of a random variable that is discretized version (0/1) of a continuous random variable x. The random variable denotes the state of a nonlinear dynamical system called as the Tent Map. Iterations of the Tent Map yields a time series of length N.
The code should exit as soon as the entropy of the discretized time series becomes equal to the entropy of the dynamical system. It is known theoretically that the entropy of the system is log_2(2). The code exits but the frst 3 values of the entropy array are erroneous - entropy(1) = 1, entropy(2) = NaN and entropy(3) = NaN. I am scratching my head as to why this is happening and how I can get rid of it. Please help in correcting the code. THank you.
clear all
H = log(2)
threshold = 0.5;
x(1) = rand;
lambda(1) = 1;
entropy(1,1) = 1;
j=2;
tol=0.01;
while(~(abs(lambda-H)<tol))
if x(j - 1) < 0.5
x(j) = 2 * x(j - 1);
else
x(j) = 2 * (1 - x(j - 1));
end
s = (x>=threshold);
p_1 = sum(s==1)/length(s);
p_0 = sum(s==0)/length(s);
entropy(:,j) = -p_1*log2(p_1)-(1-p_1)*log2(1-p_1);
lambda = entropy(:,j);
j = j+1;
end
plot( entropy )
It looks like one of your probabilities is zero. In that case, you'd be trying to calculate 0*log(0) = 0*-Inf = NaN. The entropy should be zero in this case, so you you can just check for this condition explicitly.
Couple side notes: It looks like you're declaring H=log(2), but your post says the entropy is log_2(2). p_0 is always 1 - p_1, so you don't have to count everything up again. Growing the arrays dynamically is inefficient because matlab has to re-copy the entire contents at each step. You can speed things up by pre-allocating them (only worth it if you're going to be running for many timesteps).

Error Backpropagation - Neural network

I am trying to write a code for error back-propagation for neural network but my code is taking really long time to execute. I know that training of Neural network takes long time but it is taking long time for a single iteration as well.
Multi-class classification problem!
Total number of training set = 19978
Number of inputs = 513
Number of hidden units = 345
Number of classes = 10
Below is my entire code:
X=horzcat(ones(19978,1),inputMatrix); %Adding bias
M=floor(0.66*(513+10)); %Taking two-third of imput+output
Wji=rand(513,M);
aj=X*Wji;
zj=tanh(aj); %Hidden Layer output
Wkj=rand(M,10);
ak=zj*Wkj;
akTranspose = ak';
ykTranspose=softmax(akTranspose); %For multi-class classification
yk=ykTranspose'; %Final output
error=0;
%Initializing target variables
t = zeros(19978,10);
t(1:2000,1)=1;
t(2001:4000,2)=1;
t(4001:6000,3)=1;
t(6001:8000,4)=1;
t(8001:10000,5)=1;
t(10001:12000,6)=1;
t(12001:14000,7)=1;
t(14001:16000,8)=1;
t(16001:18000,9)=1;
t(18001:19778,10)=1;
errorArray=zeros(100000,1); %Stroing error values to keep track of error iteration
errorDiff=zeros(100000,1);
for nIterations=1:5
errorOld=error;
aj=X*Wji; %Forward propagating in each iteration
zj=tanh(aj);
ak=zj*Wkj;
akTranspose = ak';
ykTranspose=softmax(akTranspose);
yk=ykTranspose';
error=0;
%Calculating error
for n=1:19978 %for 19978 training samples
for k=1:10 %for 10 classes
error = error + t(n,k)*log(yk(n,k)); %using cross entropy function
end
end
error=-error;
Ediff = error-errorOld;
errorArray(nIterations,1)=error;
errorDiff(nIterations,1)=Ediff;
%Calculating dervative of error wrt weights wji
derEWji=zeros(513,345);
derEWkj=zeros(345,10);
for i=1:513
for j=1:M;
derErrorTemp=0;
for k=1:10
for n=1:19978
derErrorTemp=derErrorTemp+Wkj(j,k)*(yk(n,k)-t(n,k));
Calculating derivative of E wrt Wkj%
derEWkj(j,k) = derEWkj(j,k)+(yk(n,k)-t(n,k))*zj(n,j);
end
end
for n=1:19978
Calculating derivative of E wrt Wji
derEWji(i,j) = derEWji(i,j)+(1-(zj(n,j)*zj(n,j)))*derErrorTemp;
end
end
end
eta = 0.0001; %learning rate
Wji = Wji - eta.*derEWji; %updating weights
Wkj = Wkj - eta.*derEWkj;
end
for-loop is very time-consuming in Matlab even with the help of JIT. Try to modify your code by vectorize them rather than organizing them in a 3-loop or even 4-loop. For example,
for n=1:19978 %for 19978 training samples
for k=1:10 %for 10 classes
error = error + t(n,k)*log(yk(n,k)); %using cross entropy function
end
end
can be changed to:
error = sum(sum(t.*yk)); % t and yk are both n*k arrays that you construct
You may try to do similar jobs for the rest of your code. Use dot product or multiplication operations on arrays for different cases.

Issues in fitting data to linear model

Assuming a noiseless AR(1) process y(t)= a*y(t-1) . I have following conceptual questions and shall be glad for the clarification.
Q1 - Discrepancy between mathematical formulation and implementation - The mathematical formulation of AR model is in the form of y(t) = - summmation over i=1 to p[a*y(t-p)] + eta(t) where p=model order and eta(t) is a white gaussian noise. But when estimating coefficients using any method like arburg() or the least square, we simply call that function. I do not know if a white gaussian noise is implicitly added. Then, when we resolve the AR equation with the estimated coefficients, I have seen that the negative sign is not considered nor the noise term added.
What is the correct representation of AR model and how do I find the average coefficients over k number of trials when I have only a single sample of 1000 data points?
Q2 - Coding problem in How to simulate fitted_data for k number of trials and then find the residuals - I fitted a data "data" generated from unknown system and obtained the coefficient by
load('data.txt');
for trials = 1:10
model = ar(data,1,'ls');
original_data=data;
fitted_data(i)=coeff1*data(i-1); % **OR**
data(i)=coeff1*data(i-1);
fitted_data=data;
residual= original_data - fitted_data;
plot(original_data,'r'); hold on; plot(fitted_data);
end
When calculating residual is the fitted_data obtained as above by resolving the AR equation with the obtained coefficients? Matlab has a function for doing this but I wanted to make my own. So, after finding coefficients from the original data how do I resolve ? The coding above is incorrect. Attached is the plot of original data and the fitted_data.
If you model is simply y(n)= a*y(n-1) with scalar a, then here is the solution.
y = randn(10, 1);
a = y(1 : end - 1) \ y(2 : end);
y_estim = y * a;
residual = y - y_estim;
Of course, you should separate the data into train-test, and apply a on the test data. You can generalize this approach to y(n)= a*y(n-1) + b*y(n-2), etc.
Note that \ represents mldivide() function: mldivide
Edit:
% model: y[n] = c + a*y(n-1) + b*y(n-2) +...+z*y(n-n_order)
n_order = 3;
allow_offset = true; % alows c in the model
% train
y_train = randn(20,1); % from your data
[y_in, y_out] = shifted_input(y_train, n_order, allow_offset);
a = y_in \ y_out;
% now test
y_test = randn(20,1); % from your data
[y_in, y_out] = shifted_input(y_test, n_order, allow_offset);
y_estim = y_in * a; % same a
residual = y_out - y_estim;
here is shifted_input():
function [y_in, y_out] = shifted_input(y, n_order, allow_offset)
y_out = y(n_order + 1 : end);
n_rows = size(y, 1) - n_order;
y_in = nan(n_rows, n_order);
for k = 1 : n_order
y_in(:, k) = y(1 : n_rows);
y = circshift(y, -1);
end
if allow_offset
y_in = [y_in, ones(n_rows, 1)];
end
return
AR-type models can serve a number of purposes, including linear prediction, linear predictive coding, filtering noise. The eta(t) are not something we are interested in retaining, rather part of the point of the algorithms is to remove their influence to any extent possible by looking for persistent patterns in the data.
I have textbooks that, in the context of linear prediction, do not include the negative sign included in your expression prior to the sum. On the other hand Matlab's function lpcdoes:
Xp(n) = -A(2)*X(n-1) - A(3)*X(n-2) - ... - A(N+1)*X(n-N)
I recommend you look at function lpc if you haven't already, and at the examples from the documentation such as the following:
randn('state',0);
noise = randn(50000,1); % Normalized white Gaussian noise
x = filter(1,[1 1/2 1/3 1/4],noise);
x = x(45904:50000);
% Compute the predictor coefficients, estimated signal, prediction error, and autocorrelation sequence of the prediction error:
p = lpc(x,3);
est_x = filter([0 -p(2:end)],1,x); % Estimated signal
e = x - est_x; % Prediction error
[acs,lags] = xcorr(e,'coeff'); % ACS of prediction error
The estimated x is computed as est_x. Note how the example uses filter. Quoting the matlab doc again, filter(b,a,x) "is a "Direct Form II Transposed" implementation of the standard difference equation:
a(1)*y(n) = b(1)*x(n) + b(2)*x(n-1) + ... + b(nb+1)*x(n-nb)
- a(2)*y(n-1) - ... - a(na+1)*y(n-na)
which means that in the prior example est_x(n) is computed as
est_x(n) = -p(2)*x(n-1) -p(3)*x(n-2) -p(4)*x(n-3)
which is what you expect!
Edit:
As regards the function ar, the matlab documentation explains that the output coefficients have the same meaning as in the lp scenario discussed above.
The right way to evaluate the output of the AR model is to compute
data_armod(i)= -coeff(2)*data(i-1) -coeff(3)*data(i-2) -coeff(4)*data(i-3)
where coeff is the coefficient matrix returned with
model = ar(data,3,'ls');
coeff = model.a;

creating a train perceptron in MATLAB for gender clasiffication

I am coding a perceptron to learn to categorize gender in pictures of faces. I am very very new to MATLAB, so I need a lot of help. I have a few questions:
I am trying to code for a function:
function [y] = testset(x,w)
%y = sign(sigma(x*w-threshold))
where y is the predicted results, x is the training/testing set put in as a very large matrix, and w is weight on the equation. The part after the % is what I am trying to write, but I do not know how to write this in MATLAB code. Any ideas out there?
I am trying to code a second function:
function [err] = testerror(x,w,y)
%err = sigma(max(0,-w*x*y))
w, x, and y have the same values as stated above, and err is my function of error, which I am trying to minimize through the steps of the perceptron.
I am trying to create a step in my perceptron to lower the percent of error by using gradient descent on my original equation. Does anyone know how I can increment w using gradient descent in order to minimize the error function using an if then statement?
I can put up the code I have up till now if that would help you answer these questions.
Thank you!
edit--------------------------
OK, so I am still working on the code for this, and would like to put it up when I have something more complete. My biggest question right now is:
I have the following function:
function [y] = testset(x,w)
y = sign(sum(x*w-threshold))
Now I know that I am supposed to put a threshold in, but cannot figure out what I am supposed to put in as the threshold! any ideas out there?
edit----------------------------
this is what I have so far. Changes still need to be made to it, but I would appreciate input, especially regarding structure, and advice for making the changes that need to be made!
function [y] = Perceptron_Aviva(X,w)
y = sign(sum(X*w-1));
end
function [err] = testerror(X,w,y)
err = sum(max(0,-w*X*y));
end
%function [w] = perceptron(X,Y,w_init)
%w = w_init;
%end
%------------------------------
% input samples
X = X_train;
% output class [-1,+1];
Y = y_train;
% init weigth vector
w_init = zeros(size(X,1));
w = w_init;
%---------------------------------------------
loopcounter = 0
while abs(err) > 0.1 && loopcounter < 100
for j=1:size(X,1)
approx_y(j) = Perceptron_Aviva(X(j),w(j))
err = testerror(X(j),w(j),approx_y(j))
if err > 0 %wrong (structure is correct, test is wrong)
w(j) = w(j) - 0.1 %wrong
elseif err < 0 %wrong
w(j) = w(j) + 0.1 %wrong
end
% -----------
% if sign(w'*X(:,j)) ~= Y(j) %wrong decision?
% w = w + X(:,j) * Y(j); %then add (or subtract) this point to w
end
you can read this question I did some time ago.
I uses a matlab code and a function perceptron
function [w] = perceptron(X,Y,w_init)
w = w_init;
for iteration = 1 : 100 %<- in practice, use some stopping criterion!
for ii = 1 : size(X,2) %cycle through training set
if sign(w'*X(:,ii)) ~= Y(ii) %wrong decision?
w = w + X(:,ii) * Y(ii); %then add (or subtract) this point to w
end
end
sum(sign(w'*X)~=Y)/size(X,2) %show misclassification rate
end
and it is called from code (#Itamar Katz) like (random data):
% input samples
X1=[rand(1,100);rand(1,100);ones(1,100)]; % class '+1'
X2=[rand(1,100);1+rand(1,100);ones(1,100)]; % class '-1'
X=[X1,X2];
% output class [-1,+1];
Y=[-ones(1,100),ones(1,100)];
% init weigth vector
w=[.5 .5 .5]';
% call perceptron
wtag=perceptron(X,Y,w);
% predict
ytag=wtag'*X;
% plot prediction over origianl data
figure;hold on
plot(X1(1,:),X1(2,:),'b.')
plot(X2(1,:),X2(2,:),'r.')
plot(X(1,ytag<0),X(2,ytag<0),'bo')
plot(X(1,ytag>0),X(2,ytag>0),'ro')
legend('class -1','class +1','pred -1','pred +1')
I guess this can give you an idea to make the functions you described.
To the error compare the expected result with the real result (class)
Assume your dataset is X, the datapoins, and Y, the labels of the classes.
f=newp(X,Y)
creates a perceptron.
If you want to create an MLP then:
f=newff(X,Y,NN)
where NN is the network architecture, i.e. an array that designates the number of neurons at each hidden layer. For example
NN=[5 3 2]
will correspond to an network with 5 neurons at the first layers, 3 at the second and 2 a the third hidden layer.
Well what you call threshold is the Bias in machine learning nomenclature. This should be left as an input for the user because it is used during training.
Also, I wonder why you are not using the builtin matlab functions. i.e newp or newff. e.g.
ff=newp(X,Y)
Then you can set the properties of the object ff to do your job for selecting gradient descent and so on.