I have asked a few questions about neural networks on this website in the past and have gotten great answers, but I am still struggling to implement one for myself. This is quite a long question, but I am hoping that it will serve as a guide for other people creating their own basic neural networks in MATLAB, so it should be worth it.
What I have done so far could be completely wrong. I am following the online stanford machine learning course by Professor Andrew Y. Ng and have tried to implement what he has taught to the best of my ability.
Can you please tell me if the feed forward and cost function parts of my code are correct, and where I am going wrong in the minimization (optimization) part?
I have a feed 2 layer feed forward neural network.
The MATLAB code for the feedforward part is:
function [ Y ] = feedforward2( X,W1,W2)
%This takes a row vector of inputs into the neural net with weight matrices W1 and W2 and returns a row vector of the outputs from the neural net
%Remember X, Y, and A can be vectors, and W1 and W2 Matrices
X=transpose(X); %X needs to be a column vector
A = sigmf(W1*X,[1 0]); %Values of the first hidden layer
Y = sigmf(W2*A,[1 0]); %Output Values of the network
Y = transpose(Y); %Y needs to be a column vector
So for example a two layer neural net with two inputs and two outputs would look a bit like this:
a1
x1 o--o--o y1 (all weights equal 1)
\/ \/
/\ /\
x2 o--o--o y2
a2
if we put in:
X=[2,3];
W1=ones(2,2);
W2=ones(2,2);
Y = feedforward2(X,W1,W2)
we get the the output:
Y = [0.5,0.5]
This represents the y1 and y2 values shown in the drawing of the neural net
The MATLAB code for the squared error cost function is:
function [ C ] = cost( W1,W2,Xtrain,Ytrain )
%This gives a value seeing how close W1 and W2 are to giving a network that represents the Xtrain and Ytrain data
%It uses the squared error cost function
%The closer the cost is to zero, the better these particular weights are at giving a network that represents the training data
%If the cost is zero, the weights give a network that when the Xtrain data is put in, The Ytrain data comes out
M = size(Xtrain,1); %Number of training examples
oldsum = 0;
for i = 1:M,
H = feedforward2(Xtrain,W1,W2);
temp = ( H(i) - Ytrain(i) )^2;
Sum = temp + oldsum;
oldsum = Sum;
end
C = (1/2*M) * Sum;
end
Example
So for example if the training data is:
Xtrain =[0,0; Ytrain=[0/57;
1,2; 3/57;
4,1; 5/57;
5,2; 7/57; a1
3,4; 7/57; %This will be for a two input one output network x1 o--o y1
5,3; 8/57; \/ \_o
1,5; 6/57; /\ /
6,2; 8/57; x2 o--o
2,1; 3/57; a2
5,5;] 10/57;]
We start with initial random weights
W1=[2,3; W2=[3,2]
4,1]
If we put in:
Y= feedforward2([6,2],W1,W2)
We get
Y = 0.9933
Which is far from what the training data says it should be (8/57 = 0.1404). So the initial random weights W1 and W2 where a bad guess.
To measure exactly how bad/good a guess the random weights weights are we use the cost function:
C= cost(W1,W2,Xtrain,Ytrain)
This gives the value:
C = 6.6031e+003
Minimizing the cost function
If we minimize the cost function by searching all of the possible variables W1 and W2 and then picking the lowest, this will give the network that best approximates the training data
But when I Use the code:
[W1,W2]=fminsearch(cost(W1,W2,Xtrain,Ytrain),[W1,W2])
It gives an error message. It says: "Error using horzcat. CAT arguments dimensions are not consistent."Why am I getting this error and what can I do to fix it?
Can you please tell me if the feed forward and cost function parts of my code are correct, and where I am going wrong in the minimization (optimization) part?
Thank you!!!
Your Neural network seems alright, although the kind of training you're trying to do is quite in-efficient if you're training against labeled data as you're doing. In that case I would suggest looking into Back-propagation
About your error when training: Your error message hints at the problem: dimensions are not consistent
As argument x0 in fminsearch which is the initial guess for the optimizer, you send [W1, W2] but from what I can see, these matrices don't have the same number of rows, and therefore you can't add them together like that. I would suggest modifying your cost-function to take a vector as argument and then form your weight-vectors for different layers from that one vector.
You are also not supplying the cost-function correctly to fminsearch as you are just evaluating cost with w1, w2, Xtrain and Ytrain in-place.
According to the documentation (it's been years since I used Matlab) it seems like you pass the pointer to the cost-function as
fminsearch(cost, [W1; W2])
EDIT: You could express your weights and modify your code as follows:
global Xtrain
global Ytrain
W = [W1; W2]
fminsearch(cost, W)
Cost-function must be modified such that it doesn't take Xtrain, Ytrain as input because fminsearch will then try to optimize those too. Modify your cost-function like this:
function [ C ] = cost( W )
W1 = W[1:2,:]
W2 = W[3,:]
global Xtrain
global Ytrain
...
Related
I am following this tutorial for making neural network
https://www.kaggle.com/antmarakis/another-neural-network-from-scratch
I do not understand the train part of this code where 1 is appended to the input feature vector.
def Train(X, Y, lr, weights):`
`layers = len(weights)`
`for i in range(len(X)):`
`x, y = X[i], Y[i]`
`x = np.matrix(np.append(1, x)) # Augment feature vector`
`activations = ForwardPropagation(x, weights, layers)`
`weights = BackPropagation(y, activations, weights, layers)`
`return weights
any help in understanding this would be appreciated.
Forward propagation includes multiplying by weights and adding a bias term. The equation is
y = X*W + b. This can be written in a more vectorised form as y = [X, 1] * [W, b]. (* stands for matrix multiplication here).
In the code, the weights and biases seemed to have been combined into a single weight matrix W and x is modified as an augmented vector by appending a one to it.
Data x is input to an autoregreesive model (AR) model. The output of the AR model is corrupted with Additive White Gaussian Noise at SNR = 30 dB. The observations are denoted by noisy_y.
Let there be close estimates h_hat of the AR model (these are obtained from Least Squares estimation). I want to see how close the input obtained from deconvolution with h_hat and the measurements is to the known x.
My confusion is which variable to use for deconvolution -- clean y or noisy y?
Upon deconvolution, I should get x_hat. I am not sure if the correct way to perform deconvolution is using the noisy_y or using the y before adding noise. I have used the following code.
Can somebody please help in what is the correct method to plot x and x_hat.
Below is the plot of x vs x_hat. As can be seen, that these do not match. Where is my understand wrong? Please help.
The code is:
clear all
N = 200; %number of data points
a1=0.1650;
b1=-0.850;
h = [1 a1 b1]; %true coefficients
x = rand(1,N);
%%AR model
y = filter(1,h,x); %transmitted signal through AR channel
noisy_y = awgn(y,30,'measured');
hat_h= [1 0.133 0.653];
x_hat = filter(hat_h,1,noisy_y); %deconvolution
plot(1:50,x(1:50),'b');
hold on;
plot(1:50,x_hat(1:50),'-.rd');
A first issue is that the coefficients h of your AR model correspond to an unstable system since one of its poles is located outside the unit circle:
>> abs(roots(h))
ans =
1.00814
0.84314
Parameter estimation techniques are then quite likely to fail to converge given a diverging input sequence. Indeed, looking at the stated hat_h = [1 0.133 0.653] it is pretty clear that the parameter estimation did not converge anywhere near the actual coefficients. In your specific case you did not provide the code illustrating how you obtained hat_h (other than specifying that it was "obtained from Least Squares estimation"), so it isn't possible to further comment on what went wrong with your estimation.
That said, the standard formulation of Least Mean Squares (LMS) filters is given for an MA model. A common method for AR parameter estimation is to solve the Yule-Walker equations:
hat_h = aryule(noisy_y - mean(noisy_y), length(h)-1);
If we were to use this estimation method with the stable system defined by:
h = [1 -a1 -b1];
x = rand(1,N);
%%AR model
y = filter(1,h,x); %transmitted signal through AR channel
noisy_y = awgn(y,30,'measured');
hat_h = aryule(noisy_y - mean(noisy_y), length(h)-1);
x_hat = filter(hat_h,1,noisy_y); %deconvolution
The plot of x and x_hat would look like:
I can't get my mind around the concept of how to calculate bias and variance from a random set.
I have created the code to generate a random normal set of numbers.
% Generate random w, x, and noise from standard Gaussian
w = randn(10,1);
x = randn(600,10);
noise = randn(600,1);
and then extract the y values
y = x*w + noise;
After that I split my data into a training (100) and test (500) set
% Split data set into a training (100) and a test set (500)
x_train = x([ 1:100],:);
x_test = x([101:600],:);
y_train = y([ 1:100],:);
y_test = y([101:600],:);
train_l = length(y_train);
test_l = length(y_test);
Then I calculated the w for a specific value of lambda (1.2)
lambda = 1.2;
% Calculate the optimal w
A = x_train'*x_train+lambda*train_l*eye(10,10);
B = x_train'*y_train;
w_train = A\B;
Finally, I am computing the square error:
% Compute the mean squared error on both the training and the
% test set
sum_train = sum((x_train*w_train - y_train).^2);
MSE_train = sum_train/train_l;
sum_test = sum((x_test*w_train - y_test).^2);
MSE_test = sum_test/test_l;
I know that if I create a vector of lambda (I have already done that) over some iterations I can plot the average MSE_train and MSE_test as a function of lambda, where then I will be able to verify that large differences between MSE_test and MSE_train indicate high variance, thus overfit.
But, what I want to do extra, is to calculate the variance and the bias^2.
Taken from Ridge Regression Notes at page 7, it guides us how to calculate the bias and the variance.
My questions is, should I follow its steps on the whole random dataset (600) or on the training set? I think the bias^2 and the variance should be calculated on the training set. Also, in Theorem 2 (page 7 again) the bias is calculated by the negative product of lambda, W, and beta, the beta is my original w (w = randn(10,1)) am I right?
Sorry for the long post, but I really want to understand how the concept works in practice.
UPDATE 1:
Ok, so following the previous paper didn't generate any good results. So, I took the standard form of Ridge Regression Bias-Variance which is:
Based on that, I created (I used the test set):
% Bias and Variance
sum_bias=sum((y_test - mean(x_test*w_train)).^2);
Bias = sum_bias/test_l;
sum_var=sum((mean(x_test*w_train)- x_test*w_train).^2);
Variance = sum_var/test_l;
But, after 200 iterations and for 10 different lambdas this is what I get, which is not what I expected.
Where in fact, I was hoping for something like this:
sum_bias=sum((y_test - mean(x_test*w_train)).^2); Bias = sum_bias/test_l
Why have you squared the difference between y_test and y_predicted = x_test*w_train?
I don't believe your formula for bias is correct. In your question, the 'bias term' above in blue is the bias^2 however surely your formula is neither the bias nor the bias^2 since you have only squared the residuals, not the entire bias?
I have an unknown non-linear system and I want to model it using another system with some adaptable parameters (for instance, a neural network). So, I want to fix an online learning structure of the unknown system without knowing its dynamics, I can only interact with it through inputs-outputs. My problem is that I can not make it work in MATLAB using ode solvers. Lets say that we have this real system (my actual system is more complicated, but I will give a simple example in order to be understood):
function dx = realsystem(t, x)
u = 2;
dx = -3*x+6*u;
end
and we solve the equations like this:
[t,x_real] = ode15s(#(t,x)realsystem(t,x), [0 1], 0)
We suppose that is an unknown system and we do not know the coefficients 3 and 6 so we take an adaptive system with the 2 adaptive laws:
dx(t) = -p1(t)*x(t) + p2(t)*u(t)
dp1(t) = -e(t)*x(t)
dp2(t) = e(t)*u(t)
with e(t) the error e(t) = x(t) - x_real(t).
The thing is that I cannot find a way to feed the real values for each t to the ode solver in order to have online learning.
I tried with something like this but it didn't work:
function dx = adaptivesystem(t, x, x_real)
dx = zeros(3,1);
e = x_real - x;
u = 2;
dx(1) = -x(2)*x(1)+x(3)*u;
dx(2) = -e*x(1); %dx(2) = dp1(t)
dx(3) = e*u; %dx(3) = dp2(t)
end
You should be aware that your problem is ill-posed as it is. Given any trajectory x(t) obtained via sampling and smoothing/interpolating, you can choose p1(t) at will and set
p2(t) = ( x'(t) - p1(t)*x(t) ) / u.
So you have to formulate restrictions. One obvious is that the functions p1 and p2 should be valid for all trajectories of the black-box system. Do you have different trajectories available?
Another variant is to demand that p1 and p2 are constants. Actually, in this case and if you have equally spaced samples available, it would be easier to first find a good difference equation for the data. With the samples x[n] for time t[n]=t0+n*dt form a matrix X with rows
[ -u, x[n], x[n+1], ... ,x[n+k] ] for n=0, ... , N-k
and apply QR decomposition or SVD to X to determine the right hand kernel vectors. QR may fail to show a usable rank deficiency, so use the SVD on the top square part of R = USV^T, S diagonal, ordered as usual, U,V square and orthogonal, and use the last row of V, with coefficients
[b, a[0], ..., a[k] ],
corresponding to the smallest eigenvalue, to form the difference equation
a[0]*x[n]+a[1]*x[n-1]+...+a[k]*x[n-k]=b*u.
If the effective rank of R resp. S is not (k-1), then reduce k to be the effective rank plus one and start again.
If in the end k=1 is found, then you can make a differential equation out of it. Reformulate the difference equation as
a[0]*(x[n]-x[n-1])/dt = -(a[0]+a[1])/dt * x[n-1] + b/dt * u
and read off the differential equation
x'(t) = -(a[0]+a[1])/(a[0]*dt) * x(t) + b/(a[0]*dt) * u
One may reject this equation if the coefficients become uncomfortably large.
I have a set of data with independent variable x and y. Now I'm trying to build a two dimensional regression model that has a regression surface cutting through my data points. However, I couldn't find a way to achieve this. Can anyone give me some assistance?
You could use my favorite, polyfitn for linear or polynomial models. If you would like a different model, please edit your question or add a comment. HTH!
EDIT
Also, take a look here under Multiple Regression, likely can help you as well.
EDIT AGAIN
Sorry, I'm having too much fun with this, here's an example of multivariate regression using least squares with stock Matlab:
t = (1:10)';
x = t;
y = exp(-t);
A = [ y x ];
z = 10*y + 0.5*x;
A\z
ans =
10.0000
0.5000
If you are performing linear regression, the best tool is the regress function. Note that, if you are fitting a model of the form y(x1,x2) = b1.f(x1) + b2.g(x2) + b3 this is still a linear regression, as long as you know the functions f and g.
Nsamp = 100; %number of samples
X1 = randn(Nsamp,1); %regressor 1 (could also be some computed f(x1) )
X2 = randn(Nsamp,1); %regressor 2 (could also be some computed g(x2) )
Y = X1 + X2 + randn(Nsamp,1); %generate some data to be regressed
%now run the regression
[b,bint,r,rint,stats] = regress(Y,[X1 X2 ones(Nsamp,1)]);
% 'b' contains the coefficients, b1,b2,b3 of the fit; can be used to plot regression surface)
% 'r' contains residuals of the fit
% 'stats' contains the overall regression R^2, F stat, p-value and error variance