Bias derivative in backpropagation - neural-network

I am building a 4 - 5 - 2 layout neural network with m data points, and each data point has 4 independent variables. So for forward propagation my network shape is is:
L1: mx4 * 4x5 + 1x5 = Z1 -> A1 = ReLu(Z1).
L2: mx5 * 5x2 + 1x2 = Z2 -> A2 = softmax(Z2)
For this I get A1 = mx5 and A2 = mx2, respectively, and when finding dweight, I can get two matrices of size 4x5 and 5x2 for dweight1 and dweight2, which means I can perform matrix subtraction for gradient descent.
However, for biases, I am stuck. dL/db2 = dL/dZ2 * dZ2/db2 = (A2 - Y) * 1, which gives me a 2xm matrix, and Im stuck. Is my matrices shape wrong somewhere?

Related

MATLAB Backpropagation Algorithm not functioning as expected

I am attempting to write a Multi-Layer Perceptron Network inside MATLAB to help me better understand the calculus required for backpropagation.
The aim is so provide the network with XOR data (where upper-right and lower-left quadrant data is class 1 and the remaining quadrants class 0), train the network on this data, and then test it on new data.
My problem is that my loss curve looks very very strange:
It appears to bounce between very low error very high error and converge in the middle to a pretty poor error.
I was wondering if someone could check that I have correctly implemented the chain rule in MATLAB syntax.
The MLP network is structured as follows: Input-layer has 2 neurons, 1 hidden-layer with 2 neurons, and 1 output neuron.
Here is the MATLAB code:
%Create XOR Dataset
x1pos = rand(500,1);
x1neg = -rand(500,1);
x1 = [x1pos; x1neg];
p = randperm(length(x1));
x1 = x1(p);
x2pos = rand(500,1);
x2neg = -rand(500,1);
x2 = [x1pos; x1neg];
p = randperm(length(x2));
x2 = x2(p);
Data = [x1 x2];
TrainingData = Data(1:800,:);
TestData = Data(801:length(Data),:);
T = gt((Data(:,1).*Data(:,2)),0); %Create class label for data and assign to matrix T
%Neural Net
%Training
W1 = rand(2,2); %Initialize random weights
W2 = rand(1,2); %Initialize random weights
B1 = rand(2,1); %Initialize random biases
B2 = rand(1,1); %Initialize random biases
n = 0.05; %Set Learning Rate
for i = 1:800
%Fwd Pass
x1 = Data(i,1);
x2 = Data(i,2);
X = [x1; x2];
A1 = W1*X + B1;
H1 = sigmoid(A1);
A2 = W2*H1 + B2;
Y = sigmoid(A2);
%Loss
Loss = (Y-T(i))*(Y-T(i));
scatter(i, Loss)
hold on;
%Backpropagation
dEdY = 2*(Y-T(i)); %The partial derivative of the loss with respect to the output
dYdA2 = Y*(1-Y); %The partial derivative of the output with respect to the hidden layer output
dA2dH1 = W2.'; %The partial derivative of the hidden layer output with respect to the first layer activations
dH1dA1 = H1.*(1-H1); %The partial derivative of the first layer activations with respect to the first layer output
%Chain Rule
dEdW2 = dEdY.*dYdA2.*W2.';
dEdW1 = dEdY.*dYdA2.*dA2dH1.*dH1dA1.*W1.';
dEdB2 = dEdY.*dYdA2;
dEdB1 = dEdY.*dYdA2.*dA2dH1.*dH1dA1;
%Update Weights
W2 = (W2.' - n.*dEdW2).';
W1 = (W1.' - n.*dEdW1).';
%Update Biases
B2 = B2 - n.*dEdB2;
B1 = B1 - n.*dEdB1;
%Next training loop
end
%Testing
for i = 801:1000
x1 = Data(i,1);
x2 = Data(i,2);
X = [x1; x2];
A1 = W1*X + B1;
H1 = sigmoid(A1);
A2 = W2*H1 + B2;
Y = sigmoid(A2);
end
function o = sigmoid(input)
o = [];
for i = 1:length(input)
o = [o; 1/(1+exp(-input(i)))];
end
end

Why I got a big MSE when I try to verify convolution theorem in matlab?

I want to verify the convolution theorem in matlab.
Firstly, I do a 2D discrete convolution of a 2D Gaussian with
an image graymap(x, y).
Secondly, I compute the Fourier Transform of
the same 2D Gaussian and of the original image. Then perform a scalar multiplication
of these two Fourier Transforms, followed by an inverse Fourier Transform of the result.
Finally, I will calculate the MSE between the two results. However, I found the err is 800+.
This is my code:
[row, col] = size(graymap);
[row_2, col_2] = size(z);
result = zeros(row, col);
for i = 1: col
for j = 1:row
accumulation_value = 0;
for k = -4:4
for h = -4:4
if ((i+k > 0 && i+k < col + 1) && (j+h > 0 && j+h < row + 1))
value_image = double(graymap(i+k, j+h));
else
value_image = 0;
end
accumulation_value = accumulation_value + value_image * double(z(5 + k, 5 + h));
weighted_sum = weighted_sum + z(5 + k, 5 + h);
end
end
result(i,j) = (accumulation_value);
end
result_blur_1 = uint8(255*mat2gray(result));
M = size(graymap,1);
N = size(graymap,2);
resIFFT = ifft2(fft2(double(graymap), M, N) .* fft2(double(z), M, N));
result_blur_2 = uint8(255*mat2gray(resIFFT));
err = immse(result_blur_1, result_blur_2);
z is the 9*9 gaussian kernel. I don't flip it because it is symmetric.
I think my implementation of convolution is correct because the result is same as conv2(graymap, z, 'same').
Therefore, I believe there are something wrong with the second part. In fact, I am confused on how padding works. May it is the cause of the big MSE.
There are indeed problems with your implementation of the second part. The most important rule to remember when implementing convolution via fft is that you are actually calculating a circular convolution, not a linear convolution. Fortunately, there is a condition under which the two become equivalent. This condition is that the two arrays should be zero-padded to have a size equal to the sum of the sizes of each minus 1 (in all dimensions). So if you are working with an image X of size MxN, and a mask Z of size PxQ, then you should pad the two arrays with zeros to so they have at least dimensions M+P-1xN+Q-1. Any additional zeros won't hurt, so it's convenient to match a 'fft-friendly' size if possible (using nextpow2 for example). You just have to take the first M+P-1xN+Q-1 values.
Now, that would work straight forward if you just wanted the full result of the convolution. But because you want the central part of the convolution (the option 'same'), you need to select the correct indexes. The first index will be ceil(([P Q] - 1)/2) + 1, and then you take as many consecutive indexes as the image size.
Here is an example putting all together:
M = randperm(1024,1);
N = randperm(1024,1);
X = rand(M,N);
P = randperm(64,1);
Q = randperm(64,1);
Z = rand(P,Q);
% 'standard' convolution with option 'same'
C1 = conv2(X,Z,'same');
R = 2^nextpow2(M+P-1);
S = 2^nextpow2(N+Q-1);
% convolution with fft. Notice the zero-padding to R,S
C2 = real(ifft2(fft2(X,R,S) .* fft2(Z,R,S)));
n = ceil(([P Q] - 1)/2);
ind{1} = n(1) + (1:M);
ind{2} = n(2) + (1:N);
C2 = C2(ind{:});
err = immse(C1,C2)
I get errors of the order of 1e-26

Getting NaN values in neural network weight matrices

**I am trying to develop a feedforward NN in MATLAB. I have a dataset of 12 inputs and 1 output with 46998 samples. I have some NaN values in last rows of Matrix, because some inputs are accelerations & velocities which are 1 & 2 steps less respectively than displacements.
With this current data set I am getting w1_grad & w2_grad as NaN matrices. I tried to remove them using `Heave_dataset(isnan(Heave_dataset))=[];, but my dataset is getting converted into a column matrix of (1*610964).
can anyone help me with this ?
%
%% Clear Variables, Close Current Figures, and Create Results Directory
clc;
clear all;
close all;
mkdir('Results//'); %Directory for Storing Results
%% Configurations/Parameters
load 'Heave_dataset'
% Heave_dataset(isnan(Heave_dataset))=[];
nbrOfNeuronsInEachHiddenLayer = 24;
nbrOfOutUnits = 1;
unipolarBipolarSelector = -1; %0 for Unipolar, -1 for Bipolar
learningRate = 0.08;
nbrOfEpochs_max = 50000;
%% Read Data
Input = Heave_dataset(:, 1:length(Heave_dataset(1,:))-1);
TargetClasses = Heave_dataset(:, length(Heave_dataset(1,:)));
%% Calculate Number of Input and Output NodesActivations
nbrOfInputNodes = length(Input(1,:)); %=Dimention of Any Input Samples
nbrOfLayers = 2 + length(nbrOfNeuronsInEachHiddenLayer);
nbrOfNodesPerLayer = [nbrOfInputNodes nbrOfNeuronsInEachHiddenLayer nbrOfOutUnits];
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Forward Pass %%%%%%%%%%%
%% Adding the Bias to Input layer
Input = [ones(length(Input(:,1)),1) Input];
%% Weights leading from input layer to hidden layer is w1
w1 = rand(nbrOfNeuronsInEachHiddenLayer,(nbrOfInputNodes+1));
%% Input & output of hidde layer
hiddenlayer_input = Input*w1';
hiddenlayer_output = -1 + 2./(1 + exp(-(hiddenlayer_input)));
%% Adding the Bias to hidden layer
hiddenlayer_output = [ones(length(hiddenlayer_output(:,1)),1) hiddenlayer_output];
%% Weights leading from input layer to hidden layer is w1
w2 = rand(nbrOfOutUnits,(nbrOfNeuronsInEachHiddenLayer+1));
%% Input & output of hidde layer
outerlayer_input = hiddenlayer_output*w2';
outerlayer_output = outerlayer_input;
%% Error Calculation
TotalError = 0.5*(TargetClasses-outerlayer_output).^2;
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Backward Pass %%%%%%%%%%%
d3 = outerlayer_output - TargetClasses;
d2 = (d3*w2).*hiddenlayer_output.*(1-hiddenlayer_output);
d2 = d2(:,2:end);
D1 = d2' * Input;
D2 = d3' * hiddenlayer_output;
w1_grad = D1/46998 + learningRate*[zeros(size(w1,1),1) w1(:,2:end)]/46998;
w2_grad = D2/46998 + learningRate*[zeros(size(w2,1),1) w2(:,2:end)]/46998;
You should try vectorize your algorithm. First arrange your data in a 46998x12 matrix X.Add bias to X like X=[ones(46998,1 X]. Then the weights leading from input layer to first hidden layer must be arranged in a matrix W1 with dimensions numberofneuronsinfirsthiddenlayer(24)x(input + 1). Then XW1' is what you feed in your neuron function (either is it sigmoid or whatever it is). The result (like sigmoid(XW') is the output of neurons at hidden level 1. You add bias like before and multiply by weight matrix W2 (the weights that lead from hidden layer 1 to hidden layer 2) and so on. Hope this helps to get you started vectorizing your code at least for the feedforward part. The back-propagation part is a little trickier but luckily involves the same matrices.
I will shortly recite the feedforward process so that we use same language talking about backpropagation.
There is the data called X.(dimensions 46998x12)
A1 = [ones(46998,1 X] is the input including bias. (46998x13)
Z2 = A1*W1' (W1 is the weight matrix that leads from input to hidden layer 1)
A2 = sigmoid(Z2);
A2 = [ones(m,1) A2]; adding bias again
Z3 = A2 * W2';
A3 = sigmoid(Z3);
Supposing you only have one hidden layer feedforward stops here. I'll start backwards now and you can generalize as appropriate.
d3 = A3 - Y; (Y must is part of your data, the actual values of the data with which you train your nn)
d2 = (d3 * W2).* A2 .* (1-A2); ( Sigmod function has a nice property that d(sigmoid(z))/dz = sigmoid(z)*(1-sigmoid(z)).)
d2 = d2(:,2:end);(You dont need the first column that corresponds in the bias)
D1 = d2' * A1;
D2 = d3' * A2;
W1_grad = D1/m + lambda*[zeros(size(W1,1),1) W1(:,2:end)]/m; (lamda is the earning rate, m is 46998)
W2_grad = D2/m + lambda*[zeros(size(W2,1),1) W2(:,2:end)]/m;
Everything must be in place now except for the vectorized cost function which have to be minimized. Hope this helps a bit...

How to implement a soft-margin SVM model using Matlab's quadprog?

Suppose we are given a training dataset {yᵢ, xᵢ}, for i = 1, ..., n, where yᵢ can either be -1 or 1 and xᵢ can be e.g. a 2D or 3D point.
In general, when the input points are linearly separable, the SVM model can be defined as follows
min 1/2*||w||²
w,b
subject to the constraints (for i = 1, ..., n)
yᵢ*(w*xᵢ - b) >= 1
This is often called the hard-margin SVM model, which is thus a constrained minimization problem, where the unknowns are w and b. We can also omit 1/2 in the function to be minimized, given it's just a constant.
Now, the documentation about Matlab's quadprog states
x = quadprog(H, f, A, b) minimizes 1/2*x'*H*x + f'*x subject to the restrictions A*x ≤ b. A is a matrix of doubles, and b is a vector of doubles.
We can implement the hard-margin SVM model using quadprog function, to get the weight vector w, as follows
H becomes an identity matrix.
f' becomes a zeros matrix.
A is the left-hand side of the constraints
b is equal to -1 because the original constraint had >= 1, it becomes <= -1 when we multiply with -1 on both sides.
Now, I am trying to implement a soft-margin SVM model. The minimization equation here is
min (1/2)*||w||² + C*(∑ ζᵢ)
w,b
subject to the constraints (for i = 1, ..., n)
yᵢ*(w*xᵢ - b) >= 1 - ζᵢ
such that ζᵢ >= 0, where ∑ is the summation symbol, ζᵢ = max(0, 1 - yᵢ*(w*xᵢ - b)) and C is a hyper-parameter.
How can this optimization problem be solved using the Matlab's quadprog function? It's not clear to me how the equation should be mapped to the parameters of the quadprog function.
The "primal" form of the soft-margin SVM model (i.e. the definition above) can be converted to a "dual" form. I did that, and I am able to get the Lagrange variable values (in the dual form). However, I would like to know if I can use quadprog to solve directly the primal form without needing to convert it to the dual form.
I don't see how it can be a problem. Let z be our vector of (2n + 1) variables:
z = (w, eps, b)
Then, H becomes diagonal matrix with first n values on the diagonal equal to 1 and the last n + 1 set to zero:
H = diag([ones(1, n), zeros(1, n + 1)])
Vector f can be expressed as:
f = [zeros(1, n), C * ones(1, n), 0]'
First set of constrains becomes:
Aineq = [A1, eye(n), zeros(n, 1)]
bineq = ones(n, 1)
where A1 is a the same matrix as in primal form.
Second set of constraints becomes lower bounds:
lb = (inf(n, 1), zeros(n, 1), inf(n, 1))
Then you can call MATLAB:
z = quadprog(H, f, Aineq, bineq, [], [], lb);
P.S. I can be mistaken in some small details, but the general idea is right.
I wanted to clarify #vharavy answer because you could get lost while trying to deduce what 'n' means in his code. Here is my version according to his answer and SVM wikipedia article. I assume we have a file named "test.dat" which holds coordinates of test points and their class membership in the last column.
Example content of "test.dat" with 3D points:
-3,-3,-2,-1
-1,3,2,1
5,4,1,1
1,1,1,1
-2,5,4,1
6,0,1,1
-5,-5,-3,-1
0,-6,1,-1
-7,-2,-2,-1
Here is the code:
data = readtable("test.dat");
tableSize = size(data);
numOfPoints = tableSize(1);
dimension = tableSize(2) - 1;
PointsCoords = data(:, 1:dimension);
PointsSide = data.(dimension+1);
C = 0.5; %can be changed
n = dimension;
m = numOfPoints; %can be also interpretet as number of constraints
%z = [w, eps, b]; number of variables in 'z' is equal to n + m + 1
H = diag([ones(1, n), zeros(1, m + 1)]);
f = [zeros(1, n), C * ones(1, m), 0];
Aineq = [-diag(PointsSide)*table2array(PointsCoords), -eye(m), PointsSide];
bineq = -ones(m, 1);
lb = [-inf(1, n), zeros(1, m), -inf];
z = quadprog(H, f, Aineq, bineq, [], [], lb);
If let z = (w; w0; eps)T be a the long vector with n+1+m elements.(m the number of points)
Then,
H= diag([ones(1,n),zeros(1,m+1)]).
f = [zeros(1; n + 1); ones(1;m)]
The inequality constraints can be specified as :
A = -diag(y)[X; ones(m; 1); zeroes(m;m)] -[zeros(m,n+1),eye(m)],
where X is the n x m input matrix in the primal form.Out of the 2 parts for A, the first part is for w0 and the second part is for eps.
b = ones(m,1)
The equality constraints :
Aeq = zeros(1,n+1 +m)
beq = 0
Bounds:
lb = [-inf*ones(n+1,1); zeros(m,1)]
ub = [inf*ones(n+1+m,1)]
Now, z=quadprog(H,f,A,b,Aeq,beq,lb,ub)
Complete code. The idea is the same as above.
n = size(X,1);
m = size(X,2);
H = diag([ones(1, m), zeros(1, n + 1)]);
f = [zeros(1,m+1) c*ones(1,n)]';
p = diag(Y) * X;
A = -[p Y eye(n)];
B = -ones(n,1);
lb = [-inf * ones(m+1,1) ;zeros(n,1)];
z = quadprog(H,f,A,B,[],[],lb);
w = z(1:m,:);
b = z(m+1:m+1,:);
eps = z(m+2:m+n+1,:);

constant term in Matlab principal component regression (pcr) analysis

I am trying to learn principal component regression (pcr) with Matlab. I use this guide here: http://www.mathworks.fr/help/stats/examples/partial-least-squares-regression-and-principal-components-regression.html
it's really good, but I just cannot understand one step:
we do the PCA and the regression, nice and clear:
[PCALoadings,PCAScores,PCAVar] = princomp(X);
betaPCR = regress(y-mean(y), PCAScores(:,1:2));
And then we adjust the first coefficient:
betaPCR = PCALoadings(:,1:2)*betaPCR;
betaPCR = [mean(y) - mean(X)*betaPCR; betaPCR];
yfitPCR = [ones(n,1) X]*betaPCR;
How come that the coefficient needs to be 'mean(y) - mean(X)*betaPCR' for the constant one factor? Can you explain that to me?
Thanks in advance!
This is really a math question, not a coding question. Your PCA extracts a set of features and puts them in a matrix, which gives you PCALoadings and PCAScores. Pull out the first two principal components and their loadings, and put them in their own matrix:
W = PCALoadings(:, 1:2)
Z = PCAScores(:, 1:2)
The relationship between X and Z is that X can be approximated by:
Z = (X - mean(X)) * W <=> X ~ mean(X) + Z * W' (1)
The intuition is that Z captures most of the "important information" in X, and the matrix W tells you how to transform between the two representations.
Now you can do a regression of y on Z. First you have to subtract the mean from y, so that both the left and right hand sides have mean zero:
y - mean(y) = Z * beta + errors (2)
Now you want to use that regression to make predictions for y from X. Substituting from equation (1) into equation (2) gives you
y - mean(y) = (X - mean(X)) * W * beta
= (X - mean(X)) * beta1
where we have defined beta1 = W * beta (you do this in your third line of code). Rearranging:
y = mean(y) - mean(X) * beta1 + X * beta1
= [ones(n,1) X] * [mean(y) - mean(X) * beta1; beta1]
= [ones(n,1) X] * betaPCR
which works out if we define
betaPCR = [mean(y) - mean(X) * beta1; beta1]
as in your fourth line of code.