My two layer neural network model doesn't converge - neural-network

I am training a two layer neural network. I waited for 15000 epochs, still model doesn't converge.
ans = []
for i in range(1000):
x1,y1 = random.uniform(-3,3),random.uniform(-3,3)
if x1*x1 + y1 * y1 < 1:
ans.append([x1,y1,0])
elif x1*x1 + y1 * y1 >= 2 and x1*x1 + y1 * y1 <=8:
ans.append([x1,y1,1])
data = pd.DataFrame(ans)
print(data.shape)
X = np.array(data[[0,1]])
y = np.array(data[2])
I am generating random points generating data. the data looks like something like this.
weights_layer1 = np.random.normal(scale=1 / 10**.5, size=(2,20))
bias1 = np.zeros((1,20))
bias2 = np.zeros((1,1))
weights_layer2 = np.random.normal(scale=1 / 10**.5, size=(20,1))
for e in range(15000):
for x,y1 in zip(X,y):
x = x.reshape(1,2)
layer1 = sigmoid(np.dot(x,weights_layer1)+bias1)
layer2 = sigmoid(np.dot(layer1,weights_layer2)+bias2)
dk = (y1-layer2)*layer2*(1-layer2)
dw2 = learnrate * dk * layer1.T
dw2 =dw2.reshape(weights_layer2.shape)
# print(dw2.shape)
weights_layer2 += dw2
# bias2 += dk * learnrate
dj = weights_layer2.T* layer1*(1-layer1)*dk
dw1 = learnrate * np.dot(x.T,dj)
I am calculating loss in this manner.
loss = 0
for x,y1 in zip(X,y):
layer1 = sigmoid(np.dot(x,weights_layer1))
layer2 = sigmoid(np.dot(layer1,weights_layer2))
loss += (layer2 - y1)**2
print(loss)
cant find what is going wrong,can you see anything? Thanks. I trained the same with pytorch it is converging fine.
the final model looks like this on trained data. but on test data it is worse.

After few hours of trying out, I found the problem. This network doesn't converge without biases. Used biases it converged in 5000 epochs.

Related

Matlab neural network handwritten digit recognition, output going to indifference

Using Matlab I am trying to construct a neural network that can classify handwritten digits that are 30x30 pixels. I use backpropagation to find the correct weights and biases. The network starts with 900 inputs, then has 2 hidden layers with 16 neurons and it ends with 10 outputs. Each output neuron has a value between 0 and 1 that represents the belief that the input should be classified as a certain digit. The problem is that after training, the output becomes almost indifferent to the input and it goes towards a uniform belief of 0.1 for each output.
My approach is to take each image with 30x30 pixels and reshape it to be vectors of 900x1 (note that 'Images_vector' is already in the vector format when it is loaded). The weights and biases are initiated with random values between 0 and 1. I am using stochastic gradiƫnt descent to update the weights and biases with 10 randomly selected samples per batch. The equations are as described by Nielsen.
The script is as follows.
%% Inputs
numberofbatches = 1000;
batchsize = 10;
alpha = 1;
cutoff = 8000;
layers = [900 16 16 10];
%% Initialization
rng(0);
load('Images_vector')
Images_vector = reshape(Images_vector', 1, 10000);
labels = [ones(1,1000) 2*ones(1,1000) 3*ones(1,1000) 4*ones(1,1000) 5*ones(1,1000) 6*ones(1,1000) 7*ones(1,1000) 8*ones(1,1000) 9*ones(1,1000) 10*ones(1,1000)];
newOrder = randperm(10000);
Images_vector = Images_vector(newOrder);
labels = labels(newOrder);
images_training = Images_vector(1:cutoff);
images_testing = Images_vector(cutoff + 1:10000);
w = cell(1,length(layers) - 1);
b = cell(1,length(layers));
dCdw = cell(1,length(layers) - 1);
dCdb = cell(1,length(layers));
for i = 1:length(layers) - 1
w{i} = rand(layers(i+1),layers(i));
b{i+1} = rand(layers(i+1),1);
end
%% Learning process
batches = randi([1 cutoff - batchsize],1,numberofbatches);
cost = zeros(numberofbatches,1);
c = 1;
for batch = batches
for i = 1:length(layers) - 1
dCdw{i} = zeros(layers(i+1),layers(i));
dCdb{i+1} = zeros(layers(i+1),1);
end
for n = batch:batch+batchsize
y = zeros(10,1);
disp(labels(n))
y(labels(n)) = 1;
% Network
a{1} = images_training{n};
z{2} = w{1} * a{1} + b{2};
a{2} = sigmoid(0, z{2});
z{3} = w{2} * a{2} + b{3};
a{3} = sigmoid(0, z{3});
z{4} = w{3} * a{3} + b{4};
a{4} = sigmoid(0, z{4});
% Cost
cost(c) = sum((a{4} - y).^2) / 2;
% Gradient
d{4} = (a{4} - y) .* sigmoid(1, z{4});
d{3} = (w{3}' * d{4}) .* sigmoid(1, z{3});
d{2} = (w{2}' * d{3}) .* sigmoid(1, z{2});
dCdb{4} = dCdb{4} + d{4} / 10;
dCdb{3} = dCdb{3} + d{3} / 10;
dCdb{2} = dCdb{2} + d{2} / 10;
dCdw{3} = dCdw{3} + (a{3} * d{4}')' / 10;
dCdw{2} = dCdw{2} + (a{2} * d{3}')' / 10;
dCdw{1} = dCdw{1} + (a{1} * d{2}')' / 10;
c = c + 1;
end
% Adjustment
b{4} = b{4} - dCdb{4} * alpha;
b{3} = b{3} - dCdb{3} * alpha;
b{2} = b{2} - dCdb{2} * alpha;
w{3} = w{3} - dCdw{3} * alpha;
w{2} = w{2} - dCdw{2} * alpha;
w{1} = w{1} - dCdw{1} * alpha;
end
figure
plot(cost)
ylabel 'Cost'
xlabel 'Batches trained on'
With the sigmoid function being the following.
function y = sigmoid(derivative, x)
if derivative == 0
y = 1 ./ (1 + exp(-x));
else
y = sigmoid(0, x) .* (1 - sigmoid(0, x));
end
end
Other than this I have also tried to have 1 of each digit in each batch, but this gave the same result. Also I have tried varying the batch size, the number of batches and alpha, but with no success.
Does anyone know what I am doing wrong?
Correct me if I'm wrong: You have 10000 samples in you're data, which you divide into 1000 batches of 10 samples. Your training process consists of running over these 10000 samples once.
This might be too little, normally your training process consists of several epochs (one epoch = iterating over every sample once). You can try going over your batches multiple times.
Also for 900 inputs your network seems small. Try it with more neurons in the second layer. Hope it helps!

Learning XOR with deep neural network

I am novice to deep learning so I begin with the simplest test case: XOR learning.
In the new edition of Digital Image Processing by G & W the authors give an example of XOR learning by a deep net with 3 layers: input, hidden and output (each layer has 2 neurons.), and a sigmoid as the network activation function.
For network initailization they say: "We used alpha = 1.0, an inital set of Gaussian random weights of zero mean and standard deviation of 0.02" (alpha is the gradient descent learning rate).
Training was made with 4 labeled examples:
X = [1 -1 -1 1;1 -1 1 -1];%MATLAB syntax
R = [1 1 0 0;0 0 1 1];%Labels
I have written the following MATLAB code to implement the network learing process:
function output = neuralNet4e(input,specs)
NumPat = size(input.X,2);%Number of patterns
NumLayers = length(specs.W);
for kEpoch = 1:specs.NumEpochs
% forward pass
A = cell(NumLayers,1);%Output of each neuron in each layer
derZ = cell(NumLayers,1);%Activation function derivative on each neuron dot product
A{1} = input.X;
for kLayer = 2:NumLayers
B = repmat(specs.b{kLayer},1,NumPat);
Z = specs.W{kLayer} * A{kLayer - 1} + B;
derZ{kLayer} = specs.activationFuncDerive(Z);
A{kLayer} = specs.activationFunc(Z);
end
% backprop
D = cell(NumLayers,1);
D{NumLayers} = (A{NumLayers} - input.R).* derZ{NumLayers};
for kLayer = (NumLayers-1):-1:2
D{kLayer} = (specs.W{kLayer + 1}' * D{kLayer + 1}).*derZ{kLayer};
end
%Update weights and biases
for kLayer = 2:NumLayers
specs.W{kLayer} = specs.W{kLayer} - specs.alpha * D{kLayer} * A{kLayer - 1}' ;
specs.b{kLayer} = specs.b{kLayer} - specs.alpha * sum(D{kLayer},2);
end
end
output.A = A;
end
Now, when I am using their setup (i.e., weights initalizaion with std = 0.02)
clearvars
s = 0.02;
input.X = [1 -1 -1 1;1 -1 1 -1];
input.R = [1 1 0 0;0 0 1 1];
specs.W = {[];s * randn(2,2);s * randn(2,2)};
specs.b = {[];s * randn(2,1);s * randn(2,1)};
specs.activationFunc = #(x) 1./(1 + exp(-x));
specs.activationFuncDerive = #(x) exp(-x)./(1 + exp(-x)).^2;
specs.NumEpochs = 1e4;
specs.alpha = 1;
output = neuralNet4e(input,specs);
I'm getting (after 10000 epoches) that the final output of the net is
output.A{3} = [0.5 0.5 0.5 0.5;0.5 0.5 0.5 0.5]
but when I changed s = 0.02; to s = 1; I got output.A{3} = [0.989 0.987 0.010 0.010;0.010 0.012 0.0.98 0.98] as it should.
Is it possible to get these results with `s=0.02;' and I am doing something wrong in my code? or is standard deviation of 0.02 is just a typo?
Based on your code, I don't see any errors. In my knowledge, the result that you got,
[0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5]
That is a typical result of overfitting. There are many reasons for this to happen, such as too many epochs, too large learning rate, too small sample data, and others.
On your example, s=0.02 limits the values of randomized weights and biases. Changing that to s=1 makes the randomized values unchanged/unscaled.
To make the s=0.02 one work, you can try minimizing the number of epochs or maybe lowering the alpha.
Hope this helps.

Weird results in approximation of a function with neural networks

I am trying to approximate a function (the right hand side of a differential equation) of the form ddx=F(x,dx,u) (where x,dx,u are scalars and u is constant) with an RBF neural network. I have the function F as a black box (I can feed it with initial x,dx and u and take x and dx for a timespan I want) and during training (using sigma-modification) I am getting the following response plotting the real dx vs the approximated dx.
Then I save the parameters of the NN (the centers and the stds of the gaussians, and the final weights) and perform a simulation using the same initial x,dx and u as before and keeping, of course, the weights stable this time. But I get the following plot.
Is that logical? Am I missing something?
The training code is as follows:
%load the results I got from the real function
load sim_data t p pd dp %p is x,dp is dx and pd is u
real_states = [p,dp];
%down and upper limits of the variables
p_dl = 0;
p_ul = 2;
v_dl = -1;
v_ul = 4;
pd_dl = 0;%pd is constant each time,but the function should work for different pds
pd_ul = 2;
%number of gaussians
nc = 15;
x = p_dl:(p_ul-p_dl)/(nc-1):p_ul;
dx = v_dl:(v_ul-v_dl)/(nc-1):v_ul;
pdx = pd_dl:(pd_ul-pd_dl)/(nc-1):pd_ul;
%centers of gaussians
Cx = combvec(x,dx,pdx);
%stds of the gaussians
B = ones(1,3)./[2.5*(p_ul-p_dl)/(nc-1),2.5*(v_ul-v_dl)/(nc-1),2.5*(pd_ul-pd_dl)/(nc-1)];
nw = size(Cx,2);
wdx = zeros(nw,1);
state = real_states(1,[1,4]);%there are also y,dy,dz and z in real_states (ignored here)
states = zeros(length(t),2);
timestep = 0.005;
for step=1:length(t)
states(step,:) = state;
%compute the values of the sigmoids
Sx = exp(-1/2 * sum(((([real_states(step,1);real_states(step,4);pd(1)]*ones(1,nw))'-Cx').*(ones(nw,1)*B)).^2,2));
ddx = -530*state(2) + wdx'*Sx;
edx = state(2) - real_states(step,4);
dwdx = -1200*edx * Sx - 4 * wdx;
wdx = wdx + dwdx*timestep;
state = [state(1)+state(2)*timestep,state(2)+ddx*timestep];
end
save weights wdx Cx B
figure
plot(t,[dp(:,1),states(:,2)])
legend('x_d_o_t','x_d_o_t_h_a_t')
The code used to verify the approximation is the following:
load sim_data t p pd dp
real_states = [p,dp];
load weights wdx Cx B
nw = size(Cx,2);
state = real_states(1,[1,4]);
states = zeros(length(t),2);
timestep = 0.005;
for step=1:length(t)
states(step,:) = state;
Sx = exp(-1/2 * sum(((([real_states(step,1);real_states(step,4);pd(1)]*ones(1,nw))'-Cx').*(ones(nw,1)*B)).^2,2));
ddx = -530*state(2) + wdx'*Sx;
state = [state(1)+state(2)*timestep,state(2)+ddx*timestep];
end
figure
plot(t,[dp(:,1),states(:,2)])
legend('x_d_o_t','x_d_o_t_h_a_t')

MATLAB Perceptron

I've been struggling with this for quite some time now. I cant seem to figure out why I have a percentage error in the thousands. I'm trying to figure out a perceptron between X1 and X2 which are Gaussian distributed data sets with distinct means and identical covariances. My code:
N=200;
X = [X1; X2];
X = [X ones(N,1)]; %bias
y = [-1*ones(N/2,1); ones(N/2,1)]; %classification
%Split data into training and test
ii = randperm(N);
Xtr = X(ii(1:N/2),:);
ytr = X(ii(1:N/2),:);
Xts = X(ii(N/2+1:N),:);
yts = y(ii(N/2+1:N),:);
w = randn(3,1);
eta = 0.001;
%learn from training set
for iter=1:500
j = ceil(rand*N/2);
if( ytr(j)*Xtr(j,:)*w < 0)
w = w + eta*Xtr(j,:)';
end
end
%apply what you have learnt to test set
yhts = Xts * w;
disp([yts yhts])
PercentageError = 100*sum(find(yts .*yhts < 0))/Nts;
Any help would be appreciated. Thank you
You have a bug in your error calculation.
On this line:
PercentageError = 100*sum(find(yts .*yhts < 0))/Nts;
The find is returning indices of the matching items. For your accuracy measure you don't want those, you just want the count:
PercentageError = 100*sum( yts .*yhts < 0 )/Nts;
If I generate X1 = randn(100,2); X2 = randn(100,2); and assume Nts=100, I get 2808% for your code, and expected 50% error (no better than guessing because my test data cannot be separated) for the corrected version.
Update - the perceptron model had a more subtle bug too, see: https://datascience.stackexchange.com/questions/2353/matlab-perceptron

Neural Networks: Sigmoid Activation Function for continuous output variable

Okay, so I am in the middle of Andrew Ng's machine learning course on coursera and would like to adapt the neural network which was completed as part of assignment 4.
In particular, the neural network which I had completed correctly as part of the assignment was as follows:
Sigmoid activation function: g(z) = 1/(1+e^(-z))
10 output units, each which could take 0 or 1
1 hidden layer
Back-propagation method used to minimize cost function
Cost function:
where L=number of layers, s_l = number of units in layer l, m = number of training examples, K = number of output units
Now I want to adjust the exercise so that there is one continuous output unit that takes any value between [0,1] and I am trying to work out what needs to change, so far I have
Replaced the data with my own, i.e.,such that the output is continuous variable between 0 and 1
Updated references to the number of output units
Updated the cost function in the back-propagation algorithm to:
where a_3 is the value of the output unit determined from forward propagation.
I am certain that something else must change as the gradient checking method shows the gradient determined by back-propagation and that by the numerical approximation no longer match up. I did not change the sigmoid gradient; it is left at f(z)*(1-f(z)) where f(z) is the sigmoid function 1/(1+e^(-z))) nor did I update the numerical approximation of the derivative formula; simply (J(theta+e) - J(theta-e))/(2e).
Can anyone advise of what other steps would be required?
Coded in Matlab as follows:
% FORWARD PROPAGATION
% input layer
a1 = [ones(m,1),X];
% hidden layer
z2 = a1*Theta1';
a2 = sigmoid(z2);
a2 = [ones(m,1),a2];
% output layer
z3 = a2*Theta2';
a3 = sigmoid(z3);
% BACKWARD PROPAGATION
delta3 = a3 - y;
delta2 = delta3*Theta2(:,2:end).*sigmoidGradient(z2);
Theta1_grad = (delta2'*a1)/m;
Theta2_grad = (delta3'*a2)/m;
% COST FUNCTION
J = 1/(2 * m) * sum( (a3-y).^2 );
% Implement regularization with the cost function and gradients.
Theta1_grad(:,2:end) = Theta1_grad(:,2:end) + Theta1(:,2:end)*lambda/m;
Theta2_grad(:,2:end) = Theta2_grad(:,2:end) + Theta2(:,2:end)*lambda/m;
J = J + lambda/(2*m)*( sum(sum(Theta1(:,2:end).^2)) + sum(sum(Theta2(:,2:end).^2)));
I have since realised that this question is similar to that asked by #Mikhail Erofeev on StackOverflow, however in this case I wish the continuous variable to be between 0 and 1 and therefore use a sigmoid function.
First, your cost function should be:
J = 1/m * sum( (a3-y).^2 );
I think your Theta2_grad = (delta3'*a2)/m;is expected to match the numerical approximation after changed to delta3 = 1/2 * (a3 - y);).
Check this slide for more details.
EDIT:
In case there is some minor discrepancy between our codes, I pasted my code below for your reference. The code has already been compared with numerical approximation function checkNNGradients(lambda);, the Relative Difference is less than 1e-4 (not meets the 1e-11 requirement by Dr.Andrew Ng though)
function [J grad] = nnCostFunctionRegression(nn_params, ...
input_layer_size, ...
hidden_layer_size, ...
num_labels, ...
X, y, lambda)
Theta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), ...
hidden_layer_size, (input_layer_size + 1));
Theta2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):end), ...
num_labels, (hidden_layer_size + 1));
m = size(X, 1);
J = 0;
Theta1_grad = zeros(size(Theta1));
Theta2_grad = zeros(size(Theta2));
X = [ones(m, 1) X];
z1 = sigmoid(X * Theta1');
zs = z1;
z1 = [ones(m, 1) z1];
z2 = z1 * Theta2';
ht = sigmoid(z2);
y_recode = zeros(length(y),num_labels);
for i=1:length(y)
y_recode(i,y(i))=1;
end
y = y_recode;
regularization=lambda/2/m*(sum(sum(Theta1(:,2:end).^2))+sum(sum(Theta2(:,2:end).^2)));
J=1/(m)*sum(sum((ht - y).^2))+regularization;
delta_3 = 1/2*(ht - y);
delta_2 = delta_3 * Theta2(:,2:end) .* sigmoidGradient(X * Theta1');
delta_cap2 = delta_3' * z1;
delta_cap1 = delta_2' * X;
Theta1_grad = ((1/m) * delta_cap1)+ ((lambda/m) * (Theta1));
Theta2_grad = ((1/m) * delta_cap2)+ ((lambda/m) * (Theta2));
Theta1_grad(:,1) = Theta1_grad(:,1)-((lambda/m) * (Theta1(:,1)));
Theta2_grad(:,1) = Theta2_grad(:,1)-((lambda/m) * (Theta2(:,1)));
grad = [Theta1_grad(:) ; Theta2_grad(:)];
end
If you want to have continuous output try not to use sigmoid activation when computing target value.
a1 = [ones(m, 1) X];
a2 = sigmoid(X * Theta1');
a2 = [ones(m, 1) z1];
a3 = z1 * Theta2';
ht = a3;
Normalize input before using it in nnCostFunction. Everything else remains same.