Neural Network not fitting XOR - matlab

I created an Octave script for training a neural network with 1 hidden layer using backpropagation but it can not seem to fit an XOR function.
x Input 4x2 matrix [0 0; 0 1; 1 0; 1 1]
y Output 4x1 matrix [0; 1; 1; 0]
theta Hidden / output layer weights
z Weighted sums
a Activation function applied to weighted sums
m Sample count (4 here)
My weights are initialized as follows
epsilon_init = 0.12;
theta1 = rand(hiddenCount, inputCount + 1) * 2 * epsilon_init * epsilon_init;
theta2 = rand(outputCount, hiddenCount + 1) * 2 * epsilon_init * epsilon_init;
Feed forward
a1 = x;
a1_with_bias = [ones(m, 1) a1];
z2 = a1_with_bias * theta1';
a2 = sigmoid(z2);
a2_with_bias = [ones(size(a2, 1), 1) a2];
z3 = a2_with_bias * theta2';
a3 = sigmoid(z3);
Then I compute the logistic cost function
j = -sum((y .* log(a3) + (1 - y) .* log(1 - a3))(:)) / m;
Back propagation
delta2 = (a3 - y);
gradient2 = delta2' * a2_with_bias / m;
delta1 = (delta2 * theta2(:, 2:end)) .* sigmoidGradient(z2);
gradient1 = delta1' * a1_with_bias / m;
The gradients were verified to be correct using gradient checking.
I then use these gradients to find the optimal values for theta using gradient descent, though using Octave's fminunc function yields the same results. The cost function converges to ln(2) (or 0.5 for a squared errors cost function) because the network outputs 0.5 for all four inputs no matter how many hidden units I use.
Does anyone know where my mistake is?

Start with a larger range when initialising weights, including negative values. It is difficult for your code to "cross-over" between positive and negative weights, and you probably meant to put * 2 * epsilon_init - epsilon_init; when instead you put * 2 * epsilon_init * epsilon_init;. Fixing that may well fix your code.
As a rule of thumb, I would do something like this:
theta1 = ( 0.5 * sqrt ( 6 / ( inputCount + hiddenCount) ) *
randn( hiddenCount, inputCount + 1 ) );
theta2 = ( 0.5 * sqrt ( 6 / ( hiddenCount + outputCount ) ) *
randn( outputCount, hiddenCount + 1 ) );
The multiplier is just some advice I picked up on a course, I think that it is backed by a research paper that compared a few different approaches.
In addition, you may need a lot of iterations to learn XOR if you run basic gradient descent. I suggest running for at least 10000 before declaring that learning isn't working. The fminunc function should do better than that.
I ran your code with 2 hidden neurons, basic gradient descent and the above initialisations, and it learned XOR correctly. I also tried adding momentum terms, and the learning was faster and more reliable, so I suggest you take a look at that next.

You need at least 3 neurons in the hidden layer and correct the initialization as the first answer suggest.
If the sigmoidGradient(z2) means a2.*(1-a2) all the rest of the code seems ok to me.
Best reggards,

Related

Regularized logistic regresion with vectorization

I'm trying to implement a vectorized version of the regularised logistic regression. I have found a post that explains the regularised version but I don't understand it.
To make it easy I will copy the code below:
hx = sigmoid(X * theta);
m = length(X);
J = (sum(-y' * log(hx) - (1 - y') * log(1 - hx)) / m) + lambda * sum(theta(2:end).^2) / (2*m);
grad =((hx - y)' * X / m)' + lambda .* theta .* [0; ones(length(theta)-1, 1)] ./ m ;
I understand the first part of the Cost equation, If I'm correct it could be represented as:
J = ((-y' * log(hx)) - ((1-y)' * log(1-hx)))/m;
The problem it's the regularization term. Let's take more detail:
Dimensions:
X = (m x (n+1))
theta = ((n+1) x 1)
I don't understand why he let the first term of theta (theta_0) outside of the equation, when in theory the regularized term it's:
and it has to take into account all the thetas
For the gradient descent, I think that this equation it's equivalent:
L = eye(length(theta));
L(1,1) = 0;
grad = (1/m * X'* (hx - y)+ (lambda*(L*theta)/m).
In Matlab indexes begin from 1, and in mathematic indexes begin from 0 (the indexes on the formula which you mentioned are also beginning from 0).
So, in theory, the first term of theta also needs to be let outside of the equation.
And as for your second question, you right! It is an equivalent clean equation!

ODE45 with very large numbers as constraints

2nd ODE to solve in MATLAB:
( (a + f(t))·d²x/dt² + (b/2 + k(t))·dx/dt ) · dx/dt - g(t) = 0
Boundary condition:
dx/dt(0) = v0
where
t is the time,
x is the position
dx/dt is the velocity
d2x/dt2 is the acceleration
a, b, v0 are constants
f(t), k(t) and h(t) are KNOWN functions dependent on t
(I do not write them because they are quite big)
As an example, using symbolic variables:
syms t y
%% --- Initial conditions ---
phi = 12.5e-3;
v0 = 300;
e = 3e-3;
ro = 1580;
E = 43e9;
e_r = 0.01466;
B = 0.28e-3;
%% --- Intermediate calculations ---
v_T = sqrt(((1 + e_r) * 620e6) /E) - sqrt(E/ro) * e_r;
R_T = v_T * t;
m_acc = pi * e * ro *(R_T^2);
v_L = sqrt (E/ro);
R_L = v_L * t;
z = 2 * R_L;
E_4 = B * ((e_r^2)* B * (0.9^(z/B)-1)) /(log(0.9));
E_1 = E * e * pi * e_r^2 * (-phi* (phi - 2*v_T*t)) /16;
E_2 = pi * R_T^2 * 10e9;
E_3 = pi * R_T^2 * 1e6 * e;
%% Resolution of the problem
g_t = -diff(E_1 + E_2 + E_3, t);
f(t,y)=(g_t - (pi*v_T*e*ro/2 + E_4) * y^2 /(y* (8.33e-3 + m_acc))];
fun=matlabFunction(f);
[T,Y]=ode45(fun,[0 1], v0]);
How can I rewrite this to get x as y=dx/dt? I'm new to Matlab and any help is very welcome !
First, you chould use subs to evaluate a symbolic function. Another approach is to use matlabFunction to convert all symbolic expressions to anonymous functions, as suggested by Horchler.
Second, you're integrating the ODE as if it is 1st order in dx/dt. If you're interested in x(t) as well as dx/dt(t), then you'll have to modify the function like so:
fun = #(t,y) [y(2);
( subs(g) - (b/2 + subs(k))*y(2)*y(2) ) / ( y(2) * (a + subs(f))) ];
and of course, provide an initial value for x0 = x(0) as well as v0 = dx/dt(0).
Third, the absolute value of the parameters is hardly ever a real concern. IEEE754 double-precision floating point format can effortlessly represent numbers between 2.225073858507201e-308 and 1.797693134862316e+308 (realmin and realmax, respectively). So for the coefficients you gave (O(1014)), this is absolutely not a problem. You might lose a few digits of precision if you don't take precautions (rescale to [-1 +1], reformulate the problem in different units, ...), but the relative error due to this is more than likely to be tiny and insignificant compared to the algorithmic error made by ode45.
<RANDOM_OPINIONATED_RANT>
Fourth, WHY do you use symbolic math for this purpose?! You are doing a numerical integration, meaning, there is no analytic solution anyway. Why bother with symbolics then? Doing the integration with symbolics (through vpa even) is going to be dozens, hundreds, yes, often even thousands of times slower than keeping (or re-implementing) everything numerical (which some would argue is already slow in MATLAB compared to a bare-metal approach).
Yes, of course, for this specific, individual, isolated use case it may not matter much, but for the future I'd strongly advise you to learn to:
use symbolics for derivations, proving theorems, simplifying expressions, ...
use numerics to implement any algorithm or function from which actual numbers are expected.
In other words, symbolics for drafting, numerics for crunching. And exactly zero symbolics should appear in any good implementation of any algorithm.
Although it's possible to mix them to some extent, that does not mean it is a good idea to do so. In fact, that's almost never. And the few isolated cases where it is the only viable option are not a vindication of the approach.
They are rare, isolated cases after all, far from the abundant norm.
For me it bears resemblance with the evil eval, with similar reasons for why it Should. Be. Avoided.
</RANDOM_OPINIONATED_RANT>
With the full code, it's easy to come up with a complete solution:
% Initial conditions
phi = 12.5e-3;
v0 = 300;
x0 = 0; % (my assumption)
e = 3e-3;
ro = 1580;
E = 43e9;
e_r = 0.01466;
B = 0.28e-3;
% Intermediate calculations
v_T = sqrt(((1 + e_r) * 620e6) /E) - sqrt(E/ro) * e_r;
R_T = #(t) v_T * t;
m_acc = #(t) pi * e * ro *(R_T(t)^2);
v_L = sqrt (E/ro);
R_L = #(t) v_L * t;
z = #(t) 2 * R_L(t);
E_4 = #(t) B * ((e_r^2)* B * (0.9^(z(t)/B)-1)) /(log(0.9));
% UNUSED
%{
E_1 = #(t) -phi * E * e * pi * e_r^2 * (phi - 2*v_T*t) /16;
E_2 = #(t) pi * R_T(t)^2 * 10e9;
E_3 = #(t) pi * R_T(t)^2 * 1e6 * e;
%}
% Resolution of the problem
g_t = #(t) -( phi * E * e * pi * e_r^2 * v_T / 8 + ... % dE_1/dt
pi * 10e9 * 2 * R_T(t) * v_T + ... % dE_2/dt
pi * 1e6 * e * 2 * R_T(t) * v_T ); % dE_3/dt
% The derivative of Z = [x(t); x'(t)] equals Z' = [x'(t); x''(t)]
f = #(t,y)[y(2);
(g_t(t) - (0.5*pi*v_T*e*ro + E_4(t)) * y(2)^2) /(y(2) * (8.33e-3 + m_acc(t)))];
% Which is readily integrated
[T,Y] = ode45(f, [0 1], [x0 v0]);
% Plot solutions
figure(1)
plot(T, Y(:,1))
xlabel('t [s]'), ylabel('position [m]')
figure(2)
plot(T, Y(:,2))
xlabel('t [s]'), ylabel('velocity [m/s]')
Results:
Note that I've not used symbolics anywhere, except to double-check my hand-derived derivatives.

how to implement newton-raphson to calculate the k(i) coefficients of a implicit runge kutta?

I'm trying to implement a RK implicit 2-order to convection-diffusion equation (1D) with fdm_2nd and gauss butcher coefficients: 'u_t = -uu_x + nu .u_xx' .
My goal is to compare the explit versus implcit scheme. The explicit rk which works well with a little number of viscosity. The curve of explicit schem show us a very nice shock wave.
I need your help to implement correctly the solver of the k(i) coefficient. I don't see how implement the newton method for all k(i).
do I need to implement it for all time-space steps ? or just in time ? The jacobian is maybe wrong but i don't see where. Or maybe i use the jacobian in wrong direction...
Actualy, my code works, but i think it's was wrong somewhere ... also the implicit curve does not move from the initial values.
here my function :
function [t,u] = burgers(t0,U,N,dx)
nu=0.01; %coefficient de viscosité
A=(diag(zeros(1,N))-diag(ones(1,N-1),1)+diag(ones(1,N-1),-1)) / (2*dx);
B=(-2*diag(ones(1,N))+diag(ones(1,N-1),1)+diag(ones(1,N-1),-1)) / (dx).^2;
t=t0;
u = - A * U.^2 + nu .* B * U;
the jacobian :
function Jb = burJK(U,dx,i)
%Opérateurs
a(1,1) = 1/4;
a(1,2) = 1/4 - (3).^(1/2) / 6;
a(2,1) = 1/4 + (3).^(1/2) / 6;
a(2,2) = 1/4;
Jb(1,1) = a(1,1) .* (U(i+1,1) - U(i-1,1))/ (2*dx) - 1;
Jb(1,2) = a(1,2) .* (U(i+1,1) - U(i-1,1))/ (2*dx);
Jb(2,1) = a(2,1) .* (U(i+1,2) - U(i-1,2))/ (2*dx);
Jb(2,2) = a(2,2) .* (U(i+1,2) - U(i-1,2))/ (2*dx) - 1;
Here my newton-code:
iter = 1;
iter_max = 100;
k=zeros(2,N);
k(:,1)=[0.4;0.6];
[w_1,f1] =burgers(n + c(1) * dt,uu + dt * (a(1,:) * k(:,iter)),iter,dx);
[w_2,f2] =burgers(n + c(2) * dt,uu + dt * (a(2,:) * k(:,iter)),iter,dx);
f1 = -k(1,iter) + f1;
f2 = -k(1,iter) + f2;
f(:,1)=f1;
f(:,2)=f2;
df = burJK(f,dx,iter+1);
while iter<iter_max-1 % K_newton
delta = df\f(iter,:)';
k(:,iter+1) = k(:,iter) - delta;
iter = iter+1;
[w_1,f1] =burgers(n + c(1) * dt,uu + dt * (a(1,:) * k(:,iter+1)),N,dx);
[w_2,f2] =burgers(n + c(2) * dt,uu + dt * (a(2,:) * k(:,iter+1)),N,dx);
f1 = -k(1,iter+1) + f1;
f2 = -k(1,iter+1) + f2;
f(:,1)=f1;
f(:,2)=f2;
df = burJK(f,dx,iter);
if iter>iter_max
disp('#');
else
disp('ok');
end
end
I'm a little rusty on exactly how to implement this in matlab, but I can walk your through the general steps and hopefully that will help. First we can consider the equation you are solving to fit the general class of problems that can be posed as
du/dt = F(u), where F is a linear or nonlinear function
For a Runge Kutta scheme you typically recast the problem something like this
k(i) = F(u+dt*a(i,i)*k(i)+ a(i,j)*k(j))
for a given stage. Now comes the tricky part, you you need to make 1-D vector constructed by stacking k(1) onto k(2). So the first half of the elements of the vector are k(1) and the second half are k(2). With this new combined vector you can then change F So that it operates on the two k's separately. This results in
K = FF(u+dt*a*K) where FF is F for the new double k vector, K
Ok, now we can implement the Newton's method. You will do this for each time step and until you have converged on the right answer and use it across all spatial points at the same time. What you do is you guess a K and compute the jacobian of G(K,U) = K-FF(FF(u+dt*a*K). G(K,U) should be only valued at zero when K is at the right solution. So in other words, do you Newton's method on K and when looking for convergence you need to see that it is converging at all spots. I would run the newton's method until max(abs(G(K,U)))< SolverTolerance.
Sorry I can't be more help on the matlab implementation, but hopefully I helped with explaining how to implement the newton's method.

Matlab: return complete solution of inverse cosine (acos)

I have some Matlab code of the following form:
syms theta x
theta = acos(x)
This returns a single solution for theta. However, I want to return the complete solution (between some limits).
For example,
x = cos(theta)
would give x=0.5 for theta = 60 degrees, 120 degrees, 420 degrees, etc. Therefore, in my code above, I want theta to return all these possible values.
Does anyone know how to do this? I have been searching google for hours but I can't find how to do this!
Many thanks!
Here's a numerical solution; like Benoit_11 I don't see the point of doing it symbolically in this context.
There are two solutions within the interval [-pi, pi], the larger one being returned by acos:
% solution within [0, pi]
theta1 = acos(x);
% solution within [-pi, 0]
theta2 = -acos(x);
These solutions repeat at steps of 2 pi. The number of possible steps downwards and upwards can be determined by the integer part of the distance between the basic solution and the respective limit (lower and upper), in units of 2 pi. For theta1:
% repetitions in 2 pi intervals within limits
sd = floor((theta1 - lower) / (2 * pi));
su = floor((upper - theta1) / (2 * pi));
theta1 = (-sd : su) * 2 * pi + theta1
And the same for theta2:
% repetitions in 2 pi intervals within limits
sd = floor((theta2 - lower) / (2 * pi));
su = floor((upper - theta2) / (2 * pi));
theta2 = (-sd : su) * 2 * pi + theta2
If you'd like one combined list of solutions, excluding possible duplicates:
theta = unique([theta1, theta2])
and in degrees:
theta = theta / pi * 180;
Example:
x = 0.5;
lower = -10;
upper = 30;
gives
theta =
-7.3304 -5.2360 -1.0472 1.0472 5.2360 7.3304 11.5192 13.6136 17.8024 19.8968 24.0855 26.1799
You just have to use a loop:
for x = limit_inf:step:limit_sup
theta(x) = acos(x);
end
And also define your limits and the step between them.

Evolution strategy with individual stepsizes

I'm trying to find a good solution with an evolution strategy for a 30 dimensional minimization problem. Now I have developed with success a simple (1,1) ES and also a self-adaptive (1,lambda) ES with one step size.
The next step is to create a (1,lambda) ES with individual stepsizes per dimension. The problem is that my MATLAB code doesn't work yet. I'm testing on the sphere objective function:
function f = sphere(x)
f = sum(x.^2);
end
The plotted results of the ES with one step size vs. the one with individual stepsizes:
The blue line is the performance of the ES with individual step sizes and the red one is for the ES with one step size.
The code for the (1,lambda) ES with multiple stepsizes:
% Strategy parameters
tau = 1 / sqrt(2 * sqrt(N));
tau_prime = 1 / sqrt(2 * N);
lambda = 10;
% Initialize
xp = (ub - lb) .* rand(N, 1) + lb;
sigmap = (ub - lb) / (3 * sqrt(N));
fp = feval(fitnessfct, xp');
evalcount = 1;
% Evolution cycle
while evalcount <= stopeval
% Generate offsprings and evaluate
for i = 1 : lambda
rand_scalar = randn();
for j = 1 : N
Osigma(j,i) = sigmap(j) .* exp(tau_prime * rand_scalar + tau * randn());
end
O(:,i) = xp + Osigma(:,i) .* rand(N,1);
fo(i) = feval(fitnessfct, O(:,i)');
end
evalcount = evalcount + lambda;
% Select best
[~, sortindex] = sort(fo);
xp = O(:,sortindex(1));
fp = fo(sortindex(1));
sigmap = Osigma(:,sortindex(1));
end
Does anybody see the problem?
Your mutations have a bias: they can only ever increase the parameters, never decrease them. sigmap is a vector of (scaled) upper minus lower bounds: all positive. exp(...) is always positive. Therefore the elements of Osigma are always positive. Then your change is Osigma .* rand(N,1), and rand(N,1) is also always positive.
Did you perhaps mean to use randn(N,1) instead of rand(N,1)? With that single-character change, I find that your code optimizes rather than pessimizing :-).