Piecewise linear function reduces to threshold function? - neural-network

I was reading a book where piecewise function is defined as
where varphi is the peicewise function
It says the special forms of piecewise-linear function is :
A linear combiner arises if the linear region of operation is maintained without running into saturation
The piecewise-linear function reduces to a threshold function if the amplification factor of the linear region is made infinitely large
I tried to understand no 1 as the function value will rise if the adder output value of NN sticks to this area: ( -.5 to .5 ) :
not in <-.5 and >.5 region. I don't know if that's correct.
As for no 2, i have no idea what is being implied. A threshold function is presented in the book as :
What is "amplification factor" and how it behave like threshold function when made infinitively large ?

Related

Double numerical integration in Matlab - Singularity

I have discrete data of a 2D function defined as
theta = linspace(0,pi,nTheta);
phi = linspace(0,2*pi,nPhi);
p=zeros(nPhi,nTheta);%only to show the dimension of my matrix
[np,nt]=ndgrid(phi,theta);
f1 = griddedInterpolant(np,nt,p,'spline');
f2= #(np,nt) f1(np,nt);
integral2(f2,0,2*pi,0,pi)
Note that p is calculated from a complex physical problem, but i showed above how it is initialized.
Also, I can increase nTheta and nPhi, which leads to more accurate calculation of p.
My calculated function (with nPhi=400,nTheta=200) is something like:
I tried 3 ways :
using Trapz function
using the code above but with linear interpolation for gridded interpolant
using the code above with spline interpolation
Although the spline is better than others, i still need to increase nPhi and nTheta, which makes it impossible for me to do the simulation due to its cost.
Is there any suggestion except these 3 methods or any general suggestion how i can do this computation more efficient? (I also took advantage of the symmetry in both directions)
Note that the shape of my function varies in each time step, so a local mesh refinement might be challenging because i don't know the detail of my function in advance.

Equation system with derivatives

I have to solve the following system:
X'(t) = -D(t)x(t)+μ(s(t), p(t))x(t);
S'(t) = D(t)(s(t)^in - s(t)) - Yxsμ(s(t), p(t))x(t)
p'(t) = -D(t)p(t)+(aμ(s(t), p(t))+b)x(t)
where
μ(s(t), p(t)) = Μmax ((1 - (p(t)/pm))s(t)) / (km+s(t)+(s(t) ^ 2)/ki)
where Yxs, a,b, Mmax, Pm, km, ki are constant variables, then I have to linearizate the system and find the balance Points of thiw system. any suggestion how to do it with Matlab or Mathematica??
Matlab can help with some steps, but there might be few where you do have to write down some equations yourself.
To start with a simple side note: Matlabs ODE45 function allows to simulate any function of the form dx/dt = f(x,u), regardless of how non-linear or time variant they might become.
to linearize such a system, you need to derive a jacobian matrix and substitute the linearization point in this matrix. This linearization point is any point where all state derivatives equal 0, it does not need to be a balance point. However, it is desired to have it a balance point as this means the linearization point is a stable equilibrium.
So in MATLAB:
create symbolic variables for all states and inputs (so x(t), s(t) and p(t))
create symbolic state equations dx/dt = f(x,u) and output equation y = g(x,u)
derive symbolic State space matrices A,B,C,D using the "jacobian" function
substitute linearization point in these symbolic state space matrices using "subs"
use eval(symbolic matrix) to retrieve a numeric matrix.
Depending on the non-linear complexity and the chosen linearization point, the linearized system might only within acceptable bounds from the actual system for a very tight region, so be aware of that.

Optimizing huge amounts of calls of fsolve in Matlab

I'm solving a pair of non-linear equations for each voxel in a dataset of a ~billion voxels using fsolve() in MATLAB 2016b.
I have done all the 'easy' optimizations that I'm aware of. Memory localization is OK, I'm using parfor, the equations are in fairly numerically simple form. All discontinuities of the integral are fed to integral(). I'm using the Levenberg-Marquardt algorithm with good starting values and a suitable starting damping constant, it converges on average with 6 iterations.
I'm now at ~6ms per voxel, which is good, but not good enough. I'd need a order of magnitude reduction to make the technique viable. There's only a few things that I can think of improving before starting to hammer away at accuracy:
The splines in the equation are for quick sampling of complex equations. There are two for each equation, one is inside the 'complicated nonlinear equation'. They represent two equations, one which is has a large amount of terms but is smooth and has no discontinuities and one which approximates a histogram drawn from a spectrum. I'm using griddedInterpolant() as the editor suggested.
Is there a faster way to sample points from pre-calculated distributions?
parfor i=1:numel(I1)
sols = fsolve(#(x) equationPair(x, input1, input2, ...
6 static inputs, fsolve options)
output1(i) = sols(1); output2(i) = sols(2)
end
When calling fsolve, I'm using the 'parametrization' suggested by Mathworks to input the variables. I have a nagging feeling that defining a anonymous function for each voxel is taking a large slice of the time at this point. Is this true, is there a relatively large overhead for defining the anonymous function again and again? Do I have any way to vectorize the call to fsolve?
There are two input variables which keep changing, all of the other input variables stay static. I need to solve one equation pair for each input pair so I can't make it a huge system and solve it at once. Do I have any other options than fsolve for solving pairs of nonlinear equations?
If not, some of the static inputs are the fairly large. Is there a way to keep the inputs as persistent variables using MATLAB's persistent, would that improve performance? I only saw examples of how to load persistent variables, how could I make it so that they would be input only once and future function calls would be spared from the assumedly largish overhead of the large inputs?
EDIT:
The original equations in full form look like:
Where:
and:
Everything else is known, solving for x_1 and x_2. f_KN was approximated by a spline. S_low (E) and S_high(E) are splines, the histograms they are from look like:
So, there's a few things I thought of:
Lookup table
Because the integrals in your function do not depend on any of the parameters other than x, you could make a simple 2D-lookup table from them:
% assuming simple (square) range here, adjust as needed
[x1,x2] = meshgrid( linspace(0, xmax, N) );
LUT_high = zeros(size(x1));
LUT_low = zeros(size(x1));
for ii = 1:N
LUT_high(:,ii) = integral(#(E) Fhi(E, x1(1,ii), x2(:,ii)), ...
0, E_high, ...
'ArrayValued', true);
LUT_low(:,ii) = integral(#(E) Flo(E, x1(1,ii), x2(:,ii)), ...
0, E_low, ...
'ArrayValued', true);
end
where Fhi and Flo are helper functions to compute those integrals, vectorized with scalar x1 and vector x2 in this example. Set N as high as memory will allow.
Those lookup tables you then pass as parameters to equationPair() (which allows parfor to distribute the data). Then just use interp2 in equationPair():
F(1) = I_high - interp2(x1,x2,LUT_high, x(1), x(2));
F(2) = I_low - interp2(x1,x2,LUT_low , x(1), x(2));
So, instead of recomputing the whole integral every time, you evaluate it once for the expected range of x, and reuse the outcomes.
You can specify the interpolation method used, which is linear by default. Specify cubic if you're really concerned about accuracy.
Coarse/Fine
Should the lookup table method not be possible for some reason (memory limitations, in case the possible range of x is too big), here's another thing you could do: split up the whole procedure in 2 parts, which I'll call coarse and fine.
The intent of the coarse method is to improve your initial estimates really quickly, but perhaps not so accurately. The quickest way to approximate that integral by far is via the rectangle method:
do not approximate S with a spline, just use the original tabulated data (so S_high/low = [S_high/low#E0, S_high/low#E1, ..., S_high/low#E_high/low]
At the same values for E as used by the S data (E0, E1, ...), evaluate the exponential at x:
Elo = linspace(0, E_low, numel(S_low)).';
integrand_exp_low = exp(x(1)./Elo.^3 + x(2)*fKN(Elo));
Ehi = linspace(0, E_high, numel(S_high)).';
integrand_exp_high = exp(x(1)./Ehi.^3 + x(2)*fKN(Ehi));
then use the rectangle method:
F(1) = I_low - (S_low * Elo) * (Elo(2) - Elo(1));
F(2) = I_high - (S_high * Ehi) * (Ehi(2) - Ehi(1));
Running fsolve like this for all I_low and I_high will then have improved your initial estimates x0 probably to a point close to "actual" convergence.
Alternatively, instead of the rectangle method, you use trapz (trapezoidal method). A tad slower, but possibly a bit more accurate.
Note that if (Elo(2) - Elo(1)) == (Ehi(2) - Ehi(1)) (step sizes are equal), you can further reduce the number of computations. In that case, the first N_low elements of the two integrands are identical, so the values of the exponentials will only differ in the N_low + 1 : N_high elements. So then just compute integrand_exp_high, and set integrand_exp_low equal to the first N_low elements of integrand_exp_high.
The fine method then uses your original implementation (with the actual integral()s), but then starting at the updated initial estimates from the coarse step.
The whole objective here is to try and bring the total number of iterations needed down from about 6 to less than 2. Perhaps you'll even find that the trapz method already provides enough accuracy, rendering the whole fine step unnecessary.
Vectorization
The rectangle method in the coarse step outlined above is easy to vectorize:
% (uses R2016b implicit expansion rules)
Elo = linspace(0, E_low, numel(S_low));
integrand_exp_low = exp(x(:,1)./Elo.^3 + x(:,2).*fKN(Elo));
Ehi = linspace(0, E_high, numel(S_high));
integrand_exp_high = exp(x(:,1)./Ehi.^3 + x(:,2).*fKN(Ehi));
F = [I_high_vector - (S_high * integrand_exp_high) * (Ehi(2) - Ehi(1))
I_low_vector - (S_low * integrand_exp_low ) * (Elo(2) - Elo(1))];
trapz also works on matrices; it will integrate over each column in the matrix.
You'd call equationPair() then using x0 = [x01; x02; ...; x0N], and fsolve will then converge to [x1; x2; ...; xN], where N is the number of voxels, and each x0 is 1×2 ([x(1) x(2)]), so x0 is N×2.
parfor should be able to slice all of this fairly easily over all the workers in your pool.
Similarly, vectorization of the fine method should also be possible; just use the 'ArrayValued' option to integral() as shown above:
F = [I_high_vector - integral(#(E) S_high(E) .* exp(x(:,1)./E.^3 + x(:,2).*fKN(E)),...
0, E_high,...
'ArrayValued', true);
I_low_vector - integral(#(E) S_low(E) .* exp(x(:,1)./E.^3 + x(:,2).*fKN(E)),...
0, E_low,...
'ArrayValued', true);
];
Jacobian
Taking derivatives of your function is quite easy. Here is the derivative w.r.t. x_1, and here w.r.t. x_2. Your Jacobian will then have to be a 2×2 matrix
J = [dF(1)/dx(1) dF(1)/dx(2)
dF(2)/dx(1) dF(2)/dx(2)];
Don't forget the leading minus sign (F = I_hi/lo - g(x) → dF/dx = -dg/dx)
Using one or both of the methods outlined above, you can implement a function to compute the Jacobian matrix and pass this on to fsolve via the 'SpecifyObjectiveGradient' option (via optimoptions). The 'CheckGradients' option will come in handy there.
Because fsolve usually spends the vast majority of its time computing the Jacobian via finite differences, manually computing a value for it manually will normally speed the algorithm up tremendously.
It will be faster, because
fsolve doesn't have to do extra function evaluations to do the finite differences
the convergence rate will increase due to the improved precision of the Jacobian
Especially if you use the rectangle method or trapz like above, you can reuse many of the computations you've already done for the function values themselves, meaning, even more speed-up.
Rody's answer was the correct one. Supplying the Jacobian was the single largest factor. Especially with the vectorized version, there were 3 orders of magnitude of difference in speed with the Jacobian supplied and not.
I had trouble finding information about this subject online so I'll spell it out here for future reference: It is possible to vectorize independant parallel equations with fsolve() with great gains.
I also did some work with inlining fsolve(). After supplying the Jacobian and being smarter about the equations, the serial version of my code was mostly overhead at ~1*10^-3 s per voxel. At that point most of the time inside the function was spent passing around a options -struct and creating error-messages which are never sent + lots of unused stuff assumedly for the other optimization functions inside the optimisation function (levenberg-marquardt for me). I succesfully butchered the function fsolve and some of the functions it calls, dropping the time to ~1*10^-4s per voxel on my machine. So if you are stuck with a serial implementation e.g. because of having to rely on the previous results it's quite possible to inline fsolve() with good results.
The vectorized version provided the best results in my case, with ~5*10^-5 s per voxel.

Backpropagation for rectified linear unit activation with cross entropy error

I'm trying to implement gradient calculation for neural networks using backpropagation.
I cannot get it to work with cross entropy error and rectified linear unit (ReLU) as activation.
I managed to get my implementation working for squared error with sigmoid, tanh and ReLU activation functions. Cross entropy (CE) error with sigmoid activation gradient is computed correctly. However, when I change activation to ReLU - it fails. (I'm skipping tanh for CE as it retuls values in (-1,1) range.)
Is it because of the behavior of log function at values close to 0 (which is returned by ReLUs approx. 50% of the time for normalized inputs)?
I tried to mitiage that problem with:
log(max(y,eps))
but it only helped to bring error and gradients back to real numbers - they are still different from numerical gradient.
I verify the results using numerical gradient:
num_grad = (f(W+epsilon) - f(W-epsilon)) / (2*epsilon)
The following matlab code presents a simplified and condensed backpropagation implementation used in my experiments:
function [f, df] = backprop(W, X, Y)
% W - weights
% X - input values
% Y - target values
act_type='relu'; % possible values: sigmoid / tanh / relu
error_type = 'CE'; % possible values: SE / CE
N=size(X,1); n_inp=size(X,2); n_hid=100; n_out=size(Y,2);
w1=reshape(W(1:n_hid*(n_inp+1)),n_hid,n_inp+1);
w2=reshape(W(n_hid*(n_inp+1)+1:end),n_out, n_hid+1);
% feedforward
X=[X ones(N,1)];
z2=X*w1'; a2=act(z2,act_type); a2=[a2 ones(N,1)];
z3=a2*w2'; y=act(z3,act_type);
if strcmp(error_type, 'CE') % cross entropy error - logistic cost function
f=-sum(sum( Y.*log(max(y,eps))+(1-Y).*log(max(1-y,eps)) ));
else % squared error
f=0.5*sum(sum((y-Y).^2));
end
% backprop
if strcmp(error_type, 'CE') % cross entropy error
d3=y-Y;
else % squared error
d3=(y-Y).*dact(z3,act_type);
end
df2=d3'*a2;
d2=d3*w2(:,1:end-1).*dact(z2,act_type);
df1=d2'*X;
df=[df1(:);df2(:)];
end
function f=act(z,type) % activation function
switch type
case 'sigmoid'
f=1./(1+exp(-z));
case 'tanh'
f=tanh(z);
case 'relu'
f=max(0,z);
end
end
function df=dact(z,type) % derivative of activation function
switch type
case 'sigmoid'
df=act(z,type).*(1-act(z,type));
case 'tanh'
df=1-act(z,type).^2;
case 'relu'
df=double(z>0);
end
end
Edit
After another round of experiments, I found out that using a softmax for the last layer:
y=bsxfun(#rdivide, exp(z3), sum(exp(z3),2));
and softmax cost function:
f=-sum(sum(Y.*log(y)));
make the implementaion working for all activation functions including ReLU.
This leads me to conclusion that it is the logistic cost function (binary clasifier) that does not work with ReLU:
f=-sum(sum( Y.*log(max(y,eps))+(1-Y).*log(max(1-y,eps)) ));
However, I still cannot figure out where the problem lies.
Every squashing function sigmoid, tanh and softmax (in the output layer)
means different cost functions.
Then makes sense that a RLU (in the output layer) does not match with the cross entropy cost function.
I will try a simple square error cost function to test a RLU output layer.
The true power of RLU is in the hidden layers of a deep net since it not suffer from gradient vanishing error.
If you use gradient descendent you need to derive the activation function to be used later in the back-propagation approach. Are you sure about the 'df=double(z>0)'?. For the logistic and tanh seems to be right.
Further, are you sure about this 'd3=y-Y' ? I would say this is true when you use the logistic function but not for the ReLu (the derivative is not the same and therefore will not lead to that simple equation).
You could use the softplus function that is a smooth version of the ReLU, which the derivative is well known (logistic function).
I think the flaw lies in comapring with the numerically computed derivatives. In your derivativeActivation function , you define the derivative of ReLu at 0 to be 0. Where as numerically computing the derivative at x=0 shows it to be
(ReLU(x+epsilon)-ReLU(x-epsilon)/(2*epsilon)) at x =0 which is 0.5. Therefore, defining the derivative of ReLU at x=0 to be 0.5 will solve the problem
I thought I'd share my experience I had with similar problem. I too have designed my multi classifier ANN in a way that all hidden layers use RELU as non-linear activation function and the output layer uses softmax function.
My problem was related to some degree to numerical precision of the programming language/platform I was using. In my case I noticed that if I used "plain" RELU not only does it kill the gradient but the programming language I used produced the following softmax output vectors (this is just a example sample):
⎡1.5068230536681645e-35⎤
⎢ 2.520367499064734e-18⎥
⎢3.2572859518007807e-22⎥
⎢ 1⎥
⎢ 5.020155103452967e-32⎥
⎢1.7620297760773188e-18⎥
⎢ 5.216008990667109e-18⎥
⎢ 1.320937038894421e-20⎥
⎢2.7854159049317976e-17⎥
⎣1.8091246170996508e-35⎦
Notice the values of most of the elements are close to 0, but most importantly notice the 1 value in the output.
I used a different cross-entropy error function than the one you used. Instead of calculating log(max(1-y, eps)) I stuck to the basic log(1-y). So given the output vector above, when I calculated log(1-y) I got the -Inf as a result of cross-entropy, which obviously killed the algorithm.
I imagine if your eps is not reasonably high enough so that log(max(1-y, eps)) -> log(max(0, eps)) doesn't yield way too small log output you might be in a similar pickle like myself.
My solution to this problem was to use Leaky RELU. Once I've started using it, I could carry on using the multi classifier cross-entropy as oppose to softmax-cost function you decided to try.

Run a function in between each iteration of fsolve in MATLAB

I am using fsolve to minimise an energy function in MATLAB. The algorithm I am using fits a grid to noisy lattice data, with costs for the distances of the grid from each data point.
The objective function is formulated with squared error terms, to allow for the Gauss–Newton algorithm to be used. However, the program reverts to Levenberg-Marquardt:
Warning: Trust-region-dogleg algorithm of FSOLVE cannot handle non-square systems;
using Levenberg-Marquardt algorithm instead.
I realised this is probably due to the fact that while the costs have squared errors, there is a stage in the objective (cost) function that chooses the nearest grid centre to each data point, thus making the algorithm non-square.
What I would like to do is to perform this assignment update of nearest grid centres separately to the evaluation of the Jacobian of the cost function. I believe this would then allow Gauss-Newton to be used, and significantly improve the speed of the algorithm.
Currently, I believe there is something like this going on:
while i < options.MaxIter && threshold has not been met
Compute Jacobian of cost function (which includes assignment routine)
Move down the slope in the direction of highest gradient
end
What i would like to happen instead:
while i < options.MaxIter && threshold has not been met
Perform assignment routine
Compute Jacobian of cost function (which is now square, as no assignment occurs)
Move down the slope
end
Is there a way to insert a function like this into the iterations without picking apart the whole of the fsolve algorithm? Even if I manually edited fsolve, would the nature of the Gauss-Newton algorithm allow me to add in this extra step?
Thanks
Since you're working with squared errors, anyway, you could use LSQNONLIN instead of fsovle. This allows you to compute the Jacobian (as well as all necessary preparations) in your objective function. The Jacobian is then returned as second output argument.