I am using fsolve to minimise an energy function in MATLAB. The algorithm I am using fits a grid to noisy lattice data, with costs for the distances of the grid from each data point.
The objective function is formulated with squared error terms, to allow for the Gauss–Newton algorithm to be used. However, the program reverts to Levenberg-Marquardt:
Warning: Trust-region-dogleg algorithm of FSOLVE cannot handle non-square systems;
using Levenberg-Marquardt algorithm instead.
I realised this is probably due to the fact that while the costs have squared errors, there is a stage in the objective (cost) function that chooses the nearest grid centre to each data point, thus making the algorithm non-square.
What I would like to do is to perform this assignment update of nearest grid centres separately to the evaluation of the Jacobian of the cost function. I believe this would then allow Gauss-Newton to be used, and significantly improve the speed of the algorithm.
Currently, I believe there is something like this going on:
while i < options.MaxIter && threshold has not been met
Compute Jacobian of cost function (which includes assignment routine)
Move down the slope in the direction of highest gradient
end
What i would like to happen instead:
while i < options.MaxIter && threshold has not been met
Perform assignment routine
Compute Jacobian of cost function (which is now square, as no assignment occurs)
Move down the slope
end
Is there a way to insert a function like this into the iterations without picking apart the whole of the fsolve algorithm? Even if I manually edited fsolve, would the nature of the Gauss-Newton algorithm allow me to add in this extra step?
Thanks
Since you're working with squared errors, anyway, you could use LSQNONLIN instead of fsovle. This allows you to compute the Jacobian (as well as all necessary preparations) in your objective function. The Jacobian is then returned as second output argument.
Related
I have discrete data of a 2D function defined as
theta = linspace(0,pi,nTheta);
phi = linspace(0,2*pi,nPhi);
p=zeros(nPhi,nTheta);%only to show the dimension of my matrix
[np,nt]=ndgrid(phi,theta);
f1 = griddedInterpolant(np,nt,p,'spline');
f2= #(np,nt) f1(np,nt);
integral2(f2,0,2*pi,0,pi)
Note that p is calculated from a complex physical problem, but i showed above how it is initialized.
Also, I can increase nTheta and nPhi, which leads to more accurate calculation of p.
My calculated function (with nPhi=400,nTheta=200) is something like:
I tried 3 ways :
using Trapz function
using the code above but with linear interpolation for gridded interpolant
using the code above with spline interpolation
Although the spline is better than others, i still need to increase nPhi and nTheta, which makes it impossible for me to do the simulation due to its cost.
Is there any suggestion except these 3 methods or any general suggestion how i can do this computation more efficient? (I also took advantage of the symmetry in both directions)
Note that the shape of my function varies in each time step, so a local mesh refinement might be challenging because i don't know the detail of my function in advance.
I am processing a set of data using ridge regression. I found a very interesting phenomenon when apply the learned function to data. Namely, when the ridge parameter increases from zero, the test error keeps increasing. But if we penalize small coefficients(set the parameter <0), the test error can even be smaller.
This is my matlab code:
for i = 1:100
beta = ridgePolyRegression(ty_train,tX_train,lambda(i));
sqridge_train_cost(i) = computePolyCostMSE(ty_train,tX_train,beta);
sqridge_test_cost(i) = computePolyCostMSE(ty_valid,tX_valid,beta);
end
plot(lambda,sqridge_test_cost,'color','b');
lambda is the ridge parameter. ty_train is the output of the training data, tX_train is the input of training data. Also, we use a quadratic function regression here.
function [ beta ] = ridgePolyRegression( y,tX,lambda )
X = tX(:,2:size(tX,2));
tX2 = [tX,X.^2];
beta = (tX2'*tX2 + lambda * eye(size(tX2,2))) \ (tX2'*y);
end
The plotted picture is:
Why the error is minimal when lambda is negative? Is it a sign of under-fitting?
You should not use negative lambdas.
From (probabilistic) theoretic point of view, lambda relates to the inverse of variance of parameter prior distribution, and variance can't be negative.
From computational point of view, it can (given it's less that the smallest eigenvalue of the covariance matrix) turn your positive-definite form into an indefinite form, which means you'll have not a maximum, but a saddle point. It also means there are points where your target function is as small (or as big) as you want, so you can reduce loss indefinitely and no minimum / maximum exists at all.
Your optimization algorithm gives you just a stationary point, which will be a global maximum if and only if the form is positive definite.
Short Answer: When lambda is negative, you're actually overfitting your data. Hence, it's reasonable to get much less error.
Long Answer:
The regularization term (or the penalty term as described by many statisticians) aims to penalize the weights (or the betas as written in the coming Eq.) for going too high (overfitting) and going too low (underfitting). Giving you the power to control how your model behaves, and you usually aim the "right fitting" model.
For mathematical intuition, you can check the following Eq. (P. S. Equation is screenshotted from Elements of Statistical Learning by Trevor Hastie et. al)
When you decide to make your lambda negative, the penalty term is indeed turned into a utility term that helps to increase the weights (i.e., overfitting).
Overfitting is, simply, understanding your data along with the features more than you should, because you do not have the whole population yet; therefore, what you understood so far is possibly wrong on a different dataset.
So, you should never be using negative values of lambdas.
I need to numerically integrate the following system of ODEs:
dA/dR = f(R,A,B)
dB/dR = g(R,A,B)
I'm solving the ODEs for a Initial-value stability problem. In this problem, the system is initially stable but goes unstable at some radius. However, whilst stable, I don't want the amplitude to decay away from the starting value (to O(10^-5) for example) as this is non-physical since the system's stability is limited to the background noise amplitude. The amplitude should remain at the starting value of 1 until the system destabilises. Hence, I want to overwrite the derivative estimate to zero whenever it is negative.
I've written some 4th order Runge-Kutta code that achieves this, but I'd much prefer to simply pass ODE45 (or any of the built in solvers) a parameter to make it overwrite the derivative whenever it is negative. Is this possible?
A simple, fast, efficient way to implement this is via the max function. For example, if you want to make sure all of your derivatives remain non-negative, in your integration function:
function ydot = f(x,y)
ydot(1) = ...
ydot(2) = ...
...
ydot = max(ydot,0);
Note that this is not the same thing as the output states returned by ode45 remaining non-negative. The above should ensure that your state variables never decay.
Note, however, that that this effectively makes your integration function stiff. You might consider using a solver like ode15s instead, or at least confirming that the results are consistent with those from ode45. Alternatively, you could use a continuous sigmoid function, instead of the discontinuous, step-like max. This is partly a modeling decision.
I've been trying to write a MATLAB-function which calculates the a root of a function simply by using Newton-Raphson. The problem with this algorithm is it diverges near torsion points and roots with oscillations (e.g for x^2+2 after 10 iteration with initial guess -1 the method diverges). Is there any satisfying condition to identify when we get oscillations and torsions which doesn't count iterations in a really inefficient way?
You may be interested in the Matlab File Exchange entry called "Newton Raphson Solver with adaptive Step Size". It implements the Newton-Raphson method to extract the roots of a polynomial.
In particular, this function has a while staement on line 147. Simply replace
while( err > ConvCrit && n < maxIter)
with
while( err > ConvCrit) %removing the maximum iteration criterion
I'd argue that your initial estimate of -1 is poor, hence the estimation error is large and this is what probably causes the algorithm to overshoot and oscillate (and eventually diverge).
You could consider doing successive over-relaxation by multiplying the quotient f(xn)/f'(xn) by a positive factor. I recommended you to look up methods for adaptive successive over-relaxation (which I won't elaborate here), that (adaptively) set the relaxation parameter iteratively based on the observed behavior of the converging process.
I think that you are trying to solve a function which does not have any real roots... so the newton raphson is not going to provide you any results for any initial guess..
This is my first post to stackoverflow, so if this isn't the correct area I apologize. I am working on minimizing a L1-Regularized System.
This weekend is my first dive into optimization, I have a basic linear system Y = X*B, X is an n-by-p matrix, B is a p-by-1 vector of model coefficients and Y is a n-by-1 output vector.
I am trying to find the model coefficients, I have implemented both gradient descent and coordinate descent algorithms to minimize the L1 Regularized system. To find my step size I am using the backtracking algorithm, I terminate the algorithm by looking at the norm-2 of the gradient and terminating if it is 'close enough' to zero(for now I'm using 0.001).
The function I am trying to minimize is the following (0.5)*(norm((Y - X*B),2)^2) + lambda*norm(B,1). (Note: By norm(Y,2) I mean the norm-2 value of the vector Y) My X matrix is 150-by-5 and is not sparse.
If I set the regularization parameter lambda to zero I should converge on the least squares solution, I can verify that both my algorithms do this pretty well and fairly quickly.
If I start to increase lambda my model coefficients all tend towards zero, this is what I expect, my algorithms never terminate though because the norm-2 of the gradient is always positive number. For example, a lambda of 1000 will give me coefficients in the 10^(-19) range but the norm2 of my gradient is ~1.5, this is after several thousand iterations, While my gradient values all converge to something in the 0 to 1 range, my step size becomes extremely small (10^(-37) range). If I let the algorithm run for longer the situation does not improve, it appears to have gotten stuck somehow.
Both my gradient and coordinate descent algorithms converge on the same point and give the same norm2(gradient) number for the termination condition. They also work quite well with lambda of 0. If I use a very small lambda(say 0.001) I get convergence, a lambda of 0.1 looks like it would converge if I ran it for an hour or two, a lambda any greater and the convergence rate is so small it's useless.
I had a few questions that I think might relate to the problem?
In calculating the gradient I am using a finite difference method (f(x+h) - f(x-h))/(2h)) with an h of 10^(-5). Any thoughts on this value of h?
Another thought was that at these very tiny steps it is traveling back and forth in a direction nearly orthogonal to the minimum, making the convergence rate so slow it is useless.
My last thought was that perhaps I should be using a different termination method, perhaps looking at the rate of convergence, if the convergence rate is extremely slow then terminate. Is this a common termination method?
The 1-norm isn't differentiable. This will cause fundamental problems with a lot of things, notably the termination test you chose; the gradient will change drastically around your minimum and fail to exist on a set of measure zero.
The termination test you really want will be along the lines of "there is a very short vector in the subgradient."
It is fairly easy to find the shortest vector in the subgradient of ||Ax-b||_2^2 + lambda ||x||_1. Choose, wisely, a tolerance eps and do the following steps:
Compute v = grad(||Ax-b||_2^2).
If x[i] < -eps, then subtract lambda from v[i]. If x[i] > eps, then add lambda to v[i]. If -eps <= x[i] <= eps, then add the number in [-lambda, lambda] to v[i] that minimises v[i].
You can do your termination test here, treating v as the gradient. I'd also recommend using v for the gradient when choosing where your next iterate should be.