How does MATLAB's fit() function differentiate arbitrary MATLAB expressions for Levenberg-Marquardt to be applicable? - matlab

As far as I understand the LM algorithm, it is an improvement over the Newton's method, so very roughly speaking, an algorithm which tries to build a path in the parameter space, leading to the point where the error function is minimal, which follows the direction of the biggest gradient of error function (differentiated with respect to the parameters).
I have written a Newton's method optimizer for a neural network once, as an exercise, and the critical part of the algorithm was that we could apply the chain rule (error backpropagation) to compute the gradient. And it was me who used the chain rule to the write out a formula for the gradients. (Essentially by symbolic differentiating on paper once and coding the resulting formula.)
In MATLAB (Curve Fitting Toolbox), there is a standard fit() function, which claims to use Levenberg-Marquardt's method to fit basically any parametric MATLAB expression as well as a set of prepared models.
Well, I suspect that the prepared models could be pre-differentiated by Mathworks' engineers to generate the code for the gradients. But what about the 'arbitrary' fits?
Is MATLAB trying to do symbolic differentiation implicitly? I highly doubt that anyone can write rules for differentiation of all the complex MATLAB constructions, i.e. classes and enumerations.
Or, maybe MATLAB is just differentiating the function by evaluating it in ξ and ξ+Δξ and dividing by Δξ? But that would require finding the best shift and require n+1 function evaluations, where n is the number of parameters optimized.
And in any case, even this strategy would fail if the function is not differentiable, which I suspect to be the case for almost any general MATLAB expression.
Could anyone give a plausible hypothesis of what is actually happening inside?
(Well, knowing what actually happens inside would be even better, but even an informal insight would be great.)

Related

integral or trapz, which one is the more appropriate in MATLAB?

I'm computing multiple integrals using MATLAB.
I'm using the integral function to compute the integral but I was wondering is it faster to use trapz instead of using integral?
I know that trapz introduces a bit of error in the computation, but despite that, with is the best function to compute integrals in MATLAB?
Short and sweet:
Use trapz for discrete data or for selected functional data if you don't care about (potentially extremely) low accuracy of the integral value
Use integral for integrands that have a functional form, adjusting tolerances as needed for speed.
As mentioned by the MATLAB documentation, trapz is intended "to perform numerical integrations on discrete data sets" and leverages the trapezoidal rule for the integrations. The error between the true integral and the trapz approximation is almost entirely dependent on the input x vector (sometimes called the abscissa in integration parlance) with no automatic adaptability. The good part is that if the underlying function is "nice" (i.e., continuous, smooth, no sharp peaks or excessive oscillations, etc.), trapz will likely be the fastest function to approximate the integral since it
Doesn't have to call a function for values (they're input)
Doesn't automatically adapt (which takes time and can be complex to
implement).
However, for general integrals, trapz may also be the most inaccurate and may require a denser x vector to calculate a low-error value.
For discrete data, this is a short-coming that must be lived with, but if the integrand has a functional form, integral and its family is highly recommended.
The black-box numeric integrators in MATLAB have evolved over the years, and MathWorks co-founder Clever Moler has a nice blog post going over some of the evolutions. The post discusses the quad, quadl, and quadgk functions and how quadgk became the core for integral and its ilk. The basic breakdown of the three functions is
quad uses a three-point and five-point Simpson's Rule
quadl uses a four-seven-thirteen point1 Lobatto-Kronrod2 rule
quadgk a uses seven-fifteen point Gauss-Kronrod2 rule
to acquire both an approximation of the integral and an error approximation for adaptive quadrature. The summary of the history lesson and test problems is that quadgk was written with vectorization incorporated3, uses a higher-order rule which excludes end-points, and gives extremely accurate answers faster than its competitors. As a result, quadgk is the core of the new and highly-recommended integral family.
1 Adaptive quadrature usually lists the number of points used to form its approximation of the value and the error. Typically, there are two numbers that indicate the number of points to form the low-order and high-order approximations. quadl is interesting in that it uses a four-point Gauss-Lobatto rule and seven-point and thirteen-point Kronrod extensions for its error handling.
2 Gaussian Quadrature, which is an integration technique that chooses it abscissa to exactly integrate a family of polynomials over a given interval instead of prescribing them as in Newton-Cotes, has a lot of names associated with it to indicate a lot of "stuff" that's going on without being explicit about it (which can be very annoying to newcomers). "Gauss" refers to the aforementioned method of choosing abscissa and associated weights for the integration. "Lobatto" indicates an extension to Gauss-Legendre integration methods that incorporates end-points (others may not like my link between these two, but I find the parallels pleasing). "Kronrod" indicates an extension to any particular Gauss rule that creates a high-order rule using a given set of abscissa and adding to it; this creates a "nesting" (the low-order points are part of the high-order point set) that results in fewer function evaluations overall.
3 Since vectorization is written into integral, integrands or limits that are vector-valed must use the 'ArrayValued' flag to tell the program to make functional evaluations differently so as not to create a size-mismatch error. It might be possible to program around this to a certain extent, but the MathWorks decided not to.

Who knows the computational complexity of the function quadprog in MATLAB?

The QP problem is convex. For Wiki, the problem can be solved in polynomial time.
But what exactly is the order?
That is an interesting question with (in my opinion) no clear answer. I am going to assume your problem is convex and you are interested in run-time complexity (as opposed to Iteration complexity).
As you may know, QuadProg is not one algorithm but rather, a generic name for something that solves Quadratic problems. It uses a set of algorithms underneath viz. Interior Point (Default), Trust-Region and Active-Set. Source.
Depending upon what you choose, each of these algorithms will have its own complexity analysis. For Trust-Region and Active-Set methods, the complexity analysis is extremely hard. In fact, Active-Set methods are not polynomial to begin with. Counterexamples exist where Active-Set methods take exponential "time" to converge (This is true also for the Simplex Method for Linear Programs). Source.
Now, assuming that you choose Interior Point methods, the answer is still not straightforward because there are various flavours of these methods. When Karmarkar first proposed this method, it was the first known polynomial algorithm for solving Linear Programs and it had a complexity of O(n^3.5). Source. These bounds were improved quite a lot later. However, this is for Linear Programs.
Finally, to answer your question, Ye and Tse proved in 1989 that we can have an Interior Point method with complexity O(n^3). However, whether MATLAB uses this exact flavor of Interior Point method is a little tricky to know but O(n^3) would be my best guess.
Of course, my answer is rather theoretical; if you want to empirically test it out, you can do so by gradually increasing the number of variables and plotting the CPU time required to get an estimate.

How calculating hessian works for Neural Network learning

Can anyone explain to me in a easy and less mathematical way what is a Hessian and how does it work in practice when optimizing the learning process for a neural network ?
To understand the Hessian you first need to understand Jacobian, and to understand a Jacobian you need to understand the derivative
Derivative is the measure of how fast function value changes withe the change of the argument. So if you have the function f(x)=x^2 you can compute its derivative and obtain a knowledge how fast f(x+t) changes with small enough t. This gives you knowledge about basic dynamics of the function
Gradient shows you in multidimensional functions the direction of the biggest value change (which is based on the directional derivatives) so given a function ie. g(x,y)=-x+y^2 you will know, that it is better to minimize the value of x, while strongly maximize the vlaue of y. This is a base of gradient based methods, like steepest descent technique (used in the traditional backpropagation methods).
Jacobian is yet another generalization, as your function might have many values, like g(x,y)=(x+1, xy, x-z), thus you now have 23 partial derivatives, one gradient per each output value (each of 2 values) thus forming together a matrix of 2*3=6 values.
Now, derivative shows you the dynamics of the function itself. But you can go one step further, if you can use this dynamics to find the optimum of the function, maybe you can do even better if you find out the dynamics of this dynamics, and so - compute derivatives of second order? This is exactly what Hessian is, it is a matrix of second order derivatives of your function. It captures the dynamics of the derivatives, so how fast (in what direction) does the change change. It may seem a bit complex at the first sight, but if you think about it for a while it becomes quite clear. You want to go in the direction of the gradient, but you do not know "how far" (what is the correct step size). And so you define new, smaller optimization problem, where you are asking "ok, I have this gradient, how can I tell where to go?" and solve it analogously, using derivatives (and derivatives of the derivatives form the Hessian).
You may also look at this in the geometrical way - gradient based optimization approximates your function with the line. You simply try to find a line which is closest to your function in a current point, and so it defines a direction of change. Now, lines are quite primitive, maybe we could use some more complex shapes like.... parabolas? Second derivative, hessian methods are just trying to fit the parabola (quadratic function, f(x)=ax^2+bx+c) to your current position. And based on this approximation - chose the valid step.

Scipy/Python indirect spline interpolation

I need to fit data in quite an indirect way. The original data to be recovered in the fit is some linear function with small oscillations and drifts on it, that I would like to identify. Let's call this f(t). We can not record this parameter in the experiment directly, but only indirectly, let's say as g(f) = sin(a f(t)). (The real transfer funcion is more complex, but it should not play a role in here)
So if f(t) changes direction towards the turning points of the sin function, it is difficult to identify and I tried an alternative approach to recover f(t) than just the inverse function of g and some data continuing guesses:
I create a model function fm(t) which undergoes the same and known transfer function g() and fit g(fm(t)) to the data. As the dataset is huge, I do this piecewise for successive chunks of data guaranteeing the continuity of fm across the whole set.
A first try was to use linear functions using the optimize.leastsq, where the error estimate is derived from g(fm). It is not completely satisfactory, and I think it would be far better to fit a spline to the data to get fspline(t) as a model for f(t), guaranteeing the continuity of the data and of its derivative.
The problem with it is, that spline fitting from the interpolate package works on the data directly, so I can not wrap the spline using g(fspline) and do the spline interpolation on this. Is there a way this can be done in scipy?
Any other ideas?
I tried quadratic functions and fixing the offset and slope such to match the ones of the preceeding fitted chunk of data, so there is only one fitting parameter, the curvature, which very quickly starts to deviate
Thanks
What you would need is a matrix of spline basis functions, b(t), so you can approximate f(t) as a linear combination of spline basis function
f(t) = np.dot(b(t), coefs)
and then estimate the coefficients, coefs, by optimize.leastsq.
However, spline basis functions are not readily available in python, as far as I know (unless you borrow experimental scripts or search through the code of some packages).
Instead you could also use polynomials, for example
b(t) = np.polynomial.chebvander(t, order)
and use a polynomial approximation instead of the splines.
The structure of this problem is very similar to generalized linear models where g is your known link function and similar to index problems in econometrics.
It would be possible to use the scipy splines in an indirect way if you create artificial data
y_i = f(t_i)
where f(t_i) are scipy.interpolate splines, and the y_i are the parameters to be estimated in the least squares optimization. (Loosely based on a script that I saw some time ago that used this for creating a different kind of smoothing splines than the scipy version. I don't remember where I saw this.)
Thank you for these comments. I tried out the polynomial basis suggested above, but polynomials are no option for my needs, ads they tend to create ringing, which is difficult to condition.
The solution on using splines I now found is quite simple and straightforward, and I think it is what you meant by "using the splines in an indirect way".
The fitting function f(t) is obtained by the interpolate.splev(x, (t,c,k)) function, but providing the spline coefficients c by the omptimize.leastsq function. In this way, f(t) is no direct spline fit (as one would usually obtain with the splrep(x, y) function) but indirectly optimized in the fit, and therefore it is possible to use the link function g on it. The initial guess for c might be obtained by one evaluation of splrep(xinit, yinit, t=knots) on model data.
One trick is to restrict the number of knots for the spline to below the number of datapoints by explicitly specifying them during the function call of splrep() and giving this reduced set during the evaluation using splev().

Looking for ODE integrator/solver with a relaxed attitude to derivative precision

I have a system of (first order) ODEs with fairly expensive to compute derivatives.
However, the derivatives can be computed considerably cheaper to within given error bounds, either because the derivatives are computed from a convergent series and bounds can be placed on the maximum contribution from dropped terms, or through use of precomputed range information stored in kd-tree/octree lookup tables.
Unfortunately, I haven't been able to find any general ODE solvers which can benefit from this; they all seem to just give you coordinates and want an exact result back. (Mind you, I'm no expert on ODEs; I'm familiar with Runge-Kutta, the material in the Numerical Recipies book, LSODE and the Gnu Scientific Library's solver).
ie for all the solvers I've seen, you provide a derivs callback function accepting a t and an array of x, and returning an array of dx/dt back; but ideally I'm looking for one which gives the callback t, xs, and an array of acceptable errors, and receives dx/dt_min and dx/dt_max arrays back, with the derivative range guaranteed to be within the required precision. (There are probably numerous equally useful variations possible).
Any pointers to solvers which are designed with this sort of thing in mind, or alternative approaches to the problem (I can't believe I'm the first person wanting something like this) would be greatly appreciated.
Roughly speaking, if you know f' up to absolute error eps, and integrate from x0 to x1, the error of the integral coming from the error in the derivative is going to be <= eps*(x1 - x0). There is also discretization error, coming from your ODE solver. Consider how big eps*(x1 - x0) can be for you and feed the ODE solver with f' values computed with error <= eps.
I'm not sure this is a well-posed question.
In many algorithms, e.g, nonlinear equation solving, f(x) = 0, an estimate of a derivative f'(x) is all that's required for use in something like Newton's method since you only need to go in the "general direction" of the answer.
However, in this case, the derivative is a primary part of the (ODE) equation you're solving - get the derivative wrong, and you'll just get the wrong answer; it's like trying to solve f(x) = 0 with only an approximation for f(x).
As another answer has suggested, if you set up your ODE as applied f(x) + g(x) where g(x) is an error term, you should be able to relate errors in your derivatives to errors in your inputs.
Having thought about this some more, it occurred to me that interval arithmetic is probably key. My derivs function basically returns intervals. An integrator using interval arithmetic would maintain x's as intervals. All I'm interested in is obtaining a sufficiently small error bound on the xs at a final t. An obvious approach would be to iteratively re-integrate, improving the quality of the sample introducing the most error each iteration until we finally get a result with acceptable bounds (although that sounds like it could be a "cure worse than the disease" with regards to overall efficiency). I suspect adaptive step size control could fit in nicely in such a scheme, with step size chosen to keep the "implicit" discretization error comparable with the "explicit error" ie the interval range).
Anyway, googling "ode solver interval arithmetic" or just "interval ode" turns up a load of interesting new and relevant stuff (VNODE and its references in particular).
If you have a stiff system, you will be using some form of implicit method in which case the derivatives are only used within the Newton iteration. Using an approximate Jacobian will cost you strict quadratic convergence on the Newton iterations, but that is often acceptable. Alternatively (mostly if the system is large) you can use a Jacobian-free Newton-Krylov method to solve the stages, in which case your approximate Jacobian becomes merely a preconditioner and you retain quadratic convergence in the Newton iteration.
Have you looked into using odeset? It allows you to set options for an ODE solver, then you pass the options structure as the fourth argument to whichever solver you call. The error control properties (RelTol, AbsTol, NormControl) may be of most interest to you. Not sure if this is exactly the sort of help you need, but it's the best suggestion I could come up with, having last used the MATLAB ODE functions years ago.
In addition: For the user-defined derivative function, could you just hard-code tolerances into the computation of the derivatives, or do you really need error limits to be passed from the solver?
Not sure I'm contributing much, but in the pharma modeling world, we use LSODE, DVERK, and DGPADM. DVERK is a nice fast simple order 5/6 Runge-Kutta solver. DGPADM is a good matrix-exponent solver. If your ODEs are linear, matrix exponent is best by far. But your problem is a little different.
BTW, the T argument is only in there for generality. I've never seen an actual system that depended on T.
You may be breaking into new theoretical territory. Good luck!
Added: If you're doing orbital simulations, seems to me I heard of special methods used for that, based on conic-section curves.
Check into a finite element method with linear basis functions and midpoint quadrature. Solving the following ODE requires only one evaluation each of f(x), k(x), and b(x) per element:
-k(x)u''(x) + b(x)u'(x) = f(x)
The answer will have pointwise error proportional to the error in your evaluations.
If you need smoother results, you can use quadratic basis functions with 2 evaluation of each of the above functions per element.