How to specify non linear regression model in python - scipy

I am taking an Econometrics course, and have been trying to use Python rather than the propreitry STATA and EVIEWS they set the assignments in.
In one of the questions, I have consumption data over time. I am asked to compute it in two ways.
The first way is calculating a model of the form consumption = Aexp(Bt), and the second way is to log both sides and do ordinary OLS on log(consumption) = alpha + Bt
I know how to do the second way. Howver, when I try to do the first way it goes wrong. Using statsmodels, I can exponentiate the time data (after normalising), but this calculates a regression in the form consumption = Aexp(t) + B, which is not what I want. (I want to specify where the parameters go). In sklearn I could find a polynomial regression, but not exponential.
Then I found scipy.curve_fit
However this seems to have two problems:
(1) It seems to rely on initial guesses for parameters, which means my output will end up being different from proprietry software (whereas output for things like OLS are the same) [as I assume initial guesses means some iterative solution is done which is helpful for very weird and wonderful functions, but I assume fairly standard results hold for exponential regression]
(2) every time I try to implement it, it just returns the guess parameters.
Here is my code
`consumption_data = pd.read_csv(......\consumption.csv")
def func(x,a,b):
return a * np.exp(b*x)
xdata = consumption_data.YEAR
ydata = consumption_data.CONSUMPTION
ydata = (ydata - 1948)/100
popt, pcov = curve_fit(func, xdata, ydata, (1,1))
print(popt)
plt.plot(xdata, func(xdata, *popt), 'g--',)
`
The scipy.optimize code is basically just copy-pasted from their tutorial
https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html

short answer: use statsmodels GLM
statsmodels does not have nonlinear least squares. The best python library for that is lmfit https://pypi.org/project/lmfit/
curve_fit, lmfit and nonlinear least squares algorithm in general find an iterative solution to the optimization problem. Even when we have to provide starting values, the solution is in many cases the same across packages up to convergence tolerance, e.g. 1e-5 or 1e-6.
Many standard models in statistics and econometrics have a single global maximum with well behaved data. However, in other cases like mixture models, there might be many local optima and the estimation might converge to one of them.
To the specific case:
consumption = A exp(B t)
can be rewritten as
consumption = exp(a + B t)
So this is just a single index model or a generalized linear model with an exponential mean function.
The general version has the expectation of the dependent variable as a nonlinear function of a linear combination of the explanatory variables:
E(y | x) = g(x b)
This can be estimated with statsmodels with GLM with family Gaussian and the log-link.
Aside: In econometrics, there is a literature to use Poisson quasi-likelihood as an estimator for exp models instead of taking the log of the dependent variable.
Poisson usually uses the log-link function as in the above.
However, using GLM allows us to use log-link, i.e. exponential mean function, with any of the supported distribution families. The main difference is in the underlying variance assumption. Gaussian assumes constant variance, Poisson assumes that the variance is proportional to the mean and Gamma assumes that the variance is quadratic in the mean.
If we use a robust sandwich covariance estimator for parameter inference, then standard errors and inference are correct even if the variance function is misspecified.

Related

Double numerical integration in Matlab - Singularity

I have discrete data of a 2D function defined as
theta = linspace(0,pi,nTheta);
phi = linspace(0,2*pi,nPhi);
p=zeros(nPhi,nTheta);%only to show the dimension of my matrix
[np,nt]=ndgrid(phi,theta);
f1 = griddedInterpolant(np,nt,p,'spline');
f2= #(np,nt) f1(np,nt);
integral2(f2,0,2*pi,0,pi)
Note that p is calculated from a complex physical problem, but i showed above how it is initialized.
Also, I can increase nTheta and nPhi, which leads to more accurate calculation of p.
My calculated function (with nPhi=400,nTheta=200) is something like:
I tried 3 ways :
using Trapz function
using the code above but with linear interpolation for gridded interpolant
using the code above with spline interpolation
Although the spline is better than others, i still need to increase nPhi and nTheta, which makes it impossible for me to do the simulation due to its cost.
Is there any suggestion except these 3 methods or any general suggestion how i can do this computation more efficient? (I also took advantage of the symmetry in both directions)
Note that the shape of my function varies in each time step, so a local mesh refinement might be challenging because i don't know the detail of my function in advance.

Optimizing huge amounts of calls of fsolve in Matlab

I'm solving a pair of non-linear equations for each voxel in a dataset of a ~billion voxels using fsolve() in MATLAB 2016b.
I have done all the 'easy' optimizations that I'm aware of. Memory localization is OK, I'm using parfor, the equations are in fairly numerically simple form. All discontinuities of the integral are fed to integral(). I'm using the Levenberg-Marquardt algorithm with good starting values and a suitable starting damping constant, it converges on average with 6 iterations.
I'm now at ~6ms per voxel, which is good, but not good enough. I'd need a order of magnitude reduction to make the technique viable. There's only a few things that I can think of improving before starting to hammer away at accuracy:
The splines in the equation are for quick sampling of complex equations. There are two for each equation, one is inside the 'complicated nonlinear equation'. They represent two equations, one which is has a large amount of terms but is smooth and has no discontinuities and one which approximates a histogram drawn from a spectrum. I'm using griddedInterpolant() as the editor suggested.
Is there a faster way to sample points from pre-calculated distributions?
parfor i=1:numel(I1)
sols = fsolve(#(x) equationPair(x, input1, input2, ...
6 static inputs, fsolve options)
output1(i) = sols(1); output2(i) = sols(2)
end
When calling fsolve, I'm using the 'parametrization' suggested by Mathworks to input the variables. I have a nagging feeling that defining a anonymous function for each voxel is taking a large slice of the time at this point. Is this true, is there a relatively large overhead for defining the anonymous function again and again? Do I have any way to vectorize the call to fsolve?
There are two input variables which keep changing, all of the other input variables stay static. I need to solve one equation pair for each input pair so I can't make it a huge system and solve it at once. Do I have any other options than fsolve for solving pairs of nonlinear equations?
If not, some of the static inputs are the fairly large. Is there a way to keep the inputs as persistent variables using MATLAB's persistent, would that improve performance? I only saw examples of how to load persistent variables, how could I make it so that they would be input only once and future function calls would be spared from the assumedly largish overhead of the large inputs?
EDIT:
The original equations in full form look like:
Where:
and:
Everything else is known, solving for x_1 and x_2. f_KN was approximated by a spline. S_low (E) and S_high(E) are splines, the histograms they are from look like:
So, there's a few things I thought of:
Lookup table
Because the integrals in your function do not depend on any of the parameters other than x, you could make a simple 2D-lookup table from them:
% assuming simple (square) range here, adjust as needed
[x1,x2] = meshgrid( linspace(0, xmax, N) );
LUT_high = zeros(size(x1));
LUT_low = zeros(size(x1));
for ii = 1:N
LUT_high(:,ii) = integral(#(E) Fhi(E, x1(1,ii), x2(:,ii)), ...
0, E_high, ...
'ArrayValued', true);
LUT_low(:,ii) = integral(#(E) Flo(E, x1(1,ii), x2(:,ii)), ...
0, E_low, ...
'ArrayValued', true);
end
where Fhi and Flo are helper functions to compute those integrals, vectorized with scalar x1 and vector x2 in this example. Set N as high as memory will allow.
Those lookup tables you then pass as parameters to equationPair() (which allows parfor to distribute the data). Then just use interp2 in equationPair():
F(1) = I_high - interp2(x1,x2,LUT_high, x(1), x(2));
F(2) = I_low - interp2(x1,x2,LUT_low , x(1), x(2));
So, instead of recomputing the whole integral every time, you evaluate it once for the expected range of x, and reuse the outcomes.
You can specify the interpolation method used, which is linear by default. Specify cubic if you're really concerned about accuracy.
Coarse/Fine
Should the lookup table method not be possible for some reason (memory limitations, in case the possible range of x is too big), here's another thing you could do: split up the whole procedure in 2 parts, which I'll call coarse and fine.
The intent of the coarse method is to improve your initial estimates really quickly, but perhaps not so accurately. The quickest way to approximate that integral by far is via the rectangle method:
do not approximate S with a spline, just use the original tabulated data (so S_high/low = [S_high/low#E0, S_high/low#E1, ..., S_high/low#E_high/low]
At the same values for E as used by the S data (E0, E1, ...), evaluate the exponential at x:
Elo = linspace(0, E_low, numel(S_low)).';
integrand_exp_low = exp(x(1)./Elo.^3 + x(2)*fKN(Elo));
Ehi = linspace(0, E_high, numel(S_high)).';
integrand_exp_high = exp(x(1)./Ehi.^3 + x(2)*fKN(Ehi));
then use the rectangle method:
F(1) = I_low - (S_low * Elo) * (Elo(2) - Elo(1));
F(2) = I_high - (S_high * Ehi) * (Ehi(2) - Ehi(1));
Running fsolve like this for all I_low and I_high will then have improved your initial estimates x0 probably to a point close to "actual" convergence.
Alternatively, instead of the rectangle method, you use trapz (trapezoidal method). A tad slower, but possibly a bit more accurate.
Note that if (Elo(2) - Elo(1)) == (Ehi(2) - Ehi(1)) (step sizes are equal), you can further reduce the number of computations. In that case, the first N_low elements of the two integrands are identical, so the values of the exponentials will only differ in the N_low + 1 : N_high elements. So then just compute integrand_exp_high, and set integrand_exp_low equal to the first N_low elements of integrand_exp_high.
The fine method then uses your original implementation (with the actual integral()s), but then starting at the updated initial estimates from the coarse step.
The whole objective here is to try and bring the total number of iterations needed down from about 6 to less than 2. Perhaps you'll even find that the trapz method already provides enough accuracy, rendering the whole fine step unnecessary.
Vectorization
The rectangle method in the coarse step outlined above is easy to vectorize:
% (uses R2016b implicit expansion rules)
Elo = linspace(0, E_low, numel(S_low));
integrand_exp_low = exp(x(:,1)./Elo.^3 + x(:,2).*fKN(Elo));
Ehi = linspace(0, E_high, numel(S_high));
integrand_exp_high = exp(x(:,1)./Ehi.^3 + x(:,2).*fKN(Ehi));
F = [I_high_vector - (S_high * integrand_exp_high) * (Ehi(2) - Ehi(1))
I_low_vector - (S_low * integrand_exp_low ) * (Elo(2) - Elo(1))];
trapz also works on matrices; it will integrate over each column in the matrix.
You'd call equationPair() then using x0 = [x01; x02; ...; x0N], and fsolve will then converge to [x1; x2; ...; xN], where N is the number of voxels, and each x0 is 1×2 ([x(1) x(2)]), so x0 is N×2.
parfor should be able to slice all of this fairly easily over all the workers in your pool.
Similarly, vectorization of the fine method should also be possible; just use the 'ArrayValued' option to integral() as shown above:
F = [I_high_vector - integral(#(E) S_high(E) .* exp(x(:,1)./E.^3 + x(:,2).*fKN(E)),...
0, E_high,...
'ArrayValued', true);
I_low_vector - integral(#(E) S_low(E) .* exp(x(:,1)./E.^3 + x(:,2).*fKN(E)),...
0, E_low,...
'ArrayValued', true);
];
Jacobian
Taking derivatives of your function is quite easy. Here is the derivative w.r.t. x_1, and here w.r.t. x_2. Your Jacobian will then have to be a 2×2 matrix
J = [dF(1)/dx(1) dF(1)/dx(2)
dF(2)/dx(1) dF(2)/dx(2)];
Don't forget the leading minus sign (F = I_hi/lo - g(x) → dF/dx = -dg/dx)
Using one or both of the methods outlined above, you can implement a function to compute the Jacobian matrix and pass this on to fsolve via the 'SpecifyObjectiveGradient' option (via optimoptions). The 'CheckGradients' option will come in handy there.
Because fsolve usually spends the vast majority of its time computing the Jacobian via finite differences, manually computing a value for it manually will normally speed the algorithm up tremendously.
It will be faster, because
fsolve doesn't have to do extra function evaluations to do the finite differences
the convergence rate will increase due to the improved precision of the Jacobian
Especially if you use the rectangle method or trapz like above, you can reuse many of the computations you've already done for the function values themselves, meaning, even more speed-up.
Rody's answer was the correct one. Supplying the Jacobian was the single largest factor. Especially with the vectorized version, there were 3 orders of magnitude of difference in speed with the Jacobian supplied and not.
I had trouble finding information about this subject online so I'll spell it out here for future reference: It is possible to vectorize independant parallel equations with fsolve() with great gains.
I also did some work with inlining fsolve(). After supplying the Jacobian and being smarter about the equations, the serial version of my code was mostly overhead at ~1*10^-3 s per voxel. At that point most of the time inside the function was spent passing around a options -struct and creating error-messages which are never sent + lots of unused stuff assumedly for the other optimization functions inside the optimisation function (levenberg-marquardt for me). I succesfully butchered the function fsolve and some of the functions it calls, dropping the time to ~1*10^-4s per voxel on my machine. So if you are stuck with a serial implementation e.g. because of having to rely on the previous results it's quite possible to inline fsolve() with good results.
The vectorized version provided the best results in my case, with ~5*10^-5 s per voxel.

Mixture of 1D Gaussians fit to data in Matlab / Python

I have a discrete curve y=f(x). I know the locations and amplitudes of peaks. I want to approximate the curve by fitting a gaussian at each peak. How should I go about finding the optimized gaussian parameters ? I would like to know if there is any inbuilt function which will make my task simpler.
Edit
I have fixed mean of gaussians and tried to optimize on sigma using
lsqcurvefit() in matlab. MSE is less. However, I have an additional hard constraint that the value of approximate curve should be equal to the original function at the peaks. This constraint is not satisfied by my model. I am pasting current working code here. I would like to have a solution which obeys the hard constraint at peaks and approximately fits the curve at other points. The basic idea is that the approximate curve has fewer parameters but still closely resembles the original curve.
fun = #(x,xdata)myFun(x,xdata,pks,locs); %pks,locs are the peak locations and amplitudes already available
x0=w(1:6)*0.25; % my initial guess based on domain knowledge
[sigma resnorm] = lsqcurvefit(fun,x0,xdata,ydata); %xdata and ydata are the original curve data points
recons = myFun(sigma,xdata,pks,locs);
figure;plot(ydata,'r');hold on;plot(recons);
function f=myFun(sigma,xdata,a,c)
% a is constant , c is mean of individual gaussians
f=zeros(size(xdata));
for i = 1:6 %use 6 gaussians to approximate function
f = f + a(i) * exp(-(xdata-c(i)).^2 ./ (2*sigma(i)^2));
end
end
If you know your peak locations and amplitudes, then all you have left to do is find the width of each Gaussian. You can think of this as an optimization problem.
Say you have x and y, which are samples from the curve you want to approximate.
First, define a function g() that will construct the approximation for given values of the widths. g() takes a parameter vector sigma containing the width of each Gaussian. The locations and amplitudes of the Gaussians will be constrained to the values you already know. g() outputs the value of the sum-of-gaussians approximation at each point in x.
Now, define a loss function L(), which takes sigma as input. L(sigma) returns a scalar that measures the error--how badly the given approximation (using sigma) differs from the curve you're trying to approximate. The squared error is a common loss function for curve fitting:
L(sigma) = sum((y - g(sigma)) .^ 2)
The task now is to search over possible values of sigma, and find the choice that minimizes the error. This can be done using a variety of optimization routines.
If you have the Mathworks optimization toolbox, you can use the function lsqnonlin() (in this case you won't have to define L() yourself). The curve fitting toolbox is probably an alternative. Otherwise, you can use an open source optimization routine (check out cvxopt).
A couple things to note. You need to impose the constraint that all values in sigma are greater than zero. You can tell the optimization algorithm about this constraint. Also, you'll need to specify an initial guess for the parameters (i.e. sigma). In this case, you could probably choose something reasonable by looking at the curve in the vicinity of each peak. It may be the case (when the loss function is nonconvex) that the final solution is different, depending on the initial guess (i.e. you converge to a local minimum). There are many fancy techniques for dealing with this kind of situation, but a simple thing to do is to just try with multiple different initial guesses, and pick the best result.
Edited to add:
In python, you can use optimization routines in the scipy.optimize module, e.g. curve_fit().
Edit 2 (response to edited question):
If your Gaussians have much overlap with each other, then taking their sum may cause the height of the peaks to differ from your known values. In this case, you could take a weighted sum, and treat the weights as another parameter to optimize.
If you want the peak heights to be exactly equal to some specified values, you can enforce this constraint in the optimization problem. lsqcurvefit() won't be able to do it because it only handles bound constraints on the parameters. Take a look at fmincon().
you can use Expectation–Maximization algorithm for fitting Mixture of Gaussians on your data. it don't care about data dimension.
in documentation of MATLAB you can lookup gmdistribution.fit or fitgmdist.

Why ridge regression minimizes test cost when lambda is negative

I am processing a set of data using ridge regression. I found a very interesting phenomenon when apply the learned function to data. Namely, when the ridge parameter increases from zero, the test error keeps increasing. But if we penalize small coefficients(set the parameter <0), the test error can even be smaller.
This is my matlab code:
for i = 1:100
beta = ridgePolyRegression(ty_train,tX_train,lambda(i));
sqridge_train_cost(i) = computePolyCostMSE(ty_train,tX_train,beta);
sqridge_test_cost(i) = computePolyCostMSE(ty_valid,tX_valid,beta);
end
plot(lambda,sqridge_test_cost,'color','b');
lambda is the ridge parameter. ty_train is the output of the training data, tX_train is the input of training data. Also, we use a quadratic function regression here.
function [ beta ] = ridgePolyRegression( y,tX,lambda )
X = tX(:,2:size(tX,2));
tX2 = [tX,X.^2];
beta = (tX2'*tX2 + lambda * eye(size(tX2,2))) \ (tX2'*y);
end
The plotted picture is:
Why the error is minimal when lambda is negative? Is it a sign of under-fitting?
You should not use negative lambdas.
From (probabilistic) theoretic point of view, lambda relates to the inverse of variance of parameter prior distribution, and variance can't be negative.
From computational point of view, it can (given it's less that the smallest eigenvalue of the covariance matrix) turn your positive-definite form into an indefinite form, which means you'll have not a maximum, but a saddle point. It also means there are points where your target function is as small (or as big) as you want, so you can reduce loss indefinitely and no minimum / maximum exists at all.
Your optimization algorithm gives you just a stationary point, which will be a global maximum if and only if the form is positive definite.
Short Answer: When lambda is negative, you're actually overfitting your data. Hence, it's reasonable to get much less error.
Long Answer:
The regularization term (or the penalty term as described by many statisticians) aims to penalize the weights (or the betas as written in the coming Eq.) for going too high (overfitting) and going too low (underfitting). Giving you the power to control how your model behaves, and you usually aim the "right fitting" model.
For mathematical intuition, you can check the following Eq. (P. S. Equation is screenshotted from Elements of Statistical Learning by Trevor Hastie et. al)
When you decide to make your lambda negative, the penalty term is indeed turned into a utility term that helps to increase the weights (i.e., overfitting).
Overfitting is, simply, understanding your data along with the features more than you should, because you do not have the whole population yet; therefore, what you understood so far is possibly wrong on a different dataset.
So, you should never be using negative values of lambdas.

Goodness of fit with MATLAB and chi-square test

I would like to measure the goodness-of-fit to an exponential decay curve. I am using the lsqcurvefit MATLAB function. I have been suggested by someone to do a chi-square test.
I would like to use the MATLAB function chi2gof but I am not sure how I would tell it that the data is being fitted to an exponential curve
The chi2gof function tests the null hypothesis that a set of data, say X, is a random sample drawn from some specified distribution (such as the exponential distribution).
From your description in the question, it sounds like you want to see how well your data X fits an exponential decay function. I really must emphasize, this is completely different to testing whether X is a random sample drawn from the exponential distribution. If you use chi2gof for your stated purpose, you'll get meaningless results.
The usual approach for testing the goodness of fit for some data X to some function f is least squares, or some variant on least squares. Further, a least squares approach can be used to generate test statistics that test goodness-of-fit, many of which are distributed according to the chi-square distribution. I believe this is probably what your friend was referring to.
EDIT: I have a few spare minutes so here's something to get you started. DISCLAIMER: I've never worked specifically on this problem, so what follows may not be correct. I'm going to assume you have a set of data x_n, n = 1, ..., N, and the corresponding timestamps for the data, t_n, n = 1, ..., N. Now, the exponential decay function is y_n = y_0 * e^{-b * t_n}. Note that by taking the natural logarithm of both sides we get: ln(y_n) = ln(y_0) - b * t_n. Okay, so this suggests using OLS to estimate the linear model ln(x_n) = ln(x_0) - b * t_n + e_n. Nice! Because now we can test goodness-of-fit using the standard R^2 measure, which matlab will return in the stats structure if you use the regress function to perform OLS. Hope this helps. Again I emphasize, I came up with this off the top of my head in a couple of minutes, so there may be good reasons why what I've suggested is a bad idea. Also, if you know the initial value of the process (ie x_0), then you may want to look into constrained least squares where you bind the parameter ln(x_0) to its known value.