Why ridge regression minimizes test cost when lambda is negative - matlab

I am processing a set of data using ridge regression. I found a very interesting phenomenon when apply the learned function to data. Namely, when the ridge parameter increases from zero, the test error keeps increasing. But if we penalize small coefficients(set the parameter <0), the test error can even be smaller.
This is my matlab code:
for i = 1:100
beta = ridgePolyRegression(ty_train,tX_train,lambda(i));
sqridge_train_cost(i) = computePolyCostMSE(ty_train,tX_train,beta);
sqridge_test_cost(i) = computePolyCostMSE(ty_valid,tX_valid,beta);
end
plot(lambda,sqridge_test_cost,'color','b');
lambda is the ridge parameter. ty_train is the output of the training data, tX_train is the input of training data. Also, we use a quadratic function regression here.
function [ beta ] = ridgePolyRegression( y,tX,lambda )
X = tX(:,2:size(tX,2));
tX2 = [tX,X.^2];
beta = (tX2'*tX2 + lambda * eye(size(tX2,2))) \ (tX2'*y);
end
The plotted picture is:
Why the error is minimal when lambda is negative? Is it a sign of under-fitting?

You should not use negative lambdas.
From (probabilistic) theoretic point of view, lambda relates to the inverse of variance of parameter prior distribution, and variance can't be negative.
From computational point of view, it can (given it's less that the smallest eigenvalue of the covariance matrix) turn your positive-definite form into an indefinite form, which means you'll have not a maximum, but a saddle point. It also means there are points where your target function is as small (or as big) as you want, so you can reduce loss indefinitely and no minimum / maximum exists at all.
Your optimization algorithm gives you just a stationary point, which will be a global maximum if and only if the form is positive definite.

Short Answer: When lambda is negative, you're actually overfitting your data. Hence, it's reasonable to get much less error.
Long Answer:
The regularization term (or the penalty term as described by many statisticians) aims to penalize the weights (or the betas as written in the coming Eq.) for going too high (overfitting) and going too low (underfitting). Giving you the power to control how your model behaves, and you usually aim the "right fitting" model.
For mathematical intuition, you can check the following Eq. (P. S. Equation is screenshotted from Elements of Statistical Learning by Trevor Hastie et. al)
When you decide to make your lambda negative, the penalty term is indeed turned into a utility term that helps to increase the weights (i.e., overfitting).
Overfitting is, simply, understanding your data along with the features more than you should, because you do not have the whole population yet; therefore, what you understood so far is possibly wrong on a different dataset.
So, you should never be using negative values of lambdas.

Related

How to specify non linear regression model in python

I am taking an Econometrics course, and have been trying to use Python rather than the propreitry STATA and EVIEWS they set the assignments in.
In one of the questions, I have consumption data over time. I am asked to compute it in two ways.
The first way is calculating a model of the form consumption = Aexp(Bt), and the second way is to log both sides and do ordinary OLS on log(consumption) = alpha + Bt
I know how to do the second way. Howver, when I try to do the first way it goes wrong. Using statsmodels, I can exponentiate the time data (after normalising), but this calculates a regression in the form consumption = Aexp(t) + B, which is not what I want. (I want to specify where the parameters go). In sklearn I could find a polynomial regression, but not exponential.
Then I found scipy.curve_fit
However this seems to have two problems:
(1) It seems to rely on initial guesses for parameters, which means my output will end up being different from proprietry software (whereas output for things like OLS are the same) [as I assume initial guesses means some iterative solution is done which is helpful for very weird and wonderful functions, but I assume fairly standard results hold for exponential regression]
(2) every time I try to implement it, it just returns the guess parameters.
Here is my code
`consumption_data = pd.read_csv(......\consumption.csv")
def func(x,a,b):
return a * np.exp(b*x)
xdata = consumption_data.YEAR
ydata = consumption_data.CONSUMPTION
ydata = (ydata - 1948)/100
popt, pcov = curve_fit(func, xdata, ydata, (1,1))
print(popt)
plt.plot(xdata, func(xdata, *popt), 'g--',)
`
The scipy.optimize code is basically just copy-pasted from their tutorial
https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html
short answer: use statsmodels GLM
statsmodels does not have nonlinear least squares. The best python library for that is lmfit https://pypi.org/project/lmfit/
curve_fit, lmfit and nonlinear least squares algorithm in general find an iterative solution to the optimization problem. Even when we have to provide starting values, the solution is in many cases the same across packages up to convergence tolerance, e.g. 1e-5 or 1e-6.
Many standard models in statistics and econometrics have a single global maximum with well behaved data. However, in other cases like mixture models, there might be many local optima and the estimation might converge to one of them.
To the specific case:
consumption = A exp(B t)
can be rewritten as
consumption = exp(a + B t)
So this is just a single index model or a generalized linear model with an exponential mean function.
The general version has the expectation of the dependent variable as a nonlinear function of a linear combination of the explanatory variables:
E(y | x) = g(x b)
This can be estimated with statsmodels with GLM with family Gaussian and the log-link.
Aside: In econometrics, there is a literature to use Poisson quasi-likelihood as an estimator for exp models instead of taking the log of the dependent variable.
Poisson usually uses the log-link function as in the above.
However, using GLM allows us to use log-link, i.e. exponential mean function, with any of the supported distribution families. The main difference is in the underlying variance assumption. Gaussian assumes constant variance, Poisson assumes that the variance is proportional to the mean and Gamma assumes that the variance is quadratic in the mean.
If we use a robust sandwich covariance estimator for parameter inference, then standard errors and inference are correct even if the variance function is misspecified.

Choosing a suitable plaintext_modulus

In choosing parameters such as plaintext_modulus, is there any good strategy? (aside from guess-and-check until the output looks correct)
In particular, I'm experimenting with IntegerEncoder with BFV. My (potentially-wrong) understanding is that the plaintext_modulus is not the modulus for the integer being encoded, but the modulus for each coefficient in the polynomial representation.
With B=2, it looks like these coefficients will just be 0 or 1. However, after operations like add and multiply are applied, this clearly is no longer the case. Is there a good way to determine a good bound for the coefficients, in order to pick plaintext_modulus?
My (potentially-wrong) understanding is that the plaintext_modulus is not the modulus for the integer being encoded, but the modulus for each coefficient in the polynomial representation.
This is the correct way of thinking when using IntegerEncoder. Note, however, that when using BatchEncoder (PolyCRTBuilder in SEAL 2.*) the situation is exactly the opposite: each slot in the plaintext vector is an integer modulo poly_modulus.
With B=2, it looks like these coefficients will just be 0 or 1. However, after operations like add and multiply are applied, this clearly is no longer the case. Is there a good way to determine a good bound for the coefficients, in order to pick plaintext_modulus?
The whole point of IntegerEncoder is that fresh encodings have as small coefficients as possible, delaying plain_modulus overflow and allowing you to use smaller plain_modulus (implies smaller noise growth). SEAL 2.* had an automatic parameter selection tool that performed heuristic upper bound estimates on noise growth and plaintext coefficient growth, and basically did exactly what you want. Unfortunately these estimates were performed on a per-operation basis, causing overestimates in the earlier operations to blow up in later stages of the computation. As a result, the estimates were not very tight for more than the simplest computations and in many cases the parameters this tool provided were oversized.
To estimate the plaintext coefficient growth in multiplications, let's consider two polynomials p(x) and q(x). Obviously the product will have degree exactly equal to deg(p)+deg(q)---that part is easy. If |P| denotes the infinity norm of a polynomial P (absolute value of largest coefficient), then:
|p*q| <= min{deg(p)+1, deg(q)+1} * |p||q|.
Actually, SEAL 2.* is a little bit more precise here. Instead of using the degrees, it uses the number of non-zero coefficients in these polynomials. This makes a big difference when the polynomials are sparse, in which case the contribution from cross-terms is much smaller and a better bound is:
|p*q| <= min{#(non_zero_coeffs(p)), #(non_zero_coeffs(q))} * |p||q|.
A deeper analysis of coefficient growth in IntegerEncoder-like encoders is done in https://eprint.iacr.org/2016/250 by Costache et al., which you may want to look at.

Minimization of L1-Regularized system, converging on non-minimum location?

This is my first post to stackoverflow, so if this isn't the correct area I apologize. I am working on minimizing a L1-Regularized System.
This weekend is my first dive into optimization, I have a basic linear system Y = X*B, X is an n-by-p matrix, B is a p-by-1 vector of model coefficients and Y is a n-by-1 output vector.
I am trying to find the model coefficients, I have implemented both gradient descent and coordinate descent algorithms to minimize the L1 Regularized system. To find my step size I am using the backtracking algorithm, I terminate the algorithm by looking at the norm-2 of the gradient and terminating if it is 'close enough' to zero(for now I'm using 0.001).
The function I am trying to minimize is the following (0.5)*(norm((Y - X*B),2)^2) + lambda*norm(B,1). (Note: By norm(Y,2) I mean the norm-2 value of the vector Y) My X matrix is 150-by-5 and is not sparse.
If I set the regularization parameter lambda to zero I should converge on the least squares solution, I can verify that both my algorithms do this pretty well and fairly quickly.
If I start to increase lambda my model coefficients all tend towards zero, this is what I expect, my algorithms never terminate though because the norm-2 of the gradient is always positive number. For example, a lambda of 1000 will give me coefficients in the 10^(-19) range but the norm2 of my gradient is ~1.5, this is after several thousand iterations, While my gradient values all converge to something in the 0 to 1 range, my step size becomes extremely small (10^(-37) range). If I let the algorithm run for longer the situation does not improve, it appears to have gotten stuck somehow.
Both my gradient and coordinate descent algorithms converge on the same point and give the same norm2(gradient) number for the termination condition. They also work quite well with lambda of 0. If I use a very small lambda(say 0.001) I get convergence, a lambda of 0.1 looks like it would converge if I ran it for an hour or two, a lambda any greater and the convergence rate is so small it's useless.
I had a few questions that I think might relate to the problem?
In calculating the gradient I am using a finite difference method (f(x+h) - f(x-h))/(2h)) with an h of 10^(-5). Any thoughts on this value of h?
Another thought was that at these very tiny steps it is traveling back and forth in a direction nearly orthogonal to the minimum, making the convergence rate so slow it is useless.
My last thought was that perhaps I should be using a different termination method, perhaps looking at the rate of convergence, if the convergence rate is extremely slow then terminate. Is this a common termination method?
The 1-norm isn't differentiable. This will cause fundamental problems with a lot of things, notably the termination test you chose; the gradient will change drastically around your minimum and fail to exist on a set of measure zero.
The termination test you really want will be along the lines of "there is a very short vector in the subgradient."
It is fairly easy to find the shortest vector in the subgradient of ||Ax-b||_2^2 + lambda ||x||_1. Choose, wisely, a tolerance eps and do the following steps:
Compute v = grad(||Ax-b||_2^2).
If x[i] < -eps, then subtract lambda from v[i]. If x[i] > eps, then add lambda to v[i]. If -eps <= x[i] <= eps, then add the number in [-lambda, lambda] to v[i] that minimises v[i].
You can do your termination test here, treating v as the gradient. I'd also recommend using v for the gradient when choosing where your next iterate should be.

Goodness of fit with MATLAB and chi-square test

I would like to measure the goodness-of-fit to an exponential decay curve. I am using the lsqcurvefit MATLAB function. I have been suggested by someone to do a chi-square test.
I would like to use the MATLAB function chi2gof but I am not sure how I would tell it that the data is being fitted to an exponential curve
The chi2gof function tests the null hypothesis that a set of data, say X, is a random sample drawn from some specified distribution (such as the exponential distribution).
From your description in the question, it sounds like you want to see how well your data X fits an exponential decay function. I really must emphasize, this is completely different to testing whether X is a random sample drawn from the exponential distribution. If you use chi2gof for your stated purpose, you'll get meaningless results.
The usual approach for testing the goodness of fit for some data X to some function f is least squares, or some variant on least squares. Further, a least squares approach can be used to generate test statistics that test goodness-of-fit, many of which are distributed according to the chi-square distribution. I believe this is probably what your friend was referring to.
EDIT: I have a few spare minutes so here's something to get you started. DISCLAIMER: I've never worked specifically on this problem, so what follows may not be correct. I'm going to assume you have a set of data x_n, n = 1, ..., N, and the corresponding timestamps for the data, t_n, n = 1, ..., N. Now, the exponential decay function is y_n = y_0 * e^{-b * t_n}. Note that by taking the natural logarithm of both sides we get: ln(y_n) = ln(y_0) - b * t_n. Okay, so this suggests using OLS to estimate the linear model ln(x_n) = ln(x_0) - b * t_n + e_n. Nice! Because now we can test goodness-of-fit using the standard R^2 measure, which matlab will return in the stats structure if you use the regress function to perform OLS. Hope this helps. Again I emphasize, I came up with this off the top of my head in a couple of minutes, so there may be good reasons why what I've suggested is a bad idea. Also, if you know the initial value of the process (ie x_0), then you may want to look into constrained least squares where you bind the parameter ln(x_0) to its known value.

How to overcome singularities in numerical integration (in Matlab or Mathematica)

I want to numerically integrate the following:
where
and a, b and β are constants which for simplicity, can all be set to 1.
Neither Matlab using dblquad, nor Mathematica using NIntegrate can deal with the singularity created by the denominator. Since it's a double integral, I can't specify where the singularity is in Mathematica.
I'm sure that it is not infinite since this integral is based in perturbation theory and without the
has been found before (just not by me so I don't know how it's done).
Any ideas?
(1) It would be helpful if you provide the explicit code you use. That way others (read: me) need not code it up separately.
(2) If the integral exists, it has to be zero. This is because you negate the n(y)-n(x) factor when you swap x and y but keep the rest the same. Yet the integration range symmetry means that amounts to just renaming your variables, hence it must stay the same.
(3) Here is some code that shows it will be zero, at least if we zero out the singular part and a small band around it.
a = 1;
b = 1;
beta = 1;
eps[x_] := 2*(a-b*Cos[x])
n[x_] := 1/(1+Exp[beta*eps[x]])
delta = .001;
pw[x_,y_] := Piecewise[{{1,Abs[Abs[x]-Abs[y]]>delta}}, 0]
We add 1 to the integrand just to avoid accuracy issues with results that are near zero.
NIntegrate[1+Cos[(x+y)/2]^2*(n[x]-n[y])/(eps[x]-eps[y])^2*pw[Cos[x],Cos[y]],
{x,-Pi,Pi}, {y,-Pi,Pi}] / (4*Pi^2)
I get the result below.
NIntegrate::slwcon:
Numerical integration converging too slowly; suspect one of the following:
singularity, value of the integration is 0, highly oscillatory integrand,
or WorkingPrecision too small.
NIntegrate::eincr:
The global error of the strategy GlobalAdaptive has increased more than
2000 times. The global error is expected to decrease monotonically after a
number of integrand evaluations. Suspect one of the following: the
working precision is insufficient for the specified precision goal; the
integrand is highly oscillatory or it is not a (piecewise) smooth
function; or the true value of the integral is 0. Increasing the value of
the GlobalAdaptive option MaxErrorIncreases might lead to a convergent
numerical integration. NIntegrate obtained 39.4791 and 0.459541
for the integral and error estimates.
Out[24]= 1.00002
This is a good indication that the unadulterated result will be zero.
(4) Substituting cx for cos(x) and cy for cos(y), and removing extraneous factors for purposes of convergence assessment, gives the expression below.
((1 + E^(2*(1 - cx)))^(-1) - (1 + E^(2*(1 - cy)))^(-1))/
(2*(1 - cx) - 2*(1 - cy))^2
A series expansion in cy, centered at cx, indicates a pole of order 1. So it does appear to be a singular integral.
Daniel Lichtblau
The integral looks like a Cauchy Principal Value type integral (i.e. it has a strong singularity). That's why you can't apply standard quadrature techniques.
Have you tried PrincipalValue->True in Mathematica's Integrate?
In addition to Daniel's observation about integrating an odd integrand over a symmetric range (so that symmetry indicates the result should be zero), you can also do this to understand its convergence better (I'll use latex, writing this out with pen and paper should make it easier to read; it took a lot longer to write than to do, it's not that complicated):
First, epsilon(x)-\epsilon(y)\propto\cos(y)-\cos(x)=2\sin(\xi_+)\sin(\xi_-) where I have defined \xi_\pm=(x\pm y)/2 (so I've rotated the axes by pi/4). The region of integration then is \xi_+ between \pi/\sqrt{2} and -\pi/\sqrt{2} and \xi_- between \pm(\pi/\sqrt{2}-\xi_-). Then the integrand takes the form \frac{1}{\sin^2(\xi_-)\sin^2(\xi_+)} times terms with no divergences. So, evidently, there are second-order poles, and this isn't convergent as presented.
Perhaps you could email the persons who obtained an answer with the cos term and ask what precisely it is they did. Perhaps there's a physical regularisation procedure being employed. Or you could have given more information on the physical origin of this (some sort of second order perturbation theory for some sort of bosonic system?), had that not been off-topic here...
May be I am missing something here, but the integrand
f[x,y]=Cos^2[(x+y)/2]*(n[x]-n[y])/(eps[x]-eps[y]) with n[x]=1/(1+Exp[Beta*eps[x]]) and eps[x]=2(a-b*Cos[x]) is indeed a symmetric function in x and y: f[x,-y]= f[-x,y]=f[x,y].
Therefore its integral over any domain [-u,u]x[-v,v] is zero. No numerical integration seems to be needed here. The result is just zero.