How to decompose linear regression error in statsmodel - linear-regression

I want to decompose the error sum of the squares into the lack of fit error and pure error. I am using the statsmodel library.
model = sm.OLS(y, X)
res= model.fit()
I know how to decompose the total sum of squares(res.centered_tss) into the regression(res.ssr) and residual error(res.ess).
But I want to decompose it into pure error and lack of fit. My data has multiple y values for each x value so it is perfect for this type of analysis. How can I do this in statsmodel.
Formula of what I am looking for.

Related

Getting covariance matrix in Spark Linear Regression

I have been looking into Spark's documentation but still couldn't find how to get covariance matrix after doing linear regression.
Given input training data, I did a very simple linear regression similar to this:
val lr = new LinearRegression()
val fit = lr.fit(training)
Getting regression parameters is as easy as fit.coefficients but there seems to be no information on how to get covariance matrix.
And just to clarify, I am looking for function similar to vcov in R. With this, I should be able to do something like vcov(fit) to get the covariance matrix. Any other methods that can help to achieve this are okay too.
EDIT
The explanation on how to get covariance matrix from linear regression is discussed in detail here. Standard deviation is easy to get as it is provided by fit.summary.meanSsquaredError. However, the parameter (X'X)-1 is hard to get. It would be interesting to see if this can be used to somehow calculate the covariance matrix.
Although the whole covariance matrix is collected on the driver, it is not possible to obtain it without making your own solver. You can do that by copying WLS and setting additional "getters".
Closest you can get without digging into the code is lrModel.summary.coefficientStandardErrors that is based on diagonal of inverted matrix (A^T * W * A) which is based on upper triangular matrix (covariance).
I don't think that is enough so sorry about that.

Identify parameters of ARIMA model

I am trying to build ARIMA model, I have 144 terms in my standardized time series, which represent residuals form original time series. This residuals, on which I would like to build ARIMA model, are obtained when I subtracted linear trend and periodical component from original time series, so residuals are stochastic component.
Because of that subtraction I modeled residuals like stationary series (d=0), so model is ARIMA(p,d,q)=ARIMA(?,0,?).
ACF and PACF functions of my residuals are not very clear as cases in literature for identification ARIMA models, and when I choose parameters p and q according to criteria that they are last values outside of confidence interval, I got values p=109, q=97. Matlab gave me error for this case:
Error using arima/estimate (line 386)
Input response series has an insufficient number of observations.
On the other side, when I am looking only to N/4 length of time series for identifying p and q parameters, I got p=36, q=34. Matlab gave me error for this case
Warning: Nonlinear inequality constraints are active; standard errors may be inaccurate.
In arima.estimate at 1113
Error using arima/validateModel (line 1306)
The non-seasonal autoregressive polynomial is unstable.
Error in arima/setLagOp (line 391)
Mdl = validateModel(Mdl);
Error in arima/estimate (line 1181)
Mdl = setLagOp(Mdl, 'AR' , LagOp([1 -coefficients(iAR)' ], 'Lags', [0 LagsAR ]));
How do I need to correct identify p and q parameters and what is wrong here? And wwhat does it mean in this partial autocorrelation diagram, why are last values so big?
This guide contains a lot of useful information about the correct estimation of ARIMA p and q parameters.
As long as I can remember from my studies, since ACF tails off after lag q - p and PACF tails off after lag p - q, the correct identification of p and q orders is not always straightforward and even the best practices provided by the above guide could not be enough to point you to the right direction.
Usually, a failproof approach is to apply an information criteria (like AIC, BIC or FPE) to several models with different orders of p and q. The model that presents the smallest value of the criterion is the best one. Let's say your maximum q and p desired order is 6 an that k is the number of observations, you could proceed as follows:
ll = zeros(6);
pq = zeros(6);
for p = 1:6
for q = 1:6
mod = arima(p,0,q);
[fit,~,fit_ll] = estimate(mod,Y,'print',false);
ll(p,q) = fit_ll;
pq(p,q) = p + q;
end
end
ll = reshape(ll,36,1);
pq = reshape(pq,36,1);
[~,bic] = aicbic(ll,pq+1,k);
bic = reshape(bic,6,6);
Once this is done, use the indices returned by the min function in order to find the optimal q and p orders.
On a side note, for what concerns your errors... well, the first one is pretty straightforward and is self-explanatory. The second one basically means that a correct model estimation is not possible.

Kalman Filter prediction error estimation: why two constants and transposed matrices?

Hy everybody!
I have found a very informative and good tutorial for understanding Kalman Filter. In the end, I would like to understand the Extended Kalman Filter in the second half of the tutorial, but first I want to solve any mystery.
Kalman Filter tutorial Part 6.
I think we use constant for prediction error, because the new value in a certain k time moment can be different, than the previous. But why we use two constants? It says:
we multiply twice by a because the prediction error pk is itself a squared error; hence, it is scaled by the square of the coefficient associated with the state value xk.
I can't see the meaning of this sentence.
And later in the EKF he creates a matrix and a transposed matrix from that (in Part 12). Why the transposed one?
Thanks a lot.
The Kalman filter maintains error estimates as variances, which are squared standard deviations. When you multiply a Gaussian random variable N(x,p) by a constant a, you increase its standard deviation by a factor of a, which means its variance increases as a^2. He's writing this as a*p*a to maintain a parallel structure when he converts from a scalar state to a matrix state. If you have an error coviarance matrix P representing state x, then the error covariance of Ax is APA^T as he shows in part 12. It's a convenient shorthand for doing that calculation. You can expand the matrix multiplication by hand to see that the coefficients all go in the right place.
If any of this is fuzzy to you, I strongly recommend you read a tutorial on Gaussian random variables. Between x and P in a Kalman filter, your success depends a lot more on you understanding P than x, even though most people get started by being interested in improving x.

Normalize in Adaboost without numerical error - Matlab

I'm implementing AdaBoost on Matlab. This algorithm requires that in every iteration the weights of each data point in the training set sum up to one.
If I simply use the following normalization v = v / sum(v) I get a vector whose 1-norm is 1 except some numerical error which later leads to the failure of the algorithm.
Is there a matlab function for normalizing a vector so that it's 1-norm is EXACTLY 1?
Assuming you want identical values to be normalised with the same factor, this is not possible. Simple counter example:
v=ones(21,1);
v = v / sum(v);
sum(v)-1
One common way to deal with it, is enforce values sum(v)>=1 or sum(v)<=1 if your algorithm can deal with a derivation to one side:
if sum(v)>1
v=v-eps(v)
end
Alternatively you can try using vpa, but this will drastically increase your computation time.

Simple binary logistic regression using MATLAB

I'm working on doing a logistic regression using MATLAB for a simple classification problem. My covariate is one continuous variable ranging between 0 and 1, while my categorical response is a binary variable of 0 (incorrect) or 1 (correct).
I'm looking to run a logistic regression to establish a predictor that would output the probability of some input observation (e.g. the continuous variable as described above) being correct or incorrect. Although this is a fairly simple scenario, I'm having some trouble running this in MATLAB.
My approach is as follows: I have one column vector X that contains the values of the continuous variable, and another equally-sized column vector Y that contains the known classification of each value of X (e.g. 0 or 1). I'm using the following code:
[b,dev,stats] = glmfit(X,Y,'binomial','link','logit');
However, this gives me nonsensical results with a p = 1.000, coefficients (b) that are extremely high (-650.5, 1320.1), and associated standard error values on the order of 1e6.
I then tried using an additional parameter to specify the size of my binomial sample:
glm = GeneralizedLinearModel.fit(X,Y,'distr','binomial','BinomialSize',size(Y,1));
This gave me results that were more in line with what I expected. I extracted the coefficients, used glmval to create estimates (Y_fit = glmval(b,[0:0.01:1],'logit');), and created an array for the fitting (X_fit = linspace(0,1)). When I overlaid the plots of the original data and the model using figure, plot(X,Y,'o',X_fit,Y_fit'-'), the resulting plot of the model essentially looked like the lower 1/4th of the 'S' shaped plot that is typical with logistic regression plots.
My questions are as follows:
1) Why did my use of glmfit give strange results?
2) How should I go about addressing my initial question: given some input value, what's the probability that its classification is correct?
3) How do I get confidence intervals for my model parameters? glmval should be able to input the stats output from glmfit, but my use of glmfit is not giving correct results.
Any comments and input would be very useful, thanks!
UPDATE (3/18/14)
I found that mnrval seems to give reasonable results. I can use [b_fit,dev,stats] = mnrfit(X,Y+1); where Y+1 simply makes my binary classifier into a nominal one.
I can loop through [pihat,lower,upper] = mnrval(b_fit,loopVal(ii),stats); to get various pihat probability values, where loopVal = linspace(0,1) or some appropriate input range and `ii = 1:length(loopVal)'.
The stats parameter has a great correlation coefficient (0.9973), but the p values for b_fit are 0.0847 and 0.0845, which I'm not quite sure how to interpret. Any thoughts? Also, why would mrnfit work over glmfit in my example? I should note that the p-values for the coefficients when using GeneralizedLinearModel.fit were both p<<0.001, and the coefficient estimates were quite different as well.
Finally, how does one interpret the dev output from the mnrfit function? The MATLAB document states that it is "the deviance of the fit at the solution vector. The deviance is a generalization of the residual sum of squares." Is this useful as a stand-alone value, or is this only compared to dev values from other models?
It sounds like your data may be linearly separable. In short, that means since your input data is one dimensional, that there is some value of x such that all values of x < xDiv belong to one class (say y = 0) and all values of x > xDiv belong to the other class (y = 1).
If your data were two-dimensional this means you could draw a line through your two-dimensional space X such that all instances of a particular class are on one side of the line.
This is bad news for logistic regression (LR) as LR isn't really meant to deal with problems where the data are linearly separable.
Logistic regression is trying to fit a function of the following form:
This will only return values of y = 0 or y = 1 when the expression within the exponential in the denominator is at negative infinity or infinity.
Now, because your data is linearly separable, and Matlab's LR function attempts to find a maximum likelihood fit for the data, you will get extreme weight values.
This isn't necessarily a solution, but try flipping the labels on just one of your data points (so for some index t where y(t) == 0 set y(t) = 1). This will cause your data to no longer be linearly separable and the learned weight values will be dragged dramatically closer to zero.