Getting covariance matrix in Spark Linear Regression - scala

I have been looking into Spark's documentation but still couldn't find how to get covariance matrix after doing linear regression.
Given input training data, I did a very simple linear regression similar to this:
val lr = new LinearRegression()
val fit = lr.fit(training)
Getting regression parameters is as easy as fit.coefficients but there seems to be no information on how to get covariance matrix.
And just to clarify, I am looking for function similar to vcov in R. With this, I should be able to do something like vcov(fit) to get the covariance matrix. Any other methods that can help to achieve this are okay too.
EDIT
The explanation on how to get covariance matrix from linear regression is discussed in detail here. Standard deviation is easy to get as it is provided by fit.summary.meanSsquaredError. However, the parameter (X'X)-1 is hard to get. It would be interesting to see if this can be used to somehow calculate the covariance matrix.

Although the whole covariance matrix is collected on the driver, it is not possible to obtain it without making your own solver. You can do that by copying WLS and setting additional "getters".
Closest you can get without digging into the code is lrModel.summary.coefficientStandardErrors that is based on diagonal of inverted matrix (A^T * W * A) which is based on upper triangular matrix (covariance).
I don't think that is enough so sorry about that.

Related

Numerical Instability Kalman Filter in MatLab

I am trying to run a standard Kalman Filter algorithm to calculate likelihoods, but I keep getting a problema of a non positive definite variance matrix when calculating normal densities.
I've researched a little and seen that there may be in fact some numerical instabitlity; tried some numerical ways to avoid a non-positive definite matrix, using both choleski decomposition and its variant LDL' decomposition.
I am using MatLab.
Does anyone suggest anything?
Thanks.
I have encountered what might be the same problem before when I needed to run a Kalman filter for long periods but over time my covariance matrix would degenerate. It might just be a problem of losing symmetry due to numerical error. One simple way to enforce your covariance matrix (let's call it P) to remain symmetric is to do:
P = (P + P')/2 # where P' is transpose(P)
right after estimating P.
post your code.
As a rule of thumb, if the model is not accurate and the regularization (i.e. the model noise matrix Q) is not sufficiently "large" an underfitting will occur and the covariance matrix of the estimator will be ill-conditioned. Try fine tuning your Q matrix.
The Kalman Filter implemented using the Joseph Form is known to be numerically unstable, as any old timer who once worked with single precision implementation of the filter can tell. This problem was discovered zillions of years ago and prompt a lot of research in implementing the filter in a stable manner. Probably the best well-known implementation is the UD, where the Covariance matrix is factorized as UDU' and the two factors are updated and propagated using special formulas (see Thoronton and Bierman). U is an upper diagonal matrix with "1" in its diagonal, and D is a diagonal matrix.

Matlab: Determinant of VarianceCovariance matrix

When solving the log likelihood expression for autoregressive models, I cam across the variance covariance matrix Tau given under slide 9 Parameter estimation of time series tutorial. Now, in order to use
fminsearch
to maximize the likelihood function expression, I need to express the likelihood function where the variance covariance matrix arises. Can somebody please show with an example how I can implement (determinant of Gamma)^-1/2 ? Any other example apart from autoregressive model will also do.
How about sqrt(det(Gamma)) for the sqrt-determinant and inv(Gamma) for inverse?
But if you do not want to implement it yourself you can look at yulewalkerarestimator
UPD: For estimation of autocovariance matrix use xcov
also, this topic is a bit more explained here

scaling when sampling from multivariate gaussian

I have a data matrix A (with dependencies between columns) of which I estimate the covariance matrix S. I now want to use this covariance matrix to simulate a new matrix A_sim. Since I assume that the underlying data generator of A was gaussian, I can simply sample from a gaussian specified by S. I do that in matlab as follows:
A_sim = randn(size(A))*chol(S);
However, the values in A_sim are way larger than in A. if I scale down S by a factor of 100, A_sim looks much better. I am now looking for a way to determine this scaling factor in a principled way. can anyone give advise or suggest literature that might be helpful?
Matlab has the function mvnrnd which generates multivariate random variables for you.

Implementing Naïve Bayes algorithm in MATLAB - Need some guidance

I have a Binary classification problem that I need to do in MATLAB. There are two classes and the training data and testing data problems are from two classes and they are 2d coordinates drawn from Gaussian distributions.
The samples are 2D points and they are something like these (1000 samples for class A and 1000 samples for class B):
I am just posting some of them here:
5.867766 3.843014
5.019520 2.874257
1.787476 4.483156
4.494783 3.551501
1.212243 5.949315
2.216728 4.126151
2.864502 3.139245
1.532942 6.669650
6.569531 5.032038
2.552391 5.753817
2.610070 4.251235
1.943493 4.326230
1.617939 4.948345
If a new test data comes in, how should I classify the test sample?
P(Class/TestPoint) is proportional to P(TestPoint/Class) * (ProbabilityOfClass).
I am not sure of how we compute the P(Sample/Class) variable for the 2D coordinates given. Right now, I am using the formula
P(Coordinates/Class) = (Coordinates- mean for that class) / standard deviation of points in that class).
However, I am not getting very good test results with this. Am I doing anything wrong?
That is the good method, however the formula is not correct, look at the multivariate gaussian distribution article on wikipedia:
P(TestPoint|Class)=
,
where is the determinant of A.
Sigma = classPoint*classPoint';
mu = mean(classPoint,2);
proba = 1/((2*pi)^(2/2)*det(Sigma)^(1/2))*...
exp(-1/2*(testPoint-mu)*inv(Sigma)*(testPoint-mu)');
In your case, since they are as many points in both class, P(class)=1/2
Assuming your formula is correctly applied, another issue could be the derivation of features from your data points. Your problem might not be suited for a linear classifier.

Matlab - bug with linear discriminant analysis

I run
Y_testing_obtained = classify(X_testing, X_training, Y_training);
and the error I get is
Error using ==> classify at 246
The pooled covariance matrix of TRAINING must be positive definite.
X_training is 1550 x 5 matrix. Can you please tell me what this error means, i.e. why is it appearing, and how to work around it?
Thanks
Explanation: When you run the function classify without specifying the type of discriminant function (as you did), Matlab uses Linear Discriminant Analysis (LDA). Without going into too much details on LDA, the algorithms needs to calculate the covariance matrix of X_testing in order to solve an optimisation problem, and this matrix has to be positive definite (see Wikipedia: Positive-definite matrix). The underlying assumption is that your data is represented by a multivariate probability distribution, which always has a positive definite covariance matrix unless one or more variables are exact linear combinations of the others.
To solve your problem: It is possible that one of your variables is a linear combination of the others. You can try selecting a sensible subset of your variables, or perform Principal Component Analysis (PCA) on the training data and then classify using the first few principal components. Or, you could specify the type of discriminant function and choose one of the two naive Bayes classifiers, for example:
Y_testing_obtained = classify(X_testing, X_training, Y_training, 'diaglinear');
As a side note, you also need to have more observations (rows) than variables (columns), but in your case this is not the problem as you seem to have 1550 observations and 5 variables.
Finally, you can also have a look at the answers posted to a similar question on the Matlab forum.
Try regularizing the data using cvshrink function in Matlab