The pooled covariance matrix of TRAINING must be positive definite. (lda classifier) - matlab

I have a problem with classification (LDA classifier ).
I have 80 samples of training data (80x100) and 15 samples of testing data (15x100). classify function returns: The covariance matrix of each group in TRAINING must be positive definite.

Without knowing how your data looks like, all I can do is to suggest you a few solutions that may solve your problem. A non positive definite convariance matrix can be produced by many different factors:
linear dependence between two or more columns (you can get rid of as many columns that produce linear dependence as possible)
non-stationary data (in this case, you can use differences instead of levels because they grant stationarity)
columns with highly mismatching magnitude, for example a column with very big values and another one with very small values (rescale your columns so that all of them have approximately the same magnitude).

Related

Rank of matrix contradicts the number of independent columns

I have 50x49 matrix A that has 49 linearly independent columns. However, my software (octave) tells me its rank is 44:
Is it due to some computational error? If so, then how to prevent such errors?
If the software was able to correctly calculate rref(A), then why did it fail with rank(A)? Does it mean that calculating rank(A) is more error prone than calculating rref(A), or vice versa? I mean rref(A) actually tells you the rank, but here's a contradiction.
P.S. I've checked, Python makes the same error.
EDIT 1: Here is the matrix A itself. The first 9 columns were given. The rest was obtained with polynomial features.
EDIT 2: I was able to found a similar issue. Here is 10x10 matrix B of rank 10 (and octave calculates its rank correctly). However, octave says that rank(B * B) = 9 which is impossible.
The distinction between an invertible matrix (i.e. full rank) and a non-invertible one is clear-cut in theory, but not so in practice. A matrix B with large condition number (as in your example) can be inverted, but computing the inverse is numerically unstable. It roughly corresponds to B having a determinant that is "small" (using an appropriate, relative measure of "small"), so the matrix is almost singular. As a result, the inverse matrix will be computed with bad accuracy. In your example B, the condition number (computed with cond) is 2.069e9.
Another way to look at this is: when the condition number is large, it well could be that B is "really" singular, but small numerical errors from previous computations make it look barely non-singular. So you can't be sure.
The rank and rref functions use different algorithms (singular-value decomposition for rank, Gauss-Jordan elimination with partial pivoting for rref). For well-behaved matrices the numerical errors will be small in both cases, and the results will be consistent. But for a bad-conditioned matrix the numerical errors will be large and potentially different in each case, giving inconsistent results.
This is a well known issue with numerical algebra. In general, avoid inverting matrices with large condition number.

How can I reduce extract features from a set of Matrices and vectors to be used in Machine Learning in MATLAB

I have a task where I need to train a machine learning model to predict a set of outputs from multiple inputs. My inputs are 1000 iterations of a set of 3x 1 vectors, a set of 3x3 covariance matrices and a set of scalars, while my output is just a set of scalars. I cannot use regression learner app because these inputs need to have the same dimensions, any idea on how to unify them?
One possible way to solve this is to flatten the covariance matrix into a vector. Once you did that, you can construct a 1000xN matrix where 1000 refers to the number of samples in your dataset and N is the number of features. For example if your features consist of a 3x1 vector, a 3x3 covariance matrix and lets say 5 other scalars, N could be 3+3*3+5=17. You then use this matrix to train an arbitrary model such as a linear regressor or more advanced models like a tree or the like.
When training machine learning models it is important to understand your data and exploit its structure to help the learning algorithms. For example we could use the fact that a covariance matrix is symmetric and positive semi-definite and thus lives in a closed convex cone. Symmetry of the matrix implies that it lives in a subspace of the set of all 3x3 matrices. In fact the dimension of the space of 3x3 symmetric matrices is only 6. You can use that knowledge to reduce redundancy in your data.

How to select top 100 features(a subset) which are most relevant after pca?

I performed PCA on a 63*2308 matrix and obtained a score and a co-efficient matrix. The score matrix is 63*2308 and the co-efficient matrix is 2308*2308 in dimensions.
How do i extract the column names for the top 100 features which are most important so that i can perform regression on them?
PCA should give you both a set of eigenvectors (your co-efficient matrix) and a vector of eigenvalues (1*2308) often referred to as lambda). You might been to use a different PCA function in matlab to get them.
The eigenvalues indicate how much of your data each eigenvector explains. A simple method for selecting features would be to select the 100 features with the highest eigen values. This gives you a set of feature which explain most of the variance in the data.
If you need to justify your approach for a write up you can actually calculate the amount of variance explained per eigenvector and cut of at, for example, 95% variance explained.
Bear in mind that selecting based solely on eigenvalue, might not correspond to the set of features most important to your regression, so if you don't get the performance you expect you might want to try a different feature selection method such as recursive feature selection. I would suggest using google scholar to find a couple of papers doing something similar and see what methods they use.
A quick matlab example of taking the top 100 principle components using PCA.
[eigenvectors, projected_data, eigenvalues] = princomp(X);
[foo, feature_idx] = sort(eigenvalues, 'descend');
selected_projected_data = projected(:, feature_idx(1:100));
Have you tried with
B = sort(your_matrix,2,'descend');
C = B(:,1:100);
Be careful!
With just 63 observations and 2308 variables, your PCA result will be meaningless because the data is underspecified. You should have at least (rule of thumb) dimensions*3 observations.
With 63 observations, you can at most define a 62 dimensional hyperspace!

Matlab - Stratified Sampling of Multidimensional Data

I want to divide a corpus into training & testing sets in a stratified fashion.
The observation data points are arranged in a Matrix A as
A=[16,3,0;12,6,4;19,2,1;.........;17,0,2;13,3,2]
Each column of the matrix represent a distinct feature.
In Matlab, the cvpartition(A,'holdout',p) function requires A to be a vector. How can I perform the same action with A as a Matrix i.e. resulting sets have roughly the same distribution of each feature as in the original corpus.
By using a matrix A rather than grouped data, you are making the assumption that a random partition of your data will return a test and train set with the same column distributions.
In general, the assumption you are making in your question is that there is a partition of A such that each of the marginal distributions of A (1 per column) has the same distribution across all three variables. There is no guarantee that this is true. Check whether the columns of your matrix are correlated. If they are not, simply partition on 1 and use the row indices to define a test matrix:
cv = cvpartition(A(:, 1), 'holdout', p);
text_mat = A(cv.test, :);
If they are correlated, you may need to go back and reconsider what you are trying to do.

Matlab - bug with linear discriminant analysis

I run
Y_testing_obtained = classify(X_testing, X_training, Y_training);
and the error I get is
Error using ==> classify at 246
The pooled covariance matrix of TRAINING must be positive definite.
X_training is 1550 x 5 matrix. Can you please tell me what this error means, i.e. why is it appearing, and how to work around it?
Thanks
Explanation: When you run the function classify without specifying the type of discriminant function (as you did), Matlab uses Linear Discriminant Analysis (LDA). Without going into too much details on LDA, the algorithms needs to calculate the covariance matrix of X_testing in order to solve an optimisation problem, and this matrix has to be positive definite (see Wikipedia: Positive-definite matrix). The underlying assumption is that your data is represented by a multivariate probability distribution, which always has a positive definite covariance matrix unless one or more variables are exact linear combinations of the others.
To solve your problem: It is possible that one of your variables is a linear combination of the others. You can try selecting a sensible subset of your variables, or perform Principal Component Analysis (PCA) on the training data and then classify using the first few principal components. Or, you could specify the type of discriminant function and choose one of the two naive Bayes classifiers, for example:
Y_testing_obtained = classify(X_testing, X_training, Y_training, 'diaglinear');
As a side note, you also need to have more observations (rows) than variables (columns), but in your case this is not the problem as you seem to have 1550 observations and 5 variables.
Finally, you can also have a look at the answers posted to a similar question on the Matlab forum.
Try regularizing the data using cvshrink function in Matlab