Compute the training error and test error in libsvm + MATLAB - matlab

I would like to draw learning curves for a given SVM classifier. Thus, in order to do this, I would like to compute the training, cross-validation and test error, and then plot them while varying some parameter (e.g., number of instances m).
How to compute training, cross-validation and test error on libsvm when used with MATLAB?
I have seen other answers (see example) that suggest solutions for other languages.
Isn't there a compact way of doing it?

Given a set of instances described by:
a set of features featureVector;
their corresponding labels (e.g., either 0 or 1),
if a model was previously inferred via libsvm, the MSE error can be computed as follows:
[predictedLabels, accuracy, ~] = svmpredict(labels, featureVectors, model,'-q');
MSE = accuracy(2);
Notice that predictedLabels contains the labels that were predicted by the classifier for the given instances.

Related

How to find the training accuracy using fitcsvm?

I would like to find the predicted labels of data point feature vectors while training the classifier, i am using MDL=fitcsvm(train_data,train_labels) in matlab the MDL is composed of properties, none of them corresponds to the training accuracy, Is there any way to find it ?
You can apply cross validation
xval = crossval(Mdl,'KFold',10);
kfoldLoss(xval)

In Matlab, what does it mean to use GMM as a posterior distribution to make a supervised classifier inspired by GMM? Suggested by podludek and lejlot

I understand that GMM is not a classifier itself, but I am trying to follow the instructions of some users in this stack exchange post below to create a GMM-inspired classifier.
lejlot: Multiclass classification using Gaussian Mixture Models with scikit learn
"construct your own classifier where you fit one GMM per label and then use assigned probability to do actual classification. Then it is a proper classifier"
What is meant by "assigned probability" for GMM Matlab objects in the above quote and how can we input a new point to get our desired assigned probability? For a new point that we are trying to classify, my understanding is that we need to get the posterior probabilities that the new point belongs to either Gaussian and then compare these two probabilities.
It looks from the documentation https://www.mathworks.com/help/stats/gmdistribution.html
like we only have access to cluster center mu's and covariance matrices (sigma) but not an actual probability distribution that would take in a point and spit out a probability
podludek: Multiclass classification using Gaussian Mixture Models with scikit learn
"GMM is not a classifier, but generative model. You can use it to a classification problem by applying Bayes theorem.....You should use GMM as a posterior distribution, one GMM per each class." -
In the documentation in Matlab for posterior(gm,X), the tutorial shows us inputting X, which is already the the data we used to create ("train") our GMM. But how can we get the posterior probability of being in a cluster for a new point?
https://www.mathworks.com/help/stats/gmdistribution.posterior.html
"P = posterior(gm,X) returns the posterior probability of each Gaussian mixture component in gm given each observation in X"
--> But the X used in the link above is the 'training' data used to create the GMM itself, not a new point. Also we have two gm objects, not one. How can we grab the probability a point belongs to a Gaussian?
The pseudocode below is how I envisioned a GMM inspired classifier would go for a two class example: I would fit GMM's to individual clusters as described by podludek. Then, I would use the posterior probailities of a point being in each cluster and then pick the bigger probability.
I'm aware there are issues with this conceptually (such as the two GMM objects having conflicting covariance matrices) but I've been assured by my mentor that there is a way to make a supervised version of GMM, and he wants me to make one, so here we go:
Pseusdocode:
X % The training data matrix
% each new row is a new data point
% each column is new feature
% Ex: if you had 10,000 data points and 100 features for each, your matrix
% would be 10000 by 100
% Let's say we had 200 points of each class in our training data
% Grab subsets of X that corresponds to classes 1 and 2
X_only_class_2 = X(1:200,:)
X_only_class_1 = X(201:end,:)
gmfit_class_1 = fitgmdist(X_only_class_1,1,'RegularizationValue',0.1);
cov_matrix_1=gmfit_class_1.Sigma;
gmfit_class_2 = fitgmdist(X_only_class_2,1,'RegularizationValue',0.1);
cov_matrix_2=gmfit_class_2.Sigma;
% Now do some tests on data we already know the classification of to check if this is working as we would expect:
a = posterior(gmfit_class_1,X_only_class_1)
b = posterior(gmfit_class_1,X_only_class_2)
c = posterior(gmfit_class_2,X_only_class_1)
d = posterior(gmfit_class_2,X_only_class_2)
But unfortunately, computing these posteriors a, b, c, and d just result in column vectors of 1's. I'm aware these are degenerate cases (and pointless for actual classification since we already know the classifications of our training data) but I still wanted to test them to make sure the posterior method is working as I would expect.
Expected:
a = posterior(gmfit_class_1,X_only_class_1)
% ^ This produces a column vector of 1's, which I thought was fine. After all, the gmfit object was trained on those points
b = posterior(gmfit_class_1,X_only_class_2)
% ^ This one also produces a vector of 1's, which I thought was wrong. It should be a vector of low, but nonzero numbers
c = posterior(gmfit_class_2,X_only_class_1)
% ^ This one also produces a vector of 1's, which I thought was wrong. It should be a vector of low, but nonzero numbers
d = posterior(gmfit_class_2,X_only_class_2)
% ^ This produces a column vector of 1's, which I thought was fine. After all, the gmfit object was trained on those points
I have to think that somehow Matlab is being confused by how in both gmm fit models, there is only one cluster in each. Either that or I am not interpreting the posterior method correctly.

Matlab predict function not working

I am trying to train a linear SVM on a data which has 100 dimensions. I have 80 instances for training. I train the SVM using fitcsvm function in MATLAB and check the function using predict on the training data. When I classify the training data with the SVM all the data points are being classified into only one class.
SVM = fitcsvm(votes,b,'ClassNames',unique(b)');
predict(SVM,votes);
This gives outputs as all 0's which corresponds to 0th class. b contains 1's and 0's indicating the class to which each data point belongs.
The data used, i.e. matrix votes and vector b are given the following link
Make sure you use a non-linear kernel, such as a gaussian kernel and that the parameters of the kernel are tweaked. Just as a starting point:
SVM = fitcsvm(votes,b,'KernelFunction','RBF', 'KernelScale','auto');
bp = predict(SVM,votes);
that said you should split your set in a training set and a testing set, otherwise you risk overfitting

leave-one-out regression using lasso in Matlab

I have 300 data samples with around 4000 dimension feature each. Each input has a 5 dim. output which is in the range of -2 to 2. I am trying to fit a lasso model to it. I went through a few posts which talk about cross validation strategies like this one: Leave one out cross validation algorithm in matlab
But I saw that lasso does not support leaveout in Matlab! http://www.mathworks.com/help/stats/lasso.html
How can I train a model using leave one out cross validation and fit a model using lasso on my dataset? I am trying to do this in matlab. I would like to get a set of weights which I will be able to use for future predictions on other data.
I tried using glmnet: http://www.stanford.edu/~hastie/glmnet_matlab/intro.html but I couldn't compile it on my machine due to lack of proper mex compiler.
Any solutions to my problem? Thanks :)
EDIT
I am also trying to use lasso function in-built with MATLAB. It has an option to perform cross validation. It outputs B and Fit Statistics, where B is Fitted coefficients, a p-by-L matrix, where p is the number of predictors (columns) in X, and L is the number of Lambda values.
Now given a new test sample, how can I calculate the output using this model?
You can use a leave-one-out approach regardless of your training method. As explained here, you can use crossvalind to split the data into training and test sets.
[Train, Test] = crossvalind('LeaveMOut', N, M)

Matlab - bug with linear discriminant analysis

I run
Y_testing_obtained = classify(X_testing, X_training, Y_training);
and the error I get is
Error using ==> classify at 246
The pooled covariance matrix of TRAINING must be positive definite.
X_training is 1550 x 5 matrix. Can you please tell me what this error means, i.e. why is it appearing, and how to work around it?
Thanks
Explanation: When you run the function classify without specifying the type of discriminant function (as you did), Matlab uses Linear Discriminant Analysis (LDA). Without going into too much details on LDA, the algorithms needs to calculate the covariance matrix of X_testing in order to solve an optimisation problem, and this matrix has to be positive definite (see Wikipedia: Positive-definite matrix). The underlying assumption is that your data is represented by a multivariate probability distribution, which always has a positive definite covariance matrix unless one or more variables are exact linear combinations of the others.
To solve your problem: It is possible that one of your variables is a linear combination of the others. You can try selecting a sensible subset of your variables, or perform Principal Component Analysis (PCA) on the training data and then classify using the first few principal components. Or, you could specify the type of discriminant function and choose one of the two naive Bayes classifiers, for example:
Y_testing_obtained = classify(X_testing, X_training, Y_training, 'diaglinear');
As a side note, you also need to have more observations (rows) than variables (columns), but in your case this is not the problem as you seem to have 1550 observations and 5 variables.
Finally, you can also have a look at the answers posted to a similar question on the Matlab forum.
Try regularizing the data using cvshrink function in Matlab