measuring uncertainty in matlabs svmclassify - matlab

I'm doing contextual object recognition and I need a prior for my observations. e.g. this space was labeled "dog", what's the probability that it was labeled correctly? Do you know if matlabs svmclassify has an argument to return this level of certainty with it's classification?
If not, matlabs svm has the following structures in it:
SVM =
SupportVectors: [11x124 single]
Alpha: [11x1 double]
Bias: 0.0915
KernelFunction: #linear_kernel
KernelFunctionArgs: {}
GroupNames: {11x1 cell}
SupportVectorIndices: [11x1 double]
ScaleData: [1x1 struct]
FigureHandles: []
Can you think of any ways to compute a good measure of uncertainty from these? (Which support vector to use?) Papers/articles explaining uncertainty in SVMs welcome. More in depth explanations of matlabs SVM are also welcome.
If you can't do it this way, can you think of any other libraries with SVMs that have this measure of uncertainty?

Salam,
Hint: I would suggest that you modify the svmclassify.m function to pass the f values of the svmdecision.m to the user. This is in addition to the outputclass.
To access svmclassify.m, just type the following in the Matlab command line:
open svmclassify
I found that svmdecision.m (https://code.google.com/p/auc-recognition/source/browse/trunk/ALLMatlab/AucLib/DigitsOfflineNew/svmdecision.m?spec=svn260&r=260) can pass the f value; so the following may be substitude when calling the svmdecision.m in svmclassify.m:
[classified, f] = svmdecision(sample,svmStruct);
By further passing this f value to the user, the multi-class implementations such as one-vs-all may be designed with the binary classifier already built in Matlab.
f values are what you are looking for in terms of comparison of different inputs being related to their output class.
I hope this helps you to write your code and understand it! though you've probably solved your problem by now.

I know this was posted a long time ago, but I think the author is looking for a measure of uncertainty in the output of the trained SVM, whether the output is an estimated label or an estimated probability, which are both point estimates. One measure would be the variance of the output for the same input x. The problem is that regular SVMs are not stochastic models, such as probabilistic/Bayesian neural nets for example. The inference with an SVM is always the same, given the same x.
I would need to check, but perhaps it would be possible to train an SVM with stochastic regularization. Maybe some form of stochastic margin maximization, whereby during the steps of the optimization routine, input training vectors can undergo small stochastic perturbations or perhaps randomly chosen features could be dropped out. Similarly, during testing, one could drop or modify different randomly chosen features, producing different point estimates each time. Then, you could take the variance of the estimates, producing the measure of uncertainty.
It could be argued that, if the input x presents patterns that the model is not familiar with, the point estimates will be wildly unstable and variable.
One simple illustration can be the following: Imagine a 2d toy example, where the two classes are well separated and occupy a dense range of feature values. While a model trained on this set will generalize well to points that fall into the distribution/range of the values seen in the train data, it would be quite unstable given a test observation that sits far away from both classes, but at the same latitude, so to speak, as the separating hyperplane. Such an observations presents a challenge to the model since a tiny perturbation of one of the support vectors could cause a tiny rotation in the separating hyperplane, not significantly changing training or validation error, but potentially changing the estimate of the far-away test observation by a lot.

Are you supplying the data and doing the training yourself? If so, the best thing to do is to partition your data into training and test sets. Matlab has a function for this called cvpartition. You can use the classification results on the test data to estimate the rate of false positive and the miss rate. For a binary classification task, those two numbers will quantify the uncertainty. For a classification test with multiple hypotheses the best thing to do, probably, is to compile your results in a confusion matrix.
edit. found some older code I'd used that might help a little
P=cvpartition(Y,'holdout',0.20);
rbfsigma=1.41;
svmStruct=svmtrain(X(P.training,:),Y(P.training),'kernel_function','rbf','rbf_sigma',rbfsigma,'boxconstraint',0.7,'showplot','true');
C=svmclassify(svmStruct,X(P.test,:));
errRate=sum(Y(P.test)~=C)/P.TestSize
conMat=confusionmat(Y(P.test),C)

LIBSVM, which also has a Matlab interface, has an option -b that makes the classification function return probability estimates. They seem to be computed following the general approach of Platt (2000), which is to perform a one-dimensional logistic regression applied to the decision values.
Platt, J. C. (2000). Probabilities for SV machines. In Smola, A. J. et al. (eds.) Advances in large margin classifiers. pp. 61–74. Cambridge: The MIT Press.

Related

quadprog in MATLAB taking lot of time

My goal is to classify an image using multi class linear SVM (with out kernel). I would like to write my own SVM classifier
I am using MATLAB and have trained linear SVM using image sets provided.
I have around 20 classes, 5 images in each class (total of 100 images) and I am using one-versus-all strategy.
Each image is a (112,92) matrix. That means 112*92=10304 values.
I am using quadprog(H,f,A,C) to solve the quadratic equation (y=w'x+b) in the SVM. One call to quadprog returns w vector of size 10304 for one image. That means I have to call quadprog for 100 times.
The problem is one quadprog call takes 35 secs to get executed. That means for 100 images it will take 3500 secs. This might be due to large size of vectors and matrices involved.
I want to reduce the execution time of quadprog. Is there any way to do it?
First of all, when you do classification using SVM, you usually extract a feature (like HOG) of an image, so that the dimensionality of space on which SVM has to operate gets reduced. You are using raw pixel values, which generates a 10304-D vector. That is not good. Use some standard feature.
Secondly, you do not call quadprog 100 times. You call only once. The concept behind the optimization is, you want to find a weight vector w and a bias b such that it satisfies w'x_i+b=y_i for all images (i.e. all x_i). Note that i will go from 1 to the number of examples in your training set, but w and b stay the same. If you wanted a (w,b) that will satisfy only one x, you do not need any fancy optimization. So stack your x in a big matrix of dimension NxD, y will be a vector of Nx1, and then call quadprog. It will take a longer time than 35 seconds, but you do it only once. This is called training an SVM. While testing for a new image, you just extract its feature, and do w*x+b to get its class.
Third, unless you are doing this as an exercise to understand SVMs, quadprog is not the best way to solve this problem. You need to use some other techniques which work well with large data, for example, Sequential Minimal Optimization.
How can one linear hyperplane classify more than 2 classes: For multi-class classification, SVMs usually use two popular approaches:
One-vs-one SVM: For a C class problem, you train C(C-1)/2 classifiers and at test time, you test that many classifiers and choose the class which receives most votes.
One-vs-all SVM: As name suggests, you train one classifier per class with positive samples from that class and negative samples from all other classes.
From LIBSVM FAQs:
It is one-against-one. We chose it after doing the following comparison: C.-W. Hsu and C.-J. Lin. A comparison of methods for multi-class support vector machines , IEEE Transactions on Neural Networks, 13(2002), 415-425.
"1-against-the rest" is a good method whose performance is comparable to "1-against-1." We do the latter simply because its training time is shorter.
However, also note that a naive implementation of one-vs-one may not be practical for large-scale problems. LIBSVM website also lists this shortcoming and provides an extension.
LIBLINEAR does not support one-versus-one multi-classification, so we provide an extension here. If k is the number of classes, we generate k(k-1)/2 models, each of which involves only two classes of training data. According to Yuan et al. (2012), one-versus-one is not practical for large-scale linear classification because of the huge space needed to store k(k-1)/2 models. However, this approach may still be viable if model vectors (i.e., weight vectors) are very sparse. Our implementation stores models in a sparse form and can effectively handle some large-scale data.

Matlab fitcsvm Feature Coefficients

I'm running a series of SVM classifiers for a binary classification problem, and am getting very nice results as far as classification accuracy.
The next step of my analysis is to understand how the different features contribute to the classification. According to the documentation, Matlab's fitcsvm function returns a class, SVMModel, which has a field called "Beta", defined as:
Numeric vector of trained classifier coefficients from the primal linear problem. Beta has length equal to the number of predictors (i.e., size(SVMModel.X,2)).
I'm not quite sure how to interpret these values. I assume higher values represent a greater contribution of a given feature to the support vector? What do negative weights mean? Are these weights somehow analogous to beta parameters in a linear regression model?
Thanks for any help and suggestions.
----UPDATE 3/5/15----
In looking closer at the equations describing the linear SVM, I'm pretty sure Beta must correspond to w in the primal form.
The only other parameter is b, which is just the offset.
Given that, and given this explanation, it seems that taking the square or absolute value of the coefficients provides a metric of relative importance of each feature.
As I understand it, this interpretation only holds for the linear binary SVM problem.
Does that all seem reasonable to people?
Intuitively, one can think of the absolute value of a feature weight as a measure of it's importance. However, this is not true in the general case because the weights symbolize how much a marginal change in the feature value would affect the output, which means that it is dependent on the feature's scale. For instance, if we have a feature for "age" that is measured in years, but than we change it to months, the corresponding coefficient will be divided by 12, but clearly,it doesn't mean that the age is less important now!
The solution is to scale the data (which is usually a good practice anyway).
If the data is scaled your intuition is correct and in fact, there is a feature selection method that does just that: choosing the features with the highest absolute weight. See http://jmlr.csail.mit.edu/proceedings/papers/v3/chang08a/chang08a.pdf
Note that this is correct only to linear SVM.

One Class SVM using LibSVM in Matlab - Conceptual

Perhaps this is an easy question, but I want to make sure I understand the conceptual basis of the LibSVM implementation of one-class SVMs and if what I am doing is permissible.
I am using one class SVMs in this case for outlier detection and removal. This is used in the context of a greater time series prediction model as a data preprocessing step. That said, I have a Y vector (which is the quantity we are trying to predict and is continuous, not class labels) and an X matrix (continuous features used to predict). Since I want to detect outliers in the data early in the preprocessing step, I have yet to normalize or lag the X matrix for use in prediction, or for that matter detrend/remove noise/or otherwise process the Y vector (which is already scaled to within [-1,1]). My main question is whether it is correct to model the one class SVM like so (using libSVM):
svmod = svmtrain(ones(size(Y,1),1),Y,'-s 2 -t 2 -g 0.00001 -n 0.01');
[od,~,~] = svmpredict(ones(size(Y,1),1),Y,svmod);
The resulting model does yield performance somewhat in line with what I would expect (99% or so prediction accuracy, meaning 1% of the observations are outliers). But why I ask is because in other questions regarding one class SVMs, people appear to be using their X matrices where I use Y. Thanks for your help.
What you are doing here is nothing more than a fancy range check. If you are not willing to use X to find outliers in Y (even though you really should), it would be a lot simpler and better to just check the distribution of Y to find outliers instead of this improvised SVM solution (for example remove the upper and lower 0.5-percentiles from Y).
In reality, this is probably not even close to what you really want to do. With this setup you are rejecting Y values as outliers without considering any context (e.g. X). Why are you using RBF and how did you come up with that specific value for gamma? A kernel is total overkill for one-dimensional data.
Secondly, you are training and testing on the same data (Y). A kitten dies every time this happens. One-class SVM attempts to build a model which recognizes the training data, it should not be used on the same data it was built with. Please, think of the kittens.
Additionally, note that the nu parameter of one-class SVM controls the amount of outliers the classifier will accept. This is explained in the LIBSVM implementation document (page 4): It is proved that nu is an upper bound on the fraction of training errors and
a lower bound of the fraction of support vectors. In other words: your training options specifically state that up to 1% of the data can be rejected. For one-class SVM, replace can by should.
So when you say that the resulting model does yield performance somewhat in line with what I would expect ... ofcourse it does, by definition. Since you have set nu=0.01, 1% of the data is rejected by the model and thus flagged as an outlier.

Interpretation of MATLAB's NaiveBayses 'posterior' function

After we created a Naive Bayes classifier object nb (say, with multivariate multinomial (mvmn) distribution), we can call posterior function on testing data using nb object. This function has 3 output parameters:
[post,cpre,logp] = posterior(nb,test)
I understand how post is computed and the meaning of that, also cpre is the predicted class, based on the maximum over posterior probabilities for each class.
The question is about logp. It is clear how it is computed (logarithm of the PDF of each pattern in test), but I don't understand the meaning of this measure and how it can be used in the context of Naive Bayes procedure. Any light on this is very much appreciated.
Thanks.
The logp you are referring to is the log likelihood, which is one way to measure how good a model is fitting. We use log probabilities to prevent computers from underflowing on very small floating-point numbers, and also because adding is faster than multiplying.
If you learned your classifier several times with different starting points, you would get different results because the likelihood function is not log-concave, meaning there are local maxima that you would get stuck in. If you computed the likelihood of the posterior on your original data you would get the likelihood of the model. Although the likelihood gives you a good measure of how one set of parameters fits compared to another, you need to be careful that you're not overfitting.
In your case, you are computing the likelihood on some unobserved (test) data, which gives you an idea of how well your learned classifier is fitting on the data. If you were trying to learn this model based on the test set, you would pick the parameters based on the highest test likelihood; however in general when you're doing this it's better to use a validation set. What you are doing here is computing predictive likelihood.
Computing the log likelihood is not limited to Naive Bayes classifiers and can in fact be computed for any Bayesian model (gaussian mixture, latent dirichlet allocation, etc).

Good results with NN, not with SVM; cause for concern?

I have painstakingly gathered data for a proof-of-concept study I am performing. The data consists of 40 different subjects, each with 12 parameters measured at 60 time intervals and 1 output parameter being 0 or 1. So I am building a binary classifier.
I knew beforehand that there is a non-linear relation between the input-parameters and the output so a simple perceptron of Bayes classifier would be unable to classify the sample. This assumption proved correct after initial tests.
Therefore I went to neural networks and as I hoped the results were pretty good. An error of about 1-5% is generally the result. The training is done by using 70% as training and 30% as evaluation. Running the complete dataset again (100%) through the model I was very happy with the results. The following is a typical confusion matrix (P = positive, N = negative):
P N
P 13 2
N 3 42
So I am happy and with the notion that I used a 30% for evaluation I am confident that I am not fitting noise.
Therefore I resolved to SVM for a double check and the SVM was unable to converge to a good solution. Most of the time the solutions are terrible (say 90% error...). Maybe I am not fully aware of SVM's or the implementations are not correct, but it troubles me because I thought that when NN provide a good solution, SVM's are most of the time better in seperating the data due to their maximum-margin hyperplane.
What does this say of my result? Am I fitting noise? And how do I know if this is a correct result?
I am using Encog for the calculations but the NN results are comparable to home-grown NN models I made.
If it is your first time to use SVM, I strongly recommend you to take a look at A Practical Guide to Support Vector Classication, by authors of a famous SVM package libsvm. It gives a list of suggestions to train your SVM classifier.
Transform data to the format of an SVM package
Conduct simple scaling on the data
Consider the RBF kernel
Use cross-validation to nd the best parameter C and γ
Use the best parameter C and γ
to train the whole training set
Test
In short, try scaling your data and carefully choosing the kernal plus the parameters.