Matlab fitcsvm Feature Coefficients - matlab

I'm running a series of SVM classifiers for a binary classification problem, and am getting very nice results as far as classification accuracy.
The next step of my analysis is to understand how the different features contribute to the classification. According to the documentation, Matlab's fitcsvm function returns a class, SVMModel, which has a field called "Beta", defined as:
Numeric vector of trained classifier coefficients from the primal linear problem. Beta has length equal to the number of predictors (i.e., size(SVMModel.X,2)).
I'm not quite sure how to interpret these values. I assume higher values represent a greater contribution of a given feature to the support vector? What do negative weights mean? Are these weights somehow analogous to beta parameters in a linear regression model?
Thanks for any help and suggestions.
----UPDATE 3/5/15----
In looking closer at the equations describing the linear SVM, I'm pretty sure Beta must correspond to w in the primal form.
The only other parameter is b, which is just the offset.
Given that, and given this explanation, it seems that taking the square or absolute value of the coefficients provides a metric of relative importance of each feature.
As I understand it, this interpretation only holds for the linear binary SVM problem.
Does that all seem reasonable to people?

Intuitively, one can think of the absolute value of a feature weight as a measure of it's importance. However, this is not true in the general case because the weights symbolize how much a marginal change in the feature value would affect the output, which means that it is dependent on the feature's scale. For instance, if we have a feature for "age" that is measured in years, but than we change it to months, the corresponding coefficient will be divided by 12, but clearly,it doesn't mean that the age is less important now!
The solution is to scale the data (which is usually a good practice anyway).
If the data is scaled your intuition is correct and in fact, there is a feature selection method that does just that: choosing the features with the highest absolute weight. See http://jmlr.csail.mit.edu/proceedings/papers/v3/chang08a/chang08a.pdf
Note that this is correct only to linear SVM.

Related

In Bayesian simulation, why use fixed value for the parameters, which has a prior, in the model?

When simulating a Bayesian model, are we supposed to treat the parameters as a random variable (a prior), but not use a fixed value?
For example, we have a Bayesian linear model y=x\beta+\epsilon. When simulating it, literature usually: 1. set regression coefficients at fixed values, e.g. (0,3,-2,1,0,...); 2. simulate the predictors many times; 3. simulate the error term, usually standard normal; 4. generate the response.
If the regression coefficients have a prior (assume they have exchangeable priors), and thus we have posterior distributions, why would we simulate only one set of regression coefficients values? This sounds like the posterior has a distribution, meaning that we don't believe in any fixed value, while the truth indeed is fixed value. Even the posterior mean is supposed to converge to the OLS estimate under good setups, but this still feels difficult to understand.

dfittool results interpretation

Does anyone know how to tell the difference between distributions (ie their goodness of fit) using the dfittool in Matlab? In a class I took forever ago, we learned about the log likelihood parameter and how to compare a pdf fitted to Gaussian vs gamma, etc. But right now, all the matlab help files online are like "it means something." Any assistance would be appreciated. Basically, I need to interpret the "results" in "edit fit" of the dfittool. I want to be able to compare my dfits to each other from the results, so I can pick the best fit for my analysis. I don't know what the difference is between a log likelihood of -111 vs -105.
Example below:
Distribution: Normal
Log likelihood: -110.954
Domain: -Inf < y < Inf
Mean: 101.443
Variance: 436.332
Parameter Estimate Std. Err.
mu 101.443 4.17771
sigma 20.8886 3.04691
Estimated covariance of parameter estimates:
mu sigma
mu 17.4533 6.59643e-15
sigma 6.59643e-15 9.28366
Thank you!
(Log) likelihood is a measure of the fit of a distribution to data, so the simple answer is: the distribution with the largest likelihood is the one that fits best. However, what you get here as an output is the maximized likelihood, i.e. the likelihood at those parameter values where it is maximal. Different families of distributions might be differently "flexible", so that it is easier to get a larger likelihood with one of them in general, so this limits comparability. This holds especially if you compare families with different numbers of parameters. A fix for this is to use formal model comparison, e.g. using the Bayes factor, which however is considerably more complex mathematically, or its approximation, the Bayesian information criterion.
More generally speaking however, it is seldomly a good idea to just randomly pick distributions and see how well they fit. It would be better to have some at least partially theoretically motivated idea why a distribution is a candidate. On the most basic level this means considering its definition range: the normal distribution is defined on the whole real line, the gamma distribution only for nonnegative real numbers. This way it should be possible to rule one of them out based on basic properties of your data.

SVM in Matlab: Meaning of Parameter 'box constraint' in function fitcsvm

I'm new to SVMs in Matlab and need a little bit of help with it.
I want to train a support vector machine using the build in function fitcsvm of the Statistics Toolbox.
Of course there are many parameter choices which control how the SVM will be trained.
The Matlab help is a litte bit wage about how the parameters archive a better training result. Especially the parameter 'Box Contraint' seems to have an important influence on the number of chosen support vectors and generalization quality.
The Help (http://de.mathworks.com/help/stats/fitcsvm.html#bt8v_z4-1) says
A parameter that controls the maximum penalty imposed on margin-violating observations, and aids in preventing overfitting (regularization).
If you increase the box constraint, then the SVM classifier assigns fewer support vectors. However, increasing the box constraint can lead to longer training times.
How exactly is this parameter used?
Is it the same or something like the soft margin factor C in the Wikipedia reference?
Or something completely different?
Thanks for your help.
You were definitely on the right path. While description in the documentation of fitcsvm (as you posted in the question) is very short, you should have a look at the Understanding Support Vector Machines site in the MATLAB documentation.
In the non-separable case (often called Soft-Margin SVM), one allows misclassifications, at the cost of a penalty factor C. The mathematical formulation of the SVM then becomes:
with the slack variables s_i which cause a penalty term which is weighted by C.
Making C large increases the weight of misclassifications, which leads to a stricter separation.
This factor C is called box constraint.
The reason for this name is, that in the formulation of the dual optimization problem, the Langrange multipliers are bounded to be within the range [0,C].
C thus poses a box constraint on the Lagrange multipliers.
tl;dr your guess was right, it is the C in the soft margin SVM.

What's the best way to calculate a numerical derivative in MATLAB?

(Note: This is intended to be a community Wiki.)
Suppose I have a set of points xi = {x0,x1,x2,...xn} and corresponding function values fi = f(xi) = {f0,f1,f2,...,fn}, where f(x) is, in general, an unknown function. (In some situations, we might know f(x) ahead of time, but we want to do this generally, since we often don't know f(x) in advance.) What's a good way to approximate the derivative of f(x) at each point xi? That is, how can I estimate values of dfi == d/dx fi == df(xi)/dx at each of the points xi?
Unfortunately, MATLAB doesn't have a very good general-purpose, numerical differentiation routine. Part of the reason for this is probably because choosing a good routine can be difficult!
So what kinds of methods are there? What routines exist? How can we choose a good routine for a particular problem?
There are several considerations when choosing how to differentiate in MATLAB:
Do you have a symbolic function or a set of points?
Is your grid evenly or unevenly spaced?
Is your domain periodic? Can you assume periodic boundary conditions?
What level of accuracy are you looking for? Do you need to compute the derivatives within a given tolerance?
Does it matter to you that your derivative is evaluated on the same points as your function is defined?
Do you need to calculate multiple orders of derivatives?
What's the best way to proceed?
These are just some quick-and-dirty suggestions. Hopefully somebody will find them helpful!
1. Do you have a symbolic function or a set of points?
If you have a symbolic function, you may be able to calculate the derivative analytically. (Chances are, you would have done this if it were that easy, and you would not be here looking for alternatives.)
If you have a symbolic function and cannot calculate the derivative analytically, you can always evaluate the function on a set of points, and use some other method listed on this page to evaluate the derivative.
In most cases, you have a set of points (xi,fi), and will have to use one of the following methods....
2. Is your grid evenly or unevenly spaced?
If your grid is evenly spaced, you probably will want to use a finite difference scheme (see either of the Wikipedia articles here or here), unless you are using periodic boundary conditions (see below). Here is a decent introduction to finite difference methods in the context of solving ordinary differential equations on a grid (see especially slides 9-14). These methods are generally computationally efficient, simple to implement, and the error of the method can be simply estimated as the truncation error of the Taylor expansions used to derive it.
If your grid is unevenly spaced, you can still use a finite difference scheme, but the expressions are more difficult and the accuracy varies very strongly with how uniform your grid is. If your grid is very non-uniform, you will probably need to use large stencil sizes (more neighboring points) to calculate the derivative at a given point. People often construct an interpolating polynomial (often the Lagrange polynomial) and differentiate that polynomial to compute the derivative. See for instance, this StackExchange question. It is often difficult to estimate the error using these methods (although some have attempted to do so: here and here). Fornberg's method is often very useful in these cases....
Care must be taken at the boundaries of your domain because the stencil often involves points that are outside the domain. Some people introduce "ghost points" or combine boundary conditions with derivatives of different orders to eliminate these "ghost points" and simplify the stencil. Another approach is to use right- or left-sided finite difference methods.
Here's an excellent "cheat sheet" of finite difference methods, including centered, right- and left-sided schemes of low orders. I keep a printout of this near my workstation because I find it so useful.
3. Is your domain periodic? Can you assume periodic boundary conditions?
If your domain is periodic, you can compute derivatives to a very high order accuracy using Fourier spectral methods. This technique sacrifices performance somewhat to gain high accuracy. In fact, if you are using N points, your estimate of the derivative is approximately N^th order accurate. For more information, see (for example) this WikiBook.
Fourier methods often use the Fast Fourier Transform (FFT) algorithm to achieve roughly O(N log(N)) performance, rather than the O(N^2) algorithm that a naively-implemented discrete Fourier transform (DFT) might employ.
If your function and domain are not periodic, you should not use the Fourier spectral method. If you attempt to use it with a function that is not periodic, you will get large errors and undesirable "ringing" phenomena.
Computing derivatives of any order requires 1) a transform from grid-space to spectral space (O(N log(N))), 2) multiplication of the Fourier coefficients by their spectral wavenumbers (O(N)), and 2) an inverse transform from spectral space to grid space (again O(N log(N))).
Care must be taken when multiplying the Fourier coefficients by their spectral wavenumbers. Every implementation of the FFT algorithm seems to have its own ordering of the spectral modes and normalization parameters. See, for instance, the answer to this question on the Math StackExchange, for notes about doing this in MATLAB.
4. What level of accuracy are you looking for? Do you need to compute the derivatives within a given tolerance?
For many purposes, a 1st or 2nd order finite difference scheme may be sufficient. For higher precision, you can use higher order Taylor expansions, dropping higher-order terms.
If you need to compute the derivatives within a given tolerance, you may want to look around for a high-order scheme that has the error you need.
Often, the best way to reduce error is reducing the grid spacing in a finite difference scheme, but this is not always possible.
Be aware that higher-order finite difference schemes almost always require larger stencil sizes (more neighboring points). This can cause issues at the boundaries. (See the discussion above about ghost points.)
5. Does it matter to you that your derivative is evaluated on the same points as your function is defined?
MATLAB provides the diff function to compute differences between adjacent array elements. This can be used to calculate approximate derivatives via a first-order forward-differencing (or forward finite difference) scheme, but the estimates are low-order estimates. As described in MATLAB's documentation of diff (link), if you input an array of length N, it will return an array of length N-1. When you estimate derivatives using this method on N points, you will only have estimates of the derivative at N-1 points. (Note that this can be used on uneven grids, if they are sorted in ascending order.)
In most cases, we want the derivative evaluated at all points, which means we want to use something besides the diff method.
6. Do you need to calculate multiple orders of derivatives?
One can set up a system of equations in which the grid point function values and the 1st and 2nd order derivatives at these points all depend on each other. This can be found by combining Taylor expansions at neighboring points as usual, but keeping the derivative terms rather than cancelling them out, and linking them together with those of neighboring points. These equations can be solved via linear algebra to give not just the first derivative, but the second as well (or higher orders, if set up properly). I believe these are called combined finite difference schemes, and they are often used in conjunction with compact finite difference schemes, which will be discussed next.
Compact finite difference schemes (link). In these schemes, one sets up a design matrix and calculates the derivatives at all points simultaneously via a matrix solve. They are called "compact" because they are usually designed to require fewer stencil points than ordinary finite difference schemes of comparable accuracy. Because they involve a matrix equation that links all points together, certain compact finite difference schemes are said to have "spectral-like resolution" (e.g. Lele's 1992 paper--excellent!), meaning that they mimic spectral schemes by depending on all nodal values and, because of this, they maintain accuracy at all length scales. In contrast, typical finite difference methods are only locally accurate (the derivative at point #13, for example, ordinarily doesn't depend on the function value at point #200).
A current area of research is how best to solve for multiple derivatives in a compact stencil. The results of such research, combined, compact finite difference methods, are powerful and widely applicable, though many researchers tend to tune them for particular needs (performance, accuracy, stability, or a particular field of research such as fluid dynamics).
Ready-to-Go Routines
As described above, one can use the diff function (link to documentation) to compute rough derivatives between adjacent array elements.
MATLAB's gradient routine (link to documentation) is a great option for many purposes. It implements a second-order, central difference scheme. It has the advantages of computing derivatives in multiple dimensions and supporting arbitrary grid spacing. (Thanks to #thewaywewalk for pointing out this glaring omission!)
I used Fornberg's method (see above) to develop a small routine (nderiv_fornberg) to calculate finite differences in one dimension for arbitrary grid spacings. I find it easy to use. It uses sided stencils of 6 points at the boundaries and a centered, 5-point stencil in the interior. It is available at the MATLAB File Exchange here.
Conclusion
The field of numerical differentiation is very diverse. For each method listed above, there are many variants with their own set of advantages and disadvantages. This post is hardly a complete treatment of numerical differentiation.
Every application is different. Hopefully this post gives the interested reader an organized list of considerations and resources for choosing a method that suits their own needs.
This community wiki could be improved with code snippets and examples particular to MATLAB.
I believe there is more in to these particular questions. So I have elaborated on the subject further as follows:
(4) Q: What level of accuracy are you looking for? Do you need to compute the derivatives within a given tolerance?
A: The accuracy of numerical differentiation is subjective to the application of interest. Usually the way it works is, if you are using the ND in forward problem to approximate the derivatives to estimate features from signal of interest, then you should be aware of noise perturbations. Usually such artifacts contain high frequency components and by the definition of the differentiator, the noise effect will be amplified in the magnitude order of $i\omega^n$. So, increasing the accuracy of differentiator (increasing the polynomial accuracy) will no help at all. In this case you should be able to cancelt the effect of noise for differentiation. This can be done in casecade order: first smooth the signal, and then differentiate. But a better way of doing this is to use "Lowpass Differentiator". A good example of MATLAB library can be found here.
However, if this is not the case and you're using ND in inverse problems, such as solvign PDEs, then the global accuracy of differentiator is very important. Depending on what kind of bounady condition (BC) suits your problem, the design will be adapted accordingly. The rule of thump is to increase the numerical accuracy known is the fullband differentiator. You need to design a derivative matrix that takes care of suitable BC. You can find comprehensive solutions to such designs using the above link.
(5) Does it matter to you that your derivative is evaluated on the same points as your function is defined?
A: Yes absolutely. The evaluation of the ND on the same grid points is called "centralized" and off the points "staggered" schemes. Note that using odd order of derivatives, centralized ND will deviate the accuracy of frequency response of the differentiator. Therefore, if you're using such design in inverse problems, this will perturb your approximation. Also, the opposite applies to the case of even order of differentiation utilized by staggered schemes. You can find comprehensive explanation on this subject using the link above.
(6) Do you need to calculate multiple orders of derivatives?
This totally depends on your application at hand. You can refer to the same link I have provided and take care of multiple derivative designs.

measuring uncertainty in matlabs svmclassify

I'm doing contextual object recognition and I need a prior for my observations. e.g. this space was labeled "dog", what's the probability that it was labeled correctly? Do you know if matlabs svmclassify has an argument to return this level of certainty with it's classification?
If not, matlabs svm has the following structures in it:
SVM =
SupportVectors: [11x124 single]
Alpha: [11x1 double]
Bias: 0.0915
KernelFunction: #linear_kernel
KernelFunctionArgs: {}
GroupNames: {11x1 cell}
SupportVectorIndices: [11x1 double]
ScaleData: [1x1 struct]
FigureHandles: []
Can you think of any ways to compute a good measure of uncertainty from these? (Which support vector to use?) Papers/articles explaining uncertainty in SVMs welcome. More in depth explanations of matlabs SVM are also welcome.
If you can't do it this way, can you think of any other libraries with SVMs that have this measure of uncertainty?
Salam,
Hint: I would suggest that you modify the svmclassify.m function to pass the f values of the svmdecision.m to the user. This is in addition to the outputclass.
To access svmclassify.m, just type the following in the Matlab command line:
open svmclassify
I found that svmdecision.m (https://code.google.com/p/auc-recognition/source/browse/trunk/ALLMatlab/AucLib/DigitsOfflineNew/svmdecision.m?spec=svn260&r=260) can pass the f value; so the following may be substitude when calling the svmdecision.m in svmclassify.m:
[classified, f] = svmdecision(sample,svmStruct);
By further passing this f value to the user, the multi-class implementations such as one-vs-all may be designed with the binary classifier already built in Matlab.
f values are what you are looking for in terms of comparison of different inputs being related to their output class.
I hope this helps you to write your code and understand it! though you've probably solved your problem by now.
I know this was posted a long time ago, but I think the author is looking for a measure of uncertainty in the output of the trained SVM, whether the output is an estimated label or an estimated probability, which are both point estimates. One measure would be the variance of the output for the same input x. The problem is that regular SVMs are not stochastic models, such as probabilistic/Bayesian neural nets for example. The inference with an SVM is always the same, given the same x.
I would need to check, but perhaps it would be possible to train an SVM with stochastic regularization. Maybe some form of stochastic margin maximization, whereby during the steps of the optimization routine, input training vectors can undergo small stochastic perturbations or perhaps randomly chosen features could be dropped out. Similarly, during testing, one could drop or modify different randomly chosen features, producing different point estimates each time. Then, you could take the variance of the estimates, producing the measure of uncertainty.
It could be argued that, if the input x presents patterns that the model is not familiar with, the point estimates will be wildly unstable and variable.
One simple illustration can be the following: Imagine a 2d toy example, where the two classes are well separated and occupy a dense range of feature values. While a model trained on this set will generalize well to points that fall into the distribution/range of the values seen in the train data, it would be quite unstable given a test observation that sits far away from both classes, but at the same latitude, so to speak, as the separating hyperplane. Such an observations presents a challenge to the model since a tiny perturbation of one of the support vectors could cause a tiny rotation in the separating hyperplane, not significantly changing training or validation error, but potentially changing the estimate of the far-away test observation by a lot.
Are you supplying the data and doing the training yourself? If so, the best thing to do is to partition your data into training and test sets. Matlab has a function for this called cvpartition. You can use the classification results on the test data to estimate the rate of false positive and the miss rate. For a binary classification task, those two numbers will quantify the uncertainty. For a classification test with multiple hypotheses the best thing to do, probably, is to compile your results in a confusion matrix.
edit. found some older code I'd used that might help a little
P=cvpartition(Y,'holdout',0.20);
rbfsigma=1.41;
svmStruct=svmtrain(X(P.training,:),Y(P.training),'kernel_function','rbf','rbf_sigma',rbfsigma,'boxconstraint',0.7,'showplot','true');
C=svmclassify(svmStruct,X(P.test,:));
errRate=sum(Y(P.test)~=C)/P.TestSize
conMat=confusionmat(Y(P.test),C)
LIBSVM, which also has a Matlab interface, has an option -b that makes the classification function return probability estimates. They seem to be computed following the general approach of Platt (2000), which is to perform a one-dimensional logistic regression applied to the decision values.
Platt, J. C. (2000). Probabilities for SV machines. In Smola, A. J. et al. (eds.) Advances in large margin classifiers. pp. 61–74. Cambridge: The MIT Press.