How do i identify which features are being selected with LDA? - matlab

I have run LDA with MATLAB using the fitcdiscr function and predict.
I have a feeling there may be some bugs in my code however and as a sanity check would like to identify which features are being most heavily weighted in the classification.
Can this be done?

There is a Coeffs field in your fitted object containing all the relevant information http://uk.mathworks.com/help/stats/classificationdiscriminant-class.html
In particular, if you fit a linear LDA there will be Linear field which is the linear operator used for projection. However, one should bear in mind that value of coefficients of linear models are not feature importances. There is much more in that to consider. Weight can be big because your feature have small values or because there is a highly biased distribution of the values. If you need feature selection technique - use feature selection methods (like L1 regularized models) otherwise you might easily get wrong conclusions from your data.

Related

Parameter selection of SVM

I have a dataset which I use for classifcation with libSVM in Matlab. The dataset consists of 4 classes.
For parameter selection of SVM I can do nested cross-validation. The problem is that I also need the value of the best parameters in the end.
After having done the nested cross-validation and having the final accuracy I want the values of the best parameters. Then I will train a SVM for each class (one-vs-all) with the best parameters for selecting the most important features (according to heighest weight), i.e. feature importance map.
How can I do this? Should I just not do nested cross-validation and only looping over all parameters and doing cross-validation?
Second, if I use a linear SVM then using this weight vector w for assigning importance to features works, but does it also work for non-linear SVM (e.g. rbf kernel)?
To find the "best" parameters for your kernel of choice, you have to loop through all parameters to perform a so called "grid search". LIBSVM does not support a build-in grid-search mechanismn.
Regarding your second question, I would suggest to perform a feature selection (e.g. Information Gain, Mutual Information, ...) as a pre-processing step before the actual work with the SVM and in a second step take the weight vector
s into consideration (but I am not sure, if this will work with RBF or Gaußian Kernels...).

How to see which Atribute (Feature) contribute most to the performance of the classification with PCA in Matlab?

I would like to perform classification on a small data set 65x9 using some of the Machine Learning Classification Methods (SVM, Decision Trees or any other).
So, before starting with the classification I would like to do attribute analyses with PCA in Matlab or Weka (preferred MatLab). I would like to obtain which Attribute contribute most to the performance of the classifier. So I can maybe reduce the number of some Attribute or/and include more in the future. Any example of PCA can find regarding this in MatLab or Weka?
Thanks
PCA is a unsupervised feature extraction method.
If your question is on selecting attributes to use with PCA, i don't know what your purpose is but it is unnecessary to do something like that to improve classification performance. Just use the whole attributes. PCA will give you best attributes in decreasing order for each instance.
If your question is on selecting attributes after PCA, you can chose a treshold (for example 0.95) and calculate #attributes enough for treshold beginning from the first attribute to last one. You can use the eigenvalues of covariance matrix to calculate and achive treshold in PCA.
After running PCA, we know that the first attribute is the best one, the second attribute is the best one after first etc...

Simple Sequential feature selection in Matlab

I have a 40X3249 noisy dataset and 40X1 resultset. I want to perform simple sequential feature selection on it, in Matlab. Matlab example is complicated and I can't follow it. Even a few examples on SoF didn't help. I want to use decision tree as classifier to perform feature selection. Can someone please explain in simple terms.
Also is it a problem that my dataset has very low number of observations compared to the number of features?
I am following this example: Sequential feature selection Matlab and I am getting error like this:
The pooled covariance matrix of TRAINING must be positive definite.
I've explained the error message you're getting in answers to your previous questions.
In general, it is a problem that you have many more variables than samples. This will prevent you using some techniques, such as the discriminant analysis you were attempting, but it's a problem anyway. The fact is that if you have that high a ratio of variables to samples, it is very likely that some combination of variables would perfectly classify your dataset even if they were all random numbers. That's true if you build a single decision tree model, and even more true if you are using a feature selection method to explicitly search through combinations of variables.
I would suggest you try some sort of dimensionality reduction method. If all of your variables are continuous, you could try PCA as suggested by #user1207217. Alternatively you could use a latent variable method for model-building, such as PLS (plsregress in MATLAB).
If you're still intent on using sequential feature selection with a decision tree on this dataset, then you should be able to modify the example in the question you linked to, replacing the call to classify with one to classregtree.
This error comes from the use of the classify function in that question, which is performing LDA. This error occurs when the data is rank deficient (or in other words, some features are almost exactly correlated). In order to overcome this, you should project the data down to a lower dimensional subspace. Principal component analysis can do this for you. See here for more details on how to use pca function within statistics toolbox of Matlab.
[basis, scores, ~] = pca(X); % Find the basis functions and their weighting, X is row vectors
indices = find(scores > eps(2*max(scores))); % This is to find irrelevant components up to machine precision of the biggest component .. with a litte extra tolerance (2x)
new_basis = basis(:, indices); % This gets us the relevant components, which are stored in variable "basis" as column vectors
X_new = X*new_basis; % inner products between the new basis functions spanning some subspace of the original, and the original feature vectors
This should get you automatic projections down into a relevant subspace. Note that your features won't have the same meaning as before, because they will be weighted combinations of the old features.
Extra note: If you don't want to change your feature representation, then instead of classify, you need to use something which works with rank deficient data. You could roll your own version of penalised discriminant analysis (which is quite simple), use support vector machines, or other classification functions which don't break with correlated features as LDA does (by virtue of requiring matrix inversion of the covariance estimate).
EDIT: P.S I haven't tested this, because I have rolled my own version of PCA in Matlab.

Genetic Algorithm After SVM

I have already applied SVM using LIBSVM. Now i would like to implement Genetic Algorithm for feature selection. Tried to google for some information
1) Saw this website : http://www.scribd.com/doc/31235552/Genetic-Algorithm-Implementation-Using-Matlab
2) GA Examples in MATLAB : http://www.mathworks.com/help/toolbox/gads/f6691.html
Have few questions on them
Q1) [x fval] = ga(#fitnessfun, nvars, options). This is the function to do gasolver. What should be the fitnessfun? In most ga, it is a polynomial function. But in the case of SVM, what shld be the fitnessfun?
Q2) is there any concrete examples for GA after SVM?
Like to hear some feedback.
Thanks in advance.
If you want to do feature selection, I think you have it backwards. You should run the GA for feature selection before the training of your SVM. Your fitness function could become the performance of a newly trained SVM on selected features, it depends on what you want to accomplish. Can't say you were very clear on this topic.
To answer your second comment:
There are many parts, I don't know this ga function you are using, but if you take a look at the documentation they must tell you somewhere what parameters this fitnessfun should be expecting. I'm guessing the individual for which you want to evaluate fitness is the main parameter for this function. If you evolve a selection of features, this individual would be an array of Boolean variables where true indicates a feature that is selected an false indicates a feature that is not selected. This fitness function needs to return an indicator of how well this selection of features fares, i.e. it must return a higher number for a better selection, and a lower number for a worst selection. Prediction accuracy might be a good value for this (nb. of correct predictions divided by the total number of samples).
I'm going to assume you know how to calculate the prediction accuracy of an SVM model given a dataset and its labels. Since you have a pre-trained SVM it might be a bit tricky to use it only for selected features, and it depends strongly upon the implementation of your SVM. If it is a linear SVM, you could just set the values of the non-selected features to zero in the data matrix. However, if it is an RBF SVM that won't work. You will need to understand the inner mechanisms of the SVM implementation you are relying on. I suggest making a simple example where you train an SVM on 3d data and then adapt it to work on 2d data. It strongly depends on the implementation of your SVM model.

Feature Selection in MATLAB

I have a dataset for text classification ready to be used in MATLAB. Each document is a vector in this dataset and the dimensionality of this vector is extremely high. In these cases peopl usually do some feature selection on the vectors like the ones that you have actually find the WEKA toolkit. Is there anything like that in MATLAB? if not can u suggest and algorithm for me to do it...?
thanks
MATLAB (and its toolboxes) include a number of functions that deal with feature selection:
RANDFEATURES (Bioinformatics Toolbox): Generate randomized subset of features directed by a classifier
RANKFEATURES (Bioinformatics Toolbox): Rank features by class separability criteria
SEQUENTIALFS (Statistics Toolbox): Sequential feature selection
RELIEFF (Statistics Toolbox): Relief-F algorithm
TREEBAGGER.OOBPermutedVarDeltaError, predictorImportance (Statistics Toolbox): Using ensemble methods (bagged decision trees)
You can also find examples that demonstrates usage on real datasets:
Identifying Significant Features and Classifying Protein Profiles
Genetic Algorithm Search for Features in Mass Spectrometry Data
In addition, there exist third-party toolboxes:
Matlab Toolbox for Dimensionality Reduction
LIBGS: A MATLAB Package for Gene Selection
Otherwise you can always call your favorite functions from WEKA directly from MATLAB since it include a JVM...
Feature selection depends on the specific task you want to do on the text data.
One of the simplest and crudest method is to use Principal component analysis (PCA) to reduce the dimensions of the data. This reduced dimensional data can be used directly as features for classification.
See the tutorial on using PCA here:
http://matlabdatamining.blogspot.com/2010/02/principal-components-analysis.html
Here is the link to Matlab PCA command help:
http://www.mathworks.com/help/toolbox/stats/princomp.html
Using the obtained features, the well known Support Vector Machines (SVM) can be used for classification.
http://www.mathworks.com/help/toolbox/bioinfo/ref/svmclassify.html
http://www.autonlab.org/tutorials/svm.html
You might consider using the independent features technique of Weiss and Kulikowski to quickly eliminate variables which are obviously unimformative:
http://matlabdatamining.blogspot.com/2006/12/feature-selection-phase-1-eliminate.html