Linear Discriminant Analysis LDA - matlab

I have a large dataset of multidimensional data (240 dimensions).
I am a beginner at performing data mining and I want to apply Linear Discriminant Analysis by using MATLAB. However, I have seen that there are a lot of functions explained on the web but I do not understand how should they be applied.
Basically, I want to apply LDA.
After this step I want to be able to do a reconstruction for my data.
I can do this manually, but I was wondering if there are any predefined functions which can do this because they should already be optimized.
My initial data is something like: size(x) = [2000 240]. So basically I have 240 features (dimensions) and 2000 data points. And I want to perform LDA on this data set.

The function classify from Statistics Toolbox does Linear (and, if you set some options, Quadratic) Discriminant Analysis. There are a couple of worked examples in the documentation that explain how it should be used: type doc classify or showdemo classdemo to see them.
240 features is quite a lot given that you only have 2000 observations, even if you have only two classes. You might want to apply a dimension reduction method before LDA, such as PCA (see doc princomp) or use a feature selection method (see doc sequentialfs for one such method).

you can use fitcdiscr for classification using LDA in matlab 2014

Related

MATLAB: PCA for Dimensionality Reduction

I have computed colour descriptors of a dataset of images and generated a 152×320 matrix (152 samples and 320 features). I would like to use PCA to reduce the dimensionality of my image descriptors space. I know that I could implement this using Matlab PCA built-in function but as I have just started learning about this concept I would like to implement the Matlab code without the built-in function so I can have a clear understanding how the function works. I tried to find how to do that online but all I could find is the either the general concept of PCA or the implementation of it with the built-in functions without explaining clearly how it works. Anyone could help me with a step by step instructions or a link that could explain a simple way on how to implement PCA for dimensionality reduction. The reason why I'm so confused is because there are so many uses for PCA and methods to implement it and the more I read about it the more confused I get.
PCA is basically taking the dominant eigen vectors of the data (Or better yet their projection of the dominant Eigen Vectors of the covariance matrix).
What you can do is use the SVD (Singular Value Decomposition).
To imitate MATLAB's pca() function here what you should do:
Center all features (Each column of your data should have zero mean).
Apply the svd() function on your data.
Use the V Matrix (Its columns) as your vectors to project your data on. Chose the number of columns to use according to the dimension of the data you'd like to have.
The projected data is now you new dimensionality reduction data.

How do i identify which features are being selected with LDA?

I have run LDA with MATLAB using the fitcdiscr function and predict.
I have a feeling there may be some bugs in my code however and as a sanity check would like to identify which features are being most heavily weighted in the classification.
Can this be done?
There is a Coeffs field in your fitted object containing all the relevant information http://uk.mathworks.com/help/stats/classificationdiscriminant-class.html
In particular, if you fit a linear LDA there will be Linear field which is the linear operator used for projection. However, one should bear in mind that value of coefficients of linear models are not feature importances. There is much more in that to consider. Weight can be big because your feature have small values or because there is a highly biased distribution of the values. If you need feature selection technique - use feature selection methods (like L1 regularized models) otherwise you might easily get wrong conclusions from your data.

Simple Sequential feature selection in Matlab

I have a 40X3249 noisy dataset and 40X1 resultset. I want to perform simple sequential feature selection on it, in Matlab. Matlab example is complicated and I can't follow it. Even a few examples on SoF didn't help. I want to use decision tree as classifier to perform feature selection. Can someone please explain in simple terms.
Also is it a problem that my dataset has very low number of observations compared to the number of features?
I am following this example: Sequential feature selection Matlab and I am getting error like this:
The pooled covariance matrix of TRAINING must be positive definite.
I've explained the error message you're getting in answers to your previous questions.
In general, it is a problem that you have many more variables than samples. This will prevent you using some techniques, such as the discriminant analysis you were attempting, but it's a problem anyway. The fact is that if you have that high a ratio of variables to samples, it is very likely that some combination of variables would perfectly classify your dataset even if they were all random numbers. That's true if you build a single decision tree model, and even more true if you are using a feature selection method to explicitly search through combinations of variables.
I would suggest you try some sort of dimensionality reduction method. If all of your variables are continuous, you could try PCA as suggested by #user1207217. Alternatively you could use a latent variable method for model-building, such as PLS (plsregress in MATLAB).
If you're still intent on using sequential feature selection with a decision tree on this dataset, then you should be able to modify the example in the question you linked to, replacing the call to classify with one to classregtree.
This error comes from the use of the classify function in that question, which is performing LDA. This error occurs when the data is rank deficient (or in other words, some features are almost exactly correlated). In order to overcome this, you should project the data down to a lower dimensional subspace. Principal component analysis can do this for you. See here for more details on how to use pca function within statistics toolbox of Matlab.
[basis, scores, ~] = pca(X); % Find the basis functions and their weighting, X is row vectors
indices = find(scores > eps(2*max(scores))); % This is to find irrelevant components up to machine precision of the biggest component .. with a litte extra tolerance (2x)
new_basis = basis(:, indices); % This gets us the relevant components, which are stored in variable "basis" as column vectors
X_new = X*new_basis; % inner products between the new basis functions spanning some subspace of the original, and the original feature vectors
This should get you automatic projections down into a relevant subspace. Note that your features won't have the same meaning as before, because they will be weighted combinations of the old features.
Extra note: If you don't want to change your feature representation, then instead of classify, you need to use something which works with rank deficient data. You could roll your own version of penalised discriminant analysis (which is quite simple), use support vector machines, or other classification functions which don't break with correlated features as LDA does (by virtue of requiring matrix inversion of the covariance estimate).
EDIT: P.S I haven't tested this, because I have rolled my own version of PCA in Matlab.

Matlab Question - Principal Component Analysis

I have a set of 100 observations where each observation has 45 characteristics. And each one of those observations have a label attached which I want to predict based on those 45 characteristics. So it's an input matrix with the dimension 45 x 100 and a target matrix with the dimension 1 x 100.
The thing is that I want to know how many of those 45 characteristics are relevant in my set of data, basically the principal component analysis, and I understand that I can do this with Matlab function processpca.
Could you please tell me how can I do this? Suppose that the input matrix is x with 45 rows and 100 columns and y is a vector with 100 elements.
Assuming that you want to construct a model of the 1x100 vector, based on the 45x100 matrix, I am not convinced that PCA will do what you think. PCA can be used to select variables for model estimation, but this is a somewhat indirect way to gather a set of model features. Anyway, I suggest reading both:
Principal Components Analysis
and...
Putting PCA to Work
...both of which provide code in MATLAB not requiring any Toolboxes.
Have you tried COEFF = princomp(x)?
COEFF = princomp(X) performs principal
components analysis (PCA) on the
n-by-p data matrix X, and returns the
principal component coefficients, also
known as loadings. Rows of X
correspond to observations, columns to
variables. COEFF is a p-by-p matrix,
each column containing coefficients
for one principal component. The
columns are in order of decreasing
component variance.
From your question I deduced you don't need to do it in MATLAB, but you just want to analyze your dataset. According to my opinion the key is visualization of the dependencies.
If you're not forced to do the analysis in MATLAB I'd suggest you try more specialized software something like WEKA (www.cs.waikato.ac.nz/ml/weka/) or RapidMiner (rapid-i.com). Both tools can provide PCA and other dimension reduction algorithms + they contain nice visualization tools.
Your use case sounds like a combination of Classification and Feature Selection.
Statistics Toolbox offers a lot of good capabilities in this area. The toolbox provides access to a number of classification algorithms including
Naive Bayes Classifiers Bagged
Decision Trees (aka Random Forests)
Binomial and Multinominal logistic regression
Linear Discriminant analysis
You also have a variety of options available for feature selection include
sequentialfs (forwards and backwards feature selection)
relifF
"treebagger" also supports options for feature selection and estimating variable importance.
Alternatively, you can use some of Optimization Toolbox's capabilities to write your own custom equations to estimate variable importance.
A couple monthes back, I did a webinar for The MathWorks titled "Compuational Statistics: Getting Started with Classification using MTALAB". You can watch the Webinar at
http://www.mathworks.com/company/events/webinars/wbnr51468.html?id=51468&p1=772996255&p2=772996273
The code and the data set for the examples is available at MATLAB Central
http://www.mathworks.com/matlabcentral/fileexchange/28770
With all this said and done, many people using Principal Component Analysis as a pre-processing step before applying classification algorithms. PCA gets used alot
When you need to extract features from images
When you're worried about multicollinearity
You should find correlation matrix. in the following example matlab finds correlation matrix with 'corr' function
http://www.mathworks.com/help/stats/feature-transformation.html#f75476

Feature Selection in MATLAB

I have a dataset for text classification ready to be used in MATLAB. Each document is a vector in this dataset and the dimensionality of this vector is extremely high. In these cases peopl usually do some feature selection on the vectors like the ones that you have actually find the WEKA toolkit. Is there anything like that in MATLAB? if not can u suggest and algorithm for me to do it...?
thanks
MATLAB (and its toolboxes) include a number of functions that deal with feature selection:
RANDFEATURES (Bioinformatics Toolbox): Generate randomized subset of features directed by a classifier
RANKFEATURES (Bioinformatics Toolbox): Rank features by class separability criteria
SEQUENTIALFS (Statistics Toolbox): Sequential feature selection
RELIEFF (Statistics Toolbox): Relief-F algorithm
TREEBAGGER.OOBPermutedVarDeltaError, predictorImportance (Statistics Toolbox): Using ensemble methods (bagged decision trees)
You can also find examples that demonstrates usage on real datasets:
Identifying Significant Features and Classifying Protein Profiles
Genetic Algorithm Search for Features in Mass Spectrometry Data
In addition, there exist third-party toolboxes:
Matlab Toolbox for Dimensionality Reduction
LIBGS: A MATLAB Package for Gene Selection
Otherwise you can always call your favorite functions from WEKA directly from MATLAB since it include a JVM...
Feature selection depends on the specific task you want to do on the text data.
One of the simplest and crudest method is to use Principal component analysis (PCA) to reduce the dimensions of the data. This reduced dimensional data can be used directly as features for classification.
See the tutorial on using PCA here:
http://matlabdatamining.blogspot.com/2010/02/principal-components-analysis.html
Here is the link to Matlab PCA command help:
http://www.mathworks.com/help/toolbox/stats/princomp.html
Using the obtained features, the well known Support Vector Machines (SVM) can be used for classification.
http://www.mathworks.com/help/toolbox/bioinfo/ref/svmclassify.html
http://www.autonlab.org/tutorials/svm.html
You might consider using the independent features technique of Weiss and Kulikowski to quickly eliminate variables which are obviously unimformative:
http://matlabdatamining.blogspot.com/2006/12/feature-selection-phase-1-eliminate.html