I have computed colour descriptors of a dataset of images and generated a 152×320 matrix (152 samples and 320 features). I would like to use PCA to reduce the dimensionality of my image descriptors space. I know that I could implement this using Matlab PCA built-in function but as I have just started learning about this concept I would like to implement the Matlab code without the built-in function so I can have a clear understanding how the function works. I tried to find how to do that online but all I could find is the either the general concept of PCA or the implementation of it with the built-in functions without explaining clearly how it works. Anyone could help me with a step by step instructions or a link that could explain a simple way on how to implement PCA for dimensionality reduction. The reason why I'm so confused is because there are so many uses for PCA and methods to implement it and the more I read about it the more confused I get.
PCA is basically taking the dominant eigen vectors of the data (Or better yet their projection of the dominant Eigen Vectors of the covariance matrix).
What you can do is use the SVD (Singular Value Decomposition).
To imitate MATLAB's pca() function here what you should do:
Center all features (Each column of your data should have zero mean).
Apply the svd() function on your data.
Use the V Matrix (Its columns) as your vectors to project your data on. Chose the number of columns to use according to the dimension of the data you'd like to have.
The projected data is now you new dimensionality reduction data.
Related
I understand the concept of PCA, and what it's doing, but trying to apply the concept to my application is proving difficult.
I have a 1 by X matrix of a physiological signal (it's not EMG, but very similar, so think of it as EMG if it helps) which contains various noise and artefacts. What I've noticed of the noise is that some of it is very large and I would assume after PCA this would be the largest principal component, thus my idea of using PCA for some dimensional reduction.
My problem is that with a 1 by X matrix there is no covariance matrix, only the variance, and thus eigenvectors and all of PCA falls through.
I know I need to rearrange my data into a matrix more than 1D, but this is where I need some suggestions. Do I split my data into windows of equal length to create a large dimensional matrix which I can apply PCA to? Do I perform several trials of the same action so I have lots of data sets (this would be impractical for my application)?
Any suggestions or examples would be helpful. I'm using MATLAB to perform this task.
I have a 40X3249 noisy dataset and 40X1 resultset. I want to perform simple sequential feature selection on it, in Matlab. Matlab example is complicated and I can't follow it. Even a few examples on SoF didn't help. I want to use decision tree as classifier to perform feature selection. Can someone please explain in simple terms.
Also is it a problem that my dataset has very low number of observations compared to the number of features?
I am following this example: Sequential feature selection Matlab and I am getting error like this:
The pooled covariance matrix of TRAINING must be positive definite.
I've explained the error message you're getting in answers to your previous questions.
In general, it is a problem that you have many more variables than samples. This will prevent you using some techniques, such as the discriminant analysis you were attempting, but it's a problem anyway. The fact is that if you have that high a ratio of variables to samples, it is very likely that some combination of variables would perfectly classify your dataset even if they were all random numbers. That's true if you build a single decision tree model, and even more true if you are using a feature selection method to explicitly search through combinations of variables.
I would suggest you try some sort of dimensionality reduction method. If all of your variables are continuous, you could try PCA as suggested by #user1207217. Alternatively you could use a latent variable method for model-building, such as PLS (plsregress in MATLAB).
If you're still intent on using sequential feature selection with a decision tree on this dataset, then you should be able to modify the example in the question you linked to, replacing the call to classify with one to classregtree.
This error comes from the use of the classify function in that question, which is performing LDA. This error occurs when the data is rank deficient (or in other words, some features are almost exactly correlated). In order to overcome this, you should project the data down to a lower dimensional subspace. Principal component analysis can do this for you. See here for more details on how to use pca function within statistics toolbox of Matlab.
[basis, scores, ~] = pca(X); % Find the basis functions and their weighting, X is row vectors
indices = find(scores > eps(2*max(scores))); % This is to find irrelevant components up to machine precision of the biggest component .. with a litte extra tolerance (2x)
new_basis = basis(:, indices); % This gets us the relevant components, which are stored in variable "basis" as column vectors
X_new = X*new_basis; % inner products between the new basis functions spanning some subspace of the original, and the original feature vectors
This should get you automatic projections down into a relevant subspace. Note that your features won't have the same meaning as before, because they will be weighted combinations of the old features.
Extra note: If you don't want to change your feature representation, then instead of classify, you need to use something which works with rank deficient data. You could roll your own version of penalised discriminant analysis (which is quite simple), use support vector machines, or other classification functions which don't break with correlated features as LDA does (by virtue of requiring matrix inversion of the covariance estimate).
EDIT: P.S I haven't tested this, because I have rolled my own version of PCA in Matlab.
I am trying to develop a system for image classification. I am using following the article:
INDEPENDENT COMPONENT ANALYSIS (ICA) FOR TEXTURE CLASSIFICATION by Dr. Dia Abu Al Nadi and Ayman M. Mansour
In a paragraph it says:
Given the above texture images, the Independent Components are learned by the method outlined above.
The (8 x 8) ICA basis function for the above textures are shown in Figure 2. respectively. The dimension is reduced by PCA, resulting in a total of 40 functions. Note that independent components from different windows size are different.
The "method outlined above" is FastICA, the textures are taken from Brodatz album , each texture image has 640x640 pixels. My question is:
What the authors means with "The dimension is reduced by PCA, resulting in a total of 40 functions.", and how can I get that functions using matlab?
PCA (Principal Component Analysis) is a method for finding an orthogonal basis (think of a coordinate system) for a high-dimensional (data) space. The "axes" of the PCA basis are sorted by variance, i.e. along the first PCA "axis" your data has the largest variance, along the second "axis" the second largest variance, etc.
This is exploited for dimension reduction: Say you have 1000 dimensional data. Then you do a PCA, transform your data into the PCA basis and throw away all but the first 20 dimensions (just an example). If your data follows a certain statistical distribution, then chances are that the 20 PCA dimensions describe your data almost as well as the 64 original dimensions did. There are methods for finding the number of dimensions to use, but that is beyond scope here.
Computationally, PCA amounts to finding the Eigen-decomposition of your data's covariance matrix, in Matlab: [V,D] = eig(cov(MyData)).
Note that if you want to work with these concepts you should do some serious reading. A classic article on what you can do with PCA on image data is Turk and Pentland's Eigenfaces. It also gives some background in an understandable way.
PCA reduce the dimension of data,ICA extracts the components of the data of which dimension must <=
data dimension
I have a large dataset of multidimensional data (240 dimensions).
I am a beginner at performing data mining and I want to apply Linear Discriminant Analysis by using MATLAB. However, I have seen that there are a lot of functions explained on the web but I do not understand how should they be applied.
Basically, I want to apply LDA.
After this step I want to be able to do a reconstruction for my data.
I can do this manually, but I was wondering if there are any predefined functions which can do this because they should already be optimized.
My initial data is something like: size(x) = [2000 240]. So basically I have 240 features (dimensions) and 2000 data points. And I want to perform LDA on this data set.
The function classify from Statistics Toolbox does Linear (and, if you set some options, Quadratic) Discriminant Analysis. There are a couple of worked examples in the documentation that explain how it should be used: type doc classify or showdemo classdemo to see them.
240 features is quite a lot given that you only have 2000 observations, even if you have only two classes. You might want to apply a dimension reduction method before LDA, such as PCA (see doc princomp) or use a feature selection method (see doc sequentialfs for one such method).
you can use fitcdiscr for classification using LDA in matlab 2014
I have a set of 100 observations where each observation has 45 characteristics. And each one of those observations have a label attached which I want to predict based on those 45 characteristics. So it's an input matrix with the dimension 45 x 100 and a target matrix with the dimension 1 x 100.
The thing is that I want to know how many of those 45 characteristics are relevant in my set of data, basically the principal component analysis, and I understand that I can do this with Matlab function processpca.
Could you please tell me how can I do this? Suppose that the input matrix is x with 45 rows and 100 columns and y is a vector with 100 elements.
Assuming that you want to construct a model of the 1x100 vector, based on the 45x100 matrix, I am not convinced that PCA will do what you think. PCA can be used to select variables for model estimation, but this is a somewhat indirect way to gather a set of model features. Anyway, I suggest reading both:
Principal Components Analysis
and...
Putting PCA to Work
...both of which provide code in MATLAB not requiring any Toolboxes.
Have you tried COEFF = princomp(x)?
COEFF = princomp(X) performs principal
components analysis (PCA) on the
n-by-p data matrix X, and returns the
principal component coefficients, also
known as loadings. Rows of X
correspond to observations, columns to
variables. COEFF is a p-by-p matrix,
each column containing coefficients
for one principal component. The
columns are in order of decreasing
component variance.
From your question I deduced you don't need to do it in MATLAB, but you just want to analyze your dataset. According to my opinion the key is visualization of the dependencies.
If you're not forced to do the analysis in MATLAB I'd suggest you try more specialized software something like WEKA (www.cs.waikato.ac.nz/ml/weka/) or RapidMiner (rapid-i.com). Both tools can provide PCA and other dimension reduction algorithms + they contain nice visualization tools.
Your use case sounds like a combination of Classification and Feature Selection.
Statistics Toolbox offers a lot of good capabilities in this area. The toolbox provides access to a number of classification algorithms including
Naive Bayes Classifiers Bagged
Decision Trees (aka Random Forests)
Binomial and Multinominal logistic regression
Linear Discriminant analysis
You also have a variety of options available for feature selection include
sequentialfs (forwards and backwards feature selection)
relifF
"treebagger" also supports options for feature selection and estimating variable importance.
Alternatively, you can use some of Optimization Toolbox's capabilities to write your own custom equations to estimate variable importance.
A couple monthes back, I did a webinar for The MathWorks titled "Compuational Statistics: Getting Started with Classification using MTALAB". You can watch the Webinar at
http://www.mathworks.com/company/events/webinars/wbnr51468.html?id=51468&p1=772996255&p2=772996273
The code and the data set for the examples is available at MATLAB Central
http://www.mathworks.com/matlabcentral/fileexchange/28770
With all this said and done, many people using Principal Component Analysis as a pre-processing step before applying classification algorithms. PCA gets used alot
When you need to extract features from images
When you're worried about multicollinearity
You should find correlation matrix. in the following example matlab finds correlation matrix with 'corr' function
http://www.mathworks.com/help/stats/feature-transformation.html#f75476