I'm trying to select a subset of features from a data that contains 2000 of them for 63 samples. Now I know how to do PCA in MATLAB. I used 'pcacov' and it returns the eigenvectors and the eigenvalues too. However, I don't know how to select the features I want. I mean if the features aren't labeled, how can I select my features ? or they will be returned in the same order ?
PCA does not tell you which features are the most significant, but which combinations of features keep the most variance.
What PCA does is rotate your dataset in such a way that it has the most variance along the first dimension, second most along second, and so on. So, what you do when you multiply your feature vectors by the first N eigenvectors is rotate the set and keep the first N dimensions to transform your vectors into a lower-dimensional representation that keeps most of the variance.
how can I select my features ?
If you call it like
[pc,variances,explained] = pcacov(covx)
then the principal components are the vectors in the first return argument with variances as in the second return argument. They are in correspondence and sorted from most significant to least significant.
or they will be returned in the same order ?
You can assume this if the function help says so, otherwise it's not safe to assume so and you can do something like.
[varsorted,varsortedinds] = sort(variances,'descend');
pcsorted = pc(:,varsortedinds);
And varsorted and pcsorted will be in order from most to least significant.
Edit 7 years later: I realized in re-reading the question that my answer doesn't actually answer this. I thought what was being asked was are the principal components sorted. Don Reba's answer is an answer to the actual question asked. I can't delete a selected answer though.
Related
This question of mine is not tightly related to Matlab, but is relevant to it:
I'm looking how to fill in the matrix [[a,b,c],[d,e,f]] in a few nontrivial ways so that as many places as possible in
corrcoef([a,b,c],[d,e,f])
are zero. My attempts yield NaN result in most cases.
Given the current comments, you are trying to understand how two series of random draws from two distributions can have zero correlation. Specifically, exercise 4.6.9 to which you refer mentions draws from two normal distributions.
An issue with your approach is that you are hoping to derive a link between a theoretical property and experimentation, in this case using Matlab. And, as you seem to have noticed, unless you are looking at specific degenerate cases, your experimentation will fail. That is because although the true correlation parameter rho in the exercise might be zero, a sample of random draws will always have some level of correlation. Here is an illustration, and as you'll notice if you run it the actual correlations span the whole spectrum between -1 and 1 despite their average being zero (as it should be since both generators are pseudo-uncorrelated):
n=1e4;
experiment = nan(n,1);
for i=1:n
r = corrcoef(rand(4,1),rand(4,1));
experiment(i)=r(2);
end
hist(experiment);
title(sprintf('Average correlation: %.4f%%',mean(experiment)));
If you look at the definition of Pearson correlation in wikipedia, you will see that the only way this can be zero is when the numerator is zero, i.e. E[(X-Xbar)(Y-Ybar)]=0. Though this might be the case asymptotically, you will be hard-pressed to find a non-degenerate case where this will happen in a small sample. Nevertheless, to show you you can derive some such degenerate cases, let's dig a bit further. If you want the expectation of this product to be zero, you could make either the left or the right hand part zero when the other is non-zero. For one side to be zero, the draw must be exactly equal to the average of draws. Therefore we can imagine creating such a pair of variables using this technique:
we create two vectors of 4 variables, and alternate which draw will be equal to the average.
let's say we want X to average 1, and Y to average 2, and we make even-indexed draws equal to the average for X and odd-indexed draws equal to the average for Y.
one such generation would be: X=[0,1,2,1], Y=[2,0,2,4], and you can check that corrcoef([0,1,2,1],[2,0,2,4]) does in fact produce an identity matrix. This is because, every time a component of X is different than its average of 1, the component in Y is equal to its average of 2.
another example, where the average of X is 3 and that of Y is 4 is: X=[3,-5,3,11], Y=[1008,4,-1000,4]. etc.
If you wanted to know how to create samples from non-correlated distributions altogether, that would be and entirely different question, though (perhaps) more interesting in terms of understanding statistics. If this is your case, and given the exercise you mention discusses normal distributions, I would suggest you take a look at generating antithetic variables using the Box-Muller transform.
Happy randomizing!
I have a training set with the size of (size(X_Training)=122 x 125937).
122 is the number of features
and 125937 is the sample size.
From my little understanding, PCA is useful when you want to reduce the dimension of the features. Meaning, I should reduce 122 to a smaller number.
But when I use in matlab:
X_new = pca(X_Training)
I get a matrix of size 125973x121, I am really confused, because this not only changes the features but also the sample size? This is a big problem for me, because I still have the target vector Y_Training that I want to use for my neural network.
Any help? Did I badly interpret the results? I only want to reduce the number of features.
Firstly, the documentation of the PCA function is useful: https://www.mathworks.com/help/stats/pca.html. It mentions that the rows are the samples while the columns are the features. This means you need to transpose your matrix first.
Secondly, you need to specify the number of dimensions to reduce to a priori. The PCA function does not do that for you automatically. Therefore, in addition to extracting the principal coefficients for each component, you also need to extract the scores as well. Once you have this, you simply subset into the scores and perform the reprojection into the reduced space.
In other words:
n_components = 10; % Change to however you see fit.
[coeff, score] = pca(X_training.');
X_reduce = score(:, 1:n_components);
X_reduce will be the dimensionality reduced feature set with the total number of columns being the total number of reduced features. Also notice that the number of training examples does not change as we expect. If you want to make sure that the number of features are along the rows instead of the columns after we reduce the number of features, transpose this output matrix as well before you proceed.
Finally, if you want to automatically determine the number of features to reduce to, one method to do so is to calculate the variance explained of each feature, then accumulate the values from the first feature up to the point where we exceed some threshold. Usually 95% is used.
Therefore, you need to provide additional output variables to capture these:
[coeff, score, latent, tsquared, explained, mu] = pca(X_training.');
I'll let you go through the documentation to understand the other variables, but the one you're looking at is the explained variable. What you should do is find the point where the total variance explained exceeds 95%:
[~,n_components] = max(cumsum(explained) >= 95);
Finally, if you want to perform a reconstruction and see how well the reconstruction into the original feature space performs from the reduced feature, you need to perform a reprojection into the original space:
X_reconstruct = bsxfun(#plus, score(:, 1:n_components) * coeff(:, 1:n_components).', mu);
mu are the means of each feature as a row vector. Therefore you need add this vector across all examples, so broadcasting is required and that's why bsxfun is used. If you're using MATLAB R2018b, this is now implicitly done when you use the addition operation.
X_reconstruct = score(:, 1:n_components) * coeff(:, 1:n_components).' + mu;
I performed PCA on a 63*2308 matrix and obtained a score and a co-efficient matrix. The score matrix is 63*2308 and the co-efficient matrix is 2308*2308 in dimensions.
How do i extract the column names for the top 100 features which are most important so that i can perform regression on them?
PCA should give you both a set of eigenvectors (your co-efficient matrix) and a vector of eigenvalues (1*2308) often referred to as lambda). You might been to use a different PCA function in matlab to get them.
The eigenvalues indicate how much of your data each eigenvector explains. A simple method for selecting features would be to select the 100 features with the highest eigen values. This gives you a set of feature which explain most of the variance in the data.
If you need to justify your approach for a write up you can actually calculate the amount of variance explained per eigenvector and cut of at, for example, 95% variance explained.
Bear in mind that selecting based solely on eigenvalue, might not correspond to the set of features most important to your regression, so if you don't get the performance you expect you might want to try a different feature selection method such as recursive feature selection. I would suggest using google scholar to find a couple of papers doing something similar and see what methods they use.
A quick matlab example of taking the top 100 principle components using PCA.
[eigenvectors, projected_data, eigenvalues] = princomp(X);
[foo, feature_idx] = sort(eigenvalues, 'descend');
selected_projected_data = projected(:, feature_idx(1:100));
Have you tried with
B = sort(your_matrix,2,'descend');
C = B(:,1:100);
Be careful!
With just 63 observations and 2308 variables, your PCA result will be meaningless because the data is underspecified. You should have at least (rule of thumb) dimensions*3 observations.
With 63 observations, you can at most define a 62 dimensional hyperspace!
Consider having a matrix. From this matrix I select the same number of elements from every row. Let us say that the matrix is nxn and from each row I take m elements (m<n).
I will build a mxm matrix with this selected elements. In every row I put the elements taken from the original matrix (same row index of course).
What is the best way to achieve this?
Thankyou
One way to achieve this is illustrated here. Define an array a to play around with ...
a = randi(6,6);
b = a([1 3 5],[2 4 6])
This demonstrates the use of index vectors for selecting rows and columns from one matrix into another. It depends on being able to specify the vectors you want to use as indices. You could also write:
c = a(1:2:end,2:2:end)
Now, if you tell us what you mean by 'the best way' we may be able to tell you that too !
EDIT
So I read the question again, it seems by 'best' you mean 'fastest'. I've never been concerned to measure the speed of this sort of operation, I await with interest one of the real Matlab experts who lurk hereabouts providing a much cleverer answer than this.
Of course, the fastest way is to not build a submatrix at all, but to operate on the elements of the original matrix. Whether your algorithm can be adapted to avoid building a submatrix is unknown to me.
If I am given a set of vectors (they can be provided as the column vectors of a matrix), and I want to get the maximally independent vectors, what is the best way to go about it?
I could add one vector to the result set at a time to see if the rank of the newly formed matrix is increased or not. But I feel it is not very efficient. Of course, I could go back to do Gauss elimination to work this out. But I am just wondering if there is a better (efficient and numerically stable and robut) approach to this problem.
Thanks.
Edit
Feel the addition by watching the rank increasing is probably not valid. We can do deletion by watching if the rank is decreasing though.
This code will do the trick. It's a little bit dirty because it grows rInd on the fly, which isn't the most efficient, but the idea is more important. It uses the QR decomposition, which is basically Gram-Schmidt orthogonalization. From this, it goes through the rows of r until it finds the next vector in A that adds something linearly independent to the currently known basis.
iUnderConsideration = 1;
[q,r] = qr(A);
rInd = [];
for j = 1:size(r,2),
if(r(iUnderConsideration,j) ~= 0)
rInd = [rInd r(:,j)];
iUnderConsideration = iUnderConsideration + 1;
end
if(iUnderConsideration > size(r,1))
break;
end
end
q*rInd %here's your answer
As a side note, this code will chose the vectors of your matrix A without changing them. svd wouldn't give you these directly.
[U,S,V]=svd(vectors);
U(1:size(vectors,1),1:size(vectors,2))=vectors;
U now contains the original vectors plus an optimally orthogonal set.
Doing RREF and looking for columns with the leading zeros is your best bet:
matr(:,logical(sum(rref(matr)==1)))
This will give you the basis for the column space of the matrix.
SVD is your answer.
The MATLAB reference for SVD.