Matlab Principal Component Analysis (eigenvalues order) - matlab

I want to use the "princomp" function of Matlab but this function gives the eigenvalues in a sorted array. This way I can't find out to which column corresponds which eigenvalue.
For Matlab,
m = [1,2,3;4,5,6;7,8,9];
[pc,score,latent] = princomp(m);
is the same as
m = [2,1,3;5,4,6;8,7,9];
[pc,score,latent] = princomp(m);
That is, swapping the first two columns does not change anything. The result (eigenvalues) in latent will be: (27,0,0)
The information (which eigenvalue corresponds to which original (input) column) is lost.
Is there a way to tell matlab to not to sort the eigenvalues?

With PCA, each principle component returned will be a linear combination of the original columns/dimensions. Perhaps an example might clear up any misunderstanding you have.
Lets consider the Fisher-Iris dataset comprising of 150 instances and 4 dimensions, and apply PCA on the data. To make things easier to understand, I am first zero-centering the data before calling PCA function:
load fisheriris
X = bsxfun(#minus, meas, mean(meas)); %# so that mean(X) is the zero vector
[PC score latent] = princomp(X);
Lets look at the first returned principal component (1st column of PC matrix):
>> PC(:,1)
0.36139
-0.084523
0.85667
0.35829
This is expressed as a linear combination of the original dimensions, i.e.:
PC1 = 0.36139*dim1 + -0.084523*dim2 + 0.85667*dim3 + 0.35829*dim4
Therefore to express the same data in the new coordinates system formed by the principal components, the new first dimension should be a linear combination of the original ones according to the above formula.
We can compute this simply as X*PC which is the exactly what is returned in the second output of PRINCOMP (score), to confirm this try:
>> all(all( abs(X*PC - score) < 1e-10 ))
1
Finally the importance of each principal component can be determined by how much variance of the data it explains. This is returned by the third output of PRINCOMP (latent).
We can compute the PCA of the data ourselves without using PRINCOMP:
[V E] = eig( cov(X) );
[E order] = sort(diag(E), 'descend');
V = V(:,order);
the eigenvectors of the covariance matrix V are the principal components (same as PC above, although the sign can be inverted), and the corresponding eigenvalues E represent the amount of variance explained (same as latent). Note that it is customary to sort the principal component by their eigenvalues. And as before, to express the data in the new coordinates, we simply compute X*V (should be the same as score above, if you make sure to match the signs)

"The information (which eigenvalue corresponds to which original (input) column) is lost."
Since each principal component is a linear function of all input variables, each principal component (eigenvector, eigenvalue), corresponds to all of the original input columns. Ignoring possible changes in sign, which are arbitrary in PCA, re-ordering the input variables about will not change the PCA results.
"Is there a way to tell matlab to not to sort the eigenvalues?"
I doubt it: PCA (and eigen analysis in general) conventionally sorts the results by variance, though I'd note that princomp() sorts from greatest to least variance, while eig() sorts in the opposite direction.
For more explanation of PCA using MATLAB illustrations, with or without princomp(), see:
Principal Components Analysis

Related

Computing the SVD of a rectangular matrix

I have a matrix like M = K x N ,where k is 49152 and is the dimension of the problem and N is 52 and is the number of observations.
I have tried to use [U,S,V]=SVD(M) but doing this I get less memory space.
I found another code which uses [U,S,V]=SVD(COV(M)) and it works well. My questions are what is the meaning of using the COV(M) command inside the SVD and what is the meaning of the resultant [U,S,V]?
Finding the SVD of the covariance matrix is a method to perform Principal Components Analysis or PCA for short. I won't get into the mathematical details here, but PCA performs what is known as dimensionality reduction. If you like a more formal treatise on the subject, you can read up on my post about it here: What does selecting the largest eigenvalues and eigenvectors in the covariance matrix mean in data analysis?. However, simply put dimensionality reduction projects your data stored in the matrix M onto a lower dimensional surface with the least amount of projection error. In this matrix, we are assuming that each column is a feature or a dimension and each row is a data point. I suspect the reason why you are getting more memory occupied by applying the SVD on the actual data matrix M itself rather than the covariance matrix is because you have a significant amount of data points with a small amount of features. The covariance matrix finds the covariance between pairs of features. If M is a m x n matrix where m is the total number of data points and n is the total number of features, doing cov(M) would actually give you a n x n matrix, so you are applying SVD on a small amount of memory in comparison to M.
As for the meaning of U, S and V, for dimensionality reduction specifically, the columns of V are what are known as the principal components. The ordering of V is in such a way where the first column is the first axis of your data that describes the greatest amount of variability possible. As you start going to the second columns up to the nth column, you start to introduce more axes in your data and the variability starts to decrease. Eventually when you hit the nth column, you are essentially describing your data in its entirety without reducing any dimensions. The diagonal values of S denote what is called the variance explained which respect the same ordering as V. As you progress through the singular values, they tell you how much of the variability in your data is described by each corresponding principal component.
To perform the dimensionality reduction, you can either take U and multiply by S or take your data that is mean subtracted and multiply by V. In other words, supposing X is the matrix M where each column has its mean computed and the is subtracted from each column of M, the following relationship holds:
US = XV
To actually perform the final dimensionality reduction, you take either US or XV and retain the first k columns where k is the total amount of dimensions you want to retain. The value of k depends on your application, but many people choose k to be the total number of principal components that explains a certain percentage of your variability in your data.
For more information about the link between SVD and PCA, please see this post on Cross Validated: https://stats.stackexchange.com/q/134282/86678
Instead of [U, S, V] = svd(M), which tries to build a matrix U that is 49152 by 49152 (= 18 GB 😱!), do svd(M, 'econ'). That returns the “economy-class” SVD, where U will be 52 by 52, S is 52 by 52, and V is also 52 by 52.
cov(M) will remove each dimension’s mean and evaluate the inner product, giving you a 52 by 52 covariance matrix. You can implement your own version of cov, called mycov, as
function [C] = mycov(M)
M = bsxfun(#minus, M, mean(M, 1)); % subtract each dimension’s mean over all observations
C = M' * M / size(M, 1);
(You can verify this works by looking at mycov(randn(49152, 52)), which should be close to eye(52), since each element of that array is IID-Gaussian.)
There’s a lot of magical linear algebraic properties and relationships between the SVD and EVD (i.e., singular value vs eigenvalue decompositions): because the covariance matrix cov(M) is a Hermitian matrix, it’s left- and right-singular vectors are the same, and in fact also cov(M)’s eigenvectors. Furthermore, cov(M)’s singular values are also its eigenvalues: so svd(cov(M)) is just an expensive way to get eig(cov(M)) 😂, up to ±1 and reordering.
As #rayryeng explains at length, usually people look at svd(M, 'econ') because they want eig(cov(M)) without needing to evaluate cov(M), because you never want to compute cov(M): it’s numerically unstable. I recently wrote an answer that showed, in Python, how to compute eig(cov(M)) using svd(M2, 'econ'), where M2 is the 0-mean version of M, used in the practical application of color-to-grayscale mapping, which might help you get more context.

Difference in Matlab results when using PCA() and PCACOV()

Closest match I can get is to run:
data=rand(100,10); % data set
[W,pc] = pca(cov(data));
then don't demean
data2=data
[W2, EvalueMatrix2] = eig(cov(data2));
[W3, EvalueMatrix3] = svd(cov(data2));
In this case W2 and W3 agree and W is the transpose of them?
Still not clear why W should be the transpose of the other two?
As an extra check I use pcacov:
[W4, EvalueMatrix4] = pcacov(cov(data2));
Again it agrees with WE and W3 but is the transpose of W?
The results are different because you're subtracting the mean of each row of the data matrix. Based on the way you're computing things, rows of the data matrix correspond to data points and columns correspond to dimensions (this is how the pca() function works too). With this setup, you should subtract the mean from each column, not row. This corresponds to 'centering' the data; the mean along each dimension is set to zero. Once you do this, results should be equivalent to pca(), up to sign flips.
Edit to address edited question:
Centering issue looks ok now. When you run the eigenvalue decomposition on the covariance matrix, remember to sort the eigenvectors in order of descending eigenvalues. This should match the output of pcacov(). When calling pca(), you have to pass it the data matrix, not the covariance matrix.

What does selecting the largest eigenvalues and eigenvectors in the covariance matrix mean in data analysis?

Suppose there is a matrix B, where its size is a 500*1000 double(Here, 500 represents the number of observations and 1000 represents the number of features).
sigma is the covariance matrix of B, and D is a diagonal matrix whose diagonal elements are the eigenvalues of sigma. Assume A is the eigenvectors of the covariance matrix sigma.
I have the following questions:
I need to select the first k = 800 eigenvectors corresponding to the eigenvalues with the largest magnitude to rank the selected features. The final matrix named Aq. How can I do this in MATLAB?
What is the meaning of these selected eigenvectors?
It seems the size of the final matrix Aq is 1000*800 double once I calculate Aq. The time points/observation information of 500 has disappeared. For the final matrix Aq, what does the value 1000 in matrix Aq represent now? Also, what does the value 800 in matrix Aq represent now?
I'm assuming you determined the eigenvectors from the eig function. What I would recommend to you in the future is to use the eigs function. This not only computes the eigenvalues and eigenvectors for you, but it will compute the k largest eigenvalues with their associated eigenvectors for you. This may save computational overhead where you don't have to compute all of the eigenvalues and associated eigenvectors of your matrix as you only want a subset. You simply supply the covariance matrix of your data to eigs and it returns the k largest eigenvalues and eigenvectors for you.
Now, back to your problem, what you are describing is ultimately Principal Component Analysis. The mechanics behind this would be to compute the covariance matrix of your data and find the eigenvalues and eigenvectors of the computed result. It has been known that doing it this way is not recommended due to numerical instability with computing the eigenvalues and eigenvectors for large matrices. The most canonical way to do this now is via Singular Value Decomposition. Concretely, the columns of the V matrix give you the eigenvectors of the covariance matrix, or the principal components, and the associated eigenvalues are the square root of the singular values produced in the diagonals of the matrix S.
See this informative post on Cross Validated as to why this is preferred:
https://stats.stackexchange.com/questions/79043/why-pca-of-data-by-means-of-svd-of-the-data
I'll throw in another link as well that talks about the theory behind why the Singular Value Decomposition is used in Principal Component Analysis:
https://stats.stackexchange.com/questions/134282/relationship-between-svd-and-pca-how-to-use-svd-to-perform-pca
Now let's answer your question one at a time.
Question #1
MATLAB generates the eigenvalues and the corresponding ordering of the eigenvectors in such a way where they are unsorted. If you wish to select out the largest k eigenvalues and associated eigenvectors given the output of eig (800 in your example), you'll need to sort the eigenvalues in descending order, then rearrange the columns of the eigenvector matrix produced from eig then select out the first k values.
I should also note that using eigs will not guarantee sorted order, so you will have to explicitly sort these too when it comes down to it.
In MATLAB, doing what we described above would look something like this:
sigma = cov(B);
[A,D] = eig(sigma);
vals = diag(D);
[~,ind] = sort(abs(vals), 'descend');
Asort = A(:,ind);
It's a good thing to note that you do the sorting on the absolute value of the eigenvalues because scaled eigenvalues are also eigenvalues themselves. These scales also include negatives. This means that if we had a component whose eigenvalue was, say -10000, this is a very good indication that this component has some significant meaning to your data, and if we sorted purely on the numbers themselves, this gets placed near the lower ranks.
The first line of code finds the covariance matrix of B, even though you said it's already stored in sigma, but let's make this reproducible. Next, we find the eigenvalues of your covariance matrix and the associated eigenvectors. Take note that each column of the eigenvector matrix A represents one eigenvector. Specifically, the ith column / eigenvector of A corresponds to the ith eigenvalue seen in D.
However, the eigenvalues are in a diagonal matrix, so we extract out the diagonals with the diag command, sort them and figure out their ordering, then rearrange A to respect this ordering. I use the second output of sort because it tells you the position of where each value in the unsorted result would appear in the sorted result. This is the ordering we need to rearrange the columns of the eigenvector matrix A. It's imperative that you choose 'descend' as the flag so that the largest eigenvalue and associated eigenvector appear first, just like we talked about before.
You can then pluck out the first k largest vectors and values via:
k = 800;
Aq = Asort(:,1:k);
Question #2
It's a well known fact that the eigenvectors of the covariance matrix are equal to the principal components. Concretely, the first principal component (i.e. the largest eigenvector and associated largest eigenvalue) gives you the direction of the maximum variability in your data. Each principal component after that gives you variability of a decreasing nature. It's also good to note that each principal component is orthogonal to each other.
Here's a good example from Wikipedia for two dimensional data:
I pulled the above image from the Wikipedia article on Principal Component Analysis, which I linked you to above. This is a scatter plot of samples that are distributed according to a bivariate Gaussian distribution centred at (1,3) with a standard deviation of 3 in roughly the (0.878, 0.478) direction and of 1 in the orthogonal direction. The component with a standard deviation of 3 is the first principal component while the one that is orthogonal is the second component. The vectors shown are the eigenvectors of the covariance matrix scaled by the square root of the corresponding eigenvalue, and shifted so their tails are at the mean.
Now let's get back to your question. The reason why we take a look at the k largest eigenvalues is a way of performing dimensionality reduction. Essentially, you would be performing a data compression where you would take your higher dimensional data and project them onto a lower dimensional space. The more principal components you include in your projection, the more it will resemble the original data. It actually begins to taper off at a certain point, but the first few principal components allow you to faithfully reconstruct your data for the most part.
A great visual example of performing PCA (or SVD rather) and data reconstruction is found by this great Quora post I stumbled upon in the past.
http://qr.ae/RAEU8a
Question #3
You would use this matrix to reproject your higher dimensional data onto a lower dimensional space. The number of rows being 1000 is still there, which means that there were originally 1000 features in your dataset. The 800 is what the reduced dimensionality of your data would be. Consider this matrix as a transformation from the original dimensionality of a feature (1000) down to its reduced dimensionality (800).
You would then use this matrix in conjunction with reconstructing what the original data was. Concretely, this would give you an approximation of what the original data looked like with the least amount of error. In this case, you don't need to use all of the principal components (i.e. just the k largest vectors) and you can create an approximation of your data with less information than what you had before.
How you reconstruct your data is very simple. Let's talk about the forward and reverse operations first with the full data. The forward operation is to take your original data and reproject it but instead of the lower dimensionality, we will use all of the components. You first need to have your original data but mean subtracted:
Bm = bsxfun(#minus, B, mean(B,1));
Bm will produce a matrix where each feature of every sample is mean subtracted. bsxfun allows the subtraction of two matrices in unequal dimension provided that you can broadcast the dimensions so that they can both match up. Specifically, what will happen in this case is that the mean of each column / feature of B will be computed and a temporary replicated matrix will be produced that is as large as B. When you subtract your original data with this replicated matrix, the effect will subtract every data point with their respective feature means, thus decentralizing your data so that the mean of each feature is 0.
Once you do this, the operation to project is simply:
Bproject = Bm*Asort;
The above operation is quite simple. What you are doing is expressing each sample's feature as a linear combination of principal components. For example, given the first sample or first row of the decentralized data, the first sample's feature in the projected domain is a dot product of the row vector that pertains to the entire sample and the first principal component which is a column vector.. The first sample's second feature in the projected domain is a weighted sum of the entire sample and the second component. You would repeat this for all samples and all principal components. In effect, you are reprojecting the data so that it is with respect to the principal components - which are orthogonal basis vectors that transform your data from one representation to another.
A better description of what I just talked about can be found here. Look at Amro's answer:
Matlab Principal Component Analysis (eigenvalues order)
Now to go backwards, you simply do the inverse operation, but a special property with the eigenvector matrix is that if you transpose this, you get the inverse. To get the original data back, you undo the operation above and add the means back to the problem:
out = bsxfun(#plus, Bproject*Asort.', mean(B, 1));
You want to get the original data back, so you're solving for Bm with respect to the previous operation that I did. However, the inverse of Asort is just the transpose here. What's happening after you perform this operation is that you are getting the original data back, but the data is still decentralized. To get the original data back, you must add the means of each feature back into the data matrix to get the final result. That's why we're using another bsxfun call here so that you can do this for each sample's feature values.
You should be able to go back and forth from the original domain and projected domain with the above two lines of code. Now where the dimensionality reduction (or the approximation of the original data) comes into play is the reverse operation. What you need to do first is project the data onto the bases of the principal components (i.e. the forward operation), but now to go back to the original domain where we are trying to reconstruct the data with a reduced number of principal components, you simply replace Asort in the above code with Aq and also reduce the amount of features you're using in Bproject. Concretely:
out = bsxfun(#plus, Bproject(:,1:k)*Aq.', mean(B, 1));
Doing Bproject(:,1:k) selects out the k features in the projected domain of your data, corresponding to the k largest eigenvectors. Interestingly, if you just want the representation of the data with regards to a reduced dimensionality, you can just use Bproject(:,1:k) and that'll be enough. However, if you want to go forward and compute an approximation of the original data, we need to compute the reverse step. The above code is simply what we had before with the full dimensionality of your data, but we use Aq as well as selecting out the k features in Bproject. This will give you the original data that is represented by the k largest eigenvectors / eigenvalues in your matrix.
If you'd like to see an awesome example, I'll mimic the Quora post that I linked to you but using another image. Consider doing this with a grayscale image where each row is a "sample" and each column is a feature. Let's take the cameraman image that's part of the image processing toolbox:
im = imread('camerman.tif');
imshow(im); %// Using the image processing toolbox
We get this image:
This is a 256 x 256 image, which means that we have 256 data points and each point has 256 features. What I'm going to do is convert the image to double for precision in computing the covariance matrix. Now what I'm going to do is repeat the above code, but incrementally increasing k at each go from 3, 11, 15, 25, 45, 65 and 125. Therefore, for each k, we are introducing more principal components and we should slowly start to get a reconstruction of our data.
Here's some runnable code that illustrates my point:
%%%%%%%// Pre-processing stage
clear all;
close all;
%// Read in image - make sure we cast to double
B = double(imread('cameraman.tif'));
%// Calculate covariance matrix
sigma = cov(B);
%// Find eigenvalues and eigenvectors of the covariance matrix
[A,D] = eig(sigma);
vals = diag(D);
%// Sort their eigenvalues
[~,ind] = sort(abs(vals), 'descend');
%// Rearrange eigenvectors
Asort = A(:,ind);
%// Find mean subtracted data
Bm = bsxfun(#minus, B, mean(B,1));
%// Reproject data onto principal components
Bproject = Bm*Asort;
%%%%%%%// Begin reconstruction logic
figure;
counter = 1;
for k = [3 11 15 25 45 65 125 155]
%// Extract out highest k eigenvectors
Aq = Asort(:,1:k);
%// Project back onto original domain
out = bsxfun(#plus, Bproject(:,1:k)*Aq.', mean(B, 1));
%// Place projection onto right slot and show the image
subplot(4, 2, counter);
counter = counter + 1;
imshow(out,[]);
title(['k = ' num2str(k)]);
end
As you can see, the majority of the code is the same from what we have seen. What's different is that I loop over all values of k, project back onto the original space (i.e. computing the approximation) with the k highest eigenvectors, then show the image.
We get this nice figure:
As you can see, starting with k=3 doesn't really do us any favours... we can see some general structure, but it wouldn't hurt to add more in. As we start increasing the number of components, we start to get a clearer picture of what the original data looks like. At k=25, we actually can see what the cameraman looks like perfectly, and we don't need components 26 and beyond to see what's happening. This is what I was talking about with regards to data compression where you don't need to work on all of the principal components to get a clear picture of what's going on.
I'd like to end this note by referring you to Chris Taylor's wonderful exposition on the topic of Principal Components Analysis, with code, graphs and a great explanation to boot! This is where I got started on PCA, but the Quora post is what solidified my knowledge.
Matlab - PCA analysis and reconstruction of multi dimensional data

Matlab PCA order of principal components

So I read the documentation on pca and it stated that the columns are organized in descending order of their variance. However, whenever I take the PCA of an example and I take the variance of the PCA matrix I get no specific order. A simple example of this is example:
pc = pca(x)
Which returns
pc =
0.0036 -0.0004
0.0474 -0.0155
0.3149 0.3803
0.3969 -0.1930
0.3794 0.3280
0.5816 -0.2482
0.3188 0.1690
-0.1343 0.7835
0.3719 0.0785
0.0310 -0.0110
Meaning column one should be PC1 and column two should be PC2 meaning var(PC1) > var(PC2), but when I get the variance this is clearly not the case.
var(pc)
ans =
0.0518 0.0932
Can anyone shed light into why the variance of PC1 is not the largest?
The docs state that calling
COEFF = pca(x)
will return a p-by-p matrix, so your result is rather surprising (EDIT: this is because your x data set has so few rows compared to columns (i.e. similar to having 10 unknowns and only 3 equations)). Either way when they talk about variance They don't mean the variance of the coefficients of each component but rather the variance of the x data columns after being projected on to each principal component. The docs state that the output score holds these projections and so to see the descending variance you should be doing:
[COEFF, score, latent] = pca(x)
var(score)
and you will see that var(score) equals the third output latent and is indeed in descending order.
Your misunderstanding is that you are trying to calculate the variance of the coefficients of the principal component vectors. These are just unit vectors describing the direction of the hyperplane on which to project your data such that the resulting projected data has maximum variance. These vectors ARE arranged in an order such that your original data projected onto the hyperplane that each describes will be in descending order of variance, but variance of the projected data (score) and NOT of the coefficients of the principal component vectors (COEFF or in your code pc).

Principal Components calculated using different functions in Matlab

I am trying to understand principal component analysis in Matlab,
There seems to be at least 3 different functions that do it.
I have some questions re the code below:
Am I creating approximate x values using only one eigenvector (the one corresponding to the largest eigenvalue) correctly? I think so??
Why are PC and V which are both meant to be the loadings for (x'x) presented differently? The column order is reversed because eig does not order the eigenvalues with the largest value first but why are they the negative of each other?
Why are the eig values not in ordered with the eigenvector corresponding to the largest eigenvalue in the first column?
Using the code below I get back to the input matrix x when using svd and eig, but the results from princomp seem to be totally different? What so I have to do to make princomp match the other two functions?
Code:
x=[1 2;3 4;5 6;7 8 ]
econFlag=0;
[U,sigma,V] = svd(x,econFlag);%[U,sigma,coeff] = svd(z,econFlag);
U1=U(:,1);
V1=V(:,1);
sigma_partial=sigma(1,1);
score1=U*sigma;
test1=score1*V';
score_partial=U1*sigma_partial;
test1_partial=score_partial*V1';
[PC, D] = eig(x'*x)
score2=x*PC;
test2=score2*PC';
PC1=PC(:,2);
score2_partial=x*PC1;
test2_partial=score2_partial*PC1';
[o1 o2 o3]=princomp(x);
Yes. According to the documentation of svd, diagonal elements of the output S are in decreasing order. There is no such guarantee for the the output D of eig though.
Eigenvectors and singular vectors have no defined sign. If a is an eigenvector, so is -a.
I've often wondered the same. Laziness on the part of TMW? Optimization, because sorting would be an additional step and not everybody needs 'em sorted?
princomp centers the input data before computing the principal components. This makes sense as normally the PCA is computed with respect to the covariance matrix, and the eigenvectors of x' * x are only identical to those of the covariance matrix if x is mean-free.
I would compute the PCA by transforming to the basis of the eigenvectors of the covariance matrix (centered data), but apply this transform to the original (uncentered) data. This allows to capture a maximum of variance with as few principal components as possible, but still to recover the orginal data from all of them:
[V, D] = eig(cov(x));
score = x * V;
test = score * V';
test is identical to x, up to numerical error.
In order to easily pick the components with the most variance, let's fix that lack of sorting ourselves:
[V, D] = eig(cov(x));
[D, ind] = sort(diag(D), 'descend');
V = V(:, ind);
score = x * V;
test = score * V';
Reconstruct the signal using the strongest principal component only:
test_partial = score(:, 1) * V(:, 1)';
In response to Amro's comments: It is of course also possible to first remove the means from the input data, and transform these "centered" data. In that case, for perfect reconstruction of the original data it would be necessary to add the means again. The way to compute the PCA given above is the one described by Neil H. Timm, Applied Multivariate Analysis, Springer 2002, page 446:
Given an observation vector Y with mean mu and covariance matrix Sigma of full rank p, the goal of PCA is to create a new set of variables called principal components (PCs) or principal variates. The principal components are linear combinations of the variables of the vector Y that are uncorrelated such that the variance of the jth component is maximal.
Timm later defines "standardized components" as those which have been computed from centered data and are then divided by the square root of the eigenvalues (i.e. variances), i.e. "standardized principal components" have mean 0 and variance 1.