Matlab - how to compute PCA on a huge data set [duplicate] - matlab

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
MATLAB is running out of memory but it should not be
I want to perform PCA analysis on a huge data set of points. To be more specific, I have size(dataPoints) = [329150 132] where 328150 is the number of data points and 132 are the number of features.
I want to extract the eigenvectors and their corresponding eigenvalues so that I can perform PCA reconstruction.
However, when I am using the princomp function (i.e. [eigenVectors projectedData eigenValues] = princomp(dataPoints); I obtain the following error :
>> [eigenVectors projectedData eigenValues] = princomp(pointsData);
Error using svd
Out of memory. Type HELP MEMORY for your options.
Error in princomp (line 86)
[U,sigma,coeff] = svd(x0,econFlag); % put in 1/sqrt(n-1) later
However, if I am using a smaller data set, I have no problem.
How can I perform PCA on my whole dataset in Matlab? Have someone encountered this problem?
Edit:
I have modified the princomp function and tried to use svds instead of svd, but however, I am obtaining pretty much the same error. I have dropped the error bellow :
Error using horzcat
Out of memory. Type HELP MEMORY for your options.
Error in svds (line 65)
B = [sparse(m,m) A; A' sparse(n,n)];
Error in princomp (line 86)
[U,sigma,coeff] = svds(x0,econFlag); % put in 1/sqrt(n-1) later

Solution based on Eigen Decomposition
You can first compute PCA on X'X as #david said. Specifically, see the script below:
sz = [329150 132];
X = rand(sz);
[V D] = eig(X.' * X);
Actually, V holds the right singular vectors, and it holds the principal vectors if you put your data vectors in rows. The eigenvalues, D, are the variances among each direction. The singular vectors, which are the standard deviations, are computed as the square root of the variances:
S = sqrt(D);
Then, the left singular vectors, U, are computed using the formula X = USV'. Note that U refers to the principal components if your data vectors are in columns.
U = X*V*S^(-1);
Let us reconstruct the original data matrix and see the L2 reconstruction error:
X2 = U*S*V';
L2ReconstructionError = norm(X(:)-X2(:))
It is almost zero:
L2ReconstructionError =
6.5143e-012
If your data vectors are in columns and you want to convert your data into eigenspace coefficients, you should do U.'*X.
This code snippet takes around 3 seconds in my moderate 64-bit desktop.
Solution based on Randomized PCA
Alternatively, you can use a faster approximate method which is based on randomized PCA. Please see my answer in Cross Validated. You can directly compute fsvd and get U and V instead of using eig.
You may employ randomized PCA if the data size is too big. But, I think the previous way is sufficient for the size you gave.

My guess is that you have a huge data set. You don't need all of the svd coefficients. In this case, use svds instead of svd :
Taken directly from Matlab help:
s = svds(A,k) computes the k largest singular values and associated singular vectors of matrix A.
From your question, I understand that you don't call svd directly. But you might as well take a look at princomp (It is editable!) and alter the line that calls it.

You probably needed to calculate an n by n matrix in your computation somehow that is to say:
329150 * 329150 * 8btyes ~ 866GB`
of space which explains why you're getting a memory error. There seems to be an efficient way to calculate pca using princomp(X, 'econ') which I suggest you give it a try.
More on this in stackoverflow and mathworks..

Manually compute X'X (132x132) and svd on it. Or find NIPALS script.

Related

PCA for dimensionality reduction MATLAB

I have a feature vector of size [4096 x 180], where 180 is the number of samples and 4096 is the feature vector length of each sample.
I want to reduce the dimensionality of the data using PCA.
I tried using the built in pca function of MATLAB [V U]=pca(X) and reconstructed the data by X_rec= U(:, 1:n)*V(:, 1:n)', n being the dimension I chose. This returns a matrix of 4096 x 180.
Now I have 3 questions:
How to obtain the reduced dimension?
When I put n as 200, it gave an error as matrix dimension increased, which gave me the assumption that we cannot reduce dimension lesser than the sample size. Is this true?
How to find the right number of reduced dimensions?
I have to use the reduced dimension feature set for further classification.
If anyone can provide a detailed step by step explanation of the pca code for this I would be grateful. I have looked at many places but my confusion still persists.
You may want to refer to Matlab example to analyse city data.
Here is some simplified code:
load cities;
[~, pca_scores, ~, ~, var_explained] = pca(ratings);
Here, pca_scores are the pca components with respective variances of each component in var_explained. You do not need to do any explicit multiplication after running pca. Matlab will give you the components directly.
In your case, consider that data X is a 4096-by-180 matrix, i.e. you have 4096 samples and 180 features. Your goal is to reduce dimensionality such that you have p features, where p < 180. In Matlab, you can simply run the following,
p = 100;
[~, pca_scores, ~, ~, var_explained] = pca(X, 'NumComponents', p);
pca_scores will be a 4096-by-p matrix and var_explained will be a vector of length p.
To answer your questions:
How to obtain the reduced dimension? I above example, pca_scores is your reduced dimension data.
When I put n as 200, it gave an error as matrix dimension increased, which gave me the assumption that we cannot reduce dimension lesser than the sample size. Is this true? You can't use 200, since the reduced dimensions have to be less than 180.
How to find the right number of reduced dimensions? You can make this decision by checking the var_explained vector. Typically you want to retain about 99% variance of the features. You can read more about this here.

How to multiply matrix of nxm with matrix nxmxp different dimensions in matlab

In my current analysis, I am trying to multiply a matrix (flm), of dimension nxm, with the inverse of a matrix nxmxp, and then use this result to multiply it by the inverse of the matrix (flm).
I was trying using the following code:
flm = repmat(Data.fm.flm(chan,:),[1 1 morder]); %chan -> is a vector 1by3
A = (flm(:,:,:)/A_inv(:,:,:))/flm(:,:,:);
However. due to the problem of dimensions, I am getting the following error message:
Error using ==> mrdivide
Inputs must be 2-D, or at least one
input must be scalar.
To compute elementwise RDIVIDE, use
RDIVIDE (./) instead.
I have no idea on how to proceed without using a for loop, so anyone as any suggestion?
I think you are looking for a way to conveniently multiply matrices when one is of higher dimensionality than the other. In that case you can use bxsfun to automatically 'expand' the smaller matrix.
x = rand(3,4);
y = rand(3,4,5);
bsxfun(#times,x,y)
It is quite simple, and very efficient.
Make sure to check out doc bsxfun for more examples.

PCA for feature extraction MATLAB

I have a data matrix A of size N-by-M.
I wanted use PCA for dimensionality reduction. I want to set the dimensions to 'k'.
I understand that after feature extraction, I should get a Nxk matrix.
I have tried pcares as follows,
[residuals,reconstructed] = pcares(A,k)
But this does not help me.
I am also trying to use the dr toolbox (here)
This returns me a k-by-k matrix. How do I proceede further?
Any help would be appreciated.
Thank You
pcares gives you the residual, which is the error when subtracting the input with the reconstructed input. You can use the pca command. It returns a MxM matrix whose columns are the principle components. You can use the first k of them to construct the feature, just do the following
X = bsxfun(#minus, A, mean(A)) * coeff(:, 1:k);, where coeff is what is returned from the pca command. The function call with bsxfun subtracts the mean (centers the data, as this is what pca did when calculating the output coeff).

Multivariate Random Number Generation in Matlab

I'm probably being a little dense but I'm not very mathsy and can't seem to understand the covariance element of creating multivariate data.
I'm after two columns of random data (representing two correlated variables).
I think I am right in needing to use the mvnrnd function and I understand that 'mu' must be a column of my mean vectors. As I need 4 distinct classes within my data these are going to be (1, 1) (-1 1) (1 -1) and (-1 -1). I assume I will have to do the function 4x with a different column of mean vectors each time and then combine them to get my full data set.
I don't understand what I should put for SIGMA - Matlab help tells me that it must be 'a d-by-d symmetric positive semi-definite matrix, or a d-by-d-by-n array' i.e. a covariance matrix. I don't understand how I create a covariance matrix for numbers that I am yet to generate.
Any advice would be greatly appreciated!
Assuming that I understood your case properly, I would go this way:
data = [normrnd(0,1,5000,1),normrnd(0,1,5000,1)]; %% your starting data series
MU = mean(data,1);
SIGMA = cov(data);
Now, it should be possible to feed mvnrnd with MU and SIGMA:
r = mvnrnd(MU,SIGMA,5000);
plot(r(:,1),r(:,2),'+') %% in case you wanna plot the results
I hope this helps.
I think your aim is to generate the simulated multivariate gaussian distributed data. For example, I use
k = 6; % feature dimension
mu = rand(1,k);
sigma = 10*eye(k,k);
unit matrix by 10 times is a symmetric positive semi-definite matrix. And the gaussian distribution will be more round than other type of sigma.
then you can use it as the above example of mvnrnd function and see the plot.

Matlab Principal Component Analysis (eigenvalues order)

I want to use the "princomp" function of Matlab but this function gives the eigenvalues in a sorted array. This way I can't find out to which column corresponds which eigenvalue.
For Matlab,
m = [1,2,3;4,5,6;7,8,9];
[pc,score,latent] = princomp(m);
is the same as
m = [2,1,3;5,4,6;8,7,9];
[pc,score,latent] = princomp(m);
That is, swapping the first two columns does not change anything. The result (eigenvalues) in latent will be: (27,0,0)
The information (which eigenvalue corresponds to which original (input) column) is lost.
Is there a way to tell matlab to not to sort the eigenvalues?
With PCA, each principle component returned will be a linear combination of the original columns/dimensions. Perhaps an example might clear up any misunderstanding you have.
Lets consider the Fisher-Iris dataset comprising of 150 instances and 4 dimensions, and apply PCA on the data. To make things easier to understand, I am first zero-centering the data before calling PCA function:
load fisheriris
X = bsxfun(#minus, meas, mean(meas)); %# so that mean(X) is the zero vector
[PC score latent] = princomp(X);
Lets look at the first returned principal component (1st column of PC matrix):
>> PC(:,1)
0.36139
-0.084523
0.85667
0.35829
This is expressed as a linear combination of the original dimensions, i.e.:
PC1 = 0.36139*dim1 + -0.084523*dim2 + 0.85667*dim3 + 0.35829*dim4
Therefore to express the same data in the new coordinates system formed by the principal components, the new first dimension should be a linear combination of the original ones according to the above formula.
We can compute this simply as X*PC which is the exactly what is returned in the second output of PRINCOMP (score), to confirm this try:
>> all(all( abs(X*PC - score) < 1e-10 ))
1
Finally the importance of each principal component can be determined by how much variance of the data it explains. This is returned by the third output of PRINCOMP (latent).
We can compute the PCA of the data ourselves without using PRINCOMP:
[V E] = eig( cov(X) );
[E order] = sort(diag(E), 'descend');
V = V(:,order);
the eigenvectors of the covariance matrix V are the principal components (same as PC above, although the sign can be inverted), and the corresponding eigenvalues E represent the amount of variance explained (same as latent). Note that it is customary to sort the principal component by their eigenvalues. And as before, to express the data in the new coordinates, we simply compute X*V (should be the same as score above, if you make sure to match the signs)
"The information (which eigenvalue corresponds to which original (input) column) is lost."
Since each principal component is a linear function of all input variables, each principal component (eigenvector, eigenvalue), corresponds to all of the original input columns. Ignoring possible changes in sign, which are arbitrary in PCA, re-ordering the input variables about will not change the PCA results.
"Is there a way to tell matlab to not to sort the eigenvalues?"
I doubt it: PCA (and eigen analysis in general) conventionally sorts the results by variance, though I'd note that princomp() sorts from greatest to least variance, while eig() sorts in the opposite direction.
For more explanation of PCA using MATLAB illustrations, with or without princomp(), see:
Principal Components Analysis