I'm trying to apply PCA on my data using princomp(x), that has been standardized.
The data is <16 x 1036800 double>. This runs our of memory which is too be expected except for the fact that this is a new computer, the computer holds 24GB of RAM for data mining. MATLAB even lists the 24GB available on a memory check.
Is MATLAB actually running out of memory while performing a PCA or is MATLAB not using the RAM to it's full potential? Any information or ideas would be helpful. (I may need to increase the virtual memory but assumed the 24GB would have sufficed.)
For a data matrix of size n-by-p, PRINCOMP will return a coefficient matrix of size p-by-p where each column is a principal component expressed using the original dimensions, so in your case you will create an output matrix of size:
1036800*1036800*8 bytes ~ 7.8 TB
Consider using PRINCOMP(X,'econ') to return only the PCs with significant variance
Alternatively, consider performing PCA by SVD: in your case n<<p, and the covariance matrix is impossible to compute. Therefore, instead of decomposing the p-by-p matrix XX', it is sufficient to only decompose the smaller n-by-n matrix X'X. Refer to this paper for reference.
EDIT:
Here's my implementation, the outputs of this function match those of PRINCOMP (the first three anyway):
function [PC,Y,varPC] = pca_by_svd(X)
% PCA_BY_SVD
% X data matrix of size n-by-p where n<<p
% PC columns are first n principal components
% Y data projected on those PCs
% varPC variance along the PCs
%
X0 = bsxfun(#minus, X, mean(X,1)); % shift data to zero-mean
[U,S,PC] = svd(X0,'econ'); % SVD decomposition
Y = X0*PC; % project X on PC
varPC = diag(S'*S)' / (size(X,1)-1); % variance explained
end
I just tried it on my 4GB machine, and it ran just fine:
» x = rand(16,1036800);
» [PC, Y, varPC] = pca_by_svd(x);
» whos
Name Size Bytes Class Attributes
PC 1036800x16 132710400 double
Y 16x16 2048 double
varPC 1x16 128 double
x 16x1036800 132710400 double
Update:
The princomp function became deprecated in favor of pca introduced in R2012b, which includes many more options.
Matlab has hardcoded limitations on matrix sizes. See this link. If you think you're not passing up those limits, then you probably have a bug in your code and actually are.
Mathworks engineer Stuart McGarrity recorded a nice webinar surveying diagnosis techniques and common solutions. If you're data is indeed within allowed limits, the issue might be memory fragmentation - which is easily solvable.
Related
I have a feature vector of size [4096 x 180], where 180 is the number of samples and 4096 is the feature vector length of each sample.
I want to reduce the dimensionality of the data using PCA.
I tried using the built in pca function of MATLAB [V U]=pca(X) and reconstructed the data by X_rec= U(:, 1:n)*V(:, 1:n)', n being the dimension I chose. This returns a matrix of 4096 x 180.
Now I have 3 questions:
How to obtain the reduced dimension?
When I put n as 200, it gave an error as matrix dimension increased, which gave me the assumption that we cannot reduce dimension lesser than the sample size. Is this true?
How to find the right number of reduced dimensions?
I have to use the reduced dimension feature set for further classification.
If anyone can provide a detailed step by step explanation of the pca code for this I would be grateful. I have looked at many places but my confusion still persists.
You may want to refer to Matlab example to analyse city data.
Here is some simplified code:
load cities;
[~, pca_scores, ~, ~, var_explained] = pca(ratings);
Here, pca_scores are the pca components with respective variances of each component in var_explained. You do not need to do any explicit multiplication after running pca. Matlab will give you the components directly.
In your case, consider that data X is a 4096-by-180 matrix, i.e. you have 4096 samples and 180 features. Your goal is to reduce dimensionality such that you have p features, where p < 180. In Matlab, you can simply run the following,
p = 100;
[~, pca_scores, ~, ~, var_explained] = pca(X, 'NumComponents', p);
pca_scores will be a 4096-by-p matrix and var_explained will be a vector of length p.
To answer your questions:
How to obtain the reduced dimension? I above example, pca_scores is your reduced dimension data.
When I put n as 200, it gave an error as matrix dimension increased, which gave me the assumption that we cannot reduce dimension lesser than the sample size. Is this true? You can't use 200, since the reduced dimensions have to be less than 180.
How to find the right number of reduced dimensions? You can make this decision by checking the var_explained vector. Typically you want to retain about 99% variance of the features. You can read more about this here.
I'm looking at the transfer functions (transfer matrix) of a Multiple-input single-output (MISO) system. The system has 32 dynamic states, four inputs, and one output. The system A, B, C, and matrices are calculated in a Matlab code, and the state space model is created as sys=ss(A,B,C,D).
My question is that why are the transfer functions obtained by applying the "tf" function on "sys" (a 1*4 structure) different from those obtained by applying the "tf" function on individual system models "sys(1)", "sys(2)", "sys(3)", and "sys(4)", whereas the system matrices obtained by individuals "sys(1)" to "sys(4) completely match with the corresponding matrices and matrices' columns of the "sys"?
I tried the same thing for a simple 4th-order system, and they completely match. I also tried it for a 32th-state system (of same dimension of my original system), where all system matrices are generated by randn function. Then, I tried to find the transfer function coefficients by using cell2mat(T.den) and cell2mat(T.num) for sys and for sys(1) to sys(4). All the denominator coefficients match. Also, except for one of the transfer functions, the numerator coefficients match as well.
It should be mentioned that in the original example, the matrix A is singular, but in the synthetic example 2 (of dimension 32), the condition number of system matrix is around 120.
You can find the code below.
Your help is highly appreciated.
clear all;
clc;
%% Building the system matrices
A=randn(32,32);
B=randn(32,4);
C=randn(1,32);
D=randn(1,4);
sys=ss(A,B,C,D); % creating the state space model
TFF=tf(sys); % calculating the tranfer matrix
%% extracting the numerator and denominator coefficients of 4 transfer
%functions
for i=1:4
Ti=TFF(i);
Tin(i,:)=cell2mat(Ti.num); % numerator coefficients
Tid(i,:)=cell2mat(Ti.den); % denominator coefficients
clear Ti
end
%% calculatingthe numerator and denominator coefficients based on individual
%transfer functions
TF1=tf(sys(1));
T1n=cell2mat(TF1.num);
T1d=cell2mat(TF1.den);
TF2=tf(sys(2));
T2n=cell2mat(TF2.num);
T2d=cell2mat(TF2.den);
TF3=tf(sys(3));
T3n=cell2mat(TF3.num);
T3d=cell2mat(TF3.den);
TF4=tf(sys(4));
T4n=cell2mat(TF4.num);
T4d=cell2mat(TF4.den);
num2str([T1n.'-Tin(1,:).']) % the error between the numerator coefficients
% of the TF1 by 2 aproaches
num2str([T2n.'-Tin(2,:).'])
num2str([T3n.'-Tin(3,:).'])
num2str([T4n.'-Tin(4,:).'])
num2str([T1d.'-Tid(1,:).'])
num2str([T2d.'-Tid(2,:).'])
num2str([T3d.'-Tid(3,:).'])
num2str([T4d.'-Tid(4,:).'])
It is a mixture of a few things; because the resulting model is not guaranteed to be minimal per se after model switches you get some discrepancies but also a subsystem of the MIMO system is not guaranteed to agree with a SISO part of the system which can remove some of the poles of the other modes if input is not acting on them.
However, beyond 5,6 transfer matrices are numerically terrible to perform operations with. Hence try to avoid them. Check other properties of the subsystems such as Bode plots for comparison
I am performing logistic regression in MATLAB with L2 regularization on text data. My program works well for small datasets. For larger sets, it keeps running infinitely.
I have seen the potentially duplicate question (matlab fminunc not quitting (running indefinitely)). In that question, the cost for initial theta was NaN and there was an error printed in the console. For my implementation, I am getting a real valued cost and there is no error even with verbose parameters being passed to fminunc(). Hence I believe this question might not be a duplicate.
I need help in scaling it to larger sets. The size of the training data I am currently working on is roughly 10k*12k (10k text files cumulatively containing 12k words). Thus, I have m=10k training examples and n=12k features.
My cost function is defined as follows:
function [J gradient] = costFunction(X, y, lambda, theta)
[m n] = size(X);
g = inline('1.0 ./ (1.0 + exp(-z))');
h = g(X*theta);
J =(1/m)*sum(-y.*log(h) - (1-y).*log(1-h))+ (lambda/(2*m))*norm(theta(2:end))^2;
gradient(1) = (1/m)*sum((h-y) .* X(:,1));
for i = 2:n
gradient(i) = (1/m)*sum((h-y) .* X(:,i)) - (lambda/m)*theta(i);
end
end
I am performing optimization using MATLAB's fminunc() function. The parameters I pass to fminunc() are:
options = optimset('LargeScale', 'on', 'GradObj', 'on', 'MaxIter', MAX_ITR);
theta0 = zeros(n, 1);
[optTheta, functionVal, exitFlag] = fminunc(#(t) costFunction(X, y, lambda, t), theta0, options);
I am running this code on a machine with these specifications:
Macbook Pro i7 2.8GHz / 8GB RAM / MATLAB R2011b
The cost function seems to behave correctly. For initial theta, I get acceptable values of J and gradient.
K>> theta0 = zeros(n, 1);
K>> [j g] = costFunction(X, y, lambda, theta0);
K>> j
j =
0.6931
K>> max(g)
ans =
0.4082
K>> min(g)
ans =
-2.7021e-05
The program takes incredibly long to run. I started profiling keeping MAX_ITR = 1 for fminunc(). With a single iteration, the program did not complete execution even after a couple of hours had elapsed. My questions are:
Am I doing something wrong mathematically?
Should I use any other optimizer instead of fminunc()? With LargeScale=on, fminunc() uses trust-region algorithms.
Is this problem cluster-scale and should not be run on a single machine?
Any other general tips will be appreciated. Thanks!
This helped solve the problem: I was able to get this working by setting the LargeScale flag to 'off' in fminunc(). From what I gather, LargeScale = 'on' uses trust region algorithms, while keeping it 'off' uses quasi-newton methods. Using quasi-newton methods and passing the gradient worked a lot faster for this particular problem and gave very nice results.
I was able to get this working by setting the LargeScale flag to 'off' in fminunc(). From what I gather, LargeScale = 'on' uses trust region algorithms, while keeping it 'off' uses quasi-newton methods. Using quasi-newton methods and passing the gradient worked a lot faster for this particular problem and gave very nice results.
Here is my advise:
-Set the Matlab flag to show debug output during run. If not just print out in your cost function the cost, which will allow you to monitor iteration count and error.
And second, which is very important:
Your problem is illposed, or so to say underdetermined. You have a 12k feature space and provide only 10k examples, which means for an unconstrained optimization the answer is -Inf. To make a quick example why this is, your problem is like:
Minimize x+y+z given that x+y-z = 2. Feature space dim 3, spanned vector space - 1d. I suggest use PCA or CCA to reduce the dimensionality of the the text files by retaining their variation up to 99%. This will probably give you a feature space ~100-200dim.
PS: Just to point out that the problem is very fram from cluster size requirement, which usually is 1kk+ data points and that fminunc is not at all an overkill, and LIBSVM has nothing to do with it because fminunc is just a quadratic optimizer, while LIBSVM is a classifier. To clear out LIBSVM uses something similar to fminunc just with different objective function.
Here's what I suspect to be the issue, based on my experience with this type of problem. You're using a dense representation for X instead of a sparse one. You're also seeing the typical effect in text classification that the number of terms increasing roughly linearly with the number of samples. Effectively, the cost of the matrix multiplication X*theta goes up quadratically with the number of samples.
By contrast, a good sparse matrix representation only iterates over the non-zero elements to do a matrix multiplication, which tends to be roughly constant per document if they're of appropriately constant length, causing linear instead of quadratic slowdown in the number of samples.
I'm not a Matlab guru, but I know it has a sparse matrix package, so try to use that.
I have an array of valued double M where size(M)=15000
I need to convert this array to a diagonal matrix with command diag(M)
but i get the famous error out of memory
I run matlab with option -nojvm to gain memory space
and with the optin 3GB switch on windows
i tried also to convert my array to double precision
but the problem persist
any other idea?
There are much better ways to do whatever you're probably trying to do than generating the full diagonal matrix (which will be extremely sparse).
Multiplying that matrix, which has 225 million elements, by other matrices will also take a very long time.
I suggest you restructure your algorithm to take advantage of the fact that:
diag(M)(a, b) =
M(a) | a == b
0 | a != b
You'll save a huge amount of time and memory and whoever is paying you will be happier.
This is what a diagonal matrix looks like:
Every entry except those along the diagonal of the matrix (the ones where row index equals the column index) is zero. Relating this example to your provided values, diag(M) = A and M(n) = An
Use saprse matrix
M = spdiags( M, 0, numel(M), numel(M) );
For more info see matlab doc on spdiags and on sparse matrices in general.
If you have an n-by-n square matrix, M, you can directly extract the diagonal elements into a row vector via
n = size(M,1); % Or length(M), but this is more general
D = M(1:n+1:end); % 1-by-n vector containing diagonal elements of M
If you have an older version of Matlab, the above may even be faster than using diag (if I recall, diag wasn't always a compiled function). Then, if you need to save memory and only need the diagonal of M and can get rid of the rest, you can do this:
M(:) = 0; % Zero out M
M(1:n+1:end) = D; % Insert diagonal elements back into M
clear D; % Clear D from memory
This should not allocate much more than about (n^2+n)*8 = n*(n+1)*8 bytes at any one time for double precision values (some will needed for indexing operations). There are other ways to do the above that might save a bit more if you need a (full, non-sparse) n-by-n diagonal matrix, but there's no way to get around that you'll need n^2*8 bytes at a minimum just to store the matrix of doubles.
However, you're still likely to run into problems. I'd investigate sparse datatypes as #user2379182 suggests. Or rework you algorithms. Or better yet, look into obtaining 64-bit Matlab and/or a 64-bit OS!
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
MATLAB is running out of memory but it should not be
I want to perform PCA analysis on a huge data set of points. To be more specific, I have size(dataPoints) = [329150 132] where 328150 is the number of data points and 132 are the number of features.
I want to extract the eigenvectors and their corresponding eigenvalues so that I can perform PCA reconstruction.
However, when I am using the princomp function (i.e. [eigenVectors projectedData eigenValues] = princomp(dataPoints); I obtain the following error :
>> [eigenVectors projectedData eigenValues] = princomp(pointsData);
Error using svd
Out of memory. Type HELP MEMORY for your options.
Error in princomp (line 86)
[U,sigma,coeff] = svd(x0,econFlag); % put in 1/sqrt(n-1) later
However, if I am using a smaller data set, I have no problem.
How can I perform PCA on my whole dataset in Matlab? Have someone encountered this problem?
Edit:
I have modified the princomp function and tried to use svds instead of svd, but however, I am obtaining pretty much the same error. I have dropped the error bellow :
Error using horzcat
Out of memory. Type HELP MEMORY for your options.
Error in svds (line 65)
B = [sparse(m,m) A; A' sparse(n,n)];
Error in princomp (line 86)
[U,sigma,coeff] = svds(x0,econFlag); % put in 1/sqrt(n-1) later
Solution based on Eigen Decomposition
You can first compute PCA on X'X as #david said. Specifically, see the script below:
sz = [329150 132];
X = rand(sz);
[V D] = eig(X.' * X);
Actually, V holds the right singular vectors, and it holds the principal vectors if you put your data vectors in rows. The eigenvalues, D, are the variances among each direction. The singular vectors, which are the standard deviations, are computed as the square root of the variances:
S = sqrt(D);
Then, the left singular vectors, U, are computed using the formula X = USV'. Note that U refers to the principal components if your data vectors are in columns.
U = X*V*S^(-1);
Let us reconstruct the original data matrix and see the L2 reconstruction error:
X2 = U*S*V';
L2ReconstructionError = norm(X(:)-X2(:))
It is almost zero:
L2ReconstructionError =
6.5143e-012
If your data vectors are in columns and you want to convert your data into eigenspace coefficients, you should do U.'*X.
This code snippet takes around 3 seconds in my moderate 64-bit desktop.
Solution based on Randomized PCA
Alternatively, you can use a faster approximate method which is based on randomized PCA. Please see my answer in Cross Validated. You can directly compute fsvd and get U and V instead of using eig.
You may employ randomized PCA if the data size is too big. But, I think the previous way is sufficient for the size you gave.
My guess is that you have a huge data set. You don't need all of the svd coefficients. In this case, use svds instead of svd :
Taken directly from Matlab help:
s = svds(A,k) computes the k largest singular values and associated singular vectors of matrix A.
From your question, I understand that you don't call svd directly. But you might as well take a look at princomp (It is editable!) and alter the line that calls it.
You probably needed to calculate an n by n matrix in your computation somehow that is to say:
329150 * 329150 * 8btyes ~ 866GB`
of space which explains why you're getting a memory error. There seems to be an efficient way to calculate pca using princomp(X, 'econ') which I suggest you give it a try.
More on this in stackoverflow and mathworks..
Manually compute X'X (132x132) and svd on it. Or find NIPALS script.