I am using the [h,p,ksstat,cv] = kstest(x,'cdf',y); function in Matlab to find the ksstat and p-value. My x is x(1,1:10) = [0.16;1.21;4.41;0.09;0.64;0.36;0.04;6.76;0.04;0.49]and my y = chi2cdf(x,9); which is the cdf I am specifying or testing. Although I get this error:
Error using kstest (line 160)
Hypothesized CDF matrix must have 2 columns.
Normally I would have [h,p,ksstat,cv] = kstest(x,'cdf',y); where
y = makedist('ChiSquared'); but as you may know the distribution Chi-squared does not exist so I am not sure how to get around this issue. Any suggestions are greatly appreciated.
I think you should write:
[h,p,ksstat,cv] = kstest(x,'cdf',[x y]);
as the documentation says:
When CDF is a matrix, column 1 contains a set of possible x values, and column 2 contains the corresponding hypothesized cumulative distribution function values G(x).
Related
I'm trying to build a Gaussian Mixture Model using random initializations and compare the results with one using Kmeans initializations. However, I have difficulty creating the initial covariance matrix. I randomly selected 10 data points from my data set of 2500 data points (each "point" is actually an image), and used them as the means. Then I'm trying to create the covariance matrix from each of these random points.
Here's what I have.
% Randomly initialize GMM parameters
rng(1);
rand_index = randperm(2500);
Mu = data(:,rand_index(1:10));
for i = 1 : 10
Sigma(:,:,i) = cov(Mu);
Pxi(:,i) = mvnpdf(data', Mu(:,i)', Sigma(:,:,i));
end
data is a 50x2500 matrix. I keep getting an error because my Sigma is of the wrong size, or it's not positive definite, etc.
For example, the code above gave the error
Error using mvnpdf (line 116)
SIGMA must be a square matrix with size equal to the number of columns in X, or a row vector with length equal to the number of
columns in X.
If I use
Sigma(:,:,i) = cov([Mu(:,i) Mu(:,i)]');
I get the error
Error using mvnpdf (line 129)
SIGMA must be a square, symmetric, positive definite matrix.
How should I create this covariance matrix?
I assume what you are experiencing is not happening at every run. This is a numerical instability that you can avoid using a simple technique:
%Add a tiny variance to avoid numerical instability
Sigma(:,:,i) = cov([Mu(:,i) Mu(:,i)]');
D = size(Sigma,1);
Sigma(:,:,i) = Sigma(:,:,i) + 1E-5.*diag(ones(D,1));
Looking over some MATLAB code related to multivariate Gaussian distributions and I come across this line:
params.means(k, :) = mean(X(Y == y, :));
Looking at the MATLAB documentation http://www.mathworks.com/help/matlab/ref/mean.html, my assumption is that it calculates the mean over the matrix X in the first dimension (the column). What I don't see is the parentheses that comes after. Is this a conditional probability (where Y = y)? Can someone point me to some documentation where this is explained?
Unpacked, this single line might look like:
row_indices = find(Y==y);
new_X = X(row_indices,:);
params.means(k,:) = mean(new_X);
So, as you can see, the Y==y is simply being used to find a subset of X over which the mean is taken.
Given that you said that this was for computing multivariate Gaussian distributions, I bet that X and Y are paired sets of data. I bet that the code is looping (using the variable k) over different values y. So, it finds all of the Y equal to y and then calculates the mean of the X values that correspond to those Y values.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
MATLAB is running out of memory but it should not be
I want to perform PCA analysis on a huge data set of points. To be more specific, I have size(dataPoints) = [329150 132] where 328150 is the number of data points and 132 are the number of features.
I want to extract the eigenvectors and their corresponding eigenvalues so that I can perform PCA reconstruction.
However, when I am using the princomp function (i.e. [eigenVectors projectedData eigenValues] = princomp(dataPoints); I obtain the following error :
>> [eigenVectors projectedData eigenValues] = princomp(pointsData);
Error using svd
Out of memory. Type HELP MEMORY for your options.
Error in princomp (line 86)
[U,sigma,coeff] = svd(x0,econFlag); % put in 1/sqrt(n-1) later
However, if I am using a smaller data set, I have no problem.
How can I perform PCA on my whole dataset in Matlab? Have someone encountered this problem?
Edit:
I have modified the princomp function and tried to use svds instead of svd, but however, I am obtaining pretty much the same error. I have dropped the error bellow :
Error using horzcat
Out of memory. Type HELP MEMORY for your options.
Error in svds (line 65)
B = [sparse(m,m) A; A' sparse(n,n)];
Error in princomp (line 86)
[U,sigma,coeff] = svds(x0,econFlag); % put in 1/sqrt(n-1) later
Solution based on Eigen Decomposition
You can first compute PCA on X'X as #david said. Specifically, see the script below:
sz = [329150 132];
X = rand(sz);
[V D] = eig(X.' * X);
Actually, V holds the right singular vectors, and it holds the principal vectors if you put your data vectors in rows. The eigenvalues, D, are the variances among each direction. The singular vectors, which are the standard deviations, are computed as the square root of the variances:
S = sqrt(D);
Then, the left singular vectors, U, are computed using the formula X = USV'. Note that U refers to the principal components if your data vectors are in columns.
U = X*V*S^(-1);
Let us reconstruct the original data matrix and see the L2 reconstruction error:
X2 = U*S*V';
L2ReconstructionError = norm(X(:)-X2(:))
It is almost zero:
L2ReconstructionError =
6.5143e-012
If your data vectors are in columns and you want to convert your data into eigenspace coefficients, you should do U.'*X.
This code snippet takes around 3 seconds in my moderate 64-bit desktop.
Solution based on Randomized PCA
Alternatively, you can use a faster approximate method which is based on randomized PCA. Please see my answer in Cross Validated. You can directly compute fsvd and get U and V instead of using eig.
You may employ randomized PCA if the data size is too big. But, I think the previous way is sufficient for the size you gave.
My guess is that you have a huge data set. You don't need all of the svd coefficients. In this case, use svds instead of svd :
Taken directly from Matlab help:
s = svds(A,k) computes the k largest singular values and associated singular vectors of matrix A.
From your question, I understand that you don't call svd directly. But you might as well take a look at princomp (It is editable!) and alter the line that calls it.
You probably needed to calculate an n by n matrix in your computation somehow that is to say:
329150 * 329150 * 8btyes ~ 866GB`
of space which explains why you're getting a memory error. There seems to be an efficient way to calculate pca using princomp(X, 'econ') which I suggest you give it a try.
More on this in stackoverflow and mathworks..
Manually compute X'X (132x132) and svd on it. Or find NIPALS script.
Is it possible to compute the numerical hessian matrix for this function with respect to W_i,C, epsilon_i easily Matlab? I have computed a hessian by manually take a derivative, but I want to verify if my result is correct.
W = Nx1;
X = NxM;
X_i = Nx1;
y = 1xM;
C = 1x1;
DERIVEST on the file exchange has a function for doing this. There are also tips for doing this eg in Section 18 of this tutorial, or many other places.
I have a matrix in MATLAB and I need to find the 99% value for each column. In other words, the value such that 99% of the population has a larger value than it. Is there a function in MATLAB for this?
Use QUANTILE function.
Y = quantile(X,P);
where X is a matrix and P is scalar or vector of probabilities. For example, if P=0.01, the Y will be vector of values for each columns, so that 99% of column values are larger.
The simplest solution is to use the function QUANTILE as yuk suggested.
Y = quantile(X,0.01);
However, you will need the Statistics Toolbox to use the function QUANTILE. A solution that is not dependent on toolboxes can be found by noting that QUANTILE calls the function PRCTILE, which itself calls the built-in function INTERP1Q to do the primary computation. For the general case of a 2-D matrix that contains no NaN values you can compute the quantiles of each column using the following code:
P = 0.01; %# Your probability
S = sort(X); %# Sort the columns of your data X
N = size(X,1); %# The number of rows of X
Y = interp1q([0 (0.5:(N-0.5))./N 1]',S([1 1:N N],:),P); %'# Get the quantiles
This should give you the same results as calling QUANTILE, without needing any toolboxes.
If you do not have the Statistics Toolbox, there is always
y=sort(x);
y(floor(length(y)*0.99))
or
y(floor(length(y)*0.01))
depending on what you meant.