So I'm trying to implement an EM-Algorithm to train a Gaussian Class Conditional model for classifying data. I'm stuck in the M-step at the moment because I can't figure out how to calculate the covariance matrix.
The problem is I have a big data set and using a for loop to go through each point would be way to slow. I also can't use the covariance function cov(), because I need to use a mean which I calculated using this formula(mu symbol one)
Is there a way to adjust cov() to use the mean I want? Or is there another way I could do this without for loops?
Edit: Forgot to explain what the data matrix is like. Its an nx3 where each row is a data point.
It technically needs to work for the general case nxm but n is usually really big(1000 or more) while m is relatively small.
You can calculate your covariance matrix manually. Let data be the matrix containing all your variables (for example, [x y]) and mu your custom mean, proceed as follows:
n = size(data,1);
data_dem = data - (ones(n,1) * mu);
cov_mat = (data_dem.' * data_dem) ./ (n - 1);
Notice that I used the Bessel's Correction (n-1 instead of n) because the Matlab cov function uses it, unless you specify the third argument as 1:
cov_mat = cov(x,y,1);
C = cov(___,w) specifies the normalization weight for any of the
previous syntaxes. When w = 0 (default), C is normalized by the number
of observations-1. When w = 1, it is normalized by the number of
observations.
Related
I want to solve, in MatLab, a linear system (corresponding to a PDE system of two equations written in finite difference scheme). The action of the system matrix (corresponding to one of the diffusive terms of the PDE system) reads, symbolically (u is one of the unknown fields, n is the time step, j is the grid point):
and fully:
The above matrix has to be intended as A, where A*U^n+1 = B is the system. U contains the 'u' and the 'v' (second unknown field of the PDE system) alternatively: U = [u_1,v_1,u_2,v_2,...,u_J,v_J].
So far I have been filling this matrix using spdiags and diag in the following expensive way:
E=zeros(2*J,1);
E(1:2:2*J) = 1;
E(2:2:2*J) = 0;
Dvec=zeros(2*J,1);
for i=3:2:2*J-3
Dvec(i)=D_11((i+1)/2);
end
for i=4:2:2*J-2
Dvec(i)=D_21(i/2);
end
A = diag(Dvec)*spdiags([-E,-E,2*E,2*E,-E,-E],[-3,-2,-1,0,1,2],2*J,2*J)/(dx^2);`
and for the solution
[L,U]=lu(A);
y = L\B;
U(:) =U\y;
where B is the right hand side vector.
This is obviously unreasonably expensive because it needs to build a JxJ matrix, do a JxJ matrix multiplication, etc.
Then comes my question: is there a way to solve the system without passing MatLab a matrix, e.g., by passing the vector Dvec or alternatively directly D_11 and D_22?
This would spare me a lot of memory and processing time!
Matlab doesn't store sparse matrices as JxJ arrays but as lists of size O(J). See
http://au.mathworks.com/help/matlab/math/constructing-sparse-matrices.html
Since you are using the spdiags function to construct A, Matlab should already recognize A as sparse and you should indeed see such a list if you display A in console view.
For a tridiagonal matrix like yours, the L and U matrices should already be sparse.
So you just need to ensure that the \ operator uses the appropriate sparse algorithm according to the rules in http://au.mathworks.com/help/matlab/ref/mldivide.html. It's not clear whether the vector B will already be considered sparse, but you could recast it as a diagonal matrix which should certainly be considered sparse.
I am trying to understand principal component analysis in Matlab,
There seems to be at least 3 different functions that do it.
I have some questions re the code below:
Am I creating approximate x values using only one eigenvector (the one corresponding to the largest eigenvalue) correctly? I think so??
Why are PC and V which are both meant to be the loadings for (x'x) presented differently? The column order is reversed because eig does not order the eigenvalues with the largest value first but why are they the negative of each other?
Why are the eig values not in ordered with the eigenvector corresponding to the largest eigenvalue in the first column?
Using the code below I get back to the input matrix x when using svd and eig, but the results from princomp seem to be totally different? What so I have to do to make princomp match the other two functions?
Code:
x=[1 2;3 4;5 6;7 8 ]
econFlag=0;
[U,sigma,V] = svd(x,econFlag);%[U,sigma,coeff] = svd(z,econFlag);
U1=U(:,1);
V1=V(:,1);
sigma_partial=sigma(1,1);
score1=U*sigma;
test1=score1*V';
score_partial=U1*sigma_partial;
test1_partial=score_partial*V1';
[PC, D] = eig(x'*x)
score2=x*PC;
test2=score2*PC';
PC1=PC(:,2);
score2_partial=x*PC1;
test2_partial=score2_partial*PC1';
[o1 o2 o3]=princomp(x);
Yes. According to the documentation of svd, diagonal elements of the output S are in decreasing order. There is no such guarantee for the the output D of eig though.
Eigenvectors and singular vectors have no defined sign. If a is an eigenvector, so is -a.
I've often wondered the same. Laziness on the part of TMW? Optimization, because sorting would be an additional step and not everybody needs 'em sorted?
princomp centers the input data before computing the principal components. This makes sense as normally the PCA is computed with respect to the covariance matrix, and the eigenvectors of x' * x are only identical to those of the covariance matrix if x is mean-free.
I would compute the PCA by transforming to the basis of the eigenvectors of the covariance matrix (centered data), but apply this transform to the original (uncentered) data. This allows to capture a maximum of variance with as few principal components as possible, but still to recover the orginal data from all of them:
[V, D] = eig(cov(x));
score = x * V;
test = score * V';
test is identical to x, up to numerical error.
In order to easily pick the components with the most variance, let's fix that lack of sorting ourselves:
[V, D] = eig(cov(x));
[D, ind] = sort(diag(D), 'descend');
V = V(:, ind);
score = x * V;
test = score * V';
Reconstruct the signal using the strongest principal component only:
test_partial = score(:, 1) * V(:, 1)';
In response to Amro's comments: It is of course also possible to first remove the means from the input data, and transform these "centered" data. In that case, for perfect reconstruction of the original data it would be necessary to add the means again. The way to compute the PCA given above is the one described by Neil H. Timm, Applied Multivariate Analysis, Springer 2002, page 446:
Given an observation vector Y with mean mu and covariance matrix Sigma of full rank p, the goal of PCA is to create a new set of variables called principal components (PCs) or principal variates. The principal components are linear combinations of the variables of the vector Y that are uncorrelated such that the variance of the jth component is maximal.
Timm later defines "standardized components" as those which have been computed from centered data and are then divided by the square root of the eigenvalues (i.e. variances), i.e. "standardized principal components" have mean 0 and variance 1.
I'm probably being a little dense but I'm not very mathsy and can't seem to understand the covariance element of creating multivariate data.
I'm after two columns of random data (representing two correlated variables).
I think I am right in needing to use the mvnrnd function and I understand that 'mu' must be a column of my mean vectors. As I need 4 distinct classes within my data these are going to be (1, 1) (-1 1) (1 -1) and (-1 -1). I assume I will have to do the function 4x with a different column of mean vectors each time and then combine them to get my full data set.
I don't understand what I should put for SIGMA - Matlab help tells me that it must be 'a d-by-d symmetric positive semi-definite matrix, or a d-by-d-by-n array' i.e. a covariance matrix. I don't understand how I create a covariance matrix for numbers that I am yet to generate.
Any advice would be greatly appreciated!
Assuming that I understood your case properly, I would go this way:
data = [normrnd(0,1,5000,1),normrnd(0,1,5000,1)]; %% your starting data series
MU = mean(data,1);
SIGMA = cov(data);
Now, it should be possible to feed mvnrnd with MU and SIGMA:
r = mvnrnd(MU,SIGMA,5000);
plot(r(:,1),r(:,2),'+') %% in case you wanna plot the results
I hope this helps.
I think your aim is to generate the simulated multivariate gaussian distributed data. For example, I use
k = 6; % feature dimension
mu = rand(1,k);
sigma = 10*eye(k,k);
unit matrix by 10 times is a symmetric positive semi-definite matrix. And the gaussian distribution will be more round than other type of sigma.
then you can use it as the above example of mvnrnd function and see the plot.
I'm looking to duplication a 784x784 matrix in matlab along a 3rd axis. The following code seems to work:
mat = reshape(repmat(mat, 1,10000),784,784,10000);
Unfortunately, it takes so long to run it's worthless (changing the 10,000s to 1000 makes it take a few minutes, and using 10,000 makes my whole machine freeze up practically). is there a faster way to do this?
For reference, I'm looking to use mvnpdf on 10,000 vectors each of length 784, using the same covariance matrix for each. So my final call looks like
mvnpdf(X,mu,mat)
%size(X) = (10000,784), size(mu) = (10000,784), size(mat) = 784,784,10000
If there's a way to do this that's not repeating the covariance matrix 10,000 times, that'd be helpful too. Thanks!
For replication in more than 2 dimensions, you need to supply the replication counts as an array:
out = repmat(mat,[1,1,10000])
Creating a 784x784 matrix 10,000 times isn't going to take advantage of the vectorization in MATLAB, which is going to be more useful for small arrays. Avoiding a for loop also won't help too much, given the following:
The main speedup you can gain here is by computing the inverse of the covariance matrix once, and then computing the pdf yourself. The inverse of sigma takes O(n^3), and you are needlessly doing that 10,000 times. (Also, the square root determinant can be precomputed.) For reference, the PDF of the multivariate normal distribution is computed as follows:
http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Properties
Better to just compute the inverse once, and then compute z = x - mu for each value, then doing z'Sz for each pdf value, and applying a simple function and a constant. But wait! You can vectorize that, too.
I don't have MATLAB in front of me, but this is basically what you need to do, and it'll run in an instant.
s = inv(sigma);
c = -0.5*log(det(s)) - (k/2)*log(2*pi);
z = x - mu; % 10000 x 784 matrix
ans = exp( c - 0.5 .* dot(z*s, z, 2) ); % 10000 x 1 vector
I have a matrix in MATLAB and I need to find the 99% value for each column. In other words, the value such that 99% of the population has a larger value than it. Is there a function in MATLAB for this?
Use QUANTILE function.
Y = quantile(X,P);
where X is a matrix and P is scalar or vector of probabilities. For example, if P=0.01, the Y will be vector of values for each columns, so that 99% of column values are larger.
The simplest solution is to use the function QUANTILE as yuk suggested.
Y = quantile(X,0.01);
However, you will need the Statistics Toolbox to use the function QUANTILE. A solution that is not dependent on toolboxes can be found by noting that QUANTILE calls the function PRCTILE, which itself calls the built-in function INTERP1Q to do the primary computation. For the general case of a 2-D matrix that contains no NaN values you can compute the quantiles of each column using the following code:
P = 0.01; %# Your probability
S = sort(X); %# Sort the columns of your data X
N = size(X,1); %# The number of rows of X
Y = interp1q([0 (0.5:(N-0.5))./N 1]',S([1 1:N N],:),P); %'# Get the quantiles
This should give you the same results as calling QUANTILE, without needing any toolboxes.
If you do not have the Statistics Toolbox, there is always
y=sort(x);
y(floor(length(y)*0.99))
or
y(floor(length(y)*0.01))
depending on what you meant.