Multivariate Random Number Generation in Matlab - matlab

I'm probably being a little dense but I'm not very mathsy and can't seem to understand the covariance element of creating multivariate data.
I'm after two columns of random data (representing two correlated variables).
I think I am right in needing to use the mvnrnd function and I understand that 'mu' must be a column of my mean vectors. As I need 4 distinct classes within my data these are going to be (1, 1) (-1 1) (1 -1) and (-1 -1). I assume I will have to do the function 4x with a different column of mean vectors each time and then combine them to get my full data set.
I don't understand what I should put for SIGMA - Matlab help tells me that it must be 'a d-by-d symmetric positive semi-definite matrix, or a d-by-d-by-n array' i.e. a covariance matrix. I don't understand how I create a covariance matrix for numbers that I am yet to generate.
Any advice would be greatly appreciated!

Assuming that I understood your case properly, I would go this way:
data = [normrnd(0,1,5000,1),normrnd(0,1,5000,1)]; %% your starting data series
MU = mean(data,1);
SIGMA = cov(data);
Now, it should be possible to feed mvnrnd with MU and SIGMA:
r = mvnrnd(MU,SIGMA,5000);
plot(r(:,1),r(:,2),'+') %% in case you wanna plot the results
I hope this helps.

I think your aim is to generate the simulated multivariate gaussian distributed data. For example, I use
k = 6; % feature dimension
mu = rand(1,k);
sigma = 10*eye(k,k);
unit matrix by 10 times is a symmetric positive semi-definite matrix. And the gaussian distribution will be more round than other type of sigma.
then you can use it as the above example of mvnrnd function and see the plot.

Related

Calculating covariance in Matlab for large dataset and different mean

So I'm trying to implement an EM-Algorithm to train a Gaussian Class Conditional model for classifying data. I'm stuck in the M-step at the moment because I can't figure out how to calculate the covariance matrix.
The problem is I have a big data set and using a for loop to go through each point would be way to slow. I also can't use the covariance function cov(), because I need to use a mean which I calculated using this formula(mu symbol one)
Is there a way to adjust cov() to use the mean I want? Or is there another way I could do this without for loops?
Edit: Forgot to explain what the data matrix is like. Its an nx3 where each row is a data point.
It technically needs to work for the general case nxm but n is usually really big(1000 or more) while m is relatively small.
You can calculate your covariance matrix manually. Let data be the matrix containing all your variables (for example, [x y]) and mu your custom mean, proceed as follows:
n = size(data,1);
data_dem = data - (ones(n,1) * mu);
cov_mat = (data_dem.' * data_dem) ./ (n - 1);
Notice that I used the Bessel's Correction (n-1 instead of n) because the Matlab cov function uses it, unless you specify the third argument as 1:
cov_mat = cov(x,y,1);
C = cov(___,w) specifies the normalization weight for any of the
previous syntaxes. When w = 0 (default), C is normalized by the number
of observations-1. When w = 1, it is normalized by the number of
observations.

Vectorization in PCA

i am doing Principal Component Analysis,and want help to know if can represent
summation from i to m (X(i)*X(i)^T) in terms of data matrix..direct multiplication of two matrices.
Can this be done..or need i use a for loop and do it.
Currently i have tried
sum=zeros(n,n);
for i=1:m
sum=sum+ X(i,:)*(X(i,:)^T);
end
My goal is to find the principal eigen values of the resulting matrix.
Thanks in advance
Say the shape of the data matrix X is (Dim, Num), you can just compute sum of all sample correlations with:
S = X*X'
For implementing PCA, also don't forget to divide the matrix by the amount of samples.
Sigma = (1/N)X*X'
If your data has zero mean, this is also the covariance matrix.

Generate random samples from arbitrary discrete probability density function in Matlab

I've got an arbitrary probability density function discretized as a matrix in Matlab, that means that for every pair x,y the probability is stored in the matrix:
A(x,y) = probability
This is a 100x100 matrix, and I would like to be able to generate random samples of two dimensions (x,y) out of this matrix and also, if possible, to be able to calculate the mean and other moments of the PDF. I want to do this because after resampling, I want to fit the samples to an approximated Gaussian Mixture Model.
I've been looking everywhere but I haven't found anything as specific as this. I hope you may be able to help me.
Thank you.
If you really have a discrete probably density function defined by A (as opposed to a continuous probability density function that is merely described by A), you can "cheat" by turning your 2D problem into a 1D problem.
%define the possible values for the (x,y) pair
row_vals = [1:size(A,1)]'*ones(1,size(A,2)); %all x values
col_vals = ones(size(A,1),1)*[1:size(A,2)]; %all y values
%convert your 2D problem into a 1D problem
A = A(:);
row_vals = row_vals(:);
col_vals = col_vals(:);
%calculate your fake 1D CDF, assumes sum(A(:))==1
CDF = cumsum(A); %remember, first term out of of cumsum is not zero
%because of the operation we're doing below (interp1 followed by ceil)
%we need the CDF to start at zero
CDF = [0; CDF(:)];
%generate random values
N_vals = 1000; %give me 1000 values
rand_vals = rand(N_vals,1); %spans zero to one
%look into CDF to see which index the rand val corresponds to
out_val = interp1(CDF,[0:1/(length(CDF)-1):1],rand_vals); %spans zero to one
ind = ceil(out_val*length(A));
%using the inds, you can lookup each pair of values
xy_values = [row_vals(ind) col_vals(ind)];
I hope that this helps!
Chip
I don't believe matlab has built-in functionality for generating multivariate random variables with arbitrary distribution. As a matter of fact, the same is true for univariate random numbers. But while the latter can be easily generated based on the cumulative distribution function, the CDF does not exist for multivariate distributions, so generating such numbers is much more messy (the main problem is the fact that 2 or more variables have correlation). So this part of your question is far beyond the scope of this site.
Since half an answer is better than no answer, here's how you can compute the mean and higher moments numerically using matlab:
%generate some dummy input
xv=linspace(-50,50,101);
yv=linspace(-30,30,100);
[x y]=meshgrid(xv,yv);
%define a discretized two-hump Gaussian distribution
A=floor(15*exp(-((x-10).^2+y.^2)/100)+15*exp(-((x+25).^2+y.^2)/100));
A=A/sum(A(:)); %normalized to sum to 1
%plot it if you like
%figure;
%surf(x,y,A)
%actual half-answer starts here
%get normalized pdf
weight=trapz(xv,trapz(yv,A));
A=A/weight; %A normalized to 1 according to trapz^2
%mean
mean_x=trapz(xv,trapz(yv,A.*x));
mean_y=trapz(xv,trapz(yv,A.*y));
So, the point is that you can perform a double integral on a rectangular mesh using two consecutive calls to trapz. This allows you to compute the integral of any quantity that has the same shape as your mesh, but a drawback is that vector components have to be computed independently. If you only wish to compute things which can be parametrized with x and y (which are naturally the same size as you mesh), then you can get along without having to do any additional thinking.
You could also define a function for the integration:
function res=trapz2(xv,yv,A,arg)
if ~isscalar(arg) && any(size(arg)~=size(A))
error('Size of A and var must be the same!')
end
res=trapz(xv,trapz(yv,A.*arg));
end
This way you can compute stuff like
weight=trapz2(xv,yv,A,1);
mean_x=trapz2(xv,yv,A,x);
NOTE: the reason I used a 101x100 mesh in the example is that the double call to trapz should be performed in the proper order. If you interchange xv and yv in the calls, you get the wrong answer due to inconsistency with the definition of A, but this will not be evident if A is square. I suggest avoiding symmetric quantities during the development stage.

svds not working for some matrices - wrong result

Here is my testing function:
function diff = svdtester()
y = rand(500,20);
[U,S,V] = svd(y);
%{
y = sprand(500,20,.1);
[U,S,V] = svds(y);
%}
diff_mat = y - U*S*V';
diff = mean(abs(diff_mat(:)));
end
There are two very similar parts: one finds the SVD of a random matrix, the other finds the SVD of a random sparse matrix. Regardless of which one you choose to comment (right now the second one is commented-out), we compute the difference between the original matrix and the product of its SVD components and return that average absolute difference.
When using rand/svd, the typical return (mean error) value is around 8.8e-16, basically zero. When using sprand/svds, the typical return values is around 0.07, which is fairly terrible considering the sparse matrix is 90% 0's to start with.
Am I misunderstanding how SVD should work for sparse matrices, or is something wrong with these functions?
Yes, the behavior of svds is a little bit different from svd. According to MATLAB's documentation:
[U,S,V] = svds(A,...) returns three output arguments, and if A is m-by-n:
U is m-by-k with orthonormal columns
S is k-by-k diagonal
V is n-by-k with orthonormal columns
U*S*V' is the closest rank k approximation to A
In fact, usually k will be somethings about 6, so you will get rather "rude" approximation. To get more exact approximation specify k to be min(size(y)):
[U, S, V] = svds(y, min(size(y)))
and you will get error of the same order of magnitude as in case of svd.
P.S. Also, MATLAB's documentations says:
Note svds is best used to find a few singular values of a large, sparse matrix. To find all the singular values of such a matrix, svd(full(A)) will usually perform better than svds(A,min(size(A))).

Matlab correlation between two matrices

Basically I have two matrices, something like this:
> Matrix A (100 rows x 2 features)
Height - Weight
1.48 75
1.55 65
1.60 70
etc...
And Matrix B (same dimension of Matrix A but with different values of course)
I would like to understand if there is some correlation between Matrix A and Matrix B, which strategy do you suggest me?
The concept you are looking for is known as canonical correlation. It is a well developed bit of theory in the field of multivariate analysis. Essentially, the idea is to find a linear combination of the columns in your first matrix and a linear combination of the columns in your second matrix, such that the correlation between the two linear combinations is maximized.
This can be done manually using eigenvectors and eigenvalues, but if you have the statistics toolbox, then Matlab has already got it packaged and ready to go for you. The function is called canoncorr, and the documentation is here
A brief example of the usage of this function follows:
%# Set up some example data
CovMat = randi(5, 4, 4) + 20 * eye(4); %# Build a random covariance matrix
CovMat = (1/2) * (CovMat + CovMat'); %# Ensure random covriance matrix is symmetrix
X = mvnrnd(zeros(500, 4), CovMat); %# Simulate data using multivariate Normal
%# Partition the data into two matrices
X1 = X(:, 1:2);
X2 = X(:, 3:4);
%# Find the canonical correlations of the two matrices
[A, B, r] = canoncorr(X1, X2);
The first canonical correlation is the first element of r, and the second canonical correlation is the second element of r.
The canoncorr function also has a lot of other outputs. I'm not sure I'm clever enough to provide a satisfactory yet concise explanation of them here so instead I'm going to be lame and recommend you read up on it in a multivariate analysis textbook - most multivariate analysis textbooks will have a full chapter dedicated to canonical correlations.
Finally, if you don't have the statistics toolbox, then a quick google revealed the following FEX submission that claims to provide canonical correlation analysis - note, I haven't tested it myself.
Ok, let's have a short try:
A = [1:20; rand(1,20)]'; % Generate some data...
The best way to examine a 2-dimensional relationship is by looking at the data plots:
plot(A(:,1), A(:,2), 'o') % In the random data you should not see some pattern...
If we really want to compute some correlation coefficients, we can do this with corrcoef, as you mentioned:
B = corrcoef(A)
ans =
1.0000 -0.1350
-0.1350 1.0000
Here, B(1,1) is the correlation between column 1 and column 1, B(2,1) between column 1 and column 2 (and vice versa, thus B is symmetric).
One may argue about the usefulness of such a measure in a two-dimensional context - in my opinion you usually gain more insights by looking at the plots.