Drawing from correlated marginal uniform in Matlab: Alternatives to copularnd? - matlab

Consider two random variables X and Y both uniformly distributed in [0,1] and correlated with correlation rho.
I want to draw P realisations from their joint distribution in Matlab.
Is
A = copularnd('gaussian',rho,P);
the only way to do it?

Copulas provide a very convenient way modelling the joint distribution. Just saying that you want X~U[0,1], Y~U[0,1] and corr(X,Y)=rho is not enough to define the relationship between these two random variables. Simply by substituting different copulas you can build different models (eventually choosing the one suiting your use-case best) that all satisfy this condition.
Apart from understanding the basics of what copulas are, you need to understand there are different types of correlations, such as linear (Pearson) and rank (Spearman / Kendall), and how they relate to each other. In particular, rank correlations preserve when a monotonic transformation is applied, unlike linear correlation. This is the key property allowing us to easily translate the desired linear correlation of uniform marginal distributions to the linear correlation of bivariate normal (or other type of distribution you use in the copula), which would be the input correlation to copularnd.
There is a very good answer on Cross Validated, describing exactly how to convert the correlation in your case. You should also read this Matlab guide on Using Rank Correlation Coefficients and copulas.
In a nutshell, to model desired linear correlation of marginals, you need to translate it into some rank correlation (using Spearman is convenient, since for uniform distribution it equals Pearason correlation), which would be the same for your normals (because it's a monotonic transformation). All you need is to convert that Spearman correlation for your normal to a linear Person correlation, which would be the input to copularnd.
Hopefully the code would make it easy to understand. Obviously you don't need those temporary variables, but it should make the logic clear:
rho_uni_Pearson = 0.7;
rho_uni_Spearman = rho_uni_Pearson;
rho_normal_Spearman = rho_uni_Spearman;
rho_normal_Pearson = 2*sin(rho_normal_Spearman*pi/6);
X = copularnd('gaussian', rho_normal_Pearson, 1e7);
And the resulting linear correlation is exactly what we wanted (since we've generated a very large sample):
corr(X(:,1), X(:,2));
ans =
0.7000
Note that for a bivariate normal copula, the relationship between linear and rank correlations is easy enough, but it could be more complex if you use a different copula. Matlab has a function copulaparam that allows you to translate from rank to linear correlation. So instead of writing out explicit formula as above, we could just use:
rho_normal_Pearson = copulaparam('Gaussian', rho_normal_Spearman, 'type', 'spearman')
Now that we have learned the basics, let's go ahead and use a t copula with 5 degress of freedom instead of gaussian copula:
nu = 5; % number of degrees of freedom of t distribution
rho_t_Pearson = copulaparam('t', 0.7, nu, 'type', 'spearman');
X = copularnd('t', rho_t_Pearson, nu, 1e7);
Ensuring resulting linear correlation is what we wanted:
corr(X(:,1), X(:,2));
ans =
0.6996
It's easy to observe that resulting distributions may be strikingly different, depending on which copula you choose, even through they all give you the same linear correlation. It is up to you, the researcher, to determine which model is best suited to your particular data and problem. For example:
N = 500;
rho_Pearson = copulaparam('gaussian', 0.1, 'type', 'spearman');
X1 = copularnd('gaussian', rho_Pearson, N);
figure(); scatterhist(X1(:,1),X1(:,2)); title('gaussian');
nu = 1; % number of degrees of freedom of t distribution
rho_Pearson = copulaparam('t', 0.1, nu, 'type', 'spearman');
X2 = copularnd('t', rho_Pearson, nu, N);
figure(); scatterhist(X2(:,1),X2(:,2)); title('t, nu=1');

Related

This line of code is supposed to generate exponential service times, but I am not able to get the logic behind it

This line of code is supposed to generate exponential service times, but I am not able to get the logic behind it.
% Exponential service time with rate 1
mean = 1;
dt = -mean * log(1 - rand());
This is the source link, but MATLAB is needed to open the example.
I was also thinking if exprnd(1) will give the same result of generating numbers from the exponential distribution that has a mean of 1?
You are right!
First, note that MATLAB parameterizes the Exponential distribution by the mean, not the rate, so exprnd(5) would have a rate lambda = 1/5.
This line of code is another way to do the same thing:
-mean * log(1 - rand());
This is the inverse transform for the Exponential distribution.
If X follows an Exponential distribution, then
and rewriting the cumulative distribution function (CDF) and letting U ~ Uniform(0,1), we can derive the inverse transform.
Note the last equality is because 1-U and U are equal in distribution. In other words, 1-U ~ Uniform(0,1) and U ~ Uniform(0,1).
You can test this yourself with this example code with multiple approaches.
% MATLAB R2018b
rate = 1; % mean = 1 % mean = 1/rate
NumSamples = 1000;
% Approach 1
X1 = (-1/rate)*log(1-rand(NumSamples,1)); % inverse transform
% Approach 2
X2 = exprnd(1/rate,NumSamples,1);
% Approach 3
pd = makedist('Exponential',1/rate) % create probability distribution object
X3 = random(pd,NumSamples,1);
EDIT: The OP asked is there was a reason to generate from the CDF rather than from the probability density function (PDF). This is my attempt to answer that.
The inverse transform method uses the CDF to take advantage of the fact that the CDF is itself a probability and so must be on the interval [0, 1]. Then it is very easy to generate very good (pseudo) random numbers which will be on that interval. The CDF is sufficient to uniquely define the distribution, and inverting the CDF means that its unique "shape" will properly map the uniformly distributed numbers on [0, 1] to a non-uniform shape in the domain which will follow the probability density function (PDF).
You can see the CDF performing this nonlinear mapping in this figure.
One use of the PDF would be Acceptance-Rejection methods, which can be useful for some distributions including custom PDFs (thanks to #pjs for jogging my memory).

Transforming draws in Matlab from Gaussian mixture to uniform

Consider the following draws for a 2x1 vector in Matlab with a probability distribution that is a mixture of two Gaussian components.
P=10^3; %number draws
v=1;
%First component
mu_a = [0,0.5];
sigma_a = [v,0;0,v];
%Second component
mu_b = [0,8.2];
sigma_b = [v,0;0,v];
%Combine
MU = [mu_a;mu_b];
SIGMA = cat(3,sigma_a,sigma_b);
w = ones(1,2)/2; %equal weight 0.5
obj = gmdistribution(MU,SIGMA,w);
%Draws
RV_temp = random(obj,P);%Px2
% Transform each component of RV_temp into a uniform in [0,1] by estimating the cdf.
RV1=ksdensity(RV_temp(:,1), RV_temp(:,1),'function', 'cdf');
RV2=ksdensity(RV_temp(:,2), RV_temp(:,2),'function', 'cdf');
Now, if we check whether RV1 and RV2 are uniformly distributed on [0,1] by doing
ecdf(RV1)
ecdf(RV2)
we can see that RV1 is uniformly distributed on [0,1] (the empirical cdf is close to the 45 degree line) while RV2 is not.
I don't understand why. It seems that the more distant are mu_a(2)and mu_b(2), the worse the job done by ksdensity with a reasonable number of draws. Why?
When you have a mixture of N(0.5,v) and N(8.2,v) then the range of the generated data is larger than if you had expectation which were closer, like N(0,v) and N(0,v), as you have in the other dimension. Then you ask ksdensity to approximate a function using P points inside this range.
Like in standard linear interpolation, the denser the points the better approximation of the function (inside the range), this is the same case here. Thus in the N(0.5,v) and N(8.2,v) where the points are "sparse" (or sparser, is that a word?) the approximation is worse than in the N(0,v) and N(0,v) where the points are denser.
As a small side note, are there any reason that you do not apply ksdensity directly on the bivariate data? Also I cannot reproduce your comment where you say that 5e2points are also good. Final comment, 1e3 is typically prefered over 10^3.
I think this is simply about the number of samples you're using. For the first example, the means of the two Gaussians are relatively close, hence a thousand samples are enough to obtain a cdf really close the the U[0,1] cdf. On the second vector though, you have a higher difference, and need more samples. With 100000 samples, I obtained the following result:
With 1000 I obtained this:
Which is clearly farther from the Uniform cdf function. Try to increase the number of samples to a million and check if the result is again getting closer.

MATLAB - Meaning of guassian distribution data. (in Neural Network)

I'm a newbie to MATLAB and now I'm trying to create a 2-d gaussian distribute data to train my neural network. I just found the code on the official document.
mu = [0 0];
Sigma = [.25 .3; .3 1];
x1 = -3:.2:3; x2 = -3:.2:3;
[X1,X2] = meshgrid(x1,x2);
F = mvnpdf([X1(:) X2(:)],mu,Sigma);
I know "mu" is average of the data. Sigma is something related to
Standard deviation. But I just don't get what is the idea of mesgrid and the interval(x1,x2). And the Geometric meaning of these code.
Also, can someone explain me why is guassian distribution so important in machine learning and data science? Cause all the course keep saying and saying this term.
Meshgrid is a basic matlab function, that is in no way specifically related to neural networks or a gaussian distribution. Check the documentation of Matlab to find out more about it.
The gaussian distribution (also known as normal distribution) is important for datascience because it comes with several nice statistical properties. Unfortunately it is hard to describe them all in a compact way, and this would also not be a question about programming, but more about statistics.
I think the code you provide seems confusing to you because you expect it to generate samples whereas it merely returns values of the Gaussian PDF (probability density function) for some given pairs of (x1,x2).
For example F = mvnpdf(a,b,mu, Sigma) returns the probability of x1=a and x2=b given that they follow a multivariate Gaussian distribution with mean mu and covariance matrix Sigma.
Being in Stack Overflow, I am focusing on the Matlab aspect of your question: for generating 100 samples of a 2-D Gaussian you can use something like the following (taken from the Matlab help of randn function):
mu = [1 2];
Sigma = [1 .5; .5 2];
R = chol(Sigma);
z = repmat(mu,100,1) + randn(100,2)*R;
The array z = [x1,x2] contains the x1 and x2 vectors that you are looking for.
Some statistics textbook or wikipedia could convince you on why the above code indeed generates such samples. The last line of code is related to one of the nice properties of a Gaussian distribution (or any other elliptical distribution).

Generate random samples from arbitrary discrete probability density function in Matlab

I've got an arbitrary probability density function discretized as a matrix in Matlab, that means that for every pair x,y the probability is stored in the matrix:
A(x,y) = probability
This is a 100x100 matrix, and I would like to be able to generate random samples of two dimensions (x,y) out of this matrix and also, if possible, to be able to calculate the mean and other moments of the PDF. I want to do this because after resampling, I want to fit the samples to an approximated Gaussian Mixture Model.
I've been looking everywhere but I haven't found anything as specific as this. I hope you may be able to help me.
Thank you.
If you really have a discrete probably density function defined by A (as opposed to a continuous probability density function that is merely described by A), you can "cheat" by turning your 2D problem into a 1D problem.
%define the possible values for the (x,y) pair
row_vals = [1:size(A,1)]'*ones(1,size(A,2)); %all x values
col_vals = ones(size(A,1),1)*[1:size(A,2)]; %all y values
%convert your 2D problem into a 1D problem
A = A(:);
row_vals = row_vals(:);
col_vals = col_vals(:);
%calculate your fake 1D CDF, assumes sum(A(:))==1
CDF = cumsum(A); %remember, first term out of of cumsum is not zero
%because of the operation we're doing below (interp1 followed by ceil)
%we need the CDF to start at zero
CDF = [0; CDF(:)];
%generate random values
N_vals = 1000; %give me 1000 values
rand_vals = rand(N_vals,1); %spans zero to one
%look into CDF to see which index the rand val corresponds to
out_val = interp1(CDF,[0:1/(length(CDF)-1):1],rand_vals); %spans zero to one
ind = ceil(out_val*length(A));
%using the inds, you can lookup each pair of values
xy_values = [row_vals(ind) col_vals(ind)];
I hope that this helps!
Chip
I don't believe matlab has built-in functionality for generating multivariate random variables with arbitrary distribution. As a matter of fact, the same is true for univariate random numbers. But while the latter can be easily generated based on the cumulative distribution function, the CDF does not exist for multivariate distributions, so generating such numbers is much more messy (the main problem is the fact that 2 or more variables have correlation). So this part of your question is far beyond the scope of this site.
Since half an answer is better than no answer, here's how you can compute the mean and higher moments numerically using matlab:
%generate some dummy input
xv=linspace(-50,50,101);
yv=linspace(-30,30,100);
[x y]=meshgrid(xv,yv);
%define a discretized two-hump Gaussian distribution
A=floor(15*exp(-((x-10).^2+y.^2)/100)+15*exp(-((x+25).^2+y.^2)/100));
A=A/sum(A(:)); %normalized to sum to 1
%plot it if you like
%figure;
%surf(x,y,A)
%actual half-answer starts here
%get normalized pdf
weight=trapz(xv,trapz(yv,A));
A=A/weight; %A normalized to 1 according to trapz^2
%mean
mean_x=trapz(xv,trapz(yv,A.*x));
mean_y=trapz(xv,trapz(yv,A.*y));
So, the point is that you can perform a double integral on a rectangular mesh using two consecutive calls to trapz. This allows you to compute the integral of any quantity that has the same shape as your mesh, but a drawback is that vector components have to be computed independently. If you only wish to compute things which can be parametrized with x and y (which are naturally the same size as you mesh), then you can get along without having to do any additional thinking.
You could also define a function for the integration:
function res=trapz2(xv,yv,A,arg)
if ~isscalar(arg) && any(size(arg)~=size(A))
error('Size of A and var must be the same!')
end
res=trapz(xv,trapz(yv,A.*arg));
end
This way you can compute stuff like
weight=trapz2(xv,yv,A,1);
mean_x=trapz2(xv,yv,A,x);
NOTE: the reason I used a 101x100 mesh in the example is that the double call to trapz should be performed in the proper order. If you interchange xv and yv in the calls, you get the wrong answer due to inconsistency with the definition of A, but this will not be evident if A is square. I suggest avoiding symmetric quantities during the development stage.

determining "how good" a correlation is in matlab?

I'm working with a set of data and I've obtained a certain correlations (using pearson's correlation coefficient). I've been asked to determine the "quality of the correlation," and by that my supervisor means he wants to see what the correlations would be if I tried permuting all the y values of my ordered pairs, and compared the obtained correlation coefficients. Does anyone know a nice way of doing this? Is there a matlab function that would determine how good a correlation is when compared to a correlation between random permutations of the data?
First, you have to check whether the correlation coefficient you got is significantly different from zero. The corr function can do this (see pval).
Second, if it's significantly different from zero, then you would like to decide whether this difference is also significant from a practical point of view. In practice, the square of the correlation coefficent (the coefficient of determination) is considered significant, if it's larger than 0.5, which means that the variations of one of the correlated parameters "explains" at least 50% of the variation of the other.
Third, there are cases when the coefficient of determination is close to one, but this is not enough to determine the "goodness of correlation". For example, if you measure the same variable using two different methods, you will usually get very similar values, so the correlation coefficient will be almost 1. In such cases you should apply the Bland-Altman analysis, which is very easy to implement in Matlab, and has its own "goodness" parameters (the bias and the so-called limits of agreement).
You can permute one vector's labels N times and calculate coefficient of correlations (cc) for each iteration. Then you can compare distribution of those values with the real correlation.
Something like this:
%# random data
n = 20;
x = (1:n)';
y = x + randn(n,1)*3;
%# real correlation
cc = corr(x,y);
%# do permutations
n_iter = 100; %# number of permutations
cc_iter = zeros(n_iter,1); %# preallocate the vector
for k = 1:n_iter
ind = randperm(n); %# vector of random permutations
cc_iter(k) = corr(x,y(ind));
end
%# calculate statistics
cc_mean = mean(cc_iter);
cc_std = std(cc_iter);
zval = cc - cc_mean ./ cc_std;
%# probability that the real cc belongs to the same distribution as cc from permuted data
pv = 2 * normcdf(-abs(zval),cc_mean,cc_std);
%# plot
hist(cc_iter,20)
line([cc cc],ylim,'color','r') %# real value
In addition, if you compute correlation with [cc pv] = corr(x,y), you get p-value of how your correlation is different from no correlation. This p-value is calculated from assumption that your vector distributed normally. However, if you calculate not Pearson, but Spearman or Kendall correlation (non-parametric), those p-values will be from randomly permuted data:
[cc pv] = corr(x,y,'type','Spearman')