determining "how good" a correlation is in matlab? - matlab

I'm working with a set of data and I've obtained a certain correlations (using pearson's correlation coefficient). I've been asked to determine the "quality of the correlation," and by that my supervisor means he wants to see what the correlations would be if I tried permuting all the y values of my ordered pairs, and compared the obtained correlation coefficients. Does anyone know a nice way of doing this? Is there a matlab function that would determine how good a correlation is when compared to a correlation between random permutations of the data?

First, you have to check whether the correlation coefficient you got is significantly different from zero. The corr function can do this (see pval).
Second, if it's significantly different from zero, then you would like to decide whether this difference is also significant from a practical point of view. In practice, the square of the correlation coefficent (the coefficient of determination) is considered significant, if it's larger than 0.5, which means that the variations of one of the correlated parameters "explains" at least 50% of the variation of the other.
Third, there are cases when the coefficient of determination is close to one, but this is not enough to determine the "goodness of correlation". For example, if you measure the same variable using two different methods, you will usually get very similar values, so the correlation coefficient will be almost 1. In such cases you should apply the Bland-Altman analysis, which is very easy to implement in Matlab, and has its own "goodness" parameters (the bias and the so-called limits of agreement).

You can permute one vector's labels N times and calculate coefficient of correlations (cc) for each iteration. Then you can compare distribution of those values with the real correlation.
Something like this:
%# random data
n = 20;
x = (1:n)';
y = x + randn(n,1)*3;
%# real correlation
cc = corr(x,y);
%# do permutations
n_iter = 100; %# number of permutations
cc_iter = zeros(n_iter,1); %# preallocate the vector
for k = 1:n_iter
ind = randperm(n); %# vector of random permutations
cc_iter(k) = corr(x,y(ind));
end
%# calculate statistics
cc_mean = mean(cc_iter);
cc_std = std(cc_iter);
zval = cc - cc_mean ./ cc_std;
%# probability that the real cc belongs to the same distribution as cc from permuted data
pv = 2 * normcdf(-abs(zval),cc_mean,cc_std);
%# plot
hist(cc_iter,20)
line([cc cc],ylim,'color','r') %# real value
In addition, if you compute correlation with [cc pv] = corr(x,y), you get p-value of how your correlation is different from no correlation. This p-value is calculated from assumption that your vector distributed normally. However, if you calculate not Pearson, but Spearman or Kendall correlation (non-parametric), those p-values will be from randomly permuted data:
[cc pv] = corr(x,y,'type','Spearman')

Related

MATLAB: How to compute the similarity of two signals and get the correct consistency or coherence metric

I was wondering about the consistency metric. Generally, it allows us to deduce the parity or similarity between two signals, right? If so, if the probability is higher (from 0.5 to 1), does it means that there is a strong similarity of the signals? If the margin is less than (0.1-0.43), can this predict the poor coherence between the signals (or poor similarity, the probability the signals are different)? So, if we got the metric <0, is this approved the signal is totally different? Because I'm getting negative numbers. Is this hypothesis possible?
Can I have a clear understanding of the consistency metric of the signal? Here is my small code and figure. Thanks in advance.
s1 = signal3
s2 = signal4
if s1 ~= s2
[C1] = xcorr(s1);
[C2] = xcorr(s2);
signal_mix = C1.*C2 %mixing vector
signal_mix1 = signal_mix
else
s1(1,:) == s2(1,:)
s3 = s1
s3= s2
signal_mix = s2
end
n =2;
for i = length(signal_mix1)
signal_mix1(i) = min(C1(i),C2(i))/ max(C1(i),C2(i)) % consistency score
signal_mix2 = sum(signal_mix1(i))
end
Depending on your use case you might want to consider a dynamic time wraping distance (Matlab has a build in function for that) as similarity metric. One problem with using correlation as a metric is that it compares always the same timestep of the signals. So two identical signals, where one is time delayed, could lead to low correlation. The DTW distance adresses this by comparing to values of adjacent timesteps.
The downside of the dtw distance is that the distance it self can't be interpretet on its only only relative to other distances. So you can tell that two signals A & B with a distance of 150 are more similar than A & C with a distance of 250. But the distance of 150 on its own doesn't tell you a lot.
first of all, you could use xcorrfunction to calculate cross-correlation between two signals.
from Matlab help:
r = xcorr(x,y) returns the cross-correlation of two discrete-time
sequences. Cross-correlation measures the similarity between a vector
x and shifted (lagged) copies of a vector y as a function of the lag.
If x and y have different lengths, the function appends zeros to the
end of the shorter vector so it has the same length as the other.
additionally you could use xcov:
xcov computes the mean of its inputs, subtracts the mean, and then
calls xcorr.
The result of xcov can be interpreted as an estimate of the covariance
between two random sequences or as the deterministic covariance
between two deterministic signals.
in case of your example you are using xcorr with one signal so it computes auto-correlation between the signal itself and its lagged signal.
update:
based on the comment, it seems you need linear correlation, it can be calculated by corr function:
p=corr(x,y)
the value of p is 1 when x , y behave exactly like each other, and is -1 when x and y behave quite the opposite of each other.
when p is 0 it means there is no correlation between two signals.

Drawing from correlated marginal uniform in Matlab: Alternatives to copularnd?

Consider two random variables X and Y both uniformly distributed in [0,1] and correlated with correlation rho.
I want to draw P realisations from their joint distribution in Matlab.
Is
A = copularnd('gaussian',rho,P);
the only way to do it?
Copulas provide a very convenient way modelling the joint distribution. Just saying that you want X~U[0,1], Y~U[0,1] and corr(X,Y)=rho is not enough to define the relationship between these two random variables. Simply by substituting different copulas you can build different models (eventually choosing the one suiting your use-case best) that all satisfy this condition.
Apart from understanding the basics of what copulas are, you need to understand there are different types of correlations, such as linear (Pearson) and rank (Spearman / Kendall), and how they relate to each other. In particular, rank correlations preserve when a monotonic transformation is applied, unlike linear correlation. This is the key property allowing us to easily translate the desired linear correlation of uniform marginal distributions to the linear correlation of bivariate normal (or other type of distribution you use in the copula), which would be the input correlation to copularnd.
There is a very good answer on Cross Validated, describing exactly how to convert the correlation in your case. You should also read this Matlab guide on Using Rank Correlation Coefficients and copulas.
In a nutshell, to model desired linear correlation of marginals, you need to translate it into some rank correlation (using Spearman is convenient, since for uniform distribution it equals Pearason correlation), which would be the same for your normals (because it's a monotonic transformation). All you need is to convert that Spearman correlation for your normal to a linear Person correlation, which would be the input to copularnd.
Hopefully the code would make it easy to understand. Obviously you don't need those temporary variables, but it should make the logic clear:
rho_uni_Pearson = 0.7;
rho_uni_Spearman = rho_uni_Pearson;
rho_normal_Spearman = rho_uni_Spearman;
rho_normal_Pearson = 2*sin(rho_normal_Spearman*pi/6);
X = copularnd('gaussian', rho_normal_Pearson, 1e7);
And the resulting linear correlation is exactly what we wanted (since we've generated a very large sample):
corr(X(:,1), X(:,2));
ans =
0.7000
Note that for a bivariate normal copula, the relationship between linear and rank correlations is easy enough, but it could be more complex if you use a different copula. Matlab has a function copulaparam that allows you to translate from rank to linear correlation. So instead of writing out explicit formula as above, we could just use:
rho_normal_Pearson = copulaparam('Gaussian', rho_normal_Spearman, 'type', 'spearman')
Now that we have learned the basics, let's go ahead and use a t copula with 5 degress of freedom instead of gaussian copula:
nu = 5; % number of degrees of freedom of t distribution
rho_t_Pearson = copulaparam('t', 0.7, nu, 'type', 'spearman');
X = copularnd('t', rho_t_Pearson, nu, 1e7);
Ensuring resulting linear correlation is what we wanted:
corr(X(:,1), X(:,2));
ans =
0.6996
It's easy to observe that resulting distributions may be strikingly different, depending on which copula you choose, even through they all give you the same linear correlation. It is up to you, the researcher, to determine which model is best suited to your particular data and problem. For example:
N = 500;
rho_Pearson = copulaparam('gaussian', 0.1, 'type', 'spearman');
X1 = copularnd('gaussian', rho_Pearson, N);
figure(); scatterhist(X1(:,1),X1(:,2)); title('gaussian');
nu = 1; % number of degrees of freedom of t distribution
rho_Pearson = copulaparam('t', 0.1, nu, 'type', 'spearman');
X2 = copularnd('t', rho_Pearson, nu, N);
figure(); scatterhist(X2(:,1),X2(:,2)); title('t, nu=1');

Generate random samples from arbitrary discrete probability density function in Matlab

I've got an arbitrary probability density function discretized as a matrix in Matlab, that means that for every pair x,y the probability is stored in the matrix:
A(x,y) = probability
This is a 100x100 matrix, and I would like to be able to generate random samples of two dimensions (x,y) out of this matrix and also, if possible, to be able to calculate the mean and other moments of the PDF. I want to do this because after resampling, I want to fit the samples to an approximated Gaussian Mixture Model.
I've been looking everywhere but I haven't found anything as specific as this. I hope you may be able to help me.
Thank you.
If you really have a discrete probably density function defined by A (as opposed to a continuous probability density function that is merely described by A), you can "cheat" by turning your 2D problem into a 1D problem.
%define the possible values for the (x,y) pair
row_vals = [1:size(A,1)]'*ones(1,size(A,2)); %all x values
col_vals = ones(size(A,1),1)*[1:size(A,2)]; %all y values
%convert your 2D problem into a 1D problem
A = A(:);
row_vals = row_vals(:);
col_vals = col_vals(:);
%calculate your fake 1D CDF, assumes sum(A(:))==1
CDF = cumsum(A); %remember, first term out of of cumsum is not zero
%because of the operation we're doing below (interp1 followed by ceil)
%we need the CDF to start at zero
CDF = [0; CDF(:)];
%generate random values
N_vals = 1000; %give me 1000 values
rand_vals = rand(N_vals,1); %spans zero to one
%look into CDF to see which index the rand val corresponds to
out_val = interp1(CDF,[0:1/(length(CDF)-1):1],rand_vals); %spans zero to one
ind = ceil(out_val*length(A));
%using the inds, you can lookup each pair of values
xy_values = [row_vals(ind) col_vals(ind)];
I hope that this helps!
Chip
I don't believe matlab has built-in functionality for generating multivariate random variables with arbitrary distribution. As a matter of fact, the same is true for univariate random numbers. But while the latter can be easily generated based on the cumulative distribution function, the CDF does not exist for multivariate distributions, so generating such numbers is much more messy (the main problem is the fact that 2 or more variables have correlation). So this part of your question is far beyond the scope of this site.
Since half an answer is better than no answer, here's how you can compute the mean and higher moments numerically using matlab:
%generate some dummy input
xv=linspace(-50,50,101);
yv=linspace(-30,30,100);
[x y]=meshgrid(xv,yv);
%define a discretized two-hump Gaussian distribution
A=floor(15*exp(-((x-10).^2+y.^2)/100)+15*exp(-((x+25).^2+y.^2)/100));
A=A/sum(A(:)); %normalized to sum to 1
%plot it if you like
%figure;
%surf(x,y,A)
%actual half-answer starts here
%get normalized pdf
weight=trapz(xv,trapz(yv,A));
A=A/weight; %A normalized to 1 according to trapz^2
%mean
mean_x=trapz(xv,trapz(yv,A.*x));
mean_y=trapz(xv,trapz(yv,A.*y));
So, the point is that you can perform a double integral on a rectangular mesh using two consecutive calls to trapz. This allows you to compute the integral of any quantity that has the same shape as your mesh, but a drawback is that vector components have to be computed independently. If you only wish to compute things which can be parametrized with x and y (which are naturally the same size as you mesh), then you can get along without having to do any additional thinking.
You could also define a function for the integration:
function res=trapz2(xv,yv,A,arg)
if ~isscalar(arg) && any(size(arg)~=size(A))
error('Size of A and var must be the same!')
end
res=trapz(xv,trapz(yv,A.*arg));
end
This way you can compute stuff like
weight=trapz2(xv,yv,A,1);
mean_x=trapz2(xv,yv,A,x);
NOTE: the reason I used a 101x100 mesh in the example is that the double call to trapz should be performed in the proper order. If you interchange xv and yv in the calls, you get the wrong answer due to inconsistency with the definition of A, but this will not be evident if A is square. I suggest avoiding symmetric quantities during the development stage.

Mahalanobis distance in matlab: pdist2() vs. mahal() function

I have two matrices X and Y. Both represent a number of positions in 3D-space. X is a 50*3 matrix, Y is a 60*3 matrix.
My question: why does applying the mean-function over the output of pdist2() in combination with 'Mahalanobis' not give the result obtained with mahal()?
More details on what I'm trying to do below, as well as the code I used to test this.
Let's suppose the 60 observations in matrix Y are obtained after an experimental manipulation of some kind. I'm trying to assess whether this manipulation had a significant effect on the positions observed in Y. Therefore, I used pdist2(X,X,'Mahalanobis') to compare X to X to obtain a baseline, and later, X to Y (with X the reference matrix: pdist2(X,Y,'Mahalanobis')), and I plotted both distributions to have a look at the overlap.
Subsequently, I calculated the mean Mahalanobis distance for both distributions and the 95% CI and did a t-test and Kolmogorov-Smirnoff test to asses if the difference between the distributions was significant. This seemed very intuitive to me, however, when testing with mahal(), I get different values, although the reference matrix is the same. I don't get what the difference between both ways of calculating mahalanobis distance is exactly.
Comment that is too long #3lectrologos:
You mean this: d(I) = (Y(I,:)-mu)inv(SIGMA)(Y(I,:)-mu)'? This is just the formula for calculating mahalanobis, so should be the same for pdist2() and mahal() functions. I think mu is a scalar and SIGMA is a matrix based on the reference distribution as a whole in both pdist2() and mahal(). Only in mahal you are comparing each point of your sample set to the points of the reference distribution, while in pdist2 you are making pairwise comparisons based on a reference distribution. Actually, with my purpose in my mind, I think I should go for mahal() instead of pdist2(). I can interpret a pairwise distance based on a reference distribution, but I don't think it's what I need here.
% test pdist2 vs. mahal in matlab
% the purpose of this script is to see whether the average over the rows of E equals the values in d...
% data
X = []; % 50*3 matrix, data omitted
Y = []; % 60*3 matrix, data omitted
% calculations
S = nancov(X);
% mahal()
d = mahal(Y,X); % gives an 60*1 matrix with a value for each Cartesian element in Y (second matrix is always the reference matrix)
% pairwise mahalanobis distance with pdist2()
E = pdist2(X,Y,'mahalanobis',S); % outputs an 50*60 matrix with each ij-th element the pairwise distance between element X(i,:) and Y(j,:) based on the covariance matrix of X: nancov(X)
%{
so this is harder to interpret than mahal(), as elements of Y are not just compared to the "mahalanobis-centroid" based on X,
% but to each individual element of X
% so the purpose of this script is to see whether the average over the rows of E equals the values in d...
%}
F = mean(E); % now I averaged over the rows, which means, over all values of X, the reference matrix
mean(d)
mean(E(:)) % not equal to mean(d)
d-F' % not zero
% plot output
figure(1)
plot(d,'bo'), hold on
plot(mean(E),'ro')
legend('mahal()','avaraged over all x values pdist2()')
ylabel('Mahalanobis distance')
figure(2)
plot(d,'bo'), hold on
plot(E','ro')
plot(d,'bo','MarkerFaceColor','b')
xlabel('values in matrix Y (Yi) ... or ... pairwise comparison Yi. (Yi vs. all Xi values)')
ylabel('Mahalanobis distance')
legend('mahal()','pdist2()')
One immediate difference between the two is that mahal subtracts the sample mean of X from each point in Y before computing distances.
Try something like E = pdist2(X,Y-mean(X),'mahalanobis',S); to see if that gives you the same results as mahal.
Note that
mahal(X,Y)
is equivalent to
pdist2(X,mean(Y),'mahalanobis',cov(Y)).^2
Well, I guess there are two different ways to calculate mahalanobis distance between two clusters of data like you explain above:
1) you compare each data point from your sample set to mu and sigma matrices calculated from your reference distribution (although labeling one cluster sample set and the other reference distribution may be arbitrary), thereby calculating the distance from each point to this so called mahalanobis-centroid of the reference distribution.
2) you compare each datapoint from matrix Y to each datapoint of matrix X, with, X the reference distribution (mu and sigma are calculated from X only)
The values of the distances will be different, but I guess the ordinal order of dissimilarity between clusters is preserved when using either method 1 or 2? I actually wonder when comparing 10 different clusters to a reference matrix X, or to each other, if the order of the dissimilarities would differ using method 1 or method 2? Also, I can't imagine a situation where one method would be wrong and the other method not. Although method 1 seems more intuitive in some situations, like mine.

Multivariate Random Number Generation in Matlab

I'm probably being a little dense but I'm not very mathsy and can't seem to understand the covariance element of creating multivariate data.
I'm after two columns of random data (representing two correlated variables).
I think I am right in needing to use the mvnrnd function and I understand that 'mu' must be a column of my mean vectors. As I need 4 distinct classes within my data these are going to be (1, 1) (-1 1) (1 -1) and (-1 -1). I assume I will have to do the function 4x with a different column of mean vectors each time and then combine them to get my full data set.
I don't understand what I should put for SIGMA - Matlab help tells me that it must be 'a d-by-d symmetric positive semi-definite matrix, or a d-by-d-by-n array' i.e. a covariance matrix. I don't understand how I create a covariance matrix for numbers that I am yet to generate.
Any advice would be greatly appreciated!
Assuming that I understood your case properly, I would go this way:
data = [normrnd(0,1,5000,1),normrnd(0,1,5000,1)]; %% your starting data series
MU = mean(data,1);
SIGMA = cov(data);
Now, it should be possible to feed mvnrnd with MU and SIGMA:
r = mvnrnd(MU,SIGMA,5000);
plot(r(:,1),r(:,2),'+') %% in case you wanna plot the results
I hope this helps.
I think your aim is to generate the simulated multivariate gaussian distributed data. For example, I use
k = 6; % feature dimension
mu = rand(1,k);
sigma = 10*eye(k,k);
unit matrix by 10 times is a symmetric positive semi-definite matrix. And the gaussian distribution will be more round than other type of sigma.
then you can use it as the above example of mvnrnd function and see the plot.