weighted correlation for case of matrix - matlab

i have question how to calculate weighted correlations for matrices,from wikipedia i have created three following codes
1.weighted mean calculation
function [y]= weighted_mean(x,w);
n=length(x);
%assume that weight vector and input vector have same length
sum=0.0;
sum_weight=0.0;
for i=1:n
sum=sum+ x(i)*w(i);
sum_weight=sum_weight+w(i);
end
y=sum/sum_weight;
end
2.weighted covariance
function result=cov_weighted(x,y,w)
n=length(x);
sum_covar=0.0;
sum_weight=0;
for i=1:n
sum_covar=sum_covar+w(i)*(x(i)-weighted_mean(x,w))*(y(i)-weighted_mean(y,w));
sum_weight=sum_weight+w(i);
end
result=sum_covar/sum_weight;
end
and finally weighted correlation
3.
function corr_weight=weighted_correlation(x,y,w);
corr_weight=cov_weighted(x,y,w)/sqrt(cov_weighted(x,x,w)*cov_weighted(y,y,w));
end
now i want to apply weighted correlation method for matrices,related to this link
http://www.mathworks.com/matlabcentral/fileexchange/20846-weighted-correlation-matrix/content/weightedcorrs.m
i did not understand anything how to apply,that why i have created my self,but need in case of input are matrices,thanks very much

#dato-datuashvili Maybe I am providing too much information...
1) I would like to stress that the evaluation of Weighted Correlation matrices are very uncommon. This happens because you have to provide beforehand the weights. Unless you have a clear reason to choose the weights, there is no clear way to provide them.
How can you tell that a measurement of your sample is more or less important than another measurement?
Having said that, the weights are up to you! Yo have to choose them!
So, people usually consider just the correlation matrix (no weights or all weights are the same e.g w_i=1).
If you have a clear way to choose good weights, just do not consider this part.
2) I understand that you want to test your code. So, in order to that, you have to have correlated random variables. How to generate them?
Multivariate normal distributions are the simplest case. See the wikipedia page about them: Multivariate Normal Distribution (see the item "Drawing values from the distribution". Wikipedia shows you how to generate the random numbers from this distribution using Choleski Decomposition). The 2-variate case is much simpler. See for instance Generate Correlated Normal Random Variables
The good news is that if you are using Matlab there is a function for you. See Matlab: Random numbers from the multivariate normal distribution.]
In order to use this function you have to provide the desired means and covariances. [Note that you are making the role of nature here. You are generating the data! In real life, you are going to apply your function to the real data. What I am trying to say is that this step is only useful for tests. Furthermore, pay attencion to the fact that in the Matlab function you are providing the variances and evaluating the correlations (covariances normalized by standard errors). In the 2-dimensional case (that is the case of your function it is possible to provide directly the correlation. See the page above that I provided to you of Math.Stackexchange]
3) Finally, you can apply them to your function. Generate X and Y from a normal multivarite distribution and provide the vector of weights w to your function corr_weight_correlation and you are done!
I hope I provide what you need!
Daniel
Update:
% From the matlab page
mu = [2 3];
SIGMA = [1 1.5; 1.5 3];
n=100;
[x,y] = mvnrnd(mu,SIGMA,n);
% Using your code
w=ones(n,1);
corr_weight=weighted_correlation(x,y,w); % Remember that Sigma is covariance and Corr_weight is correlation. In order to calculate the same thing, just use result=cov_weighted instead.

Related

Inverse cumulative distribution function in MATLAB given an empirical PDF

Is it possible to generate random numbers with a distribution that depends on empirical probability data? I imagine this is possible by taking the inverse cumulative distribution function. I have seen some examples where this is done in MATLAB (the software that I'm using) but all of those examples have an underlying analytic form for the probability. Here I have only the PDF. For instance, I have data of probabilities for a particular event. Most of the probabilities are zero and hence not unique, but not all.
My goal is to generate the random numbers and then figure out what the distribution is. I'd really appreciate if people can help clear up my thinking here.
EDIT:
I think I want something like:
cdf=cumsum(pdf); % calculate pdf from empirical pdf
M=length(cdf);
xq=linspace(0,1,M);
invcdf=interp1(cdf,xq,xq); % calculate inverse cdf, i.e., x
but how do I take into account that a lot of the values of the pdf are zero and not unique? Is this even the right approach?
I am basing my answer on Inverse empirical cumulative distribution function from the MathWorks File Exchange. See that link for other suggestions to solving your problem.
% y - input: data set
% q - input: desired quantile (can be a scalar or a vector)
% xq - output: ICDF at specified quantile
[f, x] = ecdf(y);
xq = zeros(size(q));
for ii = 1:length(q)
xq(ii) = min(x(q(ii) <= f));
end
I'd eliminate the for loop if you're only using scalars. Also, there may be a more efficient way to vectorize the for loop, but this should at least get you started.

What are the differences between different gaussian functions in Matlab?

y = gauss(x,s,m)
Y = normpdf(X,mu,sigma)
R = normrnd(mu,sigma)
What are the basic differences between these three functions?
Y = normpdf(X,mu,sigma) is the probability density function for a normal distribution with mean mu and stdev sigma. Use this if you want to know the relative likelihood at a point X.
R = normrnd(mu,sigma) takes random samples from the same distribution as above. So use this function if you want to simulate something based on the normal distribution.
y = gauss(x,s,m) at first glance looks like the exact same function as normpdf(). But there is a slight difference: Its calculation is
Y = EXP(-(X-M).^2./S.^2)./(sqrt(2*pi).*S)
while normpdf() uses
Y = EXP(-(X-M).^2./(2*S.^2))./(sqrt(2*pi).*S)
This means that the integral of gauss() from -inf to inf is 1/sqrt(2). Therefore it isn't a legit PDF and I have no clue where one could use something like this.
For completeness we also have to mention p = normcdf(x,mu,sigma). This is the normal cumulative distribution function. It gives the probability that a value is between -inf and x.
A few more insights to add to Leander good answer:
When comparing between functions it is good to look at their source or toolbox. gauss is not a function written by Mathworks, so it may be redundant to a function that comes with Matlab.
Also, both normpdf and normrnd are part of the Statistics and Machine Learning Toolbox so users without it cannot use them. However, generating random numbers from a normal distribution is quite a common task, so it should be accessible for users that have only the core Matlab. Hence, there is a redundant function to normrnd which is randn that is part of the core Matlab.

Mixture of 1D Gaussians fit to data in Matlab / Python

I have a discrete curve y=f(x). I know the locations and amplitudes of peaks. I want to approximate the curve by fitting a gaussian at each peak. How should I go about finding the optimized gaussian parameters ? I would like to know if there is any inbuilt function which will make my task simpler.
Edit
I have fixed mean of gaussians and tried to optimize on sigma using
lsqcurvefit() in matlab. MSE is less. However, I have an additional hard constraint that the value of approximate curve should be equal to the original function at the peaks. This constraint is not satisfied by my model. I am pasting current working code here. I would like to have a solution which obeys the hard constraint at peaks and approximately fits the curve at other points. The basic idea is that the approximate curve has fewer parameters but still closely resembles the original curve.
fun = #(x,xdata)myFun(x,xdata,pks,locs); %pks,locs are the peak locations and amplitudes already available
x0=w(1:6)*0.25; % my initial guess based on domain knowledge
[sigma resnorm] = lsqcurvefit(fun,x0,xdata,ydata); %xdata and ydata are the original curve data points
recons = myFun(sigma,xdata,pks,locs);
figure;plot(ydata,'r');hold on;plot(recons);
function f=myFun(sigma,xdata,a,c)
% a is constant , c is mean of individual gaussians
f=zeros(size(xdata));
for i = 1:6 %use 6 gaussians to approximate function
f = f + a(i) * exp(-(xdata-c(i)).^2 ./ (2*sigma(i)^2));
end
end
If you know your peak locations and amplitudes, then all you have left to do is find the width of each Gaussian. You can think of this as an optimization problem.
Say you have x and y, which are samples from the curve you want to approximate.
First, define a function g() that will construct the approximation for given values of the widths. g() takes a parameter vector sigma containing the width of each Gaussian. The locations and amplitudes of the Gaussians will be constrained to the values you already know. g() outputs the value of the sum-of-gaussians approximation at each point in x.
Now, define a loss function L(), which takes sigma as input. L(sigma) returns a scalar that measures the error--how badly the given approximation (using sigma) differs from the curve you're trying to approximate. The squared error is a common loss function for curve fitting:
L(sigma) = sum((y - g(sigma)) .^ 2)
The task now is to search over possible values of sigma, and find the choice that minimizes the error. This can be done using a variety of optimization routines.
If you have the Mathworks optimization toolbox, you can use the function lsqnonlin() (in this case you won't have to define L() yourself). The curve fitting toolbox is probably an alternative. Otherwise, you can use an open source optimization routine (check out cvxopt).
A couple things to note. You need to impose the constraint that all values in sigma are greater than zero. You can tell the optimization algorithm about this constraint. Also, you'll need to specify an initial guess for the parameters (i.e. sigma). In this case, you could probably choose something reasonable by looking at the curve in the vicinity of each peak. It may be the case (when the loss function is nonconvex) that the final solution is different, depending on the initial guess (i.e. you converge to a local minimum). There are many fancy techniques for dealing with this kind of situation, but a simple thing to do is to just try with multiple different initial guesses, and pick the best result.
Edited to add:
In python, you can use optimization routines in the scipy.optimize module, e.g. curve_fit().
Edit 2 (response to edited question):
If your Gaussians have much overlap with each other, then taking their sum may cause the height of the peaks to differ from your known values. In this case, you could take a weighted sum, and treat the weights as another parameter to optimize.
If you want the peak heights to be exactly equal to some specified values, you can enforce this constraint in the optimization problem. lsqcurvefit() won't be able to do it because it only handles bound constraints on the parameters. Take a look at fmincon().
you can use Expectation–Maximization algorithm for fitting Mixture of Gaussians on your data. it don't care about data dimension.
in documentation of MATLAB you can lookup gmdistribution.fit or fitgmdist.

How to generate a 2D random vector in MATLAB?

I have a non-negative function f defined on a unit square S = [0,1] x [0,1] such that
My question is, how can I use MATLAB to generate a 2D random vector from S according to the probability density function f?
Rejection Sampling
The suggestion Luis Mendo made is very good because it applies to nearly all distribution functions. Based on this answer I wrote code for m.
An important point when using rejection sampling this way is that you must know the maximum of your pdf within the range. If you over-estimate the maximum your code will only run slower. If you under-estimate it it will create wrong numbers!
The idea is that you sample many uniform distributed points and accept depending on the probability density for the points.
pdf=#(x).5.*x(:,1)+3./2.*x(:,2);
maximum=2; %Right maximum for THIS EXAMPLE.
%If you are unable to determine the maximum of your
%function within the [0,1]x[0,1] range, please give an example.
result=[];
n=10;
while (size(result,1)<n)
%1. sample random point:
val=rand(1,2);
%2. Accept with probability pdf(val)/maximum
if rand<pdf(val)/maximum
%append to solution
result(end+1,:)=val;
end
end
I know that this solution is not a fast implementation, but I wanted to start with an implementation as simple as possible to make sure that the concept of rejection sampling becomes clear.
ICDF
Besides rejection sampling there is a different approach to address this issue on a more mathematical level, but you need to sit down and do some math first to end up with a better solution. For 1 dimensional distributions you typically sample using the ICDF (inverted cumulative density function) function simply using ICDF(rand(n,1)) to get random samples.
If you manage to do the math, you could instead for your PDF function define two functions ICDF1 (ICDF for the first dimension) and ICDF2 (ICDF for the second dimension) in matlab.
The first ICDF1 would map unifrom random distributed samples to sample values for the first dimension of your random distribution.
The second ICDF2 would map the output if ICDF1 and uniform distributed samples to your intended solution.
Here is some matlab code assuming you already defined ICDF1 and ICDF2
samples=ICDF1(rand(n,1));
samples(:,2)=ICDF2(samples,rand(n,1));
The great advantage of this solution is, that it does not reject any samples, being potentially much faster.

Goodness of fit with MATLAB and chi-square test

I would like to measure the goodness-of-fit to an exponential decay curve. I am using the lsqcurvefit MATLAB function. I have been suggested by someone to do a chi-square test.
I would like to use the MATLAB function chi2gof but I am not sure how I would tell it that the data is being fitted to an exponential curve
The chi2gof function tests the null hypothesis that a set of data, say X, is a random sample drawn from some specified distribution (such as the exponential distribution).
From your description in the question, it sounds like you want to see how well your data X fits an exponential decay function. I really must emphasize, this is completely different to testing whether X is a random sample drawn from the exponential distribution. If you use chi2gof for your stated purpose, you'll get meaningless results.
The usual approach for testing the goodness of fit for some data X to some function f is least squares, or some variant on least squares. Further, a least squares approach can be used to generate test statistics that test goodness-of-fit, many of which are distributed according to the chi-square distribution. I believe this is probably what your friend was referring to.
EDIT: I have a few spare minutes so here's something to get you started. DISCLAIMER: I've never worked specifically on this problem, so what follows may not be correct. I'm going to assume you have a set of data x_n, n = 1, ..., N, and the corresponding timestamps for the data, t_n, n = 1, ..., N. Now, the exponential decay function is y_n = y_0 * e^{-b * t_n}. Note that by taking the natural logarithm of both sides we get: ln(y_n) = ln(y_0) - b * t_n. Okay, so this suggests using OLS to estimate the linear model ln(x_n) = ln(x_0) - b * t_n + e_n. Nice! Because now we can test goodness-of-fit using the standard R^2 measure, which matlab will return in the stats structure if you use the regress function to perform OLS. Hope this helps. Again I emphasize, I came up with this off the top of my head in a couple of minutes, so there may be good reasons why what I've suggested is a bad idea. Also, if you know the initial value of the process (ie x_0), then you may want to look into constrained least squares where you bind the parameter ln(x_0) to its known value.