How to generate a 2D random vector in MATLAB? - matlab

I have a non-negative function f defined on a unit square S = [0,1] x [0,1] such that
My question is, how can I use MATLAB to generate a 2D random vector from S according to the probability density function f?

Rejection Sampling
The suggestion Luis Mendo made is very good because it applies to nearly all distribution functions. Based on this answer I wrote code for m.
An important point when using rejection sampling this way is that you must know the maximum of your pdf within the range. If you over-estimate the maximum your code will only run slower. If you under-estimate it it will create wrong numbers!
The idea is that you sample many uniform distributed points and accept depending on the probability density for the points.
pdf=#(x).5.*x(:,1)+3./2.*x(:,2);
maximum=2; %Right maximum for THIS EXAMPLE.
%If you are unable to determine the maximum of your
%function within the [0,1]x[0,1] range, please give an example.
result=[];
n=10;
while (size(result,1)<n)
%1. sample random point:
val=rand(1,2);
%2. Accept with probability pdf(val)/maximum
if rand<pdf(val)/maximum
%append to solution
result(end+1,:)=val;
end
end
I know that this solution is not a fast implementation, but I wanted to start with an implementation as simple as possible to make sure that the concept of rejection sampling becomes clear.

ICDF
Besides rejection sampling there is a different approach to address this issue on a more mathematical level, but you need to sit down and do some math first to end up with a better solution. For 1 dimensional distributions you typically sample using the ICDF (inverted cumulative density function) function simply using ICDF(rand(n,1)) to get random samples.
If you manage to do the math, you could instead for your PDF function define two functions ICDF1 (ICDF for the first dimension) and ICDF2 (ICDF for the second dimension) in matlab.
The first ICDF1 would map unifrom random distributed samples to sample values for the first dimension of your random distribution.
The second ICDF2 would map the output if ICDF1 and uniform distributed samples to your intended solution.
Here is some matlab code assuming you already defined ICDF1 and ICDF2
samples=ICDF1(rand(n,1));
samples(:,2)=ICDF2(samples,rand(n,1));
The great advantage of this solution is, that it does not reject any samples, being potentially much faster.

Related

Is there a way to get the probability from the probability density in multivariate kernel estimation?

I have a question about multivariate kernel density in matlab, which is my first time using it.
I have a 3-dimensional sample data (x, y, z in axes) and want to find a probability of being in a certain volume using kernel density estimation. So, I used the mvksdensity function in matlab and got the probability density (estimated function values) for the points I decided.
What I originally wanted to do was to (if I could fine the function) triple integral the multivariate function for a given volume. But the mvksdensity function only returns the density estimates and does not return the function. I thought there will be an easy way to compute the probability from the density, but I’m stuck. Does anyone have any useful information for this? Thanks in advance.
I thought about fitdist function to find the distribution, but it only works for univariate kernel distribution.
I also tried to use mvncdf, which is a function that returns the cdf of the multivariate normal distribution for the row of the sample data after setting the mean and the std. But then I have to calculate the probability for a given volume for every normal distribution in each data point and then add it, which will be inefficient for a large amount of data and I don't know if it's a correct way.
I can suggest the following Monte-Carlo approach. You find a master volume that contains the entire mass of the estimated probability density. This should be as small as possible for the sake of efficiency. Then you generate a large number of test points in the master volume, either on a grid or randomly according to a uniform distribution. The probability content of a specific volume V can be estimated by the sum of the density values of the test points in V over the sum of the density values of all test points. I am afraid, however, that in 3D you would need at least 1E6 test points, probably more. If you give me access to your sample, I would be pleased to try out my suggestion. It should also be fairly easy to work out an estimate of the standard error of the estimated probability content of V.

Mixture of 1D Gaussians fit to data in Matlab / Python

I have a discrete curve y=f(x). I know the locations and amplitudes of peaks. I want to approximate the curve by fitting a gaussian at each peak. How should I go about finding the optimized gaussian parameters ? I would like to know if there is any inbuilt function which will make my task simpler.
Edit
I have fixed mean of gaussians and tried to optimize on sigma using
lsqcurvefit() in matlab. MSE is less. However, I have an additional hard constraint that the value of approximate curve should be equal to the original function at the peaks. This constraint is not satisfied by my model. I am pasting current working code here. I would like to have a solution which obeys the hard constraint at peaks and approximately fits the curve at other points. The basic idea is that the approximate curve has fewer parameters but still closely resembles the original curve.
fun = #(x,xdata)myFun(x,xdata,pks,locs); %pks,locs are the peak locations and amplitudes already available
x0=w(1:6)*0.25; % my initial guess based on domain knowledge
[sigma resnorm] = lsqcurvefit(fun,x0,xdata,ydata); %xdata and ydata are the original curve data points
recons = myFun(sigma,xdata,pks,locs);
figure;plot(ydata,'r');hold on;plot(recons);
function f=myFun(sigma,xdata,a,c)
% a is constant , c is mean of individual gaussians
f=zeros(size(xdata));
for i = 1:6 %use 6 gaussians to approximate function
f = f + a(i) * exp(-(xdata-c(i)).^2 ./ (2*sigma(i)^2));
end
end
If you know your peak locations and amplitudes, then all you have left to do is find the width of each Gaussian. You can think of this as an optimization problem.
Say you have x and y, which are samples from the curve you want to approximate.
First, define a function g() that will construct the approximation for given values of the widths. g() takes a parameter vector sigma containing the width of each Gaussian. The locations and amplitudes of the Gaussians will be constrained to the values you already know. g() outputs the value of the sum-of-gaussians approximation at each point in x.
Now, define a loss function L(), which takes sigma as input. L(sigma) returns a scalar that measures the error--how badly the given approximation (using sigma) differs from the curve you're trying to approximate. The squared error is a common loss function for curve fitting:
L(sigma) = sum((y - g(sigma)) .^ 2)
The task now is to search over possible values of sigma, and find the choice that minimizes the error. This can be done using a variety of optimization routines.
If you have the Mathworks optimization toolbox, you can use the function lsqnonlin() (in this case you won't have to define L() yourself). The curve fitting toolbox is probably an alternative. Otherwise, you can use an open source optimization routine (check out cvxopt).
A couple things to note. You need to impose the constraint that all values in sigma are greater than zero. You can tell the optimization algorithm about this constraint. Also, you'll need to specify an initial guess for the parameters (i.e. sigma). In this case, you could probably choose something reasonable by looking at the curve in the vicinity of each peak. It may be the case (when the loss function is nonconvex) that the final solution is different, depending on the initial guess (i.e. you converge to a local minimum). There are many fancy techniques for dealing with this kind of situation, but a simple thing to do is to just try with multiple different initial guesses, and pick the best result.
Edited to add:
In python, you can use optimization routines in the scipy.optimize module, e.g. curve_fit().
Edit 2 (response to edited question):
If your Gaussians have much overlap with each other, then taking their sum may cause the height of the peaks to differ from your known values. In this case, you could take a weighted sum, and treat the weights as another parameter to optimize.
If you want the peak heights to be exactly equal to some specified values, you can enforce this constraint in the optimization problem. lsqcurvefit() won't be able to do it because it only handles bound constraints on the parameters. Take a look at fmincon().
you can use Expectation–Maximization algorithm for fitting Mixture of Gaussians on your data. it don't care about data dimension.
in documentation of MATLAB you can lookup gmdistribution.fit or fitgmdist.

weighted correlation for case of matrix

i have question how to calculate weighted correlations for matrices,from wikipedia i have created three following codes
1.weighted mean calculation
function [y]= weighted_mean(x,w);
n=length(x);
%assume that weight vector and input vector have same length
sum=0.0;
sum_weight=0.0;
for i=1:n
sum=sum+ x(i)*w(i);
sum_weight=sum_weight+w(i);
end
y=sum/sum_weight;
end
2.weighted covariance
function result=cov_weighted(x,y,w)
n=length(x);
sum_covar=0.0;
sum_weight=0;
for i=1:n
sum_covar=sum_covar+w(i)*(x(i)-weighted_mean(x,w))*(y(i)-weighted_mean(y,w));
sum_weight=sum_weight+w(i);
end
result=sum_covar/sum_weight;
end
and finally weighted correlation
3.
function corr_weight=weighted_correlation(x,y,w);
corr_weight=cov_weighted(x,y,w)/sqrt(cov_weighted(x,x,w)*cov_weighted(y,y,w));
end
now i want to apply weighted correlation method for matrices,related to this link
http://www.mathworks.com/matlabcentral/fileexchange/20846-weighted-correlation-matrix/content/weightedcorrs.m
i did not understand anything how to apply,that why i have created my self,but need in case of input are matrices,thanks very much
#dato-datuashvili Maybe I am providing too much information...
1) I would like to stress that the evaluation of Weighted Correlation matrices are very uncommon. This happens because you have to provide beforehand the weights. Unless you have a clear reason to choose the weights, there is no clear way to provide them.
How can you tell that a measurement of your sample is more or less important than another measurement?
Having said that, the weights are up to you! Yo have to choose them!
So, people usually consider just the correlation matrix (no weights or all weights are the same e.g w_i=1).
If you have a clear way to choose good weights, just do not consider this part.
2) I understand that you want to test your code. So, in order to that, you have to have correlated random variables. How to generate them?
Multivariate normal distributions are the simplest case. See the wikipedia page about them: Multivariate Normal Distribution (see the item "Drawing values from the distribution". Wikipedia shows you how to generate the random numbers from this distribution using Choleski Decomposition). The 2-variate case is much simpler. See for instance Generate Correlated Normal Random Variables
The good news is that if you are using Matlab there is a function for you. See Matlab: Random numbers from the multivariate normal distribution.]
In order to use this function you have to provide the desired means and covariances. [Note that you are making the role of nature here. You are generating the data! In real life, you are going to apply your function to the real data. What I am trying to say is that this step is only useful for tests. Furthermore, pay attencion to the fact that in the Matlab function you are providing the variances and evaluating the correlations (covariances normalized by standard errors). In the 2-dimensional case (that is the case of your function it is possible to provide directly the correlation. See the page above that I provided to you of Math.Stackexchange]
3) Finally, you can apply them to your function. Generate X and Y from a normal multivarite distribution and provide the vector of weights w to your function corr_weight_correlation and you are done!
I hope I provide what you need!
Daniel
Update:
% From the matlab page
mu = [2 3];
SIGMA = [1 1.5; 1.5 3];
n=100;
[x,y] = mvnrnd(mu,SIGMA,n);
% Using your code
w=ones(n,1);
corr_weight=weighted_correlation(x,y,w); % Remember that Sigma is covariance and Corr_weight is correlation. In order to calculate the same thing, just use result=cov_weighted instead.

Matlab 'entropy()' on Normal RVs

If I estimate the entropy of a vector of standard normal random variables using the Matlab entropy() function, I get an answer somewhere in the region of 4, whereas the actual entropy should be 0.5 * log(2*pi*e*sigma^2) which is approximately equal to 1.4.
Does anyone know where the discrepancy is coming from?
Note: To save time here is the Matlab code
for i = 1:1000
X(i) = randn();
end
'The entropy of X is'
entropy(X)
Please read the help (help entropy) or documentation for entropy. You'll see that it's designed for images and uses a histogram technique rather than calculating the it analytically. You'll need to create your own function if you want the formula from Wikipedia, but as the formula is so simple, that should be no problem.
I believe that the reason that you're getting such divergent answers is that entropy scales the bins of the histogram by the number of elements. If you want to uses such an estimation technique you'll want to use hist and scale the bins by area. See this StackOverflow question.

Goodness of fit with MATLAB and chi-square test

I would like to measure the goodness-of-fit to an exponential decay curve. I am using the lsqcurvefit MATLAB function. I have been suggested by someone to do a chi-square test.
I would like to use the MATLAB function chi2gof but I am not sure how I would tell it that the data is being fitted to an exponential curve
The chi2gof function tests the null hypothesis that a set of data, say X, is a random sample drawn from some specified distribution (such as the exponential distribution).
From your description in the question, it sounds like you want to see how well your data X fits an exponential decay function. I really must emphasize, this is completely different to testing whether X is a random sample drawn from the exponential distribution. If you use chi2gof for your stated purpose, you'll get meaningless results.
The usual approach for testing the goodness of fit for some data X to some function f is least squares, or some variant on least squares. Further, a least squares approach can be used to generate test statistics that test goodness-of-fit, many of which are distributed according to the chi-square distribution. I believe this is probably what your friend was referring to.
EDIT: I have a few spare minutes so here's something to get you started. DISCLAIMER: I've never worked specifically on this problem, so what follows may not be correct. I'm going to assume you have a set of data x_n, n = 1, ..., N, and the corresponding timestamps for the data, t_n, n = 1, ..., N. Now, the exponential decay function is y_n = y_0 * e^{-b * t_n}. Note that by taking the natural logarithm of both sides we get: ln(y_n) = ln(y_0) - b * t_n. Okay, so this suggests using OLS to estimate the linear model ln(x_n) = ln(x_0) - b * t_n + e_n. Nice! Because now we can test goodness-of-fit using the standard R^2 measure, which matlab will return in the stats structure if you use the regress function to perform OLS. Hope this helps. Again I emphasize, I came up with this off the top of my head in a couple of minutes, so there may be good reasons why what I've suggested is a bad idea. Also, if you know the initial value of the process (ie x_0), then you may want to look into constrained least squares where you bind the parameter ln(x_0) to its known value.