Fitting a 2D Gaussian to 2D Data Matlab - matlab

I have a vector of x and y coordinates drawn from two separate unknown Gaussian distributions. I would like to fit these points to a three dimensional Gauss function and evaluate this function at any x and y.
So far the only manner I've found of doing this is using a Gaussian Mixture model with a maximum of 1 component (see code below) and going into the handle of ezcontour to take the X, Y, and Z data out.
The problems with this method is firstly that its a very ugly roundabout manner of getting this done and secondly the ezcontour command only gives me a grid of 60x60 but I need a much higher resolution.
Does anyone know a more elegant and useful method that will allow me to find the underlying Gauss function and extract its value at any x and y?
Code:
GaussDistribution = fitgmdist([varX varY],1); %Not exactly the intention of fitgmdist, but it gets the job done.
h = ezcontour(#(x,y)pdf(GaussDistributions,[x y]),[-500 -400], [-40 40]);

Gaussian Distribution in general form is like this:
I am not allowed to upload picture but the Formula of gaussian is:
1/((2*pi)^(D/2)*sqrt(det(Sigma)))*exp(-1/2*(x-Mu)*Sigma^-1*(x-Mu)');
where D is the data dimension (for you is 2);
Sigma is covariance matrix;
and Mu is mean of each data vector.
here is an example. In this example a guassian is fitted into two vectors of randomly generated samples from normal distributions with parameters N1(4,7) and N2(-2,4):
Data = [random('norm',4,7,30,1),random('norm',-2,4,30,1)];
X = -25:.2:25;
Y = -25:.2:25;
D = length(Data(1,:));
Mu = mean(Data);
Sigma = cov(Data);
P_Gaussian = zeros(length(X),length(Y));
for i=1:length(X)
for j=1:length(Y)
x = [X(i),Y(j)];
P_Gaussian(i,j) = 1/((2*pi)^(D/2)*sqrt(det(Sigma)))...
*exp(-1/2*(x-Mu)*Sigma^-1*(x-Mu)');
end
end
mesh(P_Gaussian)
run the code in matlab. For the sake of clarity I wrote the code like this it can be written more more efficient from programming point of view.

Related

Generate random samples from arbitrary discrete probability density function in Matlab

I've got an arbitrary probability density function discretized as a matrix in Matlab, that means that for every pair x,y the probability is stored in the matrix:
A(x,y) = probability
This is a 100x100 matrix, and I would like to be able to generate random samples of two dimensions (x,y) out of this matrix and also, if possible, to be able to calculate the mean and other moments of the PDF. I want to do this because after resampling, I want to fit the samples to an approximated Gaussian Mixture Model.
I've been looking everywhere but I haven't found anything as specific as this. I hope you may be able to help me.
Thank you.
If you really have a discrete probably density function defined by A (as opposed to a continuous probability density function that is merely described by A), you can "cheat" by turning your 2D problem into a 1D problem.
%define the possible values for the (x,y) pair
row_vals = [1:size(A,1)]'*ones(1,size(A,2)); %all x values
col_vals = ones(size(A,1),1)*[1:size(A,2)]; %all y values
%convert your 2D problem into a 1D problem
A = A(:);
row_vals = row_vals(:);
col_vals = col_vals(:);
%calculate your fake 1D CDF, assumes sum(A(:))==1
CDF = cumsum(A); %remember, first term out of of cumsum is not zero
%because of the operation we're doing below (interp1 followed by ceil)
%we need the CDF to start at zero
CDF = [0; CDF(:)];
%generate random values
N_vals = 1000; %give me 1000 values
rand_vals = rand(N_vals,1); %spans zero to one
%look into CDF to see which index the rand val corresponds to
out_val = interp1(CDF,[0:1/(length(CDF)-1):1],rand_vals); %spans zero to one
ind = ceil(out_val*length(A));
%using the inds, you can lookup each pair of values
xy_values = [row_vals(ind) col_vals(ind)];
I hope that this helps!
Chip
I don't believe matlab has built-in functionality for generating multivariate random variables with arbitrary distribution. As a matter of fact, the same is true for univariate random numbers. But while the latter can be easily generated based on the cumulative distribution function, the CDF does not exist for multivariate distributions, so generating such numbers is much more messy (the main problem is the fact that 2 or more variables have correlation). So this part of your question is far beyond the scope of this site.
Since half an answer is better than no answer, here's how you can compute the mean and higher moments numerically using matlab:
%generate some dummy input
xv=linspace(-50,50,101);
yv=linspace(-30,30,100);
[x y]=meshgrid(xv,yv);
%define a discretized two-hump Gaussian distribution
A=floor(15*exp(-((x-10).^2+y.^2)/100)+15*exp(-((x+25).^2+y.^2)/100));
A=A/sum(A(:)); %normalized to sum to 1
%plot it if you like
%figure;
%surf(x,y,A)
%actual half-answer starts here
%get normalized pdf
weight=trapz(xv,trapz(yv,A));
A=A/weight; %A normalized to 1 according to trapz^2
%mean
mean_x=trapz(xv,trapz(yv,A.*x));
mean_y=trapz(xv,trapz(yv,A.*y));
So, the point is that you can perform a double integral on a rectangular mesh using two consecutive calls to trapz. This allows you to compute the integral of any quantity that has the same shape as your mesh, but a drawback is that vector components have to be computed independently. If you only wish to compute things which can be parametrized with x and y (which are naturally the same size as you mesh), then you can get along without having to do any additional thinking.
You could also define a function for the integration:
function res=trapz2(xv,yv,A,arg)
if ~isscalar(arg) && any(size(arg)~=size(A))
error('Size of A and var must be the same!')
end
res=trapz(xv,trapz(yv,A.*arg));
end
This way you can compute stuff like
weight=trapz2(xv,yv,A,1);
mean_x=trapz2(xv,yv,A,x);
NOTE: the reason I used a 101x100 mesh in the example is that the double call to trapz should be performed in the proper order. If you interchange xv and yv in the calls, you get the wrong answer due to inconsistency with the definition of A, but this will not be evident if A is square. I suggest avoiding symmetric quantities during the development stage.

3d plot with ksdensity in matlab

I have a problem in matlab.
I used a ksdensity function on a vector of deltaX, which was my computed X minus actual X.
And I did the same on deltaY.
Then I used plot on that data. This gave me two 2d plots.
As I have two plots showing how (in)accurate was my system in computing X and Y (something like gaussian bell it was). Now I would like to have one plot but in 3d.
The code was just like that:
[f,xi] = ksdensity(deltaX);
figure;
plot(xi,f)
Ok what I'm about to show is probably not the correct way to visualize your problem, mostly because I'm not quite sure I understand what you're up to. But this will show you an example of how to make the Z matrix as discussed in the comments to your question.
Here's the code:
x = wgn(1000,1,5);%create x and y variables, just noise
y = wgn(1000,1,10);
[f,xi] = ksdensity(x);%compute the ksdensity (no idea if this makes real-world sense)
[f2,xi2] = ksdensity(y);
%create the Z matrix by adding together the densities at each x,y pair
%I doubt this makes real-world sense
for z=1:length(xi)
for zz = 1:length(xi2)
Z(z,zz) = f(z)+f2(zz);
end
end
figure(1)
mesh(xi,xi2,Z)
Here's the result:
I leave it up to you to determine the correct way to visualize your density functions in 3D, this is just how you could make the Z matrix. In short, the Z matrix contains the plot elevation at each x,y coordinate. Hope this helps a little.

matlab: cdfplot of relative error

The figure shown above is the plot of cumulative distribution function (cdf) plot for relative error (attached together the code used to generate the plot). The relative error is defined as abs(measured-predicted)/(measured). May I know the possible error/interpretation as the plot is supposed to be a smooth curve.
X = load('measured.txt');
Xhat = load('predicted.txt');
idx = find(X>0);
x = X(idx);
xhat = Xhat(idx);
relativeError = abs(x-xhat)./(x);
cdfplot(relativeError);
The input data file is a 4x4 matrix with zeros on the diagonal and some unmeasured entries (represent with 0). Appreciate for your kind help. Thanks!
The plot should be a discontinuous one because you are using discrete data. You are not plotting an analytic function which has an explicit (or implicit) function that maps, say, x to y. Instead, all you have is at most 16 points that relates x and y.
The CDF only "grows" when new samples are counted; otherwise its value remains steady, just because there isn't any satisfying sample that could increase the "frequency".
You can check the example in Mathworks' `cdfplot1 documentation to understand the concept of "empirical cdf". Again, only when you observe a sample can you increase the cdf.
If you really want to "get" a smooth curve, either 1) add more points so that the discontinuous line looks smoother, or 2) find any statistical model of whatever you are working on, and plot the analytic function instead.

Discrete surface integral with cumsum

I have a matrix z(x,y)
This is an NxN abitary pdf constructed from a unique Kernel density estimation (i.e. not a usual pdf and it doesn't have a function). It is multivariate and can't be separated and is discrete data.
I wan't to construct a NxN matrix (F(x,y)) that is the cumulative distribution function in 2 dimensions of this pdf so that I can then randomly sample the F(x,y) = P(x < X ,y < Y);
Analytically I think the CDF of a multivariate function is the surface integral of the pdf.
What I have tried is using the cumsum function in order to calculate the surface integral and tested this with a multivariate normal against the analytical solution and there seems to be some discrepancy between the two:
% multivariate parameters
delta = 100;
mu = [1 1];
Sigma = [0.25 .3; .3 1];
x1 = linspace(-2,4,delta); x2 = linspace(-2,4,delta);
[X1,X2] = meshgrid(x1,x2);
% Calculate Normal multivariate pdf
F = mvnpdf([X1(:) X2(:)],mu,Sigma);
F = reshape(F,length(x2),length(x1));
% My attempt at a numerical surface integral
FN = cumsum(cumsum(F,1),2);
% Normalise the CDF
FN = FN./max(max(FN));
X = [X1(:) X2(:)];
% Analytic solution to a multivariate normal pdf
p = mvncdf(X,mu,Sigma);
p = reshape(p,delta,delta);
% Highlight the difference
dif = p - FN;
error = max(max(sqrt(dif.^2)));
% %% Plot
figure(1)
surf(x1,x2,F);
caxis([min(F(:))-.5*range(F(:)),max(F(:))]);
xlabel('x1'); ylabel('x2'); zlabel('Probability Density');
figure(2)
surf(X1,X2,FN);
xlabel('x1'); ylabel('x2');
figure(3);
surf(X1,X2,p);
xlabel('x1'); ylabel('x2');
figure(5)
surf(X1,X2,dif)
xlabel('x1'); ylabel('x2');
Particularly the error seems to be in the transition region which is the most important.
Does anyone have any better solution to this problem or see what I'm doing wrong??
Any help would be much appreciated!
EDIT: This is the desired outcome of the cumulative integration, The reason this function is of value to me is that when you randomly generate samples from this function on the closed interval [0,1] the higher weighted (i.e. the more likely) values appear more often in this way the samples converge on the expected value(s) (in the case of multiple peaks) this is desired outcome for algorithms such as particle filters, neural networks etc.
Think of the 1-dimensional case first. You have a function represented by a vector F and want to numerically integrate. cumsum(F) will do that, but it uses a poor form of numerical integration. Namely, it treats F as a step function. You could instead do a more accurate numerical integration using the Trapezoidal rule or Simpson's rule.
The 2-dimensional case is no different. Your use of cumsum(cumsum(F,1),2) is again treating F as a step function, and the numerical errors resulting from that assumption only get worse as the number of dimensions of integration increases. There exist 2-dimensional analogues of the Trapezoidal rule and Simpson's rule. Since there's a bit too much math to repeat here, take a look here:
http://onestopgate.com/gate-study-material/mathematics/numerical-analysis/numerical-integration/2d-trapezoidal.asp.
You DO NOT need to compute the 2-dimensional integral of the probability density function in order to sample from the distribution. If you are computing the 2-d integral, you are going about the problem incorrectly.
Here are two ways to approach the sampling problem.
(1) You write that you have a kernel density estimate. A kernel density estimate is a special case of a mixture density. Any mixture density can be sampled by first selecting one kernel (perhaps differently or equally weighted, same procedure applies), and then sampling from that kernel. (That applies in any number of dimensions.) Typically the kernels are some relatively simple distribution such as a Gaussian distribution so that it is easy to sample from it.
(2) Any joint density P(X, Y) is equal to P(X | Y) P(Y) (and equivalently P(Y | X) P(X)). Therefore you can sample from P(Y) (or P(X)) and then from P(X | Y). In order to sample from P(X | Y), you will need to integrate P(X, Y) along a line Y = y (where y is the sampled value of Y), but (this is crucial) you only need to integrate along that line; you don't need to integrate over all values of X and Y.
If you tell us more about your problem, I can help with the details.

How to create 3D joint density plot MATLAB?

I 'm having a problem with creating a joint density function from data. What I have is queue sizes from a stock as two vectors saved as:
X = [askQueueSize bidQueueSize];
I then use the hist3-function to create a 3D histogram. This is what I get:
http://dl.dropbox.com/u/709705/hist-plot.png
What I want is to have the Z-axis normalized so that it goes from [0 1].
How do I do that? Or do someone have a great joint density matlab function on stock?
This is similar (How to draw probability density function in MatLab?) but in 2D.
What I want is 3D with x:ask queue, y:bid queue, z:probability.
Would greatly appreciate if someone could help me with this, because I've hit a wall over here.
I couldn't see a simple way of doing this. You can get the histogram counts back from hist3 using
[N C] = hist3(X);
and the idea would be to normalise them with:
N = N / sum(N(:));
but I can't find a nice way to plot them back to a histogram afterwards (You can use bar3(N), but I think the axes labels will need to be set manually).
The solution I ended up with involves modifying the code of hist3. If you have access to this (edit hist3) then this may work for you, but I'm not really sure what the legal situation is (you need a licence for the statistics toolbox, if you copy hist3 and modify it yourself, this is probably not legal).
Anyway, I found the place where the data is being prepared for a surf plot. There are 3 matrices corresponding to x, y, and z. Just before the contents of the z matrix were calculated (line 256), I inserted:
n = n / sum(n(:));
which normalises the count matrix.
Finally once the histogram is plotted, you can set the axis limits with:
xlim([0, 1]);
if necessary.
With help from a guy at mathworks forum, this is the great solution I ended up with:
(data_x and data_y are values, which you want to calculate at hist3)
x = min_x:step:max_x; % axis x, which you want to see
y = min_y:step:max_y; % axis y, which you want to see
[X,Y] = meshgrid(x,y); *%important for "surf" - makes defined grid*
pdf = hist3([data_x , data_y],{x y}); %standard hist3 (calculated for yours axis)
pdf_normalize = (pdf'./length(data_x)); %normalization means devide it by length of
%data_x (or data_y)
figure()
surf(X,Y,pdf_normalize) % plot distribution
This gave me the joint density plot in 3D. Which can be checked by calculating the integral over the surface with:
integralOverDensityPlot = sum(trapz(pdf_normalize));
When the variable step goes to zero the variable integralOverDensityPlot goes to 1.0
Hope this help someone!
There is a fast way how to do this with hist3 function:
[bins centers] = hist3(X); % X should be matrix with two columns
c_1 = centers{1};
c_2 = centers{2};
pdf = bins / (sum(sum(bins))*(c_1(2)-c_1(1)) * (c_2(2)-c_2(1)));
If you "integrate" this you will get 1.
sum(sum(pdf * (c_1(2)-c_1(1)) * (c_2(2)-c_2(1))))