MATLAB - Meaning of guassian distribution data. (in Neural Network) - matlab

I'm a newbie to MATLAB and now I'm trying to create a 2-d gaussian distribute data to train my neural network. I just found the code on the official document.
mu = [0 0];
Sigma = [.25 .3; .3 1];
x1 = -3:.2:3; x2 = -3:.2:3;
[X1,X2] = meshgrid(x1,x2);
F = mvnpdf([X1(:) X2(:)],mu,Sigma);
I know "mu" is average of the data. Sigma is something related to
Standard deviation. But I just don't get what is the idea of mesgrid and the interval(x1,x2). And the Geometric meaning of these code.
Also, can someone explain me why is guassian distribution so important in machine learning and data science? Cause all the course keep saying and saying this term.

Meshgrid is a basic matlab function, that is in no way specifically related to neural networks or a gaussian distribution. Check the documentation of Matlab to find out more about it.
The gaussian distribution (also known as normal distribution) is important for datascience because it comes with several nice statistical properties. Unfortunately it is hard to describe them all in a compact way, and this would also not be a question about programming, but more about statistics.

I think the code you provide seems confusing to you because you expect it to generate samples whereas it merely returns values of the Gaussian PDF (probability density function) for some given pairs of (x1,x2).
For example F = mvnpdf(a,b,mu, Sigma) returns the probability of x1=a and x2=b given that they follow a multivariate Gaussian distribution with mean mu and covariance matrix Sigma.
Being in Stack Overflow, I am focusing on the Matlab aspect of your question: for generating 100 samples of a 2-D Gaussian you can use something like the following (taken from the Matlab help of randn function):
mu = [1 2];
Sigma = [1 .5; .5 2];
R = chol(Sigma);
z = repmat(mu,100,1) + randn(100,2)*R;
The array z = [x1,x2] contains the x1 and x2 vectors that you are looking for.
Some statistics textbook or wikipedia could convince you on why the above code indeed generates such samples. The last line of code is related to one of the nice properties of a Gaussian distribution (or any other elliptical distribution).

Related

This line of code is supposed to generate exponential service times, but I am not able to get the logic behind it

This line of code is supposed to generate exponential service times, but I am not able to get the logic behind it.
% Exponential service time with rate 1
mean = 1;
dt = -mean * log(1 - rand());
This is the source link, but MATLAB is needed to open the example.
I was also thinking if exprnd(1) will give the same result of generating numbers from the exponential distribution that has a mean of 1?
You are right!
First, note that MATLAB parameterizes the Exponential distribution by the mean, not the rate, so exprnd(5) would have a rate lambda = 1/5.
This line of code is another way to do the same thing:
-mean * log(1 - rand());
This is the inverse transform for the Exponential distribution.
If X follows an Exponential distribution, then
and rewriting the cumulative distribution function (CDF) and letting U ~ Uniform(0,1), we can derive the inverse transform.
Note the last equality is because 1-U and U are equal in distribution. In other words, 1-U ~ Uniform(0,1) and U ~ Uniform(0,1).
You can test this yourself with this example code with multiple approaches.
% MATLAB R2018b
rate = 1; % mean = 1 % mean = 1/rate
NumSamples = 1000;
% Approach 1
X1 = (-1/rate)*log(1-rand(NumSamples,1)); % inverse transform
% Approach 2
X2 = exprnd(1/rate,NumSamples,1);
% Approach 3
pd = makedist('Exponential',1/rate) % create probability distribution object
X3 = random(pd,NumSamples,1);
EDIT: The OP asked is there was a reason to generate from the CDF rather than from the probability density function (PDF). This is my attempt to answer that.
The inverse transform method uses the CDF to take advantage of the fact that the CDF is itself a probability and so must be on the interval [0, 1]. Then it is very easy to generate very good (pseudo) random numbers which will be on that interval. The CDF is sufficient to uniquely define the distribution, and inverting the CDF means that its unique "shape" will properly map the uniformly distributed numbers on [0, 1] to a non-uniform shape in the domain which will follow the probability density function (PDF).
You can see the CDF performing this nonlinear mapping in this figure.
One use of the PDF would be Acceptance-Rejection methods, which can be useful for some distributions including custom PDFs (thanks to #pjs for jogging my memory).

Fast scaling of Gaussian Kernel by the Covariance of the Inputs

I am currently fiddling with multivariate kernel density estimations for estimating the probability density functions (PDF) of hydrological data sets using Matlab. I am most familiar with kernel density estimation using Gaussian kernels as outlined in Sharma (2000 and 2014) (where the kernel bandwidths are set using the Gaussian Reference Rule (GRR)). The GRR is written as follows (Sharma, 2000):
where lambda_ref = GRR bandwidth of kernel, n is the sample size, and d is the dimension of the data set we are using for density estimation. To estimate the multivariate density of our data set X we use the following formula (Sharma, 2000):
where lamda is the same as lamda_ref above, S is the sample covariance of X and det() stands for determinant.
My question is: I understand that there are many "fast" methods for calculating the Gaussian kernel function represented by the term exp() such as the method proposed here (using Matlab): http://mrmartin.net/?p=218. Since I will be working with data sets that are quite large in sample size (1000-10,000) I am looking for a fast code. Is anyone aware how I can write a fast code for the second equation that takes into account the inverse of the sample covariance matrix (S^-1)?
I greatly appreciate any help that can be provided on this issue. Thank you!
Note(s):
I understand that there is a Matlab code for calculating the second equation, found as a sub-function in: http://www.mathworks.com/matlabcentral/fileexchange/29039-mutual-information-2-variablle/content/MutualInfo.m. However this code has a bottleneck in how it calculates the kernel matrix.
References:
1 A. Sharma, Seasonal to interannual rainfall probabilistic forecasts for improved water supply management: Part 3 — A nonparametric probabilistic forecast model, Journal of Hydrology, Volume 239, Issues 1–4, 20 December 2000, Pages 249-258, ISSN 0022-1694, http://dx.doi.org/10.1016/S0022-1694(00)00348-6.
2 Sharma, A., and R. Mehrotra (2014), An information theoretic alternative to model a natural system using observational information alone, Water Resour. Res., 50, 650–660, doi:10.1002/2013WR013845.
I have found a code that I am able to modify for my purposes. The original code is listed at the following link: http://www.kernel-methods.net/matlab/kernels/rbf.m.
Code
function K = rbf(coord,sig)
%function K = rbf(coord,sig)
%
% Computes an rbf kernel matrix from the input coordinates
%
%INPUTS
% coord = a matrix containing all samples as rows
% sig = sigma, the kernel width; squared distances are divided by
% squared sig in the exponent
%
%OUTPUTS
% K = the rbf kernel matrix ( = exp(-1/(2*sigma^2)*(coord*coord')^2) )
%
%
% For more info, see www.kernel-methods.net
%
%Author: Tijl De Bie, february 2003. Adapted: october 2004 (for speedup).
n=size(coord,1);
K=coord*coord'/sig^2;
d=diag(K);
K=K-ones(n,1)*d'/2;
K=K-d*ones(1,n)/2;
K=exp(K);
Modified Code incorporating sample covariance scaling:
xcov = cov(x.'); % sample covariance of the data
invxc = pinv(xcov); % inversion of data sample covariance
coord = x.';
sig = sigma; % kernel bandwidth
n = size(coord,1);
K = coord*invxc*coord'/sig^2;
d = diag(K);
K = K-ones(n,1)*d'/2;
K = K-d*ones(1,n)/2;
K = exp(K); % kernel matrix
I hope this helps someone else looking into the same problem.

Fitting a 2D Gaussian to 2D Data Matlab

I have a vector of x and y coordinates drawn from two separate unknown Gaussian distributions. I would like to fit these points to a three dimensional Gauss function and evaluate this function at any x and y.
So far the only manner I've found of doing this is using a Gaussian Mixture model with a maximum of 1 component (see code below) and going into the handle of ezcontour to take the X, Y, and Z data out.
The problems with this method is firstly that its a very ugly roundabout manner of getting this done and secondly the ezcontour command only gives me a grid of 60x60 but I need a much higher resolution.
Does anyone know a more elegant and useful method that will allow me to find the underlying Gauss function and extract its value at any x and y?
Code:
GaussDistribution = fitgmdist([varX varY],1); %Not exactly the intention of fitgmdist, but it gets the job done.
h = ezcontour(#(x,y)pdf(GaussDistributions,[x y]),[-500 -400], [-40 40]);
Gaussian Distribution in general form is like this:
I am not allowed to upload picture but the Formula of gaussian is:
1/((2*pi)^(D/2)*sqrt(det(Sigma)))*exp(-1/2*(x-Mu)*Sigma^-1*(x-Mu)');
where D is the data dimension (for you is 2);
Sigma is covariance matrix;
and Mu is mean of each data vector.
here is an example. In this example a guassian is fitted into two vectors of randomly generated samples from normal distributions with parameters N1(4,7) and N2(-2,4):
Data = [random('norm',4,7,30,1),random('norm',-2,4,30,1)];
X = -25:.2:25;
Y = -25:.2:25;
D = length(Data(1,:));
Mu = mean(Data);
Sigma = cov(Data);
P_Gaussian = zeros(length(X),length(Y));
for i=1:length(X)
for j=1:length(Y)
x = [X(i),Y(j)];
P_Gaussian(i,j) = 1/((2*pi)^(D/2)*sqrt(det(Sigma)))...
*exp(-1/2*(x-Mu)*Sigma^-1*(x-Mu)');
end
end
mesh(P_Gaussian)
run the code in matlab. For the sake of clarity I wrote the code like this it can be written more more efficient from programming point of view.

Discrete surface integral with cumsum

I have a matrix z(x,y)
This is an NxN abitary pdf constructed from a unique Kernel density estimation (i.e. not a usual pdf and it doesn't have a function). It is multivariate and can't be separated and is discrete data.
I wan't to construct a NxN matrix (F(x,y)) that is the cumulative distribution function in 2 dimensions of this pdf so that I can then randomly sample the F(x,y) = P(x < X ,y < Y);
Analytically I think the CDF of a multivariate function is the surface integral of the pdf.
What I have tried is using the cumsum function in order to calculate the surface integral and tested this with a multivariate normal against the analytical solution and there seems to be some discrepancy between the two:
% multivariate parameters
delta = 100;
mu = [1 1];
Sigma = [0.25 .3; .3 1];
x1 = linspace(-2,4,delta); x2 = linspace(-2,4,delta);
[X1,X2] = meshgrid(x1,x2);
% Calculate Normal multivariate pdf
F = mvnpdf([X1(:) X2(:)],mu,Sigma);
F = reshape(F,length(x2),length(x1));
% My attempt at a numerical surface integral
FN = cumsum(cumsum(F,1),2);
% Normalise the CDF
FN = FN./max(max(FN));
X = [X1(:) X2(:)];
% Analytic solution to a multivariate normal pdf
p = mvncdf(X,mu,Sigma);
p = reshape(p,delta,delta);
% Highlight the difference
dif = p - FN;
error = max(max(sqrt(dif.^2)));
% %% Plot
figure(1)
surf(x1,x2,F);
caxis([min(F(:))-.5*range(F(:)),max(F(:))]);
xlabel('x1'); ylabel('x2'); zlabel('Probability Density');
figure(2)
surf(X1,X2,FN);
xlabel('x1'); ylabel('x2');
figure(3);
surf(X1,X2,p);
xlabel('x1'); ylabel('x2');
figure(5)
surf(X1,X2,dif)
xlabel('x1'); ylabel('x2');
Particularly the error seems to be in the transition region which is the most important.
Does anyone have any better solution to this problem or see what I'm doing wrong??
Any help would be much appreciated!
EDIT: This is the desired outcome of the cumulative integration, The reason this function is of value to me is that when you randomly generate samples from this function on the closed interval [0,1] the higher weighted (i.e. the more likely) values appear more often in this way the samples converge on the expected value(s) (in the case of multiple peaks) this is desired outcome for algorithms such as particle filters, neural networks etc.
Think of the 1-dimensional case first. You have a function represented by a vector F and want to numerically integrate. cumsum(F) will do that, but it uses a poor form of numerical integration. Namely, it treats F as a step function. You could instead do a more accurate numerical integration using the Trapezoidal rule or Simpson's rule.
The 2-dimensional case is no different. Your use of cumsum(cumsum(F,1),2) is again treating F as a step function, and the numerical errors resulting from that assumption only get worse as the number of dimensions of integration increases. There exist 2-dimensional analogues of the Trapezoidal rule and Simpson's rule. Since there's a bit too much math to repeat here, take a look here:
http://onestopgate.com/gate-study-material/mathematics/numerical-analysis/numerical-integration/2d-trapezoidal.asp.
You DO NOT need to compute the 2-dimensional integral of the probability density function in order to sample from the distribution. If you are computing the 2-d integral, you are going about the problem incorrectly.
Here are two ways to approach the sampling problem.
(1) You write that you have a kernel density estimate. A kernel density estimate is a special case of a mixture density. Any mixture density can be sampled by first selecting one kernel (perhaps differently or equally weighted, same procedure applies), and then sampling from that kernel. (That applies in any number of dimensions.) Typically the kernels are some relatively simple distribution such as a Gaussian distribution so that it is easy to sample from it.
(2) Any joint density P(X, Y) is equal to P(X | Y) P(Y) (and equivalently P(Y | X) P(X)). Therefore you can sample from P(Y) (or P(X)) and then from P(X | Y). In order to sample from P(X | Y), you will need to integrate P(X, Y) along a line Y = y (where y is the sampled value of Y), but (this is crucial) you only need to integrate along that line; you don't need to integrate over all values of X and Y.
If you tell us more about your problem, I can help with the details.

manipulate data to better fit a Gaussian Distribution

I have got a question concerning normal distribution (with mu = 0 and sigma = 1).
Let say that I firstly call randn or normrnd this way
x = normrnd(0,1,[4096,1]); % x = randn(4096,1)
Now, to assess how good x values fit the normal distribution, I call
[a,b] = normfit(x);
and to have a graphical support
histfit(x)
Now come to the core of the question: if I am not satisfied enough on how x fits the given normal distribution, how can I optimize x in order to better fit the expected normal distribution with 0 mean and 1 standard deviation?? Sometimes because of the few representation values (i.e. 4096 in this case), x fits really poorly the expected Gaussian, so that I wanna manipulate x (linearly or not, it does not really matter at this stage) in order to get a better fitness.
I'd like remarking that I have access to the statistical toolbox.
EDIT
I made the example with normrnd and randn cause my data are supposed and expected to have normal distribution. But, within the question, those functions are only helpful to better understand my concern.
Would it be possible to appy a least-squares fitting?
Generally the distribution I get is similar to the following:
My
Maybe, you can try to normalize your input data to have mean=0 and sigma=1. Like this:
y=(x-mean(x))/std(x);
If you are searching for a nonlinear transformation that would make your distribution look normal, you can first estimate the cumulative distribution, then take the function composition with the inverse of standard normal CDF. This way you can transform almost any distribution to a normal through invertible transformation. Take a look at the example code below.
x = randn(1000, 1) + 4 * (rand(1000, 1) < 0.5); % some funky bimodal distribution
xr = linspace(-5, 9, 2000);
cdf = cumsum(ksdensity(x, xr, 'width', 0.5)); cdf = cdf / cdf(end); % you many want to use a better smoother
c = interp1(xr, cdf, x); % function composition step 1
y = norminv(c); % function composition step 2
% take a look at the result
figure;
subplot(2,1,1); hist(x, 100);
subplot(2,1,2); hist(y, 100);