manipulate data to better fit a Gaussian Distribution - matlab

I have got a question concerning normal distribution (with mu = 0 and sigma = 1).
Let say that I firstly call randn or normrnd this way
x = normrnd(0,1,[4096,1]); % x = randn(4096,1)
Now, to assess how good x values fit the normal distribution, I call
[a,b] = normfit(x);
and to have a graphical support
histfit(x)
Now come to the core of the question: if I am not satisfied enough on how x fits the given normal distribution, how can I optimize x in order to better fit the expected normal distribution with 0 mean and 1 standard deviation?? Sometimes because of the few representation values (i.e. 4096 in this case), x fits really poorly the expected Gaussian, so that I wanna manipulate x (linearly or not, it does not really matter at this stage) in order to get a better fitness.
I'd like remarking that I have access to the statistical toolbox.
EDIT
I made the example with normrnd and randn cause my data are supposed and expected to have normal distribution. But, within the question, those functions are only helpful to better understand my concern.
Would it be possible to appy a least-squares fitting?
Generally the distribution I get is similar to the following:
My

Maybe, you can try to normalize your input data to have mean=0 and sigma=1. Like this:
y=(x-mean(x))/std(x);

If you are searching for a nonlinear transformation that would make your distribution look normal, you can first estimate the cumulative distribution, then take the function composition with the inverse of standard normal CDF. This way you can transform almost any distribution to a normal through invertible transformation. Take a look at the example code below.
x = randn(1000, 1) + 4 * (rand(1000, 1) < 0.5); % some funky bimodal distribution
xr = linspace(-5, 9, 2000);
cdf = cumsum(ksdensity(x, xr, 'width', 0.5)); cdf = cdf / cdf(end); % you many want to use a better smoother
c = interp1(xr, cdf, x); % function composition step 1
y = norminv(c); % function composition step 2
% take a look at the result
figure;
subplot(2,1,1); hist(x, 100);
subplot(2,1,2); hist(y, 100);

Related

This line of code is supposed to generate exponential service times, but I am not able to get the logic behind it

This line of code is supposed to generate exponential service times, but I am not able to get the logic behind it.
% Exponential service time with rate 1
mean = 1;
dt = -mean * log(1 - rand());
This is the source link, but MATLAB is needed to open the example.
I was also thinking if exprnd(1) will give the same result of generating numbers from the exponential distribution that has a mean of 1?
You are right!
First, note that MATLAB parameterizes the Exponential distribution by the mean, not the rate, so exprnd(5) would have a rate lambda = 1/5.
This line of code is another way to do the same thing:
-mean * log(1 - rand());
This is the inverse transform for the Exponential distribution.
If X follows an Exponential distribution, then
and rewriting the cumulative distribution function (CDF) and letting U ~ Uniform(0,1), we can derive the inverse transform.
Note the last equality is because 1-U and U are equal in distribution. In other words, 1-U ~ Uniform(0,1) and U ~ Uniform(0,1).
You can test this yourself with this example code with multiple approaches.
% MATLAB R2018b
rate = 1; % mean = 1 % mean = 1/rate
NumSamples = 1000;
% Approach 1
X1 = (-1/rate)*log(1-rand(NumSamples,1)); % inverse transform
% Approach 2
X2 = exprnd(1/rate,NumSamples,1);
% Approach 3
pd = makedist('Exponential',1/rate) % create probability distribution object
X3 = random(pd,NumSamples,1);
EDIT: The OP asked is there was a reason to generate from the CDF rather than from the probability density function (PDF). This is my attempt to answer that.
The inverse transform method uses the CDF to take advantage of the fact that the CDF is itself a probability and so must be on the interval [0, 1]. Then it is very easy to generate very good (pseudo) random numbers which will be on that interval. The CDF is sufficient to uniquely define the distribution, and inverting the CDF means that its unique "shape" will properly map the uniformly distributed numbers on [0, 1] to a non-uniform shape in the domain which will follow the probability density function (PDF).
You can see the CDF performing this nonlinear mapping in this figure.
One use of the PDF would be Acceptance-Rejection methods, which can be useful for some distributions including custom PDFs (thanks to #pjs for jogging my memory).

What are the differences between different gaussian functions in Matlab?

y = gauss(x,s,m)
Y = normpdf(X,mu,sigma)
R = normrnd(mu,sigma)
What are the basic differences between these three functions?
Y = normpdf(X,mu,sigma) is the probability density function for a normal distribution with mean mu and stdev sigma. Use this if you want to know the relative likelihood at a point X.
R = normrnd(mu,sigma) takes random samples from the same distribution as above. So use this function if you want to simulate something based on the normal distribution.
y = gauss(x,s,m) at first glance looks like the exact same function as normpdf(). But there is a slight difference: Its calculation is
Y = EXP(-(X-M).^2./S.^2)./(sqrt(2*pi).*S)
while normpdf() uses
Y = EXP(-(X-M).^2./(2*S.^2))./(sqrt(2*pi).*S)
This means that the integral of gauss() from -inf to inf is 1/sqrt(2). Therefore it isn't a legit PDF and I have no clue where one could use something like this.
For completeness we also have to mention p = normcdf(x,mu,sigma). This is the normal cumulative distribution function. It gives the probability that a value is between -inf and x.
A few more insights to add to Leander good answer:
When comparing between functions it is good to look at their source or toolbox. gauss is not a function written by Mathworks, so it may be redundant to a function that comes with Matlab.
Also, both normpdf and normrnd are part of the Statistics and Machine Learning Toolbox so users without it cannot use them. However, generating random numbers from a normal distribution is quite a common task, so it should be accessible for users that have only the core Matlab. Hence, there is a redundant function to normrnd which is randn that is part of the core Matlab.

MATLAB - Meaning of guassian distribution data. (in Neural Network)

I'm a newbie to MATLAB and now I'm trying to create a 2-d gaussian distribute data to train my neural network. I just found the code on the official document.
mu = [0 0];
Sigma = [.25 .3; .3 1];
x1 = -3:.2:3; x2 = -3:.2:3;
[X1,X2] = meshgrid(x1,x2);
F = mvnpdf([X1(:) X2(:)],mu,Sigma);
I know "mu" is average of the data. Sigma is something related to
Standard deviation. But I just don't get what is the idea of mesgrid and the interval(x1,x2). And the Geometric meaning of these code.
Also, can someone explain me why is guassian distribution so important in machine learning and data science? Cause all the course keep saying and saying this term.
Meshgrid is a basic matlab function, that is in no way specifically related to neural networks or a gaussian distribution. Check the documentation of Matlab to find out more about it.
The gaussian distribution (also known as normal distribution) is important for datascience because it comes with several nice statistical properties. Unfortunately it is hard to describe them all in a compact way, and this would also not be a question about programming, but more about statistics.
I think the code you provide seems confusing to you because you expect it to generate samples whereas it merely returns values of the Gaussian PDF (probability density function) for some given pairs of (x1,x2).
For example F = mvnpdf(a,b,mu, Sigma) returns the probability of x1=a and x2=b given that they follow a multivariate Gaussian distribution with mean mu and covariance matrix Sigma.
Being in Stack Overflow, I am focusing on the Matlab aspect of your question: for generating 100 samples of a 2-D Gaussian you can use something like the following (taken from the Matlab help of randn function):
mu = [1 2];
Sigma = [1 .5; .5 2];
R = chol(Sigma);
z = repmat(mu,100,1) + randn(100,2)*R;
The array z = [x1,x2] contains the x1 and x2 vectors that you are looking for.
Some statistics textbook or wikipedia could convince you on why the above code indeed generates such samples. The last line of code is related to one of the nice properties of a Gaussian distribution (or any other elliptical distribution).

Fitting a 2D Gaussian to 2D Data Matlab

I have a vector of x and y coordinates drawn from two separate unknown Gaussian distributions. I would like to fit these points to a three dimensional Gauss function and evaluate this function at any x and y.
So far the only manner I've found of doing this is using a Gaussian Mixture model with a maximum of 1 component (see code below) and going into the handle of ezcontour to take the X, Y, and Z data out.
The problems with this method is firstly that its a very ugly roundabout manner of getting this done and secondly the ezcontour command only gives me a grid of 60x60 but I need a much higher resolution.
Does anyone know a more elegant and useful method that will allow me to find the underlying Gauss function and extract its value at any x and y?
Code:
GaussDistribution = fitgmdist([varX varY],1); %Not exactly the intention of fitgmdist, but it gets the job done.
h = ezcontour(#(x,y)pdf(GaussDistributions,[x y]),[-500 -400], [-40 40]);
Gaussian Distribution in general form is like this:
I am not allowed to upload picture but the Formula of gaussian is:
1/((2*pi)^(D/2)*sqrt(det(Sigma)))*exp(-1/2*(x-Mu)*Sigma^-1*(x-Mu)');
where D is the data dimension (for you is 2);
Sigma is covariance matrix;
and Mu is mean of each data vector.
here is an example. In this example a guassian is fitted into two vectors of randomly generated samples from normal distributions with parameters N1(4,7) and N2(-2,4):
Data = [random('norm',4,7,30,1),random('norm',-2,4,30,1)];
X = -25:.2:25;
Y = -25:.2:25;
D = length(Data(1,:));
Mu = mean(Data);
Sigma = cov(Data);
P_Gaussian = zeros(length(X),length(Y));
for i=1:length(X)
for j=1:length(Y)
x = [X(i),Y(j)];
P_Gaussian(i,j) = 1/((2*pi)^(D/2)*sqrt(det(Sigma)))...
*exp(-1/2*(x-Mu)*Sigma^-1*(x-Mu)');
end
end
mesh(P_Gaussian)
run the code in matlab. For the sake of clarity I wrote the code like this it can be written more more efficient from programming point of view.

Calculating confidence intervals for a non-normal distribution

First, I should specify that my knowledge of statistics is fairly limited, so please forgive me if my question seems trivial or perhaps doesn't even make sense.
I have data that doesn't appear to be normally distributed. Typically, when I plot confidence intervals, I would use the mean +- 2 standard deviations, but I don't think that is acceptible for a non-uniform distribution. My sample size is currently set to 1000 samples, which would seem like enough to determine if it was a normal distribution or not.
I use Matlab for all my processing, so are there any functions in Matlab that would make it easy to calculate the confidence intervals (say 95%)?
I know there are the 'quantile' and 'prctile' functions, but I'm not sure if that's what I need to use. The function 'mle' also returns confidence intervals for normally distributed data, although you can also supply your own pdf.
Could I use ksdensity to create a pdf for my data, then feed that pdf into the mle function to give me confidence intervals?
Also, how would I go about determining if my data is normally distributed. I mean I can currently tell just by looking at the histogram or pdf from ksdensity, but is there a way to quantitatively measure it?
Thanks!
So there are a couple of questions there. Here are some suggestions
You are right that a mean of 1000 samples should be normally distributed (unless your data is "heavy tailed", which I'm assuming is not the case). to get a 1-alpha-confidence interval for the mean (in your case alpha = 0.05) you can use the 'norminv' function. For example say we wanted a 95% CI for the mean a sample of data X, then we can type
N = 1000; % sample size
X = exprnd(3,N,1); % sample from a non-normal distribution
mu = mean(X); % sample mean (normally distributed)
sig = std(X)/sqrt(N); % sample standard deviation of the mean
alphao2 = .05/2; % alpha over 2
CI = [mu + norminv(alphao2)*sig ,...
mu - norminv(alphao2)*sig ]
CI =
2.9369 3.3126
Testing if a data sample is normally distribution can be done in a lot of ways. One simple method is with a QQ plot. To do this, use 'qqplot(X)' where X is your data sample. If the result is approximately a straight line, the sample is normal. If the result is not a straight line, the sample is not normal.
For example if X = exprnd(3,1000,1) as above, the sample is non-normal and the qqplot is very non-linear:
X = exprnd(3,1000,1);
qqplot(X);
On the other hand if the data is normal the qqplot will give a straight line:
qqplot(randn(1000,1))
You might consider, also, using bootstrapping, with the bootci function.
You may use the method proposed in [1]:
MEDIAN +/- 1.7(1.25R / 1.35SQN)
Where R = Interquartile Range,
SQN = Square Root of N
This is often used in notched box plots, a useful data visualization for non-normal data. If the notches of two medians do not overlap, the medians are, approximately, significantly different at about a 95% confidence level.
[1] McGill, R., J. W. Tukey, and W. A. Larsen. "Variations of Boxplots." The American Statistician. Vol. 32, No. 1, 1978, pp. 12–16.
Are you sure you need confidence intervals or just the 90% range of the random data?
If you need the latter, I suggest you use prctile(). For example, if you have a vector holding independent identically distributed samples of random variables, you can get some useful information by running
y = prcntile(x, [5 50 95])
This will return in [y(1), y(3)] the range where 90% of your samples occur. And in y(2) you get the median of the sample.
Try the following example (using a normally distributed variable):
t = 0:99;
tt = repmat(t, 1000, 1);
x = randn(1000, 100) .* tt + tt; % simple gaussian model with varying mean and variance
y = prctile(x, [5 50 95]);
plot(t, y);
legend('5%','50%','95%')
I have not used Matlab but from my understanding of statistics, if your distribution cannot be assumed to be normal distribution, then you have to take it as Student t distribution and calculate confidence Interval and accuracy.
http://www.stat.yale.edu/Courses/1997-98/101/confint.htm