Test if a data distribution follows a Gaussian distribution in MATLAB - matlab

I have some data points and their mean point. I need to find whether those data points (with that mean) follows a Gaussian distribution. Is there a function in MATLAB which can do that kind of a test? Or do I need to write a test of my own?
I tried looking at different statistical functions provided by MATLAB. I am very new to MATLAB so I might have overlooked the right function.
cheers

Check this documentation page on all available hypothesis tests.
From those, for your purpose you can use:
Chi-square goodness-of-fit test
Lilliefors test
z-test
t-test
Kolmogorov-Smirnov test
... among others
You can also use some visual tests like:
hist
normplot
cdfplot

I like Spiegelhalter's test (D. J. Spiegelhalter, 'Diagnostic tests of distributional shape,' Biometrika, 1983):
function pval = spiegel_test(x)
% compute pvalue under null of x normally distributed;
% x should be a vector;
xm = mean(x);
xs = std(x);
xz = (x - xm) ./ xs;
xz2 = xz.^2;
N = sum(xz2 .* log(xz2));
n = numel(x);
ts = (N - 0.73 * n) / (0.8969 * sqrt(n)); %under the null, ts ~ N(0,1)
pval = 1 - abs(erf(ts / sqrt(2))); %2-sided test.
whenever hacking statistical tests, alway test them under the null! here's a simple example:
pvals = nan(10000,1);
for j=1:numel(pvals);
pvals(j) = spiegel_test(randn(300,1));
end
nnz(pvals < 0.05) ./ numel(pvals)
I get the results:
ans =
0.0505
Similarly
nnz(pvals > 0.95) ./ numel(pvals)
I get
ans =
0.0475

For testing in general, look up the Kolmogorov-Smirnov Test, also in the Stats Toolbox, as kstest and the two-sample version: kstest2 . You feed it your empirical data, (and the data from a possible function, like the gaussian, etc...) then it tests the likelihood that your sample was pulled from the normal distribution (or the one you supplied for the two-sample version)... The nicety is that it'll work for any possible distributions...

Related

Small bug in MATLAB R2017B LogLikelihood after fitnlm?

Background: I am working on a problem similar to the nonlinear logistic regression described in the link [1] (my problem is more complicated, but link [1] is enough for the next sections of this post). Comparing my results with those obtained in parallel with a R package, I got similar results for the coefficients, but (very approximately) an opposite logLikelihood.
Hypothesis: The logLikelihood given by fitnlm in Matlab is in fact the negative LogLikelihood. (Note that this impairs consequently the BIC and AIC computation by Matlab)
Reasonning: in [1], the same problem is solved through two different approaches. ML-approach/ By defining the negative LogLikelihood and making an optimization with fminsearch. GLS-approach/ By using fitnlm.
The negative LogLikelihood after the ML-approach is:380
The negative LogLikelihood after the GLS-approach is:-406
I imagine the second one should be at least multiplied by (-1)?
Questions: Did I miss something? Is the (-1) coefficient enough, or would this simple correction not be enough?
Self-contained code:
%copy-pasting code from [1]
myf = #(beta,x) beta(1)*x./(beta(2) + x);
mymodelfun = #(beta,x) 1./(1 + exp(-myf(beta,x)));
rng(300,'twister');
x = linspace(-1,1,200)';
beta = [10;2];
beta0=[3;3];
mu = mymodelfun(beta,x);
n = 50;
z = binornd(n,mu);
y = z./n;
%ML Approach
mynegloglik = #(beta) -sum(log(binopdf(z,n,mymodelfun(beta,x))));
opts = optimset('fminsearch');
opts.MaxFunEvals = Inf;
opts.MaxIter = 10000;
betaHatML = fminsearch(mynegloglik,beta0,opts)
neglogLH_MLApproach = mynegloglik(betaHatML);
%GLS Approach
wfun = #(xx) n./(xx.*(1-xx));
nlm = fitnlm(x,y,mymodelfun,beta0,'Weights',wfun)
neglogLH_GLSApproach = - nlm.LogLikelihood;
Source:
[1] https://uk.mathworks.com/help/stats/examples/nonlinear-logistic-regression.html
This answer (now) only details which code is used. Please see Tom Lane's answer below for a substantive answer.
Basically, fitnlm.m is a call to NonLinearModel.fit.
When opening NonLinearModel.m, one gets in line 1209:
model.LogLikelihood = getlogLikelihood(model);
getlogLikelihood is itself described between lines 1234-1251.
For instance:
function L = getlogLikelihood(model)
(...)
L = -(model.DFE + model.NumObservations*log(2*pi) + (...) )/2;
(...)
Please also not that this notably impacts ModelCriterion.AIC and ModelCriterion.BIC, as they are computed using model.LogLikelihood ("thinking" it is the logLikelihood).
To get the corresponding formula for BIC/AIC/..., type:
edit classreg.regr.modelutils.modelcriterion
this is Tom from MathWorks. Take another look at the formula quoted:
L = -(model.DFE + model.NumObservations*log(2*pi) + (...) )/2;
Remember the normal distribution has a factor (1/sqrt(2*pi)), so taking logs of that gives us -log(2*pi)/2. So the minus sign comes from that and it is part of the log likelihood. The property value is not the negative log likelihood.
One reason for the difference in the two log likelihood values is that the "ML approach" value is computing something based on the discrete probabilities from the binomial distribution. Those are all between 0 and 1, and they add up to 1. The "GLS approach" is computing something based on the probability density of the continuous normal distribution. In this example, the standard deviation of the residuals is about 0.0462. That leads to density values that are much higher than 1 at the peak. So the two things are not really comparable. You would need to convert the normal values to probabilities on the same discrete intervals that correspond to individual outcomes from the binomial distribution.

Matlab simulation error

I am completely new to Matlab. I am trying to simulate a Wiener and Poisson combined process.
Why do I get Subscripted assignment dimension mismatch?
I am trying to simulate
Z(t)=lambda*W^2(t)-N(t)
Where W is a wiener process and N is a poisson process.
The code I am using is below:
T=500
dt=1
K=T/dt
W(1)=0
lambda=3
t=0:dt:T
for k=1:K
r=randn
W(k+1)=W(k)+sqrt(dt)*r
N=poissrnd(lambda*dt,1,k)
Z(k)=lambda*W.^2-N
end
plot(t,Z)
It is true that some indexing is missing, but I think you would benefit from rewriting your code in a more 'Matlab way'. The following code is using the fact that Matlab basic variables are matrices, and compute the results in a vectorized way. Try to understand this kind of writing, as this is the way to exploit Matlab more efficiently, along with writing shorter and readable code:
T = 500;
dt = 1;
K = T/dt;
lambda = 3;
t = 1:dt:T;
sqdtr = sqrt(dt)*randn(K-1,1); % define sqrt(dt)*r as a vector
N = poissrnd(lambda*dt,K,1); % define N as a vector
W = cumsum([0; sqdtr],1); % cumulative sum instead of the loop
Z = lambda*W.^2-N; % summing the processes element-wiesly
plot(t,Z)
Example for a result:
you forget index
Z(k)=lambda*W.^2-N
it must be
Z(k)=lambda*W(k).^2-N(k)

Find the optimum combination

I am trying to optimise this: function [ LPS, LCE ] = runProject( Nw, Np, Nb) which calls some other functions I have written before. The idea is to find the optimum combination of Nw, Np, Nb AND keep the LPS=0, while LCE is minimum. Nw, Np, Nb should be positive integers. LCE will also be positive.
function [ LPS, LCE ] = runProject( Nw, Np, Nb)
%
% Detailed explanation goes here
[Pg, Pw, Pp] = Pgener();
[Pb, LPS] = Bat( Pg );
[LCE] = Constr(Pw, Pp, Nb)
end
However, I tried the gamultiobj solver from the Global Optimization Toolbox of matlab2015 (trial version) for a different approach with pareto front, but I got the error:
"Optimization running.
Error running optimization.
Not enough input arguments."
You should write your objective function like the following example:
function scores = rastriginsfcn(pop)
%RASTRIGINSFCN Compute the "Rastrigin" function.
% pop = max(-5.12,min(5.12,pop));
scores = 10.0 * size(pop,2) + sum(pop .^2 - 10.0 * cos(2 * pi .* pop),2);
As you can see, the function accepts all the inputs as a single vector pop.
With such representation I can evaluate the function as follows:
rastriginsfcn([2 3])
>> ans
13
Still for running the optimization from the toolbox you have to mention the number of variables, for instance, in my example it is equal to 2:
[x fval exitflag] = ga(#rastriginsfcn, 2)
It is the same for the multi-objective optimization. Check the following image from MATHWORKS:

MATLAB code for calculating MFCC

I have a question if that's ok. I was recently looking for algorithm to calculate MFCCs. I found a good tutorial rather than code so I tried to code it by myself. I still feel like I am missing one thing. In the code below I took FFT of a signal, calculated normalized power, filter a signal using triangular shapes and eventually sum energies corresponding to each bank to obtain MFCCs.
function output = mfcc(x,M,fbegin,fs)
MF = #(f) 2595.*log10(1 + f./700);
invMF = #(m) 700.*(10.^(m/2595)-1);
M = M+2; % number of triangular filers
mm = linspace(MF(fbegin),MF(fs/2),M); % equal space in mel-frequency
ff = invMF(mm); % convert mel-frequencies into frequency
X = fft(x);
N = length(X); % length of a short time window
N2 = max([floor(N+1)/2 floor(N/2)+1]); %
P = abs(X(1:N2,:)).^2./N; % NoFr no. of periodograms
mfccShapes = triangularFilterShape(ff,N,fs); %
output = log(mfccShapes'*P);
end
function [out,k] = triangularFilterShape(f,N,fs)
N2 = max([floor(N+1)/2 floor(N/2)+1]);
M = length(f);
k = linspace(0,fs/2,N2);
out = zeros(N2,M-2);
for m=2:M-1
I = k >= f(m-1) & k <= f(m);
J = k >= f(m) & k <= f(m+1);
out(I,m-1) = (k(I) - f(m-1))./(f(m) - f(m-1));
out(J,m-1) = (f(m+1) - k(J))./(f(m+1) - f(m));
end
end
Could someone please confirm that this is all right or direct me if I made mistake> I tested it on a simple pure tone and it gives me, in my opinion, reasonable answers.
Any help greatly appreciated :)
PS. I am working on how to apply vectorized Cosinus Transform. It looks like I would need a matrix of MxM of transform coefficients but I did not find any source that would explain how to do it.
You can test it yourself by comparing your results against other implementations like this one here
you will find a fully configurable matlab toolbox incl. MFCCs and even a function to reverse MFCC back to a time signal, which is quite handy for testing purposes:
melfcc.m - main function for calculating PLP and MFCCs from sound waveforms, supports many options.
invmelfcc.m - main function for inverting back from cepstral coefficients to spectrograms and (noise-excited) waveforms, options exactly match melfcc (to invert that processing).
the page itself has a lot of information on the usage of the package.

Calculating Covariance Matrix in Matlab

I am implementing a PCA algorithm in MATLAB. I see two different approaches to calculating the covariance matrix:
C = sampleMat.' * sampleMat ./ nSamples;
and
C = cov(data);
What is the difference between these two methods?
PS 1: When I use cov(data) is that unnecessary:
meanSample = mean(data,1);
data = data - repmat(data, nSamples, 1);
PS 2:
At first approach should I use nSamples or nSamples - 1?
In short: cov mainly just adds convenience to the bare formula.
If you type
edit cov
You'll see a lot of stuff, with these lines all the way at the bottom:
xc = bsxfun(#minus,x,sum(x,1)/m); % Remove mean
if flag
xy = (xc' * xc) / m;
else
xy = (xc' * xc) / (m-1); % DEFAULT
end
which is essentially the same as your first line, save for the subtraction of the column-means.
Read the wiki on sample covariances to see why there is a minus-one in the default path.
Note however that your first line uses normal transpose (.'), whereas the cov-version uses conjugate-transpose ('). This will make the output of cov different in the context of complex-valued data.
Also note that cov is a function call to a non-built in function. That means that there will be a (possibly severe) performance penalty when using cov in a loop; Matlab's JIT compiler cannot accelerate non-built in functions.