Calculating confidence intervals for a non-normal distribution - matlab

First, I should specify that my knowledge of statistics is fairly limited, so please forgive me if my question seems trivial or perhaps doesn't even make sense.
I have data that doesn't appear to be normally distributed. Typically, when I plot confidence intervals, I would use the mean +- 2 standard deviations, but I don't think that is acceptible for a non-uniform distribution. My sample size is currently set to 1000 samples, which would seem like enough to determine if it was a normal distribution or not.
I use Matlab for all my processing, so are there any functions in Matlab that would make it easy to calculate the confidence intervals (say 95%)?
I know there are the 'quantile' and 'prctile' functions, but I'm not sure if that's what I need to use. The function 'mle' also returns confidence intervals for normally distributed data, although you can also supply your own pdf.
Could I use ksdensity to create a pdf for my data, then feed that pdf into the mle function to give me confidence intervals?
Also, how would I go about determining if my data is normally distributed. I mean I can currently tell just by looking at the histogram or pdf from ksdensity, but is there a way to quantitatively measure it?
Thanks!

So there are a couple of questions there. Here are some suggestions
You are right that a mean of 1000 samples should be normally distributed (unless your data is "heavy tailed", which I'm assuming is not the case). to get a 1-alpha-confidence interval for the mean (in your case alpha = 0.05) you can use the 'norminv' function. For example say we wanted a 95% CI for the mean a sample of data X, then we can type
N = 1000; % sample size
X = exprnd(3,N,1); % sample from a non-normal distribution
mu = mean(X); % sample mean (normally distributed)
sig = std(X)/sqrt(N); % sample standard deviation of the mean
alphao2 = .05/2; % alpha over 2
CI = [mu + norminv(alphao2)*sig ,...
mu - norminv(alphao2)*sig ]
CI =
2.9369 3.3126
Testing if a data sample is normally distribution can be done in a lot of ways. One simple method is with a QQ plot. To do this, use 'qqplot(X)' where X is your data sample. If the result is approximately a straight line, the sample is normal. If the result is not a straight line, the sample is not normal.
For example if X = exprnd(3,1000,1) as above, the sample is non-normal and the qqplot is very non-linear:
X = exprnd(3,1000,1);
qqplot(X);
On the other hand if the data is normal the qqplot will give a straight line:
qqplot(randn(1000,1))

You might consider, also, using bootstrapping, with the bootci function.

You may use the method proposed in [1]:
MEDIAN +/- 1.7(1.25R / 1.35SQN)
Where R = Interquartile Range,
SQN = Square Root of N
This is often used in notched box plots, a useful data visualization for non-normal data. If the notches of two medians do not overlap, the medians are, approximately, significantly different at about a 95% confidence level.
[1] McGill, R., J. W. Tukey, and W. A. Larsen. "Variations of Boxplots." The American Statistician. Vol. 32, No. 1, 1978, pp. 12–16.

Are you sure you need confidence intervals or just the 90% range of the random data?
If you need the latter, I suggest you use prctile(). For example, if you have a vector holding independent identically distributed samples of random variables, you can get some useful information by running
y = prcntile(x, [5 50 95])
This will return in [y(1), y(3)] the range where 90% of your samples occur. And in y(2) you get the median of the sample.
Try the following example (using a normally distributed variable):
t = 0:99;
tt = repmat(t, 1000, 1);
x = randn(1000, 100) .* tt + tt; % simple gaussian model with varying mean and variance
y = prctile(x, [5 50 95]);
plot(t, y);
legend('5%','50%','95%')

I have not used Matlab but from my understanding of statistics, if your distribution cannot be assumed to be normal distribution, then you have to take it as Student t distribution and calculate confidence Interval and accuracy.
http://www.stat.yale.edu/Courses/1997-98/101/confint.htm

Related

How do I align two signals in MATLAB [duplicate]

I want to get the offset in samples between two datasets in Matlab (getting them synced in time), a quite common issue. Therefore I use the cross correlation function xcorr or the cross covariance function xcov (both provide similar results in most cases for this purpose). With artificial data it works fine, but I struggle with "real" data, even though it should be pretty much the same. Matlab always says the offset would be zero. I'm using this simple piece of code:
[crossCorr] = xcov(b, c);
[~, peakIndex] = max(crossCorr())
offset = peakIndex - length(b)
I've posted a fully runable example m-file with a downsampled data excerpt on pastebin:
Code with data on pastebin
EDIT: The downsampled excerpt seems to be not fully suitable for evaluating the effect. Here's a much larger sample with the original frequency, pease use this one instead. Unfortunately it was too big for pastebin.
As the plot shows it should be no problem at all to get the offset via cross covariance. I also tried to scale the data nicer in order to avoid numerical problems, but that didn't change anything at all.
Would be great if someone could tell me my mistake.
There's nothing wrong with your method in principle, I used exactly the same approach successfully for temporally aligning different audio recordings of the same signal.
However, it appears that for your time series, correlation (or covariance) is simply not the right measure to compare shifted versions – possibly because they contain components of a time scale comparable to the total length. An alternative is to use residual variance, i.e. the variance of the difference between shifted versions. Here is a (not particularly elegant) implementation of this idea:
lags = -1000 : 1000;
v = nan(size(lags));
for i = 1 : numel(lags)
lag = lags(i);
if lag >= 0
v(i) = var(b(1 + lag : end) - c(1 : end - lag));
else
v(i) = var(b(1 : end + lag) - c(1 - lag : end));
end
end
[~, ind] = min(v);
minlag = lags(ind);
For your (longer) data set, this results in minlag = 169. Plotting residual variance over lags gives:
Your data has a minor peak around 5 and a major peak around 101.
If I knew something about my data then I could might window around an acceptable range of offsets as shown below.
Code for initial exploration:
figure; clc;
subplot(2,1,1)
plot(1:numel(b), b);
hold on
plot(1:numel(c), c, 'r');
legend('b','c')
subplot(2,1,2)
plot(crossCorr,'.b-')
hold on
plot(peakIndex,crossCorr(peakIndex),'or')
legend('crossCorr','peak')
Initial Image:
If you zoom into the first peak you can see that it is not only high around 5, but it is polynomial "enough" to allow sub-element offsets. That is convenient.
Image showing :
Here is what the curve-fitting tool gives as the analytic for a cubic:
Linear model Poly3:
f(x) = p1*x^3 + p2*x^2 + p3*x + p4
Coefficients (with 95% confidence bounds):
p1 = 8.515e-013 (8.214e-013, 8.816e-013)
p2 = -3.319e-011 (-3.369e-011, -3.269e-011)
p3 = 2.253e-010 (2.229e-010, 2.277e-010)
p4 = -4.226e-012 (-7.47e-012, -9.82e-013)
Goodness of fit:
SSE: 2.799e-024
R-square: 1
Adjusted R-square: 1
RMSE: 6.831e-013
You can note that the SSE fits to roundoff.
If you compute the root (near n=4) you use the following matlab code:
% Coefficients
p1 = 8.515e-013
p2 = -3.319e-011
p3 = 2.253e-010
p4 = -4.226e-012
% Linear model Poly3:
syms('x')
f = p1*x^3 + p2*x^2 + p3*x + p4
xz1=fzero(#(y) subs(diff(f),'x',y), 4)
and you get the analytic root at 4.01420240431444.
EDIT:
Hmmm. How about fitting a gaussian mixture model to the convolution? You sweep through a good range of component count, you do between 10 and 30 repeats, and you find which component count has the best/lowest BIC. So you fit a gmdistribution to the lower subplot of the first figure, then test the covariance at the means of the components in decreasing order.
I would try the offset at the means, and just look at sum squared error. I would then pick the offset that has the lowest error.
Procedure:
compute cross correlation
fit cross correlation to Gaussian Mixture model
sweep a reasonable range of components (start with 1-10)
use a reasonable number of repeats (10 to 30 depending on run-to-run variation)
compute Bayes Information Criterion (BIC) for each level, pick the lowest because it indicates a reasonable balance of error and parameter count
each component is going to have a mean, evaluate that mean as a candidate offset and compute sum-squared error (sse) when you offset like that.
pick the offset of the component that gives best SSE
Let me know how well that works.
If the two signals misalign by non-integer number of samples, e.g. 3.7 samples, then the xcorr method may find the max value at 4 samples, it won't be able to find the accurate time shift. In this case, you should try a method called "unified change detection". The web-link for the paper is:
[http://www.phmsociety.org/node/1404/]
Good Luck.

Time delay estimation using crosscorrelation

I have two sensors seperated by some distance which receive a signal from a source. The signal in its pure form is a sine wave at a frequency of 17kHz. I want to estimate the TDOA between the two sensors. I am using crosscorrelation and below is my code
x1; % signal as recieved by sensor1
x2; % signal as recieved by sensor2
len = length(x1);
nfft = 2^nextpow2(2*len-1);
X1 = fft(x1);
X2 = fft(x2);
X = X1.*conj(X2);
m = ifft(X);
r = [m(end-len+1) m(1:len)];
[a,i] = max(r);
td = i - length(r)/2;
I am filtering my signals x1 and x2 by removing all frequencies below 17kHz.
I am having two problems with the above code:
1. With the sensors and source at the same place, I am getting different values of 'td' at each time. I am not sure what is wrong. Is it because of the noise? If so can anyone please provide a solution? I have read many papers and went through other questions on stackoverflow so please answer with code along with theory instead of just stating the theory.
2. The value of 'td' is sometimes not matching with the delay as calculated using xcorr. What am i doing wrong? Below is my code for td using xcorr
[xc,lags] = xcorr(x1,x2);
[m,i] = max(xc);
td = lags(i);
One problem you might have is the fact that you only use a single frequency. At f = 17 kHz, and an estimated speed-of-sound v = 340 m/s (I assume you use ultra-sound), the wavelength is lambda = v / f = 2 cm. This means that your length measurement has an unambiguity range of 2 cm (sorry, cannot find a good link, google yourself). This means that you already need to know your distance to better than 2 cm, before you can use the result of your measurement to refine the distance.
Think of it in another way: when taking the cross-correlation between two perfect sines, the result should be a 'comb' of peaks with spacing equal to the wavelength. If they overlap perfectly, and you displace one signal by one wavelength, they still overlap perfectly. This means that you first have to know which of these peaks is the right one, otherwise a different peak can be the highest every time purely by random noise. Did you make a plot of the calculated cross-correlation before trying to blindly find the maximum?
This problem is the same as in interferometry, where it is easy to measure small distance variations with a resolution smaller than a wavelength by measuring phase differences, but you have no idea about the absolute distance, since you do not know the absolute phase.
The solution to this is actually easy: let your source generate more frequencies. Even using (band-limited) white-noise should work without problems when calculating cross-correlations, and it removes the ambiguity problem. You should see the white noise as a collection of sines. The cross-correlation of each of them will generate a comb, but with different spacing. When adding all those combs together, they will add up significantly only in a single point, at the delay you are looking for!
White Noise, Maximum Length Sequency or other non-periodic signals should be used as the test signal for time delay measurement using cross correleation. This is because non-periodic signals have only one cross correlation peak and there will be no ambiguity to determine the time delay. It is possible to use the burst type of periodic signals to do the job, but with degraded SNR. If you have to use a continuous periodic signal as the test signal, then you can only measure a time delay within one period of the periodic test signal. This should explain why, in your case, using lower frequency sine wave as the test signal works while using higher frequency sine wave does not. This is demonstrated in these videos: https://youtu.be/L6YJqhbsuFY, https://youtu.be/7u1nSD0RlwY .

Bootstrap and asymmetric CI

I'm trying to create confidence interval for a set of data not randomly distributed and very skewed at right. Surfing, I discovered a pretty rude method that consists in using the 97.5% percentile (of my data) for the upperbound CL and 2.5% percentile for your lower CL.
Unfortunately, I need a more sophisticated way!
Then I discovered the bootstrap, precisley the MATLAB bootci function, but I'm having hard time to undestand how to used it properly.
Let's say that M is my matrix containing my data (19x100), and let's say that:
Mean = mean(M,2);
StdDev = sqrt(var(M'))';
How can I compute the asymmetrical CI for every row of the Mean vector using bootci?
Note: earlier, I was computing the CI in this very wrong way: Mean +/- 2 * StdDev, shame on me!
Let's say you have a 100x19 data set. Each column has a different distribution. We'll choose the log normal distribution, so that they skew to the right.
means = repmat(log(1:19), 100, 1);
stdevs = ones(100, 19);
X = lognrnd(means, stdevs);
Notice that each column is from the same distribution, and the rows are separate observations. Most functions in MATLAB operate on the rows by default, so it's always preferable to keep your data this way around.
You can compute bootstrap confidence intervals for the mean using the bootci function.
ci = bootci(1000, #mean, X);
This does 1000 resamplings of your data, calculates the mean for each resampling and then takes the 2.5% and 97.5% quantiles. To show that it's an asymmetric confidence interval about the mean, we can plot the mean and the confidence intervals for each column
plot(mean(X), 'r')
hold on
plot(ci')

Goodness of fit with MATLAB and chi-square test

I would like to measure the goodness-of-fit to an exponential decay curve. I am using the lsqcurvefit MATLAB function. I have been suggested by someone to do a chi-square test.
I would like to use the MATLAB function chi2gof but I am not sure how I would tell it that the data is being fitted to an exponential curve
The chi2gof function tests the null hypothesis that a set of data, say X, is a random sample drawn from some specified distribution (such as the exponential distribution).
From your description in the question, it sounds like you want to see how well your data X fits an exponential decay function. I really must emphasize, this is completely different to testing whether X is a random sample drawn from the exponential distribution. If you use chi2gof for your stated purpose, you'll get meaningless results.
The usual approach for testing the goodness of fit for some data X to some function f is least squares, or some variant on least squares. Further, a least squares approach can be used to generate test statistics that test goodness-of-fit, many of which are distributed according to the chi-square distribution. I believe this is probably what your friend was referring to.
EDIT: I have a few spare minutes so here's something to get you started. DISCLAIMER: I've never worked specifically on this problem, so what follows may not be correct. I'm going to assume you have a set of data x_n, n = 1, ..., N, and the corresponding timestamps for the data, t_n, n = 1, ..., N. Now, the exponential decay function is y_n = y_0 * e^{-b * t_n}. Note that by taking the natural logarithm of both sides we get: ln(y_n) = ln(y_0) - b * t_n. Okay, so this suggests using OLS to estimate the linear model ln(x_n) = ln(x_0) - b * t_n + e_n. Nice! Because now we can test goodness-of-fit using the standard R^2 measure, which matlab will return in the stats structure if you use the regress function to perform OLS. Hope this helps. Again I emphasize, I came up with this off the top of my head in a couple of minutes, so there may be good reasons why what I've suggested is a bad idea. Also, if you know the initial value of the process (ie x_0), then you may want to look into constrained least squares where you bind the parameter ln(x_0) to its known value.

Am I using the Fourier transformation the right way?

I am wondering if I am using Fourier Transformation in MATLAB the right way. I want to have all the average amplitudes for frequencies in a song. For testing purposes I am using a free mp3 download of Beethovens "For Elise" which I converted to a 8 kHz mono wave file using Audacity.
My MATLAB code is as follows:
clear all % be careful
% load file
% Für Elise Recording by Valentina Lisitsa
% from http://www.forelise.com/recordings/valentina_lisitsa
% Converted to 8 kHz mono using Audacity
allSamples = wavread('fur_elise_valentina_lisitsa_8khz_mono.wav');
% apply windowing function
w = hanning(length(allSamples));
allSamples = allSamples.*w;
% FFT needs input of length 2^x
NFFT = 2^nextpow2(length(allSamples))
% Apply FFT
fftBuckets=fft(allSamples, NFFT);
fftBuckets=fftBuckets(1:(NFFT/2+1)); % because of symetric/mirrored values
% calculate single side amplitude spectrum,
% normalize by dividing by NFFT to get the
% popular way of displaying amplitudes
% in a range of 0 to 1
fftBuckets = (2*abs(fftBuckets))/NFFT;
% plot it: max possible frequency is 4000, because sampling rate of input
% is 8000 Hz
x = linspace(1,4000,length(fftBuckets));
bar(x,fftBuckets);
The output then looks like this:
Can somebody please tell me if my code is correct? I am especially wondering about the peaks around 0.
For normalizing, do I have to divide by NFFT or length(allSamples)?
For me this doesn't really look like a bar chart, but I guess this is due to the many values I am plotting?
Thanks for any hints!
Depends on your definition of "correct". This is doing what you intended, I think, but it's probably not very useful. I would suggest using a 2D spectrogram instead, as you'll get time-localized information on frequency content.
There is no one correct way of normalising FFT output; there are various different conventions (see e.g. the discussion here). The comment in your code says that you want a range of 0 to 1; if your input values are in the range -1 to 1, then dividing by number of bins will achieve that.
Well, exactly!
I would also recommend plotting the y-axis on a logarithmic scale (in decibels), as that's roughly how the human ear interprets loudness.
Two things that jump out at me:
I'm not sure why you are including the DC (index = 1) component in your plot. Not a big deal, but of course that bin contains no frequency data
I think that dividing by length(allSamples) is more likely to be correct than dividing by NFFT. The reason is that if you want the DC component to be equal to the mean of the input data, dividing by length(allSamples) is the right thing to do.
However, like Oli said, you can't really say what the "correct" normalization is until you know exactly what you are trying to calculate. I tend to use FFTs to estimate power spectra, so I want units like "DAC / rt-Hz", which would lead to a different normalization than if you wanted something like "DAC / Hz".
Ultimately there's no substitute for thinking about exacty what you want to get out of the FFT (including units), and working out for yourself what the correct normalization should be (starting from the definition of the FFT if necessary).
You should also be aware that MATLAB's fft has no requirement to use an array length that is a power of 2 (though doing so will presumably lead to the FFT running faster). Because zero-padding will introduce some ringing, you need to think about whether it is the right thing to do for your application.
Finally, if a periodogram / power spectrum is really what you want, MATLAB provides functions like periodogram, pwelch and others that may be helpful.