How to do I calculate the hazard rate accurately in Matlab? - matlab

I need to calculate the hazard-rate, PDF/(1-CDF), of a Rayleigh function over x.
x = 0:0.001:2.5;
HR = pdf('rayl',x,sqrt(1/18))./(1-cdf('rayl',x,sqrt(1/18)));
plot(x,HR)
Here the plot becomes funny at approximately x = 2. How can I improve accuracy of the HR?

You're running into numerical issues, specifically catastrophic cancellation. Try plotting just the denominator of your hazard rate function, the complementary CDF, on a semilogy plot:
x = 0:0.001:2.5;
semilogy(x,1-cdf('rayl',x,sqrt(1/18)))
As you can see, there are issues a bit before x = 2 when the denominator is approximately machine epsilon, eps. This is when cdf('rayl',x,sqrt(1/18)) is about 1.
Luckily, Matlab provides a way around this with an option that calculates the complementary CDF, or tail probability of the CDF, directly via the 'upper' option for cdf:
x = 0:0.001:2.5;
HR = pdf('rayl',x,sqrt(1/18))./cdf('rayl',x,sqrt(1/18),'upper');
plot(x,HR)
which now returns a straight line with a slope of 18, as expected. The raylcdf function also supports this option.

Related

Computing the DFT of an arbitrary signal

As part of a course in signal processing at university, we have been asked to write an algorithm in Matlab to calculate the single sided spectrum of our signal using DFT, without using the fft() function built in to matlab. this isn't an assessed part of the course, I'm just interested in getting this "right" for myself. I am currently using the 2018b version of Matlab, should anyone find this useful.
I have built a signal of a 1 KHz and 2KHz sinusoid, phase shifted by 135 degrees (2*pi/3 rad).
then using the equations in 9.1 of Discrete time signal processing (Allan V. Oppenheim) and Euler's formula to simplify the exponent, I produce this code:
%%DFT(currently buggy)
n=0;m=0;
for m=1:DFT_N-1 %DFT_Fmin;DFT_Fmax; %scrolls through DFT m values (K in text.)
for n=1:DFT_N-1;%;(DFT_N-1);%<<redundant code? from Oppenheim eqn. 9.1 % eulers identity, K=m and n=n
X(m)=x(n)*(cos((2*pi*n*m)/DFT_N)-j*sin((2*pi*n*m)/DFT_N));
n=n+1;
end
%m=m+1; %redundant code?
end
This takes x as the input, in this case the signal mentioned earlier, as well as the resolution of the transform, as represented by the DFT_N, which has been initialized to 100. The output of this function, X, should be something in the frequency domain, but plotting X yields a circular plot slightly larger than the unit circle, and with a gap on the left hand edge.
I am struggling to see how I am supposed to convert this to the stem() plots as given by the in-built DFT algorithm.
Many thanks, J.
This is your bug:
replace X(m)=x(n)*(cos.. with X(m)=X(m)+x(n)*(cos..
For a given m, it does not integrate over the variable n, but overwrites X(m) only the last calculation for n = DFT_N-1.
Notice that integrating over n=1:DFT_N-1 omits one harmonic, i.e., the first one, exp(-j*2*pi). Replace
n=1:DFT_N-1 with n=1:DFT_N to include that. I would also replace m=1:DFT_N-1 with m=1:DFT_N for plotting concerns.
Also replace any 2*pi*n*m with 2*pi*(n-1)*(m-1) to get the phase right, since the first entry of X should correspond to zero-frequency, yielding sum_n x(n) * (cos(0) + j sin(0)) = sum_n x(n). If your signal x is real-valued then the zero-frequency component X(1) should be real-valued, angle(X(1))=0.
Last remark, don't forget to shift zero-frequency component to the center of the spectrum for better visibility, X = circshift(X,floor(size(X)/2));
If you are interested in the single-sided spectrum only, than you can just calculate X(m) for m=1:DFT_N/2 since X it is conjugate symmetric around m=DFT_N/2, i.e., X(DFT_N/2+m) = X(DFT_N/2-m)', due to exp(-j*(pi*n+2*pi/DFT_N*m)) = exp(-j*(pi*n-2*pi/DFT_N*m))'.
As a side note, for a given m this program calculates an inner product between the array x and another array of complex exponentials, i.e., exp(-j*2*pi/DFT_N*m*n), for n = 0,1,...,N-1. MATLAB syntax is very convenient for such calculations, and you can avoid this inner loop by the following command
exp(-j*2*pi/DFT_N*m*(0:DFT_N-1)) * x
where x is a column vector. Similarly, you can avoid the first loop too by expanding your complex exponential vector row-wise for every m, i.e., build the matrix exp(-j*2*pi/DFT_N*(0:DFT_N-1)'*(0:DFT_N-1)). Then your DFT is simply
X = exp(-j*2*pi/DFT_N*(0:DFT_N-1)'*(0:DFT_N-1)) * x
For single-sided spectrum, instead use
X = exp(-j*2*pi/DFT_N*(0:floor((DFT_N-1)/2))'*(0:DFT_N-1)) * x

Trying to plot the fft of a sinc function

I am trying to plot the fft of a set of data I have. These data form a nearly perfect sinc function. Here is the data of which I am trying to plot the fft:
.
I know the fft of a sinc function should look like kind of a step function. However, the results I get are nowhere near that. Finding the fft in itself is super easy, but I think my mistake is when I try to compute the frequency axis. I have found several methods online, but so far I have not been able to make it work. Here is my code:
sampleRate = (max(xdata) - min(xdata))/length(xdata);
sampleN = length(xdata);
y = fft(ydata, sampleN);
Y = y.*conj(y)/sampleN;
freq = (0:1:length(Y)-1)*sampleRate/sampleNumber;
plot(freq, Y)
I have found pretty much all of that online and I understand pretty much none of it (which might be why it's not working...)
Zoom on what I get using that code:
It now seems to be working! This is what I get when I subtract the mean:
What you see here is the zero frequency being much, much larger than everything else. Plot with plot(freq,Y,'o-') to prove that the shape you see is just the linear interpolation between two samples.
The zero frequency is the sum of all samples. Because the mean of your signal is quite a bit larger than the amplitude, the zero frequency dwarfs everything else. And because you are plotting the power (absolute square of the DFT), this difference is enhanced even more.
There are two simple solutions:
Plot using logarithmic y-axis:
plot(freq, Y)
set(gca,'yscale','log')
Subtract the mean from your signal, remove the zero frequency, or scale the y-axis (these are all more or less equivalent):
y = fft(ydata-mean(ydata), sampleN);
or
y(1) = 0;
or
plot(freq, Y)
set(gca,'ylim',[0,max(Y(2:end))]);

Fit sine wave with a distorted time-base

I want to know the best way to fit a sine-wave with a distorted time base, in Matlab.
The distortion in time is given by a n-th order polynomial (n~10), of the form t_distort = P(t).
For example, consider the distortion t_distort = 8 + 12t + 6t^2 + t^3 (which is just the power series expansion of (t-2)^3).
This will distort a sine-wave as follows:
I want to be able to find the distortion given this distorted sine-wave. (i.e. I want to find the function t = G(t_distort), but t_distort = P(t) is unknown.)
If your resolution is high enough, then this is basically an angle-demodulation problem. The standard way to demodulate an angle-modulated signal is to take the derivative, followed by an envelope detector, followed by an integrator.
Since I don't know the exact numbers you're using, I'll show an example with my own numbers. Suppose my original timebase has 10 million points from 0 to 100:
t = 0:0.00001:100;
I then get the distorted timebase and calculate the distorted sine wave:
td = 0.02*(t+2).^3;
yd = sin(td);
Now I can demodulate it. Take the "derivative" using approximate differences divided by the step size from before:
ydot = diff(yd)/0.00001;
The envelope can be easily detected:
envelope = abs(hilbert(ydot));
This gives an approximation for the derivative of P(t). The last step is an integrator, which I can approximate using a cumulative sum (we have to scale it again by the step size):
tdguess = cumsum(envelope)*0.00001;
This gives a curve that's almost identical to the original distorted timebase (so, it gives a good approximation of P(t)):
You won't be able to get the constant term of the polynomial since we made our approximation from its derivative, which of course eliminates the constant term. You wouldn't even be able to find a unique constant term mathematically from just yd, since infinitely many values will yield the same distorted sine wave. You can get the other three coefficients of P(t) using polyfit if you know the degree of P(t) (ignore the last number, it's the constant term):
>> polyfit(t(1:10000000), tdguess, 3)
ans =
0.0200 0.1201 0.2358 0.4915
This is pretty close to the original, aside from the constant term: 0.02*(t+2)^3 = 0.02t^3 + 0.12t^2 + 0.24t + 0.16.
You wanted the inverse mapping Q(t). Can you do that knowing a close approximation for P(t) as found so far?
Here's an analytical driven route that takes asin of the signal with proper unwrapping of the angle. Then you can fit a polynomial using polyfit on the angle or using other fit methods (search for fit and see). Last, take a sin of the fitted function and compare the signal to the fitted one... see this pedagogical example:
% generate data
t=linspace(0,10,1e2);
x=0.02*(t+2).^3;
y=sin(x);
% take asin^2 to obtain points of "discontinuity" where then asin hits +-1
da=(asin(y).^2);
[val locs]=findpeaks(da); % this can be done in many other ways too...
% construct the asin according to the proper phase unwrapping
an=NaN(size(y));
an(1:locs(1)-1)=asin(y(1:locs(1)-1));
for n=2:numel(locs)
an(locs(n-1)+1:locs(n)-1)=(n-1)*pi+(-1)^(n-1)*asin(y(locs(n-1)+1:locs(n)-1));
end
an(locs(n)+1:end)=n*pi+(-1)^(n)*asin(y(locs(n)+1:end));
r=~isnan(an);
p=polyfit(t(r),an(r),3);
figure;
subplot(2,1,1); plot(t,y,'.',t,sin(polyval(p,t)),'r-');
subplot(2,1,2); plot(t,x,'.',t,(polyval(p,t)),'r-');
title(['mean error ' num2str(mean(abs(x-polyval(p,t))))]);
p =
0.0200 0.1200 0.2400 0.1600
The reason I preallocate with NaNand avoid taking the asin at points of discontinuity (locs) is to reduce the error of the fit later. As you can see, for a 100 points between 0,10 the average error is of the order of floating point accuracy, and the polynomial coefficients are as exact as you can have them.
The fact that you are not taking a derivative (as in the very elegant Hilbert transform) allows to be numerically exact. For the same conditions the Hilbert transform solution will have a much bigger average error (order of unity vs 1e-15).
The only limitation of this method is that you need enough points in the regime where the asin flips direction and that function inside the sin is well behaved. If there's a sampling issue you can truncate the data and only maintain a smaller range closer to zero, such that it'll be enough to characterize the function inside the sin. After all, you don't need millions op points to fit to a 3 parameter function.

Discrete surface integral with cumsum

I have a matrix z(x,y)
This is an NxN abitary pdf constructed from a unique Kernel density estimation (i.e. not a usual pdf and it doesn't have a function). It is multivariate and can't be separated and is discrete data.
I wan't to construct a NxN matrix (F(x,y)) that is the cumulative distribution function in 2 dimensions of this pdf so that I can then randomly sample the F(x,y) = P(x < X ,y < Y);
Analytically I think the CDF of a multivariate function is the surface integral of the pdf.
What I have tried is using the cumsum function in order to calculate the surface integral and tested this with a multivariate normal against the analytical solution and there seems to be some discrepancy between the two:
% multivariate parameters
delta = 100;
mu = [1 1];
Sigma = [0.25 .3; .3 1];
x1 = linspace(-2,4,delta); x2 = linspace(-2,4,delta);
[X1,X2] = meshgrid(x1,x2);
% Calculate Normal multivariate pdf
F = mvnpdf([X1(:) X2(:)],mu,Sigma);
F = reshape(F,length(x2),length(x1));
% My attempt at a numerical surface integral
FN = cumsum(cumsum(F,1),2);
% Normalise the CDF
FN = FN./max(max(FN));
X = [X1(:) X2(:)];
% Analytic solution to a multivariate normal pdf
p = mvncdf(X,mu,Sigma);
p = reshape(p,delta,delta);
% Highlight the difference
dif = p - FN;
error = max(max(sqrt(dif.^2)));
% %% Plot
figure(1)
surf(x1,x2,F);
caxis([min(F(:))-.5*range(F(:)),max(F(:))]);
xlabel('x1'); ylabel('x2'); zlabel('Probability Density');
figure(2)
surf(X1,X2,FN);
xlabel('x1'); ylabel('x2');
figure(3);
surf(X1,X2,p);
xlabel('x1'); ylabel('x2');
figure(5)
surf(X1,X2,dif)
xlabel('x1'); ylabel('x2');
Particularly the error seems to be in the transition region which is the most important.
Does anyone have any better solution to this problem or see what I'm doing wrong??
Any help would be much appreciated!
EDIT: This is the desired outcome of the cumulative integration, The reason this function is of value to me is that when you randomly generate samples from this function on the closed interval [0,1] the higher weighted (i.e. the more likely) values appear more often in this way the samples converge on the expected value(s) (in the case of multiple peaks) this is desired outcome for algorithms such as particle filters, neural networks etc.
Think of the 1-dimensional case first. You have a function represented by a vector F and want to numerically integrate. cumsum(F) will do that, but it uses a poor form of numerical integration. Namely, it treats F as a step function. You could instead do a more accurate numerical integration using the Trapezoidal rule or Simpson's rule.
The 2-dimensional case is no different. Your use of cumsum(cumsum(F,1),2) is again treating F as a step function, and the numerical errors resulting from that assumption only get worse as the number of dimensions of integration increases. There exist 2-dimensional analogues of the Trapezoidal rule and Simpson's rule. Since there's a bit too much math to repeat here, take a look here:
http://onestopgate.com/gate-study-material/mathematics/numerical-analysis/numerical-integration/2d-trapezoidal.asp.
You DO NOT need to compute the 2-dimensional integral of the probability density function in order to sample from the distribution. If you are computing the 2-d integral, you are going about the problem incorrectly.
Here are two ways to approach the sampling problem.
(1) You write that you have a kernel density estimate. A kernel density estimate is a special case of a mixture density. Any mixture density can be sampled by first selecting one kernel (perhaps differently or equally weighted, same procedure applies), and then sampling from that kernel. (That applies in any number of dimensions.) Typically the kernels are some relatively simple distribution such as a Gaussian distribution so that it is easy to sample from it.
(2) Any joint density P(X, Y) is equal to P(X | Y) P(Y) (and equivalently P(Y | X) P(X)). Therefore you can sample from P(Y) (or P(X)) and then from P(X | Y). In order to sample from P(X | Y), you will need to integrate P(X, Y) along a line Y = y (where y is the sampled value of Y), but (this is crucial) you only need to integrate along that line; you don't need to integrate over all values of X and Y.
If you tell us more about your problem, I can help with the details.

Calculating confidence intervals for a non-normal distribution

First, I should specify that my knowledge of statistics is fairly limited, so please forgive me if my question seems trivial or perhaps doesn't even make sense.
I have data that doesn't appear to be normally distributed. Typically, when I plot confidence intervals, I would use the mean +- 2 standard deviations, but I don't think that is acceptible for a non-uniform distribution. My sample size is currently set to 1000 samples, which would seem like enough to determine if it was a normal distribution or not.
I use Matlab for all my processing, so are there any functions in Matlab that would make it easy to calculate the confidence intervals (say 95%)?
I know there are the 'quantile' and 'prctile' functions, but I'm not sure if that's what I need to use. The function 'mle' also returns confidence intervals for normally distributed data, although you can also supply your own pdf.
Could I use ksdensity to create a pdf for my data, then feed that pdf into the mle function to give me confidence intervals?
Also, how would I go about determining if my data is normally distributed. I mean I can currently tell just by looking at the histogram or pdf from ksdensity, but is there a way to quantitatively measure it?
Thanks!
So there are a couple of questions there. Here are some suggestions
You are right that a mean of 1000 samples should be normally distributed (unless your data is "heavy tailed", which I'm assuming is not the case). to get a 1-alpha-confidence interval for the mean (in your case alpha = 0.05) you can use the 'norminv' function. For example say we wanted a 95% CI for the mean a sample of data X, then we can type
N = 1000; % sample size
X = exprnd(3,N,1); % sample from a non-normal distribution
mu = mean(X); % sample mean (normally distributed)
sig = std(X)/sqrt(N); % sample standard deviation of the mean
alphao2 = .05/2; % alpha over 2
CI = [mu + norminv(alphao2)*sig ,...
mu - norminv(alphao2)*sig ]
CI =
2.9369 3.3126
Testing if a data sample is normally distribution can be done in a lot of ways. One simple method is with a QQ plot. To do this, use 'qqplot(X)' where X is your data sample. If the result is approximately a straight line, the sample is normal. If the result is not a straight line, the sample is not normal.
For example if X = exprnd(3,1000,1) as above, the sample is non-normal and the qqplot is very non-linear:
X = exprnd(3,1000,1);
qqplot(X);
On the other hand if the data is normal the qqplot will give a straight line:
qqplot(randn(1000,1))
You might consider, also, using bootstrapping, with the bootci function.
You may use the method proposed in [1]:
MEDIAN +/- 1.7(1.25R / 1.35SQN)
Where R = Interquartile Range,
SQN = Square Root of N
This is often used in notched box plots, a useful data visualization for non-normal data. If the notches of two medians do not overlap, the medians are, approximately, significantly different at about a 95% confidence level.
[1] McGill, R., J. W. Tukey, and W. A. Larsen. "Variations of Boxplots." The American Statistician. Vol. 32, No. 1, 1978, pp. 12–16.
Are you sure you need confidence intervals or just the 90% range of the random data?
If you need the latter, I suggest you use prctile(). For example, if you have a vector holding independent identically distributed samples of random variables, you can get some useful information by running
y = prcntile(x, [5 50 95])
This will return in [y(1), y(3)] the range where 90% of your samples occur. And in y(2) you get the median of the sample.
Try the following example (using a normally distributed variable):
t = 0:99;
tt = repmat(t, 1000, 1);
x = randn(1000, 100) .* tt + tt; % simple gaussian model with varying mean and variance
y = prctile(x, [5 50 95]);
plot(t, y);
legend('5%','50%','95%')
I have not used Matlab but from my understanding of statistics, if your distribution cannot be assumed to be normal distribution, then you have to take it as Student t distribution and calculate confidence Interval and accuracy.
http://www.stat.yale.edu/Courses/1997-98/101/confint.htm