Finding Probability of Gaussian Distribution Using Matlab - matlab

The original question was to model a lightbulb, which are used 24/7, and usually one lasts 25 days. A box of bulbs contains 12. What is the probability that the box will last longer than a year?
I had to use MATLAB to model a Gaussian curve based on an exponential variable.
The code below generates a Gaussian model with mean = 300 and std= sqrt(12)*25.
The reason I had to use so many different variables and add them up was because I was supposed to be demonstrating the central limit theorem. The Gaussian curve represents the probability of a box of bulbs lasting for a # of days, where 300 is the average number of days a box will last.
I am having trouble using the gaussian I generated and finding the probability for days >365. The statement 1-normcdf(365,300, sqrt(12)*25) was an attempt to figure out the expected value for the probability, which I got as .2265. Any tips on how to find the probability for days>365 based on the Gaussian I generated would be greatly appreciated.
Thank you!!!
clear all
samp_num=10000000;
param=1/25;
a=-log(rand(1,samp_num))/param;
b=-log(rand(1,samp_num))/param;
c=-log(rand(1,samp_num))/param;
d=-log(rand(1,samp_num))/param;
e=-log(rand(1,samp_num))/param;
f=-log(rand(1,samp_num))/param;
g=-log(rand(1,samp_num))/param;
h=-log(rand(1,samp_num))/param;
i=-log(rand(1,samp_num))/param;
j=-log(rand(1,samp_num))/param;
k=-log(rand(1,samp_num))/param;
l=-log(rand(1,samp_num))/param;
x=a+b+c+d+e+f+g+h+i+j+k+l;
mean_x=mean(x);
std_x=std(x);
bin_sizex=.01*10/param;
binsx=[0:bin_sizex:800];
u=hist(x,binsx);
u1=u/samp_num;
1-normcdf(365,300, sqrt(12)*25)
bar(binsx,u1)
legend(['mean=',num2str(mean_x),'std=',num2str(std_x)]);

[f, y]=ecdf(x) will create an empirical cdf for the data in x. You can then find the probability where it first crosses 365 to get your answer.

Generate N replicates of x, where N should be several thousand or tens of thousands. Then p-hat = count(x > 365) / N, and has a standard error of sqrt[p-hat * (1 - p-hat) / N]. The larger the number of replications is, the smaller the margin of error will be for the estimate.
When I did this in JMP with N=10,000 I ended up with [0.2039, 0.2199] as a 95% CI for the true proportion of the time that a box of bulbs lasts more than a year. The discrepancy with your value of 0.2265, along with a histogram of the 10,000 outcomes, indicates that actual distribution is still somewhat skewed. In other words, using a CLT approximation for the sum of 12 exponentials is going to give answers that are slightly off.

Related

How to find relationship between two distribution curves

I have some floating data (represented by blue curve), when I do some loss compression, the yellow curve can be obtained (mean,standard deviation).
My aim is to minimize this losses after compression process, Hence, I would like to find an equation/curve/filter that:
the yellow curve times "function" nearly equal to blue Gaussian curve.
or
blue curve = Function(green curve)
Thanks for your help!
The best way is to do Kolmogorov–Smirnov test. It compares the maximum difference between the cumulative distributions of the two input vectors.
You can start to play with this test using the implementation in Matlab called [h p k]=kstest2(dist1, dist2) You should be looking at the k value which is the test statistic, it denotes the maximum difference between the 2 empirical cumulative distributions. If you want to visualise how is this difference calculated,
cdfplot(dist1)
hold on
cdfplot(dist2)
hold off
you will see the two cumulative distributions in the same plot. The maximum differene between them is k. If the relationship between the 2 distributions are high the lower the gap is and k value tends to be 1 and in case of highly different distributions the value moves towards 0 and away from 1.
Hope it helps.
If you have found any more interesting methods, kindly let me know.

How to generate a positive square wave in matlab, whose cumulative sum should be equal to a given value?

I have a cumulative rain, R = 100 mm for t = 10 days. I want to distribute the R over t through square wave so that the cumulative of the square wave is equal to R. In fact, I want to generate different scenario by changing the frequency and duration of rainfall, however, for all cases, the cumulative R should be same.
Please suggest.
Regards,
Imran
Do it on paper first - this is arguably a maths question really.
The trick is to be clear about all your definitions (this is also why you're getting downvotes - those same definitions are required for people to give you detailed help).
Start with the definition of a square wave that fits your requirements (i.e. the two levels it occupies). Define your cumulative and determine how to compute it. You should then be able to write an expression which relates the duty cycle of the square wave (choose a letter for this parameter) and the total time period (your t) to the cumulative (your R). Then you can invert the relationship such that, for a given cumulative R and time t you can recover the required duty cycle.
When you've done this you'll probably find it a great deal easier to implement in matlab.
If you have trouble with any of these stages then ask a new question about it, remembering to include what you've tried so far.

How do I find the slope (rate) in MATLAB?

How do I find the slope (rate) in MATLAB?
For example, say I have a scatter plot:
Year = [2001 2002 2003 2004 2005];
Distance = [1.5 1.8 1.9 2.2 2.5];
scatter(Year, Distance)
hold on
pf = polyfit(Year,Distance,1);
f = polyval(pf,Year);
plot(Year,f)
And I can find R by:
[r,p] = corrcoef(Year,Distance)
I want to find the rate at which the distance increases per year, which I think is equivalent to the slope?
You are correct in your interpretation of the slope in this case. If you use polyfit in that fashion, you are finding the slope and intercept of the regression line that best fits that distribution. In this case, the slope would be the rate at which distance increases per year. Without going into much detail, polyfit will determine the line of best fit that will minimize the sum of squared errors between the best fit line and your data points. Therefore, this slope will give you the best rate at which distance is increasing per year, given your point distribution.
You can follow Chris A's approach in that you can find point-wise pairs of neighbouring points and compute a slope for each, then do an average, but doing polyfit will find the least squares regression line and in my opinion that's the way to go.
You can obtain the least squares, or best fit slope by extracting the first value of pf as you have already observed. The second value will contain the intercept term of the regression line.
Good choice on using corrcoef to determine how good the fit is. However, be careful and take the correlation coefficient with a grain of salt. Some distributions may report a good correlation coefficient, but the actual best fit line will not look very good. A classic example would be the Anscombe quartet. In this example, all distributions reported a correlation coefficient of 0.816, yet the variability in the data was quite different. As a means of self-containment, this is what the data look like as well as the best fit line through each set of points. You can see that the regression line is actually the same for all data sets, yet the point distribution is completely different:

Bootstrap and asymmetric CI

I'm trying to create confidence interval for a set of data not randomly distributed and very skewed at right. Surfing, I discovered a pretty rude method that consists in using the 97.5% percentile (of my data) for the upperbound CL and 2.5% percentile for your lower CL.
Unfortunately, I need a more sophisticated way!
Then I discovered the bootstrap, precisley the MATLAB bootci function, but I'm having hard time to undestand how to used it properly.
Let's say that M is my matrix containing my data (19x100), and let's say that:
Mean = mean(M,2);
StdDev = sqrt(var(M'))';
How can I compute the asymmetrical CI for every row of the Mean vector using bootci?
Note: earlier, I was computing the CI in this very wrong way: Mean +/- 2 * StdDev, shame on me!
Let's say you have a 100x19 data set. Each column has a different distribution. We'll choose the log normal distribution, so that they skew to the right.
means = repmat(log(1:19), 100, 1);
stdevs = ones(100, 19);
X = lognrnd(means, stdevs);
Notice that each column is from the same distribution, and the rows are separate observations. Most functions in MATLAB operate on the rows by default, so it's always preferable to keep your data this way around.
You can compute bootstrap confidence intervals for the mean using the bootci function.
ci = bootci(1000, #mean, X);
This does 1000 resamplings of your data, calculates the mean for each resampling and then takes the 2.5% and 97.5% quantiles. To show that it's an asymmetric confidence interval about the mean, we can plot the mean and the confidence intervals for each column
plot(mean(X), 'r')
hold on
plot(ci')

Calculating confidence intervals for a non-normal distribution

First, I should specify that my knowledge of statistics is fairly limited, so please forgive me if my question seems trivial or perhaps doesn't even make sense.
I have data that doesn't appear to be normally distributed. Typically, when I plot confidence intervals, I would use the mean +- 2 standard deviations, but I don't think that is acceptible for a non-uniform distribution. My sample size is currently set to 1000 samples, which would seem like enough to determine if it was a normal distribution or not.
I use Matlab for all my processing, so are there any functions in Matlab that would make it easy to calculate the confidence intervals (say 95%)?
I know there are the 'quantile' and 'prctile' functions, but I'm not sure if that's what I need to use. The function 'mle' also returns confidence intervals for normally distributed data, although you can also supply your own pdf.
Could I use ksdensity to create a pdf for my data, then feed that pdf into the mle function to give me confidence intervals?
Also, how would I go about determining if my data is normally distributed. I mean I can currently tell just by looking at the histogram or pdf from ksdensity, but is there a way to quantitatively measure it?
Thanks!
So there are a couple of questions there. Here are some suggestions
You are right that a mean of 1000 samples should be normally distributed (unless your data is "heavy tailed", which I'm assuming is not the case). to get a 1-alpha-confidence interval for the mean (in your case alpha = 0.05) you can use the 'norminv' function. For example say we wanted a 95% CI for the mean a sample of data X, then we can type
N = 1000; % sample size
X = exprnd(3,N,1); % sample from a non-normal distribution
mu = mean(X); % sample mean (normally distributed)
sig = std(X)/sqrt(N); % sample standard deviation of the mean
alphao2 = .05/2; % alpha over 2
CI = [mu + norminv(alphao2)*sig ,...
mu - norminv(alphao2)*sig ]
CI =
2.9369 3.3126
Testing if a data sample is normally distribution can be done in a lot of ways. One simple method is with a QQ plot. To do this, use 'qqplot(X)' where X is your data sample. If the result is approximately a straight line, the sample is normal. If the result is not a straight line, the sample is not normal.
For example if X = exprnd(3,1000,1) as above, the sample is non-normal and the qqplot is very non-linear:
X = exprnd(3,1000,1);
qqplot(X);
On the other hand if the data is normal the qqplot will give a straight line:
qqplot(randn(1000,1))
You might consider, also, using bootstrapping, with the bootci function.
You may use the method proposed in [1]:
MEDIAN +/- 1.7(1.25R / 1.35SQN)
Where R = Interquartile Range,
SQN = Square Root of N
This is often used in notched box plots, a useful data visualization for non-normal data. If the notches of two medians do not overlap, the medians are, approximately, significantly different at about a 95% confidence level.
[1] McGill, R., J. W. Tukey, and W. A. Larsen. "Variations of Boxplots." The American Statistician. Vol. 32, No. 1, 1978, pp. 12–16.
Are you sure you need confidence intervals or just the 90% range of the random data?
If you need the latter, I suggest you use prctile(). For example, if you have a vector holding independent identically distributed samples of random variables, you can get some useful information by running
y = prcntile(x, [5 50 95])
This will return in [y(1), y(3)] the range where 90% of your samples occur. And in y(2) you get the median of the sample.
Try the following example (using a normally distributed variable):
t = 0:99;
tt = repmat(t, 1000, 1);
x = randn(1000, 100) .* tt + tt; % simple gaussian model with varying mean and variance
y = prctile(x, [5 50 95]);
plot(t, y);
legend('5%','50%','95%')
I have not used Matlab but from my understanding of statistics, if your distribution cannot be assumed to be normal distribution, then you have to take it as Student t distribution and calculate confidence Interval and accuracy.
http://www.stat.yale.edu/Courses/1997-98/101/confint.htm