Finding 15th and 85th percentile in matlab - matlab

I have came up with a matlab code to plot a probability density and a cumulative graph. I have used the matlab to compute the standard deviation and the mean as well.
My next task is to find the 15th and 85 percentile of the cumulative graph. I tried to use 'prctile (prob, 15)' to calculate the 15th percentile but it does not seem to be the same value as what I have observed from the graph.
Is there any other ways to find the 15th and 85 percentile?

This should give you the 15% and 85% percentile values as you see in your cumulative graph:
15_percentile = prob(find(prob<prctile(prob,15),1));
85_percentile = prob(find(prob>prctile(prob,85),1,'last'));

There are several ways to calculate a percentile (see http://en.wikipedia.org/wiki/Percentile)
The gotcha here is that MatLab and Excel don't agree (Excel uses the definition empployed by the National Institiute of Standards And Technology in the US...also default for R)...worth considering if you swap data and analysis between MatLab and Excel.

Use the percentile function if you have that statistics toolbox (type help prctile).
http://www.mathworks.com/help/stats/prctile.html
Alternatively write it yourself! A percentile is simply the data sorted, and the value closest to the percentile you want (for example if you have 1000 values, your 15th percentile will be the (15/100)*1000=150th value! Make sure you sort the data from smallest to largest.
There is a special way to deal with values that fall in between samples, but these depend on the definition you use. Some take the nearest, others take the average between two samples, and some others calculate how close they are to the samples and take a value that is linearly proportional to that.

Related

How to interpolate values to fill percentile ranges?

I have values for 10th-25th percentile range which is 0.49, 25th-50th percentile is 1.36 (this is peak), 50th-75th percentile is 0.18, >90th percentile is 0.15.
I want to interpolate the values for the ranges >5, 5th-10th, 75th-90th percentile. How to do that in MATLAB?
If I assume a normal distribution, whose peak is 1.36 (25th-50th percentile) (as shown in figure attached), how to interpolate the values of unknown percentile ranges?
Actually, performing an interpolation in order to find percentile values looks not very good to me. If you are dealing with a normal distribution and its parameters (mu and sigma) are known, what you are looking for is the norminv function (official documentation: https://mathworks.com/help/stats/norminv.html).
X = norminv(P,mu,sigma) computes the inverse of the normal CDF using
the corresponding mean mu and standard deviation sigma at the
corresponding probabilities in P. The parameters in sigma must be
positive, and the values in P must lie in the interval [0 1].
For example, this is how you find the interval that contains 95% of the values of a standard normal distribution:
norminv([0.025 0.975],0,1)
This is how you find the 99th percentile of a normal distribution with mu=10 and sigma=3.5:
norminv(0.99,10,3.5)
If you don't know those parameters, you can estimate them from the data you actually have. The parameters of the normal family are the mean and the standard deviation; once they are known, the underlying distribution is fully described. Actually:
The mean of a normal distribution is halfway between the 25th and the 75th percentile. Average these two values to approximate it.
In a normal distribution, the difference between the 25th and the 75th percentile is about 1.35 times its standard deviation. So take the difference between the aforementioned values and divide it by 1.35 in order to obtain an approximation of the standard deviation.
If you want to go with a linear interpolation, have a look at interp1 (https://mathworks.com/help/matlab/ref/interp1.html).

Quantizing timeline data for averaging and histograms

I have some raw spreadsheet data that's in a format, like:
12/7/2016 3:07:00, 88.05,
12/7/2016 3:08:00, 89.10,
12/7/2016 3:13:00, 87.00,
etc
These data points are not sampled at a regular interval, but are randomly collected throughout the day.
Using Google Sheets I'm able to graph this easily onto a Timeline chart. This puts the values at the correct position on the timeline and takes the uneven sampling intervals into account.
I would like to generate a histogram of the timeline data while taking into account the timestamps and calculate an average value over a timeframe. I believe if I simply run this through the built-in histogram chart or select my data values and run it through an averaging function, it will be skewed by the uneven sampling intervals.
What's the easiest way to quantize the sampling intervals (ideally within Google Sheets) for generating my histogram and averaging?
or
Is there a built-in method to generate histograms/averaging of values while taking timestamp data into account, eliminating the need for quantized data?
You can calculate the appropriate average as follows (assuming your data is in the range A2:B50)
=sum(arrayformula((A3:A50-A2:A49)*(B3:B50+B2:B49)/2))/(A50-A2)
This formula implements the Trapezoidal rule: the value assigned to each time interval is the average of observed values at the ends of that interval.
There isn't a built-in "weighted histogram" tool, so it appears that needs re-sampling to create a representative histogram. Here is one way to resample. Let's say you want 20 samples; then in C2 enter
=arrayformula(A2+(row(1:20)-1)*(A50-A2)/19)
to get 20 uniformly distributed time values. (Division by 19 because of the fence-post distinction.) Then in D2,
=arrayformula(vlookup(C2:C21, A2:B50, 2))
will lookup a value for each sample time. Then you can build a histogram from column D.

Finding Probability of Gaussian Distribution Using Matlab

The original question was to model a lightbulb, which are used 24/7, and usually one lasts 25 days. A box of bulbs contains 12. What is the probability that the box will last longer than a year?
I had to use MATLAB to model a Gaussian curve based on an exponential variable.
The code below generates a Gaussian model with mean = 300 and std= sqrt(12)*25.
The reason I had to use so many different variables and add them up was because I was supposed to be demonstrating the central limit theorem. The Gaussian curve represents the probability of a box of bulbs lasting for a # of days, where 300 is the average number of days a box will last.
I am having trouble using the gaussian I generated and finding the probability for days >365. The statement 1-normcdf(365,300, sqrt(12)*25) was an attempt to figure out the expected value for the probability, which I got as .2265. Any tips on how to find the probability for days>365 based on the Gaussian I generated would be greatly appreciated.
Thank you!!!
clear all
samp_num=10000000;
param=1/25;
a=-log(rand(1,samp_num))/param;
b=-log(rand(1,samp_num))/param;
c=-log(rand(1,samp_num))/param;
d=-log(rand(1,samp_num))/param;
e=-log(rand(1,samp_num))/param;
f=-log(rand(1,samp_num))/param;
g=-log(rand(1,samp_num))/param;
h=-log(rand(1,samp_num))/param;
i=-log(rand(1,samp_num))/param;
j=-log(rand(1,samp_num))/param;
k=-log(rand(1,samp_num))/param;
l=-log(rand(1,samp_num))/param;
x=a+b+c+d+e+f+g+h+i+j+k+l;
mean_x=mean(x);
std_x=std(x);
bin_sizex=.01*10/param;
binsx=[0:bin_sizex:800];
u=hist(x,binsx);
u1=u/samp_num;
1-normcdf(365,300, sqrt(12)*25)
bar(binsx,u1)
legend(['mean=',num2str(mean_x),'std=',num2str(std_x)]);
[f, y]=ecdf(x) will create an empirical cdf for the data in x. You can then find the probability where it first crosses 365 to get your answer.
Generate N replicates of x, where N should be several thousand or tens of thousands. Then p-hat = count(x > 365) / N, and has a standard error of sqrt[p-hat * (1 - p-hat) / N]. The larger the number of replications is, the smaller the margin of error will be for the estimate.
When I did this in JMP with N=10,000 I ended up with [0.2039, 0.2199] as a 95% CI for the true proportion of the time that a box of bulbs lasts more than a year. The discrepancy with your value of 0.2265, along with a histogram of the 10,000 outcomes, indicates that actual distribution is still somewhat skewed. In other words, using a CLT approximation for the sum of 12 exponentials is going to give answers that are slightly off.

How to plot a probability density distribution graph in MATLAB?

I have about 10000 floating point data, and have read them into a single row matrix.
Now I would like to plot them and show their distribution, would there be some simple functions to do that?
plot() actually plots value with respect to data number...which is not what I want
bar() is similar to what I want, but actually I would like to lower the sample rate and merge neighbor bars which are close enough (e.g. one bar for 0.50-0.55, and one bar for 0.55-0.60, etc) instead of having one single bar for every single data sample.
would there be a function to calculate this distribution by dividing the range into small steps, and outputting the prob density in each step?
Thank you!
hist() would be best. It plots a histogram, with a lot of options which you can see by doc hist, or by checking the Matlab website. Options include a specified number of bins, or a range of bins. This will plot a histogram of 1000 normally random points, with 50 bins.
hist(randn(1000,1),50)

Determine the position and value of peak

I have a graph with five major peaks. I'd like to find the position and value of the first peak (the one furthest to the right). I have more than 100 different plots of this and the peak grows and shrinks in size in the various plots, and will need to use a for loop. I'm just stuck on determining the x and y values to a large number of significant figures using Matlab code.
Here's one of the many plots:
If you know for sure you're always gonna have 5 peaks I think the FileExchange function extrema will be very helpful, see here.
This will return you the maxima (and minima if needed) in descending order, so the first elements of output zmax and imax are respectively the maximal value and its index, their second elements are the second maximum value and its index and so on.
In the case if the peak you need is always the smallest of the five you'll just need zmax(5) and imax(5) to determine the 5th biggest maximum.
If you have access to Signal Processing Toolbox, findpeaks is the function you are looking for. It can be invoked using different options including number of peaks, which can be helpful when that information is available.