Bootstrap and asymmetric CI - matlab

I'm trying to create confidence interval for a set of data not randomly distributed and very skewed at right. Surfing, I discovered a pretty rude method that consists in using the 97.5% percentile (of my data) for the upperbound CL and 2.5% percentile for your lower CL.
Unfortunately, I need a more sophisticated way!
Then I discovered the bootstrap, precisley the MATLAB bootci function, but I'm having hard time to undestand how to used it properly.
Let's say that M is my matrix containing my data (19x100), and let's say that:
Mean = mean(M,2);
StdDev = sqrt(var(M'))';
How can I compute the asymmetrical CI for every row of the Mean vector using bootci?
Note: earlier, I was computing the CI in this very wrong way: Mean +/- 2 * StdDev, shame on me!

Let's say you have a 100x19 data set. Each column has a different distribution. We'll choose the log normal distribution, so that they skew to the right.
means = repmat(log(1:19), 100, 1);
stdevs = ones(100, 19);
X = lognrnd(means, stdevs);
Notice that each column is from the same distribution, and the rows are separate observations. Most functions in MATLAB operate on the rows by default, so it's always preferable to keep your data this way around.
You can compute bootstrap confidence intervals for the mean using the bootci function.
ci = bootci(1000, #mean, X);
This does 1000 resamplings of your data, calculates the mean for each resampling and then takes the 2.5% and 97.5% quantiles. To show that it's an asymmetric confidence interval about the mean, we can plot the mean and the confidence intervals for each column
plot(mean(X), 'r')
hold on
plot(ci')

Related

Matlab: What are the ways to determine the distribution of the data

I have a data set of n = 1000 realizations of a random variable X and is univariate -- X = {x1, x2,...,xn}. Data is generated by varying a parameter on which the random variable depends. For example, let the r.v be Area of a circle. So, by varying the radius (keeping the dimension fixed - say 2 dimensional circle) I generate n area for radius in the range r = 5 to n.
By using fitdist command I can fit distribution to the data set choosing distributions like Normal, Kernel, Binomial etc. Thus, data set is fitted to k distribution. So, I get k distributions. How do I select the Best fit distribution and hence the pdf ?
Also, do I need to normalize (post process) the data always in the range [0,1] before fitting?
If I understand correctly, you are asking how to decide which distribution to choose once you have a few fits.
There are three major metrics (IMO) for measuring "goodness-of-fit":
Chi-Squared
Kolmogrov-Smirnov
Anderson-Darling
Which to choose depends on a large number of factors; you can randomly pick one or read the Wiki pages to figure out which suits your need. These tests are also a part of MATLAB.
For instance, you can use kstest for the Kolmogrov-Smirnov test. You can provide the data and the hypothesized distribution to the function and evaluate the different options based on the KS test.
Alternately, you can use Anderson-Darling through adtest or Chi-Squared through chi2gof.

Finding Probability of Gaussian Distribution Using Matlab

The original question was to model a lightbulb, which are used 24/7, and usually one lasts 25 days. A box of bulbs contains 12. What is the probability that the box will last longer than a year?
I had to use MATLAB to model a Gaussian curve based on an exponential variable.
The code below generates a Gaussian model with mean = 300 and std= sqrt(12)*25.
The reason I had to use so many different variables and add them up was because I was supposed to be demonstrating the central limit theorem. The Gaussian curve represents the probability of a box of bulbs lasting for a # of days, where 300 is the average number of days a box will last.
I am having trouble using the gaussian I generated and finding the probability for days >365. The statement 1-normcdf(365,300, sqrt(12)*25) was an attempt to figure out the expected value for the probability, which I got as .2265. Any tips on how to find the probability for days>365 based on the Gaussian I generated would be greatly appreciated.
Thank you!!!
clear all
samp_num=10000000;
param=1/25;
a=-log(rand(1,samp_num))/param;
b=-log(rand(1,samp_num))/param;
c=-log(rand(1,samp_num))/param;
d=-log(rand(1,samp_num))/param;
e=-log(rand(1,samp_num))/param;
f=-log(rand(1,samp_num))/param;
g=-log(rand(1,samp_num))/param;
h=-log(rand(1,samp_num))/param;
i=-log(rand(1,samp_num))/param;
j=-log(rand(1,samp_num))/param;
k=-log(rand(1,samp_num))/param;
l=-log(rand(1,samp_num))/param;
x=a+b+c+d+e+f+g+h+i+j+k+l;
mean_x=mean(x);
std_x=std(x);
bin_sizex=.01*10/param;
binsx=[0:bin_sizex:800];
u=hist(x,binsx);
u1=u/samp_num;
1-normcdf(365,300, sqrt(12)*25)
bar(binsx,u1)
legend(['mean=',num2str(mean_x),'std=',num2str(std_x)]);
[f, y]=ecdf(x) will create an empirical cdf for the data in x. You can then find the probability where it first crosses 365 to get your answer.
Generate N replicates of x, where N should be several thousand or tens of thousands. Then p-hat = count(x > 365) / N, and has a standard error of sqrt[p-hat * (1 - p-hat) / N]. The larger the number of replications is, the smaller the margin of error will be for the estimate.
When I did this in JMP with N=10,000 I ended up with [0.2039, 0.2199] as a 95% CI for the true proportion of the time that a box of bulbs lasts more than a year. The discrepancy with your value of 0.2265, along with a histogram of the 10,000 outcomes, indicates that actual distribution is still somewhat skewed. In other words, using a CLT approximation for the sum of 12 exponentials is going to give answers that are slightly off.

Goodness of fit with MATLAB and chi-square test

I would like to measure the goodness-of-fit to an exponential decay curve. I am using the lsqcurvefit MATLAB function. I have been suggested by someone to do a chi-square test.
I would like to use the MATLAB function chi2gof but I am not sure how I would tell it that the data is being fitted to an exponential curve
The chi2gof function tests the null hypothesis that a set of data, say X, is a random sample drawn from some specified distribution (such as the exponential distribution).
From your description in the question, it sounds like you want to see how well your data X fits an exponential decay function. I really must emphasize, this is completely different to testing whether X is a random sample drawn from the exponential distribution. If you use chi2gof for your stated purpose, you'll get meaningless results.
The usual approach for testing the goodness of fit for some data X to some function f is least squares, or some variant on least squares. Further, a least squares approach can be used to generate test statistics that test goodness-of-fit, many of which are distributed according to the chi-square distribution. I believe this is probably what your friend was referring to.
EDIT: I have a few spare minutes so here's something to get you started. DISCLAIMER: I've never worked specifically on this problem, so what follows may not be correct. I'm going to assume you have a set of data x_n, n = 1, ..., N, and the corresponding timestamps for the data, t_n, n = 1, ..., N. Now, the exponential decay function is y_n = y_0 * e^{-b * t_n}. Note that by taking the natural logarithm of both sides we get: ln(y_n) = ln(y_0) - b * t_n. Okay, so this suggests using OLS to estimate the linear model ln(x_n) = ln(x_0) - b * t_n + e_n. Nice! Because now we can test goodness-of-fit using the standard R^2 measure, which matlab will return in the stats structure if you use the regress function to perform OLS. Hope this helps. Again I emphasize, I came up with this off the top of my head in a couple of minutes, so there may be good reasons why what I've suggested is a bad idea. Also, if you know the initial value of the process (ie x_0), then you may want to look into constrained least squares where you bind the parameter ln(x_0) to its known value.

How do I use scipy.optimize to minimize a set of functions?

[EDIT: The fmin() method is a good choice for my problem. However, my problem was that one of the axes was a sum of the other axes. I wasn't recalculating the y axis after applying the multiplier. Thus, the value returned from my optimize function was always returning the same value. This gave fmin no direction so it's chosen multipliers were very close together. Once the calculations in my optimize function were corrected fmin chose a larger range.]
I have two datasets that I want to apply multipliers to to see what values could 'improve' their correlation coefficients.
For example, say data set 1 has a correlation coefficient of -.6 and data set 2 has .5.
I can apply different multipliers to each of these data sets that might improve the coefficient. I would like to find a set of multipliers to choose for these two data sets that optimizing the correlation coefficients of each set.
I have written an objective function that takes a list of multipliers, applies them to the data sets, calculates the correlation coefficient (scipy.stats.spearmanr()), and sums these coefficients. So I need to use something from scipy.optimize to pass a set of multipliers to this function and find the set that optimizes this sum.
I have tried using optimize.fmin and several others. However, I want the optimization technique to use a much larger range of multipliers. For example, my data sets might have values in the millions, but fmin will only choose multipliers around 1.0, 1.05, etc. This isn't a big enough value to modify these correlation coefficients in any meaningful way.
Here is some sample code of my objective function:
def objective_func(multipliers):
for multiplier in multipliers:
for data_set in data_sets():
x_vals = getDataSetXValues()
y_vals = getDataSetYValues()
xvals *= muliplier
coeffs.append(scipy.stats.spearmanr(x_vals, y_vals)
return -1 * sum(coeffs)
I'm using -1 because I actually want the biggest value, but fmin is for minimization.
Here is a sample of how I'm trying to use fmin:
print optimize.fmin(objective_func)
The multipliers start at 1.0 and just range between 1.05, 1.0625, etc. I can see in the actual fmin code where these values are chosen. I ultimately need another method to call to give the minimization a range of values to check for, not all so closely related.
Multiplying the x data by some factor won't really change the Spearman rank-order correlation coefficient, though.
>>> x = numpy.random.uniform(-10,10,size=(20))
>>> y = numpy.random.uniform(-10,10,size=(20))
>>> scipy.stats.spearmanr(x,y)
(-0.24661654135338346, 0.29455199407204263)
>>> scipy.stats.spearmanr(x*10,y)
(-0.24661654135338346, 0.29455199407204263)
>>> scipy.stats.spearmanr(x*1e6,y)
(-0.24661654135338346, 0.29455199407204263)
>>> scipy.stats.spearmanr(x*1e-16,y)
(-0.24661654135338346, 0.29455199407204263)
>>> scipy.stats.spearmanr(x*(-2),y)
(0.24661654135338346, 0.29455199407204263)
>>> scipy.stats.spearmanr(x*(-2e6),y)
(0.24661654135338346, 0.29455199407204263)
(The second term in the tuple is the p value.)
You can change its sign, if you flip the signs of the terms, but the whole point of Spearman correlation is that it tells you the degree to which any monotonic relationship would capture the association. Probably that explains why fmin isn't changing the multiplier much: it's not getting any feedback on direction, because the returned value is constant.
So I don't see how what you're trying to do can work.
I'm also not sure why you've chosen the sum of all the the Spearman coefficients and the p values as what you're trying to maximize: the Spearman coefficients can be negative, so you probably want to square them, and you haven't mentioned the p values, so I'm not sure why you're throwing them in.
[It's possible I guess that we're working with different scipy versions and our spearmanr functions return different things. I've got 0.9.0.]
You probably don't want to minimize the sum of coefficients but the sum of squares. Also, if the multipliers can be chosen independently, why are you trying to optimize them all at the same time? Can you post your current code and some sample data?

Calculating confidence intervals for a non-normal distribution

First, I should specify that my knowledge of statistics is fairly limited, so please forgive me if my question seems trivial or perhaps doesn't even make sense.
I have data that doesn't appear to be normally distributed. Typically, when I plot confidence intervals, I would use the mean +- 2 standard deviations, but I don't think that is acceptible for a non-uniform distribution. My sample size is currently set to 1000 samples, which would seem like enough to determine if it was a normal distribution or not.
I use Matlab for all my processing, so are there any functions in Matlab that would make it easy to calculate the confidence intervals (say 95%)?
I know there are the 'quantile' and 'prctile' functions, but I'm not sure if that's what I need to use. The function 'mle' also returns confidence intervals for normally distributed data, although you can also supply your own pdf.
Could I use ksdensity to create a pdf for my data, then feed that pdf into the mle function to give me confidence intervals?
Also, how would I go about determining if my data is normally distributed. I mean I can currently tell just by looking at the histogram or pdf from ksdensity, but is there a way to quantitatively measure it?
Thanks!
So there are a couple of questions there. Here are some suggestions
You are right that a mean of 1000 samples should be normally distributed (unless your data is "heavy tailed", which I'm assuming is not the case). to get a 1-alpha-confidence interval for the mean (in your case alpha = 0.05) you can use the 'norminv' function. For example say we wanted a 95% CI for the mean a sample of data X, then we can type
N = 1000; % sample size
X = exprnd(3,N,1); % sample from a non-normal distribution
mu = mean(X); % sample mean (normally distributed)
sig = std(X)/sqrt(N); % sample standard deviation of the mean
alphao2 = .05/2; % alpha over 2
CI = [mu + norminv(alphao2)*sig ,...
mu - norminv(alphao2)*sig ]
CI =
2.9369 3.3126
Testing if a data sample is normally distribution can be done in a lot of ways. One simple method is with a QQ plot. To do this, use 'qqplot(X)' where X is your data sample. If the result is approximately a straight line, the sample is normal. If the result is not a straight line, the sample is not normal.
For example if X = exprnd(3,1000,1) as above, the sample is non-normal and the qqplot is very non-linear:
X = exprnd(3,1000,1);
qqplot(X);
On the other hand if the data is normal the qqplot will give a straight line:
qqplot(randn(1000,1))
You might consider, also, using bootstrapping, with the bootci function.
You may use the method proposed in [1]:
MEDIAN +/- 1.7(1.25R / 1.35SQN)
Where R = Interquartile Range,
SQN = Square Root of N
This is often used in notched box plots, a useful data visualization for non-normal data. If the notches of two medians do not overlap, the medians are, approximately, significantly different at about a 95% confidence level.
[1] McGill, R., J. W. Tukey, and W. A. Larsen. "Variations of Boxplots." The American Statistician. Vol. 32, No. 1, 1978, pp. 12–16.
Are you sure you need confidence intervals or just the 90% range of the random data?
If you need the latter, I suggest you use prctile(). For example, if you have a vector holding independent identically distributed samples of random variables, you can get some useful information by running
y = prcntile(x, [5 50 95])
This will return in [y(1), y(3)] the range where 90% of your samples occur. And in y(2) you get the median of the sample.
Try the following example (using a normally distributed variable):
t = 0:99;
tt = repmat(t, 1000, 1);
x = randn(1000, 100) .* tt + tt; % simple gaussian model with varying mean and variance
y = prctile(x, [5 50 95]);
plot(t, y);
legend('5%','50%','95%')
I have not used Matlab but from my understanding of statistics, if your distribution cannot be assumed to be normal distribution, then you have to take it as Student t distribution and calculate confidence Interval and accuracy.
http://www.stat.yale.edu/Courses/1997-98/101/confint.htm