I'm new to Matlab and I would appreciate if someone could help.
The problem:
IQ coefficients are Normally distributed with a mean of 100 and a standard deviation of 15. Calculate the probability that a randomly drawn person from this population has an IQ greater than 110 but smaller than 130. You can achieve this using one line of matlab code.
What does this look like?
I tried like this:
>> max(normpdf(linspace(110,130,100),100,15))
ans =
0.0213
But not sure if it is correct..
I would be thankful for any help!
This is most efficiently handled using the normal cumulative density function.
normcdf(130,100,15) - normcdf(110,100,15)
Or if you prefer to manually convert these to "Z" scores then you can use the single argument version of the cdf.
normcdf(30/15) - normcdf(10/15)
In either case the answer is 0.2297, so about 23%.
Lets check:
N=1e7; %Number of "experimental" samples
iq = randn(1,N)*15 + 100; %Create a set of IQ values
p = sum(iq>=110 & iq<=130)/N %Determine how many are in range of interest.
This returns a number around 23%.
Related
The original question was to model a lightbulb, which are used 24/7, and usually one lasts 25 days. A box of bulbs contains 12. What is the probability that the box will last longer than a year?
I had to use MATLAB to model a Gaussian curve based on an exponential variable.
The code below generates a Gaussian model with mean = 300 and std= sqrt(12)*25.
The reason I had to use so many different variables and add them up was because I was supposed to be demonstrating the central limit theorem. The Gaussian curve represents the probability of a box of bulbs lasting for a # of days, where 300 is the average number of days a box will last.
I am having trouble using the gaussian I generated and finding the probability for days >365. The statement 1-normcdf(365,300, sqrt(12)*25) was an attempt to figure out the expected value for the probability, which I got as .2265. Any tips on how to find the probability for days>365 based on the Gaussian I generated would be greatly appreciated.
Thank you!!!
clear all
samp_num=10000000;
param=1/25;
a=-log(rand(1,samp_num))/param;
b=-log(rand(1,samp_num))/param;
c=-log(rand(1,samp_num))/param;
d=-log(rand(1,samp_num))/param;
e=-log(rand(1,samp_num))/param;
f=-log(rand(1,samp_num))/param;
g=-log(rand(1,samp_num))/param;
h=-log(rand(1,samp_num))/param;
i=-log(rand(1,samp_num))/param;
j=-log(rand(1,samp_num))/param;
k=-log(rand(1,samp_num))/param;
l=-log(rand(1,samp_num))/param;
x=a+b+c+d+e+f+g+h+i+j+k+l;
mean_x=mean(x);
std_x=std(x);
bin_sizex=.01*10/param;
binsx=[0:bin_sizex:800];
u=hist(x,binsx);
u1=u/samp_num;
1-normcdf(365,300, sqrt(12)*25)
bar(binsx,u1)
legend(['mean=',num2str(mean_x),'std=',num2str(std_x)]);
[f, y]=ecdf(x) will create an empirical cdf for the data in x. You can then find the probability where it first crosses 365 to get your answer.
Generate N replicates of x, where N should be several thousand or tens of thousands. Then p-hat = count(x > 365) / N, and has a standard error of sqrt[p-hat * (1 - p-hat) / N]. The larger the number of replications is, the smaller the margin of error will be for the estimate.
When I did this in JMP with N=10,000 I ended up with [0.2039, 0.2199] as a 95% CI for the true proportion of the time that a box of bulbs lasts more than a year. The discrepancy with your value of 0.2265, along with a histogram of the 10,000 outcomes, indicates that actual distribution is still somewhat skewed. In other words, using a CLT approximation for the sum of 12 exponentials is going to give answers that are slightly off.
If I estimate the entropy of a vector of standard normal random variables using the Matlab entropy() function, I get an answer somewhere in the region of 4, whereas the actual entropy should be 0.5 * log(2*pi*e*sigma^2) which is approximately equal to 1.4.
Does anyone know where the discrepancy is coming from?
Note: To save time here is the Matlab code
for i = 1:1000
X(i) = randn();
end
'The entropy of X is'
entropy(X)
Please read the help (help entropy) or documentation for entropy. You'll see that it's designed for images and uses a histogram technique rather than calculating the it analytically. You'll need to create your own function if you want the formula from Wikipedia, but as the formula is so simple, that should be no problem.
I believe that the reason that you're getting such divergent answers is that entropy scales the bins of the histogram by the number of elements. If you want to uses such an estimation technique you'll want to use hist and scale the bins by area. See this StackOverflow question.
I did calculation and got the following numbers
0.739128438976901 0.739128438976900
I want MATLAB to consider that they are equal, but MATLAB recognized that the first one was greater than the second. How can I make MATLAB consider them as they are equal ?
Thanks
x = 42
y = 42.00001
if abs(x-y) < tolerance
% do something
end
The setting for tolerance is up to you.
I don't know a whole lot about Matlab (I'm more of a Mathematica guy myself), but it seems there is a roundn(x,n) function which rounds an element x to the nearest multiple of 10^n. Perhaps this could be used here.
The radii r is drawn from a cut-off log-normal distribution, which has a following probability density function:
pdf=((sqrt(2).*exp(-0.5*((log(r/rch)).^2)))./((sqrt(pi.*(sigma_nd.^2))...
.*r).*(erf((log(rmax/rch))./sqrt(2.*(sigma_nd.^2)))-erf((log(rmin/rch))./sqrt(2.*(sigma_nd.^2))))));
rch, sigma_nd, rmax, and rmin are all constants.
I found the explanation from the net, but it seems difficult to find its integral and then take inverse in Matlab.
I checked, but my first instinct is that it looks like log(r/rch) is a truncated normal distribution with limits of log(rmin/rch) and log(rmax/rch). So you can generate the appropriate truncated normal random variable, say y, then r = rch * exp(y).
You can generate truncated normal random variables by generating the untruncated values and replacing those that are outside the limits. Alternatively, you can do it using the CDF, as described by #PengOne, which you can find on the wikipedia page.
I'm (still) not sure that your p.d.f. is completely correct, but what's most important here is the distribution.
If your PDF is continuous, then you can integrate to get a CDF, then find the inverse of the CDF and evaluate that at the random value.
If your PDF is not continuous, then you can get a discrete CDF using cumsum, and use that as your initial Y value in interp(), with the initial X value the same as the values the PDF was sampled at, and asking to interpolate at your array of rand() numbers.
it's not clear what's your variable, but i'm assuming it's r.
the simplest way to do this is, as Chris noted, first get the cdf (note that if r starts at 0, pdf(1) is Nan... change it to 0):
cdf = cumtrapz(pdf);
cdf = cdf / cdf(end);
then spawn a uniform distribution (size_dist indicating the number of elements):
y = rand (size_dist,1);
followed by a method to place distribution along the cdf. Any technique will work, but here is the simplest (albeit inelegant)
x = zeros(size_dist,1);
for i = 1:size_dist
x(i) = find( y(i)<= cdf,1);
end
and finally, returning to the original pdf. Use matlab numerical indexing to reverse course. Note: use r and not pdf:
pdfHist = r(x);
hist (pdfHist,50)
Probably an overkill for your distribution - but you can always write a Metropolis sampler.
On the other hand - implementation is straight forward so you'd have your sampler very quick.
First, I should specify that my knowledge of statistics is fairly limited, so please forgive me if my question seems trivial or perhaps doesn't even make sense.
I have data that doesn't appear to be normally distributed. Typically, when I plot confidence intervals, I would use the mean +- 2 standard deviations, but I don't think that is acceptible for a non-uniform distribution. My sample size is currently set to 1000 samples, which would seem like enough to determine if it was a normal distribution or not.
I use Matlab for all my processing, so are there any functions in Matlab that would make it easy to calculate the confidence intervals (say 95%)?
I know there are the 'quantile' and 'prctile' functions, but I'm not sure if that's what I need to use. The function 'mle' also returns confidence intervals for normally distributed data, although you can also supply your own pdf.
Could I use ksdensity to create a pdf for my data, then feed that pdf into the mle function to give me confidence intervals?
Also, how would I go about determining if my data is normally distributed. I mean I can currently tell just by looking at the histogram or pdf from ksdensity, but is there a way to quantitatively measure it?
Thanks!
So there are a couple of questions there. Here are some suggestions
You are right that a mean of 1000 samples should be normally distributed (unless your data is "heavy tailed", which I'm assuming is not the case). to get a 1-alpha-confidence interval for the mean (in your case alpha = 0.05) you can use the 'norminv' function. For example say we wanted a 95% CI for the mean a sample of data X, then we can type
N = 1000; % sample size
X = exprnd(3,N,1); % sample from a non-normal distribution
mu = mean(X); % sample mean (normally distributed)
sig = std(X)/sqrt(N); % sample standard deviation of the mean
alphao2 = .05/2; % alpha over 2
CI = [mu + norminv(alphao2)*sig ,...
mu - norminv(alphao2)*sig ]
CI =
2.9369 3.3126
Testing if a data sample is normally distribution can be done in a lot of ways. One simple method is with a QQ plot. To do this, use 'qqplot(X)' where X is your data sample. If the result is approximately a straight line, the sample is normal. If the result is not a straight line, the sample is not normal.
For example if X = exprnd(3,1000,1) as above, the sample is non-normal and the qqplot is very non-linear:
X = exprnd(3,1000,1);
qqplot(X);
On the other hand if the data is normal the qqplot will give a straight line:
qqplot(randn(1000,1))
You might consider, also, using bootstrapping, with the bootci function.
You may use the method proposed in [1]:
MEDIAN +/- 1.7(1.25R / 1.35SQN)
Where R = Interquartile Range,
SQN = Square Root of N
This is often used in notched box plots, a useful data visualization for non-normal data. If the notches of two medians do not overlap, the medians are, approximately, significantly different at about a 95% confidence level.
[1] McGill, R., J. W. Tukey, and W. A. Larsen. "Variations of Boxplots." The American Statistician. Vol. 32, No. 1, 1978, pp. 12–16.
Are you sure you need confidence intervals or just the 90% range of the random data?
If you need the latter, I suggest you use prctile(). For example, if you have a vector holding independent identically distributed samples of random variables, you can get some useful information by running
y = prcntile(x, [5 50 95])
This will return in [y(1), y(3)] the range where 90% of your samples occur. And in y(2) you get the median of the sample.
Try the following example (using a normally distributed variable):
t = 0:99;
tt = repmat(t, 1000, 1);
x = randn(1000, 100) .* tt + tt; % simple gaussian model with varying mean and variance
y = prctile(x, [5 50 95]);
plot(t, y);
legend('5%','50%','95%')
I have not used Matlab but from my understanding of statistics, if your distribution cannot be assumed to be normal distribution, then you have to take it as Student t distribution and calculate confidence Interval and accuracy.
http://www.stat.yale.edu/Courses/1997-98/101/confint.htm