How do I compute a PMF and CDF for a binomial distribution in MATLAB? - matlab

I need to calculate the probability mass function, and cumulative distribution function, of the binomial distribution. I would like to use MATLAB to do this (raw MATLAB, no toolboxes). I can calculate these myself, but was hoping to use a predefined function and can't find any. Is there something out there?
function x = homebrew_binomial_pmf(N,p)
x = [1];
for i = 1:N
x = [0 x]*p + [x 0]*(1-p);
end

You can use the function NCHOOSEK to compute the binomial coefficient. With that, you can create a function that computes the value of the probability mass function for a set of k values for a given N and p:
function pmf = binom_dist(N,p,k)
nValues = numel(k);
pmf = zeros(1,nValues);
for i = 1:nValues
pmf(i) = nchoosek(N,k(i))*p^k(i)*(1-p)^(N-k(i));
end
end
To plot the probability mass function, you would do the following:
k = 0:40;
pmf = binom_dist(40,0.5,k);
plot(k,pmf,'r.');
and the cumulative distribution function can be found from the probability mass function using CUMSUM:
cummDist = cumsum(pmf);
plot(k,cummDist,'r.');
NOTE: When the binomial coefficient returned from NCHOOSEK is large you can end up losing precision. A very nice alternative is to use the submission Variable Precision Integer Arithmetic from John D'Errico on the MathWorks File Exchange. By converting your numbers to his vpi type, you can avoid the precision loss.

octave provides a good collection of distribution pdf, cdf, quantile; they have to be translated from octave, but this is relatively trivial (convert endif to end, convert != to ~=, etc;) see e.g. octave binocdf for the binomial cdf function.

looks like for the CDF of the binomial distribution, my best bet is the incomplete beta function betainc.

For PDF
x=1:15
p=.45
c=binopdf(x,15,p)
plot(x,c)
Similarly CDF
D=binocdf(x,15,p)
plot(x,D)

Related

Plot log(n over k)

I've never used Matlab before and I really don't know how to fix the code. I need to plot log(1000 over k) with k going from 1 to 1000.
y = #(x) log(nchoosek(1000,x));
fplot(y,[1 1000]);
Error:
Warning: Function behaves unexpectedly on array inputs. To improve performance, properly
vectorize your function to return an output with the same size and shape as the input
arguments.
In matlab.graphics.function.FunctionLine>getFunction
In matlab.graphics.function.FunctionLine/updateFunction
In matlab.graphics.function.FunctionLine/set.Function_I
In matlab.graphics.function.FunctionLine/set.Function
In matlab.graphics.function.FunctionLine
In fplot>singleFplot (line 241)
In fplot>#(f)singleFplot(cax,{f},limits,extraOpts,args) (line 196)
In fplot>vectorizeFplot (line 196)
In fplot (line 166)
In P1 (line 5)
There are several problems with the code:
nchoosek does not vectorize on the second input, that is, it does not accept an array as input. fplot works faster for vectorized functions. Otherwise it can be used, but it issues a warning.
The result of nchoosek is close to overflowing for such large values of the first input. For example, nchoosek(1000,500) gives 2.702882409454366e+299, and issues a warning.
nchoosek expects integer inputs. fplot uses in general non-integer values within the specified limits, and so nchoosek issues an error.
You can solve these three issues exploiting the relationship between the factorial and the gamma function and the fact that Matlab has gammaln, which directly computes the logarithm of the gamma function:
n = 1000;
y = #(x) gammaln(n+1)-gammaln(x+1)-gammaln(n-x+1);
fplot(y,[1 1000]);
Note that you get a plot with y values for all x in the specified range, but actually the binomial coefficient is only defined for non-negative integers.
OK, since you've gotten spoilers for your homework exercise anyway now, I'll post an answer that I think is easier to understand.
The multiplicative formula for the binomial coefficient says that
n over k = producti=1 to k( (n+1-i)/i )
(sorry, no way to write proper formulas on SO, see the Wikipedia link if that was not clear).
To compute the logarithm of a product, we can compute the sum of the logarithms:
log(product(xi)) = sum(log(xi))
Thus, we can compute the values of (n+1-i)/i for all i, take the logarithm, and then sum up the first k values to get the result for a given k.
This code accomplishes that using cumsum, the cumulative sum. Its output at array element k is the sum over all input array elements from 1 to k.
n = 1000;
i = 1:1000;
f = (n+1-i)./i;
f = cumsum(log(f));
plot(i,f)
Note also ./, the element-wise division. / performs a matrix division in MATLAB, and is not what you need here.
syms function type reproduces exactly what you want
syms x
y = log(nchoosek(1000,x));
fplot(y,[1 1000]);
This solution uses arrayfun to deal with the fact that nchoosek(n,k) requires k to be a scalar. This approach requires no toolboxes.
Also, this uses plot instead of fplot since this clever answer already addresses how to do with fplot.
% MATLAB R2017a
n = 1000;
fh=#(k) log(nchoosek(n,k));
K = 1:1000;
V = arrayfun(fh,K); % calls fh on each element of K and return all results in vector V
plot(K,V)
Note that for some values of k greater than or equal to 500, you will receive the warning
Warning: Result may not be exact. Coefficient is greater than 9.007199e+15 and is only accurate to 15 digits
because nchoosek(1000,500) = 2.7029e+299. As pointed out by #Luis Mendo, this is due to realmax = 1.7977e+308 which is the largest real floating-point supported. See here for more info.

How to simulate the Hypergeometric distribution in Matlab

i want to simulate in Matlab program the Hypergeometric distribution with probability mass function and parameters as described here : https://en.wikipedia.org/wiki/Hypergeometric_distribution
how can i code that while producing random numbers from uniform distribution.
Most sensible would be to use the builtin hypergeometric generator.
If you have to do this for an assignment or some other arbitrary reason, the generic solution when an inverse CDF exists is to do inversion—use the uniform generator to create a p-value (a value between 0 and 1), and plug that into the inverse CDF. Since Matlab provides an inverse CDF function, this should be straightforward.
This is easily done with the randperm function, which generates a sample without replacement.
Let the distribution parameters be defined as follows:
N = 10; % population size
K = 3; % number of success states in the population
n = 5; % number of draws
Then the variable k obtained as
k = sum(randperm(N,n)<=K);
has a hypergeometric distribution with parameters N,K,n.
If you really need to use a uniform random number generator (rand function):
[~, x] = sort(rand(1,N));
x = x(1:n); % this gives the same result as randperm(N,n)
k = sum(x<=K);

Generating a random number based off normal distribution in matlab

I am trying to generate a random number based off of normal distribution traits that I have (mean and standard deviation). I do NOT have the Statistics and Machine Learning toolbox.
I know one way to do it would be to randomly generate a random number r from 0 to 1 and find the value that gives a probability of that random number. I can do this by entering the standard normal function
f= #(y) (1/(1*2.50663))*exp(-((y).^2)/(2*1^2))
and solving for
r=integral(f,-Inf,z)
and then extrapolating from that z-value to the final answer X with the equation
z=(X-mew)/sigma
But as far as I know, there is no matlab command that allows you to solve for x where x is the limit of an integral. Is there a way to do this, or is there a better way to randomly generate this number?
You can use the built-in randn function which yields random numbers pulled from a standard normal distribution with a zero mean and a standard deviation of 1. To alter this distribution, you can multiply the output of randn by your desired standard deviation and then add your desired mean.
% Define the distribution that you'd like to get
mu = 2.5;
sigma = 2.0;
% You can any size matrix of values
sz = [10000 1];
value = (randn(sz) * sigma) + mu;
% mean(value)
% 2.4696
%
% std(value)
% 1.9939
If you just want a single number from the distribution, you can use the no-input version of randn to yield a scalar
value = (randn * sigma) + mu;
Just for the fun of it, you can generate a Gaussian random variable using a uniform random generator:
The logarithm of a uniform random variable on (0,1) has an exponential distribution
The square root of that has a Rayleigh distribution
Multiply by the cosine (or sine) of a uniform random variable on (0,2*pi) and the result is Gaussian. You need to multiply by sqrt(2) to normalize.
The obtained Gaussian variable is normalized (zero mean, unit standard deviation). If you need specific mean and standard deviation, multiply by the latter and then add the former.
Example (normalized Gaussian):
m = 1; n = 1e5; % desired output size
x = sqrt(-2*log(rand(m,n))).*cos(2*pi*rand(m,n));
Check:
>> mean(x)
ans =
-0.001194631660594
>> std(x)
ans =
0.999770464360453
>> histogram(x,41)

How to evaluate emprical cdf at given points in Matlab?

Suppose I have a sequence of scalar points subject to a unknown distribution.
From the sequence of points, we can get the empirical cdf.
I was wondering if there is some way in Matlab to evaluate this empirical cdf at any point? For example, evaluate it at the same sequence of points that are used to build the empirical cdf?
I have looked up the function ecdf at http://www.mathworks.com/help/stats/ecdf.html. Its usage is [f,x] = ecdf(y), where the empirical cdf from data yis evaluated atx, butx` doesn't seem to be specifiable.
Thanks and regards!
Assuming you have the output of the function, two vectors f and x and you want to find the emperical cdf at point x_of_interest, this is what you can do:
max(f(x<=x_of_interest))
Or maybe you want to use minand >=, but I think the above formula is correct.
It seems like x are unique points in y with their CDF.
I'm not sure that's what you meant, but I also needed to convert a vector of data values to the corresponding vector of empirical CDFs, with the same ordering.
Actually, instead of the usual definition cdf(x) = Prob(X <= x) I prefer the more symmetric definition cdf(x) = Prob(X < x) + 1/2 * Prob(X == x) , which is more suitable to cases with ties. Now the computation is a one-liner, but with the help of the tiedrank() function from the statistical toolbox:
cdf = (tiedrank(data) - 1/2) / length(data) ;
For example,
data = [3 2 4 2 1] ;
yields
cdf = [0.7 0.4 0.9 0.4 0.1] ;
A good approach might be to use interpolation to find the closest "x" for each point you want to evaluate, and the relative "f" value. You can find how to use interp1 for this purpose here:
determining-the-value-of-ecdf-at-a-point-using-matlab

pdf of a particular distribution

I am new to Matlab. I would like to check the so call "logarithmic law" for determinant of random matrices with Matlab, but still do not know how.
Logarithmic law:
Let A be a random Bernoulli matrix (entries are iid, taking value +-1 with prob. 1/2) of size n by n. We may want to compare the probability density function of (log(det(A^2))-log(factorial(n-1)))/sqrt(2n) with the pdf of Gaussian distribution. The logarithmic law says that the pdf of the first will approach to that of the second when n approaches infinity.
My Matlab task is very simple: check the comparison for, say n=100. Anyone knows how to do so?
Thanks.
Consider the following experiment:
n = 100; %# matrix size
num = 1000; %# number of matrices to generate
detA2ln = zeros(num,1);
for i=1:num
A = randi([0 1],[n n])*2 - 1; %# -1,+1
detA2ln(i) = log(det(A^2));
end
%# `gammaln(n)` is more accurate than `log(factorial(n-1))`
myPDF = ( detA2ln - gammaln(n) ) ./ sqrt(2*log(n));
normplot(myPDF)
Note that for large matrices, the determinant of A*A will be too large to represent in double numbers and will return Inf. However we only require the log of the determinant, and there exist other approachs to find this result that keeps the computation in log-scale.
In the comments, #yoda suggested using the eigenvalues detA2(i) = real(sum(log(eig(A^2))));, I also found a submission on FEX that have a similar implementation (using LU or Cholesky decomposition)