I have a probability distribution that defines the probability of occurrence of n possible states.
I would like to calculate the value of Shannon's entropy, in bits, of the given probability distribution.
Can I use wentropy(x,'shannon') to get the value and if so where can I define the number of possible states a system has?
Since you already have the probability distribution, call it p, you can do the following formula for Shannon Entropy instead of using wentropy:
H = sum(-(p(p>0).*(log2(p(p>0)))));
This gives the entropy H in bits.
p must sum to 1.
Related
I have a binary vector V, in which each entry describes success (1) or failure (0) in the relevant trial out of a whole session.
(the length of the vector denotes the number of trials in the session).
I can easily calculate the success rate of the session (by taking the mean of the vector i.e. (sum(V)/length(V))).
However I also need to know the variance or std of each session.
In order to calculate that, is it OK to use the Matlab std function (i.e. to take std(V)/length(V))?
Or, should I use something which is specifically suited for the binomial distribution?
Is there a Matlab std (or variance) function which is specific for a "success/failure" distribution?
Thanks
If you satisfy the assumptions of the Binomial distribution,
a fixed number of n independent Bernoulli trials,
each with constant success probability p,
then I'm not sure that is necessary, since the parameters n and p are available from your data.
Note that we model number of successes (in n trials) as a random variable distributed with the Binomial(n,p) distribution.
n = length(V);
p = mean(V); % equivalently, sum(V)/length(V)
% the mean is the maximum likelihood estimator (MLE) for p
% note: need large n or replication to get true p
Then the standard deviation of the number of successes in n independent Bernoulli trials with constant success probability p is just sqrt(n*p*(1-p)).
Of course you can assess this from your data if you have multiple samples. Note this is different from std(V). In your data formatting, it would require having multiple vectors, V1, V2, V2, etc. (replication), then the sample standard deviation of the number of successes would obtained from the following.
% Given V1, V2, V3 sets of Bernoulli trials
std([sum(V1) sum(V2) sum(V3)])
If you already know your parameters: n, p
You can obtain it easily enough.
n = 10;
p = 0.65;
pd = makedist('Binomial',n, p)
std(pd) % 1.5083
or
sqrt(n*p*(1-p)) % 1.5083
as discussed earlier.
Does the standard deviation increase with n ?
The OP has asked:
Something is bothering me.. if std = sqrt(n*p*(1-p)), then it increases with n. Shoudn't the std decrease when n increases?
Confirmation & Derivation:
Definitions:
Then we know that
Then just from definitions of expectation and variance we can show the variance (similarly for standard deviation if you add the square root) increases with n.
Since the square root is a non-decreasing function, we know the same relationship holds for the standard deviation.
Suppose I have a continuous probability distribution, e.g., Normal, on a support A. Suppose that there is a Matlab code that allows me to draw random numbers from such a distribution, e.g., this.
I want to build a Matlab code to "approximate" this continuous probability distribution with a probability mass function spanning over r points.
This means that I want to write a Matlab code to:
(1) Select r points from A. Let us call these points a1,a2,...,ar. These points will constitute the new discretised support.
(2) Construct a probability mass function over a1,a2,...,ar. This probability mass function should "well" approximate the original continuous probability distribution.
Could you help by providing also an example? This is a similar question asked for Julia.
Here some of my thoughts. Suppose that the continuous probability distribution of interest is one-dimensional. One way to go could be:
(1) Draw 10^6 random numbers from the continuous probability distribution of interest and store them in a column vector D.
(2) Suppose that r=10. Compute the 10-th, 20-th,..., 90-th quantiles of D. Find the median point falling in each of the 10 bins obtained. Call these median points a1,...,ar.
How can I construct the probability mass function from here?
Also, how can I generalise this procedure to more than one dimension?
Update using histcounts: I thought about using histcounts. Do you think it is a valid option? For many dimensions I can use this.
clear
rng default
%(1) Draw P random numbers for standard normal distribution
P=10^6;
X = randn(P,1);
%(2) Apply histcounts
[N,edges] = histcounts(X);
%(3) Construct the new discrete random variable
%(3.1) The support of the discrete random variable is the collection of the mean values of each bin
supp=zeros(size(N,2),1);
for j=2:size(N,2)+1
supp(j-1)=(edges(j)-edges(j-1))/2+edges(j-1);
end
%(3.2) The probability mass function of the discrete random variable is the
%number of X within each bin divided by P
pmass=N/P;
%(4) Check if the approximation is OK
%(4.1) Find the CDF of the discrete random variable
CDF_discrete=zeros(size(N,2),1);
for h=2:size(N,2)+1
CDF_discrete(h-1)=sum(X<=edges(h))/P;
end
%(4.2) Plot empirical CDF of the original random variable and CDF_discrete
ecdf(X)
hold on
scatter(supp, CDF_discrete)
I don't know if this is what you're after but maybe it can help you. You know, P(X = x) = 0 for any point in a continuous probability distribution, that is the pointwise probability of X mapping to x is infinitesimal small, and thus regarded as 0.
What you could do instead, in order to approximate it to a discrete probability space, is to define some points (x_1, x_2, ..., x_n), and let their discrete probabilities be the integral of some range of the PDF (from your continuous probability distribution), that is
P(x_1) = P(X \in (-infty, x_1_end)), P(x_2) = P(X \in (x_1_end, x_2_end)), ..., P(x_n) = P(X \in (x_(n-1)_end, +infty))
:-)
I’d like to be able to generate in MatLab a sequence of N pseudo-random numbers with a Poisson distribution having mean M. The sum of the N numbers should be T. N, M, and T are always positive or zero and would be user specified parameters to any function.
Obviously, if T is small relative to N it is likely that there will be problems achieving a total of T. In that case the function could just return the values T and then N-1 zeros or an error code. However, it is highly likely that in most cases T>>N.
I have been trying variations based on the method of generating random numbers with a given distribution provided at http://matlabtricks.com/post-44/generate-random-numbers-with-a-given-distribution and trying various normalizations at each step but have not been successful.
You could try to approximate what you want by using multinomial distribution.
If you use Wikipedia notation, then k=N, n=T and pi=M/T. Poisson distribution has distinctive property of mean equal to variance, but if your parameters are such that pi is small, then mean npi would be quite close to variance npi(1-pi). Sum would be automatically (by property of multinomial) equal of T.
Multinomial sampling in Matlab is done using mnrmd function.
UPDATE
Wrt comment, lets consider N sampled values vi, and write their sum
Sum(i=1...N) vi = T
Lets compute mean value of the left and right side of this equation.
Sum(i=1...N) E(vi) = E(T) = T
On the right side, mean value of constant is constant itself. On the left side we have
Sum(i=1...N) E(vi) = Sum(i=1...N) M = N*M = T
Therefore, M=T/N and pi=M/T=1/N.
I'm building up on my preivous question because there is a further issue.
I have fitted in Matlab a normal distribution to my data vector: PD = fitdist(data,'normal'). Now I have a new data point coming in (e.g. x = 0.5) and I would like to calculate its probability.
Using cdf(PD,x) will not work because it gives the probability that the point is smaller or equal to x (but not exactly x). Using pdf(PD,x) gives just the densitiy but not the probability and so it can be greater than one.
How can I calculate the probability?
If the distribution is continuous then the probability of any point x is 0, almost by definition of continuous distribution. If the distribution is discrete and, furthermore, the support of the distribution is a subset of the set of integers, then for any integer x its probability is
cdf(PD,x) - cdf(PD,x-1)
More generally, for any random variable X which takes on integer values, the probability mass function f(x) and the cumulative distribution F(x) are related by
f(x) = F(x) - F(x-1)
The right hand side can be interpreted as a discrete derivative, so this is a direct analog of the fact that in the continuous case the pdf is the derivative of the cdf.
I'm not sure if matlab has a more direct way to get at the probability mass function in your situation than going through the cdf like that.
In the continuous case, your question doesn't make a lot of sense since, as I said above, the probability is 0. Non-zero probability in this case is something that attaches to intervals rather than individual points. You still might want to ask for the probability of getting a value near x -- but then you have to decide on what you mean by "near". For example, if x is an integer then you might want to know the probability of getting a value that rounds to x. That would be:
cdf(PD, x + 0.5) - cdf(PD, x - 0.5)
Let's say you have a random variable X that follows the normal distribution with mean mu and standard deviation s.
Let F be the cumulative distribution function for the normal distribution with mean mu and standard deviation s. The probability the random variableX falls between a and b, that is P(a < X <= b) = F(b) - F(a).
In Matlab code:
P_a_b = normcdf(b, mu, s) - normcdf(a, mu, s);
Note: observe that the probability X is exactly equal to 0.5 (or any specific value) is zero! A range of outcomes will have positive probability, but an insufficient sum of individual outcomes will have probability zero.
I know for a random variable x that P(x=i) for each i=1,2,...,100. Then how may I sample x by a multinomial distribution, based on the given P(x=i) in Matlab?
I am allowed to use the Matlab built-in commands rand and randi, but not mnrnd.
In general, you can sample numbers from any 1 dimensional probability distribution X using a uniform random number generator and the inverse cumulative distribution function of X. This is known as inverse transform sampling.
random_x = xcdf_inverse(rand())
How does this apply here? If you have your vector p of probabilities defining your multinomial distribution, F = cumsum(p) gives you a vector that defines the CDF. You can then generate a uniform random number on [0,1] using temp = rand() and then find the first row in F greater than temp. This is basically using the inverse CDF of the multinomial distribution.
Be aware though that for some distributions (eg. gamma distribution), this turns out to be an inefficient way to generate random draws because evaluating the inverse CDF is so slow (if the CDF cannot expressed analytically, slower numerical methods must be used).