I'm building up on my preivous question because there is a further issue.
I have fitted in Matlab a normal distribution to my data vector: PD = fitdist(data,'normal'). Now I have a new data point coming in (e.g. x = 0.5) and I would like to calculate its probability.
Using cdf(PD,x) will not work because it gives the probability that the point is smaller or equal to x (but not exactly x). Using pdf(PD,x) gives just the densitiy but not the probability and so it can be greater than one.
How can I calculate the probability?
If the distribution is continuous then the probability of any point x is 0, almost by definition of continuous distribution. If the distribution is discrete and, furthermore, the support of the distribution is a subset of the set of integers, then for any integer x its probability is
cdf(PD,x) - cdf(PD,x-1)
More generally, for any random variable X which takes on integer values, the probability mass function f(x) and the cumulative distribution F(x) are related by
f(x) = F(x) - F(x-1)
The right hand side can be interpreted as a discrete derivative, so this is a direct analog of the fact that in the continuous case the pdf is the derivative of the cdf.
I'm not sure if matlab has a more direct way to get at the probability mass function in your situation than going through the cdf like that.
In the continuous case, your question doesn't make a lot of sense since, as I said above, the probability is 0. Non-zero probability in this case is something that attaches to intervals rather than individual points. You still might want to ask for the probability of getting a value near x -- but then you have to decide on what you mean by "near". For example, if x is an integer then you might want to know the probability of getting a value that rounds to x. That would be:
cdf(PD, x + 0.5) - cdf(PD, x - 0.5)
Let's say you have a random variable X that follows the normal distribution with mean mu and standard deviation s.
Let F be the cumulative distribution function for the normal distribution with mean mu and standard deviation s. The probability the random variableX falls between a and b, that is P(a < X <= b) = F(b) - F(a).
In Matlab code:
P_a_b = normcdf(b, mu, s) - normcdf(a, mu, s);
Note: observe that the probability X is exactly equal to 0.5 (or any specific value) is zero! A range of outcomes will have positive probability, but an insufficient sum of individual outcomes will have probability zero.
Related
I have a binary vector V, in which each entry describes success (1) or failure (0) in the relevant trial out of a whole session.
(the length of the vector denotes the number of trials in the session).
I can easily calculate the success rate of the session (by taking the mean of the vector i.e. (sum(V)/length(V))).
However I also need to know the variance or std of each session.
In order to calculate that, is it OK to use the Matlab std function (i.e. to take std(V)/length(V))?
Or, should I use something which is specifically suited for the binomial distribution?
Is there a Matlab std (or variance) function which is specific for a "success/failure" distribution?
Thanks
If you satisfy the assumptions of the Binomial distribution,
a fixed number of n independent Bernoulli trials,
each with constant success probability p,
then I'm not sure that is necessary, since the parameters n and p are available from your data.
Note that we model number of successes (in n trials) as a random variable distributed with the Binomial(n,p) distribution.
n = length(V);
p = mean(V); % equivalently, sum(V)/length(V)
% the mean is the maximum likelihood estimator (MLE) for p
% note: need large n or replication to get true p
Then the standard deviation of the number of successes in n independent Bernoulli trials with constant success probability p is just sqrt(n*p*(1-p)).
Of course you can assess this from your data if you have multiple samples. Note this is different from std(V). In your data formatting, it would require having multiple vectors, V1, V2, V2, etc. (replication), then the sample standard deviation of the number of successes would obtained from the following.
% Given V1, V2, V3 sets of Bernoulli trials
std([sum(V1) sum(V2) sum(V3)])
If you already know your parameters: n, p
You can obtain it easily enough.
n = 10;
p = 0.65;
pd = makedist('Binomial',n, p)
std(pd) % 1.5083
or
sqrt(n*p*(1-p)) % 1.5083
as discussed earlier.
Does the standard deviation increase with n ?
The OP has asked:
Something is bothering me.. if std = sqrt(n*p*(1-p)), then it increases with n. Shoudn't the std decrease when n increases?
Confirmation & Derivation:
Definitions:
Then we know that
Then just from definitions of expectation and variance we can show the variance (similarly for standard deviation if you add the square root) increases with n.
Since the square root is a non-decreasing function, we know the same relationship holds for the standard deviation.
Suppose I have a continuous probability distribution, e.g., Normal, on a support A. Suppose that there is a Matlab code that allows me to draw random numbers from such a distribution, e.g., this.
I want to build a Matlab code to "approximate" this continuous probability distribution with a probability mass function spanning over r points.
This means that I want to write a Matlab code to:
(1) Select r points from A. Let us call these points a1,a2,...,ar. These points will constitute the new discretised support.
(2) Construct a probability mass function over a1,a2,...,ar. This probability mass function should "well" approximate the original continuous probability distribution.
Could you help by providing also an example? This is a similar question asked for Julia.
Here some of my thoughts. Suppose that the continuous probability distribution of interest is one-dimensional. One way to go could be:
(1) Draw 10^6 random numbers from the continuous probability distribution of interest and store them in a column vector D.
(2) Suppose that r=10. Compute the 10-th, 20-th,..., 90-th quantiles of D. Find the median point falling in each of the 10 bins obtained. Call these median points a1,...,ar.
How can I construct the probability mass function from here?
Also, how can I generalise this procedure to more than one dimension?
Update using histcounts: I thought about using histcounts. Do you think it is a valid option? For many dimensions I can use this.
clear
rng default
%(1) Draw P random numbers for standard normal distribution
P=10^6;
X = randn(P,1);
%(2) Apply histcounts
[N,edges] = histcounts(X);
%(3) Construct the new discrete random variable
%(3.1) The support of the discrete random variable is the collection of the mean values of each bin
supp=zeros(size(N,2),1);
for j=2:size(N,2)+1
supp(j-1)=(edges(j)-edges(j-1))/2+edges(j-1);
end
%(3.2) The probability mass function of the discrete random variable is the
%number of X within each bin divided by P
pmass=N/P;
%(4) Check if the approximation is OK
%(4.1) Find the CDF of the discrete random variable
CDF_discrete=zeros(size(N,2),1);
for h=2:size(N,2)+1
CDF_discrete(h-1)=sum(X<=edges(h))/P;
end
%(4.2) Plot empirical CDF of the original random variable and CDF_discrete
ecdf(X)
hold on
scatter(supp, CDF_discrete)
I don't know if this is what you're after but maybe it can help you. You know, P(X = x) = 0 for any point in a continuous probability distribution, that is the pointwise probability of X mapping to x is infinitesimal small, and thus regarded as 0.
What you could do instead, in order to approximate it to a discrete probability space, is to define some points (x_1, x_2, ..., x_n), and let their discrete probabilities be the integral of some range of the PDF (from your continuous probability distribution), that is
P(x_1) = P(X \in (-infty, x_1_end)), P(x_2) = P(X \in (x_1_end, x_2_end)), ..., P(x_n) = P(X \in (x_(n-1)_end, +infty))
:-)
I have a set of points describing a closed curve in the complex plane, call it Z = [z_1, ..., z_N]. I'd like to interpolate this curve, and since it's periodic, trigonometric interpolation seemed a natural choice (especially because of its increased accuracy). By performing the FFT, we obtain the Fourier coefficients:
F = fft(Z);
At this point, we could get Z back by the formula (where 1i is the imaginary unit, and we use (k-1)*(n-1) because MATLAB indexing starts at 1)
N
Z(n) = (1/N) sum F(k)*exp( 1i*2*pi*(k-1)*(n-1)/N), 1 <= n <= N.
k=1
My question
Is there any reason why n must be an integer? Presumably, if we treat n as any real number between 1 and N, we will just get more points on the interpolated curve. Is this true? For example, if we wanted to double the number of points, could we not set
N
Z_new(n) = (1/N) sum F(k)*exp( 1i*2*pi*(k-1)*(n-1)/N), with n = 1, 1.5, 2, 2.5, ..., N-1, N-0.5, N
k=1
?
The new points are of course just subject to some interpolation error, but they'll be fairly accurate, right? The reason I'm asking this question is because this method is not working for me. When I try to do this, I get a garbled mess of points that makes no sense.
(By the way, I know that I could use the interpft() command, but I'd like to add points only in certain areas of the curve, for example between z_a and z_b)
The point is when n is integer, you have some primary functions which are orthogonal and can be as a basis for the space. When, n is not integer, The exponential functions in the formula, are not orthogonal. Hence, the expression of a function based on these non-orthogonal basis is not meaningful as much as you expected.
For orthogonality case you can see the following as an example (from here). As you can check, you can find two n_1 and n_2 which are not integer, the following integrals are not zero any more, and they are not orthogonal.
I'd like to make an array of random samples from a Gaussian distrubution.
Mean value is 0 and variance is 1.
If I take enough samples, I would think my maximum value of a sample can be 0+1=1.
However, I find that I get values like 4.2891 ...
My code:
x = 0+sqrt(1)*randn(100000,1);
mean(x)
var(x)
max(x)
This would give me a mean like 0, a var of 0.9937 but my maximum value is 4.2891?
Can anyone help me out why it does this?
As others have mentioned, there is no bound on the possible values that x can take on in a gaussian distribution. However, the farther x is from the mean, the less likely it is to be drawn.
To give you some intuition for what the variance actually means (for any distribution, not just the gaussian case), you can look at the 68-95-99.7 rule. The rule says:
about 68% of the population will be within one sigma of the mean
about 95% of the population will be within two sigma's of the mean
about 99.7% of the population will be within three sigma's of the mean
Here sigma = sqrt(var) is the standard deviation of the distribution.
So while in theory it is possible to draw any x from a gaussian distribution, in practice it is unlikely to draw anything past 5 or 6 standard deviations away for a population of 100000.
This will yield N random numbers using the gaussian normal distribution.
N = 100;
mu = 0;
sigma = 1;
Xs = normrnd(mu, sigma, N);
EDIT:
I just realized that your code is in fact equivalent to what I've written.
As others already pointed out: variance is not the maximum distance a sample can deviate from the mean! (It is just the average of the squares of those distances)
I am a beginner in MATLAB trying to use it to help me understand random signals, I was doing some Normal probability density function calculations until i came across this problem :
Write a MATLAB program to calculate the probability Pr(x1 ≤ X ≤ x2)
if X is a Gaussian random variable for an arbitrary x1 and x2.
Note that you will have to specify the mean and variance of the Gaussian random
variable.
I usually use the built in normpdf function of course selecting my mean and variance, but for this one since it has a range am not sure what I can do to find the answer
Y = normpdf(X,mu,sigma)
If you recall from probability theory, you know that the Cumulative Distribution Function sums up probabilities from -infinity up until a certain point x. Specifically, the CDF F(x) for a probability distribution P with random variable X evaluated at a certain point x is defined as:
Note that I am assuming that we are dealing with the discrete case. Also, let's call this a left-tail sum, because it sums all of the probabilities from the left of the distribution up until the point x. Consequently, this defines the area underneath the curve up until point x.
Now, what your question is asking is to find the probability between a certain range x1 <= x <= x2, not just with using a left-tail sum (<= x). Now, if x1 <= x2, this means that total area where the end point is x2, or the probability of all events up to and including x2, is also part of the area defined by the end point defined at x1. Because you want the probability between a certain range, you need to accumulate all events that happen between x1 and x2, and so you want the area under the PDF curve that is in between this range. Also, you want to have the area that is greater than x1 and less than x2.
Here's a pictorial example:
Source: ReliaWiki
The top figure is the PDF of a Gaussian distribution function, while the bottom figure denotes the CDF of a Gaussian distribution. You see that if x1 <= x2, the area defined by the point at x1 is also captured by the point at x2. Here's a better graph:
Source: Introduction to Statistics
Here the CDF is continuous instead of discrete, but the result is still the same. If you want the area in between two intervals and ultimately the probability in between the two ranges, you need to take the CDF value at x2 and subtract the CDF value at x1. You want the remaining area, and so you just need to subtract the CDF values and ultimately the left-tail areas, and so:
As such, to calculate the CDF of the Gaussian distribution use normcdf and specify the mean and standard deviation of your Gaussian distribution. Therefore, you just need to do this:
y = normcdf(x2, mu, sigma) - normcdf(x1, mu, sigma);
x1 and x2 are the values of the interval that you want to calculate the sum of probabilities under.
You can use erf,
mu = 5;
sigma = 3;
x1 = 3;
x2 = 8;
p = .5*(erf((x2-mu)/sigma/2^.5) - erf((x1-mu)/sigma/2^.5));
error function is defined like this in MATLAB,