Dividing a normal distribution into regions of equal probability in Matlab

Dividing a normal distribution into regions of equal probability in Matlab - matlab

Consider a Normal distribution with mean 0 and standard deviation 1. I would like to divide this distribution into 9 regions of equal probability and take a random sample from each region.

It sounds like you want to find the values that divide the area under the probability distribution function into segments of equal probability. This can be done in matlab by applying the norminv function.
In your particular case:
segmentBounds = norminv(linspace(0,1,10),0,1)
Any two adjacent values of segmentBounds now describe the boundaries of segments of the Normal probability distribution function such that each segment contains one ninth of the total probability.
I'm not sure exactly what you mean by taking random numbers from each sample. One approach is to sample from each region by performing rejection sampling. In short, for each region bounded by x0 and x1, draw a sample from y = normrnd(0,1). If x0 < y < x1, keep it. Else discard it and repeat.
It's also possible that you intend to sample from these regions uniformly. To do this you can try rand(1)*(x1-x0) + x0. This will produce problems for the extreme quantiles, however, since the regions extend to +/- infinity.

Related

How to generate random uniformly distributed vectors of euclidian length of one?

I am trying to randomly generate uniformly distributed vectors, which are of Euclidian length of 1. By uniformly distributed I mean that each entry (coordinate) of the vectors is uniformly distributed.
More specifically, I would like to create a set of, say, 1000 vectors (lets call them V_i, with i=1,…,1000), where each of these random vectors has unit Euclidian length and the same dimension V_i=(v_1i,…,v_ni)' (let’s say n = 5, but the algorithm should work with any dimension). If we then look on the distribution of e.g. v_1i, the first element of each V_i, then I would like that this is uniformly distributed.
In the attached MATLAB example you see that you cannot simply draw random vectors from a uniform distribution and then normalize the vectors to Euclidian length of 1, as the distribution of the elements across the vectors is then no longer uniform.
Is there a way to generate this set of vectors such, that the distribution of the single elements across the vector-set is uniform?
Thank you for any ideas.
PS: MATLAB is our Language of choice, but solutions in any languages are, of course, welcome.
clear all
rng('default')
nvar=5;
sample = 1000;
x = zeros(nvar,sample);
for ii = 1:sample
y=rand(nvar,1);
x(:,ii) = y./norm(y);
end
hist(x(1,:))
figure
hist(x(2,:))
figure
hist(x(3,:))
figure
hist(x(4,:))
figure
hist(x(5,:))

What you want cannot be accomplished.
Vectors with a length of 1 sit on a circle (or sphere or hypersphere depending on the number of dimensions). Let's focus on the 2D case, if it cannot be done there, it will be clear that it cannot be done with more dimensions either.
Because the points are on a circle, their x and y coordinates are dependent, the one can be computed based on the other. Thus, the distributions of x and y coordinates cannot be defined independently. We can define the distribution of the one, generate random values for it, but the other coordinate must be computed from the first.
Let's make points on a half circle with a uniform x coordinate (can be extended to a full circle by adding a random sign to the y coordinate):
N = 1000;
x = 2 * rand(N,1) - 1;
y = sqrt(1 - x.^2);
plot(x,y,'.')
axis equal
histogram(y)
The plot generates shows a clearly non-uniform distribution, with many more samples generated near y=1 than near y=0. If we add a random sign to the y-coordinate we'd have more samples near y=1 and y=-1 than near y=0.

Approximate continuous probability distribution in Matlab

Suppose I have a continuous probability distribution, e.g., Normal, on a support A. Suppose that there is a Matlab code that allows me to draw random numbers from such a distribution, e.g., this.
I want to build a Matlab code to "approximate" this continuous probability distribution with a probability mass function spanning over r points.
This means that I want to write a Matlab code to:
(1) Select r points from A. Let us call these points a1,a2,...,ar. These points will constitute the new discretised support.
(2) Construct a probability mass function over a1,a2,...,ar. This probability mass function should "well" approximate the original continuous probability distribution.
Could you help by providing also an example? This is a similar question asked for Julia.
Here some of my thoughts. Suppose that the continuous probability distribution of interest is one-dimensional. One way to go could be:
(1) Draw 10^6 random numbers from the continuous probability distribution of interest and store them in a column vector D.
(2) Suppose that r=10. Compute the 10-th, 20-th,..., 90-th quantiles of D. Find the median point falling in each of the 10 bins obtained. Call these median points a1,...,ar.
How can I construct the probability mass function from here?
Also, how can I generalise this procedure to more than one dimension?
Update using histcounts: I thought about using histcounts. Do you think it is a valid option? For many dimensions I can use this.
clear
rng default
%(1) Draw P random numbers for standard normal distribution
P=10^6;
X = randn(P,1);
%(2) Apply histcounts
[N,edges] = histcounts(X);
%(3) Construct the new discrete random variable
%(3.1) The support of the discrete random variable is the collection of the mean values of each bin
supp=zeros(size(N,2),1);
for j=2:size(N,2)+1
supp(j-1)=(edges(j)-edges(j-1))/2+edges(j-1);
end
%(3.2) The probability mass function of the discrete random variable is the
%number of X within each bin divided by P
pmass=N/P;
%(4) Check if the approximation is OK
%(4.1) Find the CDF of the discrete random variable
CDF_discrete=zeros(size(N,2),1);
for h=2:size(N,2)+1
CDF_discrete(h-1)=sum(X<=edges(h))/P;
end
%(4.2) Plot empirical CDF of the original random variable and CDF_discrete
ecdf(X)
hold on
scatter(supp, CDF_discrete)

I don't know if this is what you're after but maybe it can help you. You know, P(X = x) = 0 for any point in a continuous probability distribution, that is the pointwise probability of X mapping to x is infinitesimal small, and thus regarded as 0.
What you could do instead, in order to approximate it to a discrete probability space, is to define some points (x_1, x_2, ..., x_n), and let their discrete probabilities be the integral of some range of the PDF (from your continuous probability distribution), that is
P(x_1) = P(X \in (-infty, x_1_end)), P(x_2) = P(X \in (x_1_end, x_2_end)), ..., P(x_n) = P(X \in (x_(n-1)_end, +infty))
:-)

How to make sense (handle) when computes logarithm of zero in prior information

I am working in image classification. I am using an information that called prior probability (in Bayesian rule). It has range in [0,1]. And it requires computing in logarithm. However, as you know, logarithm of zero number is Inf.
For example, given an pixel x in image I (size 3 by 3) with an cost function such as
Cost(x)=30+log(prior(x))
where prior is an matrix 3 by 3
prior=[ 0 0 0.5;
1 1 0.2;
0.4 0 0]
I =[ 1 2 3;
4 5 6;
7 8 9]
I want to compute cost of x=1 then
cost(x=1)=30+log(0)
Now, log(0) is Inf. Then result cost(x=1) also Inf. Based on my assumption that prior=0 that mean the given pixel belongs to background, and prior=1 that mean the given pixel belongs to foreground.
My question is that how to compute log(prior) satisfy my assumption.
I am using Matlab to do it. I think that log(0) becomes very small negative value. And I just set it is -9 as my code
%% Handle with log(0)
prior(prior==0.0) = NaN;
%% Compute log
log_prior=log(prior);
%% Assume that e^-9 very near 0.
log_prior(isnan(log_prior)) = -9;
UPDATE: To make clearly what I am doing. Let see the Bayesian rule. My task is that how to assign an given pixel x belongs to Background (BG) or Foreground (FG). It will depends on the probability
P(x∈BG|x)=P(x|x∈BG)P(x∈BG)/P(x)
In which P(x|x∈BG) is likelihood function and assume that it is approximated by Gaussian distribution, P(x∈BG) is prior term and P(x) can be ignore due to it is const
Using Maximum-a-Posteriori (MAP) Estimation we can map the above equation in to log space (to resolve exponential in Gaussian function)
Cost(x)=log(P(x∈BG|x))=log(P(x|x∈BG))+log(P(x∈BG))
To make simple, let assume log(P(x|x∈BG))=30, log(P(x∈BG)) is log(prior) then my cost function can rewritten as
Cost(x)=30+log(prior(x))
Now problem is that prior is within [0,1] then it logarithm is -Inf. As the chepner said, we can add eps value as
log(prior+eps)
However, log(eps) is very a lager negative number. It will be affected my cost function (also becomes very large negative number). Then the first term in my cost function (30) becomes not necessary. Based on my assumption that log(x)=1 then the pixel x will be BG and prior(x)=1 will be FG. How to make handle with my log(prior) when I compute my cost function?

The correct thing to do, before fiddling with Matlab, is to try to understand your problem. Ask yourself "what does it mean for the prior probability to vanish?". The answer is given by Bayes theorem, one form of which is:
posterior = likelihood * prior / normalization
So places where the prior is nil are, by definition, places where you are certain that your events (the things whose probabilities you are computing) cannot happen, regardless of their apparent likelihood (i.e. "cost"). So they are not interesting for you. You just recognize that and skip them.

Estimating skewness of histogram in MATLAB

What test can I do in MATLAB to test the spread of a histogram? For example, in the given set of histograms, I am only interested in 1,2,3,5 and 7 (going from left to right, top to bottom) because they are less spread out. How can I obtain a value that will tell me if a histogram is positively skewed?
It may be possible using Chi-Squared tests but I am not sure what the MATLAB code will be for that.

You can use the standard definition of skewness. In other words, you can use:
You compute the mean of your data and you use the above equation to calculate skewness. Positive and negative skewness are like so:
Source: Wikipedia
As such, the larger the value, the more positively skewed it is. The more negative the value, the more negatively skewed it is.
Now to compute the mean of your histogram data, it's quite simple. You simply do a weighted sum of the histogram entries and divide by the total number of entries. Given that your histogram is stored in h, the bin centres of your histogram are stored in x, you would do the following. What I will do here is assume that you have bins from 0 up to N-1 where N is the total number of bins in the histogram... judging from your picture:
x = 0:numel(h)-1; %// Judging from your pictures
num_entries = sum(h(:));
mu = sum(h.*x) / num_entries;
skew = ((1/num_entries)*(sum((h.*x - mu).^3))) / ...
((1/(num_entries-1))*(sum((h.*x - mu).^2)))^(3/2);
skew would contain the numerical measure of skewness for a histogram that follows that formula. Therefore, with your problem statement, you will want to look for skewness numbers that are positive and large. I can't really comment on what threshold you should look at, but look for positive numbers that are much larger than most of the histograms that you have.

Matlab create array random samples Gaussian distribution

I'd like to make an array of random samples from a Gaussian distrubution.
Mean value is 0 and variance is 1.
If I take enough samples, I would think my maximum value of a sample can be 0+1=1.
However, I find that I get values like 4.2891 ...
My code:
x = 0+sqrt(1)*randn(100000,1);
mean(x)
var(x)
max(x)
This would give me a mean like 0, a var of 0.9937 but my maximum value is 4.2891?
Can anyone help me out why it does this?

As others have mentioned, there is no bound on the possible values that x can take on in a gaussian distribution. However, the farther x is from the mean, the less likely it is to be drawn.
To give you some intuition for what the variance actually means (for any distribution, not just the gaussian case), you can look at the 68-95-99.7 rule. The rule says:
about 68% of the population will be within one sigma of the mean
about 95% of the population will be within two sigma's of the mean
about 99.7% of the population will be within three sigma's of the mean
Here sigma = sqrt(var) is the standard deviation of the distribution.
So while in theory it is possible to draw any x from a gaussian distribution, in practice it is unlikely to draw anything past 5 or 6 standard deviations away for a population of 100000.

This will yield N random numbers using the gaussian normal distribution.
N = 100;
mu = 0;
sigma = 1;
Xs = normrnd(mu, sigma, N);
EDIT:
I just realized that your code is in fact equivalent to what I've written.
As others already pointed out: variance is not the maximum distance a sample can deviate from the mean! (It is just the average of the squares of those distances)