Approximate continuous probability distribution in Matlab - matlab

Suppose I have a continuous probability distribution, e.g., Normal, on a support A. Suppose that there is a Matlab code that allows me to draw random numbers from such a distribution, e.g., this.
I want to build a Matlab code to "approximate" this continuous probability distribution with a probability mass function spanning over r points.
This means that I want to write a Matlab code to:
(1) Select r points from A. Let us call these points a1,a2,...,ar. These points will constitute the new discretised support.
(2) Construct a probability mass function over a1,a2,...,ar. This probability mass function should "well" approximate the original continuous probability distribution.
Could you help by providing also an example? This is a similar question asked for Julia.
Here some of my thoughts. Suppose that the continuous probability distribution of interest is one-dimensional. One way to go could be:
(1) Draw 10^6 random numbers from the continuous probability distribution of interest and store them in a column vector D.
(2) Suppose that r=10. Compute the 10-th, 20-th,..., 90-th quantiles of D. Find the median point falling in each of the 10 bins obtained. Call these median points a1,...,ar.
How can I construct the probability mass function from here?
Also, how can I generalise this procedure to more than one dimension?
Update using histcounts: I thought about using histcounts. Do you think it is a valid option? For many dimensions I can use this.
clear
rng default
%(1) Draw P random numbers for standard normal distribution
P=10^6;
X = randn(P,1);
%(2) Apply histcounts
[N,edges] = histcounts(X);
%(3) Construct the new discrete random variable
%(3.1) The support of the discrete random variable is the collection of the mean values of each bin
supp=zeros(size(N,2),1);
for j=2:size(N,2)+1
supp(j-1)=(edges(j)-edges(j-1))/2+edges(j-1);
end
%(3.2) The probability mass function of the discrete random variable is the
%number of X within each bin divided by P
pmass=N/P;
%(4) Check if the approximation is OK
%(4.1) Find the CDF of the discrete random variable
CDF_discrete=zeros(size(N,2),1);
for h=2:size(N,2)+1
CDF_discrete(h-1)=sum(X<=edges(h))/P;
end
%(4.2) Plot empirical CDF of the original random variable and CDF_discrete
ecdf(X)
hold on
scatter(supp, CDF_discrete)

I don't know if this is what you're after but maybe it can help you. You know, P(X = x) = 0 for any point in a continuous probability distribution, that is the pointwise probability of X mapping to x is infinitesimal small, and thus regarded as 0.
What you could do instead, in order to approximate it to a discrete probability space, is to define some points (x_1, x_2, ..., x_n), and let their discrete probabilities be the integral of some range of the PDF (from your continuous probability distribution), that is
P(x_1) = P(X \in (-infty, x_1_end)), P(x_2) = P(X \in (x_1_end, x_2_end)), ..., P(x_n) = P(X \in (x_(n-1)_end, +infty))
:-)

Related

How to calculate probability of a point using a probability distribution object?

I'm building up on my preivous question because there is a further issue.
I have fitted in Matlab a normal distribution to my data vector: PD = fitdist(data,'normal'). Now I have a new data point coming in (e.g. x = 0.5) and I would like to calculate its probability.
Using cdf(PD,x) will not work because it gives the probability that the point is smaller or equal to x (but not exactly x). Using pdf(PD,x) gives just the densitiy but not the probability and so it can be greater than one.
How can I calculate the probability?
If the distribution is continuous then the probability of any point x is 0, almost by definition of continuous distribution. If the distribution is discrete and, furthermore, the support of the distribution is a subset of the set of integers, then for any integer x its probability is
cdf(PD,x) - cdf(PD,x-1)
More generally, for any random variable X which takes on integer values, the probability mass function f(x) and the cumulative distribution F(x) are related by
f(x) = F(x) - F(x-1)
The right hand side can be interpreted as a discrete derivative, so this is a direct analog of the fact that in the continuous case the pdf is the derivative of the cdf.
I'm not sure if matlab has a more direct way to get at the probability mass function in your situation than going through the cdf like that.
In the continuous case, your question doesn't make a lot of sense since, as I said above, the probability is 0. Non-zero probability in this case is something that attaches to intervals rather than individual points. You still might want to ask for the probability of getting a value near x -- but then you have to decide on what you mean by "near". For example, if x is an integer then you might want to know the probability of getting a value that rounds to x. That would be:
cdf(PD, x + 0.5) - cdf(PD, x - 0.5)
Let's say you have a random variable X that follows the normal distribution with mean mu and standard deviation s.
Let F be the cumulative distribution function for the normal distribution with mean mu and standard deviation s. The probability the random variableX falls between a and b, that is P(a < X <= b) = F(b) - F(a).
In Matlab code:
P_a_b = normcdf(b, mu, s) - normcdf(a, mu, s);
Note: observe that the probability X is exactly equal to 0.5 (or any specific value) is zero! A range of outcomes will have positive probability, but an insufficient sum of individual outcomes will have probability zero.

Sample multinomial distribution in Matlab without using mnrnd

I know for a random variable x that P(x=i) for each i=1,2,...,100. Then how may I sample x by a multinomial distribution, based on the given P(x=i) in Matlab?
I am allowed to use the Matlab built-in commands rand and randi, but not mnrnd.
In general, you can sample numbers from any 1 dimensional probability distribution X using a uniform random number generator and the inverse cumulative distribution function of X. This is known as inverse transform sampling.
random_x = xcdf_inverse(rand())
How does this apply here? If you have your vector p of probabilities defining your multinomial distribution, F = cumsum(p) gives you a vector that defines the CDF. You can then generate a uniform random number on [0,1] using temp = rand() and then find the first row in F greater than temp. This is basically using the inverse CDF of the multinomial distribution.
Be aware though that for some distributions (eg. gamma distribution), this turns out to be an inefficient way to generate random draws because evaluating the inverse CDF is so slow (if the CDF cannot expressed analytically, slower numerical methods must be used).

how to calculate the spectral density of a matrix of data use matlab

I am not doing signal processing. But in my area, I will use the spectral density of a matrix of data. I get quite confused at a very detailed level.
%matrix H is given.
corr=xcorr2(H); %get the correlation
spec=fft2(corr); % Wiener-Khinchin Theorem
In matlab, xcorr2 will calculate the correlation function of this matrix. The lag will range from -N+1 to N-1. So if size of matrix H is N by N, then size of corr will be 2N-1 by 2N-1. For discretized data, I should use corr or half of corr?
Another problem is I think Wiener-Khinchin Theorem is basically for continuous function. I have always thought that Discretized FT is an approximation to Continuous FT, or you can say it is a tool to calculate Continuous FT. If you use matlab build in function 'fft', you should divide the final result by \delta x.
Any kind soul who knows this area well there to share some matlab code with me?
Basically, approximating a continuous FT by a Discretized FT is the same as approximating an integral by a finite sum.
We will first discuss the 1D case, then we'll discuss the 2D case.
Let's look at the Wiener-Kinchin theorem (for example here).
It states that :
"For the discrete-time case, the power spectral density of the function with discrete values x[n], is :
where
Is the autocorrelation function of x[n]."
1) You can see already that the sum is taken from -infty to +infty in the calculation of S(f)
2) Now considering the Matlab fft - You can see (command 'edit fft' in Matlab), that it is defined as :
X(k) = sum_{n=1}^N x(n)*exp(-j*2*pi*(k-1)*(n-1)/N), 1 <= k <= N.
which is exactly what you want to be done in order to calculate the power spectral density for a frequency f.
Note that, for continuous functions, S(f) will be a continuous function. For Discretized function, S(f) will be discrete.
Now that we know all that, it can easily be extended to the 2D case. Indeed, the structure of fft2 matches the structure of the right hand side of the Wiener-Kinchin Theorem for the 2D case.
Though, it will be necessary to divide your result by NxM, where N is the number of sample points in x and M is the number of sample points in y.

Generate random samples from arbitrary discrete probability density function in Matlab

I've got an arbitrary probability density function discretized as a matrix in Matlab, that means that for every pair x,y the probability is stored in the matrix:
A(x,y) = probability
This is a 100x100 matrix, and I would like to be able to generate random samples of two dimensions (x,y) out of this matrix and also, if possible, to be able to calculate the mean and other moments of the PDF. I want to do this because after resampling, I want to fit the samples to an approximated Gaussian Mixture Model.
I've been looking everywhere but I haven't found anything as specific as this. I hope you may be able to help me.
Thank you.
If you really have a discrete probably density function defined by A (as opposed to a continuous probability density function that is merely described by A), you can "cheat" by turning your 2D problem into a 1D problem.
%define the possible values for the (x,y) pair
row_vals = [1:size(A,1)]'*ones(1,size(A,2)); %all x values
col_vals = ones(size(A,1),1)*[1:size(A,2)]; %all y values
%convert your 2D problem into a 1D problem
A = A(:);
row_vals = row_vals(:);
col_vals = col_vals(:);
%calculate your fake 1D CDF, assumes sum(A(:))==1
CDF = cumsum(A); %remember, first term out of of cumsum is not zero
%because of the operation we're doing below (interp1 followed by ceil)
%we need the CDF to start at zero
CDF = [0; CDF(:)];
%generate random values
N_vals = 1000; %give me 1000 values
rand_vals = rand(N_vals,1); %spans zero to one
%look into CDF to see which index the rand val corresponds to
out_val = interp1(CDF,[0:1/(length(CDF)-1):1],rand_vals); %spans zero to one
ind = ceil(out_val*length(A));
%using the inds, you can lookup each pair of values
xy_values = [row_vals(ind) col_vals(ind)];
I hope that this helps!
Chip
I don't believe matlab has built-in functionality for generating multivariate random variables with arbitrary distribution. As a matter of fact, the same is true for univariate random numbers. But while the latter can be easily generated based on the cumulative distribution function, the CDF does not exist for multivariate distributions, so generating such numbers is much more messy (the main problem is the fact that 2 or more variables have correlation). So this part of your question is far beyond the scope of this site.
Since half an answer is better than no answer, here's how you can compute the mean and higher moments numerically using matlab:
%generate some dummy input
xv=linspace(-50,50,101);
yv=linspace(-30,30,100);
[x y]=meshgrid(xv,yv);
%define a discretized two-hump Gaussian distribution
A=floor(15*exp(-((x-10).^2+y.^2)/100)+15*exp(-((x+25).^2+y.^2)/100));
A=A/sum(A(:)); %normalized to sum to 1
%plot it if you like
%figure;
%surf(x,y,A)
%actual half-answer starts here
%get normalized pdf
weight=trapz(xv,trapz(yv,A));
A=A/weight; %A normalized to 1 according to trapz^2
%mean
mean_x=trapz(xv,trapz(yv,A.*x));
mean_y=trapz(xv,trapz(yv,A.*y));
So, the point is that you can perform a double integral on a rectangular mesh using two consecutive calls to trapz. This allows you to compute the integral of any quantity that has the same shape as your mesh, but a drawback is that vector components have to be computed independently. If you only wish to compute things which can be parametrized with x and y (which are naturally the same size as you mesh), then you can get along without having to do any additional thinking.
You could also define a function for the integration:
function res=trapz2(xv,yv,A,arg)
if ~isscalar(arg) && any(size(arg)~=size(A))
error('Size of A and var must be the same!')
end
res=trapz(xv,trapz(yv,A.*arg));
end
This way you can compute stuff like
weight=trapz2(xv,yv,A,1);
mean_x=trapz2(xv,yv,A,x);
NOTE: the reason I used a 101x100 mesh in the example is that the double call to trapz should be performed in the proper order. If you interchange xv and yv in the calls, you get the wrong answer due to inconsistency with the definition of A, but this will not be evident if A is square. I suggest avoiding symmetric quantities during the development stage.

Dividing a normal distribution into regions of equal probability in Matlab

Consider a Normal distribution with mean 0 and standard deviation 1. I would like to divide this distribution into 9 regions of equal probability and take a random sample from each region.
It sounds like you want to find the values that divide the area under the probability distribution function into segments of equal probability. This can be done in matlab by applying the norminv function.
In your particular case:
segmentBounds = norminv(linspace(0,1,10),0,1)
Any two adjacent values of segmentBounds now describe the boundaries of segments of the Normal probability distribution function such that each segment contains one ninth of the total probability.
I'm not sure exactly what you mean by taking random numbers from each sample. One approach is to sample from each region by performing rejection sampling. In short, for each region bounded by x0 and x1, draw a sample from y = normrnd(0,1). If x0 < y < x1, keep it. Else discard it and repeat.
It's also possible that you intend to sample from these regions uniformly. To do this you can try rand(1)*(x1-x0) + x0. This will produce problems for the extreme quantiles, however, since the regions extend to +/- infinity.