How ksdensity calculates each data point xi? - matlab

When I use [f,xi] = ksdensity(x) in Matlab, I get the probability density estimate, f, and xi evaluation points at which ksdensity calculates f.
My question is: How is each xi point calculated/determined? Is there a formula?
The documentation center says: Default is 100 equally spaced points that cover the range of data in x. So, they cover the range, but this does not explain how are calculated.
Thank you very much!
Juan

The standard method of achieving equally spaced points in MATLAB is using the linspace command. linspace(a,b,n) generates n linearly spaced points between and including a and b.
So it most probably is equivalent to xi = linspace(min(x),max(x)) (default number of points is 100 inlinspace).

Related

Approximate continuous probability distribution in Matlab

Suppose I have a continuous probability distribution, e.g., Normal, on a support A. Suppose that there is a Matlab code that allows me to draw random numbers from such a distribution, e.g., this.
I want to build a Matlab code to "approximate" this continuous probability distribution with a probability mass function spanning over r points.
This means that I want to write a Matlab code to:
(1) Select r points from A. Let us call these points a1,a2,...,ar. These points will constitute the new discretised support.
(2) Construct a probability mass function over a1,a2,...,ar. This probability mass function should "well" approximate the original continuous probability distribution.
Could you help by providing also an example? This is a similar question asked for Julia.
Here some of my thoughts. Suppose that the continuous probability distribution of interest is one-dimensional. One way to go could be:
(1) Draw 10^6 random numbers from the continuous probability distribution of interest and store them in a column vector D.
(2) Suppose that r=10. Compute the 10-th, 20-th,..., 90-th quantiles of D. Find the median point falling in each of the 10 bins obtained. Call these median points a1,...,ar.
How can I construct the probability mass function from here?
Also, how can I generalise this procedure to more than one dimension?
Update using histcounts: I thought about using histcounts. Do you think it is a valid option? For many dimensions I can use this.
clear
rng default
%(1) Draw P random numbers for standard normal distribution
P=10^6;
X = randn(P,1);
%(2) Apply histcounts
[N,edges] = histcounts(X);
%(3) Construct the new discrete random variable
%(3.1) The support of the discrete random variable is the collection of the mean values of each bin
supp=zeros(size(N,2),1);
for j=2:size(N,2)+1
supp(j-1)=(edges(j)-edges(j-1))/2+edges(j-1);
end
%(3.2) The probability mass function of the discrete random variable is the
%number of X within each bin divided by P
pmass=N/P;
%(4) Check if the approximation is OK
%(4.1) Find the CDF of the discrete random variable
CDF_discrete=zeros(size(N,2),1);
for h=2:size(N,2)+1
CDF_discrete(h-1)=sum(X<=edges(h))/P;
end
%(4.2) Plot empirical CDF of the original random variable and CDF_discrete
ecdf(X)
hold on
scatter(supp, CDF_discrete)
I don't know if this is what you're after but maybe it can help you. You know, P(X = x) = 0 for any point in a continuous probability distribution, that is the pointwise probability of X mapping to x is infinitesimal small, and thus regarded as 0.
What you could do instead, in order to approximate it to a discrete probability space, is to define some points (x_1, x_2, ..., x_n), and let their discrete probabilities be the integral of some range of the PDF (from your continuous probability distribution), that is
P(x_1) = P(X \in (-infty, x_1_end)), P(x_2) = P(X \in (x_1_end, x_2_end)), ..., P(x_n) = P(X \in (x_(n-1)_end, +infty))
:-)

Matlab ksdensity for bivariate data with more evaluation points

I'm running Matlab code for kernel density, i.e., [f,xi] = ksdensity(x), where x is a two column bivariate data. The resulting output f is the density vector, while xi is the meshgrid of evaluation points that is 30x30 in dimension. See the documentation here: Link.
I'm trying to increase number of evaluation points that I receive from this code. There is an option mentioned in the documentation called 'NumPoints' that is only applicable for univariate data. Is there an option or ways that I can increase the meshgrid points of evaluation points of bivariate data to, say, 100x100?
You need to use the optional second input argument pts to specify the range and number of the output points in your grid. See this example in the documentation. Depending on your input data, you could specify something like this:
pts = [linspace(min(x(:,1)),max(x(:,1)),1000).' linspace(min(x(:,2)),max(x(:,2)),1000).'];
The NumPoints is npoints in the ksdensity(). e.g., [f,xi] = ksdensity(x, 'npoints', 1000) will return 1000 points of xi and f.

Generate random samples from arbitrary discrete probability density function in Matlab

I've got an arbitrary probability density function discretized as a matrix in Matlab, that means that for every pair x,y the probability is stored in the matrix:
A(x,y) = probability
This is a 100x100 matrix, and I would like to be able to generate random samples of two dimensions (x,y) out of this matrix and also, if possible, to be able to calculate the mean and other moments of the PDF. I want to do this because after resampling, I want to fit the samples to an approximated Gaussian Mixture Model.
I've been looking everywhere but I haven't found anything as specific as this. I hope you may be able to help me.
Thank you.
If you really have a discrete probably density function defined by A (as opposed to a continuous probability density function that is merely described by A), you can "cheat" by turning your 2D problem into a 1D problem.
%define the possible values for the (x,y) pair
row_vals = [1:size(A,1)]'*ones(1,size(A,2)); %all x values
col_vals = ones(size(A,1),1)*[1:size(A,2)]; %all y values
%convert your 2D problem into a 1D problem
A = A(:);
row_vals = row_vals(:);
col_vals = col_vals(:);
%calculate your fake 1D CDF, assumes sum(A(:))==1
CDF = cumsum(A); %remember, first term out of of cumsum is not zero
%because of the operation we're doing below (interp1 followed by ceil)
%we need the CDF to start at zero
CDF = [0; CDF(:)];
%generate random values
N_vals = 1000; %give me 1000 values
rand_vals = rand(N_vals,1); %spans zero to one
%look into CDF to see which index the rand val corresponds to
out_val = interp1(CDF,[0:1/(length(CDF)-1):1],rand_vals); %spans zero to one
ind = ceil(out_val*length(A));
%using the inds, you can lookup each pair of values
xy_values = [row_vals(ind) col_vals(ind)];
I hope that this helps!
Chip
I don't believe matlab has built-in functionality for generating multivariate random variables with arbitrary distribution. As a matter of fact, the same is true for univariate random numbers. But while the latter can be easily generated based on the cumulative distribution function, the CDF does not exist for multivariate distributions, so generating such numbers is much more messy (the main problem is the fact that 2 or more variables have correlation). So this part of your question is far beyond the scope of this site.
Since half an answer is better than no answer, here's how you can compute the mean and higher moments numerically using matlab:
%generate some dummy input
xv=linspace(-50,50,101);
yv=linspace(-30,30,100);
[x y]=meshgrid(xv,yv);
%define a discretized two-hump Gaussian distribution
A=floor(15*exp(-((x-10).^2+y.^2)/100)+15*exp(-((x+25).^2+y.^2)/100));
A=A/sum(A(:)); %normalized to sum to 1
%plot it if you like
%figure;
%surf(x,y,A)
%actual half-answer starts here
%get normalized pdf
weight=trapz(xv,trapz(yv,A));
A=A/weight; %A normalized to 1 according to trapz^2
%mean
mean_x=trapz(xv,trapz(yv,A.*x));
mean_y=trapz(xv,trapz(yv,A.*y));
So, the point is that you can perform a double integral on a rectangular mesh using two consecutive calls to trapz. This allows you to compute the integral of any quantity that has the same shape as your mesh, but a drawback is that vector components have to be computed independently. If you only wish to compute things which can be parametrized with x and y (which are naturally the same size as you mesh), then you can get along without having to do any additional thinking.
You could also define a function for the integration:
function res=trapz2(xv,yv,A,arg)
if ~isscalar(arg) && any(size(arg)~=size(A))
error('Size of A and var must be the same!')
end
res=trapz(xv,trapz(yv,A.*arg));
end
This way you can compute stuff like
weight=trapz2(xv,yv,A,1);
mean_x=trapz2(xv,yv,A,x);
NOTE: the reason I used a 101x100 mesh in the example is that the double call to trapz should be performed in the proper order. If you interchange xv and yv in the calls, you get the wrong answer due to inconsistency with the definition of A, but this will not be evident if A is square. I suggest avoiding symmetric quantities during the development stage.

Dividing equally a vector matlab

in MFCCs i have specified f_low and f_high which are my frequency min and max bands, and i am about to compute N equally distanced mel values between these two frequency values. So i wrote
f_low=1000;
f_high=fs/2;
filt_num=26; % number of filters
stp=round(f_high/filt_num); % step
f=f_low:stp:f_high; % my frequency vector
but i can't divide equally my f vector, maybe there is a function in matlab that does it , or am i missing something? Please help and thanks in advance.
A bit of digging around leads me to believe you want a linearly spaced vector with filt_num entries, starting at f_low and ending at f_high. You should use linspace for that as follows:
f = linspace(f_low,f_high,filt_num);
This is essentially the same as your last two lines of code. Keep in mind your code only works when f_high is larger than f_low. linspace does not have this issue, as it also supports descending vectors.

Mahalanobis distance in Matlab

I am trying to find the Mahalanobis distance of some points from the origin.The MATLAB command for that is mahal(Y,X)
But if I use this I get NaN as the matrix X =0 as the distance needs to be found from the origin.Can someone please help me with this.How should it be done
I think you are a bit confused about what mahal() is doing. First, computation of the Mahalanobis distance requires a population of points, from which the covariance will be calculated.
In the Matlab docs for this function it makes it clear that the distance being computed is:
d(I) = (Y(I,:)-mu)*inv(SIGMA)*(Y(I,:)-mu)'
where mu is the population average of X and SIGMA is the population covariance matrix of X. Since your population consists of a single point (the origin), it has no covariance, and so the SIGMA matrix is not invertible, hence the error where you get NaN/Inf values in the distances.
If you know the covariance structure that you want to use for the Mahalanobis distance, then you can just use the formula above to compute it for yourself. Let's say that the covariance you care about is stored in a matrix S. You want the distance w.r.t. the origin, so you don't need to subtract anything from the values in Y, all you need to compute is:
for ii = 1:size(Y,1)
d(ii) = Y(ii,:)*inv(S)*Y(ii,:)'; % Where Y(ii,:) is assumed to be a row vector.'
end