Matlab ksdensity point range - matlab

I am using this form of the ksdensity function in MATLAB.
[f,xi] = ksdensity(x)
The documentation says that "f is the vector of density values evaluated at the points in xi ... The density is evaluated at 100 equally spaced points that cover the range of the data in x. "
Now, my xi values cover a much larger range than the data in x. Why is this?
For my data,
>> min(x)
ans =
-2.2588
>> min(xi)
ans =
-6.8010
>> max(x)
ans =
6.5326
>> max(xi)
ans =
11.0748
I know I can specify an xi range myself, but why is it not equally spaced between min and max of x by default?
It makes it hard to compare histogram estimators and kernel estimators when the bins in the histogram only cover the range of x, whereas the test points given from ksdensity exceed this range.

ksdensity performs a smoothing of the histogram with a Gaussian kernel. As an illustration, check the output of the impulse response:
[f,xi] = ksdensity(0);
plot(xi,f)
The width of the Gaussian is 1, which means that you'll add about 3 on both sides of the data (which can make it appear as if there were negative values in a strictly positive histogram). Thus, if your data span a small range only, you need to shrink the kernel width in ksdensity.

Related

Kullback-Leibler Divergence between 2 Histograms from an image (MATLAB)

I pulled histograms from images on matlab, than I want to compare the histograms using KL-divergence.
I found this script but I do not understand how I could apply it to my case.
So here I pull my histogram (pretty simple!!):
[N,X]=hist(I,n);
[N1,X1]=hist(I1,n);
KLDiv(N,N1)
% ans=inf
N is the histogram of my image I
Like you can see my result is inf...
Please can you tell me in my case how to use the script?
Thanks
You probably want to calculate the histogram of an image using imhist, instead of the columnwise calculation of the histogram:
I1 = rand(10);
I2 = rand(10);
[N1, X1] = imhist(I1, 10); % limit the number of bins to avoid zero values
[N2, X2] = imhist(I2, 10);
KLDiv(N1.', N2.') % convert to row vectors to correspond with the requested format
KLDiv(N1.', N1.') % the divergence of an histogram with itself is indeed zero
Note that I limited the number of bins to be sure that each bin has at least one point, because the Kullback-Leibler divergence is not defined if Q(i) is zero and P(i) not:
The Kullback–Leibler divergence is defined only if Q(i)=0 implies
P(i)=0, for all i (absolute continuity).
Notes
Range of Kullback–Leibler divergence?
Any positive number, zero if (and only if) they are equal: KLD >= 0.
To which base should I take the logarithm? Natural logarithm log or base 2 logarithm log2?
Note that it is just a matter of scaling your results. So in fact, it doesn't matter, but be sure to use the same logarithm if you want to compare your results. Wikipedia suggests the following:
logarithms in these formulae are taken to base 2 if information is
measured in units of bits, or to base e if information is measured in
nats.

Generate random samples from arbitrary discrete probability density function in Matlab

I've got an arbitrary probability density function discretized as a matrix in Matlab, that means that for every pair x,y the probability is stored in the matrix:
A(x,y) = probability
This is a 100x100 matrix, and I would like to be able to generate random samples of two dimensions (x,y) out of this matrix and also, if possible, to be able to calculate the mean and other moments of the PDF. I want to do this because after resampling, I want to fit the samples to an approximated Gaussian Mixture Model.
I've been looking everywhere but I haven't found anything as specific as this. I hope you may be able to help me.
Thank you.
If you really have a discrete probably density function defined by A (as opposed to a continuous probability density function that is merely described by A), you can "cheat" by turning your 2D problem into a 1D problem.
%define the possible values for the (x,y) pair
row_vals = [1:size(A,1)]'*ones(1,size(A,2)); %all x values
col_vals = ones(size(A,1),1)*[1:size(A,2)]; %all y values
%convert your 2D problem into a 1D problem
A = A(:);
row_vals = row_vals(:);
col_vals = col_vals(:);
%calculate your fake 1D CDF, assumes sum(A(:))==1
CDF = cumsum(A); %remember, first term out of of cumsum is not zero
%because of the operation we're doing below (interp1 followed by ceil)
%we need the CDF to start at zero
CDF = [0; CDF(:)];
%generate random values
N_vals = 1000; %give me 1000 values
rand_vals = rand(N_vals,1); %spans zero to one
%look into CDF to see which index the rand val corresponds to
out_val = interp1(CDF,[0:1/(length(CDF)-1):1],rand_vals); %spans zero to one
ind = ceil(out_val*length(A));
%using the inds, you can lookup each pair of values
xy_values = [row_vals(ind) col_vals(ind)];
I hope that this helps!
Chip
I don't believe matlab has built-in functionality for generating multivariate random variables with arbitrary distribution. As a matter of fact, the same is true for univariate random numbers. But while the latter can be easily generated based on the cumulative distribution function, the CDF does not exist for multivariate distributions, so generating such numbers is much more messy (the main problem is the fact that 2 or more variables have correlation). So this part of your question is far beyond the scope of this site.
Since half an answer is better than no answer, here's how you can compute the mean and higher moments numerically using matlab:
%generate some dummy input
xv=linspace(-50,50,101);
yv=linspace(-30,30,100);
[x y]=meshgrid(xv,yv);
%define a discretized two-hump Gaussian distribution
A=floor(15*exp(-((x-10).^2+y.^2)/100)+15*exp(-((x+25).^2+y.^2)/100));
A=A/sum(A(:)); %normalized to sum to 1
%plot it if you like
%figure;
%surf(x,y,A)
%actual half-answer starts here
%get normalized pdf
weight=trapz(xv,trapz(yv,A));
A=A/weight; %A normalized to 1 according to trapz^2
%mean
mean_x=trapz(xv,trapz(yv,A.*x));
mean_y=trapz(xv,trapz(yv,A.*y));
So, the point is that you can perform a double integral on a rectangular mesh using two consecutive calls to trapz. This allows you to compute the integral of any quantity that has the same shape as your mesh, but a drawback is that vector components have to be computed independently. If you only wish to compute things which can be parametrized with x and y (which are naturally the same size as you mesh), then you can get along without having to do any additional thinking.
You could also define a function for the integration:
function res=trapz2(xv,yv,A,arg)
if ~isscalar(arg) && any(size(arg)~=size(A))
error('Size of A and var must be the same!')
end
res=trapz(xv,trapz(yv,A.*arg));
end
This way you can compute stuff like
weight=trapz2(xv,yv,A,1);
mean_x=trapz2(xv,yv,A,x);
NOTE: the reason I used a 101x100 mesh in the example is that the double call to trapz should be performed in the proper order. If you interchange xv and yv in the calls, you get the wrong answer due to inconsistency with the definition of A, but this will not be evident if A is square. I suggest avoiding symmetric quantities during the development stage.

Finding probability of Gaussian random variable with range

I am a beginner in MATLAB trying to use it to help me understand random signals, I was doing some Normal probability density function calculations until i came across this problem :
Write a MATLAB program to calculate the probability Pr(x1 ≤ X ≤ x2)
if X is a Gaussian random variable for an arbitrary x1 and x2.
Note that you will have to specify the mean and variance of the Gaussian random
variable.
I usually use the built in normpdf function of course selecting my mean and variance, but for this one since it has a range am not sure what I can do to find the answer
Y = normpdf(X,mu,sigma)
If you recall from probability theory, you know that the Cumulative Distribution Function sums up probabilities from -infinity up until a certain point x. Specifically, the CDF F(x) for a probability distribution P with random variable X evaluated at a certain point x is defined as:
Note that I am assuming that we are dealing with the discrete case. Also, let's call this a left-tail sum, because it sums all of the probabilities from the left of the distribution up until the point x. Consequently, this defines the area underneath the curve up until point x.
Now, what your question is asking is to find the probability between a certain range x1 <= x <= x2, not just with using a left-tail sum (<= x). Now, if x1 <= x2, this means that total area where the end point is x2, or the probability of all events up to and including x2, is also part of the area defined by the end point defined at x1. Because you want the probability between a certain range, you need to accumulate all events that happen between x1 and x2, and so you want the area under the PDF curve that is in between this range. Also, you want to have the area that is greater than x1 and less than x2.
Here's a pictorial example:
Source: ReliaWiki
The top figure is the PDF of a Gaussian distribution function, while the bottom figure denotes the CDF of a Gaussian distribution. You see that if x1 <= x2, the area defined by the point at x1 is also captured by the point at x2. Here's a better graph:
Source: Introduction to Statistics
Here the CDF is continuous instead of discrete, but the result is still the same. If you want the area in between two intervals and ultimately the probability in between the two ranges, you need to take the CDF value at x2 and subtract the CDF value at x1. You want the remaining area, and so you just need to subtract the CDF values and ultimately the left-tail areas, and so:
As such, to calculate the CDF of the Gaussian distribution use normcdf and specify the mean and standard deviation of your Gaussian distribution. Therefore, you just need to do this:
y = normcdf(x2, mu, sigma) - normcdf(x1, mu, sigma);
x1 and x2 are the values of the interval that you want to calculate the sum of probabilities under.
You can use erf,
mu = 5;
sigma = 3;
x1 = 3;
x2 = 8;
p = .5*(erf((x2-mu)/sigma/2^.5) - erf((x1-mu)/sigma/2^.5));
error function is defined like this in MATLAB,

Mahalanobis distance in matlab: pdist2() vs. mahal() function

I have two matrices X and Y. Both represent a number of positions in 3D-space. X is a 50*3 matrix, Y is a 60*3 matrix.
My question: why does applying the mean-function over the output of pdist2() in combination with 'Mahalanobis' not give the result obtained with mahal()?
More details on what I'm trying to do below, as well as the code I used to test this.
Let's suppose the 60 observations in matrix Y are obtained after an experimental manipulation of some kind. I'm trying to assess whether this manipulation had a significant effect on the positions observed in Y. Therefore, I used pdist2(X,X,'Mahalanobis') to compare X to X to obtain a baseline, and later, X to Y (with X the reference matrix: pdist2(X,Y,'Mahalanobis')), and I plotted both distributions to have a look at the overlap.
Subsequently, I calculated the mean Mahalanobis distance for both distributions and the 95% CI and did a t-test and Kolmogorov-Smirnoff test to asses if the difference between the distributions was significant. This seemed very intuitive to me, however, when testing with mahal(), I get different values, although the reference matrix is the same. I don't get what the difference between both ways of calculating mahalanobis distance is exactly.
Comment that is too long #3lectrologos:
You mean this: d(I) = (Y(I,:)-mu)inv(SIGMA)(Y(I,:)-mu)'? This is just the formula for calculating mahalanobis, so should be the same for pdist2() and mahal() functions. I think mu is a scalar and SIGMA is a matrix based on the reference distribution as a whole in both pdist2() and mahal(). Only in mahal you are comparing each point of your sample set to the points of the reference distribution, while in pdist2 you are making pairwise comparisons based on a reference distribution. Actually, with my purpose in my mind, I think I should go for mahal() instead of pdist2(). I can interpret a pairwise distance based on a reference distribution, but I don't think it's what I need here.
% test pdist2 vs. mahal in matlab
% the purpose of this script is to see whether the average over the rows of E equals the values in d...
% data
X = []; % 50*3 matrix, data omitted
Y = []; % 60*3 matrix, data omitted
% calculations
S = nancov(X);
% mahal()
d = mahal(Y,X); % gives an 60*1 matrix with a value for each Cartesian element in Y (second matrix is always the reference matrix)
% pairwise mahalanobis distance with pdist2()
E = pdist2(X,Y,'mahalanobis',S); % outputs an 50*60 matrix with each ij-th element the pairwise distance between element X(i,:) and Y(j,:) based on the covariance matrix of X: nancov(X)
%{
so this is harder to interpret than mahal(), as elements of Y are not just compared to the "mahalanobis-centroid" based on X,
% but to each individual element of X
% so the purpose of this script is to see whether the average over the rows of E equals the values in d...
%}
F = mean(E); % now I averaged over the rows, which means, over all values of X, the reference matrix
mean(d)
mean(E(:)) % not equal to mean(d)
d-F' % not zero
% plot output
figure(1)
plot(d,'bo'), hold on
plot(mean(E),'ro')
legend('mahal()','avaraged over all x values pdist2()')
ylabel('Mahalanobis distance')
figure(2)
plot(d,'bo'), hold on
plot(E','ro')
plot(d,'bo','MarkerFaceColor','b')
xlabel('values in matrix Y (Yi) ... or ... pairwise comparison Yi. (Yi vs. all Xi values)')
ylabel('Mahalanobis distance')
legend('mahal()','pdist2()')
One immediate difference between the two is that mahal subtracts the sample mean of X from each point in Y before computing distances.
Try something like E = pdist2(X,Y-mean(X),'mahalanobis',S); to see if that gives you the same results as mahal.
Note that
mahal(X,Y)
is equivalent to
pdist2(X,mean(Y),'mahalanobis',cov(Y)).^2
Well, I guess there are two different ways to calculate mahalanobis distance between two clusters of data like you explain above:
1) you compare each data point from your sample set to mu and sigma matrices calculated from your reference distribution (although labeling one cluster sample set and the other reference distribution may be arbitrary), thereby calculating the distance from each point to this so called mahalanobis-centroid of the reference distribution.
2) you compare each datapoint from matrix Y to each datapoint of matrix X, with, X the reference distribution (mu and sigma are calculated from X only)
The values of the distances will be different, but I guess the ordinal order of dissimilarity between clusters is preserved when using either method 1 or 2? I actually wonder when comparing 10 different clusters to a reference matrix X, or to each other, if the order of the dissimilarities would differ using method 1 or method 2? Also, I can't imagine a situation where one method would be wrong and the other method not. Although method 1 seems more intuitive in some situations, like mine.

Can I adjust spectogram frequency axes?

The MATLAB documentation examples for the spectrogram function gives examples that have the frequency axis set to [0 500]. Can I change this to something like [0 100]? Obviously running the axis command will do this for me, but that adjusts the end result and "blows up" the resultant plot, make it pixelated. I am basically looking to build a spectrogram that only looks for frequencies between 0-100, not rescaling after building the spectrogram.
Here's an example from that documentation:
T = 0:0.001:2;
X = chirp(T,0,1,150);
spectrogram(X,256,250,256,1E3,'yaxis');
This produces the following:
Everything below 350Hz is unneeded. Is there a way to not include everything between 350 to 500 when building the spectrogram, rather than adjusting axes after the fact?
From the documentation:
[S,F,T] = spectrogram(x,window,noverlap,F) uses a vector F of frequencies in Hz. F must be a vector with at least two elements. This case computes the spectrogram at the frequencies in F using the Goertzel algorithm. The specified frequencies are rounded to the nearest DFT bin commensurate with the signal's resolution. In all other syntax cases where nfft or a default for nfft is used, the short-time Fourier transform is used. The F vector returned is a vector of the rounded frequencies. T is a vector of times at which the spectrogram is computed. The length of F is equal to the number of rows of S. The length of T is equal to k, as defined above and each value corresponds to the center of each segment.
Does that help you?
The FFT is so fast that it is better to increase the resolution and then just discard the unwanted data. If you need better spectral resolution (more frequency bins) then increase the FFT size. To get smoother looking spectrum in time dimension, increase the noverlap value to reduce the increments for each consequtive FFT. In this case you would not specify the F. If FFT size is 1024 then you get 1024/2+1 frequency bins.
FFTN = 512;
start = 512*(350/500); % Only care about freq bins above this value
WIN_SIZE = FFTN;
overlap = floor(FFTN*0.8);
[~,F,T,P] = spectrogram(y, WIN_SIZE, overlap, FFTN);
f = 0:(length(F)-1);
f = f*((Fs/2)/length(F));
P = P(start:512,:);
f = f(1,start:512);
imagesc(T,f,10*log10(P),[-70 20]);