Estimating skewness of histogram in MATLAB - matlab

What test can I do in MATLAB to test the spread of a histogram? For example, in the given set of histograms, I am only interested in 1,2,3,5 and 7 (going from left to right, top to bottom) because they are less spread out. How can I obtain a value that will tell me if a histogram is positively skewed?
It may be possible using Chi-Squared tests but I am not sure what the MATLAB code will be for that.

You can use the standard definition of skewness. In other words, you can use:
You compute the mean of your data and you use the above equation to calculate skewness. Positive and negative skewness are like so:
Source: Wikipedia
As such, the larger the value, the more positively skewed it is. The more negative the value, the more negatively skewed it is.
Now to compute the mean of your histogram data, it's quite simple. You simply do a weighted sum of the histogram entries and divide by the total number of entries. Given that your histogram is stored in h, the bin centres of your histogram are stored in x, you would do the following. What I will do here is assume that you have bins from 0 up to N-1 where N is the total number of bins in the histogram... judging from your picture:
x = 0:numel(h)-1; %// Judging from your pictures
num_entries = sum(h(:));
mu = sum(h.*x) / num_entries;
skew = ((1/num_entries)*(sum((h.*x - mu).^3))) / ...
((1/(num_entries-1))*(sum((h.*x - mu).^2)))^(3/2);
skew would contain the numerical measure of skewness for a histogram that follows that formula. Therefore, with your problem statement, you will want to look for skewness numbers that are positive and large. I can't really comment on what threshold you should look at, but look for positive numbers that are much larger than most of the histograms that you have.

Related

Transforming draws in Matlab from Gaussian mixture to uniform

Consider the following draws for a 2x1 vector in Matlab with a probability distribution that is a mixture of two Gaussian components.
P=10^3; %number draws
v=1;
%First component
mu_a = [0,0.5];
sigma_a = [v,0;0,v];
%Second component
mu_b = [0,8.2];
sigma_b = [v,0;0,v];
%Combine
MU = [mu_a;mu_b];
SIGMA = cat(3,sigma_a,sigma_b);
w = ones(1,2)/2; %equal weight 0.5
obj = gmdistribution(MU,SIGMA,w);
%Draws
RV_temp = random(obj,P);%Px2
% Transform each component of RV_temp into a uniform in [0,1] by estimating the cdf.
RV1=ksdensity(RV_temp(:,1), RV_temp(:,1),'function', 'cdf');
RV2=ksdensity(RV_temp(:,2), RV_temp(:,2),'function', 'cdf');
Now, if we check whether RV1 and RV2 are uniformly distributed on [0,1] by doing
ecdf(RV1)
ecdf(RV2)
we can see that RV1 is uniformly distributed on [0,1] (the empirical cdf is close to the 45 degree line) while RV2 is not.
I don't understand why. It seems that the more distant are mu_a(2)and mu_b(2), the worse the job done by ksdensity with a reasonable number of draws. Why?
When you have a mixture of N(0.5,v) and N(8.2,v) then the range of the generated data is larger than if you had expectation which were closer, like N(0,v) and N(0,v), as you have in the other dimension. Then you ask ksdensity to approximate a function using P points inside this range.
Like in standard linear interpolation, the denser the points the better approximation of the function (inside the range), this is the same case here. Thus in the N(0.5,v) and N(8.2,v) where the points are "sparse" (or sparser, is that a word?) the approximation is worse than in the N(0,v) and N(0,v) where the points are denser.
As a small side note, are there any reason that you do not apply ksdensity directly on the bivariate data? Also I cannot reproduce your comment where you say that 5e2points are also good. Final comment, 1e3 is typically prefered over 10^3.
I think this is simply about the number of samples you're using. For the first example, the means of the two Gaussians are relatively close, hence a thousand samples are enough to obtain a cdf really close the the U[0,1] cdf. On the second vector though, you have a higher difference, and need more samples. With 100000 samples, I obtained the following result:
With 1000 I obtained this:
Which is clearly farther from the Uniform cdf function. Try to increase the number of samples to a million and check if the result is again getting closer.

Kullback-Leibler Divergence between 2 Histograms from an image (MATLAB)

I pulled histograms from images on matlab, than I want to compare the histograms using KL-divergence.
I found this script but I do not understand how I could apply it to my case.
So here I pull my histogram (pretty simple!!):
[N,X]=hist(I,n);
[N1,X1]=hist(I1,n);
KLDiv(N,N1)
% ans=inf
N is the histogram of my image I
Like you can see my result is inf...
Please can you tell me in my case how to use the script?
Thanks
You probably want to calculate the histogram of an image using imhist, instead of the columnwise calculation of the histogram:
I1 = rand(10);
I2 = rand(10);
[N1, X1] = imhist(I1, 10); % limit the number of bins to avoid zero values
[N2, X2] = imhist(I2, 10);
KLDiv(N1.', N2.') % convert to row vectors to correspond with the requested format
KLDiv(N1.', N1.') % the divergence of an histogram with itself is indeed zero
Note that I limited the number of bins to be sure that each bin has at least one point, because the Kullback-Leibler divergence is not defined if Q(i) is zero and P(i) not:
The Kullback–Leibler divergence is defined only if Q(i)=0 implies
P(i)=0, for all i (absolute continuity).
Notes
Range of Kullback–Leibler divergence?
Any positive number, zero if (and only if) they are equal: KLD >= 0.
To which base should I take the logarithm? Natural logarithm log or base 2 logarithm log2?
Note that it is just a matter of scaling your results. So in fact, it doesn't matter, but be sure to use the same logarithm if you want to compare your results. Wikipedia suggests the following:
logarithms in these formulae are taken to base 2 if information is
measured in units of bits, or to base e if information is measured in
nats.

Dividing a normal distribution into regions of equal probability in Matlab

Consider a Normal distribution with mean 0 and standard deviation 1. I would like to divide this distribution into 9 regions of equal probability and take a random sample from each region.
It sounds like you want to find the values that divide the area under the probability distribution function into segments of equal probability. This can be done in matlab by applying the norminv function.
In your particular case:
segmentBounds = norminv(linspace(0,1,10),0,1)
Any two adjacent values of segmentBounds now describe the boundaries of segments of the Normal probability distribution function such that each segment contains one ninth of the total probability.
I'm not sure exactly what you mean by taking random numbers from each sample. One approach is to sample from each region by performing rejection sampling. In short, for each region bounded by x0 and x1, draw a sample from y = normrnd(0,1). If x0 < y < x1, keep it. Else discard it and repeat.
It's also possible that you intend to sample from these regions uniformly. To do this you can try rand(1)*(x1-x0) + x0. This will produce problems for the extreme quantiles, however, since the regions extend to +/- infinity.

Finding defined peaks with Clusters in MATLAB

this is my problem:
I have the next data "A", which looks like:
As you can see, I have drawn with red circles the apparently peaks, the most defined are 2 and 7, I say that they are defined because its standard deviation is low in comparison with the other peaks (especially the second one).
What I need is a way (anyway) to get the values and the standard deviation of n peaks in a numeric array.
I have tried with "clusters", but I got no good results:
First of all, I used "kmeans" MATLAB function, and I realize that this algorithm doesn't group peaks as I need. As you can see in the picture above, in the red circle, that cluster has at less 3 or 4 peaks. And kmeans need that you set the number of clusters, and I need to identify it automatically.
I hope that anyone can give me some ideas, or a way to get better results, thanks.
Pd: I leave the data "A" in the next link.
https://drive.google.com/file/d/0B4WGV21GqSL5a2EyQ2l0SHZURzA/edit?usp=sharing
The problem is that your axes have very different meaning.
K-means optimizes variance. But variance in X is something entirely different than variance in Y, isn't it? Furthermore, each of these methods will split your data in both X and Y, whereas I assume you want the data to be partitioned on the X axis only.
I suggest the following: consider the Y axis to be a weight, and X axis to be a position.
Then perform weighted density estimation, and look for low density to separate your clusters.
I can't help you with MATLAB. I don't use it.
Mathematically, what you want to do is place a Gaussian at each point, with area Y and center X. Then find minima and maxima on the sum of these Gaussians. See Wikipedia, Kernel Density Estimation for details; except that you want to use the Y axis as weights. You could maybe also use 1/Y as standard deviation, if you don't want to use weights.

Why Kernel smoothing function, ksdensity, in MATLAB, results in values greater than one?

I have a set of samples, S, and I want to find its PDF. The problem is when I use ksdensity I get values greater than one!
[f,xi] = ksdensity(S)
In array f, most of the values are greater than one! Would you please tell me what the problem can be? Thanks for your help.
For example:
S=normrnd(0.3035, 0.0314,1,1000);
ksdensity(S)
ksdensity, as the name says, estimates a probability density function over a continuous variable. Probability densities can be larger than 1, they can actually have arbitrary values from zero upwards. The constraint on probabilities is that their sum over an exhaustive range of possibilities has to be 1. For probability densities, the constraint is that the integral over the whole range of values is 1.
A crude approximation of an integral of the pdf estimated by ksdensity can be obtained in Matlab like this:
sum(f) * min(diff(xi))
assuming that the values in xi are equally spaced. The value of this expression should be approximately 1.
If in your application you believe this approximation is not close enough to 1, you might want to specify the grid of estimation points (second parameter pts) such that the spacing is finer or the range is wider than the one automatically generated by ksdensity.