How to find relationship between two distribution curves - matlab

I have some floating data (represented by blue curve), when I do some loss compression, the yellow curve can be obtained (mean,standard deviation).
My aim is to minimize this losses after compression process, Hence, I would like to find an equation/curve/filter that:
the yellow curve times "function" nearly equal to blue Gaussian curve.
or
blue curve = Function(green curve)
Thanks for your help!

The best way is to do Kolmogorov–Smirnov test. It compares the maximum difference between the cumulative distributions of the two input vectors.
You can start to play with this test using the implementation in Matlab called [h p k]=kstest2(dist1, dist2) You should be looking at the k value which is the test statistic, it denotes the maximum difference between the 2 empirical cumulative distributions. If you want to visualise how is this difference calculated,
cdfplot(dist1)
hold on
cdfplot(dist2)
hold off
you will see the two cumulative distributions in the same plot. The maximum differene between them is k. If the relationship between the 2 distributions are high the lower the gap is and k value tends to be 1 and in case of highly different distributions the value moves towards 0 and away from 1.
Hope it helps.
If you have found any more interesting methods, kindly let me know.

Related

Dividing a normal distribution into regions of equal probability in Matlab

Consider a Normal distribution with mean 0 and standard deviation 1. I would like to divide this distribution into 9 regions of equal probability and take a random sample from each region.
It sounds like you want to find the values that divide the area under the probability distribution function into segments of equal probability. This can be done in matlab by applying the norminv function.
In your particular case:
segmentBounds = norminv(linspace(0,1,10),0,1)
Any two adjacent values of segmentBounds now describe the boundaries of segments of the Normal probability distribution function such that each segment contains one ninth of the total probability.
I'm not sure exactly what you mean by taking random numbers from each sample. One approach is to sample from each region by performing rejection sampling. In short, for each region bounded by x0 and x1, draw a sample from y = normrnd(0,1). If x0 < y < x1, keep it. Else discard it and repeat.
It's also possible that you intend to sample from these regions uniformly. To do this you can try rand(1)*(x1-x0) + x0. This will produce problems for the extreme quantiles, however, since the regions extend to +/- infinity.

Estimating skewness of histogram in MATLAB

What test can I do in MATLAB to test the spread of a histogram? For example, in the given set of histograms, I am only interested in 1,2,3,5 and 7 (going from left to right, top to bottom) because they are less spread out. How can I obtain a value that will tell me if a histogram is positively skewed?
It may be possible using Chi-Squared tests but I am not sure what the MATLAB code will be for that.
You can use the standard definition of skewness. In other words, you can use:
You compute the mean of your data and you use the above equation to calculate skewness. Positive and negative skewness are like so:
Source: Wikipedia
As such, the larger the value, the more positively skewed it is. The more negative the value, the more negatively skewed it is.
Now to compute the mean of your histogram data, it's quite simple. You simply do a weighted sum of the histogram entries and divide by the total number of entries. Given that your histogram is stored in h, the bin centres of your histogram are stored in x, you would do the following. What I will do here is assume that you have bins from 0 up to N-1 where N is the total number of bins in the histogram... judging from your picture:
x = 0:numel(h)-1; %// Judging from your pictures
num_entries = sum(h(:));
mu = sum(h.*x) / num_entries;
skew = ((1/num_entries)*(sum((h.*x - mu).^3))) / ...
((1/(num_entries-1))*(sum((h.*x - mu).^2)))^(3/2);
skew would contain the numerical measure of skewness for a histogram that follows that formula. Therefore, with your problem statement, you will want to look for skewness numbers that are positive and large. I can't really comment on what threshold you should look at, but look for positive numbers that are much larger than most of the histograms that you have.

why is perfcurve() matlab function giving me straight lines and not a normal curve as expected?

I am trying to build a receiver operating characteristic (ROC) curves to evaluate the discriminating ability of my classifier to correctly classify diseased and non-diseased subjects.
I understand that the closer the curve follows the left-hand border and then the top border of the ROC space, the more accurate the test. My experiments gave me quite desirable value of area under curve (auc), i.e. 0.86458. However, the ROC curve (in which I included the cut-off points for tracing purposes) seems quite strange as it gave me straight lines as below:
... and not a curve I expected and as I normally see from any references like this:
Does it hav something to do with the number of observations used? (in this case I only have 50 samples). Or is this just fine as long as the the auc value is high and that the 'curve' comes above the 45-degree diagonal of the ROC space? I would be glad if someone can share their thoughts about it. Thank you!
By the way, I used the perfcurve() function in matlab:
% ROC comparison between the proposed approach and the baseline
[X1,Y1,T1,auc1,OPTROCPT1,SUBY5,SUBYNAMES1] = perfcurve(testLabel,predlabel_prop,1);
[X2,Y2,T2,auc2,OPTROCPT2,SUBY2,SUBYNAMES2] = perfcurve(testLabel,predLabel_base,1);
figure;
plot(X1,Y1,'-r*',X2,Y2,'--ko');
legend('proposed approach','baseline','Location','east');
xlabel('False positive rate'); ylabel('True positive rate')
title('ROC comparison of the proposed approach and the baseline')
text(0.6,0.3,{'* - proposed method',strcat('Area Under Curve = ',...
num2str(auc1))},'EdgeColor','r');
text(0.6,0.15,{'o - baseline',strcat('Area Under Curve = ',num2str(auc2))},'EdgeColor','k');
You probably have too litte data.
You curve indicates your data set has 13 negative and 5 positive examples (in your test set?)
Furthermore, all but 4 have exactly the same score (maybe 0)? Or is that your cutoff?
Given this small sample size, I would not accept the hypothesis that your proposed method is better than the baseline, but accept the alternative - the methods perform as good as the other: the difference of 0.04 is much too small for this tiny sample size, the results are virtually identical. Any variation within the cut-off area (the diagonal part) can be much larger than this 0.04... On a different run, a different test set, the results may be the other way around.
Shape of your curve is just a result of high explanatory power of your model and limited number of observations (e.g. take a look at the example here http://nl.mathworks.com/help/stats/perfcurve.html).

Finding defined peaks with Clusters in MATLAB

this is my problem:
I have the next data "A", which looks like:
As you can see, I have drawn with red circles the apparently peaks, the most defined are 2 and 7, I say that they are defined because its standard deviation is low in comparison with the other peaks (especially the second one).
What I need is a way (anyway) to get the values and the standard deviation of n peaks in a numeric array.
I have tried with "clusters", but I got no good results:
First of all, I used "kmeans" MATLAB function, and I realize that this algorithm doesn't group peaks as I need. As you can see in the picture above, in the red circle, that cluster has at less 3 or 4 peaks. And kmeans need that you set the number of clusters, and I need to identify it automatically.
I hope that anyone can give me some ideas, or a way to get better results, thanks.
Pd: I leave the data "A" in the next link.
https://drive.google.com/file/d/0B4WGV21GqSL5a2EyQ2l0SHZURzA/edit?usp=sharing
The problem is that your axes have very different meaning.
K-means optimizes variance. But variance in X is something entirely different than variance in Y, isn't it? Furthermore, each of these methods will split your data in both X and Y, whereas I assume you want the data to be partitioned on the X axis only.
I suggest the following: consider the Y axis to be a weight, and X axis to be a position.
Then perform weighted density estimation, and look for low density to separate your clusters.
I can't help you with MATLAB. I don't use it.
Mathematically, what you want to do is place a Gaussian at each point, with area Y and center X. Then find minima and maxima on the sum of these Gaussians. See Wikipedia, Kernel Density Estimation for details; except that you want to use the Y axis as weights. You could maybe also use 1/Y as standard deviation, if you don't want to use weights.

How to calculate residuals for two curves (matrixes) of different size?

I've got a theoretical curve which was calculated numerically and an experimental curve (better to say a massive of experimental points). I need to calculate the residuals between these two curves to check the accuracy of modeling with the least squares sum method. These matrixes (curves) are of different size. Is there any function in MATLAB providing the calculation of residuals for two matrixes of different size?
I thought I'll just elaborate a bit on what Aabaz said in case there are others who might find this useful (Although Aabaz's explanation is probably clear enough for people who have an understanding of the necessary math etc.).
First, I'm assuming you have a 2D plot but it shouldn't be difficult to generalize to ND case.
Basically, for each point in your experimental data (xi, yi), use your "theoretical curve" to estimate yi' for the value xi. This is probably what Aabaz is referring to by making grid step size the same so that you interpolate the points exactly at the x coordinate values xi of your experimental data using the formula for your curve(s).
Next, to measure whether the fitting is good, you could for e.g. measure the sum of square differences using:
error = sum( (yi' - yi)^2 ){where i range over all points in your exp. data}
Of course other error metrics other than least square could be used to estimate how well the data fit your model (i.e. your curve) but by far for most applications, least square is the most common.
Hope this helps.