In Matlab, How to divide multivariate Gaussian distributions to separate Gaussians? - matlab

I have an image with multivariate Gaussian distribution in histogram. I want to segment the image to two regions so that they both can follow the normal distribution like the red and blue curves shows in histogram. I know Gaussian mixture model potentially works for that. I tried to use fitgmdist function and then clustering the two parts but still not work well. Any suggestion will be appreciated.
Below is the Matlab code for my appraoch.
% Read Image
I = imread('demo.png');
I = rgb2gray(I);
data = I(:);
% Fit a gaussian mixture model
obj = fitgmdist(data,2);
idx = cluster(obj,data);
cluster1 = data(idx == 1,:);
cluster2 = data(idx == 2,:);
% Display Histogram
histogram(cluster1)
histogram(cluster2)

Your solution is correct
The way you are displaying your histogram poorly represents the detected distributions.
Normalize the bin sizes because histogram is a frequency count
Make the axes limits consistent (or plot on same axis)
These two small changes show that you're actually getting a pretty good distribution fit.
histogram(cluster1,0:.01:1); hold on;
histogram(cluster2,0:.01:1);
Re-fit a gaussian-curve to each cluster
Once you have your clusters if you treat them as independent distributions, you can smooth the tails where the two distributions merge.
gcluster1 = fitdist(cluster1,'Normal');
gcluster2 = fitdist(cluster2,'Normal');
x_values = 0:.01:1;
y1 = pdf(gcluster1,x_values);
y2 = pdf(gcluster2,x_values);
plot(x_values,y1);hold on;
plot(x_values,y2);

How are you trying to use this 'model'? If the data is constant, then why dont you measure, the mean/variances for the two gaussians seperately?
And if you are trying to generate new values from this mixed distribution, then you can look into a mixture model with weights given to each of the above distributions.

Related

How do I fit distributions to data sets in matlab?

I'm trying to find a fit to my data using matlab but I'm having a lot of trouble, here's what ive done so far:
A = load('homicide_crime.txt') % A is a two column array the first column is the year and the second column is crime in that year
norm_crime = (A(:,2)-mean(A(:,2)))/std(A(:,2));
[f,x]=hist(norm_crime,20);
plot(x,f/trapz(x,f))
y=normpdf(x,0,1);
hold on
plot(x,y)
This is the resulting plot
.
So i tried afterwards using the distribution fitter which gave me this.
Neither of these things look right since the peak aren't aligned and the fit is too small.
Here is the data set for anyone intrested
https://pastebin.com/CyddrN1R.
Any help is much appreciated.
Actually, I think you are confusing data transformation with distribution fitting.
DATA TRANSFORMATION
In this approach, data is manipulated through a non-linear transformation in order to achieve a perfect fit. This means that it forces your data to follow the chosen distribution rule. To accomplish this with a normal distribution, all you have to do is applying the following code:
A = load('homicide_crime.txt');
years = A(:,1);
crimes = A(:,2);
figure(),histfit(crimes);
rank = tiedrank(crimes);
p = rank ./ (numel(rank) + 1);
crimes_normal = norminv(p,0,1);
figure(),histfit(crimes_normal);
Using the following manipulation:
crimes_normal = (crimes - mean(crimes)) ./ std(crimes);
that can also be written as:
crimes_normal = zscore(crimes);
you modify your observations so that they have mu=0 and sigma=1, but this is far from making them perfectly fit a normal distribution.
DISTRIBUTION FITTING
In this approach, the parameters of the chosen distribution are calculated over the given dataset, and then random observations are drawn. On one side you have your empirical observations, and on the other side you have your fitted data. A goodness-of-fit test can finally tell you how well empirical observations fit the given distribution comparing them to theoretical observations.
Since your are working with a normal distribution, you know that it is fully described by two parameters: mu and sigma. Hence:
A = load('homicide_crime.txt');
years = A(:,1);
crimes_emp = A(:,2);
[mu,sigma] = normfit(crimes_emp);
% you can also use
% mu = mean(crimes);
% sigma = std(crimes);
% to achieve the same result
[f,x] = hist(crimes_emp);
crimes_the = normpdf(x,mu,sigma) .* max(f);
figure();
bar(x, (f ./ sum(f)));
hold on;
plot(x,crimes_the,'-r','LineWidth',2);
hold off;
And this returns something very close to the problem you originally noticed. As you can clearly see, without even running a Kolmogorov-Smirnov or an Anderson-Darling... your data doesn't fit a normal distribution well.
You can try a non-parametric density estimation method. I used kernel density estimation (KDE) with the default normal distribution as the kernel, to obtain the result as shown below. The Matlab command for the same is ksdensity() and documentation can be found here.
A = load('homicide_crime.txt') % load data
years = A(:,1);
values = A(:,2);
[f0,x0] = hist(values,100); % plot the actual histogram
[f1,x1,b1] = ksdensity(values); % KDE with automatically assigned bandwidth
[f2,x2,b2] = ksdensity(values,'Bandwidth',b1*0.6); % 60% of initial bandwidth (b1)
[f3,x3,b3] = ksdensity(values,'Bandwidth',b2*0.6); % 60% of previous bandwidth (b2) = 36% of initial bandwidth (b1)
[f4,x4,b4] = ksdensity(values,'Bandwidth',b3*0.6); % 60% of previous bandwidth (b3) = 21.6% of initial bandwidth (b1)
figure; hold on;
bar(years, f0/(sum(f0)*10) ); % scaled for visualization
plot(years, f1, 'y')
plot(years, f2, 'c')
plot(years, f3, 'g')
plot(years, f4, 'r','linewidth',3) % Final fit
In the code above, I first plot the histogram and then calculate the kde without any user specified bandwidth. This leads to an oversmooth fitting (yellow curve). With a few trials by reducing the bandwidth as only 60% of the previous iteration, I finally was able to get the closest fit (red curve). You can play around the bandwidth to get a still better desirable fit.

How to use distance to extract features and compare images: : matlab

I was trying to code for feature extraction from the two images, which are actually similar. I tried to extract the intersection points from both of the image and calculated the distance from one intersection point to all other points. This procedure was iterated for all points and in both images.
Then I compared the distance between points in both images But I found that even for dissimilar images am getting same kind of distance and am not able to distinguish them.
Is there any way in this method which will improve the code or is there any other way to find the similarity.
I = bwmorph(I,'skel',Inf);
II = bwmorph(II,'skel',Inf);
[i,j] = ind2sub(size(I),find(bwmorph(bwmorph(I,'thin',Inf),'branchpoint') == 1));
[i1,j1] = ind2sub(size(II),find(bwmorph(bwmorph(II,'thin',Inf),'branchpoint') == 1));
figure,imshow(I); hold on; plot(j,i,'rx');
figure,imshow(II); hold on; plot(j1,i1,'rx')
m=size(i,1);
n=size(j,1);
m1=size(i1,1);
n1=size(j1,1);
for x=1:m
for y=1:n
d1(y,x)=round(sqrt((i(y,1)-i(x,1)).^2+(j(y,1)-j(x,1)).^2));
end
end
for x1=1:m1
for y1=1:n1
dd1(y1,x1)=round(sqrt((i1(y1,1)-i1(x1,1)).^2+(j1(y1,1)-j1(x1,1)).^2));
end
end
size(d1);
k1=reshape(d1,1,m*n);
k=sort(k1);
k=unique(k);
size(dd1);
k2=reshape(dd1,1,m1*n1);
k2=sort(k2);
k2=unique(k2);
z = intersect(k,k2)
length(z);
if length(z)>20
disp('similar images');
else
disp('dissimilar images');
end
This is a part of my code where I tried to extract features.
input1
input2
skel 1
skel2
I think your code is not the problem. Instead, it seems that either your feature descriptor is not powerful enough or your comparison method is not powerful enough, or a combination of the two. This gives us several options for how to explore solutions to the problem.
Feature Descriptor
You are constructing an image feature consisting of the distances between skeleton intersection points. This is an unusual approach and a very interesting one. It reminds me of peak constellations, a feature used by Shazam to audio-fingerprint songs. If you are interested in exploring, that more sophisticated technique, take a look at "An Industrial Strength Audio Search Algorithm" by Avery Li-Chun Wang. I believe you could adapt their feature descriptor to your application.
However, if you want a simpler solution there are some other options as well. Your current descriptor uses unique to find a set of unique distances between the skeleton intersection points. Take a look at the following images of a line and an equilateral triangle both with 5 unit line lengths. If we use the unique distances between vertices to make the feature, the two images have identical features, but we can also count the number of lines of each length in a histogram.
The histogram preserves more of the image structure as part of the feature. Using a histogram might help distinguish better between your similar and dissimilar cases.
Here's some demo code for histogram features using the Matlab demo images pears.png and peppers.png. I had difficulty extracting the skeleton from your provided images, but you should be able to adapt this code easily to your application.
I1 = = im2bw(imread('peppers.png'));
I2 = = im2bw(imread('pears.png'));
I1_skel = bwmorph(I1,'skel',Inf);
I2_skel = bwmorph(I2,'skel',Inf);
[i1,j1] = ind2sub(size(I1_skel),find(bwmorph(bwmorph(I1_skel,'thin',Inf),'branchpoint') == 1));
[i2,j2] = ind2sub(size(I2_skel),find(bwmorph(bwmorph(I2_skel,'thin',Inf),'branchpoint') == 1));
%You used a for loop to find the distance between each pair of
%intersections. There is a function for this.
d1 = round(pdist2([i1, j1], [i1, j1]));
d2 = round(pdist2([i2, j2], [i2, j2]));
%Choose a number of bins for the histogram.
%This will be the length of the feature.
%More bins will preserve more structure.
%Fewer bins will help generalize between similar but not identical images.
num_bins = 100;
%Instead of using `unique` to remove repetitions use `histcounts` in R2014b
%feature1 = histcounts(d1(:), num_bins);
%feature2 = histcounts(d2(:), num_bins);
%Use `hist` for pre R2014b Matlab versions
feature1 = hist(d1(:), num_bins);
feature2 = hist(d2(:), num_bins);
%Normalize the features
feature1 = feature1 ./ norm(feature1);
feature2 = feature2 ./ norm(feature2);
figure; bar([feature1; feature2]');
title('Features'); legend({'Feature 1', 'Feature 2'});
xlim([0, num_bins]);
Here are what the detected intersection points are in each image
Here are the resulting features. You can see the clear differences between images.
Feature Comparison
The second part to consider is how you compare your features. Currently, you are simply looking for >20 similar distances. With the 'peppers.png' and 'pears.png' test images distributed with Matlab, I find more than 2000 intersection points in one image and 260 in the other. With so many points, it is trivial to have an overlap of >20 similar distances. In your images, the number of intersection points is much smaller. You could carefully adjust the threshold of similar distances, but I think this metric is probably to simplistic.
In Machine Learning, a simple way to compare two feature vectors is vector similarity or distance. There are multiple distance metrics you could explore. Common ones include
Cosine Distance
score_cosine = feature1 * feature2'; %Cosine distance between vectors
%Set a threshold for cosine similarity [0, 1] where 1 is identical and 0 is perpendicular
cosine_threshold = .9;
disp('Cosine Compare')
disp(score_cosine)
if score_cosine > cosine_threshold
disp('similar images');
else
disp('dissimilar images');
end
Euclidean Distance
score_euclidean = pdist2(feature1, feature2);
%Set a threshold for euclidean similarity where smaller is more similar
euclidean_threshold = 0.1;
disp('Euclidean Compare')
disp(score_euclidean)
if score_euclidean < euclidean_threshold
disp('similar images');
else
disp('dissimilar images');
end
If these don't work, you may need to train a classifier to find a more complicated function to distinguish between similar and dissimilar images.

Matlab: plotting frequency distribution with a curve

I have to plot 10 frequency distributions on one graph. In order to keep things tidy, I would like to avoid making a histogram with bins and would prefer having lines that follow the contour of each histogram plot.
I tried the following
[counts, bins] = hist(data);
plot(bins, counts)
But this gives me a very inexact and jagged line.
I read about ksdensity, which gives me a nice curve, but it changes the scaling of my y-axis and I need to be able to read the frequencies from the y-axis.
Can you recommend anything else?
You're using the default number of bins for your histogram and, I will assume, for your kernel density estimation calculations.
Depending on how many data points you have, that will certainly not be optimal, as you've discovered. The first thing to try is to calculate the optimum bin width to give the smoothest curve while simultaneously preserving the underlying PDF as best as possible. (see also here, here, and here);
If you still don't like how smooth the resulting plot is, you could try using the bins output from hist as a further input to ksdensity. Perhaps something like this:
[kcounts,kbins] = ksdensity(data,bins,'npoints',length(bins));
I don't have your data, so you may have to play with the parameters a bit to get exactly what you want.
Alternatively, you could try fitting a spline through the points that you get from hist and plotting that instead.
Some code:
data = randn(1,1e4);
optN = sshist(data);
figure(1)
[N,Center] = hist(data);
[Nopt,CenterOpt] = hist(data,optN);
[f,xi] = ksdensity(data,CenterOpt);
dN = mode(diff(Center));
dNopt = mode(diff(CenterOpt));
plot(Center,N/dN,'.-',CenterOpt,Nopt/dNopt,'.-',xi,f*length(data),'.-')
legend('Default','Optimum','ksdensity')
The result:
Note that the "optimum" bin width preserves some of the fine structure of the distribution (I had to run this a couple times to get the spikes) while the ksdensity gives a smooth curve. Depending on what you're looking for in your data, that may be either good or bad.
How about interpolating with splines?
nbins = 10; %// number of bins for original histogram
n_interp = 500; %// number of values for interpolation
[counts, bins] = hist(data, nbins);
bins_interp = linspace(bins(1), bins(end), n_interp);
counts_interp = interp1(bins, counts, bins_interp, 'spline');
plot(bins, counts) %// original histogram
figure
plot(bins_interp, counts_interp) %// interpolated histogram
Example: let
data = randn(1,1e4);
Original histogram:
Interpolated:
Following your code, the y axis in the above figures gives the count, not the probability density. To get probability density you need to normalize:
normalization = 1/(bins(2)-bins(1))/sum(counts);
plot(bins, counts*normalization) %// original histogram
plot(bins_interp, counts_interp*normalization) %// interpolated histogram
Check: total area should be approximately 1:
>> trapz(bins_interp, counts_interp*normalization)
ans =
1.0009

Matlab cdfplot: how to control the spacing of the marker spacing

I have a Matlab figure I want to use in a paper. This figure contains multiple cdfplots.
Now the problem is that I cannot use the markers because the become very dense in the plot.
If i want to make the samples sparse I have to drop some samples from the cdfplot which will result in a different cdfplot line.
How can I add enough markers while maintaining the actual line?
One method is to get XData/YData properties from your curves follow solution (1) from #ephsmith and set it back. Here is an example for one curve.
y = evrnd(0,3,100,1); %# random data
%# original data
subplot(1,2,1)
h = cdfplot(y);
set(h,'Marker','*','MarkerSize',8,'MarkerEdgeColor','r','LineStyle','none')
%# reduced data
subplot(1,2,2)
h = cdfplot(y);
set(h,'Marker','*','MarkerSize',8,'MarkerEdgeColor','r','LineStyle','none')
xdata = get(h,'XData');
ydata = get(h,'YData');
set(h,'XData',xdata(1:5:end));
set(h,'YData',ydata(1:5:end));
Another method is to calculate empirical CDF separately using ECDF function, then reduce the results before plotting with PLOT.
y = evrnd(0,3,100,1); %# random data
[f, x] = ecdf(y);
%# original data
subplot(1,2,1)
plot(x,f,'*')
%# reduced data
subplot(1,2,2)
plot(x(1:5:end),f(1:5:end),'r*')
Result
I know this is potentially unnecessary given MATLAB's built-in functions (in the Statistics Toolbox anyway) but it may be of use to other viewers who do not have access to the toolbox.
The empirical CMF (CDF) is essentially the cumulative sum of the empirical PMF. The latter is attainable in MATLAB via the hist function. In order to get a nice approximation to the empirical PMF, the number of bins must be selected appropriately. In the following example, I assume that 64 bins is good enough for your data.
%# compute a histogram with 64 bins for the data points stored in y
[f,x]=hist(y,64);
%# convert the frequency points in f to proportions
f = f./sum(f);
%# compute the cumulative sum of the empirical PMF
cmf = cumsum(f);
Now you can choose how many points you'd like to plot by using the reduced data example given by yuk.
n=20 ; % number of total data markers in the curve graph
M_n = round(linspace(1,numel(y),n)) ; % indices of markers
% plot the whole line, and markers for selected data points
plot(x,y,'b-',y(M_n),y(M_n),'rs')
verry simple.....
try reducing the marker size.
x = rand(10000,1);
y = x + rand(10000,1);
plot(x,y,'b.','markersize',1);
For publishing purposes I tend to use the plot tools on the figure window. This allow you to tweak all of the plot parameters and immediately see the result.
If the problem is that you have too many data points, you can:
1). Plot using every nth sample of the data. Experiment to find an n that results in the look you want.
2). I typically fit curves to my data and add a few sparsely placed markers to plots of the fits to differentiate the curves.
Honestly, for publishing purposes I have always found that choosing different 'LineStyle' or 'LineWidth' properties for the lines gives much cleaner results than using different markers. This would also be a lot easier than trying to downsample your data, and for plots made with CDFPLOT I find that markers simply occlude the stairstep nature of the lines.

Smoothing of histogram with a low-pass filter in MATLAB

I have an image and my aim is to binarize the image. I have filtered the image with a low pass Gaussian filter and have computed the intensity histogram of the image.
I now want to perform smoothing of the histogram so that I can obtain the threshold for binarization. I used a low pass filter but it did not work. This is the filter I used.
h = fspecial('gaussian', [8 8],2);
Can anyone help me with this? What is the process with respect to smoothing of a histogram?
imhist(Ig);
Thanks a lot for all your help.
I've been working on a very similar problem recently, trying to compute a threshold in order to exclude noisy background pixels from MRI data prior to performing other computations on the images. What I did was fit a spline to the histogram to smooth it while maintaining an accurate fit of the shape. I used the splinefit package from the file exchange to perform the fitting. I computed a histogram for a stack of images treated together, but it should work similarly for an individual image. I also happened to use a logarithmic transformation of my histogram data, but that may or may not be a useful step for your application.
[my_histogram, xvals] = hist(reshape(image_volume), 1, []), number_of_bins);
my_log_hist = log(my_histogram);
my_log_hist(~isfinite(my_log_hist)) = 0; % Get rid of NaN values that arise from empty bins (log of zero = NaN)
figure(1), plot(xvals, my_log_hist, 'b');
hold on
breaks = linspace(0, max_pixel_intensity, numberofbreaks);
xx = linspace(0, max_pixel_intensity, max_pixel_intensity+1);
pp = splinefit(xvals, my_log_hist, breaks, 'r');
plot(xx, ppval(pp, xx), 'r');
Note that the spline is differentiable and you can use ppdiff to get the derivative, which is useful for finding maxima and minima to help pick an appropriate threshold. The numberofbreaks is set to a relatively low number so that the spline will smooth the histogram. I used linspace in the example to pick the breaks, but if you know that some portion of the histogram exhibits much greater curvature than elsewhere, you'd want to have more breaks in that region and less elsewhere in order to accurately capture the shape of the histogram.
To smooth the histogram you need to use a 1-D filter. This is easily done using the filter function. Here is an example:
I = imread('pout.tif');
h = imhist(I);
smooth_h = filter(normpdf(-4:4, 0,1),1,h);
Of course you can use any smoothing function you choose. The mean would simply be ones(1,8).
Since your goal here is just to find the threshold to binarize an image you could just use the graythresh function which uses Otsu's method.