Clustering an image using Gaussian mixture models - matlab

I want to use GMM(Gaussian mixture models for clustering a binary image and also want to plot the cluster centroids on the binary image itself.
I am using this as my reference:
http://in.mathworks.com/help/stats/gaussian-mixture-models.html
This is my initial code
I=im2double(imread('sil10001.pbm'));
K = I(:);
mu=mean(K);
sigma=std(K);
P=normpdf(K, mu, sigma);
Z = norminv(P,mu,sigma);
X = mvnrnd(mu,sigma,1110);
X=reshape(X,111,10);
scatter(X(:,1),X(:,2),10,'ko');
options = statset('Display','final');
gm = fitgmdist(X,2,'Options',options);
idx = cluster(gm,X);
cluster1 = (idx == 1);
cluster2 = (idx == 2);
scatter(X(cluster1,1),X(cluster1,2),10,'r+');
hold on
scatter(X(cluster2,1),X(cluster2,2),10,'bo');
hold off
legend('Cluster 1','Cluster 2','Location','NW')
P = posterior(gm,X);
scatter(X(cluster1,1),X(cluster1,2),10,P(cluster1,1),'+')
hold on
scatter(X(cluster2,1),X(cluster2,2),10,P(cluster2,1),'o')
hold off
legend('Cluster 1','Cluster 2','Location','NW')
clrmap = jet(80); colormap(clrmap(9:72,:))
ylabel(colorbar,'Component 1 Posterior Probability')
But the problem is that I am unable to plot the cluster centroids received from GMM in the primary binary image.How do i do this?
**Now suppose i have 10 such images in a sequence And i want to store the information of their mean position in two cell array then how do i do that.This is my code foe my new question **
images=load('gait2go.mat');%load the matrix file
for i=1:10
I{i}=images.result{i};
I{i}=im2double(I{i});
%determine 'white' pixels, size of image can be [M N], [M N 3] or [M N 4]
Idims=size(I{i});
whites=true(Idims(1),Idims(2));
df=I{i};
%we add up the various color channels
for colori=1:size(df,3)
whites=whites & df(:,:,colori)>0.5;
end
%choose indices of 'white' pixels as coordinates of data
[datax datay]=find(whites);
%cluster data into 10 clumps
K = 10; % number of mixtures/clusters
cInd = kmeans([datax datay], K, 'EmptyAction','singleton',...
'maxiter',1000,'start','cluster');
%get clusterwise means
meanx=zeros(K,1);
meany=zeros(K,1);
for i=1:K
meanx(i)=mean(datax(cInd==i));
meany(i)=mean(datay(cInd==i));
end
xc{i}=meanx(i);%cell array contaning the position of the mean for the 10
images
xb{i}=meany(i);
figure;
gscatter(datay,-datax,cInd); %funky coordinates for plotting according to
image
axis equal;
hold on;
scatter(meany,-meanx,20,'+'); %same funky coordinates
end
I am able to get 10 images segmented but no the values of themean stored in the cell arrays xc and xb.They r only storing [] in place of the values of means

I decided to post an answer to your question (where your question was determined by a maximum-likelihood guess:P), but I wrote an extensive introduction. Please read carefully, as I think you have difficulties understanding the methods you want to use, and you have difficulties understanding why others can't help you with your usual approach of asking questions. There are several problems with your question, both code-related and conceptual. Let's start with the latter.
The problem with the problem
You say that you want to cluster your image with Gaussian mixture modelling. While I'm generally not familiar with clustering, after a look through your reference and the wonderful SO answer you cited elsewhere (and a quick 101 from #rayryeng) I think you are on the wrong track altogether.
Gaussian mixture modelling, as its name suggests, models your data set with a mixture of Gaussian (i.e. normal) distributions. The reason for the popularity of this method is that when you do measurements of all sorts of quantities, in many cases you will find that your data is mostly distributed like a normal distribution (which is actually the reason why it's called normal). The reason behind this is the central limit theorem, which implies that the sum of reasonably independent random variables tends to be normal in many cases.
Now, clustering, on the other hand, simply means separating your data set into disjoint smaller bunches based on some criteria. The main criterion is usually (some kind of) distance, so you want to find "close lumps of data" in your larger data set. You usually need to cluster your data before performing a GMM, because it's already hard enough to find the Gaussians underlying your data without having to guess the clusters too. I'm not familiar enough with the procedures involved to tell how well GMM algorithms can work if you just let them work on your raw data (but I expect that many implementations start with a clustering step anyway).
To get closer to your question: I guess you want to do some kind of image recognition. Looking at the picture, you want to get more strongly correlated lumps. This is clustering. If you look at a picture of a zoo, you'll see, say, an elephant and a snake. Both have their distinct shapes, and they are well separated from one another. If you cluster your image (and the snake is not riding the elephant, neither did it eat it), you'll find two lumps: one lump elephant-shaped, and one lump snake-shaped. Now, it wouldn't make sense to use GMM on these data sets: elephants, and especially snakes, are not shaped like multivariate Gaussian distributions. But you don't need this in the first place, if you just want to know where the distinct animals are located in your picture.
Still staying with the example, you should make sure that you cluster your data into an appropriate number of subsets. If you try to cluster your zoo picture into 3 clusters, you might get a second, spurious snake: the nose of the elephant. With an increasing number of clusters your partitioning might make less and less sense.
Your approach
Your code doesn't give you anything reasonable, and there's a very good reason for that: it doesn't make sense from the start. Look at the beginning:
I=im2double(imread('sil10001.pbm'));
K = I(:);
mu=mean(K);
sigma=std(K);
X = mvnrnd(mu,sigma,1110);
X=reshape(X,111,10);
You read your binary image, convert it to double, then stretch it out into a vector and compute the mean and deviation of that vector. You basically smear your intire image into 2 values: an average intensity and a deviation. And THEN you generate 111*10 standard normal points with these parameters, and try to do GMM on the first two sets of 111. Which are both independently normal with the same parameter. So you probably get two overlapping Gaussians around the same mean with the same deviation.
I think the examples you found online confused you. When you do GMM, you already have your data, so no pseudo-normal numbers should be involved. But when people post examples, they also try to provide reproducible inputs (well, some of them do, nudge nudge wink wink). A simple method for this is to generate a union of simple Gaussians, which can then be fed into GMM.
So, my point is, that you don't have to generate random numbers, but have to use the image data itself as input to your procedure. And you probably just want to cluster your image, instead of actually using GMM to draw potatoes over your cluster, since you want to cluster body parts in an image about a human. Most body parts are not shaped like multivariate Gaussians (with a few distinct exceptions for men and women).
What I think you should do
If you really want to cluster your image, like in the figure you added to your question, then you should use a method like k-means. But then again, you already have a program that does that, don't you? So I don't really think I can answer the question saying "How can I cluster my image with GMM?". Instead, here's an answer to "How can I cluster my image?" with k-means, but at least there will be a piece of code here.
%set infile to what your image file will be
infile='sil10001.pbm';
%read file
I=im2double(imread(infile));
%determine 'white' pixels, size of image can be [M N], [M N 3] or [M N 4]
Idims=size(I);
whites=true(Idims(1),Idims(2));
%we add up the various color channels
for colori=1:Idims(3)
whites=whites & I(:,:,colori)>0.5;
end
%choose indices of 'white' pixels as coordinates of data
[datax datay]=find(whites);
%cluster data into 10 clumps
K = 10; % number of mixtures/clusters
cInd = kmeans([datax datay], K, 'EmptyAction','singleton',...
'maxiter',1000,'start','cluster');
%get clusterwise means
meanx=zeros(K,1);
meany=zeros(K,1);
for i=1:K
meanx(i)=mean(datax(cInd==i));
meany(i)=mean(datay(cInd==i));
end
figure;
gscatter(datay,-datax,cInd); %funky coordinates for plotting according to image
axis equal;
hold on;
scatter(meany,-meanx,20,'ko'); %same funky coordinates
Here's what this does. It first reads your image as double like yours did. Then it tries to determine "white" pixels by checking that each color channel (of which can be either 1, 3 or 4) is brighter than 0.5. Then your input data points to the clustering will be the x and y "coordinates" (i.e. indices) of your white pixels.
Next it does the clustering via kmeans. This part of the code is loosely based on the already cited answer of Amro. I had to set a large maximal number of iterations, as the problem is ill-posed in the sense that there aren't 10 clear clusters in the picture. Then we compute the mean for each cluster, and plot the clusters with gscatter, and the means with scatter. Note that in order to have the picture facing in the right directions in a scatter plot you have to shift around the input coordinates. Alternatively you could define datax and datay correspondingly at the beginning.
And here's my output, run with the already processed figure you provided in your question:

I do believe you must had made a naive mistake in the plot and that's why you see just a straight line: You are plotting only the x values.
In my opinion, the second argument in the scatter command should be X(cluster1,2) or X(cluster2,2) depending on which scatter command is being used in the code.

The code can be made more simple:
%read file
I=im2double(imread('sil10340.pbm'));
%choose indices of 'white' pixels as coordinates of data
[datax datay]=find(I);
%cluster data into 10 clumps
K = 10; % number of mixtures/clusters
[cInd, c] = kmeans([datax datay], K, 'EmptyAction','singleton',...
'maxiter',1000,'start','cluster');
figure;
gscatter(datay,-datax,cInd); %funky coordinates for plotting according to
image
axis equal;
hold on;
scatter(c(:,2),-c(:,1),20,'ko'); %same funky coordinates
I don't think there is nay need for the looping as the c itself return a 10x2 double array which contains the position of the means

Related

Looking for a tool that extracts data from a plot figure ( here 2D contours from Covariance matrix or Markov chains) and reproduce the original figure

I am looking for an application or a tool which is able for example to extract data from a 2D contour plot like below :
I have seen https://dash-gallery.plotly.host/Portal/ tool or https://plotly.com/dash/ , https://automeris.io/ , but I have test them and this is difficult to extract data (here actually, the data are covariance matrices with ellipses, but I would like to extend it if possible to Markov chains).
If someone could know if there are more efficient tools, mostly from this kind of 2D plot.
I am also opened to commercial applications. I am on MacOS 11.3.
If I am not on the right forum, please let me know it.
UPDATE 1:
I tried to apply the method in Matlab with the script below from this previous post :
%// Import the data:
imdata = importdata('Omega_L_Omega_m.png');
Gray = rgb2gray(imdata.cdata);
colorLim = [-1 1]; %// this should be set manually
%// Get the area of the data:
f = figure('Position',get(0,'ScreenSize'));
imshow(imdata.cdata,'Parent',axes('Parent',f),'InitialMagnification','fit');
%// Get the area of the data:
title('Click with the cross on the most top left area of the *data*')
da_tp_lft = round(getPosition(impoint));
title('Click with the cross on the most bottom right area of the *data*')
da_btm_rgt = round(getPosition(impoint));
dat_area = double(Gray(da_tp_lft(2):da_btm_rgt(2),da_tp_lft(1):da_btm_rgt(1)));
%// Get the area of the colorbar:
title('Click with the cross within the upper most color of the *colorbar*')
ca_tp_lft = round(getPosition(impoint));
title('Click with the cross within the bottom most color of the *colorbar*')
ca_btm_rgt = round(getPosition(impoint));
cmap_area = double(Gray(ca_tp_lft(2):ca_btm_rgt(2),ca_tp_lft(1):ca_btm_rgt(1)));
close(f)
%// Convert the colormap to data:
data = dat_area./max(cmap_area(:)).*range(colorLim)-abs(min(colorLim));
It seems that I get data in data array but I don't know how to exploit it to reproduce the original figure from these data.
Could anyone see how to plot with Matlab this kind of plot with the data I have normally extracted (not sure the Matlab. script has generated all the data for green, orange and blue contours, with each confidence level, that is to say, 68%, 95%, 99.7%) ?
UPDATE 2: I have had first elements of answer on the following link :
partial answer but not fully completed
I cite elements of the approach :
clc
clear all;
imdata = imread('https://www.mathworks.com/matlabcentral/answers/uploaded_files/642495/image.png');
close all;
Gray = rgb2gray(imdata);
yax=sum(conv2(single(Gray),[-1 -1 -1;0 0 0; 1 1 1],'valid'),2);
xax=sum(conv2(single(Gray),[-1 -1 -1;0 0 0; 1 1 1]','valid'),1);
figure(1),subplot(211),plot(xax),subplot(212),plot(yax)
ROIy = find(abs(yax)>1e5);
ROIyinner = find(diff(ROIy)>5);
ROIybounds = ROIy([ROIyinner ROIyinner+1]);
ROIx = find(abs(xax)>1e5);
ROIxinner = find(diff(ROIx)>5);
ROIxbounds = ROIx([ROIxinner ROIxinner+1]);
PLTregion = Gray(ROIybounds(1):ROIybounds(2),ROIxbounds(1):ROIxbounds(2));
PLTregion(PLTregion==255)=nan;
figure(2),imagesc(PLTregion)
[N X]=hist(single(PLTregion(:)),0:255);
figure(3),plot(X,N),set(gca,'yscale','log')
PLTitems = find(N>2000)% %limit "color" of interest to items with >1000 pixels
PLTitems = 1×10
1 67 90 101 129 132 144 167 180 194
PLTvalues = X(PLTitems);
PLTvalues(1)=[]; %ignore black?
%test out region 1
for ind = 1:numel(PLTvalues)
temp = zeros(size(PLTregion));
temp(PLTregion==PLTvalues(ind) | (PLTregion<=50 & PLTregion>10))=255;
% figure(100), imagesc(temp)
temp = bwareaopen(temp,1000);
temp = imfill(temp,'holes');
figure(100), subplot(3,3,ind),imagesc(temp)
figure(101), subplot(3,3,ind),imagesc(single(PLTregion).*temp,[0 255])
end
If someone could know how to improve these first interesting results, this would be fine to mention it.
Restating the problem - My understanding given the different comments and your updates is the following:
someone other than you is in possession of data, which as it happens is 2D data, i.e. an Nx2 matrix;
using the covariance matrix, they are effectively saying something about the joint distribution of these two dimensions, specifically about the variance;
if they assume a Gaussian distribution, as is implied by your comment regarding 68%, 95% and 99.7% for 1sigma, 2sigma and 3sigma, they can draw ellipses which represent the 2D-normal distribution: these are in fact some of the contour lines associated with the 3D "bell" surface;
you have obtained the contour lines in a graph and are trying to obtain the covariance matrix (not the original data...);
you are concerned about the complexity of having to extract the information from each ellipsis.
Partial answer:
It is impossible to recover the original data, I hope you are already aware of that, but in case you are not let's just note that the covariance matrix is a summary statistic of the data, much like the average, and although it says something about the data many different datasets could happen to have the same summary statistic (the same way many different sets of numbers can give you an average of 10).
It is possible to somewhat recover the covariance matrix, i.e. the 3 numbers a, b and c in the matrix [a,b;b,c], though the error in doing so will likely be large because of how imprecise the pixel representation is. Essentially, you will be looking for the dimensions of the two axes, for the variances, as well as the angle of one of the axes, for the covariance.
Unless I am mistaken, under the Gaussian assumption above, you only need to measure this for one of the three ellipses, and then factor by whatever number of sigmas that contour represents. Here you might want to either use the best-defined ellipse, or attempt to use the largest one, which will provide the maximum precision for your measurements (cf. pixelization).
Also, the problem of finding the axes and angle for the ellipse need not be as complex as what it seems like in your first trials: instead of trying to find the contour of the ellipses, find the bounding rectangle.
In order to further simplify this process, if your images are color-coded the way you show, then a filter on blue pixels might be enough in terms of image processing. Then simply take the minimum and maximum (x,y) coordinates in order to obtain the bounding rectangle.
Once the bounding rectangle is obtained, find the equation to your ellipse (that's a question for a math group, but you could start here for example).
Happy filtering!

k-means clustering using function 'kmeans' in MATLAB

I have this matrix:
x = [2+2*i 2-2*i -2+2*i -2-2*i];
I want to simulate transmitting it and adding noise to it. I represented the components of the complex number as below:
A = randn(150, 2) + 2*ones(150, 2); C = randn(150, 2) - 2*ones(150, 2);
At the receiver, I received the below vector, where the components are ordered based on what I sent originally, i.e., the components of x).
X = [A A A C C A C C];
Now I want to apply the kmeans(X) to have four clusters, so kmeans(X, 4). I am experiencing the following problems:
I am not sure if I can represent the complex numbers as shown in X above.
I can't plot the result of the kmeans to show the clusters.
I could not understand the clusters centroid results.
How can I find the best error rate, if this example was to represent a communication system and at the receiver, k-means clustering was used in order to decide what the transmitted signal was?
If you don't "understand" the cluster centroid results, then you don't understand how k-means works. I'll present a small summary here.
How k-means works is that for some data that you have, you want to group them into k groups. You initially choose k random points in your data, and these will have labels from 1,2,...,k. These are what we call the centroids. Then, you determine how close the rest of the data are to each of these points. You then group those points so that whichever points are closest to any of these k points, you assign those points to belong to that particular group (1,2,...,k). After, for all of the points for each group, you update the centroids, which actually is defined as the representative point for each group. For each group, you compute the average of all of the points in each of the k groups. These become the new centroids for the next iteration. In the next iteration, you determine how close each point in your data is to each of the centroids. You keep iterating and repeating this behaviour until the centroids don't move anymore, or they move very little.
Now, let's answer your questions one-by-one.
1. Complex number representation
k-means in MATLAB doesn't define how complex data is handled. A common way for people to deal with complex numbered data is to split up the real and imaginary parts into separate dimensions as you have done. This is a perfectly valid way to use k-means for complex valued data.
See this post on the MathWorks MATLAB forum for more details: https://www.mathworks.com/matlabcentral/newsreader/view_thread/78306
2. Plot the results
You aren't constructing your matrix X properly. Note that A and C are both 150 x 2 matrices. You need to structure X such that each row is a point, and each column is a variable. Therefore, you need to concatenate your A and C row-wise. Therefore:
X = [A; A; A; C; C; A; C; C];
Note that you have duplicate points. This is actually no different than doing X = [A; C]; as far as kmeans is concerned. Perhaps you should generate X, then add the noise in rather than taking A and C, adding noise, then constructing your signal.
Now, if you want to plot the results as well as the centroids, what you need to do is use the two output version of kmeans like so:
[idx, centroids] = kmeans(X, 4);
idx will contain the cluster number that each point in X belongs to, and centroids will be a 4 x 2 matrix where each row tells you the mean of each cluster found in the data. If you want to plot the data, as well as the clusters, you simply need to do following. I'm going to loop over each cluster membership and plot the results on a figure. I'm also going to colour in where the mean of each cluster is located:
x = X(:,1);
y = X(:,2);
figure;
hold on;
colors = 'rgbk';
for num = 1 : 4
plot(x(idx == num), y(idx == num), [colors(num) '.']);
end
plot(centroids(:,1), centroids(:,2), 'c.', 'MarkerSize', 14);
grid;
The above code goes through each cluster, plots them in a different colour, then plots the centroids in cyan with a slightly larger thickness so you can see what the graph looks like.
This is what I get:
3. Understanding centroid results
This is probably because you didn't construct X properly. This is what I get for my centroids:
centroids =
-1.9176 -2.0759
1.5980 2.8071
2.7486 1.6147
0.8202 0.8025
This is pretty self-explanatory and I talked about how this is structured earlier.
4. Best representation of the signal
What you can do is repeat the clustering a number of times, then the algorithm will decide what the best clustering was out of these times. You would simply use the Replicates flag and denote how many times you want this run. Obviously, the more times you run this, the better your results may be. Therefore, do something like:
[idx, centroids] = kmeans(X, 4, 'Replicates', 5);
This will run kmeans 5 times and give you the best centroids of these 5 times.
Now, if you want to determine what the best sequence that was transmitted, you'd have to split up your X into 150 rows each (as your random sequence was 150 elements), then run a separate kmeans on each subset. You can try to find the best representation of each part of the sequence by using the Replicates flag each time.... so you can do something like:
for num = 1 : 8
%// Look at 150 points at a time
[idx, centroids] = kmeans(X((num-1)*150 + 1 : num*150, :), 4, 'Replicates', 5);
%// Do your analysis
%//...
%//...
end
idx and centroids would be the results for each portion of your transmitted signal. You probably want to look at centroids at each iteration to determine what symbol was transmitted at a particular time.
If you want to plot the decision regions, then you're probably looking for a Voronoi diagram. All you do is given a set of points that are defined within the domain of your problem, you just have to determine which cluster each point belongs to. Given that our data spans between -5 <= (x,y) <= 5, let's go through each point in the grid and determine which cluster each point belongs to. We'd then colour the appropriate point according to which cluster it belongs to.
Something like:
colors = 'rgbk';
[X,Y] = meshgrid(-5:0.05:5, -5:0.05:5);
X = X(:);
Y = Y(:);
figure;
hold on;
for idx = 1 : numel(X)
[~,ind] = min(sum(bsxfun(#minus, [X(idx) Y(idx)], centroids).^2, 2));
plot(X(idx), Y(idx), [colors(ind), '.']);
end
plot(centroids(:,1), centroids(:,2), 'c.', 'MarkerSize', 14);
The above code will plot the decision regions / Voronoi diagram of the particular configuration, as well as where the cluster centres are located. Note that the code is rather unoptimized and it'll take a while for the graph to generate, but I wanted to write something quick to illustrate my point.
Here's what the decision regions look like:
Hope this helps! Good luck!

Convert grayscale image to point cloud (similar to dither)

I'm currently trying to implement a method to generate TSP art, and for that I need a list of points (x,y), the local density of which is proportional to the gray scale pixel value of a given image.
My first thought was: well that works pretty much like Inverse Transform Sampling for statistics (you want to draw a sample that matches a given probability density function but you can only create a sample that is uniformly distributed).
I implemented this and it works fairly well, as evident by executing this code:
%% Load image, adjust it for our needs
im=imread('http://goo.gl/DDwV3t'); %load random headshot from google
im=imadjust(im,stretchlim(im,[.01,.65]),[]);
im=im2double(rgb2gray(im));
im=im(10:end-5,50:end-5);
figure;imshow(im);title('original');
im=1-im; %we want black dots on white background
im=flipud(im); %and we want it the right way up
%% process per row
imrow = cumsum(im,2);
imrow=imrow*size(imrow,1)./repmat(max(imrow,[],2),1,size(imrow,2));
y=1:size(imrow,2);
ximrow_i = zeros(size(imrow));
for i = 1:size(imrow,1)
mask =logical([diff(imrow(i,:))>=0.01,0]); %needed for interp
ximrow_i(i,:) = interp1(imrow(i,mask),y(mask),y);
end
y=1:size(ximrow_i,1);
y=repmat(y',1,size(ximrow_i,2));
y1=y(1:5:end,1:5:end); %downscale a bit
ximcol_i1=ximrow_i(1:5:end,1:5:end); %downscale a bit
figure('Color','w');plot(ximcol_i1(:),y1(:),'k.');title('Inverse Transform Sampling on rows');
axis equal;axis off;
%% process per column
imcol=cumsum(im,1);
imcol=imcol*size(imcol,2)./repmat(max(imcol,[],1),size(imcol,1),1);
y=1:size(imcol,1);
yimcol_i=zeros(size(imcol));
for i = 1:size(imcol,2)
mask =logical([diff(imcol(:,i))>=0.01;0]);
yimcol_i(:,i) = interp1(imcol(mask,i),y(mask),y);
end
y=1:size(imcol,2);
y=repmat(y,size(imcol,1),1);
y1=y(1:5:end,1:5:end);
yimcol_i1=yimcol_i(1:5:end,1:5:end);
figure('Color','w');plot(y1(:),yimcol_i1(:),'k.');title('Inverse Transform Sampling on cols');
axis equal;axis off;
It has the shortcoming that I can only use this per-row or per-column, but not both. The Inverse Transform Sampling method does not work for multivariate PDFs in general, and I'm fairly sure I wont be able to get it to work in this case.
Is there a simple method to achieve my goal that I haven't seen yet?
I am aware that an algorithm called Voronoi Stippler has been used to create the desired result and I will investigate that, but for the moment I liked the simplicity of Inverse Transform Sampling and would like to know if I can extend that method to match my needs.
It turns out this is fairly simple and can be done by Rejection Sampling.
For the special case where the instrumental distribution is U(0,1) it works like this (if I understood it correctly):
im=imread('http://goo.gl/DDwV3t'); %load random headshot from google
im=imadjust(im,stretchlim(im,[.01,.65]),[]);
im=im2double(rgb2gray(im));
im=im(10:end-5,50:end-5);
im=1-flipud(im);
d = im > .9*rand(size(im));
d=d&(rand(size(d))>.95); %randomly sieve out some more points
[i,j]=ind2sub(size(d),find(d));
figure('Color','w');plot(j,i,'k.');title('Rejection Sampling');
axis equal;axis off;
The sampling is done in one line:
d = im > .9*rand(size(im));
Since I ended up with too many points I randomly sampled the result thus reducing the number of points by approximately the factor 20.
This is pretty much the result I originally desired.

Iregular plot of k-means clustering, outlier removal

Hi I'm working on trying to cluster network data from the 1999 darpa data set. Unfortunately I'm not really getting clustered data, not compared to some of the literature, using the same techniques and methods.
My data comes out like this:
As you can see, it is not very Clustered. This is due to a lot of outliers (noise) in the dataset. I have looked at some outlier removal techniques but nothing I have tried so far really cleans the data. One of the methods I have tried:
%% When an outlier is considered to be more than three standard deviations away from the mean, determine the number of outliers in each column of the count matrix:
mu = mean(data)
sigma = std(data)
[n,p] = size(data);
% Create a matrix of mean values by replicating the mu vector for n rows
MeanMat = repmat(mu,n,1);
% Create a matrix of standard deviation values by replicating the sigma vector for n rows
SigmaMat = repmat(sigma,n,1);
% Create a matrix of zeros and ones, where ones indicate the location of outliers
outliers = abs(data - MeanMat) > 3*SigmaMat;
% Calculate the number of outliers in each column
nout = sum(outliers)
% To remove an entire row of data containing the outlier
data(any(outliers,2),:) = [];
In the first run, it removed 48 rows from the 1000 normalized random rows which are selected from the full dataset.
This is the full script I used on the data:
%% load data
%# read the list of features
fid = fopen('kddcup.names','rt');
C = textscan(fid, '%s %s', 'Delimiter',':', 'HeaderLines',1);
fclose(fid);
%# determine type of features
C{2} = regexprep(C{2}, '.$',''); %# remove "." at the end
attribNom = [ismember(C{2},'symbolic');true]; %# nominal features
%# build format string used to read/parse the actual data
frmt = cell(1,numel(C{1}));
frmt( ismember(C{2},'continuous') ) = {'%f'}; %# numeric features: read as number
frmt( ismember(C{2},'symbolic') ) = {'%s'}; %# nominal features: read as string
frmt = [frmt{:}];
frmt = [frmt '%s']; %# add the class attribute
%# read dataset
fid = fopen('kddcup.data_10_percent_corrected','rt');
C = textscan(fid, frmt, 'Delimiter',',');
fclose(fid);
%# convert nominal attributes to numeric
ind = find(attribNom);
G = cell(numel(ind),1);
for i=1:numel(ind)
[C{ind(i)},G{i}] = grp2idx( C{ind(i)} );
end
%# all numeric dataset
fulldata = cell2mat(C);
%% dimensionality reduction
columns = 6
[U,S,V]=svds(fulldata,columns);
%% randomly select dataset
rows = 1000;
columns = 6;
%# pick random rows
indX = randperm( size(fulldata,1) );
indX = indX(1:rows)';
%# pick random columns
indY = indY(1:columns);
%# filter data
data = U(indX,indY);
% apply normalization method to every cell
maxData = max(max(data));
minData = min(min(data));
data = ((data-minData)./(maxData));
% output matching data
dataSample = fulldata(indX, :)
%% When an outlier is considered to be more than three standard deviations away from the mean, use the following syntax to determine the number of outliers in each column of the count matrix:
mu = mean(data)
sigma = std(data)
[n,p] = size(data);
% Create a matrix of mean values by replicating the mu vector for n rows
MeanMat = repmat(mu,n,1);
% Create a matrix of standard deviation values by replicating the sigma vector for n rows
SigmaMat = repmat(sigma,n,1);
% Create a matrix of zeros and ones, where ones indicate the location of outliers
outliers = abs(data - MeanMat) > 2.5*SigmaMat;
% Calculate the number of outliers in each column
nout = sum(outliers)
% To remove an entire row of data containing the outlier
data(any(outliers,2),:) = [];
%% generate sample data
K = 6;
numObservarations = size(data, 1);
dimensions = 3;
%% cluster
opts = statset('MaxIter', 100, 'Display', 'iter');
[clustIDX, clusters, interClustSum, Dist] = kmeans(data, K, 'options',opts, ...
'distance','sqEuclidean', 'EmptyAction','singleton', 'replicates',3);
%% plot data+clusters
figure, hold on
scatter3(data(:,1),data(:,2),data(:,3), 5, clustIDX, 'filled')
scatter3(clusters(:,1),clusters(:,2),clusters(:,3), 100, (1:K)', 'filled')
hold off, xlabel('x'), ylabel('y'), zlabel('z')
grid on
view([90 0]);
%% plot clusters quality
figure
[silh,h] = silhouette(data, clustIDX);
avrgScore = mean(silh);
This is two distinct clusters from the output:
As you can see the data looks cleaner and more clustered than the original. However I still think a better method can be used.
For instance observing the overall clustering, I still have a lot of noise (outliers) from the dataset. As can be seen here:
I need the outlier rows put into a seperate dataset for later classification (only removed from the clustering)
Here is a link to the darpa dataset, please note that the 10% data set has had significant reduction in columns, a majority of columns which have 0 or 1's running through-out have been removed (42 columns to 6 columns):
http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
EDIT
Columns kept in the dataset are:
src_bytes: continuous.
dst_bytes: continuous.
count: continuous.
srv_count: continuous.
dst_host_count: continuous.
dst_host_srv_count: continuous.
RE-EDIT:
Based on discussions with Anony-Mousse and his answer, there may be a way of reducing noise in the clustering using K-Medoids http://en.wikipedia.org/wiki/K-medoids. I'm hoping that there isnt much of a change in the code that I currently have but as of yet I do not know how to implement it to test whether this will significantly reduce the noise. So providing that someone can show me a working example this will be accepted as an answer.
Note that using this dataset is discouraged:
That dataset has errors: KDD Cup '99 dataset (Network Intrusion) considered harmful
Reconsider using a different algorithm. k-means is not really appropriate for mixed-type data, where many attributes are discrete, and have very different scales. K-means needs to be able to compute sensible means. And for a binary vector "0.5" is not a sensible mean, it should be either 0 or 1.
Plus, k-means doesn't like outliers too much.
When plotting, make sure to scale them equally, or the result will look incorrect. You X-axis has a length of around 0.9, your y axis only 0.2 - no wonder they look squashed.
Overall, maybe the data set just doesn't have k-means-style clusters? You definitely should try a density-based methods (because these can deal with outliers) such as DBSCAN. But judging from the visualizations you added, I'd say it has at most 4-5 clusters, and they are not really interesting. They probably can be captured with a number of thresholds in some dimensions.
Here is a visualization of the data set after z-normalization, visualized in parallel coordinates, with 5000 samples. Bright green is normal.
You can clearly see special properties of the data set. All of the attacks are clearly different in attributes 3 and 4 (count and srv_count) and also most very concentrated in dst_host_count and dst_host_srv_count.
I've ran OPTICS on this data set, too. It found a number of clusters, most of them in the wine-colored attack pattern. But they're not really interesting. If you have 10 different hosts ping-flooding, they will form 10 clusters.
You can see very well that OPTICS managed to cluster a number of these attacks. It missed all the orange stuff (maybe if I had set minpts lower, it is quite spread out) but it even discovered *structure within the wine-colored attack), breaking it into a number of separate events.
To really make sense of this data set, you should start with feature extraction, for example by merging such ping flood connection attempts into an aggregate event.
Also note that this is an unrealistic scenario.
There are well-known patterns involved in attacks, in particular port scans. These are best detected with a specialized port scan detector, not with learning.
The simulated data has a lot of completely pointless "attacks" simulated. For example Smurf attack from the 90s, is >50% of the data set, and Syn flood is another 20%; while normal traffic is <20%!
For these kind of attacks, there are well-known signatures.
Much of modern attacks (SQL injection, for example) flow with usual HTTP traffic, and will not show up anomalous in raw traffic pattern.
Just don't use this data for classification or outlier detection. Just don't.
Quoting the KDNuggets link above:
As a result, we strongly recommend that
(1) all researchers stop using the KDD Cup '99 dataset,
(2) The KDD Cup and UCI websites include a warning on the KDD Cup '99 dataset webpage informing researchers that there are known problems with the dataset, and
(3) peer reviewers for conferences and journals ding papers (or even outright reject them, as is common in the network security community) with results drawn solely from the KDD Cup '99 dataset.
This is neither real nor realistic data. Go get something else.
First things first: you're asking for a lot here. For future reference: try to break up your problem in smaller chunks, and post several questions. This increases your chances of getting answers (and doesn't cost you 400 reputation!).
Luckily for you, I understand your predicament, and just love this sort of question!
Apart from this dataset's possible issues with k-means, this question is still generic enough to apply also to other datasets (and thus Googlers ending up here looking for a similar thing), so let's go ahead and get this solved.
My suggestion is we edit this answer until you get reasonably satisfactory results.
Number of clusters
Step 1 of any clustering problem: how many clusters to choose? There are a few methods I know of with which you can select the proper number of clusters. There is a nice wiki page about this, containing all of the methods below (and a few more).
Visual inspection
It might seem silly, but if you have well-separated data, a simple plot can tell you already (approximately) how many clusters you'll need, just by looking.
Pros:
quick
simple
works well on well-separated clusters in relatively small datasets
Cons:
and dirty
requires user interaction
it's easy to miss smaller clusters
data with less-well separated clusters, or very many of them, are hard to do by this method
it is all rather subjective -- the next person might select a different amount than you did.
silhouettes plot
As indicated in one of your other questions, making a silhouettes plot will help you make a better decision about the proper number of clusters in your data.
Pros:
relatively simple
reduces subjectivity by using statistical measures
intuitive way to represent quality of the choice
Cons:
requires user interaction
In the limit, if you take as many clusters as there are datapoints, a silhouettes plot will tell you that that is the best choice
it is still rather subjective, not based on statistical means
can be computationally expensive
elbow method
As with the silhouettes plot approach, you run kmeans repeatedly, each time with a larger amount of clusters, and you see how much of the total variance in the data is explained by the clusters chosen by this kmeans run. There will be a number of clusters where the amount of explaned variance will suddenly increase a lot less than in any of the previous choices of the number of clusters (the "elbow"). The elbow is then statistically speaking the best choice for the number of clusters.
Pros:
no user interaction required -- the elbow can be selected automatically
statistically more sound than any of the aforementioned methods
Cons:
somewhat complicated
still subjective, since the definition of the "elbow" depends on subjectively chosen parameters
can be computationally expensive
Outliers
Once you have chosen the number of clusters with any of the methods above, it is time to do outlier detection to see if the quality of your clusters improves.
I would start by a two-step-iterative approach, using the elbow method. In pseudo-Matlab:
data = your initial dataset
dataMod = your initial dataset
MAX = the number of clusters chosen by visual inspection
while (forever)
for N = MAX-5 : MAX+5
if (N < 1), continue, end
perform k-means with N clusters on dataMod
if (variance explained shows a jump)
break
end
if (you are satisfied)
break
end
for i = 1:N
extract all points from cluster i
find the centroid (let k-means do that)
calculate the standard deviation of distances to the centroid
mark points further than 3 sigma as possible outliers
end
dataMod = data with marked points removed
end
The tough part is obviously determining whether you are satisfied.
This is the key to the algorithm's effectiveness. The rough structure of
this part
if (you are satisfied)
break
end
would be something like this
if (situation has improved)
data = dataMod
elseif (situation is same or worse)
dataMod = data
break
end
the situation has improved when there are fewer outliers, or the variance
explaned for ALL choices of N is better than during the previous loop in the while. This is also something to fiddle with.
Anyway, much more than a first attempt I wouldn't call this.
If anyone sees incompletenesses, flaws or loopholes here, please
comment or edit.

How to generate this shape in Matlab?

In matlab, how to generate two clusters of random points like the following graph. Can you show me the scripts/code?
If you want to generate such data points, you will need to have their probability distribution to be able to generate the points.
For your point, I do not have the real distributions, so I can only give an approximation. From your figure I see that both lay approximately on a circle, with a random radius and a limited span for the angle. I assume those angles and radii are uniformly distributed over certain ranges, which seems like a pretty good starting point.
Therefore it also makes sense to generate the random data in polar coordinates (i.e. angle and radius) instead of the cartesian ones (i.e. horizontal and vertical), and transform them to allow plotting.
C1 = [0 0]; % center of the circle
C2 = [-5 7.5];
R1 = [8 10]; % range of radii
R2 = [8 10];
A1 = [1 3]*pi/2; % [rad] range of allowed angles
A2 = [-1 1]*pi/2;
nPoints = 500;
urand = #(nPoints,limits)(limits(1) + rand(nPoints,1)*diff(limits));
randomCircle = #(n,r,a)(pol2cart(urand(n,a),urand(n,r)));
[P1x,P1y] = randomCircle(nPoints,R1,A1);
P1x = P1x + C1(1);
P1y = P1y + C1(2);
[P2x,P2y] = randomCircle(nPoints,R2,A2);
P2x = P2x + C2(1);
P2y = P2y + C2(2);
figure
plot(P1x,P1y,'or'); hold on;
plot(P2x,P2y,'sb'); hold on;
axis square
This yields:
This method works relatively well when you deal with distributions that you can transform easily and when you can easily describe the possible locations of the points. If you cannot, there are other methods such as the inverse transforming sampling method which offer algorithms to generate the data instead of manual variable transformations as I did here.
K-means is not going to give you what you want.
For K-means, vectors are classified based on their nearest cluster center. I can only think of two ways you could get the non-convex assignment shown in the picture:
Your input data is actually higher-dimensional, and your sample image is just a 2-d projection.
You're using a distance metric with different scaling across the dimensions.
To achieve your aim:
Use a non-linear clustering algorithm.
Apply a non-linear transform to your input data. (Probably not feasible).
You can find a list on non-linear clustering algorithms here. Specifically, look at this reference on the MST clustering page. Your exact shape appears on the fourth page of the PDF together with a comparison of what happens with K-Means.
For existing MATLAB code, you could try this Kernel K-Means implementation. Also, check out the Clustering Toolbox.
Assuming that you really want to do the clustering operation on existing data, as opposed to generating the data itself. Since you have a plot of some data, it seems logical that you already know how to do that! If I am wrong in this assumption, then you should word your questions more carefully in the future.
The human brain is quite good at seeing patterns in things like this, that writing a code for on a computer will often take some serious effort.
As has been said already, traditional clustering tools such as k-means will fail. Luckily, the image processing toolbox has good tools for these purposes already written. I might suggest converting the plot into an image, using filled in dots to plot the points. Make sure the dots are large enough that they touch each other within a cluster, with some overlap. Then use dilation/erosion tools if necessary to make sure that any small cracks are filled in, but don't go so far as to cause the clusters to merge. Finally, use region segmentation tools to pick out the clusters. Once done, transform back from pixel units in the image into your spatial units, and you have accomplished your task.
For the image processing approach to work, you will need sufficient separation between the clusters compared to the coarseness within a cluster. But that seems obvious for any method to succeed.