Matlab kmeans clustering for non linearly separable data - matlab

I've a non linearly separable data at my hand. I want to cluster it using K-means implementation in matlab. I want to get the cluster labels for each and every data point, to use them for another classification problem.
The problem is k-means is not giving results as expected. I'm attaching the cluster plot I obtained.
I expected k-means to give clusters as concentric circles as the data looks, but output was arcs. I don't understand why is this happening.
Can you suggest me any other clustering method to acheive my goal.

Before using an algorithm, you should try to understand it: what is the goal of an algorithm, and how does it achieve it. For k-means, Wikipedia tells us the following:
k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean
Three concentric circles would have the exact same mean, so k-means is not suitable to separate them. The result is really what you should expect from k-means here.
Now, if you know that your clusters will always be concentric circles, you can simply convert your cartesian (x-y) coordinates to polar coordinates, and use only the radius rho for clustering - as you know that the angle theta doesn't matter:
% Create random data
[x1,y1] = pol2cart(2*pi*rand(1000,1),rand(1000,1));
[x2,y2] = pol2cart(2*pi*rand(1000,1),rand(1000,1)+2);
[x3,y3] = pol2cart(2*pi*rand(1000,1),rand(1000,1)+4);
X = [x1,y1; x2,y2; x3,y3];
% Transform to polar
[theta,rho] = cart2pol(X(:,1),X(:,2));
% k-means clustering
idx = kmeans(rho,3);
% Plot results
hold on
plot(X(idx==1,1), X(idx==1,2), 'r.')
plot(X(idx==2,1), X(idx==2,2), 'g.')
plot(X(idx==3,1), X(idx==3,2), 'b.')
Or more generally: use a suitable kernel for k-means clustering, or use another algorithm.

Related

Evaluation of K-means clustering ( accuracy)

I created a 2-dimensional random datasets (composed from a dataset of points and a column of labels) for centroid based k-means clustering in MATLAB where each point is represented by a vector of X and Y (the point coordinates) and each label represents the data point cluster,see example in figure below.
I applied the K-means clustering algorithm on these point datasets. I need help with the following:
What function can I use to evaluate the accuracy of the K-means algorithm? In more detail: My aim is to score the Kmeans algorithm based on how many assigned labels it correctly identifies by comparing with assigned numbers by matlab. For example, I verify if the point (7.200592168, 11.73878455) is assigned with the point (6.951107307, 11.27498898) to the same cluster... etc.
If I correctly understand your question, you are looking for the adjusted rand index. This will score the similarity between your matlab labels and your k-means labels.
Alternatively you can create a confusion matrix to visualise the mapping between your two labelsets.
I would use squared error
You are trying to minimize the total squared distance between each point and the mean coordinate of it's cluster.

Clustering of 1 dimensional data

I am trying to learn the k-means clustering algorithm in MATLAB without using inbuilt k-means function. Say I have the data of size 1x100 and I want to group them into two clusters. So how can I do this. I want to visualize the two centroids and data together on a plot in MATLAB.
Note : When I plot in MATLAB, I am able to see only data but not the data and two centroids simultaneously.
Any help in this regard is highly appreciated.
A minimal K-means clustering algorithm in matlab could be:
p = rand(100,2); % rand(number_of_points,number_of_dimension)
c = p(1:3,:); % We create 3 centroids
% We run this minimal KNN algorithm:
for ii = 1:10
% Which centroids is the closest for each points ? min(Euclidian_distance):
[~,idx] = min(sum((permute(p,[3,2,1])-c).^2,2),[],1);
% We calculate the new centroids (the center of mass of the corresponding points)
c = splitapply(#mean,p,idx(:))
end
And we can plot the result if needed:
hold on
scatter(p(:,1),p(:,2),[],idx(:))
scatter(c(:,1),c(:,2),[],'red')
And we obtain:
With our 3 centroids in red and the clusters with a distinct color.
Noticed that in this example the data are of dimension 2, but it will also work with any other dimension.
The 3 initial centroids correspond to 3 points of the dataset (randomly selected), it ensure that every centroids are the closest centroid for, at least, 1 point.
In this example there is 10 iterations. But it is certainly better to define a tolerance and stop the iteration when the centroids have converged.

How to get the threshold value of k-means algorithm that is used to binarize the images?

I applied k-means algorithm for segmenting images. I used built in k-means function. It works properly but I want to know the threshold value that converts it to binary images in k-means method. For example, we can get threshold value by using built in function in MATLAB:
threshold=graythresh(grayscaledImage);
a=im2bw(a,threshold);
%Applying k-means....
imdata=reshape(grayscaledImage,[],1);
imdata=double(imdata);
[imdx mn]=kmeans(imdata,2);
imIdx=reshape(imdx,size(grayscaledImage));
imshow(imIdx,[]);
Actually, k-means and the well known Otsu threshold for binarizing intensity images based on a global threshold have an interesting relationship:
http://www-cs.engr.ccny.cuny.edu/~wolberg/cs470/doc/Otsu-KMeansHIS09.pdf
It can be shown that k-means is a locally optimal, iterative solution to the same objective function as Otsu, where Otsu is a globally optimal, non-iterative solution.
Given greyscale intensity data, one could compute a threshold based on otsu, which can be expressed in MATLAB using graythresh, or otsuthresh, depending on which interface you prefer.
A = imread('cameraman.tif');
A = im2double(A);
totsu = otsuthresh(histcounts(A,10000))
[~,c] = kmeans(A(:),2,'Replicates',10);
tkmeans = mean(c)
You can obtain a grayscale threshold from kmeans by just finding the midpoint of the two centroids, which should make sense geometrically since on either side of that midpoint, you are closer to one of the centroids or the other, and should therefore lie in that respective cluster.
totsu =
0.3308
tkmeans =
0.3472
You can't get the threshold because there is no threshold in the kmeans algorithm.
K-means is a clustering algorithm, it returns clusters which in many cases cannot be obtained with a simple thresholding.
See this link to learn further on how k-means works.

How to generate this shape in Matlab?

In matlab, how to generate two clusters of random points like the following graph. Can you show me the scripts/code?
If you want to generate such data points, you will need to have their probability distribution to be able to generate the points.
For your point, I do not have the real distributions, so I can only give an approximation. From your figure I see that both lay approximately on a circle, with a random radius and a limited span for the angle. I assume those angles and radii are uniformly distributed over certain ranges, which seems like a pretty good starting point.
Therefore it also makes sense to generate the random data in polar coordinates (i.e. angle and radius) instead of the cartesian ones (i.e. horizontal and vertical), and transform them to allow plotting.
C1 = [0 0]; % center of the circle
C2 = [-5 7.5];
R1 = [8 10]; % range of radii
R2 = [8 10];
A1 = [1 3]*pi/2; % [rad] range of allowed angles
A2 = [-1 1]*pi/2;
nPoints = 500;
urand = #(nPoints,limits)(limits(1) + rand(nPoints,1)*diff(limits));
randomCircle = #(n,r,a)(pol2cart(urand(n,a),urand(n,r)));
[P1x,P1y] = randomCircle(nPoints,R1,A1);
P1x = P1x + C1(1);
P1y = P1y + C1(2);
[P2x,P2y] = randomCircle(nPoints,R2,A2);
P2x = P2x + C2(1);
P2y = P2y + C2(2);
figure
plot(P1x,P1y,'or'); hold on;
plot(P2x,P2y,'sb'); hold on;
axis square
This yields:
This method works relatively well when you deal with distributions that you can transform easily and when you can easily describe the possible locations of the points. If you cannot, there are other methods such as the inverse transforming sampling method which offer algorithms to generate the data instead of manual variable transformations as I did here.
K-means is not going to give you what you want.
For K-means, vectors are classified based on their nearest cluster center. I can only think of two ways you could get the non-convex assignment shown in the picture:
Your input data is actually higher-dimensional, and your sample image is just a 2-d projection.
You're using a distance metric with different scaling across the dimensions.
To achieve your aim:
Use a non-linear clustering algorithm.
Apply a non-linear transform to your input data. (Probably not feasible).
You can find a list on non-linear clustering algorithms here. Specifically, look at this reference on the MST clustering page. Your exact shape appears on the fourth page of the PDF together with a comparison of what happens with K-Means.
For existing MATLAB code, you could try this Kernel K-Means implementation. Also, check out the Clustering Toolbox.
Assuming that you really want to do the clustering operation on existing data, as opposed to generating the data itself. Since you have a plot of some data, it seems logical that you already know how to do that! If I am wrong in this assumption, then you should word your questions more carefully in the future.
The human brain is quite good at seeing patterns in things like this, that writing a code for on a computer will often take some serious effort.
As has been said already, traditional clustering tools such as k-means will fail. Luckily, the image processing toolbox has good tools for these purposes already written. I might suggest converting the plot into an image, using filled in dots to plot the points. Make sure the dots are large enough that they touch each other within a cluster, with some overlap. Then use dilation/erosion tools if necessary to make sure that any small cracks are filled in, but don't go so far as to cause the clusters to merge. Finally, use region segmentation tools to pick out the clusters. Once done, transform back from pixel units in the image into your spatial units, and you have accomplished your task.
For the image processing approach to work, you will need sufficient separation between the clusters compared to the coarseness within a cluster. But that seems obvious for any method to succeed.

Getting the index of closest data point to the centriods in Kmeans clustering in MATLAB

I am doing some clustering using K-means in MATLAB. As you might know the usage is as below:
[IDX,C] = kmeans(X,k)
where IDX gives the cluster number for each data point in X, and C gives the centroids for each cluster.I need to get the index(row number in the actual data set X) of the closest datapoint to the centroid. Does anyone know how I can do that?
Thanks
The "brute-force approach", as mentioned by #Dima would go as follows
%# loop through all clusters
for iCluster = 1:max(IDX)
%# find the points that are part of the current cluster
currentPointIdx = find(IDX==iCluster);
%# find the index (among points in the cluster)
%# of the point that has the smallest Euclidean distance from the centroid
%# bsxfun subtracts coordinates, then you sum the squares of
%# the distance vectors, then you take the minimum
[~,minIdx] = min(sum(bsxfun(#minus,X(currentPointIdx,:),C(iCluster,:)).^2,2));
%# store the index into X (among all the points)
closestIdx(iCluster) = currentPointIdx(minIdx);
end
To get the coordinates of the point that is closest to the cluster center k, use
X(closestIdx(k),:)
The brute force approach would be to run k-means, and then compare each data point in the cluster to the centroid, and find the one closest to it. This is easy to do in matlab.
On the other hand, you may want to try the k-medoids clustering algorithm, which gives you a data point as the "center" of each cluster. Here is a matlab implementation.
Actually, kmeans already gives you the answer, if I understand you right:
[IDX,C, ~, D] = kmeans(X,k); % D is the distance of each datapoint to each of the clusters
[minD, indMinD] = min(D); % indMinD(i) is the index (in X) of closest point to the i-th centroid