Evaluation of K-means clustering ( accuracy) - matlab

I created a 2-dimensional random datasets (composed from a dataset of points and a column of labels) for centroid based k-means clustering in MATLAB where each point is represented by a vector of X and Y (the point coordinates) and each label represents the data point cluster,see example in figure below.
I applied the K-means clustering algorithm on these point datasets. I need help with the following:
What function can I use to evaluate the accuracy of the K-means algorithm? In more detail: My aim is to score the Kmeans algorithm based on how many assigned labels it correctly identifies by comparing with assigned numbers by matlab. For example, I verify if the point (7.200592168, 11.73878455) is assigned with the point (6.951107307, 11.27498898) to the same cluster... etc.

If I correctly understand your question, you are looking for the adjusted rand index. This will score the similarity between your matlab labels and your k-means labels.
Alternatively you can create a confusion matrix to visualise the mapping between your two labelsets.

I would use squared error
You are trying to minimize the total squared distance between each point and the mean coordinate of it's cluster.

Related

How to get the threshold value of k-means algorithm that is used to binarize the images?

I applied k-means algorithm for segmenting images. I used built in k-means function. It works properly but I want to know the threshold value that converts it to binary images in k-means method. For example, we can get threshold value by using built in function in MATLAB:
threshold=graythresh(grayscaledImage);
a=im2bw(a,threshold);
%Applying k-means....
imdata=reshape(grayscaledImage,[],1);
imdata=double(imdata);
[imdx mn]=kmeans(imdata,2);
imIdx=reshape(imdx,size(grayscaledImage));
imshow(imIdx,[]);
Actually, k-means and the well known Otsu threshold for binarizing intensity images based on a global threshold have an interesting relationship:
http://www-cs.engr.ccny.cuny.edu/~wolberg/cs470/doc/Otsu-KMeansHIS09.pdf
It can be shown that k-means is a locally optimal, iterative solution to the same objective function as Otsu, where Otsu is a globally optimal, non-iterative solution.
Given greyscale intensity data, one could compute a threshold based on otsu, which can be expressed in MATLAB using graythresh, or otsuthresh, depending on which interface you prefer.
A = imread('cameraman.tif');
A = im2double(A);
totsu = otsuthresh(histcounts(A,10000))
[~,c] = kmeans(A(:),2,'Replicates',10);
tkmeans = mean(c)
You can obtain a grayscale threshold from kmeans by just finding the midpoint of the two centroids, which should make sense geometrically since on either side of that midpoint, you are closer to one of the centroids or the other, and should therefore lie in that respective cluster.
totsu =
0.3308
tkmeans =
0.3472
You can't get the threshold because there is no threshold in the kmeans algorithm.
K-means is a clustering algorithm, it returns clusters which in many cases cannot be obtained with a simple thresholding.
See this link to learn further on how k-means works.

Matlab kmeans clustering for non linearly separable data

I've a non linearly separable data at my hand. I want to cluster it using K-means implementation in matlab. I want to get the cluster labels for each and every data point, to use them for another classification problem.
The problem is k-means is not giving results as expected. I'm attaching the cluster plot I obtained.
I expected k-means to give clusters as concentric circles as the data looks, but output was arcs. I don't understand why is this happening.
Can you suggest me any other clustering method to acheive my goal.
Before using an algorithm, you should try to understand it: what is the goal of an algorithm, and how does it achieve it. For k-means, Wikipedia tells us the following:
k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean
Three concentric circles would have the exact same mean, so k-means is not suitable to separate them. The result is really what you should expect from k-means here.
Now, if you know that your clusters will always be concentric circles, you can simply convert your cartesian (x-y) coordinates to polar coordinates, and use only the radius rho for clustering - as you know that the angle theta doesn't matter:
% Create random data
[x1,y1] = pol2cart(2*pi*rand(1000,1),rand(1000,1));
[x2,y2] = pol2cart(2*pi*rand(1000,1),rand(1000,1)+2);
[x3,y3] = pol2cart(2*pi*rand(1000,1),rand(1000,1)+4);
X = [x1,y1; x2,y2; x3,y3];
% Transform to polar
[theta,rho] = cart2pol(X(:,1),X(:,2));
% k-means clustering
idx = kmeans(rho,3);
% Plot results
hold on
plot(X(idx==1,1), X(idx==1,2), 'r.')
plot(X(idx==2,1), X(idx==2,2), 'g.')
plot(X(idx==3,1), X(idx==3,2), 'b.')
Or more generally: use a suitable kernel for k-means clustering, or use another algorithm.

Clustering algorithm with different epsilons on different axes

I am looking for a clustering algorithm such a s DBSCAN do deal with 3d data, in which is possible to set different epsilons depending on the axis. So for instance an epsilon of 10m on the x-y plan, and an epsilon 0.2m on the z axis.
Essentially, I am looking for large but flat clusters.
Note: I am an archaeologist, the algorithm will be used to look for potential correlations between objects scattered in large surfaces, but in narrow vertical layers
Solution 1:
Scale your data set to match your desired epsilon.
In your case, scale z by 50.
Solution 2:
Use a weighted distance function.
E.g. WeightedEuclideanDistanceFunction in ELKI, and choose your weights accordingly, e.g. -distance.weights 1,1,50 will put 50x as much weight on the third axis.
This may be the most convenient option, since you are already using ELKI.
Just define a custom distance metric when computing the DBSCAN core points. The standard DBSCAN uses the Euclidean distance to compute points within an epsilon. So all dimensions are treated the same.
However, you could use the Mahalanobis distance to weigh each dimension differently. You can use a diagonal covariance matrix for flat clusters. You can use a full symmetric covariance matrix for flat tilted clusters, etc.
In your case, you would use a covariance matrix like:
100 0 0 0 100 0 0 0 0.04
In the pseudo code provided at the Wikipedia entry for DBSCAN just use one of the distance metrics suggested above in the regionQuery function.
Update
Note: scaling the data is equivalent to using an appropriate metric.

K-means Clustering, major understanding issue

Suppose that we have a 64dim matrix to cluster, let's say that the matrix dataset is dt=64x150.
Using from vl_feat's library its kmeans function, I will cluster my dataset to 20 centrers:
[centers, assignments] = vl_kmeans(dt, 20);
centers is a 64x20 matrix.
assignments is a 1x150 matrix with values inside it.
According to manual: The vector assignments contains the (hard) assignments of the input data to the clusters.
I still can not understand what those numbers in the matrix assignments mean. I dont get it at all. Anyone mind helping me a bit here? An example or something would be great. What do these values represent anyway?
In k-means the problem you are trying to solve is the problem of clustering your 150 points into 20 clusters. Each point is a 64-dimension point and thus represented by a vector of size 64. So in your case dt is the set of points, each column is a 64-dim vector.
After running the algorithm you get centers and assignments. centers are the 20 positions of the cluster's center in a 64-dim space, in case you want to visualize it, measure distances between points and clusters, etc. 'assignments' on the other hand contains the actual assignments of each 64-dim point in dt. So if assignments[7] is 15 it indicates that the 7th vector in dt belongs to the 15th cluster.
For example here you can see clustering of lots of 2d points, let's say 1000 into 3 clusters. In this case dt would be 2x1000, centers would be 2x3 and assignments would be 1x1000 and will hold numbers ranging from 1 to 3 (or 0 to 2, in case you're using openCV)
EDIT:
The code to produce this image is located here: http://pypr.sourceforge.net/kmeans.html#k-means-example along with a tutorial on kmeans for pyPR.
In openCV it is the number of the cluster that each of the input points belong to

Clustering with a Distance Matrix via Mahalanobis distance

I have a set of pairwise distances (in a matrix) between objects that I would like to cluster. I currently use k-means clustering (computing distance from the centroid as the average distance to all members of the given cluster, since I do not have coordinates), with k chosen by the best Davies-Bouldin index over an interval.
However, I have three separate metrics (more in the future, potentially) describing the difference between the data, each fairly different in terms of magnitude and spread. Currently, I compute the distance matrix with the Euclidean distance across the three metrics, but I am fairly certain that the difference between the metrics is messing it up (e.g. the largest one is overpowering the other ones).
I thought a good way to deal with this is to use the Mahalanobis distance to combine the metrics. However, I obviously cannot compute the covariance matrix between the coordinates, but I can compute it for the distance metrics. Does this make sense? That is, if I get the distance between two objects i and j as:
D(i,j) = sqrt( dt S^-1 d )
where d is the 3-vector of the different distance metrics between i and j, dt is the transpose of d, and S is the covariance matrix of the distances, would D be a good, normalized metric for clustering?
I have also thought of normalizing the metrics (i.e. subtracting the mean and dividing out the variance) and then simply staying with the euclidean distance (in fact it would seem that this essentially is Mahalanobis distance, at least in some cases), or of switching to something like DBSCAN or EM, and have not ruled them out (though MDS then clustering might be a bit excessive). As a sidenote, any packages able to do all of this would be greatly appreciated. Thanks!
Consider using k-medoids (PAM) instead of a hacked k-means, which can work with arbitary distance functions; whereas k-means is designed to minimize variances, not arbitrary distances.
EM will have the same problem - it needs to be able to compute meaningful centers.
You can also use hierarchical linkage clustering. It only needs a distance matrix.