How to cluster new data using cluster centers generated by kmeans - matlab

I have got some cluster centers using kmeans in Matlab.
Now there are some new data points, I don't want to use for loop to compare the distance between every data points and cluster centers, because it's too slow.
So how should I do.

Try to use matlab knnsearch(X,Y) were X is the matrix that represent the kmeans centers and Y is the set of new data points.
look http://www.mathworks.com/help/stats/knnsearch.html

I'd do something like this.
If your new points are stored in coord with rows as points and columns as coordinates, and your cluster means are similarly stored in means, and you want to know which cluster each point is closest to (in ).
Then:
% compute distance
d2 = repmat(sum(coord.^2,2),1,size(means,1)) + repmat(sum(means.^2,2)',size(coord,1),1) - 2*coord*means';
% Assign to nearest cluster
[~, assigns] = min(d2,[],2);

Related

Clustering of 1 dimensional data

I am trying to learn the k-means clustering algorithm in MATLAB without using inbuilt k-means function. Say I have the data of size 1x100 and I want to group them into two clusters. So how can I do this. I want to visualize the two centroids and data together on a plot in MATLAB.
Note : When I plot in MATLAB, I am able to see only data but not the data and two centroids simultaneously.
Any help in this regard is highly appreciated.
A minimal K-means clustering algorithm in matlab could be:
p = rand(100,2); % rand(number_of_points,number_of_dimension)
c = p(1:3,:); % We create 3 centroids
% We run this minimal KNN algorithm:
for ii = 1:10
% Which centroids is the closest for each points ? min(Euclidian_distance):
[~,idx] = min(sum((permute(p,[3,2,1])-c).^2,2),[],1);
% We calculate the new centroids (the center of mass of the corresponding points)
c = splitapply(#mean,p,idx(:))
end
And we can plot the result if needed:
hold on
scatter(p(:,1),p(:,2),[],idx(:))
scatter(c(:,1),c(:,2),[],'red')
And we obtain:
With our 3 centroids in red and the clusters with a distinct color.
Noticed that in this example the data are of dimension 2, but it will also work with any other dimension.
The 3 initial centroids correspond to 3 points of the dataset (randomly selected), it ensure that every centroids are the closest centroid for, at least, 1 point.
In this example there is 10 iterations. But it is certainly better to define a tolerance and stop the iteration when the centroids have converged.

Cluster data based on some threshold points

How can I cluster a data based on some threshold values.
I have some points like 2, 1,0.5. I need to find the nearest elements of each of these 3 points and need to cluster into 3 groups.How can I do that in matlab. Please help me.
If I use some points(like centroids) for clustering1.
Assuming the 2 is a centroid and finding the elements near to that point and clustering the dense region of the point 2.
(calculate the distances from all the centroids
classify the data into the nearest cluster)

Query regarding k-means clustering in MATLAB

I have a very large amount of data in the form of matrix.I have already clustered it using k-means clustering in MATLAB R2013a. I want the exact coordinates of the centroid of each cluster formed.. Is it possible using any formula or anything else?
I want to find out the centroid of each cluster so that whenever some new data arrives in matrix, i can compute its distance from each centroid so as to find out the cluster to which new data will belong
My data is heterogeneous in nature.So,its difficult to find out average of data of each cluster.So, i am trying to write some code for printing the centroid location automatically.
In MATLAB, use
[idx,C] = kmeans(..)
instead of
idx = kmeans(..)
As per the documentation:
[idx,C] = kmeans(..) returns the k cluster centroid locations in the k-by-p matrix C.
The centroid is simply evaluated as the average value of all the points' coordinates that are assigned to that cluster.
If you have the assignments {point;cluster} you can easily evaluate the centroid: let's say you have a given cluster with n points assigned to it and these points are a1,a2,...,an. You can evaluate the centroid for such cluster by using:
centroid=(a1+a2+...+an)/n
Obviously you can run this process in a loop, depending on how your data structure (i.e. the assignment point/centroid) is organized.

K-means Clustering, major understanding issue

Suppose that we have a 64dim matrix to cluster, let's say that the matrix dataset is dt=64x150.
Using from vl_feat's library its kmeans function, I will cluster my dataset to 20 centrers:
[centers, assignments] = vl_kmeans(dt, 20);
centers is a 64x20 matrix.
assignments is a 1x150 matrix with values inside it.
According to manual: The vector assignments contains the (hard) assignments of the input data to the clusters.
I still can not understand what those numbers in the matrix assignments mean. I dont get it at all. Anyone mind helping me a bit here? An example or something would be great. What do these values represent anyway?
In k-means the problem you are trying to solve is the problem of clustering your 150 points into 20 clusters. Each point is a 64-dimension point and thus represented by a vector of size 64. So in your case dt is the set of points, each column is a 64-dim vector.
After running the algorithm you get centers and assignments. centers are the 20 positions of the cluster's center in a 64-dim space, in case you want to visualize it, measure distances between points and clusters, etc. 'assignments' on the other hand contains the actual assignments of each 64-dim point in dt. So if assignments[7] is 15 it indicates that the 7th vector in dt belongs to the 15th cluster.
For example here you can see clustering of lots of 2d points, let's say 1000 into 3 clusters. In this case dt would be 2x1000, centers would be 2x3 and assignments would be 1x1000 and will hold numbers ranging from 1 to 3 (or 0 to 2, in case you're using openCV)
EDIT:
The code to produce this image is located here: http://pypr.sourceforge.net/kmeans.html#k-means-example along with a tutorial on kmeans for pyPR.
In openCV it is the number of the cluster that each of the input points belong to

Getting the index of closest data point to the centriods in Kmeans clustering in MATLAB

I am doing some clustering using K-means in MATLAB. As you might know the usage is as below:
[IDX,C] = kmeans(X,k)
where IDX gives the cluster number for each data point in X, and C gives the centroids for each cluster.I need to get the index(row number in the actual data set X) of the closest datapoint to the centroid. Does anyone know how I can do that?
Thanks
The "brute-force approach", as mentioned by #Dima would go as follows
%# loop through all clusters
for iCluster = 1:max(IDX)
%# find the points that are part of the current cluster
currentPointIdx = find(IDX==iCluster);
%# find the index (among points in the cluster)
%# of the point that has the smallest Euclidean distance from the centroid
%# bsxfun subtracts coordinates, then you sum the squares of
%# the distance vectors, then you take the minimum
[~,minIdx] = min(sum(bsxfun(#minus,X(currentPointIdx,:),C(iCluster,:)).^2,2));
%# store the index into X (among all the points)
closestIdx(iCluster) = currentPointIdx(minIdx);
end
To get the coordinates of the point that is closest to the cluster center k, use
X(closestIdx(k),:)
The brute force approach would be to run k-means, and then compare each data point in the cluster to the centroid, and find the one closest to it. This is easy to do in matlab.
On the other hand, you may want to try the k-medoids clustering algorithm, which gives you a data point as the "center" of each cluster. Here is a matlab implementation.
Actually, kmeans already gives you the answer, if I understand you right:
[IDX,C, ~, D] = kmeans(X,k); % D is the distance of each datapoint to each of the clusters
[minD, indMinD] = min(D); % indMinD(i) is the index (in X) of closest point to the i-th centroid