I have a very large amount of data in the form of matrix.I have already clustered it using k-means clustering in MATLAB R2013a. I want the exact coordinates of the centroid of each cluster formed.. Is it possible using any formula or anything else?
I want to find out the centroid of each cluster so that whenever some new data arrives in matrix, i can compute its distance from each centroid so as to find out the cluster to which new data will belong
My data is heterogeneous in nature.So,its difficult to find out average of data of each cluster.So, i am trying to write some code for printing the centroid location automatically.
In MATLAB, use
[idx,C] = kmeans(..)
instead of
idx = kmeans(..)
As per the documentation:
[idx,C] = kmeans(..) returns the k cluster centroid locations in the k-by-p matrix C.
The centroid is simply evaluated as the average value of all the points' coordinates that are assigned to that cluster.
If you have the assignments {point;cluster} you can easily evaluate the centroid: let's say you have a given cluster with n points assigned to it and these points are a1,a2,...,an. You can evaluate the centroid for such cluster by using:
centroid=(a1+a2+...+an)/n
Obviously you can run this process in a loop, depending on how your data structure (i.e. the assignment point/centroid) is organized.
Related
I created a 2-dimensional random datasets (composed from a dataset of points and a column of labels) for centroid based k-means clustering in MATLAB where each point is represented by a vector of X and Y (the point coordinates) and each label represents the data point cluster,see example in figure below.
I applied the K-means clustering algorithm on these point datasets. I need help with the following:
What function can I use to evaluate the accuracy of the K-means algorithm? In more detail: My aim is to score the Kmeans algorithm based on how many assigned labels it correctly identifies by comparing with assigned numbers by matlab. For example, I verify if the point (7.200592168, 11.73878455) is assigned with the point (6.951107307, 11.27498898) to the same cluster... etc.
If I correctly understand your question, you are looking for the adjusted rand index. This will score the similarity between your matlab labels and your k-means labels.
Alternatively you can create a confusion matrix to visualise the mapping between your two labelsets.
I would use squared error
You are trying to minimize the total squared distance between each point and the mean coordinate of it's cluster.
I have got some cluster centers using kmeans in Matlab.
Now there are some new data points, I don't want to use for loop to compare the distance between every data points and cluster centers, because it's too slow.
So how should I do.
Try to use matlab knnsearch(X,Y) were X is the matrix that represent the kmeans centers and Y is the set of new data points.
look http://www.mathworks.com/help/stats/knnsearch.html
I'd do something like this.
If your new points are stored in coord with rows as points and columns as coordinates, and your cluster means are similarly stored in means, and you want to know which cluster each point is closest to (in ).
Then:
% compute distance
d2 = repmat(sum(coord.^2,2),1,size(means,1)) + repmat(sum(means.^2,2)',size(coord,1),1) - 2*coord*means';
% Assign to nearest cluster
[~, assigns] = min(d2,[],2);
idx4 = kmeans(A,4);
silhouette(A,idx4,'Euclidean')
I have matrix A of dimensions [492 x 5148]. I did kmeans clustering on matlab using above command and plotted using silhouette function. It shows 4 clusters beautifully. But now I want to know which row of matrix A is assigned to which cluster. How to know that?
From the documentation, http://www.mathworks.es/es/help/stats/kmeans.html , you will see that idx4 contains the indices of the cluster for each row in A.
That is, the value of idx4(1) is the index of the cluster of the row A(1,:).
Suppose that we have a 64dim matrix to cluster, let's say that the matrix dataset is dt=64x150.
Using from vl_feat's library its kmeans function, I will cluster my dataset to 20 centrers:
[centers, assignments] = vl_kmeans(dt, 20);
centers is a 64x20 matrix.
assignments is a 1x150 matrix with values inside it.
According to manual: The vector assignments contains the (hard) assignments of the input data to the clusters.
I still can not understand what those numbers in the matrix assignments mean. I dont get it at all. Anyone mind helping me a bit here? An example or something would be great. What do these values represent anyway?
In k-means the problem you are trying to solve is the problem of clustering your 150 points into 20 clusters. Each point is a 64-dimension point and thus represented by a vector of size 64. So in your case dt is the set of points, each column is a 64-dim vector.
After running the algorithm you get centers and assignments. centers are the 20 positions of the cluster's center in a 64-dim space, in case you want to visualize it, measure distances between points and clusters, etc. 'assignments' on the other hand contains the actual assignments of each 64-dim point in dt. So if assignments[7] is 15 it indicates that the 7th vector in dt belongs to the 15th cluster.
For example here you can see clustering of lots of 2d points, let's say 1000 into 3 clusters. In this case dt would be 2x1000, centers would be 2x3 and assignments would be 1x1000 and will hold numbers ranging from 1 to 3 (or 0 to 2, in case you're using openCV)
EDIT:
The code to produce this image is located here: http://pypr.sourceforge.net/kmeans.html#k-means-example along with a tutorial on kmeans for pyPR.
In openCV it is the number of the cluster that each of the input points belong to
I would like to know if there are commands to get the cluster to which a data point belongs to while generating dendrogram.
For example, if the datapoints 32,46,26,15,33,54,17,19,27 are grouped as one cluster, how to obtain this information while plotting dendrogram.
I computed the linkage function and plotted the dendrogram using the command:
[H,T,perm]=dendrogram(Z,0) (Since I have more than 30 data points)
Any suggestions on how to extract the cluster information for the example mentioned above will be helpful.
I would like to use the cluster information for visualization purpose.
Thank you.
Function dendrogram generates the dendrogram plot and (as the documentation explains) "returns T, a vector of size M that contains the leaf node number for each object in the original dataset."
If you want to find all elements belonging to cluster iclust, you can try something similar to the following:
iclust=2; % find all elements in cluster # 2 for example
ifound = find(T==iclust);
edit
By the way if you want to colorize the dendrogram you can try
[H, T] = dendrogram(Z,'colorthreshold',thresh);
where thresh is a threshold below which branches should be colored.