Cluster data based on some threshold points - matlab

How can I cluster a data based on some threshold values.
I have some points like 2, 1,0.5. I need to find the nearest elements of each of these 3 points and need to cluster into 3 groups.How can I do that in matlab. Please help me.
If I use some points(like centroids) for clustering1.
Assuming the 2 is a centroid and finding the elements near to that point and clustering the dense region of the point 2.
(calculate the distances from all the centroids
classify the data into the nearest cluster)

Related

How to cluster new data using cluster centers generated by kmeans

I have got some cluster centers using kmeans in Matlab.
Now there are some new data points, I don't want to use for loop to compare the distance between every data points and cluster centers, because it's too slow.
So how should I do.
Try to use matlab knnsearch(X,Y) were X is the matrix that represent the kmeans centers and Y is the set of new data points.
look http://www.mathworks.com/help/stats/knnsearch.html
I'd do something like this.
If your new points are stored in coord with rows as points and columns as coordinates, and your cluster means are similarly stored in means, and you want to know which cluster each point is closest to (in ).
Then:
% compute distance
d2 = repmat(sum(coord.^2,2),1,size(means,1)) + repmat(sum(means.^2,2)',size(coord,1),1) - 2*coord*means';
% Assign to nearest cluster
[~, assigns] = min(d2,[],2);

Which k-means cluster should I assign a record to, when the euclidean distance between the record and both centroids are the same?

I am dealing with k-means clustering 6 records. I am given the centroids and K=3. I have only 2 features. My given centroids are known. as I have only 3 features I am assuming as x,y points and I have plotted them.
Having the points mapped on an x and y axis, finding the euclidean distance I found that lets say (8,6) belongs to the my first cluster. However for all other records, the euclidean distance between the records 2 nearest centroids are the same. So lets say the point (2,6) should belong to the centroid (2,4) or (2,8)?? Or (5,4) belongs to (2,4) or (8,4)??
Thanks for replying
The objective of k-means is to minimize variance.
Therefore, you should assign the point to that cluster, where variance increases the least. Even when cluster centers are at the same distance, the increase in variance by assigning the point can vary, because the cluster center will move due to this change. This is one of the ideas of the very fast Hartigan-Wong algorithm for k-means (as opposed to the slow textbook algorithm).

How a clustering algorithm in R can end up with negative silhouette values? AB

We know that clustering methods in R assign observations to the closest medoids. Hence, it is supposed to be the closest cluster each observation can have. So, I wonder how it is possible to have negative values of silhouette , while we are supposedly assign each observation to the closest cluster and the formula in silhouette method cannot get negative?
Behnam.
Two errors:
most clustering algorithms do not use the medoid, only PAM does.
the silhouette does not use the distance to the medoid, but the average distance to all cluster members. If the closest cluster is very wide, the average distance can be larger than the distance to the medoid. Consider a cluster with one point in the center, and all others on a sphere around it.

Query regarding k-means clustering in MATLAB

I have a very large amount of data in the form of matrix.I have already clustered it using k-means clustering in MATLAB R2013a. I want the exact coordinates of the centroid of each cluster formed.. Is it possible using any formula or anything else?
I want to find out the centroid of each cluster so that whenever some new data arrives in matrix, i can compute its distance from each centroid so as to find out the cluster to which new data will belong
My data is heterogeneous in nature.So,its difficult to find out average of data of each cluster.So, i am trying to write some code for printing the centroid location automatically.
In MATLAB, use
[idx,C] = kmeans(..)
instead of
idx = kmeans(..)
As per the documentation:
[idx,C] = kmeans(..) returns the k cluster centroid locations in the k-by-p matrix C.
The centroid is simply evaluated as the average value of all the points' coordinates that are assigned to that cluster.
If you have the assignments {point;cluster} you can easily evaluate the centroid: let's say you have a given cluster with n points assigned to it and these points are a1,a2,...,an. You can evaluate the centroid for such cluster by using:
centroid=(a1+a2+...+an)/n
Obviously you can run this process in a loop, depending on how your data structure (i.e. the assignment point/centroid) is organized.

K-means Clustering, major understanding issue

Suppose that we have a 64dim matrix to cluster, let's say that the matrix dataset is dt=64x150.
Using from vl_feat's library its kmeans function, I will cluster my dataset to 20 centrers:
[centers, assignments] = vl_kmeans(dt, 20);
centers is a 64x20 matrix.
assignments is a 1x150 matrix with values inside it.
According to manual: The vector assignments contains the (hard) assignments of the input data to the clusters.
I still can not understand what those numbers in the matrix assignments mean. I dont get it at all. Anyone mind helping me a bit here? An example or something would be great. What do these values represent anyway?
In k-means the problem you are trying to solve is the problem of clustering your 150 points into 20 clusters. Each point is a 64-dimension point and thus represented by a vector of size 64. So in your case dt is the set of points, each column is a 64-dim vector.
After running the algorithm you get centers and assignments. centers are the 20 positions of the cluster's center in a 64-dim space, in case you want to visualize it, measure distances between points and clusters, etc. 'assignments' on the other hand contains the actual assignments of each 64-dim point in dt. So if assignments[7] is 15 it indicates that the 7th vector in dt belongs to the 15th cluster.
For example here you can see clustering of lots of 2d points, let's say 1000 into 3 clusters. In this case dt would be 2x1000, centers would be 2x3 and assignments would be 1x1000 and will hold numbers ranging from 1 to 3 (or 0 to 2, in case you're using openCV)
EDIT:
The code to produce this image is located here: http://pypr.sourceforge.net/kmeans.html#k-means-example along with a tutorial on kmeans for pyPR.
In openCV it is the number of the cluster that each of the input points belong to