sklearn python affinity propagation - is there a method to calculate error in clusters? - cluster-analysis

In looking at the docs for sklearn.cluster and Affinity Propagation I don't see anything that would calculate error in a cluster. Does this exist or is this something I have to write on my own?
Update: Let me propose a possible idea:
With Affinity Propagation we have a dissimilarity matrix (that is a matrix that measures how dissimilar each row is from each other). When AP is finished I have all the label assignments to which cluster they belong. What if I took the dissimilarity measurement from the matrix? For example, say in an 10x10 matrix point 3 is my cluster and label 4 is assigned to the exemplar 3. The dissimilarity between the centroid and label is say -5, as an example. Let's say there are two more labels assigned to this centroid with a dissimilarity of -3 and -8 respectively. Now if I said the total error is -16/3. If I have another cluster with dissimilarity measurements of -2, -3, -2, -3, -2, -3 = -15/6. This seems to provide a potential error measurement.

I don't think there is a commonly accepted definition of "error" that would make sense in the context of affinity propagation, which is a similarity based method.
Errors work well with coordinate based methods such as k-means, but on AP we may not have coordinates.

Related

Cluster data based on some threshold points

How can I cluster a data based on some threshold values.
I have some points like 2, 1,0.5. I need to find the nearest elements of each of these 3 points and need to cluster into 3 groups.How can I do that in matlab. Please help me.
If I use some points(like centroids) for clustering1.
Assuming the 2 is a centroid and finding the elements near to that point and clustering the dense region of the point 2.
(calculate the distances from all the centroids
classify the data into the nearest cluster)

Updating value of K in K-Means Clustering

What is the best way to cluster a dataset with no labels and no idea of the number of clusters required?
For example, using the Iris dataset with no labels or knowledge of the number of label classes.
My idea:
Compute the mean square distance from each of the existing clusters for a sample
*If mean square distance > some threshold by a factor that depends (penalizes) on k, then, add that a “new” candidate.
*If a new cluster was added, find the new “best” k+1 cluster centers
If no new cluster was added, go to next row
What you can do is plot the elbow curve at different K-values as described here
Specifically,
1) The idea of the elbow method is to run k-means clustering on the dataset for a range of values of k (say, k from 1 to 10 in the examples above), and for each value of k calculate the sum of squared errors (SSE).
2) Then, plot a line chart of the SSE for each value of k. If the line chart looks like an arm, then the "elbow" on the arm is the value of k that is the best
3) So our goal is to choose a small value of k that still has a low SSE, and the elbow usually represents where we start to have diminishing returns by increasing k
Dozens of methods have been proposed on how to choose k.
Some variants such as x-means can dynamically adjust k, you only need to give the maximum - and choose the quality criterion AIC or BIC.

Which k-means cluster should I assign a record to, when the euclidean distance between the record and both centroids are the same?

I am dealing with k-means clustering 6 records. I am given the centroids and K=3. I have only 2 features. My given centroids are known. as I have only 3 features I am assuming as x,y points and I have plotted them.
Having the points mapped on an x and y axis, finding the euclidean distance I found that lets say (8,6) belongs to the my first cluster. However for all other records, the euclidean distance between the records 2 nearest centroids are the same. So lets say the point (2,6) should belong to the centroid (2,4) or (2,8)?? Or (5,4) belongs to (2,4) or (8,4)??
Thanks for replying
The objective of k-means is to minimize variance.
Therefore, you should assign the point to that cluster, where variance increases the least. Even when cluster centers are at the same distance, the increase in variance by assigning the point can vary, because the cluster center will move due to this change. This is one of the ideas of the very fast Hartigan-Wong algorithm for k-means (as opposed to the slow textbook algorithm).

How a clustering algorithm in R can end up with negative silhouette values? AB

We know that clustering methods in R assign observations to the closest medoids. Hence, it is supposed to be the closest cluster each observation can have. So, I wonder how it is possible to have negative values of silhouette , while we are supposedly assign each observation to the closest cluster and the formula in silhouette method cannot get negative?
Behnam.
Two errors:
most clustering algorithms do not use the medoid, only PAM does.
the silhouette does not use the distance to the medoid, but the average distance to all cluster members. If the closest cluster is very wide, the average distance can be larger than the distance to the medoid. Consider a cluster with one point in the center, and all others on a sphere around it.

K-means Clustering, major understanding issue

Suppose that we have a 64dim matrix to cluster, let's say that the matrix dataset is dt=64x150.
Using from vl_feat's library its kmeans function, I will cluster my dataset to 20 centrers:
[centers, assignments] = vl_kmeans(dt, 20);
centers is a 64x20 matrix.
assignments is a 1x150 matrix with values inside it.
According to manual: The vector assignments contains the (hard) assignments of the input data to the clusters.
I still can not understand what those numbers in the matrix assignments mean. I dont get it at all. Anyone mind helping me a bit here? An example or something would be great. What do these values represent anyway?
In k-means the problem you are trying to solve is the problem of clustering your 150 points into 20 clusters. Each point is a 64-dimension point and thus represented by a vector of size 64. So in your case dt is the set of points, each column is a 64-dim vector.
After running the algorithm you get centers and assignments. centers are the 20 positions of the cluster's center in a 64-dim space, in case you want to visualize it, measure distances between points and clusters, etc. 'assignments' on the other hand contains the actual assignments of each 64-dim point in dt. So if assignments[7] is 15 it indicates that the 7th vector in dt belongs to the 15th cluster.
For example here you can see clustering of lots of 2d points, let's say 1000 into 3 clusters. In this case dt would be 2x1000, centers would be 2x3 and assignments would be 1x1000 and will hold numbers ranging from 1 to 3 (or 0 to 2, in case you're using openCV)
EDIT:
The code to produce this image is located here: http://pypr.sourceforge.net/kmeans.html#k-means-example along with a tutorial on kmeans for pyPR.
In openCV it is the number of the cluster that each of the input points belong to