How a clustering algorithm in R can end up with negative silhouette values? AB - cluster-analysis

We know that clustering methods in R assign observations to the closest medoids. Hence, it is supposed to be the closest cluster each observation can have. So, I wonder how it is possible to have negative values of silhouette , while we are supposedly assign each observation to the closest cluster and the formula in silhouette method cannot get negative?
Behnam.

Two errors:
most clustering algorithms do not use the medoid, only PAM does.
the silhouette does not use the distance to the medoid, but the average distance to all cluster members. If the closest cluster is very wide, the average distance can be larger than the distance to the medoid. Consider a cluster with one point in the center, and all others on a sphere around it.

Related

Evaluation of K-means clustering ( accuracy)

I created a 2-dimensional random datasets (composed from a dataset of points and a column of labels) for centroid based k-means clustering in MATLAB where each point is represented by a vector of X and Y (the point coordinates) and each label represents the data point cluster,see example in figure below.
I applied the K-means clustering algorithm on these point datasets. I need help with the following:
What function can I use to evaluate the accuracy of the K-means algorithm? In more detail: My aim is to score the Kmeans algorithm based on how many assigned labels it correctly identifies by comparing with assigned numbers by matlab. For example, I verify if the point (7.200592168, 11.73878455) is assigned with the point (6.951107307, 11.27498898) to the same cluster... etc.
If I correctly understand your question, you are looking for the adjusted rand index. This will score the similarity between your matlab labels and your k-means labels.
Alternatively you can create a confusion matrix to visualise the mapping between your two labelsets.
I would use squared error
You are trying to minimize the total squared distance between each point and the mean coordinate of it's cluster.

Which k-means cluster should I assign a record to, when the euclidean distance between the record and both centroids are the same?

I am dealing with k-means clustering 6 records. I am given the centroids and K=3. I have only 2 features. My given centroids are known. as I have only 3 features I am assuming as x,y points and I have plotted them.
Having the points mapped on an x and y axis, finding the euclidean distance I found that lets say (8,6) belongs to the my first cluster. However for all other records, the euclidean distance between the records 2 nearest centroids are the same. So lets say the point (2,6) should belong to the centroid (2,4) or (2,8)?? Or (5,4) belongs to (2,4) or (8,4)??
Thanks for replying
The objective of k-means is to minimize variance.
Therefore, you should assign the point to that cluster, where variance increases the least. Even when cluster centers are at the same distance, the increase in variance by assigning the point can vary, because the cluster center will move due to this change. This is one of the ideas of the very fast Hartigan-Wong algorithm for k-means (as opposed to the slow textbook algorithm).

Inter-Cluster and Intra-Cluster distances

I have found the following formulas for Inter-Cluster and Intra-Cluster distances and I am not sure I understand how they work.
Inter-Cluster Distance
Shouldn't there be a square root in formulas above?
Inter-Cluster and Intra-Cluster:
Why is there the j index starting from N+1? And not from 1 to N2?
Which one is the correct one? Or are there any equivalencies? Or should I go for the distance between centroids for the inter cluster distance? Seems rather simple. What about the intra cluster distance?
I find the wikipedia formulas http://en.wikipedia.org/wiki/Cluster_analysis#Internal_evaluation even harder to understand.
I need to compute this distances in order to proper group colors in order to create a reduced color palette, so I'm thinking the more accurate these distances are, the more accurate the groupping (formula instead of distance between centroids distance for inter-cluster). The vectors are 3-dimensional(RGB components).
A lot of algorithms don't really use "distance".
k-means for example minimizes variance, which is the sum-of-squares you are seeing here. Now sum-of-squares is squared Euclidean distance, so one can argue that this algorithm also tries to minimize Euclidean distances; but the "natural" formulation of the algorithm doesn't use Euclidean distances, but sum-of-squares. if I'm not mistaken, the same also holds for Ward clustering, that you should compute it using variance, not euclidean distance.
Note that if you minimize z^2, and z cannot be negative, then you also minimized z.
See also: https://stats.stackexchange.com/questions/95793/is-there-an-advantage-to-squaring-dissimilarities-when-using-ward-clustering

Clustering with a Distance Matrix via Mahalanobis distance

I have a set of pairwise distances (in a matrix) between objects that I would like to cluster. I currently use k-means clustering (computing distance from the centroid as the average distance to all members of the given cluster, since I do not have coordinates), with k chosen by the best Davies-Bouldin index over an interval.
However, I have three separate metrics (more in the future, potentially) describing the difference between the data, each fairly different in terms of magnitude and spread. Currently, I compute the distance matrix with the Euclidean distance across the three metrics, but I am fairly certain that the difference between the metrics is messing it up (e.g. the largest one is overpowering the other ones).
I thought a good way to deal with this is to use the Mahalanobis distance to combine the metrics. However, I obviously cannot compute the covariance matrix between the coordinates, but I can compute it for the distance metrics. Does this make sense? That is, if I get the distance between two objects i and j as:
D(i,j) = sqrt( dt S^-1 d )
where d is the 3-vector of the different distance metrics between i and j, dt is the transpose of d, and S is the covariance matrix of the distances, would D be a good, normalized metric for clustering?
I have also thought of normalizing the metrics (i.e. subtracting the mean and dividing out the variance) and then simply staying with the euclidean distance (in fact it would seem that this essentially is Mahalanobis distance, at least in some cases), or of switching to something like DBSCAN or EM, and have not ruled them out (though MDS then clustering might be a bit excessive). As a sidenote, any packages able to do all of this would be greatly appreciated. Thanks!
Consider using k-medoids (PAM) instead of a hacked k-means, which can work with arbitary distance functions; whereas k-means is designed to minimize variances, not arbitrary distances.
EM will have the same problem - it needs to be able to compute meaningful centers.
You can also use hierarchical linkage clustering. It only needs a distance matrix.

K means clustring find k farthest points in java

I'm trying to implement k means clustering.
I've a set of points with coordinates (x,y) and i am using Euclidean distance for finding distance. I've computed distance between all points in a matrix
dist[i][j] - distance between points i and j
when i choose a[1][3] farthest from pt 1 as 3.
then when i search farthest from 3 i may get a[3][j] but a[1][j] may be minimum.
[pt j is far from pt3 but near to 1]
so how to choose k farthest points using the distance matrix.
Note that the k-farthest points do not necessarily yield the best result: they clearly aren't the best cluster center estimates.
Plus, since k-means heuristics may get stuck in a local minimum, you will want a randomized algorithm that allows you to restart the process multiple times and get potentiall different results.
You may want to look at k-means++ which is a known good heuristic for k-means initialization.