I have a set of pairwise distances (in a matrix) between objects that I would like to cluster. I currently use k-means clustering (computing distance from the centroid as the average distance to all members of the given cluster, since I do not have coordinates), with k chosen by the best Davies-Bouldin index over an interval.
However, I have three separate metrics (more in the future, potentially) describing the difference between the data, each fairly different in terms of magnitude and spread. Currently, I compute the distance matrix with the Euclidean distance across the three metrics, but I am fairly certain that the difference between the metrics is messing it up (e.g. the largest one is overpowering the other ones).
I thought a good way to deal with this is to use the Mahalanobis distance to combine the metrics. However, I obviously cannot compute the covariance matrix between the coordinates, but I can compute it for the distance metrics. Does this make sense? That is, if I get the distance between two objects i and j as:
D(i,j) = sqrt( dt S^-1 d )
where d is the 3-vector of the different distance metrics between i and j, dt is the transpose of d, and S is the covariance matrix of the distances, would D be a good, normalized metric for clustering?
I have also thought of normalizing the metrics (i.e. subtracting the mean and dividing out the variance) and then simply staying with the euclidean distance (in fact it would seem that this essentially is Mahalanobis distance, at least in some cases), or of switching to something like DBSCAN or EM, and have not ruled them out (though MDS then clustering might be a bit excessive). As a sidenote, any packages able to do all of this would be greatly appreciated. Thanks!
Consider using k-medoids (PAM) instead of a hacked k-means, which can work with arbitary distance functions; whereas k-means is designed to minimize variances, not arbitrary distances.
EM will have the same problem - it needs to be able to compute meaningful centers.
You can also use hierarchical linkage clustering. It only needs a distance matrix.
Related
I am dealing with k-means clustering 6 records. I am given the centroids and K=3. I have only 2 features. My given centroids are known. as I have only 3 features I am assuming as x,y points and I have plotted them.
Having the points mapped on an x and y axis, finding the euclidean distance I found that lets say (8,6) belongs to the my first cluster. However for all other records, the euclidean distance between the records 2 nearest centroids are the same. So lets say the point (2,6) should belong to the centroid (2,4) or (2,8)?? Or (5,4) belongs to (2,4) or (8,4)??
Thanks for replying
The objective of k-means is to minimize variance.
Therefore, you should assign the point to that cluster, where variance increases the least. Even when cluster centers are at the same distance, the increase in variance by assigning the point can vary, because the cluster center will move due to this change. This is one of the ideas of the very fast Hartigan-Wong algorithm for k-means (as opposed to the slow textbook algorithm).
We know that clustering methods in R assign observations to the closest medoids. Hence, it is supposed to be the closest cluster each observation can have. So, I wonder how it is possible to have negative values of silhouette , while we are supposedly assign each observation to the closest cluster and the formula in silhouette method cannot get negative?
Behnam.
Two errors:
most clustering algorithms do not use the medoid, only PAM does.
the silhouette does not use the distance to the medoid, but the average distance to all cluster members. If the closest cluster is very wide, the average distance can be larger than the distance to the medoid. Consider a cluster with one point in the center, and all others on a sphere around it.
I am looking for a clustering algorithm such a s DBSCAN do deal with 3d data, in which is possible to set different epsilons depending on the axis. So for instance an epsilon of 10m on the x-y plan, and an epsilon 0.2m on the z axis.
Essentially, I am looking for large but flat clusters.
Note: I am an archaeologist, the algorithm will be used to look for potential correlations between objects scattered in large surfaces, but in narrow vertical layers
Solution 1:
Scale your data set to match your desired epsilon.
In your case, scale z by 50.
Solution 2:
Use a weighted distance function.
E.g. WeightedEuclideanDistanceFunction in ELKI, and choose your weights accordingly, e.g. -distance.weights 1,1,50 will put 50x as much weight on the third axis.
This may be the most convenient option, since you are already using ELKI.
Just define a custom distance metric when computing the DBSCAN core points. The standard DBSCAN uses the Euclidean distance to compute points within an epsilon. So all dimensions are treated the same.
However, you could use the Mahalanobis distance to weigh each dimension differently. You can use a diagonal covariance matrix for flat clusters. You can use a full symmetric covariance matrix for flat tilted clusters, etc.
In your case, you would use a covariance matrix like:
100 0 0 0 100 0 0 0 0.04
In the pseudo code provided at the Wikipedia entry for DBSCAN just use one of the distance metrics suggested above in the regionQuery function.
Update
Note: scaling the data is equivalent to using an appropriate metric.
I have found the following formulas for Inter-Cluster and Intra-Cluster distances and I am not sure I understand how they work.
Inter-Cluster Distance
Shouldn't there be a square root in formulas above?
Inter-Cluster and Intra-Cluster:
Why is there the j index starting from N+1? And not from 1 to N2?
Which one is the correct one? Or are there any equivalencies? Or should I go for the distance between centroids for the inter cluster distance? Seems rather simple. What about the intra cluster distance?
I find the wikipedia formulas http://en.wikipedia.org/wiki/Cluster_analysis#Internal_evaluation even harder to understand.
I need to compute this distances in order to proper group colors in order to create a reduced color palette, so I'm thinking the more accurate these distances are, the more accurate the groupping (formula instead of distance between centroids distance for inter-cluster). The vectors are 3-dimensional(RGB components).
A lot of algorithms don't really use "distance".
k-means for example minimizes variance, which is the sum-of-squares you are seeing here. Now sum-of-squares is squared Euclidean distance, so one can argue that this algorithm also tries to minimize Euclidean distances; but the "natural" formulation of the algorithm doesn't use Euclidean distances, but sum-of-squares. if I'm not mistaken, the same also holds for Ward clustering, that you should compute it using variance, not euclidean distance.
Note that if you minimize z^2, and z cannot be negative, then you also minimized z.
See also: https://stats.stackexchange.com/questions/95793/is-there-an-advantage-to-squaring-dissimilarities-when-using-ward-clustering
I have a matrix I am working with which 300x5000 and I wanted to test which distance calculation parameter is the most effective. I got the following results:
'Sqeuclidean' = 17 iterations, total sum of distances = 25175.4
'Correlation' = 9 iterations, total sum of distances = 32.7
'Cityblock' = 34 iterations, total sum of distances = 105175.3
'Cosine' = 11 iterations, total sum of distances = 11.9
I am having trouble understanding why the results vary so much and how to choose the most effective distance parameter. Any advice?
EDIT:
I have 300 features with 5000 instances of each feature.
the function looks like this:
[idx, ctrs, sumd, d] = kmeans(matrix, 25, 'distance', 'cityblock', 'replicate', 20)
with interchanging the distance parameter. The features were already normalized.
Thanks!
As slayton commented, you really need to define what 'best' means for your particular problem.
The only thing that matters is how well the distance function clusters the data. In general, clustering is highly-dependent on the distance function. The two metrics that you've selected (number of iterations, sum of distances) are pretty irrelevant to how well the clustering works.
You need to know what you're trying to achieve with clustering, and you need some metric for how well you've achieved that goal. If there's an objective metric to determine how good your clusters are, then use that. Often, the metric is fuzzier: does this look right when I visualize the data. Look at your data, and look at how each distance function clusters the data. Select the distance function that seems to generate the best clusters. Do this for several subsets of your data, to make sure that your intuition is correct. You should also try to understand the result that each distance function gives you.
Lastly, some problems lend themselves to a particular distance function. If your problem has spatial features, then a Euclidean (geometric) distance is often a natural choice. Other distance functions will perform better for different problems.
Distance values from different
distance functions
data sets
normalizations
are generally not comparable. Simple example from reality: measure distances in "meter" or in "inch", and you get very different results. The result in meters will not be better just because it is measured on a different scale. So you must not compare the variances of different results.
Notice that k-means is meant to be used with euclidean distance only, and may not converge with other distance functions. IMHO, L_p norms should be fine, and on TF-IDF maybe also cosine. But I do not know a proof for that.
Oh, and k-means works really bad with high-dimensional data. It is meant for low dimensionality.