For matlab clustering, can centroid linkage work for distances other than Euclidean? - matlab

If I am doing hierarchical clustering, if I am using centroid linkage with a distance function other than Euclidean, say, for example, minkowski distance with an exponent of 3 as opposed to 2, will that necessarily not work? What would matlab do if it was given centroid linkage and a Minkowski distance with an exponent fo 3 so that it was not simply Euclidean distance?

You can use the linkage function define other distances. Examples are included in the link.

Related

Which k-means cluster should I assign a record to, when the euclidean distance between the record and both centroids are the same?

I am dealing with k-means clustering 6 records. I am given the centroids and K=3. I have only 2 features. My given centroids are known. as I have only 3 features I am assuming as x,y points and I have plotted them.
Having the points mapped on an x and y axis, finding the euclidean distance I found that lets say (8,6) belongs to the my first cluster. However for all other records, the euclidean distance between the records 2 nearest centroids are the same. So lets say the point (2,6) should belong to the centroid (2,4) or (2,8)?? Or (5,4) belongs to (2,4) or (8,4)??
Thanks for replying
The objective of k-means is to minimize variance.
Therefore, you should assign the point to that cluster, where variance increases the least. Even when cluster centers are at the same distance, the increase in variance by assigning the point can vary, because the cluster center will move due to this change. This is one of the ideas of the very fast Hartigan-Wong algorithm for k-means (as opposed to the slow textbook algorithm).

Clustering algorithm with different epsilons on different axes

I am looking for a clustering algorithm such a s DBSCAN do deal with 3d data, in which is possible to set different epsilons depending on the axis. So for instance an epsilon of 10m on the x-y plan, and an epsilon 0.2m on the z axis.
Essentially, I am looking for large but flat clusters.
Note: I am an archaeologist, the algorithm will be used to look for potential correlations between objects scattered in large surfaces, but in narrow vertical layers
Solution 1:
Scale your data set to match your desired epsilon.
In your case, scale z by 50.
Solution 2:
Use a weighted distance function.
E.g. WeightedEuclideanDistanceFunction in ELKI, and choose your weights accordingly, e.g. -distance.weights 1,1,50 will put 50x as much weight on the third axis.
This may be the most convenient option, since you are already using ELKI.
Just define a custom distance metric when computing the DBSCAN core points. The standard DBSCAN uses the Euclidean distance to compute points within an epsilon. So all dimensions are treated the same.
However, you could use the Mahalanobis distance to weigh each dimension differently. You can use a diagonal covariance matrix for flat clusters. You can use a full symmetric covariance matrix for flat tilted clusters, etc.
In your case, you would use a covariance matrix like:
100 0 0 0 100 0 0 0 0.04
In the pseudo code provided at the Wikipedia entry for DBSCAN just use one of the distance metrics suggested above in the regionQuery function.
Update
Note: scaling the data is equivalent to using an appropriate metric.

Spectral clustering distance/similarity

All papers about spectral clustering use similarity matrix as the input to spectral clustering algorithm.
Is it also possible to use pairwise distance matrix? I haven't seen any version of spectral clustering code which would use parwise distance.
I am implementing spectral clustering in matlab and it has the function pdist and the output of this function is pairwise distance matrix.
Similarity or Affinity Matrix gives an idea about the closeness of these data points with respect to each other. Distance on the other hand gives the measure of dis-similarity w.r.t each other. The easiest and most frequently used way of using pairwise distances for Similarity Matrix is to use a Gaussian kernel to get the affinity measure.
For points a and b, let D = pdist(a,b) give you the pairwise distance. Then the similarity for your matrix can be obtained as sim_ab = exp-(D/f) where f is a scaling factor.

Inter-Cluster and Intra-Cluster distances

I have found the following formulas for Inter-Cluster and Intra-Cluster distances and I am not sure I understand how they work.
Inter-Cluster Distance
Shouldn't there be a square root in formulas above?
Inter-Cluster and Intra-Cluster:
Why is there the j index starting from N+1? And not from 1 to N2?
Which one is the correct one? Or are there any equivalencies? Or should I go for the distance between centroids for the inter cluster distance? Seems rather simple. What about the intra cluster distance?
I find the wikipedia formulas http://en.wikipedia.org/wiki/Cluster_analysis#Internal_evaluation even harder to understand.
I need to compute this distances in order to proper group colors in order to create a reduced color palette, so I'm thinking the more accurate these distances are, the more accurate the groupping (formula instead of distance between centroids distance for inter-cluster). The vectors are 3-dimensional(RGB components).
A lot of algorithms don't really use "distance".
k-means for example minimizes variance, which is the sum-of-squares you are seeing here. Now sum-of-squares is squared Euclidean distance, so one can argue that this algorithm also tries to minimize Euclidean distances; but the "natural" formulation of the algorithm doesn't use Euclidean distances, but sum-of-squares. if I'm not mistaken, the same also holds for Ward clustering, that you should compute it using variance, not euclidean distance.
Note that if you minimize z^2, and z cannot be negative, then you also minimized z.
See also: https://stats.stackexchange.com/questions/95793/is-there-an-advantage-to-squaring-dissimilarities-when-using-ward-clustering

Clustering with a Distance Matrix via Mahalanobis distance

I have a set of pairwise distances (in a matrix) between objects that I would like to cluster. I currently use k-means clustering (computing distance from the centroid as the average distance to all members of the given cluster, since I do not have coordinates), with k chosen by the best Davies-Bouldin index over an interval.
However, I have three separate metrics (more in the future, potentially) describing the difference between the data, each fairly different in terms of magnitude and spread. Currently, I compute the distance matrix with the Euclidean distance across the three metrics, but I am fairly certain that the difference between the metrics is messing it up (e.g. the largest one is overpowering the other ones).
I thought a good way to deal with this is to use the Mahalanobis distance to combine the metrics. However, I obviously cannot compute the covariance matrix between the coordinates, but I can compute it for the distance metrics. Does this make sense? That is, if I get the distance between two objects i and j as:
D(i,j) = sqrt( dt S^-1 d )
where d is the 3-vector of the different distance metrics between i and j, dt is the transpose of d, and S is the covariance matrix of the distances, would D be a good, normalized metric for clustering?
I have also thought of normalizing the metrics (i.e. subtracting the mean and dividing out the variance) and then simply staying with the euclidean distance (in fact it would seem that this essentially is Mahalanobis distance, at least in some cases), or of switching to something like DBSCAN or EM, and have not ruled them out (though MDS then clustering might be a bit excessive). As a sidenote, any packages able to do all of this would be greatly appreciated. Thanks!
Consider using k-medoids (PAM) instead of a hacked k-means, which can work with arbitary distance functions; whereas k-means is designed to minimize variances, not arbitrary distances.
EM will have the same problem - it needs to be able to compute meaningful centers.
You can also use hierarchical linkage clustering. It only needs a distance matrix.