Representative instance of a cluster - hierarchical clustering

Representative instance of a cluster - hierarchical clustering - matlab

I'm using the Agglomerative hierarchical cluster method to cluster a set of data. Where the dataset I use for clusrting is a trajectories.
I use a custom distance function to estimate the distance between the trajectories.
The matlab code is as follow: Z = linkage(ID,'single','#my_distfun');
After clustering the data; I would like to find the representative instance ( or trajectory).
How can I find the representative instance (trajectory) of each cluster?

Hierarchical clustering does not have a concept of representative instances.
You will have to decide upon a definition yourself.
For example, you could use the element with the smallest average distance to all others. Or the one with the smallest average squared distance, or ... many other options.
"Representative" is a subjective term.

Related

Density Based Clustering with Representatives

I'm looking for a method to perform density based clustering. The resulting clusters should have a representative unlike DBSCAN.
Mean-Shift seems to fit those needs but doesn't scale enough for my needs. I have looked into some subspace clustering algorithms and only found CLIQUE using representatives, but this part is not implemented in Elki.

As I noted in the comments on the previous iteration of your question,
https://stackoverflow.com/questions/34720959/dbscan-java-library-with-corepoints
Density-based clustering does not assume there is a center or representative.
Consider the following example image from Wikipedia user Chire (BY-CC-SA 3.0):
Which object should be the representative of the red cluster?
Density-based clustering is about finding "arbitrarily shaped" clusters. These do not have a meaningful single representative object. They are not meant to "compress" your data - this is not a vector quantization method, but structure discovery. But it is the nature of such complex structure that it cannot be reduced to a single representative. The proper representation of such a cluster is the set of all points in the cluster. For geometric understanding in 2D, you can also compute convex hulls, for example, to get an area as in that picture.
Choosing representative objects is a different task. This is not needed for discovering this kind of structure, and thus these algorithms do not compute representative objects - it would waste CPU.

You could choose the object with the highest density as representative of the cluster.
It is a fairly easy modification to DBSCAN to store the neighbor count of every object.
But as Anony-Mousse mentioned, the object may nevertheless be a rather bad choice. Density-based clustering is not designed to yield representative objects.
You could try AffinityPropagation, but it will also not scale very well.

clustering vs fitting a mixture model

I have a question about using a clustering method vs fitting the same data with a distribution.
Assuming that I have a dataset with 2 features (feat_A and feat_B) and let's assume that I use a clustering algorithm to divide the data in an optimal number of clusters...say 3.
My goal is to assign for each of the input data [feat_Ai,feat_Bi] a probability (or something similar) that the point belongs to cluster 1 2 3.
a. First approach with clustering:
I cluster the data in the 3 clusters and I assign to each point the probability of belonging to a cluster depending on the distance from the center of that cluster.
b. Second approach using mixture model:
I fit a mixture model or mixture distribution to the data. Data are fit to the distribution using an expectation maximization (EM) algorithm, which assigns posterior probabilities to each component density with respect to each observation. Clusters are assigned by selecting the component that maximizes the posterior probability.
In my problem I find the cluster centers (or I fit the model if approach b. is used) with a subsample of data. Then I have to assign a probability to a lot of other data... I would like to know in presence of new data which approach is better to use to still have meaningful assignments.
I would go for a clustering method for example a kmean because:
If the new data come from a distribution different from the one used to create the mixture model, the assignment could be not correct.
With new data the posterior probability changes.
The clustering method minimizes the variance of the clusters in order to find a kind of optimal separation border, the mixture model take into consideration the variance of the data to create the model (not sure that the clusters that will be formed are separated in an optimal way).
More info about the data:
Features shouldn't be assumed dependent.
Feat_A represents the duration of a physical activity Feat_B the step counts In principle we could say that with an higher duration of the activity the step counts increase, but it is not always true.
Please help me to think and if you have any other point please let me know..

Hierarchical agglomerative clustering

Can we use Hierarchical agglomerative clustering for clustering data in this format ?
"beirut,proff,email1"
"beirut,proff,email2"
"swiss,aproff,email1"
"france,instrc,email2"
"swiss,instrc,email2"
"beirut,proff,email1"
"swiss,instrc,email2"
"france,aproff,email2"
If not, what is the compatible clustering algorithm to cluster data with string values ?
Thank you for your help!

Any type of clustering requires a distance metric. If all you're willing to do with your strings is treat them as equal to each other or not equal to each other, the best you can really do is the field-wise Hamming distance... that is, the distance between "abc,def,ghi" and "uvw,xyz,ghi" is 2, and the distance between "abw,dez,ghi" is also 2. If you want to cluster similar strings within a particular field -- say clustering "Slovakia" and "Slovenia" because of the name similarity, or "Poland" and "Ukraine" because they border each other, you'll use more complex metrics. Given a distance metric, hierarchical agglomerative clustering should work fine.
All this assumes, however, that clustering is what you actually want to do. Your dataset seems like sort of an odd use-case for clustering.

Hierarchical clustering is a rather flexible clustering algorithm. Except for some linkages (Ward?) it does not have any requirement on the "distance" - it could be a similarity as well, usually negative values will work just as well, you don't need triangle inequality etc.
Other algorithms - such as k-means - are much more limited. K-means minimizes variance; so it can only handle (squared) Euclidean distance; and it needs to be able to compute means, thus the data needs to be in a continuous, fixed dimensionality vector space; and sparsity may be an issue.
One algorithm that probably is even more flexible is Generalized DBSCAN. Essentially, it needs a binary decision "x is a neighbor of y" (e.g. distance less than epsilon), and a predicate to measure "core point" (e.g. density). You can come up with arbitary complex such predicates, that may no longer be a single "distance" anymore.
Either way: If you can measure similarity of these records, hiearchical clustering should work. The question is, if you can get enough similarity out of that data, and not just 3 bit: "has the same email", "has the same name", "has the same location" -- 3 bit will not provide a very interesting hierarchy.

How do I choose k when using k-means clustering with Silhouette function?

I've been studying about k-means clustering, and one big thing which is not clear is what Silhouette function really tell to me?
i know it shows that what appropriate k should be detemine but i cant understand what mean of silhouette function really say to me?
i read somewhere, if the mean of silhouette is less than 0.5 your clustering is not valid.
thanks for your answers in advance.

From the definition of silhouette :
Silhouette Value
The silhouette value for each point is a measure of how similar that
point is to points in its own cluster compared to points in other
clusters, and ranges from -1 to +1.
The silhouette value for the ith point, Si, is defined as
Si = (bi-ai)/ max(ai,bi) where ai is the average distance from the ith
point to the other points in the same cluster as i, and bi is the
minimum average distance from the ith point to points in a different
cluster, minimized over clusters.
This method just compares the intra-group similarity to closest group similarity. If any data member average distance to other members of the same cluster is higher than average distance to some other cluster members, then this value is negative and clustering is not successful. On the other hand, silhuette values close to 1 indicates a successful clustering operation. 0.5 is not an exact measure for clustering.

#fatihk gave a good citation;
additionally, you may think about the Silhouette value as a degree of
how clusters overlap with each other, i.e. -1: overlap perfectly,
+1: clusters are perfectly separable;
BUT low Silhouette values for a particular algorithm does NOT mean that there are no clusters, rather it means that the algorithm used cannot separate clusters and you may consider tune your algorithm or use a different algorithm (think about K-means for concentric circles, vs DBSCAN).

There is an explicit formula associated with the elbow method to automatically determine the number of clusters. The formula tells you about the strength of the elbow(s) being detected when using the elbow method to determine the number of clusters, see here. See illustration here:
Enhanced Elbow rule

Hierarchical Cluster Analysis in Cluster 3.0

I'm new to this site as well as new to cluster analysis, so I apologize if I violate conventions.
I've been using Cluster 3.0 to perform Hierarchical Cluster Analysis with Euclidean Distance and Average linkage. Cluster 3.0 outputs a .gtr file with a node joining a gene and their similarity score. I've noticed that the first line in the .gtr file always links a gene with another gene followed by the similarity score. But, how do I reproduce this similarity score?
In my data set, I have 8 genes and create a distance matrix where d_{ij} contains the Euclidian distance between gene i and gene j. Then I normalize the matrix by dividing each element by the max value in the matrix. To get the similarity matrix, I subtract all the elements from 1. However, my result does not use the linkage type and differs from the output similarity score.
I am mainly confused how linkages affect the similarity of the first node (the joining of the two closest genes) and how to compute the similarity score.
Thank you!

The algorithm compares clusters using some linkage method, not data points. However, in the first iteration of the algorithm each data point forms its own cluster; this means that your linkage method is actually reduced to the metric you use to measure the distance between data points (for your case Euclidean distance). For subsequent iterations, the distance between clusters will be measured according to your linkage method, which in your case is average link. For two clusters A and B, this is calculated as follows:
where d(a,b) is the Euclidean distance between the two data points. Convince yourself that when A and B contain just one data point (as in the first iteration) this equation reduces itself to d(a,b). I hope this makes things a bit more clear. If not, please provide more details of what exactly you want to do.