Using precision recall metric on a hierarchy of recovered clusters - cluster-analysis

Context: We are two students intending to write a thesis on reverse engineering namespaces using hierarchical agglomerative clustering algorithms. We have a variation of linking methods and other tweaks to the algorithm we want to try out. We will run the algorithm on popular GitHub repositories and compare the created clusters with the originally existent namespaces. Our work will closely follow the works of this paper. In the paper the authors mentions the use of the “precision recall metric” to measure the accuracy of the clustering algorithm. However looking more closely on the metric and its origin, it seems to be dedicated to flat (non-hierarchical) clusters.
Question:
Is there a way to use the precision recall metric to measure the accuracy of a hierarchy of recovered clusters? If not, what other options exists?

Related

Density Based Clustering with Representatives

I'm looking for a method to perform density based clustering. The resulting clusters should have a representative unlike DBSCAN.
Mean-Shift seems to fit those needs but doesn't scale enough for my needs. I have looked into some subspace clustering algorithms and only found CLIQUE using representatives, but this part is not implemented in Elki.
As I noted in the comments on the previous iteration of your question,
https://stackoverflow.com/questions/34720959/dbscan-java-library-with-corepoints
Density-based clustering does not assume there is a center or representative.
Consider the following example image from Wikipedia user Chire (BY-CC-SA 3.0):
Which object should be the representative of the red cluster?
Density-based clustering is about finding "arbitrarily shaped" clusters. These do not have a meaningful single representative object. They are not meant to "compress" your data - this is not a vector quantization method, but structure discovery. But it is the nature of such complex structure that it cannot be reduced to a single representative. The proper representation of such a cluster is the set of all points in the cluster. For geometric understanding in 2D, you can also compute convex hulls, for example, to get an area as in that picture.
Choosing representative objects is a different task. This is not needed for discovering this kind of structure, and thus these algorithms do not compute representative objects - it would waste CPU.
You could choose the object with the highest density as representative of the cluster.
It is a fairly easy modification to DBSCAN to store the neighbor count of every object.
But as Anony-Mousse mentioned, the object may nevertheless be a rather bad choice. Density-based clustering is not designed to yield representative objects.
You could try AffinityPropagation, but it will also not scale very well.

Cluster assignment remapping

I have test classification datasets from UCI Machine Learning repository which are labelled.
I am stripping of the labels and using the data to benchmark a few clustering algorithm and then I am planning to use external validation methods. I will run the algorithm with different initial configurations, for say, 50 times and then take the mean value. For 50 iterations the algorithm labels the data points of one single cluster with different numbers. Because in each run the cluster labels can change, also because each iteration might have slightly different cluster assignments, how to somehow remap each of the clusters to one uniform numbering.
Primary idea is to remap by checking how many of the points in the class labels intersect the maximum in the actual labels and then making a remap based on that, but this can get incorrect remappings because when the classes will have more or less equal number of points, this will not work.
Another idea is to keep the labels while clustering, but make the clustering algorithm ignore it. This way all the cluster data will have the label tags. This is doable but I have already have a benchmarked cluster assignment data to be processed therefore I am trying to avoid modifying and re-benchmarking my implementation (which will take quite some time and cpu) of the cluster analysis algorithms and include the label tag to the vectors and then ignore it.
Is there any way that I can compute average accuracy from the cluster assignments I have right now?
EDIT:
The domain in which I am studying (metaheuristic clustering algorithms) I could not find a paper comparing these indexes. The paper which compares seems to be incorrect in their values. Can anyone point me to a paper where clustering results are compared using any of these indexes?
What do you do when the number of clusters doesn't agree?
Do not try to map clusters.
Instead, use the proper external validation measures for clustering, which do not require a 1:1 correspondence of clusters. There are plenty, for details see Wikipedia.

ELKI implementation of OPTICS clustering algorithm detects only one cluster

I'm having issue with using OPTICS implementation in ELKI environment. I have used the same data for DBSCAN implementation and it worked like a charm. Probably I'm missing something with parameters but I can't figure it out, everything seems to be right.
Data is a simple 300х2 matrix, consists of 3 clusters with 100 points in each.
DBSCAN result:
Clustering result of DBSCAN
MinPts = 10, Eps = 1
OPTICS result:
Clustering result of OPTICS
MinPts = 10
You apparently already found the solution yourself, but here is the long story:
The OPTICS class in ELKI only computes the cluster order / reachability diagram.
In order to extract clusters, you have different choices, one of which (the one from the original OPTICS publication) is available in ELKI.
So in order to extract clusters in ELKI, you need to use the OPTICSXi algorithm, which will in turn use either OPTICS or the index based DeLiClu to compute the cluster order.
The reason why this is split into two parts in ELKI probably is so that you can on one hand implement another logic for extracting the clusters, and on the other hand implement different methods like DeLiClu for computing the cluster order. That would align well with the modular architecture of ELKI.
IIRC there is at least one more method (apparently not yet in ELKI) that extracts clusters by looking for local maxima, then extending them horizontally until they hit the end of the valley. And there was a different one that used "inflexion points" of the plot.
#AnonyMousse pretty much put it right. I just can't upvote or comment yet.
We hope to have some students contribute the other cluster extraction methods as small student projects over time. They are not essential for our research, but they are good tasks for students that want to learn about ELKI to get started.
ELKI is a fast moving project, and it lives from community contributions. We would be happy to see you contribute some code to it. We know that the codebase is not easy to get started with - it is fairly large, and the generality of the implementation and the support for index structures make it a bit hard to get started. We try to add Tutorials to help you to get started. And once you are used to it, you will actually benefit from the architecture: your algorithms get the benfits of indexing and arbitrary distance functions, while if you would implement from scratch, you would likely only support Euclidean distance, and no index acceleration.
Seeing that you struggled with OPTICS, I will try to write an OPTICS tutorial in the new year. In particular, OPTICS can benefit a lot from using an appropriate index structure.

What kind of analysis to use in SPSS for finding out groups/grouping?

My research question is about elderly people and I have to find out underlying groups. The data comes from a questionnaire. I have thought about cluster analysis, but the thing is that I would like to search perceived health and which things affect on the perceived health, e.g. what kind of groups of elderly rank their health as bad.
I have some 30 questions I would like to check with the analysis, to see if for example widows have better or worse health than the average. I also have weights in my data so I need to use complex samples.
How can I use an already existing function, or what analysis should I use?
The key challenge you have to solve first is to specify a similarity measure. Once you can measure similarity, various clustering algorithms become available.
But questionnaire data doesn't make a very good vector space, so you can't just use Euclidean distance.
If you want to generate clusters using SPSS, standard options include: k-means, hierarhical cluster analysis, or 2-step. I have some general notes on cluster analysis in SPSS here. See from slide 34.
If you want to see if widows differ in their health, then you need to form a measure of health and compare means on that measure between widows and non-widows (presumably using a between groups t-test). If you have 30 questions related to health, then you may want to do a factor analysis to see how the items group together.
If you are trying to develop a general model of whats predicts perceived health then there are a wide range of modelling options available. Multiple regression would be an obvious starting point. If you have many potential predictors then you have a lot of choices regarding whether you are going to be testing particular models or doing a more data driven model building approach.
More generally, it sounds like you need to clarify the aims of your analyses and the particular hypotheses that you want to test.

Expectation Maximization Issue - How to find the optimum number of gaussians within the data

Is there any algorithm or trick of how to determine the number of gaussians which should be identified within a set of data before applying the expectation maximization algorithm?
For example, in the above illustrated plot of 2 - Dimensional data, when I apply the Expectation Maximization algorithm, I try to fit 4 gaussians to the data and I would obtain the following result.
But what if I wouldn't knew the number of gaussians within the data? Is there any algorithm or trick which I could apply so that I could find out this detail?
This might be a bit of a retread, since others already linked the wiki article of the actual cluster number determination, but I found that article a lil overly dense, so I thought I'd provide a brief, intuitive answer:
Basically, there isn't a universally 'correct' answer for the number of clusters in a data set -- the fewer clusters, the smaller the description length but the higher the variance, and in all non-trivial datasets the variance won't completely go away unless you have a Gaussian for each point, which renders the clustering useless (this is a case of the more general phenomena known as the 'futility of bias free learning': A learner that makes no a priori assumptions regarding the identity of the target concept has no rational basis for classifying any unseen instances).
So you basically have to pick some feature of your dataset to maximize via the number of clusters (see the wiki article on inductive bias for some example features)
In other sad news, in all such cases finding the number of clusters is known to be NP-hard, so the best you can expect is a good heuristic approach.
Wikipedia has an article on this subject. I am not too familiar with the subject, but I've been told that clustering algorithms that don't require specifying the number of clusters instead need some density information about the clusters or some minimum distance between clusters.
Non parametric bayesian clustering is now getting lot of attention. You dont need to specify clusters.
Autoclass is algorithm that automatically identify number of clusters from mixture.