clustering evaluation, taking into account the number of cluster - cluster-analysis

I know how to calculate the Recall, Precision and F_measure for clusters as explained in this course https://www.coursera.org/learn/cluster-analysis/lecture/BcYhV/6-4-external-measures-1-matching-based-measures
However, what if the number of clusters generated by my system is more than the number of clusters in the ground-truth, how can we calculate these measures?
It seems that there is no penalty for systems generating more clusters since we just matching each cluster in the ground-truth to the best cluster generated from my system. Am i missing something here?

Don't compute them as in classification!!!
Either you need to work with pairs of points - that is the most common approach, used by the very popular ARI measure.
Or you need to find the cluster with the maximum overlap, this then sometimes called "matching". I am not convinced of this approach.
Last but not least, you could use the Hungarian algorithm to find the best partial 1:1 correspondence, and consider unmatched clusters to be all false.

Related

ELKI ClusteringVectorDumper - which ones are the outliers?

I am using ELKI for DBSCAN clustering and its ClusteringVectorDumper to output the cluster ids into a text file.
Which id do outliers get?
I assumed it was '0' but that does not seem to be true.
You can find the source code of ClusteringVectorDumper online.
There is no special treatment of noise clusters, but they will be processed as returned by Clustering.getAllClusters(). To provide some stability, this method currently sorts by name. DBSCAN does not provide further cluster names (some algorithms do assign e.g. interval names, or subspace names), so if I recall correctly, all clusters will be named either "Cluster" or "Noise". Because "Noise" sorts after "Cluster", the largest index should be the noise cluster.
Feel free to send a pull request to improve either the naming, or the output. I had been considering to use negative numbers for noise clusters, but it would increase the code complexity; and people would likely not expect to see a cluster -1 either.
It does not work well to abuse DBSCAN for outlier detection. It will miss outliers because they are reachable, and it will assign low-density clusters as outliers that other methods will easily recognize correctly. It also does not provide a ranking, so you have little control over how many outliers you get. If you would modify DBSCAN to provide such a ranking, you would likely reinvent one of the oldest outlier detection method, kNN outlier detection. (DB-Outlier is also very closely related to DBSCAN).

Cluster assignment remapping

I have test classification datasets from UCI Machine Learning repository which are labelled.
I am stripping of the labels and using the data to benchmark a few clustering algorithm and then I am planning to use external validation methods. I will run the algorithm with different initial configurations, for say, 50 times and then take the mean value. For 50 iterations the algorithm labels the data points of one single cluster with different numbers. Because in each run the cluster labels can change, also because each iteration might have slightly different cluster assignments, how to somehow remap each of the clusters to one uniform numbering.
Primary idea is to remap by checking how many of the points in the class labels intersect the maximum in the actual labels and then making a remap based on that, but this can get incorrect remappings because when the classes will have more or less equal number of points, this will not work.
Another idea is to keep the labels while clustering, but make the clustering algorithm ignore it. This way all the cluster data will have the label tags. This is doable but I have already have a benchmarked cluster assignment data to be processed therefore I am trying to avoid modifying and re-benchmarking my implementation (which will take quite some time and cpu) of the cluster analysis algorithms and include the label tag to the vectors and then ignore it.
Is there any way that I can compute average accuracy from the cluster assignments I have right now?
EDIT:
The domain in which I am studying (metaheuristic clustering algorithms) I could not find a paper comparing these indexes. The paper which compares seems to be incorrect in their values. Can anyone point me to a paper where clustering results are compared using any of these indexes?
What do you do when the number of clusters doesn't agree?
Do not try to map clusters.
Instead, use the proper external validation measures for clustering, which do not require a 1:1 correspondence of clusters. There are plenty, for details see Wikipedia.

Affinity propagation vs basic k-means algorithm

I have a dataset consists of (700 data points x 400 dimensions) which belong to 10 classes. I did cluster this data to see how data points will fit into clusters similar to their class. I performed two clustering experiments, one using basic k-means (euclidean) and another using Affinity Propagation. I noticed that the results using k-means are better and faster!! than the Affinity Propagation.
I could not understand the reason behind this. Can any of you help in giving explanation why I got such results (I thought Affinity Propagation is better than k-means)?
It could be a matter of granularity - the APC result could be close to a subclustering or superclustering of the class labels. There is a parameter that affects APC granularity (check yourself).
Another consideration is how you prepare the network that you give to APC (or any other network clustering algorithm). Ideally it should not be too dense. As a rough guideline, make sure that the distribution of { number of neighbours per node | all nodes } does not stray far outside [0.5 * sqrt(N) - 2.0 * sqrt(N)]. Especially try to avoid hubs, that is, nodes that have many more neighbours than that upper bound.
As a sanity check, are the values that you give to APC similarities? They should similarities be of course, not distances. You have a choice how the similarity is computed. The standard way to restrain the number of neighbours is to use a cut-off. Experiment with the combination of these. Finally you may also want to try MCL, an algorithm that precedes APC and uses conceptually similar principles but is a bit cleaner in its formulation (alternation of simple matrix operations). It is probably faster.

interpreting the results of OPTICSxi Clustering

I am interested in detecting clusters in areas with varying-density, such as user-generated data in cities, and for that I adopted the OPTICS algorithm.
Unlike DBSCAN, the OPTICS algorithm does not produce a strict cluster partition, but an augmented ordering of the database. To produce the cluster partition, I use OPTICSxi, which is another algorithm that produces a classification based on the output of OPTICS. There are few libraries capable of extracting a cluster partition from the output of OPTICS, and ELKI’s OPTICSxi implementation is one of them.
It is very clear to me, how-to interpret the results of DBSCAN (although it is not that easy, to set “meaningful” global parameters); DBSCAN detects a “prototype” of a cluster, characterized by a density, expressed as a number of points per area (minpts/epsilon). The results of OPTICSxi seem a bit more difficult to interpret.
There are two phenomena that I sometimes detect in the outputs of OPTICSxi, and that I am not able to explain. One is the appearance of “spike” clusters, that link parts of the map. I cannot explain them, because they seem to be made of very few points and I don’t understand how the algorithm decides to group them in the same cluster. Do they really represent a “corridor” of density variation? looking at the underlying data, it does not look like that. You can see these “spikes” in the image bellow.
The other phenomenon that I cannot explain is the fact that sometimes there are "overlapping" clusters of the same hierarchical level. OPTICSxi is based on the OPTICS ordering of the database (e.g. dendrogram) and there are no repeated points in that diagram.
Since this is a hierarchical clustering, we consider that clusters of a lower level contain clusters of a higher level, and that idea is enforced when building the convex hulls. However, I don’t see any justification for having clusters that intersect other clusters on the same hierarchical level, which in practice would mean that some points would have a double cluster “membership”. On the image bellow, we can see some intersecting clusters with the same hierarchical level (0).
Finally the most important thought/question that I want to leave you with, is: what do we expect to see in an OPTICSxi clustering classification? This question is closely linked to the task of parametrizing OPTICSxi.
Since I see hardly any studies with runs of OPTICSxi for a particular cluster problem, I struggle to find what is an optimal clustering classification would be; i.e.: one that can provide some meaningful/useful results, and add some value to the DBSCAN clustering. To help me answering that question, I performed many runs of OPTICSxi, with different combinations of parameters, and I selected three that I will discuss bellow.
On this run I used a large value of epsilon (2Km); the meaning of that value is that we accept large clusters (up to 2Km); since the algorithm “merges” clusters, we will end up with some very large clusters, that will have almost certainly a low density. I like this output, because it exposes the hierarchical structure of the classification, and it actually reminds me of several runs DBSCAN with a different combination of parameters (for different densities), which is the advertised “strength” of OPTICS. As it was mentioned before, smaller clusters correspond to higher levels in the hierarchical scale, and higher densities.
On this run we see a large number of clusters, even if the “contrast” parameter is the same from the previous run. That is mostly because I chosen a low number of minpts, which established that we accept clusters with a low number of points. Since the epsilon in this case is shorter, we don’t see these large clusters occupying a large part of the map. I find this output less interesting than the previous one, mostly because, even if we have an hierarchical structure there are many clusters at the same level, and many of them intersect. In terms of interpretation, I can see an overall “shape” that is similar to the previous one, but it is actually discretized in lots of small clusters that are easily overlooked as “noise”.
This run has a parameter choice that is similar to the previous one, except that the minpts is larger; the consequences is that not only we find less clusters and they overlap less, but also that they are mostly at the same level.
In a perspective of adding value to DBSCAN, I would opt for the first combination of parameters, since it provides a hierarchical picture of the data, exposing clearly which areas are more dense. IMHO the last combination of parameters, fails to provide an idea of the global distribution of density, since it is finding similar clusters all over the study area. I am interested to read other opinions.
The problem with extracting clusters from the OPTICS plot is the first and last elements of a clsuter. Just from the plot, you cannot (to my understanding) decide whether the last element should belong to the previous cluster or not.
Consider a plot like this
*
* *
* *
* **
**************
A B C D EF G H
This can be a cluster where A is right in the middle, B-E nearby, and F is the nearest element in a completely different cluster. For example, the data set might look like this:
* D *
B A E F G
* C H *
Or, A is at the rim of the first cluster, B-D are part of the cluster, whereas E is an outlier element bridging the gap to the cluster F-H.
A data set that causes such an effect could look like this:
D * *
* C B A E F G
E * H *
OpticsXi operates visually. F is the "steeper" point to split, so E will in each case be part of the first cluster. It is literally the best guess OpticsXi can do without looking at the data points.
This is likely the effect causing the spikes you have been observing.
I see four options:
improve OpticsXi yourself. If you are interested, we can discuss some heuristics possible to distinguish these two cases above.
implement one of the other extraction methods, such as inflexion points (but they may suffer from the same effects, als they are in the plot AFAICT)
use HDBSCAN (sorry, not yet included in ELKI, although we have a version that appears to be working) - probably in 0.7.0
Apply post-processing to the clusters. In particular, test the first and last few points by cluster order, if you want to include them in the cluster, move them to the next, or move them to the parent cluster. Maybe simply by average distance from the cluster...

Choosing Clustering Method based on results

I'm using WEKA for my thesis and have over 1000 lines of data. The database includes demographical information (Age, Location, status etc.) followed by name of products (valued 1 or 0). The end results is a recommender system.
I used two methods of clustering, K-Means and DBScan.
When using K-means I tried 3 different number of cluster, while using DBscan I chose 3 different epsilons (Epsilon 3 = 48 clusters with ignored 17% of data, Epsilone 2.5 = 19 clusters while cluster 0 holds 229 items with ignored 6%.) Meaning i have 6 different clustering results for same data.
How do I choose what's best suits my data ?
What is "best"?
As some smart people noticed:
the validity of a clustering is often in the eye of the beholder
There is no objectively "better" for clustering, or you are not doing cluster analysis.
Even when a result actually is "better" on some mathematical measure such as separation, silhouette or even when using a supervised evaluation using labels - its still only better at optimizing towards some mathematical goal, not to your use case.
K-means finds a local optimal sum-of-squares assignment for a given k. (And if you increase k, there exists a better assignment!) DBSCAN (it's actually correctly spelled all uppercase) always finds the optimal density-connected components for the given MinPts/Epsilon combination. Yet, both just optimize with respect to some mathematical criterion. Unless this critertion aligns with your requirements, it is worthless. So there is no best, until you know what you need. But if you know what you need, you would not need to do cluster analysis.
So what to do?
Try different algorithms and different parameters and analyze the output with your domain knowledge, if they help you with the problem you are trying to solve. If they help you solving your problem, then they are good. If they do not help, try again.
Over time, you will collect some experience. For example, if the sum-of-squares is meaningless for your domain, don't use k-means. If your data does not have meaningful density, don't use density based clustering such as DBSCAN. It's not that these algorithms fail. They just don't solve your problem, they solve a different problem that you are not interested in. And they might be really good at solving this other problem...