Cluster quality measures - matlab

Does Matlab provide any facility for evaluating clustering methods? (cluster compactness and cluster separation. ....)
Or is there any toolbox for it?

Matlab provides Silhouette index and there is a toolbox CVAP: Cluster Validity Analysis Platform for Matlab. Which includes following validity indexes:
Davies-Bouldin
Calinski-Harabasz
Dunn index
R-squared index
Hubert-Levin (C-index)
Krzanowski-Lai index
Hartigan index
Root-mean-square standard deviation (RMSSTD) index
Semi-partial R-squared (SPR) index
Distance between two clusters (CD) index
weighted inter-intra index
Homogeneity index
Separation index
Note that you might need precompiled LIBRA binaries for your platform.

Not in Matlab, but ELKI (Java) provides a dozen or so cluster quality measures for evaluation.

You can try Silhouette plot from the Statistical toolbox.
For an example see this documentation.

Be aware that the Silhouette in Matlab has some strange behavior for singleton clusters. It assigns a score of 1 for singletons, when, for me, a more reasonable approach would be to give 0 for these clusters. In the Matlab implementation, if you give number of clusters as number of objects, Silhouette will give you a score of 1.

Related

Spatial features and pattern analysis of a plan?

I am working on instances from the TSPLIB, which are simply coordinates of nodes in a plan. I'm looking to analyze spatial characteristics and features of a set of instances (e.g. clustered, not clustered, dispersed, etc) and I would like to implement some code in Matlab to analyze and compute specific features.
For example, so far, I have used Nearest Neighbor analysis to identify clusters, as well as quadrant analysis. Can anyone suggest any other spatial features and patterns that could be computed with some relatively simple code? Anybody maybe expert in the Traveling Salesman Problem. Thank you so much!
K-means is a very useful clustering tool that you can use.
https://www.mathworks.com/help/stats/kmeans.html
Nearest Neighbor is a classification methods. if you want to do classification you can use K Nearest Neighbors, SVM or Neural Networks Pattern recognition toolbox. these are all already in Matlab.
Also, check out Matlab Apps. there are some very cool clustering tools available as well with examples.

how to calculate Davies Bouldin from clustering methods in rapidminer?

I want to cluster data without k-means. for example I prefer to cluster with DBSCAN or support vector clustering.
So I need evaluating performance of clustering with Davies Bouldin metric but I don't know how to calculate Davies Bouldin in Rapidminer for DBSCAN or Support vector clustering.
Please help me.
Thank you.
The operator Cluster Distance Performance allows the Davies-Bouldin validity measure to be calculated. This requires a cluster model containing the cluster centroids to be passed to it which means approaches like Dbscan and Support vector clustering cannot be used with it because they do not produce cluster centroids.

Clustering based on pearson correlation

I have a use case where I have traffic data for every 15 minutes for 1 month.
This data is collected for various resources in netwrok.
Now I need to group resources which are similar(based on traffic usage pattern over 00 hours to 23:45 hrs).
One way to check if two resources have similar traffic behavior is that I can use Pearson correlation coefficient for all the resources and create N*N matrix.
My question is which method I should apply to cluster the similar resources ?
Existing methods in K-Means clustering are based on euclidean distance. Which algorithm I can use to cluster based on similarity of pattern ?
Any thoughts or link to possible solution is welcome. I want to implement using Java.
Pearson correlation is not compatible with the mean. Thus, k-means must not be used - it is proper for least-squares, but not for correlation.
Instead, just use hierarchical agglomerative clustering, which will work with Pearson correlation matrixes just fine. Or DBSCAN: it also works with arbitary distance functions. You can set a threshold: an absolute correlation of, e.g. +0.75, may be a desireable value of epsilon. But to get a feeling of your distance function, dendrograms as used by HAC are probably easier.
Beware that Pearson is not defined for constant patterns. If you have a resource with 0 usage, your distance will be undefined.

ELKI - Clustering Statistics

When a data set is analyzed by a clustering algorithm in ELKI 0.5, the program produces a number of statistics: the Jaccard index, F1-Measures, etc. In order to calculate these statistics, there have to be 2 clusterings to compare. What is the clustering created by the algorithm compared to?
The automatic evaluation (note that you can configure the evaluation manually!) is based on labels in your data set. At least in the current version (why are you using 0.5 and not 0.6.0?) it should only automatically evaluate if it finds labels in the data set.
We currently have not published internal measures. There are some implementations, such as evaluation/clustering/internal/EvaluateSilhouette.java, some of which will be in the next release.
In my experiments, internal evaluation measures were badly misleading. For example on the Silhouette coefficient, the labeled "solution" would often even score a negative silhouette coefficient (i.e. worse than not clustering at all).
Also, these measures are not scalable. The silhouette coefficient is in O(n^2) to compute; which usually makes this evaluation more expensive than the actual clustering!
We do appreciate contributions!
You are more than welcome to contribute your favorite evaluation measure to ELKI, to share with others.

Clustering with varying dimensions

In my clustering problem, not only the points can come and go but also the features can be removed or added. Is there any clustering algorithm for my problem.
Specifically I am looking for an agglomerative hierarchical clustering version of these kind of clustering algorithms.
You can use hierarchical clustering (except it scales really bad) or any other distance based clustering. Just k-means is a bit tricky because how do you compute the mean when the value is not present?
You only need to define an appropriate distance function first.
Clustering is usually done based on similarity, so: first find out what "similar" means for you. This is very data set and use case specific, although many people can use some kind of distance function. There is no "one size fits all" solution.