how to calculate Davies Bouldin from clustering methods in rapidminer? - cluster-analysis

I want to cluster data without k-means. for example I prefer to cluster with DBSCAN or support vector clustering.
So I need evaluating performance of clustering with Davies Bouldin metric but I don't know how to calculate Davies Bouldin in Rapidminer for DBSCAN or Support vector clustering.
Please help me.
Thank you.

The operator Cluster Distance Performance allows the Davies-Bouldin validity measure to be calculated. This requires a cluster model containing the cluster centroids to be passed to it which means approaches like Dbscan and Support vector clustering cannot be used with it because they do not produce cluster centroids.

Related

How to identify found clusters in Lumer Faieta Ant clustering

I have been experimenting with Lumer-Faieta clustering and I am getting
promising results:
However, as clusters formed I was wondering how to identify the final clusters? Do I run another clustering algorithm to identify the clusters (that seems counter-productive)?
I had the idea of starting each data point in its own cluster. Then, when a laden ant drops a data point, its gets the same cluster as the data points that dominates its neighborhood. The problem with this is that if clusters are broken up, they share share the same cluster number.
I am stuck. Any suggestions?
To solve this problem, I employed DBSCAN as a post processing step. The effect as follows:
Given that we have a projection of a high dimensional problem on a 2D grid, with known distances and uniform densities, DBSCAN is ideal for this problem. Choosing the right value for epsilon and the minimum number of neighbours are trivial (I used 3 for both values). Once the clusters have been identified, it can be projected back to the n-dimension space.
See The 5 Clustering Algorithms Data Scientists Need to Know for a quick overview (and graphic demo) of DBSCAN and some other clustering algorithms.

How to decide the numbers of clusters based on a distance threshold between clusters for agglomerative clustering with sklearn?

With sklearn.cluster.AgglomerativeClustering from sklearn I need to specify the number of resulting clusters in advance. What I would like to do instead is to merge clusters until a certain maximum distance between clusters is reached and then stop the clustering process.
Accordingly, the number of clusters might vary depending on the structure of the data. I also do not care about the number of resulting clusters nor the size of the clusters but only that the cluster centroids do not exceed a certain distance.
How can I achieve this?
This pull request for a distance_threshold parameter in scikit-learn's agglomerative clustering may be of interest:
https://github.com/scikit-learn/scikit-learn/pull/9069
It looks like it'll be merged in version 0.22.
EDIT: See my answer to my own question for an example of implementing single linkage clustering with a distance based stopping criterion using scipy.
Use scipy directly instead of sklearn. IMHO, it is much better.
Hierarchical clustering is a three step process:
Compute the dendrogram
Visualize and analyze
Extract branches
But that doesn't fit the supervised-learning-oriented API preference of sklearn, which would like everything to implement a fit, predict API...
SciPy has a function for you:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.fcluster.html#scipy.cluster.hierarchy.fcluster

Using Mahout for clustering one point

I know that Mahout is used for batch processing, but I am interested if I can use its KMeans, and how, for clustering individual points?
Let's say that we have following situation
Global clustering, that performs batch processing on all data and gives centroids as result
One point clustering, that uses centroids from global clustering, to assign that point to a cluster - it does not require cluster centroid re-computation - just assigning that point to an existing cluster
Can I do this using Mahout, or I have to implement it myself? I thought setting number of iterations to 1, and in that way assign the point, but the thing is, KMeans recomputes cluster centroids and if that new point is an outlier, it makes a new cluster from it. I don't want that, I actually want the distance to closest centroid.
For now, it seems that it is not very appropriate to use KMeans for this, but it should be implemented separately... Is that correct?
Thanks
You don't need to use Mahout for this.
K-means assigns points to the nearest center.
So just get all centers (which should fit easily into RAM), and compute the least-squares difference to each center.
It's just a few CPU cycles, there is absolutely no benefit in trying to do this on Mahout - the overhead will be much too large for just some k distance computations.

K-means clustering Matlab

My problem is that it is difficult to get the optimal cluster number by using k-means, so I thought of using a hierarchical algorithm to find the optimal cluster number. After defining my ideal classification I want to use this classification to find the centroids with k-means, without iteration.
data= rand(300,5);
D = pdist(data);
Z = linkage(D,'ward');
T = cluster(Z,'maxclust',6);
Now I want to use the clusters defined in vector T and the positions in to k-means algorithm without iterations. Can anyone give a tip how to do?
Thank you.
If you are looking for the centroids given that you already clustered them in T, then you only need to compute the mean of data grouped according to T.

Cluster quality measures

Does Matlab provide any facility for evaluating clustering methods? (cluster compactness and cluster separation. ....)
Or is there any toolbox for it?
Matlab provides Silhouette index and there is a toolbox CVAP: Cluster Validity Analysis Platform for Matlab. Which includes following validity indexes:
Davies-Bouldin
Calinski-Harabasz
Dunn index
R-squared index
Hubert-Levin (C-index)
Krzanowski-Lai index
Hartigan index
Root-mean-square standard deviation (RMSSTD) index
Semi-partial R-squared (SPR) index
Distance between two clusters (CD) index
weighted inter-intra index
Homogeneity index
Separation index
Note that you might need precompiled LIBRA binaries for your platform.
Not in Matlab, but ELKI (Java) provides a dozen or so cluster quality measures for evaluation.
You can try Silhouette plot from the Statistical toolbox.
For an example see this documentation.
Be aware that the Silhouette in Matlab has some strange behavior for singleton clusters. It assigns a score of 1 for singletons, when, for me, a more reasonable approach would be to give 0 for these clusters. In the Matlab implementation, if you give number of clusters as number of objects, Silhouette will give you a score of 1.