How to decide the numbers of clusters based on a distance threshold between clusters for agglomerative clustering with sklearn? - cluster-analysis

With sklearn.cluster.AgglomerativeClustering from sklearn I need to specify the number of resulting clusters in advance. What I would like to do instead is to merge clusters until a certain maximum distance between clusters is reached and then stop the clustering process.
Accordingly, the number of clusters might vary depending on the structure of the data. I also do not care about the number of resulting clusters nor the size of the clusters but only that the cluster centroids do not exceed a certain distance.
How can I achieve this?

This pull request for a distance_threshold parameter in scikit-learn's agglomerative clustering may be of interest:
https://github.com/scikit-learn/scikit-learn/pull/9069
It looks like it'll be merged in version 0.22.
EDIT: See my answer to my own question for an example of implementing single linkage clustering with a distance based stopping criterion using scipy.

Use scipy directly instead of sklearn. IMHO, it is much better.
Hierarchical clustering is a three step process:
Compute the dendrogram
Visualize and analyze
Extract branches
But that doesn't fit the supervised-learning-oriented API preference of sklearn, which would like everything to implement a fit, predict API...
SciPy has a function for you:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.fcluster.html#scipy.cluster.hierarchy.fcluster

Related

How to identify found clusters in Lumer Faieta Ant clustering

I have been experimenting with Lumer-Faieta clustering and I am getting
promising results:
However, as clusters formed I was wondering how to identify the final clusters? Do I run another clustering algorithm to identify the clusters (that seems counter-productive)?
I had the idea of starting each data point in its own cluster. Then, when a laden ant drops a data point, its gets the same cluster as the data points that dominates its neighborhood. The problem with this is that if clusters are broken up, they share share the same cluster number.
I am stuck. Any suggestions?
To solve this problem, I employed DBSCAN as a post processing step. The effect as follows:
Given that we have a projection of a high dimensional problem on a 2D grid, with known distances and uniform densities, DBSCAN is ideal for this problem. Choosing the right value for epsilon and the minimum number of neighbours are trivial (I used 3 for both values). Once the clusters have been identified, it can be projected back to the n-dimension space.
See The 5 Clustering Algorithms Data Scientists Need to Know for a quick overview (and graphic demo) of DBSCAN and some other clustering algorithms.

Result of overlapping clustering

I'm using function fcm from Matlab for overlapping clustering. The output of this function is a matrix of size kxn with k being the number of clusters and n being the number of examples.
Now my problem is that how do I choose clusters for an example? For each example, I have scores for all clusters so I can easily find the best matched cluster, but what about other clusters?
Many thanks.
It depends on the clustering algorithm, but you can probably interpret those soft clustering values as probabilities. This gives two well-founded options for extracting a hard clustering:
Sample each point's cluster from its cluster distribution (a column in your kxn matrix).
Assign each point to its most probable cluster. This corresponds to the MAP (max a posteriori) solution to the clustering problem.
Option 2 is probably the way to go - a single sample may not be a great representation of what's going on; with MAP, you're at least guaranteed to get something probable.

Cluster assignment remapping

I have test classification datasets from UCI Machine Learning repository which are labelled.
I am stripping of the labels and using the data to benchmark a few clustering algorithm and then I am planning to use external validation methods. I will run the algorithm with different initial configurations, for say, 50 times and then take the mean value. For 50 iterations the algorithm labels the data points of one single cluster with different numbers. Because in each run the cluster labels can change, also because each iteration might have slightly different cluster assignments, how to somehow remap each of the clusters to one uniform numbering.
Primary idea is to remap by checking how many of the points in the class labels intersect the maximum in the actual labels and then making a remap based on that, but this can get incorrect remappings because when the classes will have more or less equal number of points, this will not work.
Another idea is to keep the labels while clustering, but make the clustering algorithm ignore it. This way all the cluster data will have the label tags. This is doable but I have already have a benchmarked cluster assignment data to be processed therefore I am trying to avoid modifying and re-benchmarking my implementation (which will take quite some time and cpu) of the cluster analysis algorithms and include the label tag to the vectors and then ignore it.
Is there any way that I can compute average accuracy from the cluster assignments I have right now?
EDIT:
The domain in which I am studying (metaheuristic clustering algorithms) I could not find a paper comparing these indexes. The paper which compares seems to be incorrect in their values. Can anyone point me to a paper where clustering results are compared using any of these indexes?
What do you do when the number of clusters doesn't agree?
Do not try to map clusters.
Instead, use the proper external validation measures for clustering, which do not require a 1:1 correspondence of clusters. There are plenty, for details see Wikipedia.

Decision on number of clusters in Data Mining

When ever we want to cluster some data then It is required to give the number of cluster by user. Like K-Means algorithm we need to specify that how cluster are required.
My question is it possible that the algorithm decides itself that how cluster are feasible for particular data set.
There are several clustering algorithms that do not require a desired number of clusters as an input to the algorithm. An example of such an algorithm is the mean-shift clustering algorithm. However, you will need to specify a kernel as an input to the algorithm. This kernel selection (e.g., the size and shape of the kernel) will impact the number of clusters that you get as an output.
Some more information:
http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf
http://scikit-learn.org/stable/auto_examples/cluster/plot_mean_shift.html
I'm not expert with that, but to answer to your question, yes there are methods to determine automatically the number of cluster for a kmeans for example.
It's quite complicated but given a dataset and a cluster method you can compute what is called gap statistic in order to estime the number of clusters.
If you are a R user, try to check clusGap and maxSE functions.

Find connected components in a graph in MATLAB

I have many 3D data points, and I wish to find 'connected components' in this graph. This is where clusters are formed that exhibit the following properties:
Each cluster contains points all of which are at most distance from another point in the cluster.
All points in two distinct clusters are at least distance from each other.
This problem is described in the question and answer here.
Is there a MATLAB implementation of such an algorithm built-in or available on the FEX? Simple searches have not thrown up anything useful.
Perhaps a density-based clustering algorithm can be applied in this case. See this related question for a description of the DBscan algorithm.
I do not think that it is possible to satisfy both conditions in all cases.
If you decide to concentrate on the first condition, you can use Complete-Linkage hierarichical clustering, in which points or groups of points are merged based on the maximum distance between any two points. In Matlab, this is implemented in CLUSTERDATA (see help for the individual function steps).
To calculateyour cluster indices, you'd run
clusterIndex = clusterdata(coordiantes,maxDistance,'criterion','distance','linkage','complete','distance','euclidean')
In case you then want to simply eliminate points of different clusters that are less than minDistance apart, you can run pdist between clusters to clean up your connected components.
k-means or k-medoid algorithm may be useful in this case.