Result of overlapping clustering - matlab

I'm using function fcm from Matlab for overlapping clustering. The output of this function is a matrix of size kxn with k being the number of clusters and n being the number of examples.
Now my problem is that how do I choose clusters for an example? For each example, I have scores for all clusters so I can easily find the best matched cluster, but what about other clusters?
Many thanks.

It depends on the clustering algorithm, but you can probably interpret those soft clustering values as probabilities. This gives two well-founded options for extracting a hard clustering:
Sample each point's cluster from its cluster distribution (a column in your kxn matrix).
Assign each point to its most probable cluster. This corresponds to the MAP (max a posteriori) solution to the clustering problem.
Option 2 is probably the way to go - a single sample may not be a great representation of what's going on; with MAP, you're at least guaranteed to get something probable.

Related

Matlab cluster data into Disconnected Subsets

I am trying to extract a fixed (and known) number of clusters from a set of points using Matlab.
An immediate clustering method that I tried is the k-means algorithm which seems to tick all the boxes.
Unfortunately, in some cases, the subsets (or clusters) extracted are intertwined, as shown in the image below for the left-most cluster:
[]
Is there a way to set the k-means algorithm, so that the generated clusters are disconnected?
Is there a way to post-process the cluster indices returned by the k-means algorithm, so as to obtain "disconnected" clusters?
Alternatively, is there another clustering method that might be more suitable?
Thanks!

Clustering algorithm for specificing n points per cluster?

I'm looking for a clustering algorithm where you set a number of points, which the algorithm would aim for in the clusters. For example, if I have 10 total data points, n=5, the algorithm would then cluster and group them into 2 clusters. If it total was 11 and n=5, it would group 2 clusters, one with 5 and one with 6.
I was thinking I could use agglomerative clustering and then stop at a certain number of clusters but I was wondering if this is the wrong approach, and I shouldn't be doing clustering at all and using something else to group items? Thanks.
Just so you know, clustering methodologies are unsupervised, so you don't train/test anything. You let the algo tell you the story, based on the data that is fed in. You don't know what will happen in advance. In short, with DBSCAN and also Hierarchical Clustering (but not K-Means), you do not pre-specify the number of clusters. The algo determines the optimal number of clusters for you. If you really want to control the number of clusters (min or max) you need to use a K-Means algo. Take a look at this link when you have a chance.
https://blog.cambridgespark.com/how-to-determine-the-optimal-number-of-clusters-for-k-means-clustering-14f27070048f

How to identify found clusters in Lumer Faieta Ant clustering

I have been experimenting with Lumer-Faieta clustering and I am getting
promising results:
However, as clusters formed I was wondering how to identify the final clusters? Do I run another clustering algorithm to identify the clusters (that seems counter-productive)?
I had the idea of starting each data point in its own cluster. Then, when a laden ant drops a data point, its gets the same cluster as the data points that dominates its neighborhood. The problem with this is that if clusters are broken up, they share share the same cluster number.
I am stuck. Any suggestions?
To solve this problem, I employed DBSCAN as a post processing step. The effect as follows:
Given that we have a projection of a high dimensional problem on a 2D grid, with known distances and uniform densities, DBSCAN is ideal for this problem. Choosing the right value for epsilon and the minimum number of neighbours are trivial (I used 3 for both values). Once the clusters have been identified, it can be projected back to the n-dimension space.
See The 5 Clustering Algorithms Data Scientists Need to Know for a quick overview (and graphic demo) of DBSCAN and some other clustering algorithms.

How to decide the numbers of clusters based on a distance threshold between clusters for agglomerative clustering with sklearn?

With sklearn.cluster.AgglomerativeClustering from sklearn I need to specify the number of resulting clusters in advance. What I would like to do instead is to merge clusters until a certain maximum distance between clusters is reached and then stop the clustering process.
Accordingly, the number of clusters might vary depending on the structure of the data. I also do not care about the number of resulting clusters nor the size of the clusters but only that the cluster centroids do not exceed a certain distance.
How can I achieve this?
This pull request for a distance_threshold parameter in scikit-learn's agglomerative clustering may be of interest:
https://github.com/scikit-learn/scikit-learn/pull/9069
It looks like it'll be merged in version 0.22.
EDIT: See my answer to my own question for an example of implementing single linkage clustering with a distance based stopping criterion using scipy.
Use scipy directly instead of sklearn. IMHO, it is much better.
Hierarchical clustering is a three step process:
Compute the dendrogram
Visualize and analyze
Extract branches
But that doesn't fit the supervised-learning-oriented API preference of sklearn, which would like everything to implement a fit, predict API...
SciPy has a function for you:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.fcluster.html#scipy.cluster.hierarchy.fcluster

With SciPy how do I get clustering for k=? with doing hierarchical clustering

So I am using fastcluster with SciPy to do agglomerative clustering. I can do dendrogram to get the dendrogram for the clustering. I can do fcluster(Z, sqrt(D.max()), 'distance') to get a pretty good clustering for my data. What if I want to manually inspect a region in the dendrogram where say k=3 (clusters) and then I want to inspect k=6 (clusters)? How do I get the clustering at a specific level of the dendrogram?
I see all these functions with tolerances, but I don't understand how to convert from tolerance to number of clusters. I can manually build the clustering using a simple data set by going through the linkage (Z) and piecing the clusters together step by step, but this is not practical for large data sets.
If you want to cut the tree at a specific level, then use:
fl = fcluster(cl,numclust,criterion='maxclust')
where cl is the output of your linkage method and numclust is the number of clusters you want to get.
Hierarchical clustering allows you to zoom in and out to get fine or coarse grained views of the clustering. So, it might not be clear in advance which level of the dendrogram to cut. A simple solution is to get the cluster membership at every level. It is also possible to select the desired number of clusters.
import numpy as np
from scipy import cluster
np.random.seed(23)
X = np.random.randn(20, 4)
Z = cluster.hierarchy.ward(X)
cutree_all = cluster.hierarchy.cut_tree(Z)
cutree1 = cluster.hierarchy.cut_tree(Z, n_clusters=[5, 10])
print("membership at all levels \n", cutree_all)
print("membership for 5 and 10 clusters \n", cutree1)
Ok so let me propose one way. I don't think it is the right or best way, but at least it is a start.
Choose k we are interested in
Note that linkage Z has N-1 lists where N is the number of data points. The mth list entry will produce N-m clusters. Therefore grab the list in Z with index where k = N-m-1.
Grab the distance value which is the 3rd column of that list
Call fcluster with that particular distance as the tolerance (or perhaps the distance plus some really small delta).
The only problem with this is that there are ties, but really this is not a problem if you can detect that a tie has taken place.