With SciPy how do I get clustering for k=? with doing hierarchical clustering - scipy

So I am using fastcluster with SciPy to do agglomerative clustering. I can do dendrogram to get the dendrogram for the clustering. I can do fcluster(Z, sqrt(D.max()), 'distance') to get a pretty good clustering for my data. What if I want to manually inspect a region in the dendrogram where say k=3 (clusters) and then I want to inspect k=6 (clusters)? How do I get the clustering at a specific level of the dendrogram?
I see all these functions with tolerances, but I don't understand how to convert from tolerance to number of clusters. I can manually build the clustering using a simple data set by going through the linkage (Z) and piecing the clusters together step by step, but this is not practical for large data sets.

If you want to cut the tree at a specific level, then use:
fl = fcluster(cl,numclust,criterion='maxclust')
where cl is the output of your linkage method and numclust is the number of clusters you want to get.

Hierarchical clustering allows you to zoom in and out to get fine or coarse grained views of the clustering. So, it might not be clear in advance which level of the dendrogram to cut. A simple solution is to get the cluster membership at every level. It is also possible to select the desired number of clusters.
import numpy as np
from scipy import cluster
np.random.seed(23)
X = np.random.randn(20, 4)
Z = cluster.hierarchy.ward(X)
cutree_all = cluster.hierarchy.cut_tree(Z)
cutree1 = cluster.hierarchy.cut_tree(Z, n_clusters=[5, 10])
print("membership at all levels \n", cutree_all)
print("membership for 5 and 10 clusters \n", cutree1)

Ok so let me propose one way. I don't think it is the right or best way, but at least it is a start.
Choose k we are interested in
Note that linkage Z has N-1 lists where N is the number of data points. The mth list entry will produce N-m clusters. Therefore grab the list in Z with index where k = N-m-1.
Grab the distance value which is the 3rd column of that list
Call fcluster with that particular distance as the tolerance (or perhaps the distance plus some really small delta).
The only problem with this is that there are ties, but really this is not a problem if you can detect that a tie has taken place.

Related

How to decide the numbers of clusters based on a distance threshold between clusters for agglomerative clustering with sklearn?

With sklearn.cluster.AgglomerativeClustering from sklearn I need to specify the number of resulting clusters in advance. What I would like to do instead is to merge clusters until a certain maximum distance between clusters is reached and then stop the clustering process.
Accordingly, the number of clusters might vary depending on the structure of the data. I also do not care about the number of resulting clusters nor the size of the clusters but only that the cluster centroids do not exceed a certain distance.
How can I achieve this?
This pull request for a distance_threshold parameter in scikit-learn's agglomerative clustering may be of interest:
https://github.com/scikit-learn/scikit-learn/pull/9069
It looks like it'll be merged in version 0.22.
EDIT: See my answer to my own question for an example of implementing single linkage clustering with a distance based stopping criterion using scipy.
Use scipy directly instead of sklearn. IMHO, it is much better.
Hierarchical clustering is a three step process:
Compute the dendrogram
Visualize and analyze
Extract branches
But that doesn't fit the supervised-learning-oriented API preference of sklearn, which would like everything to implement a fit, predict API...
SciPy has a function for you:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.fcluster.html#scipy.cluster.hierarchy.fcluster

Visualizing clusters using TSNE

I have a dataset which I need to cluster and display in a way wherein elements in the same cluster should appear closer together. The dataset is based out of a research study, and has around 16 rows(entries) and about 50 features. I do agree that its not an ideal dataset to begin with, but unfortunately thats is the situation on hand.
Following is the approach I took:
I first applied KMeans on the dataset after normalizing it.
In parallel I also tried to use TSNE to map the data into 2 dimensions and plotted them on a scatterplot. From my understanding of TSNE, that technique should already be placing items in same clusters closer to each other. When I look at the scatterplot, however, the clusters are really all over the place.
The result of the scatterplot can be found here: https://imgur.com/ZPhPjHB
Is this because TSNE and KMeans intrinsically work differently? Should I just do TSNE and try to label the clusters (and if so, how?) or should I be using TSNE output to feed into KMeans somehow?
I am really new in this space and advice would be greatly appreciated!
Thanks in advance once again
Edit: The same overlap happens if I first use TSNE to reduce dimensions to 2 and then use those reduced dimensions to cluster using KMeans
There is a difference between TSNE and KMeans. TSNE is used for visualization mostly and it tries to project points on the 2D/3D space (from bigger spaces) in order to keep distances (if in the bigger space 2 points were far away TSNE will try to show it).
So TSNE is not a real clustering. And that's why results you got that strange scatter plot.
For TSNE sometimes you need to apply PCA before but that is needed if your number of features is big. Just to speed-up calculations.
As already advised, try to use hierarchical clustering or simply generate more rows.
Apply tSNE and fit k-means is one of the basic things you can start from.
I would say consider using different f-divergence.
Stochastic Neighbor Embedding under f-divergences https://arxiv.org/pdf/1811.01247.pdf
This paper tries five different f- divergence functions : KL, RKL, JS, CH (Chi-Square), HL (Hellinger).
The paper goes over which divergence emphasize what in terms of precision and recall.

Result of overlapping clustering

I'm using function fcm from Matlab for overlapping clustering. The output of this function is a matrix of size kxn with k being the number of clusters and n being the number of examples.
Now my problem is that how do I choose clusters for an example? For each example, I have scores for all clusters so I can easily find the best matched cluster, but what about other clusters?
Many thanks.
It depends on the clustering algorithm, but you can probably interpret those soft clustering values as probabilities. This gives two well-founded options for extracting a hard clustering:
Sample each point's cluster from its cluster distribution (a column in your kxn matrix).
Assign each point to its most probable cluster. This corresponds to the MAP (max a posteriori) solution to the clustering problem.
Option 2 is probably the way to go - a single sample may not be a great representation of what's going on; with MAP, you're at least guaranteed to get something probable.

Cluster data into good and bad

I need to divide data points into those that are similar to each other("good" points) and everyone else("bad" points).
It looks like some kind of clustering problem and what do I do:
I am assuming that there are at least two "good" points.
Find pairwise distance between all types of points.
Find minimum distance (minDist).
Do hierarchical clustering for all points.
Make a cut at the height of 5*minDist.
Say that all points that are in the same cluster as pair with minDist and under that cut belong to the desired "good" cluster.
And this works pretty well, but if there are two points that are very close to each other. minDist is very small and this 5*minDist cut is also small => only these 2 points are in the desired "good" cluster.
I would think that either I need to change this approach completely and here is question number 1:
[1] "What methods do exist to separate similar points from everyone else?"
Or I need to modify this 5*minDist to some other function of minDist. And question is:
[2] "What may I choose as reasonable alternative to 5*minDist?"
Vladimir
Instead of doing clustering, you want to do outlier detection.
There are dozens of algorithms for this (see ELKI for a large collection). Some very basic methods may solve your problem:
The number of neighbors in radius r. If +i < threshold, the point is an outlier.
Distance to the k nearest neighbor. Choose k>1 to avoid these 2 element clusters you are seeing.
Also, DBSCAN clustering could work for you. Consider all clusters to be good, and only noise to be bad!

How can I perform K-means clustering on time series data?

How can I do K-means clustering of time series data?
I understand how this works when the input data is a set of points, but I don't know how to cluster a time series with 1XM, where M is the data length. In particular, I'm not sure how to update the mean of the cluster for time series data.
I have a set of labelled time series, and I want to use the K-means algorithm to check whether I will get back a similar label or not. My X matrix will be N X M, where N is number of time series and M is data length as mentioned above.
Does anyone know how to do this? For example, how could I modify this k-means MATLAB code so that it would work for time series data? Also, I would like to be able to use different distance metrics besides Euclidean distance.
To better illustrate my doubts, here is the code I modified for time series data:
% Check if second input is centroids
if ~isscalar(k)
c=k;
k=size(c,1);
else
c=X(ceil(rand(k,1)*n),:); % assign centroid randomly at start
end
% allocating variables
g0=ones(n,1);
gIdx=zeros(n,1);
D=zeros(n,k);
% Main loop converge if previous partition is the same as current
while any(g0~=gIdx)
% disp(sum(g0~=gIdx))
g0=gIdx;
% Loop for each centroid
for t=1:k
% d=zeros(n,1);
% Loop for each dimension
for s=1:n
D(s,t) = sqrt(sum((X(s,:)-c(t,:)).^2));
end
end
% Partition data to closest centroids
[z,gIdx]=min(D,[],2);
% Update centroids using means of partitions
for t=1:k
% Is this how we calculate new mean of the time series?
c(t,:)=mean(X(gIdx==t,:));
end
end
Time series are usually high-dimensional. And you need specialized distance function to compare them for similarity. Plus, there might be outliers.
k-means is designed for low-dimensional spaces with a (meaningful) euclidean distance. It is not very robust towards outliers, as it puts squared weight on them.
Doesn't sound like a good idea to me to use k-means on time series data. Try looking into more modern, robust clustering algorithms. Many will allow you to use arbitrary distance functions, including time series distances such as DTW.
It's probably too late for an answer, but:
k-means can be used to cluster longitudinal data
Anony-Mousse is right, DWT distance is the way to go for time series
The methods above use R. You'll find more methods by looking, e.g., for "Iterative Incremental Clustering of Time Series".
I have recently come across the kml R package which claims to implement k-means clustering for longitudinal data. I have not tried it out myself.
Also the Time-series clustering - A decade review paper by S. Aghabozorgi, A. S. Shirkhorshidi and T. Ying Wah might be useful to you to seek out alternatives. Another nice paper although somewhat dated is Clustering of time series data-a survey by T. Warren Liao.
If you did really want to use clustering, then dependent on your application you could generate a low dimensional feature vector for each time series. For example, use time series mean, standard deviation, dominant frequency from a Fourier transform etc. This would be suitable for use with k-means, but whether it would give you useful results is dependent on your specific application and the content of your time series.
I don't think k-means is the right way for it either. As #Anony-Mousse suggested you can utilize DTW. In fact, I had the same problem for one of my projects and I wrote my own class for that in Python. The logic is;
Create your all cluster combinations. k is for cluster count and n is for number of series. The number of items returned should be n! / k! / (n-k)!. These would be something like potential centers.
For each series, calculate distances for each center in each cluster groups and assign it to the minimum one.
For each cluster groups, calculate total distance within individual clusters.
Choose the minimum.
And, the Python implementation is here if you're interested.