K-Means clustering with low silhouette score but clear cluster pattern, what to believe - cluster-analysis

Thanks all in advance! I'm pretty new in unsupervised learning.
I have a web usage dataset with inflated zeros. The zeros are simply because user didn't engage with a web feature. The usages are relatively low for majority of my variables.
I clustered my data with K-Means, the silhouette score looks not very convincing as it looks like I only one good cluster and it's really big:. The plots are number of clusters and silhouette score from 2 to 5 clusters.
It looks like the 3 clusters have higher score. I went ahead to cluster with k=3. If I plot a polar plot, it looks like the clusters are very clear.
I'm really confused, are my clustered result valid or not?

Related

Self-organizing map: How to identify clusters from plots?

I've been learning about neural networks and most recently been trying out different clustering methods. But unlike KNN, GMM, or DBSCAN, there isn't a feature (in Matlab that I'm aware of) that identifies clusters for you. So I've been reading articles of how to interpret these plots, but I'm still confused. For my example, in the weight positions plot, I see one cluster. For the neighbor weight differences, I see one, maybe two clusters (yellow/bright - similar, red/dark - dissimilar). That seems to be confirmed when looking at the densities in the hits plot. There might be more, but I honestly I can't tell (I'm new at this) because of the gradient instead of a solid boundary between clusters. How many clusters do you see, and what's your logic? Thank you]1[]2[]3
selforgmap([5 5]
[net,tr] = train(net,x)
figure, plotsomnd(net)
figure, plotsomhits(net,x)
figure, plotsompos(net,x)
You may construct a new paradigm in relation with what the SOM nodes represent, i.e. they produce a new dataset. The new dataset is independent from the original dateset. Nevertheles, it is arranged somehow so that the underlying structure imitates that of the original dataset. Therefore, it is often found that people perform SOM with clustering algorithms such as K-means, Hierarchical Clustering, etc subsequently. This can be regarded as: instead of clustering directly from a huge amount of the original data, the clustering procedure is performed on a new version of the original dataset which is smaller but still inherits the topology of the original dataset. AFAIK, SOM is different from KNN in the sense that SOM is unsupervised whereas KNN is supervised.

How to identify found clusters in Lumer Faieta Ant clustering

I have been experimenting with Lumer-Faieta clustering and I am getting
promising results:
However, as clusters formed I was wondering how to identify the final clusters? Do I run another clustering algorithm to identify the clusters (that seems counter-productive)?
I had the idea of starting each data point in its own cluster. Then, when a laden ant drops a data point, its gets the same cluster as the data points that dominates its neighborhood. The problem with this is that if clusters are broken up, they share share the same cluster number.
I am stuck. Any suggestions?
To solve this problem, I employed DBSCAN as a post processing step. The effect as follows:
Given that we have a projection of a high dimensional problem on a 2D grid, with known distances and uniform densities, DBSCAN is ideal for this problem. Choosing the right value for epsilon and the minimum number of neighbours are trivial (I used 3 for both values). Once the clusters have been identified, it can be projected back to the n-dimension space.
See The 5 Clustering Algorithms Data Scientists Need to Know for a quick overview (and graphic demo) of DBSCAN and some other clustering algorithms.

Is there maximum number of noise/ outliers in DBSCAN algorithm?

I did clustering on spatial datasets using DBSCAN algorithm and generating a lot of noise 193000 of 250000 data. is that a reasonable amount?
Depends on your data and problem.
If I generate random coordinates, 100% noise would be appropriate because the data is random noise.
First, to address the question in your title. By making eps
very large, it is easy to get no noise points and all the points are
in one big cluster. By making eps very small, you can easily
make all points be noise points. In general, somewhere in between
is what you are looking for. Your job is to find a value that produces
a meaningful clustering. That is where the remark of
#Anony-Mousse comes into play.
Depends on your data and problem
As he suggested, if you have uniform random data, maybe all
noise is the best answer. If you have Gaussian random data,
maybe one big cluster with a few outliers is good. But this is
supposed to help you understand the structure of your data.
What happens as you change eps? From your current clustering
with many noise points, what happens as you gradually increase eps?
Does it gradually add a few noise points into the existing clusters?
Is there some place where two clusters get merged into one? Is there
someplace that there is a sudden change in the number of clusters?
Also, can you interpret the clusters in terms of your variables?
Perhaps the difference between two clusters is that in one all the
values of some variable are low and in another cluster they are high. Considering whatever problem you are trying to solve,
do the clusters divide the data into meaningful groups? Try to use
the clusterings to find meaning in your data.

CurveRep in Hmisc for clustering longitudinal curves based on 3 time points

I am working on the following project and am exploring the CurveRep() clustering approach provided by Hmisc. (CurveRep clusters individual subjects' longitudinal growth curves according to similar patterns based on the CLARA clustering algorithm). As I haven't found any publication using CurveRep() and generally very little discussion about it on the internet, I would be grateful if you could let me know your experience with it or what you think about it!
- My project: I have about 200 metabolites measured in n=500 subjects at three time points (0,30,120min). Individual time courses vary quite a bit, but in Spaghetti plots, there appear to be groups (e.g. straight & flat curves, peak-shaped curves, valley-curves). I would like to cluster these curves into two or three representative time courses and would then fit a curve-specific regression model for each cluster. CurveRep() seems exactly what I am looking for and it produces acceptable cluster solutions (although solutions are more based on different y-axis intersections rather than different growth patterns).
Is it any good? Are there alternative clustering algorithms that group according to similar longitudinal change (e.g., cluster 1 = "linear rising", cluster 2 = "valley-shaped")?
Thanks a lot!
Chris
Three time points is too little for all the time-series methods to wpork for you. Look at DTW - it is designed for much higher resolution.
Clustering algorithms such as k-means, PAM and CLARA could work for you. Look at the cluster centers.
It may be necessary to preprocess your data more carefully.
If you are interested in change instead of absolute values, encode your data accordingly. For example,
x1, x2, x3 -> x2-x1, x3-x2
or
x1,x2,x3 -> x1-mu,x2-mu,x3-mu with mu=(x1+x2+x3)/3
this will make the clustering results more likely to match your motivation.

Finding elongated clusters using MATLAB

Let me explain what I'm trying to do.
I have plot of an Image's points/pixels in the RGB space.
What I am trying to do is find elongated clusters in this space. I'm fairly new to clustering techniques and maybe I'm not doing things correctly, I'm trying to cluster using MATLAB's inbuilt k-means clustering but it appears as if that is not the best approach in this case.
What I need to do is find "color clusters".
This is what I get after applying K-means on an image.
This is how it should look like:
for an image like this:
Can someone tell me where I'm going wrong, and what I can to do improve my results?
Note: Sorry for the low-res images, these are the best I have.
Are you trying to replicate the results of this paper? I would say just do what they did.
However, I will add since there are some issues with the current answers.
1) Yes, your clusters are not spherical- which is an assumption k-means makes. DBSCAN and MeanShift are two more common methods for handling such data, as they can handle non spherical data. However, your data appears to have one large central clump that spreads outwards in a few finite directions.
For DBSCAN, this means it will put everything into one cluster, or everything is its own cluster. As DBSCAN has the assumption of uniform density and requires that clusters be separated by some margin.
MeanShift will likely have difficulty because everything seems to be coming from one central lump - so that will be the area of highest density that the points will shift toward, and converge to one large cluster.
My advice would be to change color spaces. RGB has issues, and it the assumptions most algorithms make will probably not hold up well under it. What clustering algorithm you should be using will then likely change in the different feature space, but hopefully it will make the problem easier to handle.
k-means basically assumes clusters are approximately spherical. In your case they are definitely NOT. Try fit a Gaussian to each cluster with non-spherical covariance matrix.
Basically, you will be following the same expectation-maximization (EM) steps as in k-means with the only exception that you will be modeling and fitting the covariance matrix as well.
Here's an outline for the algorithm
init: assign each point at random to one of k clusters.
For each cluster estimate mean and covariance
For each point estimate its likelihood to belong to each cluster
note that this likelihood is based not only on the distance to the center (mean) but also on the shape of the cluster as it is encoded by the covariance matrix
repeat stages 2 and 3 until convergence or until exceeded pre-defined number of iterations
Take a look at density-based clustering algorithms, such as DBSCAN and MeanShift. If you are doing this for segmentation, you might want to add pixel coordinates to your vectors.