Hi I am doing python course from Datacamp. I have difficulty in understanding dendrogram and clustering.
Intermediate clusterings
Question: Displayed on the right is the dendrogram for the hierarchical clustering of the grain samples that you computed earlier. If the hierarchical clustering were stopped at height 6 on the dendrogram, how many clusters would there be?
How do we have 3 clusters if stopping at height of 6
Because there are three partitions remaining if you draw a line at height 6.
This apparently is a class question. It may be better to ask in class: others may have the same problem understanding and your teacher needs feedback what you find difficult!
Related
I have been experimenting with Lumer-Faieta clustering and I am getting
promising results:
However, as clusters formed I was wondering how to identify the final clusters? Do I run another clustering algorithm to identify the clusters (that seems counter-productive)?
I had the idea of starting each data point in its own cluster. Then, when a laden ant drops a data point, its gets the same cluster as the data points that dominates its neighborhood. The problem with this is that if clusters are broken up, they share share the same cluster number.
I am stuck. Any suggestions?
To solve this problem, I employed DBSCAN as a post processing step. The effect as follows:
Given that we have a projection of a high dimensional problem on a 2D grid, with known distances and uniform densities, DBSCAN is ideal for this problem. Choosing the right value for epsilon and the minimum number of neighbours are trivial (I used 3 for both values). Once the clusters have been identified, it can be projected back to the n-dimension space.
See The 5 Clustering Algorithms Data Scientists Need to Know for a quick overview (and graphic demo) of DBSCAN and some other clustering algorithms.
I have a dataset which I need to cluster and display in a way wherein elements in the same cluster should appear closer together. The dataset is based out of a research study, and has around 16 rows(entries) and about 50 features. I do agree that its not an ideal dataset to begin with, but unfortunately thats is the situation on hand.
Following is the approach I took:
I first applied KMeans on the dataset after normalizing it.
In parallel I also tried to use TSNE to map the data into 2 dimensions and plotted them on a scatterplot. From my understanding of TSNE, that technique should already be placing items in same clusters closer to each other. When I look at the scatterplot, however, the clusters are really all over the place.
The result of the scatterplot can be found here: https://imgur.com/ZPhPjHB
Is this because TSNE and KMeans intrinsically work differently? Should I just do TSNE and try to label the clusters (and if so, how?) or should I be using TSNE output to feed into KMeans somehow?
I am really new in this space and advice would be greatly appreciated!
Thanks in advance once again
Edit: The same overlap happens if I first use TSNE to reduce dimensions to 2 and then use those reduced dimensions to cluster using KMeans
There is a difference between TSNE and KMeans. TSNE is used for visualization mostly and it tries to project points on the 2D/3D space (from bigger spaces) in order to keep distances (if in the bigger space 2 points were far away TSNE will try to show it).
So TSNE is not a real clustering. And that's why results you got that strange scatter plot.
For TSNE sometimes you need to apply PCA before but that is needed if your number of features is big. Just to speed-up calculations.
As already advised, try to use hierarchical clustering or simply generate more rows.
Apply tSNE and fit k-means is one of the basic things you can start from.
I would say consider using different f-divergence.
Stochastic Neighbor Embedding under f-divergences https://arxiv.org/pdf/1811.01247.pdf
This paper tries five different f- divergence functions : KL, RKL, JS, CH (Chi-Square), HL (Hellinger).
The paper goes over which divergence emphasize what in terms of precision and recall.
I am working on the following project and am exploring the CurveRep() clustering approach provided by Hmisc. (CurveRep clusters individual subjects' longitudinal growth curves according to similar patterns based on the CLARA clustering algorithm). As I haven't found any publication using CurveRep() and generally very little discussion about it on the internet, I would be grateful if you could let me know your experience with it or what you think about it!
- My project: I have about 200 metabolites measured in n=500 subjects at three time points (0,30,120min). Individual time courses vary quite a bit, but in Spaghetti plots, there appear to be groups (e.g. straight & flat curves, peak-shaped curves, valley-curves). I would like to cluster these curves into two or three representative time courses and would then fit a curve-specific regression model for each cluster. CurveRep() seems exactly what I am looking for and it produces acceptable cluster solutions (although solutions are more based on different y-axis intersections rather than different growth patterns).
Is it any good? Are there alternative clustering algorithms that group according to similar longitudinal change (e.g., cluster 1 = "linear rising", cluster 2 = "valley-shaped")?
Thanks a lot!
Chris
Three time points is too little for all the time-series methods to wpork for you. Look at DTW - it is designed for much higher resolution.
Clustering algorithms such as k-means, PAM and CLARA could work for you. Look at the cluster centers.
It may be necessary to preprocess your data more carefully.
If you are interested in change instead of absolute values, encode your data accordingly. For example,
x1, x2, x3 -> x2-x1, x3-x2
or
x1,x2,x3 -> x1-mu,x2-mu,x3-mu with mu=(x1+x2+x3)/3
this will make the clustering results more likely to match your motivation.
Can I use k-means algorithm for a single attribute?
Is there any relationship between the attributes and the number of clusters?
I have one attribute's performance, and I want to classify the data into 3 clusters: poor, medium, and good.
Is it possible to create 3 clusters with one attribute?
K-Means is useful when you have an idea of how many clusters actually exists in your space. Its main benefit is its speed. There is a relationship between attributes and the number of observations in your dataset.
Sometimes a dataset can suffer from The Curse of Dimensionality where your number of variables/attributes is much greater than your number of observations. Basically, in high dimensional spaces with few observations, it becomes difficult to separate observations in hyper dimensions.
You can certainly have three clusters with one attribute. Consider the quantitative attribute in which you have 7 observations
1
2
100
101
500
499
501
Notice there are three clusters in this sample centered: 1.5, 100.5, and 500.
If you have one dimensional data, search stackoverflow for better approaches than k-means.
K-means and other clustering algorithms shine when you have multivariate data. They will "work" with 1-dimensional data, but they are not very smart anymore.
One-dimensional data is ordered. If you sort your data (or it even is already sorted), it can be processed much more efficiently than with k-means. Complexity of k-means is "just" O(n*k*i), but if your data is sorted and 1-dimensional you can actually improve k-means to O(k*i). Sorting comes at a cost, but there are very good sort implementations everywhere...
Plus, for 1-dimensional data there is a lot of statistics you can use that are not very well researched or tractable on higher dimensions. One statistic you really should try is kernel density estimation. Maybe also try Jenks Natural Breaks Optimization.
However, if you want to just split your data into poor/medium/high, why don't you just use two thresholds?
As others have answered already, k-means requires prior information about the count of clusters. This may appear to be not very helpful at the start. But, I will cite the following scenario which I worked with and found to be very helpful.
Color segmentation
Think of a picture with 3 channels of information. (Red, Green Blue) You want to quantize the colors into 20 different bands for the purpose of dimensional reduction. We call this as vector quantization.
Every pixel is a 3 dimensional vector with Red, Green and Blue components. If the image is 100 pixels by 100 pixels then you have 10,000 vectors.
R,G,B
128,100,20
120,9,30
255,255,255
128,100,20
120,9,30
.
.
.
Depending on the type of analysis you intend to perform, you may not need all the R,G,B values. It might be simpler to deal with an ordinal representation.
In the above example, the RGB values might be assigned a flat integral representation
R,G,B
128,100,20 => 1
120,9,30 => 2
255,255,255=> 3
128,100,20 => 1
120,9,30 => 2
You run the k-Means algorithm on these 10,000 vectors and specify 20 clusters. Result - you have reduced your image colors to 20 broad buckets. Obviously some information is lost. However, the intuition for this loss being acceptable is that when the human eyes is gazing out over a patch of green meadow, we are unlikely to register all the 16 million RGB colours.
YouTube video
https://www.youtube.com/watch?v=yR7k19YBqiw
I have embedded key pictures from this video for your understanding. Attention! I am not the author of this video.
Original image
After segmentation using K means
Yes it is possible to use clustering with single attribute.
No there is no known relation between number of cluster and the attributes. However there have been some study that suggest taking number of clusters (k)=n\sqrt{2}, where n is the total number of items. This is just one study, different study have suggested different cluster numbers. The best way to determine cluster number is to select that cluster number that minimizes intra-cluster distance and maximizes inter-cluster distance. Also having background knowledge is important.
The problem you are looking with performance attribute is more a classification problem than a clustering problem
Difference between classification and clustering in data mining?
With only one attribute, you don't need to do k-means. First, I would like to know if your attribute is numerical or categorical.
If it's numerical, it would be easier to set up two thresholds. And if it's categorical, things are getting much easier. Just specify which classes belong to poor, medium or good. Then simple data frame operations would be working.
Feel free to send me comments if you are still confused.
Rowen
Let me explain what I'm trying to do.
I have plot of an Image's points/pixels in the RGB space.
What I am trying to do is find elongated clusters in this space. I'm fairly new to clustering techniques and maybe I'm not doing things correctly, I'm trying to cluster using MATLAB's inbuilt k-means clustering but it appears as if that is not the best approach in this case.
What I need to do is find "color clusters".
This is what I get after applying K-means on an image.
This is how it should look like:
for an image like this:
Can someone tell me where I'm going wrong, and what I can to do improve my results?
Note: Sorry for the low-res images, these are the best I have.
Are you trying to replicate the results of this paper? I would say just do what they did.
However, I will add since there are some issues with the current answers.
1) Yes, your clusters are not spherical- which is an assumption k-means makes. DBSCAN and MeanShift are two more common methods for handling such data, as they can handle non spherical data. However, your data appears to have one large central clump that spreads outwards in a few finite directions.
For DBSCAN, this means it will put everything into one cluster, or everything is its own cluster. As DBSCAN has the assumption of uniform density and requires that clusters be separated by some margin.
MeanShift will likely have difficulty because everything seems to be coming from one central lump - so that will be the area of highest density that the points will shift toward, and converge to one large cluster.
My advice would be to change color spaces. RGB has issues, and it the assumptions most algorithms make will probably not hold up well under it. What clustering algorithm you should be using will then likely change in the different feature space, but hopefully it will make the problem easier to handle.
k-means basically assumes clusters are approximately spherical. In your case they are definitely NOT. Try fit a Gaussian to each cluster with non-spherical covariance matrix.
Basically, you will be following the same expectation-maximization (EM) steps as in k-means with the only exception that you will be modeling and fitting the covariance matrix as well.
Here's an outline for the algorithm
init: assign each point at random to one of k clusters.
For each cluster estimate mean and covariance
For each point estimate its likelihood to belong to each cluster
note that this likelihood is based not only on the distance to the center (mean) but also on the shape of the cluster as it is encoded by the covariance matrix
repeat stages 2 and 3 until convergence or until exceeded pre-defined number of iterations
Take a look at density-based clustering algorithms, such as DBSCAN and MeanShift. If you are doing this for segmentation, you might want to add pixel coordinates to your vectors.