Visualizing clusters using TSNE - cluster-analysis

I have a dataset which I need to cluster and display in a way wherein elements in the same cluster should appear closer together. The dataset is based out of a research study, and has around 16 rows(entries) and about 50 features. I do agree that its not an ideal dataset to begin with, but unfortunately thats is the situation on hand.
Following is the approach I took:
I first applied KMeans on the dataset after normalizing it.
In parallel I also tried to use TSNE to map the data into 2 dimensions and plotted them on a scatterplot. From my understanding of TSNE, that technique should already be placing items in same clusters closer to each other. When I look at the scatterplot, however, the clusters are really all over the place.
The result of the scatterplot can be found here: https://imgur.com/ZPhPjHB
Is this because TSNE and KMeans intrinsically work differently? Should I just do TSNE and try to label the clusters (and if so, how?) or should I be using TSNE output to feed into KMeans somehow?
I am really new in this space and advice would be greatly appreciated!
Thanks in advance once again
Edit: The same overlap happens if I first use TSNE to reduce dimensions to 2 and then use those reduced dimensions to cluster using KMeans

There is a difference between TSNE and KMeans. TSNE is used for visualization mostly and it tries to project points on the 2D/3D space (from bigger spaces) in order to keep distances (if in the bigger space 2 points were far away TSNE will try to show it).
So TSNE is not a real clustering. And that's why results you got that strange scatter plot.
For TSNE sometimes you need to apply PCA before but that is needed if your number of features is big. Just to speed-up calculations.
As already advised, try to use hierarchical clustering or simply generate more rows.

Apply tSNE and fit k-means is one of the basic things you can start from.
I would say consider using different f-divergence.
Stochastic Neighbor Embedding under f-divergences https://arxiv.org/pdf/1811.01247.pdf
This paper tries five different f- divergence functions : KL, RKL, JS, CH (Chi-Square), HL (Hellinger).
The paper goes over which divergence emphasize what in terms of precision and recall.

Related

Removing outliers with PCA in multidmension (100+) cluster problem

I have two dataframes that I need to clusterize where I am trying to do the following:
Apply PCA to remove outliers and use PCA with 3 components to visualize it.I am using a total of explained variance of 97,5% for the outlier removal process.
Inverse transform and get the MSE score between the inversed tranformed dataframes and the original ones.
Use the IQR upper bracket limit using the calculated MSE score to remove the outliers.
Applying the PCA with 3 components to visualize and determine the number of clusters on the new dataframe.
My main issues are:
Is the IQR on MSE a good criteria for removal?
I have limited to the upper bracket since we are working with absolute values. If not and I am mixing concepts, what would be a good criteria for this type of transformation?
Or I should drop PCA and go for other methods of outliers detection, if so which?
And ultimately I still visualize points very far from the clusters when doing the x,y,z plot, does this mean they aren't outliers, just a few scattered far away points that represent a small cluster? Or the outlier detecting isn't being effective?
Finally on the second dataframe a 3D visualization has roughly 40% of explained variance, is it fair to apply the same decision making process?
The pca library provides functionalities that can be of use for vizualization, outlier detection, playing with explained variance. In general, the Hotelling T2 test and SPE/dmodx are techniques used to remove outliers when using PCA. A previous post with outlier detection can be found here: https://stackoverflow.com/a/63043840/13730780
But in general, if your aim is to detect outliers, it depends on the type of data you have (continuous, categorical, one-hot, mixed datasets), whether you want/need to include context. If you approach is by clustering, you can try the clusteval library which includes methods such as dbscan.

Self-organizing map: How to identify clusters from plots?

I've been learning about neural networks and most recently been trying out different clustering methods. But unlike KNN, GMM, or DBSCAN, there isn't a feature (in Matlab that I'm aware of) that identifies clusters for you. So I've been reading articles of how to interpret these plots, but I'm still confused. For my example, in the weight positions plot, I see one cluster. For the neighbor weight differences, I see one, maybe two clusters (yellow/bright - similar, red/dark - dissimilar). That seems to be confirmed when looking at the densities in the hits plot. There might be more, but I honestly I can't tell (I'm new at this) because of the gradient instead of a solid boundary between clusters. How many clusters do you see, and what's your logic? Thank you]1[]2[]3
selforgmap([5 5]
[net,tr] = train(net,x)
figure, plotsomnd(net)
figure, plotsomhits(net,x)
figure, plotsompos(net,x)
You may construct a new paradigm in relation with what the SOM nodes represent, i.e. they produce a new dataset. The new dataset is independent from the original dateset. Nevertheles, it is arranged somehow so that the underlying structure imitates that of the original dataset. Therefore, it is often found that people perform SOM with clustering algorithms such as K-means, Hierarchical Clustering, etc subsequently. This can be regarded as: instead of clustering directly from a huge amount of the original data, the clustering procedure is performed on a new version of the original dataset which is smaller but still inherits the topology of the original dataset. AFAIK, SOM is different from KNN in the sense that SOM is unsupervised whereas KNN is supervised.

Comparing k-means clustering

I have 150 images, 15 each of 10 different people. So basically I know which image should belong together, if clustered.
These images are of 73 dimensions (feature-vector) and I clustered them into 10 clusters using kmeans function in matlab.
Later, I processed these 150 data points and reduced its dimension from 73 to 3 for my work and applied the same kmeans function on them.
I want to compare the results obtained on these data sets (processed and unprocessed) by applying the same k-means function and wish to know if the processing which reduced it to lower dimension improves the kmeans clustering or not.
I thought comparing the variance of each cluster can be one parameter for comparison, however I am not sure if I can directly compare and evaluate my results (within cluster sum of distances etc.) as both the cases are of different dimension. Could anyone please suggest a way where I can compare the kmean results, some way to normalize them or any other comparison that I can make?
I can think of three options. I am unaware of any well developed methodology to do this specifically with K-means clustering.
Look at the confusion matrices between the two approaches.
Compare the mahalanobis distances between the clusters, and between items in clusters to their nearest other clusters.
Look at the Vornoi cells and see how far your points are from the boundaries of the cells.
The problem with 3, is the distance metrics get skewed, 3D distance vs. 73D distances are not commensurate, so I'm not a fan of that approach. I'd recommend reading some books on K-means if you are adamant of that path, rank speculation is fun, but standing on the shoulders of giants is better.

Clustering of 3D points

I have a large dataset of around 20 million points (x,y,z) in a 3-dimensional space. I know these points are organized in dense regions, but that these regions vary in size. I think a standard unsupervised 3D clustering should solve my problem.
Since I can't estimate the number of clusters a priori, I tried using k-means with a wide range for k, but it is slow and also, I would have to estimate how significant each k-partition is.
Basically, my question is: how can I extract the most significant partition of my points into clusters?
k-means is probably not the best alhorithm for such data.
DBSCAN should be closer to your intuition of dense regions.
Try on a sample first, then figure out how to scale up.
It is not clear to me from the above if you're going to use k-means or not, but if you are, you should be following the responses from the post below which shows how to measure variance of the clusters.
Calculating the percentage of variance measure for k-means?
Additionally, you can get a good fit using 'the elbow method' by trying 2 to 15 k sized clusters. See the answer from Amro for the process on this.
One simple idea in this case is to use 3 different clusterings, along each dimension. That might speed things up.
So you find clusters along X axis (project all the points down to X axis) and then continue to form sub clusters along the Y axis and then along the Z axis.
I think 1-D k-means can be solved very efficiently using dynamic programming http://www.sciencedirect.com/science/article/pii/0025556473900072.

Finding elongated clusters using MATLAB

Let me explain what I'm trying to do.
I have plot of an Image's points/pixels in the RGB space.
What I am trying to do is find elongated clusters in this space. I'm fairly new to clustering techniques and maybe I'm not doing things correctly, I'm trying to cluster using MATLAB's inbuilt k-means clustering but it appears as if that is not the best approach in this case.
What I need to do is find "color clusters".
This is what I get after applying K-means on an image.
This is how it should look like:
for an image like this:
Can someone tell me where I'm going wrong, and what I can to do improve my results?
Note: Sorry for the low-res images, these are the best I have.
Are you trying to replicate the results of this paper? I would say just do what they did.
However, I will add since there are some issues with the current answers.
1) Yes, your clusters are not spherical- which is an assumption k-means makes. DBSCAN and MeanShift are two more common methods for handling such data, as they can handle non spherical data. However, your data appears to have one large central clump that spreads outwards in a few finite directions.
For DBSCAN, this means it will put everything into one cluster, or everything is its own cluster. As DBSCAN has the assumption of uniform density and requires that clusters be separated by some margin.
MeanShift will likely have difficulty because everything seems to be coming from one central lump - so that will be the area of highest density that the points will shift toward, and converge to one large cluster.
My advice would be to change color spaces. RGB has issues, and it the assumptions most algorithms make will probably not hold up well under it. What clustering algorithm you should be using will then likely change in the different feature space, but hopefully it will make the problem easier to handle.
k-means basically assumes clusters are approximately spherical. In your case they are definitely NOT. Try fit a Gaussian to each cluster with non-spherical covariance matrix.
Basically, you will be following the same expectation-maximization (EM) steps as in k-means with the only exception that you will be modeling and fitting the covariance matrix as well.
Here's an outline for the algorithm
init: assign each point at random to one of k clusters.
For each cluster estimate mean and covariance
For each point estimate its likelihood to belong to each cluster
note that this likelihood is based not only on the distance to the center (mean) but also on the shape of the cluster as it is encoded by the covariance matrix
repeat stages 2 and 3 until convergence or until exceeded pre-defined number of iterations
Take a look at density-based clustering algorithms, such as DBSCAN and MeanShift. If you are doing this for segmentation, you might want to add pixel coordinates to your vectors.