Self-organizing map: How to identify clusters from plots? - matlab

I've been learning about neural networks and most recently been trying out different clustering methods. But unlike KNN, GMM, or DBSCAN, there isn't a feature (in Matlab that I'm aware of) that identifies clusters for you. So I've been reading articles of how to interpret these plots, but I'm still confused. For my example, in the weight positions plot, I see one cluster. For the neighbor weight differences, I see one, maybe two clusters (yellow/bright - similar, red/dark - dissimilar). That seems to be confirmed when looking at the densities in the hits plot. There might be more, but I honestly I can't tell (I'm new at this) because of the gradient instead of a solid boundary between clusters. How many clusters do you see, and what's your logic? Thank you]1[]2[]3
selforgmap([5 5]
[net,tr] = train(net,x)
figure, plotsomnd(net)
figure, plotsomhits(net,x)
figure, plotsompos(net,x)

You may construct a new paradigm in relation with what the SOM nodes represent, i.e. they produce a new dataset. The new dataset is independent from the original dateset. Nevertheles, it is arranged somehow so that the underlying structure imitates that of the original dataset. Therefore, it is often found that people perform SOM with clustering algorithms such as K-means, Hierarchical Clustering, etc subsequently. This can be regarded as: instead of clustering directly from a huge amount of the original data, the clustering procedure is performed on a new version of the original dataset which is smaller but still inherits the topology of the original dataset. AFAIK, SOM is different from KNN in the sense that SOM is unsupervised whereas KNN is supervised.

Related

What is the importance of clustering?

During unsupervised learning we do cluster analysis (like K-Means) to bin the data to a number of clusters.
But what is the use of these clustered data in practical scenario.
I think during clustering we are losing information about the data.
Are there some practical examples where clustering could be beneficial?
The information loss can be intentional. Here are three examples:
PCM signal quantification (Lloyd's k-means publication). You know that are certain number (say 10) different signals are transmitted, but with distortion. Quantifying removes the distortions and re-extracts the original 10 different signals. Here, you lose the error and keep the signal.
Color quantization (see Wikipedia). To reduce the number of colors in an image, a quite nice method uses k-means (usually in HSV or Lab space). k is the number of desired output colors. Information loss here is intentional, to better compress the image. k-means attempts to find the least-squared-error approximation of the image with just k colors.
When searching motifs in time series, you can also use quantization such as k-means to transform your data into a symbolic representation. The bag-of-visual-words approach that was the state of the art for image recognition prior to deep learning also used this.
Explorative data mining (clustering - one may argue that above use cases are not data mining / clustering; but quantization). If you have a data set of a million points, which points are you going to investigate? clustering methods try ro split the data into groups that are supposed to be more homogeneous within and more different to another. Thrn you don't have to look at every object, but only at some of each cluster to hopefully learn something about the whole cluster (and your whole data set). Centroid methods such as k-means even can proviee a "prototype" for each cluster, albeit it is a good idea to also lool at other points within the cluster. You may also want to do outlier detection and look at some of the unusual objects. This scenario is somewhere inbetween of sampling representative objects and reducing the data set size to become more manageable. The key difference to above points is that the result is usually not "operationalized" automatically, but because explorative clustering results are too unreliable (and thus require many iterations) need to be analyzed manually.

Density Based Clustering with Representatives

I'm looking for a method to perform density based clustering. The resulting clusters should have a representative unlike DBSCAN.
Mean-Shift seems to fit those needs but doesn't scale enough for my needs. I have looked into some subspace clustering algorithms and only found CLIQUE using representatives, but this part is not implemented in Elki.
As I noted in the comments on the previous iteration of your question,
https://stackoverflow.com/questions/34720959/dbscan-java-library-with-corepoints
Density-based clustering does not assume there is a center or representative.
Consider the following example image from Wikipedia user Chire (BY-CC-SA 3.0):
Which object should be the representative of the red cluster?
Density-based clustering is about finding "arbitrarily shaped" clusters. These do not have a meaningful single representative object. They are not meant to "compress" your data - this is not a vector quantization method, but structure discovery. But it is the nature of such complex structure that it cannot be reduced to a single representative. The proper representation of such a cluster is the set of all points in the cluster. For geometric understanding in 2D, you can also compute convex hulls, for example, to get an area as in that picture.
Choosing representative objects is a different task. This is not needed for discovering this kind of structure, and thus these algorithms do not compute representative objects - it would waste CPU.
You could choose the object with the highest density as representative of the cluster.
It is a fairly easy modification to DBSCAN to store the neighbor count of every object.
But as Anony-Mousse mentioned, the object may nevertheless be a rather bad choice. Density-based clustering is not designed to yield representative objects.
You could try AffinityPropagation, but it will also not scale very well.

How to give label for cluster from GMM iteration?

I read the concept of GMM from Understanding concept of Gaussian Mixture Models. It is helpful for me. I have implemented GMM for fisheriris also but I didn't use fitgmdist function because I didn't have it. So I used code from http://chrisjmccormick.wordpress.com/2014/08/04/gaussian-mixture-models-tutorial-and-matlab-code/.
When I read Understanding concept of Gaussian Mixture Models, Amro could plot the result with its label, i.e. setosa, virginica, and versicolor. How did he do it? After some iterations, I only got mu, Sigma, and weight. There is no label at all. I want to put the label (setosa, virginica, and versicolor) to mixture models from GMM iteration.
There are two sets of "labels" in that plot:
one is the "true" labels of the Fisher Iris dataset (the species variable which contains the class of each instance: setoas, versicolor, or virginica). Normally you wouldn't have those in a real dataset (after all the goal of clustering is to discover those groups within the data, which you don't know beforehand). I just used them here to get an idea of how well the EM clustering performed against the actual truth (the scatter points are color-coded according to the class).
the other set of labels are the clusters we found using GMM. Basically I built a 50x50 grid of 2D points to cover the entire data domain, I then assign a cluster to each of those points by computing the posterior probability and choosing the component with highest likelihood. I showed those clusters in the background color. As a nice consequence, we get to see the discriminant decision boundaries between the clusters.
You can see that the cluster of points on the left got separated quite nicely (and perfectly matched the setosa class). While the points on the right side of the plot got separated in two matching the other two classes, although there were instance "misclassified" if you will (some green points on the wrong side of the boundary).
Typically in a real setting you wouldn't have those actual classes to compare against, so no way to tell how "accurate" your clustering was (there exist other metrics for clustering performance evaluation)...

Finding elongated clusters using MATLAB

Let me explain what I'm trying to do.
I have plot of an Image's points/pixels in the RGB space.
What I am trying to do is find elongated clusters in this space. I'm fairly new to clustering techniques and maybe I'm not doing things correctly, I'm trying to cluster using MATLAB's inbuilt k-means clustering but it appears as if that is not the best approach in this case.
What I need to do is find "color clusters".
This is what I get after applying K-means on an image.
This is how it should look like:
for an image like this:
Can someone tell me where I'm going wrong, and what I can to do improve my results?
Note: Sorry for the low-res images, these are the best I have.
Are you trying to replicate the results of this paper? I would say just do what they did.
However, I will add since there are some issues with the current answers.
1) Yes, your clusters are not spherical- which is an assumption k-means makes. DBSCAN and MeanShift are two more common methods for handling such data, as they can handle non spherical data. However, your data appears to have one large central clump that spreads outwards in a few finite directions.
For DBSCAN, this means it will put everything into one cluster, or everything is its own cluster. As DBSCAN has the assumption of uniform density and requires that clusters be separated by some margin.
MeanShift will likely have difficulty because everything seems to be coming from one central lump - so that will be the area of highest density that the points will shift toward, and converge to one large cluster.
My advice would be to change color spaces. RGB has issues, and it the assumptions most algorithms make will probably not hold up well under it. What clustering algorithm you should be using will then likely change in the different feature space, but hopefully it will make the problem easier to handle.
k-means basically assumes clusters are approximately spherical. In your case they are definitely NOT. Try fit a Gaussian to each cluster with non-spherical covariance matrix.
Basically, you will be following the same expectation-maximization (EM) steps as in k-means with the only exception that you will be modeling and fitting the covariance matrix as well.
Here's an outline for the algorithm
init: assign each point at random to one of k clusters.
For each cluster estimate mean and covariance
For each point estimate its likelihood to belong to each cluster
note that this likelihood is based not only on the distance to the center (mean) but also on the shape of the cluster as it is encoded by the covariance matrix
repeat stages 2 and 3 until convergence or until exceeded pre-defined number of iterations
Take a look at density-based clustering algorithms, such as DBSCAN and MeanShift. If you are doing this for segmentation, you might want to add pixel coordinates to your vectors.

Matlab: K-means clustering with predefined populations

I am trying to differentiate two populations. Each population is an NxM matrix in which N is fixed between the two and M is variable in length (N=column specific attributes of each run, M=run number). I have looked at PCA and K-means for differentiating the two, but I was curious of the best practice.
To my knowledge, in K-means, there is no initial 'calibration' in which the clusters are chosen such that known bimodal populations can be differentiated. It simply minimizes the distance and assigns the data to an arbitrary number of populations. I would like to tell the clustering algorithm that I want the best fit in which the two populations are separated. I can then use the fit I get from the initial clustering on future datasets. Any help, example code, or reading material would be appreciated.
-R
K-means and PCA are typically used in unsupervised learning problems, i.e. problems where you have a single batch of data and want to find some easier way to describe it. In principle, you could run K-means (with K=2) on your data, and then evaluate the degree to which your two classes of data match up with the data clusters found by this algorithm (note: you may want multiple starts).
It sounds to like you have a supervised learning problem: you have a training data set which has already been partitioned into two classes. In this case k-nearest neighbors (as mentioned by #amas) is probably the approach most like k-means; however Support Vector Machines can also be an attractive approach.
I frequently refer to The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Springer Series in Statistics) by Trevor Hastie (Author), Robert Tibshirani (Author), Jerome Friedman (Author).
It really depends on the data. But just to let you know K-means does get stuck at local minima so if you wanna use it try running it from different random starting points. PCA's might also be useful how ever like any other spectral clustering method you have much less control over the clustering procedure. I recommend that you cluster the data using k-means with multiple random starting points and c how it works then you can predict and learn for each the new samples with K-NN (I don't know if it is useful for your case).
Check Lazy learners and K-NN for prediction.