I roughly know SOM and it is mapping its network to different clusters of trained data.
How to implement to count the number of clusters using SOM? I use KNNL library here to implement SOM. In its demo, it is shown how to train and test only. How can I implement it for counting number of clusters? I know I can also use DBSCAN for cluster counting as well. But first, I like to implement SOM for cluster counting. My input data is 2D data representing points in 2D space like
(132.181,0.683431),
(136.886,0.988517),
(137.316,0.504297),
(133.653,0.602269),
Related
I have some data collected using an online survey. Therefore, there are no classes/labels in the data to evaluate clustering results. I am trying to do the clustering in order to cluster participants in some groups for another task.
In the data, I have 10 attributes like: Age, Gender, etc., and 111 examples or data-points.
It's my first time to perform clustering and it's been difficult to find potential clusters in the data.
Here are the steps I have performed in Weka:
I have tried to cluster the data using all attributes, all types of clustering in Weka (like cobweb, EM .. etc) and using different cluster numbers (1-10). And When I visualise the clusters, they don't make any sense and the data are widely spread between x and y axis.
I have applied PCA and selected different number of attribute combinations according to the ranks obtained in PCA. The best clustering result was obtained using k-means and with only 2 combinations of attributes and the number of clusters selected was 3, and seed was 7 (sorry, I have no idea what the seed is).
My Questions:
Are the steps I performed to cluster data correct? If not please give me advice/s
Is this considered as a good clustering result?
How can I optimise or enhance my clusters?
What is meant with seed in Weka clustering?
Could someone please provide some information on how to properly combine a self organizing map with a multilayer perceptron?
I recently read some articles about this technique in comparison to regular MLPs and it performed way better in prediction tasks. So, I want to use the SOM as front-end for dimension reduction by clustering the input data and pass the results to an MLP back-end.
My current idea of implementing it is it to train the SOM with a couple of training sets and to determine the clusters. Afterwards, I initialize the MLP with as many input units as SOM clusters. Next step would be to train the MLP using the SOM's output (which value?...weights of BMU?) as in input for the network (SOM's Output for the Cluster matching Input Unit and zeros for any other Input Units?).
There is no single way of doing that. Let me list some possibilities:
The one you describe. But then, your MLP will need to have K*D inputs, where K is the number of clusters and D is the input dimension. There is no dimensionality reduction.
Similar to your idea, but instead of using the weights, just send 1 for the BMU and 0 for the remaining clusters. Then your MLP will need K inputs.
Same as above, but instead of 1 or 0, send the distance from the input vector to each cluster.
Same as above, but instead of the distance, compute a Gaussian activation for each cluster.
Since the SOM preserves topology, send only the 2D coordinates of the BMU (possibly normalized between 0 and 1). Then your MLP will need only 2 inputs and you achieve real extreme dimensionality reduction.
You can read about those ideas and some more here: Principal temporal extensions of SOM: Overview. It is not about feeding the output of a SOM to a MLP, but a SOM to itself. But you'll be able to understand the various possibilities when trying to produce some output from a SOM.
I have test classification datasets from UCI Machine Learning repository which are labelled.
I am stripping of the labels and using the data to benchmark a few clustering algorithm and then I am planning to use external validation methods. I will run the algorithm with different initial configurations, for say, 50 times and then take the mean value. For 50 iterations the algorithm labels the data points of one single cluster with different numbers. Because in each run the cluster labels can change, also because each iteration might have slightly different cluster assignments, how to somehow remap each of the clusters to one uniform numbering.
Primary idea is to remap by checking how many of the points in the class labels intersect the maximum in the actual labels and then making a remap based on that, but this can get incorrect remappings because when the classes will have more or less equal number of points, this will not work.
Another idea is to keep the labels while clustering, but make the clustering algorithm ignore it. This way all the cluster data will have the label tags. This is doable but I have already have a benchmarked cluster assignment data to be processed therefore I am trying to avoid modifying and re-benchmarking my implementation (which will take quite some time and cpu) of the cluster analysis algorithms and include the label tag to the vectors and then ignore it.
Is there any way that I can compute average accuracy from the cluster assignments I have right now?
EDIT:
The domain in which I am studying (metaheuristic clustering algorithms) I could not find a paper comparing these indexes. The paper which compares seems to be incorrect in their values. Can anyone point me to a paper where clustering results are compared using any of these indexes?
What do you do when the number of clusters doesn't agree?
Do not try to map clusters.
Instead, use the proper external validation measures for clustering, which do not require a 1:1 correspondence of clusters. There are plenty, for details see Wikipedia.
Once I have collected and organized data in a SOM how do I identify clusters?
(Items are aggregated and clustered using many traits - upwards of 10)
Specifically I want to find the 'center' of the cluster - therefor giving me the 'center' node(s).
You could use a relative small map and consider each node a cluster, but this is far from optimal. If you want to apply an automated cluster detection method you should definitely read
Clustering of the SelfâOrganizing Map
and search similar bibliography.
You could also use more sophisticated versions of SOM algorithm (multi leveled, self growing, etc).
In any case, keep in mind that the problem of finding the "correct" number of clusters doesn't have a finite solution.
As far as I can tell, SOM is primarily a data-driven dimensionality reduction and data compression method. So it won't cluster the data for you; it may actually tend to spread clusters in the projection (i.e. split them into multiple cells).
However, it may work well for some data sets to either:
Instead of processing the full data set, work only on the SOM nodes (weighted by the number of elements assigned to them), which should be significantly smaller
Instead of working in the original space, work in the lower-dimensional space that the SOM represents
And then run a regular clustering algorithm on the transformed data.
Though an old question I've encountered the same issue and I've had some success implementing Estimating the Number of Clusters in Multivariate Data by Self-Organizing Maps, so I thought I'd share.
The linked algorithm uses the U-matrix to highlight the boundaries of the individual clusters and then uses an image processing algorithm called watershedding to identify the components. For this to work correctly the regions in the u-matrix are required to be concave within the resolution of your quantization (which when converted to a binary image, simply results in using a floodfill to identify the regions).
I'm trying to cluster some images depending on the angles between body parts.
The features extracted from each image are:
angle1 : torso - torso
angle2 : torso - upper left arm
..
angle10: torso - lower right foot
Therefore the input data is a matrix of size 1057x10, where 1057 stands for the number of images, and 10 stands for angles of body parts with torso.
Similarly a testSet is 821x10 matrix.
I want all the rows in input data to be clustered with 88 clusters.
Then I will use these clusters to find which clusters does TestData fall into?
In a previous work, I used K-Means clustering which is very straightforward. We just ask K-Means to cluster the data into 88 clusters. And implement another method that calculates the distance between each row in test data and the centers of each cluster, then pick the smallest values. This is the cluster of the corresponding input data row.
I have two questions:
Is it possible to do this using SOM in MATLAB?
AFAIK SOM's are for visual clustering. But I need to know the actual class of each cluster so that I can later label my test data by calculating which cluster it belongs to.
Do you have a better solution?
Self-Organizing Map (SOM) is a clustering method considered as an unsupervised variation of the Artificial Neural Network (ANN). It uses competitive learning techniques to train the network (nodes compete among themselves to display the strongest activation to a given data)
You can think of SOM as if it consists of a grid of interconnected nodes (square shape, hexagonal, ..), where each node is an N-dim vector of weights (same dimension size as the data points we want to cluster).
The idea is simple; given a vector as input to SOM, we find the node closet to it, then update its weights and the weights of the neighboring nodes so that they approach that of the input vector (hence the name self-organizing). This process is repeated for all input data.
The clusters formed are implicitly defined by how the nodes organize themselves and form a group of nodes with similar weights. They can be easily seen visually.
SOM are in a way similar to the K-Means algorithm but different in that we don't impose a fixed number of clusters, instead we specify the number and shape of nodes in the grid that we want it to adapt to our data.
Basically when you have a trained SOM, and you want to classify a new test input vector, you simply assign it to the nearest (distance as a similarity measure) node on the grid (Best Matching Unit BMU), and give as prediction the [majority] class of the vectors belonging to that BMU node.
For MATLAB, you can find a number of toolboxes that implement SOM:
The Neural Network Toolbox from MathWorks can be used for clustering using SOM (see the nctool clustering tool).
Also worth checking out is the SOM Toolbox