How to give label for cluster from GMM iteration? - matlab

I read the concept of GMM from Understanding concept of Gaussian Mixture Models. It is helpful for me. I have implemented GMM for fisheriris also but I didn't use fitgmdist function because I didn't have it. So I used code from http://chrisjmccormick.wordpress.com/2014/08/04/gaussian-mixture-models-tutorial-and-matlab-code/.
When I read Understanding concept of Gaussian Mixture Models, Amro could plot the result with its label, i.e. setosa, virginica, and versicolor. How did he do it? After some iterations, I only got mu, Sigma, and weight. There is no label at all. I want to put the label (setosa, virginica, and versicolor) to mixture models from GMM iteration.

There are two sets of "labels" in that plot:
one is the "true" labels of the Fisher Iris dataset (the species variable which contains the class of each instance: setoas, versicolor, or virginica). Normally you wouldn't have those in a real dataset (after all the goal of clustering is to discover those groups within the data, which you don't know beforehand). I just used them here to get an idea of how well the EM clustering performed against the actual truth (the scatter points are color-coded according to the class).
the other set of labels are the clusters we found using GMM. Basically I built a 50x50 grid of 2D points to cover the entire data domain, I then assign a cluster to each of those points by computing the posterior probability and choosing the component with highest likelihood. I showed those clusters in the background color. As a nice consequence, we get to see the discriminant decision boundaries between the clusters.
You can see that the cluster of points on the left got separated quite nicely (and perfectly matched the setosa class). While the points on the right side of the plot got separated in two matching the other two classes, although there were instance "misclassified" if you will (some green points on the wrong side of the boundary).
Typically in a real setting you wouldn't have those actual classes to compare against, so no way to tell how "accurate" your clustering was (there exist other metrics for clustering performance evaluation)...

Related

Self-organizing map: How to identify clusters from plots?

I've been learning about neural networks and most recently been trying out different clustering methods. But unlike KNN, GMM, or DBSCAN, there isn't a feature (in Matlab that I'm aware of) that identifies clusters for you. So I've been reading articles of how to interpret these plots, but I'm still confused. For my example, in the weight positions plot, I see one cluster. For the neighbor weight differences, I see one, maybe two clusters (yellow/bright - similar, red/dark - dissimilar). That seems to be confirmed when looking at the densities in the hits plot. There might be more, but I honestly I can't tell (I'm new at this) because of the gradient instead of a solid boundary between clusters. How many clusters do you see, and what's your logic? Thank you]1[]2[]3
selforgmap([5 5]
[net,tr] = train(net,x)
figure, plotsomnd(net)
figure, plotsomhits(net,x)
figure, plotsompos(net,x)
You may construct a new paradigm in relation with what the SOM nodes represent, i.e. they produce a new dataset. The new dataset is independent from the original dateset. Nevertheles, it is arranged somehow so that the underlying structure imitates that of the original dataset. Therefore, it is often found that people perform SOM with clustering algorithms such as K-means, Hierarchical Clustering, etc subsequently. This can be regarded as: instead of clustering directly from a huge amount of the original data, the clustering procedure is performed on a new version of the original dataset which is smaller but still inherits the topology of the original dataset. AFAIK, SOM is different from KNN in the sense that SOM is unsupervised whereas KNN is supervised.

clustering vs fitting a mixture model

I have a question about using a clustering method vs fitting the same data with a distribution.
Assuming that I have a dataset with 2 features (feat_A and feat_B) and let's assume that I use a clustering algorithm to divide the data in an optimal number of clusters...say 3.
My goal is to assign for each of the input data [feat_Ai,feat_Bi] a probability (or something similar) that the point belongs to cluster 1 2 3.
a. First approach with clustering:
I cluster the data in the 3 clusters and I assign to each point the probability of belonging to a cluster depending on the distance from the center of that cluster.
b. Second approach using mixture model:
I fit a mixture model or mixture distribution to the data. Data are fit to the distribution using an expectation maximization (EM) algorithm, which assigns posterior probabilities to each component density with respect to each observation. Clusters are assigned by selecting the component that maximizes the posterior probability.
In my problem I find the cluster centers (or I fit the model if approach b. is used) with a subsample of data. Then I have to assign a probability to a lot of other data... I would like to know in presence of new data which approach is better to use to still have meaningful assignments.
I would go for a clustering method for example a kmean because:
If the new data come from a distribution different from the one used to create the mixture model, the assignment could be not correct.
With new data the posterior probability changes.
The clustering method minimizes the variance of the clusters in order to find a kind of optimal separation border, the mixture model take into consideration the variance of the data to create the model (not sure that the clusters that will be formed are separated in an optimal way).
More info about the data:
Features shouldn't be assumed dependent.
Feat_A represents the duration of a physical activity Feat_B the step counts In principle we could say that with an higher duration of the activity the step counts increase, but it is not always true.
Please help me to think and if you have any other point please let me know..

Remove outliers from a set of 3d points before clustering Matlab

I have a set of 3d points in Matlab but the problem is that my data found here. And as you can see there are some outliers which are affecting my clustering results. So if anyone could please advise how I can delete these outliers from my data.
Having looked at your data, I don't think any clustering algorithm will do what you want. Instead, you will probably need to train a classifier. This is what the Kinect people did, train a classifier using millions of real and synthetic postures, to have it label limbs, head, etc.
The reason why I don't think density based clustering will work either is because your data is a single, density-connected, body-with-two-boxes-shaped blob. But without knowing what a "body" and a "box" is, segmentation will be rather arbitrary. Or in the case of density based clustering: it will not segment at all, or it will segment e.g. by the rather low resultion of your z axis. Furthermore, your X and Y axes come from a grid based image scan (I assume), so you have a very uniform density on the X and Y axes - but the arms, for example, are not of a lower density than the body or boxes.
You can, however, use DBSCAN with rather broad (and easy to set) parameters to remove the noise.
E.g. in ELKI the following parameters yield reasonable results:
java -jar elki.jar -dbc.in /tmp/XX.csv -algorithm clustering.DBSCAN \
-dbscan.epsilon 0.05 -dbscan.minpts 100
The majority cluster is your data with the outliers removed; even with this blob near the foot removed.
To speed up the clustering process, you can add the parameters
-db.index tree.spatial.rstarvariants.rstar.RStarTreeFactory \
-pagefile.pagesize 1000 -spatial.bulkstrategy SortTileRecursiveBulkSplit
which yields a runtime opf 4.5 seconds here. This obviously is not good enough for realtime operation as on a Kinect; but it is not surprising to see a directed classification algorithm to outperform an unsupervised method - this is in fact to be expected.
Here is the result of clustering the data set with the parameters above:

Finding elongated clusters using MATLAB

Let me explain what I'm trying to do.
I have plot of an Image's points/pixels in the RGB space.
What I am trying to do is find elongated clusters in this space. I'm fairly new to clustering techniques and maybe I'm not doing things correctly, I'm trying to cluster using MATLAB's inbuilt k-means clustering but it appears as if that is not the best approach in this case.
What I need to do is find "color clusters".
This is what I get after applying K-means on an image.
This is how it should look like:
for an image like this:
Can someone tell me where I'm going wrong, and what I can to do improve my results?
Note: Sorry for the low-res images, these are the best I have.
Are you trying to replicate the results of this paper? I would say just do what they did.
However, I will add since there are some issues with the current answers.
1) Yes, your clusters are not spherical- which is an assumption k-means makes. DBSCAN and MeanShift are two more common methods for handling such data, as they can handle non spherical data. However, your data appears to have one large central clump that spreads outwards in a few finite directions.
For DBSCAN, this means it will put everything into one cluster, or everything is its own cluster. As DBSCAN has the assumption of uniform density and requires that clusters be separated by some margin.
MeanShift will likely have difficulty because everything seems to be coming from one central lump - so that will be the area of highest density that the points will shift toward, and converge to one large cluster.
My advice would be to change color spaces. RGB has issues, and it the assumptions most algorithms make will probably not hold up well under it. What clustering algorithm you should be using will then likely change in the different feature space, but hopefully it will make the problem easier to handle.
k-means basically assumes clusters are approximately spherical. In your case they are definitely NOT. Try fit a Gaussian to each cluster with non-spherical covariance matrix.
Basically, you will be following the same expectation-maximization (EM) steps as in k-means with the only exception that you will be modeling and fitting the covariance matrix as well.
Here's an outline for the algorithm
init: assign each point at random to one of k clusters.
For each cluster estimate mean and covariance
For each point estimate its likelihood to belong to each cluster
note that this likelihood is based not only on the distance to the center (mean) but also on the shape of the cluster as it is encoded by the covariance matrix
repeat stages 2 and 3 until convergence or until exceeded pre-defined number of iterations
Take a look at density-based clustering algorithms, such as DBSCAN and MeanShift. If you are doing this for segmentation, you might want to add pixel coordinates to your vectors.

MATLAB: Self-Organizing Map (SOM) clustering

I'm trying to cluster some images depending on the angles between body parts.
The features extracted from each image are:
angle1 : torso - torso
angle2 : torso - upper left arm
..
angle10: torso - lower right foot
Therefore the input data is a matrix of size 1057x10, where 1057 stands for the number of images, and 10 stands for angles of body parts with torso.
Similarly a testSet is 821x10 matrix.
I want all the rows in input data to be clustered with 88 clusters.
Then I will use these clusters to find which clusters does TestData fall into?
In a previous work, I used K-Means clustering which is very straightforward. We just ask K-Means to cluster the data into 88 clusters. And implement another method that calculates the distance between each row in test data and the centers of each cluster, then pick the smallest values. This is the cluster of the corresponding input data row.
I have two questions:
Is it possible to do this using SOM in MATLAB?
AFAIK SOM's are for visual clustering. But I need to know the actual class of each cluster so that I can later label my test data by calculating which cluster it belongs to.
Do you have a better solution?
Self-Organizing Map (SOM) is a clustering method considered as an unsupervised variation of the Artificial Neural Network (ANN). It uses competitive learning techniques to train the network (nodes compete among themselves to display the strongest activation to a given data)
You can think of SOM as if it consists of a grid of interconnected nodes (square shape, hexagonal, ..), where each node is an N-dim vector of weights (same dimension size as the data points we want to cluster).
The idea is simple; given a vector as input to SOM, we find the node closet to it, then update its weights and the weights of the neighboring nodes so that they approach that of the input vector (hence the name self-organizing). This process is repeated for all input data.
The clusters formed are implicitly defined by how the nodes organize themselves and form a group of nodes with similar weights. They can be easily seen visually.
SOM are in a way similar to the K-Means algorithm but different in that we don't impose a fixed number of clusters, instead we specify the number and shape of nodes in the grid that we want it to adapt to our data.
Basically when you have a trained SOM, and you want to classify a new test input vector, you simply assign it to the nearest (distance as a similarity measure) node on the grid (Best Matching Unit BMU), and give as prediction the [majority] class of the vectors belonging to that BMU node.
For MATLAB, you can find a number of toolboxes that implement SOM:
The Neural Network Toolbox from MathWorks can be used for clustering using SOM (see the nctool clustering tool).
Also worth checking out is the SOM Toolbox