single- linkage hierarchical cluster method cutting the tree - matlab

I have a dataset contains 3 categories {c1,c2 and c3}. I’m using the single- linkage hierarchical cluster method (from the matlab) to cluster the dataset. I built my own distance measure. The following figure shows the results. Note that the hierarchical cluster method clusters the data correctly; where the points of c1 (yellow) are very close to each other. And similarly, c2(green) and c3(blue).
From the figure, we can note that the distances between the points in c1 are very small comparing to c2 and c3. So, for example, If I decide to cut the tree at 8, this will results with c1, c2 and c3 will be splited into 8 clusters; where each point will be in different cluster.
How can I overcome this problem; do I need to change the clustering method? Or cut the tree at 17 and cluster the resulted clusters again?

There are different ways of extracting clusters from a dendrogram. You are not required to do a single cut (although matlab may only offer this choice). Selecting regions like you did is also reasonable, and so is cutting the dendrogram at multiple heights. But not every tools has all the capabilities.
Notice that c3 was split into two, half of which is not well separated from c2.

Related

Clustering groups based on 3 variables in SPSS and R

I am currently trying to understand cluster analysis (using SPSS and R). Reading up so much about it further confused me as to what clustering method to use to answer a research question.
My research question investigates whether a) certain participants can be clustered according to their change in variable A (a group that remains stable, a group that worsens, and a third group that improves over 2 assessments), and b) how these groups/clusters differ with regard to two other variables at assessment 1 (B and C). That is, do people with different patterns in B and C have a different change in A?
Question: I have standardised the data and, so far, have tried two step hierarchical and k-means clustering. However, I am unsure if this is the right method for answering my question. In the case of a fixed number of clusters, I chose 3 because I am interested in seeing clusters of people that improve/worsen/stay stable over time, and the clusters' individual pattern of B and C. Is this feasible? Am I missing something?
For k-means clustering I used the following syntax:
QUICK CLUSTER z_A_change z_B_mean z_C_mean
/MISSING=LISTWISE
/CRITERIA=CLUSTER(3) MXITER(10) CONVERGE(0)
/METHOD=KMEANS(NOUPDATE)
/SAVE CLUSTER DISTANCE
Finally, is there any way to visualise these clusters on a 3-D plot in SPSS? I am not quite as proficient in R's ggplot2 or scatterplot3d as I would like to be.
Thank you in advance.
If you use TWOSTEP CLUSTER or QUICK CLUSTER to fit a three cluster solution and save the cluster memberships as a new variable, you can create a grouped 3D scatter plot via the Chat Builder. In the menus, go to Graphs>Chart Builder. In the Gallery view, under Choose from: select Scatter/Dot. In the icons shown underneath the main canvas, the second from the right in the top row should be the grouped 3D scatter. Move that icon into the canvas. Select each of the three variables used in the clustering for the X, Y, and Z axes. Specify the cluster membership variable as the Set Color variable, then click OK.

How to merge clustering results for different clustering approaches?

Problem: It appears to me that a fundamental property of a clustering method c() is whether we can combine the results c(A) and c(B) by some function f() of two clusterings in a way that we do not have to apply the full clustering c(A+B) again but instead do f(c(A),c(B)) and still end up with the same result:
c(A+B) == f(c(A),c(B))
I suppose that a necessary condition for some c() to have this property is that it is determistic, that is the order of its internal processing is irrelevant for the result. However, this might not be sufficient.
It would be really nice to have some reference where to look up which cluster methods support this and what a good f() looks like in the respective case.
Example: At the moment I am thinking about DBSCAN which should be deterministic if I allow border points to belong to multiple clusters at the same time (without connecting them):
One point is reachable from another point if it is in its eps-neighborhood
A core point is a point with at least minPts reachable
An edge goes from every core point to all points reachable from it
Every point with incoming edge from a core point is in the same cluster as the latter
If you miss the noise points then assume that each core node reaches itself (reflexivity) and afterwards we define noise points to be clusters of size one. Border points are non-core points. Afterwards if we want a partitioning, we can assign randomly the border points that are in multiple clusters to one of them. I do not consider this relevant for the method itself.
Supposedly the only clustering where this is efficiently possible is single linkage hierarchical clustering, because edges removed from A x A and B x B are not necessary for finding the MST of the joined set.
For DBSCAN precisely, you have the problem that the core point property can change when you add data. So c(A+B) likely has core points that were not core in either A not B. This can cause clusters to merge. f() supposedly needs to re-check all data points, i.e., rerun DBSCAN. While you can exploit that core points of the subset must be core of the entire set, you'll still need to find neighbors and missing core points.

Determine number of clusters for different datasets

I performed a clustering analysis of the media usage of different users in order to find different groups that use a specific set of media (e.g. group 1 use media A, B and C and group 2 use media B, C and D). Then I divided the datset in different groups, since the users belong to a specific group (as a consequence the original dataset and the new datasets have a different size). Within in this groups I like to cluster again which different media sets are used.
How can I determine the number of clusters to guarantee that the results are comparable?
Thank you in advance!
Don't rely on clustering to be stable.
It's a hypothesis generation tool.
You clustered, and now you have the hypothesis that there are groups ABCD of media usage. You should first evaluate if this hypothesis is adequate. Now what you want to do in your next step is to assign the labels to subsets of the data. First of all, you should be able to simply subset this from the previous labels. But if this really is different data, you can label new data, for example using the most similar record (nearest neighbor classification). But that is classification now, because your classes are fixed.

Clustering data using matlab

I'm trying to cluster my data. This is the example of my data:
genes param1 param2 ...
gene1 0.224 -0.113 ...
gene2 -0.149 -0.934 ...
I have a thousand of genes and a hundred of parameters. I wanted to cluster my data by both genes and parameters and used clustergram for it. As there are a lot of genes it's very difficult to understand anything using a picture. Now I want to have a text-information of the 15-20 biggest clusters of genes in my data. I mean 15-20 lists of genes, that belong to different clusters. How can I do this?
Thanks
This is the example of clustergram I have from my data:
There are vertical and horizontal dendrograms here. As there is a lot of rows, it's impossible to see anything on vertical dendrogram (I need only this one).
As far as I understand, dendrogram creates a binary clusters from my data, and there are N-1 clusters from N rows of data.As these are binary clusters, there is one cluster, on the next step it splits into two, then again into two and so on. Can I get information about which genes are in which clusters on the 4-th step, for example, when there are 16 clusters?
To see interesting parts of the dendrogram and heatmap more clearly, you can use the zoom button on the toolbar to select regions of interest and zoom in on them.
To find out which genes/variables are in a particular cluster, right-click on a point in one of the dendrograms that represents the cluster you're interested in, and select Export to Workspace. You'll get a structure with the following fields:
GroupNames — Cell array of text strings containing the names of the row or column groups.
RowNodeNames — Cell array of text strings containing the names of the row nodes.
ColumnNodeNames — Cell array of text strings containing the names of the column nodes.
ExprValues — An M-by-N matrix of intensity values, where M and N are the number of row nodes and of column nodes respectively.

Mahout K-means has different behavior based on the number of mapping tasks

I experience a strange situation when running Mahout K-means:
Using the a pre-selected set of initial centroids, I run K-means on a SequenceFile generated by lucene.vector. The run is for testing purposes, so the file is small (around 10MB~10000 vectors).
When K-means is executed with a single mapper (the default considering the Hadoop split size which in my cluster is 128MB), it reaches a given clustering result in 2 iterations (Case A).
However, I wanted to test if there would be any improvement/deterioration in the algorithm's execution speed by firing more mapping tasks (the Hadoop cluster has in total 6 nodes).
I therefore set the -Dmapred.max.split.size parameter to 5242880 bytes, in order to make mahout fire 2 mapping tasks (Case B).
I indeed succeeded in starting two mappers, but the strange thing was that the job finished after 5 iterations instead of 2, and that even at the first assignment of points to clusters, the mappers made different choices compared to the single-map execution . What I mean is that after close inspection of the clusterDump for the first iteration for both two cases, I found that in case B some points were not assigned to their closest cluster.
Could this behavior be justified by the existing K-means Mahout implementation?
From a quick look at the sources, I see two problems with the Mahout k-means implementation.
First of all, the way the S0, S1, S2 statistics are kept is probably not numerically stable for large data sets. Oh, and since k-means actually does not even use S2, it is also unnecessary slow. I bet a good implementation can beat this version of k-means by a factor of 2-5 at least.
For small data sets split onto multiple machines, there seems to be an error in the way they compute their means. Ouch. This will amplify if the reducer is applied to more than one input, in particular when the partitions are small. To be more verbose, the cluster mean apparently is initialized with the previous mean instead of the 0 vector. Now if you if you reduce 't' copies of it, the resulting vector will be off by 't' times the previous mean.
Initialization of AbstractCluster:
setS1(center.like());
Update of the mean:
getS1().assign(x, Functions.PLUS);
Merge of multiple copies of a cluster:
setS1(getS1().plus(cl.getS1()));
Finalization to new center:
setCenter(getS1().divide(getS0()));
So with this approach, the center will be offset from the proper value by the previous center times t / n where t is the number of splits, and n the number of objects.
To fix the numerical instability (which arises whenever the data set is not centered on the 0 vector), I recommend replacing the S1 statistic by the true mean, not S0*mean. Both S1 and S2 can be incrementally updated at little cost using the incremental mean formula which AFAICT was used in the original "k-means" publication by MacQueen (which actually is an online kmeans, while this is Lloyd style batch iterations). Well, for an incremental k-means you obviously need the updatable mean vector anyway... I believe the formula was also discussed by Knuth in his essential books. I'm surprised that Mahout does not seem to use it. It's fairly cheap (just a few CPU instructions more, no additional data, so it all happens in the CPU cache line) and gives you extra precision when you are dealing with large data sets.