Why does Weka XMeans systematically underestimate the number of clusters? - cluster-analysis

I'm having some trouble with Weka's XMeans clusterer. I've talked to a couple of other humans and we all agree that there are six clusters in the screenshot below, or at least a minimum of two if you squint really hard. Either way, xMeans does not seem to agree.
XMeans seems to systematically underestimate the number of clusters, based on whatever I have the minimum cluster count set to. With the maximum number of clusters held at 100, here are the results I get:
-L 1 // 1 cluster
-L 2 // 2 clusters
-L 3 // 3 clusters
-L 4 // 5 clusters
-L 5 // 6 clusters
-L 6 // 6 clusters
Most egregiously, with -L 1 (and -H 100) only a single cluster is found. Only by getting the minimum cluster count to five do I actually see the six clusters. Cranking the improve-structure parameter way up (to 100,000) does not seem to have any effect. (I've also played with other options without seeing any difference.) Here are the options that generated the above screenshot, which found one center:
private static final String[] XMEANS_OPTIONS = {
"-H", "100", // max number of clusters (default 4)
"-L", "1", // min number of clusters (default 2)
"-I", "100", // max overall iterations (default 1)
"-M", "1000000", // max improve-structure iterations (default 1000)
"-J", "1000000", // max improve-parameter iterations (default 1000)
};
Obviously I'm missing something here. How do I make XMeans behave as expected?

I think it is what I was afraid of. A psychological problem (I think if you search for Gestalttheorie you might find some explanation on the perception aspect).
The human eye is grasping the form of the clusters and finds six circles. However, k-means and therefore x-means only looks at the distance. Hence, the cluster look rather awkward. Also after multiple restarts with 6-means I almost always achieved clusters like:
This one might be a local minima which might be solved by xmeans
or for 3-means
Which is quite interesting as some points clearly violate the expectations.
If you use the K-Means in R you can analyse the withinss of the cluster result. These show that these awkward looking cluster typically perform quite well. Hence, there is no convergence to a different result to be expected.
I think this could be resolvable by using a different distance measure. For example the squared euclidean distance which enforce circle like shapes.
Or using some kernel-based clustering technique with a RBF Kernel
===============================================
EDIT1:
However, one aspect of the results of weka are still rather awkward, I used RWeka to run a few experiments. Basically I did run 100 cluster-runs for each initial cluster size between 2 and 7. My expectation was that for 2 and 6 the clusters are rather stable for the other cluster sizes I would have expected them to grow.
However, the results are differently
So basically 2 and 6 are fairly stable, however, the cluster sizes are always expended by at most 1.
So lets have a look at the BIC
What we can observe is that the BIC is not increasing when increasing the cluster size, however, strongly dependent on the initial cluster size.
Let's drill down a bit further and look at initial cluster size of 3. Running multiple restarts with different initial sizes generates (reproducible) the following two situations:
Nevertheless the results on the BIC seem to suggest that there is a BUG in the BIC calculation.

Related

Find "complemented" bit vectors clusters

I have a huge list of bit vectors (BV) that I want to group in clusters.
The idea behind this clusters is to be able to choose later BVs from each cluster and combine them for generate a BV with (almost) all-ones (which must be maximized).
For example, imagine the 1 means an app is Up and 0 is down in node X in a specific moment in time. We want to find the min list of nodes for having the app Up:
App BV for node X in cluster 1: 1 0 0 1 0 0
App BV for node Y in cluster 2: 0 1 1 0 1 0
Combined BV for App (X+Y): 1 1 1 1 1 0
I have been checking the different cluster algorithms but I did found one that takes into account this "complemental" behavior because in this case each column of the BV is not referred to a feature (only means up or down in an specific timeframe).
Regarding other algorithms like k-means or hierarchical clustering, I do not have clear if I can include in the clustering algorithm this consideration for the later grouping.
Finally, I am using the hamming distance to determine the intra-cluster and the inter-cluster distances given that it seems to be the most appropiated metric for binary data but results show me that clusters are not closely grouped and separated among them so I wonder if I am applying the most suitable group/approximation method or even if I should filter the input data previously grouping.
Any clue or idea regarding grouping/clustering method or filtering data is welcomed.
This does not at all sound like a clustering problem.
None of these algorithms will help you.
Instead, I would rather call this a match making algorithm. But I'd assume it is at least NP-hard (it resembles set cover) to find the true optimum, so you'll need to come up with a fast approximation. Best something specific to your use case.
Also you haven't specified (you wrote + but that likely isn't what you want) how to combine two 1s. Is it xor or or? Nor if it is possible to combine more than two, and what the cost is when doing so. A strategy would be to find the nearest neighbor of the inverse bitvector for each and always combine the best pair.

ELKI DBSCAN LngLatDistanceFunction producing one cluster

I'm using Elki LngLatDistanceFunction to cluster Lon/lat points but it's only returning one cluster (was returning more clusters when I used Euclid distance). I tried multiple Epsilon values but I'm still getting one cluster.
int minPts=20;
double eps=10;
ListParameterization params = new ListParameterization();
params.addParameter(DBSCAN.DISTANCE_FUNCTION_ID, LngLatDistanceFunction.class);
params.addParameter(DBSCAN.Parameterizer.MINPTS_ID, minPts);
params.addParameter(DBSCAN.Parameterizer.EPSILON_ID, eps);
params.addParameter(AbstractDatabase.Parameterizer.DATABASE_CONNECTION_ID, dbcon);
params.addParameter(AbstractDatabase.Parameterizer.INDEX_ID, RStarTreeFactory.class);
params.addParameter(RStarTreeFactory.Parameterizer.BULK_SPLIT_ID, SortTileRecursiveBulkSplit.class);
params.addParameter(AbstractPageFileFactory.Parameterizer.PAGE_SIZE_ID, 600);
Database db = ClassGenericsUtil.parameterizeOrAbort(StaticArrayDatabase.class, params);
db.initialize();
GeneralizedDBSCAN dbscan = ClassGenericsUtil.parameterizeOrAbort(GeneralizedDBSCAN.class, params);
The distance is in meters. Therefore, you need to choose epsilon such that some - but not all points - have more than minPts neighbors.
You can use the KNNDistancesSampler class to estimate the parameter. It is not an automatic estimation. But you can plot the resuling distances, and check for a "knee" in this plot.
Pay attention to the "noise" flag.
If you get a single cluster, and it is "noise", then epsilon is too small.
If you get a single cluster, and it is a "cluster" (not noise), then epsilon is too large.
If you get a single cluster, and it is "noise", then minPts may be too large.
If you get a single cluster, and it is a cluster, then minPts may be too small.
For most applications, it is easier to fix minPts to 4, or 10, or 20; and then adjust the epsilon parameter as desired. For geographic applications like yours, it may be much easier to fix the epsilon parameter, and vary the minpts parameter instead. For example, you may know that a distance of less than 10000 meter indicates objects to be "neighbors".
Algorithms such as OPTICS are also helpful to choose the parameter visually. (Use the MiniGUI!)

Clustering in Matlab

Hi I am trying to cluster using linkage(). Here is the code I am trying..
Y = pdist(data);
Z = linkage(Y);
T = cluster(Z,'maxclust',4096);
I am getting error as follows
The number of elements exceeds the maximum allowed size in
MATLAB.
Error in ==> linkage at 135
Z = linkagemex(Y,method);
data size is 56710*128. How can I apply the code on small chunks of data and then merge those clusters optimally?? Or any other solution to the problem.
Matlab probably cannot cluster this many objects with this algorithm.
Most likely they use distance matrixes in their implementation. A pairwise distance matrix for 56710 objects needs 56710*56709/2=1,607,983,695 entries, or some 12 GB of RAM; most likely also a working copy of this is needed. Chances are that the default Matlab data structures are not prepared to handle this amount of data (and you won't want to wait for the algorithm to finish either; probably that is why they "allow" only a certain amount).
Try using a subset, and see how well it scales. If you use 1000 instances, does it work? How long does the computation take? If you increase to 2000, how much longer does it take?

rapidminer: cluster performance operators..what does different value mean?

I have to check performance of various clustering algos using different performance operators in rapidminer. For that I want to know the following things:
what does cluster number index value shows which is output of cluster count performance operator?
what does small and large value of avg within cluster distance and avg. within centroid distance mean in terms of good and bad clustering?
I also want to check other indexes value like Dunn index,Jaccard index, Fowlkes–Mallows for various clustering algos. but rapidminer don't have any operator for this, what to do for that. I don't have experience with R.
I have copied part of the answer I gave on the Rapid-I forum
The cluster number index is the count of clusters - pointless you might say but when used with DBSCAN, it can be quite interesting http://rapidminernotes.blogspot.co.uk/2010/12/counting-clusters.html
The avg within cluster and centroid distances are hard to interpret - one thing to search for is "elbow criterion" in this context. As the number of clusters varies, note how the validity measure changes and look for an "elbow" that marks the point where the natural progression of the measure dominates the structure.
R has many validity measures and it's worth investing some time because you can always call the R process from RapidMiner which makes it easier to work out what is going on.

Mahout K-means has different behavior based on the number of mapping tasks

I experience a strange situation when running Mahout K-means:
Using the a pre-selected set of initial centroids, I run K-means on a SequenceFile generated by lucene.vector. The run is for testing purposes, so the file is small (around 10MB~10000 vectors).
When K-means is executed with a single mapper (the default considering the Hadoop split size which in my cluster is 128MB), it reaches a given clustering result in 2 iterations (Case A).
However, I wanted to test if there would be any improvement/deterioration in the algorithm's execution speed by firing more mapping tasks (the Hadoop cluster has in total 6 nodes).
I therefore set the -Dmapred.max.split.size parameter to 5242880 bytes, in order to make mahout fire 2 mapping tasks (Case B).
I indeed succeeded in starting two mappers, but the strange thing was that the job finished after 5 iterations instead of 2, and that even at the first assignment of points to clusters, the mappers made different choices compared to the single-map execution . What I mean is that after close inspection of the clusterDump for the first iteration for both two cases, I found that in case B some points were not assigned to their closest cluster.
Could this behavior be justified by the existing K-means Mahout implementation?
From a quick look at the sources, I see two problems with the Mahout k-means implementation.
First of all, the way the S0, S1, S2 statistics are kept is probably not numerically stable for large data sets. Oh, and since k-means actually does not even use S2, it is also unnecessary slow. I bet a good implementation can beat this version of k-means by a factor of 2-5 at least.
For small data sets split onto multiple machines, there seems to be an error in the way they compute their means. Ouch. This will amplify if the reducer is applied to more than one input, in particular when the partitions are small. To be more verbose, the cluster mean apparently is initialized with the previous mean instead of the 0 vector. Now if you if you reduce 't' copies of it, the resulting vector will be off by 't' times the previous mean.
Initialization of AbstractCluster:
setS1(center.like());
Update of the mean:
getS1().assign(x, Functions.PLUS);
Merge of multiple copies of a cluster:
setS1(getS1().plus(cl.getS1()));
Finalization to new center:
setCenter(getS1().divide(getS0()));
So with this approach, the center will be offset from the proper value by the previous center times t / n where t is the number of splits, and n the number of objects.
To fix the numerical instability (which arises whenever the data set is not centered on the 0 vector), I recommend replacing the S1 statistic by the true mean, not S0*mean. Both S1 and S2 can be incrementally updated at little cost using the incremental mean formula which AFAICT was used in the original "k-means" publication by MacQueen (which actually is an online kmeans, while this is Lloyd style batch iterations). Well, for an incremental k-means you obviously need the updatable mean vector anyway... I believe the formula was also discussed by Knuth in his essential books. I'm surprised that Mahout does not seem to use it. It's fairly cheap (just a few CPU instructions more, no additional data, so it all happens in the CPU cache line) and gives you extra precision when you are dealing with large data sets.