In the MATLAB version of the K-means algorithm, there is a very useful flag that indicates the action to take if a cluster loses all member observations during the optimization. There are 3 possibilities in MATLAB:
treat empty cluster as an error
remove any clusters that become empty
Create a new cluster consisting of the one point furthest from its centroid
Does any one know what happens in DAAL K-means in that case? I could not find anything in the documentation about this.
In the K-Means implementation of Intel DAAL, clustering information on feature vectors is automatically collected during execution of the program. The feature furthest from their assigned centroids are selected as new cluster centers to compensate for an empty cluster during an iteration.
It is notable that a good choice of cluster initialization can avoid an empty cluster.
Refer https://software.intel.com/en-us/daal-programming-guide-details-5 for further details.
Related
I am trying to extract a fixed (and known) number of clusters from a set of points using Matlab.
An immediate clustering method that I tried is the k-means algorithm which seems to tick all the boxes.
Unfortunately, in some cases, the subsets (or clusters) extracted are intertwined, as shown in the image below for the left-most cluster:
[]
Is there a way to set the k-means algorithm, so that the generated clusters are disconnected?
Is there a way to post-process the cluster indices returned by the k-means algorithm, so as to obtain "disconnected" clusters?
Alternatively, is there another clustering method that might be more suitable?
Thanks!
I have been experimenting with Lumer-Faieta clustering and I am getting
promising results:
However, as clusters formed I was wondering how to identify the final clusters? Do I run another clustering algorithm to identify the clusters (that seems counter-productive)?
I had the idea of starting each data point in its own cluster. Then, when a laden ant drops a data point, its gets the same cluster as the data points that dominates its neighborhood. The problem with this is that if clusters are broken up, they share share the same cluster number.
I am stuck. Any suggestions?
To solve this problem, I employed DBSCAN as a post processing step. The effect as follows:
Given that we have a projection of a high dimensional problem on a 2D grid, with known distances and uniform densities, DBSCAN is ideal for this problem. Choosing the right value for epsilon and the minimum number of neighbours are trivial (I used 3 for both values). Once the clusters have been identified, it can be projected back to the n-dimension space.
See The 5 Clustering Algorithms Data Scientists Need to Know for a quick overview (and graphic demo) of DBSCAN and some other clustering algorithms.
I am building an Online news clustering system using Lucene and Mahout libraries in java. I intend to use vector space model and tfidf weights for Kmeans(or fuzzy/streamKmeans). My plan is : Cluster initial articles,assign new article to the cluster whose centroid is closest based on a small distance threshold. The leftover documents that aren’t associated with any old clusters form new data(new topics). Separately cluster them among themselves and add these temporary cluster centroids to the previous centroids. Less frequently, execute the full batch clustering to recluster the entire set of documents. The problem arises in comparing a new article to a centroid to assign it to an old cluster. The centroid dimension is number of distinct words in initial data. But the dimension of new article is different. I am following the book Mahout in Action. Is there any approach or some sort of feature extraction to handle this. The following similar links still remain unanswered:
https://stats.stackexchange.com/questions/41409/bag-of-words-in-an-online-configuration-for-classification-clustering
https://stats.stackexchange.com/questions/123830/vector-space-model-for-online-news-clustering
Thanks in advance
Increase the dimensionality as desired, using 0 as new values.
From a theoretical point of view, consider the vector space as infinite dimensional.
I have test classification datasets from UCI Machine Learning repository which are labelled.
I am stripping of the labels and using the data to benchmark a few clustering algorithm and then I am planning to use external validation methods. I will run the algorithm with different initial configurations, for say, 50 times and then take the mean value. For 50 iterations the algorithm labels the data points of one single cluster with different numbers. Because in each run the cluster labels can change, also because each iteration might have slightly different cluster assignments, how to somehow remap each of the clusters to one uniform numbering.
Primary idea is to remap by checking how many of the points in the class labels intersect the maximum in the actual labels and then making a remap based on that, but this can get incorrect remappings because when the classes will have more or less equal number of points, this will not work.
Another idea is to keep the labels while clustering, but make the clustering algorithm ignore it. This way all the cluster data will have the label tags. This is doable but I have already have a benchmarked cluster assignment data to be processed therefore I am trying to avoid modifying and re-benchmarking my implementation (which will take quite some time and cpu) of the cluster analysis algorithms and include the label tag to the vectors and then ignore it.
Is there any way that I can compute average accuracy from the cluster assignments I have right now?
EDIT:
The domain in which I am studying (metaheuristic clustering algorithms) I could not find a paper comparing these indexes. The paper which compares seems to be incorrect in their values. Can anyone point me to a paper where clustering results are compared using any of these indexes?
What do you do when the number of clusters doesn't agree?
Do not try to map clusters.
Instead, use the proper external validation measures for clustering, which do not require a 1:1 correspondence of clusters. There are plenty, for details see Wikipedia.
I am unable to find the details of matlab's k-mean about seeds. That if matlab's k-mean recomputes the cluster assignment of Xs seeds, which is subset of data set X matrix.
Or these seeds are only used for initial centered location and are not considered in k-means cluster assignment phase?
I want to semi-supervised clustering by seeds by Sugato Basu et.al
It might be a naive question but your answer would make this confusion clearer.
Thanks in advance.
Did you check the documentation: doc kmeans? There they use the term initial cluster centroid positions to refer to seeds.
In particular, look at the parameter named start which is used to specify the seeds, and the replicates parameter. Also see the Algorithms section, where they discuss the two phases of the process (batch update and online update). Finally, and perhaps best of all, you can look at the code directly with edit kmeans, and step through it with the debugger.
It's not clear to me exactly what your question is but, from the above, I would answer that the seeds are calculated once according to the 'start' parameter, followed by the batch update and online update. This is repeated according to the 'replicates' parameter.
I don't know what "semi-supervised clustering by seeds" is, but I'm pretty sure it's not supported out of the box.