Interpreting the result of StreamingKMeans in mahout 0.8 - streaming

What I want to achieve is simply find out which input points are included in a given cluster!?
I have a personal dataset which contains some documents that are grouped in 12 clusters manually.
I know how to interpret kmenas result in mahout .7 with using namedVector class and one of dumpers (like clusterdumper). after clustering using kmeans driver, a directory named clusteredPoints has created which contains clustering result and using clusterDumper, you can see the created clusters and the points that are in each one. in below link there is a good solution for this :
How to read Mahout clustering output
But, as I mentioned in title I want to have this capability to interpret Streaming Kmeans result which is a new feature in mahout .8.
In this feature, it uses a Centroid class for holding data points and each cluster seeds. The generated result of StreamingKMeans algorithm is only a sequence file which is constructed of centroid vectors + keys and weights of each cluster. And in this output there is no information of input data points to know the distribution of them between clusters. However, it is not possible to me to get a sense of accuracy of clustering.
by the way, How to get this information in clustering output ? It is not implemented or just I failed to find and use prepared soulution? How can I analysis the result of streamingKMeans?
thanks.

Related

Clustering in Weka

I have some data collected using an online survey. Therefore, there are no classes/labels in the data to evaluate clustering results. I am trying to do the clustering in order to cluster participants in some groups for another task.
In the data, I have 10 attributes like: Age, Gender, etc., and 111 examples or data-points.
It's my first time to perform clustering and it's been difficult to find potential clusters in the data.
Here are the steps I have performed in Weka:
I have tried to cluster the data using all attributes, all types of clustering in Weka (like cobweb, EM .. etc) and using different cluster numbers (1-10). And When I visualise the clusters, they don't make any sense and the data are widely spread between x and y axis.
I have applied PCA and selected different number of attribute combinations according to the ranks obtained in PCA. The best clustering result was obtained using k-means and with only 2 combinations of attributes and the number of clusters selected was 3, and seed was 7 (sorry, I have no idea what the seed is).
My Questions:
Are the steps I performed to cluster data correct? If not please give me advice/s
Is this considered as a good clustering result?
How can I optimise or enhance my clusters?
What is meant with seed in Weka clustering?

How to decide the numbers of clusters based on a distance threshold between clusters for agglomerative clustering with sklearn?

With sklearn.cluster.AgglomerativeClustering from sklearn I need to specify the number of resulting clusters in advance. What I would like to do instead is to merge clusters until a certain maximum distance between clusters is reached and then stop the clustering process.
Accordingly, the number of clusters might vary depending on the structure of the data. I also do not care about the number of resulting clusters nor the size of the clusters but only that the cluster centroids do not exceed a certain distance.
How can I achieve this?
This pull request for a distance_threshold parameter in scikit-learn's agglomerative clustering may be of interest:
https://github.com/scikit-learn/scikit-learn/pull/9069
It looks like it'll be merged in version 0.22.
EDIT: See my answer to my own question for an example of implementing single linkage clustering with a distance based stopping criterion using scipy.
Use scipy directly instead of sklearn. IMHO, it is much better.
Hierarchical clustering is a three step process:
Compute the dendrogram
Visualize and analyze
Extract branches
But that doesn't fit the supervised-learning-oriented API preference of sklearn, which would like everything to implement a fit, predict API...
SciPy has a function for you:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.fcluster.html#scipy.cluster.hierarchy.fcluster

Clustering cancer gene expression data at the same time?

I'm working at gene expression data clustering techniques and I have downloaded 35 datasets from web,
We have 35 datasets that each of them represents a type of cancer. Each dataset has its own features. Some of these datasets are shared in several features, and some of them do not share anything from the viewpoint of features.
My question is, how do we ultimately cluster these data, while many of them do not have the same characteristics?
I think that we do the clustering operation on all 35 datasets at the same time.
Is my idea correct?
any help is appreciated.
I assume that when you say heterogenous it'll be things like different gene expression platforms where different genes are present.
You can use any clustering technique, but you'll need to write your own distance metric that takes into account heterogeneity within your dataset. For instance, you could use the correlation of all the genes that are in common between pairwise samples, create a distance matrix from this, then use something like hierarchical clustering on this distance matrix
I think there is no need to write your own distance metric. There already exists plenty of distance metrics that can work for mixed data types. For instance, the gower distance works well for mixed data type. See this post on the same. But if your data contains only continuous values then you can use k-means. You'll also be better off, if the data is preprocessed first.

Incorporating new articles in tfidf vector for online clustering

I am building an Online news clustering system using Lucene and Mahout libraries in java. I intend to use vector space model and tfidf weights for Kmeans(or fuzzy/streamKmeans). My plan is : Cluster initial articles,assign new article to the cluster whose centroid is closest based on a small distance threshold. The leftover documents that aren’t associated with any old clusters form new data(new topics). Separately cluster them among themselves and add these temporary cluster centroids to the previous centroids. Less frequently, execute the full batch clustering to recluster the entire set of documents. The problem arises in comparing a new article to a centroid to assign it to an old cluster. The centroid dimension is number of distinct words in initial data. But the dimension of new article is different. I am following the book Mahout in Action. Is there any approach or some sort of feature extraction to handle this. The following similar links still remain unanswered:
https://stats.stackexchange.com/questions/41409/bag-of-words-in-an-online-configuration-for-classification-clustering
https://stats.stackexchange.com/questions/123830/vector-space-model-for-online-news-clustering
Thanks in advance
Increase the dimensionality as desired, using 0 as new values.
From a theoretical point of view, consider the vector space as infinite dimensional.

MATLAB: K Means clustering With varying centroids

I'm created a code book based on k-means clustering algorithm.But the algorithm didn't converge to an optimal code book, each time, the cluster centroids are varying(because of random selection of initial seeds). There is an option in Matlab to give an initial matrix to K-Means.But how we can can select the initial code book from a large data set? Is there any other way to get a unique code book using K-means?
It's somewhat standard to run k-means multiple times using different initial states (e.g., initial seeds) and choose the result with the lowest error as the best result.
It's also typical to seed k-means by randomly choosing k elements from your data set as the initial seeds.
Since by default MATLAB's K-Means uses the K-MEans++ algorithm for initialization it means it uses random numbers.
Hence each call (For sequential calls) to K-Means will probably produce different results.
You have 3 options to make this deterministic:
Set MATLAB's Random Number Generator state to certain state before calling K-Means.
Use the stream option in K-Means options to set the stream inside K-Means.
Write your own version of K-Means which uses a deterministic way to initialize K-Means.