Why K-means is so used for document clustering? - cluster-analysis

Can someone explain to me why the K-means algorithm is so used (especially in documents clustering) despite its defects, instead of K-medoids for example, or CAH, SOM etc.?

Related

How to update datasets without re-clustering the whole datatsets after clustering finished?

What I used is spectral clustering. What should I do to avoid clustering the whole datasets?
There are papers on how to infer the spectral embedding for new data points, e.g.,
Bengio, Y., Paiement, J. F., Vincent, P., Delalleau, O., Roux, N. L., & Ouimet, M. (2004). Out-of-sample extensions for lle, isomap, mds, eigenmaps, and spectral clustering. In Advances in neural information processing systems (pp. 177-184).
You can then assign them to the nearest k-means cluster and update the means.
But implementing this will require quite some coding work on your behalf. In particular, in order to get this fast.
Clustering isn't really meant to be updatable, and not is spectral embedding, so it is worth looking into alternate algorithms, and to reconsider your objective whether you really need to have this.

Which clustering algorithms can be used with Word Mover's Distance from M. Kusner's paper?

I am new to machine learning and now I am interested in document clustering (short texts with different lengths) according to their semantic similarity (I just want to go beyond the standard TF/IDF approach). I read the paper http://proceedings.mlr.press/v37/kusnerb15.pdf where the Word Mover's distance for word embeddings is explained. In the paper they used it for classification. My question is now - can I use it for clustering? If so, is there a paper where this kind of usage is discribed?
P.S.: I am basically interested in clustering which takes into account the semantic similarity, so even a word2vec or doc2vec approach will do the job - I just couldn't find any papers where they are used in a clustering problem.
If you could afford to compute an entire distance matrix, then you could do hierarchical clustering, for example.
It's easy today find other clusterings that accept any distance and use a threshold. These could even use the bounds for performance. But it's not obvious that they will work on such data.

Moving Data From Scikit-Learn to Elki for Clustering

I have 100,000 sentences that I've processed into TF-IDF vectors using scikit-learn's TfidfVectorizer with highly customized stopwords and nlp stemming. My goal is to cluster the sentences using dbscan or another density-based cluster to discover similar sentences.
In scikit-learn's dbscan implementation, I run out of memory when I cluster more than 40,000 sentences. I have seen suggestions to use ELKI's Java clustering GUI. I'd like to try clustering in Java, but I cannot find a method for moving my TF-IDF vectors from Python to ELKI. ELKI's documentation states that it can handle sparse vectors in a particular format or in .arff.
Most concrete question. Can anyone suggestion how to move TFIDF vectors from scikit-learn into a format that can be loaded into ELKI.
Will ELKI better manage memory than scikit-learn? Or is this pointless work?

Clustering Algorithm for average energy measurements

I have a data set which consists of data points having attributes like:
average daily consumption of energy
average daily generation of energy
type of energy source
average daily energy fed in to grid
daily energy tariff
I am new to clustering techniques.
So my question is which clustering algorithm will be best for such kind of data to form clusters ?
I think hierarchical clustering is a good choice. Have a look here Clustering Algorithms
The more simple way to do clustering is by kmeans algorithm. If all of your attributes are numerical, then this is the easiest way of doing the clustering. Even if they are not, you would have to find a distance measure for caterogical or nominal attributes, but still kmeans is a good choice. Kmeans is a partitional clustering algorithm... i wouldn't use hierarchical clustering for this case. But that also depends on what you want to do. you need to evaluate if you want to find clusters within clusters or they all have to be totally apart from each other and not included on each other.
Take care.
1) First, try with k-means. If that fulfills your demand that's it. Play with different number of clusters (controlled by parameter k). There are a number of implementations of k-means and you can implement your own version if you have good programming skills.
K-means generally works well if data looks like a circular/spherical shape. This means that there is some Gaussianity in the data (data comes from a Gaussian distribution).
2) if k-means doesn't fulfill your expectations, it is time to read and think more. Then I suggest reading a good survey paper. the most common techniques are implemented in several programming languages and data mining frameworks, many of them are free to download and use.
3) if applying state-of-the-art clustering techniques is not enough, it is time to design a new technique. Then you can think by yourself or associate with a machine learning expert.
Since most of your data is continuous, and it reasonable to assume that energy consumption and generation are normally distributed, I would use statistical methods for clustering.
Such as:
Gaussian Mixture Models
Bayesian Hierarchical Clustering
The advantage of these methods over metric-based clustering algorithms (e.g. k-means) is that we can take advantage of the fact that we are dealing with averages, and we can make assumptions on the distributions from which those average were calculated.

Clustering classifier and clustering policy

I was going through the K-means algorithm in mahout and when debugging, I noticed that when creating the first clusters it does this following code:
ClusteringPolicy policy = new KMeansClusteringPolicy(convergenceDelta);
ClusterClassifier prior = new ClusterClassifier(clusters, policy);
prior.writeToSeqFiles(priorClustersPath);
I was reading the description of these classes and it was not clear for me...
I was wondering what is the meaning of these cluster classifier and policy?
is it related with hierarchical clustering, centroid based clustering, distribution based
clustering etc?
Because I do not know what is the benefit or the reason of using this cluster classifier and policy when using K-means mahout implementation.
The implementation shares code with other variants of k-means and similar algorithms such as Canopy pre-clustering and GMM.
These classes encode only the difference between these algorithms.
Mahout is not a good place to study the k-means algorithm, the implementation is quite a mess. It's also slow. As in really really slow. Most of the time, a single CPU implementation will outright beat Mahout on anything that fits into memory. Maybe even on disk of a single machine. Because of all the map-reduce overhead.