I have 100,000 sentences that I've processed into TF-IDF vectors using scikit-learn's TfidfVectorizer with highly customized stopwords and nlp stemming. My goal is to cluster the sentences using dbscan or another density-based cluster to discover similar sentences.
In scikit-learn's dbscan implementation, I run out of memory when I cluster more than 40,000 sentences. I have seen suggestions to use ELKI's Java clustering GUI. I'd like to try clustering in Java, but I cannot find a method for moving my TF-IDF vectors from Python to ELKI. ELKI's documentation states that it can handle sparse vectors in a particular format or in .arff.
Most concrete question. Can anyone suggestion how to move TFIDF vectors from scikit-learn into a format that can be loaded into ELKI.
Will ELKI better manage memory than scikit-learn? Or is this pointless work?
Related
Can someone explain to me why the K-means algorithm is so used (especially in documents clustering) despite its defects, instead of K-medoids for example, or CAH, SOM etc.?
Well, I have been studying up on the different algorithms used for clustering like k-means, k-mediods etc and I was trying to run the algorithms and analyze their performance on the leaf dataset right here:
http://archive.ics.uci.edu/ml/datasets/Leaf
I was able to cluster the dataset via k-means by first reading the csv file, filtering out unneeded attributes and applying k-means on it. The problem that I am facing here that I wish to calculate measures such as entropy, precision, recall and f-measure for the model developed via k-means. Is there an operator avialable that allows me to do this so that I can quantitatively compare the different clustering algorithms available on rapid-miner?
P.S I know about performance operators like Performance(Classification) that allows me to calculate precision and recall for a model but I dont know any that allow me to calculate entropy.
Help would be much appreciated.
The short answer is to use R. Here's a link to a book chapter about this very subject. There is a revised version coming soon that works for the most recent version of RapidMiner.
I was going through the K-means algorithm in mahout and when debugging, I noticed that when creating the first clusters it does this following code:
ClusteringPolicy policy = new KMeansClusteringPolicy(convergenceDelta);
ClusterClassifier prior = new ClusterClassifier(clusters, policy);
prior.writeToSeqFiles(priorClustersPath);
I was reading the description of these classes and it was not clear for me...
I was wondering what is the meaning of these cluster classifier and policy?
is it related with hierarchical clustering, centroid based clustering, distribution based
clustering etc?
Because I do not know what is the benefit or the reason of using this cluster classifier and policy when using K-means mahout implementation.
The implementation shares code with other variants of k-means and similar algorithms such as Canopy pre-clustering and GMM.
These classes encode only the difference between these algorithms.
Mahout is not a good place to study the k-means algorithm, the implementation is quite a mess. It's also slow. As in really really slow. Most of the time, a single CPU implementation will outright beat Mahout on anything that fits into memory. Maybe even on disk of a single machine. Because of all the map-reduce overhead.
I am trying to differentiate two populations. Each population is an NxM matrix in which N is fixed between the two and M is variable in length (N=column specific attributes of each run, M=run number). I have looked at PCA and K-means for differentiating the two, but I was curious of the best practice.
To my knowledge, in K-means, there is no initial 'calibration' in which the clusters are chosen such that known bimodal populations can be differentiated. It simply minimizes the distance and assigns the data to an arbitrary number of populations. I would like to tell the clustering algorithm that I want the best fit in which the two populations are separated. I can then use the fit I get from the initial clustering on future datasets. Any help, example code, or reading material would be appreciated.
-R
K-means and PCA are typically used in unsupervised learning problems, i.e. problems where you have a single batch of data and want to find some easier way to describe it. In principle, you could run K-means (with K=2) on your data, and then evaluate the degree to which your two classes of data match up with the data clusters found by this algorithm (note: you may want multiple starts).
It sounds to like you have a supervised learning problem: you have a training data set which has already been partitioned into two classes. In this case k-nearest neighbors (as mentioned by #amas) is probably the approach most like k-means; however Support Vector Machines can also be an attractive approach.
I frequently refer to The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Springer Series in Statistics) by Trevor Hastie (Author), Robert Tibshirani (Author), Jerome Friedman (Author).
It really depends on the data. But just to let you know K-means does get stuck at local minima so if you wanna use it try running it from different random starting points. PCA's might also be useful how ever like any other spectral clustering method you have much less control over the clustering procedure. I recommend that you cluster the data using k-means with multiple random starting points and c how it works then you can predict and learn for each the new samples with K-NN (I don't know if it is useful for your case).
Check Lazy learners and K-NN for prediction.
All
I am searching for applying the same approach of David Nister and Henrik Stewenius in http://www.wisdom.weizmann.ac.il/~bagon/CVspring07/files/scalable.pdf
In this paper, they use a high number of SIFT vectors (128-D) as input to a hierarchical k-means clustering to construct a hierarchical visual vocabulary tree.
Does any one know any good library that i can use to do this clustering?
Ps: the number of input SIFT descriptors is high (70,000,000) and i want that result will be a vocabulary tree with 1,000,000 leaf nodes.
thanks very much.
regards.
The ClusterQuantiser tool in OpenIMAJ should be able to do this if the data is in a supported format. If the tool can't work with your data out of the box, then you could write a driver for the org.openimaj.ml.clustering.kmeans.HierarchicalByteKMeans class (in the svn trunk version) or the org.openimaj.ml.clustering.kmeans.HByteKMeans class in the 1.0.5 release. Both versions of the class support streaming data from disk, so you don't need to hold all the features in memory!
For completeness, vlfeat also has a hierarchical k-means implementation, but I'm not sure how much it scales.
From practical experience, you might also consider sampling the features before clustering. I'm not sure that you'll get much benefit from clustering them all.