term clustering library? - text-processing

Does anybody know an open-source\free library that does term clustering?
Thanks,
yaniv

Apache Mahout provides algorithms for clustering.

Checkout NLTK. There's a number of clustering modules that might work for you.

WEKA has a whole suite of tools for text processing along with clustering.

If your in to python there is NLTK, as already mentioned by it's author, but there is also sklearn which provides much more than just clustering. (Link takes you to text applicable examples).

Python Scikit learn has some dedicated packages for text analysis. Besides they have a complete suite of Clustering Algorithms that includes K-means, AP, Mean shift, Spectral Clustering, Hierarchical Clustering and DBSCAN algorithms (with appropriate evaluation metrics). This may be helpful your term clustering task.
Link to Scikit Learn latest video tutorial
Link to Scikit Learn Book

Related

calculating clustering validity of k-means using rapidminer

Well, I have been studying up on the different algorithms used for clustering like k-means, k-mediods etc and I was trying to run the algorithms and analyze their performance on the leaf dataset right here:
http://archive.ics.uci.edu/ml/datasets/Leaf
I was able to cluster the dataset via k-means by first reading the csv file, filtering out unneeded attributes and applying k-means on it. The problem that I am facing here that I wish to calculate measures such as entropy, precision, recall and f-measure for the model developed via k-means. Is there an operator avialable that allows me to do this so that I can quantitatively compare the different clustering algorithms available on rapid-miner?
P.S I know about performance operators like Performance(Classification) that allows me to calculate precision and recall for a model but I dont know any that allow me to calculate entropy.
Help would be much appreciated.
The short answer is to use R. Here's a link to a book chapter about this very subject. There is a revised version coming soon that works for the most recent version of RapidMiner.

RapidMiner and WEKA : Different clustering result

I am new in Data Mining analytic and Machine Learning. I have been trying to compare the use of Predictive analysis and Clustering analysis using RapidMiner and Weka for my college assignment.
Just after I study the advantages and disadvantages from both tools and starting to do the analyzing process I found some problems. I tried doing Clustering using K-means and simpleKmeans for Weka and Regression analysis using LinearRegression and I am not quite satisfied with the result, since they contain result that significantly different. all of that I used a same datasets. numerical datasets.
I have been spending a lot of my time trying to figure something out by studying the initialization for each algorithm each tools since the interface is different and there are some parameter that is on RapidMiner but not in Weka or otherwise, so I am a bit confused. (is it the problem?)
Despite that what do you think is wrong? is there some initialization process that I missed? or is it because the code is different in each tools even they use the same algorithm?
Thank you for your answer!
Weka often uses built-in normalization at least in k-means and other algorithms.
Make sure you have disabled this if you want to make results comparable.
Also understand that k-means is a randomized algorithm. Different results even from the same package are to be expected (and desirable).
did you use WEKA itself or rapidminer's WEKA extension? Did you try to compare the results of WEKA with RM WEKA?

Are there any implementations available online for filter based feature selection methods?

The selection methods I am looking for are the ones based on subset evaluation (i.e. do not simply rank individual features). I prefer implementations in Matlab or based on WEKA, but implementations in any other language will still be useful.
I am aware of the existence of CsfSubsetEval and ConsistencySubsetEval in WEKA, but they did not lead to good classification performance, probably because they suffer from the following limitation:
CsfSubsetEval is biased toward small feature subsets, which may prevent locally predictive features from being included in the selected subset, as noted in [1].
ConsistencySubsetEval use min-features bias [2] which, similarly to CsfSubsetEval, result in the selection of too few features.
I know it is "too few" because I have built classification models with larger subsets and their classification performance were relatively much better.
[1] M. A. Hall, Correlation-based Feature Subset Selection for Machine Learning, 1999.
[2] Liu, Huan, and Lei Yu, Toward integrating feature selection algorithms for classification and clustering, 2005.
Check out python scikit learn simple and efficient tools for data mining and data analysis. There are various implemented methods for feature selection, classification, evaluation and a lot of documentations and tutorials.
My search has led me to the following implementations:
FEAST toolbox: it is an interesting toolbox, developed by the University of Manchester, and provide implementations of Shannon's Information Theory functions. The implementations can be downloaded from THIS webpage, and they can be used to evaluate individual features as well as subset of features.
I have also found THIS matlab code, which is an implementation for a selection algorithm based on Interaction Information.
PY_FS: A Python Package for Feature Selection
I came across this package [1] which was just released (2021) and contains many methods with reference to their original papers.

text classification methods? SVM and decision tree

i have a training set and i want to use a classification method for classifying other documents according to my training set.my document types are news and categories are sports,politics,economic and so on.
i understand naive bayes and KNN completely but SVM and decision tree are vague and i dont know if i can implement this method by myself?or there is applications for using this methods?
what is the best method i can use for classifying docs in this way?
thanks!
Naive Bayes
Though this is the simplest algorithm and everything is deemed independent, in real text classification case, this method work great. And I would try this algorithm first for sure.
KNN
KNN is for clustering rather than classification. I think you misunderstand the conception of clustering and classification.
SVM
SVM has SVC(classification) and SVR(Regression) algorithms to do class classification and prediction. It sometime works good, but from my experiences, it has bad performance in text classification, as it has high demands for good tokenizers (filters). But the dictionary of the dataset always has dirty tokens. The accuracy is really bad.
Random Forest (decision tree)
I've never try this method for text classification. Because I think decision tree need several key nodes, while it's hard to find "several key tokens" for text classification, and random forest works bad for high sparse dimensions.
FYI
These are all from my experiences, but for your case, you have no better ways to decide which methods to use but to try every algorithm to fit your model.
Apache's Mahout is a great tool for machine learning algorithms. It integrates three aspects' algorithms: recommendation, clustering, and classification. You could try this library. But you have to learn some basic knowledge about Hadoop.
And for machine learning, weka is a software toolkit for experiences which integrates many algorithms.
Linear SVMs are one of the top algorithms for text classification problems (along with Logistic Regression). Decision Trees suffer badly in such high dimensional feature spaces.
The Pegasos algorithm is one of the simplest Linear SVM algorithms and is incredibly effective.
EDIT: Multinomial Naive bayes also works well on text data, though not usually as well as Linear SVMs. kNN can work okay, but its an already slow algorithm and doesn't ever top the accuracy charts on text problems.
If you are familiar with Python, you may consider NLTK and scikit-learn. The former is dedicated to NLP while the latter is a more comprehensive machine learning package (but it has a great inventory of text processing modules). Both are open source and have great community suport on SO.

hierarchical k-means clustering for SIFT vectors

All
I am searching for applying the same approach of David Nister and Henrik Stewenius in http://www.wisdom.weizmann.ac.il/~bagon/CVspring07/files/scalable.pdf
In this paper, they use a high number of SIFT vectors (128-D) as input to a hierarchical k-means clustering to construct a hierarchical visual vocabulary tree.
Does any one know any good library that i can use to do this clustering?
Ps: the number of input SIFT descriptors is high (70,000,000) and i want that result will be a vocabulary tree with 1,000,000 leaf nodes.
thanks very much.
regards.
The ClusterQuantiser tool in OpenIMAJ should be able to do this if the data is in a supported format. If the tool can't work with your data out of the box, then you could write a driver for the org.openimaj.ml.clustering.kmeans.HierarchicalByteKMeans class (in the svn trunk version) or the org.openimaj.ml.clustering.kmeans.HByteKMeans class in the 1.0.5 release. Both versions of the class support streaming data from disk, so you don't need to hold all the features in memory!
For completeness, vlfeat also has a hierarchical k-means implementation, but I'm not sure how much it scales.
From practical experience, you might also consider sampling the features before clustering. I'm not sure that you'll get much benefit from clustering them all.