im currently working on a classification task using language modelling. The first part of the project involved using n-gram language models to classify documents using c5.0. The final part of the project requires me to use cross entropy to model each class and classify test cases against these models.
Does anyone have have experience in using cross entropy, or links to information about how to use a cross entropy model for sampling data? Any information at all would be great! Thanks
You can get theoretic background on using cross-entropy with language models on various textbooks, e.g. "Speech and language processing" by Jurafsky & Martin, pages 116-118 in the 2nd edition.
As to concrete usage, in most language modeling tools the cross-entropy is not directly measured, but the 'Perplexity', which is the exp of the cross-entropy. The perplexity, in turn, can be used to classify documents. see, e.g. the documentation for the command 'evallm' in SLM, the Carnegie-Melon university language modeling tools (http://www.speech.cs.cmu.edu/SLM/toolkit_documentation.html)
good luck :)
Related
I want to classify short pieces of text (neuroscience-related), using an ontology (NIF). I have read some papers, but non of those went through the whole procedure of performing an ontology-based classification. Therefore, I wanted to double check and see if I got it right:
Feature Extraction: First, we will use the ontology for annotating (tagging) the text with ontology concepts. We will parse the text and annotate terms that can be found from the ontology using the class of those terms in the ontology.
I guess the accuracy of the annotation process can be increased by using techniques such as finding semantic similarities between ontology terms and terms in the text. Also, by using techniques such as lemmatization.
Using ontology and a rule-based classification system, there is no need for learning and we can move towards classification.
Then, for the classification phase and since we're using a rule-based classifier, we will classify the text according to the classes assigned to the text. Is this correct? Also, can we move up in the ontology and use super-classes in annotation in order to reduce the number of tags (classes) used in the classification?
My other question is: Are there enough benefits for using ontologies in classification? Because I read in a paper that using ontologies actually decreases the accuracy! What are those benefits? What does using meaningful tags from the ontology allow us to do that arbitrary terms don't?
Did anybody see any samples how set up simple application to train dnet and then use it to recognize it a limited number of voice commands without binding to a particular language? I believe Kaldi API is quite powerful for it but there is a lack of documentation.
1) You take existing DNN model or train it yourself. You can use Tedlium experiment from Kaldi, it is free to run. It does not matter if model is for English, it will work for other languages too.
2) You extract DNN posteriors from both training keyphrases. nnet3-am-compute tool can be used for that. It takes DNN model and returns phonetic or state posteriors for every frame.
3) You implement DTW algorithm to compare DNN posteriors. This part you have to do yourself, it is not implemented in Kaldi.
Related papers describing the algorithm:
Investigating Neural Network based Query-by-Example Keyword Spotting Approach for Personalized Wake-up Word Detection in Mandarin Chinese
Query-By-Example Spoken Term Detection Using Phonetic Posteriorgram Templates
I was trying to find evaluation mechanisms of collaborative K-Nearest neighbor algorithm, but i am confused how can I evaluate this algorithm. How can I be sure that the recommendation done by this algorithm is correct or good. Actually I have also developed an algorithm that i want to compare with it. but i am not sure how can i compare and evaluate both of them. The data set used by me is of movie lens.
your people help on evaluating this recomender system will be highly appreciated.
Evaluating recommender systems is a large concern of its research and industry communities. Look at "Evaluating collaborative filtering recommender systems", a Herlocker et al paper. The people who publish MovieLens data (the GroupLens research lab at the University of Minnesota) also publish many papers on recsys topics, and the PDFs are often free at http://grouplens.org/publications/.
Check out https://scholar.google.com/scholar?hl=en&q=evaluating+recommender+systems.
In short, you should use a method that hides some data. You will train your model on a portion of the data (called "training data") and test on the remainder of the data that your model has never seen before. There's a formal way to do this called cross-validation, but the general concept of visible training data versus hidden test data is the most important.
I also recommend https://www.coursera.org/learn/recommender-systems, a Coursera course on recommender systems taught by GroupLens folks. In that course you'll learn to use LensKit, a recommender systems framework in Java that includes a large evaluation suite. Even if you don't take the course, LensKit may be just what you want.
The selection methods I am looking for are the ones based on subset evaluation (i.e. do not simply rank individual features). I prefer implementations in Matlab or based on WEKA, but implementations in any other language will still be useful.
I am aware of the existence of CsfSubsetEval and ConsistencySubsetEval in WEKA, but they did not lead to good classification performance, probably because they suffer from the following limitation:
CsfSubsetEval is biased toward small feature subsets, which may prevent locally predictive features from being included in the selected subset, as noted in [1].
ConsistencySubsetEval use min-features bias [2] which, similarly to CsfSubsetEval, result in the selection of too few features.
I know it is "too few" because I have built classification models with larger subsets and their classification performance were relatively much better.
[1] M. A. Hall, Correlation-based Feature Subset Selection for Machine Learning, 1999.
[2] Liu, Huan, and Lei Yu, Toward integrating feature selection algorithms for classification and clustering, 2005.
Check out python scikit learn simple and efficient tools for data mining and data analysis. There are various implemented methods for feature selection, classification, evaluation and a lot of documentations and tutorials.
My search has led me to the following implementations:
FEAST toolbox: it is an interesting toolbox, developed by the University of Manchester, and provide implementations of Shannon's Information Theory functions. The implementations can be downloaded from THIS webpage, and they can be used to evaluate individual features as well as subset of features.
I have also found THIS matlab code, which is an implementation for a selection algorithm based on Interaction Information.
PY_FS: A Python Package for Feature Selection
I came across this package [1] which was just released (2021) and contains many methods with reference to their original papers.
i have a training set and i want to use a classification method for classifying other documents according to my training set.my document types are news and categories are sports,politics,economic and so on.
i understand naive bayes and KNN completely but SVM and decision tree are vague and i dont know if i can implement this method by myself?or there is applications for using this methods?
what is the best method i can use for classifying docs in this way?
thanks!
Naive Bayes
Though this is the simplest algorithm and everything is deemed independent, in real text classification case, this method work great. And I would try this algorithm first for sure.
KNN
KNN is for clustering rather than classification. I think you misunderstand the conception of clustering and classification.
SVM
SVM has SVC(classification) and SVR(Regression) algorithms to do class classification and prediction. It sometime works good, but from my experiences, it has bad performance in text classification, as it has high demands for good tokenizers (filters). But the dictionary of the dataset always has dirty tokens. The accuracy is really bad.
Random Forest (decision tree)
I've never try this method for text classification. Because I think decision tree need several key nodes, while it's hard to find "several key tokens" for text classification, and random forest works bad for high sparse dimensions.
FYI
These are all from my experiences, but for your case, you have no better ways to decide which methods to use but to try every algorithm to fit your model.
Apache's Mahout is a great tool for machine learning algorithms. It integrates three aspects' algorithms: recommendation, clustering, and classification. You could try this library. But you have to learn some basic knowledge about Hadoop.
And for machine learning, weka is a software toolkit for experiences which integrates many algorithms.
Linear SVMs are one of the top algorithms for text classification problems (along with Logistic Regression). Decision Trees suffer badly in such high dimensional feature spaces.
The Pegasos algorithm is one of the simplest Linear SVM algorithms and is incredibly effective.
EDIT: Multinomial Naive bayes also works well on text data, though not usually as well as Linear SVMs. kNN can work okay, but its an already slow algorithm and doesn't ever top the accuracy charts on text problems.
If you are familiar with Python, you may consider NLTK and scikit-learn. The former is dedicated to NLP while the latter is a more comprehensive machine learning package (but it has a great inventory of text processing modules). Both are open source and have great community suport on SO.