Best method to implement text classification (2 classes)

Best method to implement text classification (2 classes) - neural-network

I have to write classifier for corpus of texts, which should separate all my texts into 2 classes.
The corpus is very large (near 4 millions for test, and 50000 for study).
But, what algorithm should I choose?
Naive Bayesian
Neural networks
SVM
Random forest
kNN (why not?)
I heard that Random forests and SVM is state-of-the-art methods, but, maybe someone
has a deal with listed above algorithms, and knows, which is fastest and which more accurate?

As a 2-classes text classifier, I don't think you need:
(1) KNN: it is a clustering method rather than classification, and it is slow;
(2) Random forest: the decision trees may not be a good option in high sparse dimensions;
You can try:
(1) naive bayesian: most straightforward and easiest to code. Proved to work well in text classification problems;
(2) logistic regression: works well if your training sample number is much larger than the feature number;
(3) SVM: again, for training sample much more than features, SVM with linear kernel works as well as logistic regression. And it is also one of the top algorithms in text classification;
(4) Neural network: seems like a panacea in machine learning. In theory it can learn any models that SVM/logistic regression could. The problem is there are not so many packages on NN as there are in SVM. As a result, the optimization process for neural network is time-consuming.
Yet it is hard to say which algorithm is best suit for your case. If you are using python, scikit-learn includes almost all these algorithms for you to test. Besides, weka, which integrates many machine learning algorithms in a user friendly graphic interface, is also a good candidate for you to better know the performance of each algorithm.

Related

Can KNN be better than other classifiers?

As Known, there are classifiers that have a training or a learning step, like SVM or Random Forest. On the other hand, KNN does not have.
Can KNN be better than these classifiers?
If no, why?
If yes, when, how and why?

The main answer is yes, it can due to no free lunch theorem implications. FLT can be loosley stated as (in terms of classification)
There is no universal classifier which is consisntenly better at any task than others
It can also be (not very strictly) inverted
For each (well defined) classifier there exists a dataset where it is the best one
And in particular - kNN is well-defined classifier, in particular it is consistent with any distibution, which means that given infinitely many training points it converges to the optimal, Bayesian separator.
So can it be better than SVM or RF? Obviously! When? There is no clear answer. First of all in supervised learning you often actually get just one training set and try to fit the best model. In such scenario any model can be the best one. When statisticians/theoretical ML try to answer whether one model is better than another, we actually try to test "what would happen if we would have ifinitely many training sets" - so we look at the expected value of the behaviour of the classifiers. In such setting, we often show that SVM/RF is better than KNN. But it does not mean that they are always better. It only means, that for randomly selected dataset you should expect KNN to work worse, but this is only probability. And as you can always win in a lottery (no matter the odds!) you can also always win with KNN (just to be clear - KNN has bigger chances of being a good model than winning a lottery :-)).
What are particular examples? Let us for example consider a rotated XOR problem.
If the true decision boundaries are as above, and you only have this four points. Obviously 1NN will be much better than SVM (with dot, poly or rbf kernel) or RF. It should also be true once you include more and more training points.

"In general kNN would not be expected to exceed SVM or RF. When kNN does, that says something very interesting about the training data. If many doublets are present i the data set, a nearest neighbor algorithm works very well."
I heard the argument something like as written by Claudia Perlich in this podcast:
http://www.thetalkingmachines.com/blog/2015/6/18/working-with-data-and-machine-learning-in-advertizing
My intuitive understanding of why RF and SVM is better kNN in generel: All algorithms basicly assume some local similarity, such that samples very alike gets classified alike. kNN can only choose the most similar samples by distance(or some other global kernel). So the samples which could influence a prediction on kNN would exists within a hyper sphere for the Euclidean distance kernel. RF and SVM can learn other definitions of locality which could stretch far by some features and short by others. Also the propagation of locality could take up many learned shapes, and these shapes can differ through out the feature space.

Which predictive modelling technique will be most helpful?

I have a training dataset which gives me the ranking of various cricket players(2008) on the basis of their performance in the past years(2005-2007).
I've to develop a model using this data and then apply it on another dataset to predict the ranking of players(2012) using the data already given to me(2009-2011).
Which predictive modelling will be best for this? What are the pros and cons of using the different forms of regression or neural networks?

The type of model to use depends on different factors:
Amount of data: if you have very little data, you better opt for a simple prediction model like linear regression. If you use a prediction model which is too powerful you run into the risk of over-fitting your model with the effect that it generalizes bad on new data. Now you might ask, what is little data? That depends on the number of input dimensions and on the underlying distributions of your data.
Your experience with the model. Neural networks can be quite tricky to handle if you have little experience with them. There are quite a few parameters to be optimized, like the network layer structure, the number of iterations, the learning rate, the momentum term, just to mention a few. Linear prediction is a lot easier to handle with respect to this "meta-optimization"
A pragmatic approach for you, if you still cannot opt for one of the methods, would be to evaluate a couple of different prediction methods. You take some of your data where you already have target values (the 2008 data), split it into training and test data (take some 10% as test data, e.g.), train and test using cross-validation and compute the error rate by comparing the predicted values with the target values you already have.
One great book, which is also on the web, is Pattern recognition and machine learning by C. Bishop. It has a great introductory section on prediction models.

Which predictive modelling will be best for this? 2. What are the pros
and cons of using the different forms of regression or neural
networks?
"What is best" depends on the resources you have. Full Bayesian Networks (or k-Dependency Bayesian Networks) with information theoretically learned graphs, are the ultimate 'assumptionless' models, and often perform extremely well. Sophisticated Neural Networks can perform impressively well too. The problem with such models is that they can be very computationally expensive, so models that employ methods of approximation may be more appropriate. There are mathematical similarities connecting regression, neural networks and bayesian networks.
Regression is actually a simple form of Neural Networks with some additional assumptions about the data. Neural Networks can be constructed to make less assumptions about the data, but as Thomas789 points out at the cost of being considerably more difficult to understand (sometimes monumentally difficult to debug).
As a rule of thumb - the more assumptions and approximations in a model the easier it is to A: understand and B: find the computational power necessary, but potentially at the cost of performance or "overfitting" (this is when a model suits the training data well, but doesn't extrapolate to the general case).
Free online books:
http://www.inference.phy.cam.ac.uk/mackay/itila/
http://ciml.info/dl/v0_8/ciml-v0_8-all.pdf

text classification methods? SVM and decision tree

i have a training set and i want to use a classification method for classifying other documents according to my training set.my document types are news and categories are sports,politics,economic and so on.
i understand naive bayes and KNN completely but SVM and decision tree are vague and i dont know if i can implement this method by myself?or there is applications for using this methods?
what is the best method i can use for classifying docs in this way?
thanks!

Naive Bayes
Though this is the simplest algorithm and everything is deemed independent, in real text classification case, this method work great. And I would try this algorithm first for sure.
KNN
KNN is for clustering rather than classification. I think you misunderstand the conception of clustering and classification.
SVM
SVM has SVC(classification) and SVR(Regression) algorithms to do class classification and prediction. It sometime works good, but from my experiences, it has bad performance in text classification, as it has high demands for good tokenizers (filters). But the dictionary of the dataset always has dirty tokens. The accuracy is really bad.
Random Forest (decision tree)
I've never try this method for text classification. Because I think decision tree need several key nodes, while it's hard to find "several key tokens" for text classification, and random forest works bad for high sparse dimensions.
FYI
These are all from my experiences, but for your case, you have no better ways to decide which methods to use but to try every algorithm to fit your model.
Apache's Mahout is a great tool for machine learning algorithms. It integrates three aspects' algorithms: recommendation, clustering, and classification. You could try this library. But you have to learn some basic knowledge about Hadoop.
And for machine learning, weka is a software toolkit for experiences which integrates many algorithms.

Linear SVMs are one of the top algorithms for text classification problems (along with Logistic Regression). Decision Trees suffer badly in such high dimensional feature spaces.
The Pegasos algorithm is one of the simplest Linear SVM algorithms and is incredibly effective.
EDIT: Multinomial Naive bayes also works well on text data, though not usually as well as Linear SVMs. kNN can work okay, but its an already slow algorithm and doesn't ever top the accuracy charts on text problems.

If you are familiar with Python, you may consider NLTK and scikit-learn. The former is dedicated to NLP while the latter is a more comprehensive machine learning package (but it has a great inventory of text processing modules). Both are open source and have great community suport on SO.

Optimization of Neural Network input data

I'm trying to build an app to detect images which are advertisements from the webpages. Once I detect those I`ll not be allowing those to be displayed on the client side.
Basically I'm using Back-propagation algorithm to train the neural network using the dataset given here: http://archive.ics.uci.edu/ml/datasets/Internet+Advertisements.
But in that dataset no. of attributes are very high. In fact one of the mentors of the project told me that If you train the Neural Network with that many attributes, it'll take lots of time to get trained. So is there a way to optimize the input dataset? Or I just have to use that many attributes?

1558 is actually a modest number of features/attributes. The # of instances(3279) is also small. The problem is not on the dataset side, but on the training algorithm side.
ANN is slow in training, I'd suggest you to use a logistic regression or svm. Both of them are very fast to train. Especially, svm has a lot of fast algorithms.
In this dataset, you are actually analyzing text, but not image. I think a linear family classifier, i.e. logistic regression or svm, is better for your job.
If you are using for production and you cannot use open source code. Logistic regression is very easy to implement compared to a good ANN and SVM.
If you decide to use logistic regression or SVM, I can future recommend some articles or source code for you to refer.

If you're actually using a backpropagation network with 1558 input nodes and only 3279 samples, then the training time is the least of your problems: Even if you have a very small network with only one hidden layer containing 10 neurons, you have 1558*10 weights between the input layer and the hidden layer. How can you expect to get a good estimate for 15580 degrees of freedom from only 3279 samples? (And that simple calculation doesn't even take the "curse of dimensionality" into account)
You have to analyze your data to find out how to optimize it. Try to understand your input data: Which (tuples of) features are (jointly) statistically significant? (use standard statistical methods for this) Are some features redundant? (Principal component analysis is a good stating point for this.) Don't expect the artificial neural network to do that work for you.
Also: remeber Duda&Hart's famous "no-free-lunch-theorem": No classification algorithm works for every problem. And for any classification algorithm X, there is a problem where flipping a coin leads to better results than X. If you take this into account, deciding what algorithm to use before analyzing your data might not be a smart idea. You might well have picked the algorithm that actually performs worse than blind guessing on your specific problem! (By the way: Duda&Hart&Storks's book about pattern classification is a great starting point to learn about this, if you haven't read it yet.)

aplly a seperate ANN for each category of features
for example
457 inputs 1 output for url terms ( ANN1 )
495 inputs 1 output for origurl ( ANN2 )
...
then train all of them
use another main ANN to join results

Neural Net Optimize w/ Genetic Algorithm

Is a genetic algorithm the most efficient way to optimize the number of hidden nodes and the amount of training done on an artificial neural network?
I am coding neural networks using the NNToolbox in Matlab. I am open to any other suggestions of optimization techniques, but I'm most familiar with GA's.

Actually, there are multiple things that you can optimize using GA regarding NN.
You can optimize the structure (number of nodes, layers, activation function etc.).
You can also train using GA, that means setting the weights.
Genetic algorithms will never be the most efficient, but they usually used when you have little clue as to what numbers to use.
For training, you can use other algorithms including backpropagation, nelder-mead etc..
You said you wanted to optimize number hidden nodes, for this, genetic algorithm may be sufficient, although far from "optimal". The space you are searching is probably too small to use genetic algorithms, but they can still work and afaik, they are already implemented in matlab, so no biggie.
What do you mean by optimizing amount of training done? If you mean number of epochs, then that's fine, just remember that training is somehow dependent on starting weights and they are usually random, so the fitness function used for GA won't really be a function.

A good example of neural networks and genetic programming is the NEAT architecture (Neuro-Evolution of Augmenting Topologies). This is a genetic algorithm that finds an optimal topology. It's also known to be good at keeping the number of hidden nodes down.
They also made a game using this called Nero. Quite unique and very amazing tangible results.
Dr. Stanley's homepage:
http://www.cs.ucf.edu/~kstanley/
Here you'll find just about everything NEAT related as he is the one who invented it.

Genetic algorithms can be usefully applied to optimising neural networks, but you have to think a little about what you want to do.
Most "classic" NN training algorithms, such as Back-Propagation, only optimise the weights of the neurons. Genetic algorithms can optimise the weights, but this will typically be inefficient. However, as you were asking, they can optimise the topology of the network and also the parameters for your training algorithm. You'll have to be especially wary of creating networks that are "over-trained" though.
One further technique with a modified genetic algorithms can be useful for overcoming a problem with Back-Propagation. Back-Propagation usually finds local minima, but it finds them accurately and rapidly. Combining a Genetic Algorithm with Back-Propagation, e.g., in a Lamarckian GA, gives the advantages of both. This technique is briefly described during the GAUL tutorial

It is sometimes useful to use a genetic algorithm to train a neural network when your objective function isn't continuous.

I'm not sure whether you should use a genetic algorithm for this.
I suppose the initial solution population for your genetic algorithm would consist of training sets for your neural network (given a specific training method). Usually the initial solution population consists of random solutions to your problem. However, random training sets would not really train your neural network.
The evaluation algorithm for your genetic algorithm would be a weighed average of the amount of training needed, the quality of the neural network in solving a specific problem and the numer of hidden nodes.
So, if you run this, you would get the training set that delivered the best result in terms of neural network quality (= training time, number hidden nodes, problem solving capabilities of the network).
Or are you considering an entirely different approach?

I'm not entirely sure what kind of problem you're working with, but GA sounds like a little bit of overkill here. Depending on the range of parameters you're working with, an exhaustive (or otherwise unintelligent) search may work. Try plotting your NN's performance with respect to number of hidden nodes for a first few values, starting small and jumping by larger and larger increments. In my experience, many NNs plateau in performance surprisingly early; you may be able to get a good picture of what range of hidden node numbers makes the most sense.
The same is often true for NNs' training iterations. More training helps networks up to a point, but soon ceases to have much effect.
In the majority of cases, these NN parameters don't affect performance in a very complex way. Generally, increasing them increases performance for a while but then diminishing returns kick in. GA is not really necessary to find a good value on this kind of simple curve; if the number of hidden nodes (or training iterations) really does cause the performance to fluctuate in a complicated way, then metaheuristics like GA may be apt. But give the brute-force approach a try before taking that route.

I would tend to say that genetic algorithms is a good idea since you can start with a minimal solution and grow the number of neurons. It is very likely that the "quality function" for which you want to find the optimal point is smooth and has only few bumps.
If you have to find this optimal NN frequently I would recommend using optimization algorithms and in your case quasi newton as described in numerical recipes which is optimal for problems where the function is expensive to evaluate.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse