I have found the following Matlab implementation of a Naive Bayes classifier:
https://github.com/jjedele/Naive-Bayes-Classifier-Octave-Matlab
What is the difference between Gaussian Naive Bayes and Naive Bayes? How could I extend the above implementation to become Gaussian Naive Bayes?
How can I extend the implementation for using it with 4 classes? Just doing one-vs-all other?
Thank you very much for the help.
In Naive Bayes Classification we take a set of features (x0,x1,...xn) and try to assign those feature to one of a known set Y of class (y0,y1,...yk) we do that by using training data to calculate the conditional probabilities that tell us how often a particular class had a certain feature in the training set and then multiplying them together.
The result is a score for each class in the set Y. We then take the highest scoring member of Y as the class that our feature set should be assigned to.
up until this point we haven't made any assumptions about what the p(x|C) distributions look like.
In Guassian Naive Bayes we assume that all those p(x|C) values are normaly distributed that's the only "difference" and it really isn't a difference GNB is just a subset of Naive Bayes.
This can be useful if you don't have a lot of training data, and are willing to make the assumption that the population data is normally distributed about the mean of the sample (training) data you do have.
Full discloser the Tex comes from wikipedia.
Related
As i know, some classifiers such as Naive Bayes calculate the posterior probability of data and based on it produce the result.
My question is that does any classifier can produce posterior probability?
for example how decision tree can generate it?
Some classification models such as logistic regression and neural networks compute posterior class probabilities directly. Models based on generative models, such the quadratic discriminant and models derived from mixture densities, also compute posterior class probabilities. Decision trees can be easily adapted to output a class probability by returning the proportion of positive examples from leaves of the tree.
A prominent exception is the support vector machine, which doesn't return a probability. I think maybe someone has tried to modify it to return a probability; dunno how that worked out.
See Hastie, Tibshirani, and Friedman, "Elements of Statistical Learning" (or any of many texts) for more about this stuff. Further questions of this kind should probably go to stats.stackexchange.com.
As Known, there are classifiers that have a training or a learning step, like SVM or Random Forest. On the other hand, KNN does not have.
Can KNN be better than these classifiers?
If no, why?
If yes, when, how and why?
The main answer is yes, it can due to no free lunch theorem implications. FLT can be loosley stated as (in terms of classification)
There is no universal classifier which is consisntenly better at any task than others
It can also be (not very strictly) inverted
For each (well defined) classifier there exists a dataset where it is the best one
And in particular - kNN is well-defined classifier, in particular it is consistent with any distibution, which means that given infinitely many training points it converges to the optimal, Bayesian separator.
So can it be better than SVM or RF? Obviously! When? There is no clear answer. First of all in supervised learning you often actually get just one training set and try to fit the best model. In such scenario any model can be the best one. When statisticians/theoretical ML try to answer whether one model is better than another, we actually try to test "what would happen if we would have ifinitely many training sets" - so we look at the expected value of the behaviour of the classifiers. In such setting, we often show that SVM/RF is better than KNN. But it does not mean that they are always better. It only means, that for randomly selected dataset you should expect KNN to work worse, but this is only probability. And as you can always win in a lottery (no matter the odds!) you can also always win with KNN (just to be clear - KNN has bigger chances of being a good model than winning a lottery :-)).
What are particular examples? Let us for example consider a rotated XOR problem.
If the true decision boundaries are as above, and you only have this four points. Obviously 1NN will be much better than SVM (with dot, poly or rbf kernel) or RF. It should also be true once you include more and more training points.
"In general kNN would not be expected to exceed SVM or RF. When kNN does, that says something very interesting about the training data. If many doublets are present i the data set, a nearest neighbor algorithm works very well."
I heard the argument something like as written by Claudia Perlich in this podcast:
http://www.thetalkingmachines.com/blog/2015/6/18/working-with-data-and-machine-learning-in-advertizing
My intuitive understanding of why RF and SVM is better kNN in generel: All algorithms basicly assume some local similarity, such that samples very alike gets classified alike. kNN can only choose the most similar samples by distance(or some other global kernel). So the samples which could influence a prediction on kNN would exists within a hyper sphere for the Euclidean distance kernel. RF and SVM can learn other definitions of locality which could stretch far by some features and short by others. Also the propagation of locality could take up many learned shapes, and these shapes can differ through out the feature space.
My goal is to classify an image using multi class linear SVM (with out kernel). I would like to write my own SVM classifier
I am using MATLAB and have trained linear SVM using image sets provided.
I have around 20 classes, 5 images in each class (total of 100 images) and I am using one-versus-all strategy.
Each image is a (112,92) matrix. That means 112*92=10304 values.
I am using quadprog(H,f,A,C) to solve the quadratic equation (y=w'x+b) in the SVM. One call to quadprog returns w vector of size 10304 for one image. That means I have to call quadprog for 100 times.
The problem is one quadprog call takes 35 secs to get executed. That means for 100 images it will take 3500 secs. This might be due to large size of vectors and matrices involved.
I want to reduce the execution time of quadprog. Is there any way to do it?
First of all, when you do classification using SVM, you usually extract a feature (like HOG) of an image, so that the dimensionality of space on which SVM has to operate gets reduced. You are using raw pixel values, which generates a 10304-D vector. That is not good. Use some standard feature.
Secondly, you do not call quadprog 100 times. You call only once. The concept behind the optimization is, you want to find a weight vector w and a bias b such that it satisfies w'x_i+b=y_i for all images (i.e. all x_i). Note that i will go from 1 to the number of examples in your training set, but w and b stay the same. If you wanted a (w,b) that will satisfy only one x, you do not need any fancy optimization. So stack your x in a big matrix of dimension NxD, y will be a vector of Nx1, and then call quadprog. It will take a longer time than 35 seconds, but you do it only once. This is called training an SVM. While testing for a new image, you just extract its feature, and do w*x+b to get its class.
Third, unless you are doing this as an exercise to understand SVMs, quadprog is not the best way to solve this problem. You need to use some other techniques which work well with large data, for example, Sequential Minimal Optimization.
How can one linear hyperplane classify more than 2 classes: For multi-class classification, SVMs usually use two popular approaches:
One-vs-one SVM: For a C class problem, you train C(C-1)/2 classifiers and at test time, you test that many classifiers and choose the class which receives most votes.
One-vs-all SVM: As name suggests, you train one classifier per class with positive samples from that class and negative samples from all other classes.
From LIBSVM FAQs:
It is one-against-one. We chose it after doing the following comparison: C.-W. Hsu and C.-J. Lin. A comparison of methods for multi-class support vector machines , IEEE Transactions on Neural Networks, 13(2002), 415-425.
"1-against-the rest" is a good method whose performance is comparable to "1-against-1." We do the latter simply because its training time is shorter.
However, also note that a naive implementation of one-vs-one may not be practical for large-scale problems. LIBSVM website also lists this shortcoming and provides an extension.
LIBLINEAR does not support one-versus-one multi-classification, so we provide an extension here. If k is the number of classes, we generate k(k-1)/2 models, each of which involves only two classes of training data. According to Yuan et al. (2012), one-versus-one is not practical for large-scale linear classification because of the huge space needed to store k(k-1)/2 models. However, this approach may still be viable if model vectors (i.e., weight vectors) are very sparse. Our implementation stores models in a sparse form and can effectively handle some large-scale data.
Is it possible to train classifiers in sklearn with a cost matrix with different costs for different mistakes? For example in a 2 class problem, the cost matrix would be a 2 by 2 square matrix. For example A_ij = cost of classifying i as j.
The main classifier I am using is a Random Forest.
Thanks.
The cost-sensitive framework you describe is not supported in scikit-learn, in any of the classifiers we have.
You could use a custom scoring function that accepts a matrix of per-class or per-instance costs. Here's an example of a scorer that calculates per-instance misclassification cost:
def financial_loss_scorer(y, y_pred, **kwargs):
import pandas as pd
totals = kwargs['totals']
# Create an indicator - 0 if correct, 1 otherwise
errors = pd.DataFrame((~(y == y_pred)).astype(int).rename('Result'))
# Use the product totals dataset to create results
results = errors.merge(totals, left_index=True, right_index=True, how='inner')
# Calculate per-prediction loss
loss = results.Result * results.SumNetAmount
return loss.sum()
The scorer becomes:
make_scorer(financial_loss_scorer, totals=totals_data, greater_is_better=False)
Where totals_data is a pandas.DataFrame with indexes that match the training set indexes.
You could always just look at your ROC curve. Each point on the ROC curve corresponds to a separate confusion matrix. So by specifying the confusion matrix you want, via choosing your classifier threshold implies some sort of cost weighting scheme. Then you just have to choose the confusion matrix that would imply the cost matrix you are looking for.
On the other hand if you really had your heart set on it, and really want to "train" an algorithm using a cost matrix, you could "sort of" do it in sklearn.
Although it is impossible to directly train an algorithm to be cost sensitive in sklearn you could use a cost matrix sort of setup to tune your hyper-parameters. I've done something similar to this using a genetic algorithm. It really doesn't do a great job, but it should give a modest boost to performance.
One way to circumvent this limitation is to use under or oversampling. E.g., if you are doing binary classification with an imbalanced dataset, and want to make errors on the minority class more costly, you could oversample it. You may want to have a look at imbalanced-learn which is a package from scikit-learn-contrib.
May not be direct to your question (since you are asking about Random Forest).
But for SVM (in Sklearn), you can utilize the class_weight parameter to specify the weights of different classes. Essentially, you will pass in a dictionary.
You might want to refer to this page to see an example of using class_weight.
After we created a Naive Bayes classifier object nb (say, with multivariate multinomial (mvmn) distribution), we can call posterior function on testing data using nb object. This function has 3 output parameters:
[post,cpre,logp] = posterior(nb,test)
I understand how post is computed and the meaning of that, also cpre is the predicted class, based on the maximum over posterior probabilities for each class.
The question is about logp. It is clear how it is computed (logarithm of the PDF of each pattern in test), but I don't understand the meaning of this measure and how it can be used in the context of Naive Bayes procedure. Any light on this is very much appreciated.
Thanks.
The logp you are referring to is the log likelihood, which is one way to measure how good a model is fitting. We use log probabilities to prevent computers from underflowing on very small floating-point numbers, and also because adding is faster than multiplying.
If you learned your classifier several times with different starting points, you would get different results because the likelihood function is not log-concave, meaning there are local maxima that you would get stuck in. If you computed the likelihood of the posterior on your original data you would get the likelihood of the model. Although the likelihood gives you a good measure of how one set of parameters fits compared to another, you need to be careful that you're not overfitting.
In your case, you are computing the likelihood on some unobserved (test) data, which gives you an idea of how well your learned classifier is fitting on the data. If you were trying to learn this model based on the test set, you would pick the parameters based on the highest test likelihood; however in general when you're doing this it's better to use a validation set. What you are doing here is computing predictive likelihood.
Computing the log likelihood is not limited to Naive Bayes classifiers and can in fact be computed for any Bayesian model (gaussian mixture, latent dirichlet allocation, etc).