different results by SMO, NaiveBayes, and BayesNet classifiers in weka - classification

I am trying different classifiers of Weka on my data set. I have small dataset and I am classifying my data into five classes.
My problem is that when I apply cross validation or percentage split classification by different classifiers, I get very different results.
For example, when I use NaiveBayse or BayseNet classifiers, I have an F-score of around 40 for all classes, but using SMO I get an F-score of 20. The worse result is obtained when I use LibLinear classifier which gives me a F-scores of around 15.
Maybe I should mention that since LibLinear classifier doesn't accept nominals, I assign a code to each of the possible nominal values and use them as Numeric values in my dataset.
Can anybody tell me why I get such different results? I expected all classifiers to have roughly similar results.
In addition, when I use LibLinear on my test set, I have all data classified under one of the classes and there is no instances in the other four classes.
Thanks in advance,

Why would you expect similar results? For small data set especially I think different methods could easily lead to different predictions. Also linear model has tolerance threshold that would cause early termination before convergence. It's something you can play with in LibLINEAR or SMO for instance.

Related

Can I separately train a classifier (e.g. SVM) with two different types of features and combine the results later?

I am a student and working on my first simple machine learning project. The project is about classifying articles into fake and true. I want to use SVM as classification algorithm and two different types of features:
TF-IDF
Lexical Features like the count of exclamation marks and numbers
I have figured out how to use the lexical features and TF-IDF as a features separately. However, I have not managed to figure out, how to combine them.
Is it possible, to train and test two separate learning algorithms (one with TF-IDF and the other one with lexical features) and later combine the results?
For example, can I calculate Accuracy, Precision and Recall for both separately and then take the average?
One way of combining two models is called model stacking. The idea behind it is, that you take the predictions of both models and feed them into a third model (called meta-model) which is then trained to make predictions given the output of the first two models. There is also another version of model stacking where you aditionally feed the original features into the meta-model.
However, in your case another way to combine both approaches would be to simply feed both the TF-IDF and the lexical features into one model and see how that performs.
For example, can I calculate Accuracy, Precision and Recall for both separately and then take the average?
This would unfortunately not work, because there is no combined model making those predictions for which your calculated metrics would be true.

Shouldn't we take average of n models in cross validation in linear regression?

I have a question regarding cross validation in Linear regression model.
From my understanding, in cross validation, we split the data into (say) 10 folds and train the data from 9 folds and the remaining folds we use for testing. We repeat this process until we test all of the folds, so that every folds are tested exactly once.
When we are training the model from 9 folds, should we not get a different model (may be slightly different from the model that we have created when using the whole dataset)? I know that we take an average of all the "n" performances.
But, what about the model? Shouldn't the resulting model also be taken as the average of all the "n" models? I see that the resulting model is same as the model which we created using whole of the dataset before cross-validation. If we are considering the overall model even after cross-validation (and not taking avg of all the models), then what's the point of calculating average performance from n different models (because they are trained from different folds of data and are supposed to be different, right?)
I apologize if my question is not clear or too funny.
Thanks for reading, though!
I think that there is some confusion in some of the answers proposed because of the use of the word "model" in the question asked. If I am guessing correctly, you are referring to the fact that in K-fold cross-validation we learn K-different predictors (or decision functions), which you call "model" (this is a bad idea because in machine learning we also do model selection which is choosing between families of predictors and this is something which can be done using cross-validation). Cross-validation is typically used for hyperparameter selection or to choose between different algorithms or different families of predictors. Once these chosen, the most common approach is to relearn a predictor with the selected hyperparameter and algorithm from all the data.
However, if the loss function which is optimized is convex with respect to the predictor, than it is possible to simply average the different predictors obtained from each fold.
This is because for a convex risk, the risk of the average of the predictor is always smaller than the average of the individual risks.
The PROs and CONs of averaging (vs retraining) are as follows
PROs: (1) In each fold, the evaluation that you made on the held out set gives you an unbiased estimate of the risk for those very predictors that you have obtained, and for these estimates the only source of uncertainty is due to the estimate of the empirical risk (the average of the loss function) on the held out data.
This should be contrasted with the logic which is used when you are retraining and which is that the cross-validation risk is an estimate of the "expected value of the risk of a given learning algorithm" (and not of a given predictor) so that if you relearn from data from the same distribution, you should have in average the same level of performance. But note that this is in average and when retraining from the whole data this could go up or down. In other words, there is an additional source of uncertainty due to the fact that you will retrain.
(2) The hyperparameters have been selected exactly for the number of datapoints that you used in each fold to learn. If you relearn from the whole dataset, the optimal value of the hyperparameter is in theory and in practice not the same anymore, and so in the idea of retraining, you really cross your fingers and hope that the hyperparameters that you have chosen are still fine for your larger dataset.
If you used leave-one-out, there is obviously no concern there, and if the number of data point is large with 10 fold-CV you should be fine. But if you are learning from 25 data points with 5 fold CV, the hyperparameters for 20 points are not really the same as for 25 points...
CONs: Well, intuitively you don't benefit from training with all the data at once
There are unfortunately very little thorough theory on this but the following two papers especially the second paper consider precisely the averaging or aggregation of the predictors from K-fold CV.
Jung, Y. (2016). Efficient Tuning Parameter Selection by Cross-Validated Score in High Dimensional Models. International Journal of Mathematical and Computational Sciences, 10(1), 19-25.
Maillard, G., Arlot, S., & Lerasle, M. (2019). Aggregated Hold-Out. arXiv preprint arXiv:1909.04890.
The answer is simple: you use the process of (repeated) cross validation (CV) to obtain a relatively stable performance estimate for a model instead of improving it.
Think of trying out different model types and parametrizations which are differently well suited for your problem. Using CV you obtain many different estimates on how each model type and parametrization would perform on unseen data. From those results you usually choose one well suited model type + parametrization which you will use, then train it again on all (training) data. The reason for doing this many times (different partitions with repeats, each using different partition splits) is to get a stable estimation of the performance - which will enable you to e.g. look at the mean/median performance and its spread (would give you information about how well the model usually performs and how likely it is to be lucky/unlucky and get better/worse results instead).
Two more things:
Usually, using CV will improve your results in the end - simply because you take a model that is better suited for the job.
You mentioned taking the "average" model. This actually exists as "model averaging", where you average the results of multiple, possibly differently trained models to obtain a single result. Its one way to use an ensemble of models instead of a single one. But also for those you want to use CV in the end for choosing reasonable model.
I like your thinking. I think you have just accidentally discovered Random Forest:
https://en.wikipedia.org/wiki/Random_forest
Without repeated cv your seemingly best model is likely to be only a mediocre model when you score it on new data...

How to Combine two classification model in matlab?

I am trying to detect the faces using the Matlab built-in viola jones face detection. Is there anyway that I can combine two classification models like "FrontalFaceCART" and "ProfileFace" into one in order to get a better result?
Thank you.
You can't combine models. That's a non-sense in any classification task since every classifier is different (works differently, i.e. different algorithm behind it, and maybe is also trained differently).
According to the classification model(s) help (which can be found here), your two classifiers work as follows:
FrontalFaceCART is a model composed of weak classifiers, based on classification and regression tree analysis
ProfileFace is composed of weak classifiers, based on a decision stump
More infos can be found in the link provided but you can easily see that their inner behaviour is rather different, so you can't mix them or combine them.
It's like (in Machine Learning) mixing a Support Vector Machine with a K-Nearest Neighbour: the first one uses separating hyperplanes whereas the latter is simply based on distance(s).
You can, however, train several models in parallel (e.g. independently) and choose the model that better suits you (e.g. smaller error rate/higher accuracy): so you basically create as many different classifiers as you like, give them the same training set, evaluate each accuracy (and/or other parameters) and choose the best model.
One option is to make a hierarchical classifier. So in a first step you use the frontal face classifier (assuming that most pictures are frontal faces). If the classifier fails, you try with the profile classifier.
I did that with a dataset of faces and it improved my overall classification accuracy. Furthermore, if you have some a priori information, you can use it. In my case the faces were usually in the middle up part of the picture.
To further improve your performance, without using the two classifiers in MATLAB you are using, you would need to change your technique (and probably your programming language). This is the best method so far: Facenet.

why we need cross validation in multiSVM method for image classification?

I am new to image classification, currently working on SVM(support Vector Machine) method for classifying four groups of images by multisvm function, my algorithm every time the training and testing data are randomly selected and the performance is varies at every time. Some one suggested to do cross validation i did not understand why we need cross validation and what is the main purpose of this? . My actual data set consist training matrix size 28×40000 and testing matrix size 17×40000. how to do cross validation by this data set help me. thanks in advance .
Cross validation is used to select your model. The out-of-sample error can be estimated from your validation error. As a result, you would like to select the model with the least validation error. Here the model refers to the features you want to use, and of more importance, the gamma and C in your SVM. After cross validation, you will use the selected gamma and C with the least average validation error to train the whole training data.
You may also need to estimate the performance of your features and parameters to avoid both high-bias and high-variance. Whether your model suffers underfitting or overfitting can be observed from both in-sample-error and validation error.
Ideally 10-fold is often used for cross validation.
I'm not familiar with multiSVM but you may want to check out libSVM, it is a popular, free SVM library with support for a number of different programming languages.
Here they describe cross validation briefly. It is a way to avoid over-fitting the model by breaking up the training data into sub groups. In this way you can find a model (defined by a set of parameters) which fits both sub groups optimally.
For example, in the following picture they plot the validation accuracy contours for parameterized gamma and C values which are used to define the model. From this contour plot you can tell that the heuristically optimal values (from those tested) are those that give an accuracy closer to 84 instead of 81.
Refer to this link for more detailed information on cross-validation.
You always need to cross-validate your experiments in order to guarantee a correct scientific approach. For instance, if you don't cross-validate, the results you read (such as accuracy) might be highly biased by your test set. In an extreme case, your training step might have been very weak (in terms of fitting data) and your test step might have been very good. This applies to ALL machine learning and optimization experiments, not only SVMs.
To avoid such problems just divide your initial dataset in two (for instance), then train in the first set and test in the second, and repeat the process invesely, training in the second and testing in the first. This will guarantee that any biases to the data are visible to you. As someone suggested, you can perform this with even further division: 10-fold cross-validation, means dividing your data set in 10 parts, then training in 9 and testing in 1, then repeating the process until you have tested in all parts.

Ensemble classifier with wrapper method

I'm trying to combine multiple classifiers (ANN, SVM, kNN, ... etc.) using ensemble learning (viting, stacking ...etc.) .
In order to make a classifier, I'm using more than 20 types of explanatory variables.
However, each classifier has the best subset of explanatory variables. Thus, seeking the best combination of explanatory variables for each classifier in wrapper method,
I would like to combine multiple classifiers (ANN, SVM, kNN, ... etc.) using ensemble learning (viting, stacking ...etc.) .
By using the meta-learning with weka, I should be able to use the ensemble itself.
But I can not obtain the best combination of explanatory variables since wrapper method summarizes the prediction of each classifier.
I am not stick to weka if it can be solved easier in maybe matlab or R.
With ensemble approaches, best results have been achieved with very simple classifiers. Which on the other hand can be pretty fast, to make up for the ensemble cost.
This may seem counterintuitive at first: one would exepect a better input classifier to produce a better output. However, there are two reasons why this does not work.
First of all, with simple classifiers, you can usually tweak them more to get a diverse set of input classifiers. A full-dimensional method + feature bagging gives you a diverse set of classifiers. A classifier that internally does feature selection or reduction makes feature bagging largely disfunct for getting variety. Secondly, a complex method such as SVM is more likely to optimize/converge towards the very same result. After all, the complex methods are supposed to go through a much larger search space and find the best result in this search space. But that also means, you are more likely to get the same result again.
Last but not least, when using very primivite classifiers, the errors are better behaved and more likely to even out on ensemble combination.