How Adaboost and decision tree features importances differ? - classification

I have a multiclass classification problem and I extracted features importances based on impurity decrease. I compared a decision tree and AdaBoost classifiers and I ovserved that there is a feature that was ranked on top with the decision tree while it has a very lower importance according to AdaBoost.
Is that a normal behavior?
Thanks

Yes it is normal behavior. The features importance calculates a score for all the input features of a model. However, each model has a (slightly) different technique. For example: a linear regression will look at linear relationships. If a feature has a perfect linear relationship with your target, then it will have a high feature importance. Features with a non-linear relationship may not improve the accuracy resulting in a lower feature importance score.
There is some research related to the difference in feature importance measures. An example is:
https://link.springer.com/article/10.1007/s42452-021-04148-9

Related

scikit adaboost feature_importance_

how exactly does the adaboost algorithm implemented in python assigns feature importances to each feature? I am using it for feature selection and my model performs better on applying feature selection based on the values of feature_importance_ .
The feature_importances_ is an attribute available to sklearn's adaboost algorithm when the base classifier is a decision tree. In order to understand how feature_importances_ are calculated in the adaboost algorithm, you need to first understand how it is calculated for a decision tree classifier.
Decision Tree Classifier:
The feature_importances_ will vary depending on what split criteria you choose. When the split criteria is set to be "entropy": DecisionTreeClassifier(criterion='entropy') the feature_importances_ are equivalent to the information gain of each feature. Here is a tutorial on how to compute the information gain of each feature (slide 7 in particular). When you change the split criteria the feature_importances_ are no longer equivalent to the information gain, however the steps you take to calculate it are similar to those taken in slide 7 (with the new split criteria used in place of entropy).
Ensemble Classifiers:
Now let's return to your original question of how is it determined for the adaboost algorithm. According to the docs:
This notion of importance can be extended to decision tree ensembles by simply averaging the feature importance of each tree

Can KNN be better than other classifiers?

As Known, there are classifiers that have a training or a learning step, like SVM or Random Forest. On the other hand, KNN does not have.
Can KNN be better than these classifiers?
If no, why?
If yes, when, how and why?
The main answer is yes, it can due to no free lunch theorem implications. FLT can be loosley stated as (in terms of classification)
There is no universal classifier which is consisntenly better at any task than others
It can also be (not very strictly) inverted
For each (well defined) classifier there exists a dataset where it is the best one
And in particular - kNN is well-defined classifier, in particular it is consistent with any distibution, which means that given infinitely many training points it converges to the optimal, Bayesian separator.
So can it be better than SVM or RF? Obviously! When? There is no clear answer. First of all in supervised learning you often actually get just one training set and try to fit the best model. In such scenario any model can be the best one. When statisticians/theoretical ML try to answer whether one model is better than another, we actually try to test "what would happen if we would have ifinitely many training sets" - so we look at the expected value of the behaviour of the classifiers. In such setting, we often show that SVM/RF is better than KNN. But it does not mean that they are always better. It only means, that for randomly selected dataset you should expect KNN to work worse, but this is only probability. And as you can always win in a lottery (no matter the odds!) you can also always win with KNN (just to be clear - KNN has bigger chances of being a good model than winning a lottery :-)).
What are particular examples? Let us for example consider a rotated XOR problem.
If the true decision boundaries are as above, and you only have this four points. Obviously 1NN will be much better than SVM (with dot, poly or rbf kernel) or RF. It should also be true once you include more and more training points.
"In general kNN would not be expected to exceed SVM or RF. When kNN does, that says something very interesting about the training data. If many doublets are present i the data set, a nearest neighbor algorithm works very well."
I heard the argument something like as written by Claudia Perlich in this podcast:
http://www.thetalkingmachines.com/blog/2015/6/18/working-with-data-and-machine-learning-in-advertizing
My intuitive understanding of why RF and SVM is better kNN in generel: All algorithms basicly assume some local similarity, such that samples very alike gets classified alike. kNN can only choose the most similar samples by distance(or some other global kernel). So the samples which could influence a prediction on kNN would exists within a hyper sphere for the Euclidean distance kernel. RF and SVM can learn other definitions of locality which could stretch far by some features and short by others. Also the propagation of locality could take up many learned shapes, and these shapes can differ through out the feature space.

How to use weighted vote for classification using weka

I know we can use the vote classifier to combine different classifiers.
May I know if there is any way to combine the classifiers with different weights for each classifier? How would I be able to do that with Weka?
I have googled that we can add weights to attributes or instances. But I would like to know how to add weight to classifiers.
If weighted vote is not possible, is there any other way I can do that? Thanks.
It does not appear possible to achieve weighted voting with Weka without modifying the Java classes yourself.
Source 1
Source 2
That being said, I believe you can achieve rudimentary weighting by providing multiples of the base classifiers to the voting meta classifier. This appears to be backed up by the source code.
For example:
Classifier 1: J48 decision tree
Classifier 2: J48 decision tree
Classifier 3: naive Bayes
This would allow the decision tree to vote twice and therefore have a higher weight than naive Bayes.

SVM LibSVM Ignore Feature 1,3,5 when Predicting

this question is about LibSVM or SVMs in general.
I wonder if it is possible to categorize Feature-Vectors of different length with the same SVM Model.
Let's say we train the SVM with about 1000 Instances of the following Feature Vector:
[feature1 feature2 feature3 feature4 feature5]
Now I want to predict a test-vector which has the same length of 5.
If the probability I receive is to poor, I now want to check the first subset of my test-vector containing the columns 2-5. So I want to dismiss the 1 feature.
My question now is: Is it possible to tell the SVM only to check the features 2-5 for prediction (e.g. with weights), or do I have to train different SVM Models. One for 5 features, another for 4 features and so on...?
Thanks in advance...
marcus
You can always remove features from your test points by fiddling with the file, but I highly recommend not using such an approach. An SVM model is valid when all features are present. If you are using the linear kernel, simply setting a given feature to 0 will implicitly cause it to be ignored (though you should not do this). When using other kernels, this is very much a no no.
Using a different set of features for predictions than the set you used for training is not a good approach.
I strongly suggest to train a new model for the subset of features you wish to use in prediction.

why as number of features increases, the classification accuracy decreases when using svm

I am using libsvm for image classification. Why when I use more features for classification my prediction accuracy decreases? Shouldn´t it increase? My dataset size is fixed at 1600 for training and 400 for testing.
Because the additional features may not be at all useful for separating the classes in the feature space. Accuracy is not necessarily tied to number of features.
Including lots of poor features may cause your SVM to learn the noise in the data, damaging the accuracy.
For example, if your extra feature looks like this (using 2D plots for clarity):
Then it will not be a very good feature for separating the (in this case) two classes. If for example, the SVM trains only on this pattern, it will not be very good at predicting the class of a future point. However, there might be a feature in your dataset that looks like this:
A feature like this would be very useful in separating the two classes.