scikit adaboost feature_importance_ - feature-selection

how exactly does the adaboost algorithm implemented in python assigns feature importances to each feature? I am using it for feature selection and my model performs better on applying feature selection based on the values of feature_importance_ .

The feature_importances_ is an attribute available to sklearn's adaboost algorithm when the base classifier is a decision tree. In order to understand how feature_importances_ are calculated in the adaboost algorithm, you need to first understand how it is calculated for a decision tree classifier.
Decision Tree Classifier:
The feature_importances_ will vary depending on what split criteria you choose. When the split criteria is set to be "entropy": DecisionTreeClassifier(criterion='entropy') the feature_importances_ are equivalent to the information gain of each feature. Here is a tutorial on how to compute the information gain of each feature (slide 7 in particular). When you change the split criteria the feature_importances_ are no longer equivalent to the information gain, however the steps you take to calculate it are similar to those taken in slide 7 (with the new split criteria used in place of entropy).
Ensemble Classifiers:
Now let's return to your original question of how is it determined for the adaboost algorithm. According to the docs:
This notion of importance can be extended to decision tree ensembles by simply averaging the feature importance of each tree

Related

How Adaboost and decision tree features importances differ?

I have a multiclass classification problem and I extracted features importances based on impurity decrease. I compared a decision tree and AdaBoost classifiers and I ovserved that there is a feature that was ranked on top with the decision tree while it has a very lower importance according to AdaBoost.
Is that a normal behavior?
Thanks
Yes it is normal behavior. The features importance calculates a score for all the input features of a model. However, each model has a (slightly) different technique. For example: a linear regression will look at linear relationships. If a feature has a perfect linear relationship with your target, then it will have a high feature importance. Features with a non-linear relationship may not improve the accuracy resulting in a lower feature importance score.
There is some research related to the difference in feature importance measures. An example is:
https://link.springer.com/article/10.1007/s42452-021-04148-9

Most important attributes in matlab

so I have a dataset of 77 patients cancer patients and 12500+ attributes. I have applied Principal Component Analysis in order to filter the attributes to only retain the ones the explain the most variance.
My question is, are there techniques in Matlab, other than PCA, to identify the attributes with the most predictive power?
There are two main ways to cleverly "reduce the dimensionality" of your dataset. One is Feature Transformation (that includes, for example, PCA), and the other one is Feature Selection.
It seems that you are looking for a Feature Selection algorithm, that would retain the most informative original attributes. On the contrary, a Feature Transformation algorithm will generate a new set of attributes!
As for your exact question, there are multiple choices you can make. Keep in mind that, naively, each Feature Selection algorithm will have to choose the best features according to "how well" those features alone can model the problem.
For a MATLAB built-in implementation, if you have the Statistics and Machine Learning Toolbox installed, you can use the "Sequential feature selection" function sequentialfs.

kmean clustering: variable selection

I'm applying a kmean algorithm for clustering my customer base. I'm struggling conceptually on the selection process of the dimensions (variables) to include in the model. I was wondering if there are methods established to compare among models with different variables. In particular, I was thinking to use the common SSwithin / SSbetween ratio, but I'm not sure if that can be applied to compare models with a different number of dimensions...
Any suggestions>?
Thanks a lot.
Classic approaches are sequential selection algorithms like "sequential floating forward selection" (SFFS) or "sequential floating backward elimination (SFBS). Those are heuristic methods where you eliminate (or add) one feature at the time based on your performance metric, e.g,. mean squared error (MSE). Also, you could use a genetic algorithm for that if you like.
Here is an easy-going paper that summarizes the ideas:
Feature Selection from Huge Feature Sets
And a more advanced one that could be useful: Unsupervised Feature Selection for the k-means Clustering Problem
EDIT:
When I think about it again, I initially had the question in mind "how do I select the k (a fixed number) best features (where k < d)," e.g., for computational efficiency or visualization purposes. Now, I think what you where asking is more like "What is the feature subset that performs best overall?" The silhouette index (similarity of points within a cluster) could be useful, but I really don't think you can really improve the performance via feature selection unless you have the ground truth labels.
I have to admit that I have more experience with supervised rather than unsupervised methods. Thus, I typically prefer regularization over feature selection/dimensionality reduction when it comes to tackling the "curse of dimensionality." I use dimensionality reduction frequently for data compression though.

Fusion Classifier in Weka?

I have a dataset with 20 features. 10 for age and 10 for weight. I want to classify the data for both separately then use the results from these 2 classifiers as an input to a third for the final result..
Is this possible with Weka????
Fusion of decisions is possible in WEKA (or with any two models), but not using the approach you describe.
Seeing as your using classifiers, each model will only output a class. You could use the two labels produced as features for a third model, but the lack of diversity in your inputs would most likely prevent the third model from giving you anything interesting.
At the most basic level, you could implement a voting scheme. Give each model a "vote" and then take assume that the correct class is the majority voted class. While this will give a rudimentary form of fusion, if you're familiar with voting theory you know that majority-rules somewhat falls apart when you have more than two classes.
I recommend that you use Combinatorial Fusion to fuse the output of the two classifiers. A good paper regarding the technique is available as a free PDF here. In essence, you use the Classifer::distributionForInstance() method provided by WEKA's classifiers and then use the sum of the distributions (called "scores") to rank the classes, choosing the class with the highest rank. The paper demonstrates that this method is superior to doing just voting alone.

results of two feature selection algo do not match

I am working on two feature selection algorithms for a real world problem where the sample size is 30 and feature size is 80. The first algorithm is wrapper forward feature selection using SVM classifier, the second is filter feature selection algorithm using Pearson product-moment correlation coefficient and Spearman's rank correlation coefficient. It turns out that the selected features by these two algorithms are not overlap at all. Is it reasonable? Does it mean I made mistakes in my implementation? Thank you.
FYI, I am using Libsvm + matlab.
It can definitely happen as both strategies do not have the same expression power.
Trust the wrapper if you want the best feature subset for prediction, trust the correlation if you want all features that are linked to the output/predicted variable. Those subsets can be quite different, especially if you have many redundant features.
Using top correlated features is a strategy which assumes that the relationships between the features and the output/predicted variable are linear, (or at least monotonous in case of Spearman's rank correlation), and that features are statistically independent one from another, and do not 'interact' with one another. Those assumptions are most often violated in real world problems.
Correlations, or other 'filters' such as mutual information, are better used either to filter out features, to decide which features not to consider, rather than to decide which features to consider. Filters are necessary when the initial feature count is very large (hundreds, thousands) to reduce the workload for a subsequent wrapper algorithm.
Depending on the distribution of the data you can either use spearman or pearson.The latter is used for normal distribution while former for non-normal.Find the distribution and use appropriate one.