Classification with multiple features? - classification

I have:
1) 2 groups of subjects (controls and cancer patients)
2) a group of features, for each of them.
I want to find the feature, or which combination of which features, discriminate best between the two groups.
I have started with evaluation of AUC, then with some k means clustering, but I don't know how to combine features for the classification.
Thank you

I suggest you use some method of feature importance evaluation. There is many different way to test importance of features. At first, in my opinion, easy one is Random Forest classifier. This model has "build-in" feature importance evaluation during training, based on out-of-bag error. Tree-based classifiers must evaluate gain of information after getting value of feature in training process.
You can also test feature importance by checking model score by modifying data set, i.e using backward elimination strategy.
You can use also PCA or statistic tests. Finally you can also looking for dependency between feature to remove from your data feature that not provide enough information.

Related

Feature selection for one class classification

I try to apply One Class SVM but my dataset contains too many features and I believe feature selection would improve my metrics. Are there any methods for feature selection that do not need the label of the class?
If yes and you are aware of an existing implementation please let me know
You'd probably get better answers asking this on Cross Validated instead of Stack Exchange, although since you ask for implementations I will answer your question.
Unsupervised methods exist that allow you to eliminate features without looking at the target variable. This is called unsupervised data (dimensionality) reduction. They work by looking for features that convey similar information and then either eliminate some of those features or reduce them to fewer features whilst retaining as much information as possible.
Some examples of data reduction techniques include PCA, redundancy analysis, variable clustering, and random projections, amongst others.
You don't mention which program you're working in but I am going to presume it's Python. sklearn has implementations for PCA and SparseRandomProjection. I know there is a module designed for variable clustering in Python but I have not used it and don't know how convenient it is. I don't know if there's an unsupervised implementation of redundancy analysis in Python but you could consider making your own. Depending on what you decide to do it might not be too tricky (especially if you just do correlation based).
In case you're working in R, finding versions of data reduction using PCA will be no problem. For variable clustering and redundancy analysis, great packages like Hmisc and ClustOfVar exist.
You can also read about other unsupervised data reduction techniques; you might find other methods more suitable.

Most important attributes in matlab

so I have a dataset of 77 patients cancer patients and 12500+ attributes. I have applied Principal Component Analysis in order to filter the attributes to only retain the ones the explain the most variance.
My question is, are there techniques in Matlab, other than PCA, to identify the attributes with the most predictive power?
There are two main ways to cleverly "reduce the dimensionality" of your dataset. One is Feature Transformation (that includes, for example, PCA), and the other one is Feature Selection.
It seems that you are looking for a Feature Selection algorithm, that would retain the most informative original attributes. On the contrary, a Feature Transformation algorithm will generate a new set of attributes!
As for your exact question, there are multiple choices you can make. Keep in mind that, naively, each Feature Selection algorithm will have to choose the best features according to "how well" those features alone can model the problem.
For a MATLAB built-in implementation, if you have the Statistics and Machine Learning Toolbox installed, you can use the "Sequential feature selection" function sequentialfs.

kmean clustering: variable selection

I'm applying a kmean algorithm for clustering my customer base. I'm struggling conceptually on the selection process of the dimensions (variables) to include in the model. I was wondering if there are methods established to compare among models with different variables. In particular, I was thinking to use the common SSwithin / SSbetween ratio, but I'm not sure if that can be applied to compare models with a different number of dimensions...
Any suggestions>?
Thanks a lot.
Classic approaches are sequential selection algorithms like "sequential floating forward selection" (SFFS) or "sequential floating backward elimination (SFBS). Those are heuristic methods where you eliminate (or add) one feature at the time based on your performance metric, e.g,. mean squared error (MSE). Also, you could use a genetic algorithm for that if you like.
Here is an easy-going paper that summarizes the ideas:
Feature Selection from Huge Feature Sets
And a more advanced one that could be useful: Unsupervised Feature Selection for the k-means Clustering Problem
EDIT:
When I think about it again, I initially had the question in mind "how do I select the k (a fixed number) best features (where k < d)," e.g., for computational efficiency or visualization purposes. Now, I think what you where asking is more like "What is the feature subset that performs best overall?" The silhouette index (similarity of points within a cluster) could be useful, but I really don't think you can really improve the performance via feature selection unless you have the ground truth labels.
I have to admit that I have more experience with supervised rather than unsupervised methods. Thus, I typically prefer regularization over feature selection/dimensionality reduction when it comes to tackling the "curse of dimensionality." I use dimensionality reduction frequently for data compression though.

Fusion Classifier in Weka?

I have a dataset with 20 features. 10 for age and 10 for weight. I want to classify the data for both separately then use the results from these 2 classifiers as an input to a third for the final result..
Is this possible with Weka????
Fusion of decisions is possible in WEKA (or with any two models), but not using the approach you describe.
Seeing as your using classifiers, each model will only output a class. You could use the two labels produced as features for a third model, but the lack of diversity in your inputs would most likely prevent the third model from giving you anything interesting.
At the most basic level, you could implement a voting scheme. Give each model a "vote" and then take assume that the correct class is the majority voted class. While this will give a rudimentary form of fusion, if you're familiar with voting theory you know that majority-rules somewhat falls apart when you have more than two classes.
I recommend that you use Combinatorial Fusion to fuse the output of the two classifiers. A good paper regarding the technique is available as a free PDF here. In essence, you use the Classifer::distributionForInstance() method provided by WEKA's classifiers and then use the sum of the distributions (called "scores") to rank the classes, choosing the class with the highest rank. The paper demonstrates that this method is superior to doing just voting alone.

results of two feature selection algo do not match

I am working on two feature selection algorithms for a real world problem where the sample size is 30 and feature size is 80. The first algorithm is wrapper forward feature selection using SVM classifier, the second is filter feature selection algorithm using Pearson product-moment correlation coefficient and Spearman's rank correlation coefficient. It turns out that the selected features by these two algorithms are not overlap at all. Is it reasonable? Does it mean I made mistakes in my implementation? Thank you.
FYI, I am using Libsvm + matlab.
It can definitely happen as both strategies do not have the same expression power.
Trust the wrapper if you want the best feature subset for prediction, trust the correlation if you want all features that are linked to the output/predicted variable. Those subsets can be quite different, especially if you have many redundant features.
Using top correlated features is a strategy which assumes that the relationships between the features and the output/predicted variable are linear, (or at least monotonous in case of Spearman's rank correlation), and that features are statistically independent one from another, and do not 'interact' with one another. Those assumptions are most often violated in real world problems.
Correlations, or other 'filters' such as mutual information, are better used either to filter out features, to decide which features not to consider, rather than to decide which features to consider. Filters are necessary when the initial feature count is very large (hundreds, thousands) to reduce the workload for a subsequent wrapper algorithm.
Depending on the distribution of the data you can either use spearman or pearson.The latter is used for normal distribution while former for non-normal.Find the distribution and use appropriate one.