I am new to ML and I am building a prediction system using Spark ml. I read that a major part of feature engineering is to find the importance of each feature in doing the required prediction. In my problem, I have three categorical features and two string features. I use the OneHotEncoding technique for transforming the categorical features and simple HashingTF mechanism to transform the string features. And, then these are input as various stages of the Pipeline, including ml NaiveBayes and a VectorAssembler (to assemble all the features into a single column), fit and transformed using the training and test data sets respectively.
Everything is good, except, how do I decide the importance of each feature? I know I have only a handful of features now, but I will be adding more soon. The closest thing I came across was the ChiSqSelector available with spark ml module, but it seems to only work for categorical features.
Thanks, any leads appreciated!
You can see these example:
The method mentioned in question's comment
Information Gain based feature selection in Spark’s MLlib
This package contains several feature selection methods (also InfoGain):
Information Theoretic Feature Selection Framework
Using ChiSqSelector is okay, you can simply discretize your continuous features (the HashingTF values). One example is provided in: http://spark.apache.org/docs/latest/mllib-feature-extraction.html, I copy here the part of interest:
// Discretize data in 16 equal bins since ChiSqSelector requires categorical features
// Even though features are doubles, the ChiSqSelector treats each unique value as a category
val discretizedData = data.map { lp =>
LabeledPoint(lp.label, Vectors.dense(lp.features.toArray.map { x => (x / 16).floor })) }
L1 regulation is also an option.
You may use L1 to get the features importance from the coefficients, and decide which features to use for the Bayes training accordingly.
Example of getting coefficients
Update:
Some conditions under which coefficients not work very well
Related
I am a student and working on my first simple machine learning project. The project is about classifying articles into fake and true. I want to use SVM as classification algorithm and two different types of features:
TF-IDF
Lexical Features like the count of exclamation marks and numbers
I have figured out how to use the lexical features and TF-IDF as a features separately. However, I have not managed to figure out, how to combine them.
Is it possible, to train and test two separate learning algorithms (one with TF-IDF and the other one with lexical features) and later combine the results?
For example, can I calculate Accuracy, Precision and Recall for both separately and then take the average?
One way of combining two models is called model stacking. The idea behind it is, that you take the predictions of both models and feed them into a third model (called meta-model) which is then trained to make predictions given the output of the first two models. There is also another version of model stacking where you aditionally feed the original features into the meta-model.
However, in your case another way to combine both approaches would be to simply feed both the TF-IDF and the lexical features into one model and see how that performs.
For example, can I calculate Accuracy, Precision and Recall for both separately and then take the average?
This would unfortunately not work, because there is no combined model making those predictions for which your calculated metrics would be true.
I have:
1) 2 groups of subjects (controls and cancer patients)
2) a group of features, for each of them.
I want to find the feature, or which combination of which features, discriminate best between the two groups.
I have started with evaluation of AUC, then with some k means clustering, but I don't know how to combine features for the classification.
Thank you
I suggest you use some method of feature importance evaluation. There is many different way to test importance of features. At first, in my opinion, easy one is Random Forest classifier. This model has "build-in" feature importance evaluation during training, based on out-of-bag error. Tree-based classifiers must evaluate gain of information after getting value of feature in training process.
You can also test feature importance by checking model score by modifying data set, i.e using backward elimination strategy.
You can use also PCA or statistic tests. Finally you can also looking for dependency between feature to remove from your data feature that not provide enough information.
I'm a bit new to Spark ML API. I'm trying to do multi-label classification for 160 labels by training 160 classifiers(logistic or random forest etc). Once I train on Dataset[LabeledPoint], I'm finding it hard to get an API where I get the probability for each class for a single example. I've read on SO that you can use the pipeline API and get the probabilities, but for my use case this is going to be hard because I'll have to repicate 160 RDDs for my evaluation features, get probability for each class and then do a join to rank the classes by their probabilities. Instead, I want to just have one copy of evaluation features, broadcast the 160 models and then do the predictions inside the map function. I find myself having to implement this but wonder if there's another convenience API in Spark to do the same for different classifiers like Logistic/RF which converts a Vector representing features to the probability for it belonging to a class. Please let me know if there's a better way to approach multi-label classification in Spark.
EDIT: I tried to create a function to transform a vector to a label for random forest, but it's super annoying because I now have to clone large pieces of tree traversal in Spark, and almost everywhere I encountered dead ends because some function or variable was private or protected. Correct me if wrong, but if this use case is not already implemented, I think it atleast is well-justified because Scikit-learn already has such APIs in place to do this.
Thanks
Found the culprit line in Spark MLLib code: https://github.com/apache/spark/blob/5ad644a4cefc20e4f198d614c59b8b0f75a228ba/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L224
The predict method is marked as protected but it should actually be public for such use cases to be supported.
This has been fixed in version 2.4 as seen here:
https://github.com/apache/spark/blob/branch-2.4/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala
So upgrading to version 2.4 should do the trick ... although I don't think 2.4 is out yet, so it's a matter of waiting.
EDIT: for people that are interested, apparently not only is this beneficial for multi-label prediction, it's been observed that there's 3-4x improvement in latency as well for regular classification/regression for single instance/small batch predictions (see https://issues.apache.org/jira/browse/SPARK-16198 for details).
how exactly does the adaboost algorithm implemented in python assigns feature importances to each feature? I am using it for feature selection and my model performs better on applying feature selection based on the values of feature_importance_ .
The feature_importances_ is an attribute available to sklearn's adaboost algorithm when the base classifier is a decision tree. In order to understand how feature_importances_ are calculated in the adaboost algorithm, you need to first understand how it is calculated for a decision tree classifier.
Decision Tree Classifier:
The feature_importances_ will vary depending on what split criteria you choose. When the split criteria is set to be "entropy": DecisionTreeClassifier(criterion='entropy') the feature_importances_ are equivalent to the information gain of each feature. Here is a tutorial on how to compute the information gain of each feature (slide 7 in particular). When you change the split criteria the feature_importances_ are no longer equivalent to the information gain, however the steps you take to calculate it are similar to those taken in slide 7 (with the new split criteria used in place of entropy).
Ensemble Classifiers:
Now let's return to your original question of how is it determined for the adaboost algorithm. According to the docs:
This notion of importance can be extended to decision tree ensembles by simply averaging the feature importance of each tree
this question is about LibSVM or SVMs in general.
I wonder if it is possible to categorize Feature-Vectors of different length with the same SVM Model.
Let's say we train the SVM with about 1000 Instances of the following Feature Vector:
[feature1 feature2 feature3 feature4 feature5]
Now I want to predict a test-vector which has the same length of 5.
If the probability I receive is to poor, I now want to check the first subset of my test-vector containing the columns 2-5. So I want to dismiss the 1 feature.
My question now is: Is it possible to tell the SVM only to check the features 2-5 for prediction (e.g. with weights), or do I have to train different SVM Models. One for 5 features, another for 4 features and so on...?
Thanks in advance...
marcus
You can always remove features from your test points by fiddling with the file, but I highly recommend not using such an approach. An SVM model is valid when all features are present. If you are using the linear kernel, simply setting a given feature to 0 will implicitly cause it to be ignored (though you should not do this). When using other kernels, this is very much a no no.
Using a different set of features for predictions than the set you used for training is not a good approach.
I strongly suggest to train a new model for the subset of features you wish to use in prediction.