Spark ML API to convert a vector to a probability for multilabel classification - scala

I'm a bit new to Spark ML API. I'm trying to do multi-label classification for 160 labels by training 160 classifiers(logistic or random forest etc). Once I train on Dataset[LabeledPoint], I'm finding it hard to get an API where I get the probability for each class for a single example. I've read on SO that you can use the pipeline API and get the probabilities, but for my use case this is going to be hard because I'll have to repicate 160 RDDs for my evaluation features, get probability for each class and then do a join to rank the classes by their probabilities. Instead, I want to just have one copy of evaluation features, broadcast the 160 models and then do the predictions inside the map function. I find myself having to implement this but wonder if there's another convenience API in Spark to do the same for different classifiers like Logistic/RF which converts a Vector representing features to the probability for it belonging to a class. Please let me know if there's a better way to approach multi-label classification in Spark.
EDIT: I tried to create a function to transform a vector to a label for random forest, but it's super annoying because I now have to clone large pieces of tree traversal in Spark, and almost everywhere I encountered dead ends because some function or variable was private or protected. Correct me if wrong, but if this use case is not already implemented, I think it atleast is well-justified because Scikit-learn already has such APIs in place to do this.
Thanks

Found the culprit line in Spark MLLib code: https://github.com/apache/spark/blob/5ad644a4cefc20e4f198d614c59b8b0f75a228ba/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L224
The predict method is marked as protected but it should actually be public for such use cases to be supported.
This has been fixed in version 2.4 as seen here:
https://github.com/apache/spark/blob/branch-2.4/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala
So upgrading to version 2.4 should do the trick ... although I don't think 2.4 is out yet, so it's a matter of waiting.
EDIT: for people that are interested, apparently not only is this beneficial for multi-label prediction, it's been observed that there's 3-4x improvement in latency as well for regular classification/regression for single instance/small batch predictions (see https://issues.apache.org/jira/browse/SPARK-16198 for details).

Related

Can I separately train a classifier (e.g. SVM) with two different types of features and combine the results later?

I am a student and working on my first simple machine learning project. The project is about classifying articles into fake and true. I want to use SVM as classification algorithm and two different types of features:
TF-IDF
Lexical Features like the count of exclamation marks and numbers
I have figured out how to use the lexical features and TF-IDF as a features separately. However, I have not managed to figure out, how to combine them.
Is it possible, to train and test two separate learning algorithms (one with TF-IDF and the other one with lexical features) and later combine the results?
For example, can I calculate Accuracy, Precision and Recall for both separately and then take the average?
One way of combining two models is called model stacking. The idea behind it is, that you take the predictions of both models and feed them into a third model (called meta-model) which is then trained to make predictions given the output of the first two models. There is also another version of model stacking where you aditionally feed the original features into the meta-model.
However, in your case another way to combine both approaches would be to simply feed both the TF-IDF and the lexical features into one model and see how that performs.
For example, can I calculate Accuracy, Precision and Recall for both separately and then take the average?
This would unfortunately not work, because there is no combined model making those predictions for which your calculated metrics would be true.

spark ml : how to find feature importance

I am new to ML and I am building a prediction system using Spark ml. I read that a major part of feature engineering is to find the importance of each feature in doing the required prediction. In my problem, I have three categorical features and two string features. I use the OneHotEncoding technique for transforming the categorical features and simple HashingTF mechanism to transform the string features. And, then these are input as various stages of the Pipeline, including ml NaiveBayes and a VectorAssembler (to assemble all the features into a single column), fit and transformed using the training and test data sets respectively.
Everything is good, except, how do I decide the importance of each feature? I know I have only a handful of features now, but I will be adding more soon. The closest thing I came across was the ChiSqSelector available with spark ml module, but it seems to only work for categorical features.
Thanks, any leads appreciated!
You can see these example:
The method mentioned in question's comment
Information Gain based feature selection in Spark’s MLlib
This package contains several feature selection methods (also InfoGain):
Information Theoretic Feature Selection Framework
Using ChiSqSelector is okay, you can simply discretize your continuous features (the HashingTF values). One example is provided in: http://spark.apache.org/docs/latest/mllib-feature-extraction.html, I copy here the part of interest:
// Discretize data in 16 equal bins since ChiSqSelector requires categorical features
// Even though features are doubles, the ChiSqSelector treats each unique value as a category
val discretizedData = data.map { lp =>
LabeledPoint(lp.label, Vectors.dense(lp.features.toArray.map { x => (x / 16).floor })) }
L1 regulation is also an option.
You may use L1 to get the features importance from the coefficients, and decide which features to use for the Bayes training accordingly.
Example of getting coefficients
Update:
Some conditions under which coefficients not work very well

Fusion Classifier in Weka?

I have a dataset with 20 features. 10 for age and 10 for weight. I want to classify the data for both separately then use the results from these 2 classifiers as an input to a third for the final result..
Is this possible with Weka????
Fusion of decisions is possible in WEKA (or with any two models), but not using the approach you describe.
Seeing as your using classifiers, each model will only output a class. You could use the two labels produced as features for a third model, but the lack of diversity in your inputs would most likely prevent the third model from giving you anything interesting.
At the most basic level, you could implement a voting scheme. Give each model a "vote" and then take assume that the correct class is the majority voted class. While this will give a rudimentary form of fusion, if you're familiar with voting theory you know that majority-rules somewhat falls apart when you have more than two classes.
I recommend that you use Combinatorial Fusion to fuse the output of the two classifiers. A good paper regarding the technique is available as a free PDF here. In essence, you use the Classifer::distributionForInstance() method provided by WEKA's classifiers and then use the sum of the distributions (called "scores") to rank the classes, choosing the class with the highest rank. The paper demonstrates that this method is superior to doing just voting alone.

How to get the probabilities of each class for the test instances in weka

I have my training data with a binary column at the end, I need to run a classifier that gives me a numerical probability that it's correct. I've tried running it with the linear regression classifier and I get some minus numbers in the prediction column. I also tried it with the lazy iBk classifier but only got predictions (of 1) where the binary column was 1.
Yeah, not all algorithms provide that information. Naive Bayes is a standard for that sort of thing and works surprising well under many conditions.
I had great results with RandomCommittee where probabilities were a key factor in our application. Not only did we have to pick the top of the list, but we needed to know if there were any close runners-up. When this happened the application paused and asked the user for clarification (which was stored back in the database, of course.) RandomCommittee was chosen after extensive testing over time.

different results by SMO, NaiveBayes, and BayesNet classifiers in weka

I am trying different classifiers of Weka on my data set. I have small dataset and I am classifying my data into five classes.
My problem is that when I apply cross validation or percentage split classification by different classifiers, I get very different results.
For example, when I use NaiveBayse or BayseNet classifiers, I have an F-score of around 40 for all classes, but using SMO I get an F-score of 20. The worse result is obtained when I use LibLinear classifier which gives me a F-scores of around 15.
Maybe I should mention that since LibLinear classifier doesn't accept nominals, I assign a code to each of the possible nominal values and use them as Numeric values in my dataset.
Can anybody tell me why I get such different results? I expected all classifiers to have roughly similar results.
In addition, when I use LibLinear on my test set, I have all data classified under one of the classes and there is no instances in the other four classes.
Thanks in advance,
Why would you expect similar results? For small data set especially I think different methods could easily lead to different predictions. Also linear model has tolerance threshold that would cause early termination before convergence. It's something you can play with in LibLINEAR or SMO for instance.