How to extract variable weight from spark pipeline logistic model? - scala

I am currently trying to learn Spark Pipeline (Spark 1.6.0). I imported datasets (train and test) as oas.sql.DataFrame objects. After executing the following codes, the produced model is a oas.ml.tuning.CrossValidatorModel.
You can use model.transform (test) to predict based on the test data in Spark. However, I would like to compare the weights that model used to predict with that from R. How to extract the weights of the predictors and intercept (if any) of model? The Scala codes are:
import sqlContext.implicits._
import org.apache.spark.mllib.linalg.{Vectors, Vector}
import org.apache.spark.SparkContext
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.{LogisticRegression, LogisticRegressionModel}
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator}
val conTrain = sc.textFile("AbsolutePath2Train.txt")
val conTest = sc.textFile("AbsolutePath2Test.txt")
// parse text and convert to sql.DataFrame
val train = conTrain.map { line =>
val parts = line.split(",")
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(" +").map(_.toDouble)))
}.toDF()
val test =conTest.map{ line =>
val parts = line.split(",")
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(" +").map(_.toDouble)))
}.toDF()
// set parameter space and evaluation method
val lr = new LogisticRegression().setMaxIter(400)
val pipeline = new Pipeline().setStages(Array(lr))
val paramGrid = new ParamGridBuilder().addGrid(lr.regParam, Array(0.1, 0.01)).addGrid(lr.fitIntercept).addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0)).build()
val cv = new CrossValidator().setEstimator(pipeline).setEvaluator(new BinaryClassificationEvaluator).setEstimatorParamMaps(paramGrid).setNumFolds(2)
// fit logistic model
val model = cv.fit(train)
// If you want to predict with test
val pred = model.transform(test)
My spark environment is not accessible. Thus, these codes are retyped and rechecked. I hope they are correct. So far, I have tried searching on webs, asking others. About my coding, welcome suggestions, and criticisms.

// set parameter space and evaluation method
val lr = new LogisticRegression().setMaxIter(400)
val pipeline = new Pipeline().setStages(Array(lr))
val paramGrid = new ParamGridBuilder().addGrid(lr.regParam, Array(0.1, 0.01)).addGrid(lr.fitIntercept).addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0)).build()
val cv = new CrossValidator().setEstimator(pipeline).setEvaluator(new BinaryClassificationEvaluator).setEstimatorParamMaps(paramGrid).setNumFolds(2)
// you can print lr model coefficients as below
val model = cv.bestModel.asInstanceOf[PipelineModel]
val lrModel = model.stages(0).asInstanceOf[LogisticRegressionModel]
println(s"LR Model coefficients:\n${lrModel.coefficients.toArray.mkString("\n")}")
Two steps:
Get the best pipeline from cross validation result.
Get the LR Model from the best pipeline. It's the first stage in your code example.

I was looking for exactly the same thing. You might already have the answer, but anyway, here it is.
import org.apache.spark.ml.classification.LogisticRegressionModel
val lrmodel = model.bestModel.asInstanceOf[LogisticRegressionModel]
print(model.weight, model.intercept)

I am still not sure about how to extract weights from "model" above. But by restructuring the process towards the official tutorial, the following works on spark-1.6.0:
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.tuning.{ParamGridBuilder, TrainValidationSplit}
val lr = new LogisticRegression().setMaxIter(400)
val paramGrid = new ParamGridBuilder().addGrid(lr.regParam, Array(0.1, 0.01)).addGrid(lr.fitIntercept).addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0)).build()
val trainValidationSplit = new TrainValidationSplit().setEstimator(lr).setEvaluator(new BinaryClassificationEvaluator).setEstimatorParamMaps(paramGrid).setTrainRatio(0.8)
val restructuredModel = trainValidationSplit.fit(train)
val lrmodel = restructuredModel.bestModel.asInstanceOf[LogisticRegressionModel]
lrmodel.weigths
lrmodel.intercept
I noticed the difference between "lrmodel" here and "model" generated above:
model.bestModel --> gives oas.ml.Model[_] = pipeline_****
restructuredModel.bestModel --> gives oas.ml.Model[_] = logreg_****
That's why we can cast resturcturedModel.bestModel as LogisticRegressionModel but not that of model.bestModel. I'll add more when I understand the reason of the differences.

Related

Use saved Spark mllib decision tree binary classification model to predict on new data

I am using Spark version 2.2.0 and scala version 2.11.8.
I created and saved a decision tree binary classification model using following code:
package...
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.model.DecisionTreeModel
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.sql.SparkSession
object DecisionTreeClassification {
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession.builder
.master("local[*]")
.appName("Decision Tree")
.getOrCreate()
// Load and parse the data file.
val data = MLUtils.loadLibSVMFile(sparkSession.sparkContext, "path/to/file/xyz.txt")
// Split the data into training and test sets (20% held out for testing)
val splits = data.randomSplit(Array(0.8, 0.2))
val (trainingData, testData) = (splits(0), splits(1))
// Train a DecisionTree model.
// Empty categoricalFeaturesInfo indicates all features are continuous.
val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]()
val impurity = "gini"
val maxDepth = 5
val maxBins = 32
val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,
impurity, maxDepth, maxBins)
// Evaluate model on test instances and compute test error
val labelAndPreds = testData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
val testErr = labelAndPreds.filter(r => r._1 != r._2).count().toDouble / testData.count()
println(s"Test Error = $testErr")
println(s"Learned classification tree model:\n ${model.toDebugString}")
// Save and load model
model.save(sparkSession.sparkContext, "target/tmp/myDecisionTreeClassificationModel")
val sameModel = DecisionTreeModel.load(sparkSession.sparkContext, "target/tmp/myDecisionTreeClassificationModel")
// $example off$
sparkSession.sparkContext.stop()
}
}
Now, I want to predict a label (0 or 1) for a new data using this saved model. I am new to Spark, can anybody please let me know how to do that?
I found answer to this question so I thought I should share it if someone is looking for the answer to similar question
To make prediction for new data simply add few lines before stopping the spark session:
val newData = MLUtils.loadLibSVMFile(sparkSession.sparkContext, "path/to/file/abc.txt")
val newDataPredictions = newData.map
{ point =>
val newPrediction = model.predict(point.features)
(point.label, newPrediction)
}
newDataPredictions.foreach(f => println("Predicted label", f._2))

Using Spark ML's Logistic Regression model on MultiClass Classification giving error : Column prediction already exists

I am using Spark ML's Logistic Regression model for classification problem having 100 categories (0-99). My columns in dataset are - "_c0,_c1,_c2,_c3,_c4,_c5"
where _c5 is a target variable and rest are the features. My code is following :
import org.apache.spark.ml.feature.{StringIndexer, VectorAssembler}
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.classification.OneVsRest
val _c0Indexer = new StringIndexer().setInputCol("_c0").setOutputCol("_c0Index")
val _c1Indexer = new StringIndexer().setInputCol("_c1").setOutputCol("_c1Index")
val _c2Indexer = new StringIndexer().setInputCol("_c2").setOutputCol("_c2Index")
val _c3Indexer = new StringIndexer().setInputCol("_c3").setOutputCol("_c3Index")
val _c4Indexer = new StringIndexer().setInputCol("_c4").setOutputCol("_c4Index")
val _c5Indexer = new StringIndexer().setInputCol("_c5").setOutputCol("_c5Index")
val assembler = new VectorAssembler().setInputCols(Array("_c0Index", "_c1Index", "_c2Index", "_c3Index","_c4Index")).setOutputCol("features")
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8).setLabelCol("_c5Index").setFeaturesCol("features")
val ovr = new OneVsRest().setClassifier(lr)
val pipeline = new Pipeline().setStages(Array(_c0Indexer, _c1Indexer, _c2Indexer, _c3Indexer, _c4Indexer,assembler, _c5Indexer, ovr,lr))
val model = pipeline.fit(data)
val predictions = model.transform(testdf)
println(predictions.select("features", "_c5Index", "probability","prediction").show(5))
But it is showing an error :
requirement failed: Column prediction already exists.
Can someone please guide why I am getting this error? Thanks in advance.
Try taking out the "lr" at the end of your pipeline. I think it's unnecessary since ovr uses lr.

Spark DataFrame to RDD and back

I am writing an Apache Spark application using Scala. To handle and store data I use DataFrames. I have a nice pipeline with feature extraction and a MultiLayerPerceptron classifier, using the ML API.
I also want to use SVM (for comparison purposes). The thing is (and correct me if I am mistaken) only the MLLib provides SVM. And MLLib is not ready to handle DataFrames, only RDDs.
So I figured I can maintain the core of my application using DataFrames and to use SVM 1) I just convert the DataFrame's columns I need to an RDD[LabeledPoint] and 2) after the classification add the SVMs prediction to the DataFrame as a new column.
The first part I handled with a small function:
private def dataFrameToRDD(dataFrame : DataFrame) : RDD[LabeledPoint] = {
val rddMl = dataFrame.select("label", "features").rdd.map(r => (r.getInt(0).toDouble, r.getAs[org.apache.spark.ml.linalg.SparseVector](1)))
rddMl.map(r => new LabeledPoint(r._1, Vectors.dense(r._2.toArray)))
}
I have to specify and convert the type of vector since the feature extraction method uses ML API and not MLLib.
Then, this RDD[LabeledPoint] is fed to the SVM and classification goes smoothly, no issues. At the end and following spark's example I get an RDD[Double]:
val predictions = rdd.map(point => model.predict(point.features))
Now, I want to add the prediction score as column to the original DataFrame and return it. This is where I got stuck. I can convert the RDD[Double] to a DataFrame using
(sql context ommited)
import sqlContext.implicits._
val plDF = predictions.toDF("prediction")
But how do I join the two DataFrames where the second DataFrame becomes a column of the original one? I tried to use methods join and union but got SQL exceptions as the DataFrames have no equal columns to join or unite on.
EDIT
I tried
data.withColumn("prediction", plDF.col("prediction"))
But I get an Analysis Exception :(
I haven't figured out how to do it without recurring to RDDs, but anyway here's how I solved it with RDD. Added the rest of the code so that anyone can understand the complete logic. Any suggestions are appreciated.
package stuff
import java.util.logging.{Level, Logger}
import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, Row, SQLContext}
/**
* Created by camandros on 10-03-2017.
*/
class LinearSVMClassifier extends Classifier with Serializable{
#transient lazy val log: Logger = Logger.getLogger(getClass.getName)
private var model : SVMModel = _
override def train(data : DataFrame): Unit = {
val rdd = dataFrameToRDD(data)
// Run training algorithm to build the model
val numIter : Int = 100
val step = Osint.properties(Osint.SVM_STEPSIZE).toDouble
val c = Osint.properties(Osint.SVM_C).toDouble
log.log(Level.INFO, "Initiating SVM training with parameters: C="+c+", step="+step)
model = SVMWithSGD.train(rdd, numIterations = numIter, stepSize = step, regParam = c)
log.log(Level.INFO, "Model training finished")
// Clear the default threshold.
model.clearThreshold()
}
override def classify(data : DataFrame): DataFrame = {
log.log(Level.INFO, "Converting DataFrame to RDD")
val rdd = dataFrameToRDD(data)
log.log(Level.INFO, "Conversion finished; beginning classification")
// Compute raw scores on the test set.
val predictions = rdd.map(point => model.predict(point.features))
log.log(Level.INFO, "Classification finished; Transforming RDD to DataFrame")
val sqlContext : SQLContext = Osint.spark.sqlContext
val tupleRDD = data.rdd.zip(predictions).map(t => Row.fromSeq(t._1.toSeq ++ Seq(t._2)))
sqlContext.createDataFrame(tupleRDD, data.schema.add("predictions", "Double"))
//TODO this should work it doesn't since this "withColumn" method seems to be applicable only to add
// new columns using information from the same dataframe; therefore I am using the horrible rdd conversion
//val sqlContext : SQLContext = Osint.spark.sqlContext
//import sqlContext.implicits._
//val plDF = predictions.toDF("predictions")
//data.withColumn("prediction", plDF.col("predictions"))
}
private def dataFrameToRDD(dataFrame : DataFrame) : RDD[LabeledPoint] = {
val rddMl = dataFrame.select("label", "features").rdd.map(r => (r.getInt(0).toDouble, r.getAs[org.apache.spark.ml.linalg.SparseVector](1)))
rddMl.map(r => new LabeledPoint(r._1, Vectors.dense(r._2.toArray)))
}
}

Can I convert an incoming stream of data into an array?

I'm trying to learn streaming data and manipulating it with the telecom churn dataset provided here. I've written a method to calculate this in batch:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD, LogisticRegressionWithLBFGS, LogisticRegressionModel, NaiveBayes, NaiveBayesModel}
import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
object batchChurn{
def main(args: Array[String]): Unit = {
//setting spark context
val conf = new SparkConf().setAppName("churn")
val sc = new SparkContext(conf)
//loading and mapping data into RDD
val csv = sc.textFile("file://filename.csv")
val data = csv.map {line =>
val parts = line.split(",").map(_.trim)
val stringvec = Array(parts(1)) ++ parts.slice(4,20)
val label = parts(20).toDouble
val vec = stringvec.map(_.toDouble)
LabeledPoint(label, Vectors.dense(vec))
}
val splits = data.randomSplit(Array(0.7,0.3))
val (training, testing) = (splits(0),splits(1))
val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]()
val numTrees = 6
val featureSubsetStrategy = "auto"
val impurity = "gini"
val maxDepth = 7
val maxBins = 32
val model = RandomForest.trainClassifier(training, numClasses, categoricalFeaturesInfo,numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
val labelAndPreds = testing.map {point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
}
}
I've had no problems with this. Now, I looked at the NetworkWordCount example provided on the spark website, and changed the code slightly to see how it would behave.
val ssc = new StreamingContext(sc, Seconds(5))
val lines = ssc.socketTextStream("127.0.0.1", 9999)
val data = lines.flatMap(_.split(","))
My question is: is it possible to convert this DStream to an array which I can input into my analysis code? Currently when I try to convert to Array using val data = lines.flatMap(_.split(",")), it clearly says that:error: value toArray is not a member of org.apache.spark.streaming.dstream.DStream[String]
Your DStream contains many RDDs you can get access to the RDDs using foreachRDD function.
https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/streaming/dstream/DStream.html#foreachRDD(scala.Function1)
then each RDD can be converted to array using collect function.
this has already been shown here
For each RDD in a DStream how do I convert this to an array or some other typical Java data type?
DStream.foreachRDD gives you an RDD[String] for each interval of
course, you could collect in an array
val arr = new ArrayBuffer[String]();
data.foreachRDD {
arr ++= _.collect()
}
Also keep in mind you could end up having way more data than you want in your driver since a DStream can be huge.
To limit the data for your analysis , I would do this way
data.slice(new Time(fromMillis), new Time(toMillis)).flatMap(_.collect()).toSet
You cannot put all the elements of a DStream in an array because those elements will keep being read over the wire, and your array would have to be indefinitely extensible.
The adaptation of this decision tree model to a streaming mode, where training and testing data arrives continuously, is not trivial for algorithmical reasons — while the answers mentioning collect are technically correct, they're not the appropriate solution to what you're trying to do.
If you want to run decision trees on a Stream in Spark, you may want to look at Hoeffding trees.

run multiclass classification using spark ml pipeline

I just started using spark ML pipeline to implement a multiclass classifier using LogisticRegressionWithLBFGS (which accepts as a parameters number of classes)
I followed this example:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.sql.{Row, SQLContext}
case class LabeledDocument(id: Long, text: String, label: Double)
case class Document(id: Long, text: String)
val conf = new SparkConf().setAppName("SimpleTextClassificationPipeline")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
// Prepare training documents, which are labeled.
val training = sc.parallelize(Seq(
LabeledDocument(0L, "a b c d e spark", 1.0),
LabeledDocument(1L, "b d", 0.0),
LabeledDocument(2L, "spark f g h", 1.0),
LabeledDocument(3L, "hadoop mapreduce", 0.0)))
// Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setNumFeatures(1000)
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.01)
val pipeline = new Pipeline()
.setStages(Array(tokenizer, hashingTF, lr))
// Fit the pipeline to training documents.
val model = pipeline.fit(training.toDF)
// Prepare test documents, which are unlabeled.
val test = sc.parallelize(Seq(
Document(4L, "spark i j k"),
Document(5L, "l m n"),
Document(6L, "mapreduce spark"),
Document(7L, "apache hadoop")))
// Make predictions on test documents.
model.transform(test.toDF)
.select("id", "text", "probability", "prediction")
.collect()
.foreach { case Row(id: Long, text: String, prob: Vector, prediction: Double) =>
println("($id, $text) --> prob=$prob, prediction=$prediction")
}
sc.stop()
The problem is that the LogisticRegression class used by ML use by default 2 classes (line 176) : override val numClasses: Int = 2
Any idea how to solve this problem?
Thanks
As Odomontois already mentioned, if you'd like to use basic NLP pipelines using Spark ML Pipelines you have only 2 options:
One vs. Rest and pass existing LogisticRegression, i.e. new OneVsRest().setClassifier(logisticRegression)
Use bag of words (CountVectorizer in terms of Spark) and NaiveBayes classifier that supports multiclass classification
But your test samples only have 2 classes.. Why would it do otherwise in "auto" mode? You can force to have a multinomial classifer though:
https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.ml.classification.LogisticRegression
val family: Param[String]
Param for the name of family which is a description of the label distribution to be used in the model. Supported options:
"auto": Automatically select the family based on the number of classes: If numClasses == 1 || numClasses == 2, set to "binomial". Else, set to "multinomial"
"binomial": Binary logistic regression with pivoting.
"multinomial": Multinomial logistic (softmax) regression without pivoting. Default is "auto".