Use saved Spark mllib decision tree binary classification model to predict on new data - scala

I am using Spark version 2.2.0 and scala version 2.11.8.
I created and saved a decision tree binary classification model using following code:
package...
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.model.DecisionTreeModel
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.sql.SparkSession
object DecisionTreeClassification {
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession.builder
.master("local[*]")
.appName("Decision Tree")
.getOrCreate()
// Load and parse the data file.
val data = MLUtils.loadLibSVMFile(sparkSession.sparkContext, "path/to/file/xyz.txt")
// Split the data into training and test sets (20% held out for testing)
val splits = data.randomSplit(Array(0.8, 0.2))
val (trainingData, testData) = (splits(0), splits(1))
// Train a DecisionTree model.
// Empty categoricalFeaturesInfo indicates all features are continuous.
val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]()
val impurity = "gini"
val maxDepth = 5
val maxBins = 32
val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,
impurity, maxDepth, maxBins)
// Evaluate model on test instances and compute test error
val labelAndPreds = testData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
val testErr = labelAndPreds.filter(r => r._1 != r._2).count().toDouble / testData.count()
println(s"Test Error = $testErr")
println(s"Learned classification tree model:\n ${model.toDebugString}")
// Save and load model
model.save(sparkSession.sparkContext, "target/tmp/myDecisionTreeClassificationModel")
val sameModel = DecisionTreeModel.load(sparkSession.sparkContext, "target/tmp/myDecisionTreeClassificationModel")
// $example off$
sparkSession.sparkContext.stop()
}
}
Now, I want to predict a label (0 or 1) for a new data using this saved model. I am new to Spark, can anybody please let me know how to do that?

I found answer to this question so I thought I should share it if someone is looking for the answer to similar question
To make prediction for new data simply add few lines before stopping the spark session:
val newData = MLUtils.loadLibSVMFile(sparkSession.sparkContext, "path/to/file/abc.txt")
val newDataPredictions = newData.map
{ point =>
val newPrediction = model.predict(point.features)
(point.label, newPrediction)
}
newDataPredictions.foreach(f => println("Predicted label", f._2))

Related

Getting Task not Serializable error on trying to create Decision Tree using Spark with Scala while executing in my local machine

I am trying to create Fraud Transaction Detector using spark with scala. My code works fine with normal Spark logic. However when I try the solution using decision tree approach I get task not Serializable error. From my understanding I get this error when I try fitting my training data into pipeline. I tried a few solutions like extending to Serializable, converting my indexed data back to string etc. but none of them worked.
Can someone please help in understanding what I am doing wrong
Below is my code
package com.vinspark.frauddetection
object FraudDet_ML extends Serializable{
def main(args:Array[String]){
Logger.getLogger("org").setLevel(Level.ERROR)
val spark=SparkSession
.builder()
.appName("FraudDetection")
.master("local[*]")
.config("spark.sql.warehouse.dir","file:///C:/temp")
.getOrCreate()
import spark.sqlContext.implicits._
var df=spark.read.format("csv").option("header", "true").option("mode",
"DROPMALFORMED").option("inferSchema", "true").load("../PS_20174392719_1491204439457_log.csv")
df= df.withColumn("orgDiff", col("newbalanceOrig") -
col("oldbalanceOrg")).withColumn("destDiff",
col("newbalanceDest") - col("oldbalanceDest"))
df= df.withColumn("label",
when((col("oldbalanceOrg") <=56900 && col("type")=="TRANSFER" && col("newbalanceDest") <= 105)
||(col("oldbalanceOrg") >56900 && col("newbalanceOrig")<=12)
||(col("oldbalanceOrg") >56900 && col("newbalanceOrig")>12 && col("amount")>1160000),
1)
.otherwise(0)
)
df.createOrReplaceTempView("transaction")
val indexer = new StringIndexer().setInputCol("type").setOutputCol("typeIndexed")
println("indexer created")
val labelIndexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("indexedLabel").fit(df)
val splitDataSet: Array[org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]]
=df.randomSplit(Array(0.8, 0.2), 12345L)
val trainDataFrame = splitDataSet(0)val testDataFrame = splitDataSet(1)
val train = splitDataSet(0)
val test = splitDataSet(1)
println("train: "+train.count()+" test: "+test.count())
val va = new VectorAssembler().setInputCols(Array("typeIndexed", "amount", "oldbalanceOrg",
"newbalanceOrig", "oldbalanceDest", "newbalanceDest", "orgDiff",
"destDiff")).setOutputCol("features")
println("vector assembler created")
val dt = new DecisionTreeClassifier()
.setLabelCol("label")
.setFeaturesCol("features").setSeed(54321).setMaxDepth(5)
println("decision tree created")
val labelConverter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predictedLabel")
.setLabels(labelIndexer.labels)
val pipeline = new Pipeline().setStages(Array(indexer,labelIndexer, va, dt))
println("pipeline created")
val pipeConst=pipeline
val model = pipeConst.fit(train)
val model1 = model
println("model created")
val prediction=model1.transform(test)
println("indexer created")
prediction.collect().foreach(println)
}
}

type TimestampType of colum not supported

i use spark with scala a have problem in type TimestampType
object regressionLinear {
case class X(
time:String,nodeID: Int, posX: Double,posY: Double,
speed: Double,period: Int)
def main (args: Array[String]) {
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
/**
* Read the input data
*/
var dataset = "C:\\spark\\A6-d07-h08.csv"
if (args.length > 0) {
dataset = args(0)
}
val spark = SparkSession
.builder
.appName("regressionsol")
.master("local[4]")
.getOrCreate()
import spark.implicits._
val data = spark.sparkContext.textFile(dataset)
.map(line=>line.split(","))
.map(userRecord => (userRecord(0).trim.toString,
userRecord(1).trim.toInt, userRecord(2).trim.toDouble,userRecord(3).trim.toDouble,userRecord(4).trim.toDouble,userRecord(5).trim.toInt))
.toDF("time","nodeID","posX", "posY","speed","period").withColumn("time", $"time".cast("timestamp"))
val assembler = new VectorAssembler()
.setInputCols( Array(
"time","nodeID","posX", "posY","speed","period"))
.setOutputCol("features")
val lr = new LinearRegression()
.setLabelCol("period")
.setFeaturesCol("features")
.setRegParam(0.1)
.setMaxIter(100)
.setSolver("l-bfgs")
val steps =
Array(assembler, lr)
val pipeline = new Pipeline()
.setStages(steps)
val Array(training, test) = data.randomSplit(Array(0.75, 0.25), seed = 12345)
val model = pipeline.fit {
training
}
val holdout = model.transform(test)
holdout.show(20)
val prediction = holdout.select("prediction", "period","nodeID").orderBy(abs(col("prediction")-col("period")))
prediction.show(20)
val rm = new RegressionMetrics(prediction.rdd.map{
x => (x(0).asInstanceOf[Double], x(1).asInstanceOf[Double])
})
println(s"RMSE = ${rm.rootMeanSquaredError}")
println(s"R-squared = ${rm.r2}")
spark.stop()
}
}
it is error
Exception in thread "main" java.lang.IllegalArgumentException: Data type TimestampType of column time is not supported.
at org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:124)
at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:184)
at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:184)
at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:184)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:136)
at regressionLinear$.main(regressionLinear.scala:100)
at regressionLinear.main(regressionLinear.scala)
VectorAssembler accepts only numeric columns. Other type of columns have to be encoded first. And considering that you apply LinearRegression data has to be encoded anyway.
Exact steps will depend on the domain specific knowledge:
If you expect linear trend based on time cast field to numeric first.
If you expect some type of seasonal effects you might have to extract individual components (day of week, hour / time of day, month and so on), and typically apply StringIndexer + `OneHotEncoder.

How to predict kmeans cluster with Spark org.apache.spark.ml.clustering.{KMeans, KMeansModel}

i have a problem with the two different MLLIB Implementations (org.apache.spark.ml. and org.apache.spark.mllib) and KMeans. I am using the new implementation of org.apache.spark.ml which is using Dataframes but I'm struggeling with the documentation and how to predict a cluster index.
import org.apache.spark.ml.clustering.{KMeans, KMeansModel}
import org.apache.spark.ml.feature.LabeledPoint
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.sql.{Row, SparkSession}
/**
* An example showcasing the use of kMeans
*/
object ExploreKMeans {
// Spark configuration.
// Retrieve sparkContext with spark.sparkContext.
private val spark = SparkSession.builder()
.appName("com.example.ml.exploration.kMeans")
.master("local[*]")
.getOrCreate()
// This import, after the definition of a valid SQLContext defines implicits for converting RDDs to Dataframes over .toDF().
import spark.implicits._
def main(args: Array[String]): Unit = {
val data = spark.sparkContext.parallelize(Array((5.0, 2.0,1.5), (2.0, 2.5,2.3), (1.0, 2.1,4.2), (2.0, 5.5, 8.5)))
val df = data.toDF().map { row =>
val label = row(0).asInstanceOf[Double]
val value1 = row(1).asInstanceOf[Double]
val value2 = row(2).asInstanceOf[Double]
LabeledPoint(label, Vectors.dense(value1,value2))
}
val kmeans = new KMeans().setK(3).setSeed(1L)
val model: KMeansModel = kmeans.fit(df)
// Evaluate clustering by computing Within Set Sum of Squared Errors.
val WSSSE = model.computeCost(df)
println(s"Within Set Sum of Squared Errors = $WSSSE")
// Shows the result.
println("Cluster Centers: ")
model.clusterCenters.foreach(println)
//TODO How to predict cluster index?
//model.predict(???
}
}
How do I use the model to predict the cluster index of new values? The model.predict function is not visible. This API is really confusing...
well, a easier way to do this is:
model.summary.predictions.show
Ok, I got it. Predictions are now done with the transform method:
println("Transform ")
val transformed = model.transform(df)
transformed.collect().foreach(println)
Cluster Centers:
[2.25,1.9]
[5.5,8.5]
[2.1,4.2]
Transform:
[5.0,[2.0,1.5],0]
[2.0,[2.5,2.3],0]
[1.0,[2.1,4.2],2]
[2.0,[5.5,8.5],1]

How to extract variable weight from spark pipeline logistic model?

I am currently trying to learn Spark Pipeline (Spark 1.6.0). I imported datasets (train and test) as oas.sql.DataFrame objects. After executing the following codes, the produced model is a oas.ml.tuning.CrossValidatorModel.
You can use model.transform (test) to predict based on the test data in Spark. However, I would like to compare the weights that model used to predict with that from R. How to extract the weights of the predictors and intercept (if any) of model? The Scala codes are:
import sqlContext.implicits._
import org.apache.spark.mllib.linalg.{Vectors, Vector}
import org.apache.spark.SparkContext
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.{LogisticRegression, LogisticRegressionModel}
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator}
val conTrain = sc.textFile("AbsolutePath2Train.txt")
val conTest = sc.textFile("AbsolutePath2Test.txt")
// parse text and convert to sql.DataFrame
val train = conTrain.map { line =>
val parts = line.split(",")
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(" +").map(_.toDouble)))
}.toDF()
val test =conTest.map{ line =>
val parts = line.split(",")
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(" +").map(_.toDouble)))
}.toDF()
// set parameter space and evaluation method
val lr = new LogisticRegression().setMaxIter(400)
val pipeline = new Pipeline().setStages(Array(lr))
val paramGrid = new ParamGridBuilder().addGrid(lr.regParam, Array(0.1, 0.01)).addGrid(lr.fitIntercept).addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0)).build()
val cv = new CrossValidator().setEstimator(pipeline).setEvaluator(new BinaryClassificationEvaluator).setEstimatorParamMaps(paramGrid).setNumFolds(2)
// fit logistic model
val model = cv.fit(train)
// If you want to predict with test
val pred = model.transform(test)
My spark environment is not accessible. Thus, these codes are retyped and rechecked. I hope they are correct. So far, I have tried searching on webs, asking others. About my coding, welcome suggestions, and criticisms.
// set parameter space and evaluation method
val lr = new LogisticRegression().setMaxIter(400)
val pipeline = new Pipeline().setStages(Array(lr))
val paramGrid = new ParamGridBuilder().addGrid(lr.regParam, Array(0.1, 0.01)).addGrid(lr.fitIntercept).addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0)).build()
val cv = new CrossValidator().setEstimator(pipeline).setEvaluator(new BinaryClassificationEvaluator).setEstimatorParamMaps(paramGrid).setNumFolds(2)
// you can print lr model coefficients as below
val model = cv.bestModel.asInstanceOf[PipelineModel]
val lrModel = model.stages(0).asInstanceOf[LogisticRegressionModel]
println(s"LR Model coefficients:\n${lrModel.coefficients.toArray.mkString("\n")}")
Two steps:
Get the best pipeline from cross validation result.
Get the LR Model from the best pipeline. It's the first stage in your code example.
I was looking for exactly the same thing. You might already have the answer, but anyway, here it is.
import org.apache.spark.ml.classification.LogisticRegressionModel
val lrmodel = model.bestModel.asInstanceOf[LogisticRegressionModel]
print(model.weight, model.intercept)
I am still not sure about how to extract weights from "model" above. But by restructuring the process towards the official tutorial, the following works on spark-1.6.0:
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.tuning.{ParamGridBuilder, TrainValidationSplit}
val lr = new LogisticRegression().setMaxIter(400)
val paramGrid = new ParamGridBuilder().addGrid(lr.regParam, Array(0.1, 0.01)).addGrid(lr.fitIntercept).addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0)).build()
val trainValidationSplit = new TrainValidationSplit().setEstimator(lr).setEvaluator(new BinaryClassificationEvaluator).setEstimatorParamMaps(paramGrid).setTrainRatio(0.8)
val restructuredModel = trainValidationSplit.fit(train)
val lrmodel = restructuredModel.bestModel.asInstanceOf[LogisticRegressionModel]
lrmodel.weigths
lrmodel.intercept
I noticed the difference between "lrmodel" here and "model" generated above:
model.bestModel --> gives oas.ml.Model[_] = pipeline_****
restructuredModel.bestModel --> gives oas.ml.Model[_] = logreg_****
That's why we can cast resturcturedModel.bestModel as LogisticRegressionModel but not that of model.bestModel. I'll add more when I understand the reason of the differences.

Can I convert an incoming stream of data into an array?

I'm trying to learn streaming data and manipulating it with the telecom churn dataset provided here. I've written a method to calculate this in batch:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD, LogisticRegressionWithLBFGS, LogisticRegressionModel, NaiveBayes, NaiveBayesModel}
import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
object batchChurn{
def main(args: Array[String]): Unit = {
//setting spark context
val conf = new SparkConf().setAppName("churn")
val sc = new SparkContext(conf)
//loading and mapping data into RDD
val csv = sc.textFile("file://filename.csv")
val data = csv.map {line =>
val parts = line.split(",").map(_.trim)
val stringvec = Array(parts(1)) ++ parts.slice(4,20)
val label = parts(20).toDouble
val vec = stringvec.map(_.toDouble)
LabeledPoint(label, Vectors.dense(vec))
}
val splits = data.randomSplit(Array(0.7,0.3))
val (training, testing) = (splits(0),splits(1))
val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]()
val numTrees = 6
val featureSubsetStrategy = "auto"
val impurity = "gini"
val maxDepth = 7
val maxBins = 32
val model = RandomForest.trainClassifier(training, numClasses, categoricalFeaturesInfo,numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
val labelAndPreds = testing.map {point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
}
}
I've had no problems with this. Now, I looked at the NetworkWordCount example provided on the spark website, and changed the code slightly to see how it would behave.
val ssc = new StreamingContext(sc, Seconds(5))
val lines = ssc.socketTextStream("127.0.0.1", 9999)
val data = lines.flatMap(_.split(","))
My question is: is it possible to convert this DStream to an array which I can input into my analysis code? Currently when I try to convert to Array using val data = lines.flatMap(_.split(",")), it clearly says that:error: value toArray is not a member of org.apache.spark.streaming.dstream.DStream[String]
Your DStream contains many RDDs you can get access to the RDDs using foreachRDD function.
https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/streaming/dstream/DStream.html#foreachRDD(scala.Function1)
then each RDD can be converted to array using collect function.
this has already been shown here
For each RDD in a DStream how do I convert this to an array or some other typical Java data type?
DStream.foreachRDD gives you an RDD[String] for each interval of
course, you could collect in an array
val arr = new ArrayBuffer[String]();
data.foreachRDD {
arr ++= _.collect()
}
Also keep in mind you could end up having way more data than you want in your driver since a DStream can be huge.
To limit the data for your analysis , I would do this way
data.slice(new Time(fromMillis), new Time(toMillis)).flatMap(_.collect()).toSet
You cannot put all the elements of a DStream in an array because those elements will keep being read over the wire, and your array would have to be indefinitely extensible.
The adaptation of this decision tree model to a streaming mode, where training and testing data arrives continuously, is not trivial for algorithmical reasons — while the answers mentioning collect are technically correct, they're not the appropriate solution to what you're trying to do.
If you want to run decision trees on a Stream in Spark, you may want to look at Hoeffding trees.