Integrate Spark ML Model in Scala App without embedded Spark Cluster - scala

I have trained a Spark Multilayer Perceptron Classifier to detect spam messages and would like to use it in a webservice in combination with the Play Framework.
My solution (see below) spawns an embedded local spark cluster, loads the model and classifies messages. Is there a way to use the model without an embedded Spark cluster?
Spark has some dependencies that clash with the Play Framework dependencies. I thought there might be a way to run the model in classification mode without starting an embedded spark cluster.
My second question is if I can classify a single message without putting it in a DataFrame first.
Application Loader:
lazy val sparkSession: SparkSession = {
val conf: SparkConf = new SparkConf()
.setMaster("local[*]")
.setAppName("Classifier")
.set("spark.ui.enabled", "false")
val session = SparkSession.builder()
.config(conf)
.getOrCreate()
applicationLifecycle.addStopHook { () ⇒
Future { session.stop() }
}
session
}
lazy val model: PipelineModel = {
sparkSession
CrossValidatorModel.load("mpc-model").bestModel.asInstanceOf[PipelineModel]
}
Classification service (model and spark session are injected):
val messageDto = Seq(MessageSparkDto(
sender = message.sender.email.value,
text = featureTransformer.cleanText(text).value,
messagelength = text.value.length,
isMultimail = featureTransformer.isMultimail(message.sender.email),
))
val messageDf = messageDto.toDS()
model.transform(messageDf).head().getAs[Double]("prediction") match {
case 1.0 ⇒ MessageEvaluationResult(MessageClass.Spam)
case _ ⇒ MessageEvaluationResult(MessageClass.NonSpam)
}
Edit: As pointed out in the comments, one solution could be to transform the model to PMML and then use another engine to load the model and use it for classification. This sounds too me like a lot of overhead as well. Has someone experience with running spark in local mode with minimal overhead and dependencies to use the ML classifiers?

Although I like the solution proposed in the linked post, the following might also be possible. You could of course copy that model to the Server onto which you will deploy the Webservice, install a spark "cluster" with one machine on it and put spark jobserver on top of it, which will handle the requests and access spark. That would be the no-brainer-solution and should work if your model does not need lots of computational power.

Related

Registering Classes with Kryo via SparkSession in Spark 2+

I'm migrating from Spark 1.6 to 2.3.
I need to register custom classes with Kryo. So what I see here: https://spark.apache.org/docs/2.3.1/tuning.html#data-serialization
val conf = new SparkConf().setMaster(...).setAppName(...)
conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2]))
val sc = new SparkContext(conf)
The problem is... everywhere else in Spark 2+ instructions, it indicates that SparkSession is the way to go for everything... and if you need SparkContext it should be through spark.sparkContext and not as a stand-alone val.
So now I use the following (and have wiped any trace of conf, sc, etc. from my code)...
val spark = SparkSession.builder.appName("myApp").getOrCreate()
My question: where do I register classes with Kryo if I don't use SparkConf or SparkContext directly?
I see spark.kryo.classesToRegister here: https://spark.apache.org/docs/2.3.1/configuration.html#compression-and-serialization
I have a pretty extensive conf.json to set spark-defaults.conf, but I want to keep it generalizable across apps, so I don't want to register classes here.
When I look here: https://spark.apache.org/docs/2.3.1/api/scala/index.html#org.apache.spark.sql.SparkSession
It makes me think I can do something like the following to augment my spark-defaults.conf:
val spark =
SparkSession
.builder
.appName("myApp")
.config("spark.kryo.classesToRegister", "???")
.getOrCreate()
But what is??? if I want to register org.myorg.myapp.{MyClass1, MyClass2, MyClass3}? I can't find an example of this use.
Would it be:
.config("spark.kryo.classesToRegister", "MyClass1,MyClass2,MyClass3")
or
.config("spark.kryo.classesToRegister", "class org.myorg.mapp.MyClass1,class org.myorg.mapp.MyClass2,class org.myorg.mapp.MyClass3")
or something else?
EDIT
when I try testing different formats in spark-shell via spark.conf.set("spark.kryo.classesToRegister", "any,any2,any3") i never get any error messages no matter what I put in the string any,any2,any3.
I tried making any each of the following formats
"org.myorg.myapp.myclass"
"myclass"
"class org.myorg.myapp.myclass"
I can't tell if any of these successfully registered anything.
Have you tried the following, it should work since it actually a part of the SparkConf API and I think the only thing missing is that you just need to plug it into the SparkSession:
private lazy val sparkConf = new SparkConf()
.setAppName("spark_basic_rdd").setMaster("local[*]").registerKryoClasses(...)
private lazy val sparkSession = SparkSession.builder()
.config(sparkConf).getOrCreate()
And if you need a Spark Context you can call:
private lazy val sparkContext: SparkContext = sparkSession.sparkContext

Non working Spark example in Scala, LogisticRegressionTrainingSummary

I tried to implement this example for multinomial logistic regression, but it doesn't recognize features that are being used. Probably some version mismatch. This part of code:
trainingSummary.falsePositiveRateByLabel.zipWithIndex.foreach { case (rate, label) =>
println(s"label $label: $rate")
}
None of the members of LogisticRegressionTrainingSummary are being recognized, falsePositiveRateByLabel particulary in the given example. As well as other members later in code: truePositiveRateByLabel, precisionByLabel,...
When I go to implementation I can't find any similar members that I could use instead, I use mllib 2.11. What am I missing?
You are correct, this is a versioning issue. The github code example you have given is for the current master branch of Spark where there has been some major changes in this part of the API.
What you have been following is what code in Spark 2.3 will look like. However, at this time, this version is not yet stable and available for download. This is what the version 2.2 branch of the same code example looks like:
val training = spark
.read
.format("libsvm")
.load("data/mllib/sample_multiclass_classification_data.txt")
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
// Fit the model
val lrModel = lr.fit(training)
// Print the coefficients and intercept for multinomial logistic regression
println(s"Coefficients: \n${lrModel.coefficientMatrix}")
println(s"Intercepts: ${lrModel.interceptVector}")
// $example off$
spark.stop()
In other words, the methods you are trying to use are not yet implemented in your Spark version.

Has the limitation of a single SparkContext actually been lifted in Spark 2.0?

There has been plenty of chatter about Spark 2.0 supporting multiple SparkContext s. A configuration variable to support it has been around for much longer but not actually effective.
In $SPARK_HOME/conf/spark-defaults.conf :
spark.driver.allowMultipleContexts true
Let's verify that property were recognized:
scala> println(s"allowMultiCtx = ${sc.getConf.get("spark.driver.allowMultipleContexts")}")
allowMultiCtx = true
Here is a small poc program for it:
import org.apache.spark._
import org.apache.spark.streaming._
println(s"allowMultiCtx = ${sc.getConf.get("spark.driver.allowMultipleContexts")}")
def createAndStartFileStream(dir: String) = {
val sc = new SparkContext("local[1]",s"Spark-$dir" /*,conf*/)
val ssc = new StreamingContext(sc, Seconds(4))
val dstream = ssc.textFileStream(dir)
val valuesCounts = dstream.countByValue()
ssc.start
ssc.awaitTermination
}
val dirs = Seq("data10m", "data50m", "dataSmall").map { d =>
s"/shared/demo/data/$d"
}
dirs.foreach{ d =>
createAndStartFileStream(d)
}
However Attempts to use When that capability are not succeeding:
16/08/14 11:38:55 WARN SparkContext: Multiple running SparkContexts detected
in the same JVM!
org.apache.spark.SparkException: Only one SparkContext may be running in
this JVM (see SPARK-2243). To ignore this error,
set spark.driver.allowMultipleContexts = true.
The currently running SparkContext was created at:
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:814)
org.apache.spark.repl.Main$.createSparkSession(Main.scala:95)
Anyone have any insight on how to use the multiple contexts?
Per #LostInOverflow this feature will not be fixed. Here is info from that jira
SPARK-2243 Support multiple SparkContexts in the same JVM
https://issues.apache.org/jira/browse/SPARK-2243
Sean Owen added a comment - 16/Jan/16 17:35 You say you're concerned
with over-utilizing a cluster for steps that don't require much
resource. This is what dynamic allocation is for: the number of
executors increases and decreases with load. If one context is already
using all cluster resources, yes, that doesn't do anything. But then,
neither does a second context; the cluster is already fully used. I
don't know what overhead you're referring to, but certainly one
context running N jobs is busier than N contexts running N jobs. Its
overhead is higher, but the total overhead is lower. This is more an
effect than a cause that would make you choose one architecture over
another. Generally, Spark has always assumed one context per JVM and I
don't see that changing, which is why I finally closed this. I don't
see any support for making this happen.

Spark ML - Save OneVsRestModel

I am in the middle of refactoring my code to take advantage of DataFrames, Estimators, and Pipelines. I was originally using MLlib Multiclass LogisticRegressionWithLBFGS on RDD[LabeledPoint]. I am enjoying learning and using the new API, but I am not sure how to save my new model and apply it on new data.
Currently, the ML implementation of LogisticRegression only supports binary classification. I am, instead using OneVsRest like so:
val lr = new LogisticRegression().setFitIntercept(true)
val ovr = new OneVsRest()
ovr.setClassifier(lr)
val ovrModel = ovr.fit(training)
I would now like to save my OneVsRestModel, but this does not seem to be supported by the API. I have tried:
ovrModel.save("my-ovr") // Cannot resolve symbol save
ovrModel.models.foreach(_.save("model-" + _.uid)) // Cannot resolve symbol save
Is there a way to save this, so I can load it in a new application for making new predictions?
Spark 2.0.0
OneVsRestModel implements MLWritable so it should be possible to save it directly. Method shown below can be still useful to save individual models separately.
Spark < 2.0.0
The problem here is that models returns an Array of ClassificationModel[_, _]] not an Array of LogisticRegressionModel (or MLWritable). To make it work you'll have to be specific about the types:
import org.apache.spark.ml.classification.LogisticRegressionModel
ovrModel.models.zipWithIndex.foreach {
case (model: LogisticRegressionModel, i: Int) =>
model.save(s"model-${model.uid}-$i")
}
or to be more generic:
import org.apache.spark.ml.util.MLWritable
ovrModel.models.zipWithIndex.foreach {
case (model: MLWritable, i: Int) =>
model.save(s"model-${model.uid}-$i")
}
Unfortunately as for now (Spark 1.6) OneVsRestModel doesn't implement MLWritable so it cannot be saved alone.
Note:
All models int the OneVsRest seem to use the same uid hence we need an explicit index. It will be also useful to identify the model later.

Run a read-only test in Spark

I want to compare the read performance of different storage systems using Spark ,e.g. HDFS/S3N. I have written a small Scala program for this:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
object SimpleApp {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("WordCount")
val sc = new SparkContext(conf)
val file = sc.textFile("s3n://test/wordtest")
val splits = file.map(word => word)
splits.saveAsTextFile("s3n://test/myoutput")
}
}
My question is, is it possible to run a read-only test with Spark? For the program above, isn't saveAsTextFile() causing some write as well?
I am not sure if that is possible at all. In order to run a transformation, a posterior action is necessary.
From the official Spark documentation:
All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program.
Taking this into account, saveAsTextFile might not be considered the lightest from the wide range of actions available. Several lightweight alternatives exists, actions like count or first, for example. These would leverage almost the totality of the work to the transformations phase, making you able to measure the performance of your solution.
You might want to check the available actions and choose the one that best fits your requirements.
Yes."saveAsTextFile" writes the RDD data to text file using given path.