I am using Spark with Scala. I want to do different preprocessing on my data. Is there a way for CrossValidator to take multiple models (also with ParamMaps) to get the best model out of these two?
e.g. What I want to do is:
val discretizer = new QuantileDiscretizer()
val normalizer = new Normalizer()
val lr1 = new LinearRegression()
val lr2 = new LinearRegression()
val pipeline = new Pipeline().setStages(Array(dicretizer,normalizer,lr1,lr2))
Now I want my CrossValidator to pick the best of the two models from lr1 and lr2. This is just a small example, I want to extend it to multiple such possibilities with ParamMaps too.
You should be able to evaluate these models against each other using a custom estimator like in How to use CrossValidator to choose between different models.
I've been trying to learn/use Scala for machine learning and to do that I need to convert string variables to an index of dummies.
The way I've done it is with the StringIndexer in Scala. Before running I've used df.na.fill("missing") to replace missing values. Even after I run that I still get a NullPointerException.
Is there something else I should be doing or something else I should be checking? I used printSchema to filter only on the string columns to get the list of columns I needed to run StringIndexer on.
val newDf1 = reweight.na.fill("Missing")
val cat_cols = Array("highest_tier_nm", "day_of_week", "month",
"provided", "docsis", "dwelling_type_grp", "dwelling_type_cd", "market"
val transformers: Array[org.apache.spark.ml.PipelineStage] = cat_cols
.map(cname => new StringIndexer()
val stages: Array[org.apache.spark.ml.PipelineStage] = transformers
val categorical = new Pipeline().setStages(stages)
val cat_reweight = categorical.fit(newDf)
Normally when using machine learning you would train the model with one part of the data and then test it with another part. Hence, there are two different methods to use to reflect this. You have only used fit() which is equivalent to training a model (or a pipeline).
This mean that your cat_reweight is not a dataframe, it is a PipelineModel. A PipelineModel have a function transform() that takes data with the same format as the one used for training and gives a dataframe as output. In other words, you should add .transform(newDf1) after fit(newDf1).
Another possible issue is that in your code you have used fit(newDf) instead of fit(newDf1). Make sure the correct dataframe is used for both the fit() and transform() methods, otherwise you will get a NullPointerException.
It works for me when running locally, however, if you still get an error you could try to cache() after replacing the nulls and then performing an action to make sure all transformations are done.
Hope it helps!
I have trained a Spark Multilayer Perceptron Classifier to detect spam messages and would like to use it in a webservice in combination with the Play Framework.
My solution (see below) spawns an embedded local spark cluster, loads the model and classifies messages. Is there a way to use the model without an embedded Spark cluster?
Spark has some dependencies that clash with the Play Framework dependencies. I thought there might be a way to run the model in classification mode without starting an embedded spark cluster.
My second question is if I can classify a single message without putting it in a DataFrame first.
Application Loader:
lazy val sparkSession: SparkSession = {
val conf: SparkConf = new SparkConf()
.set("spark.ui.enabled", "false")
val session = SparkSession.builder()
applicationLifecycle.addStopHook { () ⇒
Future { session.stop() }
lazy val model: PipelineModel = {
Classification service (model and spark session are injected):
val messageDto = Seq(MessageSparkDto(
sender = message.sender.email.value,
text = featureTransformer.cleanText(text).value,
messagelength = text.value.length,
isMultimail = featureTransformer.isMultimail(message.sender.email),
val messageDf = messageDto.toDS()
model.transform(messageDf).head().getAs[Double]("prediction") match {
case 1.0 ⇒ MessageEvaluationResult(MessageClass.Spam)
case _ ⇒ MessageEvaluationResult(MessageClass.NonSpam)
Edit: As pointed out in the comments, one solution could be to transform the model to PMML and then use another engine to load the model and use it for classification. This sounds too me like a lot of overhead as well. Has someone experience with running spark in local mode with minimal overhead and dependencies to use the ML classifiers?
Although I like the solution proposed in the linked post, the following might also be possible. You could of course copy that model to the Server onto which you will deploy the Webservice, install a spark "cluster" with one machine on it and put spark jobserver on top of it, which will handle the requests and access spark. That would be the no-brainer-solution and should work if your model does not need lots of computational power.
I am in the middle of refactoring my code to take advantage of DataFrames, Estimators, and Pipelines. I was originally using MLlib Multiclass LogisticRegressionWithLBFGS on RDD[LabeledPoint]. I am enjoying learning and using the new API, but I am not sure how to save my new model and apply it on new data.
Currently, the ML implementation of LogisticRegression only supports binary classification. I am, instead using OneVsRest like so:
val lr = new LogisticRegression().setFitIntercept(true)
val ovr = new OneVsRest()
val ovrModel = ovr.fit(training)
I would now like to save my OneVsRestModel, but this does not seem to be supported by the API. I have tried:
ovrModel.save("my-ovr") // Cannot resolve symbol save
ovrModel.models.foreach(_.save("model-" + _.uid)) // Cannot resolve symbol save
Is there a way to save this, so I can load it in a new application for making new predictions?
Spark 2.0.0
OneVsRestModel implements MLWritable so it should be possible to save it directly. Method shown below can be still useful to save individual models separately.
Spark < 2.0.0
The problem here is that models returns an Array of ClassificationModel[_, _]] not an Array of LogisticRegressionModel (or MLWritable). To make it work you'll have to be specific about the types:
import org.apache.spark.ml.classification.LogisticRegressionModel
ovrModel.models.zipWithIndex.foreach {
case (model: LogisticRegressionModel, i: Int) =>
or to be more generic:
import org.apache.spark.ml.util.MLWritable
ovrModel.models.zipWithIndex.foreach {
case (model: MLWritable, i: Int) =>
Unfortunately as for now (Spark 1.6) OneVsRestModel doesn't implement MLWritable so it cannot be saved alone.
All models int the OneVsRest seem to use the same uid hence we need an explicit index. It will be also useful to identify the model later.
I have a spark.ml pipeline in Spark 1.5.1 which consists of a series of transformers followed by a k-means estimator. I want to be able to access the KMeansModel.clusterCenters after fitting the pipeline, but can't figure out how. Is there a spark.ml equivalent of sklearn's pipeline.named_steps feature?
I found this answer which gives two options. The first works if I take the k-means model out of my pipeline and fit it separately, but that kinda defeats the purpose of a pipeline. The second option doesn't work - I get error: value getModel is not a member of org.apache.spark.ml.PipelineModel.
EDIT: Example pipeline:
import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}
import org.apache.spark.ml.clustering.{KMeans, KMeansModel}
import org.apache.spark.ml.Pipeline
// create example dataframe
val sentenceData = sqlContext.createDataFrame(Seq(
("Hi I heard about Spark"),
("I wish Java could use case classes"),
("K-means models are neat")
// initialize pipeline stages
val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("features").setNumFeatures(20)
val kmeans = new KMeans()
val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, kmeans))
// fit the pipeline
val fitKmeans = pipeline.fit(sentenceData)
So now fitKmeans is of type org.apache.spark.ml.PipelineModel. My question is, how do I access the cluster centers calculated by the k-means model contained within this pipeline? As noted above, when not contained in a pipeline, this can be done with fitKmeans.clusterCenters.
Answering my own question...I finally stumbled on an example deep in the spark.ml docs that shows how to do this using the stages member of the PipelineModel class. So for the example I posted above, in order to access the k-means cluster centers, do:
val centers = fitKmeans.stages(2).asInstanceOf[KMeansModel].clusterCenters
where fitKmeans is a PipelineModel and 2 is the index of the k-means model in the array of pipeline stages.
Reference: the last line of most of the examples on this page.
I'm looking into creating a pipeline to run logistic regression in spark and I'm running into an issue on whether there is either a way I can extend or bypass the "Tokenizer" object.
Essentially, the problem I'm running into is that the tokenizer is not nearly precise enough for the vectors I'm trying to create (i.e. stemming, lemmatization, bi-grams etc.), but in EVERY example for spark pipelines I see something along the lines of:
val tokenizer = new Tokenizer()
val hashingTF = new HashingTF()
val lr = new LogisticRegression()
val pipeline = new Pipeline()
.setStages(Array(tokenizer, hashingTF, lr))
Must I have a tokenizer stage? Would it be trivial to extend the Tokenizer class to do the string modifications I want? Any help would be highly appreciated!
So I found a pretty decent example of extending the tokenizer class right here. This should give a pretty good roadmap of what needs to be overloaded for proper tokenization.