Spark: Override library method - scala

I would like to make some modifications to the scala code of without having to rebuild the whole Spark.
Since we can append jar files to the execution of either spark-submit, or pySpark. Is it possible to compile a modified copy of and override the default methods of Spark, or at least create new ones? Thanks.

Creating a new Class extending, and overriding the respective methods without modification of the source code should work.
class CustomLogisticRegression
LogisticRegression {
override def toString(): String = "This is overridden Logistic Regression Class"
Running Logistic Regression with the new CustomLogisticRegression class
val data = sqlCtx.createDataFrame(MLUtils.loadLibSVMFile(sc, "/opt/spark/spark-1.5.2-bin-hadoop2.6/data/mllib/sample_libsvm_data.txt"))
val customLR = new CustomLogisticRegression()
val customLRModel =
val originalLR = new LogisticRegression()
val originalLRModel =
// Print the intercept for logistic regression
println(s"Custom Class's Intercept: ${customLRModel.intercept}")
println(s"Original Class's Intercept: ${originalLRModel.intercept}")
Custom Class's Intercept: 0.22456315961250317
Original Class's Intercept: 0.22456315961250317
This is overridden Logistic Regression Class


Cannot call DecisionTreeClassifier.train()

I am trying to use DecisionTreeClassifier.train() but it comes out to such error report:
Error:(218, 41) method train in class DecisionTreeClassifier cannot be accessed in
Access to protected method train not permitted because
enclosing object FeatureSelection in package core is not a subclass of
class DecisionTreeClassifier in package classification where target is defined
val dt = decisionTreeClassifier.train(trainRdd)
It reports that my object FeatureSelection is not a subclass of package classification so it's unable to call a potected method of the package.But actually train() is a function with public type in official documents.
Surroundings: Scala 2.10.6 Spark 2.10:1.6.1 jdk 1.8
Here are the codes attached:
object FeatureSelection {
def selectFeatureGreedyDTNoLimit(){
val selectfeature=ArrayBuffer[String]()
val selectsize=selectfeature.size
val tempfeature=selectfeature++ArrayBuffer(line)
val vectorDF = new VectorAssembler()
.select("label", "features")
val Array(trainRdd, testRdd) =
.map(row => LabeledPoint(Common.any2Double(row.get(0)).get, row.getAs[Vector](1)))
.randomSplit(Array(0.5, 0.5), 0L)
val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]()
val dt = decisionTreeClassifier.train(trainRdd, categoricalFeaturesInfo, numClasses)
Hoping for someone to help me solve this problem.
You are supposed to use the method fit.
train is an internal function, that is why it is protected.

Spark MLlib: Should I call .cache before fitting a model?

Imagine that I am training a Spark MLlib model as follows:
val traingData = loadTrainingData(...)
val logisticRegression = new LogisticRegression()
val logisticRegressionModel =
Does the call traingData.cache improve performances at training time or is it not needed?
Does the .fit(...) method for a ML algorithm call cache/unpersist internally?
There is no need to call .cache for Spark LogisticRegression (and some other models). The train method (called by in LogisticRegression is implemented as follows:
override protected[spark] def train(dataset: Dataset[_]): LogisticRegressionModel = {
val handlePersistence = dataset.rdd.getStorageLevel == StorageLevel.NONE // true if not cached-persisted
train(dataset, handlePersistence)
And later...
if (handlePersistence) instances.persist(StorageLevel.MEMORY_AND_DISK)
This will generally even be more efficient than a custom call to .cache as instances in the line above only contains (label, weight, features) and not the rest of the data.

Spark DataFrame to RDD and back

I am writing an Apache Spark application using Scala. To handle and store data I use DataFrames. I have a nice pipeline with feature extraction and a MultiLayerPerceptron classifier, using the ML API.
I also want to use SVM (for comparison purposes). The thing is (and correct me if I am mistaken) only the MLLib provides SVM. And MLLib is not ready to handle DataFrames, only RDDs.
So I figured I can maintain the core of my application using DataFrames and to use SVM 1) I just convert the DataFrame's columns I need to an RDD[LabeledPoint] and 2) after the classification add the SVMs prediction to the DataFrame as a new column.
The first part I handled with a small function:
private def dataFrameToRDD(dataFrame : DataFrame) : RDD[LabeledPoint] = {
val rddMl ="label", "features") => (r.getInt(0).toDouble, r.getAs[](1))) => new LabeledPoint(r._1, Vectors.dense(r._2.toArray)))
I have to specify and convert the type of vector since the feature extraction method uses ML API and not MLLib.
Then, this RDD[LabeledPoint] is fed to the SVM and classification goes smoothly, no issues. At the end and following spark's example I get an RDD[Double]:
val predictions = => model.predict(point.features))
Now, I want to add the prediction score as column to the original DataFrame and return it. This is where I got stuck. I can convert the RDD[Double] to a DataFrame using
(sql context ommited)
import sqlContext.implicits._
val plDF = predictions.toDF("prediction")
But how do I join the two DataFrames where the second DataFrame becomes a column of the original one? I tried to use methods join and union but got SQL exceptions as the DataFrames have no equal columns to join or unite on.
I tried
data.withColumn("prediction", plDF.col("prediction"))
But I get an Analysis Exception :(
I haven't figured out how to do it without recurring to RDDs, but anyway here's how I solved it with RDD. Added the rest of the code so that anyone can understand the complete logic. Any suggestions are appreciated.
package stuff
import java.util.logging.{Level, Logger}
import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, Row, SQLContext}
* Created by camandros on 10-03-2017.
class LinearSVMClassifier extends Classifier with Serializable{
#transient lazy val log: Logger = Logger.getLogger(getClass.getName)
private var model : SVMModel = _
override def train(data : DataFrame): Unit = {
val rdd = dataFrameToRDD(data)
// Run training algorithm to build the model
val numIter : Int = 100
val step =
val c =
log.log(Level.INFO, "Initiating SVM training with parameters: C="+c+", step="+step)
model = SVMWithSGD.train(rdd, numIterations = numIter, stepSize = step, regParam = c)
log.log(Level.INFO, "Model training finished")
// Clear the default threshold.
override def classify(data : DataFrame): DataFrame = {
log.log(Level.INFO, "Converting DataFrame to RDD")
val rdd = dataFrameToRDD(data)
log.log(Level.INFO, "Conversion finished; beginning classification")
// Compute raw scores on the test set.
val predictions = => model.predict(point.features))
log.log(Level.INFO, "Classification finished; Transforming RDD to DataFrame")
val sqlContext : SQLContext = Osint.spark.sqlContext
val tupleRDD = => Row.fromSeq(t._1.toSeq ++ Seq(t._2)))
sqlContext.createDataFrame(tupleRDD, data.schema.add("predictions", "Double"))
//TODO this should work it doesn't since this "withColumn" method seems to be applicable only to add
// new columns using information from the same dataframe; therefore I am using the horrible rdd conversion
//val sqlContext : SQLContext = Osint.spark.sqlContext
//import sqlContext.implicits._
//val plDF = predictions.toDF("predictions")
//data.withColumn("prediction", plDF.col("predictions"))
private def dataFrameToRDD(dataFrame : DataFrame) : RDD[LabeledPoint] = {
val rddMl ="label", "features") => (r.getInt(0).toDouble, r.getAs[](1))) => new LabeledPoint(r._1, Vectors.dense(r._2.toArray)))

Save Spark StandardScaler for later use in Scala

I am still using Spark 1.6 and trained a StandardScalar that I would like to save and reuse on future datasets.
Using the supplied examples I could transform the data successfully but I can't find a way to save the trained normaliser.
Is there any way in which the trained normaliser can be saved?
Assuming that you have created the scalerModel:
val scalerModel = StandardScalerModel.load("path/folder/")
StandardScalerModel class has a save method. After calling the fit method on StandardScaler, the returned object is StandardScalerModel: API Docs
e.g. similar to the supplied example:
val dataFrame ="libsvm").load("data/mllib/sample_libsvm_data.txt")
val scaler = new StandardScaler()
// Compute summary statistics by fitting the StandardScaler.
val scalerModel =
val sameModel = PipelineModel.load("/path/to/the/file")

How to save RandomForestClassifier Spark model in scala?

I built a random forest model using the following code:
val rf = new RandomForestClassifier().setLabelCol("indexedLabel").setFeaturesCol("features")
val labelConverter = new IndexToString().setInputCol("prediction").setOutputCol("predictedLabel").setLabels(labelIndexer.labels)
val training = labelIndexer.transform(df)
val model =
now I want to save the model in order to predict later using the following code:
val predictions: DataFrame = model.transform(testData)
I've looked into Spark documentation here and didn't find any option to do that. Any idea?
It took me a few hours to build the model , if Spark is crushing I won't be able to get it back.
It's possible to save and reload tree based models in HDFS using Spark 1.6 using saveAsObjectFile() for both Pipeline based and basic model.
Below is example for pipeline based model.
// model
val model =
// Create rdd using Seq
sc.parallelize(Seq(model), 1).saveAsObjectFile("hdfs://filepath")
// Reload model by using it's class
// You can get class of object using object.getClass()
val sameModel = sc.objectFile[PipelineModel]("filepath").first()
For RandomForestClassifier save & load model: tested spark 1.6.2 + scala in ml(in spark 2.0 you can have direct save option for model)
import //imports
val classifier = new RandomForestClassifier().setImpurity("gini").setMaxDepth(3).setNumTrees(20).setFeatureSubsetStrategy("auto").setSeed(5043)
val model =
sc.parallelize(Seq(model), 1).saveAsObjectFile(modelSavePath) //save model
val linRegModel = sc.objectFile[RandomForestClassificationModel](modelSavePath).first() //load model
`val predictions1 = linRegModel.transform(testData)` //predictions1 is dataframe
It is in the MLWriter interface - that is accessed via the writer attribute on your model:
Here is the interface:
abstract class MLWriter extends BaseReadWrite with Logging {
protected var shouldOverwrite: Boolean = false
* Saves the ML instances to the input path.
#throws[IOException]("If the input path already exists but overwrite is not enabled.")
def save(path: String): Unit = {
This is a refactoring from earlier versions of mllib/
Update It appears that the Model were not writable:
Exception in thread "main" java.lang.UnsupportedOperationException:
Pipeline write will fail on this Pipeline because it contains a stage
which does not implement Writable. Non-Writable stage:
rfc_4e467607406f of type class
So there may not be a straightforward solution for this.
Here is a PySpark v1.6 implementation corresponding to the Scala saveAsObjectFile() answer above.
It coerses the Python objects to/from Java objects to achieve serialisation with saveAsObjectFile().
Without the Java coersion I had weird Py4J errors on serialisation. If anyone has a simplier implementation, please edit or comment.
Save a trained RandomForestClassificationModel object:
# Save RandomForestClassificationModel to hdfs
gateway = sc._gateway
java_list =
modelRdd = sc._jsc.parallelize(java_list)
Load a trained RandomForestClassificationModel object:
# Load RandomForestClassificationModel from hdfs
rfObjectFileLoaded = sc._jsc.objectFile("hdfs:///some/path/rfModel")
rfModelLoaded_JavaObject = rfObjectFileLoaded.first()
rfModelLoaded = RandomForestClassificationModel(rfModelLoaded_JavaObject)
predictions = rfModelLoaded.transform(test_input_df)