Spark load model and continue training - scala

I'm using Scala with Spark 2.0 to train a model with LinearRegression.
val lr = new LinearRegression()
.setMaxIter(num_iter)
.setRegParam(reg)
.setStandardization(true)
val model = lr.fit(data)
this is working fine and I get good results.
I saved the model and loaded it in another class to make some predictions:
val model = LinearRegressionModel.load("models/LRModel")
val result = model.transform(data).select("prediction")
Now I wanted to continue training the model with new data, so I saved the model and loaded it to continue the training.
Saving:
model.save("models/LRModel")
lr.save("models/LR")
Loading:
val lr = LinearRegression.load("models/LR")
val model = LinearRegressionModel.load("models/LRModel")
The Problem is, when I load the model, there is not fit or train function to continue the training.
When I load the LinearRegression object it seems like it does not save the weights, only the parameters for the algorithm.
I tested it by training the same data for the same number of iterations and the result was the exact same rootMeanSquaredError and it was definitely not converged at this point of learning.
I also can not load the model into the LinearRegression, it results in a error:
Exception in thread "main" java.lang.NoSuchMethodException: org.apache.spark.ml.regression.LinearRegressionModel.<init>(java.lang.String)
So the question is, how do I get the LinearRegression object to use the saved LinearRegressionModel?

You can use pipeline to save and load the machine learning models.
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.ml.PipelineModel
val lr = new LinearRegression().setLabelCol("labesl").setFeaturesCol("features").setMaxIter(10).setRegParam(1.0).setElasticNetParam(1.0)
val pipeline = new Pipeline().setStages(Array(lr))
pipeline.fit(trainingData)
pipeline.write.overwrite().save("hdfs://.../spark/mllib/models/linearRegression");
val sameModel = PipelineModel.load("hdfs://...")
sameModel.transform(assembler).select("features", "labels", "prediction").show(

Related

spark dataframe format error when using saved model to predict on new data

I am able to train a model and saved the model (Train.scala).
Now i want to use that trained model to predict on new data (Predict.scala).
I create a new VectorAssembler in my Predict.scala to featurize the new data. Should I use the same VectorAssembler in the Train.scala for the Predict.scala file? Because I am seeing issues with feature data type after transformation.
For example: when i read in the trained model and try to predict on the new data that is featurized, i got this error:
type mismatch;
[error] found : org.apache.spark.sql.DataFrame
[error] (which expands to) org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
[error] required: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] => org.apache.spark.sql.Dataset[?]
[error] val predictions = model.transform(featureData)
Training code:
Train.scala
// assembler
val assembler = new VectorAssembler()
.setInputCols(feature_list)
.setOutputCol("features")
//read in train data
val trainingData = spark
.read
.parquet(train_data_path)
// generate training features
val trainingFeatures = assembler.transform(trainingData)
//define model
val lightGBMClassifier = new LightGBMClassifier()
.setLabelCol("label")
.setFeaturesCol("features")
.setIsUnbalance(true)
.setMaxDepth(25)
.setNumLeaves(31)
.setNumIterations(100)
// fit model
val lgbm = lightGBMClassifier.fit(trainingFeatures)
//save model
lgbm
.write
.overwrite()
.save(my_model_s3_path)
Predict code: Predict.scala
val assembler = new VectorAssembler()
.setInputCols(feature_list)
.setOutputCol("features")
// load model
val model = spark.read.parquet(my_model_s3_path)
// load new data
val inputData = spark.read.parquet(new_data_path)
//Assembler to transform new data
val featureData = assembler.transform(inputData)
//predict on new data
val predictions = model.transform(featureData) ### <- got error here
Should i be using a different method to read in my trained model or transform my data?
"Should I use the same VectorAssembler in the Train.scala for the Predict.scala file?" Yes, however, I would strong recommend to use Pipelines.
// Train.scala
val pipeline = new Pipeline().setStages(Array(assembler, lightGBMClassifier))
val pipelineModel = pipeline.fit(trainingData)
pipelineModel.write.overwrite().save("/path/to/pipelineModel")
// Predict.scala
val pipelineModel = PipelineModel.load("/path/to/pipelineModel")
val predictions = pipelineModel.transform(inputData)
See if the issue goes away but simply using Pipelines, serializing/deserializing the model correctly, and structuring your code better. Also, make sure that trainingData and inputData both contain the same columns listed in feature_list.

Spark MLlib: Should I call .cache before fitting a model?

Imagine that I am training a Spark MLlib model as follows:
val traingData = loadTrainingData(...)
val logisticRegression = new LogisticRegression()
traingData.cache
val logisticRegressionModel = logisticRegression.fit(trainingData)
Does the call traingData.cache improve performances at training time or is it not needed?
Does the .fit(...) method for a ML algorithm call cache/unpersist internally?
There is no need to call .cache for Spark LogisticRegression (and some other models). The train method (called by Predictor.fit(...)) in LogisticRegression is implemented as follows:
override protected[spark] def train(dataset: Dataset[_]): LogisticRegressionModel = {
val handlePersistence = dataset.rdd.getStorageLevel == StorageLevel.NONE // true if not cached-persisted
train(dataset, handlePersistence)
}
And later...
if (handlePersistence) instances.persist(StorageLevel.MEMORY_AND_DISK)
This will generally even be more efficient than a custom call to .cache as instances in the line above only contains (label, weight, features) and not the rest of the data.

Save Spark StandardScaler for later use in Scala

I am still using Spark 1.6 and trained a StandardScalar that I would like to save and reuse on future datasets.
Using the supplied examples I could transform the data successfully but I can't find a way to save the trained normaliser.
Is there any way in which the trained normaliser can be saved?
Assuming that you have created the scalerModel:
import org.apache.spark.ml.feature.StandardScalerModel
scalerModel.write.save("path/folder/")
val scalerModel = StandardScalerModel.load("path/folder/")
StandardScalerModel class has a save method. After calling the fit method on StandardScaler, the returned object is StandardScalerModel: API Docs
e.g. similar to the supplied example:
import org.apache.spark.ml.feature.StandardScaler
import org.apache.spark.ml.PipelineModel
val dataFrame = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
val scaler = new StandardScaler()
.setInputCol("features")
.setOutputCol("scaledFeatures")
.setWithStd(true)
.setWithMean(false)
// Compute summary statistics by fitting the StandardScaler.
val scalerModel = scaler.fit(dataFrame)
scalerModel.write.overwrite().save("/path/to/the/file")
val sameModel = PipelineModel.load("/path/to/the/file")

How to save RandomForestClassifier Spark model in scala?

I built a random forest model using the following code:
import org.apache.spark.ml.classification.RandomForestClassificationModel
import org.apache.spark.ml.classification.RandomForestClassifier
val rf = new RandomForestClassifier().setLabelCol("indexedLabel").setFeaturesCol("features")
val labelConverter = new IndexToString().setInputCol("prediction").setOutputCol("predictedLabel").setLabels(labelIndexer.labels)
val training = labelIndexer.transform(df)
val model = rf.fit(training)
now I want to save the model in order to predict later using the following code:
val predictions: DataFrame = model.transform(testData)
I've looked into Spark documentation here and didn't find any option to do that. Any idea?
It took me a few hours to build the model , if Spark is crushing I won't be able to get it back.
It's possible to save and reload tree based models in HDFS using Spark 1.6 using saveAsObjectFile() for both Pipeline based and basic model.
Below is example for pipeline based model.
// model
val model = pipeline.fit(trainingData)
// Create rdd using Seq
sc.parallelize(Seq(model), 1).saveAsObjectFile("hdfs://filepath")
// Reload model by using it's class
// You can get class of object using object.getClass()
val sameModel = sc.objectFile[PipelineModel]("filepath").first()
For RandomForestClassifier save & load model: tested spark 1.6.2 + scala in ml(in spark 2.0 you can have direct save option for model)
import org.apache.spark.ml.classification.RandomForestClassificationModel
import org.apache.spark.ml.classification.RandomForestClassifier //imports
val classifier = new RandomForestClassifier().setImpurity("gini").setMaxDepth(3).setNumTrees(20).setFeatureSubsetStrategy("auto").setSeed(5043)
val model = classifier.fit(trainingData)
sc.parallelize(Seq(model), 1).saveAsObjectFile(modelSavePath) //save model
val linRegModel = sc.objectFile[RandomForestClassificationModel](modelSavePath).first() //load model
`val predictions1 = linRegModel.transform(testData)` //predictions1 is dataframe
It is in the MLWriter interface - that is accessed via the writer attribute on your model:
model.asInstanceOf[MLWritable].write.save(path)
Here is the interface:
abstract class MLWriter extends BaseReadWrite with Logging {
protected var shouldOverwrite: Boolean = false
/**
* Saves the ML instances to the input path.
*/
#Since("1.6.0")
#throws[IOException]("If the input path already exists but overwrite is not enabled.")
def save(path: String): Unit = {
This is a refactoring from earlier versions of mllib/spark.ml
Update It appears that the Model were not writable:
Exception in thread "main" java.lang.UnsupportedOperationException:
Pipeline write will fail on this Pipeline because it contains a stage
which does not implement Writable. Non-Writable stage:
rfc_4e467607406f of type class
org.apache.spark.ml.classification.RandomForestClassificationModel
So there may not be a straightforward solution for this.
Here is a PySpark v1.6 implementation corresponding to the Scala saveAsObjectFile() answer above.
It coerses the Python objects to/from Java objects to achieve serialisation with saveAsObjectFile().
Without the Java coersion I had weird Py4J errors on serialisation. If anyone has a simplier implementation, please edit or comment.
Save a trained RandomForestClassificationModel object:
# Save RandomForestClassificationModel to hdfs
gateway = sc._gateway
java_list = gateway.jvm.java.util.ArrayList()
java_list.add(rfModel._java_obj)
modelRdd = sc._jsc.parallelize(java_list)
modelRdd.saveAsObjectFile("hdfs:///some/path/rfModel")
Load a trained RandomForestClassificationModel object:
# Load RandomForestClassificationModel from hdfs
rfObjectFileLoaded = sc._jsc.objectFile("hdfs:///some/path/rfModel")
rfModelLoaded_JavaObject = rfObjectFileLoaded.first()
rfModelLoaded = RandomForestClassificationModel(rfModelLoaded_JavaObject)
predictions = rfModelLoaded.transform(test_input_df)

How to extract variable weight from spark pipeline logistic model?

I am currently trying to learn Spark Pipeline (Spark 1.6.0). I imported datasets (train and test) as oas.sql.DataFrame objects. After executing the following codes, the produced model is a oas.ml.tuning.CrossValidatorModel.
You can use model.transform (test) to predict based on the test data in Spark. However, I would like to compare the weights that model used to predict with that from R. How to extract the weights of the predictors and intercept (if any) of model? The Scala codes are:
import sqlContext.implicits._
import org.apache.spark.mllib.linalg.{Vectors, Vector}
import org.apache.spark.SparkContext
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.{LogisticRegression, LogisticRegressionModel}
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator}
val conTrain = sc.textFile("AbsolutePath2Train.txt")
val conTest = sc.textFile("AbsolutePath2Test.txt")
// parse text and convert to sql.DataFrame
val train = conTrain.map { line =>
val parts = line.split(",")
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(" +").map(_.toDouble)))
}.toDF()
val test =conTest.map{ line =>
val parts = line.split(",")
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(" +").map(_.toDouble)))
}.toDF()
// set parameter space and evaluation method
val lr = new LogisticRegression().setMaxIter(400)
val pipeline = new Pipeline().setStages(Array(lr))
val paramGrid = new ParamGridBuilder().addGrid(lr.regParam, Array(0.1, 0.01)).addGrid(lr.fitIntercept).addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0)).build()
val cv = new CrossValidator().setEstimator(pipeline).setEvaluator(new BinaryClassificationEvaluator).setEstimatorParamMaps(paramGrid).setNumFolds(2)
// fit logistic model
val model = cv.fit(train)
// If you want to predict with test
val pred = model.transform(test)
My spark environment is not accessible. Thus, these codes are retyped and rechecked. I hope they are correct. So far, I have tried searching on webs, asking others. About my coding, welcome suggestions, and criticisms.
// set parameter space and evaluation method
val lr = new LogisticRegression().setMaxIter(400)
val pipeline = new Pipeline().setStages(Array(lr))
val paramGrid = new ParamGridBuilder().addGrid(lr.regParam, Array(0.1, 0.01)).addGrid(lr.fitIntercept).addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0)).build()
val cv = new CrossValidator().setEstimator(pipeline).setEvaluator(new BinaryClassificationEvaluator).setEstimatorParamMaps(paramGrid).setNumFolds(2)
// you can print lr model coefficients as below
val model = cv.bestModel.asInstanceOf[PipelineModel]
val lrModel = model.stages(0).asInstanceOf[LogisticRegressionModel]
println(s"LR Model coefficients:\n${lrModel.coefficients.toArray.mkString("\n")}")
Two steps:
Get the best pipeline from cross validation result.
Get the LR Model from the best pipeline. It's the first stage in your code example.
I was looking for exactly the same thing. You might already have the answer, but anyway, here it is.
import org.apache.spark.ml.classification.LogisticRegressionModel
val lrmodel = model.bestModel.asInstanceOf[LogisticRegressionModel]
print(model.weight, model.intercept)
I am still not sure about how to extract weights from "model" above. But by restructuring the process towards the official tutorial, the following works on spark-1.6.0:
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.tuning.{ParamGridBuilder, TrainValidationSplit}
val lr = new LogisticRegression().setMaxIter(400)
val paramGrid = new ParamGridBuilder().addGrid(lr.regParam, Array(0.1, 0.01)).addGrid(lr.fitIntercept).addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0)).build()
val trainValidationSplit = new TrainValidationSplit().setEstimator(lr).setEvaluator(new BinaryClassificationEvaluator).setEstimatorParamMaps(paramGrid).setTrainRatio(0.8)
val restructuredModel = trainValidationSplit.fit(train)
val lrmodel = restructuredModel.bestModel.asInstanceOf[LogisticRegressionModel]
lrmodel.weigths
lrmodel.intercept
I noticed the difference between "lrmodel" here and "model" generated above:
model.bestModel --> gives oas.ml.Model[_] = pipeline_****
restructuredModel.bestModel --> gives oas.ml.Model[_] = logreg_****
That's why we can cast resturcturedModel.bestModel as LogisticRegressionModel but not that of model.bestModel. I'll add more when I understand the reason of the differences.