How to save RandomForestClassifier Spark model in scala? - scala

I built a random forest model using the following code:
import org.apache.spark.ml.classification.RandomForestClassificationModel
import org.apache.spark.ml.classification.RandomForestClassifier
val rf = new RandomForestClassifier().setLabelCol("indexedLabel").setFeaturesCol("features")
val labelConverter = new IndexToString().setInputCol("prediction").setOutputCol("predictedLabel").setLabels(labelIndexer.labels)
val training = labelIndexer.transform(df)
val model = rf.fit(training)
now I want to save the model in order to predict later using the following code:
val predictions: DataFrame = model.transform(testData)
I've looked into Spark documentation here and didn't find any option to do that. Any idea?
It took me a few hours to build the model , if Spark is crushing I won't be able to get it back.

It's possible to save and reload tree based models in HDFS using Spark 1.6 using saveAsObjectFile() for both Pipeline based and basic model.
Below is example for pipeline based model.
// model
val model = pipeline.fit(trainingData)
// Create rdd using Seq
sc.parallelize(Seq(model), 1).saveAsObjectFile("hdfs://filepath")
// Reload model by using it's class
// You can get class of object using object.getClass()
val sameModel = sc.objectFile[PipelineModel]("filepath").first()

For RandomForestClassifier save & load model: tested spark 1.6.2 + scala in ml(in spark 2.0 you can have direct save option for model)
import org.apache.spark.ml.classification.RandomForestClassificationModel
import org.apache.spark.ml.classification.RandomForestClassifier //imports
val classifier = new RandomForestClassifier().setImpurity("gini").setMaxDepth(3).setNumTrees(20).setFeatureSubsetStrategy("auto").setSeed(5043)
val model = classifier.fit(trainingData)
sc.parallelize(Seq(model), 1).saveAsObjectFile(modelSavePath) //save model
val linRegModel = sc.objectFile[RandomForestClassificationModel](modelSavePath).first() //load model
`val predictions1 = linRegModel.transform(testData)` //predictions1 is dataframe

It is in the MLWriter interface - that is accessed via the writer attribute on your model:
model.asInstanceOf[MLWritable].write.save(path)
Here is the interface:
abstract class MLWriter extends BaseReadWrite with Logging {
protected var shouldOverwrite: Boolean = false
/**
* Saves the ML instances to the input path.
*/
#Since("1.6.0")
#throws[IOException]("If the input path already exists but overwrite is not enabled.")
def save(path: String): Unit = {
This is a refactoring from earlier versions of mllib/spark.ml
Update It appears that the Model were not writable:
Exception in thread "main" java.lang.UnsupportedOperationException:
Pipeline write will fail on this Pipeline because it contains a stage
which does not implement Writable. Non-Writable stage:
rfc_4e467607406f of type class
org.apache.spark.ml.classification.RandomForestClassificationModel
So there may not be a straightforward solution for this.

Here is a PySpark v1.6 implementation corresponding to the Scala saveAsObjectFile() answer above.
It coerses the Python objects to/from Java objects to achieve serialisation with saveAsObjectFile().
Without the Java coersion I had weird Py4J errors on serialisation. If anyone has a simplier implementation, please edit or comment.
Save a trained RandomForestClassificationModel object:
# Save RandomForestClassificationModel to hdfs
gateway = sc._gateway
java_list = gateway.jvm.java.util.ArrayList()
java_list.add(rfModel._java_obj)
modelRdd = sc._jsc.parallelize(java_list)
modelRdd.saveAsObjectFile("hdfs:///some/path/rfModel")
Load a trained RandomForestClassificationModel object:
# Load RandomForestClassificationModel from hdfs
rfObjectFileLoaded = sc._jsc.objectFile("hdfs:///some/path/rfModel")
rfModelLoaded_JavaObject = rfObjectFileLoaded.first()
rfModelLoaded = RandomForestClassificationModel(rfModelLoaded_JavaObject)
predictions = rfModelLoaded.transform(test_input_df)

Related

Can we create a xml file with specific node with Spark Scala?

I have another question about Spark and Scala. I want to use that technologie to get data and generate a xml.
Therefore, I want to know if it is possible to create node ourself (not automatic creation) and what library can we use ? I search but I found nothing very interesting(Like I'm new in this technologie, I don't know many keywords).
I want to know if there is in Spark something like this code (I write that in scala. It works in local but I can't use new File() in Spark).
val docBuilder: DocumentBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder()
val document = docBuilder.newDocument()
ar root:Element = document.createElement("<name Balise>")
attr = document.createAttribute("<attr1>")
attr.setValue("<value attr1>")
root.setAttributeNode(<attr>)
attr = document.createAttribute("<attr2>")
attr.setValue("<value attr2>")
root.setAttributeNode(attr)
document.appendChild(root)
document.setXmlStandalone(true)
var transformerFactory:TransformerFactory = TransformerFactory.newInstance()
var transformer:Transformer = transformerFactory.newTransformer()
var domSource:DOMSource = new DOMSource(document)
var streamResult:StreamResult = new StreamResult(new File(destination))
transformer.transform(domSource,streamResult)
I want to know if it's possible to do that with spark.
Thanks for your answer and have a good day.
Not exactly, but you can do something similar by using Spark XML API pr XStream API on Spark.
First try using Spark XML API which is most useful when reading and writing XML files using Spark. However, At the time of writing this, Spark XML has following limitations.
1) Adding attribute to root element has not supported.
2) Does not support following structure where you have header and footer elements.
<parent>
<header></header>
<dataset>
<data attr="1"> suports xml tags and data here</data>
<data attr="2">value2</data>
</dataset>
<footer></footer>
</parent>
If you have one root element and following data then Spark XML is go to api.
Alternatively, you can look at XStream API. Below are steps how to use it to create custom XML structures.
1) First, create a Scala class similar to the structure you wanted in XML.
case class XMLData(name:String, value:String, attr:String)
2) Create an instance of this class
val data = XMLData("bookName","AnyValue", "AttributeValue")
3) Conver data object to XML using XStream API. If you already have data in a DataFrame, then do a map transformation to convert data to an XML string and store it back in DataFrame. if you do so, then you can skip step #4
val xstream = new XStream(new DomDriver)
val xmlString = xstream.toXML(data)
4) Now convert xmlString to DataFrame
val df = xmlString.toDF()
5) Finally, write to a file
df.write.text("file://filename")
Here isa full sample example with XStream API
import com.thoughtworks.xstream.XStream
import com.thoughtworks.xstream.io.xml.DomDriver
import org.apache.spark.sql.SparkSession
case class Animal(cri:String,taille:Int)
object SparkXMLUsingXStream{
def main(args: Array[String]): Unit = {
val spark = SparkSession.
builder.master ("local[*]")
.appName ("sparkbyexamples.com")
.getOrCreate ()
var animal:Animal = Animal("Rugissement",150)
val xstream1 = new XStream(new DomDriver())
xstream1.alias("testAni",classOf[Animal])
xstream1.aliasField("cricri",classOf[Animal],"cri")
val xmlString = Seq(xstream1.toXML(animal))
import spark.implicits._
val newDf = xmlString.toDF()
newDf.show(false)
}
}
Hope this helps !!
Thanks

Create a map to call the POJO for each row of Spark Dataframe

I built an H2O model in R and saved the POJO code. I want to score parquet files in hdfs using the POJO but I'm not sure how to go about it. I plan on reading the parquet files into spark (scala/SparkR/PySpark) and scoring them on there. Below is the excerpt I found on H2O's documentation page.
"How do I run a POJO on a Spark Cluster?
The POJO provides just the math logic to do predictions, so you won’t find any Spark (or even H2O) specific code there. If you want to use the POJO to make predictions on a dataset in Spark, create a map to call the POJO for each row and save the result to a new column, row-by-row"
Does anyone have some example code of how I can do this? I'd greatly appreciate any assistance. I code primarily in R and SparkR, and I'm not sure how I can "map" the POJO to each line.
Thanks in advance.
I just posted a solution that actually uses DataFrame/Dataset. The post used a Star Wars dataset to build a model in R and then scored MOJO on the test set in Spark. I'll paste the only relevant part here:
Scoring with Spark (and Scala)
You could either use spark-submit or spark-shell. If you use spark-submit, h2o-genmodel.jar needs to be put under lib folder of the root directory of your spark application so it could be added as a dependency during compilation. The following code assumes you're running spark-shell. In order to use h2o-genmodel.jar, you need to append the jar file when launching spark-shell by providing a --jar flag. For example:
/usr/lib/spark/bin/spark-shell \
--conf spark.serializer="org.apache.spark.serializer.KryoSerializer" \
--conf spark.driver.memory="3g" \
--conf spark.executor.memory="10g" \
--conf spark.executor.instances=10 \
--conf spark.executor.cores=4 \
--jars /path/to/h2o-genmodel.jar
Now in the Spark shell, import the dependencies
import _root_.hex.genmodel.easy.{EasyPredictModelWrapper, RowData}
import _root_.hex.genmodel.MojoModel
Using DataFrame
val modelPath = "/path/to/zip/file"
val dataPath = "/path/to/test/data"
// Import data
val dfStarWars = spark.read.option("header", "true").csv(dataPath)
// Import MOJO model
val mojo = MojoModel.load(modelPath)
val easyModel = new EasyPredictModelWrapper(mojo)
// score
val dfScore = dfStarWars.map {
x =>
val r = new RowData
r.put("height", x.getAs[String](1))
r.put("mass", x.getAs[String](2))
val score = easyModel.predictBinomial(r).classProbabilities
(x.getAs[String](0), score(1))
}.toDF("name", "isHumanScore")
The variable score is a list of two scores for level 0 and 1. score(1) is the score for level 1, which is "human". By default the map function returns a DataFrame with unspecified column names "_1", "_2", etc. You can rename the columns by calling toDF.
Using Dataset
To use the Dataset API we just need to create two case classes, one for the input data, and one for the output.
case class StarWars (
name: String,
height: String,
mass: String,
is_human: String
)
case class Score (
name: String,
isHumanScore: Double
)
// Dataset
val dtStarWars = dfStarWars.as[StarWars]
val dtScore = dtStarWars.map {
x =>
val r = new RowData
r.put("height", x.height)
r.put("mass", x.mass)
val score = easyModel.predictBinomial(r).classProbabilities
Score(x.name, score(1))
}
With Dataset you can get the value of a column by calling x.columnName directly. Just notice that the types of the column values have to be String, so you might need to manually cast them if they are of other types defined in the case class.
If you want to perform scoring with POJO or MOJO in spark you should be using RowData which is provided within h2o-genmodel.jar class as row by row input data to call easyPredict method to generate scores.
Your solution will be to read the parquet file from HDFS and then for each row, convert that to RowData object by filling each entry and then pass that to your POJO scoring function. Remember POJO and MOJO they both use exact same scoring function to score and the only difference is on how the POJO Class is used vs MOJO resources zip package is used. As MOJO are backward compatible and could work with any newer h2o-genmodel.jar it is best if you use MOJO instead of POJO.
Following is the full Scala code you can use on Spark to load a MOJO model and then do the scoring:
import _root_.hex.genmodel.GenModel
import _root_.hex.genmodel.easy.{EasyPredictModelWrapper, RowData}
import _root_.hex.genmodel.easy.prediction
import _root_.hex.genmodel.MojoModel
import _root_.hex.genmodel.easy.RowData
// Load Mojo
val mojo = MojoModel.load("/Users/avkashchauhan/learn/customers/mojo_bin/gbm_model.zip")
val easyModel = new EasyPredictModelWrapper(mojo)
// Get Mojo Details
var features = mojo.getNames.toBuffer
// Creating the row
val r = new RowData
r.put("AGE", "68")
r.put("RACE", "2")
r.put("DCAPS", "2")
r.put("VOL", "0")
r.put("GLEASON", "6")
// Performing the Prediction
val prediction = easyModel.predictBinomial(r).classProbabilities
Here is an example of reading parquet files in Spark and then saving as CSV. You can use the same code to read the parquet from HDFS and then pass the each row as RowData to above example.
Here is detailed example of using MOJO model in spark and perform scoring using RowData.

Save Spark StandardScaler for later use in Scala

I am still using Spark 1.6 and trained a StandardScalar that I would like to save and reuse on future datasets.
Using the supplied examples I could transform the data successfully but I can't find a way to save the trained normaliser.
Is there any way in which the trained normaliser can be saved?
Assuming that you have created the scalerModel:
import org.apache.spark.ml.feature.StandardScalerModel
scalerModel.write.save("path/folder/")
val scalerModel = StandardScalerModel.load("path/folder/")
StandardScalerModel class has a save method. After calling the fit method on StandardScaler, the returned object is StandardScalerModel: API Docs
e.g. similar to the supplied example:
import org.apache.spark.ml.feature.StandardScaler
import org.apache.spark.ml.PipelineModel
val dataFrame = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
val scaler = new StandardScaler()
.setInputCol("features")
.setOutputCol("scaledFeatures")
.setWithStd(true)
.setWithMean(false)
// Compute summary statistics by fitting the StandardScaler.
val scalerModel = scaler.fit(dataFrame)
scalerModel.write.overwrite().save("/path/to/the/file")
val sameModel = PipelineModel.load("/path/to/the/file")

Spark load model and continue training

I'm using Scala with Spark 2.0 to train a model with LinearRegression.
val lr = new LinearRegression()
.setMaxIter(num_iter)
.setRegParam(reg)
.setStandardization(true)
val model = lr.fit(data)
this is working fine and I get good results.
I saved the model and loaded it in another class to make some predictions:
val model = LinearRegressionModel.load("models/LRModel")
val result = model.transform(data).select("prediction")
Now I wanted to continue training the model with new data, so I saved the model and loaded it to continue the training.
Saving:
model.save("models/LRModel")
lr.save("models/LR")
Loading:
val lr = LinearRegression.load("models/LR")
val model = LinearRegressionModel.load("models/LRModel")
The Problem is, when I load the model, there is not fit or train function to continue the training.
When I load the LinearRegression object it seems like it does not save the weights, only the parameters for the algorithm.
I tested it by training the same data for the same number of iterations and the result was the exact same rootMeanSquaredError and it was definitely not converged at this point of learning.
I also can not load the model into the LinearRegression, it results in a error:
Exception in thread "main" java.lang.NoSuchMethodException: org.apache.spark.ml.regression.LinearRegressionModel.<init>(java.lang.String)
So the question is, how do I get the LinearRegression object to use the saved LinearRegressionModel?
You can use pipeline to save and load the machine learning models.
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.ml.PipelineModel
val lr = new LinearRegression().setLabelCol("labesl").setFeaturesCol("features").setMaxIter(10).setRegParam(1.0).setElasticNetParam(1.0)
val pipeline = new Pipeline().setStages(Array(lr))
pipeline.fit(trainingData)
pipeline.write.overwrite().save("hdfs://.../spark/mllib/models/linearRegression");
val sameModel = PipelineModel.load("hdfs://...")
sameModel.transform(assembler).select("features", "labels", "prediction").show(

Running Mlib via Spark Job Server

I was practising developing sample model using online resources provided in spark website. I managed to create the model and run it for sample data using Spark-Shell , But how to do actually run the model in production environment ? Is it via Spark Job server ?
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
val data = sc.textFile("hdfs://mycluster/user/Cancer.csv")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts.last.toDouble, Vectors.dense(parts.take(9).map(_.toDouble)))
}
var svm = new SVMWithSGD().setIntercept(true)
val model = svm.run(parsedData)
var predictedValue = model.predict(Vectors.dense(5,1,1,1,2,1,3,1,1))
println(predictedValue)
The above code works perfect when i run it in spark-shell , But i have no idea how do we actually run model in production environment. I tried to run it via spark jobserver but i get error ,
curl -d "input.string = 1, 2, 3, 4, 5, 6, 7, 8, 9" 'ptfhadoop01v:8090/jobs?appName=SQL&classPath=spark.jobserver.SparkPredict'
I am sure its because am passing a String value whereas the program expects it be vector elements , Can someone guide me on how to achieve this . And also is this how the data being passed to Model in production environment ? Or is it some other way.
Spark Job-server is used in production use-cases, where you want to design pipelines of Spark jobs, and also (optionally) use the SparkContext across jobs, over a REST API. Sparkplug is an alternative to Spark Job-server, providing similar constructs.
However, to answer your question on how to run a (singular) Spark job in production environments, the answer is you do not need a third-party library to do so. You only need to construct a SparkContext object, and use it to trigger Spark jobs. For instance, for your code snippet, all that is needed is;
package runner
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import com.typesafe.config.{ConfigFactory, Config}
import org.apache.spark.{SparkConf, SparkContext}
/**
*
*/
object SparkRunner {
def main (args: Array[String]){
val config: Config = ConfigFactory.load("app-default-config") /*Use a library to read a config file*/
val sc: SparkContext = constructSparkContext(config)
val data = sc.textFile("hdfs://mycluster/user/Cancer.csv")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts.last.toDouble, Vectors.dense(parts.take(9).map(_.toDouble)))
}
var svm = new SVMWithSGD().setIntercept(true)
val model = svm.run(parsedData)
var predictedValue = model.predict(Vectors.dense(5,1,1,1,2,1,3,1,1))
println(predictedValue)
}
def constructSparkContext(config: Config): SparkContext = {
val conf = new SparkConf()
conf
.setMaster(config.getString("spark.master"))
.setAppName(config.getString("app.name"))
/*Set more configuration values here*/
new SparkContext(conf)
}
}
Optionally, you can also use the wrapper for spark-submit script, SparkSubmit, provided in the Spark library itself.