How to load a PMML model? - scala

I'm following the instructions of PMML model export - spark.mllib to create a K-means model.
val numClusters = 10
val numIterations = 10
val clusters = KMeans.train(data, numClusters, numIterations)
// Save and load model: export to PMML
println("PMML Model:\n" + clusters.toPMML("/kmeans.xml"))
But I don't know how to load the PMML after that.
I'm trying
val sameModel = KMeansModel.load(sc, "/kmeans.xml")
and appears:
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/kmeans.xml/metadata
Any idea?
Best regards

As stated in the documentation (for the version you seem to be interested it - 1.6.1 and also for the latest available - 2.1.0) Spark supports exporting to PMML only. The load method actually expects to retrieve a model saved in Spark own format and this is why the load method expects a certain path to be there and why the exception has been thrown.
If you trained the model with Spark, you can save it and load it later.
If you need to load a model that has not been trained in Spark and has been saved as PMML you can use jpmml-spark to load and evaluate it.

My limited experience in this spark.mllib's KMeans space is that it is not possible, but you could develop the feature yourself.
spark.mllib's KMeansModel is PMMLExportable:
class KMeansModel #Since("1.1.0") (#Since("1.0.0") val clusterCenters: Array[Vector])
extends Saveable with Serializable with PMMLExportable {
That's why you can use toPMML that saves a model into the PMML XML format.
(Again I've got a very little experience in Spark MLlib) My understanding is that KMeans is all about centroids and that's what is loaded when you do KMeansModel.load that in turn uses KMeansModel.SaveLoadV1_0.load that reads the centroids and creates a KMeansModel:
new KMeansModel(localCentroids.sortBy(_.id).map(_.point))
For KMeansModel.toPMML, Spark MLlib uses pmml-model's PMML (as you can see here):
new PMML("4.2", header, null)
I'd recommend exploring pmml-model's PMML how to do saving and loading as that's beyond Spark's realm.
Side notes
Why would you even want to use Spark to have the model after you trained it? It is indeed possible, but you may be wasting your cluster resources for Spark to host the model.
In my limited understanding, the sole purpose of Spark MLlib is to use Spark's features like distribution and parallelism to handle large datasets to build models and use them without the Spark machinery afterwards.
I must be missing something important in my narrow view...

You could use PMML4S-Spark to load a PMML model to evaluate it in Spark, for example:
import org.pmml4s.spark.ScoreModel
val model = ScoreModel.fromFile("/kmeans.xml")
The model is a SparkML transformer, so you can make prediction against a dataframe:
val scoreDf = model.transform(df)

PMML files are actually xml files with schemas defined by Data Mining Consortium. For that reason you can either define a deserializer based on the contract given at DMC and PMML web page here or use 3rd party libraries.
I am researching on jpmml library for incorporation python prepared models in Spring application.
Information here:
https://github.com/jpmml
http://dmg.org/pmml/v4-1/GeneralStructure.html

Related

Is there any way to convert pyspark random forest model to pmml?

I have trained RandomForest in pyspark2.1, but saved as pyspark model file.
rf_model = RandomForestClassifier(featuresCol='features',
labelCol='click',
maxDepth=10,
maxBins=32,
numTrees=100,
)
model = rf_model.fit(dftrain)
model_path = 'hdfs://hacluster/user/model'
model.save(model_path)
But now,we have downloaded the model without the dftrain data and cannot access to the hdfs right now. Is there any way to convert model file to pmml without exact train data?
I already knew pyspark2pmml or jpmml-sparkml, both have train data as input.Like,
from jpmml_sparkml import toPMMLBytes
pmmlBytes = toPMMLBytes(sc, dftrain, pipelineModel)
print(pmmlBytes)
I already knew pyspark2pmml or jpmml-sparkml, both have train data as input.
The JPMML-SparkML library (either directly or via the PySpark2PMML wrapper library) is still your only option. However, you should check out its README file to refresh your knowledge about it - your example uses outdated API (toPMMLBytes utility method instead of PMMLBuilder#buildByteArray builder method).
Regarding the need for the training dataset, then JPMML-SparkML needs to know the schema (in the form of org.apache.spark.sql.types.StructType object) of the training dataset, not the actual data. This schema is used for getting column names, data types, and other metadata.
If you don't have the original schema available, then it shouldn't be difficult to create one programmatically.

Pyspark throwing error: py4j.Py4JException: Method __getstate__([]) does not exist [duplicate]

Background
My original question here was Why using DecisionTreeModel.predict inside map function raises an exception? and is related to How to generate tuples of (original lable, predicted label) on Spark with MLlib?
When we use Scala API a recommended way of getting predictions for RDD[LabeledPoint] using DecisionTreeModel is to simply map over RDD:
val labelAndPreds = testData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
Unfortunately similar approach in PySpark doesn't work so well:
labelsAndPredictions = testData.map(
lambda lp: (lp.label, model.predict(lp.features))
labelsAndPredictions.first()
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
Instead of that official documentation recommends something like this:
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
So what is going on here? There is no broadcast variable here and Scala API defines predict as follows:
/**
* Predict values for a single data point using the model trained.
*
* #param features array representing a single data point
* #return Double prediction from the trained model
*/
def predict(features: Vector): Double = {
topNode.predict(features)
}
/**
* Predict values for the given data set using the model trained.
*
* #param features RDD representing data points to be predicted
* #return RDD of predictions for each of the given data points
*/
def predict(features: RDD[Vector]): RDD[Double] = {
features.map(x => predict(x))
}
so at least at the first glance calling from action or transformation is not a problem since prediction seems to be a local operation.
Explanation
After some digging I figured out that the source of the problem is a JavaModelWrapper.call method invoked from DecisionTreeModel.predict. It access SparkContext which is required to call Java function:
callJavaFunc(self._sc, getattr(self._java_model, name), *a)
Question
In case of DecisionTreeModel.predict there is a recommended workaround and all the required code is already a part of the Scala API but is there any elegant way to handle problem like this in general?
Only solutions I can think of right now are rather heavyweight:
pushing everything down to JVM either by extending Spark classes through Implicit Conversions or adding some kind of wrappers
using Py4j gateway directly
Communication using default Py4J gateway is simply not possible. To understand why we have to take a look at the following diagram from the PySpark Internals document [1]:
Since Py4J gateway runs on the driver it is not accessible to Python interpreters which communicate with JVM workers through sockets (See for example PythonRDD / rdd.py).
Theoretically it could be possible to create a separate Py4J gateway for each worker but in practice it is unlikely to be useful. Ignoring issues like reliability Py4J is simply not designed to perform data intensive tasks.
Are there any workarounds?
Using Spark SQL Data Sources API to wrap JVM code.
Pros: Supported, high level, doesn't require access to the internal PySpark API
Cons: Relatively verbose and not very well documented, limited mostly to the input data
Operating on DataFrames using Scala UDFs.
Pros: Easy to implement (see Spark: How to map Python with Scala or Java User Defined Functions?), no data conversion between Python and Scala if data is already stored in a DataFrame, minimal access to Py4J
Cons: Requires access to Py4J gateway and internal methods, limited to Spark SQL, hard to debug, not supported
Creating high level Scala interface in a similar way how it is done in MLlib.
Pros: Flexible, ability to execute arbitrary complex code. It can be don either directly on RDD (see for example MLlib model wrappers) or with DataFrames (see How to use a Scala class inside Pyspark). The latter solution seems to be much more friendly since all ser-de details are already handled by existing API.
Cons: Low level, required data conversion, same as UDFs requires access to Py4J and internal API, not supported
Some basic examples can be found in Transforming PySpark RDD with Scala
Using external workflow management tool to switch between Python and Scala / Java jobs and passing data to a DFS.
Pros: Easy to implement, minimal changes to the code itself
Cons: Cost of reading / writing data (Alluxio?)
Using shared SQLContext (see for example Apache Zeppelin or Livy) to pass data between guest languages using registered temporary tables.
Pros: Well suited for interactive analysis
Cons: Not so much for batch jobs (Zeppelin) or may require additional orchestration (Livy)
Joshua Rosen. (2014, August 04) PySpark Internals. Retrieved from https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals

Execute Scala code from Pyspark [duplicate]

Background
My original question here was Why using DecisionTreeModel.predict inside map function raises an exception? and is related to How to generate tuples of (original lable, predicted label) on Spark with MLlib?
When we use Scala API a recommended way of getting predictions for RDD[LabeledPoint] using DecisionTreeModel is to simply map over RDD:
val labelAndPreds = testData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
Unfortunately similar approach in PySpark doesn't work so well:
labelsAndPredictions = testData.map(
lambda lp: (lp.label, model.predict(lp.features))
labelsAndPredictions.first()
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
Instead of that official documentation recommends something like this:
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
So what is going on here? There is no broadcast variable here and Scala API defines predict as follows:
/**
* Predict values for a single data point using the model trained.
*
* #param features array representing a single data point
* #return Double prediction from the trained model
*/
def predict(features: Vector): Double = {
topNode.predict(features)
}
/**
* Predict values for the given data set using the model trained.
*
* #param features RDD representing data points to be predicted
* #return RDD of predictions for each of the given data points
*/
def predict(features: RDD[Vector]): RDD[Double] = {
features.map(x => predict(x))
}
so at least at the first glance calling from action or transformation is not a problem since prediction seems to be a local operation.
Explanation
After some digging I figured out that the source of the problem is a JavaModelWrapper.call method invoked from DecisionTreeModel.predict. It access SparkContext which is required to call Java function:
callJavaFunc(self._sc, getattr(self._java_model, name), *a)
Question
In case of DecisionTreeModel.predict there is a recommended workaround and all the required code is already a part of the Scala API but is there any elegant way to handle problem like this in general?
Only solutions I can think of right now are rather heavyweight:
pushing everything down to JVM either by extending Spark classes through Implicit Conversions or adding some kind of wrappers
using Py4j gateway directly
Communication using default Py4J gateway is simply not possible. To understand why we have to take a look at the following diagram from the PySpark Internals document [1]:
Since Py4J gateway runs on the driver it is not accessible to Python interpreters which communicate with JVM workers through sockets (See for example PythonRDD / rdd.py).
Theoretically it could be possible to create a separate Py4J gateway for each worker but in practice it is unlikely to be useful. Ignoring issues like reliability Py4J is simply not designed to perform data intensive tasks.
Are there any workarounds?
Using Spark SQL Data Sources API to wrap JVM code.
Pros: Supported, high level, doesn't require access to the internal PySpark API
Cons: Relatively verbose and not very well documented, limited mostly to the input data
Operating on DataFrames using Scala UDFs.
Pros: Easy to implement (see Spark: How to map Python with Scala or Java User Defined Functions?), no data conversion between Python and Scala if data is already stored in a DataFrame, minimal access to Py4J
Cons: Requires access to Py4J gateway and internal methods, limited to Spark SQL, hard to debug, not supported
Creating high level Scala interface in a similar way how it is done in MLlib.
Pros: Flexible, ability to execute arbitrary complex code. It can be don either directly on RDD (see for example MLlib model wrappers) or with DataFrames (see How to use a Scala class inside Pyspark). The latter solution seems to be much more friendly since all ser-de details are already handled by existing API.
Cons: Low level, required data conversion, same as UDFs requires access to Py4J and internal API, not supported
Some basic examples can be found in Transforming PySpark RDD with Scala
Using external workflow management tool to switch between Python and Scala / Java jobs and passing data to a DFS.
Pros: Easy to implement, minimal changes to the code itself
Cons: Cost of reading / writing data (Alluxio?)
Using shared SQLContext (see for example Apache Zeppelin or Livy) to pass data between guest languages using registered temporary tables.
Pros: Well suited for interactive analysis
Cons: Not so much for batch jobs (Zeppelin) or may require additional orchestration (Livy)
Joshua Rosen. (2014, August 04) PySpark Internals. Retrieved from https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals

Can Scala load SparkR-saved model?

I'm a data analyst. I want to train a model (for example randomforest) and this model can be saved and loaded by Scala. Since both Scala and R are using MLlib for machine learning, can Scala also load the model trained and saved in SparkR?
I found an article saying that it was not compatible:
https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html
But it was written almost a year ago. Does the latest, even development version, of SparkR support this cross-compatibility of model?
Code: To Save and Load Model in Spark
val model = pipeline.fit(training)
// Now we can optionally save the fitted pipeline to disk
model.write.overwrite().save("/tmp/spark-logistic-regression-model")
// We can also save this unfit pipeline to disk
pipeline.write.overwrite().save("/tmp/unfit-lr-model")
// And load it back in during production
val sameModel = PipelineModel.load("/tmp/spark-logistic-regression-model")
For More details refer
https://spark.apache.org/docs/latest/ml-pipeline.html#example-pipeline
Hope this Helps!!!...

How to use Weka model on Spark

I am new to spark and scala.
I have 10 machine learning models which are trained using WEKA.
Now, i am moving my application to spark and want to use these models.
How can i use them into spark?
For prediction, which model to choose depends on the type of data coming.
How shall i design my application so that i don't have to load all 10 of them in memory together?
Any help would be appreciated.
First of all, the classifiers in weka are not serializable therefore you can only apply your models in a tricky way.
On the other hand, it is not clear why you want to apply weka based model in apache spark as you can also train spark based ML algorithms with MLLib (http://spark.apache.org/docs/latest/ml-guide.html).
It is well documented, and you can find a lot of useful examples.
Finally, I compered the performance of weka J48 decision tree and the spark decision tree model on the reuters data set.
It is a document classification problem, I evaulted the model on 10-fold cross validation manner.
The F1 scores result of weka:
(ship, 0.5751879699248121)
(grain, 0.7714285714285716)
(money-fx, 0.7308567096285064)
(corn, 0.7334851936218679)
(trade, 0.7641325536062378)
(crude, 0.7815049864007253)
(earn, 0.9310115645354248)
(wheat, 0.7661870503597122)
(acq, 0.8078484438430312)
(interest, 0.6561743341404359)
And the results of spark:
(ship, 0.5307018372123027)
(grain, 0.7606432455706257)
(money-fx, 0.7476899173974012)
(corn, 0.7210280866934613)
(trade, 0.7607140827384508)
(crude, 0.7450426425908848)
(earn, 0.9337615148649243)
(wheat, 0.751148372254634)
(acq, 0.8009280204333529)
(interest, 0.6837952003315322)
As you can see, it is not a huge different between the two solution.
So, I recommend you to apply apache spark mllib!