Is there any way to convert pyspark random forest model to pmml? - pyspark

I have trained RandomForest in pyspark2.1, but saved as pyspark model file.
rf_model = RandomForestClassifier(featuresCol='features',
labelCol='click',
maxDepth=10,
maxBins=32,
numTrees=100,
)
model = rf_model.fit(dftrain)
model_path = 'hdfs://hacluster/user/model'
model.save(model_path)
But now,we have downloaded the model without the dftrain data and cannot access to the hdfs right now. Is there any way to convert model file to pmml without exact train data?
I already knew pyspark2pmml or jpmml-sparkml, both have train data as input.Like,
from jpmml_sparkml import toPMMLBytes
pmmlBytes = toPMMLBytes(sc, dftrain, pipelineModel)
print(pmmlBytes)

I already knew pyspark2pmml or jpmml-sparkml, both have train data as input.
The JPMML-SparkML library (either directly or via the PySpark2PMML wrapper library) is still your only option. However, you should check out its README file to refresh your knowledge about it - your example uses outdated API (toPMMLBytes utility method instead of PMMLBuilder#buildByteArray builder method).
Regarding the need for the training dataset, then JPMML-SparkML needs to know the schema (in the form of org.apache.spark.sql.types.StructType object) of the training dataset, not the actual data. This schema is used for getting column names, data types, and other metadata.
If you don't have the original schema available, then it shouldn't be difficult to create one programmatically.

Related

Own data set consisting of numbers (csv) at PyTorch

I want to use my own dataset, which consists of numbers, in PyTorch. They are therefore available as a csv file, for example. What is the easiest way to do load this into PyTorch? So far I only know how to use already existing datasets in PyTorch, but I don't want to do that.
You need to create a custom class that inherits from Pytorch's Dataset class.
Then, you need to wrap it with a DataLoader.
Follow this tutorial for an in depth explanation:
https://pytorch.org/tutorials/beginner/basics/data_tutorial.html#creating-a-custom-dataset-for-your-files
The easiest way to import your dataset would be to:
Use pandas package to load your csv file:
import pandas as pd
data = pd.read_csv("filename.csv")
Then, implement a very simple pytorch Dataset class as described here.
You will finally pass your instance of Dataset as the first parameter of a pytorch DataLoader

Convert PySpark ML Word2Vec model to Gensim Word2Vec model

I've generated a PySpark Word2Vec model like so:
from pyspark.ml.feature import Word2Vec
w2v = Word2Vec(vectorSize=100, minCount=1, inputCol='words', outputCol = 'vector')
model = w2v.fit(df)
(The data that I used to train the model on isn't relevant, what's important is that its all in the right format and successfully yields a pyspark.ml.feature.Word2VecModel object.)
Now I need to convert this model to a Gensim Word2Vec model. How would I go about this?
If you still have the training data, re-training the gensim Word2Vec model may be the most straightforward approach.
If you only need the word-vectors, perhaps PySpark's model can export them in the word2vec.c format that gensim can load with .load_word2vec_format().
The only reason to port the model would be to continue training. Such incremental training, while possible, involves considering a lot of tradeoffs in balancing the influence of the older and later training to get good results.
If you are in fact wanting to do this conversion in order to do more training in such a manner, it again suggests that using the original training to reproduce a similar model could be plausible.
But, if you have to convert the model, the general approach would be to study the source code and internal data structures of the two models, to discover how they alternatively represent each of the key aspects of the model:
the known word-vectors (model.wv.vectors in gensim)
the known-vocabulary of words, including stats about word-frequencies and the position of individual words (model.wv.vocab in gensim)
the hidden-to-output weights of the model (`model.trainables' and its properties in gensim)
other model properties describing the model's modes & metaparameters
A reasonable interactive approach could be:
Write some acceptance tests that take models of both types, and test whether they are truly 'equivalent' for your purposes. (This is relatively easy for just checking if the vectors for individual words are present and identical, but nearly as hard as the conversion itself for verifying other ready-to-be-trained-more behaviors.)
Then, in an interactive notebook, load the source model, and also create a dummy gensim model with the same vocabulary size. Consulting the source code, write Python statements to iteratively copy/transform key properties over from the source into the target, repeatedly testing if they verify as equivalent.
When they do, take those steps you did manually and combine them into a utility method to do the conversion. Again verify its operation then try using the converted model however you'd hoped – perhaps discovering overlooked info or discovering other bugs in the process, and then improving the verification method and conversion method.
It's possible that the PySpark model will be missing things the gensim model expects, which might require synthesizing workable replacement values.
Good luck! (But re-train the gensim model from the original data if you want things to just be straightforward and work.)

Can Scala load SparkR-saved model?

I'm a data analyst. I want to train a model (for example randomforest) and this model can be saved and loaded by Scala. Since both Scala and R are using MLlib for machine learning, can Scala also load the model trained and saved in SparkR?
I found an article saying that it was not compatible:
https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html
But it was written almost a year ago. Does the latest, even development version, of SparkR support this cross-compatibility of model?
Code: To Save and Load Model in Spark
val model = pipeline.fit(training)
// Now we can optionally save the fitted pipeline to disk
model.write.overwrite().save("/tmp/spark-logistic-regression-model")
// We can also save this unfit pipeline to disk
pipeline.write.overwrite().save("/tmp/unfit-lr-model")
// And load it back in during production
val sameModel = PipelineModel.load("/tmp/spark-logistic-regression-model")
For More details refer
https://spark.apache.org/docs/latest/ml-pipeline.html#example-pipeline
Hope this Helps!!!...

How to load a PMML model?

I'm following the instructions of PMML model export - spark.mllib to create a K-means model.
val numClusters = 10
val numIterations = 10
val clusters = KMeans.train(data, numClusters, numIterations)
// Save and load model: export to PMML
println("PMML Model:\n" + clusters.toPMML("/kmeans.xml"))
But I don't know how to load the PMML after that.
I'm trying
val sameModel = KMeansModel.load(sc, "/kmeans.xml")
and appears:
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/kmeans.xml/metadata
Any idea?
Best regards
As stated in the documentation (for the version you seem to be interested it - 1.6.1 and also for the latest available - 2.1.0) Spark supports exporting to PMML only. The load method actually expects to retrieve a model saved in Spark own format and this is why the load method expects a certain path to be there and why the exception has been thrown.
If you trained the model with Spark, you can save it and load it later.
If you need to load a model that has not been trained in Spark and has been saved as PMML you can use jpmml-spark to load and evaluate it.
My limited experience in this spark.mllib's KMeans space is that it is not possible, but you could develop the feature yourself.
spark.mllib's KMeansModel is PMMLExportable:
class KMeansModel #Since("1.1.0") (#Since("1.0.0") val clusterCenters: Array[Vector])
extends Saveable with Serializable with PMMLExportable {
That's why you can use toPMML that saves a model into the PMML XML format.
(Again I've got a very little experience in Spark MLlib) My understanding is that KMeans is all about centroids and that's what is loaded when you do KMeansModel.load that in turn uses KMeansModel.SaveLoadV1_0.load that reads the centroids and creates a KMeansModel:
new KMeansModel(localCentroids.sortBy(_.id).map(_.point))
For KMeansModel.toPMML, Spark MLlib uses pmml-model's PMML (as you can see here):
new PMML("4.2", header, null)
I'd recommend exploring pmml-model's PMML how to do saving and loading as that's beyond Spark's realm.
Side notes
Why would you even want to use Spark to have the model after you trained it? It is indeed possible, but you may be wasting your cluster resources for Spark to host the model.
In my limited understanding, the sole purpose of Spark MLlib is to use Spark's features like distribution and parallelism to handle large datasets to build models and use them without the Spark machinery afterwards.
I must be missing something important in my narrow view...
You could use PMML4S-Spark to load a PMML model to evaluate it in Spark, for example:
import org.pmml4s.spark.ScoreModel
val model = ScoreModel.fromFile("/kmeans.xml")
The model is a SparkML transformer, so you can make prediction against a dataframe:
val scoreDf = model.transform(df)
PMML files are actually xml files with schemas defined by Data Mining Consortium. For that reason you can either define a deserializer based on the contract given at DMC and PMML web page here or use 3rd party libraries.
I am researching on jpmml library for incorporation python prepared models in Spring application.
Information here:
https://github.com/jpmml
http://dmg.org/pmml/v4-1/GeneralStructure.html

usage of naive bayes Model for prediction

Hi all I am new to scala and spark MLIB.
I have a dataset of diseses of diseases along with the symptoms which are in the following format:
Disease,symptom1 symptom2 symptom3
I have almost 300 entries which are in the above mentioned format in a CSV file.
I want to achieve this following functionality:
If a user has given a input of sysmptoms namely Symptom1,Symptom2,Symptom3 the model must be able to predict the disease.
I have the following Questions:
which machine learning model should I use to achieve this functionality.
I have gone through some models and founf NAIVES Bayes model if wrong correct me.
can I provide text input to Naives Bayes model.
Is there any sample code available to achieve this functionality.
You can use any of the classification algorithms present in Spark MLlib for further reference read the official docs and go thru this link from databricks blog https://databricks.com/blog/2015/07/29/new-features-in-machine-learning-pipelines-in-spark-1-4.html