I'm a data analyst. I want to train a model (for example randomforest) and this model can be saved and loaded by Scala. Since both Scala and R are using MLlib for machine learning, can Scala also load the model trained and saved in SparkR?
I found an article saying that it was not compatible:
https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html
But it was written almost a year ago. Does the latest, even development version, of SparkR support this cross-compatibility of model?
Code: To Save and Load Model in Spark
val model = pipeline.fit(training)
// Now we can optionally save the fitted pipeline to disk
model.write.overwrite().save("/tmp/spark-logistic-regression-model")
// We can also save this unfit pipeline to disk
pipeline.write.overwrite().save("/tmp/unfit-lr-model")
// And load it back in during production
val sameModel = PipelineModel.load("/tmp/spark-logistic-regression-model")
For More details refer
https://spark.apache.org/docs/latest/ml-pipeline.html#example-pipeline
Hope this Helps!!!...
Related
I have trained RandomForest in pyspark2.1, but saved as pyspark model file.
rf_model = RandomForestClassifier(featuresCol='features',
labelCol='click',
maxDepth=10,
maxBins=32,
numTrees=100,
)
model = rf_model.fit(dftrain)
model_path = 'hdfs://hacluster/user/model'
model.save(model_path)
But now,we have downloaded the model without the dftrain data and cannot access to the hdfs right now. Is there any way to convert model file to pmml without exact train data?
I already knew pyspark2pmml or jpmml-sparkml, both have train data as input.Like,
from jpmml_sparkml import toPMMLBytes
pmmlBytes = toPMMLBytes(sc, dftrain, pipelineModel)
print(pmmlBytes)
I already knew pyspark2pmml or jpmml-sparkml, both have train data as input.
The JPMML-SparkML library (either directly or via the PySpark2PMML wrapper library) is still your only option. However, you should check out its README file to refresh your knowledge about it - your example uses outdated API (toPMMLBytes utility method instead of PMMLBuilder#buildByteArray builder method).
Regarding the need for the training dataset, then JPMML-SparkML needs to know the schema (in the form of org.apache.spark.sql.types.StructType object) of the training dataset, not the actual data. This schema is used for getting column names, data types, and other metadata.
If you don't have the original schema available, then it shouldn't be difficult to create one programmatically.
I am new to spark and scala.
I have 10 machine learning models which are trained using WEKA.
Now, i am moving my application to spark and want to use these models.
How can i use them into spark?
For prediction, which model to choose depends on the type of data coming.
How shall i design my application so that i don't have to load all 10 of them in memory together?
Any help would be appreciated.
First of all, the classifiers in weka are not serializable therefore you can only apply your models in a tricky way.
On the other hand, it is not clear why you want to apply weka based model in apache spark as you can also train spark based ML algorithms with MLLib (http://spark.apache.org/docs/latest/ml-guide.html).
It is well documented, and you can find a lot of useful examples.
Finally, I compered the performance of weka J48 decision tree and the spark decision tree model on the reuters data set.
It is a document classification problem, I evaulted the model on 10-fold cross validation manner.
The F1 scores result of weka:
(ship, 0.5751879699248121)
(grain, 0.7714285714285716)
(money-fx, 0.7308567096285064)
(corn, 0.7334851936218679)
(trade, 0.7641325536062378)
(crude, 0.7815049864007253)
(earn, 0.9310115645354248)
(wheat, 0.7661870503597122)
(acq, 0.8078484438430312)
(interest, 0.6561743341404359)
And the results of spark:
(ship, 0.5307018372123027)
(grain, 0.7606432455706257)
(money-fx, 0.7476899173974012)
(corn, 0.7210280866934613)
(trade, 0.7607140827384508)
(crude, 0.7450426425908848)
(earn, 0.9337615148649243)
(wheat, 0.751148372254634)
(acq, 0.8009280204333529)
(interest, 0.6837952003315322)
As you can see, it is not a huge different between the two solution.
So, I recommend you to apply apache spark mllib!
I'm following the instructions of PMML model export - spark.mllib to create a K-means model.
val numClusters = 10
val numIterations = 10
val clusters = KMeans.train(data, numClusters, numIterations)
// Save and load model: export to PMML
println("PMML Model:\n" + clusters.toPMML("/kmeans.xml"))
But I don't know how to load the PMML after that.
I'm trying
val sameModel = KMeansModel.load(sc, "/kmeans.xml")
and appears:
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/kmeans.xml/metadata
Any idea?
Best regards
As stated in the documentation (for the version you seem to be interested it - 1.6.1 and also for the latest available - 2.1.0) Spark supports exporting to PMML only. The load method actually expects to retrieve a model saved in Spark own format and this is why the load method expects a certain path to be there and why the exception has been thrown.
If you trained the model with Spark, you can save it and load it later.
If you need to load a model that has not been trained in Spark and has been saved as PMML you can use jpmml-spark to load and evaluate it.
My limited experience in this spark.mllib's KMeans space is that it is not possible, but you could develop the feature yourself.
spark.mllib's KMeansModel is PMMLExportable:
class KMeansModel #Since("1.1.0") (#Since("1.0.0") val clusterCenters: Array[Vector])
extends Saveable with Serializable with PMMLExportable {
That's why you can use toPMML that saves a model into the PMML XML format.
(Again I've got a very little experience in Spark MLlib) My understanding is that KMeans is all about centroids and that's what is loaded when you do KMeansModel.load that in turn uses KMeansModel.SaveLoadV1_0.load that reads the centroids and creates a KMeansModel:
new KMeansModel(localCentroids.sortBy(_.id).map(_.point))
For KMeansModel.toPMML, Spark MLlib uses pmml-model's PMML (as you can see here):
new PMML("4.2", header, null)
I'd recommend exploring pmml-model's PMML how to do saving and loading as that's beyond Spark's realm.
Side notes
Why would you even want to use Spark to have the model after you trained it? It is indeed possible, but you may be wasting your cluster resources for Spark to host the model.
In my limited understanding, the sole purpose of Spark MLlib is to use Spark's features like distribution and parallelism to handle large datasets to build models and use them without the Spark machinery afterwards.
I must be missing something important in my narrow view...
You could use PMML4S-Spark to load a PMML model to evaluate it in Spark, for example:
import org.pmml4s.spark.ScoreModel
val model = ScoreModel.fromFile("/kmeans.xml")
The model is a SparkML transformer, so you can make prediction against a dataframe:
val scoreDf = model.transform(df)
PMML files are actually xml files with schemas defined by Data Mining Consortium. For that reason you can either define a deserializer based on the contract given at DMC and PMML web page here or use 3rd party libraries.
I am researching on jpmml library for incorporation python prepared models in Spring application.
Information here:
https://github.com/jpmml
http://dmg.org/pmml/v4-1/GeneralStructure.html
I've been playing around with the Gaussian Mixture Models provided for spark/mllib.
I found it really nice to generate a GaussianMixture from an enormous number of vectors/points. However, this is not always the case in ML. Very often you do not need to generate a model from numberless vectors, but to generate a numberless models -each one- from a few vectors (i.e., building a GMM for each user of a database with hundred of users).
At this point, I do not know how to proceed with the mllib, as I cannot see an easy way to distribute in both by users and by data.
Example:
Let featuresByUser = RDD[user, List[Vectors]],
the natural way to train a GMM for each user might be something like
featuresByUser.mapValues(
feats => new GaussianMixture.set(nGaussians).run(sc.parallelize(feats))
)
However, it is well-known that this is forbidden in spark. The inside sc.parallelize is not in the driver, so this leads to an error.
So the question are,
should the Mllib methods accept Seq[Vector] as input apart from
RDD[Vector] Thus, the programmer could choose one of the other depending on the problem.
Is there any other workaround that I'm missing to deal with this case (using mllib)?
Mllib unfortunately is currently not meant to create many models, but only one at the time, which was confirmed at a recent Spark meetup in London.
What you can do is launch a separate job for each model in a separate thread in the driver. This is described in the job scheduling documentation. So you would create one RDD per user and run a Gaussian mixture on each, running the 'action' that makes the thing run for each on a separate thread.
Another option, if the amount of data per user fits on one instance, you can do a Gaussian mixture on each user with something else than Mllib. This approach was described in the meetup in a case where sklearn was used within PySpark to create multiple models. You'd do something like:
val users: List[Long] = getUsers
val models = sc.parallelize(users).map(user => {
val userData = getDataForUser(user)
buildGM(userData)
})
Gensim's official tutorial explicitly states that it is possible to continue training a (loaded) model. I'm aware that according to the documentation it is not possible to continue training a model that was loaded from the word2vec format. But even when one generates a model from scratch and then tries to call the train method, it is not possible to access the newly created labels for the LabeledSentence instances supplied to train.
>>> sentences = [LabeledSentence(['first', 'sentence'], ['SENT_0']), LabeledSentence(['second', 'sentence'], ['SENT_1'])]
>>> model = Doc2Vec(sentences, min_count=1)
>>> print(model.vocab.keys())
dict_keys(['SENT_0', 'SENT_1', 'sentence', 'first', 'second'])
>>> sentence = LabeledSentence(['third', 'sentence'], ['SENT_2'])
>>> model.train([sentence])
>>> print(model.vocab.keys())
# At this point I would expect the key 'SENT_2' to be present in the vocabulary, but it isn't
dict_keys(['SENT_0', 'SENT_1', 'sentence', 'first', 'second'])
Is it at all possible to continue the training of a Doc2Vec model in Gensim with new sentences? If so, how can this be achieved?
My understand is that this is not possible for any new labels. We can only continue training when the new data has the same labels as the old data. As a result, we are training or retuning the weights of the already learned vocabulary, but are not able to learn a new vocabulary.
There is a similar question for adding new labels/words/sentences during training: https://groups.google.com/forum/#!searchin/word2vec-toolkit/online$20word2vec/word2vec-toolkit/L9zoczopPUQ/_Zmy57TzxUQJ
Also, you might want to keep an eye on this discussion:
https://groups.google.com/forum/#!topic/gensim/UZDkfKwe9VI
Update: If you want to add new words to an already trained model, take a look at online word2vec here:
http://rutumulkar.com/blog/2015/word2vec/
According to gensim documentation online/incremental training is not supported for doc2vec.
refer to https://github.com/RaRe-Technologies/gensim/issues/1019
I could still add new documents to an existing doc2vec model( but some it crashes due to segmentation fault) but most similar query does not work on newly added document(so this approach seems useless).