MLLIb: Saving and loading a model - scala

I'm using LinearRegressionWithSGD and then I save the model weights and intercept.
File that contains weights has this format:
1.20455
0.1356
0.000456
Intercept is 0 since I am using train not setting the intercept so it can be ignored for the moment. I would now like to initialize a new model object and using these saved weights from the above file. We are using CDH 5.1
Something along these lines:
// Here is the code the load data and train the model on it.
val weights = sc.textFile("linear-weights");
val model = new LinearRegressionWithSGD(weights);
then use is as:
// Here is where I want to use the trained model to predict on new data.
val valuesAndPreds = testData.map { point =>
// Predicting on new data.
val prediction = model.predict(point.features)
(point.label, prediction)
}
Any pointers to how do I do that?

It appears you are duplicating the training portion of the LinearRegressionWithSGD - which takes a LibSVM file as input.
Are you certain that you want to provide your own weights - instead of allowing the library to do its job in the training phase?
if so, then you can create your own LinearRegressionWithSGD and override the createModel
Here would be your steps given you already have calculated your desired weights / performed the training your own way:
// Stick in your weights below ..
var model = algorithm.createModel(weights, 0.0)
// Now you can run the last steps of the 'normal' process
val prediction = model.predict(test.map(_.features))
val predictionAndLabel = prediction.zip(test.map(_.label))
BTW for reference here is the more 'standard' approach that includes the training steps:
val data = MLUtils.loadLibSVMFile(sc, inputFile).cache()
val splits = examples.randomSplit(Array(0.8, 0.2))
val training = splits(0).cache()
val test = splits(1).cache()
val updater = params.regType match {
case NONE => new SimpleUpdater()
case L1 => new L1Updater()
case L2 => new SquaredL2Updater()
}
val algorithm = new LinearRegressionWithSGD()
val algorithm = new LinearRegressionWithSGD()
algorithm.optimizer
.setNumIterations(params.numIterations)
.setStepSize(params.stepSize)
.setUpdater(updater)
.setRegParam(params.regParam)
val model = algorithm.run(training)
val prediction = model.predict(test.map(_.features))
val predictionAndLabel = prediction.zip(test.map(_.label))

Related

Training/Test data with SparkML in Scala

I've been facing with an issue for the past couple of hours.
In theory, when we split data for training and testing, we should standardize the data for training independently, so as not to introduce bias, and then after having trained the model do we standardize the test set using the same "parameter" values as for the training set.
So far I've only managed to do it without the pipeline, looking like this:
val training = splitData(0)
val test = splitData(1)
val assemblerTraining = new VectorAssembler()
.setInputCols(training.columns)
.setOutputCol("features")
val standardScaler = new StandardScaler()
.setInputCol("features")
.setOutputCol("normFeatures")
.setWithStd(true)
.setWithMean(true)
val scalerModel = standardScaler.fit(training)
val scaledTrainingData = scalerModel.transform(training)
val scaledTestData = scalerModel.transform(test)
How would I go about implementing this with pipelines?
My issue is that if I create a pipeline like so:
val pipelineTraining = new Pipeline()
.setStages(
Array(
assemblerTraining,
standardScaler,
lr
)
)
where lr is a LinearRegression, then there is no way to actually access the scaling model from inside the pipeline.
I've also thought of using an intermediary pipeline to do the scaling like so:
val pipelineScalingModel = new Pipeline()
.setStages(Array(assemblerTraining, standardScaler))
.fit(training)
val pipelineTraining = new Pipeline()
.setStages(Array(pipelineScalingModel,lr))
val scaledTestData = pipelineScalingModel.transform(test)
But I don't know if this is the right way of going about it.
Any suggestions would be greatly appreciated.
In case anybody else meets with this issue, this is how I proceeded:
I realized I was not allowed to modify the [forbiddenColumnName] variable.Therefore I gave up on trying to use pipelines in that phase.
I created my own standardizing function and called it for each individual feature, like so:
def standardizeColumn( dfTrain : DataFrame, dfTest : DataFrame, columnName : String) : Array[DataFrame] = {
val withMeanStd = dfTrain.select(mean(col(columnName)), stddev(col(columnName))).collect
val auxDFTrain = dfTrain.withColumn(columnName, (col(columnName) - withMeanStd(0).getDouble(0))/withMeanStd(0).getDouble(1))
val auxDFTest = dfTest.withColumn(columnName, (col(columnName) - withMeanStd(0).getDouble(1))/withMeanStd(0).getDouble(1))
Array(auxDFTrain, auxDFTest)
}
for (columnName <- training.columns){
if ((columnName != [forbiddenColumnName]) && (columnExists(training, columnName))){
val auxResult = standardizeColumn(training, test, columnName)
training = auxResult(0)
test = auxResult(1)
}
}
[MENTION] My number of variables is very low ~15, therefore this is not a very lenghty process. I seriously doubt this would be the right way of going about things on much bigger datasets.

Can I alter spark Pipeline stages by some trained transfomers?

Because I need alter train data after StringIndexer. (append unseen future to handle error when future model prediction) . So I need build a pipeline by trained Transfomers.
But I haven't find a way to do this thing ,
sample code:
// fit by original df
val catIndexer = catFeatures.map(cname => {
new StringIndexer()
.setHandleInvalid("keep") // would get error when future prediction if training data doesn't contain unseen feature
.setInputCol(cname)
.setOutputCol(cname + KeyColumns.stringIndexerSuffix)
})
val indexedCatFeatures = catIndexer.map(idx => idx.getOutputCol)
val stringIndexerPipeline = new Pipeline().setStages(catIndexer)
val stringIndexerPipelineFitted = stringIndexerPipeline.fit(trainDataset) // note: trainDataset
// transform original df with one new row(unseen feature), to avoid unseen feature when future prediction
val rdd = mdContext.get.spark.sparkContext.makeRDD(List(Row(newRow:_*)))
val newDF = mdContext.get.spark.createDataFrame(rdd, trainDataset.schema).na.fill(0)
val patchedTrainDataset = trainDataset.unionByName(newDF)
val strIndexTrainDataset = stringIndexerPipelineFitted.transform(patchedTrainDataset) // note: patchedTrainDataset
// onehot and assemble
val oneHotEncoder = new OneHotEncoderEstimator().setInputCols(indexedCatFeatures).setOutputCols(indexedCatFeatures.map(_+KeyColumns.oneHotEncoderSuffix))
.setDropLast(false)
val predictors = numFeatures ++ oneHotEncoder.getOutputCols
val assembler = new VectorAssembler().setInputCols(predictors).setOutputCol(KeyColumns.features)
val leftPipeline = new Pipeline().setStages(Array(oneHotEncoder, assembler))
// feature transfomers
val transfomers = stringIndexerPipeline.asInstanceOf[PipelineModel].stages ++ leftPipeline.asInstanceOf[PipelineModel].stages
// train model
...
//
val cv = new CrossValidator()
.setEstimator(modelPipeline)
.setEvaluator(new BinaryClassificationEvaluator().setLabelCol(KeyColumns.y))
.setEstimatorParamMaps(paramGrid)
.setNumFolds(cvConfig.folders)
.setParallelism(cvConfig.parallelism)
val transformedTrainDataset = leftPipeline.fit(strIndexTrainDataset).transform(strIndexTrainDataset)
val cvModel = cv.fit(transformedTrainDataset)
val bestModel = cvModel.bestModel.asInstanceOf[PipelineModel]
val newStages = transfomers ++ Array[SparkTransformer](bestModel.stages.last)
// !!!error can't new here
val newBestModel = new PipelineModel(bestModel.uid, newStages)
// !!!error can't new here
val newCvModel = new CrossValidatorModel(cvModel.uid, newBestModel, cvModel.avgMetrics)
Thanks for raising a great question.
According to this Q&A, we've known that you wouldn't get an instance of PipelineModel from new method (which is not legal for ). There are mainly two ways:
PipelineModel.load(file: String)
val pipelineModel = pipeline.fit(dataFrame)
Now here is the thing: you can skip the fit() implicitly by only adding trained Model into pipeline to get pipelineModel.
e.g.
// Still, add your trained models into a array
val trainedModels = cols.map(col => {
new ValueIndexerModel().setInputCol(col).setOutputCol(col + "_indexed").setLevels(level)
})
// just set your models as stages of a pipeline as usual
val pipeline = new PipeLine().setStages(trainedModels)
// fit, which will skip for models
val pipelineModel = pipeline.fit(dataFrame)
// then you get your pipelineModel, you can transform now
val transDF = pipelineModel.transform(dataFrame)
The reason we are able to handle it like this is according to the source code of spark:
val transformers = ListBuffer.empty[Transformer]
theStages.view.zipWithIndex.foreach { case (stage, index) =>
if (index <= indexOfLastEstimator) {
val transformer = stage match {
case estimator: Estimator[_] =>
estimator.fit(curDataset)
case t: Transformer =>
t
case _ =>
throw new IllegalArgumentException(
s"Does not support stage $stage of type ${stage.getClass}")
}
if (index < indexOfLastEstimator) {
curDataset = transformer.transform(curDataset)
}
transformers += transformer
} else {
transformers += stage.asInstanceOf[Transformer]
}
}
Your trained model is subclass of Transformer, so when fit, your pipeline of trained models will skip all the process of fit, and give a pipelineModel with your trained models. Thanks to zero323 and user1269298 in Q&A again.

Refit existing Spark ML PipelineModel with new data

I'm using Spark Structured Streaming - more or less - to taim my data with a DecisionTreeRegressor.
I'd like to reuse my already fitted PipelineModel to fit again on new data.
Is it possible?
I've already tried to load back my PipelineModel and add it's stages to a pipeline and fit the data on a new model.
val modelDirectory = "/mnt/D834B3AF34B38ECE/DEV/hadoop/model"
var model : PipelineModel = _
var newModel : PipelineModel = _
var pipeline : Pipeline = _
..........
val trainingData = //an instance of a dataframne
val testData = //an instance of a dataframne
val assembler = new VectorAssembler()
.setInputCols(Array("routeId", "stopId", "month","dayOfWeek","hour","temperature","humidity","pressure","rain","snow","visibility"))
.setOutputCol("features")
val dt = new DecisionTreeRegressor()
.setLabelCol("value")
.setFeaturesCol("features")
.setImpurity("variance")
.setMaxDepth(30)
.setMaxBins(32)
.setMinInstancesPerNode(5)
pipeline = new Pipeline()
try {
model = PipelineModel.load(modelDirectory)
pipeline.setStages(model.stages)
} catch {
case iie: InvalidInputException => {
pipeline.setStages(Array(assembler,dt))
printf(iie.getMessage)
}
case unknownError: UnknownError => {
printf(unknownError.getMessage)
}
}
newModel = pipeline.fit(trainingData)
// Make predictions.
val predictions: DataFrame = model.transform(testData)
// Select example rows to display.
print(s"Predictions based on ${System.currentTimeMillis()} time train: ${System.lineSeparator()}")
predictions.show(10, false)
// Select (prediction, true label) and compute test error
val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("value")
.setPredictionCol("prediction")
.setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
You can find my full source code in: https://github.com/Hakuhun/bkk-data-process-spark/blob/master/src/main/scala/hu/oe/bakonyi/bkk/BkkDataDeserializer.scala
There is no way to refit an already fitted model in Spark 2.4.4.
For continuous learning machine learning solutions check the MLLibs documentation. You can achieve that with StreamingLinearRegressionWithSVG, StreamingKmeans, StreamingLogisticRegressionWithSVG.
Also, keep in mind that that is a streaming application, so may your pipeline is learning continuously

Apache Flink - Prediction Handling

I am currently working with Apache Flink's SVM-Class to predict some text data.
The class provides a predict-function which is taking a DataSet[Vector] as an input and gives me a DataSet[Prediction] as result. So far so good.
My problem is, that i dont have the context which prediction belongs to which text and i cant insert the text within the predict()-function to have it afterwards.
Code:
val tweets: DataSet[(SparseVector, String)] =
source.flatMap(new SelectEnglishTweetWithCreatedAtFlatMapper)
.map(tweet => (featureVectorService.transform(tweet._2))
model.predict(tweets).print
result example:
(SparseVector((462,8.73165920153676), (10844,8.508515650222549), (15656,2.931052542245018)),-1.0)
Is there a way to keep other data next to the prediction to have everything together ? because without context the prediction is not helping me.
Or maybe there is a way to just predict one vector instead of a DataSet, that i could call the function inside the map function above.
The SVM predictor expects as input a sub type of Vector. Hence there are two options to solve this problem:
Create a sub type of Vector which contains the tweet text as a tag. It will then be looped through the predictor. This approach has the advantage that no additional operation is needed. However, one needs define new classes an utilities to represent different vector types with tags:
val env = ExecutionEnvironment.getExecutionEnvironment
val input = env.fromElements("foobar", "barfo", "test")
val vectorizedInput = input.map(word => {
val value = word.chars().sum()
new DenseVectorWithTag(Array(value), word)
})
val svm = SVM().setBlocks(env.getParallelism)
val weights = env.fromElements(DenseVector(1.0))
svm.weightsOption = Option(weights) // skipping the training here
val predictionResult: DataSet[(DenseVectorWithTag, Double)] = svm.predict(vectorizedInput)
class DenseVectorWithTag(override val data: Array[Double], tag: String)
extends DenseVector(data) {
override def toString: String = "(" + super.toString + ", " + tag + ")"
}
Join the prediction DataSet with the input DataSet on the vectorized representation of the tweets. This approach has the advantage that we don't need to introduce new classes. The price we pay for this is an additional join operation which might be expensive:
val input = env.fromElements("foobar", "barfo", "test")
val vectorizedInput = input.map(word => {
val value = word.chars().sum()
(DenseVector(value), word)
})
val svm = SVM().setBlocks(env.getParallelism)
val weights = env.fromElements(DenseVector(1.0))
svm.weightsOption = Option(weights) // skipping the training here
val predictionResult = svm.predict(vectorizedInput.map(a => a._1))
val inputWithPrediction: DataSet[(String, Double)] = vectorizedInput
.join(predictionResult)
.where(0)
.equalTo(0)
.apply((t, p) => (t._2, p._2))

How to implement Kmeans evaluator in Spark ML

I want to select k-means model in terms of 'k' parameter based on the lowest k-means score.
I can find find optimal value of 'k' parameter by hand, writing something like
def clusteringScore0(data: DataFrame, k: Int): Double = {
val assembler = new VectorAssembler().
setInputCols(data.columns.filter(_ != "label")).
setOutputCol("featureVector")
val kmeans = new KMeans().
setSeed(Random.nextLong()).
setK(k).
setPredictionCol("cluster").
setFeaturesCol("featureVector")
val pipeline = new Pipeline().setStages(Array(assembler, kmeans))
val kmeansModel = pipeline.fit(data).stages.last.asInstanceOf[KMeansModel]
kmeansModel.computeCost(assembler.transform(data)) / data.count() }
(20 to 100 by 20).map(k => (k, clusteringScore0(numericOnly, k))).
foreach(println)
Should I use CrossValitor API?
Something like this:
val paramGrid = new ParamGridBuilder().addGrid(kmeansModel.k, 20 to 100 by 20).build()
val cv = new CrossValidator().setEstimator(pipeline).setEvaluator(new KMeansEvaluator()).setEstimatorParamMaps(paramGrid).setNumFolds(3)
There are Evaluators for regression and classification, but no Evaluator for clustering.
So I should implement Evaluator interface. I am stuck with evaluate method.
class KMeansEvaluator extends Evaluator {
override def copy(extra: ParamMap): Evaluator = defaultCopy(extra)
override def evaluate(data: Dataset[_]): Double = ??? // should I somehow adapt code from KMeansModel.computeCost()?
override val uid = Identifiable.randomUID("cost_evaluator")
}
Hi ClusteringEvaluator is available from Spark 2.3.0. You can use to find optimal k values by including ClusteringEvaluator object into your for-loop. You can also find more detail for silhouette analysis in Scikit-learn page. In short, the score should be between [-1,1], the larger score is the better. I have modified a for loop below for your codes.
import org.apache.spark.ml.evaluation.ClusteringEvaluator
val evaluator = new ClusteringEvaluator()
.setFeaturesCol("featureVector")
.setPredictionCol("cluster")
.setMetricName("silhouette")
for(k <- 20 to 100 by 20){
clusteringScore0(numericOnly,k)
val transformedDF = kmeansModel.transform(numericOnly)
val score = evaluator.evaluate(transformedDF)
println(k,score,kmeansModel.computeCost(transformedDF))
}