Refit existing Spark ML PipelineModel with new data - scala

I'm using Spark Structured Streaming - more or less - to taim my data with a DecisionTreeRegressor.
I'd like to reuse my already fitted PipelineModel to fit again on new data.
Is it possible?
I've already tried to load back my PipelineModel and add it's stages to a pipeline and fit the data on a new model.
val modelDirectory = "/mnt/D834B3AF34B38ECE/DEV/hadoop/model"
var model : PipelineModel = _
var newModel : PipelineModel = _
var pipeline : Pipeline = _
..........
val trainingData = //an instance of a dataframne
val testData = //an instance of a dataframne
val assembler = new VectorAssembler()
.setInputCols(Array("routeId", "stopId", "month","dayOfWeek","hour","temperature","humidity","pressure","rain","snow","visibility"))
.setOutputCol("features")
val dt = new DecisionTreeRegressor()
.setLabelCol("value")
.setFeaturesCol("features")
.setImpurity("variance")
.setMaxDepth(30)
.setMaxBins(32)
.setMinInstancesPerNode(5)
pipeline = new Pipeline()
try {
model = PipelineModel.load(modelDirectory)
pipeline.setStages(model.stages)
} catch {
case iie: InvalidInputException => {
pipeline.setStages(Array(assembler,dt))
printf(iie.getMessage)
}
case unknownError: UnknownError => {
printf(unknownError.getMessage)
}
}
newModel = pipeline.fit(trainingData)
// Make predictions.
val predictions: DataFrame = model.transform(testData)
// Select example rows to display.
print(s"Predictions based on ${System.currentTimeMillis()} time train: ${System.lineSeparator()}")
predictions.show(10, false)
// Select (prediction, true label) and compute test error
val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("value")
.setPredictionCol("prediction")
.setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
You can find my full source code in: https://github.com/Hakuhun/bkk-data-process-spark/blob/master/src/main/scala/hu/oe/bakonyi/bkk/BkkDataDeserializer.scala

There is no way to refit an already fitted model in Spark 2.4.4.
For continuous learning machine learning solutions check the MLLibs documentation. You can achieve that with StreamingLinearRegressionWithSVG, StreamingKmeans, StreamingLogisticRegressionWithSVG.
Also, keep in mind that that is a streaming application, so may your pipeline is learning continuously

Related

Training/Test data with SparkML in Scala

I've been facing with an issue for the past couple of hours.
In theory, when we split data for training and testing, we should standardize the data for training independently, so as not to introduce bias, and then after having trained the model do we standardize the test set using the same "parameter" values as for the training set.
So far I've only managed to do it without the pipeline, looking like this:
val training = splitData(0)
val test = splitData(1)
val assemblerTraining = new VectorAssembler()
.setInputCols(training.columns)
.setOutputCol("features")
val standardScaler = new StandardScaler()
.setInputCol("features")
.setOutputCol("normFeatures")
.setWithStd(true)
.setWithMean(true)
val scalerModel = standardScaler.fit(training)
val scaledTrainingData = scalerModel.transform(training)
val scaledTestData = scalerModel.transform(test)
How would I go about implementing this with pipelines?
My issue is that if I create a pipeline like so:
val pipelineTraining = new Pipeline()
.setStages(
Array(
assemblerTraining,
standardScaler,
lr
)
)
where lr is a LinearRegression, then there is no way to actually access the scaling model from inside the pipeline.
I've also thought of using an intermediary pipeline to do the scaling like so:
val pipelineScalingModel = new Pipeline()
.setStages(Array(assemblerTraining, standardScaler))
.fit(training)
val pipelineTraining = new Pipeline()
.setStages(Array(pipelineScalingModel,lr))
val scaledTestData = pipelineScalingModel.transform(test)
But I don't know if this is the right way of going about it.
Any suggestions would be greatly appreciated.
In case anybody else meets with this issue, this is how I proceeded:
I realized I was not allowed to modify the [forbiddenColumnName] variable.Therefore I gave up on trying to use pipelines in that phase.
I created my own standardizing function and called it for each individual feature, like so:
def standardizeColumn( dfTrain : DataFrame, dfTest : DataFrame, columnName : String) : Array[DataFrame] = {
val withMeanStd = dfTrain.select(mean(col(columnName)), stddev(col(columnName))).collect
val auxDFTrain = dfTrain.withColumn(columnName, (col(columnName) - withMeanStd(0).getDouble(0))/withMeanStd(0).getDouble(1))
val auxDFTest = dfTest.withColumn(columnName, (col(columnName) - withMeanStd(0).getDouble(1))/withMeanStd(0).getDouble(1))
Array(auxDFTrain, auxDFTest)
}
for (columnName <- training.columns){
if ((columnName != [forbiddenColumnName]) && (columnExists(training, columnName))){
val auxResult = standardizeColumn(training, test, columnName)
training = auxResult(0)
test = auxResult(1)
}
}
[MENTION] My number of variables is very low ~15, therefore this is not a very lenghty process. I seriously doubt this would be the right way of going about things on much bigger datasets.

Can I alter spark Pipeline stages by some trained transfomers?

Because I need alter train data after StringIndexer. (append unseen future to handle error when future model prediction) . So I need build a pipeline by trained Transfomers.
But I haven't find a way to do this thing ,
sample code:
// fit by original df
val catIndexer = catFeatures.map(cname => {
new StringIndexer()
.setHandleInvalid("keep") // would get error when future prediction if training data doesn't contain unseen feature
.setInputCol(cname)
.setOutputCol(cname + KeyColumns.stringIndexerSuffix)
})
val indexedCatFeatures = catIndexer.map(idx => idx.getOutputCol)
val stringIndexerPipeline = new Pipeline().setStages(catIndexer)
val stringIndexerPipelineFitted = stringIndexerPipeline.fit(trainDataset) // note: trainDataset
// transform original df with one new row(unseen feature), to avoid unseen feature when future prediction
val rdd = mdContext.get.spark.sparkContext.makeRDD(List(Row(newRow:_*)))
val newDF = mdContext.get.spark.createDataFrame(rdd, trainDataset.schema).na.fill(0)
val patchedTrainDataset = trainDataset.unionByName(newDF)
val strIndexTrainDataset = stringIndexerPipelineFitted.transform(patchedTrainDataset) // note: patchedTrainDataset
// onehot and assemble
val oneHotEncoder = new OneHotEncoderEstimator().setInputCols(indexedCatFeatures).setOutputCols(indexedCatFeatures.map(_+KeyColumns.oneHotEncoderSuffix))
.setDropLast(false)
val predictors = numFeatures ++ oneHotEncoder.getOutputCols
val assembler = new VectorAssembler().setInputCols(predictors).setOutputCol(KeyColumns.features)
val leftPipeline = new Pipeline().setStages(Array(oneHotEncoder, assembler))
// feature transfomers
val transfomers = stringIndexerPipeline.asInstanceOf[PipelineModel].stages ++ leftPipeline.asInstanceOf[PipelineModel].stages
// train model
...
//
val cv = new CrossValidator()
.setEstimator(modelPipeline)
.setEvaluator(new BinaryClassificationEvaluator().setLabelCol(KeyColumns.y))
.setEstimatorParamMaps(paramGrid)
.setNumFolds(cvConfig.folders)
.setParallelism(cvConfig.parallelism)
val transformedTrainDataset = leftPipeline.fit(strIndexTrainDataset).transform(strIndexTrainDataset)
val cvModel = cv.fit(transformedTrainDataset)
val bestModel = cvModel.bestModel.asInstanceOf[PipelineModel]
val newStages = transfomers ++ Array[SparkTransformer](bestModel.stages.last)
// !!!error can't new here
val newBestModel = new PipelineModel(bestModel.uid, newStages)
// !!!error can't new here
val newCvModel = new CrossValidatorModel(cvModel.uid, newBestModel, cvModel.avgMetrics)
Thanks for raising a great question.
According to this Q&A, we've known that you wouldn't get an instance of PipelineModel from new method (which is not legal for ). There are mainly two ways:
PipelineModel.load(file: String)
val pipelineModel = pipeline.fit(dataFrame)
Now here is the thing: you can skip the fit() implicitly by only adding trained Model into pipeline to get pipelineModel.
e.g.
// Still, add your trained models into a array
val trainedModels = cols.map(col => {
new ValueIndexerModel().setInputCol(col).setOutputCol(col + "_indexed").setLevels(level)
})
// just set your models as stages of a pipeline as usual
val pipeline = new PipeLine().setStages(trainedModels)
// fit, which will skip for models
val pipelineModel = pipeline.fit(dataFrame)
// then you get your pipelineModel, you can transform now
val transDF = pipelineModel.transform(dataFrame)
The reason we are able to handle it like this is according to the source code of spark:
val transformers = ListBuffer.empty[Transformer]
theStages.view.zipWithIndex.foreach { case (stage, index) =>
if (index <= indexOfLastEstimator) {
val transformer = stage match {
case estimator: Estimator[_] =>
estimator.fit(curDataset)
case t: Transformer =>
t
case _ =>
throw new IllegalArgumentException(
s"Does not support stage $stage of type ${stage.getClass}")
}
if (index < indexOfLastEstimator) {
curDataset = transformer.transform(curDataset)
}
transformers += transformer
} else {
transformers += stage.asInstanceOf[Transformer]
}
}
Your trained model is subclass of Transformer, so when fit, your pipeline of trained models will skip all the process of fit, and give a pipelineModel with your trained models. Thanks to zero323 and user1269298 in Q&A again.

SparkML - Creating a df(feature, feature_importance) of a RandomForestRegressionModel

I am training a Random Forest model in the following way:
//Indexer
val stringIndexers = categoricalColumns.map { colName =>
new StringIndexer()
.setInputCol(colName)
.setOutputCol(colName + "Idx")
.setHandleInvalid("keep")
.fit(training)
}
//HotEncoder
val encoders = featuresEnconding.map { colName =>
new OneHotEncoderEstimator()
.setInputCols(Array(colName + "Idx"))
.setOutputCols(Array(colName + "Enc"))
.setHandleInvalid("keep")
}
//Adding features into a feature vector column
val assembler = new VectorAssembler()
.setInputCols(featureColumns)
.setOutputCol("features")
val rf = new RandomForestRegressor()
.setLabelCol("label")
.setFeaturesCol("features")
val stepsRF = stringIndexers ++ encoders ++ Array(assembler, rf)
val pipelineRF = new Pipeline()
.setStages(stepsRF)
val paramGridRF = new ParamGridBuilder()
.addGrid(rf.maxBins, Array(800))
.addGrid(rf.featureSubsetStrategy, Array("all"))
.addGrid(rf.minInfoGain, Array(0.05))
.addGrid(rf.minInstancesPerNode, Array(1))
.addGrid(rf.maxDepth, Array(28,29,30))
.addGrid(rf.numTrees, Array(20))
.build()
//Defining the evaluator
val evaluatorRF = new RegressionEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
//Using cross validation to train the model
//Start with TrainSplit -Cross Validations taking so long so far
val cvRF = new CrossValidator()
.setEstimator(pipelineRF)
.setEvaluator(evaluatorRF)
.setEstimatorParamMaps(paramGridRF)
.setNumFolds(10)
.setParallelism(3)
//Fitting the model with our training dataset
val cvRFModel = cvRF.fit(training)
What I would like now is to get the importance of each of the features in the model after the training.
I am able to get the importance of each feature as an Array[Double] doing like this:
val bestModel = cvRFModel.bestModel.asInstanceOf[PipelineModel]
val size = bestModel.stages.size-1
val ftrImp = bestModel.stages(size).asInstanceOf[RandomForestRegressionModel].featureImportances.toArray
But I only get the importance of each feature and a numerical index, but I don't know what is the feature name inside my model which correspond to each importance value.
I also would like to mention that since I am using a hotencoder, the final amount of feature is much larger than the original featureColumns array.
How can I extract the features names used during the training of my model?
I found this possible solution:
import org.apache.spark.ml.attribute._
val bestModel = cvRFModel.bestModel.asInstanceOf[PipelineModel]
val lstModel = bestModel.stages.last.asInstanceOf[RandomForestRegressionModel]
val schema = predictions.schema
val featureAttrs = AttributeGroup.fromStructField(schema(lstModel.getFeaturesCol)).attributes.get
val mfeatures = featureAttrs.map(_.name.get)
val mdf = sc.parallelize(mfeatures zip ftrImp).toDF("featureName","Importance")
.orderBy(desc("Importance"))
display(mdf)

Spark Task not serializable (Array[Vector])

I am new to Spark, and I'm studying the "Advanced Analytics with Spark" book. The code is from the examples in the book. When I try to run the following code, I get Spark Task not serializable exception.
val kMeansModel = pipelineModel.stages.last.asInstanceOf[KMeansModel]
val centroids: Array[Vector] = kMeansModel.clusterCenters
val clustered = pipelineModel.transform(data)
val threshold = clustered.
select("cluster", "scaledFeatureVector").as[(Int, Vector)].
map { case (cluster, vec) => Vectors.sqdist(centroids(cluster), vec) }.
orderBy($"value".desc).take(100).last
Also, this is how I build the model:
def oneHotPipeline(inputCol: String): (Pipeline, String) = {
val indexer = new StringIndexer()
.setInputCol(inputCol)
.setOutputCol(inputCol + "_indexed")
val encoder = new OneHotEncoder()
.setInputCol(inputCol + "_indexed")
.setOutputCol(inputCol + "_vec")
val pipeline = new Pipeline()
.setStages(Array(indexer, encoder))
(pipeline, inputCol + "_vec")
}
val k = 180
val (protoTypeEncoder, protoTypeVecCol) = oneHotPipeline("protocol_type")
val (serviceEncoder, serviceVecCol) = oneHotPipeline("service")
val (flagEncoder, flagVecCol) = oneHotPipeline("flag")
// Original columns, without label / string columns, but with new vector encoded cols
val assembleCols = Set(data.columns: _*) --
Seq("label", "protocol_type", "service", "flag") ++
Seq(protoTypeVecCol, serviceVecCol, flagVecCol)
val assembler = new VectorAssembler().
setInputCols(assembleCols.toArray).
setOutputCol("featureVector")
val scaler = new StandardScaler()
.setInputCol("featureVector")
.setOutputCol("scaledFeatureVector")
.setWithStd(true)
.setWithMean(false)
val kmeans = new KMeans().
setSeed(Random.nextLong()).
setK(k).
setPredictionCol("cluster").
setFeaturesCol("scaledFeatureVector").
setMaxIter(40).
setTol(1.0e-5)
val pipeline = new Pipeline().setStages(
Array(protoTypeEncoder, serviceEncoder, flagEncoder, assembler, scaler, kmeans))
val pipelineModel = pipeline.fit(data)
I am assuming the problem is with the line Vectors.sqdist(centroids(cluster), vec). For some reason, I cannot use centroids in my Spark calculations. I have done some Googling, and I know this error happens when "I initialize a variable on the master, but then try to use it on the workers", which in my case is centroids. However, I do not know how to address this problem.
In case you got interested here is the entire code for this tutorial in the book. and here is the link to the dataset that the tutorial uses.

MLLIb: Saving and loading a model

I'm using LinearRegressionWithSGD and then I save the model weights and intercept.
File that contains weights has this format:
1.20455
0.1356
0.000456
Intercept is 0 since I am using train not setting the intercept so it can be ignored for the moment. I would now like to initialize a new model object and using these saved weights from the above file. We are using CDH 5.1
Something along these lines:
// Here is the code the load data and train the model on it.
val weights = sc.textFile("linear-weights");
val model = new LinearRegressionWithSGD(weights);
then use is as:
// Here is where I want to use the trained model to predict on new data.
val valuesAndPreds = testData.map { point =>
// Predicting on new data.
val prediction = model.predict(point.features)
(point.label, prediction)
}
Any pointers to how do I do that?
It appears you are duplicating the training portion of the LinearRegressionWithSGD - which takes a LibSVM file as input.
Are you certain that you want to provide your own weights - instead of allowing the library to do its job in the training phase?
if so, then you can create your own LinearRegressionWithSGD and override the createModel
Here would be your steps given you already have calculated your desired weights / performed the training your own way:
// Stick in your weights below ..
var model = algorithm.createModel(weights, 0.0)
// Now you can run the last steps of the 'normal' process
val prediction = model.predict(test.map(_.features))
val predictionAndLabel = prediction.zip(test.map(_.label))
BTW for reference here is the more 'standard' approach that includes the training steps:
val data = MLUtils.loadLibSVMFile(sc, inputFile).cache()
val splits = examples.randomSplit(Array(0.8, 0.2))
val training = splits(0).cache()
val test = splits(1).cache()
val updater = params.regType match {
case NONE => new SimpleUpdater()
case L1 => new L1Updater()
case L2 => new SquaredL2Updater()
}
val algorithm = new LinearRegressionWithSGD()
val algorithm = new LinearRegressionWithSGD()
algorithm.optimizer
.setNumIterations(params.numIterations)
.setStepSize(params.stepSize)
.setUpdater(updater)
.setRegParam(params.regParam)
val model = algorithm.run(training)
val prediction = model.predict(test.map(_.features))
val predictionAndLabel = prediction.zip(test.map(_.label))