I am using the Spark Scala ML API and I am trying to pass a pipeline ALS model to the TrainValidationSplit. The code executes but I am unable to retrieve the best parameters...thoughts?
val alsPipeline = new Pipeline().setStages(Array(idIndexer , modelIndexer, als))
val paramGrid = new ParamGridBuilder().
addGrid(als.maxIter, Array(5, 10)).
addGrid(als.regParam, Array(0.01, 0.05, 0.1)).
addGrid(als.implicitPrefs).
build()
val tvs = new TrainValidationSplit().
setEstimator(alsPipeline).
setEvaluator(new RegressionEvaluator().
setMetricName("rmse").
setLabelCol("purchases").
setPredictionCol("prediction")).
setEstimatorParamMaps(paramGrid).
setTrainRatio(0.75)
val alsModel = tvs.fit(trainALS)
You could get the rmse for each parameter in your grid using:
alsModel.getEstimatorParamMaps.zip(alsModel.avgMetrics)
Related
I've been facing with an issue for the past couple of hours.
In theory, when we split data for training and testing, we should standardize the data for training independently, so as not to introduce bias, and then after having trained the model do we standardize the test set using the same "parameter" values as for the training set.
So far I've only managed to do it without the pipeline, looking like this:
val training = splitData(0)
val test = splitData(1)
val assemblerTraining = new VectorAssembler()
.setInputCols(training.columns)
.setOutputCol("features")
val standardScaler = new StandardScaler()
.setInputCol("features")
.setOutputCol("normFeatures")
.setWithStd(true)
.setWithMean(true)
val scalerModel = standardScaler.fit(training)
val scaledTrainingData = scalerModel.transform(training)
val scaledTestData = scalerModel.transform(test)
How would I go about implementing this with pipelines?
My issue is that if I create a pipeline like so:
val pipelineTraining = new Pipeline()
.setStages(
Array(
assemblerTraining,
standardScaler,
lr
)
)
where lr is a LinearRegression, then there is no way to actually access the scaling model from inside the pipeline.
I've also thought of using an intermediary pipeline to do the scaling like so:
val pipelineScalingModel = new Pipeline()
.setStages(Array(assemblerTraining, standardScaler))
.fit(training)
val pipelineTraining = new Pipeline()
.setStages(Array(pipelineScalingModel,lr))
val scaledTestData = pipelineScalingModel.transform(test)
But I don't know if this is the right way of going about it.
Any suggestions would be greatly appreciated.
In case anybody else meets with this issue, this is how I proceeded:
I realized I was not allowed to modify the [forbiddenColumnName] variable.Therefore I gave up on trying to use pipelines in that phase.
I created my own standardizing function and called it for each individual feature, like so:
def standardizeColumn( dfTrain : DataFrame, dfTest : DataFrame, columnName : String) : Array[DataFrame] = {
val withMeanStd = dfTrain.select(mean(col(columnName)), stddev(col(columnName))).collect
val auxDFTrain = dfTrain.withColumn(columnName, (col(columnName) - withMeanStd(0).getDouble(0))/withMeanStd(0).getDouble(1))
val auxDFTest = dfTest.withColumn(columnName, (col(columnName) - withMeanStd(0).getDouble(1))/withMeanStd(0).getDouble(1))
Array(auxDFTrain, auxDFTest)
}
for (columnName <- training.columns){
if ((columnName != [forbiddenColumnName]) && (columnExists(training, columnName))){
val auxResult = standardizeColumn(training, test, columnName)
training = auxResult(0)
test = auxResult(1)
}
}
[MENTION] My number of variables is very low ~15, therefore this is not a very lenghty process. I seriously doubt this would be the right way of going about things on much bigger datasets.
I'm using Spark Structured Streaming - more or less - to taim my data with a DecisionTreeRegressor.
I'd like to reuse my already fitted PipelineModel to fit again on new data.
Is it possible?
I've already tried to load back my PipelineModel and add it's stages to a pipeline and fit the data on a new model.
val modelDirectory = "/mnt/D834B3AF34B38ECE/DEV/hadoop/model"
var model : PipelineModel = _
var newModel : PipelineModel = _
var pipeline : Pipeline = _
..........
val trainingData = //an instance of a dataframne
val testData = //an instance of a dataframne
val assembler = new VectorAssembler()
.setInputCols(Array("routeId", "stopId", "month","dayOfWeek","hour","temperature","humidity","pressure","rain","snow","visibility"))
.setOutputCol("features")
val dt = new DecisionTreeRegressor()
.setLabelCol("value")
.setFeaturesCol("features")
.setImpurity("variance")
.setMaxDepth(30)
.setMaxBins(32)
.setMinInstancesPerNode(5)
pipeline = new Pipeline()
try {
model = PipelineModel.load(modelDirectory)
pipeline.setStages(model.stages)
} catch {
case iie: InvalidInputException => {
pipeline.setStages(Array(assembler,dt))
printf(iie.getMessage)
}
case unknownError: UnknownError => {
printf(unknownError.getMessage)
}
}
newModel = pipeline.fit(trainingData)
// Make predictions.
val predictions: DataFrame = model.transform(testData)
// Select example rows to display.
print(s"Predictions based on ${System.currentTimeMillis()} time train: ${System.lineSeparator()}")
predictions.show(10, false)
// Select (prediction, true label) and compute test error
val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("value")
.setPredictionCol("prediction")
.setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
You can find my full source code in: https://github.com/Hakuhun/bkk-data-process-spark/blob/master/src/main/scala/hu/oe/bakonyi/bkk/BkkDataDeserializer.scala
There is no way to refit an already fitted model in Spark 2.4.4.
For continuous learning machine learning solutions check the MLLibs documentation. You can achieve that with StreamingLinearRegressionWithSVG, StreamingKmeans, StreamingLogisticRegressionWithSVG.
Also, keep in mind that that is a streaming application, so may your pipeline is learning continuously
I am new to Spark, and I'm studying the "Advanced Analytics with Spark" book. The code is from the examples in the book. When I try to run the following code, I get Spark Task not serializable exception.
val kMeansModel = pipelineModel.stages.last.asInstanceOf[KMeansModel]
val centroids: Array[Vector] = kMeansModel.clusterCenters
val clustered = pipelineModel.transform(data)
val threshold = clustered.
select("cluster", "scaledFeatureVector").as[(Int, Vector)].
map { case (cluster, vec) => Vectors.sqdist(centroids(cluster), vec) }.
orderBy($"value".desc).take(100).last
Also, this is how I build the model:
def oneHotPipeline(inputCol: String): (Pipeline, String) = {
val indexer = new StringIndexer()
.setInputCol(inputCol)
.setOutputCol(inputCol + "_indexed")
val encoder = new OneHotEncoder()
.setInputCol(inputCol + "_indexed")
.setOutputCol(inputCol + "_vec")
val pipeline = new Pipeline()
.setStages(Array(indexer, encoder))
(pipeline, inputCol + "_vec")
}
val k = 180
val (protoTypeEncoder, protoTypeVecCol) = oneHotPipeline("protocol_type")
val (serviceEncoder, serviceVecCol) = oneHotPipeline("service")
val (flagEncoder, flagVecCol) = oneHotPipeline("flag")
// Original columns, without label / string columns, but with new vector encoded cols
val assembleCols = Set(data.columns: _*) --
Seq("label", "protocol_type", "service", "flag") ++
Seq(protoTypeVecCol, serviceVecCol, flagVecCol)
val assembler = new VectorAssembler().
setInputCols(assembleCols.toArray).
setOutputCol("featureVector")
val scaler = new StandardScaler()
.setInputCol("featureVector")
.setOutputCol("scaledFeatureVector")
.setWithStd(true)
.setWithMean(false)
val kmeans = new KMeans().
setSeed(Random.nextLong()).
setK(k).
setPredictionCol("cluster").
setFeaturesCol("scaledFeatureVector").
setMaxIter(40).
setTol(1.0e-5)
val pipeline = new Pipeline().setStages(
Array(protoTypeEncoder, serviceEncoder, flagEncoder, assembler, scaler, kmeans))
val pipelineModel = pipeline.fit(data)
I am assuming the problem is with the line Vectors.sqdist(centroids(cluster), vec). For some reason, I cannot use centroids in my Spark calculations. I have done some Googling, and I know this error happens when "I initialize a variable on the master, but then try to use it on the workers", which in my case is centroids. However, I do not know how to address this problem.
In case you got interested here is the entire code for this tutorial in the book. and here is the link to the dataset that the tutorial uses.
I am running spark ml cross validation with regParam on logistic regression as part of the paramGrid.
val paramGrid = new ParamGridBuilder()
.addGrid(lr.regParam, Array(0.1, 0.01))
.build()
val validator = new CrossValidator()
.setEstimator(estimator)
.setEvaluator(evaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(3)
estimator here has regParam included as part of the params.
Sample code for saving the model:
class MyModelWriter(instance: MyModel[T])extends MLWriter {
override protected def saveImpl(path: String): Unit = {
new DefaultParamsWriter(instance).save(path)
instance.model.save(new Path(path, s"nameOfMofel").toString)
}
}
Mymodel does include the regParam in the params.
MyModel extends HasRegParam
When I call model.save(path) this is the exception I am getting:
java.lang.IllegalArgumentException: requirement failed: ValidatorParams save requires all Params in estimatorParamMaps to apply to this ValidatorParams, its Estimator, or its Evaluator. An extraneous Param was found: logreg_2fb5fdbe5012__regParam
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.ml.tuning.ValidatorParams$$anonfun$validateParams$1$$anonfun$apply$1.apply(ValidatorParams.scala:110)
at org.apache.spark.ml.tuning.ValidatorParams$$anonfun$validateParams$1$$anonfun$apply$1.apply(ValidatorParams.scala:109)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.ml.tuning.ValidatorParams$$anonfun$validateParams$1.apply(ValidatorParams.scala:109)
at org.apache.spark.ml.tuning.ValidatorParams$$anonfun$validateParams$1.apply(ValidatorParams.scala:108)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at org.apache.spark.ml.tuning.ValidatorParams$.validateParams(ValidatorParams.scala:108)
at org.apache.spark.ml.tuning.CrossValidatorModel$CrossValidatorModelWriter.(CrossValidator.scala:257)
at org.apache.spark.ml.tuning.CrossValidatorModel.write(CrossValidator.scala:242)
at org.apache.spark.ml.util.MLWritable$class.save(ReadWrite.scala:157)
at org.apache.spark.ml.tuning.CrossValidatorModel.save(CrossValidator.scala:210)
at com.criteo.lookalike.sink.Sinks$$anonfun$SavePipelineParam1$1.apply(Sinks.scala:111
The code for ValidatorParams.scala at L105 says
// Check to make sure all Params apply to this estimator. Throw an error if any do not.
As per this its making sure that the param in estimatorMap i.e. regParam in this case is present in estimator/evaluator which in this case is indeed present in Mymodel above.
Can anyone please tell if my understanding is right and if yes, what could be causing this? Thanks.
I just solved for this exact error.
When you add a grid, try passing a Param instance instead; and, when instantiating the Param, match it to the documented param type as you'd find under https://spark.apache.org/docs/latest/api/scala/...
For example, in RandomForestRegressor there is numTrees: IntParam.
Therefore I build the param grid as follows...
val rf = new RandomForestRegressor()
.{set...()} // (pseudocode)
val numTrees = new IntParam(rf, "numTrees", "Number of trees to train (>= 1) (default = 20)")
// for fun/preference, i make numTrees[Int] increase as does the area of a circle
val numTreesValues = (for (n <- 3 to 20 by 3) yield (math.Pi * math.pow(n, 2)).toInt)
val paramGrid = new ParamGridBuilder()
.addGrid(numTrees, numTreesValues)
.build()
Try passing the estimator into a param, then the param and the values into .addGrid
My validator then looks like this...
val cv = new CrossValidator()
.setEstimator(rf)
.setEstimatorParamMaps(paramGrid)
.{set...()}
I am trying to perform Scala operation on Shark. I am creating an RDD as follows:
val tmp: shark.api.TableRDD = sc.sql2rdd("select duration from test")
I need it to convert it to RDD[Array[Double]]. I tried toArray, but it doesn't seem to work.
I also tried converting it to Array[String] and then converting using map as follows:
val tmp_2 = tmp.map(row => row.getString(0))
val tmp_3 = tmp_2.map { row =>
val features = Array[Double] (row(0))
}
But this gives me a Spark's RDD[Unit] which cannot be used in the function. Is there any other way to proceed with this type conversion?
Edit I also tried using toDouble, but this gives me an RDD[Double] type, not RDD[Array[Double]]
val tmp_5 = tmp_2.map(_.toDouble)
Edit 2:
I managed to do this as follows:
A sample of the data:
296.98567000000003
230.84362999999999
212.89751000000001
914.02404000000001
305.55383
A Spark Table RDD was created first.
val tmp = sc.sql2rdd("select duration from test")
I made use of getString to translate it to a RDD[String] and then converted it to an RDD[Array[Double]].
val duration = tmp.map(row => Array[Double](row.getString(0).toDouble))