Spark Task not serializable (Array[Vector]) - scala

I am new to Spark, and I'm studying the "Advanced Analytics with Spark" book. The code is from the examples in the book. When I try to run the following code, I get Spark Task not serializable exception.
val kMeansModel = pipelineModel.stages.last.asInstanceOf[KMeansModel]
val centroids: Array[Vector] = kMeansModel.clusterCenters
val clustered = pipelineModel.transform(data)
val threshold = clustered.
select("cluster", "scaledFeatureVector").as[(Int, Vector)].
map { case (cluster, vec) => Vectors.sqdist(centroids(cluster), vec) }.
orderBy($"value".desc).take(100).last
Also, this is how I build the model:
def oneHotPipeline(inputCol: String): (Pipeline, String) = {
val indexer = new StringIndexer()
.setInputCol(inputCol)
.setOutputCol(inputCol + "_indexed")
val encoder = new OneHotEncoder()
.setInputCol(inputCol + "_indexed")
.setOutputCol(inputCol + "_vec")
val pipeline = new Pipeline()
.setStages(Array(indexer, encoder))
(pipeline, inputCol + "_vec")
}
val k = 180
val (protoTypeEncoder, protoTypeVecCol) = oneHotPipeline("protocol_type")
val (serviceEncoder, serviceVecCol) = oneHotPipeline("service")
val (flagEncoder, flagVecCol) = oneHotPipeline("flag")
// Original columns, without label / string columns, but with new vector encoded cols
val assembleCols = Set(data.columns: _*) --
Seq("label", "protocol_type", "service", "flag") ++
Seq(protoTypeVecCol, serviceVecCol, flagVecCol)
val assembler = new VectorAssembler().
setInputCols(assembleCols.toArray).
setOutputCol("featureVector")
val scaler = new StandardScaler()
.setInputCol("featureVector")
.setOutputCol("scaledFeatureVector")
.setWithStd(true)
.setWithMean(false)
val kmeans = new KMeans().
setSeed(Random.nextLong()).
setK(k).
setPredictionCol("cluster").
setFeaturesCol("scaledFeatureVector").
setMaxIter(40).
setTol(1.0e-5)
val pipeline = new Pipeline().setStages(
Array(protoTypeEncoder, serviceEncoder, flagEncoder, assembler, scaler, kmeans))
val pipelineModel = pipeline.fit(data)
I am assuming the problem is with the line Vectors.sqdist(centroids(cluster), vec). For some reason, I cannot use centroids in my Spark calculations. I have done some Googling, and I know this error happens when "I initialize a variable on the master, but then try to use it on the workers", which in my case is centroids. However, I do not know how to address this problem.
In case you got interested here is the entire code for this tutorial in the book. and here is the link to the dataset that the tutorial uses.

Related

Can I alter spark Pipeline stages by some trained transfomers?

Because I need alter train data after StringIndexer. (append unseen future to handle error when future model prediction) . So I need build a pipeline by trained Transfomers.
But I haven't find a way to do this thing ,
sample code:
// fit by original df
val catIndexer = catFeatures.map(cname => {
new StringIndexer()
.setHandleInvalid("keep") // would get error when future prediction if training data doesn't contain unseen feature
.setInputCol(cname)
.setOutputCol(cname + KeyColumns.stringIndexerSuffix)
})
val indexedCatFeatures = catIndexer.map(idx => idx.getOutputCol)
val stringIndexerPipeline = new Pipeline().setStages(catIndexer)
val stringIndexerPipelineFitted = stringIndexerPipeline.fit(trainDataset) // note: trainDataset
// transform original df with one new row(unseen feature), to avoid unseen feature when future prediction
val rdd = mdContext.get.spark.sparkContext.makeRDD(List(Row(newRow:_*)))
val newDF = mdContext.get.spark.createDataFrame(rdd, trainDataset.schema).na.fill(0)
val patchedTrainDataset = trainDataset.unionByName(newDF)
val strIndexTrainDataset = stringIndexerPipelineFitted.transform(patchedTrainDataset) // note: patchedTrainDataset
// onehot and assemble
val oneHotEncoder = new OneHotEncoderEstimator().setInputCols(indexedCatFeatures).setOutputCols(indexedCatFeatures.map(_+KeyColumns.oneHotEncoderSuffix))
.setDropLast(false)
val predictors = numFeatures ++ oneHotEncoder.getOutputCols
val assembler = new VectorAssembler().setInputCols(predictors).setOutputCol(KeyColumns.features)
val leftPipeline = new Pipeline().setStages(Array(oneHotEncoder, assembler))
// feature transfomers
val transfomers = stringIndexerPipeline.asInstanceOf[PipelineModel].stages ++ leftPipeline.asInstanceOf[PipelineModel].stages
// train model
...
//
val cv = new CrossValidator()
.setEstimator(modelPipeline)
.setEvaluator(new BinaryClassificationEvaluator().setLabelCol(KeyColumns.y))
.setEstimatorParamMaps(paramGrid)
.setNumFolds(cvConfig.folders)
.setParallelism(cvConfig.parallelism)
val transformedTrainDataset = leftPipeline.fit(strIndexTrainDataset).transform(strIndexTrainDataset)
val cvModel = cv.fit(transformedTrainDataset)
val bestModel = cvModel.bestModel.asInstanceOf[PipelineModel]
val newStages = transfomers ++ Array[SparkTransformer](bestModel.stages.last)
// !!!error can't new here
val newBestModel = new PipelineModel(bestModel.uid, newStages)
// !!!error can't new here
val newCvModel = new CrossValidatorModel(cvModel.uid, newBestModel, cvModel.avgMetrics)
Thanks for raising a great question.
According to this Q&A, we've known that you wouldn't get an instance of PipelineModel from new method (which is not legal for ). There are mainly two ways:
PipelineModel.load(file: String)
val pipelineModel = pipeline.fit(dataFrame)
Now here is the thing: you can skip the fit() implicitly by only adding trained Model into pipeline to get pipelineModel.
e.g.
// Still, add your trained models into a array
val trainedModels = cols.map(col => {
new ValueIndexerModel().setInputCol(col).setOutputCol(col + "_indexed").setLevels(level)
})
// just set your models as stages of a pipeline as usual
val pipeline = new PipeLine().setStages(trainedModels)
// fit, which will skip for models
val pipelineModel = pipeline.fit(dataFrame)
// then you get your pipelineModel, you can transform now
val transDF = pipelineModel.transform(dataFrame)
The reason we are able to handle it like this is according to the source code of spark:
val transformers = ListBuffer.empty[Transformer]
theStages.view.zipWithIndex.foreach { case (stage, index) =>
if (index <= indexOfLastEstimator) {
val transformer = stage match {
case estimator: Estimator[_] =>
estimator.fit(curDataset)
case t: Transformer =>
t
case _ =>
throw new IllegalArgumentException(
s"Does not support stage $stage of type ${stage.getClass}")
}
if (index < indexOfLastEstimator) {
curDataset = transformer.transform(curDataset)
}
transformers += transformer
} else {
transformers += stage.asInstanceOf[Transformer]
}
}
Your trained model is subclass of Transformer, so when fit, your pipeline of trained models will skip all the process of fit, and give a pipelineModel with your trained models. Thanks to zero323 and user1269298 in Q&A again.

Refit existing Spark ML PipelineModel with new data

I'm using Spark Structured Streaming - more or less - to taim my data with a DecisionTreeRegressor.
I'd like to reuse my already fitted PipelineModel to fit again on new data.
Is it possible?
I've already tried to load back my PipelineModel and add it's stages to a pipeline and fit the data on a new model.
val modelDirectory = "/mnt/D834B3AF34B38ECE/DEV/hadoop/model"
var model : PipelineModel = _
var newModel : PipelineModel = _
var pipeline : Pipeline = _
..........
val trainingData = //an instance of a dataframne
val testData = //an instance of a dataframne
val assembler = new VectorAssembler()
.setInputCols(Array("routeId", "stopId", "month","dayOfWeek","hour","temperature","humidity","pressure","rain","snow","visibility"))
.setOutputCol("features")
val dt = new DecisionTreeRegressor()
.setLabelCol("value")
.setFeaturesCol("features")
.setImpurity("variance")
.setMaxDepth(30)
.setMaxBins(32)
.setMinInstancesPerNode(5)
pipeline = new Pipeline()
try {
model = PipelineModel.load(modelDirectory)
pipeline.setStages(model.stages)
} catch {
case iie: InvalidInputException => {
pipeline.setStages(Array(assembler,dt))
printf(iie.getMessage)
}
case unknownError: UnknownError => {
printf(unknownError.getMessage)
}
}
newModel = pipeline.fit(trainingData)
// Make predictions.
val predictions: DataFrame = model.transform(testData)
// Select example rows to display.
print(s"Predictions based on ${System.currentTimeMillis()} time train: ${System.lineSeparator()}")
predictions.show(10, false)
// Select (prediction, true label) and compute test error
val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("value")
.setPredictionCol("prediction")
.setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
You can find my full source code in: https://github.com/Hakuhun/bkk-data-process-spark/blob/master/src/main/scala/hu/oe/bakonyi/bkk/BkkDataDeserializer.scala
There is no way to refit an already fitted model in Spark 2.4.4.
For continuous learning machine learning solutions check the MLLibs documentation. You can achieve that with StreamingLinearRegressionWithSVG, StreamingKmeans, StreamingLogisticRegressionWithSVG.
Also, keep in mind that that is a streaming application, so may your pipeline is learning continuously

SparkML - Creating a df(feature, feature_importance) of a RandomForestRegressionModel

I am training a Random Forest model in the following way:
//Indexer
val stringIndexers = categoricalColumns.map { colName =>
new StringIndexer()
.setInputCol(colName)
.setOutputCol(colName + "Idx")
.setHandleInvalid("keep")
.fit(training)
}
//HotEncoder
val encoders = featuresEnconding.map { colName =>
new OneHotEncoderEstimator()
.setInputCols(Array(colName + "Idx"))
.setOutputCols(Array(colName + "Enc"))
.setHandleInvalid("keep")
}
//Adding features into a feature vector column
val assembler = new VectorAssembler()
.setInputCols(featureColumns)
.setOutputCol("features")
val rf = new RandomForestRegressor()
.setLabelCol("label")
.setFeaturesCol("features")
val stepsRF = stringIndexers ++ encoders ++ Array(assembler, rf)
val pipelineRF = new Pipeline()
.setStages(stepsRF)
val paramGridRF = new ParamGridBuilder()
.addGrid(rf.maxBins, Array(800))
.addGrid(rf.featureSubsetStrategy, Array("all"))
.addGrid(rf.minInfoGain, Array(0.05))
.addGrid(rf.minInstancesPerNode, Array(1))
.addGrid(rf.maxDepth, Array(28,29,30))
.addGrid(rf.numTrees, Array(20))
.build()
//Defining the evaluator
val evaluatorRF = new RegressionEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
//Using cross validation to train the model
//Start with TrainSplit -Cross Validations taking so long so far
val cvRF = new CrossValidator()
.setEstimator(pipelineRF)
.setEvaluator(evaluatorRF)
.setEstimatorParamMaps(paramGridRF)
.setNumFolds(10)
.setParallelism(3)
//Fitting the model with our training dataset
val cvRFModel = cvRF.fit(training)
What I would like now is to get the importance of each of the features in the model after the training.
I am able to get the importance of each feature as an Array[Double] doing like this:
val bestModel = cvRFModel.bestModel.asInstanceOf[PipelineModel]
val size = bestModel.stages.size-1
val ftrImp = bestModel.stages(size).asInstanceOf[RandomForestRegressionModel].featureImportances.toArray
But I only get the importance of each feature and a numerical index, but I don't know what is the feature name inside my model which correspond to each importance value.
I also would like to mention that since I am using a hotencoder, the final amount of feature is much larger than the original featureColumns array.
How can I extract the features names used during the training of my model?
I found this possible solution:
import org.apache.spark.ml.attribute._
val bestModel = cvRFModel.bestModel.asInstanceOf[PipelineModel]
val lstModel = bestModel.stages.last.asInstanceOf[RandomForestRegressionModel]
val schema = predictions.schema
val featureAttrs = AttributeGroup.fromStructField(schema(lstModel.getFeaturesCol)).attributes.get
val mfeatures = featureAttrs.map(_.name.get)
val mdf = sc.parallelize(mfeatures zip ftrImp).toDF("featureName","Importance")
.orderBy(desc("Importance"))
display(mdf)

Overlap in the data - Spark Scala

I want to find the overlap between each brand like when I compare HORLICKS VS BOOST, I will have to find whats HORLICKS only percentage , BOOST ONLY and their intersection.Its basically a venn diagram problem.I have computed for one combination. But I want to compute for all combination , like HORLICKS vs BOOST,HORLICKS VS nESTLE , HORLICKS vs BOURNVITA etc.
Can somebody help me ? I am new to spark
Below is my code:
val A_storeList = sourceDf.where(col("CATEGORY").equalTo(item1(0)) and col("SUBCATEGORY").equalTo(item1(1)) and col("product_char_name").equalTo(item1(2)) and col("product_charval_dsc").equalTo(item1(3))).select("store_id").collect().map(_(0)).distinct
val B_storeList = sourceDf.where(col("CATEGORY").equalTo(item2(0)) and col("SUBCATEGORY").equalTo(item2(1)) and col("product_char_name").equalTo(item2(2)) and col("product_charval_dsc").equalTo(item2(3))).select("store_id").collect().map(_(0)).distinct
val aAndBstoreList = A_storeList.intersect(B_storeList)
val AunionB_storeList = A_storeList.union(B_storeList).distinct
val AOnly_storeList = A_storeList.diff(B_storeList)
val Bonly_storeList = B_storeList.diff(A_storeList)
val subSetOfSourceDf = sourceDf.withColumn("Versus",lit(item1(3)+"Vs"+item2(3)))
val A = subSetOfSourceDf.where(col("store_id").isin(A_storeList:_*)).withColumn("Venn",lit("A"))
val B = subSetOfSourceDf.where(col("store_id").isin(B_storeList:_*)).withColumn("Venn",lit("B"))
val AandB = subSetOfSourceDf.where(col("store_id").isin(aAndBstoreList:_*)).withColumn("product_charval_dsc",when(col("product_charval_dsc").equalTo(item1(3)),item1(3)+" and "+item2(3)).when(col("product_charval_dsc").equalTo(item2(3)),item1(3)+" and "+item2(3)).otherwise(col("product_charval_dsc"))).withColumn("Venn",lit("AintersectB"))
val AunionB = subSetOfSourceDf.where(col("store_id").isin(AunionB_storeList:_*)).withColumn("product_charval_dsc",when(col("product_charval_dsc").equalTo(item2(3)),item1(3)+" and "+item2(3)).when(col("product_charval_dsc").equalTo(item1(3)),item1(3)+" and "+item2(3)).otherwise(col("product_charval_dsc"))).withColumn("Venn",lit("AunionB"))
val AOnly = subSetOfSourceDf.where(col("store_id").isin(AOnly_storeList:_*)).withColumn("Venn",lit("AOnly"))
val BOnly = subSetOfSourceDf.where(col("store_id").isin(Bonly_storeList:_*)).withColumn("Venn",lit("BOnly"))
val allInOne = A.union(B.union(AandB.union(AunionB).union(AOnly.union(BOnly))))
val divisor = allInOne.where((col("Venn").equalTo("A").and(col("product_charval_dsc").equalTo(item1(3)))) or (col("Venn").equalTo("B").and(col("product_charval_dsc").equalTo(item2(3)))) )
.groupBy("CATEGORY","SUBCATEGORY","product_char_name").agg(sum("SALVAL") as "TOTAL")
val finalDf1 = allInOne.groupBy("CATEGORY","SUBCATEGORY","product_char_name","product_charval_dsc","Venn")
.agg(sum("SALVAL") as "SALVAL")
.where(col("product_charval_dsc").equalTo(item1(3)) or col("product_charval_dsc").equalTo(item2(3)) or col("product_charval_dsc").equalTo(item1(3)+" and "+item2(3)))
val outputDf =finalDf1.join(divisor,Seq("CATEGORY","SUBCATEGORY","product_char_name"))
.withColumn("SALE_PERCENT",col("SALVAL")/col("TOTAL") multiply(100)).withColumn("Versus",lit(item1(3)+" Vs "+item2(3)))
With this Code I have generated output. But I want to how to do this for all combination.
Generated Result:

convert scala string to RDD[seq[string]]

// 4 workers
val sc = new SparkContext("local[4]", "naivebayes")
// Load documents (one per line).
val documents: RDD[Seq[String]] = sc.textFile("/tmp/test.txt").map(_.split(" ").toSeq)
documents.zipWithIndex.foreach{
case (e, i) =>
val collectedResult = Tokenizer.tokenize(e.mkString)
}
val hashingTF = new HashingTF()
//pass collectedResult instead of document
val tf: RDD[Vector] = hashingTF.transform(documents)
tf.cache()
val idf = new IDF().fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)
in the above code snippet, i would want to extract collectedResult to reuse it for hashingTF.transform, How can this be achieved where the signature of tokenize function is
def tokenize(content: String): Seq[String] = {
...
}
Looks like you want map rather than foreach. I don't understand what you're using zipWithIndex for, nor why you're calling split on your lines only to join them straight back up again with mkString.
val lines: Rdd[String] = sc.textFile("/tmp/test.txt")
val tokenizedLines = lines.map(tokenize)
val hashes = tokenizedLines.map(hashingTF)
hashes.cache()
...