How to implement Kmeans evaluator in Spark ML - scala

I want to select k-means model in terms of 'k' parameter based on the lowest k-means score.
I can find find optimal value of 'k' parameter by hand, writing something like
def clusteringScore0(data: DataFrame, k: Int): Double = {
val assembler = new VectorAssembler().
setInputCols(data.columns.filter(_ != "label")).
setOutputCol("featureVector")
val kmeans = new KMeans().
setSeed(Random.nextLong()).
setK(k).
setPredictionCol("cluster").
setFeaturesCol("featureVector")
val pipeline = new Pipeline().setStages(Array(assembler, kmeans))
val kmeansModel = pipeline.fit(data).stages.last.asInstanceOf[KMeansModel]
kmeansModel.computeCost(assembler.transform(data)) / data.count() }
(20 to 100 by 20).map(k => (k, clusteringScore0(numericOnly, k))).
foreach(println)
Should I use CrossValitor API?
Something like this:
val paramGrid = new ParamGridBuilder().addGrid(kmeansModel.k, 20 to 100 by 20).build()
val cv = new CrossValidator().setEstimator(pipeline).setEvaluator(new KMeansEvaluator()).setEstimatorParamMaps(paramGrid).setNumFolds(3)
There are Evaluators for regression and classification, but no Evaluator for clustering.
So I should implement Evaluator interface. I am stuck with evaluate method.
class KMeansEvaluator extends Evaluator {
override def copy(extra: ParamMap): Evaluator = defaultCopy(extra)
override def evaluate(data: Dataset[_]): Double = ??? // should I somehow adapt code from KMeansModel.computeCost()?
override val uid = Identifiable.randomUID("cost_evaluator")
}

Hi ClusteringEvaluator is available from Spark 2.3.0. You can use to find optimal k values by including ClusteringEvaluator object into your for-loop. You can also find more detail for silhouette analysis in Scikit-learn page. In short, the score should be between [-1,1], the larger score is the better. I have modified a for loop below for your codes.
import org.apache.spark.ml.evaluation.ClusteringEvaluator
val evaluator = new ClusteringEvaluator()
.setFeaturesCol("featureVector")
.setPredictionCol("cluster")
.setMetricName("silhouette")
for(k <- 20 to 100 by 20){
clusteringScore0(numericOnly,k)
val transformedDF = kmeansModel.transform(numericOnly)
val score = evaluator.evaluate(transformedDF)
println(k,score,kmeansModel.computeCost(transformedDF))
}

Related

Spark ML insert/fit custom OneHotEncoder into a Pipeline

Say I have a few features/columns in a dataframe on which I apply the regular OneHotEncoder, and one (let, n-th) column on which I need to apply my custom OneHotEncoder. Then I need to use VectorAssembler to assemble those features, and put into a Pipeline, finally fitting my trainData and getting predictions from my testData, such as:
val sIndexer1 = new StringIndexer().setInputCol("my_feature1").setOutputCol("indexed_feature1")
// ... let, n-1 such sIndexers for n-1 features
val featureEncoder = new OneHotEncoderEstimator().setInputCols(Array(sIndexer1.getOutputCol), ...).
setOutputCols(Array("encoded_feature1", ... ))
// **need to insert output from my custom OneHotEncoder function (please see below)**
// (which takes the n-th feature as input) in a way that matches the VectorAssembler below
val vectorAssembler = new VectorAssembler().setInputCols(featureEncoder.getOutputCols + ???).
setOutputCol("assembled_features")
...
val pipeline = new Pipeline().setStages(Array(sIndexer1, ...,featureEncoder, vectorAssembler, myClassifier))
val model = pipeline.fit(trainData)
val predictions = model.transform(testData)
How can I modify the building of the vectorAssembler so that it can ingest the output from the custom OneHotEncoder?
The problem is my desired oheEncodingTopN() cannot/should not refer to the "actual" dataframe, since it would be a part of the pipeline (to apply on trainData/testData).
Note:
I tested that the custom OneHotEncoder (see link) works just as expected separately on e.g. trainData. Basically, oheEncodingTopN applies OneHotEncoding on the input column, but for the top N frequent values only (e.g. N = 50), and put all the rest infrequent values in a dummy column (say, "default"), e.g.:
val oheEncoded = oheEncodingTopN(df, "my_featureN", 50)
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.{col, lit, when}
import org.apache.spark.sql.Column
def flip(col: Column): Column = when(col === 1, lit(0)).otherwise(lit(1))
def oheEncodingTopN(df: DataFrame, colName: String, n: Int): DataFrame = {
df.createOrReplaceTempView("data")
val topNDF = spark.sql(s"select $colName, count(*) as count from data group by $colName order by count desc limit $n")
val pivotTopNDF = topNDF.
groupBy(colName).
pivot(colName).
count().
withColumn("default", lit(1))
val joinedTopNDF = df.join(pivotTopNDF, Seq(colName), "left").drop(colName)
val oheEncodedDF = joinedTopNDF.
na.fill(0, joinedTopNDF.columns).
withColumn("default", flip(col("default")))
oheEncodedDF
}
I think the cleanest way would be to create your own class that extends spark ML Transformer so that you can play with as you would do with any other transformer (like OneHotEncoder). Your class would look like this :
import org.apache.spark.ml.Transformer
import org.apache.spark.ml.param.Param
import org.apache.spark.ml.param.ParamMap
import org.apache.spark.ml.util.Identifiable
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{DataFrame, Dataset, Column}
class OHEncodingTopN(n :Int, override val uid: String) extends Transformer {
final val inputCol= new Param[String](this, "inputCol", "The input column")
final val outputCol = new Param[String](this, "outputCol", "The output column")
; def setInputCol(value: String): this.type = set(inputCol, value)
def setOutputCol(value: String): this.type = set(outputCol, value)
def this(n :Int) = this(n, Identifiable.randomUID("OHEncodingTopN"))
def copy(extra: ParamMap): OHEncodingTopN = {
defaultCopy(extra)
}
override def transformSchema(schema: StructType): StructType = {
// Check that the input type is what you want if needed
// val idx = schema.fieldIndex($(inputCol))
// val field = schema.fields(idx)
// if (field.dataType != StringType) {
// throw new Exception(s"Input type ${field.dataType} did not match input type StringType")
// }
// Add the return field
schema.add(StructField($(outputCol), IntegerType, false))
}
def flip(col: Column): Column = when(col === 1, lit(0)).otherwise(lit(1))
def transform(df: Dataset[_]): DataFrame = {
df.createOrReplaceTempView("data")
val colName = $(inputCol)
val topNDF = df.sparkSession.sql(s"select $colName, count(*) as count from data group by $colName order by count desc limit $n")
val pivotTopNDF = topNDF.
groupBy(colName).
pivot(colName).
count().
withColumn("default", lit(1))
val joinedTopNDF = df.join(pivotTopNDF, Seq(colName), "left").drop(colName)
val oheEncodedDF = joinedTopNDF.
na.fill(0, joinedTopNDF.columns).
withColumn("default", flip(col("default")))
oheEncodedDF
}
}
Now on a OHEncodingTopN object you should be able to call .getOuputCol to perform what you want. Good luck.
EDIT: your method that I just copy pasted in the transform method should be slightly modified in order to output a column of type Vector having the name given in the setOutputCol.

Apache Flink - Prediction Handling

I am currently working with Apache Flink's SVM-Class to predict some text data.
The class provides a predict-function which is taking a DataSet[Vector] as an input and gives me a DataSet[Prediction] as result. So far so good.
My problem is, that i dont have the context which prediction belongs to which text and i cant insert the text within the predict()-function to have it afterwards.
Code:
val tweets: DataSet[(SparseVector, String)] =
source.flatMap(new SelectEnglishTweetWithCreatedAtFlatMapper)
.map(tweet => (featureVectorService.transform(tweet._2))
model.predict(tweets).print
result example:
(SparseVector((462,8.73165920153676), (10844,8.508515650222549), (15656,2.931052542245018)),-1.0)
Is there a way to keep other data next to the prediction to have everything together ? because without context the prediction is not helping me.
Or maybe there is a way to just predict one vector instead of a DataSet, that i could call the function inside the map function above.
The SVM predictor expects as input a sub type of Vector. Hence there are two options to solve this problem:
Create a sub type of Vector which contains the tweet text as a tag. It will then be looped through the predictor. This approach has the advantage that no additional operation is needed. However, one needs define new classes an utilities to represent different vector types with tags:
val env = ExecutionEnvironment.getExecutionEnvironment
val input = env.fromElements("foobar", "barfo", "test")
val vectorizedInput = input.map(word => {
val value = word.chars().sum()
new DenseVectorWithTag(Array(value), word)
})
val svm = SVM().setBlocks(env.getParallelism)
val weights = env.fromElements(DenseVector(1.0))
svm.weightsOption = Option(weights) // skipping the training here
val predictionResult: DataSet[(DenseVectorWithTag, Double)] = svm.predict(vectorizedInput)
class DenseVectorWithTag(override val data: Array[Double], tag: String)
extends DenseVector(data) {
override def toString: String = "(" + super.toString + ", " + tag + ")"
}
Join the prediction DataSet with the input DataSet on the vectorized representation of the tweets. This approach has the advantage that we don't need to introduce new classes. The price we pay for this is an additional join operation which might be expensive:
val input = env.fromElements("foobar", "barfo", "test")
val vectorizedInput = input.map(word => {
val value = word.chars().sum()
(DenseVector(value), word)
})
val svm = SVM().setBlocks(env.getParallelism)
val weights = env.fromElements(DenseVector(1.0))
svm.weightsOption = Option(weights) // skipping the training here
val predictionResult = svm.predict(vectorizedInput.map(a => a._1))
val inputWithPrediction: DataSet[(String, Double)] = vectorizedInput
.join(predictionResult)
.where(0)
.equalTo(0)
.apply((t, p) => (t._2, p._2))

Spark Task not serializable (Array[Vector])

I am new to Spark, and I'm studying the "Advanced Analytics with Spark" book. The code is from the examples in the book. When I try to run the following code, I get Spark Task not serializable exception.
val kMeansModel = pipelineModel.stages.last.asInstanceOf[KMeansModel]
val centroids: Array[Vector] = kMeansModel.clusterCenters
val clustered = pipelineModel.transform(data)
val threshold = clustered.
select("cluster", "scaledFeatureVector").as[(Int, Vector)].
map { case (cluster, vec) => Vectors.sqdist(centroids(cluster), vec) }.
orderBy($"value".desc).take(100).last
Also, this is how I build the model:
def oneHotPipeline(inputCol: String): (Pipeline, String) = {
val indexer = new StringIndexer()
.setInputCol(inputCol)
.setOutputCol(inputCol + "_indexed")
val encoder = new OneHotEncoder()
.setInputCol(inputCol + "_indexed")
.setOutputCol(inputCol + "_vec")
val pipeline = new Pipeline()
.setStages(Array(indexer, encoder))
(pipeline, inputCol + "_vec")
}
val k = 180
val (protoTypeEncoder, protoTypeVecCol) = oneHotPipeline("protocol_type")
val (serviceEncoder, serviceVecCol) = oneHotPipeline("service")
val (flagEncoder, flagVecCol) = oneHotPipeline("flag")
// Original columns, without label / string columns, but with new vector encoded cols
val assembleCols = Set(data.columns: _*) --
Seq("label", "protocol_type", "service", "flag") ++
Seq(protoTypeVecCol, serviceVecCol, flagVecCol)
val assembler = new VectorAssembler().
setInputCols(assembleCols.toArray).
setOutputCol("featureVector")
val scaler = new StandardScaler()
.setInputCol("featureVector")
.setOutputCol("scaledFeatureVector")
.setWithStd(true)
.setWithMean(false)
val kmeans = new KMeans().
setSeed(Random.nextLong()).
setK(k).
setPredictionCol("cluster").
setFeaturesCol("scaledFeatureVector").
setMaxIter(40).
setTol(1.0e-5)
val pipeline = new Pipeline().setStages(
Array(protoTypeEncoder, serviceEncoder, flagEncoder, assembler, scaler, kmeans))
val pipelineModel = pipeline.fit(data)
I am assuming the problem is with the line Vectors.sqdist(centroids(cluster), vec). For some reason, I cannot use centroids in my Spark calculations. I have done some Googling, and I know this error happens when "I initialize a variable on the master, but then try to use it on the workers", which in my case is centroids. However, I do not know how to address this problem.
In case you got interested here is the entire code for this tutorial in the book. and here is the link to the dataset that the tutorial uses.

How to deal with more than one categorical feature in a decision tree?

I read a piece of code about binary decision tree from a book. It has only one categorical feature, which is field(3), in the raw data, and is converted to one-of-k(one-hot encoding).
def PrepareData(sc: SparkContext): (RDD[LabeledPoint], RDD[LabeledPoint], RDD[LabeledPoint], Map[String, Int]) = {
val rawDataWithHeader = sc.textFile("data/train.tsv")
val rawData = rawDataWithHeader.mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(1) else iter }
val lines = rawData.map(_.split("\t"))
val categoriesMap = lines.map(fields => fields(3)).distinct.collect.zipWithIndex.toMap
val labelpointRDD = lines.map { fields =>
val trFields = fields.map(_.replaceAll("\"", ""))
val categoryFeaturesArray = Array.ofDim[Double](categoriesMap.size)
val categoryIdx = categoriesMap(fields(3))
categoryFeaturesArray(categoryIdx) = 1
val numericalFeatures = trFields.slice(4, fields.size - 1).map(d => if (d == "?") 0.0 else d.toDouble)
val label = trFields(fields.size - 1).toInt
LabeledPoint(label, Vectors.dense(categoryFeaturesArray ++ numericalFeatures))
}
val Array(trainData, validationData, testData) = labelpointRDD.randomSplit(Array(8, 1, 1))
return (trainData, validationData, testData, categoriesMap)
}
I wonder how to revise the code if there are several categorical features in the raw data, let's say field(3), field(5), field(7) are all categorical features.
I revised the first line:
def PrepareData(sc: SparkContext): (RDD[LabeledPoint], RDD[LabeledPoint], RDD[LabeledPoint], Map[String, Int], Map[String, Int], Map[String, Int], Map[String, Int]) =......
Then, I converted another two fields into 1-of-k encoding as it was done like:
val categoriesMap5 = lines.map(fields => fields(5)).distinct.collect.zipWithIndex.toMap
val categoriesMap7 = lines.map(fields => fields(7)).distinct.collect.zipWithIndex.toMap
val categoryFeaturesArray5 = Array.ofDim[Double](categoriesMap5.size)
val categoryFeaturesArray7 = Array.ofDim[Double](categoriesMap7.size)
val categoryIdx3 = categoriesMap5(fields(5))
val categoryIdx5 = categoriesMap7(fields(7))
categoryFeaturesArray5(categoryIdx5) = 1
categoryFeaturesArray7(categoryIdx7) = 1
Finally, I revised LabeledPoint and return like:
LabeledPoint(label, Vectors.dense(categoryFeaturesArray ++ categoryFeaturesArray5 ++ categoryFeaturesArray7 ++ numericalFeatures))
return (trainData, validationData, testData, categoriesMap, categoriesMap5, categoriesMap7)
Is it correct?
==================================================
The second problem I encountered is: the following code from that book, in the trainModel, it uses
DecisionTree.trainRegressor(trainingData, categoricalFeaturesInfo, impurity, maxDepth, maxBins)
Here is the code:
def trainModel(trainData: RDD[LabeledPoint], impurity: String, maxDepth: Int, maxBins: Int): (DecisionTreeModel, Double) = {
val startTime = new DateTime()
val model = DecisionTree.trainClassifier(trainData, 2, Map[Int, Int](), impurity, maxDepth, maxBins)
val endTime = new DateTime()
val duration = new Duration(startTime, endTime)
(model, duration.getMillis())
}
The question is: how do I pass the categoricalFeaturesInfo into this method if it has three categorical features mentioned previously?
I just want to follow the step on the book to build up a prediction system on my own by using a decision tree. To be more specific, the data sets I chose has several categorical features like :
Gender: male, female
Education: HS-grad, Bachelors, Master, PH.D, ......
Country: US, Canada, England, Australia, ......
But I don't know how to merge them into one single categoryFeatures ++ numericalFeatures to put into Vector.dense(), and one single categoricalFeaturesInfo to put into DecisionTree.trainRegressor()
It is not clear for me what exactly you're doing here but it looks like it is wrong from the beginning.
Ignoring the fact that you're reinventing the wheel by implementing one-hot-encoding from scratch, the whole point of encoding is to convert categorical variables to numerical ones. This is required for linear models but arguably it doesn't make sense when working with decision trees.
Keeping that in mind you have two choices:
Index categorical fields without encoding and pass indexed features to categoricalFeaturesInfo.
One-hot-encode categorical features and treat these as numerical variables.
I believe that the former approach is the right approach. The latter one should work in practice but it just artificially increases dimensionality without providing any benefits. It may also be in conflict with some heuristics used by Spark implementation.
One way or another you should consider using ML Pipelines which provide all required indexing, encoding, and merging tools.

Reduce two Scala methods, that only differ in one Object Type

I have the following two methods, using objects from Apache Spark.
def SVMModelScoring(sc: SparkContext, scoringDataset: String, modelFileName: String): RDD[(Double, Double)] = {
val model = SVMModel.load(sc, modelFileName)
val scoreAndLabels =
MLUtils.loadLibSVMFile(sc, scoringDataset).randomSplit(Array(0.1), seed = 11L)(0).map { point =>
val score = model.predict(point.features)
(score, point.label)
}
return scoreAndLabels
}
def DecisionTreeScoring(sc: SparkContext, scoringDataset: String, modelFileName: String): RDD[(Double, Double)] = {
val model = DecisionTreeModel.load(sc, modelFileName)
val scoreAndLabels =
MLUtils.loadLibSVMFile(sc, scoringDataset).randomSplit(Array(0.1), seed = 11L)(0).map { point =>
val score = model.predict(point.features)
(score, point.label)
}
return scoreAndLabels
}
My previous attempts to merge these functions have resulted in errors surround model.predict.
Is there a way I can use model as a parameter that is weakly typed in Scala?
Disclaimer - I've never used Apache Spark.
It looks to me like the only difference between the two methods is the way the model is instantiated. It's unfortunate that the two model instances don't actually share a common trait that provides predict(...) but we can still make this work by pulling out the part that changes - the scorer:
def scoreWith(sc: SparkContext, scoringDataset: String)(scorer: (Vector)=>Double): RDD[(Double, Double)] = {
MLUtils.loadLibSVMFile(sc, scoringDataset).randomSplit(Array(0.1), seed = 11L)(0).map { point =>
val score = scorer(point.features)
(score, point.label)
}
}
Now we can get the previous functionality with:
def svmScorer(sc: SparkContext, scoringDataset:String, modelFileName:String) =
scoreWith(sc: SparkContext, scoringDataset:String)(SVMModel.load(sc, modelFileName).predict)
def dtScorer(sc: SparkContext, scoringDataset:String, modelFileName:String) =
scoreWith(sc: SparkContext, scoringDataset:String)(DecisionTreeModel.load(sc, modelFileName).predict)