How to choose combining strategy for MLlib's random forests - scala

Is it possible to choose the combining strategy for MLlib's random forests? I can't find any clue on the official API docs.
Here's my code:
val numClasses = 10
val categoricalFeaturesInfo = Map[Int, Int]()
val numTrees = 10
val featureSubsetStrategy = "auto"
val impurity = "entropy"
val maxDepth = 2
val maxBins = 320
val model = RandomForest.trainClassifier(trainData, numClasses, categoricalFeaturesInfo,
numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
val predictionAndLabels = { case LabeledPoint(label, features) =>
val prediction = model.predict(features)
(prediction, label)
I know that the predict method (implemented in treeEnsembleModels class) take in account the combining strategy (Sum, Average or Vote):
def predict(features: Vector): Double = {
(algo, combiningStrategy) match {
case (Regression, Sum) =>
case (Regression, Average) =>
predictBySumming(features) / sumWeights
case (Classification, Sum) => // binary classification
val prediction = predictBySumming(features)
// TODO: predicted labels are +1 or -1 for GBT. Need a better way to store this info.
if (prediction > 0.0) 1.0 else 0.0
case (Classification, Vote) =>
case _ =>
throw new IllegalArgumentException(
"TreeEnsembleModel given unsupported (algo, combiningStrategy) combination: " +
s"($algo, $combiningStrategy).")

I'd say the only way it's possible to do is to use reflection after the model's been built. That have to be possible, because field usage is deferred (I haven't tried to run this code, but smth like this would work).
RandomForestModel model = ...;
Class<?> c = model.getClass();
Field strategy = c.getDeclaredField("combiningStrategy");
strategy.set(model, whatever);


Can I alter spark Pipeline stages by some trained transfomers?

Because I need alter train data after StringIndexer. (append unseen future to handle error when future model prediction) . So I need build a pipeline by trained Transfomers.
But I haven't find a way to do this thing ,
sample code:
// fit by original df
val catIndexer = => {
new StringIndexer()
.setHandleInvalid("keep") // would get error when future prediction if training data doesn't contain unseen feature
.setOutputCol(cname + KeyColumns.stringIndexerSuffix)
val indexedCatFeatures = => idx.getOutputCol)
val stringIndexerPipeline = new Pipeline().setStages(catIndexer)
val stringIndexerPipelineFitted = // note: trainDataset
// transform original df with one new row(unseen feature), to avoid unseen feature when future prediction
val rdd = mdContext.get.spark.sparkContext.makeRDD(List(Row(newRow:_*)))
val newDF = mdContext.get.spark.createDataFrame(rdd, trainDataset.schema).na.fill(0)
val patchedTrainDataset = trainDataset.unionByName(newDF)
val strIndexTrainDataset = stringIndexerPipelineFitted.transform(patchedTrainDataset) // note: patchedTrainDataset
// onehot and assemble
val oneHotEncoder = new OneHotEncoderEstimator().setInputCols(indexedCatFeatures).setOutputCols(
val predictors = numFeatures ++ oneHotEncoder.getOutputCols
val assembler = new VectorAssembler().setInputCols(predictors).setOutputCol(KeyColumns.features)
val leftPipeline = new Pipeline().setStages(Array(oneHotEncoder, assembler))
// feature transfomers
val transfomers = stringIndexerPipeline.asInstanceOf[PipelineModel].stages ++ leftPipeline.asInstanceOf[PipelineModel].stages
// train model
val cv = new CrossValidator()
.setEvaluator(new BinaryClassificationEvaluator().setLabelCol(KeyColumns.y))
val transformedTrainDataset =
val cvModel =
val bestModel = cvModel.bestModel.asInstanceOf[PipelineModel]
val newStages = transfomers ++ Array[SparkTransformer](bestModel.stages.last)
// !!!error can't new here
val newBestModel = new PipelineModel(bestModel.uid, newStages)
// !!!error can't new here
val newCvModel = new CrossValidatorModel(cvModel.uid, newBestModel, cvModel.avgMetrics)
Thanks for raising a great question.
According to this Q&A, we've known that you wouldn't get an instance of PipelineModel from new method (which is not legal for ). There are mainly two ways:
PipelineModel.load(file: String)
val pipelineModel =
Now here is the thing: you can skip the fit() implicitly by only adding trained Model into pipeline to get pipelineModel.
// Still, add your trained models into a array
val trainedModels = => {
new ValueIndexerModel().setInputCol(col).setOutputCol(col + "_indexed").setLevels(level)
// just set your models as stages of a pipeline as usual
val pipeline = new PipeLine().setStages(trainedModels)
// fit, which will skip for models
val pipelineModel =
// then you get your pipelineModel, you can transform now
val transDF = pipelineModel.transform(dataFrame)
The reason we are able to handle it like this is according to the source code of spark:
val transformers = ListBuffer.empty[Transformer]
theStages.view.zipWithIndex.foreach { case (stage, index) =>
if (index <= indexOfLastEstimator) {
val transformer = stage match {
case estimator: Estimator[_] =>
case t: Transformer =>
case _ =>
throw new IllegalArgumentException(
s"Does not support stage $stage of type ${stage.getClass}")
if (index < indexOfLastEstimator) {
curDataset = transformer.transform(curDataset)
transformers += transformer
} else {
transformers += stage.asInstanceOf[Transformer]
Your trained model is subclass of Transformer, so when fit, your pipeline of trained models will skip all the process of fit, and give a pipelineModel with your trained models. Thanks to zero323 and user1269298 in Q&A again.

evaluation of word2vec-cosine similarity

I used a word2vec algorithm to compute document in a vector.I want to calculate the RMSE for different threshod.
def tokenize(line: String): Seq[String] = {
.filter(token => regex.pattern.matcher(token).matches)
.filterNot(token => stopwords.contains(token))
.filterNot(token => rareTokens.contains(token))
.filter(token => token.size >=2)
val tokens = => tokenize(doc))
import org.apache.spark.mllib.feature.Word2Vec
val word2vec = new Word2Vec()
word2vec.setSeed(42) // we do this to generate the same results each time
val word2vecModel =
val synonyms = word2vecModel.findSynonyms("drama", 15)
for((synonym, cosineSimilarity) <- synonyms) {
println(s"$synonym $cosineSimilarity")
val MSE = synonyms .map { case (v, p) => math.pow((v - p), 2) }.mean()
val RMSE:Double=math.sqrt(MSE)
when measuring RMSE,this error appear.
value - is not a member of String
How to solve it?

Spark doesn't conform to expected type TraversableOnce

val num_idf_pairs ="item", "features") => {(x(0), x(1))})
val itemRdd ="item", "features").where("item = 1") => {(x(0), x(1))})
val b_num_idf_pairs = sparkSession.sparkContext.broadcast(num_idf_pairs.collect())
val sims = num_idf_pairs.flatMap {
case (key, value) =>
val sv1 = value.asInstanceOf[SV]
import breeze.linalg._
val valuesVector = new SparseVector[Double](sv1.indices, sv1.values, sv1.size) {
case (id2, idf2) =>
val sv2 = idf2.asInstanceOf[SV]
val xVector = new SparseVector[Double](sv2.indices, sv2.values, sv2.size)
val sim = / (norm(valuesVector) * norm(xVector))
(id2.toString, key.toString, sim)
The error is doesn't conform to expected type TraversableOnce.
When i modify as follows:
val b_num_idf_pairs = sparkSession.sparkContext.broadcast(num_idf_pairs.collect())
val docSims = num_idf_pairs.flatMap {
case (id1, idf1) =>
val idfs = b_num_idf_pairs.value.filter(_._1 != id1)
val sv1 = idf1.asInstanceOf[SV]
import breeze.linalg._
val bsv1 = new SparseVector[Double](sv1.indices, sv1.values, sv1.size) {
case (id2, idf2) =>
val sv2 = idf2.asInstanceOf[SV]
val bsv2 = new SparseVector[Double](sv2.indices, sv2.values, sv2.size)
val cosSim =[Double] / (norm(bsv1) * norm(bsv2))
(id1.toString(), id2.toString(), cosSim)
it compiles but this will cause an OutOfMemoryException. I set --executor-memory 4G.
The first snippet:
num_idf_pairs.flatMap {
... { ...}
is not only not valid Spark code (no nested transformations are allowed), but also, as you already know, won't type check, because RDD is not TraversableOnce.
The second snippet likely fails, because data you are trying to collect and broadcast is to large.
It looks like you are trying to find all items similarity so you'll need Cartesian product, and structure your code roughly like this:
.filter { case ((id1, idf1), (id2, idf2)) => id1 != id2 }
.map { case ((id1, idf1), (id2, idf2)) => {
val cosSim = ??? // Compute similarity
(id1.toString(), id2.toString(), cosSim)

How to deal with more than one categorical feature in a decision tree?

I read a piece of code about binary decision tree from a book. It has only one categorical feature, which is field(3), in the raw data, and is converted to one-of-k(one-hot encoding).
def PrepareData(sc: SparkContext): (RDD[LabeledPoint], RDD[LabeledPoint], RDD[LabeledPoint], Map[String, Int]) = {
val rawDataWithHeader = sc.textFile("data/train.tsv")
val rawData = rawDataWithHeader.mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(1) else iter }
val lines ="\t"))
val categoriesMap = => fields(3)).distinct.collect.zipWithIndex.toMap
val labelpointRDD = { fields =>
val trFields ="\"", ""))
val categoryFeaturesArray = Array.ofDim[Double](categoriesMap.size)
val categoryIdx = categoriesMap(fields(3))
categoryFeaturesArray(categoryIdx) = 1
val numericalFeatures = trFields.slice(4, fields.size - 1).map(d => if (d == "?") 0.0 else d.toDouble)
val label = trFields(fields.size - 1).toInt
LabeledPoint(label, Vectors.dense(categoryFeaturesArray ++ numericalFeatures))
val Array(trainData, validationData, testData) = labelpointRDD.randomSplit(Array(8, 1, 1))
return (trainData, validationData, testData, categoriesMap)
I wonder how to revise the code if there are several categorical features in the raw data, let's say field(3), field(5), field(7) are all categorical features.
I revised the first line:
def PrepareData(sc: SparkContext): (RDD[LabeledPoint], RDD[LabeledPoint], RDD[LabeledPoint], Map[String, Int], Map[String, Int], Map[String, Int], Map[String, Int]) =......
Then, I converted another two fields into 1-of-k encoding as it was done like:
val categoriesMap5 = => fields(5)).distinct.collect.zipWithIndex.toMap
val categoriesMap7 = => fields(7)).distinct.collect.zipWithIndex.toMap
val categoryFeaturesArray5 = Array.ofDim[Double](categoriesMap5.size)
val categoryFeaturesArray7 = Array.ofDim[Double](categoriesMap7.size)
val categoryIdx3 = categoriesMap5(fields(5))
val categoryIdx5 = categoriesMap7(fields(7))
categoryFeaturesArray5(categoryIdx5) = 1
categoryFeaturesArray7(categoryIdx7) = 1
Finally, I revised LabeledPoint and return like:
LabeledPoint(label, Vectors.dense(categoryFeaturesArray ++ categoryFeaturesArray5 ++ categoryFeaturesArray7 ++ numericalFeatures))
return (trainData, validationData, testData, categoriesMap, categoriesMap5, categoriesMap7)
Is it correct?
The second problem I encountered is: the following code from that book, in the trainModel, it uses
DecisionTree.trainRegressor(trainingData, categoricalFeaturesInfo, impurity, maxDepth, maxBins)
Here is the code:
def trainModel(trainData: RDD[LabeledPoint], impurity: String, maxDepth: Int, maxBins: Int): (DecisionTreeModel, Double) = {
val startTime = new DateTime()
val model = DecisionTree.trainClassifier(trainData, 2, Map[Int, Int](), impurity, maxDepth, maxBins)
val endTime = new DateTime()
val duration = new Duration(startTime, endTime)
(model, duration.getMillis())
The question is: how do I pass the categoricalFeaturesInfo into this method if it has three categorical features mentioned previously?
I just want to follow the step on the book to build up a prediction system on my own by using a decision tree. To be more specific, the data sets I chose has several categorical features like :
Gender: male, female
Education: HS-grad, Bachelors, Master, PH.D, ......
Country: US, Canada, England, Australia, ......
But I don't know how to merge them into one single categoryFeatures ++ numericalFeatures to put into Vector.dense(), and one single categoricalFeaturesInfo to put into DecisionTree.trainRegressor()
It is not clear for me what exactly you're doing here but it looks like it is wrong from the beginning.
Ignoring the fact that you're reinventing the wheel by implementing one-hot-encoding from scratch, the whole point of encoding is to convert categorical variables to numerical ones. This is required for linear models but arguably it doesn't make sense when working with decision trees.
Keeping that in mind you have two choices:
Index categorical fields without encoding and pass indexed features to categoricalFeaturesInfo.
One-hot-encode categorical features and treat these as numerical variables.
I believe that the former approach is the right approach. The latter one should work in practice but it just artificially increases dimensionality without providing any benefits. It may also be in conflict with some heuristics used by Spark implementation.
One way or another you should consider using ML Pipelines which provide all required indexing, encoding, and merging tools.

spark scala get uncommon map elements

I am trying to split my data set into train and test data sets. I first read the file into memory as shown here:
val ratings = sc.textFile(movieLensdataHome+"/ratings.csv").map { line=>
val fields = line.split(",")
Then I select 80% of those for my training set:
val train = ratings.sample(false,.8,1)
Is there an easy way to get the test set in a distributed way,
I am trying this but fails:
val test = ratings.filter(!_.equals(
val test = ratings.subtract(train)
Take a look here.
Here is the code
def split[T : ClassManifest](data: RDD[T], p: Double, seed: Long =
System.currentTimeMillis): (RDD[T], RDD[T]) = {
val rand = new java.util.Random(seed)
val partitionSeeds = => rand.nextLong)
val temp = data.mapPartitionsWithIndex((index, iter) => {
val partitionRand = new java.util.Random(partitionSeeds(index)) => (x, partitionRand.nextDouble))
(temp.filter(_._2 <= p).map(_._1), temp.filter(_._2 > p).map(_._1))
Instead of using an exclusion method (like filter or subtract), I'd partition the set "by hand" for a more efficient execution:
val probabilisticSegment:(RDD[Double,Rating],Double=>Boolean) => RDD[Rating] =
(rdd,prob) => rdd.filter{case (k,v) => prob(k)}.map {case (k,v) => v}
val ranRating = x=> (Random.nextDouble(), x)).cache
val train = probabilisticSegment(ranRating, _ < 0.8)
val test = probabilisticSegment(ranRating, _ >= 0.8)
cache saves the intermediate RDD sothat the next two operations can be performed from that point on without incurring in the execution of the complete lineage.
(*) Note the use of val to define a function instead of def. vals are serializer-friendly