I am trying to split my data set into train and test data sets. I first read the file into memory as shown here:
val ratings = sc.textFile(movieLensdataHome+"/ratings.csv").map { line=>
val fields = line.split(",")
Then I select 80% of those for my training set:
val train = ratings.sample(false,.8,1)
Is there an easy way to get the test set in a distributed way,
I am trying this but fails:
val test = ratings.filter(!_.equals(

val test = ratings.subtract(train)

Take a look here.
Here is the code
def split[T : ClassManifest](data: RDD[T], p: Double, seed: Long =
System.currentTimeMillis): (RDD[T], RDD[T]) = {
val rand = new java.util.Random(seed)
val partitionSeeds = => rand.nextLong)
val temp = data.mapPartitionsWithIndex((index, iter) => {
val partitionRand = new java.util.Random(partitionSeeds(index)) => (x, partitionRand.nextDouble))
(temp.filter(_._2 <= p).map(_._1), temp.filter(_._2 > p).map(_._1))

Instead of using an exclusion method (like filter or subtract), I'd partition the set "by hand" for a more efficient execution:
val probabilisticSegment:(RDD[Double,Rating],Double=>Boolean) => RDD[Rating] =
(rdd,prob) => rdd.filter{case (k,v) => prob(k)}.map {case (k,v) => v}
val ranRating = x=> (Random.nextDouble(), x)).cache
val train = probabilisticSegment(ranRating, _ < 0.8)
val test = probabilisticSegment(ranRating, _ >= 0.8)
cache saves the intermediate RDD sothat the next two operations can be performed from that point on without incurring in the execution of the complete lineage.
(*) Note the use of val to define a function instead of def. vals are serializer-friendly


Saddle Frame: What's the most idiomatic way to count NaN values?

I build a Scala Frame like so e.g.
import org.saddle._
import scala.util.Random
val rowIx = Index(0 until 200)
val colIx = Index(0 until 100)
// create example having 15% of NaNs
val nanPerc = 0.15
val nanLength = math.round(nanPerc*rowIx.length*colIx.length).toInt
val nanInd = Random.shuffle(0 until rowIx.length*colIx.length).take(nanLength)
val rawMat = mat.rand(rowIx.length, colIx.length)
// contents gives a single array in row major
val rawMatContents = rawMat.contents
nanInd foreach { i => rawMatContents.update(i, Double.NaN) }
val df = Frame(rawMat, rowIx, colIx)
// now I'd like to test that the number of NaNs is correct but
// most functions for this purpose in Frame e.g. countif exclude NaNs
What's the most idiomatic (Scala, Saddle) way to count the number of NaNs?
Frame.countif is implemented as:
def countif(test: T => Boolean)(implicit ev: S2Stats): Series[CX, Int] = frame.reduce(_.countif(test))
while Vec.countif is implemented as:
def countif(test: Double => Boolean): Int = r.filterFoldLeft(t => sd.notMissing(t) && test(t))(0)((a,b) => a + 1)
We can use the same but remove test and invert the NaN check:
vec.filterFoldLeft(x => x.isNaN)(0)((a, b) => a + 1)
To run this on a Frame:
frame.reduce(_.filterFoldLeft(x => x.isNaN)(0)((a, b) => a + 1))
I found a very simple and direct way:
retDf.toMat.contents.filter(x => x.isNaN).length

Add scoped variable per row iteration in Apache Spark

I'm reading multiple html files into a dataframe in Spark.
I'm converting elements of the html to columns in the dataframe using a custom udf
val dataset = spark
.toDF("filepath", "filecontent")
.withColumn("biz_name", parseDocValue(".biz-page-title")('filecontent))
.withColumn("biz_website", parseDocValue(".biz-website a")('filecontent))
def parseDocValue(cssSelectorQuery: String) =
udf((html: String) => Jsoup.parse(html).select(cssSelectorQuery).text())
Which works perfectly, however each withColumn call will result in the parsing of the html string, which is redundant.
Is there a way (without using lookup tables or such) that I can generate 1 parsed Document (Jsoup.parse(html)) based on the "filecontent" column per row and make that available for all withColumn calls in the dataframe?
Or shouldn't I even try using DataFrames and just use RDD's ?
So the final answer was in fact quite simple:
Just map over the rows and create the object ones there
def docValue(cssSelectorQuery: String, attr: Option[String] = None)(implicit document: Document): Option[String] = {
val domObject =
val domValue = attr match {
case Some(a) => domObject.attr(a)
case None => domObject.text()
domValue match {
case x if x == null || x.isEmpty => None
case y => Some(y)
val dataset = spark
.wholeTextFiles(inputPath, minPartitions = 265)
.map {
case (filepath, filecontent) => {
implicit val document = Jsoup.parse(filecontent)
val customDataJson = docJson(filecontent, customJsonRegex)
biz_name = docValue(".biz-page-title"),
biz_website = docValue(".biz-website a"),
url = docValue("meta[property=og:url]", attr = Some("content")),
filename = Some(fileName(filepath)),
fileTimestamp = Some(fileTimestamp(filepath))
I'd probably rewrite it as follows, to do the parsing and selecting in one go and put them in a temporary column:
val dataset = spark
.withColumn("temp", parseDocValue(Array(".biz-page-title", ".biz-website a"))('filecontent))
.withColumn("biz_name", col("temp")(0))
.withColumn("biz_website", col("temp")(1))
def parseDocValue(cssSelectorQueries: Array[String]) =
udf((html: String) => {
val j = Jsoup.parse(html) =>})

How to choose combining strategy for MLlib's random forests

Is it possible to choose the combining strategy for MLlib's random forests? I can't find any clue on the official API docs.
Here's my code:
val numClasses = 10
val categoricalFeaturesInfo = Map[Int, Int]()
val numTrees = 10
val featureSubsetStrategy = "auto"
val impurity = "entropy"
val maxDepth = 2
val maxBins = 320
val model = RandomForest.trainClassifier(trainData, numClasses, categoricalFeaturesInfo,
numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
val predictionAndLabels = { case LabeledPoint(label, features) =>
val prediction = model.predict(features)
(prediction, label)
I know that the predict method (implemented in treeEnsembleModels class) take in account the combining strategy (Sum, Average or Vote):
def predict(features: Vector): Double = {
(algo, combiningStrategy) match {
case (Regression, Sum) =>
case (Regression, Average) =>
predictBySumming(features) / sumWeights
case (Classification, Sum) => // binary classification
val prediction = predictBySumming(features)
// TODO: predicted labels are +1 or -1 for GBT. Need a better way to store this info.
if (prediction > 0.0) 1.0 else 0.0
case (Classification, Vote) =>
case _ =>
throw new IllegalArgumentException(
"TreeEnsembleModel given unsupported (algo, combiningStrategy) combination: " +
s"($algo, $combiningStrategy).")
I'd say the only way it's possible to do is to use reflection after the model's been built. That have to be possible, because field usage is deferred (I haven't tried to run this code, but smth like this would work).
RandomForestModel model = ...;
Class<?> c = model.getClass();
Field strategy = c.getDeclaredField("combiningStrategy");
strategy.set(model, whatever);

RDD transformations and actions can only be invoked by the driver

org.apache.spark.SparkException: RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the transformation. For more information, see SPARK-5063.
def computeRatio(model: MatrixFactorizationModel, test_data: org.apache.spark.rdd.RDD[Rating]): Double = {
val numDistinctUsers = => x.user).distinct().count()
val userRecs: RDD[(Int, Set[Int], Set[Int])] = test_data.groupBy(testUser => testUser.user).map(u => {
(u._1, => p.product).toSet, model.recommendProducts(u._1, 20).map(prec => prec.product).toSet)
val hitsAndMiss: RDD[(Int, Double)] = => (x._1, x._2.intersect(x._3).size.toDouble))
val hits = => x._2).sum() / numDistinctUsers
return hits
I am using the method in MatrixFactorizationModel.scala, I have to map over users and then call the method to get the results for each user. By doing that I introduce nested mapping which I believe cause the issue:
I know that issue actually take place at:
val userRecs: RDD[(Int, Set[Int], Set[Int])] = test_data.groupBy(testUser => testUser.user).map(u => {
(u._1, => p.product).toSet, model.recommendProducts(u._1, 20).map(prec => prec.product).toSet)
Because while mapping over I am calling model.recommendProducts
MatrixFactorizationModel is a distributed model so you cannot simply call it from an action or a transformation. The closest thing to what you do here is something like this:
import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.recommendation.{MatrixFactorizationModel, Rating}
def computeRatio(model: MatrixFactorizationModel, testUsers: RDD[Rating]) = {
val testData = => (r.user, r.product)).groupByKey
val n = testData.count
val recommendations = model
.mapValues( => r.product))
val hits = testData
.map{case (xs, ys) => xs.toSet.intersect(ys.toSet).size}
hits / n
distinct is an expensive operation and completely obsoletely here since you can obtain the same information from a grouped data
instead of groupBy followed by projection (map), project first and group later. There is no reason to transfer full ratings if you want only a product ids.

Scala Spark: Split collection into several RDD?

Is there any Spark function that allows to split a collection into several RDDs according to some creteria? Such function would allow to avoid excessive itteration. For example:
def main(args: Array[String]) {
val logFile = "file.txt"
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val lineAs = logData.filter(line => line.contains("a")).saveAsTextFile("linesA.txt")
val lineBs = logData.filter(line => line.contains("b")).saveAsTextFile("linesB.txt")
In this example I have to iterate 'logData` twice just to write results in two separate files:
val lineAs = logData.filter(line => line.contains("a")).saveAsTextFile("linesA.txt")
val lineBs = logData.filter(line => line.contains("b")).saveAsTextFile("linesB.txt")
It would be nice instead to have something like this:
val resultMap = => if line.contains("a") ("a", line) else if line.contains("b") ("b", line) else (" - ", line)
resultMap.writeByKey("a", "linesA.txt")
resultMap.writeByKey("b", "linesB.txt")
Any such thing?
Maybe something like this would work:
def singlePassMultiFilter[T](
rdd: RDD[T],
f1: T => Boolean,
f2: T => Boolean,
level: StorageLevel = StorageLevel.MEMORY_ONLY
): (RDD[T], RDD[T], Boolean => Unit) = {
val tempRDD = rdd mapPartitions { iter =>
val abuf1 = ArrayBuffer.empty[T]
val abuf2 = ArrayBuffer.empty[T]
for (x <- iter) {
if (f1(x)) abuf1 += x
if (f2(x)) abuf2 += x
Iterator.single((abuf1, abuf2))
val rdd1 = tempRDD.flatMap(_._1)
val rdd2 = tempRDD.flatMap(_._2)
(rdd1, rdd2, (blocking: Boolean) => tempRDD.unpersist(blocking))
Note that an action called on rdd1 (resp. rdd2) will cause tempRDD to be computed and persisted. This is practically equivalent to computing rdd2 (resp. rdd1) since the overhead of the flatMap in the definitions of rdd1 and rdd2 are, I believe, going to be pretty negligible.
You would use singlePassMultiFitler like so:
val (rdd1, rdd2, cleanUp) = singlePassMultiFilter(rdd, f1, f2)
rdd1.persist() //I'm going to need `rdd1` more later...
cleanUp(true) //I'm done with `rdd2` and `rdd1` has been persisted so free stuff up...
Clearly this could extended to an arbitrary number of filters, collections of filters, etc.
Have a look at the following question.
Write to multiple outputs by key Spark - one Spark job
You can flatMap an RDD with a function like the following and then do a groupBy on the key.
def multiFilter(words:List[String], line:String) = for { word <- words; if line.contains(word) } yield { (word,line) }
val filterWords = List("a","b")
val filteredRDD = logData.flatMap( line => multiFilter(filterWords, line) )
val groupedRDD = filteredRDD.groupBy(_._1)
But depending on the size of your input RDD you may or not see any performance gains because any of groupBy operations involves a shuffle.
On the other hand if you have enough memory in your Spark cluster you can cache the input RDD and therefore running multiple filter operations may not be as expensive as you think.