RDD transformations and actions can only be invoked by the driver - scala

Error:
org.apache.spark.SparkException: RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
def computeRatio(model: MatrixFactorizationModel, test_data: org.apache.spark.rdd.RDD[Rating]): Double = {
val numDistinctUsers = test_data.map(x => x.user).distinct().count()
val userRecs: RDD[(Int, Set[Int], Set[Int])] = test_data.groupBy(testUser => testUser.user).map(u => {
(u._1, u._2.map(p => p.product).toSet, model.recommendProducts(u._1, 20).map(prec => prec.product).toSet)
})
val hitsAndMiss: RDD[(Int, Double)] = userRecs.map(x => (x._1, x._2.intersect(x._3).size.toDouble))
val hits = hitsAndMiss.map(x => x._2).sum() / numDistinctUsers
return hits
}
I am using the method in MatrixFactorizationModel.scala, I have to map over users and then call the method to get the results for each user. By doing that I introduce nested mapping which I believe cause the issue:
I know that issue actually take place at:
val userRecs: RDD[(Int, Set[Int], Set[Int])] = test_data.groupBy(testUser => testUser.user).map(u => {
(u._1, u._2.map(p => p.product).toSet, model.recommendProducts(u._1, 20).map(prec => prec.product).toSet)
})
Because while mapping over I am calling model.recommendProducts

MatrixFactorizationModel is a distributed model so you cannot simply call it from an action or a transformation. The closest thing to what you do here is something like this:
import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.recommendation.{MatrixFactorizationModel, Rating}
def computeRatio(model: MatrixFactorizationModel, testUsers: RDD[Rating]) = {
val testData = testUsers.map(r => (r.user, r.product)).groupByKey
val n = testData.count
val recommendations = model
.recommendProductsForUsers(20)
.mapValues(_.map(r => r.product))
val hits = testData
.join(recommendations)
.values
.map{case (xs, ys) => xs.toSet.intersect(ys.toSet).size}
.sum
hits / n
}
Notes:
distinct is an expensive operation and completely obsoletely here since you can obtain the same information from a grouped data
instead of groupBy followed by projection (map), project first and group later. There is no reason to transfer full ratings if you want only a product ids.

Related

org.apache.spark.SparkException: This RDD lacks a SparkContext error

Complete error is:
org.apache.spark.SparkException: This RDD lacks a SparkContext. It
could happen in the following cases:
(1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x =>
rdd2.values.count() * x) is invalid because the values transformation
and count action cannot be performed inside of the rdd1.map
transformation. For more information, see SPARK-5063.
(2) When a Spark Streaming job recovers from checkpoint, this exception will be hit if a reference to an RDD not defined by the
streaming job is used in DStream operations. For more information, See
SPARK-13758.
but I think I didn't use nested rdd transform in my code.
how to solve it?
my scala code:
stream.foreachRDD { rdd => {
val nRDD = rdd.map(item => item.value())
val oldRDD = sc.textFile("hdfs://localhost:9011/recData/miniApp/mall")
val top = oldRDD.sortBy(item => {
val arr = item.split(' ')
arr(0)
}, ascending = false).take(200)
val topRDD = sc.makeRDD(top)
val unionRDD = topRDD.union(nRDD)
val validRDD = unionRDD.map(item => {
val arr = item.split(' ')
((arr(1), arr(2)), arr(3).toDouble)
})
.reduceByKey((f, s) => {
if (f > s) f else s
})
.distinct()
val ratings = validRDD.map(item => {
Rating(item._1._2.toInt, item._1._1.toInt, item._2)
})
val rank = 10
val numIterations = 5
val model = ALS.train(ratings, rank, numIterations, 0.01)
nRDD.map(item => {
val arr = item.split(' ')
arr(2)
}).toDS()
.distinct()
.foreach(item=>{
println("als recommending for user "+item)
val recommendRes = model.recommendProducts(item.toInt, 10)
for (elem <- recommendRes) {
println(elem)
}
})
nRDD.saveAsTextFile("hdfs://localhost:9011/recData/miniApp/mall")
}
}
The error is telling you that you're missing a SparkContext. I'm guessing that the program fails on this line:
val oldRDD = sc.textFile("hdfs://localhost:9011/recData/miniApp/mall")
The documentation provides an example of creating a SparkContext to use in this situation.
From the docs:
val stream: DStream[String] = ...
stream.foreachRDD { rdd =>
// Get the singleton instance of SparkSession
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()
import spark.implicits._
// Do things...
}
Although you're using RDDs instead of DataFrames, the same principles should apply.

Spark accumulator empty when used in UDF

I was working on optimizing my Spark process, and was trying to use a UDF with an accumulator. I have gotten the accumulator to work on its own, and was looking to see if I would get any speed up using a UDF. But instead, when I wrap the accumulator in the UDF, it remains empty. Am I going something wrong in particular? Is there something going on with Lazy Execution where even with my .count it is still not executing?
Input:
0,[0.11,0.22]
1,[0.22,0.33]
Output:
(0,0,0.11),(0,1,0.22),(1,0,0.22),(1,1,0.33)
Code
val accum = new MapAccumulator2d()
val session = SparkSession.builder().getOrCreate()
session.sparkContext.register(accum)
//Does not work - Empty Accumlator
val rowAccum = udf((itemId: Int, item: mutable.WrappedArray[Float]) => {
val map = item
.zipWithIndex
.map(ff => {
((itemId, ff._2), ff._1.toDouble)
}).toMap
accum.add(map)
itemId
})
dataFrame.select(rowAccum(col("itemId"), col("jaccardList"))).count
//Works
dataFrame.foreach(f => {
val map = f.getAs[mutable.WrappedArray[Float]](1)
.zipWithIndex
.map(ff => {
((f.getInt(0), ff._2), ff._1.toDouble)
}).toMap
accum.add(map)
})
val list = accum.value.toList.map(f => (f._1._1, f._1._2, f._2))
Looks like the only issue here is using count to "trigger" the lazily-evaluated UDF: Spark is "smart" enough to realize that the select operation can't change the result of count and therefore doesn't really execute the UDF. Choosing a different operation (e.g. collect) shows that the UDF works and updates the accumulator.
Here's a (more concise) example:
val accum = sc.longAccumulator
val rowAccum = udf((itemId: Int) => { accum.add(itemId); itemId })
val dataFrame = Seq(1,2,3,4,5).toDF("itemId")
dataFrame.select(rowAccum(col("itemId"))).count() // won't trigger UDF
println(s"RESULT: ${accum.value}") // prints 0
dataFrame.select(rowAccum(col("itemId"))).collect() // triggers UDF
println(s"RESULT: ${accum.value}") // prints 15

suddenly throwing This RDD lacks a SparkContext it was working before every code was in main method

It was a working piece of code but suddenly its not working after I tried creating Sparksession from different scala object
val b = a.filter { x => (!x._2._1.isEmpty) && (!x._2._2.isEmpty) }
val primary_ke = b.map(rec => (rec._1.split(",")(0))).distinct
for (i <- primary_key_distinct) {
b.foreach(println)
}
Error:
ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 5)
org.apache.spark.SparkException: This RDD lacks a SparkContext. It could happen in the following cases:
(1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
(2) When a Spark Streaming job recovers from checkpoint, this exception will be hit if a reference to an RDD not defined by the streaming job is used in DStream operations. For more information, See SPARK-13758.
Not working even after I revoked it and I'm not using any objects.
Code Updated:
object try {
def main(args: Array[String]) {
val spark = SparkSession.builder().master("local").appName("50columns3nodes").getOrCreate()
var s = spark.read.csv("/home/hadoopuser/Desktop/input/source.csv").rdd.map(_.mkString(","))
var k = spark.read.csv("/home/hadoopuser/Desktop/input/destination.csv").rdd.map(_.mkString(","))
val source_primary_key = s.map(rec => (rec.split(",")(0), rec))
val destination_primary_key = k.map(rec => (rec.split(",")(0), rec))
val a = source_primary_key.cogroup(destination_primary_key).filter { x => ((x._2._1) != (x._2._2)) }
val b = a.filter { x => (!x._2._1.isEmpty) && (!x._2._2.isEmpty) }
var extra_In_Dest = a.filter(x => x._2._1.isEmpty && !x._2._2.isEmpty).map(rec => (rec._2._2.mkString("")))
var extra_In_Src = a.filter(x => !x._2._1.isEmpty && x._2._2.isEmpty).map(rec => (rec._2._1.mkString("")))
val primary_key_distinct = b.map(rec => (rec._1.split(",")(0))).distinct
for (i <- primary_key_distinct) {
var lengthofarray = 0
println(i)
b.foreach(println)
}
}
}
Input data follows
s=1,david
2,ajay
3,jijo
4,abi
5,surendhar
k=1,david
2,ajay
3,jijoaa
4,abisdsdd
5,surendhar
val a contains {3,(jijo,jijoaa),5(abi,abisdsdd)}
If you read carefully the first message
(1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
It clearly states that actions and transformations cannot be performed inside a transformation.
primary_key_distinct is transformation done on b and b itself is a transformation done on a. And b.foreach(println) is an action done inside transformation of primary_key_distinct
So if you collect b or primary_key_distinct inside driver, then the code should run properly
val b = a.filter { x => (!x._2._1.isEmpty) && (!x._2._2.isEmpty) }.collect
or
val primary_key_distinct = b.map(rec => (rec._1.split(",")(0))).distinct.collect
or if you don't use action inside another transformation then the code should run properly too as
for (i <- 1 to 2) {
var lengthofarray = 0
println(i)
b.foreach(println)
}
I hope the explanation is clear.

scala.MatchError: null on spark RDDs

I am relatively new to both spark and scala.
I was trying to implement collaborative filtering using scala on spark.
Below is the code
import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.Rating
val data = sc.textFile("/user/amohammed/CB/input-cb.txt")
val distinctUsers = data.map(x => x.split(",")(0)).distinct().map(x => x.toInt)
val distinctKeywords = data.map(x => x.split(",")(1)).distinct().map(x => x.toInt)
val ratings = data.map(_.split(',') match {
case Array(user, item, rate) => Rating(user.toInt,item.toInt, rate.toDouble)
})
val model = ALS.train(ratings, 1, 20, 0.01)
val keywords = distinctKeywords collect
distinctUsers.map(x => {(x, keywords.map(y => model.predict(x,y)))}).collect()
It throws a scala.MatchError: null
org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:571) at the last line
Thw code works fine if I collect the distinctUsers rdd into an array and execute the same code:
val users = distinctUsers collect
users.map(x => {(x, keywords.map(y => model.predict(x, y)))})
Where am I getting it wrong when dealing with RDDs?
Spark Version : 1.0.0
Scala Version : 2.10.4
Going one call further back in the stack trace, line 43 of the MatrixFactorizationModel source says:
val userVector = new DoubleMatrix(userFeatures.lookup(user).head)
Note that the userFeatures field of model is itself another RDD; I believe it isn't getting serialized properly when the anonymous function block closes over model, and thus the lookup method on it is failing. I also tried placing both model and keywords into broadcast variables, but that didn't work either.
Instead of falling back to Scala collections and losing the benefits of Spark, it's probably better to stick with RDDs and take advantage of other ways of transforming them.
I'd start with this:
val ratings = data.map(_.split(',') match {
case Array(user, keyword, rate) => Rating(user.toInt, keyword.toInt, rate.toDouble)
})
// instead of parsing the original RDD's strings three separate times,
// you can map the "user" and "product" fields of the Rating case class
val distinctUsers = ratings.map(_.user).distinct()
val distinctKeywords = ratings.map(_.product).distinct()
val model = ALS.train(ratings, 1, 20, 0.01)
Then, instead of calculating each prediction one by one, we can obtain the Cartesian product of all possible user-keyword pairs as an RDD and use the other predict method in MatrixFactorizationModel, which takes an RDD of such pairs as its argument.
val userKeywords = distinctUsers.cartesian(distinctKeywords)
val predictions = model.predict(userKeywords).map { case Rating(user, keyword, rate) =>
(user, Map(keyword -> rate))
}.reduceByKey { _ ++ _ }
Now predictions has an immutable map for each user that can be queried for the predicted rating of a particular keyword. If you specifically want arrays as in your original example, you can do:
val keywords = distinctKeywords.collect() // add .sorted if you want them in order
val predictionArrays = predictions.mapValues(keywords.map(_))
Caveat: I tested this with Spark 1.0.1 as it's what I had installed, but it should work with 1.0.0 as well.

spark scala get uncommon map elements

I am trying to split my data set into train and test data sets. I first read the file into memory as shown here:
val ratings = sc.textFile(movieLensdataHome+"/ratings.csv").map { line=>
val fields = line.split(",")
Rating(fields(0).toInt,fields(1).toInt,fields(2).toDouble)
}
Then I select 80% of those for my training set:
val train = ratings.sample(false,.8,1)
Is there an easy way to get the test set in a distributed way,
I am trying this but fails:
val test = ratings.filter(!_.equals(train.map(_)))
val test = ratings.subtract(train)
Take a look here. http://markmail.org/message/qi6srcyka6lcxe7o
Here is the code
def split[T : ClassManifest](data: RDD[T], p: Double, seed: Long =
System.currentTimeMillis): (RDD[T], RDD[T]) = {
val rand = new java.util.Random(seed)
val partitionSeeds = data.partitions.map(partition => rand.nextLong)
val temp = data.mapPartitionsWithIndex((index, iter) => {
val partitionRand = new java.util.Random(partitionSeeds(index))
iter.map(x => (x, partitionRand.nextDouble))
})
(temp.filter(_._2 <= p).map(_._1), temp.filter(_._2 > p).map(_._1))
}
Instead of using an exclusion method (like filter or subtract), I'd partition the set "by hand" for a more efficient execution:
val probabilisticSegment:(RDD[Double,Rating],Double=>Boolean) => RDD[Rating] =
(rdd,prob) => rdd.filter{case (k,v) => prob(k)}.map {case (k,v) => v}
val ranRating = rating.map( x=> (Random.nextDouble(), x)).cache
val train = probabilisticSegment(ranRating, _ < 0.8)
val test = probabilisticSegment(ranRating, _ >= 0.8)
cache saves the intermediate RDD sothat the next two operations can be performed from that point on without incurring in the execution of the complete lineage.
(*) Note the use of val to define a function instead of def. vals are serializer-friendly