org.apache.spark.SparkException: This RDD lacks a SparkContext error - scala

Complete error is:
org.apache.spark.SparkException: This RDD lacks a SparkContext. It
could happen in the following cases:
(1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x =>
rdd2.values.count() * x) is invalid because the values transformation
and count action cannot be performed inside of the rdd1.map
transformation. For more information, see SPARK-5063.
(2) When a Spark Streaming job recovers from checkpoint, this exception will be hit if a reference to an RDD not defined by the
streaming job is used in DStream operations. For more information, See
SPARK-13758.
but I think I didn't use nested rdd transform in my code.
how to solve it?
my scala code:
stream.foreachRDD { rdd => {
val nRDD = rdd.map(item => item.value())
val oldRDD = sc.textFile("hdfs://localhost:9011/recData/miniApp/mall")
val top = oldRDD.sortBy(item => {
val arr = item.split(' ')
arr(0)
}, ascending = false).take(200)
val topRDD = sc.makeRDD(top)
val unionRDD = topRDD.union(nRDD)
val validRDD = unionRDD.map(item => {
val arr = item.split(' ')
((arr(1), arr(2)), arr(3).toDouble)
})
.reduceByKey((f, s) => {
if (f > s) f else s
})
.distinct()
val ratings = validRDD.map(item => {
Rating(item._1._2.toInt, item._1._1.toInt, item._2)
})
val rank = 10
val numIterations = 5
val model = ALS.train(ratings, rank, numIterations, 0.01)
nRDD.map(item => {
val arr = item.split(' ')
arr(2)
}).toDS()
.distinct()
.foreach(item=>{
println("als recommending for user "+item)
val recommendRes = model.recommendProducts(item.toInt, 10)
for (elem <- recommendRes) {
println(elem)
}
})
nRDD.saveAsTextFile("hdfs://localhost:9011/recData/miniApp/mall")
}
}

The error is telling you that you're missing a SparkContext. I'm guessing that the program fails on this line:
val oldRDD = sc.textFile("hdfs://localhost:9011/recData/miniApp/mall")
The documentation provides an example of creating a SparkContext to use in this situation.
From the docs:
val stream: DStream[String] = ...
stream.foreachRDD { rdd =>
// Get the singleton instance of SparkSession
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()
import spark.implicits._
// Do things...
}
Although you're using RDDs instead of DataFrames, the same principles should apply.

Related

suddenly throwing This RDD lacks a SparkContext it was working before every code was in main method

It was a working piece of code but suddenly its not working after I tried creating Sparksession from different scala object
val b = a.filter { x => (!x._2._1.isEmpty) && (!x._2._2.isEmpty) }
val primary_ke = b.map(rec => (rec._1.split(",")(0))).distinct
for (i <- primary_key_distinct) {
b.foreach(println)
}
Error:
ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 5)
org.apache.spark.SparkException: This RDD lacks a SparkContext. It could happen in the following cases:
(1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
(2) When a Spark Streaming job recovers from checkpoint, this exception will be hit if a reference to an RDD not defined by the streaming job is used in DStream operations. For more information, See SPARK-13758.
Not working even after I revoked it and I'm not using any objects.
Code Updated:
object try {
def main(args: Array[String]) {
val spark = SparkSession.builder().master("local").appName("50columns3nodes").getOrCreate()
var s = spark.read.csv("/home/hadoopuser/Desktop/input/source.csv").rdd.map(_.mkString(","))
var k = spark.read.csv("/home/hadoopuser/Desktop/input/destination.csv").rdd.map(_.mkString(","))
val source_primary_key = s.map(rec => (rec.split(",")(0), rec))
val destination_primary_key = k.map(rec => (rec.split(",")(0), rec))
val a = source_primary_key.cogroup(destination_primary_key).filter { x => ((x._2._1) != (x._2._2)) }
val b = a.filter { x => (!x._2._1.isEmpty) && (!x._2._2.isEmpty) }
var extra_In_Dest = a.filter(x => x._2._1.isEmpty && !x._2._2.isEmpty).map(rec => (rec._2._2.mkString("")))
var extra_In_Src = a.filter(x => !x._2._1.isEmpty && x._2._2.isEmpty).map(rec => (rec._2._1.mkString("")))
val primary_key_distinct = b.map(rec => (rec._1.split(",")(0))).distinct
for (i <- primary_key_distinct) {
var lengthofarray = 0
println(i)
b.foreach(println)
}
}
}
Input data follows
s=1,david
2,ajay
3,jijo
4,abi
5,surendhar
k=1,david
2,ajay
3,jijoaa
4,abisdsdd
5,surendhar
val a contains {3,(jijo,jijoaa),5(abi,abisdsdd)}
If you read carefully the first message
(1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
It clearly states that actions and transformations cannot be performed inside a transformation.
primary_key_distinct is transformation done on b and b itself is a transformation done on a. And b.foreach(println) is an action done inside transformation of primary_key_distinct
So if you collect b or primary_key_distinct inside driver, then the code should run properly
val b = a.filter { x => (!x._2._1.isEmpty) && (!x._2._2.isEmpty) }.collect
or
val primary_key_distinct = b.map(rec => (rec._1.split(",")(0))).distinct.collect
or if you don't use action inside another transformation then the code should run properly too as
for (i <- 1 to 2) {
var lengthofarray = 0
println(i)
b.foreach(println)
}
I hope the explanation is clear.

ssc.sparkContext.parallelize in foreachPartition raise Task not serializable Exception

In foreachPartition i want to translate a list of String to an rdd, i use ssc.sparkContext.parallelize, and the ssc is an object of StreamingContext. The code is like this:
val ssc = new StreamingContext(sparkConf, Seconds(5))
...
words.foreachRDD(rdd => rdd.foreachPartition { partitionOfRecords => {
var sourcelist = List[String]()
partitionOfRecords.foreach(x=>{sourcelist=sourcelist.+:("source"+x)})
if(sourcelist.length > 0 ){
sourcelist.foreach { x => println(x) }
}else{
println("aa---------------------none")
}
// change the sourcelist to a RDD an convert it to a Dstream
val ssd = ssc.sparkContext.parallelize(sourcelist)
val resultInputStream = ssc.queueStream(scala.collection.mutable.Queue(ssd))
val results = resultInputStream.map(x=>x)
results.print()
}})
But, the code raise SparkException: Task not serializable. I really don't known how to do with this.Appreciate your help, Please!

RDD transformations and actions can only be invoked by the driver

Error:
org.apache.spark.SparkException: RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
def computeRatio(model: MatrixFactorizationModel, test_data: org.apache.spark.rdd.RDD[Rating]): Double = {
val numDistinctUsers = test_data.map(x => x.user).distinct().count()
val userRecs: RDD[(Int, Set[Int], Set[Int])] = test_data.groupBy(testUser => testUser.user).map(u => {
(u._1, u._2.map(p => p.product).toSet, model.recommendProducts(u._1, 20).map(prec => prec.product).toSet)
})
val hitsAndMiss: RDD[(Int, Double)] = userRecs.map(x => (x._1, x._2.intersect(x._3).size.toDouble))
val hits = hitsAndMiss.map(x => x._2).sum() / numDistinctUsers
return hits
}
I am using the method in MatrixFactorizationModel.scala, I have to map over users and then call the method to get the results for each user. By doing that I introduce nested mapping which I believe cause the issue:
I know that issue actually take place at:
val userRecs: RDD[(Int, Set[Int], Set[Int])] = test_data.groupBy(testUser => testUser.user).map(u => {
(u._1, u._2.map(p => p.product).toSet, model.recommendProducts(u._1, 20).map(prec => prec.product).toSet)
})
Because while mapping over I am calling model.recommendProducts
MatrixFactorizationModel is a distributed model so you cannot simply call it from an action or a transformation. The closest thing to what you do here is something like this:
import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.recommendation.{MatrixFactorizationModel, Rating}
def computeRatio(model: MatrixFactorizationModel, testUsers: RDD[Rating]) = {
val testData = testUsers.map(r => (r.user, r.product)).groupByKey
val n = testData.count
val recommendations = model
.recommendProductsForUsers(20)
.mapValues(_.map(r => r.product))
val hits = testData
.join(recommendations)
.values
.map{case (xs, ys) => xs.toSet.intersect(ys.toSet).size}
.sum
hits / n
}
Notes:
distinct is an expensive operation and completely obsoletely here since you can obtain the same information from a grouped data
instead of groupBy followed by projection (map), project first and group later. There is no reason to transfer full ratings if you want only a product ids.

How to create collection of RDDs out of RDD?

I have an RDD[String], wordRDD. I also have a function that creates an RDD[String] from a string/word. I would like to create a new RDD for each string in wordRDD. Here are my attempts:
1) Failed because Spark does not support nested RDDs:
var newRDD = wordRDD.map( word => {
// execute myFunction()
(new MyClass(word)).myFunction()
})
2) Failed (possibly due to scope issue?):
var newRDD = sc.parallelize(new Array[String](0))
val wordArray = wordRDD.collect
for (w <- wordArray){
newRDD = sc.union(newRDD,(new MyClass(w)).myFunction())
}
My ideal result would look like:
// input RDD (wordRDD)
wordRDD: org.apache.spark.rdd.RDD[String] = ('apple','banana','orange'...)
// myFunction behavior
new MyClass('apple').myFunction(): RDD[String] = ('pple','aple'...'appl')
// after executing myFunction() on each word in wordRDD:
newRDD: RDD[String] = ('pple','aple',...,'anana','bnana','baana',...)
I found a relevant question here: Spark when union a lot of RDD throws stack overflow error, but it didn't address my issue.
Use flatMap to get RDD[String] as you desire.
var allWords = wordRDD.flatMap { word =>
(new MyClass(word)).myFunction().collect()
}
You cannot create a RDD from within another RDD.
However, it is possible to rewrite your function myFunction: String => RDD[String], which generates all words from the input where one letter is removed, into another function modifiedFunction: String => Seq[String] such that it can be used from within an RDD. That way, it will also be executed in parallel on your cluster. Having the modifiedFunction you can obtain the final RDD with all words by simply calling wordRDD.flatMap(modifiedFunction).
The crucial point is to use flatMap (to map and flatten the transformations):
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Test").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val input = sc.parallelize(Seq("apple", "ananas", "banana"))
// RDD("pple", "aple", ..., "nanas", ..., "anana", "bnana", ...)
val result = input.flatMap(modifiedFunction)
}
def modifiedFunction(word: String): Seq[String] = {
word.indices map {
index => word.substring(0, index) + word.substring(index+1)
}
}

Scala Spark: Split collection into several RDD?

Is there any Spark function that allows to split a collection into several RDDs according to some creteria? Such function would allow to avoid excessive itteration. For example:
def main(args: Array[String]) {
val logFile = "file.txt"
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val lineAs = logData.filter(line => line.contains("a")).saveAsTextFile("linesA.txt")
val lineBs = logData.filter(line => line.contains("b")).saveAsTextFile("linesB.txt")
}
In this example I have to iterate 'logData` twice just to write results in two separate files:
val lineAs = logData.filter(line => line.contains("a")).saveAsTextFile("linesA.txt")
val lineBs = logData.filter(line => line.contains("b")).saveAsTextFile("linesB.txt")
It would be nice instead to have something like this:
val resultMap = logData.map(line => if line.contains("a") ("a", line) else if line.contains("b") ("b", line) else (" - ", line)
resultMap.writeByKey("a", "linesA.txt")
resultMap.writeByKey("b", "linesB.txt")
Any such thing?
Maybe something like this would work:
def singlePassMultiFilter[T](
rdd: RDD[T],
f1: T => Boolean,
f2: T => Boolean,
level: StorageLevel = StorageLevel.MEMORY_ONLY
): (RDD[T], RDD[T], Boolean => Unit) = {
val tempRDD = rdd mapPartitions { iter =>
val abuf1 = ArrayBuffer.empty[T]
val abuf2 = ArrayBuffer.empty[T]
for (x <- iter) {
if (f1(x)) abuf1 += x
if (f2(x)) abuf2 += x
}
Iterator.single((abuf1, abuf2))
}
tempRDD.persist(level)
val rdd1 = tempRDD.flatMap(_._1)
val rdd2 = tempRDD.flatMap(_._2)
(rdd1, rdd2, (blocking: Boolean) => tempRDD.unpersist(blocking))
}
Note that an action called on rdd1 (resp. rdd2) will cause tempRDD to be computed and persisted. This is practically equivalent to computing rdd2 (resp. rdd1) since the overhead of the flatMap in the definitions of rdd1 and rdd2 are, I believe, going to be pretty negligible.
You would use singlePassMultiFitler like so:
val (rdd1, rdd2, cleanUp) = singlePassMultiFilter(rdd, f1, f2)
rdd1.persist() //I'm going to need `rdd1` more later...
println(rdd1.count)
println(rdd2.count)
cleanUp(true) //I'm done with `rdd2` and `rdd1` has been persisted so free stuff up...
println(rdd1.distinct.count)
Clearly this could extended to an arbitrary number of filters, collections of filters, etc.
Have a look at the following question.
Write to multiple outputs by key Spark - one Spark job
You can flatMap an RDD with a function like the following and then do a groupBy on the key.
def multiFilter(words:List[String], line:String) = for { word <- words; if line.contains(word) } yield { (word,line) }
val filterWords = List("a","b")
val filteredRDD = logData.flatMap( line => multiFilter(filterWords, line) )
val groupedRDD = filteredRDD.groupBy(_._1)
But depending on the size of your input RDD you may or not see any performance gains because any of groupBy operations involves a shuffle.
On the other hand if you have enough memory in your Spark cluster you can cache the input RDD and therefore running multiple filter operations may not be as expensive as you think.