Map function of RDD not being invoked in Scala Spark - scala

When I call the map function of an RDD is is not being applied. It works as expected for a scala.collection.immutable.List but not for an RDD. Here is some code to illustrate :
val list = List ("a" , "d" , "c" , "d")
list.map(l => {
println("mapping list")
})
val tm = sc.parallelize(list)
tm.map(m => {
println("mapping RDD")
})
Result of above code is :
mapping list
mapping list
mapping list
mapping list
But notice "mapping RDD" is not printed to screen. Why is this occurring ?
This is part of a larger issue where I am trying to populate a HashMap from an RDD :
def getTestMap( dist: RDD[(String)]) = {
var testMap = new java.util.HashMap[String , String]();
dist.map(m => {
println("populating map")
testMap.put(m , m)
})
testMap
}
val testM = getTestMap(tm)
println(testM.get("a"))
This code prints null
Is this due to lazy evaluation ?

Lazy evaluation might be part of this, if map is the only operation you are executing. Spark will not schedule execution until an action (in Spark terms) is requested on the RDD lineage.
When you execute an action, the println will happening, but not on the driver where you are expecting it but rather on the slave executing that closure. Try looking into the logs of the workers.
A similar thing is happening on the hashMap population in the 2nd part of the question. The same piece of code will be executed on each partition, on separate workers and will be serialized back to the driver. Given that closures are 'cleaned' by Spark, probably testMap is being removed from the serialized closure, resulting in a null. Note that if it was only due to the map not being executed, the hashmap should be empty, not null.
If you want to transfer the data of the RDD to another structure, you need to do that in the driver. Therefore you need to force Spark to deliver all the data to the driver. That's the function of rdd.collect().
This should work for your case. Be aware that all the RDD data should fit in the memory of your driver:
import scala.collection.JavaConverters._
def getTestMap(dist: RDD[(String)]) = dist.collect.map(m => (m , m)).toMap.asJava

Related

Scala: java.lang.UnsupportedOperationException: Primitive types are not supported

I've added following code:
var counters: Map[String, Int] = Map()
val results = rdd.filter(l => l.contains("xyz")).map(l => mapEvent(l)).filter(r => r.isDefined).map (
i => {
val date = i.get.getDateTime.toString.substring(0, 10)
counters = counters.updated(date, counters.getOrElse(date, 0) + 1)
}
)
I want to get counts for different dates in the RDD in one single iteration. But when I run this I get message saying:
No implicits found for parameters evidence$6: Encoder[Unit]
So I added this line:
implicit val myEncoder: Encoder[Unit] = org.apache.spark.sql.Encoders.kryo[Unit]
But then I get this error.
Exception in thread "main" java.lang.ExceptionInInitializerError
at com.xyz.SparkBatchJob.main(SparkBatchJob.scala)
Caused by: java.lang.UnsupportedOperationException: Primitive types are not supported.
at org.apache.spark.sql.Encoders$.genericSerializer(Encoders.scala:200)
at org.apache.spark.sql.Encoders$.kryo(Encoders.scala:152)
How do I fix this? OR Is there a better way to get the counts I want in a single iteration (O(N) time)?
A Spark RDD is a representation of a distributed collection. When you apply a map function to an RDD, the function that you use to manipulate the collection is going to be executed across the cluster so there is no sense in mutating a variable created out of the scope of the map function.
In your code, the problem is because you donĀ“t return any value, instead you are trying to mutate a structure and for that reason the compiler infers that the new created RDD after the transformation is a RDD[Unit].
If you need to create a Map as a result of a Spark action you must create a pairRDD and then apply the reduce operation.
Include the type of the rdd and the mapEvent function to see how it could be done.
Spark builds a DAG with the transformation and the action, it does not process the data twice.

RDD Remove elements by key

I have 2 RDD's that are pulled in with the following code:
val fileA = sc.textFile("fileA.txt")
val fileB = sc.textFile("fileB.txt")
I then Map and Reduce it by key:
val countsB = fileB.flatMap(line => line.split("\n"))
.map(word => (word, 1))
.reduceByKey(_+_)
val countsA = fileA.flatMap(line => line.split("\n"))
.map(word => (word, 1))
.reduceByKey(_+_)
I now wan't to find and remove all keys in countB if the key exist in countA
I have tried something like:
countsB.keys.foreach(b => {
if(countsB.collect().exists(_ == b)){
countsB.collect().drop(countsB.collect().indexOf(b))
}
})
but it doesn't seem like it removes them by the key.
There are 3 issues with your suggested code:
You are collecting the RDDs, which means they are not RDDs anymore, they are copied into the driver application's memory as plain Scala collections, so you lose Spark's parallelism and risk OutOfMemory errors in case your dataset is large
When calling drop on an immutable Scala collection (or an RDD), you don't change the original collection, you get a new collection with those records dropped, so you can't expect original collection to change
You cannot access an RDD within a function passed to any of the RDDs higher-order methods (e.g. foreach in this case) - any function passed to these method is serialized and sent to workers, and RDDs are (intentionally) not serializable - it makes no sense to fetch them into driver memory, serialize them, and send back to workers - the data is already distributed on the workers!
To solve all these - when you want to use one RDD's data to transform/filter another one, you usually want to use some type of join. In this case you can do:
// left join, and keep only records for which there was NO match in countsA:
countsB.leftOuterJoin(countsA).collect { case (key, (valueB, None)) => (key, valueB) }
NOTE that this collect that I'm using here isn't the collect you used - this one takes a PartialFunction as an argument, and behaves like a combination of map and filter, and most importantly: it doesn't copy all data into driver memory.
EDIT: as The Archetypal Paul commented - you have a much shorter and nicer option - subtractByKey:
countsB.subtractByKey(countsA)

Spark: Cannot add RDD elements into a mutable HashMap inside a closure

I have the following code where rddMap is of org.apache.spark.rdd.RDD[(String, (String, String))], and myHashMap is scala.collection.mutable.HashMap.
I did .saveAsTextFile("temp_out") to force the evaluation of rddMap.map.
However, even println(" t " + t) is printing things, later myHashMap still has only one element I manually put in the beginning ("test1", ("10", "20")).
Everything in the rddMap is not put into myHashMap.
Snippet code:
val myHashMap = new HashMap[String, (String, String)]
myHashMap.put("test1", ("10", "20"))
rddMap.map { t =>
println(" t " + t)
myHashMap.put(t._1, t._2)
}.saveAsTextFile("temp_out")
println(rddMap.count)
println(myHashMap.toString)
Why I cannot put the elements from rddMap to my myHashMap?
Here is a working example of what you want to accomplish.
val rddMap = sc.parallelize(Map("A" -> ("v", "v"), "B" -> ("d","d")).toSeq)
// Collects all the data in the RDD and converts the data to a Map
val myMap = rddMap.collect().toMap
myMap.foreach(println)
Output:
(A,(v,v))
(B,(d,d))
Here is similar code to what you've posted
rddMap.map { t=>
println("t" + t)
newHashMap.put(t._1, t._2)
println(newHashMap.toString)
}.collect
Here is the output to the above code from the Spark shell
t(A,(v,v))
Map(A -> (v,v), test1 -> (10,20))
t(B,(d,d))
Map(test1 -> (10,20), B -> (d,d))
To me it looks like Spark copies your HashMap and does add the element to the copied map.
What you are trying to do is not really supported in Spark today.
Note that every user defined function (e.g., what you add inside a map()) is a closure that gets serialized and pushed to each executioner.
Therefore everything you have inside this map() gets serialized and gets transferred around:
.map{ t =>
println(" t " + t)
myHashMap.put(t._1, t._2)
}
Essentially, your myHashMap will be copied to each executioner and each executioner will be updating it's own version of that HashMap. This is why at the end of the execution the myHashMap you have in your driver will never get changed. (Driver is the JVM that manages/orchestrates your Spark jobs. It's the place where you define your SparkContext.)
In order to push structures defined in the driver to all executioners you need to broadcast them (see link here). Note that broadcasted variables are read-only, so again, using broadcasts will not help you here.
Another way is to use Accumulators but I feel that these are more tune towards summarizing numeric values, like doing sum, max, min, etc. Maybe you can take a look at creating a custom accumulator that extends AccumulatorParam. See link here.
Coming back to the original question, if you want to collect values to your driver, currently the best way to do this is to transform your RDDs until they become a small and manageable collection of elements and then you collect() this final/small RDD.

Spark's RDD.map() will not execute unless the item inside RDD is visited

I'm not quite sure about how Scala and Spark works, maybe I write the code in the wrong way.
The function I want to achieve is, for a given Seq[String, Int], assign a random item in v._2.path to _._2.
To do that, I implement a method and call this method in the next line
def getVerticesWithFeatureSeq(graph: Graph[WikiVertex, WikiEdge.Value]): RDD[(VertexId, WikiVertex)] = {
graph.vertices.map(v => {
//For each token in the sequence, assign an article to them based on its path(root to this node)
println(v._1+" before "+v._2.featureSequence)
v._2.featureSequence = v._2.featureSequence.map(f => (f._1, v._2.path.apply(new scala.util.Random().nextInt(v._2.path.size))))
println(v._1+" after "+v._2.featureSequence)
(v._1, v._2)
})
}
val dt = getVerticesWithFeatureSeq(wikiGraph)
When I execute it, I suppose the println should print out something, but it didn't.
If I add another line of code
dt.foreach(println)
println inside map will print correctly.
Is there some latency of spark's code execution? Like if no one is accessing a variable, the computing will be deferred or even canceled?
Is graph.vertices an RDD? That would explain your issue, since Spark transformations are lazy until no action is executed, foreach in your case:
val dt = getVerticesWithFeatureSeq(wikiGraph) //no result is computed yet, map transformation is 'recorded'
dt.foreach(println) //foreach action requires a result, this triggers the computation
RDD's remember the transformations applied and they are only computed when an action requires a result to be returned to the driver program.
You can check http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations for further details and a list of available transformations and actions.

Nesting of RDD's in Scala Spark

Referring to this question : NullPointerException in Scala Spark, appears to be caused be collection type?
Answer states "Spark doesn't support nesting of RDDs (see https://stackoverflow.com/a/14130534/590203 for another occurrence of the same problem), so you can't perform transformations or actions on RDDs inside of other RDD operations."
This code :
val x = sc.parallelize(List(1 , 2, 3))
def fun1(n : Int) = {
fun2(n)
}
def fun2(n: Int) = {
n + 1
}
x.map(v => fun1(v)).take(1)
prints :
Array[Int] = Array(2)
This is correct.
But does this not disagree with "can't perform transformations or actions on RDDs inside of other RDD operations." since a nested action is occurring on an RDD ?
No. In the linked question d.filter(...) returns an RDD, so the type of
d.distinct().map(x => d.filter(_.equals(x)))
is RDD[RDD[String]]. This isn't allowed, but it doesn't happen in your code. If I understand the answer right, you can't refer to d or other RDDs inside map as well even if you don't get RDD[RDD[SomeType]] in the end.