Spark: Cannot add RDD elements into a mutable HashMap inside a closure

Spark: Cannot add RDD elements into a mutable HashMap inside a closure - scala

I have the following code where rddMap is of org.apache.spark.rdd.RDD[(String, (String, String))], and myHashMap is scala.collection.mutable.HashMap.
I did .saveAsTextFile("temp_out") to force the evaluation of rddMap.map.
However, even println(" t " + t) is printing things, later myHashMap still has only one element I manually put in the beginning ("test1", ("10", "20")).
Everything in the rddMap is not put into myHashMap.
Snippet code:
val myHashMap = new HashMap[String, (String, String)]
myHashMap.put("test1", ("10", "20"))
rddMap.map { t =>
println(" t " + t)
myHashMap.put(t._1, t._2)
}.saveAsTextFile("temp_out")
println(rddMap.count)
println(myHashMap.toString)
Why I cannot put the elements from rddMap to my myHashMap?

Here is a working example of what you want to accomplish.
val rddMap = sc.parallelize(Map("A" -> ("v", "v"), "B" -> ("d","d")).toSeq)
// Collects all the data in the RDD and converts the data to a Map
val myMap = rddMap.collect().toMap
myMap.foreach(println)
Output:
(A,(v,v))
(B,(d,d))
Here is similar code to what you've posted
rddMap.map { t=>
println("t" + t)
newHashMap.put(t._1, t._2)
println(newHashMap.toString)
}.collect
Here is the output to the above code from the Spark shell
t(A,(v,v))
Map(A -> (v,v), test1 -> (10,20))
t(B,(d,d))
Map(test1 -> (10,20), B -> (d,d))
To me it looks like Spark copies your HashMap and does add the element to the copied map.

What you are trying to do is not really supported in Spark today.
Note that every user defined function (e.g., what you add inside a map()) is a closure that gets serialized and pushed to each executioner.
Therefore everything you have inside this map() gets serialized and gets transferred around:
.map{ t =>
println(" t " + t)
myHashMap.put(t._1, t._2)
}
Essentially, your myHashMap will be copied to each executioner and each executioner will be updating it's own version of that HashMap. This is why at the end of the execution the myHashMap you have in your driver will never get changed. (Driver is the JVM that manages/orchestrates your Spark jobs. It's the place where you define your SparkContext.)
In order to push structures defined in the driver to all executioners you need to broadcast them (see link here). Note that broadcasted variables are read-only, so again, using broadcasts will not help you here.
Another way is to use Accumulators but I feel that these are more tune towards summarizing numeric values, like doing sum, max, min, etc. Maybe you can take a look at creating a custom accumulator that extends AccumulatorParam. See link here.
Coming back to the original question, if you want to collect values to your driver, currently the best way to do this is to transform your RDDs until they become a small and manageable collection of elements and then you collect() this final/small RDD.

Related

Char count in string

I'm new to scala and FP in general and trying to practice it on a dummy example.
val counts = ransomNote.map(e=>(e,1)).reduceByKey{case (x,y) => x+y}
The following error is raised:
Line 5: error: value reduceByKey is not a member of IndexedSeq[(Char, Int)] (in solution.scala)
The above example looks similar to staring FP primer on word count, I'll appreciate it if you point on my mistake.

It looks like you are trying to use a Spark method on a Scala collection. The two APIs have a few similarities, but reduceByKey is not part of it.
In pure Scala you can do it like this:
val counts =
ransomNote.foldLeft(Map.empty[Char, Int].withDefaultValue(0)) {
(counts, c) => counts.updated(c, counts(c) + 1)
}
foldLeft iterates over the collection from the left, using the empty map of counts as the accumulated state (which returns 0 is no value is found), which is updated in the function passed as argument by being updated with the found value, incremented.
Note that accessing a map directly (counts(c)) is likely to be unsafe in most situations (since it will throw an exception if no item is found). In this situation it's fine because in this scope I know I'm using a map with a default value. When accessing a map you will more often than not want to use get, which returns an Option. More on that on the official Scala documentation (here for version 2.13.2).
You can play around with this code here on Scastie.

On Scala 2.13 you can use the new groupMapReduce
ransomNote.groupMapReduce(identity)(_ => 1)(_ + _)

val str = "hello"
val countsMap: Map[Char, Int] = str
.groupBy(identity)
.mapValues(_.length)
println(countsMap)

Why is only one new object created when called multiple times in `map`?

As I understand it, a way to create a new ArrayBuffer with one element is to say
val buffer = ArrayBuffer(element)
or something like this:
val buffer = ArrayBuffer[Option[String]](None)
Suppose x is a collection with 3 elements. I'm trying to create a map that creates a new 1-element ArrayBuffer for each element in x and associates the element in x with the new buffer. (These are intentionally separate mutable buffers that will be modified by threads.) I tried this:
x.map(elem => (elem, ArrayBuffer[Option[String]](None))).toMap
However, I found (using System.identityHashCode) that only one ArrayBuffer was created, and all 3 elements were mapped to the same value.
Why is this happening? I expected that the tuple expression would be evaluated for each element of x and that this would result in a new ArrayBuffer being created for each evaluation of the expression.
What would be a good workaround?
I'm using Scala 2.11.
Update
In the process of creating a reproducible example, I figured out the problem. Here's the example; Source is an interface defined in our application.
def test1(x: Seq[Source]): Unit = {
val containers = x.map(elem => (elem, ArrayBuffer[Option[String]](None))).toMap
x.foreach(elem => println(
s"test1: elem=${System.identityHashCode(elem)} container=${System.identityHashCode(containers(elem))}"))
x.indices.foreach(n => containers(x(n)).update(0, Some(n.toString)))
x.foreach(elem => println(s"resulting value: ${containers(elem)(0)}"))
}
What I missed was that for the values of x I was trying to use, the class implementing Source was returning true for equals() for all combinations of values. So the resulting map only had one key-value pair.
Apologies for not figuring this out sooner. I'll delete the question after a while.

I think your problem is the toMap. If all three elements are None then you have in a Map just one element (as all have the same key).
I played a bit on Scalafiddle (remove the .toMap and you will have 3 ByteArrays)
let me know if I have misunderstood you.

I cannot seem to replicate the issue, for example
val m =
List(Some("a"), Some("b"), Some("c"))
.map(elem => (elem, ArrayBuffer[Option[String]](None)))
.toMap
m(Some("a")) += Some("42")
m
outputs
res2: scala.collection.immutable.Map[Some[String],scala.collection.mutable.ArrayBuffer[Option[String]]] = Map(
Some(a) -> ArrayBuffer(None, Some(42)),
Some(b) -> ArrayBuffer(None),
Some(c) -> ArrayBuffer(None)
)
where we see Some("42") was added to one buffer whilst others were unaffected.

How to read from csv to a ListMap

I have been assigned with the task of reading from csv, and creating a ListMap variable. The reason to use this specific class is that for some other use cases they were already using a number of methods with ListMap as input parameter, and they want one more.
What I have done so far is: read from the csv, and create a rdd.
The format of the csv is
"field1,field2"
"value1,value2"
"value3,value4"
In this rdd I have tuples of strings. What I would like is to now convert this to a ListMap class. So what I have is a variable with the type Array[(value1,value2),(value3,value4)].
I did this because I find it easy to go from a csv to tuples. The problem is I do not find any way to go from here to a ListMap. It seems easier to get a normal Map class, but as I said, it is required for the final result to be a ListMap type of object.
I have been reading but I do not really understand this answer nor this one

Depending on the sample data you provided, you can use collectAsMap api to get the final ListMap
val rdd = sparkSession.sparkContext.textFile("path to the text file")
.map(line => line.split(","))
.map(array => array(0) -> array(1))
.collectAsMap()
Thats it.
Now if you want to go a step further you can do additional step as
var listMap : ListMap[String, String] = ListMap.empty[String, String]
for(map <- rdd) {
listMap += map
}

Array("foo" -> "bar", "baz" -> "bat").toMap gives you a Map.
If you are looking for a ListMap specifically (for the life of me, can't think of a reason why you would), then you need a breakOut:
val map: ListMap[String, String] =
Array("foo" -> "bar", "baz" -> "bat")
.toMap
.map(identity)(scala.collection.breakOut)
breakOut is sort of a "collection factory" that lets you implicitly convert between different collection types. You can read more about it here: https://docs.scala-lang.org/tutorials/FAQ/breakout.html

flatMapping in scala/spark

Looking for some assistance with a problem with how to to something in scala using spark.
I have:
type DistanceMap = HashMap[(VertexId,String), Int]
this forms part of my data in the form of an RDD of:
org.apache.spark.rdd.RDD[(DistanceMap, String)]
in short my dataset looks like this:
({(101,S)=3},piece_of_data_1)
({(101,S)=3},piece_of_data_2)
({(101,S)=1, (100,9)=2},piece_of_data_3)
What I want to do us flat map my distance map (which I can do) but at the same time for each flatmapped DistanceMap want to retain the associated string with that. So my resulting data would look like this:
({(101,S)=3},piece_of_data_1))<br>
({(101,S)=3},piece_of_data_2))<br>
({(101,S)=1},piece_of_data_3))<br>
({(109,S)=2},piece_of_data_3))<br>
As mentioned I can flatMap the first part using:
x.flatMap(x=>x._1).collect.foreach(println))
but am stuck on how I can retain the string from the second part of my original data.

This might work for you:
x.flatMap(x => x._1.map(y => (y,x._2)))
The idea is to convert from (Seq(a,b,c),Value) to Seq( (a,Value), (b, Value), (c, Value)).
This is the same in Scala, so here is a standalone simplified Scala example you can paste in Scala REPL:
Seq((Seq("a","b","c"), 34), (Seq("r","t"), 2)).flatMap( x => x._1.map(y => (y,x._2)))
This results in:
res0: Seq[(String, Int)] = List((a,34), (b,34), (c,34), (r,2), (t,2))

update
I have an alternative solution - flip key with value and use flatMapValues transformation, and then flip key with value again: see pseudo code:
x.map(x=>x._2, x._1).flatMapValues(x=>x).map(x=>x._2, x._1)
previous version
I propose to add one preprocessing step (sorry I have no computer with scala interpreter in front of me till tomorrow to come up with working code).
transform the pair rdd from (DistanceMap, String) into the rdd with list of Tuple4: List((VertexId,String, Int, String), ... ())
apply flatMap on on result
Pseudocode:
rdd.map( (DistanceMap, String) => List((VertexId,String, Int, String), ... ()))
.flatMap(x=>x)

Map function of RDD not being invoked in Scala Spark

When I call the map function of an RDD is is not being applied. It works as expected for a scala.collection.immutable.List but not for an RDD. Here is some code to illustrate :
val list = List ("a" , "d" , "c" , "d")
list.map(l => {
println("mapping list")
})
val tm = sc.parallelize(list)
tm.map(m => {
println("mapping RDD")
})
Result of above code is :
mapping list
mapping list
mapping list
mapping list
But notice "mapping RDD" is not printed to screen. Why is this occurring ?
This is part of a larger issue where I am trying to populate a HashMap from an RDD :
def getTestMap( dist: RDD[(String)]) = {
var testMap = new java.util.HashMap[String , String]();
dist.map(m => {
println("populating map")
testMap.put(m , m)
})
testMap
}
val testM = getTestMap(tm)
println(testM.get("a"))
This code prints null
Is this due to lazy evaluation ?

Lazy evaluation might be part of this, if map is the only operation you are executing. Spark will not schedule execution until an action (in Spark terms) is requested on the RDD lineage.
When you execute an action, the println will happening, but not on the driver where you are expecting it but rather on the slave executing that closure. Try looking into the logs of the workers.
A similar thing is happening on the hashMap population in the 2nd part of the question. The same piece of code will be executed on each partition, on separate workers and will be serialized back to the driver. Given that closures are 'cleaned' by Spark, probably testMap is being removed from the serialized closure, resulting in a null. Note that if it was only due to the map not being executed, the hashmap should be empty, not null.
If you want to transfer the data of the RDD to another structure, you need to do that in the driver. Therefore you need to force Spark to deliver all the data to the driver. That's the function of rdd.collect().
This should work for your case. Be aware that all the RDD data should fit in the memory of your driver:
import scala.collection.JavaConverters._
def getTestMap(dist: RDD[(String)]) = dist.collect.map(m => (m , m)).toMap.asJava

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Spark: Cannot add RDD elements into a mutable HashMap inside a closure - scala

Related

Char count in string

Why is only one new object created when called multiple times in `map`?

How to read from csv to a ListMap

flatMapping in scala/spark

Map function of RDD not being invoked in Scala Spark

Categories

Resources