How to Reference Spark Broadcast Variables Outside of Scope - scala

All the examples I've seen for Spark broadcast variables define them in the scope of the functions using them (map(), join(), etc.). I would like to use both a map() function and mapPartitions() function that reference a broadcast variable, but I would like to modularize them so I can use the same functions for unit testing purposes.
How can I accomplish this?
A thought I had was to curry the function so that I pass a reference to the broadcast variable when using either a map or mapPartitions call.
Are there any performance implications by passing around the reference to the broadcast variable that are not normally found when defining the functions inside the original scope?
I had something like this in mind (pseudo-code):
// firstFile.scala
// ---------------
def mapper(bcast: Broadcast)(row: SomeRow): Int = {
bcast.value(row._1)
}
def mapMyPartition(bcast: Broadcast)(iter: Iterator): Iterator {
val broadcastVariable = bcast.value
for {
i <- iter
} yield broadcastVariable(i)
})
// secondFile.scala
// ----------------
import firstFile.{mapMyPartition, mapper}
val bcastVariable = sc.broadcast(Map(0 -> 1, 1 -> 2, 2 -> 3))
rdd
.map(mapper(bcastVariable))
.mapPartitions(mapMyPartition(bcastVariable))

Your solution should work fine. In both cases the function passed to map{Partitions} will contain a reference to the broadcast variable itself when serialized, but not to its value, and only call bcast.value when calculated on the node.
What needs to be avoided is something like
def mapper(bcast: Broadcast): SomeRow => Int = {
val value = bcast.value
row => value(row._1)
}

You are doing this correctly. You just have to remember to pass the broadcast reference and not the value itself. Using your example the difference might be shown as follows:
a) efficient way:
// the whole Map[Int, Int] is serialized and sent to every worker
val bcastVariable = sc.broadcast(Map(0 -> 1, 1 -> 2, 2 -> 3))
rdd
.map(mapper(bcastVariable)) // only the reference to the Map[Int, Int] is serialized and sent to every worker
.mapPartitions(mapMyPartition(bcastVariable)) // only the reference to the Map[Int, Int] is serialized and sent to every worker
b) inefficient way:
// the whole Map[Int, Int] is serialized and sent to every worker
val bcastVariable = sc.broadcast(Map(0 -> 1, 1 -> 2, 2 -> 3))
rdd
.map(mapper(bcastVariable.value)) // the whole Map[Int, Int] is serialized and sent to every worker
.mapPartitions(mapMyPartition(bcastVariable.value)) // the whole Map[Int, Int] is serialized and sent to every worker
Of course in the second example mapper and mapMyPartition would have slightly different signature.

Related

Add element error by scala/spark

I want to add some element to a set, but it does not work, what is wrong?
var set = new HashSet[Int]()
def add(a:Int){
set.add(a)
}
sc.parallelize(List(1,2,3)).map(add).collect
set.size
using sc.parallelize, you create a distributed dataset (RDD). Now your add-method (and set you are referencing therein) are serialized and sent to the executors. Your variable set only lives on the driver and does not notice the elements added to the other sets (There is no "global" set)
Solutions:
Use aggregate/combine methods to your RDD
val set = sc.parallelize(List(1,2,3))
.aggregate(Set.empty[Int])(
(s:Set[Int],i:Int) => s + i ,
(s1:Set[Int],s2:Set[Int]) => s1++s2
)
or collect the data as a set
val set = sc.parallelize(List(1,2,3)).collect().toSet
or use an accumulator:
import org.apache.spark.AccumulatorParam
object SetAccumulator extends AccumulatorParam[Set[Int]] {
def zero(initialValue: Set[Int]) = Set.empty[Int]
def addInPlace(s1: Set[Int], s2: Set[Int]) = s1 ++ s2
}
val acc = sc.accumulator(Set.empty[Int])(SetAccumulator)
sc.parallelize(List(1,2,3)).foreach(i=> acc.add(Set(i)))
val set = acc.value
the add method of set returns a unit (it has the side effect of changing the set).
When you do the map on the RDD recieved from the parallelize, you are basically changing the local version of the set at each executor, the original set on the driver is not changed.
If for example you had 3 executors, each would have a set of 1 value on that executor and then the data would go away.
In spark, you cannot rely on side effects when doing operations such as map.
A possible solution would be to do something like:
val s = sc.parallelize(List(1,2,3)).distinct().collect.toSet
s would have the set.

Why Source.fromIterator expects a Function0[Iterator[T]] as a parameter instead of Iterator[T]?

Based on: source code
I don't get why the parameter of Source.fromIterator is Function0[Iterator[T]] instead of Iterator[T].
Is there a pratical reason for this? Could we change the signature to def fromIterator(iterator: => Iterator[T]) instead ? (to avoid doing Source.fromIterator( () => myIterator) )
As per the docs:
The iterator will be created anew for each materialization, which is
the reason the method takes a function rather than an iterator
directly.
Stream stages are supposed to be re-usable so you can materialize them more than one. A given iterator, however, can (often) be consumed one time only. If fromIterator created a Source that referred to an existing iterator (whether passed by name or reference) a second attempt to materialize it could fail because the underlying iterator would be exhausted.
To get around this, the source needs to be able to instantiate a new iterator, so fromIterator allows you to supply the necessary logic to do this as a supplier function.
Here's an example of something we don't want to happen:
implicit val system = akka.actor.ActorSystem.create("test")
implicit val mat = akka.stream.ActorMaterializer(system)
val iter = Iterator.range(0, 2)
// pretend we pass the iterator directly...
val src = Source.fromIterator(() => iter)
Await.result(src.runForEach(println), 2.seconds)
// 0
// 1
// res0: akka.Done = Done
Await.result(src.runForEach(println), 2.seconds)
// res1: akka.Done = Done
// No results???
That's bad because the Source src is not re-usable since it doesn't give the same output on subsequent runs. However if we create the iterator lazily it works:
val iterFunc = () => Iterator.range(0, 2)
val src = Source.fromIterator(iterFunc)
Await.result(src.runForEach(println), 2.seconds)
// 0
// 1
// res0: akka.Done = Done
Await.result(src.runForEach(println), 2.seconds)
// 0
// 1
// res1: akka.Done = Done

How to fill a variable inside a map - Scala Spark

I have to read a text file and read it to save its values in a variable type
Map[Int, collection.mutable.Map[Int, Double]].
I have done it with a foreach and a broadcast variable, and it works properly in my local machine but it does not in a yarn-cluster. Foreach task takes too much time with the same task that in my local computer takes only 1 minute.
val data = sc.textFile(fileOriginal)
val dataRDD = parsedData.map(s => s.split(';').map(_.toDouble)).cache()
val datos = collection.mutable.Map[Int, collection.mutable.Map[Int, Double]]()
val bcDatos = sc.broadcast(datos)
dataRDD.foreach { case x =>
if (bcDatos.value.contains(x.apply(0).toInt)) {
bcDatos.value(x.apply(0).toInt).put(x.apply(1).toInt, x.apply(2) / x.apply(3) * 100)
} else {
bcDatos.value.put(x.apply(0).toInt, collection.mutable.Map((x.apply(1).toInt, x.apply(2) / x.apply(3) * 100)))
}
}
My question is: How can I do the same, but using map? Can I "fill" a variable with that structure inside a map?
Thank you
When using Spark - you should never try using mutable structures in distributed manner - that's simply not supported. If you mutate a variable created in driver code (whether using broadcast or not), a copy of that variable will be mutated on each executor separately, and you'll never be able to "merge" these mutated partial results and send them back to the driver.
Instead - you should transform your RDD into a new (immutable!) RDD with the data you need.
If I managed to follow your logic correctly - this would give you the map you need:
// assuming dataRDD has type RDD[Array[Double]] and each Array has at least 4 items:
val result: Map[Int, Map[Int, Double]] = dataRDD
.keyBy(_(0).toInt)
.mapValues(arr => Map(arr(1).toInt -> arr(2) / arr(3) * 100))
.reduceByKey((a, b) => a) // you probably want to "merge" maps "a" and "b" here, but your code doesn't seem to do that now either
.collectAsMap()

In Apache Spark, how to make an RDD/DataFrame operation lazy?

Assuming that I would like to write a function foo that transforms a DataFrame:
object Foo {
def foo(source: DataFrame): DataFrame = {
...complex iterative algorithm with a stopping condition...
}
}
since the implementation of foo has many "Actions" (collect, reduce etc.), calling foo will immediately triggers the expensive execution.
This is not a big problem, however since foo only converts a DataFrame to another, by convention it should be better to allow lazy execution: the implementation of foo should be executed only if the resulted DataFrame or its derivative(s) are being used on the Driver (through another "Action").
So far, the only way to reliably achieve this is through writing all implementations into a SparkPlan, and superimpose it into the DataFrame's SparkExecution, this is very error-prone and involves lots of boilerplate codes. What is the recommended way to do this?
It is not exactly clear to me what you try to achieve but Scala itself provides at least few tools which you may find useful:
lazy vals:
val rdd = sc.range(0, 10000)
lazy val count = rdd.count // Nothing is executed here
// count: Long = <lazy>
count // count is evaluated only when it is actually used
// Long = 10000
call-by-name (denoted by => in the function definition):
def foo(first: => Long, second: => Long, takeFirst: Boolean): Long =
if (takeFirst) first else second
val rdd1 = sc.range(0, 10000)
val rdd2 = sc.range(0, 10000)
foo(
{ println("first"); rdd1.count },
{ println("second"); rdd2.count },
true // Only first will be evaluated
)
// first
// Long = 10000
Note: In practice you should create local lazy binding to make sure that arguments are not evaluated on every access.
infinite lazy collections like Stream
import org.apache.spark.mllib.random.RandomRDDs._
val initial = normalRDD(sc, 1000000L, 10)
// Infinite stream of RDDs and actions and nothing blows :)
val stream: Stream[RDD[Double]] = Stream(initial).append(
stream.map {
case rdd if !rdd.isEmpty =>
val mu = rdd.mean
rdd.filter(_ > mu)
case _ => sc.emptyRDD[Double]
}
)
Some subset of these should be more than enough to implement complex lazy computations.

Distributed Map in Scala Spark

Does Spark support distributed Map collection types ?
So if I have an HashMap[String,String] which are key,value pairs , can this be converted to a distributed Map collection type ? To access the element I could use "filter" but I doubt this performs as well as Map ?
Since I found some new info I thought I'd turn my comments into an answer. #maasg already covered the standard lookup function I would like to point out you should be careful because if the RDD's partitioner is None, lookup just uses a filter anyway. In reference to the (K,V) store on top of spark it looks like this is in progress, but a usable pull request has been made here. Here is an example usage.
import org.apache.spark.rdd.IndexedRDD
// Create an RDD of key-value pairs with Long keys.
val rdd = sc.parallelize((1 to 1000000).map(x => (x.toLong, 0)))
// Construct an IndexedRDD from the pairs, hash-partitioning and indexing
// the entries.
val indexed = IndexedRDD(rdd).cache()
// Perform a point update.
val indexed2 = indexed.put(1234L, 10873).cache()
// Perform a point lookup. Note that the original IndexedRDD remains
// unmodified.
indexed2.get(1234L) // => Some(10873)
indexed.get(1234L) // => Some(0)
// Efficiently join derived IndexedRDD with original.
val indexed3 = indexed.innerJoin(indexed2) { (id, a, b) => b }.filter(_._2 != 0)
indexed3.collect // => Array((1234L, 10873))
// Perform insertions and deletions.
val indexed4 = indexed2.put(-100L, 111).delete(Array(998L, 999L)).cache()
indexed2.get(-100L) // => None
indexed4.get(-100L) // => Some(111)
indexed2.get(999L) // => Some(0)
indexed4.get(999L) // => None
It seems like the pull request was well received and will probably be included in future versions of spark, so it is probably safe to use that pull request in your own code. Here is the JIRA ticket in case you were curious
The quick answer: Partially.
You can transform a Map[A,B] into an RDD[(A,B)] by first forcing the map into a sequence of (k,v) pairs but by doing so you loose the constrain that keys of a map must be a set. ie. you loose the semantics of the Map structure.
From a practical perspective, you can still resolve an element into its corresponding value using kvRdd.lookup(element) but the result will be a sequence, given that you have no warranties that there's a single lookup value as explained before.
A spark-shell example to make things clear:
val englishNumbers = Map(1 -> "one", 2 ->"two" , 3 -> "three")
val englishNumbersRdd = sc.parallelize(englishNumbers.toSeq)
englishNumbersRdd.lookup(1)
res: Seq[String] = WrappedArray(one)
val spanishNumbers = Map(1 -> "uno", 2 -> "dos", 3 -> "tres")
val spanishNumbersRdd = sc.parallelize(spanishNumbers.toList)
val bilingueNumbersRdd = englishNumbersRdd union spanishNumbersRdd
bilingueNumbersRdd.lookup(1)
res: Seq[String] = WrappedArray(one, uno)