Why Source.fromIterator expects a Function0[Iterator[T]] as a parameter instead of Iterator[T]? - scala

Based on: source code
I don't get why the parameter of Source.fromIterator is Function0[Iterator[T]] instead of Iterator[T].
Is there a pratical reason for this? Could we change the signature to def fromIterator(iterator: => Iterator[T]) instead ? (to avoid doing Source.fromIterator( () => myIterator) )

As per the docs:
The iterator will be created anew for each materialization, which is
the reason the method takes a function rather than an iterator
directly.
Stream stages are supposed to be re-usable so you can materialize them more than one. A given iterator, however, can (often) be consumed one time only. If fromIterator created a Source that referred to an existing iterator (whether passed by name or reference) a second attempt to materialize it could fail because the underlying iterator would be exhausted.
To get around this, the source needs to be able to instantiate a new iterator, so fromIterator allows you to supply the necessary logic to do this as a supplier function.
Here's an example of something we don't want to happen:
implicit val system = akka.actor.ActorSystem.create("test")
implicit val mat = akka.stream.ActorMaterializer(system)
val iter = Iterator.range(0, 2)
// pretend we pass the iterator directly...
val src = Source.fromIterator(() => iter)
Await.result(src.runForEach(println), 2.seconds)
// 0
// 1
// res0: akka.Done = Done
Await.result(src.runForEach(println), 2.seconds)
// res1: akka.Done = Done
// No results???
That's bad because the Source src is not re-usable since it doesn't give the same output on subsequent runs. However if we create the iterator lazily it works:
val iterFunc = () => Iterator.range(0, 2)
val src = Source.fromIterator(iterFunc)
Await.result(src.runForEach(println), 2.seconds)
// 0
// 1
// res0: akka.Done = Done
Await.result(src.runForEach(println), 2.seconds)
// 0
// 1
// res1: akka.Done = Done

Related

Add element error by scala/spark

I want to add some element to a set, but it does not work, what is wrong?
var set = new HashSet[Int]()
def add(a:Int){
set.add(a)
}
sc.parallelize(List(1,2,3)).map(add).collect
set.size
using sc.parallelize, you create a distributed dataset (RDD). Now your add-method (and set you are referencing therein) are serialized and sent to the executors. Your variable set only lives on the driver and does not notice the elements added to the other sets (There is no "global" set)
Solutions:
Use aggregate/combine methods to your RDD
val set = sc.parallelize(List(1,2,3))
.aggregate(Set.empty[Int])(
(s:Set[Int],i:Int) => s + i ,
(s1:Set[Int],s2:Set[Int]) => s1++s2
)
or collect the data as a set
val set = sc.parallelize(List(1,2,3)).collect().toSet
or use an accumulator:
import org.apache.spark.AccumulatorParam
object SetAccumulator extends AccumulatorParam[Set[Int]] {
def zero(initialValue: Set[Int]) = Set.empty[Int]
def addInPlace(s1: Set[Int], s2: Set[Int]) = s1 ++ s2
}
val acc = sc.accumulator(Set.empty[Int])(SetAccumulator)
sc.parallelize(List(1,2,3)).foreach(i=> acc.add(Set(i)))
val set = acc.value
the add method of set returns a unit (it has the side effect of changing the set).
When you do the map on the RDD recieved from the parallelize, you are basically changing the local version of the set at each executor, the original set on the driver is not changed.
If for example you had 3 executors, each would have a set of 1 value on that executor and then the data would go away.
In spark, you cannot rely on side effects when doing operations such as map.
A possible solution would be to do something like:
val s = sc.parallelize(List(1,2,3)).distinct().collect.toSet
s would have the set.

In Apache Spark, how to make an RDD/DataFrame operation lazy?

Assuming that I would like to write a function foo that transforms a DataFrame:
object Foo {
def foo(source: DataFrame): DataFrame = {
...complex iterative algorithm with a stopping condition...
}
}
since the implementation of foo has many "Actions" (collect, reduce etc.), calling foo will immediately triggers the expensive execution.
This is not a big problem, however since foo only converts a DataFrame to another, by convention it should be better to allow lazy execution: the implementation of foo should be executed only if the resulted DataFrame or its derivative(s) are being used on the Driver (through another "Action").
So far, the only way to reliably achieve this is through writing all implementations into a SparkPlan, and superimpose it into the DataFrame's SparkExecution, this is very error-prone and involves lots of boilerplate codes. What is the recommended way to do this?
It is not exactly clear to me what you try to achieve but Scala itself provides at least few tools which you may find useful:
lazy vals:
val rdd = sc.range(0, 10000)
lazy val count = rdd.count // Nothing is executed here
// count: Long = <lazy>
count // count is evaluated only when it is actually used
// Long = 10000
call-by-name (denoted by => in the function definition):
def foo(first: => Long, second: => Long, takeFirst: Boolean): Long =
if (takeFirst) first else second
val rdd1 = sc.range(0, 10000)
val rdd2 = sc.range(0, 10000)
foo(
{ println("first"); rdd1.count },
{ println("second"); rdd2.count },
true // Only first will be evaluated
)
// first
// Long = 10000
Note: In practice you should create local lazy binding to make sure that arguments are not evaluated on every access.
infinite lazy collections like Stream
import org.apache.spark.mllib.random.RandomRDDs._
val initial = normalRDD(sc, 1000000L, 10)
// Infinite stream of RDDs and actions and nothing blows :)
val stream: Stream[RDD[Double]] = Stream(initial).append(
stream.map {
case rdd if !rdd.isEmpty =>
val mu = rdd.mean
rdd.filter(_ > mu)
case _ => sc.emptyRDD[Double]
}
)
Some subset of these should be more than enough to implement complex lazy computations.

How to Reference Spark Broadcast Variables Outside of Scope

All the examples I've seen for Spark broadcast variables define them in the scope of the functions using them (map(), join(), etc.). I would like to use both a map() function and mapPartitions() function that reference a broadcast variable, but I would like to modularize them so I can use the same functions for unit testing purposes.
How can I accomplish this?
A thought I had was to curry the function so that I pass a reference to the broadcast variable when using either a map or mapPartitions call.
Are there any performance implications by passing around the reference to the broadcast variable that are not normally found when defining the functions inside the original scope?
I had something like this in mind (pseudo-code):
// firstFile.scala
// ---------------
def mapper(bcast: Broadcast)(row: SomeRow): Int = {
bcast.value(row._1)
}
def mapMyPartition(bcast: Broadcast)(iter: Iterator): Iterator {
val broadcastVariable = bcast.value
for {
i <- iter
} yield broadcastVariable(i)
})
// secondFile.scala
// ----------------
import firstFile.{mapMyPartition, mapper}
val bcastVariable = sc.broadcast(Map(0 -> 1, 1 -> 2, 2 -> 3))
rdd
.map(mapper(bcastVariable))
.mapPartitions(mapMyPartition(bcastVariable))
Your solution should work fine. In both cases the function passed to map{Partitions} will contain a reference to the broadcast variable itself when serialized, but not to its value, and only call bcast.value when calculated on the node.
What needs to be avoided is something like
def mapper(bcast: Broadcast): SomeRow => Int = {
val value = bcast.value
row => value(row._1)
}
You are doing this correctly. You just have to remember to pass the broadcast reference and not the value itself. Using your example the difference might be shown as follows:
a) efficient way:
// the whole Map[Int, Int] is serialized and sent to every worker
val bcastVariable = sc.broadcast(Map(0 -> 1, 1 -> 2, 2 -> 3))
rdd
.map(mapper(bcastVariable)) // only the reference to the Map[Int, Int] is serialized and sent to every worker
.mapPartitions(mapMyPartition(bcastVariable)) // only the reference to the Map[Int, Int] is serialized and sent to every worker
b) inefficient way:
// the whole Map[Int, Int] is serialized and sent to every worker
val bcastVariable = sc.broadcast(Map(0 -> 1, 1 -> 2, 2 -> 3))
rdd
.map(mapper(bcastVariable.value)) // the whole Map[Int, Int] is serialized and sent to every worker
.mapPartitions(mapMyPartition(bcastVariable.value)) // the whole Map[Int, Int] is serialized and sent to every worker
Of course in the second example mapper and mapMyPartition would have slightly different signature.

passing a code block to method without execution

I have following code:
import java.io._
import com.twitter.chill.{Input, Output, ScalaKryoInstantiator}
import scala.reflect.ClassTag
object serializer {
val instantiator = new ScalaKryoInstantiator
instantiator.setRegistrationRequired(false)
val kryo = instantiator.newKryo()
def load[T](file:_=>_,name:String,cls:Class[T]):T = {
if (java.nio.file.Files.notExists(new File(name).toPath())) {
val temp = file
val baos = new FileOutputStream(name)
val output = new Output(baos, 4096)
kryo.writeObject(output, temp)
temp.asInstanceOf[T]
}
else {
println("loading from " + name)
val baos = new FileInputStream(name)
val input = new Input(baos)
kryo.readObject(input,cls)
}
}
}
I want to use it in this way:
val mylist = serializer.load((1 to 100000).toList,"allAdj.bin",classOf[List[Int]])
I don't want to run (1 to 100000).toList every time so I want to pass it to the serializer and then decide to compute it for the first time and serialize it for future or load it from file.
The problem is that the code block is running first in my code, how can I pass the code block without executing it?
P.S. Is there any scala tool that do the exact thing for me?
To have parameters not be evaluated before being passed, use pass-by-name, like this:
def method(param: =>ParamType)
Whatever you pass won't be evaluated at the time you pass, but will be evaluated each time you use param, which might not be what you want either. To have it be evaluated only the first time you use, do this:
def method(param: =>ParamType) = {
lazy val p: ParamType = param
Then use only p on the body. The first time p is used, param will be evaluated and the value will be stored. All other uses of p will use the stored value.
Note that this happens every time you invoke method. That is, if you call method twice, it won't use the "stored" value of p -- it will evaluate it again on first use. If you want to "pre-compute" something, then perhaps you'd be better off with a class instead?

Appending element to list in Scala

val indices: List[Int] = List()
val featValues: List[Double] = List()
for (f <- feat) {
val q = f.split(':')
if (q.length == 2) {
println(q.mkString("\n")) // works fine, displays info
indices :+ (q(0).toInt)
featValues :+ (q(1).toDouble)
}
}
println(indices.mkString("\n") + indices.length) // prints nothing and 0?
indices and featValues are not being filled. I'm at a loss here.
You cannot append anything to an immutable data structure such as List stored in a val (immutable named slot).
What your code is doing is creating a new list every time with one element appended, and then throwing it away (by not doing anything with it) — the :+ method on lists does not modify the list in place (even when it's a mutable list such as ArrayBuffer) but always returns a new list.
In order to achieve what you want, the quickest way (as opposed to the right way) is either to use a var (typically preferred):
var xs = List.empty[Int]
xs :+= 123 // same as `xs = xs :+ 123`
or a val containing a mutable collection:
import scala.collection.mutable.ArrayBuffer
val buf = ArrayBuffer.empty[Int]
buf += 123
However, if you really want to make your code idiomatic, you should instead just use a functional approach:
val indiciesAndFeatVals = feat.map { f =>
val Array(q0, q1) = f.split(':') // pattern matching in action
(q0.toInt, q1.toDouble)
}
which will give you a sequence of pairs, which you can then unzip to 2 separate collections:
val (indicies, featVals) = indiciesAndFeatVals.unzip
This approach will avoid the use of any mutable data structures as well as vars (i.e. mutable slots).