In Apache Spark, how to make an RDD/DataFrame operation lazy? - scala

Assuming that I would like to write a function foo that transforms a DataFrame:
object Foo {
def foo(source: DataFrame): DataFrame = {
...complex iterative algorithm with a stopping condition...
}
}
since the implementation of foo has many "Actions" (collect, reduce etc.), calling foo will immediately triggers the expensive execution.
This is not a big problem, however since foo only converts a DataFrame to another, by convention it should be better to allow lazy execution: the implementation of foo should be executed only if the resulted DataFrame or its derivative(s) are being used on the Driver (through another "Action").
So far, the only way to reliably achieve this is through writing all implementations into a SparkPlan, and superimpose it into the DataFrame's SparkExecution, this is very error-prone and involves lots of boilerplate codes. What is the recommended way to do this?

It is not exactly clear to me what you try to achieve but Scala itself provides at least few tools which you may find useful:
lazy vals:
val rdd = sc.range(0, 10000)
lazy val count = rdd.count // Nothing is executed here
// count: Long = <lazy>
count // count is evaluated only when it is actually used
// Long = 10000
call-by-name (denoted by => in the function definition):
def foo(first: => Long, second: => Long, takeFirst: Boolean): Long =
if (takeFirst) first else second
val rdd1 = sc.range(0, 10000)
val rdd2 = sc.range(0, 10000)
foo(
{ println("first"); rdd1.count },
{ println("second"); rdd2.count },
true // Only first will be evaluated
)
// first
// Long = 10000
Note: In practice you should create local lazy binding to make sure that arguments are not evaluated on every access.
infinite lazy collections like Stream
import org.apache.spark.mllib.random.RandomRDDs._
val initial = normalRDD(sc, 1000000L, 10)
// Infinite stream of RDDs and actions and nothing blows :)
val stream: Stream[RDD[Double]] = Stream(initial).append(
stream.map {
case rdd if !rdd.isEmpty =>
val mu = rdd.mean
rdd.filter(_ > mu)
case _ => sc.emptyRDD[Double]
}
)
Some subset of these should be more than enough to implement complex lazy computations.

Related

Streamline results in mapPartitions (Spark)

Is there a way to return partial results in mapPartitions() ?
Currently I use it like this:
myRDD.mapPartitions{
iter: iterator[InputType] => {
val additionalData = <some costly init operation>
val results = ArrayBuffer[OutputType]()
for(input: InputType <- iter) results += (transform(input, additionalData))
results.iterator
}
}
But of course if a partition is too big the results array will throw an OOM exception.
So my question: is there a way to send partial results every once in a while so as to avoid any OOM ?
I want to stick to mapPartitions because I initialize a costly object (e.g. get the value of a big broadcasted variable) before processing the input and I don't want to do that at every record like with map
If additionalData doesn't access the iterator you can just map:
myRDD.mapPartitions{
iter: iterator[InputType] => {
val additionalData = ???
iter.map(input => transform(input, additionalData))
}}

Scala TypeTags and performance

There are some answers around for equivalent questions about Java, but is scala reflection (2.11, TypeTags) really slow? there's a long narrative write-up about it at http://docs.scala-lang.org/overviews/reflection/overview.html, where the answer to this question is hard to extract.
I see a lot of advice floating around about avoiding reflection, maybe some of it predating the improvements of 2.11, but if this works well it looks like it can solve the debilitating aspect of the JVM's type erasure, for scala code.
Thanks!
Let's measure it.
I've created simple class C that has one method. All what this method do is sleep for 10ms.
Let's invoke this method
within reflection
directly
And see which is faster and how fast it is.
I've created three tests.
Test 1. Invoke via reflection. Execution time include all work that necessary to be done for setup reflection.
Create runtimeMirror, reflect class, create declaration for method, and at last step - execute method.
Test 2. Do not take into account this preparation stage, as it can be re-used.
We are calculate time of method invoking via reflection only.
Test 3. Invoke method directly.
Results:
Reflection from start : job done in 2561ms got 101 (1,5seconds for setup each execution)
Invoke method reflection: job done in 1093ms got 101 ( < 1ms for setup each execution)
No reflection: job done in 1087ms got 101 ( < 1ms for setup each execution)
Conclusion:
Setup phase increase execution time dramatically. But there are no need to perform setup on each execution (this is like class initialization - can be done once). So if you use reflection in right way(with separated init stage) it shows relevant performance and can be used for production.
Source code:
class C {
def x = {
Thread.sleep(10)
1
}
}
class XYZTest extends FunSpec {
def withTime[T](procName: String, f: => T): T = {
val start = System.currentTimeMillis()
val r = f
val end = System.currentTimeMillis()
print(s"$procName job done in ${end-start}ms")
r
}
describe("SomeTest") {
it("rebuild each time") {
val s = withTime("Reflection from start : ", (0 to 100). map {x =>
val ru = scala.reflect.runtime.universe
val m = ru.runtimeMirror(getClass.getClassLoader)
val im = m.reflect(new C)
val methodX = ru.typeOf[C].declaration(ru.TermName("x")).asMethod
val mm = im.reflectMethod(methodX)
mm().asInstanceOf[Int]
}).sum
println(s" got $s")
}
it("invoke each time") {
val ru = scala.reflect.runtime.universe
val m = ru.runtimeMirror(getClass.getClassLoader)
val im = m.reflect(new C)
val s = withTime("Invoke method reflection: ", (0 to 100). map {x =>
val methodX = ru.typeOf[C].declaration(ru.TermName("x")).asMethod
val mm = im.reflectMethod(methodX)
mm().asInstanceOf[Int]
}).sum
println(s" got $s")
}
it("invoke directly") {
val c = new C()
val s = withTime("No reflection: ", (0 to 100). map {x =>
c.x
}).sum
println(s" got $s")
}
}
}

Does scala have a lazy evaluating wrapper?

I want to return a wrapper/holder for a result that I want to compute only once and only if the result is actually used. Something like:
def getAnswer(question: Question): Lazy[Answer] = ???
println(getAnswer(q).value)
This should be pretty easy to implement using lazy val:
class Lazy[T](f: () => T) {
private lazy val _result = Try(f())
def value: T = _result.get
}
But I'm wondering if there's already something like this baked into the standard API.
A quick search pointed at Streams and DelayedLazyVal but neither is quite what I'm looking for.
Streams do memoize the stream elements, but it seems like the first element is computed at construction:
def compute(): Int = { println("computing"); 1 }
val s1 = compute() #:: Stream.empty
// computing is printed here, before doing s1.take(1)
In a similar vein, DelayedLazyVal starts computing upon construction, even requires an execution context:
val dlv = new DelayedLazyVal(() => 1, { println("started") })
// immediately prints out "started"
There's scalaz.Need which I think you'd be able to use for this.

ParSeq.fill running sequentially?

I am trying to initialize an array in Scala, using parallelization. However, when using ParSeq.fill method, the performance doesn't seem to be better any better than sequential initialization (Seq.fill). If I do the same task, but initializing the collection with map, then it is much faster.
To show my point, I set up the following example:
import scala.collection.parallel.immutable.ParSeq
import scala.util.Random
object Timer {
def apply[A](f: => A): (A, Long) = {
val s = System.nanoTime
val ret = f
(ret, System.nanoTime - s)
}
}
object ParallelBenchmark extends App {
def randomIsPrime: Boolean = {
val n = Random.nextInt(1000000)
(2 until n).exists(i => n % i == 0)
}
val seqSize = 100000
val (_, timeSeq) = Timer { Seq.fill(seqSize)(randomIsPrime) }
println(f"Time Seq:\t\t $timeSeq")
val (_, timeParFill) = Timer { ParSeq.fill(seqSize)(randomIsPrime) }
println(f"Time Par Fill:\t $timeParFill")
val (_, timeParMap) = Timer { (0 until seqSize).par.map(_ => randomIsPrime) }
println(f"Time Par map:\t $timeParMap")
}
And the result is:
Time Seq: 32389215709
Time Par Fill: 32730035599
Time Par map: 17270448112
Clearly showing that the fill method is not running in parallel.
The parallel collections library in Scala can only parallelize existing collections, fill hasn't been implemented yet (and may never be). Your method of using a Range to generate a cheap placeholder collection is probably your best option if you want to see a speed boost.
Here's the underlying method being called by ParSeq.fill, obviously not parallel.

Spark RDD equivalent to Scala collections partition

This is a minor issue with one of my spark jobs which doesn't seem to cause any issues -- yet annoys me every time I see it and fail to come up with a better solution.
Say I have a Scala collection like this:
val myStuff = List(Try(2/2), Try(2/0))
I can partition this list into successes and failures with partition:
val (successes, failures) = myStuff.partition(_.isSuccess)
Which is nice. The implementation of partition only traverses the source collection once to build the two new collections. However, using Spark, the best equivalent I have been able to devise is this:
val myStuff: RDD[Try[???]] = sourceRDD.map(someOperationThatMayFail)
val successes: RDD[???] = myStuff.collect { case Success(v) => v }
val failures: RDD[Throwable] = myStuff.collect { case Failure(ex) => ex }
Which aside from the difference of unpacking the Try (which is fine) also requires traversing the data twice. Which is annoying.
Is there any better Spark alternative that can split an RDD without multiple traversals? i.e. having a signature something like this where partition has the behaviour of Scala collections partition rather than RDD partition:
val (successes: RDD[Try[???]], failures: RDD[Try[???]]) = myStuff.partition(_.isSuccess)
For reference, I previously used something like the below to solve this. The potentially failing operation is de-serializing some data from a binary format, and the failures have become interesting enough that they need to be processed and saved as an RDD rather than something logged.
def someOperationThatMayFail(data: Array[Byte]): Option[MyDataType] = {
try {
Some(deserialize(data))
} catch {
case e: MyDesrializationError => {
logger.error(e)
None
}
}
}
There might be other solutions, but here you go:
Setup:
import scala.util._
val myStuff = List(Try(2/2), Try(2/0))
val myStuffInSpark = sc.parallelize(myStuff)
Execution:
val myStuffInSparkPartitioned = myStuffInSpark.aggregate((List[Try[Int]](),List[Try[Int]]()))(
(accum, curr)=>if(curr.isSuccess) (curr :: accum._1,accum._2) else (accum._1, curr :: accum._2),
(first, second)=> (first._1 ++ second._1,first._2 ++ second._2))
Let me know if you need an explanation