How can I benchmark performance in Spark console? - scala

I have just started using Spark and my interactions with it revolve around spark-shell at the moment. I would like to benchmark how long various commands take, but could not find how to get the time or run a benchmark. Ideally I would want to do something super-simple, such as:
val t = [current_time]
data.map(etc).distinct().reduceByKey(_ + _)
println([current time] - t)
Edit: Figured it out --
import org.joda.time._
val t_start = DateTime.now()
[[do stuff]]
val t_end = DateTime.now()
new Period(t_start, t_end).toStandardSeconds()

I suggest you do the following :
def time[A](f: => A) = {
val s = System.nanoTime
val ret = f
println("time: " + (System.nanoTime - s) / 1e9 + " seconds")
ret
}
You can pass a function as an argument to time function and it will compute the result of the function giving you the time taken by the function to be performed.
Let's consider a function foobar that take data as argument and then do the following :
val test = time(foobar(data))
test will contains the result of foobar and you'll get the time needed as well.

Related

For Scala 2.13, what is the fastest method for updating a LongMap, HashMap, or TrieMap with millions of updates?

Goal
I have a mutable Map[Long, Long] with millions of entries. I need to make many iterations of updates with millions of updates. I would like to do this as fast as possible.
Background
Currently, the fastest method is to use a single threaded mutable.LongMap[Long]. This type is optimized for Long types as the key.
Other map types appear to be slower -- but I may have implemented them incorrectly as I was trying to do the updates concurrently and/or in parallel without success. It is possible that updating a map in parallel is not actually occurring or is not possible in Scala.
In order of fastest to slowest:
LongMap[Long] (from above)
TrieMap[Long, Long]
ParTrieMap[Long, Long]
HashMap[Long, Long]
ParHashMap[Long, Long]
ParMap[Long, Long]
It is OK if a faster method is not mutable, but I do not think this will be the case. A mutable map is probably best for this use case.
Code to generate test data and time the test
import java.util.Calendar
import scala.collection.mutable
object DictSpeedTest2 {
//helper constants
val million: Long = 1000000
val billion: Long = million * 1000
//config
val garbageCollectionWait = 3
val numEntries: Long = million * 10 //may need to increase JVM memory with something like: -Xmx32g
val maxValue: Long = billion * million // max Long = 9223372036854775807L
// this is 1000000000000000L
def main(args: Array[String]): Unit = {
//generate random data; initial entries in a; updates in b
val a = genData(numEntries, maxValue, seed = 1000)
val b = genData(numEntries, maxValue, seed = 9999)
//initialization
val dict = new mutable.LongMap[Long]()
a.foreach(x => dict += (x._1 -> x._2))
//run and time test
println("start test: " + Calendar.getInstance().getTime)
val start = System.currentTimeMillis
b.foreach(x => dict += (x._1 -> x._2)) //updates
val end = System.currentTimeMillis
//print runtime
val durationInSeconds = (end - start).toFloat / 1000 + "s"
println("end test: " + Calendar.getInstance().getTime + " -- " + durationInSeconds)
}
def genData(n: Long, max: Long, seed: Long): Array[(Long, Long)] = {
val r = scala.util.Random
r.setSeed(seed) //deterministic generation of arrays
val a = new Array[(Long, Long)](n.toInt)
a.map(_ => (r.nextInt(), r.nextInt()) )
}
}
Current timings
LongMap[Long] with the above code completes in the following times on my 2018 MacBook Pro:
~3.5 seconds with numEntries = 10 million
~100 seconds with numEntries = 100 million
If you are not limited to use only Scala/Java maps than for exceptional performance you can peek 3rd party libraries that have maps specialized for Long/Long key/value pairs.
Here is not so outdated overview of such kind of libraries with benchmark results for Int/Int pairs.

How to properly measure elapsed time in Spark?

I have my code written in Spark and Scala. Now I need to measure elapsed time of particular functions of the code.
Should I use spark.time like this? But then how can I properly assign the value of df?
val df = spark.time(myObject.retrieveData(spark, indices))
Or should I do it in this way?
def time[R](block: => R): R = {
val t0 = System.nanoTime()
val result = block // call-by-name
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0) + "ns")
result
}
val df = time{myObject.retrieveData(spark, indices)}
Update:
As recommended in comments, I run df.rdd.count inside myObject.retrieveData in order to materialise the DataFrame.

How to time Spark program execution speed

I want to time my Spark program execution speed but due to laziness it's quite difficult. Let's take into account this (meaningless) code here:
var graph = GraphLoader.edgeListFile(context, args(0))
val graph_degs = graph.outerJoinVertices(graph.degrees).triplets.cache
/* I'd need to start the timer here */
val t1 = System.currentTimeMillis
val edges = graph_degs.flatMap(trip => { /* do something*/ })
.union(graph_degs)
val count = edges.count
val t2 = System.currentTimeMillis
/* I'd need to stop the timer here */
println("It took " + t2-t1 + " to count " + count)
The thing is, transformations are lazily so nothing gets evaluated before val count = edges.count line. But according to my point of view t1 gets a value despite the code above hasn't a value... the code above t1 gets evaluated after the timer started despite the position in the code. That's a problem...
In Spark Web UI I can't find anything interesting about it since I need the time spent after that specific line of code. Do you think is there a easy solution to see when a group of transformation gets evaluated for real?
Since consecutive transformations (within the same task - meaning, they are not separated by shuffles and performed as part of the same action) are performed as a single "step", Spark does not measure them individually. And from Driver code - you can't either.
What you can do is measure the duration of applying your function to each record, and use an Accumulator to sum it all up, e.g.:
// create accumulator
val durationAccumulator = sc.longAccumulator("flatMapDuration")
// "wrap" your "doSomething" operation with time measurement, and add to accumulator
val edges = rdd.flatMap(trip => {
val t1 = System.currentTimeMillis
val result = doSomething(trip)
val t2 = System.currentTimeMillis
durationAccumulator.add(t2 - t1)
result
})
// perform the action that would trigger evaluation
val count = edges.count
// now you can read the accumulated value
println("It took " + durationAccumulator.value + " to flatMap " + count)
You can repeat this for any individual transformation.
Disclaimers:
Of course, this will not include the time Spark spent shuffling things around and doing the actual counting - for that, indeed, the Spark UI is your best resource.
Note that accumulators are sensitive to things like retries - a retried task will update the accumulator twice.
Style Note:
You can make this code more reusable by creating a measure function that "wraps" around any function and updates a given accumulator:
// write this once:
def measure[T, R](action: T => R, acc: LongAccumulator): T => R = input => {
val t1 = System.currentTimeMillis
val result = action(input)
val t2 = System.currentTimeMillis
acc.add(t2 - t1)
result
}
// use it with any transformation:
rdd.flatMap(measure(doSomething, durationAccumulator))
The Spark Web UI records every single action, and even reports times of every stage of that action - it's all in there! You need to looks through the stages tab, not the jobs. I've found it's only useable though if you compile and submit your code. It is useless in the REPL, are you using this by any chance?

Simple scala code runs faster when I run it in an IDE, than from the terminal

I'm quite new to scala and today while trying things out I wanted to test the performance of a very simple program :
import java.util.Scanner
import scala.annotation.tailrec
object Tests {
def main(args: Array[String]): Unit = {
val sc = new Scanner(System.in)
println("Enter a positive number : ")
val length = sc.nextInt
val fp_t0 = System.currentTimeMillis
val list1 = fpListFill(List(), length)
val fp_t1 = System.currentTimeMillis
println(s"[fp]: time to fill the list = ${fp_t1 - fp_t0} ms")
}
#tailrec
def fpListFill(list: List[Int], count: Int): List[Int] = {
if (count == 0) count :: list
else fpListFill(count :: list, count - 1)
}
}
It simply asks a positive number length to the user, and then creates a list of length 'length' in a recursive way.
I tested it two different ways, each time I would ask for a list with length = 1'000'000
First I compiled the code with scalac and ran with scala in the terminal, which gave me :
Enter a positive number :
1000000
[fp]: time to fill the list = 795 ms
Then, I created a small project in IntelliJ IDEA in which I copied the above code. Running the program there, this is what I got :
Enter a positive number :
1000000
[fp]: time to fill the list = 25 ms
As you can see, there's quite a big difference in terms of performance. Does someone know why ? Thx in advance.

Why this function call in Scala is not optimized away?

I'm running this program with Scala 2.10.3:
object Test {
def main(args: Array[String]) {
def factorial(x: BigInt): BigInt =
if (x == 0) 1 else x * factorial(x - 1)
val N = 1000
val t = new Array[Long](N)
var r: BigInt = 0
for (i <- 0 until N) {
val t0 = System.nanoTime()
r = r + factorial(300)
t(i) = System.nanoTime()-t0
}
val ts = t.sortWith((x, y) => x < y)
for (i <- 0 to 10)
print(ts(i) + " ")
println("*** " + ts(N/2) + "\n" + r)
}
}
and call to a pure function factorial with constant argument is evaluated during each loop iteration (conclusion based on timing results). Shouldn't the optimizer reuse function call result after the first call?
I'm using Scala IDE for Eclipse. Are there any optimization flags for the compiler, which may produce more efficient code?
Scala is not a purely functional language, so without an effect system it cannot know that factorial is pure (for example, it doesn't "know" anything about the multiplication of big ints).
You need to add your own memoization approach here. Most simply add a val f300 = factorial(300) outside your loop.
Here is a question about memoization.