How to properly measure elapsed time in Spark? - scala

I have my code written in Spark and Scala. Now I need to measure elapsed time of particular functions of the code.
Should I use spark.time like this? But then how can I properly assign the value of df?
val df = spark.time(myObject.retrieveData(spark, indices))
Or should I do it in this way?
def time[R](block: => R): R = {
val t0 = System.nanoTime()
val result = block // call-by-name
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0) + "ns")
result
}
val df = time{myObject.retrieveData(spark, indices)}
Update:
As recommended in comments, I run df.rdd.count inside myObject.retrieveData in order to materialise the DataFrame.

Related

Why do subsequent runs of scala function go orders of magnitude faster?

I'm having a very weird phenomenon occur. I'm hashing some mockup ssoids, and timing it since our application is timing critical. The first run of the function takes ~210 milliseconds, which is ridiculous and won't work. However, the 2nd, 3rd, and 4th runs take a few thousand nanoseconds each. I'm really confused, what's going on? Is there some caching I can do to make the first run just as fast?
See the code here:
object Main {
def main(args: Array[String]): Unit = {
val qq = "847bf1dc46ca22dc93259c5e857d6333"
val oo = "847bf1dc46ca22dc9eriuerie9duy45"
val xx = "909909ewufv9026854509erie9ifkf3"
val ww = "65984jg3oiqh4g3q383423932824344"
val qqq = getBytes(qq)
val ooo = getBytes(oo)
val xxx = getBytes(xx)
val www = getBytes(ww)
val t1 = System.nanoTime
val r1 = XxHash64.hashByteArray(qqq, seed = 0)
val duration = (System.nanoTime - t1)
val t2 = System.nanoTime
val r2 = XxHash64.hashByteArray(ooo, seed = 0)
val duration2 = (System.nanoTime - t2)
val t3 = System.nanoTime
val r3 = XxHash64.hashByteArray(xxx, seed = 0)
val duration3 = (System.nanoTime - t3)
val t4 = System.nanoTime
val r4 = XxHash64.hashByteArray(www, seed = 0)
val duration4 = (System.nanoTime - t4)
println(duration)
println(duration2)
println(duration3)
println(duration4)
println(r1)
println(r2)
println(r3)
println(r4)
}
(Also note I recognize this is a slipshod way of doing timings, but I've just started.)

Why does the RDD works slower than usual List in spark?

I wanted to compare efficiency of RDD compared to usual List in scala spark.
val list = Range(1, 1000000, 1)
val dist_list = sc.parallelize(list)
val start_time = System.nanoTime
val sum = list.reduce((x,y) => x+y)
println(s"for list time is ${System.nanoTime - start_time}")
val s_time = System.nanoTime
val dist_sum = dist_list.reduce((x,y) => x+y)
println(s"for dist_list time is ${System.nanoTime - s_time}")
and got the result
for list time is 24849500
for dist_list time is 378051900
which means RDD 15 times slower than usual operations. Why could that happen?

Spark - aggregateByKey Type mismatch error

I am trying find the problem behind this. I am trying to find the maximum number Marks of each student using aggregateByKey.
val data = spark.sc.Seq(("R1","M",22),("R1","E",25),("R1","F",29),
("R2","M",20),("R2","E",32),("R2","F",52))
.toDF("Name","Subject","Marks")
def seqOp = (acc:Int,ele:(String,Int)) => if (acc>ele._2) acc else ele._2
def combOp =(acc:Int,acc1:Int) => if(acc>acc1) acc else acc1
val r = data.rdd.map{case(t1,t2,t3)=> (t1,(t2,t3))}.aggregateByKey(0)(seqOp,combOp)
I am getting error that aggregateByKey accepts (Int,(Any,Any)) but actual is (Int,(String,Int)).
Your map function is incorrect since you have a Row as input, not a Tuple3
Fix the last line with :
val r = data.rdd.map { r =>
val t1 = r.getAs[String](0)
val t2 = r.getAs[String](1)
val t3 = r.getAs[Int](2)
(t1,(t2,t3))
}.aggregateByKey(0)(seqOp,combOp)

How to merge and aggregate 2 Maps in scala most efficiently?

I have the following 2 maps:
val map12:Map[(String,String),Double]=Map(("Sam","0203") -> 16216.0, ("Jam","0157") -> 50756.0, ("Pam","0129") -> 3052.0)
val map22:Map[(String,String),Double]=Map(("Jam","0157") -> 16145.0, ("Pam","0129") -> 15258.0, ("Sam","0203") -> -1638.0, ("Dam","0088") -> -8440.0,("Ham","0104") -> 4130.0,("Hari","0268") -> -108.0, ("Om","0169") -> 5486.0, ("Shiv","0181") -> 275.0, ("Brahma","0148") -> 18739.0)
In the first approach I am using foldLeft to achieve the merging and accumulation:
val t1 = System.nanoTime()
val merged1 = (map12 foldLeft map22)((map22, map12) => map22 + (map12._1 -> (map12._2 + map22.getOrElse(map12._1, 0.0))))
val t2 = System.nanoTime()
println(" First Time taken :"+ (t2-t1))
In the second approach I am trying to use aggregate() function which supports parallel operation:
def merge(map12:Map[(String,String),Double], map22:Map[(String,String),Double]):Map[(String,String),Double]=
map12 ++ map22.map{case(k, v) => k -> (v + (map12.getOrElse(k, 0.0)))}
val inArr= Array(map12,map22)
val t5 = System.nanoTime()
val mergedNew12 = inArr.par.aggregate(Map[(String,String),Double]())(merge,merge)
val t6 = System.nanoTime()
println(" Second Time taken :"+ (t6-t5))
But I notice the foldLeft is much faster than the aggregate.
I am looking for advice on how to make this operation the most efficient.
If you want an aggregate more efficient by running with par, try with Vector instead of Array, it is one of the best collections for parallel algorithms.
On the other hand, parallel working has some overhead so If you have insufficient data, it will be not convenient.
With the data you gave us, Vector.par.aggregate is better than Array.par.aggregate, but Vector.aggregate is better than foldLeft.
val inVector= Vector(map12,map22)
val t7 = System.nanoTime()
val mergedNew12_2 = inVector.aggregate(Map[(String,String),Double]())(merge,merge)
val t8 = System.nanoTime()
println(" Third Time taken :"+ (t8-t7))
These are my times
First Time taken :6431723
Second Time taken:147474028
Third Time taken :4855489

How can I benchmark performance in Spark console?

I have just started using Spark and my interactions with it revolve around spark-shell at the moment. I would like to benchmark how long various commands take, but could not find how to get the time or run a benchmark. Ideally I would want to do something super-simple, such as:
val t = [current_time]
data.map(etc).distinct().reduceByKey(_ + _)
println([current time] - t)
Edit: Figured it out --
import org.joda.time._
val t_start = DateTime.now()
[[do stuff]]
val t_end = DateTime.now()
new Period(t_start, t_end).toStandardSeconds()
I suggest you do the following :
def time[A](f: => A) = {
val s = System.nanoTime
val ret = f
println("time: " + (System.nanoTime - s) / 1e9 + " seconds")
ret
}
You can pass a function as an argument to time function and it will compute the result of the function giving you the time taken by the function to be performed.
Let's consider a function foobar that take data as argument and then do the following :
val test = time(foobar(data))
test will contains the result of foobar and you'll get the time needed as well.