How to time Spark program execution speed - scala

I want to time my Spark program execution speed but due to laziness it's quite difficult. Let's take into account this (meaningless) code here:
var graph = GraphLoader.edgeListFile(context, args(0))
val graph_degs = graph.outerJoinVertices(graph.degrees).triplets.cache
/* I'd need to start the timer here */
val t1 = System.currentTimeMillis
val edges = graph_degs.flatMap(trip => { /* do something*/ })
.union(graph_degs)
val count = edges.count
val t2 = System.currentTimeMillis
/* I'd need to stop the timer here */
println("It took " + t2-t1 + " to count " + count)
The thing is, transformations are lazily so nothing gets evaluated before val count = edges.count line. But according to my point of view t1 gets a value despite the code above hasn't a value... the code above t1 gets evaluated after the timer started despite the position in the code. That's a problem...
In Spark Web UI I can't find anything interesting about it since I need the time spent after that specific line of code. Do you think is there a easy solution to see when a group of transformation gets evaluated for real?

Since consecutive transformations (within the same task - meaning, they are not separated by shuffles and performed as part of the same action) are performed as a single "step", Spark does not measure them individually. And from Driver code - you can't either.
What you can do is measure the duration of applying your function to each record, and use an Accumulator to sum it all up, e.g.:
// create accumulator
val durationAccumulator = sc.longAccumulator("flatMapDuration")
// "wrap" your "doSomething" operation with time measurement, and add to accumulator
val edges = rdd.flatMap(trip => {
val t1 = System.currentTimeMillis
val result = doSomething(trip)
val t2 = System.currentTimeMillis
durationAccumulator.add(t2 - t1)
result
})
// perform the action that would trigger evaluation
val count = edges.count
// now you can read the accumulated value
println("It took " + durationAccumulator.value + " to flatMap " + count)
You can repeat this for any individual transformation.
Disclaimers:
Of course, this will not include the time Spark spent shuffling things around and doing the actual counting - for that, indeed, the Spark UI is your best resource.
Note that accumulators are sensitive to things like retries - a retried task will update the accumulator twice.
Style Note:
You can make this code more reusable by creating a measure function that "wraps" around any function and updates a given accumulator:
// write this once:
def measure[T, R](action: T => R, acc: LongAccumulator): T => R = input => {
val t1 = System.currentTimeMillis
val result = action(input)
val t2 = System.currentTimeMillis
acc.add(t2 - t1)
result
}
// use it with any transformation:
rdd.flatMap(measure(doSomething, durationAccumulator))

The Spark Web UI records every single action, and even reports times of every stage of that action - it's all in there! You need to looks through the stages tab, not the jobs. I've found it's only useable though if you compile and submit your code. It is useless in the REPL, are you using this by any chance?

Related

ZStream ignores parallel operation and executes it sequentially instead

Following code is supposed to execute putStrLn effect in parallel because of mapMPar:
val runtime = zio.Runtime.default
val foo = ZIO.sleep(5.second) *> ZIO("foo")
val bar = ZIO("bar")
val k = ZStream.fromEffect(foo) ++ ZStream.fromEffect(bar)
val r = k.mapMPar(3)(x => console.putStrLn(s"Processing `${x}`"))
runtime.unsafeRun(r.runDrain)
But in fact it always process foo before bar no matter what. Did I miss something or it's a bug?
I think your example just doesn't do what you expect it to do. fromEffect creates a stream which basically says "I have an effect that will eventually generate a single item", the first stream then waits 5 seconds before producing that item. Due to the nature of the stream the ++ or concat operator is lazy, meaning it can't start processing until all items have been consumed from the first stream (which can't happen for 5 seconds). As a result your stream really looks like this:
--5s--(foo)(bar)|
instead of what I imagine you think it should like:
(bar)--5s--(foo)|
The best way to perhaps think about it is that for most of the stream you have a single lane highway, only one item can move at a time, and all subsequent items are blocked by items at the head of the line. Once you hit that Par block you are opening up to multiple lanes of traffic, meaning that faster moving stuff could potentially overtake.
Thus I can achieve the desired behavior by doing something like this instead:
val k = ZStream("foo", "bar")
val r = k.mapMPar(3)(x => putStrLn(s"$x:enter") *> (ZIO.sleep(5.second) *> putStrLn(s"Processing `${x}`")) <* putStrLn(s"$x:exit"))
r.runDrain
Or written slightly more compact:
ZStream("foo", "bar").mapMPar(3)(x => for {
_ <- putStrLn(s"$x:enter")
_ <- ZIO.sleep(5.seconds) *> putStrLn(s"Processing `$x`")
_ <- putStrLn(s"$x:exit")
} yield ()).runDrain

Time taken by a while loop and recursion

I am not asking should i use recursion or iteration, or which is faster between them. I was trying to understand the iteration and recursion time taken, and I come up with an interesting pattern in the time taken of the both, which was what ever was on the top of the file is taking more time than the other.
For example: If I am writing for loop in the beginning it is taking more time than recursion and vice versa. The difference between time taken in both of the process are significently huge aprox 30 to 40 times.
My questions are:-
Is order of a loop and recursion matters?
Is there something related to print?
What could be the possible reason for such behaviour?
following is the code I have in the same file and the language I am using is scala?
def count(x: Int): Unit = {
if (x <= 1000) {
print(s"$x ")
count(x + 1)
}
}
val t3 = System.currentTimeMillis()
count(1)
val t4 = System.currentTimeMillis()
println(s"\ntime taken by the recursion look = ${t4 - t3} mili second")
var c = 1
val t1 = System.currentTimeMillis()
while(c <= 1000)
{
print(s"$c ")
c+=1
}
val t2 = System.currentTimeMillis()
println(s"\ntime taken by the while loop = ${t2 - t1} mili second")
In this situation the time taken for recursion and while loop are 986ms, 20ms respectively.
When I switch the position of loop and recursion which means first loop then recursion, time taken for recursion and while loop are 1.69 sec and 28 ms respectively.
Edit 1:
I can see the same behaviour with bufferWriter if the recursion code is on the top. But not the case when recursion is below the loop. When recursion is below the loop it is taking almost same time with the difference of 2 to 3 ms.
If you wanted to convince yourself that the tailrec-optimization works, without relying on any profiling tools, here is what you could try:
Use way more iterations
Throw away the first few iterations to give the JIT the time to wake up and do the hotspot-optimizations
Throw away all unpredictable side effects like printing to stdout
Throw away all costly operations that are the same in both approaches (formatting numbers etc.)
Measure in multiple rounds
Randomize the number of repetitions in each round
Randomize the order of variants within each round, to avoid any "catastrophic resonance" with the cycles of the garbage collector
Preferably, don't run anything else on the computer
Something along these lines:
def compare(
xs: Array[(String, () => Unit)],
maxRepsPerBlock: Int = 10000,
totalRounds: Int = 100000,
warmupRounds: Int = 1000
): Unit = {
val n = xs.size
val times: Array[Long] = Array.ofDim[Long](n)
val rng = new util.Random
val indices = (0 until n).toList
var totalReps: Long = 0
for (round <- 1 to totalRounds) {
val order = rng.shuffle(indices)
val reps = rng.nextInt(maxRepsPerBlock / 2) + maxRepsPerBlock / 2
for (i <- order) {
var r = 0
while (r < reps) {
r += 1
val start = System.currentTimeMillis
(xs(i)._2)()
val end = System.currentTimeMillis
if (round > warmupRounds) {
times(i) += (end - start)
}
}
}
if (round > warmupRounds) {
totalReps += reps
}
}
for (i <- 0 until n) {
println(f"${xs(i)._1}%20s : ${times(i) / totalReps.toDouble}")
}
}
def gaussSumWhile(n: Int): Long = {
var acc: Long = 0
var i = 0
while (i <= n) {
acc += i
i += 1
}
acc
}
#annotation.tailrec
def gaussSumRec(n: Int, acc: Long = 0, i: Int = 0): Long = {
if (i <= n) gaussSumRec(n, acc + i, i + 1)
else acc
}
compare(Array(
("while", { () => gaussSumWhile(1000) }),
("#tailrec", { () => gaussSumRec(1000) })
))
Here is what it prints:
while : 6.737733046257334E-5
#tailrec : 6.70325653896487E-5
Even the simple hints above are sufficient for creating a benchmark that shows that the while loop and the tail-recursive function take roughly the same time.
Scala does not compile into machine code but into bytecode for the "Java Virtual Machine"(JVM) which then interprets that code on the native processor. The JVM uses multiple mechanisms to optimise code that is run frequently, eventually converting the frequently-called functions ("hotspots") into pure machine code.
This means that testing the first run of a function does not give a good measure of eventual performance. You need to "warm up" the JIT compiler by running the test code many times before attempting to measure the time taken.
Also, as noted in the comments, doing any kind of I/O is going to make timings very unreliable because there is a danger that the I/O will block. Write a test case that does not do any blocking, if possible.

Calculating Requests Per Minute From Timetstamps in RDD during mapping

I am currently trying to enrich data for machine learning with requests per minute. The data is stored in a Kafka topic and on application start the whole content of the topic is loaded and processed - therefore it is not possible to use any window operations of spark streaming to my knowledge, as all data will arrive at the same time.
My approach was to try the following:
val kMeansFeatureRdd = kMeansInformationRdd.map(x => {
val begin = x._2 //Long - unix timestamp millis
val duration = x._3 //Long
val rpm = kMeansInformationRdd.filter(y => (x._2 - 60000 <= y._2 && x._2 >= y._2)).count()
(duration, rpm)
})
However on this approach I get the following exception:
org.apache.spark.SparkException: This RDD lacks a SparkContext. It could happen in the following cases:
(1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
(2) When a Spark Streaming job recovers from checkpoint, this exception will be hit if a reference to an RDD not defined by the streaming job is used in DStream operations. For more information, See SPARK-13758.
Is there a way to achieve what I want to do?
If you need any more information just drop me a comment and I will update what you need.
EDIT:
Broadcasting an RDD does not work. Broadcasting the collected Array does not result in an acceptable performance.
What will be executed but is horribly slow and therefore not really an option:
val collected = kMeansInformationRdd.collect()
val kMeansFeatureRdd = kMeansInformationRdd.map(x => {
val begin = x._2 //Long - unix timestamp millis
val duration = x._3 //Long
val rpm = collected.filter(y => (x._2 - 60000 <= y._2 && x._2 >= y._2)).size
(duration, rpm)
})
UPDATE:
This code is at least able to get the job done way faster - but as far as I see it still gets slower the higher the requests per minute are as the filtered array grows - interesting enough it gets slower towards the end what I cannot figure out why. If someone sees the issue - or performance issues that could be generally improved - I would be happy if you let me know.
kMeansInformationRdd = kMeansInformationRdd.cache()
kMeansInformationRdd.sortBy(_._2, true)
var kMeansFeatureArray: Array[(String, Long, Long)] = Array()
var buffer: collection.mutable.Map[String, Array[Long]] = collection.mutable.Map()
var counter = 0
kMeansInformationRdd.collect.foreach(x => {
val ts = x._2
val identifier = x._1 //make sure the identifier represents actually the entity that receives the traffic -e.g. machine (IP?) not only endpoint
var bufferInstance = buffer.get(identifier).getOrElse(Array[Long]())
bufferInstance = bufferInstance ++ Array(ts)
bufferInstance = bufferInstance.filter(p => p > ts-1000)
buffer.put(identifier, bufferInstance)
val rpm = bufferInstance.size.toLong
kMeansFeatureArray = kMeansFeatureArray ++ Array((identifier, x._3, rpm)) //identifier, duration, rpm
counter = counter +1
if(counter % 10000==0){
println(counter)
println((identifier, x._3, rpm))
println((instanceSizeBefore, instanceSizeAfter))
}
})
val kMeansFeatureRdd = sc.parallelize(kMeansFeatureArray)
The code that is given in the EDIT section is not correct. It is not the correct way a variable is broadcasted in Spark. The correct way is as follows:
val collected = sc.broadcast(kMeansInformationRdd.collect())
val kMeansFeatureRdd = kMeansInformationRdd.map(x => {
val begin = x._2 //Long - unix timestamp millis
val duration = x._3 //Long
val rpm = collected.value.filter(y => (x._2 - 60000 <= y._2 && x._2 >= y._2)).size
(duration, rpm)
})
Of course, you can use global variables as well instead of sc.broadcast, but that is not recommended. Why?
The reason is that the difference between using an external variable DIRECTLY(as my so called "global variable"), and BROADCASTING a variable using sc.broadcast() is:
When using the external variable directly, spark will send a copy of the serialized variable together with each TASK. Whereas by sc.broadcast, the variable is sent one copy per EXECUTOR. The number of Task is normally 10 times larger than the Executor. So when the variable (say an array) is large enough (more than 10-20K), the former operation may cost a lot time on network transformation and cause frequent GC, which slows the spark down. Hence large variable(>10-20K) is suggested to be broadcasted explicitly.
When using the external variable directly the variable is not persisted, it ends with the task and thus can not be reused. Whereas by sc.broadcast() the variable is auto-persisted in the executors' memory, it lasts until you explicitly unpersist it. Thus sc.broadcast variable is available across tasks and stages.
So if the variable is expected to be used multiple times, sc.broadcast() is suggested.

Joining process with broadcast variable ends up endless spilling

I am joining two RDDs from text files in standalone mode. One has 400 million (9 GB) rows, and the other has 4 million (110 KB).
3-grams doc1 3-grams doc2
ion - 100772C111 ion - 200772C222
on - 100772C111 gon - 200772C222
n - 100772C111 n - 200772C222
... - .... ... - ....
ion - 3332145654 on - 58898874
mju - 3332145654 mju - 58898874
... - .... ... - ....
In each file, doc numbers (doc1 or doc2) appear one under the other. And as a result of join I would like to get a number of common 3-grams between the docs.e.g.
(100772C111-200772C222,2) --> There two common 3-grams which are 'ion' and ' n'
The server on which I run my code has 128 GB RAM and 24 cores. I set my IntelliJ configurations - VM options part with -Xmx64G
Here is my code for this:
val conf = new SparkConf().setAppName("abdulhay").setMaster("local[4]").set("spark.shuffle.spill", "true")
.set("spark.shuffle.memoryFraction", "0.6").set("spark.storage.memoryFraction", "0.4")
.set("spark.executor.memory","40g")
.set("spark.driver.memory","40g")
val sc = new SparkContext(conf)
val emp = sc.textFile("\\doc1.txt").map(line => (line.split("\t")(3),line.split("\t")(1))).distinct()
val emp_new = sc.textFile("\\doc2.txt").map(line => (line.split("\t")(3),line.split("\t")(1))).distinct()
val emp_newBC = sc.broadcast(emp_new.groupByKey.collectAsMap)
val joined = emp.mapPartitions(iter => for {
(k, v1) <- iter
v2 <- emp_newBC.value.getOrElse(k, Iterable())
} yield (s"$v1-$v2", 1))
val olsun = joined.reduceByKey((a,b) => a+b)
olsun.map(x => x._1 + "\t" + x._2).saveAsTextFile("...\\out.txt")
So as seen, during join process using broadcast variable my key values change. So it seems I need to repartition the joined values? And it is highly expensive. As a result, i ended up too much spilling issue, and it never ended. I think 128 GB memory must be sufficient. As far as I understood, when broadcast variable is used shuffling is being decreased significantly? So what is wrong with my application?
Thanks in advance.
EDIT:
I have also tried spark's join function as below:
var joinRDD = emp.join(emp_new);
val kkk = joinRDD.map(line => (line._2,1)).reduceByKey((a, b) => a + b)
again ending up too much spilling.
EDIT2:
val conf = new SparkConf().setAppName("abdulhay").setMaster("local[12]").set("spark.shuffle.spill", "true")
.set("spark.shuffle.memoryFraction", "0.4").set("spark.storage.memoryFraction", "0.6")
.set("spark.executor.memory","50g")
.set("spark.driver.memory","50g")
val sc = new SparkContext(conf)
val emp = sc.textFile("S:\\Staff_files\\Mehmet\\Projects\\SPARK - scala\\wos14.txt").map{line => val s = line.split("\t"); (s(5),s(0))}//.distinct()
val emp_new = sc.textFile("S:\\Staff_files\\Mehmet\\Projects\\SPARK - scala\\fwo_word.txt").map{line => val s = line.split("\t"); (s(3),s(1))}//.distinct()
val cog = emp_new.cogroup(emp)
val skk = cog.flatMap {
case (key: String, (l1: Iterable[String], l2: Iterable[String])) =>
(l1.toSeq ++ l2.toSeq).combinations(2).map { case Seq(x, y) => if (x < y) ((x, y),1) else ((y, x),1) }.toList
}
val com = skk.countByKey()
I would not use broadcast variables. When you say:
val emp_newBC = sc.broadcast(emp_new.groupByKey.collectAsMap)
Spark is first moving the ENTIRE dataset into the master node, a huge bottleneck and prone to produce memory errors on the master node. Then this piece of memory is shuffled back to ALL nodes (lots of network overhead), bound to produce memory issues there too.
Instead, join the RDDs themselves using join (see description here)
Figure out also if you have too few keys. For joining Spark basically needs to load the entire key into memory, and if your keys are too few that might still be too big a partition for any given executor.
A separate note: reduceByKey will repartition anyway.
EDIT: ---------------------
Ok, given the clarifications, and assuming that the number of 3-grams per doc# is not too big, this is what I would do:
Key both files by 3-gram to get (3-gram, doc#) tuples.
cogroup both RDDs, that gets you the 3gram key and 2 lists of doc#
Process those in a single scala function, output a set of all unique permutations of (doc-pairs).
then do coutByKey or countByKeyAprox to get a count of the number of distinct 3-grams for each doc pair.
Note: you can skip the .distinct() calls with this one. Also, you should not split every line twice. Change line => (line.split("\t")(3),line.split("\t")(1))) for line => { val s = line.split("\t"); (s(3),s(1)))
EDIT 2:
You also seem to be tuning your memory badly. For instance, using .set("spark.shuffle.memoryFraction", "0.4").set("spark.storage.memoryFraction", "0.6") leaves basically no memory for task execution (since they add up to 1.0). I should have seen that sooner but was focused on the problem itself.
Check the tunning guides here and here.
Also, if you are running it on a single machine, you might try with a single, huge executor (or even ditch Spark completely), as you don't need overhead of a distributed processing platform (and distributed hardware failure tolerance, etc).

My RDD change his values himself

I have a basic RDD[Object] on which i apply a map with a hashfunction on Object values using nextGaussian and nextDouble scala function. And when i print values there change at each print
def hashmin(x:Data_Object, w:Double) = {
val x1 = x.get_vector.toArray
var a1 = Array(0.0).tail
val b = Random.nextDouble * w
for( ind <- 0 to x1.size-1) {
val nG = Random.nextGaussian
a1 = a1 :+ nG
}
var sum = 0.0
for( ind <- 0 to x1.size-1) {
sum = sum + (x1(ind)*a1(ind))
}
val hash_val = (sum+b)/w
val hash_val1 = (x.get_id,hash_val)
hash_val1
}
val w = 8
val rddhash = parsedData.map(x => hashmin(x,w))
rddhash.foreach(println)
rddhash.foreach(println)
I don't understand why. Thank you in advance.
RDDs are merely a "pointer" to the data + operations to be applied to it. Actions materialize those operations by executing the RDD lineage.
So, RDDs are basically recomputed when an action is requested. In this case, the map function calling hashmin is being evaluated every time the foreach action is called.
There're few options:
Cache the RDD - this will cause the lineage to be broken and the results of the first transformation will be preserved:
val rddhash = parsedData.map(x => hashmin(x,w)).cache()
Use a seed for your random function, sothat the pseudo-random sequence generated is each time the same.
RDDs are lazy - they're computed when they're used. So the calls to Random.nextGaussian are made again each time you call foreach.
You can use persist() to store an RDD if you want to keep fixed values.