Spark running time in local changes a lot with "println" - scala

A tricky problem happened related to the sharp increase of executing time.
I run my scala code in local spark, part of which is to build a n*n matrix.
When running a small dataset, it just takes 5s to finish. The most time-consuming part is to build 2000*2000 matrix. And this part is executed within map, which just deals with array data structure.
However, just out of curiosity, I add "println" within the matrix-building code to see the number of iterations. Suddenly, the whole running time increases to 1min23s.
And the final results are same.
I am new to Spark and have no idea what really causes this situation.
The codes are simply:
val x = someRDD.map(buildMatrix)
def buildMatrix(stringVect:Array[String]): Array[Array[Double]] = {
//var count = 0
val num = stringVect.length
var simi_matrix = Array[Array[Double]]()
for (i<- 0 until num-1){
for (j<- (i+1) until num){
"build the matrix with some computation"
//println(count)
//count += 1
}
}
}

TL;DR
This does not have to do anything with Spark. I/O access to the console is synchronized and costly. It will slow down any program on the JVM (Scala/Java/Clojure/...).
println defaults to java.lang.System.out which is a PrintStream. println delegates to PrintStream#println, hence entering the synchronized block of the println implementation to output to the console: There are two expenses:
Getting a synchronized lock
I/O to the console OutputStream
The slowdown observed is to be expected. Just don't use println in hot parts of the code (like a tight loop in this case).

Related

How to allocate less memory to Scala with IntelliJ

I'm trying to crash my program (run in IntelliJ) with an OutOfMemoryException:
def OOMCrasher(acc: String): String = {
OOMCrasher(acc + "ADSJKFAKLWJEFLASDAFSDFASDFASERASDFASEASDFASDFASERESFDHFDYJDHJSDGFAERARDSHFDGJGHYTDJKXJCV")
}
OOMCrasher("")
However, it just runs for a very long time. My suspicions is that it simply takes a very long time to fill up all the gigabytes of memory allocated to the JVM with a string. So I'm looking at how to make IntelliJ allocate less memory to the JVM. Here's what I've tried:
In Run Configurations -> VM options:
--scala.driver.memory 1k || --driver.memory 1k
Both of these cause crashes with:
Unrecognized option: --scala.driver.memory
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.
I've also tried to put the options in the Edit Configurations -> Program Arguments. This causes the program to run for a very long time again, not yielding an OutOfMemoryException.
EDIT:
I will accept any answer that successfully explains how to allocate less memory to the program, since that is the main question.
UPDATE:
Changing the function to:
def OOMCrasher(acc: HisList[String]): HisList[String] = {
OOMCrasher(acc.add("Hi again!"))
}
OOMCrasher(Cons("Hi again!", Empty))
where HisList is a simple LinkedList implementation as well as running with -Xmx3m caused the wanted exception.
To functionally reach an OutOfMemoryException is harder than it looks, because recursive functions almost always run first into a StackOverflowException.
But there is a mutable approach that will guarantee an OutOfMemoryException: Doubling a List over and over again. Scala's Lists are not limited by the maximum array size and thus can expand until there is just no more memory left.
Here's an example:
def explodeList[A](list: List[A]): Unit = {
var mlist = list
while(true) {
mlist = mlist ++ mlist
}
}
To answer your actual question, try to fiddle with the JVM option -Xmx___m (e.g. -Xmx256m). This defines the maximum heap size the JVM is allowed to allocate.

Count number of elements in a text or list using Spark

I know there are different ways to count number of elements in a text or list. But I am trying to understand why this one does not work. I am trying to write an equivalent code to
A_RDD=sc.parallelize(['a', 1.2, []])
acc = sc.accumulator(0)
acc.value
A_RDD.foreach(lambda _: acc.add(1))
acc.value
Where the result is 3.
To do so I defined the following function called my_count(_), but I don't know how to get the result. A_RDD.foreach(my_count) does not do anything. I didn't receive any error either. What did I do wrong?
counter = 0 #function that counts elements
def my_count(_):
global counter
counter += 1
A_RDD.foreach(my_count)
The A_RDD.foreach(my_count) operation doesnt run on your local Python Virtual machine. It runs in your remote executor node. So the drives ships your my_count method to each of the executor nodes along with variable counter since the method refers the variable. So each executor nodes gets its own definition of counter variable which is updated by the foreach method while the counter variable defined in your driver application is not incremented.
One easy but risky solution would be to collect the RDD on your driver and then compute the count like below. This is risky because the entire RDD content is downloaded to the memory of the driver which may cause MemoryError.
>>> len(A_RDD.collect())
3
So what if you were running local and not on a cluster. In spark/scala this behaviour changes between local and on a clust. It would have a value as expected locally but in the cluster it wouldn't have the same value it would happen just as you describe... In spark/python does the same thing happen? My guess is it does.

Why do the execution time of independent code blocks depend on execution order in Scala? [duplicate]

This question already has answers here:
How do I write a correct micro-benchmark in Java?
(11 answers)
Closed 6 years ago.
I have a program written in Scala. I wanted to measure the execution time of different independent code blocks. When I did it in the obvious way (i.e. inserting System.nanoTime() before and after each block), I observed that the the execution time depends on the ordering of the blocks. The first some blocks always took more time than the others.
I created a minimalistic example that reproduces this behaviour. All code blocks are the same and call hashCode() for an array of integers, for simplicity.
package experiments
import scala.util.Random
/**
* Measuring execution time of a code block
*
* Minimalistic example
*/
object CodeBlockMeasurement {
def main(args: Array[String]): Unit = {
val numRecords = args(0).toInt
// number of independent measurements
val iterations = args(1).toInt
// Changes results a little bit, but not too much
// val records2 = Array.fill[Int](1)(0)
// records2.foreach(x => {})
for (_ <- 1 to iterations) {
measure(numRecords)
}
}
def measure(numRecords: Int): Unit = {
// using a new array every time
val records = Array.fill[Int](numRecords)(new Random().nextInt())
// block of code to be measured
def doSomething(): Unit = {
records.foreach(k => k.hashCode())
}
// measure execution time of the code-block
elapsedTime(doSomething(), "HashCodeExperiment")
}
def elapsedTime(block: => Unit, name: String): Unit = {
val t0 = System.nanoTime()
val result = block
val t1 = System.nanoTime()
// print out elapsed time in milliseconds
println(s"$name took ${(t1 - t0).toDouble / 1000000} ms")
}
}
After running the program with numRecords = 100000 and iterations = 10, my console looks like this:
HashCodeExperiment took 14.630283 ms
HashCodeExperiment took 7.125693 ms
HashCodeExperiment took 0.368151 ms
HashCodeExperiment took 0.431628 ms
HashCodeExperiment took 0.086455 ms
HashCodeExperiment took 0.056458 ms
HashCodeExperiment took 0.055138 ms
HashCodeExperiment took 0.062997 ms
HashCodeExperiment took 0.063736 ms
HashCodeExperiment took 0.056682 ms
Can somebody explain why is that? Shouldn't all be the same? Which is the real execution time?
Thanks a lot,
Peter
Environment parameters:
OS: ubuntu 14.04 LTS (64 bit)
IDE: IntelliJ IDEA 2016.1.1 (IU-145.597)
Scala: 2.11.7
It's Java's JIT kicking in. Initially the plain bytecode is executed but after some time (1.5k/10k invocations by default for Oracle JVM, see -XX:CompileThreshold) the optimizations start processing the actual executed native code which usually results in quite drastic performance improvements.
As Ivan mentions, then there's caching of intermediate bytecode/native code and various other technologies involved, one of the most significant ones being the Garbage Collector itself which cause even more variance to individual results. Depending how heavily the code allocates new objects this might absolutely trash performance whenever GC occurs, but that's a separate issue.
To remove such outlier results when microbenchmarking it is recommendable that you benchmark multiple iterations of the action and discard the bottom and top 5..10% results and do the performance estimation based on remaining samples.
Short answer: caching.
These are independent code blocks, but runs can't be fully independent because they are run in same JVM instance, and in same process of same CPU.
JVM itself has a lot of optimization inside, including caching. Modern CPU also do so. So, as result, it's quite common behavior, that re-run usually takes less time than first run.

Slow Performance When Using Scalaz Task

I'm looking to improve the performance of a Monte Carlo simulation I am developing.
I first did an implementation which does the simulation of each paths sequentially as follows:
def simulate() = {
for (path <- 0 to 30000) {
(0 to 100).foreach(
x => // do some computation
)
}
}
This basically is simulating 30,000 paths and each path has 100 discretised random steps.
The above function runs very quickly on my machine (about 1s) for the calculation I am doing.
I then thought about speeding it up even further by making the code run in a multithreaded fashion.
I decided to use Task for this and I coded the following:
val simulation = (1 |-> 30000 ).map(n => Task {
(1 |-> 100).map(x => // do some computation)
})
I then use this as follows:
Task.gatherUnordered(simulation).run
When I kick this off, I know my machine is doing a lot of work as I can
see that in the activity monitor and also the machine fan is going ballistic.
After about two minutes of heavy activity on the machine, the work it seems
to be doing finishes but I don't get any value returned (I am expected a collection
of Doubles from each task that was processed).
My questions are:
Why does this take longer than the sequential example? I am more
than likely doing something wrong but I can't see it.
Why don't I get any returned collection of values from the tasks that are apparently being processed?
I'm not sure why Task.gatherUnordered is so slow, but if you change Task.gatherUnordered to Nondeterminism.gatherUnordered everything will be fine:
import scalaz.Nondeterminism
Nondeterminism[Task].gatherUnordered(simulation).run
I'm going to create an issue on Github about Task.gatherUnordered. This definitely should be fixed.

Efficiency/scalability of parallel collections in Scala (graphs)

So I've been working with parallel collections in Scala for a graph project I'm working on, I've got the basics of the graph class defined, it is currently using a scala.collection.mutable.HashMap where the key is Int and the value is ListBuffer[Int] (adjacency list). (EDIT: This has since been change to ArrayBuffer[Int]
I had done a similar thing a few months ago in C++, with a std::vector<int, std::vector<int> >.
What I'm trying to do now is run a metric between all pairs of vertices in the graph, so in C++ I did something like this:
// myVec = std::vector<int> of vertices
for (std::vector<int>::iterator iter = myVec.begin(); iter != myVec.end(); ++iter) {
for (std::vector<int>::iterator iter2 = myVec.begin();
iter2 != myVec.end(); ++iter2) {
/* Run algorithm between *iter and *iter2 */
}
}
I did the same thing in Scala, parallelized, (or tried to) by doing this:
// vertexList is a List[Int] (NOW CHANGED TO Array[Int] - see below)
vertexList.par.foreach(u =>
vertexList.foreach(v =>
/* Run algorithm between u and v */
)
)
The C++ version is clearly single-threaded, the Scala version has .par so it's using parallel collections and is multi-threaded on 8 cores (same machine). However, the C++ version processed 305,570 pairs in a span of roughly 3 days, whereas the Scala version so far has only processed 23,573 in 17 hours.
Assuming I did my math correctly, the single-threaded C++ version is roughly 3x faster than the Scala version. Is Scala really that much slower than C++, or am I completely mis-using Scala (I only recently started - I'm about 300 pages into Programming in Scala)?
Thanks!
-kstruct
EDIT To use a while loop, do I do something like..
// Where vertexList is an Array[Int]
vertexList.par.foreach(u =>
while (i <- 0 until vertexList.length) {
/* Run algorithm between u and vertexList(i) */
}
}
If you guys mean use a while loop for the entire thing, is there an equivalent of .par.foreach for whiles?
EDIT2 Wait a second, that code isn't even right - my bad. How would I parallelize this using while loops? If I have some var i that keeps track of the iteration, then wouldn't all threads be sharing that i?
From your comments, I see that your updating a shared mutable HashMap at the end of each algorithm run. And if you're randomizing your walks, a shared Random is also a contention point.
I recommend two changes:
Use .map and .flatMap to return an immutable collection instead of modifying a shared collection.
Use a ThreadLocalRandom (from either Akka or Java 7) to reduce contention on the random number generator
Check the rest of your algorithm for further possible contention points.
You may try running the inner loop in parallel, too. But without knowing your algorithm, it's hard to know if that will help or hurt. Fortunately, running all combinations of parallel and sequential collections is very simple; just switch out pVertexList and vertexList in the code below.
Something like this:
val pVertexList = vertexList.par
val allResult = for {
u <- pVertexList
v <- pVertexList
} yield {
/* Run algorithm between u and v */
((u -> v) -> result)
}
The value allResult will be a ParVector[((Int, Int), Int)]. You may call .toMap on it to convert that into a Map.
Why mutable? I don't think there's a good parallel mutable map on Scala 2.9.x -- particularly because just such a data structure was added to the upcoming Scala 2.10.
On the other hand... you have a List[Int]? Don't use that, use a Vector[Int]. Also, are you sure you aren't wasting time elsewhere, doing the conversions from your mutable maps and buffers into immutable lists? Scala data structures are different than C++'s so you might well be incurring in complexity problems elsewhere in the code.
Finally, I think dave might be onto something when he asks about contention. If you have contention, parallelism might well make things slower. How faster/slower does it run if you do not make it parallel? If making it not parallel makes it faster, then you most likely do have contention issues.
I'm not completely sure about it, but I think foreach loops in foreach loops are rather slow, because lots of objects get created. See: http://scala-programming-language.1934581.n4.nabble.com/for-loop-vs-while-loop-performance-td1935856.html
Try rewriting it using a while loop.
Also Lists are only efficient for head access, Arrays are probably faster.