Scala parallel collections - scala

I am very naively trying to use Scala .par, and the result turns out to be slower than the non-parallel version, by quite a bit. What is the explanation for that?
Note: the question is not to make this faster, but to understand why this naive use of .par doesn't yield an immediate speed-up.
Note 2: timing method: I ran both methods with N = 10000. The first one returned in about 20s. The second one I killed after 3 minutes. Not even close. If I let it run longer I get into a Java heap space exception.
def pi_random(N: Long): Double = {
val count = (0L until N * N)
.map { _ =>
val (x, y) = (rng.nextDouble(), rng.nextDouble())
if (x*x + y*y <= 1) 1 else 0
}
.sum
4 * count.toDouble / (N * N)
}
def pi_random_parallel(N: Long): Double = {
val count = (0L until N * N)
.par
.map { _ =>
val (x, y) = (rng.nextDouble(), rng.nextDouble())
if (x*x + y*y <= 1) 1 else 0
}
.sum
4 * count.toDouble / (N * N)
}

Hard to know for sure without doing some actual profiling, but I have two theories:
First, you may be losing some benefits of the Range class, specifically near-zero memory usage. When you do (0L until N * N), you create a Range object, which is lazy. It does not actually create any object holding every single number in the range. Neither does map, I think. And sum calculates and adds numbers one at a time, so also allocates barely any memory.
I'm not sure the same is all true about ParRange. Seems like it would have to allocate some amount per split, and after map is called, perhaps it might have to store some amount of intermediate results in memory as "neighboring" splits wait for the other to complete. Especially the heap space exception makes me think something like this is the case. So you'll lose a lot of time to GC and such.
Second, probably the calls to rng.nextDouble are by far the most expensive part of that inner function. But I believe both java and scala Random classes are essentially single-threaded. They synchronize and block internally. So you won't gain that much from parallelism anyway, and in fact lose some to overhead.

There is not enough work per task, the task granularity is too fine-grained.
Creating each task requires some overhead:
Some object representing the task must be created
It must be ensured that only one thread executes one task at a time
In the case that some threads become idle, some job-stealing procedure must be invoked.
For N = 10000, you instantiate 100,000,000 tiny tasks. Each of those tasks does almost nothing: it generates two random numbers and performs some basic arithmetic and an if-branch. The overhead of creating a task is not comparable to the work that each task is doing.
The tasks must be much larger, so that each thread has enough work to do. Furthermore, it's probably faster if you make each RNG thread local, so that the threads can do their job in parallel, without permanently locking the default random number generator.
Here is an example:
import scala.util.Random
def pi_random(N: Long): Double = {
val rng = new Random
val count = (0L until N * N)
.map { _ =>
val (x, y) = (rng.nextDouble(), rng.nextDouble())
if (x*x + y*y <= 1) 1 else 0
}
.sum
4 * count.toDouble / (N * N)
}
def pi_random_parallel(N: Long): Double = {
val rng = new Random
val count = (0L until N * N)
.par
.map { _ =>
val (x, y) = (rng.nextDouble(), rng.nextDouble())
if (x*x + y*y <= 1) 1 else 0
}
.sum
4 * count.toDouble / (N * N)
}
def pi_random_properly(n: Long): Double = {
val count = (0L until n).par.map { _ =>
val rng = ThreadLocalRandom.current
var sum = 0
var idx = 0
while (idx < n) {
val (x, y) = (rng.nextDouble(), rng.nextDouble())
if (x*x + y*y <= 1.0) sum += 1
idx += 1
}
sum
}.sum
4 * count.toDouble / (n * n)
}
Here is a little demo and timings:
def measureTime[U](repeats: Long)(block: => U): Double = {
val start = System.currentTimeMillis
var iteration = 0
while (iteration < repeats) {
iteration += 1
block
}
val end = System.currentTimeMillis
(end - start).toDouble / repeats
}
// basic sanity check that all algos return roughly same result
println(pi_random(2000))
println(pi_random_parallel(2000))
println(pi_random_properly(2000))
// time comparison (N = 2k, 10 repetitions for each algorithm)
val N = 2000
val Reps = 10
println("Sequential: " + measureTime(Reps)(pi_random(N)))
println("Naive: " + measureTime(Reps)(pi_random_parallel(N)))
println("My proposal: " + measureTime(Reps)(pi_random_properly(N)))
Output:
3.141333
3.143418
3.14142
Sequential: 621.7
Naive: 3032.6
My version: 44.7
Now the parallel version is roughly an order of magnitude faster than the sequential version (result will obviously depend on the number of cores etc.).
I couldn't test it with N = 10000, because the naively parallelized version crashed everything with an "GC overhead exceeded"-error, which also illustrates that the overhead for creating the tiny tasks is too large.
In my implementation, I've additionaly unrolled the inner while: you need only one counter in one register, no need to create a huge collection by mapping on the range.
Edit: Replaced everything by ThreadLocalRandom, it now shouldn't matter whether your compiler versions supports SAM or not, so it should work with earlier versions of 2.11 too.

Related

Time taken by a while loop and recursion

I am not asking should i use recursion or iteration, or which is faster between them. I was trying to understand the iteration and recursion time taken, and I come up with an interesting pattern in the time taken of the both, which was what ever was on the top of the file is taking more time than the other.
For example: If I am writing for loop in the beginning it is taking more time than recursion and vice versa. The difference between time taken in both of the process are significently huge aprox 30 to 40 times.
My questions are:-
Is order of a loop and recursion matters?
Is there something related to print?
What could be the possible reason for such behaviour?
following is the code I have in the same file and the language I am using is scala?
def count(x: Int): Unit = {
if (x <= 1000) {
print(s"$x ")
count(x + 1)
}
}
val t3 = System.currentTimeMillis()
count(1)
val t4 = System.currentTimeMillis()
println(s"\ntime taken by the recursion look = ${t4 - t3} mili second")
var c = 1
val t1 = System.currentTimeMillis()
while(c <= 1000)
{
print(s"$c ")
c+=1
}
val t2 = System.currentTimeMillis()
println(s"\ntime taken by the while loop = ${t2 - t1} mili second")
In this situation the time taken for recursion and while loop are 986ms, 20ms respectively.
When I switch the position of loop and recursion which means first loop then recursion, time taken for recursion and while loop are 1.69 sec and 28 ms respectively.
Edit 1:
I can see the same behaviour with bufferWriter if the recursion code is on the top. But not the case when recursion is below the loop. When recursion is below the loop it is taking almost same time with the difference of 2 to 3 ms.
If you wanted to convince yourself that the tailrec-optimization works, without relying on any profiling tools, here is what you could try:
Use way more iterations
Throw away the first few iterations to give the JIT the time to wake up and do the hotspot-optimizations
Throw away all unpredictable side effects like printing to stdout
Throw away all costly operations that are the same in both approaches (formatting numbers etc.)
Measure in multiple rounds
Randomize the number of repetitions in each round
Randomize the order of variants within each round, to avoid any "catastrophic resonance" with the cycles of the garbage collector
Preferably, don't run anything else on the computer
Something along these lines:
def compare(
xs: Array[(String, () => Unit)],
maxRepsPerBlock: Int = 10000,
totalRounds: Int = 100000,
warmupRounds: Int = 1000
): Unit = {
val n = xs.size
val times: Array[Long] = Array.ofDim[Long](n)
val rng = new util.Random
val indices = (0 until n).toList
var totalReps: Long = 0
for (round <- 1 to totalRounds) {
val order = rng.shuffle(indices)
val reps = rng.nextInt(maxRepsPerBlock / 2) + maxRepsPerBlock / 2
for (i <- order) {
var r = 0
while (r < reps) {
r += 1
val start = System.currentTimeMillis
(xs(i)._2)()
val end = System.currentTimeMillis
if (round > warmupRounds) {
times(i) += (end - start)
}
}
}
if (round > warmupRounds) {
totalReps += reps
}
}
for (i <- 0 until n) {
println(f"${xs(i)._1}%20s : ${times(i) / totalReps.toDouble}")
}
}
def gaussSumWhile(n: Int): Long = {
var acc: Long = 0
var i = 0
while (i <= n) {
acc += i
i += 1
}
acc
}
#annotation.tailrec
def gaussSumRec(n: Int, acc: Long = 0, i: Int = 0): Long = {
if (i <= n) gaussSumRec(n, acc + i, i + 1)
else acc
}
compare(Array(
("while", { () => gaussSumWhile(1000) }),
("#tailrec", { () => gaussSumRec(1000) })
))
Here is what it prints:
while : 6.737733046257334E-5
#tailrec : 6.70325653896487E-5
Even the simple hints above are sufficient for creating a benchmark that shows that the while loop and the tail-recursive function take roughly the same time.
Scala does not compile into machine code but into bytecode for the "Java Virtual Machine"(JVM) which then interprets that code on the native processor. The JVM uses multiple mechanisms to optimise code that is run frequently, eventually converting the frequently-called functions ("hotspots") into pure machine code.
This means that testing the first run of a function does not give a good measure of eventual performance. You need to "warm up" the JIT compiler by running the test code many times before attempting to measure the time taken.
Also, as noted in the comments, doing any kind of I/O is going to make timings very unreliable because there is a danger that the I/O will block. Write a test case that does not do any blocking, if possible.

Scala fast way to parallelize collection

My code is equivalent to this:
def iterate(prev: Vector[Int], acc: Int): Vector[Int] = {
val next = (for { i <- 1.to(1000000) }
yield (prev(Random.nextInt(i))) ).toVector
if (acc < 20) iterate(next, acc + 1)
else next
}
iterate(1.to(1000000).toVector, 1)
For a large number of iterations, it does an operation on a collection, and yields the value. At the end of the iterations, it converts everything to a vector. Finally, it proceeds to the next recursive self-call, but it cannot proceed until it has all the iterations done. The number of the recursive self-calls is very small.
I want to paralellize this, so I tried to use .par on the 1.to(1000000) range. This used 8 processes instead of 1, and the result was only twice faster! .toParArray was only slightly faster than .par. I was told it could be much faster if I used something different, like maybe ThreadPool - this makes sense, because all of the time is spent in constructing next, and I assume that concatenating the outputs of different processes onto shared memory will not result in huge slowdowns, even for very large outputs (this is a key assumption and it might be wrong). How can I do it? If you provide code, paralellizing the code I gave will be sufficient.
Note that the code I gave is not my actual code. My actual code is much more long and complex (Held-Karp algorithm for TSP with constraints, BitSets and more stuff), and the only notable difference is that in my code, prev's type is ParMap, instead of Vector.
Edit, extra information: the ParMap has 350k elements on the worst iteration at the biggest sample size I can handle, and otherwise it's typically 5k-200k (that varies on a log scale). If it inherently needs a lot of time to concatenate the results from the processes into one single process (I assume this is what's happening), then there is nothing much I can do, but I rather doubt this is the case.
Implemented few versions after the original, proposed in the question,
rec0 is the original with a for loop;
rec1 uses par.map instead of for loop;
rec2 follows rec1 yet it employs parallel collection ParArray for lazy builders (and fast access on bulk traversal operations);
rec3 is a non-idiomatic non-parallel version with mutable ArrayBuffer.
Thus
import scala.collection.mutable.ArrayBuffer
import scala.collection.parallel.mutable.ParArray
import scala.util.Random
// Original
def rec0() = {
def iterate(prev: Vector[Int], acc: Int): Vector[Int] = {
val next = (for { i <- 1.to(1000000) }
yield (prev(Random.nextInt(i))) ).toVector
if (acc < 20) iterate(next, acc + 1)
else next
}
iterate(1.to(1000000).toVector, 1)
}
// par map
def rec1() = {
def iterate(prev: Vector[Int], acc: Int): Vector[Int] = {
val next = (1 to 1000000).par.map { i => prev(Random.nextInt(i)) }.toVector
if (acc < 20) iterate(next, acc + 1)
else next
}
iterate(1.to(1000000).toVector, 1)
}
// ParArray par map
def rec2() = {
def iterate(prev: ParArray[Int], acc: Int): ParArray[Int] = {
val next = (1 to 1000000).par.map { i => prev(Random.nextInt(i)) }.toParArray
if (acc < 20) iterate(next, acc + 1)
else next
}
iterate((1 to 1000000).toParArray, 1).toVector
}
// Non-idiomatic non-parallel
def rec3() = {
def iterate(prev: ArrayBuffer[Int], acc: Int): ArrayBuffer[Int] = {
var next = ArrayBuffer.tabulate(1000000){i => i+1}
var i = 0
while (i < 1000000) {
next(i) = prev(Random.nextInt(i+1))
i = i + 1
}
if (acc < 20) iterate(next, acc + 1)
else next
}
iterate(ArrayBuffer.tabulate(1000000){i => i+1}, 1).toVector
}
Then a little testing on averaging elapsed times,
def elapsed[A] (f: => A): Double = {
val start = System.nanoTime()
f
val stop = System.nanoTime()
(stop-start)*1e-6d
}
val times = 10
val e0 = (1 to times).map { i => elapsed(rec0) }.sum / times
val e1 = (1 to times).map { i => elapsed(rec1) }.sum / times
val e2 = (1 to times).map { i => elapsed(rec2) }.sum / times
val e3 = (1 to times).map { i => elapsed(rec3) }.sum / times
// time in ms.
e0: Double = 2782.341
e1: Double = 2454.828
e2: Double = 3455.976
e3: Double = 1275.876
shows that the non-idiomatic non-parallel version proves the fastest in average. Perhaps for larger input data, the parallel, idiomatic versions may be beneficial.

Why this function call in Scala is not optimized away?

I'm running this program with Scala 2.10.3:
object Test {
def main(args: Array[String]) {
def factorial(x: BigInt): BigInt =
if (x == 0) 1 else x * factorial(x - 1)
val N = 1000
val t = new Array[Long](N)
var r: BigInt = 0
for (i <- 0 until N) {
val t0 = System.nanoTime()
r = r + factorial(300)
t(i) = System.nanoTime()-t0
}
val ts = t.sortWith((x, y) => x < y)
for (i <- 0 to 10)
print(ts(i) + " ")
println("*** " + ts(N/2) + "\n" + r)
}
}
and call to a pure function factorial with constant argument is evaluated during each loop iteration (conclusion based on timing results). Shouldn't the optimizer reuse function call result after the first call?
I'm using Scala IDE for Eclipse. Are there any optimization flags for the compiler, which may produce more efficient code?
Scala is not a purely functional language, so without an effect system it cannot know that factorial is pure (for example, it doesn't "know" anything about the multiplication of big ints).
You need to add your own memoization approach here. Most simply add a val f300 = factorial(300) outside your loop.
Here is a question about memoization.

Suggest a cleaner functional way

Here is some imperative code:
var sum = 0
val spacing = 6
var x = spacing
for(i <- 1 to 10) {
sum += x * x
x += spacing
}
Here are two of my attempts to "functionalize" the above code:
// Attempt 1
(1 to 10).foldLeft((0, 6)) {
case((sum, x), _) => (sum + x * x, x + spacing)
}
// Attempt 2
Stream.iterate ((0, 6)) { case (sum, x) => (sum + x * x, x + spacing) }.take(11).last
I think there might be a cleaner and better functional way to do this. What would be that?
PS: Please note that the above is just an example code intended to illustrate the problem; it is not from the real application code.
Replacing 10 by N, you have spacing * spacing * N * (N + 1) * (2 * N + 1) / 6
This is by noting that you're summing (spacing * i)^2 for the range 1..N. This sum factorizes as spacing^2 * (1^2 + 2^2 + ... + N^2), and the latter sum is well-known to be N * (N + 1) * (2 * N + 1) / 6 (see Square Pyramidal Number)
I actually like idea of lazy sequences in this case. You can split your algorithm in 2 logical steps.
At first you want to work on all natural numbers (ok.. not all, but up to max int), so you define them like this:
val naturals = 0 to Int.MaxValue
Then you need to define knowledge about how numbers, that you want to sum, can be calculated:
val myDoubles = (naturals by 6 tail).view map (x => x * x)
And putting this all together:
val naturals = 0 to Int.MaxValue
val myDoubles = (naturals by 6 tail).view map (x => x * x)
val mySum = myDoubles take 10 sum
I think it's the way mathematician will approach this problem. And because all collections are lazily evaluated - you will not get out of memory.
Edit
If you want to develop idea of mathematical notation further, you can actually define this implicit conversion:
implicit def math[T, R](f: T => R) = new {
def ∀(range: Traversable[T]) = range.view map f
}
and then define myDoubles like this:
val myDoubles = ((x: Int) => x * x) ∀ (naturals by 6 tail)
My personal favourite would have to be:
val x = (6 to 60 by 6) map {x => x*x} sum
Or given spacing as an input variable:
val x = (spacing to 10*spacing by spacing) map {x => x*x} sum
or
val x = (1 to 10) map (spacing*) map {x => x*x} sum
There are two different directions to go. If you want to express yourself, assuming that you can't use the built-in range function (because you actually want something more complicated):
Iterator.iterate(spacing)(x => x+spacing).take(10).map(x => x*x).foldLeft(0)(_ + _)
This is a very general pattern: specify what you start with and how to get the next given the previous; then take the number of items you need; then transform them somehow; then combine them into a single answer. There are shortcuts for almost all of these in simple cases (e.g. the last fold is sum) but this is a way to do it generally.
But I also wonder--what is wrong with the mutable imperative approach for maximal speed? It's really quite clear, and Scala lets you mix the two styles on purpose:
var x = spacing
val last = spacing*10
val sum = 0
while (x <= last) {
sum += x*x
x += spacing
}
(Note that the for is slower than while since the Scala compiler transforms for loops to a construct of maximum generality, not maximum speed.)
Here's a straightforward translation of the loop you wrote to a tail-recursive function, in an SML-like syntax.
val spacing = 6
fun loop (sum: int, x: int, i: int): int =
if i > 0 then loop (sum+x*x, x+spacing, i-1)
else sum
val sum = loop (0, spacing, 10)
Is this what you were looking for? (What do you mean by a "cleaner" and "better" way?)
What about this?
def toSquare(i: Int) = i * i
val spacing = 6
val spaceMultiples = (1 to 10) map (spacing *)
val squares = spaceMultiples map toSquare
println(squares.sum)
You have to split your code in small parts. This can improve readability a lot.
Here is a one-liner:
(0 to 10).reduceLeft((u,v)=>u + spacing*spacing*v*v)
Note that you need to start with 0 in order to get the correct result (else the first value 6 would be added only, but not squared).
Another option is to generate the squares first:
(1 to 2*10 by 2).scanLeft(0)(_+_).sum*spacing*spacing

Scala performance - Sieve

Right now, I am trying to learn Scala . I've started small, writing some simple algorithms . I've encountered some problems when I wanted to implement the Sieve algorithm from finding all all prime numbers lower than a certain threshold .
My implementation is:
import scala.math
object Sieve {
// Returns all prime numbers until maxNum
def getPrimes(maxNum : Int) = {
def sieve(list: List[Int], stop : Int) : List[Int] = {
list match {
case Nil => Nil
case h :: list if h <= stop => h :: sieve(list.filterNot(_ % h == 0), stop)
case _ => list
}
}
val stop : Int = math.sqrt(maxNum).toInt
sieve((2 to maxNum).toList, stop)
}
def main(args: Array[String]) = {
val ap = printf("%d ", (_:Int));
// works
getPrimes(1000).foreach(ap(_))
// works
getPrimes(100000).foreach(ap(_))
// out of memory
getPrimes(1000000).foreach(ap(_))
}
}
Unfortunately it fails when I want to computer all the prime numbers smaller than 1000000 (1 million) . I am receiving OutOfMemory .
Do you have any idea on how to optimize the code, or how can I implement this algorithm in a more elegant fashion .
PS: I've done something very similar in Haskell, and there I didn't encountered any issues .
I would go with an infinite Stream. Using a lazy data structure allows to code pretty much like in Haskell. It reads automatically more "declarative" than the code you wrote.
import Stream._
val primes = 2 #:: sieve(3)
def sieve(n: Int) : Stream[Int] =
if (primes.takeWhile(p => p*p <= n).exists(n % _ == 0)) sieve(n + 2)
else n #:: sieve(n + 2)
def getPrimes(maxNum : Int) = primes.takeWhile(_ < maxNum)
Obviously, this isn't the most performant approach. Read The Genuine Sieve of Eratosthenes for a good explanation (it's Haskell, but not too difficult). For real big ranges you should consider the Sieve of Atkin.
The code in question is not tail recursive, so Scala cannot optimize the recursion away. Also, Haskell is non-strict by default, so you can't hardly compare it to Scala. For instance, whereas Haskell benefits from foldRight, Scala benefits from foldLeft.
There are many Scala implementations of Sieve of Eratosthenes, including some in Stack Overflow. For instance:
(n: Int) => (2 to n) |> (r => r.foldLeft(r.toSet)((ps, x) => if (ps(x)) ps -- (x * x to n by x) else ps))
The following answer is about a 100 times faster than the "one-liner" answer using a Set (and the results don't need sorting to ascending order) and is more of a functional form than the other answer using an array although it uses a mutable BitSet as a sieving array:
object SoE {
def makeSoE_Primes(top: Int): Iterator[Int] = {
val topndx = (top - 3) / 2
val nonprms = new scala.collection.mutable.BitSet(topndx + 1)
def cullp(i: Int) = {
import scala.annotation.tailrec; val p = i + i + 3
#tailrec def cull(c: Int): Unit = if (c <= topndx) { nonprms += c; cull(c + p) }
cull((p * p - 3) >>> 1)
}
(0 to (Math.sqrt(top).toInt - 3) >>> 1).filterNot { nonprms }.foreach { cullp }
Iterator.single(2) ++ (0 to topndx).filterNot { nonprms }.map { i: Int => i + i + 3 }
}
}
It can be tested by the following code:
object Main extends App {
import SoE._
val top_num = 10000000
val strt = System.nanoTime()
val count = makeSoE_Primes(top_num).size
val end = System.nanoTime()
println(s"Successfully completed without errors. [total ${(end - strt) / 1000000} ms]")
println(f"Found $count primes up to $top_num" + ".")
println("Using one large mutable1 BitSet and functional code.")
}
With the results from the the above as follows:
Successfully completed without errors. [total 294 ms]
Found 664579 primes up to 10000000.
Using one large mutable BitSet and functional code.
There is an overhead of about 40 milliseconds for even small sieve ranges, and there are various non-linear responses with increasing range as the size of the BitSet grows beyond the different CPU caches.
It looks like List isn't very effecient space wise. You can get an out of memory exception by doing something like this
1 to 2000000 toList
I "cheated" and used a mutable array. Didn't feel dirty at all.
def primesSmallerThan(n: Int): List[Int] = {
val nonprimes = Array.tabulate(n + 1)(i => i == 0 || i == 1)
val primes = new collection.mutable.ListBuffer[Int]
for (x <- nonprimes.indices if !nonprimes(x)) {
primes += x
for (y <- x * x until nonprimes.length by x if (x * x) > 0) {
nonprimes(y) = true
}
}
primes.toList
}