Scala Future multiple cores - slow performance - scala

import java.util.concurrent.{Executors, TimeUnit}
import scala.annotation.tailrec
import scala.concurrent.{Await, ExecutionContext, Future}
import scala.util.{Failure, Success}
object Fact extends App {
def time[R](block: => R): Long = {
val t0 = System.nanoTime()
val result = block // call-by-name
val t1 = System.nanoTime()
val t: Long = TimeUnit.SECONDS.convert((t1 - t0), TimeUnit.NANOSECONDS)
//println(
// "Time taken seconds: " + t)
t
}
def factorial(n: BigInt): BigInt = {
#tailrec
def process(n: BigInt, acc: BigInt): BigInt = {
//println(acc)
if (n <= 0) acc
else process(n - 1, n * acc)
}
process(n, 1)
}
implicit val ec =
ExecutionContext.fromExecutor(Executors.newFixedThreadPool(2))
val f1: Future[Stream[Long]] =
Future.sequence(
(1 to 50).toStream.map(x => Future { time(factorial(100000)) }))
f1.onComplete {
case Success(s) => {
println("Success : " + s.foldLeft(0L)(_ + _) + " seconds!")
}
case Failure(f) => println("Fails " + f)
}
import scala.concurrent.duration._
Await.ready(Future { 10 }, 10000 minutes)
}
I have the above Factorial code that needs to use multiple cores to complete the program faster and should utilize more cores.
So I change,
Executors.newFixedThreadPool(1) to utilize 1 core
Executors.newFixedThreadPool(2) to utilize 2 cores
When changing to 1 core, then result appears in 127 seconds.
But when changing to 2 cores, then I get 157 seconds.
My doubt is, when i increase cores(parallelism) then it should give good performance. But it is not. Why?
Please correct me, if I am wrong or missing something.
Thanks in Advance.

How are you measuring the time? The result you are printing out is not the time execution took, but the sum of times of each individual call.
Running Fact.time(Fact.main(Array.empty)) in REPL I get 90 and 178 with two and one threads respectively. Seems to make sense ...

First of all, Dima is right that what you print is total execution time of all tasks rather than total time till the last task is finished. The difference is that the first sums time for all the work done in parallel and only the latter shows actual speed up from multi-threading.
However there is another important effect. When I run this code with 1, 2 and 3 threads and measure both total time (time until f1 is ready) and total parallel time (the one that you print), I get following data (I also reduce number of calculations from 50 to 20 to speed up my tests):
1 - 70 - 70
2 - 47 - 94
3 - 43 - 126
At the first glance it looks OK as the parallel time divided by the real time is about the same as the number of threads. But if you look a bit closer, you may notice that speed up going from 1 thread to 2 is only about 1.5x and only 1.1x for the third thread. Also these figures mean that total time of all tasks actually goes up when you add threads. This might seem puzzling.
The answer to this puzzle is that your calculation is actually not CPU-bound. The thing is that the answer (factorial(100000)) is actually a pretty big number. In fact it is so big that it takes about 185KB of memory to store it. What this means is that at the latter stages of computation your factorial method actually becomes more memory-bound than CPU-bound because this size is big enough to overflow the fastest CPU caches. And this is the reason why adding more threads slows down each calculation: yes, you do calculation faster but memory doesn't get any faster. So when you saturate CPU caches and then memory channel, adding more threads (cores) doesn't improves performance that much.

Related

Expensive flatMap() operation on streams originating from Stream.emits()

I just encountered an issue with degrading fs2 performance using a stream of strings to be written to a file via text.utf8encode. I tried to change my source to use chunked strings to increase performance, but the observation was performance degradation instead.
As far as I can see, it boils down to the following: Invoking flatMap on a stream that originates from Stream.emits() can be very expensive. Time usage seems to be exponential based on the size of the sequence passed to Stream.emits(). The code snippet below shows an example:
/*
Test done with scala 2.11.11 and fs2 version 0.10.0-M7.
*/
val rangeSize = 20000
val integers = (1 to rangeSize).toVector
// Note that the last flatMaps are just added to show extreme load for streamA.
val streamA = Stream.emits(integers).flatMap(Stream.emit(_))
val streamB = Stream.range(1, rangeSize + 1).flatMap(Stream.emit(_))
streamA.toVector // Uses approx. 25 seconds (!)
streamB.toVector // Uses approx. 15 milliseconds
Is this a bug, or should usage of Stream.emits() for large sequences be avoided?
TLDR: Allocations.
Longer answer:
Interesting question. I ran a JFR profile on both methods separately, and looked at the results. First thing which immediately attracted my eye was the amount of allocations.
Stream.emit:
Stream.range:
We can see that Stream.emit allocates a significant amount of Append instances, which are the concrete implementation of Catenable[A], which is the type used in Stream.emit to fold:
private[fs2] final case class Append[A](left: Catenable[A], right: Catenable[A]) extends Catenable[A]
This actually comes from the implementation of how Catenable[A] implemented foldLeft:
foldLeft(empty: Catenable[B])((acc, a) => acc :+ f(a))
Where :+ allocates a new Append object for each element. This means we're at least generating 20000 such Append objects.
There is also a hint in the documentation of Stream.range about how it produces a single chunk instead of dividing the stream further, which may be bad if this was a big range we're generating:
/**
* Lazily produce the range `[start, stopExclusive)`. If you want to produce
* the sequence in one chunk, instead of lazily, use
* `emits(start until stopExclusive)`.
*
* #example {{{
* scala> Stream.range(10, 20, 2).toList
* res0: List[Int] = List(10, 12, 14, 16, 18)
* }}}
*/
def range(start: Int, stopExclusive: Int, by: Int = 1): Stream[Pure,Int] =
unfold(start){i =>
if ((by > 0 && i < stopExclusive && start < stopExclusive) ||
(by < 0 && i > stopExclusive && start > stopExclusive))
Some((i, i + by))
else None
}
You can see that there is no additional wrapping here, only the integers that get emitted as part of the range. On the other hand, Stream.emits creates an Append object for every element in the sequence, where we have a left containing the tail of the stream, and right containing the current value we're at.
Is this a bug? I would say no, but I would definitely open this up as a performance issue to the fs2 library maintainers.

Redis: Set time = Get time. Why?

Short: Redis set time = get time (strange)
I did some tests just insert 30000 records and than receive them 30000 times (Redis).
def redis_set(data):
for k, v in data.iteritems():
redis_conn.set(k, v)
def redis_get(data):
for k in data.iterkeys():
val = redis_conn.get(k)
def do_tests(num, tests):
# setup dict with key/values to retrieve
data = {'key' + str(i): 'val' + str(i)*100 for i in range(num)}
# run tests
for test in tests:
start = time.time()
print "Starting test .. %s" % (test.__name__)
test(data)
elapsed = time.time() - start
print "%s: %d ops in %.2f seconds : %.1f ops/sec" % (test.__name__, num, elapsed, num / elapsed)
tests = [redis_set, redis_get]
do_tests(30000, tests)
Results
Redis:
redis_set: 30000 ops in 106.21 seconds : 282.4 ops/sec
redis_get: 30000 ops in 94.94 seconds : 316.0 ops/sec
It is OK?
Nothing goes wrong.
Since Redis is single-threaded, there's no lock penalty for read and write operations. Both GET and SET are several memory operations, and both are very fast.
According to your benchmark, SET is a little slower than GET. That's also reasonable, since SET operation needs to allocate memory for newly added item, and memory allocation costs more than other memory operations.
On the other hand, Mongodb's read operation is much faster than write operation. Because it does lots of optimizations for read operations, such as cache. And the intention lock that Mongodb used is much more friendly to read operations, i.e. multiple readers can read data from a single slot at the same time, while writers are exclusive.

Different result returned using Scala Collection par in a series of runs

I have tasks that I want to execute concurrently and each task takes substantial amount of memory so I have to execute them in batches of 2 to conserve memory.
def runme(n: Int = 120) = (1 to n).grouped(2).toList.flatMap{tuple =>
tuple.par.map{x => {
println(s"Running $x")
val s = (1 to 100000).toList // intentionally to make the JVM allocate a sizeable chunk of memory
s.sum.toLong
}}
}
val result = runme()
println(result.size + " => " + result.sum)
The result I expected from the output was 120 => 84609924480 but the output was rather random. The returned collection size differed from execution to execution. Most of the time there was missing count even though all the futures were executed looking at the console. I thought flatMap waits the parallel executions in map to complete before returning the complete. What should I do to always get the right result using par? Thanks
Just for the record: changing the underlying collection in this case shouldn't change the output of your program. The problem is related to this known bug. It's fixed from 2.11.6, so if you use that (or higher) Scala version, you should not see the strange behavior.
And about the overflow, I still think that your expected value is wrong. You can check that the sum is overflowing because the list is of integers (which are 32 bit) while the total sum exceeds the integer limits. You can check it with the following snippet:
val n = 100000
val s = (1 to n).toList // your original code
val yourValue = s.sum.toLong // your original code
val correctValue = 1l * n * (n + 1) / 2 // use math formula
var bruteForceValue = 0l // in case you don't trust math :) It's Long because of 0l
for (i ← 1 to n) bruteForceValue += i // iterate through range
println(s"yourValue = $yourValue")
println(s"correctvalue = $correctValue")
println(s"bruteForceValue = $bruteForceValue")
which produces the output
yourValue = 705082704
correctvalue = 5000050000
bruteForceValue = 5000050000
Cheers!
Thanks #kaktusito.
It worked after I changed the grouped list to Vector or Seq i.e. (1 to n).grouped(2).toList.flatMap{... to (1 to n).grouped(2).toVector.flatMap{...

Adding immutable Vectors

I am trying to work more with scalas immutable collection since this is easy to parallelize, but i struggle with some newbie problems. I am looking for a way to create (efficiently) a new Vector from an operation. To be precise I want something like
val v : Vector[Double] = RandomVector(10000)
val w : Vector[Double] = RandomVector(10000)
val r = v + w
I tested the following:
// 1)
val r : Vector[Double] = (v.zip(w)).map{ t:(Double,Double) => t._1 + t._2 }
// 2)
val vb = new VectorBuilder[Double]()
var i=0
while(i<v.length){
vb += v(i) + w(i)
i = i + 1
}
val r = vb.result
}
Both take really long compared to the work with Array:
[Vector Zip/Map ] Elapsed time 0.409 msecs
[Vector While Loop] Elapsed time 0.374 msecs
[Array While Loop ] Elapsed time 0.056 msecs
// with warm-up (10000) and avg. over 10000 runs
Is there a better way to do it? I think the work with zip/map/reduce has the advantage that it can run in parallel as soon as the collections have support for this.
Thanks
Vector is not specialized for Double, so you're going to pay a sizable performance penalty for using it. If you are doing a simple operation, you're probably better off using an array on a single core than a Vector or other generic collection on the entire machine (unless you have 12+ cores). If you still need parallelization, there are other mechanisms you can use, such as using scala.actors.Futures.future to create instances that each do the work on part of the range:
val a = Array(1,2,3,4,5,6,7,8)
(0 to 4).map(_ * (a.length/4)).sliding(2).map(i => scala.actors.Futures.future {
var s = 0
var j = i(0)
while (j < i(1)) {
s += a(j)
j += 1
}
s
}).map(_()).sum // _() applies the future--blocks until it's done
Of course, you'd need to use this on a much longer array (and on a machine with four cores) for the parallelization to improve things.
You should use lazily built collections when you use more than one higher-order methods:
v1.view zip v2 map { case (a,b) => a+b }
If you don't use a view or an iterator each method will create a new immutable collection even when they are not needed.
Probably immutable code won't be as fast as mutable but the lazy collection will improve execution time of your code a lot.
Arrays are not type-erased, Vectors are. Basically, JVM gives Array an advantage over other collections when handling primitives that cannot be overcome. Scala's specialization might decrease that advantage, but, given their cost in code size, they can't be used everywhere.

Scala - calculate average of SomeObj.double in a List[SomeObj]

I'm on my second evening of scala, and I'm resisting the urge to write things in scala how I used to do them in java and trying to learn all of the idioms. In this case I'm looking to just compute an average using such things as closures, mapping, and perhaps list comprehension. Irrespective of whether this is the best way to compute an average, I just want to know how to do these things in scala for learning purposes only
Here's an example: the average method below is left pretty much unimplemented. I've got a couple of other methods for looking up the rating an individual userid gave that uses the find method of TraversableLike (I think), but nothing more that is scala specific, really. How would I compute an average given a List[RatingEvent] where RatingEvent.rating is a double value that I'd to compute an average of across all values of that List in a scala-like manner?.
package com.brinksys.liftnex.model
class Movie(val id : Int, val ratingEvents : List[RatingEvent]) {
def getRatingByUser(userId : Int) : Int = {
return getRatingEventByUserId(userId).rating
}
def getRatingEventByUserId(userId : Int) : RatingEvent = {
var result = ratingEvents find {e => e.userId == userId }
return result.get
}
def average() : Double = {
/*
fill in the blanks where an average of all ratingEvent.rating values is expected
*/
return 3.8
}
}
How would a seasoned scala pro fill in that method and use the features of scala to make it as concise as possible? I know how I would do it in java, which is what I want to avoid.
If I were doing it in python, I assume the most pythonic way would be:
sum([re.rating. for re in ratingEvents]) / len(ratingEvents)
or if I were forcing myself to use a closure (which is something I at least want to learn in scala):
reduce(lambda x, y : x + y, [re.rating for re in ratingEvents]) / len(ratingEvents)
It's the usage of these types of things I want to learn in scala.
Your suggestions? Any pointers to good tutorials/reference material relevant to this are welcome :D
If you're going to be doing math on things, using List is not always the fastest way to go because List has no idea how long it is--so ratingEvents.length takes time proportional to the length. (Not very much time, granted, but it does have to traverse the whole list to tell.) But if you're mostly manipulating data structures and only occasionally need to compute a sum or whatever, so it's not the time-critical core of your code, then using List is dandy.
Anyway, the canonical way to do it would be with a fold to compute the sum:
(0.0 /: ratingEvents){_ + _.rating} / ratingEvents.length
// Equivalently, though more verbosely:
// ratingEvents.foldLeft(0.0)(_ + _.rating) / ratingEvents.length
or by mapping and then summing (2.8 only):
ratingEvents.map(_.rating).sum / ratingEvents.length
For more information on maps and folds, see this question on that topic.
You might calculate sum and length in one go, but I doubt that this helps except for very long lists. It would look like this:
val (s,l) = ratingEvents.foldLeft((0.0, 0))((t, r)=>(t._1 + r.rating, t._2 + 1))
val avg = s / l
I think for this example Rex' solution is much better, but in other use cases the "fold-over-tuple-trick" can be essential.
Since mean and other descriptive statistics like standard deviation or median are needed in different contexts, you could also use a small reusable implicit helper class to allow for more streamlined chained commands:
implicit class ImplDoubleVecUtils(values: Seq[Double]) {
def mean = values.sum / values.length
}
val meanRating = ratingEvents.map(_.rating).mean
It even seems to be possible to write this in a generic fashion for all number types.
Tail recursive solution can achieve both single traversal and avoid high memory allocation rates
def tailrec(input: List[RatingEvent]): Double = {
#annotation.tailrec def go(next: List[RatingEvent], sum: Double, count: Int): Double = {
next match {
case Nil => sum / count
case h :: t => go(t, sum + h.rating, count + 1)
}
}
go(input, 0.0, 0)
}
Here are jmh measurements of approaches from above answers on a list of million elements:
[info] Benchmark Mode Score Units
[info] Mean.foldLeft avgt 0.007 s/op
[info] Mean.foldLeft:·gc.alloc.rate avgt 4217.549 MB/sec
[info] Mean.foldLeft:·gc.alloc.rate.norm avgt 32000064.281 B/op
...
[info] Mean.mapAndSum avgt 0.039 s/op
[info] Mean.mapAndSum:·gc.alloc.rate avgt 1690.077 MB/sec
[info] Mean.mapAndSum:·gc.alloc.rate.norm avgt 72000009.575 B/op
...
[info] Mean.tailrec avgt 0.004 s/op
[info] Mean.tailrec:·gc.alloc.rate avgt ≈ 10⁻⁴ MB/sec
[info] Mean.tailrec:·gc.alloc.rate.norm avgt 0.196 B/op
I can suggest 2 ways:
def average(x: Array[Double]): Double = x.foldLeft(0.0)(_ + _) / x.length
def average(x: Array[Double]): Double = x.sum / x.length
Both are fine, but in 1 case when using fold you can not only make "+" operation, but as well replace it with other (- or * for example)