I have a lot of calculations contributing to a final result, with no restrictions on the order of the contributions. Seems like Futures should be able to speed this up, and they do, but not in the way I thought they would. Here is code comparing performance of a very silly way to divide integers:
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.duration.Duration
import scala.concurrent.{Await, Future}
object scale_me_up {
def main(args: Array[String]) {
val M = 500 * 1000
val N = 5
Thread.sleep(3210) // let launcher settle down
for (it <- 0 until 15) {
val method = it % 3
val start = System.currentTimeMillis()
val result = divide(M, N, method)
val elapsed = System.currentTimeMillis() - start
assert(result == M / N)
if (it >= 6) {
val methods = Array("ordinary", "fast parallel", "nice parallel")
val name = methods(method)
println(f"$name%15s: $elapsed ms")
}
}
}
def is_multiple_of(m: Int, n: Int): Boolean = {
val result = !(1 until n).map(_ + (m / n) * n).toSet.contains(m)
assert(result == (m % n == 0)) // yes, a less crazy implementation exists
result
}
def divide(m: Int, n: Int, method: Int): Int = {
method match {
case 0 =>
(1 to m).count(is_multiple_of(_, n))
case 1 =>
(1 to m)
.map { x =>
Future { is_multiple_of(x, n) }
}
.count(Await.result(_, Duration.Inf))
case 2 =>
Await.result(divide_futuristically(m, n), Duration.Inf)
}
}
def divide_futuristically(m: Int, n: Int): Future[Int] = {
val futures = (1 to m).map { x =>
Future { is_multiple_of(x, n) }
}
Future.foldLeft(futures)(0) { (count, flag) =>
{ if (flag) { count + 1 } else { count } }
}
/* much worse performing alternative:
Future.sequence(futures).map(_.count(identity))
*/
}
}
When I run this, the parallel case 1 is somewhat faster than the ordinary case 0 code (hurray), but case 2 takes twice as long. Of course, it depends on the system and whether enough work needs to be done per future (which here grows with the denominator N) to offset the concurrency overhead. [PS] As expected, decreasing N gives case 0 the lead and increasing N enough makes both case 1 and case 2 about twice as fast as case 0 on my two core CPU.
I'm lead to believe that divide_futuristically is a better way to express this kind of calculation: returning a future with the combined result. Blocking is just something we need here to measure performance. But in reality, the more blocking, the faster everyone is finished. What am I doing wrong? Several alternatives to summarize the futures (like sequence) all take the same penalty.
[PPS] this was with Scala 2.12 running on Java 11 on a 2 core CPU. With Java 12 on a 6 core CPU, the difference is much less significant (though the alternative with sequence still drags its feet). With Scala 2.13, the difference is even less and as you increase the amount of work per iteration, divide_futuristically starts outperforming the competition. The future is here at last...
Seems you did everything right. I've tried by myself different approaches even .par but got the same or worse result.
I've dived into Future.foldLeft and tried to analyze what caused the latency:
/** A non-blocking, asynchronous left fold over the specified futures,
* with the start value of the given zero.
* The fold is performed asynchronously in left-to-right order as the futures become completed.
* The result will be the first failure of any of the futures, or any failure in the actual fold,
* or the result of the fold.
*
* Example:
* {{{
* val futureSum = Future.foldLeft(futures)(0)(_ + _)
* }}}
*
* #tparam T the type of the value of the input Futures
* #tparam R the type of the value of the returned `Future`
* #param futures the `scala.collection.immutable.Iterable` of Futures to be folded
* #param zero the start value of the fold
* #param op the fold operation to be applied to the zero and futures
* #return the `Future` holding the result of the fold
*/
def foldLeft[T, R](futures: scala.collection.immutable.Iterable[Future[T]])(zero: R)(op: (R, T) => R)(implicit executor: ExecutionContext): Future[R] =
foldNext(futures.iterator, zero, op)
private[this] def foldNext[T, R](i: Iterator[Future[T]], prevValue: R, op: (R, T) => R)(implicit executor: ExecutionContext): Future[R] =
if (!i.hasNext) successful(prevValue)
else i.next().flatMap { value => foldNext(i, op(prevValue, value), op) }
and this part:
else i.next().flatMap { value => foldNext(i, op(prevValue, value), op) }
.flatMap produces a new Future which is submitted to executor. In other words, every
{ (count, flag) =>
{ if (flag) { count + 1 } else { count } }
}
is executed as new Future.
I suppose this part causes the experimentally proved latency.
Related
Background
I have the following scenario. I want to execute the method of a class from an external library, repeatedly, and I want to do so until a certain timeout condition and result condition (compared to the previous result) is met. Furthermore I want to collect the return values, even on the "failed" run (the run with the "failing" result condition that should interrupt further execution).
Thus far I have accomplished this with initializing an empty var result: Result, a var stop: Boolean and using a while loop that runs while the conditions are true and modifying the outer state. I would like to get rid of this and use a functional approach.
Some context. Each run is expected to run from 0 to 60 minutes and the total time of iteration is capped at 60 minutes. Theoretically, there's no bound to how many times it executes in this period but in practice, it's generally 2-60 times.
The problem is, the runs take a long time so I need to stop the execution. My idea is to use some kind of lazy Iterator or Stream coupled with scanLeft and Option.
Code
Boiler plate
This code isn't particularly relevant but used in my approach samples and provide identical but somewhat random pseudo runtime results.
import scala.collection.mutable.ListBuffer
import scala.util.Random
val r = Random
r.setSeed(1)
val sleepingTimes: Seq[Int] = (1 to 601)
.map(x => Math.pow(2, x).toInt * r.nextInt(100))
.toList
.filter(_ > 0)
.sorted
val randomRes = r.shuffle((0 to 600).map(x => r.nextInt(10)).toList)
case class Result(val a: Int, val slept: Int)
class Lib() {
def run(i: Int) = {
println(s"running ${i}")
Thread.sleep(sleepingTimes(i))
Result(randomRes(i), sleepingTimes(i))
}
}
case class Baz(i: Int, result: Result)
val lib = new Lib()
val timeout = 10 * 1000
While approach
val iteratorStart = System.currentTimeMillis()
val iterator = for {
i <- (0 to 600).iterator
if System.currentTimeMillis() < iteratorStart + timeout
f = Baz(i, lib.run(i))
} yield f
val iteratorBuffer = ListBuffer[Baz]()
if (iterator.hasNext) iteratorBuffer.append(iterator.next())
var run = true
while (run && iterator.hasNext) {
val next = iterator.next()
run = iteratorBuffer.last.result.a < next.result.a
iteratorBuffer.append(next)
}
Stream approach (Scala.2.12)
Full example
val streamStart = System.currentTimeMillis()
val stream = for {
i <- (0 to 600).toStream
if System.currentTimeMillis() < streamStart + timeout
} yield Baz(i, lib.run(i))
var last: Option[Baz] = None
val head = stream.headOption
val tail = if (stream.nonEmpty) stream.tail else stream
val streamVersion = (tail
.scanLeft((head, true))((x, y) => {
if (x._1.exists(_.result.a > y.result.a)) (Some(y), false)
else (Some(y), true)
})
.takeWhile {
case (baz, continue) =>
if (!baz.eq(head)) last = baz
continue
}
.map(_._1)
.toList :+ last).flatten
LazyList approach (Scala 2.13)
Full example
val lazyListStart = System.currentTimeMillis()
val lazyList = for {
i <- (0 to 600).to(LazyList)
if System.currentTimeMillis() < lazyListStart + timeout
} yield Baz(i, lib.run(i))
var last: Option[Baz] = None
val head = lazyList.headOption
val tail = if (lazyList.nonEmpty) lazyList.tail else lazyList
val lazyListVersion = (tail
.scanLeft((head, true))((x, y) => {
if (x._1.exists(_.result.a > y.result.a)) (Some(y), false)
else (Some(y), true)
})
.takeWhile {
case (baz, continue) =>
if (!baz.eq(head)) last = baz
continue
}
.map(_._1)
.toList :+ last).flatten
Result
Both approaches appear to yield the correct end result:
List(Baz(0,Result(4,170)), Baz(1,Result(5,208)))
and they interrupt execution as desired.
Edit: The desired outcome is to not execute the next iteration but still return the result of the iteration that caused the interruption. Thus the desired result is
List(Baz(0,Result(4,170)), Baz(1,Result(5,208)), Baz(2,Result(2,256))
and lib.run(i) should only run 3 times.
This is achieved by the while approach, as well as the LazyList approach but not the Stream approach which executes lib.run 4 times (Bad!).
Question
Is there another stateless approach, which is hopefully more elegant?
Edit
I realized my examples were faulty and not returning the "failing" result, which it should, and that they kept executing beyond the stop condition. I rewrote the code and examples but I believe the spirit of the question is the same.
I would use something higher level, like fs2.
(or any other high-level streaming library, like: monix observables, akka streams or zio zstreams)
def runUntilOrTimeout[F[_]: Concurrent: Timer, A](work: F[A], timeout: FiniteDuration)
(stop: (A, A) => Boolean): Stream[F, A] = {
val interrupt =
Stream.sleep_(timeout)
val run =
Stream
.repeatEval(work)
.zipWithPrevious
.takeThrough {
case (Some(p), c) if stop(p, c) => false
case _ => true
} map {
case (_, c) => c
}
run mergeHaltBoth interrupt
}
You can see it working here.
Learning Scala and functional programming in general. In the following tail-recursive factorial implementation:
def factorialTailRec(n: Int) : Int = {
#tailrec
def factorialRec(n: Int, f: => Int): Int = {
if (n == 0) f else factorialRec(n - 1, n * f)
}
factorialRec(n, 1)
}
I wonder whether there is any benefit to having the second parameter called by value vs called by name (as I have done). In the first case, every stack frame is burdened with a product. In the second case, if my understanding is correct, the entire chain of products will be carried over to the case if ( n== 0) at the nth stack frame, so we will still have to perform the same number of multiplications. Unfortunately, this is not a product of form a^n, which can be calculated in log_2n steps through repeated squaring, but a product of terms that differ by 1 every time. So I can't see any possible way of optimizing the final product: it will still require the multiplication of O(n) terms.
Is this correct? Is call by value equivalent to call by name here, in terms of complexity?
Let me just expand a little bit what you've already been told in comments.
That's how by-name parameters are desugared by the compiler:
#tailrec
def factorialTailRec(n: Int, f: => Int): Int = {
if (n == 0) {
val fEvaluated = f
fEvaluated
} else {
val fEvaluated = f // <-- here we are going deeper into stack.
factorialTailRec(n - 1, n * fEvaluated)
}
}
Through experimentation I found out that with the call by name formalism, the method becomes... non-tail recursive! I made this example code to compare factorial tail-recursively, and factorial non-tail-recursively:
package example
import scala.annotation.tailrec
object Factorial extends App {
val ITERS = 100000
def factorialTailRec(n: Int) : Int = {
#tailrec
def factorialTailRec(n: Int, f: => Int): Int = {
if (n == 0) f else factorialTailRec(n - 1, n * f)
}
factorialTailRec(n, 1)
}
for(i <-1 to ITERS) println("factorialTailRec(" + i + ") = " + factorialTailRec(i))
def factorial(n:Int) : Int = {
if(n == 0) 1 else n * factorial(n-1)
}
for(i <-1 to ITERS) println("factorial(" + i + ") = " + factorial(i))
}
Observe that the inner tailRec function calls the second argument by name. for which the #tailRec annotation still does NOT throw a compile-time error!
I've been playing around with different values for the ITERS variable, and for a value of 100,000, I receive a... StackOverflowError!
(The result of zero is there because of overflow of Int.)
So I went ahead and changed the signature of factorialTailRec/2, to:
def factorialTailRec(n: Int, f: Int): Int
i.e call by value for the argument f. This time, the portion of main that runs factorialTailRec finishes absolutely fine, whereas, of course, factorial/1 crashes at the exact same integer.
Very, very interesting. It seems as if call by name in this situation maintains the stack frames because of the need of computation of the products themselves all the way back to the call chain.
I am Java developer and learning Scala at the moment. It is generally admitted, that Java is more verbose then Scala. I just need to call 2 or more methods concurrently and then combine the result. Official Scala documentation at docs.scala-lang.org/overviews/core/futures.html suggests to use for-comprehention for that. So I used that out-of-the-box solution straightforwardly. Then I thought how I would do it with CompletableFuture and was surprised that it produced more concise and faster code, then Scala's Future
Let's consider a basic concurrent case: summing up values in array. For simplicity, let's split array in 2 parts(hence it will be 2 worker threads). Java's sumConcurrently takes only 4 LOC, while Scala's version requires 12 LOC. Also Java's version is 15% faster on my computer.
Complete code, not benchmark optimised.
Java impl.:
public class CombiningCompletableFuture {
static int sumConcurrently(List<Integer> numbers) throws ExecutionException, InterruptedException {
int mid = numbers.size() / 2;
return CompletableFuture.supplyAsync( () -> sumSequentially(numbers.subList(0, mid)))
.thenCombine(CompletableFuture.supplyAsync( () -> sumSequentially(numbers.subList(mid, numbers.size())))
, (left, right) -> left + right).get();
}
static int sumSequentially(List<Integer> numbers) {
try {
Thread.sleep(TimeUnit.SECONDS.toMillis(1));
} catch (InterruptedException ignored) { }
return numbers.stream().mapToInt(Integer::intValue).sum();
}
public static void main(String[] args) throws ExecutionException, InterruptedException {
List<Integer> from1toTen = IntStream.rangeClosed(1, 10).boxed().collect(toList());
long start = System.currentTimeMillis();
long sum = sumConcurrently(from1toTen);
long duration = System.currentTimeMillis() - start;
System.out.printf("sum is %d in %d ms.", sum, duration);
}
}
Scala's impl.:
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.{Await, Future}
import scala.concurrent.duration._
object CombiningFutures extends App {
def sumConcurrently(numbers: Seq[Int]) = {
val splitted = numbers.splitAt(5)
val leftFuture = Future {
sumSequentally(splitted._1)
}
val rightFuture = Future {
sumSequentally(splitted._2)
}
val totalFuture = for {
left <- leftFuture
right <- rightFuture
} yield left + right
Await.result(totalFuture, Duration.Inf)
}
def sumSequentally(numbers: Seq[Int]) = {
Thread.sleep(1000)
numbers.sum
}
val from1toTen = 1 to 10
val start = System.currentTimeMillis
val sum = sumConcurrently(from1toTen)
val duration = System.currentTimeMillis - start
println(s"sum is $sum in $duration ms.")
}
Any explanations and suggestions how to improve Scala code without impacting readability too much?
A verbose scala version of your sumConcurrently,
def sumConcurrently(numbers: List[Int]): Future[Int] = {
val (v1, v2) = numbers.splitAt(numbers.length / 2)
for {
sum1 <- Future(v1.sum)
sum2 <- Future(v2.sum)
} yield sum1 + sum2
}
A more concise version
def sumConcurrently2(numbers: List[Int]): Future[Int] = numbers.splitAt(numbers.length / 2) match {
case (l1, l2) => Future.sequence(List(Future(l1.sum), Future(l2.sum))).map(_.sum)
}
And all this is because we have to partition the list. Lets say we had to write a function which takes a few lists and returns the sum of their sum's using multiple concurrent computations,
def sumConcurrently3(lists: List[Int]*): Future[Int] =
Future.sequence(lists.map(l => Future(l.sum))).map(_.sum)
If the above looks cryptic... then let me de-crypt it,
def sumConcurrently3(lists: List[Int]*): Future[Int] = {
val listOfFuturesOfSums = lists.map { l => Future(l.sum) }
val futureOfListOfSums = Future.sequence(listOfFuturesOfSums)
futureOfListOfSums.map { l => l.sum }
}
Now, whenever you use the result of a Future (lets say the future completes at time t1) in a computation, it means that this computation is bound to happen after time t1. You can do it with blocking like this in Scala,
val sumFuture = sumConcurrently(List(1, 2, 3, 4))
val sum = Await.result(sumFuture, Duration.Inf)
val anotherSum = sum + 100
println("another sum = " + anotherSum)
But what is the point of all that, you are blocking the current thread while for the computation on those threads to finish. Why not move the whole computation into the future itself.
val sumFuture = sumConcurrently(List(1, 2, 3, 4))
val anotherSumFuture = sumFuture.map(s => s + 100)
anotherSumFuture.foreach(s => println("another sum = " + s))
Now, you are not blocking anywhere and the threads can be used anywhere required.
Future implementation and api in Scala is designed to enable you to write your program avoiding blocking as far as possible.
For the task at hand, the following is probably not the tersest option either:
def sumConcurrently(numbers: Vector[Int]): Future[Int] = {
val (v1, v2) = numbers.splitAt(numbers.length / 2)
Future(v1.sum).zipWith(Future(v2.sum))(_ + _)
}
As I mentioned in my comment there are several issues with your example.
I came up with the following to convert a List[Int] => Try[BigDecimal]:
import scala.util.Try
def f(xs: List[Int]): Try[BigDecimal] =
Try { xs.mkString.toInt }.map ( BigDecimal(_) )
Example:
scala> f(List(1,2,3,4))
res4: scala.util.Try[BigDecimal] = Success(1234)
scala> f(List(1,2,3,55555))
res5: scala.util.Try[BigDecimal] = Success(12355555)
Is there a way to write this function without resorting to a String conversion step?
Not very pretty, and I'm not convinced it's much more efficient. Here's the basic outline.
val pwrs:Stream[BigInt] = 10 #:: pwrs.map(_ * 10)
List(1,2,3,55555).foldLeft(0:BigInt)((p,i) => pwrs.find(_ > i).get * p + i)
Here it is a little more fleshed out with error handling.
import scala.util.Try
def f(xs: List[Int]): Try[BigDecimal] = Try {
lazy val pwrs: Stream[BigDecimal] = 10 #:: pwrs.map(_ * 10)
xs.foldLeft(0: BigDecimal) {
case (acc, i) if i >= 0 => pwrs.find(_ > i).get * acc + i
case _ => throw new Error("bad")
}
}
UPDATE
Just for giggles, I thought I'd plug some code into Rex Kerr's handy benchmarking/profiling tool, Thyme.
the code
import scala.util.Try
def fString(xs: List[Int]): Try[BigInt] = Try { BigInt(xs.mkString) }
def fStream(xs: List[Int]): Try[BigInt] = Try {
lazy val pwrs: Stream[BigInt] = 10 #:: pwrs.map(_ * 10)
xs.foldLeft(0: BigInt) {
case (acc, i) if i >= 0 => pwrs.find(_ > i).get * acc + i
case _ => throw new Error("bad")
}
}
def fLog10(xs: List[Int]): Try[BigInt] = Try {
xs.foldLeft(0: BigInt) {
case (acc, i) if i >= 0 =>
math.pow(10, math.ceil(math.log10(i))).toInt * acc + i
case _ => throw new Error("bad")
}
}
fString() is a slight simplification of Kevin's original question. fStream() is my proposed non-string implementation. fLog10 is the same but with Alexey's suggested enhancement.
You'll note that I'm using BigInt instead of BigDecimal. I found that both non-string methods encountered a bug somewhere around the 37th digit of the result. Some kind of rounding error or something, but there was no problem with BigInt so that's what I used.
test setup
// create a List of 40 Ints and check its contents
val lst = List.fill(40)(util.Random.nextInt(20000))
lst.min // 5
lst.max // 19858
lst.mkString.length // 170
val th = ichi.bench.Thyme.warmed(verbose = print)
th.pbenchWarm(th.Warm(fString(lst)), title="fString")
th.pbenchWarm(th.Warm(fStream(lst)), title="fStream")
th.pbenchWarm(th.Warm(fLog10(lst)), title="fLog10")
results
Benchmark for fString (20 calls in 345.6 ms) Time: 4.015 us 95%
CI 3.957 us - 4.073 us (n=19) Garbage: 109.9 ns (n=2 sweeps
measured)
Benchmark for fStream (20 calls in 305.6 ms) Time: 7.118 us 95%
CI 7.024 us - 7.213 us (n=19) Garbage: 293.0 ns (n=3 sweeps
measured)
Benchmark for fLog10 (20 calls in 382.8 ms) Time: 9.205 us 95%
CI 9.187 us - 9.222 us (n=17) Garbage: 73.24 ns (n=2 sweeps
measured)
So I was right about the efficiency of the non-string algorithm. Oddly, using math._ to avoid Stream creation doesn't appear to be better. I didn't expect that.
takeaway
Number-to-string and string-to-number transitions are reasonably efficient.
import scala.util.{Try, Success}
import scala.annotation.tailrec
def findOrder(i: Int): Long = {
#tailrec
def _findOrder(i: Int, order: Long): Long = {
if (i < order) order
else _findOrder(i, order * 10)
}
_findOrder(i, 1)
}
def f(xs: List[Int]): Try[BigDecimal] = Try(
xs.foldLeft(BigDecimal(0))((acc, i) => acc * findOrder(i) + i)
)
To find the correct power of 10 more efficiently (replace pwrs.find(_ > i).get with nextPowerOf10(i) in #jwvh's answer):
def nextPowerOf10(x: Int) = {
val n = math.ceil(math.log10(x))
BigDecimal(math.pow(10, n))
}
Since you start with an Int, there should be no rounding issues.
I am trying to initialize an array in Scala, using parallelization. However, when using ParSeq.fill method, the performance doesn't seem to be better any better than sequential initialization (Seq.fill). If I do the same task, but initializing the collection with map, then it is much faster.
To show my point, I set up the following example:
import scala.collection.parallel.immutable.ParSeq
import scala.util.Random
object Timer {
def apply[A](f: => A): (A, Long) = {
val s = System.nanoTime
val ret = f
(ret, System.nanoTime - s)
}
}
object ParallelBenchmark extends App {
def randomIsPrime: Boolean = {
val n = Random.nextInt(1000000)
(2 until n).exists(i => n % i == 0)
}
val seqSize = 100000
val (_, timeSeq) = Timer { Seq.fill(seqSize)(randomIsPrime) }
println(f"Time Seq:\t\t $timeSeq")
val (_, timeParFill) = Timer { ParSeq.fill(seqSize)(randomIsPrime) }
println(f"Time Par Fill:\t $timeParFill")
val (_, timeParMap) = Timer { (0 until seqSize).par.map(_ => randomIsPrime) }
println(f"Time Par map:\t $timeParMap")
}
And the result is:
Time Seq: 32389215709
Time Par Fill: 32730035599
Time Par map: 17270448112
Clearly showing that the fill method is not running in parallel.
The parallel collections library in Scala can only parallelize existing collections, fill hasn't been implemented yet (and may never be). Your method of using a Range to generate a cheap placeholder collection is probably your best option if you want to see a speed boost.
Here's the underlying method being called by ParSeq.fill, obviously not parallel.