I'm trying to understand rational behind the statement
For cases where blocking is absolutely necessary, futures can be blocked on (although it is discouraged)
Idea behind ForkJoinPool is to join processes which is blocking operation, and this is main implementation of executor context for futures and actors. It should be effective for blocking joins.
I wrote small benchmark and seems like old style futures(scala 2.9) are 2 times faster in this very simple scenario.
#inline
def futureResult[T](future: Future[T]) = Await.result(future, Duration.Inf)
#inline
def futureOld[T](body: => T)(implicit ctx:ExecutionContext): () => T = {
val f = future(body)
() => futureResult(f)
}
def main(args: Array[String]) {
#volatile
var res = 0d
CommonUtil.timer("res1") {
(0 until 100000).foreach { i =>
val f1 = futureOld(math.exp(1))
val f2 = futureOld(math.exp(2))
val f3 = futureOld(math.exp(3))
res = res + f1() + f2() + f3()
}
}
println("res1 = "+res)
res = 0
res = 0
CommonUtil.timer("res1") {
(0 until 100000).foreach { i =>
val f1 = future(math.exp(1))
val f2 = future(math.exp(2))
val f3 = future(math.exp(3))
val f4 = for(r1 <- f1; r2 <- f2 ; r3 <- f3) yield r1+r2+r3
res = res + futureResult(f4)
}
}
println("res2 = "+res)
}
start:res1
res1 - 1.683 seconds
res1 = 3019287.4850644027
start:res1
res1 - 3.179 seconds
res2 = 3019287.485058338
Most of the point of Futures is that they enable you to create non-blocking, concurrent code that can easily be executed in parallel.
OK, so wrapping a potentially lengthy function in a future returns immediately, so that you can postpone worrying about the return value until you are actually interested in it. But if the part of the code which does care about the value just blocks until the result is actually available, all you gained was a way to make your code a little tidier (and you know, you can do that without futures - using futures to tidy up your code would be a code smell, I think). Unless the functions being wrapped in futures are absolutely trivial, your code is going to spend much more time blocking than evaluating other expressions.
If, on the other hand, you register a callback (e.g. using onComplete or onSuccess) and put in that callback the code which cares about the result, then you can have code which can be organised to run very efficiently and scale well. It becomes event driven rather than having to sit and wait for results.
Your benchmark is of the former type, but since you have some tiny functions there, there is very little to gain between executing them in parallel compared to in sequence. This means that you are mostly evaluating the overhead of creating and accessing the futures. Congratulations: you showed that in some circumstances 2.9 futures are faster at doing something trivial than 2.10 - something trivial which does not really play to the strengths of either version of the concept.
Try something a little more complex and demanding. I mean, you're requesting the future values almost immediately! At the very least, you could build an array of 100000 futures, then pull out their results in another loop. That would be testing something slightly meaningful. Oh, and have them compute something based on the value of i.
You could progress from there to
Creating an object to store the results.
Registering a callback with each future that inserts the result into the object.
Launching your n calculations
And then benchmarking how long it takes for the actual results to arrive, when you demand them all. That would be rather more meaningful.
EDIT
By the way, your benchmark fails both on its own terms and in its understanding of the proper use of futures.
Firstly, you are counting the time it takes to retrieve each individual future result, but not the actual time it takes to evaluate res once all 3 futures have been created, nor the total time it takes to iterate through the loop. Also, your mathematical calculations are so trivial that you might actually be testing the penalty in the second test of a) the for comprehension and b) the fourth future in which the first three futures are wrapped.
Secondly, the only reason these sums probably add up to something roughly proportional to the overall time used is precisely because there is really no concurrency here.
I'm not trying to beat up on you, it's just that these flaws in the benchmark help illuminate the issue. A proper benchmark of the performance of different futures implementations would require very careful thought.
Java7 docs for ForkJoinTask reports:
A ForkJoinTask is a lightweight form of Future. The efficiency of
ForkJoinTasks stems from a set of restrictions (that are only
partially statically enforceable) reflecting their intended use as
computational tasks calculating pure functions or operating on purely
isolated objects. The primary coordination mechanisms are fork(), that
arranges asynchronous execution, and join(), that doesn't proceed
until the task's result has been computed. Computations should avoid
synchronized methods or blocks, and should minimize other blocking
synchronization apart from joining other tasks or using synchronizers
such as Phasers that are advertised to cooperate with fork/join
scheduling. Tasks should also not perform blocking IO, and should
ideally access variables that are completely independent of those
accessed by other running tasks. Minor breaches of these restrictions,
for example using shared output streams, may be tolerable in practice,
but frequent use may result in poor performance, and the potential to
indefinitely stall if the number of threads not waiting for IO or
other external synchronization becomes exhausted. This usage
restriction is in part enforced by not permitting checked exceptions
such as IOExceptions to be thrown. However, computations may still
encounter unchecked exceptions, that are rethrown to callers
attempting to join them. These exceptions may additionally include
RejectedExecutionException stemming from internal resource exhaustion,
such as failure to allocate internal task queues. Rethrown exceptions
behave in the same way as regular exceptions, but, when possible,
contain stack traces (as displayed for example using
ex.printStackTrace()) of both the thread that initiated the
computation as well as the thread actually encountering the exception;
minimally only the latter.
Doug Lea's maintenance repository for JSR166 (targeted at JDK8) expands on this:
A ForkJoinTask is a lightweight form of Future. The efficiency of
ForkJoinTasks stems from a set of restrictions (that are only
partially statically enforceable) reflecting their main use as
computational tasks calculating pure functions or operating on purely
isolated objects. The primary coordination mechanisms are fork(), that
arranges asynchronous execution, and join(), that doesn't proceed
until the task's result has been computed. Computations should ideally
avoid synchronized methods or blocks, and should minimize other
blocking synchronization apart from joining other tasks or using
synchronizers such as Phasers that are advertised to cooperate with
fork/join scheduling. Subdividable tasks should also not perform
blocking I/O, and should ideally access variables that are completely
independent of those accessed by other running tasks. These guidelines
are loosely enforced by not permitting checked exceptions such as
IOExceptions to be thrown. However, computations may still encounter
unchecked exceptions, that are rethrown to callers attempting to join
them. These exceptions may additionally include
RejectedExecutionException stemming from internal resource exhaustion,
such as failure to allocate internal task queues. Rethrown exceptions
behave in the same way as regular exceptions, but, when possible,
contain stack traces (as displayed for example using
ex.printStackTrace()) of both the thread that initiated the
computation as well as the thread actually encountering the exception;
minimally only the latter.
It is possible to define and use ForkJoinTasks that may block, but
doing do requires three further considerations: (1) Completion of few
if any other tasks should be dependent on a task that blocks on
external synchronization or I/O. Event-style async tasks that are
never joined (for example, those subclassing CountedCompleter) often
fall into this category. (2) To minimize resource impact, tasks should
be small; ideally performing only the (possibly) blocking action. (3)
Unless the ForkJoinPool.ManagedBlocker API is used, or the number of
possibly blocked tasks is known to be less than the pool's
ForkJoinPool.getParallelism() level, the pool cannot guarantee that
enough threads will be available to ensure progress or good
performance.
tl;dr;
The "blocking join" operation referred to by the fork-join is not to be confused with calling some "blocking code" within the task.
The first is about coordinating many independent tasks (which are not independent threads) to collect individual outcomes and evaluate an overall result.
The second is about calling a potentially long-blocking operation within the single tasks: e.g. IO operations over the network, a DB query, accessing the file system, accessing a globally synchronized object or method...
The second kind of blocking is discouraged for Scala Futures and ForkJoinTasks both.
The main risk is that the thread-pool gets exhausted and is unable to complete tasks awaiting in the queue, while all available threads are busy waiting on blocking operations.
Related
I am asking myself the question: "When should you use scala.concurrent.blocking?"
If I understood correctly, the blocking {} only makes sense to be used in conjunction with the ForkJoinPool. In addition docs.scala-lang.org highlights, that blocking shouldn't be used for long running executions:
Last but not least, you must remember that the ForkJoinPool is not designed for long-lasting blocking operations.
I assume a long running execution is a database call or some kind of external IO. In this case a separate thread pools should be used, e.g. CachedThreadPool. Most IO related frameworks, like sttp, doobie, cats can make use of a provided IO thread pool.
So I am asking myself, which use-case still exists for the blocking statement? Is this only useful, when working with locking and waiting operations, like semaphores?
Consider the problem of thread pool starvation. Say you have a fixed size thread pool of 10 available threads, something like so:
implicit val myFixedThreadPool =
ExecutionContext.fromExecutor(Executors.newFixedThreadPool(10))
If for some reason all 10 threads are tied up, and a new request comes in which requires an 11th thread to do its work, then this 11th request will hang until one of the threads becomes available.
blocking { Future { ... } } construct can be interpreted as saying please do not consume a thread from myFixedThreadPool but instead spin up a new thread outside myFixedThreadPool.
One practical use case for this is if your application can conceptually be considered to be in two parts, one part which say in 90% of cases is talking to proper async APIs, but there is another part which in few special cases has to talk to say a very slow external API which takes many seconds to respond and which we have no control over. Using the fixed thread pool for the true async part is relatively safe from thread pool starvation, however also using the same fixed thread pool for the second part presents the danger of the situation where suddenly 10 requests are made to the slow external API, which now causes 90% of other requests to hang waiting for those slow requests to finish. Wrapping those slow requests in blocking would help minimise the chances of 90% of other requests from hanging.
Another way of achieving this kind of "swimlaning" of true async request from blocking requests is by offloading the blocking request to a separate dedicated thread pool to be used just for the blocking calls, something like so
implicit val myDefaultPool =
ExecutionContext.fromExecutor(Executors.newFixedThreadPool(10))
val myPoolForBlockingRequests =
ExecutionContext.fromExecutor(Executors.newFixedThreadPool(20))
Future {
callAsyncApi
} // consume thread from myDefaultPool
...
Future {
callBlockingApi
}(myPoolForBlockingRequests) // consume thread from myPoolForBlockingRequests
I am asking myself the question: "When should you use scala.concurrent.blocking?"
Well, since that is mostly useful for Future and Future should never be used for serious business logic then never.
Now, "jokes" aside, when using Futures then you should always use blocking when wrapping blocking operations, AND receive a custom ExecutionContext; instead of hardcoding the global one. Note, this should always be the case, even for non-blocking operations, but IME most folks using Future don't do this... but that is another discussion.
Then, callers of those blocking operations may decide if they will use their compute EC or a blocking one.
When the docs mention long-lasting they don't mean anything specific, mostly because is too hard to be specific about that; is context / application specific. What you need to understand is that blocking by default (note the actual EC may do whatever they want) will just create a new thread, and if you create a lot of threads and they take too long to be released you will saturate your memory and kill the program with an OOM error.
For those situations, the recommendation is to control the back pressure of your app to avoid creating too many threads. One way to do that is to create a fixed thread pool for the maximum number of blocking operations you will support and just enqueue all other pending tasks; such EC should just ignore blocking calls. You may also just have an unbound number of threads but manage the back pressure manually in other parts of your code; e.g. with an explicit Queue, this was common advice before: https://gist.github.com/djspiewak/46b543800958cf61af6efa8e072bfd5c
However, having blocked threads is always hurtful for the performance of your app, even if the compute EC is not blocked. The latest talks by Daniel explain those in detail: "The case for effect systems" & "Threads at scale".
So the ecosystem is pushing hard the state of the art to avoid that at all costs but is not a simple task. Still, runtimes like the ones provided by cats-effect or ZIO are optimized to handle blocking tasks the best they can as of today, and will probably improve during this and next years.
There is a list of parameters each of which is an input for MongoDB query.
Queries might be different, but for simplicity let's keep only one encapsulated in callMongoReactive.
fun callMongoReactive(p: Param): Mono<Result>{
// ...
}
fun queryInParallel(parameters: List<Param>): List<Result> =
parameters
.map { async { mongo.findSomething(it).awaitSingle() } }
.awaitAll()
Suppose parameters list size is not greater than 20.
What is the optimal strategy of coroutine dispatchers usage that makes these async requests run in parallel simultaneously?
Should a custom dispatcher (with a dedicated thread pool) be created or is there a standard approach for such situations?
If you want a very general answer, I'd say you may or may not want to care at this level. You don't have to.
In this specific case, the original API is reactive (immediately returns Mono), which means the actual work is not going to be performed on the thread calling callMongoReactive/findSomething, but on whatever thread pool mongoDB decided to use to back this Mono. This means the choice of dispatcher in this case really doesn't matter.
So especially in this case, I'd go for the simplest option: don't choose. Use coroutineScope and expose a suspend function so the caller decides on the coroutine context (including the dispatcher):
suspend fun queryInParallel(parameters: List<Param>): List<Result> = coroutineScope {
parameters
.map { async { mongo.findSomething(it).awaitSingle() } }
.awaitAll()
}
This is the usual idiom for "parallel decomposition of work". I'd say it's the most common.
What is the optimal strategy of coroutine dispatchers usage that makes these async requests run in parallel simultaneously?
It's worth noting that the idiom above expresses concurrency, but whether the bodies of the asyncs will be run in parallel or not depends on the dispatcher chosen by the caller.
(To reiterate, the dispatcher only affects the body of those asyncs, and in this specific case they don't use the thread that much because they call a non-blocking method anyway, so it really doesn't matter.)
Now in cases where it does matter, any dispatcher backed by more than 1 thread would allow parallelism here. There are several existing dispatchers that may be useful, without needing to create your own thread pool. Dispatchers.Default has a number of threads adapted to the number of cores of the machine it's running on, so it's a good fit for CPU-bound work. Dispatchers.IO scales the number of threads as needed, which is useful if you have a lot of blocking IO and want to avoid starvation.
You can also use limitedParallelism on any dispatcher to get a view of it with only a limited number of threads, which may be useful in some cases where you don't want to create an extra thread pool, but you do want to limit the number of available threads more than what the built-in dispatchers offer.
Creating a custom thread pool can be interesting if you want to isolate parts of your system in case one subsystem misbehaves and starves threads. It does have a memory overhead, though, since you're creating more threads.
With reference to the third point in this accepted answer, are there any cases for which it would be pointless or bad to use blocking for a long-running computation, whether CPU- or IO-bound, that is being executed 'within' a Future?
It depends on the ExecutionContext your Future is being executed in.
Pointless:
If the ExecutionContext is not a BlockContext, then using blocking will be pointless. That is, it would use the DefaultBlockContext, which simply executes the code without any special handling. It probably wouldn't add that much overhead, but pointless nonetheless.
Bad:
Scala's ExecutionContext.Implicits.global is made to spawn new threads in a ForkJoinPool when the thread pool is about to be exhausted. That is, if it knows that is going to happen via blocking. This can be bad if you're spawning lots of threads. If you're queuing up a lot of work in a short span of time, the global context will happily expand until gridlock. #dk14's answer explains this in more depth, but the gist is that it can be a performance killer as managed blocking can actually become quickly unmanageable.
The main purpose of blocking is to avoid deadlocks within thread pools, so it is tangentially related to performance in the sense that reaching a deadlock would be worse than spawning a few more threads. However, it is definitely not a magical performance enhancer.
I've written more about blocking in particular in this answer.
From my practice, blocking + ForkJoinPool may lead to contionuous and uncontrollable creation of threads if you have a lot of messages to process and each one requires long blocking (which also means that it holds some memory during such). ForkJoinPool creates new thread to compensate the "managable blocked" one, regardless of MaxThreadCount; say hello to hundreds of threads in VisualVm. And it almost kills backpressure, as there is always a place for task in the pool's queue (if your backpressure is based on ThreadPoolExecutor's policies). Performance becomes killed by both new-thread-allocation and garbage collection.
So:
it's good when message rate is not much higher than 1/blocking_time as it allows you to use full power of threads. Some smart backpressure might help to slow down incoming messages.
It's pointless if a task actually uses your CPU during blocking{} (no locks), as it will just increase counts of threads more than count of real cores in system.
And bad for any other cases - you should use separate fixed thread-pool (and maybe polling) then.
P.S. blocking is hidden inside Await.result, so it's not always obvious. In our project someone just did such Await inside some underlying worker actor.
Is there a good reason for the added complexity of Futures (vs parallel collections) when processing a list of items in parallel?
List(...).par.foreach(x=>longRunningAction(x))
vs
Future.traverse(List(...)) (x=>Future(longRunningAction(x)))
I think the main advantage would be that you can access the results of each future as soon as it is computed, while you would have to wait for the whole computation to be done with a parallel collection. A disadvantage might be that you end up creating lots of futures. If you later end up calling Future.sequence, there really is no advantage.
Parallel collections will kill of some threads as we get closer to processing all items. So last few items might be processed by single thread.
Please see my question for more details on this behavior Using ThreadPoolTaskSupport as tasksupport for parallel collections in scala
Future does no such thing, and all your threads are in use until all objects are processed. Hence unless your tasks are so small that you dont care about loss of parallelism for last few tasks and you are using huge number of threads, which have to killed of as soon as possible, Futures are better.
Futures become useful as soon as you want to compose your deferred / concurrent computations. Futures (the good kind, anyway, such as Akka's) are monadic and hence allow you to build arbitrarily complex computational structures with all the concurrency and synchronization handled properly by the Futures library.
I'm new to Scala in general and Actors in particular and my problem is so basic, the online resources I have found don't cover it.
I have a CPU-intensive, easily parallelized algorithm that will be run on an n-core machine (I don't know n). How do I implement this in Actors so that all available cores address the problem?
The first way I thought of was to simple break the problem into m pieces (where m is some medium number like 10,000) and create m Actors, one for each piece, give each Actor its little piece and let 'em go.
Somehow, this struck me as inefficient. Zillions of Actors just hanging around, waiting for some CPU love, pointlessly switching contexts...
Then I thought, make some smaller number of Actors, and feed each one several pieces. The problem was, there's no reason to expect the pieces are the same size, so one core might get bogged down, with many of its tasks still queued, while other cores are idle.
I noodled around with a Supervisor that knew which Actors were busy, and eventually realized that this has to be a solved problem. There must be a standard pattern (maybe even a standard library) for dealing with this very generic issue. Any suggestions?
Take a look at the Akka library, which includes an implementaton of actors. The Dispatchers Module gives you more options for limiting actors to cpu threads (HawtDispatch-based event-driven) and/or balancing the workload (Work-stealing event-based).
Generally, there're 2 kinds of actors: those that are tied to threads (one thread per actor), and those that share 1+ thread, working behind a scheduler/dispatcher that allocates resources (= possibility to execute a task/handle incoming message against controlled thread-pool or a single thread).
I assume, you use second type of actors - event-driven actors, because you mention that you run 10k of them. No matter how many event-driven actors you have (thousands or millions), all of them will be fighting for the small thread pool to handle the message. Therefore, you will even have a worse performance dividing your task queue into that huge number of portions - scheduler will either try to handle messages sent to 10k actors against a fixed thread pool (which is slow), or will allocate new threads in the pool (if the pool is not bounded), which is dangerous (in the worst case, there will be started 10k threads to handle messages).
Event-driven actors are good for short-time (ideally, non-blocking) tasks. If you're dealing with CPU-intensive tasks I'd limit number of threads in the scheduler/dispatcher pool (when you use event-driven actors) or actors themselves (when you use thread-based actors) to the number of cores to achieve the best performance.
If you want this to be done automatically (adjust number of threads in dispatcher pool to the number of cores), you should use HawtDisaptch (or it's Akka implementation), as it was proposed earlier:
The 'HawtDispatcher' uses the
HawtDispatch threading library which
is a Java clone of libdispatch. All
actors with this type of dispatcher
are executed on a single system wide
fixed sized thread pool. The number of
of threads will match the number of
cores available on your system. The
dispatcher delivers messages to the
actors in the order that they were
producer at the sender.
You should look into Futures I think. In fact, you probably need a threadpool which simply queues threads when a max number of threads has been reached.
Here is a small example involving futures: http://blog.tackley.net/2010/01/scala-futures.html
I would also suggest that you don't pay too much attention to the context switching since you really can't do anything but rely on the underlying implementation. Of course a rule of thumb would be to keep the active threads around the number of physical cores, but as I noted above this could be handled by a threadpool with a fifo-queue.
NOTE that I don't know if Actors in general or futures are implemented with this kind of pool.
For thread pools, look at this: http://www.scala-lang.org/api/current/scala/concurrent/ThreadPoolRunner.html
and maybe this: http://www.scala-lang.org/api/current/scala/actors/scheduler/ResizableThreadPoolScheduler.html
Good luck
EDIT
Check out this piece of code using futures:
import scala.actors.Futures._
object FibFut {
def fib(i: Int): Int = if (i < 2) 1 else fib(i - 1) + fib(i - 2)
def main(args: Array[String]) {
val fibs = for (i <- 0 to 42) yield future { fib(i) }
for (future <- fibs) println(future())
}
}
It showcases a very good point about futures, namely that you decide in which order to receive the results (as opposed to the normal mailbox-system which employs a fifo-system i.e. the fastest actor sends his result first).
For any significant project, I generally have a supervisor actor, a collection of worker actors each of which can do any work necessary, and a large number of pieces of work to do. Even though I do this fairly often, I've never put it in a (personal) library because the operations end up being so different each time, and the overhead is pretty small compared to the whole coding project.
Be aware of actor starvation if you end up utilizing the general actor threadpool. I ended up simply using my own algorithm-task-owned threadpool to handle the parallelization of a long-running, concurrent task.
The upcoming Scala 2.9 is expected to include parallel data structures which should automatically handle this for some uses. While it does not use Actors, it may be something to consider for your problem.
While this feature was originally slated for 2.8, it has been postponed until the next major release.
A presentation from the last ScalaDays is here:
http://days2010.scala-lang.org/node/138/140