Mapping Iterator into Iterator[Future] is not behaving as expecting - scala

Using Scala 2.13.0:
implicit val ec = ExecutionContext.global
val arr = (0 until 20).toIterator
.map { x =>
Thread.sleep(500);
println(x);
x
}
val fss = arr.map { slowX =>
Future { blocking { slowX } }
}
Await.result(Future.sequence(fss), Inf)
problem
arr is an iterator where each item needs 500ms processing time. We map the iterator with Future { blocking { ... }} with the purpose of making the processing parallel (using the global execution context). Finally we run Future.sequence
to consume the iterator.
Given the definition of Future.apply[T](body: =>T) and blocking[T](body: =>T), body is passed lazily, which means that body will be processed in the Future. If we inject that in the definition of Iterator.map, we get def next() = Future{blocking(self.next())}, so each item of the iterator should be processed in the Future.
But when I try this example however, I can see that the iterator is consumed sequentially, which is not what is expected!
Is that a Scala bug?? Or am I missing something?

No it's not a bug, because:
val arr = (0 until 20).toIterator
// this map invokes first and executed sequentially, because it executes in same thread.
.map { x =>
Thread.sleep(500);
println(x);
x
}
// This go sequentially because upstream map executed sequentially in same thread.
// So, "Future { blocking { slowX } }" can be replaced with "Future.successfull(slowX)"
// because no computation executed
val fss = arr.map { slowX =>
Future { blocking { slowX } }
}
If you want perform completely asynchronously, you can do something like:
def heavyCalculation(x: Int) = {
Thread.sleep(500);
println(x);
x
}
val result = Future.traverse((0 until 20).toList) { x =>
Future(blocking(heavyCalculation(x)))
}
Await.result(result, 1 minute)
Working Scatie example: https://scastie.scala-lang.org/3v06NpypRHKYkqBgzaeVXg

First, this is not a proper benchmark, you actually haven't show formal proof that this is sequential and not parallel (although is "obvious" from the source code that it isn't).
Second, and Iterator of Futures is probably a bad idea; at this point, it may make sense to look into a streaming solution like Akka-Streams, fs2, Monix or ZIO.
Third, what is even the point of having a bunch of blocking futures? you aren't actually winning too much.
Fourth, the problem is that the second map is not passing the block of the first map, just the result. So, you actually did the sleep before creating the Future.
Fifth, you probably want to do this instead.
val result = Future.traverse(data) { elem =>
Future {
blocking {
// Process elem here.
}
}
}
Await.result(result, Inf)

The other answers were pointing in the right direction, but the formal answer is the following: the signature of Iterator.map(f: A => B) tells us that A that A is computed before f is applied to it (because it is not => A). Therefore, next() is computed in the main thread.

Related

scala's for yield comprehension used with Future. How to wait until future has returned?

I have a function which provides a Context:
def buildContext(s:String)(request:RequestHeader):Future[Granite.Context] = {
.... // returns a Future[Granite.Context]
}
I then have another function which uses a Context to return an Option[Library.Document]:
def getDocument(tag: String):Option[Library.Document] = {
val fakeRequest = play.api.test.FakeRequest().withHeaders(CONTENT_TYPE -> "application/json")
val context = buildContext(tag)(fakeRequest)
val maybeDoc = context.getDocument //getDocument is defined on Granite.Context to return an Option[Library.Document]
}
How would this code take into account if the Future has returned or not? I have seen for/yield used to wait for the return but I always assumed that a for/yield just flatmaps things together and has nothing really to do with waiting for Futures to return. I'm kinda stuck here and don't really no the correct question to ask!
The other two answers are misleading. A for yield in Scala is a compiler primitive that gets transformed into map or flatMap chains. Do not use Await if you can avoid it, it's not a simple issue.
You are introducing blocking behaviour and you have yet to realise the systemic damage you are doing when blocking.
When it comes to Future, map and flatMap do different things:
map
is executed when the future completes. It's an asynchronous way to do a type safe mapping.
val f: Future[A] = someFutureProducer
def convertAToB(a: A): B = {..}
f map { a => convertAToB(a) }
flatMap
is what you use to chain things:
someFuture flatMap {
_ => {
someOtherFuture
}
}
The equivalent of the above is:
for {
result1 <- someFuture
result2 <- someOtherFuture
} yield result2
In Play you would use Async to handle the above:
Async {
someFuture.map(i => Ok("Got result: " + i))
}
Update
I misunderstood your usage of Play. Still, it doesn't change anything. You can still make your logic asynchronous.
someFuture onComplete {
case Success(result) => // doSomething
case Failure(err) => // log the error etc
}
The main difference when thinking asynchronously is that you always have to map and flatMap and do everything else inside Futures to get things done. The performance gain is massive.
The bigger your app, the bigger the gain.
When using a for-comprehension on a Future, you're not waiting for it to finish, you're just saying: when it is finished, use it like this, and For-comprehension returns another Future in this case.
If you want to wait for a future to finish, you should use the Await as follows:
val resultContext = Await.result(context , timeout.duration)
Then run the getDocument method on it as such:
val maybeDoc = resultContext.getDocument
EDIT
The usual way to work with Futures however is to wait until the last moment before you Await. As pointed out by another answer here, Play Framework does the same thing by allowing you to return Future[Result]. So, a good way to do things would be to only use for-comprehensions and make your methods return Futures, etc, until the last moment when you want to finally return your result.
You can use scala.concurrent.Await for that:
import scala.concurrent.duration._
import scala.concurrent.Await
def getDocument(tag: String):Option[Library.Document] = {
val fakeRequest = play.api.test.FakeRequest().withHeaders(CONTENT_TYPE -> "application/json")
val context = Await.result(buildContext(tag)(fakeRequest), 42.seconds)
val maybeDoc = context.getDocument
}
But Await will block thread while future is not completed, so would be better either make buildContext a synchronous operation returning Granite.Context, or make getDocument async too, returning Future[Option[Library.Document]].
Once you are in a future you must stay in the future or you must wait until the future arrives.
Waiting is usually a bad idea because it blocks your execution, so you should work in the future.
Basically you should change your getDocument method to return a Future to something like getDocument(tag: String):Future[Option[Library.Document]]
Then using map ro flatMap, you chain your future calls:
return buildContext(tag)(fakeRequest).map(_.getDocument)
If buildContext fails, map will wrap the Failure
Then call
getDocument("blah").onComplete {
case Success(optionalDoc) => ...
case Failure(e) =>...
}

Are promises flawed?

Like the author of this question I'm trying to understand the reasoning for user-visible promises in Scala 2.10's futures and promises.
Particularly, going again to the example from the SIP, isn't it completely flawed:
import scala.concurrent.{ future, promise }
val p = promise[T]
val f = p.future
val producer = future {
val r = produceSomething()
p success r
continueDoingSomethingUnrelated()
}
val consumer = future {
startDoingSomething()
f onSuccess {
case r => doSomethingWithResult()
}
}
I am imagining the case where the call to produceSomething results in a runtime exception. Because promise and producer-future are completely detached, this means the system hangs and the consumer will never complete with either success or failure.
So the only safe way to use promises requires something like
val producer = future {
try {
val r.produceSomething()
p success r
} catch {
case e: Throwable =>
p failure e
throw e // ouch
}
continueDoingSomethingUnrelated()
}
Which obviously error-prone and verbose.
The only case I can see for a visible promise type—where future {} is insufficient—is the one of the callback hook in M. A. D.'s answer. But the example of the SIP doesn't make sense to me.
This is why you rarely use success and failure unless you already know something is bulletproof. If you want bulletproof, this is what Try is for:
val producer = future {
p complete Try( produceSomething )
continueDoingSomethingUnrelated()
}
It doesn't seem necessary to throw the error again; you've already dealt with it by packing it into the answer to the promise, no? (Also, note that if produceSomething itself returns a future, you can use completeWith instead.)
Combinators
You can use Promise to build additional Future combinators that aren't already in the library.
"Select" off the first future to be satisfied. Return as a result, with the remainder of the Futures as a sequence: https://gist.github.com/viktorklang/4488970.
An after method that returns a Future that is completed after a certain period of time, to "time out" a Future: https://gist.github.com/3804710.
You need Promises to be able to create other combinators like this.
Adapt Callbacks
Use Promise to adapt callback-based APIs to Future-based APIs. For example:
def retrieveThing(key: String): Future[Thing] = {
val p = Promise[Thing]()
val callback = new Callback() {
def receive(message: ThingMessage) {
message.getPayload match {
case t: Thing =>
p success t
case err: SystemErrorPayload =>
p failure new Exception(err.getMessage)
}
}
}
thingLoader.load(key, callback, timeout)
p.future
}
Synchronizers
Build synchronizers using Promise. For example, return a cached value for an expensive operation, or compute it, but don't compute twice for the same key:
private val cache = new ConcurrentHashMap[String, Promise[T]]
def getEntry(key: String): Future[T] = {
val newPromise = Promise[T]()
val foundPromise = cache putIfAbsent (key, newPromise)
if (foundPromise == null) {
newPromise completeWith getExpensive(key)
newPromise.future
} else {
foundPromise.future
}
}
Promise has a completeWith(f: Future) method that would solve this problem by automatically handling the success/failure scenarios.
promise.completeWith( future {
r.produceSomething
})

Incremental processing in an akka actor

I have actors that need to do very long-running and computationally expensive work, but the computation itself can be done incrementally. So while the complete computation itself takes hours to complete, the intermediate results are actually extremely useful, and I'd like to be able to respond to any requests of them. This is the pseudo code of what I want to do:
var intermediateResult = ...
loop {
while (mailbox.isEmpty && computationNotFinished)
intermediateResult = computationStep(intermediateResult)
receive {
case GetCurrentResult => sender ! intermediateResult
...other messages...
}
}
The best way to do this is very close to what you are doing already:
case class Continue(todo: ToDo)
class Worker extends Actor {
var state: IntermediateState = _
def receive = {
case Work(x) =>
val (next, todo) = calc(state, x)
state = next
self ! Continue(todo)
case Continue(todo) if todo.isEmpty => // done
case Continue(todo) =>
val (next, rest) = calc(state, todo)
state = next
self ! Continue(rest)
}
def calc(state: IntermediateState, todo: ToDo): (IntermediateState, ToDo)
}
EDIT: more background
When an actor sends messages to itself, Akka’s internal processing will basically run those within a while loop; the number of messages processed in one go is determined by the actor’s dispatcher’s throughput setting (defaults to 5), after this amount of processing the thread will be returned to the pool and the continuation be enqueued to the dispatcher as a new task. Hence there are two tunables in the above solution:
process multiple steps for a single message (if processing steps are REALLY small)
increase throughput setting for increased throughput and decreased fairness
The original problem seems to have hundreds of such actors running, presumably on common hardware which does not have hundreds of CPUs, so the throughput setting should probably be set such that each batch takes no longer than ca. 10ms.
Performance Assessment
Let’s play a bit with Fibonacci:
Welcome to Scala version 2.10.0-RC1 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_07).
Type in expressions to have them evaluated.
Type :help for more information.
scala> def fib(x1: BigInt, x2: BigInt, steps: Int): (BigInt, BigInt) = if(steps>0) fib(x2, x1+x2, steps-1) else (x1, x2)
fib: (x1: BigInt, x2: BigInt, steps: Int)(BigInt, BigInt)
scala> def time(code: =>Unit) { val start = System.currentTimeMillis; code; println("code took " + (System.currentTimeMillis - start) + "ms") }
time: (code: => Unit)Unit
scala> time(fib(1, 1, 1000))
code took 1ms
scala> time(fib(1, 1, 1000))
code took 1ms
scala> time(fib(1, 1, 10000))
code took 5ms
scala> time(fib(1, 1, 100000))
code took 455ms
scala> time(fib(1, 1, 1000000))
code took 17172ms
Which means that in a presumably quite optimized loop, fib_100000 takes half a second. Now let’s play a bit with actors:
scala> case class Cont(steps: Int, batch: Int)
defined class Cont
scala> val me = inbox()
me: akka.actor.ActorDSL.Inbox = akka.actor.dsl.Inbox$Inbox#32c0fe13
scala> val a = actor(new Act {
var s: (BigInt, BigInt) = _
become {
case Cont(x, y) if y < 0 => s = (1, 1); self ! Cont(x, -y)
case Cont(x, y) if x > 0 => s = fib(s._1, s._2, y); self ! Cont(x - 1, y)
case _: Cont => me.receiver ! s
}
})
a: akka.actor.ActorRef = Actor[akka://repl/user/$c]
scala> time{a ! Cont(1000, -1); me.receive(10 seconds)}
code took 4ms
scala> time{a ! Cont(10000, -1); me.receive(10 seconds)}
code took 27ms
scala> time{a ! Cont(100000, -1); me.receive(10 seconds)}
code took 632ms
scala> time{a ! Cont(1000000, -1); me.receive(30 seconds)}
code took 17936ms
This is already interesting: given long enough time per step (with the huge BigInts behind the scenes in the last line), actors don’t much extra. Now let’s see what happens when doing smaller calculations in a more batched way:
scala> time{a ! Cont(10000, -10); me.receive(30 seconds)}
code took 462ms
This is pretty close to the result for the direct variant above.
Conclusion
Sending messages to self is NOT expensive for almost all applications, just keep the actual processing step slightly larger than a few hundred nanoseconds.
I assume from your comment to Roland Kuhn answer that you have some work which can be considered as recursive, at least in blocks. If this is not the case, I don't think there could be any clean solution to handle your problem and you will have to deal with complicated pattern matching blocks.
If my assumptions are correct, I would schedule the computation asynchronously and let the actor be free to answer other messages. The key point is to use Future monadic capabilities and having a simple receive block. You would have to handle three messages (startComputation, changeState, getState)
You will end up having the following receive:
def receive {
case StartComputation(myData) =>expensiveStuff(myData)
case ChangeState(newstate) = this.state = newstate
case GetState => sender ! this.state
}
And then you can leverage the map method on Future, by defining your own recursive map:
def mapRecursive[A](f:Future[A], handler: A => A, exitConditions: A => Boolean):Future[A] = {
f.flatMap { a=>
if (exitConditions(a))
f
else {
val newFuture = f.flatMap{ a=> Future(handler(a))}
mapRecursive(newFuture,handler,exitConditions)
}
}
}
Once you have this tool, everything is easier. If you look to the following example :
def main(args:Array[String]){
val baseFuture:Future[Int] = Promise.successful(64)
val newFuture:Future[Int] = mapRecursive(baseFuture,
(a:Int) => {
val result = a/2
println("Additional step done: the current a is " + result)
result
}, (a:Int) => (a<=1))
val one = Await.result(newFuture,Duration.Inf)
println("Computation finished, result = " + one)
}
Its output is:
Additional step done: the current a is 32
Additional step done: the current a is 16
Additional step done: the current a is 8
Additional step done: the current a is 4
Additional step done: the current a is 2
Additional step done: the current a is 1
Computation finished, result = 1
You understand you can do the same, inside your expensiveStuffmethod
def expensiveStuff(myData:MyData):Future[MyData]= {
val firstResult = Promise.successful(myData)
val handler : MyData => MyData = (myData) => {
val result = myData.copy(myData.value/2)
self ! ChangeState(result)
result
}
val exitCondition : MyData => Boolean = (myData:MyData) => myData.value==1
mapRecursive(firstResult,handler,exitCondition)
}
EDIT - MORE DETAILED
If you don't want to block the Actor, which processes messages from its mailbox in a thread-safe and synchronous manner, the only thing you can do is to get the computation executed on a different thread. This is exactly an high performance non blocking receive.
However, you were right in saying that the approach I propose pays a high performance penalty. Every step is done on a different future, which might be not necessary at all. You can therefore recurse the handler to obtain a single-threaded or multiple-threaded execution. There is no magic formula after all:
If you want to schedule asynchronously and minimize the cost, all the work should be done by a single thread
This however could prevent other work to start, because if all the threads on a thread pool are taken, the futures will queue. You might therefore want to break the operation into multiple futures so that even at full usage some new work can be scheduled before old work has been completed.
def recurseFuture[A](entryFuture: Future[A], handler: A => A, exitCondition: A => Boolean, maxNestedRecursion: Long = Long.MaxValue): Future[A] = {
def recurse(a:A, handler: A => A, exitCondition: A => Boolean, maxNestedRecursion: Long, currentStep: Long): Future[A] = {
if (exitCondition(a))
Promise.successful(a)
else
if (currentStep==maxNestedRecursion)
Promise.successful(handler(a)).flatMap(a => recurse(a,handler,exitCondition,maxNestedRecursion,0))
else{
recurse(handler(a),handler,exitCondition,maxNestedRecursion,currentStep+1)
}
}
entryFuture.flatMap { a => recurse(a,handler,exitCondition,maxNestedRecursion,0) }
}
I have enhanced for testing purposes my handler method:
val handler: Int => Int = (a: Int) => {
val result = a / 2
println("Additional step done: the current a is " + result + " on thread " + Thread.currentThread().getName)
result
}
Approach 1: Recurse the handler on itself so to get all execute on a single thread.
println("Starting strategy with all the steps on the same thread")
val deepestRecursion: Future[Int] = recurseFuture(baseFuture,handler, exitCondition)
Await.result(deepestRecursion, Duration.Inf)
println("Completed strategy with all the steps on the same thread")
println("")
Approach 2: Recurse for a limited depth the handler on itself
println("Starting strategy with the steps grouped by three")
val threeStepsInSameFuture: Future[Int] = recurseFuture(baseFuture,handler, exitCondition,3)
val threeStepsInSameFuture2: Future[Int] = recurseFuture(baseFuture,handler, exitCondition,4)
Await.result(threeStepsInSameFuture, Duration.Inf)
Await.result(threeStepsInSameFuture2, Duration.Inf)
println("Completed strategy with all the steps grouped by three")
executorService.shutdown()
You should not use Actors to make long running computations as these will block the threads that are supposed to run the Actors code.
I would try to go with a design that uses a separate Thread/ThreadPool for the computations and use AtomicReferences to store/query the intermediate results in the lines of the following pseudo code:
val cancelled = new AtomicBoolean(false)
val intermediateResult = new AtomicReference[IntermediateResult]()
object WorkerThread extends Thread {
override def run {
while(!cancelled.get) {
intermediateResult.set(computationStep(intermediateResult.get))
}
}
}
loop {
react {
case StartComputation => WorkerThread.start()
case CancelComputation => cancelled.set(true)
case GetCurrentResult => sender ! intermediateResult.get
}
}
This is a classic concurrency problem. You want want several routines/actors (or whatever you want to call them). Code is mostly correct Go, with obscenely long variable names for context. The first routine handles queries and intermediate results:
func serveIntermediateResults(
computationChannel chan *IntermediateResult,
queryChannel chan chan<-*IntermediateResult) {
var latestIntermediateResult *IntermediateResult // initial result
for {
select {
// an update arrives
case latestIntermediateResult, notClosed := <-computationChannel:
if !notClosed {
// the computation has finished, stop checking
computationChannel = nil
}
// a query arrived
case queryResponseChannel, notClosed := <-queryChannel:
if !notClosed {
// no more queries, so we're done
return
}
// respond with the latest result
queryResponseChannel<-latestIntermediateResult
}
}
}
In your long computation, you update your intermediate result wherever appropriate:
func longComputation(intermediateResultChannel chan *IntermediateResult) {
for notFinished {
// lots of stuff
intermediateResultChannel<-currentResult
}
close(intermediateResultChannel)
}
Finally to ask for the current result, you have a wrapper to make this nice:
func getCurrentResult() *IntermediateResult {
responseChannel := make(chan *IntermediateResult)
// queryChannel was given to the intermediate result server routine earlier
queryChannel<-responseChannel
return <-responseChannel
}

React for futures

I am trying to use a divide-and-conquer (aka fork/join) approach for a number crunching problem. Here is the code:
import scala.actors.Futures.future
private def compute( input: Input ):Result = {
if( pairs.size < SIZE_LIMIT ) {
computeSequential()
} else {
val (input1,input2) = input.split
val f1 = future( compute(input1) )
val f2 = future( compute(input2) )
val result1 = f1()
val result2 = f2()
merge(result1,result2)
}
}
It runs (with a nice speed-up) but the the future apply method seems to block a thread and the thread pool increases tremendously. And when too many threads are created, the computations is stucked.
Is there a kind of react method for futures which releases the thread ? Or any other way to achieve that behavior ?
EDIT: I am using scala 2.8.0.final
Don't claim (apply) your Futures, since this forces them to block and wait for an answer; as you've seen this can lead to deadlocks. Instead, use them monadically to tell them what to do when they complete. Instead of:
val result1 = f1()
val result2 = f2()
merge(result1,result2)
Try this:
for {
result1 <- f1
result2 <- f2
} yield merge(result1, result2)
The result of this will be a Responder[Result] (essentially a Future[Result]) containing the merged results; you can do something effectful with this final value using respond() or foreach(), or you can map() or flatMap() it to another Responder[T]. No blocking necessary, just keep scheduling computations for the future!
Edit 1:
Ok, the signature of the compute function is going to have to change to Responder[Result] now, so how does that affect the recursive calls? Let's try this:
private def compute( input: Input ):Responder[Result] = {
if( pairs.size < SIZE_LIMIT ) {
future(computeSequential())
} else {
val (input1,input2) = input.split
for {
result1 <- compute(input1)
result2 <- compute(input2)
} yield merge(result1, result2)
}
}
Now you no longer need to wrap the calls to compute with future(...) because they're already returning Responder (a superclass of Future).
Edit 2:
One upshot of using this continuation-passing style is that your top-level code--whatever calls compute originally--doesn't block at all any more. If it's being called from main(), and that's all the program does, this will be a problem, because now it will just spawn a bunch of futures and then immediately shut down, having finished everything it was told to do. What you need to do is block on all these futures, but only once, at the top level, and only on the results of all the computations, not any intermediate ones.
Unfortunately, this Responder thing that's being returned by compute() no longer has a blocking apply() method like the Future did. I'm not sure why flatMapping Futures produces a generic Responder instead of a Future; this seems like an API mistake. But in any case, you should be able to make your own:
def claim[A](r:Responder[A]):A = {
import java.util.concurrent.ArrayBlockingQueue
import scala.actors.Actor.actor
val q = new ArrayBlockingQueue[A](1)
// uses of 'respond' need to be wrapped in an actor or future block
actor { r.respond(a => q.put(a)) }
return q.take
}
So now you can create a blocking call to compute in your main method like so:
val finalResult = claim(compute(input))

Simple Scala actor question

I'm sure this is a very simple question, but embarrassed to say I can't get my head around it:
I have a list of values in Scala.
I would like to use use actors to make some (external) calls with each value, in parallel.
I would like to wait until all values have been processed, and then proceed.
There's no shared values being modified.
Could anyone advise?
Thanks
There's an actor-using class in Scala that's made precisely for this kind of problem: Futures. This problem would be solved like this:
// This assigns futures that will execute in parallel
// In the example, the computation is performed by the "process" function
val tasks = list map (value => scala.actors.Futures.future { process(value) })
// The result of a future may be extracted with the apply() method, which
// will block if the result is not ready.
// Since we do want to block until all results are ready, we can call apply()
// directly instead of using a method such as Futures.awaitAll()
val results = tasks map (future => future.apply())
There you go. Just that.
Create workers and ask them for futures using !!; then read off the results (which will be calculated and come in in parallel as they're ready; you can then use them). For example:
object Example {
import scala.actors._
class Worker extends Actor {
def act() { Actor.loop { react {
case s: String => reply(s.length)
case _ => exit()
}}}
}
def main(args: Array[String]) {
val arguments = args.toList
val workers = arguments.map(_ => (new Worker).start)
val futures = for ((w,a) <- workers zip arguments) yield w !! a
val results = futures.map(f => f() match {
case i: Int => i
case _ => throw new Exception("Whoops--didn't expect to get that!")
})
println(results)
workers.foreach(_ ! None)
}
}
This does a very inexpensive computation--calculating the length of a string--but you can put something expensive there to make sure it really does happen in parallel (the last thing that case of the act block should be to reply with the answer). Note that we also include a case for the worker to shut itself down, and when we're all done, we tell the workers to shut down. (In this case, any non-string shuts down the worker.)
And we can try this out to make sure it works:
scala> Example.main(Array("This","is","a","test"))
List(4, 2, 1, 4)