Scala Futures with multiple dependencies - scala

I have to compute asynchronously a set of features that can have multiple dependencies between each other (no loops). For example
class FeatureEncoderMock(val n:String, val deps: List[String] = List.empty) {
def compute = {
println(s"starting computation feature $n")
Thread.sleep(r.nextInt(2500))
println(s"end computation feature $n")
}
}
val registry = Map(
"feat1" -> new FeatureEncoderMock("feat1", List("factLogA", "factLogB")),
"factLogA" -> new FeatureEncoderMock("factLogA"),
"factLogB" -> new FeatureEncoderMock("factLogB"),
"feat1" -> new FeatureEncoderMock("feat1", List("factLogA", "factLogB")),
"feat2" -> new FeatureEncoderMock("feat2", List("factLogA")),
"feat3" -> new FeatureEncoderMock("feat3", List("feat1")),
"feat4" -> new FeatureEncoderMock("feat4", List("feat3", "factLogB"))
)
What I want to achieve is call a single function on feat4 that will trigger the computation of all dependent features and will take care of dependencies among them. I tried with this
def run(): Unit = {
val requested = "feat4"
val allFeatures = getChainOfDependencies(requested)
val promises = allFeatures.zip(Seq.fill(allFeatures.size)(Promise[Unit])).toMap
def computeWithDependencies(f: String) = Future {
println(s"computing $f")
val encoder = registry(f)
if(encoder.deps.isEmpty) {
promises(f).success(registry(f).compute)
}
else {
val depTasks = promises.filterKeys(encoder.deps.contains)
val depTasksFuture = Future.sequence(depTasks.map(_._2.future))
depTasksFuture.onSuccess({
case _ =>
println(s"all deps for $f has been computed")
promises(f).success(registry(f).compute)
println(s"done for $f")
})
}
}
computeWithDependencies(requested)
}
But I cannot understand why the order of execution is not as expected. I am not sure what is the proper way to feed the future inside a promise. I am quite sure that this piece of code is wrong on that part.

I think you're overthinking it with the promises; Future composition is probably all that you need. Something like this:
import scala.collection.mutable
def computeWithDependencies(s: String, cache: mutable.Map[String, Future[Unit]] = mutable.Map.empty)
(implicit ec: ExecutionContext): Future[Unit] = {
cache.get(s) match {
case Some(f) => f
case None => {
val encoder = registry(s)
val depsFutures = encoder.deps.map(d => computeWithDependencies(d, cache))
val result = Future.sequence(depsFutures).flatMap(_ => Future { encoder.compute })
cache += s -> result
result
}
}
}
The call to flatMap ensures that all of the dependency futures complete before the "current" future executes, even if the result (a List[Unit]) is ignored. The business with the cache is just to prevent recomputation if the dependency graph has a "diamond" in it, but could be left out if it won't or if you're ok with recomputing. Anyway, when running this:
val futureResult = computeWithDependencies("feat4")
Await.result(futureResult, 30 seconds)
I see this output:
starting computation feature factLogB
starting computation feature factLogA
end computation feature factLogB
end computation feature factLogA
starting computation feature feat1
end computation feature feat1
starting computation feature feat3
end computation feature feat3
starting computation feature feat4
end computation feature feat4
Which seems correct to me.

Related

How to Promise.allSettled with Scala futures?

I have two scala futures. I want to perform an action once both are completed, regardless of whether they were completed successfully. (Additionally, I want the ability to inspect those results at that time.)
In Javascript, this is Promise.allSettled.
Does Scala offer a simple way to do this?
One last wrinkle, if it matters: I want to do this in a JRuby application.
You can use the transform method to create a Future that will always succeed and return the result or the error as a Try object.
def toTry[A](future: Future[A])(implicit ec: ExecutionContext): Future[Try[A]] =
future.transform(x => Success(x))
To combine two Futures into one, you can use zip:
def settle2[A, B](fa: Future[A], fb: Future[B])(implicit ec: ExecutionContext)
: Future[(Try[A], Try[B])] =
toTry(fa).zip(toTry(fb))
If you want to combine an arbitrary number of Futures this way, you can use Future.traverse:
def allSettled[A](futures: List[Future[A]])(implicit ec: ExecutionContext)
: Future[List[Try[A]]] =
Future.traverse(futures)(toTry(_))
Normally in this case we use Future.sequence to transform a collection of a Future into one single Future so you can map on it, but Scala short circuit the failed Future and doesn't wait for anything after that (Scala considers one failure to be a failure for all), which doesn't fit your case.
In this case you need to map failed ones to successful, then do the sequence, e.g.
val settledFuture = Future.sequence(List(future1, future2, ...).map(_.recoverWith { case _ => Future.unit }))
settledFuture.map(//Here it is all settled)
EDIT
Since the results need to be kept, instead of mapping to Future.unit, we map the actual result into another layer of Try:
val settledFuture = Future.sequence(
List(Future(1), Future(throw new Exception))
.map(_.map(Success(_)).recover(Failure(_)))
)
settledFuture.map(println(_))
//Output: List(Success(1), Failure(java.lang.Exception))
EDIT2
It can be further simplified with transform:
Future.sequence(listOfFutures.map(_.transform(Success(_))))
Perhaps you could use a concurrent counter to keep track of the number of completed Futures and then complete the Promise once all Futures have completed
def allSettled[T](futures: List[Future[T]]): Future[List[Future[T]]] = {
val p = Promise[List[Future[T]]]()
val length = futures.length
val completedCount = new AtomicInteger(0)
futures foreach {
_.onComplete { _ =>
if (completedCount.incrementAndGet == length) p.trySuccess(futures)
}
}
p.future
}
val futures = List(
Future(-11),
Future(throw new Exception("boom")),
Future(42)
)
allSettled(futures).andThen(println(_))
// Success(List(Future(Success(-11)), Future(Failure(java.lang.Exception: boom)), Future(Success(42))))
scastie

How to create an akka-stream Source from a Flow that generate values recursively?

I need to traverse an API that is shaped like a tree. For example, a directory structure or threads of discussion. It can be modeled via the following flow:
type ItemId = Int
type Data = String
case class Item(data: Data, kids: List[ItemId])
def randomData(): Data = scala.util.Random.alphanumeric.take(2).mkString
// 0 => [1, 9]
// 1 => [10, 19]
// 2 => [20, 29]
// ...
// 9 => [90, 99]
// _ => []
// NB. I don't have access to this function, only the itemFlow.
def nested(id: ItemId): List[ItemId] =
if (id == 0) (1 to 9).toList
else if (1 <= id && id <= 9) ((id * 10) to ((id + 1) * 10 - 1)).toList
else Nil
val itemFlow: Flow[ItemId, Item, NotUsed] =
Flow.fromFunction(id => Item(randomData, nested(id)))
How can I traverse this data? I got the following working:
import akka.NotUsed
import akka.actor.ActorSystem
import akka.stream._
import akka.stream.scaladsl._
import scala.concurrent.Await
import scala.concurrent.duration.Duration
implicit val system = ActorSystem()
implicit val materializer = ActorMaterializer()
val loop =
GraphDSL.create() { implicit b =>
import GraphDSL.Implicits._
val source = b.add(Flow[Int])
val merge = b.add(Merge[Int](2))
val fetch = b.add(itemFlow)
val bcast = b.add(Broadcast[Item](2))
val kids = b.add(Flow[Item].mapConcat(_.kids))
val data = b.add(Flow[Item].map(_.data))
val buffer = Flow[Int].buffer(100, OverflowStrategy.dropHead)
source ~> merge ~> fetch ~> bcast ~> data
merge <~ buffer <~ kids <~ bcast
FlowShape(source.in, data.out)
}
val flow = Flow.fromGraph(loop)
Await.result(
Source.single(0).via(flow).runWith(Sink.foreach(println)),
Duration.Inf
)
system.terminate()
However, since I'm using a flow with a buffer, the Stream will never complete.
Completes when upstream completes and buffered elements have been drained
Flow.buffer
I read the Graph cycles, liveness, and deadlocks section multiple times and I'm still struggling to find an answer.
This would create a live lock:
import java.util.concurrent.atomic.AtomicInteger
def unfold[S, E](seed: S, flow: Flow[S, E, NotUsed])(loop: E => List[S]): Source[E, NotUsed] = {
// keep track of how many element flows,
val remaining = new AtomicInteger(1) // 1 = seed
// should be > max loop(x)
val bufferSize = 10000
val (ref, publisher) =
Source.actorRef[S](bufferSize, OverflowStrategy.fail)
.toMat(Sink.asPublisher(true))(Keep.both)
.run()
ref ! seed
Source.fromPublisher(publisher)
.via(flow)
.map{x =>
loop(x).foreach{ c =>
remaining.incrementAndGet()
ref ! c
}
x
}
.takeWhile(_ => remaining.decrementAndGet > 0)
}
EDIT: I added a git repo to test your solution https://github.com/MasseGuillaume/source-unfold
Cause of Non-Completion
I don't think the cause of the stream never completing is due to "using a flow with a buffer". The actual cause, similar to this question, is the fact that merge with the default parameter eagerClose=False is waiting on both the source and the buffer to complete before it (merge) completes. But buffer is waiting on merge to complete. So merge is waiting on buffer and buffer is waiting on merge.
eagerClose merge
You could set eagerClose=True when creating your merge. But using eager close may unfortunately result in some children ItemId values never being queried.
Indirect Solution
If you materialize a new stream for each level of the tree then the recursion can be extracted to outside of the stream.
You can construct a query function utilizing the itemFlow:
val itemQuery : Iterable[ItemId] => Future[Seq[Data]] =
(itemIds) => Source.apply(itemIds)
.via(itemFlow)
.runWith(Sink.seq[Data])
This query function can now be wrapped inside of a recursive helper function:
val recQuery : (Iterable[ItemId], Iterable[Data]) => Future[Seq[Data]] =
(itemIds, currentData) => itemQuery(itemIds) flatMap { allNewData =>
val allNewKids = allNewData.flatMap(_.kids).toSet
if(allNewKids.isEmpty)
Future.successful(currentData ++ allNewData)
else
recQuery(allNewKids, currentData ++ data)
}
The number of streams created will be equal to the maximum depth of the tree.
Unfortunately, because Futures are involved, this recursive function is not tail-recursive and could result in a "stack overflow" if the tree is too deep.
I solved this problem by writing my own GraphStage.
import akka.NotUsed
import akka.stream._
import akka.stream.scaladsl._
import akka.stream.stage.{GraphStage, GraphStageLogic, OutHandler}
import scala.concurrent.ExecutionContext
import scala.collection.mutable
import scala.util.{Success, Failure, Try}
import scala.collection.mutable
def unfoldTree[S, E](seeds: List[S],
flow: Flow[S, E, NotUsed],
loop: E => List[S],
bufferSize: Int)(implicit ec: ExecutionContext): Source[E, NotUsed] = {
Source.fromGraph(new UnfoldSource(seeds, flow, loop, bufferSize))
}
object UnfoldSource {
implicit class MutableQueueExtensions[A](private val self: mutable.Queue[A]) extends AnyVal {
def dequeueN(n: Int): List[A] = {
val b = List.newBuilder[A]
var i = 0
while (i < n) {
val e = self.dequeue
b += e
i += 1
}
b.result()
}
}
}
class UnfoldSource[S, E](seeds: List[S],
flow: Flow[S, E, NotUsed],
loop: E => List[S],
bufferSize: Int)(implicit ec: ExecutionContext) extends GraphStage[SourceShape[E]] {
val out: Outlet[E] = Outlet("UnfoldSource.out")
override val shape: SourceShape[E] = SourceShape(out)
override def createLogic(inheritedAttributes: Attributes): GraphStageLogic = new GraphStageLogic(shape) with OutHandler {
// Nodes to expand
val frontier = mutable.Queue[S]()
frontier ++= seeds
// Nodes expanded
val buffer = mutable.Queue[E]()
// Using the flow to fetch more data
var inFlight = false
// Sink pulled but the buffer was empty
var downstreamWaiting = false
def isBufferFull() = buffer.size >= bufferSize
def fillBuffer(): Unit = {
val batchSize = Math.min(bufferSize - buffer.size, frontier.size)
val batch = frontier.dequeueN(batchSize)
inFlight = true
val toProcess =
Source(batch)
.via(flow)
.runWith(Sink.seq)(materializer)
val callback = getAsyncCallback[Try[Seq[E]]]{
case Failure(ex) => {
fail(out, ex)
}
case Success(es) => {
val got = es.size
inFlight = false
es.foreach{ e =>
buffer += e
frontier ++= loop(e)
}
if (downstreamWaiting && buffer.nonEmpty) {
val e = buffer.dequeue
downstreamWaiting = false
sendOne(e)
} else {
checkCompletion()
}
()
}
}
toProcess.onComplete(callback.invoke)
}
override def preStart(): Unit = {
checkCompletion()
}
def checkCompletion(): Unit = {
if (!inFlight && buffer.isEmpty && frontier.isEmpty) {
completeStage()
}
}
def sendOne(e: E): Unit = {
push(out, e)
checkCompletion()
}
def onPull(): Unit = {
if (buffer.nonEmpty) {
sendOne(buffer.dequeue)
} else {
downstreamWaiting = true
}
if (!isBufferFull && frontier.nonEmpty) {
fillBuffer()
}
}
setHandler(out, this)
}
}
Ah, the joys of cycles in Akka streams. I had a very similar problem which I solved in a deeply hacky way. Possibly it'll be helpful for you.
Hacky Solution:
// add a graph stage that will complete successfully if it sees no element within 5 seconds
val timedStopper = b.add(
Flow[Item]
.idleTimeout(5.seconds)
.recoverWithRetries(1, {
case _: TimeoutException => Source.empty[Item]
}))
source ~> merge ~> fetch ~> timedStopper ~> bcast ~> data
merge <~ buffer <~ kids <~ bcast
What this does is that 5 seconds after the last element passes through the timedStopper stage, that stage completes the stream successfully. This is achieved via the use of idleTimeout, which fails the stream with a TimeoutException, and then using recoverWithRetries to turn that failure into a successful completion. (I did mention it was hacky).
This is obviously not suitable if you might have more than 5 seconds between elements, or if you can't afford a long wait between the stream "actually" completing and Akka picking up on it. Thankfully, neither were a concern for us, and in that case it actually works pretty well!
Non-hacky solution
Unfortunately, the only ways I can think of to do this without cheating via timeouts are very, very complicated.
Basically, you need to be able to track two things:
are there any elements still in the buffer, or in process of being sent to the buffer
is the incoming source open
and complete the stream if and only if the answer to both questions is no. Native Akka building blocks are probably not going to be able to handle this. A custom graph stage might, however. An option might be to write one that takes the place of Merge and give it some way of knowing about the buffer contents, or possibly have it track both the IDs it receives and the IDs the broadcast is sending to the buffer. The problem being that custom graph stages are not particularly pleasant to write at the best of times, let alone when you're mixing logic across stages like this.
Warnings
Akka streams just don't work well with cycles, especially how they calculate completion. As a result, this may not be the only problem you encounter.
For instance, an issue we had with a very similar structure was that a failure in the source was treated as the stream completing successfully, with a succeeded Future being materialised. The problem is that by default, a stage that fails will fail its downstreams but cancel its upstreams (which counts as a successful completion for those stages). With a cycle like the one you have, the result is a race as cancellation propagates down one branch but failure down the other. You also need to check what happens if the sink errors; depending on the cancellation settings for broadcast, it's possible the cancellation will not propagate upwards and the source will happily continue pulling in elements.
One final option: avoid handling the recursive logic with streams at all. On one extreme, if there's any way for you to write a single tail-recursive method that pulls out all the nested items at once and put that into a Flow stage, that will solve your problems. On the other, we're seriously considering going to Kafka queueing for our own system.

Submitting operations in created future

I have a Future lazy val that obtains some object and a function which submits operations in the Future.
class C {
def printLn(s: String) = println(s)
}
lazy val futureC: Future[C] = Future{Thread.sleep(3000); new C()}
def func(s: String): Unit = {
futureC.foreach{c => c.printLn(s)}
}
The problem is when Future is completed it executes operations in reverse order than they have been submited. So for example if I execute sequentialy
func("A")
func("B")
func("C")
I get after Future completion
scala> C
B
A
This order is important for me. Is there a way to preserve this order?
Of course I can use an actor who asks for future and stashing strings while future is not ready, but it seems redundant for me.
lazy val futureC: Future[C]
lazy vals in scala will be compiled in to the code which uses a synchronized block for thread safety.
Here when the func(A) is called, it will obtain the lock for the lazy val and that thread will go to sleep.
Therefore func(B) & func(C) will blocked by the lock.
When those blocked threads are run, the order cannot be guaranteed.
If you do it like below, you'll have the order as you expect. This is because the for comprehension creates a flatMap, & map based chain that gets executed sequentially.
lazy val futureC: Future[C] = Future {
Thread.sleep(1000)
new C()
}
def func(s: String) : Future[Unit] = {
futureC.map { c => c.printLn(s) }
}
val x = for {
_ <- func("A")
_ <- func("B")
_ <- func("C")
} yield ()
The order preserves even without the lazy keyword. You can remove the lazy keyword unless it is really necessary.
Hope this helps.
You can use Future.traverse to ensure the order of execution.
Something like this.. Im not sure how your func has a reference to the correct futureC, so I moved it inside.
def func(s: String): Future[Unit] = {
lazy val futureC = Future{Thread.sleep(3000); new C()}
futureC.map{c => c.printLn(s)}
}
def traverse[A,B](xs: Seq[A])(fn: A => Future[B]): Future[Seq[B]] =
xs.foldLeft(Future(Seq[B]())) { (acc, item) =>
acc.flatMap { accValue =>
fn(item).map { itemValue =>
accValue :+ itemValue
}
}
}
traverse(Seq("A","B","C"))(func)

cache using functional callbacks/ proxy pattern implementation scala

How to implement cache using functional programming
A few days ago I came across callbacks and proxy pattern implementation using scala.
This code should only apply inner function if the value is not in the map.
But every time map is reinitialized and values are gone (which seems obivous.
How to use same cache again and again between different function calls
class Aggregator{
def memoize(function: Function[Int, Int] ):Function[Int,Int] = {
val cache = HashMap[Int, Int]()
(t:Int) => {
if (!cache.contains(t)) {
println("Evaluating..."+t)
val r = function.apply(t);
cache.put(t,r)
r
}
else
{
cache.get(t).get;
}
}
}
def memoizedDoubler = memoize( (key:Int) => {
println("Evaluating...")
key*2
})
}
object Aggregator {
def main( args: Array[String] ) {
val agg = new Aggregator()
agg.memoizedDoubler(2)
agg.memoizedDoubler(2)// It should not evaluate again but does
agg.memoizedDoubler(3)
agg.memoizedDoubler(3)// It should not evaluate again but does
}
I see what you're trying to do here, the reason it's not working is that every time you call memoizedDoubler it's first calling memorize. You need to declare memoizedDoubler as a val instead of def if you want it to only call memoize once.
val memoizedDoubler = memoize( (key:Int) => {
println("Evaluating...")
key*2
})
This answer has a good explanation on the difference between def and val. https://stackoverflow.com/a/12856386/37309
Aren't you declaring a new Map per invocation ?
def memoize(function: Function[Int, Int] ):Function[Int,Int] = {
val cache = HashMap[Int, Int]()
rather than specifying one per instance of Aggregator ?
e.g.
class Aggregator{
private val cache = HashMap[Int, Int]()
def memoize(function: Function[Int, Int] ):Function[Int,Int] = {
To answer your question:
How to implement cache using functional programming
In functional programming there is no concept of mutable state. If you want to change something (like cache), you need to return updated cache instance along with the result and use it for the next call.
Here is modification of your code that follows that approach. function to calculate values and cache is incorporated into Aggregator. When memoize is called, it returns tuple, that contains calculation result (possibly taken from cache) and new Aggregator that should be used for the next call.
class Aggregator(function: Function[Int, Int], cache:Map[Int, Int] = Map.empty) {
def memoize:Int => (Int, Aggregator) = {
t:Int =>
cache.get(t).map {
res =>
(res, Aggregator.this)
}.getOrElse {
val res = function(t)
(res, new Aggregator(function, cache + (t -> res)))
}
}
}
object Aggregator {
def memoizedDoubler = new Aggregator((key:Int) => {
println("Evaluating..." + key)
key*2
})
def main(args: Array[String]) {
val (res, doubler1) = memoizedDoubler.memoize(2)
val (res1, doubler2) = doubler1.memoize(2)
val (res2, doubler3) = doubler2.memoize(3)
val (res3, doubler4) = doubler3.memoize(3)
}
}
This prints:
Evaluating...2
Evaluating...3

Using scalaz-stream as a real time Writer for asynchronous computations

I have a web-app that does a bunch of slow concurrent work to calculate its result. Instead of leaving the end user hanging I'd like to stream back progress updates via a websocket.
My codebase is built up of composition of Scalaz eithers (/) like:
type ProcessResult = Error \/ Int
def downloadFile(url: String): Future[Error \/ String] = ???
def doSlowProcessing(data1: String, data2: String): Future[ProcessResult] = ???
/* Very simple however doesn't give any progress update */
def execute(): Future[ProcessResult] = {
val download1 = downloadFile(...)
val download2 = downloadFile(...)
val et = for {
d1 <- download1
d2 <- download2
processed <- doSlowProcessing(d1, d2)
} yield processed
et.run
}
This works very well but of course the entire computation needs to be finished before I get anything out of the Future. Even if I stacked on a Writer monad to do logging I would only get the log once finished, not making my end users any happier.
I toyed around with using a scalaz-stream Queue to send the logs as a side effect while the code is running, however the end result is pretty ugly:
def execute(): Process[Task, String \/ ProcessResult] = {
val (q, src) = async.queue[String \/ ProcessResult]
val download1 = downloadFile(...)
val download2 = downloadFile(...)
val et = for {
d1 <- q.enqueue("Downloading 1".left); download1
d2 <- q.enqueue("Downloading 2".left); download2
processed <- q.enqueue("Doing processing".left); doSlowProcessing(d1, d2)
} yield processed
et.run.onSuccess {
x =>
q.enqueue(x.right)
q.close
}
src
}
It feels like there should be an idiomatic way to achieve this? Turning my SIP-14 Scala futures into Tasks is possible if necessary.
I don't think you need to use queue, one of the approaches can be to use non-Deterministic merging using the wye, i.e.
type Result = ???
val download1: Process[Task,File] = ???
val download2: Process[Task,File] = ???
val result: Process[Task,(File,File)] = (download1 yip download2).once
val processed: Process[Task, Result] = result.flatMap(doSlowProcessing)
// Run asynchronously,
processed.runLast.runAsync {
case Some(r) => .... // result computed
case None => .... //no result, hence download1,2 were empty.
}
//or run synchronously awaiting the result
processed.runLast.run match {
case Some(r) => .... // result computed
case None => .... //no result
}
//to capture the error information while download use
val withError: Process[Task,Throwable\/File] = download1.attempt
//or to log and recover to other file download
val withError: Process[Task,File] download1 onFailure { err => Log(err); download3 }
Does that make a sense?
Also please note that async.queue is deprecated since 0.5.0 in favor to async.unboundedQueue