How do you deal with futures and mapAsync in Akka Flow? - scala

I built a akka graph DSL defining a simple flow. But the flow f4 takes 3 seconds to send an element while f2 takes 10 seconds.
As a result, I got : 3, 2, 3, 2. But, this is not what I want. As f2 takes too much time, I would like to get : 3, 3, 2, 2. Here's the code...
implicit val actorSystem = ActorSystem("NumberSystem")
implicit val materializer = ActorMaterializer()
val g = RunnableGraph.fromGraph(GraphDSL.create() { implicit builder: GraphDSL.Builder[NotUsed] =>
import GraphDSL.Implicits._
val in = Source(List(1, 1))
val out = Sink.foreach(println)
val bcast = builder.add(Broadcast[Int](2))
val merge = builder.add(Merge[Int](2))
val yourMapper: Int => Future[Int] = (i: Int) => Future(i + 1)
val yourMapper2: Int => Future[Int] = (i: Int) => Future(i + 2)
val f1, f3 = Flow[Int]
val f2= Flow[Int].throttle(1, 10.second, 0, ThrottleMode.Shaping).mapAsync[Int](2)(yourMapper)
val f4= Flow[Int].throttle(1, 3.second, 0, ThrottleMode.Shaping).mapAsync[Int](2)(yourMapper2)
in ~> f1 ~> bcast ~> f2 ~> merge ~> f3 ~> out
bcast ~> f4 ~> merge
ClosedShape
})
g.run()
So where am I going wrong ? With future or mapAsync ? or else ...
Thanks

Sorry I'm new in akka, so I'm still learning. To get the expected results, one way is to put async :
val g = RunnableGraph.fromGraph(GraphDSL.create() { implicit builder: GraphDSL.Builder[NotUsed] =>
import GraphDSL.Implicits._
val in = Source(List(1, 1))
val out = Sink.foreach(println)
val bcast = builder.add(Broadcast[Int](2))
val merge = builder.add(Merge[Int](2))
val yourMapper: Int => Future[Int] = (i: Int) => Future(i + 1)
val yourMapper2: Int => Future[Int] = (i: Int) => Future(i + 2)
val f1, f3 = Flow[Int]
val f2= Flow[Int].throttle(1, 10.second, 0, ThrottleMode.Shaping).map(_+1)
//.mapAsyncUnordered[Int](2)(yourMapper)
val f4= Flow[Int].throttle(1, 3.second, 0, ThrottleMode.Shaping).map(_+2)
//.mapAsync[Int](2)(yourMapper2)
in ~> f1 ~> bcast ~> f2.async ~> merge ~> f3 ~> out
bcast ~> f4.async ~> merge
ClosedShape
})
g.run()

As you've already figured out, replacing:
mapAsync(i => Future{i + delta})
with:
map(_ + delta).async
in the two flows would achieve what you want.
The different result boils down to the key difference between mapAsync and map + async. While mapAsync enables execution of Futures in parallel threads, the multiple mapAsync flow stages are still being managed by the same underlying actor which would carry out operator fusion before execution (for cost efficiency in general).
On the other hand, async actually introduces an asynchronous boundary into the stream flow with the individual flow stages handled by separate actors. In your case, each of the two flow stages independently emits elements downstream and whichever element emitted first gets consumed first. Inevitably there is a cost for managing the stream across the asynchronous boundary and Akka Stream uses a windowed buffering strategy to amortize the cost (see this Akka Stream doc).
For more details re: difference between mapAsync and async, this blog post might be of interest.

So you are trying to join together the results coming out of f2 and f4. In which case you're trying to do what is sometimes called "scatter gather pattern".
I don't think there are off the shelf ways to implement it, without adding a custom stateful stage that will keep track of outputs from f2 and from f4 and emit a record when both are available. But they are some complications to bear in mind:
What happens if a f2/f4 fails
What happens if they take too long
You need to have unique key for each input record, so you know which output from f2 correspond to f4 (or vice versa)

Related

Akka-streams backpressure on broadcast with async processing

I am struggling with understanding if akka-stream enforces backpressure on Source when having a broadcast with one branch taking a lot of time (asynchronous) in the graph.
I tried buffer and batch to see if there was any backpressure applied on the source but it does not look like it. I also tried flushing System.out but it does not change anything.
object Test extends App {
/* Necessary for akka stream */
implicit val system = ActorSystem("test")
implicit val materializer: ActorMaterializer = ActorMaterializer()
val g = RunnableGraph.fromGraph(GraphDSL.create() { implicit builder: GraphDSL.Builder[NotUsed] =>
import GraphDSL.Implicits._
val in = Source.tick(0 seconds, 1 seconds, 1)
in.runForeach(i => println("Produced " + i))
val out = Sink.foreach(println)
val out2 = Sink.foreach[Int]{ o => println(s"2 $o") }
val bcast = builder.add(Broadcast[Int](2))
val batchedIn: Source[Int, Cancellable] = in.batch(4, identity) {
case (s, v) => println(s"Batched ${s+v}"); s + v
}
val f2 = Flow[Int].map(_ + 10)
val f4 = Flow[Int].map { i => Thread.sleep(2000); i}
batchedIn ~> bcast ~> f2 ~> out
bcast ~> f4.async ~> out2
ClosedShape
})
g.run()
}
I would expect to see "Batched ..." in the console when I am running the program and at some point to have it momentarily stuck because f4 is not fast enough to process the values. At the moment, none of those behave as expected as the numbers are generated continuously and no batch is done.
EDIT: I noticed that after some time, the batch messages start to print out in the console. I still don't know why it does not happen sooner as the backpressure should happen for the first elements
The reason that explains this behavior are internal buffers that are introduced by akka when async boundaries are set.
Buffers for asynchronous operators
internal buffers that are introduced as an optimization when using asynchronous operators.
While pipelining in general increases throughput, in practice there is a cost of passing an element through the asynchronous (and therefore thread crossing) boundary which is significant. To amortize this cost Akka Streams uses a windowed, batching backpressure strategy internally. It is windowed because as opposed to a Stop-And-Wait protocol multiple elements might be “in-flight” concurrently with requests for elements. It is also batching because a new element is not immediately requested once an element has been drained from the window-buffer but multiple elements are requested after multiple elements have been drained. This batching strategy reduces the communication cost of propagating the backpressure signal through the asynchronous boundary.
I understand that this is a toy stream, but if you explain what is your goal I will try to help you.
You need mapAsync instead of async
val g = RunnableGraph.fromGraph(GraphDSL.create() { implicit builder: GraphDSL.Builder[NotUsed] =>
import akka.stream.scaladsl.GraphDSL.Implicits._
val in = Source.tick(0 seconds, 1 seconds, 1).map(x => {println(s"Produced ${x}"); x})
val out = Sink.foreach[Int]{ o => println(s"F2 processed $o") }
val out2 = Sink.foreach[Int]{ o => println(s"F4 processed $o") }
val bcast = builder.add(Broadcast[Int](2))
val batchedIn: Source[Int, Cancellable] = in.buffer(4,OverflowStrategy.backpressure)
val f2 = Flow[Int].map(_ + 10)
val f4 = Flow[Int].mapAsync(1) { i => Future { println("F4 Started Processing"); Thread.sleep(2000); i }(system.dispatcher) }
batchedIn ~> bcast ~> f2 ~> out
bcast ~> f4 ~> out2
ClosedShape
}).run()

How to stop runnable graph

Getting my first steps with akka streams. I have a graph similar to this one copied from here :
val topHeadSink = Sink.head[Int]
val bottomHeadSink = Sink.head[Int]
val sharedDoubler = Flow[Int].map(_ * 2)
val g = RunnableGraph.fromGraph(GraphDSL.create(topHeadSink, bottomHeadSink)((_, _)) { implicit builder =>
(topHS, bottomHS) =>
import GraphDSL.Implicits._
val broadcast = builder.add(Broadcast[Int](2))
Source.single(1) ~> broadcast.in
broadcast.out(0) ~> sharedDoubler ~> topHS.in
broadcast.out(1) ~> sharedDoubler ~> bottomHS.in
ClosedShape
})
I can run the graph using g.run()
but how can I stop it ?
in what circumstances should I do it (other than the no usage - business wise) ?
This graph is contained within an actor. if the Actor crashes what will happen with the graphs underlying actor ? will it terminate as well ?
As described in the documentation, the way to complete a graph from outside the graph is with KillSwitch. The example that you copied from the documentation is not a good candidate to illustrate this approach, as the source is only a single element, and the stream will complete very quickly when you run it. Let's adjust the graph to more easily see the KillSwitch in action:
val topSink = Sink.foreach(println)
val bottomSink = Sink.foreach(println)
val sharedDoubler = Flow[Int].map(_ * 2)
val killSwitch = KillSwitches.single[Int]
val g = RunnableGraph.fromGraph(GraphDSL.create(topSink, bottomSink, killSwitch)((_, _, _)) {
implicit builder => (topS, bottomS, switch) =>
import GraphDSL.Implicits._
val broadcast = builder.add(Broadcast[Int](2))
Source.fromIterator(() => (1 to 1000000).iterator) ~> switch ~> broadcast.in
broadcast.out(0) ~> sharedDoubler ~> topS.in
broadcast.out(1) ~> sharedDoubler ~> bottomS.in
ClosedShape
})
val res = g.run // res is of type (Future[Done], Future[Done], UniqueKillSwitch)
Thread.sleep(1000)
res._3.shutdown()
The source now consists of one million elements, and the sinks now print the broadcasted elements. The stream runs for one second, which is not enough time to churn through all one million elements, before we call shutdown to complete the stream.
If you run a stream inside an actor, whether the lifecycle of the underlying actor (or actors) that is created to run the stream is ntied to the lifecycle of the "enclosing" actor depends on how the materializer is created. Read the documentation for more information. The following blog post by Colin Breck about using an actor and KillSwitch to manage the lifecycle of a stream is helpful as well: http://blog.colinbreck.com/integrating-akka-streams-and-akka-actors-part-ii/
There's a KillSwitch feature that should work for you. Check the answer to this other SO question: Proper way to stop Akka Streams on condition

How do you deal with futures in Akka Flow?

I have built an akka graph that defines a flow. My objective is to reformat my future response and save it to a file. The flow can be outlined bellow:
val g = RunnableGraph.fromGraph(GraphDSL.create() { implicit builder: GraphDSL.Builder[NotUsed] =>
import GraphDSL.Implicits._
val balancer = builder.add(Balance[(HttpRequest, String)](6, waitForAllDownstreams = false))
val merger = builder.add(Merge[Future[Map[String, String]]](6))
val fileSink = FileIO.toPath(outputPath, options)
val ignoreSink = Sink.ignore
val in = Source(seeds)
in ~> balancer.in
for (i <- Range(0,6)) {
balancer.out(i) ~>
wikiFlow.async ~>
// This maps to a Future[Map[String, String]]
Flow[(Try[HttpResponse], String)].map(parseHtml) ~>
merger
}
merger.out ~>
// When we merge we need to map our Map to a file
Flow[Future[Map[String, String]]].map((d) => {
// What is the proper way of serializing future map
// so I can work with it like a normal stream into fileSink?
// I could manually do ->
// d.foreach(someWriteToFileProcess(_))
// with ignoreSink, but this defeats the nice
// akka flow
}) ~>
fileSink
ClosedShape
})
I can hack this workflow to write my future map to a file via foreach, but I'm afraid this could somehow lead to concurrency issues with FileIO and it just doesn't feel right. What is the proper way to handle futures with our akka flow?
The easiest way to create a Flow which involves an asynchronous computation is by using mapAsync.
So... lets say you want to create a Flow which consumes Int and produces String using an asynchronous computation mapper: Int => Future[String] with a parallelism of 5.
val mapper: Int => Future[String] = (i: Int) => Future(i.toString)
val yourFlow = Flow[Int].mapAsync[String](5)(mapper)
Now, you can use this flow in your graph however you want.
An example usage will be,
val graph = GraphDSL.create() { implicit builder =>
import GraphDSL.Implicits._
val intSource = Source(1 to 10)
val printSink = Sink.foreach[String](s => println(s))
val yourMapper: Int => Future[String] = (i: Int) => Future(i.toString)
val yourFlow = Flow[Int].mapAsync[String](2)(yourMapper)
intSource ~> yourFlow ~> printSink
ClosedShape
}

Akka Streams filter & group by on a collection of keys

I have a stream of
case class Msg(keys: Seq[Char], value: String)
Now I want to filter for a subset of keys e.g.
val filterKeys = Set[Char]('k','f','c') and Filter(k.exists(filterKeys.contains)))
And then split these so certain keys are processed by different flows and then merged back together at the end;
/-key=k-> f1 --\
Source[Msg] ~> Filter ~> router |--key=f-> f2 ----> Merge --> f4
\-key=c-> f3 --/
How should I go about doing this?
FlexiRoute in the old way seemed like a good way to go but in the new API I'm guessing I want to either make a custom GraphStage or create my own graph from the DSL as I see no way to do this through the built-in stages..?
Small Key Set Solution
If your key set is small, and immutable, then a combination of broadcast and filter would probably be the easiest implementation to understand. You first need to define the filter that you described:
def goodKeys(keySet : Set[Char]) = Flow[Msg] filter (_.keys exists keySet.contains)
This can then feed a broadcaster as described in the documentation. All Msg values with good keys will be broadcasted to each of three filters, and each filter will only allow a particular key:
val g = RunnableGraph.fromGraph(GraphDSL.create() { implicit builder: GraphDSL.Builder[NotUsed] =>
import GraphDSL.Implicits._
val source : Source[Msg] = ???
val goodKeyFilter = goodKeys(Set('k','f','c'))
val bcast = builder.add(BroadCast[Msg](3))
val merge = builder.add(Merge[Msg](3))
val kKey = goodKeys(Set('k'))
val fKey = goodKeys(Set('f'))
val cKey = goodKeys(Set('c'))
//as described in the question
val f1 : Flow[Msg, Msg, _] = ???
val f2 : Flow[Msg, Msg, _] = ???
val f3 : Flow[Msg, Msg, _] = ???
val f4 : Sink[Msg,_] = ???
source ~> goodKeyFilter ~> bcast ~> kKey ~> f1 ~> merge ~> f4
bcast ~> fKey ~> f2 ~> merge
bcast ~> cKey ~> f3 ~> merge
Large Key Set Solution
If you key set is large, then groupBy is better. Suppose you have a Map of keys to functions:
//e.g. 'k' -> f1
val keyFuncs : Map[Set[Char], (Msg) => Msg]
This map can be used with the groupBy function:
source
.via(goodKeys(Set('k','f','c'))
.groupBy(keyFuncs.size, _.keys)
.map(keyFuncs(_.keys)) //apply one of f1,f2,f3 to the Msg
.mergeSubstreams

Why Akka streams cycle doesn't end in this graph?

I would like to create a graph that loop n times before going to sink. I've just created this sample that fulfill my requirements but doesn't end after going to sink and I really don't understand why. Can someone enlighten me?
Thanks.
import akka.actor.ActorSystem
import akka.stream.scaladsl._
import akka.stream.{ActorMaterializer, UniformFanOutShape}
import scala.concurrent.Future
object test {
def main(args: Array[String]) {
val ignore: Sink[Any, Future[Unit]] = Sink.ignore
val closed: RunnableGraph[Future[Unit]] = FlowGraph.closed(ignore) { implicit b =>
sink => {
import FlowGraph.Implicits._
val fileSource = Source.single((0, Array[String]()))
val merge = b.add(MergePreferred[(Int, Array[String])](1).named("merge"))
val afterMerge = Flow[(Int, Array[String])].map {
e =>
println("after merge")
e
}
val broadcastArray: UniformFanOutShape[(Int, Array[String]), (Int, Array[String])] = b.add(Broadcast[(Int, Array[String])](2).named("broadcastArray"))
val toRetry = Flow[(Int, Array[String])].filter {
case (r, s) => {
println("retry " + (r < 3) + " " + r)
r < 3
}
}.map {
case (r, s) => (r + 1, s)
}
val toSink = Flow[(Int, Array[String])].filter {
case (r, s) => {
println("sink " + (r >= 3) + " " + r)
r >= 3
}
}
merge.preferred <~ toRetry <~ broadcastArray
fileSource ~> merge ~> afterMerge ~> broadcastArray ~> toSink ~> sink
}
}
implicit val system = ActorSystem()
implicit val _ = ActorMaterializer()
val run: Future[Unit] = closed.run()
import system.dispatcher
run.onComplete {
case _ => {
println("finished")
system.shutdown()
}
}
}
}`
The Stream is never completed because the merge never signals completion.
After formatting your graph structure, it basically looks like:
//ignoring the preferred which is inconsequential
fileSource ~> merge ~> afterMerge ~> broadcastArray ~> toSink ~> sink
merge <~ toRetry <~ broadcastArray
The problem of non-completion is rooted in your merge step :
// 2 inputs into merge
fileSource ~> merge
merge <~ toRetry
Once the fileSource has emitted its single element (namely (0, Array.empty[String])) it sends out a complete message to merge.
However, the fileSource's completion message gets blocked at the merge. From the documentation:
akka.stream.scaladsl.MergePreferred
Completes when all upstreams complete (eagerClose=false) or one
upstream completes (eagerClose=true)
The merge will not send out complete until all of its input streams have completed.
// fileSource is complete ~> merge
// merge <~ toRetry is still running
// complete fileSource + still running toRetry = still running merge
Therefore, merge will wait until toRetry also completes. But toRetry will never complete because it is waiting for merge to complete.
If you want your specific graph to complete after fileSource completes then just set eagerClose=True which will cause merge to complete once fileSource completes. E.g.:
//Add this true |
// V
val merge = b.add(MergePreferred[(Int, Array[String])](1, true).named("merge")
Without the Stream Cycle
A simpler solution exists for your problem. Just use a single Flow.map stage which utilizes a tail recursive function:
//Note: there is no use of akka in this implementation
type FileInputType = (Int, Array[String])
#scala.annotation.tailrec
def recursiveRetry(fileInput : FileInputType) : FileInputType =
fileInput match {
case (r,_) if r >= 3 => fileInput
case (r,a) => recursiveRetry((r+1, a))
}
Your stream would then be reduced to
//ring-fenced akka code
val recursiveRetryFlow = Flow[FileInputType] map recursiveRetry
fileSource ~> recursiveRetryFlow ~> toSink ~> sink
The result is a cleaner stream & it avoids mixing "business logic" with akka code. This allows unit testing of the retry functionality completely independent from any third party library. The retry loop you have embedded in your stream is the "business logic". Therefore the mixed implementation is tightly coupled to akka going forward, for better or worse.
Also, in the segregated solution the cycle is contained in a tail recursive function, which is idiomatic Scala.