Akka-streams backpressure on broadcast with async processing - scala

I am struggling with understanding if akka-stream enforces backpressure on Source when having a broadcast with one branch taking a lot of time (asynchronous) in the graph.
I tried buffer and batch to see if there was any backpressure applied on the source but it does not look like it. I also tried flushing System.out but it does not change anything.
object Test extends App {
/* Necessary for akka stream */
implicit val system = ActorSystem("test")
implicit val materializer: ActorMaterializer = ActorMaterializer()
val g = RunnableGraph.fromGraph(GraphDSL.create() { implicit builder: GraphDSL.Builder[NotUsed] =>
import GraphDSL.Implicits._
val in = Source.tick(0 seconds, 1 seconds, 1)
in.runForeach(i => println("Produced " + i))
val out = Sink.foreach(println)
val out2 = Sink.foreach[Int]{ o => println(s"2 $o") }
val bcast = builder.add(Broadcast[Int](2))
val batchedIn: Source[Int, Cancellable] = in.batch(4, identity) {
case (s, v) => println(s"Batched ${s+v}"); s + v
}
val f2 = Flow[Int].map(_ + 10)
val f4 = Flow[Int].map { i => Thread.sleep(2000); i}
batchedIn ~> bcast ~> f2 ~> out
bcast ~> f4.async ~> out2
ClosedShape
})
g.run()
}
I would expect to see "Batched ..." in the console when I am running the program and at some point to have it momentarily stuck because f4 is not fast enough to process the values. At the moment, none of those behave as expected as the numbers are generated continuously and no batch is done.
EDIT: I noticed that after some time, the batch messages start to print out in the console. I still don't know why it does not happen sooner as the backpressure should happen for the first elements

The reason that explains this behavior are internal buffers that are introduced by akka when async boundaries are set.
Buffers for asynchronous operators
internal buffers that are introduced as an optimization when using asynchronous operators.
While pipelining in general increases throughput, in practice there is a cost of passing an element through the asynchronous (and therefore thread crossing) boundary which is significant. To amortize this cost Akka Streams uses a windowed, batching backpressure strategy internally. It is windowed because as opposed to a Stop-And-Wait protocol multiple elements might be “in-flight” concurrently with requests for elements. It is also batching because a new element is not immediately requested once an element has been drained from the window-buffer but multiple elements are requested after multiple elements have been drained. This batching strategy reduces the communication cost of propagating the backpressure signal through the asynchronous boundary.
I understand that this is a toy stream, but if you explain what is your goal I will try to help you.
You need mapAsync instead of async
val g = RunnableGraph.fromGraph(GraphDSL.create() { implicit builder: GraphDSL.Builder[NotUsed] =>
import akka.stream.scaladsl.GraphDSL.Implicits._
val in = Source.tick(0 seconds, 1 seconds, 1).map(x => {println(s"Produced ${x}"); x})
val out = Sink.foreach[Int]{ o => println(s"F2 processed $o") }
val out2 = Sink.foreach[Int]{ o => println(s"F4 processed $o") }
val bcast = builder.add(Broadcast[Int](2))
val batchedIn: Source[Int, Cancellable] = in.buffer(4,OverflowStrategy.backpressure)
val f2 = Flow[Int].map(_ + 10)
val f4 = Flow[Int].mapAsync(1) { i => Future { println("F4 Started Processing"); Thread.sleep(2000); i }(system.dispatcher) }
batchedIn ~> bcast ~> f2 ~> out
bcast ~> f4 ~> out2
ClosedShape
}).run()

Related

How do you deal with futures and mapAsync in Akka Flow?

I built a akka graph DSL defining a simple flow. But the flow f4 takes 3 seconds to send an element while f2 takes 10 seconds.
As a result, I got : 3, 2, 3, 2. But, this is not what I want. As f2 takes too much time, I would like to get : 3, 3, 2, 2. Here's the code...
implicit val actorSystem = ActorSystem("NumberSystem")
implicit val materializer = ActorMaterializer()
val g = RunnableGraph.fromGraph(GraphDSL.create() { implicit builder: GraphDSL.Builder[NotUsed] =>
import GraphDSL.Implicits._
val in = Source(List(1, 1))
val out = Sink.foreach(println)
val bcast = builder.add(Broadcast[Int](2))
val merge = builder.add(Merge[Int](2))
val yourMapper: Int => Future[Int] = (i: Int) => Future(i + 1)
val yourMapper2: Int => Future[Int] = (i: Int) => Future(i + 2)
val f1, f3 = Flow[Int]
val f2= Flow[Int].throttle(1, 10.second, 0, ThrottleMode.Shaping).mapAsync[Int](2)(yourMapper)
val f4= Flow[Int].throttle(1, 3.second, 0, ThrottleMode.Shaping).mapAsync[Int](2)(yourMapper2)
in ~> f1 ~> bcast ~> f2 ~> merge ~> f3 ~> out
bcast ~> f4 ~> merge
ClosedShape
})
g.run()
So where am I going wrong ? With future or mapAsync ? or else ...
Thanks
Sorry I'm new in akka, so I'm still learning. To get the expected results, one way is to put async :
val g = RunnableGraph.fromGraph(GraphDSL.create() { implicit builder: GraphDSL.Builder[NotUsed] =>
import GraphDSL.Implicits._
val in = Source(List(1, 1))
val out = Sink.foreach(println)
val bcast = builder.add(Broadcast[Int](2))
val merge = builder.add(Merge[Int](2))
val yourMapper: Int => Future[Int] = (i: Int) => Future(i + 1)
val yourMapper2: Int => Future[Int] = (i: Int) => Future(i + 2)
val f1, f3 = Flow[Int]
val f2= Flow[Int].throttle(1, 10.second, 0, ThrottleMode.Shaping).map(_+1)
//.mapAsyncUnordered[Int](2)(yourMapper)
val f4= Flow[Int].throttle(1, 3.second, 0, ThrottleMode.Shaping).map(_+2)
//.mapAsync[Int](2)(yourMapper2)
in ~> f1 ~> bcast ~> f2.async ~> merge ~> f3 ~> out
bcast ~> f4.async ~> merge
ClosedShape
})
g.run()
As you've already figured out, replacing:
mapAsync(i => Future{i + delta})
with:
map(_ + delta).async
in the two flows would achieve what you want.
The different result boils down to the key difference between mapAsync and map + async. While mapAsync enables execution of Futures in parallel threads, the multiple mapAsync flow stages are still being managed by the same underlying actor which would carry out operator fusion before execution (for cost efficiency in general).
On the other hand, async actually introduces an asynchronous boundary into the stream flow with the individual flow stages handled by separate actors. In your case, each of the two flow stages independently emits elements downstream and whichever element emitted first gets consumed first. Inevitably there is a cost for managing the stream across the asynchronous boundary and Akka Stream uses a windowed buffering strategy to amortize the cost (see this Akka Stream doc).
For more details re: difference between mapAsync and async, this blog post might be of interest.
So you are trying to join together the results coming out of f2 and f4. In which case you're trying to do what is sometimes called "scatter gather pattern".
I don't think there are off the shelf ways to implement it, without adding a custom stateful stage that will keep track of outputs from f2 and from f4 and emit a record when both are available. But they are some complications to bear in mind:
What happens if a f2/f4 fails
What happens if they take too long
You need to have unique key for each input record, so you know which output from f2 correspond to f4 (or vice versa)

How to stop runnable graph

Getting my first steps with akka streams. I have a graph similar to this one copied from here :
val topHeadSink = Sink.head[Int]
val bottomHeadSink = Sink.head[Int]
val sharedDoubler = Flow[Int].map(_ * 2)
val g = RunnableGraph.fromGraph(GraphDSL.create(topHeadSink, bottomHeadSink)((_, _)) { implicit builder =>
(topHS, bottomHS) =>
import GraphDSL.Implicits._
val broadcast = builder.add(Broadcast[Int](2))
Source.single(1) ~> broadcast.in
broadcast.out(0) ~> sharedDoubler ~> topHS.in
broadcast.out(1) ~> sharedDoubler ~> bottomHS.in
ClosedShape
})
I can run the graph using g.run()
but how can I stop it ?
in what circumstances should I do it (other than the no usage - business wise) ?
This graph is contained within an actor. if the Actor crashes what will happen with the graphs underlying actor ? will it terminate as well ?
As described in the documentation, the way to complete a graph from outside the graph is with KillSwitch. The example that you copied from the documentation is not a good candidate to illustrate this approach, as the source is only a single element, and the stream will complete very quickly when you run it. Let's adjust the graph to more easily see the KillSwitch in action:
val topSink = Sink.foreach(println)
val bottomSink = Sink.foreach(println)
val sharedDoubler = Flow[Int].map(_ * 2)
val killSwitch = KillSwitches.single[Int]
val g = RunnableGraph.fromGraph(GraphDSL.create(topSink, bottomSink, killSwitch)((_, _, _)) {
implicit builder => (topS, bottomS, switch) =>
import GraphDSL.Implicits._
val broadcast = builder.add(Broadcast[Int](2))
Source.fromIterator(() => (1 to 1000000).iterator) ~> switch ~> broadcast.in
broadcast.out(0) ~> sharedDoubler ~> topS.in
broadcast.out(1) ~> sharedDoubler ~> bottomS.in
ClosedShape
})
val res = g.run // res is of type (Future[Done], Future[Done], UniqueKillSwitch)
Thread.sleep(1000)
res._3.shutdown()
The source now consists of one million elements, and the sinks now print the broadcasted elements. The stream runs for one second, which is not enough time to churn through all one million elements, before we call shutdown to complete the stream.
If you run a stream inside an actor, whether the lifecycle of the underlying actor (or actors) that is created to run the stream is ntied to the lifecycle of the "enclosing" actor depends on how the materializer is created. Read the documentation for more information. The following blog post by Colin Breck about using an actor and KillSwitch to manage the lifecycle of a stream is helpful as well: http://blog.colinbreck.com/integrating-akka-streams-and-akka-actors-part-ii/
There's a KillSwitch feature that should work for you. Check the answer to this other SO question: Proper way to stop Akka Streams on condition

Merge and broadcast, building a (simple) Akka graph

The Akka documentation is vast and there are a lot of tutorials. But either they are outdated or they only cover the basics (or, maybe I simply can't find the right ones).
What I want to create is a websocket application with multiple clients and multiple sources on the server side. As I don't want to get over my head from the start, I want to make baby steps and incrementally increase the complexity of the software I am building.
After toying around with some simple flows I wanted to start with a more sophisticated graph now.
What I want is:
Two sources, one that pushes "keepAlive" messages from the server to the client (currently only one) and a second one that actually pushes useful data.
Now for the first one I have this:
val tickingSource: Source[Array[Byte], Cancellable] =
Source.tick(initialDelay = 1 second, interval = 10 seconds, tick = NotUsed)
.zipWithIndex
.map{ case (_, counter) => SomeMessage().toByteArray}
Where SomeMessage is a protobuf type.
Because I can't find an up-to-date way to add an actor as a source, I tried the following for my second source:
val secondSource = Source(1 to 1000)
val secondSourceConverter = Flow[Int].map(x => BigInteger.valueOf(x).toByteArray)
My attempt at the graph:
val g: RunnableGraph[NotUsed] = RunnableGraph.fromGraph(GraphDSL.create()
{
implicit builder =>
import GraphDSL.Implicits._
val sourceMerge = builder.add(Merge[Array[Byte]](2).named("sourceMerge"))
val x = Source(1 to 1000)
val y = Flow[Int].map(x => BigInteger.valueOf(x).toByteArray)
val out = Sink.ignore
tickingSource ~> sourceMerge ~> out
x ~> y ~> sourceMerge
ClosedShape
})
Now g is of type RunnableGraph[NotUsed] while it should be RunnableGraph[Array[Byte]] for my websocket. And I wonder here: am I already doing something completely wrong?
You need to pass the secondSourceConverter into the GraphDSL.create, like the following example taken from their docs. Here they import 2 sinks, but it's the same technique.
RunnableGraph.fromGraph(GraphDSL.create(topHeadSink, bottomHeadSink)((_, _)) { implicit builder =>
(topHS, bottomHS) =>
import GraphDSL.Implicits._
val broadcast = builder.add(Broadcast[Int](2))
Source.single(1) ~> broadcast.in
broadcast.out(0) ~> sharedDoubler ~> topHS.in
broadcast.out(1) ~> sharedDoubler ~> bottomHS.in
ClosedShape
})
Your graph is of type RunnableGraph[NotUsed] because you're using Sink.ignore. And you probably want a RunnableGraph[Future[Array[Byte]]] instead of a RunnableGraph[Array[Byte]]:
val byteSink = Sink.fold[Array[Byte], Array[Byte]](Array[Byte]())(_ ++ _)
val g = RunnableGraph.fromGraph(GraphDSL.create(byteSink) { implicit builder => bSink =>
import GraphDSL.Implicits._
val sourceMerge = builder.add(Merge[Array[Byte]](2))
tickingSource ~> sourceMerge ~> bSink.in
secondSource ~> secondSourceConverter ~> sourceMerge
ClosedShape
})
// RunnableGraph[Future[Array[Byte]]]
I'm not sure how you would like to process incoming messages but here is a simple example. Hope that it'll help you.
path("ws") {
extractUpgradeToWebSocket { upgrade =>
complete {
import scala.concurrent.duration._
val tickSource = Source.tick(1.second, 1.second, TextMessage("ping"))
val messagesSource = Source.queue(10, OverflowStrategy.backpressure)
messagesSource.mapMaterializedValue { queue =>
//do something with out queue
//like myHandler ! RegisterOutQueue(queue)
}
val sink = Sink.ignore
val source = tickSource.merge(messagesSource)
upgrade.handleMessagesWithSinkSource(
inSink = sink,
outSource = source
)
}
}

Akka Streams stop stream after process n elements

I have Akka Stream flow which is reading from file using alpakka, processing data and write into a file. I want to stop flow after processed n elements, count the time of duration and call system terminate. How can I achieve it?
My flow looks like that:
val graph = RunnableGraph.fromGraph(GraphDSL.create() { implicit builder: GraphDSL.Builder[NotUsed] =>
import GraphDSL.Implicits._
sourceFile ~> mainFlow ~> sinkFile
ClosedShape
})
graph.run()
Do you have an idea? Thanks
Agreeing with what #Viktor said, first of all you don't need to use the graphDSL to achieve this, and you can use take(n) to complete the graph.
Secondly, you can use mapMaterializedValue to attach a callback to your Sink materialized value (which in turn should materializes to a Future[Something]).
val graph: RunnableGraph[Future[FiniteDuration]] =
sourceFile
.via(mainFlow)
.take(N)
.toMat(sinkFile)(Keep.right)
.mapMaterializedValue { f ⇒
val start = System.nanoTime()
f.map(_ ⇒ FiniteDuration(System.nanoTime() - start, TimeUnit.NANOSECONDS))
}
graph.run().onComplete { duration ⇒
println(s"Elapsed time: $duration")
}
Note that you are going to need an ExecutionContext in scope.
EDIT
Even if you have to use the graphDSL, the same concepts apply. You only need to expose the materialized Future of your sink and map on that.
val graph: RunnableGraph[Future[??Something??]] =
RunnableGraph.fromGraph(GraphDSL.create(sinkFile) { implicit builder: GraphDSL.Builder[Future[Something]] => snk =>
import GraphDSL.Implicits._
sourceFile ~> mainFlow ~> snk
ClosedShape
})
val timedGraph: RunnableGraph[Future[FiniteDuration]] =
graph.mapMaterializedValue { f ⇒
val start = System.nanoTime()
f.map(_ ⇒ FiniteDuration(System.nanoTime() - start, TimeUnit.NANOSECONDS))
}
timedGraph.run().onComplete { duration ⇒
println(s"Elapsed time: $duration")
}
No need for GraphDSL here.
val doneFuture = (sourceFile via mainFlow.take(N) runWith sinkFile) transformWith { _ => system.terminate() }
To obtain time, you can use akka-streams-contrib: https://github.com/akka/akka-stream-contrib/blob/master/contrib/src/main/scala/akka/stream/contrib/Timed.scala

Why Akka streams cycle doesn't end in this graph?

I would like to create a graph that loop n times before going to sink. I've just created this sample that fulfill my requirements but doesn't end after going to sink and I really don't understand why. Can someone enlighten me?
Thanks.
import akka.actor.ActorSystem
import akka.stream.scaladsl._
import akka.stream.{ActorMaterializer, UniformFanOutShape}
import scala.concurrent.Future
object test {
def main(args: Array[String]) {
val ignore: Sink[Any, Future[Unit]] = Sink.ignore
val closed: RunnableGraph[Future[Unit]] = FlowGraph.closed(ignore) { implicit b =>
sink => {
import FlowGraph.Implicits._
val fileSource = Source.single((0, Array[String]()))
val merge = b.add(MergePreferred[(Int, Array[String])](1).named("merge"))
val afterMerge = Flow[(Int, Array[String])].map {
e =>
println("after merge")
e
}
val broadcastArray: UniformFanOutShape[(Int, Array[String]), (Int, Array[String])] = b.add(Broadcast[(Int, Array[String])](2).named("broadcastArray"))
val toRetry = Flow[(Int, Array[String])].filter {
case (r, s) => {
println("retry " + (r < 3) + " " + r)
r < 3
}
}.map {
case (r, s) => (r + 1, s)
}
val toSink = Flow[(Int, Array[String])].filter {
case (r, s) => {
println("sink " + (r >= 3) + " " + r)
r >= 3
}
}
merge.preferred <~ toRetry <~ broadcastArray
fileSource ~> merge ~> afterMerge ~> broadcastArray ~> toSink ~> sink
}
}
implicit val system = ActorSystem()
implicit val _ = ActorMaterializer()
val run: Future[Unit] = closed.run()
import system.dispatcher
run.onComplete {
case _ => {
println("finished")
system.shutdown()
}
}
}
}`
The Stream is never completed because the merge never signals completion.
After formatting your graph structure, it basically looks like:
//ignoring the preferred which is inconsequential
fileSource ~> merge ~> afterMerge ~> broadcastArray ~> toSink ~> sink
merge <~ toRetry <~ broadcastArray
The problem of non-completion is rooted in your merge step :
// 2 inputs into merge
fileSource ~> merge
merge <~ toRetry
Once the fileSource has emitted its single element (namely (0, Array.empty[String])) it sends out a complete message to merge.
However, the fileSource's completion message gets blocked at the merge. From the documentation:
akka.stream.scaladsl.MergePreferred
Completes when all upstreams complete (eagerClose=false) or one
upstream completes (eagerClose=true)
The merge will not send out complete until all of its input streams have completed.
// fileSource is complete ~> merge
// merge <~ toRetry is still running
// complete fileSource + still running toRetry = still running merge
Therefore, merge will wait until toRetry also completes. But toRetry will never complete because it is waiting for merge to complete.
If you want your specific graph to complete after fileSource completes then just set eagerClose=True which will cause merge to complete once fileSource completes. E.g.:
//Add this true |
// V
val merge = b.add(MergePreferred[(Int, Array[String])](1, true).named("merge")
Without the Stream Cycle
A simpler solution exists for your problem. Just use a single Flow.map stage which utilizes a tail recursive function:
//Note: there is no use of akka in this implementation
type FileInputType = (Int, Array[String])
#scala.annotation.tailrec
def recursiveRetry(fileInput : FileInputType) : FileInputType =
fileInput match {
case (r,_) if r >= 3 => fileInput
case (r,a) => recursiveRetry((r+1, a))
}
Your stream would then be reduced to
//ring-fenced akka code
val recursiveRetryFlow = Flow[FileInputType] map recursiveRetry
fileSource ~> recursiveRetryFlow ~> toSink ~> sink
The result is a cleaner stream & it avoids mixing "business logic" with akka code. This allows unit testing of the retry functionality completely independent from any third party library. The retry loop you have embedded in your stream is the "business logic". Therefore the mixed implementation is tightly coupled to akka going forward, for better or worse.
Also, in the segregated solution the cycle is contained in a tail recursive function, which is idiomatic Scala.