Akka-streams mapConcat not working with cycled RunnableGraph - scala

I have RunnableGraph like following. When there is simple map between broadcast and merge stages everything is fine. However, when it comes to mapConcat, this code is not working after consuming the first element.
I want to know why it doesn't work.
RunnableGraph.fromGraph(GraphDSL.create() { implicit b =>
import GraphDSL.Implicits._
val M = b.add(MergePreferred[Int](1))
val B = b.add(Broadcast[Int](2))
val S = Source(List(3))
S ~> M ~> Flow[Int].map { s => println(s); s } ~> B ~> Sink.ignore
M.preferred <~ Flow[Int].map(x => List.fill(3)(x-1)).mapConcat(x => {println(x); x}).filter(_ > 0) <~ B
ClosedShape
})
// run() output:
// 3
// List(2,2,2)

The mapConcat stage blocks the feedback loop, and that is expected. Consider the following chain of events:
the mapConcat function prints List(2,2,2)
the mapConcat stage needs demand to emit the first of the 3 available elements (2, 2, 2)
the demand has to come from the Merge stage, and therefore from the Broadcast stage.
the Broadcast stage backpressures if any of its downstreams backpressures. It's downstreams are a Sink.ignore (that never backpressures), and the mapConcat itself.
the mapConcat backpressures if "there are still remaining elements from the previously calculated collection", as per the docs. This is indeed the case.
In other words, your cycle is unbalanced. You are introducing more elements in the feedback loop than you are removing.
This issue is explained in detail in this documentation page, where a couple of solutions are also presented. For your specific case, because of the filter stage you have, introducing a buffer larger than 13 would print all the elements. However, note that the graph will just hang and not complete afterwards.
S ~> M ~> Flow[Int].map { s => println(s); s } ~> B ~> Sink.ignore
M.preferred <~ Flow[Int].buffer(20, OverflowStrategy.dropHead) <~ Flow[Int].map(x => List.fill(3)(x-1)).mapConcat(x => {println(x); x}).filter(_ > 0) <~ B

Related

Akka streams pass through flow limiting Parallelism / throughput of processing flow

I have a use case where I want to send a message to an external system but the flow that sends this message takes and returns a type I cant use downstream. This is a great use case for the pass through flow. I am using the implementation here. Initially I was worried that if the processingFlow uses a mapAsyncUnordered then this flow wouldn't work. Since the processing flow may reorder messages and the zip with may push out a tuple with the incorrect pair. E.g In the following example.
val testSource = Source(1 until 50)
val processingFlow: Flow[Int, Int, NotUsed] = Flow[Int].mapAsyncUnordered(10)(x => Future {
Thread.sleep(Random.nextInt(50))
x * 10
})
val passThroughFlow = PassThroughFlow(processingFlow, Keep.both)
val future = testSource.via(passThroughFlow).runWith(Sink.seq)
I would expect that the processing flow could reorder its outputs with respect its input and i would get a result such as:
[(30,1), (40,2),(10,3),(10,4), ...]
With the right ( the passed through always being in order) but the left which goes through my mapAsyncUnordered potentially being joined with an incorrect element to make a bad tuple.
Instead i actually get:
[(10,1), (20,2),(30,3),(40,4), ...]
Every time. Upon further investigation I noticed the code was running slow and in fact its not running in parallel at all despite my map async unordered. I tried introducing a buffer before and after as well as an async boundary but it always seems to run sequentially. This explains why it always ordered but I want my processing flow to have a higher throughput.
I came up with the following work around:
object PassThroughFlow {
def keepRight[A, A1](processingFlow: Flow[A, A1, NotUsed]): Flow[A, A, NotUsed] =
keepBoth[A, A1](processingFlow).map(_._2)
def keepBoth[A, A1](processingFlow: Flow[A, A1, NotUsed]): Flow[A, (A1, A), NotUsed] =
Flow.fromGraph(GraphDSL.create() { implicit builder => {
import GraphDSL.Implicits._
val broadcast = builder.add(Broadcast[A](2))
val zip = builder.add(ZipWith[A1, A, (A1, A)]((left, right) => (left, right)))
broadcast.out(0) ~> processingFlow ~> zip.in0
broadcast.out(1) ~> zip.in1
FlowShape(broadcast.in, zip.out)
}
})
}
object ParallelPassThroughFlow {
def keepRight[A, A1](parallelism: Int, processingFlow: Flow[A, A1, NotUsed]): Flow[A, A, NotUsed] =
keepBoth(parallelism, processingFlow).map(_._2)
def keepBoth[A, A1](parallelism: Int, processingFlow: Flow[A, A1, NotUsed]): Flow[A, (A1, A), NotUsed] = {
Flow.fromGraph(GraphDSL.create() { implicit builder =>
import GraphDSL.Implicits._
val fanOut = builder.add(Balance[A](outputPorts = parallelism))
val merger = builder.add(Merge[(A1, A)](inputPorts = parallelism, eagerComplete = false))
Range(0, parallelism).foreach { n =>
val passThrough = PassThroughFlow.keepBoth(processingFlow)
fanOut.out(n) ~> passThrough ~> merger.in(n)
}
FlowShape(fanOut.in, merger.out)
})
}
}
Two questions:
In the original implementation, Why does the zip inside the pass through flow limit the amount of parallelism of the map async unordered?
Is my work around sound or could it be improved? I basically fan out my input the input to multiple stacks of the pass through flow and merge it all back together. It seems to have the properties that I want (parallel yet maintains order even if processing flow reorders) yet something doesn't feel right
The behavior you're witnessing is a result of how broadcast and zip work: broadcast emits downstream when all of its outputs signal demand; zip waits for all of its inputs before signaling demand (and emitting downstream).
broadcast.out(0) ~> processingFlow ~> zip.in0
broadcast.out(1) ~> zip.in1
Consider the movement of the first element (1) through the above graph. 1 is broadcast to both processingFlow and zip. zip immediately receives one of its inputs (1) and waits for its other input (10), which will take a little longer to arrive. Only when zip gets both 1 and 10 does it pull for more elements from upstream, thus triggering the movement of the second element (2) through the stream. And so on.
As for your ParallelPassThroughFlow, I don't know why "something doesn't feel right" to you.

How to explain this Akka Streams graph from official doc?

I have a couple of questions for this sample code hosted officially here:
val topHeadSink = Sink.head[Int]
val bottomHeadSink = Sink.head[Int]
val sharedDoubler = Flow[Int].map(_ * 2)
RunnableGraph.fromGraph(GraphDSL.create(topHeadSink, bottomHeadSink)((_, _)) { implicit builder =>
(topHS, bottomHS) =>
import GraphDSL.Implicits._
val broadcast = builder.add(Broadcast[Int](2))
Source.single(1) ~> broadcast.in
broadcast.out(0) ~> sharedDoubler ~> topHS.in
broadcast.out(1) ~> sharedDoubler ~> bottomHS.in
ClosedShape
})
When do you pass in a graph through create?
Why are topHeadSink, bottomHeadSink passed in through create, but sharedDoubler is not? What is the difference between them?
When do you need builder.add?
Can I create a broadcast outside the graph without builder.add? If I add a couple of flows inside the graph, should I add the flows via builder.add as well? It is very confusing that sometimes we need builder.add and sometimes we do not.
Update
I feel this is still confusing:
The difference between these approaches is that importing using builder.add(...) ignores the materialized value of the imported graph, while importing via the factory method allows its inclusion.
topHS, bottomHS are imported from create, so they will keep their materialized value. What if I do builder.add(topHS)?
And how do you explain sharedDoubler: does it have a materialized value or not? What if I use builder.add with it?
What does this mean, the ((_,_)) of GraphDSL.create(topHeadSink, bottomHeadSink)((_, _))?
It looks like boilerplate we just need, but I am not sure what it is.
When do you pass in a graph through create?
When you want to obtain the materialized value of the graph that you pass to the create factory method. The type of the RunnableGraph in your question is RunnableGraph[(Future[Int], Future[Int])], meaning that the materialized value of the graph is (Future[Int], Future[Int]):
val g = RunnableGraph.fromGraph(...).run() // (Future[Int], Future[Int])
val topHeadSinkResult = g._1 // Future[Int]
val bottomHeadSinkResult = g._2 // Future[Int]
Now consider the following variant, which defines the sinks "inside" the graph and discards the materialized value:
val g2 = RunnableGraph.fromGraph(GraphDSL.create() { implicit builder =>
import GraphDSL.Implicits._
val topHeadSink = Sink.head[Int]
val bottomHeadSink = Sink.head[Int]
val broadcast = builder.add(Broadcast[Int](2))
Source.single(1) ~> broadcast.in
broadcast.out(0) ~> sharedDoubler ~> topHeadSink
broadcast.out(1) ~> sharedDoubler ~> bottomHeadSink
ClosedShape
}).run() // NotUsed
The value of g2 is NotUsed.
When do you need builder.add?
All of the components of a graph must be added to the builder, but there are variants of the ~> operator that add the most commonly used components--such as Source and Flow--to the builder under the covers. However, junction operations that perform a fan-in (such as Merge) or a fan-out (such as Broadcast) must be explicitly passed to builder.add if you're using the Graph DSL.
Note that for simple graphs, you can use junctions without having to use the Graph DSL. Here is an example from the documentation:
val sendRmotely = Sink.actorRef(actorRef, "Done")
val localProcessing = Sink.foreach[Int](_ => /* do something usefull */ ())
val sink = Sink.combine(sendRmotely, localProcessing)(Broadcast[Int](_))
Source(List(0, 1, 2)).runWith(sink)
What does this mean? the ((_,_)) of GraphDSL.create(topHeadSink, bottomHeadSink)((_, _))?
It's a curried parameter that specifies which materialized value(s) you want to retain. Using ((_, _)) here is the same as:
val g = RunnableGraph.fromGraph(GraphDSL.create(topHeadSink, bottomHeadSink)((t, b) => (t, b)) {
implicit builder => (topHS, bottomHS) =>
...
}).run() // (Future[Int], Future[Int])
In other words, ((_, _)) in this context is shorthand for ((t, b) => (t, b)), which preserves the respective materialized values of the two sinks that are passed in. If, for example, you want to keep only the materialized value of topHeadSink, you could change the call to the following:
val g = RunnableGraph.fromGraph(GraphDSL.create(topHeadSink, bottomHeadSink)((t, _) => t) {
implicit builder => (topHS, bottomHS) =>
...
}).run() // Future[Int]

Akka streams - resuming graph with broadcast and zip after failure

I have a flow graph with broadcast and zip inside. If something (regardless what is it) fails inside this flow, I'd like to drop the problematic element passed to it and resume. I came up with the following solution:
val flow = Flow.fromGraph(GraphDSL.create() { implicit builder =>
import GraphDSL.Implicits._
val dangerousFlow = Flow[Int].map {
case 5 => throw new RuntimeException("BOOM!")
case x => x
}
val safeFlow = Flow[Int]
val bcast = builder.add(Broadcast[Int](2))
val zip = builder.add(Zip[Int, Int])
bcast ~> dangerousFlow ~> zip.in0
bcast ~> safeFlow ~> zip.in1
FlowShape(bcast.in, zip.out)
})
Source(1 to 9)
.via(flow)
.withAttributes(ActorAttributes.supervisionStrategy(Supervision.restartingDecider))
.runWith(Sink.foreach(println))
I'd expect it to print:
(1,1)
(2,2)
(3,3)
(4,4)
(5,5)
(6,6)
(7,7)
(8,8)
(9,9)
However, it deadlocks, printing only:
(1,1)
(2,2)
(3,3)
(4,4)
We've done some debugging, and it turns out it applied the "resume" strategy to its children, which caused dangerousFlow to resume after failure and thus to demand an element from bcast. bcast won't emit an element until safeFlow demands another element, which actually never happens (because it's waiting for demand from zip).
Is there a way to resume the graph regardless of what went wrong inside one of the stages?
I think you understood the problem well. You saw that, when your element 5 crashes dangerousFlow, you should also stop the element 5 that is going through safeFlow because if it reaches the zip stage, you have the problem you describe. I don't know how to solve your problem between the broadcast and zip stages, but what about pushing the problem further, where it is easier to handle?
Consider using the following dangerousFlow:
import scala.util._
val dangerousFlow = Flow[Int].map {
case 5 => Failure(new RuntimeException("BOOM!"))
case x => Success(x)
}
Even in case of problem, dangerousFlow would still emit data. You can then zip as you are currently doing and would just need to add a collect stage as last step of your graph. On a flow, this would look like:
Flow[(Try[Int], Int)].collect {
case (Success(s), i) => s -> i
}
Now if, as you wrote, you really expect it to output the (5, 5) tuple, use the following:
Flow[(Try[Int], Int)].collect {
case (Success(s), i) => s -> i
case (_, i) => i -> i
}

Group stream elements by weight function in Akka Streams

Imagine a
val myFlow: Flow[Element] = ... //some flow..
Given a weight function
val weightFunction: Element => Int
I would like to obtain a
val transformedFlow: Flow[List[Element]]
such that each element of the transformedFlow is a List[Element], such that the sum of the weights of the elements in that list is greater than a given constant.
How would I achieve that?
How about using scan to create a stream of accumulated weights, then zip the results with the original stream of elements and then use splitAfter to create substreams? I have not even tried to compile the following, but I hope you get the idea:
val broadCast = builder.add(Broadcast[Element](2))
val zip = builder.add(Zip[Element, Boolean])
myFlow.shape.out ~> broadCast.in
broadCast.out(0) ~> zip.in0
broadCast.out(1).scan(0){ (totalWeight, elem) =>
if(totalWeight > Limit) weightFunction(elem)
else totalWeight + weightFunction(elem)
}.map(_ > Limit) ~> zip.in1
val resultFlow =
zip.out.splitAfter(_._2)
.fold(List.empty[Element]){ case (list, (elem, _)) => elem :: list }
.concatSubstreams
(You might want to consider doing map(_.reverse) on the resultFlow.)
Edit: you don't even need to do the broadcast and zip if you change the return type of the scan a bit - see a runnable code example here: https://gist.github.com/MartinHH/a05a87269b1697d5f57a1c77db269767

Is there a nicer way to connect Scan and Broadcast in Akka Stream?

Let's assume I want to create a Flow, which takes Ints and outputs tuples (doubled int, sum). So I fan-out ints, map on one edge and scan on the other. Then I zip them and this is the result:
object Main extends App {
implicit val system = ActorSystem()
implicit val materializer = ActorMaterializer()
val flow = Flow.fromGraph(GraphDSL.create() { implicit b =>
import GraphDSL.Implicits._
val broadcast = b.add(Broadcast[Int](2))
val zip = b.add(Zip[Int, Int])
val flowMap = b.add(Flow[Int].map(_ * 2))
val flowScan = b.add(Flow[Int].scan(0)(_ + _))
broadcast.out(0) ~> flowMap ~> zip.in0
broadcast.out(1) ~> flowScan ~> zip.in1
FlowShape(broadcast.in, zip.out)
})
Source(1 to 5).via(flow).to(Sink.foreach(println)).run()
}
Unfortunately, this doesn't output anything. I researched it a bit and found out that:
Broadcast emits when all of the outputs stop backpressuring and there is an input element available,
Scan backpressures when downstream backpressures.
This makes the whole flow deadlock and nothing happens. Does somebody know how to achieve the result:
(2,0)
(4,1)
(6,3)
(8,6)
(10,10)
in a nice way? The only solution I have found so far is to use .buffer:
val flowScan = b.add(Flow[Int].buffer(1, OverflowStrategy.backpressure).scan(0)(_ + _))
But I don't really like this solution because it is describing not only the logic, but also some technicalities...
The reason of the deadlock is that scan will upon its first demand, emit the start value, so 0 in this case and not pass demand upstream, this means that demand only reaches broadcast.out(0) and as you said, broadcast only emits when there has been demand from all the downstreams.
The buffer might seem like a technicality, but it is actually expressing the graph according to what you want to achieve, that you want to zip the two branches, but the scan-one will always be one element ahead of the other. This is very central to how akka-streams works.
So your result is not actually something that matches what broadcast+zip does without some additional graph nodes, I think that the way to most cleanly express what you want to happen is to place the buffer separately before the scan, this makes it more clear that one branch will be ahead of the other:
broadcast.out(0) ~> flowMap ~> zip.in0
broadcast.out(1) ~> buffer ~> flowScan ~> zip.in1