Elegant way of reusing akka-stream flows - scala

I am looking for a way to easily reuse akka-stream flows.
I treat the Flow I intend to reuse as a function, so I would like to keep its signature like:
Flow[Input, Output, NotUsed]
Now when I use this flow I would like to be able to 'call' this flow and keep the result aside for further processing.
So I want to start with Flow emiting [Input], apply my flow, and proceed with Flow emitting [(Input, Output)].
example:
val s: Source[Int, NotUsed] = Source(1 to 10)
val stringIfEven = Flow[Int].filter(_ % 2 == 0).map(_.toString)
val via: Source[(Int, String), NotUsed] = ???
Now this is not possible in a straightforward way because combining flow with .via() would give me Flow emitting just [Output]
val via: Source[String, NotUsed] = s.via(stringIfEven)
Alternative is to make my reusable flow emit [(Input, Output)] but this requires every flow to push its input through all the stages and make my code look bad.
So I came up with a combiner like this:
def tupledFlow[In,Out](flow: Flow[In, Out, _]):Flow[In, (In,Out), NotUsed] = {
Flow.fromGraph(GraphDSL.create() { implicit b =>
import GraphDSL.Implicits._
val broadcast = b.add(Broadcast[In](2))
val zip = b.add(Zip[In, Out])
broadcast.out(0) ~> zip.in0
broadcast.out(1) ~> flow ~> zip.in1
FlowShape(broadcast.in, zip.out)
})
}
that is broadcasting the input to the flow and as well in a parallel line directly -> both to the 'Zip' stage where I join values into a tuple. It then can be elegantly applied:
val tupled: Source[(Int, String), NotUsed] = s.via(tupledFlow(stringIfEven))
Everything great but when given flow is doing a 'filter' operation - this combiner is stuck and stops processing further events.
I guess that is due to 'Zip' behaviour that requires all subflows to do the same - in my case one branch is passing given object directly so another subflow cannot ignore this element with. filter(), and since it does - the flow stops because Zip is waiting for push.
Is there a better way to achieve flow composition?
Is there anything I can do in my tupledFlow to get desired behaviour when 'flow' ignores elements with 'filter' ?

Two possible approaches - with debatable elegance - are:
1) avoid using filtering stages, mutating your filter into a Flow[Int, Option[Int], NotUsed]. This way you can apply your zipping wrapper around your whole graph, as was your original plan. However, the code looks more tainted, and there is added overhead by passing around Nones.
val stringIfEvenOrNone = Flow[Int].map{
case x if x % 2 == 0 => Some(x.toString)
case _ => None
}
val tupled: Source[(Int, String), NotUsed] = s.via(tupledFlow(stringIfEvenOrNone)).collect{
case (num, Some(str)) => (num,str)
}
2) separate the filtering and transforming stages, and apply the filtering ones before your zipping wrapper. Probably a more lightweight and better compromise.
val filterEven = Flow[Int].filter(_ % 2 == 0)
val toString = Flow[Int].map(_.toString)
val tupled: Source[(Int, String), NotUsed] = s.via(filterEven).via(tupledFlow(toString))
EDIT
3) Posting another solution here for clarity, as per the discussions in the comments.
This flow wrapper allows to emit each element from a given flow, paired with the original input element that generated it. It works for any kind of inner flow (emitting 0, 1 or more elements for each input).
def tupledFlow[In,Out](flow: Flow[In, Out, _]): Flow[In, (In,Out), NotUsed] =
Flow[In].flatMapConcat(in => Source.single(in).via(flow).map( out => in -> out))

I came up with an implementation of TupledFlow that works when wrapped Flow uses filter() or mapAsync() and when wrapped Flow emits 0,1 or N elements for every input:
def tupledFlow[In,Out](flow: Flow[In, Out, _])(implicit materializer: Materializer, executionContext: ExecutionContext):Flow[In, (In,Out), NotUsed] = {
val v:Flow[In, Seq[(In, Out)], NotUsed] = Flow[In].mapAsync(4) { in: In =>
val outFuture: Future[Seq[Out]] = Source.single(in).via(flow).runWith(Sink.seq)
val bothFuture: Future[Seq[(In,Out)]] = outFuture.map( seqOfOut => seqOfOut.map((in,_)) )
bothFuture
}
val onlyDefined: Flow[In, (In, Out), NotUsed] = v.mapConcat[(In, Out)](seq => seq.to[scala.collection.immutable.Iterable])
onlyDefined
}
the only drawback I see here is that I am instantiating and materializing a flow for a single entity - just to get a notion of 'calling a flow as a function'.
I didn't do any performance tests on that - however since heavy-lifting is done in a wrapped Flow which is executed in a future - I believe this will be ok.
This implementation passes all the tests from https://gist.github.com/kretes/8d5f2925de55b2a274148b69f79e55ac#file-tupledflowspec-scala

Related

Akka combining Sinks without access to Flows

I am using an API that accepts a single AKKA Sink and fills it with data:
def fillSink(sink:Sink[String, _])
Is there a way, without delving into the depths of akka, to handle the output with two sinks instead of one?
For example
val mySink1:Sink = ...
val mySink2:Sink = ...
//something
fillSink( bothSinks )
If I had access to the Flow used by the fillSink method I could use flow.alsoTo(mySink1).to(mySink2) but the flow is not exposed.
The only workaround at the moment is to pass a single Sink which handles the strings and passes it on to two StringBuilders to replace mySink1/mySink2, but it feels like that defeats the point of AKKA. Without spending a couple days learning AKKA, I can't tell if there is a way to split output from sinks.
Thanks!
The combine Sink operator, which combines two or more Sinks using a provided Int => Graph[UniformFanOutShape[T, U], NotUsed]] function, might be what you're seeking:
def combine[T, U](first: Sink[U, _], second: Sink[U, _], rest: Sink[U, _]*)(strategy: Int => Graph[UniformFanOutShape[T, U], NotUsed]): Sink[T, NotUsed]
A trivialized example:
val doubleSink = Sink.foreach[Int](i => println(s"Doubler: ${i*2}"))
val tripleSink = Sink.foreach[Int](i => println(s" Triper: ${i*3}"))
val combinedSink = Sink.combine(doubleSink, tripleSink)(Broadcast[Int](_))
Source(List(1, 2, 3)).runWith(combinedSink)
// Doubler: 2
// Triper: 3
// Doubler: 4
// Triper: 6
// Doubler: 6
// Triper: 9

Akka streams pass through flow limiting Parallelism / throughput of processing flow

I have a use case where I want to send a message to an external system but the flow that sends this message takes and returns a type I cant use downstream. This is a great use case for the pass through flow. I am using the implementation here. Initially I was worried that if the processingFlow uses a mapAsyncUnordered then this flow wouldn't work. Since the processing flow may reorder messages and the zip with may push out a tuple with the incorrect pair. E.g In the following example.
val testSource = Source(1 until 50)
val processingFlow: Flow[Int, Int, NotUsed] = Flow[Int].mapAsyncUnordered(10)(x => Future {
Thread.sleep(Random.nextInt(50))
x * 10
})
val passThroughFlow = PassThroughFlow(processingFlow, Keep.both)
val future = testSource.via(passThroughFlow).runWith(Sink.seq)
I would expect that the processing flow could reorder its outputs with respect its input and i would get a result such as:
[(30,1), (40,2),(10,3),(10,4), ...]
With the right ( the passed through always being in order) but the left which goes through my mapAsyncUnordered potentially being joined with an incorrect element to make a bad tuple.
Instead i actually get:
[(10,1), (20,2),(30,3),(40,4), ...]
Every time. Upon further investigation I noticed the code was running slow and in fact its not running in parallel at all despite my map async unordered. I tried introducing a buffer before and after as well as an async boundary but it always seems to run sequentially. This explains why it always ordered but I want my processing flow to have a higher throughput.
I came up with the following work around:
object PassThroughFlow {
def keepRight[A, A1](processingFlow: Flow[A, A1, NotUsed]): Flow[A, A, NotUsed] =
keepBoth[A, A1](processingFlow).map(_._2)
def keepBoth[A, A1](processingFlow: Flow[A, A1, NotUsed]): Flow[A, (A1, A), NotUsed] =
Flow.fromGraph(GraphDSL.create() { implicit builder => {
import GraphDSL.Implicits._
val broadcast = builder.add(Broadcast[A](2))
val zip = builder.add(ZipWith[A1, A, (A1, A)]((left, right) => (left, right)))
broadcast.out(0) ~> processingFlow ~> zip.in0
broadcast.out(1) ~> zip.in1
FlowShape(broadcast.in, zip.out)
}
})
}
object ParallelPassThroughFlow {
def keepRight[A, A1](parallelism: Int, processingFlow: Flow[A, A1, NotUsed]): Flow[A, A, NotUsed] =
keepBoth(parallelism, processingFlow).map(_._2)
def keepBoth[A, A1](parallelism: Int, processingFlow: Flow[A, A1, NotUsed]): Flow[A, (A1, A), NotUsed] = {
Flow.fromGraph(GraphDSL.create() { implicit builder =>
import GraphDSL.Implicits._
val fanOut = builder.add(Balance[A](outputPorts = parallelism))
val merger = builder.add(Merge[(A1, A)](inputPorts = parallelism, eagerComplete = false))
Range(0, parallelism).foreach { n =>
val passThrough = PassThroughFlow.keepBoth(processingFlow)
fanOut.out(n) ~> passThrough ~> merger.in(n)
}
FlowShape(fanOut.in, merger.out)
})
}
}
Two questions:
In the original implementation, Why does the zip inside the pass through flow limit the amount of parallelism of the map async unordered?
Is my work around sound or could it be improved? I basically fan out my input the input to multiple stacks of the pass through flow and merge it all back together. It seems to have the properties that I want (parallel yet maintains order even if processing flow reorders) yet something doesn't feel right
The behavior you're witnessing is a result of how broadcast and zip work: broadcast emits downstream when all of its outputs signal demand; zip waits for all of its inputs before signaling demand (and emitting downstream).
broadcast.out(0) ~> processingFlow ~> zip.in0
broadcast.out(1) ~> zip.in1
Consider the movement of the first element (1) through the above graph. 1 is broadcast to both processingFlow and zip. zip immediately receives one of its inputs (1) and waits for its other input (10), which will take a little longer to arrive. Only when zip gets both 1 and 10 does it pull for more elements from upstream, thus triggering the movement of the second element (2) through the stream. And so on.
As for your ParallelPassThroughFlow, I don't know why "something doesn't feel right" to you.

How to explain this Akka Streams graph from official doc?

I have a couple of questions for this sample code hosted officially here:
val topHeadSink = Sink.head[Int]
val bottomHeadSink = Sink.head[Int]
val sharedDoubler = Flow[Int].map(_ * 2)
RunnableGraph.fromGraph(GraphDSL.create(topHeadSink, bottomHeadSink)((_, _)) { implicit builder =>
(topHS, bottomHS) =>
import GraphDSL.Implicits._
val broadcast = builder.add(Broadcast[Int](2))
Source.single(1) ~> broadcast.in
broadcast.out(0) ~> sharedDoubler ~> topHS.in
broadcast.out(1) ~> sharedDoubler ~> bottomHS.in
ClosedShape
})
When do you pass in a graph through create?
Why are topHeadSink, bottomHeadSink passed in through create, but sharedDoubler is not? What is the difference between them?
When do you need builder.add?
Can I create a broadcast outside the graph without builder.add? If I add a couple of flows inside the graph, should I add the flows via builder.add as well? It is very confusing that sometimes we need builder.add and sometimes we do not.
Update
I feel this is still confusing:
The difference between these approaches is that importing using builder.add(...) ignores the materialized value of the imported graph, while importing via the factory method allows its inclusion.
topHS, bottomHS are imported from create, so they will keep their materialized value. What if I do builder.add(topHS)?
And how do you explain sharedDoubler: does it have a materialized value or not? What if I use builder.add with it?
What does this mean, the ((_,_)) of GraphDSL.create(topHeadSink, bottomHeadSink)((_, _))?
It looks like boilerplate we just need, but I am not sure what it is.
When do you pass in a graph through create?
When you want to obtain the materialized value of the graph that you pass to the create factory method. The type of the RunnableGraph in your question is RunnableGraph[(Future[Int], Future[Int])], meaning that the materialized value of the graph is (Future[Int], Future[Int]):
val g = RunnableGraph.fromGraph(...).run() // (Future[Int], Future[Int])
val topHeadSinkResult = g._1 // Future[Int]
val bottomHeadSinkResult = g._2 // Future[Int]
Now consider the following variant, which defines the sinks "inside" the graph and discards the materialized value:
val g2 = RunnableGraph.fromGraph(GraphDSL.create() { implicit builder =>
import GraphDSL.Implicits._
val topHeadSink = Sink.head[Int]
val bottomHeadSink = Sink.head[Int]
val broadcast = builder.add(Broadcast[Int](2))
Source.single(1) ~> broadcast.in
broadcast.out(0) ~> sharedDoubler ~> topHeadSink
broadcast.out(1) ~> sharedDoubler ~> bottomHeadSink
ClosedShape
}).run() // NotUsed
The value of g2 is NotUsed.
When do you need builder.add?
All of the components of a graph must be added to the builder, but there are variants of the ~> operator that add the most commonly used components--such as Source and Flow--to the builder under the covers. However, junction operations that perform a fan-in (such as Merge) or a fan-out (such as Broadcast) must be explicitly passed to builder.add if you're using the Graph DSL.
Note that for simple graphs, you can use junctions without having to use the Graph DSL. Here is an example from the documentation:
val sendRmotely = Sink.actorRef(actorRef, "Done")
val localProcessing = Sink.foreach[Int](_ => /* do something usefull */ ())
val sink = Sink.combine(sendRmotely, localProcessing)(Broadcast[Int](_))
Source(List(0, 1, 2)).runWith(sink)
What does this mean? the ((_,_)) of GraphDSL.create(topHeadSink, bottomHeadSink)((_, _))?
It's a curried parameter that specifies which materialized value(s) you want to retain. Using ((_, _)) here is the same as:
val g = RunnableGraph.fromGraph(GraphDSL.create(topHeadSink, bottomHeadSink)((t, b) => (t, b)) {
implicit builder => (topHS, bottomHS) =>
...
}).run() // (Future[Int], Future[Int])
In other words, ((_, _)) in this context is shorthand for ((t, b) => (t, b)), which preserves the respective materialized values of the two sinks that are passed in. If, for example, you want to keep only the materialized value of topHeadSink, you could change the call to the following:
val g = RunnableGraph.fromGraph(GraphDSL.create(topHeadSink, bottomHeadSink)((t, _) => t) {
implicit builder => (topHS, bottomHS) =>
...
}).run() // Future[Int]

Convert into a For Comprehension in Scala

Testing this I can see that it works:
def twoHtmlFutures = Action { request =>
val async1 = as1.index(embed = true)(request) // Future[Result]
val async2 = as2.index(embed = true)(request) // Future[Result]
val async1Html = async1.flatMap(x => Pagelet.readBody(x)) // Future[Html]
val async2Html = async2.flatMap(x => Pagelet.readBody(x)) // Future[Html]
val source1 = Source.fromFuture(async1Html) // Source[Html, NotUsed]
val source2 = Source.fromFuture(async2Html) // Source[Html, NotUsed]
val merged = source1.merge(source2) // Source[Html, NotUsed]
Ok.chunked(merged)
}
But trying to put it into a For Comprehension is not working for me. This is what I tried:
def twoHtmlFutures2 = Action.async { request =>
val async1 = as1.index(embed = true)(request)
val async2 = as2.index(embed = true)(request)
for {
async1Res <- async1 // from Future[Result] to Result
async2Res <- async2 // from Future[Result] to Result
async1Html <- Pagelet.readBody(async1Res) // from Result to Html
async2Html <- Pagelet.readBody(async2Res) // from Result to Html
} yield {
val source1 = single(async1Html) // from Html to Source[Html, NotUsed]
val source2 = single(async2Html) // from Html to Source[Html, NotUsed]
val merged = source1.merge(source2) // Source[Html, NotUsed]
Ok.chunked(merged)
}
}
But this just jumps on-screen at the same time rather than at different times (streamed) as the first example does. Any helpers out there to widen my eyelids?Thanks
Monads are a sequencing shape and Futures models this as causal dependence (first-this-future-completes-then-that-future-completes):
val x = Future(something).map(_ => somethingElse) or Future(something).flatMap(_ => Future(somethingElse)
However, there's a little trick one can do in for comprehensions:
def twoHtmlFutures = Action { request =>
Ok.chunked(
Source.fromFutureSource(
for { _ <- Future.unit // For Scala version <= 2.11 use Future.successful(())
async1 = as1.index(embed = true)(request) // Future[Result]
async2 = as2.index(embed = true)(request) // Future[Result]
async1Html = async1.flatMap(x => Pagelet.readBody(x)) // Future[Html]
async2Html = async2.flatMap(x => Pagelet.readBody(x)) // Future[Html]
source1 = Source.fromFuture(async1Html) // Source[Html, NotUsed]
source2 = Source.fromFuture(async2Html) // Source[Html, NotUsed]
} yield source1.merge(source2) // Source[Html, NotUsed]
)
)
I describe this technique in greater detail in this blogpost.
An alternate solution to your problem could be:
def twoHtmlFutures = Action { request =>
Ok.chunked(
Source.fromFuture(as1.index(embed = true)(request)).merge(Source.fromFuture(as2.index(embed = true)(request))).mapAsyncUnordered(2)(b => Pagelet.readBody(b))
)
}
For comprehension and flatMap (which is its desugared version) are used to sequence things.
In the context of Future, this means that in a for comprehension, each of the statement is started only once the previous one has successfully ended.
In your case, you want two Futures run in parallel. This is not what flatMap (or for comprehension) is.
What your code do is the following:
do the first index call
when that's over, do the second index call
when that's over, do the first readBody
when that's over, do the second readBody
when that's over create two (synchronous) sources with the values from the two previous steps, merge them, and start returning the merged source as chunked response.
What your previous code did was
do the first index call
when that's over, do the first readBody
in the meantime, do the same for the second index and readBody
in the meantime, create a source that will output an element when the first readBody yields a result
in the meantime, do the same for the second
merge these two sources, and start all at once to give the merged output as a chunked response.
So, in this case, you start your chunked response just after receiving the request (but with nothing inside yet, waiting for the Futures to be resolved), while in the former case, you wait at each computation for the previous one to be over, even if you don't need its result to go on.
What you should remember, is that you should use flatMap on Future, only if you need the result from a previous computation or if you wish for another computation to be over before doing something else. The same goes for for comprehension, which is just a nice-looking way of chaining flatMaps.
Here's what your proposed for comprehension looks like after it's been desugared (courtesy of IntelliJ's "Desugar Scala code ..." menu option):
async1.flatMap((async1Res: Nothing) =>
async2.flatMap((async2Res: Nothing) =>
Pagelet.readBody(async1Res).flatMap((async1Html: Nothing) =>
Pagelet.readBody(async2Res).map((async2Html: Nothing) =>
Ok.chunked(merged)))))
As you can see, the nesting, and the concluding flatMap/map pair, are very different from your original code plan.
As a general rule, every <- in a single for comprehension is turned into a flatMap() except for the final one, which is a map(), and each is nested inside the previous.

How to deal with source that emits Future[T]?

Let's say I have some iterator:
val nextElemIter: Iterator[Future[Int]] = Iterator.continually(...)
And I want to build a source from that iterator:
val source: Source[Future[Int], NotUsed] =
Source.fromIterator(() => nextElemIter)
So now my source emits Futures. I have never seen futures being passed between stages in Akka docs or anywhere else, so instead, I could always do something like this:
val source: Source[Int, NotUsed] =
Source.fromIterator(() => nextElemIter).mapAsync(1)(identity /* was n => n */)
And now I have a regular source that emits T instead of Future[T]. But this feels hacky and wrong.
What's the proper way to deal with such situations?
Answering your question directly: I agree with Vladimir's comment that there is nothing "hacky" about using mapAsync for the purpose you described. I can't think of any more direct way to unwrap the Future from around your underlying Int values.
Answering your question indirectly...
Try to stick with Futures
Streams, as a concurrency mechanism, are incredibly useful when backpressure is required. However, pure Future operations have their place in applications as well.
If your Iterator[Future[Int]] is going to produce a known, limited, number of Future values then you may want to stick with using the Futures for concurrency.
Imagine you want to filter, map, & reduce the Int values.
def isGoodInt(i : Int) : Boolean = ??? //filter
def transformInt(i : Int) : Int = ??? //map
def combineInts(i : Int, j : Int) : Int = ??? //reduce
Futures provide a direct way of using these functions:
val finalVal : Future[Int] =
Future sequence {
for {
f <- nextElemIter.toSeq //launch all of the Futures
i <- f if isGoodInt(i)
} yield transformInt(i)
} map (_ reduce combineInts)
Compared with a somewhat indirect way of using the Stream as you suggested:
val finalVal : Future[Int] =
Source.fromIterator(() => nextElemIter)
.via(Flow[Future[Int]].mapAsync(1)(identity))
.via(Flow[Int].filter(isGoodInt))
.via(Flow[Int].map(transformInt))
.to(Sink.reduce(combineInts))
.run()