Difficulty with Scatter and Gather pattern with Akka Streams - scala

I have an object like this
case class Foo(id: Int, id1: Option[Int], id2: Option[Int])
Here id1 and id2 are obtained in two separate lookups. So first I scatter using the broadcast and then I do a gather using a merge and a groupBy.
The code I have written is
val source = Source(List(Foo(1), Foo(2), Foo(3), Foo(4)))
val flow1 = Flow[Foo].map(foo => foo.copy(id1 = Some(Random.nextInt())))
val flow2 = Flow[Foo].map(foo => foo.copy(id2 = Some(Random.nextInt())))
val flow3 = Flow[Foo].groupBy(100, foo => foo.id)
val flow4 = Flow[Foo].reduce{case (foo, fooLookup) =>
if (fooLookup.id1.isDefined) foo.copy(id1 = fooLookup.id1)
if (fooLookup.id2.isDefined) foo.copy(id2 = fooLookup.id2)
else foo
}
val sink = Sink.foreach[Foo](println)
val graph = RunnableGraph.fromGraph(GraphDSL.create(sink) { implicit builder =>
s =>
import GraphDSL.Implicits._
val b = builder.add(Broadcast[Foo](2))
val m = builder.add(Merge[Foo](2))
source ~> b
b ~> flow1 ~> m
b ~> flow2 ~> m
m ~> flow3.mergeSubStreams ~> flow4 ~> s.in
ClosedShape
})
This doesn't compile because the compiler doesn't like the flow3.mergeSubStreams.
My final goal is that the lookup for id1 and id2 happens on two separate branches and I should be able to merge and print the final object which has id, id1 and id2.
Edit:: Another question I have is that since I had split the stream into 2 the reduce function should move forward once it has processed 2 Foos. Right now it seems that the reduce function will wait for the entire stream to end because it doesn't know how many foos it will receive. So is there a way to tell the reducer that once it has received certain number of records it should pass it on to the sink.

Related

Akka Stream - Parallel Processing with Partition

I'm looking for a way to implement/use Fan-out which takes 1 input, and broadcast to N outputs parallel, the difference is that i want to partition them.
Example: 1 input can emit to 4 different outputs, and other input can emit to 2 others outputs, depends on some function f
source ~> partitionWithBroadcast // Outputs to some subset of [0,3] outputs
partitionWithBroadcast(0) ~> ...
partitionWithBroadcast(1) ~> ...
partitionWithBroadcast(2) ~> ...
partitionWithBroadcast(3) ~> ...
I was searching in the Akka documentation but couldn't found any flow which can be suitable
any ideas?
What comes to mind is a FanOutShape with filters attached to each output. NOTE: I am not using the standard Partition operator because it emits to just 1 output. The question asks to emit to any of the connected outputs. E.g.:
def createPartial[E](partitioner: E => Set[Int]) = {
GraphDSL.create[FanOutShape4[E,E,E,E,E]]() { implicit builder =>
import GraphDSL.Implicits._
val flow = builder.add(Flow.fromFunction((e: E) => (e, partitioner(e))))
val broadcast = builder.add(Broadcast[(E, Set[Int])](4))
val flow0 = builder.add(Flow[(E, Set[Int])].filter(_._2.contains(0)).map(_._1))
val flow1 = builder.add(Flow[(E, Set[Int])].filter(_._2.contains(1)).map(_._1))
val flow2 = builder.add(Flow[(E, Set[Int])].filter(_._2.contains(2)).map(_._1))
val flow3 = builder.add(Flow[(E, Set[Int])].filter(_._2.contains(3)).map(_._1))
flow.out ~> broadcast.in
broadcast.out(0) ~> flow0.in
broadcast.out(1) ~> flow1.in
broadcast.out(2) ~> flow2.in
broadcast.out(3) ~> flow3.in
new FanOutShape4[E,E,E,E,E](flow.in, flow0.out, flow1.out, flow2.out, flow3.out)
}
}
The partitioner is a function that maps an element from upstream to a tuple having that element and a set of integers that will activate the corresponding output. The graph calculates the desired partitions, then broadcasts the tuple. A flow attached to each of the outputs of the Broadcast selects elements that the partitioner assigned to that output.
Then use it e.g. as:
implicit val system: ActorSystem = ActorSystem()
implicit val ec = system.dispatcher
def partitioner(s: String) = (0 to 3).filter(s(_) == '*').toSet
val src = Source(immutable.Seq("*__*", "**__", "__**", "_*__"))
val sink0 = Sink.seq[String]
val sink1 = Sink.seq[String]
val sink2 = Sink.seq[String]
val sink3 = Sink.seq[String]
def toFutureTuple[X](f0: Future[X], f1: Future[X], f2: Future[X], f3: Future[X]) = f0.zip(f1).zip(f2).map(t => (t._1._1,t._1._2,t._2)).zip(f3).map(t => (t._1._1,t._1._2,t._1._3,t._2))
val g = RunnableGraph.fromGraph(GraphDSL.create(src, sink0, sink1, sink2, sink3)((_,f0,f1,f2,f3) => toFutureTuple(f0,f1,f2,f3)) { implicit builder =>
(in, o0, o1, o2, o3) => {
import GraphDSL.Implicits._
val part = builder.add(createPartial(partitioner))
in ~> part.in
part.out0 ~> o0
part.out1 ~> o1
part.out2 ~> o2
part.out3 ~> o3
ClosedShape
}
})
val result = Await.result(g.run(), 10.seconds)
println("0: " + result._1.mkString(" "))
println("1: " + result._2.mkString(" "))
println("2: " + result._3.mkString(" "))
println("3: " + result._4.mkString(" "))
// Prints:
//
// 0: *__* **__
// 1: **__ _*__
// 2: __**
// 3: *__* __**
First, implement your function to create the Partition:
def partitionFunction4[A](func: A => Int)(implicit builder: GraphDSL.Builder[NotUsed]) = {
// partition with 4 output ports
builder.add(Partition[A](4, inputElement => func(inputElement)))
}
then, create another function to create a Sink with a log function that is going to be used to print in the console the element:
def stream[A](log: A => Unit) = Flow.fromFunction[A, A](el => {
log(el)
el
} ).to(Sink.ignore)
Connect all the elements in the *graph function:
def graph[A](src: Source[A, NotUsed])
(func4: A => Int, log: Int => A => Unit) = {
RunnableGraph
.fromGraph(GraphDSL.create() { implicit builder =>
import GraphDSL.Implicits._
val partition4 = partitionFunction4(func4)
/** Four sinks **/
val flowSet0 = (0 to 4).map(in => log(in))
src ~> partition4.in
partition4.out(0) ~> stream(flowSet0(0))
partition4.out(1) ~> stream(flowSet0(1))
partition4.out(2) ~> stream(flowSet0(2))
partition4.out(3) ~> stream(flowSet0(3))
ClosedShape
})
.run()
}
Create a Source that emits five Int elements. The function to create the partition is "element % 4". Depending on the result of this function the element will be redirected to the specific source:
val source1: Source[Int, NotUsed] = Source(0 to 4)
graph[Int](source1)(f1 => f1 % 4,
in => {
el =>
println(s"Stream ${in} element ${el}")
})
Obtaining as result:
Stream 0 element 0
Stream 1 element 1
Stream 2 element 2
Stream 3 element 3
Stream 0 element 4

How do you deal with futures and mapAsync in Akka Flow?

I built a akka graph DSL defining a simple flow. But the flow f4 takes 3 seconds to send an element while f2 takes 10 seconds.
As a result, I got : 3, 2, 3, 2. But, this is not what I want. As f2 takes too much time, I would like to get : 3, 3, 2, 2. Here's the code...
implicit val actorSystem = ActorSystem("NumberSystem")
implicit val materializer = ActorMaterializer()
val g = RunnableGraph.fromGraph(GraphDSL.create() { implicit builder: GraphDSL.Builder[NotUsed] =>
import GraphDSL.Implicits._
val in = Source(List(1, 1))
val out = Sink.foreach(println)
val bcast = builder.add(Broadcast[Int](2))
val merge = builder.add(Merge[Int](2))
val yourMapper: Int => Future[Int] = (i: Int) => Future(i + 1)
val yourMapper2: Int => Future[Int] = (i: Int) => Future(i + 2)
val f1, f3 = Flow[Int]
val f2= Flow[Int].throttle(1, 10.second, 0, ThrottleMode.Shaping).mapAsync[Int](2)(yourMapper)
val f4= Flow[Int].throttle(1, 3.second, 0, ThrottleMode.Shaping).mapAsync[Int](2)(yourMapper2)
in ~> f1 ~> bcast ~> f2 ~> merge ~> f3 ~> out
bcast ~> f4 ~> merge
ClosedShape
})
g.run()
So where am I going wrong ? With future or mapAsync ? or else ...
Thanks
Sorry I'm new in akka, so I'm still learning. To get the expected results, one way is to put async :
val g = RunnableGraph.fromGraph(GraphDSL.create() { implicit builder: GraphDSL.Builder[NotUsed] =>
import GraphDSL.Implicits._
val in = Source(List(1, 1))
val out = Sink.foreach(println)
val bcast = builder.add(Broadcast[Int](2))
val merge = builder.add(Merge[Int](2))
val yourMapper: Int => Future[Int] = (i: Int) => Future(i + 1)
val yourMapper2: Int => Future[Int] = (i: Int) => Future(i + 2)
val f1, f3 = Flow[Int]
val f2= Flow[Int].throttle(1, 10.second, 0, ThrottleMode.Shaping).map(_+1)
//.mapAsyncUnordered[Int](2)(yourMapper)
val f4= Flow[Int].throttle(1, 3.second, 0, ThrottleMode.Shaping).map(_+2)
//.mapAsync[Int](2)(yourMapper2)
in ~> f1 ~> bcast ~> f2.async ~> merge ~> f3 ~> out
bcast ~> f4.async ~> merge
ClosedShape
})
g.run()
As you've already figured out, replacing:
mapAsync(i => Future{i + delta})
with:
map(_ + delta).async
in the two flows would achieve what you want.
The different result boils down to the key difference between mapAsync and map + async. While mapAsync enables execution of Futures in parallel threads, the multiple mapAsync flow stages are still being managed by the same underlying actor which would carry out operator fusion before execution (for cost efficiency in general).
On the other hand, async actually introduces an asynchronous boundary into the stream flow with the individual flow stages handled by separate actors. In your case, each of the two flow stages independently emits elements downstream and whichever element emitted first gets consumed first. Inevitably there is a cost for managing the stream across the asynchronous boundary and Akka Stream uses a windowed buffering strategy to amortize the cost (see this Akka Stream doc).
For more details re: difference between mapAsync and async, this blog post might be of interest.
So you are trying to join together the results coming out of f2 and f4. In which case you're trying to do what is sometimes called "scatter gather pattern".
I don't think there are off the shelf ways to implement it, without adding a custom stateful stage that will keep track of outputs from f2 and from f4 and emit a record when both are available. But they are some complications to bear in mind:
What happens if a f2/f4 fails
What happens if they take too long
You need to have unique key for each input record, so you know which output from f2 correspond to f4 (or vice versa)

Akka Streams filter & group by on a collection of keys

I have a stream of
case class Msg(keys: Seq[Char], value: String)
Now I want to filter for a subset of keys e.g.
val filterKeys = Set[Char]('k','f','c') and Filter(k.exists(filterKeys.contains)))
And then split these so certain keys are processed by different flows and then merged back together at the end;
/-key=k-> f1 --\
Source[Msg] ~> Filter ~> router |--key=f-> f2 ----> Merge --> f4
\-key=c-> f3 --/
How should I go about doing this?
FlexiRoute in the old way seemed like a good way to go but in the new API I'm guessing I want to either make a custom GraphStage or create my own graph from the DSL as I see no way to do this through the built-in stages..?
Small Key Set Solution
If your key set is small, and immutable, then a combination of broadcast and filter would probably be the easiest implementation to understand. You first need to define the filter that you described:
def goodKeys(keySet : Set[Char]) = Flow[Msg] filter (_.keys exists keySet.contains)
This can then feed a broadcaster as described in the documentation. All Msg values with good keys will be broadcasted to each of three filters, and each filter will only allow a particular key:
val g = RunnableGraph.fromGraph(GraphDSL.create() { implicit builder: GraphDSL.Builder[NotUsed] =>
import GraphDSL.Implicits._
val source : Source[Msg] = ???
val goodKeyFilter = goodKeys(Set('k','f','c'))
val bcast = builder.add(BroadCast[Msg](3))
val merge = builder.add(Merge[Msg](3))
val kKey = goodKeys(Set('k'))
val fKey = goodKeys(Set('f'))
val cKey = goodKeys(Set('c'))
//as described in the question
val f1 : Flow[Msg, Msg, _] = ???
val f2 : Flow[Msg, Msg, _] = ???
val f3 : Flow[Msg, Msg, _] = ???
val f4 : Sink[Msg,_] = ???
source ~> goodKeyFilter ~> bcast ~> kKey ~> f1 ~> merge ~> f4
bcast ~> fKey ~> f2 ~> merge
bcast ~> cKey ~> f3 ~> merge
Large Key Set Solution
If you key set is large, then groupBy is better. Suppose you have a Map of keys to functions:
//e.g. 'k' -> f1
val keyFuncs : Map[Set[Char], (Msg) => Msg]
This map can be used with the groupBy function:
source
.via(goodKeys(Set('k','f','c'))
.groupBy(keyFuncs.size, _.keys)
.map(keyFuncs(_.keys)) //apply one of f1,f2,f3 to the Msg
.mergeSubstreams

How to assemble an Akka Streams sink from multiple file writes?

I'm trying to integrate an akka streams based flow in to my Play 2.5 app. The idea is that you can stream in a photo, then have it written to disk as the raw file, a thumbnailed version and a watermarked version.
I managed to get this working using a graph something like this:
val byteAccumulator = Flow[ByteString].fold(new ByteStringBuilder())((builder, b) => {builder ++= b.toArray})
.map(_.result().toArray)
def toByteArray = Flow[ByteString].map(b => b.toArray)
val graph = Flow.fromGraph(GraphDSL.create() {implicit builder =>
import GraphDSL.Implicits._
val streamFan = builder.add(Broadcast[ByteString](3))
val byteArrayFan = builder.add(Broadcast[Array[Byte]](2))
val output = builder.add(Flow[ByteString].map(x => Success(Done)))
val rawFileSink = FileIO.toFile(file)
val thumbnailFileSink = FileIO.toFile(getFile(path, Thumbnail))
val watermarkedFileSink = FileIO.toFile(getFile(path, Watermarked))
streamFan.out(0) ~> rawFileSink
streamFan.out(1) ~> byteAccumulator ~> byteArrayFan.in
streamFan.out(2) ~> output.in
byteArrayFan.out(0) ~> slowThumbnailProcessing ~> thumbnailFileSink
byteArrayFan.out(1) ~> slowWatermarkProcessing ~> watermarkedFileSink
FlowShape(streamFan.in, output.out)
})
graph
}
Then I wire it in to my play controller using an accumulator like this:
val sink = Sink.head[Try[Done]]
val photoStorageParser = BodyParser { req =>
Accumulator(sink).through(graph).map(Right.apply)
}
The problem is that my two processed file sinks aren't completing and I'm getting zero sizes for both processed files, but not the raw one. My theory is that the accumulator is only waiting on one of the outputs of my fan out, so when the input stream completes and my byteAccumulator spits out the complete file, by the time the processing is finished play has got the materialized value from the output.
So, my questions are:
Am I on the right track with this as far as my approach goes?
What is the expected behaviour for running a graph like this?
How can I bring all my sinks together to form one final sink?
Ok, after a little help (Andreas was on the right track), I've arrived at this solution which does the trick:
val rawFileSink = FileIO.toFile(file)
val thumbnailFileSink = FileIO.toFile(getFile(path, Thumbnail))
val watermarkedFileSink = FileIO.toFile(getFile(path, Watermarked))
val graph = Sink.fromGraph(GraphDSL.create(rawFileSink, thumbnailFileSink, watermarkedFileSink)((_, _, _)) {
implicit builder => (rawSink, thumbSink, waterSink) => {
val streamFan = builder.add(Broadcast[ByteString](2))
val byteArrayFan = builder.add(Broadcast[Array[Byte]](2))
streamFan.out(0) ~> rawSink
streamFan.out(1) ~> byteAccumulator ~> byteArrayFan.in
byteArrayFan.out(0) ~> processorFlow(Thumbnail) ~> thumbSink
byteArrayFan.out(1) ~> processorFlow(Watermarked) ~> waterSink
SinkShape(streamFan.in)
}
})
graph.mapMaterializedValue[Future[Try[Done]]](fs => Future.sequence(Seq(fs._1, fs._2, fs._3)).map(f => Success(Done)))
After which it's dead easy to call this from Play:
val photoStorageParser = BodyParser { req =>
Accumulator(theSink).map(Right.apply)
}
def createImage(path: String) = Action(photoStorageParser) { req =>
Created
}

How to get a Subscriber and Publisher from a broadcasted Akka stream?

I'm having problems in getting Publishers and Subscribers out of my flows when using more complicated graphs. My goal is to provide an API of Publishers and Subscribers and run the Akka streaming internally. Here's my first try, which works just fine.
val subscriberSource = Source.subscriber[Boolean]
val someFunctionSink = Sink.foreach(Console.println)
val flow = subscriberSource.to(someFunctionSink)
//create Reactive Streams Subscriber
val subscriber: Subscriber[Boolean] = flow.run()
//prints true
Source.single(true).to(Sink(subscriber)).run()
But then with a more complicated broadcast graph, I'm unsure as how to get the Subscriber and Publisher objects out? Do I need a partial graph?
val subscriberSource = Source.subscriber[Boolean]
val someFunctionSink = Sink.foreach(Console.println)
val publisherSink = Sink.publisher[Boolean]
FlowGraph.closed() { implicit builder =>
import FlowGraph.Implicits._
val broadcast = builder.add(Broadcast[Boolean](2))
subscriberSource ~> broadcast.in
broadcast.out(0) ~> someFunctionSink
broadcast.out(1) ~> publisherSink
}.run()
val subscriber: Subscriber[Boolean] = ???
val publisher: Publisher[Boolean] = ???
When you call RunnableGraph.run() the stream is run and the result is the "materialized value" for that run.
In your simple example the materialized value of Source.subscriber[Boolean] is Subscriber[Boolean]. In your complex example you want to combine materialized values of several components of your graph to a materialized value that is a tuple (Subscriber[Boolean], Publisher[Boolean]).
You can do that by passing the components for which you are interested in their materialized values to FlowGraph.closed() and then specify a function to combine the materialized values:
import akka.stream.scaladsl._
import org.reactivestreams._
val subscriberSource = Source.subscriber[Boolean]
val someFunctionSink = Sink.foreach(Console.println)
val publisherSink = Sink.publisher[Boolean]
val graph =
FlowGraph.closed(subscriberSource, publisherSink)(Keep.both) { implicit builder ⇒
(in, out) ⇒
import FlowGraph.Implicits._
val broadcast = builder.add(Broadcast[Boolean](2))
in ~> broadcast.in
broadcast.out(0) ~> someFunctionSink
broadcast.out(1) ~> out
}
val (subscriber: Subscriber[Boolean], publisher: Publisher[Boolean]) = graph.run()
See the Scaladocs for more information about the overloads of FlowGraph.closed.
(Keep.both is short for a function (a, b) => (a, b))