How can I pipe the output of an Akka Streams Merge to another Flow? - scala

I'm playing with Akka Streams, and have figured out most of the basics, but I'm not clear on how to take the results of a Merge and do further operations (map, filter, fold, etc) on it.
I'd like to modify the following code so that, instead of piping the merge to a sink, that I could instead manipulate the data further.
implicit val materializer = FlowMaterializer()
val items_a = Source(List(10,20,30,40,50))
val items_b = Source(List(60,70,80,90,100))
val sink = ForeachSink(println)
val materialized = FlowGraph { implicit builder =>
import FlowGraphImplicits._
val merge = Merge[Int]("m1")
items_a ~> merge
items_b ~> merge ~> sink
}.run()
I guess my primary problem is that I can't figure out how to make a flow component that doesn't have a source, and I can't figure out how to do a merge without using the special Merge object and the ~> syntax.
EDIT: This question and answer was for and worked with Akka Streams 0.11

If you don't care about the semantic of Merge where elements go downstream randomly then you could just try concat on Source instead like so:
items_a.concat(items_b).map(_ * 2).map(_.toString).foreach(println)
The difference here is that all items from a will flow downstream first before any elements of b. If you really need the behavior of Merge, then you could consider something like the following (keep in mind that you will eventually need a sink, but you certainly can do additional transforms after the merging):
val items_a = Source(List(10,20,30,40,50))
val items_b = Source(List(60,70,80,90,100))
val sink = ForeachSink[Double](println)
val transform = Flow[Int].map(_ * 2).map(_.toDouble).to(sink)
val materialized = FlowGraph { implicit builder =>
import FlowGraphImplicits._
val merge = Merge[Int]("m1")
items_a ~> merge
items_b ~> merge ~> transform
}.run
In this example, you can see that I use the helper from the Flow companion to create a Flow without a specific input Source. From there I can then attach that to the merge point to get my additional processing.

Use Source.combine:
val items_a :: items_b :: items_c = List(
Source(List(10,20,30,40,50)),
Source(List(60,70,80,90,100),
Source(List(110,120,130,140,1500))
Source.combine(items_a, items_b, items_c : _*)(Merge(_))
.map(_+1)
.runForeach(println)

Or, if you need to preserve the order of the input-sources (e.g. items_a must before items_b and items_b must before items_c) you can use Concat, instead of Merge.
val items_a :: items_b :: items_c = List(
Source(List(10,20,30,40,50)),
Source(List(60,70,80,90,100),
Source(List(110,120,130,140,1500))
Source.combine(items_a, items_b, items_c : _*)(Concat(_))

Related

Concatinating two Flows in Akka stream

I am trying to concat two Flows and I am not able to explain the output of my implementation.
val source = Source(1 to 10)
val sink = Sink.foreach(println)
val flow1 = Flow[Int].map(s => s + 1)
val flow2 = Flow[Int].map(s => s * 10)
val flowGraph = Flow.fromGraph(
GraphDSL.create() { implicit builder =>
import GraphDSL.Implicits._
val concat = builder.add(Concat[Int](2))
val broadcast = builder.add(Broadcast[Int](2))
broadcast ~> flow1 ~> concat.in(0)
broadcast ~> flow2 ~> concat.in(1)
FlowShape(broadcast.in, concat.out)
}
)
source.via(flowGraph).runWith(sink)
I expect the following output from this code.
2
3
4
.
.
.
11
10
20
.
.
.
100
Instead, I see only "2" being printed. Can you please explain what is wrong in my implmentation and how should I change the program to get the desired output.
From Akka Stream's API docs:
Concat:
Emits when the current stream has an element available; if the current input completes, it tries the next one
Broadcast:
Emits when all of the outputs stops backpressuring and there is an input element available
The two operators won't work in conjunction as there is a conflict in how they work -- Concat tries to pull all elements from one of Broadcast's outputs before switching to the other one, whereas Broadcast won't emit unless there is demand for ALL of its outputs.
For what you need, you could concatenate using concat as suggested by commenters:
source.via(flow1).concat(source.via(flow2)).runWith(sink)
or equivalently, use Source.combine like below:
Source.combine(source.via(flow1), source.via(flow2))(Concat[Int](_)).runWith(sink)
Using GraphDSL, which is a simplified version of the implementation of Source.combine:
val sg = Source.fromGraph(
GraphDSL.create(){ implicit builder =>
import GraphDSL.Implicits._
val concat = builder.add(Concat[Int](2))
source ~> flow1 ~> concat
source ~> flow2 ~> concat
SourceShape(concat.out)
}
)
sg.runWith(sink)

Equivalent to balancer, broadcast and merge in pure akka streams

In akka streams, using graph dsl builder i can use balancer, broadcast and merger operators:
Flow.fromGraph(GraphDSL.create() { implicit builder =>
val balancer = builder.add(Balance[Result1](2))
val merger = builder.add(Merge[Result2](2))
balancer.out(0) ~> step1.async ~> step2.async ~> merger.in(0)
balancer.out(1) ~> step1.async ~> step2.async ~> merger.in(1)
FlowShape(balancer.in, merger.out)
}
How i can achieve the same logic using plain Source, Sink and Flow api?
I can do something like this:
source.mapAsync(2)(Future(...))
But, as i see, semanticlly it is not fully equivalent to the first example.
Use Source.combine and Sink.combine. From the documentation:
There is a simplified API you can use to combine sources and sinks with junctions like: Broadcast[T], Balance[T], Merge[In] and Concat[A] without the need for using the Graph DSL. The combine method takes care of constructing the necessary graph underneath. In following example we combine two sources into one (fan-in):
val sourceOne = Source(List(1))
val sourceTwo = Source(List(2))
val merged = Source.combine(sourceOne, sourceTwo)(Merge(_))
val mergedResult: Future[Int] = merged.runWith(Sink.fold(0)(_ + _))
The same can be done for a Sink[T] but in this case it will be fan-out:
val sendRmotely = Sink.actorRef(actorRef, "Done")
val localProcessing = Sink.foreach[Int](_ => /* do something useful */ ())
val sink = Sink.combine(sendRmotely, localProcessing)(Broadcast[Int](_))
Source(List(0, 1, 2)).runWith(sink)

Eliminating internal collecting when constructing a Source as a response to data

I have a Flow (createDataPointFlow) which is constructed by performing a mapAsync which collects data points (via Sink.seq) which I would otherwise like to stream directly (i.e. without collecting first).
However, it is not obvious to me how I can do this without collecting items, it seems I need some sort of mechanism to publish my items directly to the output portion of the flow I am creating, but I'm new to this and don't know how to do that without getting explicit actors involved, which I would like to avoid.
How can I achieve this without the need to collect things to a Sink first? Remember what I want to achieve is full streaming without the explicit buffering that Sink.seq(...) is doing.
object MyProcess {
def createDataSource(job:Job, dao:DataService):Source[JobDataPoint,NotUsed] = {
// Imagine the below call is equivalent to streaming a parameterized query using Slick
val publisher: Publisher[JobDataPoint] = dao.streamData(Criteria(job.name, job.data))
// Convert to a Source
val src: Source[JobDataPoint, NotUsed] = Source.fromPublisher(publisher)
src
}
def createDataPointFlow(dao:DataService, parallelism:Int=1): Flow[Job,JobDataPoint, NotUsed] =
Flow[Job].mapAsync(parallelism)(job =>
createDataSource(job,dao).toMat(Sink.seq)(Keep.right).run()
).mapConcat(identity)
def apply(src:Source[Job,NotUsed], dao:DataService,parallelism:Int=5) = RunnableGraph.fromGraph(GraphDSL.create(){ implicit builder =>
import GraphDSL.Implicits._
//Source
val jobs:Outlet[Job] = builder.add(src).out
//val bcastJobsSrc: Source[Job, NotUsed] = src.toMat(BroadcastHub.sink(256))(Keep.right).run()
//val bcastOutlet:Outlet[Job] = builder.add(bcastJobsSrc).out
//Flows
val bcastJobs:UniformFanOutShape[Job,Job] = builder.add(Broadcast[Job](4))
val rptMaker = builder.add(MyProcessors.flow(dao,parallelism))
val dpFlow = createDataPointFlow(dao,parallelism)
//Sinks
val jobPrinter:Inlet[Job] = builder.add(Sink.foreach[Job](job=>println(s"[MyGraph] Received job: ${job.name} => $job"))).in
val jobList:Inlet[Job] = builder.add(Sink.fold(List.empty[Job])((list,job:Job)=>job::list)).in
val reporter: Inlet[ReportTable] = builder.add(Sink.foreach[ReportTable](r=>println(s"[Report]: $r"))).in
val dpSink: Inlet[JobDataPoint] = builder.add(Sink.foreach[JobDataPoint](dp=>println(s"[DataPoint]: $dp"))).in
jobs ~> bcastJobs
bcastJobs ~> jobPrinter
bcastJobs ~> jobList
bcastJobs ~> rptMaker ~> reporter
bcastJobs ~> dpFlow ~> dpSink
ClosedShape
})
}
So after re-reading the documentation about the various stages available it turns out that what I needed was a flatMapConcat:
def createDataPointFlow(dao:DataService, parallelism:Int=1): Flow[Job,JobDataPoint, NotUsed] =
Flow[Job].flatMapConcat(createDataSource(_,dao))

Akka Streams filter & group by on a collection of keys

I have a stream of
case class Msg(keys: Seq[Char], value: String)
Now I want to filter for a subset of keys e.g.
val filterKeys = Set[Char]('k','f','c') and Filter(k.exists(filterKeys.contains)))
And then split these so certain keys are processed by different flows and then merged back together at the end;
/-key=k-> f1 --\
Source[Msg] ~> Filter ~> router |--key=f-> f2 ----> Merge --> f4
\-key=c-> f3 --/
How should I go about doing this?
FlexiRoute in the old way seemed like a good way to go but in the new API I'm guessing I want to either make a custom GraphStage or create my own graph from the DSL as I see no way to do this through the built-in stages..?
Small Key Set Solution
If your key set is small, and immutable, then a combination of broadcast and filter would probably be the easiest implementation to understand. You first need to define the filter that you described:
def goodKeys(keySet : Set[Char]) = Flow[Msg] filter (_.keys exists keySet.contains)
This can then feed a broadcaster as described in the documentation. All Msg values with good keys will be broadcasted to each of three filters, and each filter will only allow a particular key:
val g = RunnableGraph.fromGraph(GraphDSL.create() { implicit builder: GraphDSL.Builder[NotUsed] =>
import GraphDSL.Implicits._
val source : Source[Msg] = ???
val goodKeyFilter = goodKeys(Set('k','f','c'))
val bcast = builder.add(BroadCast[Msg](3))
val merge = builder.add(Merge[Msg](3))
val kKey = goodKeys(Set('k'))
val fKey = goodKeys(Set('f'))
val cKey = goodKeys(Set('c'))
//as described in the question
val f1 : Flow[Msg, Msg, _] = ???
val f2 : Flow[Msg, Msg, _] = ???
val f3 : Flow[Msg, Msg, _] = ???
val f4 : Sink[Msg,_] = ???
source ~> goodKeyFilter ~> bcast ~> kKey ~> f1 ~> merge ~> f4
bcast ~> fKey ~> f2 ~> merge
bcast ~> cKey ~> f3 ~> merge
Large Key Set Solution
If you key set is large, then groupBy is better. Suppose you have a Map of keys to functions:
//e.g. 'k' -> f1
val keyFuncs : Map[Set[Char], (Msg) => Msg]
This map can be used with the groupBy function:
source
.via(goodKeys(Set('k','f','c'))
.groupBy(keyFuncs.size, _.keys)
.map(keyFuncs(_.keys)) //apply one of f1,f2,f3 to the Msg
.mergeSubstreams

Akka Streams: No-arg GraphDSL.create() vs GraphDSL.create(sink)

The doc has the following example (only what's relevant to my question is shown):
val resultSink = Sink.head[Int]
val g = RunnableGraph.fromGraph(GraphDSL.create(resultSink) { implicit b => sink =>
import GraphDSL.Implicits._
// importing the partial graph will return its shape (inlets & outlets)
val pm3 = b.add(pickMaxOfThree)
Source.single(1) ~> pm3.in(0)
Source.single(2) ~> pm3.in(1)
Source.single(3) ~> pm3.in(2)
pm3.out ~> sink.in
ClosedShape
})
I was curious about why the sink has to be passed in as a parameter to GraphDSL.create so I modified the example slightly
val resultSink = Sink.head[Int]
val g = RunnableGraph.fromGraph(GraphDSL.create() { implicit b =>
import GraphDSL.Implicits._
// importing the partial graph will return its shape (inlets & outlets)
val pm3 = b.add(pickMaxOfThree)
val s = b.add(resultSink).in
Source.single(1) ~> pm3.in(0)
Source.single(2) ~> pm3.in(1)
Source.single(3) ~> pm3.in(2)
pm3.out ~> s
ClosedShape
})
However, this changes the return type of g.run() from Future[Int] to akka.NotUsed. Why?
I think I found the answer myself. Acc. to the doc:
using builder.add(...), an operation that will make a copy of the
blueprint that is passed to it and return the inlets and outlets of
the resulting copy so that they can be wired up. Another alternative
is to pass existing graphs—of any shape—into the factory method that
produces a new graph. The difference between these approaches is that
importing using builder.add(...) ignores the materialized value of the
imported graph while importing via the factory method allows its
inclusion
g.run returns the materialized value of the graph, thus the change in return type.