Eliminating internal collecting when constructing a Source as a response to data - scala

I have a Flow (createDataPointFlow) which is constructed by performing a mapAsync which collects data points (via Sink.seq) which I would otherwise like to stream directly (i.e. without collecting first).
However, it is not obvious to me how I can do this without collecting items, it seems I need some sort of mechanism to publish my items directly to the output portion of the flow I am creating, but I'm new to this and don't know how to do that without getting explicit actors involved, which I would like to avoid.
How can I achieve this without the need to collect things to a Sink first? Remember what I want to achieve is full streaming without the explicit buffering that Sink.seq(...) is doing.
object MyProcess {
def createDataSource(job:Job, dao:DataService):Source[JobDataPoint,NotUsed] = {
// Imagine the below call is equivalent to streaming a parameterized query using Slick
val publisher: Publisher[JobDataPoint] = dao.streamData(Criteria(job.name, job.data))
// Convert to a Source
val src: Source[JobDataPoint, NotUsed] = Source.fromPublisher(publisher)
src
}
def createDataPointFlow(dao:DataService, parallelism:Int=1): Flow[Job,JobDataPoint, NotUsed] =
Flow[Job].mapAsync(parallelism)(job =>
createDataSource(job,dao).toMat(Sink.seq)(Keep.right).run()
).mapConcat(identity)
def apply(src:Source[Job,NotUsed], dao:DataService,parallelism:Int=5) = RunnableGraph.fromGraph(GraphDSL.create(){ implicit builder =>
import GraphDSL.Implicits._
//Source
val jobs:Outlet[Job] = builder.add(src).out
//val bcastJobsSrc: Source[Job, NotUsed] = src.toMat(BroadcastHub.sink(256))(Keep.right).run()
//val bcastOutlet:Outlet[Job] = builder.add(bcastJobsSrc).out
//Flows
val bcastJobs:UniformFanOutShape[Job,Job] = builder.add(Broadcast[Job](4))
val rptMaker = builder.add(MyProcessors.flow(dao,parallelism))
val dpFlow = createDataPointFlow(dao,parallelism)
//Sinks
val jobPrinter:Inlet[Job] = builder.add(Sink.foreach[Job](job=>println(s"[MyGraph] Received job: ${job.name} => $job"))).in
val jobList:Inlet[Job] = builder.add(Sink.fold(List.empty[Job])((list,job:Job)=>job::list)).in
val reporter: Inlet[ReportTable] = builder.add(Sink.foreach[ReportTable](r=>println(s"[Report]: $r"))).in
val dpSink: Inlet[JobDataPoint] = builder.add(Sink.foreach[JobDataPoint](dp=>println(s"[DataPoint]: $dp"))).in
jobs ~> bcastJobs
bcastJobs ~> jobPrinter
bcastJobs ~> jobList
bcastJobs ~> rptMaker ~> reporter
bcastJobs ~> dpFlow ~> dpSink
ClosedShape
})
}

So after re-reading the documentation about the various stages available it turns out that what I needed was a flatMapConcat:
def createDataPointFlow(dao:DataService, parallelism:Int=1): Flow[Job,JobDataPoint, NotUsed] =
Flow[Job].flatMapConcat(createDataSource(_,dao))

Related

Akka Streams - jump over FlowShape

I have the following Graph.
At the inflateFlow stage, I check if there is already a request processed in DB. If there is already a processed message, I want to return MsgSuccess and not a RequestProcess, but the next FlowShape won't accept that, it needs a RequestProcess. Is there a way to jump from flowInflate to flowWrap without adding Either everywhere?
GraphDSL.create() { implicit builder: GraphDSL.Builder[NotUsed] =>
import GraphDSL.Implicits._
val flowInflate = builder.add(wrapFunctionInFlowShape[MsgRequest, RequestProcess](inflateFlow))
val flowProcess = builder.add(wrapFunctionInFlowShape[RequestProcess, SuccessProcess](convertFlow))
val flowWrite = builder.add(wrapFunctionInFlowShape[SuccessProcess, SuccessProcess](writeFlow))
val flowWrap = builder.add(wrapFunctionInFlowShape[SuccessProcess, MsgSuccess](wrapFlow))
flowInflate ~> flowProcess ~> flowWrite ~> flowWrap
FlowShape(flowInflate.in, flowWrap.out)
}
def wrapFunctionInFlowShape[Input, Output](f: Input => Output): Flow[Input, Output, NotUsed] = {
Flow.fromFunction { input =>
f(input)
}
}
//check for cache
def inflateFlow(msgRequest: MsgRequest): Either[RequestProcess, MsgSuccess] = {
val hash: String = hashMethod(msgRequest)
if(existisInDataBase(hash))
Right(MsgSuccess(hash))
else
Left(inflate(msgRequest))
}
def convertFlow(requestPorocess: RequestPocess): SuccessProcess = {}//process the request}
def writeFlow(successProcess: SuccessProcess): SuccessProcess = {}//write to DB}
def wrapFlow(successProcess: SuccessProcess): MsgSuccess = {}//wrap and return the message}
You can define alternative paths in a stream with a partition. In your case, the PartitionWith stage in the Akka Stream Contrib project could be helpful. Unlike the Partition stage in the standard Akka Streams API, PartitionWith allows the output types to be different: in your case, the output types are RequestProcess and MsgSuccess.
First, to use PartitionWith, add the following dependency to your build.sbt:
libraryDependencies += "com.typesafe.akka" %% "akka-stream-contrib" % "0.8"
Second, replace inflateFlow with the partition:
def split = PartitionWith[MsgRequest, RequestProcess, MsgSuccess] { msgRequest =>
val hash = hashMethod(msgRequest)
if (!existisInDataBase(hash))
Left(inflate(msgRequest))
else
Right(MsgSuccess(hash))
}
Then incorporate that stage into your graph:
val flow = Flow.fromGraph(GraphDSL.create() { implicit builder: GraphDSL.Builder[NotUsed] =>
import GraphDSL.Implicits._
val pw = builder.add(split)
val flowProcess = builder.add(wrapFunctionInFlowShape[RequestProcess, SuccessProcess](convertFlow))
val flowWrite = builder.add(wrapFunctionInFlowShape[SuccessProcess, SuccessProcess](writeFlow))
val flowWrap = builder.add(wrapFunctionInFlowShape[SuccessProcess, MsgSuccess](wrapFlow))
val mrg = builder.add(Merge[MsgSuccess](2))
pw.out0 ~> flowProcess ~> flowWrite ~> flowWrap ~> mrg.in(0)
pw.out1 ~> mrg.in(1)
FlowShape(pw.in, mrg.out)
})
If an incoming MsgRequest is not found in the database and is converted to a RequestProcess, then that message goes through your original flow path. If an incoming MsgRequest is in the database and resolves to a MsgSuccess, then it bypasses the intermediate steps in the flow. In both cases, the resulting MsgSuccess messages are merged from the two alternative paths into one flow outlet.

Akka Streams: How do I get Materialized Sink output from GraphDSL API?

This is a really simple, newbie question using the GraphDSL API. I read several related SO threads and I don't see the answer:
val actorSystem = ActorSystem("QuickStart")
val executor = actorSystem.dispatcher
val materializer = ActorMaterializer()(actorSystem)
val source: Source[Int, NotUsed] = Source(1 to 5)
val throttledSource = source.throttle(1, 1.second, 1, ThrottleMode.shaping)
val intDoublerFlow = Flow.fromFunction[Int, Int](i => i * 2)
val sink = Sink.foreach(println)
val graphModel = GraphDSL.create() { implicit b =>
import GraphDSL.Implicits._
throttledSource ~> intDoublerFlow ~> sink
// I presume I want to change this shape to something else
// but I can't figure out what it is.
ClosedShape
}
// TODO: This is RunnableGraph[NotUsed], I want RunnableGraph[Future[Done]] that gives the
// materialized Future[Done] from the sink. I presume I need to use a GraphDSL SourceShape
// but I can't get that working.
val graph = RunnableGraph.fromGraph(graphModel)
// This works and gives me the materialized sink output using the simpler API.
// But I want to use the GraphDSL so that I can add branches or junctures.
val graphThatIWantFromDslAPI = throttledSource.toMat(sink)(Keep.right)
The trick is to pass the stage you want the materialized value of (in your case, sink) to the GraphDSL.create. The function you pass as a second parameter changes as well, needing a Shape input parameter (s in the example below) which you can use in your graph.
val graphModel: Graph[ClosedShape, Future[Done]] = GraphDSL.create(sink) { implicit b => s =>
import GraphDSL.Implicits._
throttledSource ~> intDoublerFlow ~> s
// ClosedShape is just fine - it is always the shape of a RunnableGraph
ClosedShape
}
val graph: RunnableGraph[Future[Done]] = RunnableGraph.fromGraph(graphModel)
More info can be found in the docs.
val graphModel = GraphDSL.create(sink) { implicit b: Builder[Future[Done]] => sink =>
import akka.stream.scaladsl.GraphDSL.Implicits._
throttledSource ~> intDoublerFlow ~> sink
ClosedShape
}
val graph: RunnableGraph[Future[Done]] = RunnableGraph.fromGraph(graphModel)
val graphThatIWantFromDslAPI: RunnableGraph[Future[Done]] = throttledSource.toMat(sink)(Keep.right)
The problem with the GraphDSL API is, that the implicit Builder is heavily overloaded. You need to wrap your sink in create, which turns the Builder[NotUsed] into Builder[Future[Done]] and represents now a function from builder => sink => shape instead of builder => shape.

How do you deal with futures in Akka Flow?

I have built an akka graph that defines a flow. My objective is to reformat my future response and save it to a file. The flow can be outlined bellow:
val g = RunnableGraph.fromGraph(GraphDSL.create() { implicit builder: GraphDSL.Builder[NotUsed] =>
import GraphDSL.Implicits._
val balancer = builder.add(Balance[(HttpRequest, String)](6, waitForAllDownstreams = false))
val merger = builder.add(Merge[Future[Map[String, String]]](6))
val fileSink = FileIO.toPath(outputPath, options)
val ignoreSink = Sink.ignore
val in = Source(seeds)
in ~> balancer.in
for (i <- Range(0,6)) {
balancer.out(i) ~>
wikiFlow.async ~>
// This maps to a Future[Map[String, String]]
Flow[(Try[HttpResponse], String)].map(parseHtml) ~>
merger
}
merger.out ~>
// When we merge we need to map our Map to a file
Flow[Future[Map[String, String]]].map((d) => {
// What is the proper way of serializing future map
// so I can work with it like a normal stream into fileSink?
// I could manually do ->
// d.foreach(someWriteToFileProcess(_))
// with ignoreSink, but this defeats the nice
// akka flow
}) ~>
fileSink
ClosedShape
})
I can hack this workflow to write my future map to a file via foreach, but I'm afraid this could somehow lead to concurrency issues with FileIO and it just doesn't feel right. What is the proper way to handle futures with our akka flow?
The easiest way to create a Flow which involves an asynchronous computation is by using mapAsync.
So... lets say you want to create a Flow which consumes Int and produces String using an asynchronous computation mapper: Int => Future[String] with a parallelism of 5.
val mapper: Int => Future[String] = (i: Int) => Future(i.toString)
val yourFlow = Flow[Int].mapAsync[String](5)(mapper)
Now, you can use this flow in your graph however you want.
An example usage will be,
val graph = GraphDSL.create() { implicit builder =>
import GraphDSL.Implicits._
val intSource = Source(1 to 10)
val printSink = Sink.foreach[String](s => println(s))
val yourMapper: Int => Future[String] = (i: Int) => Future(i.toString)
val yourFlow = Flow[Int].mapAsync[String](2)(yourMapper)
intSource ~> yourFlow ~> printSink
ClosedShape
}

How to assemble an Akka Streams sink from multiple file writes?

I'm trying to integrate an akka streams based flow in to my Play 2.5 app. The idea is that you can stream in a photo, then have it written to disk as the raw file, a thumbnailed version and a watermarked version.
I managed to get this working using a graph something like this:
val byteAccumulator = Flow[ByteString].fold(new ByteStringBuilder())((builder, b) => {builder ++= b.toArray})
.map(_.result().toArray)
def toByteArray = Flow[ByteString].map(b => b.toArray)
val graph = Flow.fromGraph(GraphDSL.create() {implicit builder =>
import GraphDSL.Implicits._
val streamFan = builder.add(Broadcast[ByteString](3))
val byteArrayFan = builder.add(Broadcast[Array[Byte]](2))
val output = builder.add(Flow[ByteString].map(x => Success(Done)))
val rawFileSink = FileIO.toFile(file)
val thumbnailFileSink = FileIO.toFile(getFile(path, Thumbnail))
val watermarkedFileSink = FileIO.toFile(getFile(path, Watermarked))
streamFan.out(0) ~> rawFileSink
streamFan.out(1) ~> byteAccumulator ~> byteArrayFan.in
streamFan.out(2) ~> output.in
byteArrayFan.out(0) ~> slowThumbnailProcessing ~> thumbnailFileSink
byteArrayFan.out(1) ~> slowWatermarkProcessing ~> watermarkedFileSink
FlowShape(streamFan.in, output.out)
})
graph
}
Then I wire it in to my play controller using an accumulator like this:
val sink = Sink.head[Try[Done]]
val photoStorageParser = BodyParser { req =>
Accumulator(sink).through(graph).map(Right.apply)
}
The problem is that my two processed file sinks aren't completing and I'm getting zero sizes for both processed files, but not the raw one. My theory is that the accumulator is only waiting on one of the outputs of my fan out, so when the input stream completes and my byteAccumulator spits out the complete file, by the time the processing is finished play has got the materialized value from the output.
So, my questions are:
Am I on the right track with this as far as my approach goes?
What is the expected behaviour for running a graph like this?
How can I bring all my sinks together to form one final sink?
Ok, after a little help (Andreas was on the right track), I've arrived at this solution which does the trick:
val rawFileSink = FileIO.toFile(file)
val thumbnailFileSink = FileIO.toFile(getFile(path, Thumbnail))
val watermarkedFileSink = FileIO.toFile(getFile(path, Watermarked))
val graph = Sink.fromGraph(GraphDSL.create(rawFileSink, thumbnailFileSink, watermarkedFileSink)((_, _, _)) {
implicit builder => (rawSink, thumbSink, waterSink) => {
val streamFan = builder.add(Broadcast[ByteString](2))
val byteArrayFan = builder.add(Broadcast[Array[Byte]](2))
streamFan.out(0) ~> rawSink
streamFan.out(1) ~> byteAccumulator ~> byteArrayFan.in
byteArrayFan.out(0) ~> processorFlow(Thumbnail) ~> thumbSink
byteArrayFan.out(1) ~> processorFlow(Watermarked) ~> waterSink
SinkShape(streamFan.in)
}
})
graph.mapMaterializedValue[Future[Try[Done]]](fs => Future.sequence(Seq(fs._1, fs._2, fs._3)).map(f => Success(Done)))
After which it's dead easy to call this from Play:
val photoStorageParser = BodyParser { req =>
Accumulator(theSink).map(Right.apply)
}
def createImage(path: String) = Action(photoStorageParser) { req =>
Created
}

How to get a Subscriber and Publisher from a broadcasted Akka stream?

I'm having problems in getting Publishers and Subscribers out of my flows when using more complicated graphs. My goal is to provide an API of Publishers and Subscribers and run the Akka streaming internally. Here's my first try, which works just fine.
val subscriberSource = Source.subscriber[Boolean]
val someFunctionSink = Sink.foreach(Console.println)
val flow = subscriberSource.to(someFunctionSink)
//create Reactive Streams Subscriber
val subscriber: Subscriber[Boolean] = flow.run()
//prints true
Source.single(true).to(Sink(subscriber)).run()
But then with a more complicated broadcast graph, I'm unsure as how to get the Subscriber and Publisher objects out? Do I need a partial graph?
val subscriberSource = Source.subscriber[Boolean]
val someFunctionSink = Sink.foreach(Console.println)
val publisherSink = Sink.publisher[Boolean]
FlowGraph.closed() { implicit builder =>
import FlowGraph.Implicits._
val broadcast = builder.add(Broadcast[Boolean](2))
subscriberSource ~> broadcast.in
broadcast.out(0) ~> someFunctionSink
broadcast.out(1) ~> publisherSink
}.run()
val subscriber: Subscriber[Boolean] = ???
val publisher: Publisher[Boolean] = ???
When you call RunnableGraph.run() the stream is run and the result is the "materialized value" for that run.
In your simple example the materialized value of Source.subscriber[Boolean] is Subscriber[Boolean]. In your complex example you want to combine materialized values of several components of your graph to a materialized value that is a tuple (Subscriber[Boolean], Publisher[Boolean]).
You can do that by passing the components for which you are interested in their materialized values to FlowGraph.closed() and then specify a function to combine the materialized values:
import akka.stream.scaladsl._
import org.reactivestreams._
val subscriberSource = Source.subscriber[Boolean]
val someFunctionSink = Sink.foreach(Console.println)
val publisherSink = Sink.publisher[Boolean]
val graph =
FlowGraph.closed(subscriberSource, publisherSink)(Keep.both) { implicit builder ⇒
(in, out) ⇒
import FlowGraph.Implicits._
val broadcast = builder.add(Broadcast[Boolean](2))
in ~> broadcast.in
broadcast.out(0) ~> someFunctionSink
broadcast.out(1) ~> out
}
val (subscriber: Subscriber[Boolean], publisher: Publisher[Boolean]) = graph.run()
See the Scaladocs for more information about the overloads of FlowGraph.closed.
(Keep.both is short for a function (a, b) => (a, b))