How to assemble an Akka Streams sink from multiple file writes? - scala

I'm trying to integrate an akka streams based flow in to my Play 2.5 app. The idea is that you can stream in a photo, then have it written to disk as the raw file, a thumbnailed version and a watermarked version.
I managed to get this working using a graph something like this:
val byteAccumulator = Flow[ByteString].fold(new ByteStringBuilder())((builder, b) => {builder ++= b.toArray})
.map(_.result().toArray)
def toByteArray = Flow[ByteString].map(b => b.toArray)
val graph = Flow.fromGraph(GraphDSL.create() {implicit builder =>
import GraphDSL.Implicits._
val streamFan = builder.add(Broadcast[ByteString](3))
val byteArrayFan = builder.add(Broadcast[Array[Byte]](2))
val output = builder.add(Flow[ByteString].map(x => Success(Done)))
val rawFileSink = FileIO.toFile(file)
val thumbnailFileSink = FileIO.toFile(getFile(path, Thumbnail))
val watermarkedFileSink = FileIO.toFile(getFile(path, Watermarked))
streamFan.out(0) ~> rawFileSink
streamFan.out(1) ~> byteAccumulator ~> byteArrayFan.in
streamFan.out(2) ~> output.in
byteArrayFan.out(0) ~> slowThumbnailProcessing ~> thumbnailFileSink
byteArrayFan.out(1) ~> slowWatermarkProcessing ~> watermarkedFileSink
FlowShape(streamFan.in, output.out)
})
graph
}
Then I wire it in to my play controller using an accumulator like this:
val sink = Sink.head[Try[Done]]
val photoStorageParser = BodyParser { req =>
Accumulator(sink).through(graph).map(Right.apply)
}
The problem is that my two processed file sinks aren't completing and I'm getting zero sizes for both processed files, but not the raw one. My theory is that the accumulator is only waiting on one of the outputs of my fan out, so when the input stream completes and my byteAccumulator spits out the complete file, by the time the processing is finished play has got the materialized value from the output.
So, my questions are:
Am I on the right track with this as far as my approach goes?
What is the expected behaviour for running a graph like this?
How can I bring all my sinks together to form one final sink?

Ok, after a little help (Andreas was on the right track), I've arrived at this solution which does the trick:
val rawFileSink = FileIO.toFile(file)
val thumbnailFileSink = FileIO.toFile(getFile(path, Thumbnail))
val watermarkedFileSink = FileIO.toFile(getFile(path, Watermarked))
val graph = Sink.fromGraph(GraphDSL.create(rawFileSink, thumbnailFileSink, watermarkedFileSink)((_, _, _)) {
implicit builder => (rawSink, thumbSink, waterSink) => {
val streamFan = builder.add(Broadcast[ByteString](2))
val byteArrayFan = builder.add(Broadcast[Array[Byte]](2))
streamFan.out(0) ~> rawSink
streamFan.out(1) ~> byteAccumulator ~> byteArrayFan.in
byteArrayFan.out(0) ~> processorFlow(Thumbnail) ~> thumbSink
byteArrayFan.out(1) ~> processorFlow(Watermarked) ~> waterSink
SinkShape(streamFan.in)
}
})
graph.mapMaterializedValue[Future[Try[Done]]](fs => Future.sequence(Seq(fs._1, fs._2, fs._3)).map(f => Success(Done)))
After which it's dead easy to call this from Play:
val photoStorageParser = BodyParser { req =>
Accumulator(theSink).map(Right.apply)
}
def createImage(path: String) = Action(photoStorageParser) { req =>
Created
}

Related

Eliminating internal collecting when constructing a Source as a response to data

I have a Flow (createDataPointFlow) which is constructed by performing a mapAsync which collects data points (via Sink.seq) which I would otherwise like to stream directly (i.e. without collecting first).
However, it is not obvious to me how I can do this without collecting items, it seems I need some sort of mechanism to publish my items directly to the output portion of the flow I am creating, but I'm new to this and don't know how to do that without getting explicit actors involved, which I would like to avoid.
How can I achieve this without the need to collect things to a Sink first? Remember what I want to achieve is full streaming without the explicit buffering that Sink.seq(...) is doing.
object MyProcess {
def createDataSource(job:Job, dao:DataService):Source[JobDataPoint,NotUsed] = {
// Imagine the below call is equivalent to streaming a parameterized query using Slick
val publisher: Publisher[JobDataPoint] = dao.streamData(Criteria(job.name, job.data))
// Convert to a Source
val src: Source[JobDataPoint, NotUsed] = Source.fromPublisher(publisher)
src
}
def createDataPointFlow(dao:DataService, parallelism:Int=1): Flow[Job,JobDataPoint, NotUsed] =
Flow[Job].mapAsync(parallelism)(job =>
createDataSource(job,dao).toMat(Sink.seq)(Keep.right).run()
).mapConcat(identity)
def apply(src:Source[Job,NotUsed], dao:DataService,parallelism:Int=5) = RunnableGraph.fromGraph(GraphDSL.create(){ implicit builder =>
import GraphDSL.Implicits._
//Source
val jobs:Outlet[Job] = builder.add(src).out
//val bcastJobsSrc: Source[Job, NotUsed] = src.toMat(BroadcastHub.sink(256))(Keep.right).run()
//val bcastOutlet:Outlet[Job] = builder.add(bcastJobsSrc).out
//Flows
val bcastJobs:UniformFanOutShape[Job,Job] = builder.add(Broadcast[Job](4))
val rptMaker = builder.add(MyProcessors.flow(dao,parallelism))
val dpFlow = createDataPointFlow(dao,parallelism)
//Sinks
val jobPrinter:Inlet[Job] = builder.add(Sink.foreach[Job](job=>println(s"[MyGraph] Received job: ${job.name} => $job"))).in
val jobList:Inlet[Job] = builder.add(Sink.fold(List.empty[Job])((list,job:Job)=>job::list)).in
val reporter: Inlet[ReportTable] = builder.add(Sink.foreach[ReportTable](r=>println(s"[Report]: $r"))).in
val dpSink: Inlet[JobDataPoint] = builder.add(Sink.foreach[JobDataPoint](dp=>println(s"[DataPoint]: $dp"))).in
jobs ~> bcastJobs
bcastJobs ~> jobPrinter
bcastJobs ~> jobList
bcastJobs ~> rptMaker ~> reporter
bcastJobs ~> dpFlow ~> dpSink
ClosedShape
})
}
So after re-reading the documentation about the various stages available it turns out that what I needed was a flatMapConcat:
def createDataPointFlow(dao:DataService, parallelism:Int=1): Flow[Job,JobDataPoint, NotUsed] =
Flow[Job].flatMapConcat(createDataSource(_,dao))

Akka Streams - jump over FlowShape

I have the following Graph.
At the inflateFlow stage, I check if there is already a request processed in DB. If there is already a processed message, I want to return MsgSuccess and not a RequestProcess, but the next FlowShape won't accept that, it needs a RequestProcess. Is there a way to jump from flowInflate to flowWrap without adding Either everywhere?
GraphDSL.create() { implicit builder: GraphDSL.Builder[NotUsed] =>
import GraphDSL.Implicits._
val flowInflate = builder.add(wrapFunctionInFlowShape[MsgRequest, RequestProcess](inflateFlow))
val flowProcess = builder.add(wrapFunctionInFlowShape[RequestProcess, SuccessProcess](convertFlow))
val flowWrite = builder.add(wrapFunctionInFlowShape[SuccessProcess, SuccessProcess](writeFlow))
val flowWrap = builder.add(wrapFunctionInFlowShape[SuccessProcess, MsgSuccess](wrapFlow))
flowInflate ~> flowProcess ~> flowWrite ~> flowWrap
FlowShape(flowInflate.in, flowWrap.out)
}
def wrapFunctionInFlowShape[Input, Output](f: Input => Output): Flow[Input, Output, NotUsed] = {
Flow.fromFunction { input =>
f(input)
}
}
//check for cache
def inflateFlow(msgRequest: MsgRequest): Either[RequestProcess, MsgSuccess] = {
val hash: String = hashMethod(msgRequest)
if(existisInDataBase(hash))
Right(MsgSuccess(hash))
else
Left(inflate(msgRequest))
}
def convertFlow(requestPorocess: RequestPocess): SuccessProcess = {}//process the request}
def writeFlow(successProcess: SuccessProcess): SuccessProcess = {}//write to DB}
def wrapFlow(successProcess: SuccessProcess): MsgSuccess = {}//wrap and return the message}
You can define alternative paths in a stream with a partition. In your case, the PartitionWith stage in the Akka Stream Contrib project could be helpful. Unlike the Partition stage in the standard Akka Streams API, PartitionWith allows the output types to be different: in your case, the output types are RequestProcess and MsgSuccess.
First, to use PartitionWith, add the following dependency to your build.sbt:
libraryDependencies += "com.typesafe.akka" %% "akka-stream-contrib" % "0.8"
Second, replace inflateFlow with the partition:
def split = PartitionWith[MsgRequest, RequestProcess, MsgSuccess] { msgRequest =>
val hash = hashMethod(msgRequest)
if (!existisInDataBase(hash))
Left(inflate(msgRequest))
else
Right(MsgSuccess(hash))
}
Then incorporate that stage into your graph:
val flow = Flow.fromGraph(GraphDSL.create() { implicit builder: GraphDSL.Builder[NotUsed] =>
import GraphDSL.Implicits._
val pw = builder.add(split)
val flowProcess = builder.add(wrapFunctionInFlowShape[RequestProcess, SuccessProcess](convertFlow))
val flowWrite = builder.add(wrapFunctionInFlowShape[SuccessProcess, SuccessProcess](writeFlow))
val flowWrap = builder.add(wrapFunctionInFlowShape[SuccessProcess, MsgSuccess](wrapFlow))
val mrg = builder.add(Merge[MsgSuccess](2))
pw.out0 ~> flowProcess ~> flowWrite ~> flowWrap ~> mrg.in(0)
pw.out1 ~> mrg.in(1)
FlowShape(pw.in, mrg.out)
})
If an incoming MsgRequest is not found in the database and is converted to a RequestProcess, then that message goes through your original flow path. If an incoming MsgRequest is in the database and resolves to a MsgSuccess, then it bypasses the intermediate steps in the flow. In both cases, the resulting MsgSuccess messages are merged from the two alternative paths into one flow outlet.

Merge and broadcast, building a (simple) Akka graph

The Akka documentation is vast and there are a lot of tutorials. But either they are outdated or they only cover the basics (or, maybe I simply can't find the right ones).
What I want to create is a websocket application with multiple clients and multiple sources on the server side. As I don't want to get over my head from the start, I want to make baby steps and incrementally increase the complexity of the software I am building.
After toying around with some simple flows I wanted to start with a more sophisticated graph now.
What I want is:
Two sources, one that pushes "keepAlive" messages from the server to the client (currently only one) and a second one that actually pushes useful data.
Now for the first one I have this:
val tickingSource: Source[Array[Byte], Cancellable] =
Source.tick(initialDelay = 1 second, interval = 10 seconds, tick = NotUsed)
.zipWithIndex
.map{ case (_, counter) => SomeMessage().toByteArray}
Where SomeMessage is a protobuf type.
Because I can't find an up-to-date way to add an actor as a source, I tried the following for my second source:
val secondSource = Source(1 to 1000)
val secondSourceConverter = Flow[Int].map(x => BigInteger.valueOf(x).toByteArray)
My attempt at the graph:
val g: RunnableGraph[NotUsed] = RunnableGraph.fromGraph(GraphDSL.create()
{
implicit builder =>
import GraphDSL.Implicits._
val sourceMerge = builder.add(Merge[Array[Byte]](2).named("sourceMerge"))
val x = Source(1 to 1000)
val y = Flow[Int].map(x => BigInteger.valueOf(x).toByteArray)
val out = Sink.ignore
tickingSource ~> sourceMerge ~> out
x ~> y ~> sourceMerge
ClosedShape
})
Now g is of type RunnableGraph[NotUsed] while it should be RunnableGraph[Array[Byte]] for my websocket. And I wonder here: am I already doing something completely wrong?
You need to pass the secondSourceConverter into the GraphDSL.create, like the following example taken from their docs. Here they import 2 sinks, but it's the same technique.
RunnableGraph.fromGraph(GraphDSL.create(topHeadSink, bottomHeadSink)((_, _)) { implicit builder =>
(topHS, bottomHS) =>
import GraphDSL.Implicits._
val broadcast = builder.add(Broadcast[Int](2))
Source.single(1) ~> broadcast.in
broadcast.out(0) ~> sharedDoubler ~> topHS.in
broadcast.out(1) ~> sharedDoubler ~> bottomHS.in
ClosedShape
})
Your graph is of type RunnableGraph[NotUsed] because you're using Sink.ignore. And you probably want a RunnableGraph[Future[Array[Byte]]] instead of a RunnableGraph[Array[Byte]]:
val byteSink = Sink.fold[Array[Byte], Array[Byte]](Array[Byte]())(_ ++ _)
val g = RunnableGraph.fromGraph(GraphDSL.create(byteSink) { implicit builder => bSink =>
import GraphDSL.Implicits._
val sourceMerge = builder.add(Merge[Array[Byte]](2))
tickingSource ~> sourceMerge ~> bSink.in
secondSource ~> secondSourceConverter ~> sourceMerge
ClosedShape
})
// RunnableGraph[Future[Array[Byte]]]
I'm not sure how you would like to process incoming messages but here is a simple example. Hope that it'll help you.
path("ws") {
extractUpgradeToWebSocket { upgrade =>
complete {
import scala.concurrent.duration._
val tickSource = Source.tick(1.second, 1.second, TextMessage("ping"))
val messagesSource = Source.queue(10, OverflowStrategy.backpressure)
messagesSource.mapMaterializedValue { queue =>
//do something with out queue
//like myHandler ! RegisterOutQueue(queue)
}
val sink = Sink.ignore
val source = tickSource.merge(messagesSource)
upgrade.handleMessagesWithSinkSource(
inSink = sink,
outSource = source
)
}
}

Akka Streams: How do I get Materialized Sink output from GraphDSL API?

This is a really simple, newbie question using the GraphDSL API. I read several related SO threads and I don't see the answer:
val actorSystem = ActorSystem("QuickStart")
val executor = actorSystem.dispatcher
val materializer = ActorMaterializer()(actorSystem)
val source: Source[Int, NotUsed] = Source(1 to 5)
val throttledSource = source.throttle(1, 1.second, 1, ThrottleMode.shaping)
val intDoublerFlow = Flow.fromFunction[Int, Int](i => i * 2)
val sink = Sink.foreach(println)
val graphModel = GraphDSL.create() { implicit b =>
import GraphDSL.Implicits._
throttledSource ~> intDoublerFlow ~> sink
// I presume I want to change this shape to something else
// but I can't figure out what it is.
ClosedShape
}
// TODO: This is RunnableGraph[NotUsed], I want RunnableGraph[Future[Done]] that gives the
// materialized Future[Done] from the sink. I presume I need to use a GraphDSL SourceShape
// but I can't get that working.
val graph = RunnableGraph.fromGraph(graphModel)
// This works and gives me the materialized sink output using the simpler API.
// But I want to use the GraphDSL so that I can add branches or junctures.
val graphThatIWantFromDslAPI = throttledSource.toMat(sink)(Keep.right)
The trick is to pass the stage you want the materialized value of (in your case, sink) to the GraphDSL.create. The function you pass as a second parameter changes as well, needing a Shape input parameter (s in the example below) which you can use in your graph.
val graphModel: Graph[ClosedShape, Future[Done]] = GraphDSL.create(sink) { implicit b => s =>
import GraphDSL.Implicits._
throttledSource ~> intDoublerFlow ~> s
// ClosedShape is just fine - it is always the shape of a RunnableGraph
ClosedShape
}
val graph: RunnableGraph[Future[Done]] = RunnableGraph.fromGraph(graphModel)
More info can be found in the docs.
val graphModel = GraphDSL.create(sink) { implicit b: Builder[Future[Done]] => sink =>
import akka.stream.scaladsl.GraphDSL.Implicits._
throttledSource ~> intDoublerFlow ~> sink
ClosedShape
}
val graph: RunnableGraph[Future[Done]] = RunnableGraph.fromGraph(graphModel)
val graphThatIWantFromDslAPI: RunnableGraph[Future[Done]] = throttledSource.toMat(sink)(Keep.right)
The problem with the GraphDSL API is, that the implicit Builder is heavily overloaded. You need to wrap your sink in create, which turns the Builder[NotUsed] into Builder[Future[Done]] and represents now a function from builder => sink => shape instead of builder => shape.

How do you deal with futures in Akka Flow?

I have built an akka graph that defines a flow. My objective is to reformat my future response and save it to a file. The flow can be outlined bellow:
val g = RunnableGraph.fromGraph(GraphDSL.create() { implicit builder: GraphDSL.Builder[NotUsed] =>
import GraphDSL.Implicits._
val balancer = builder.add(Balance[(HttpRequest, String)](6, waitForAllDownstreams = false))
val merger = builder.add(Merge[Future[Map[String, String]]](6))
val fileSink = FileIO.toPath(outputPath, options)
val ignoreSink = Sink.ignore
val in = Source(seeds)
in ~> balancer.in
for (i <- Range(0,6)) {
balancer.out(i) ~>
wikiFlow.async ~>
// This maps to a Future[Map[String, String]]
Flow[(Try[HttpResponse], String)].map(parseHtml) ~>
merger
}
merger.out ~>
// When we merge we need to map our Map to a file
Flow[Future[Map[String, String]]].map((d) => {
// What is the proper way of serializing future map
// so I can work with it like a normal stream into fileSink?
// I could manually do ->
// d.foreach(someWriteToFileProcess(_))
// with ignoreSink, but this defeats the nice
// akka flow
}) ~>
fileSink
ClosedShape
})
I can hack this workflow to write my future map to a file via foreach, but I'm afraid this could somehow lead to concurrency issues with FileIO and it just doesn't feel right. What is the proper way to handle futures with our akka flow?
The easiest way to create a Flow which involves an asynchronous computation is by using mapAsync.
So... lets say you want to create a Flow which consumes Int and produces String using an asynchronous computation mapper: Int => Future[String] with a parallelism of 5.
val mapper: Int => Future[String] = (i: Int) => Future(i.toString)
val yourFlow = Flow[Int].mapAsync[String](5)(mapper)
Now, you can use this flow in your graph however you want.
An example usage will be,
val graph = GraphDSL.create() { implicit builder =>
import GraphDSL.Implicits._
val intSource = Source(1 to 10)
val printSink = Sink.foreach[String](s => println(s))
val yourMapper: Int => Future[String] = (i: Int) => Future(i.toString)
val yourFlow = Flow[Int].mapAsync[String](2)(yourMapper)
intSource ~> yourFlow ~> printSink
ClosedShape
}