How to signal the Sink when all elements have been processed? - scala

I have a graph that accepts a sequence of files, processes them one by one ant then at the end of the execution, the program should return success (0) or failure (-1) if all the executions have succeeded or failed.
How could this last step be achieved? How could the Sink know when it is receiving the result for the last file?
val graph = createGraph("path-to-list-of-files")
val result = graph.run()
def createGraph(fileOrPath: String): RunnableGraph[NotUsed] = {
printStage("PREPARING") {
val producer: Source[ProducerFile, NotUsed] = Producer(fileOrPath).toSource()
val validator: Flow[ProducerFile, ProducerFile, NotUsed] = Validator().toFlow()
val provisioner: Flow[ProducerFile, PrivisionerResult, NotUsed] = Provisioner().toFlow()
val executor: Flow[PrivisionerResult, ExecutorResult, NotUsed] = Executor().toFlow()
val evaluator: Flow[ExecutorResult, EvaluatorResult, NotUsed] = Evaluator().toFlow()
val reporter: Sink[EvaluatorResult, Future[Done]] = Reporter().toSink()
val graphResult = RunnableGraph.fromGraph(GraphDSL.create() { implicit builder: GraphDSL.Builder[NotUsed] =>
import GraphDSL.Implicits._
producer ~> validator ~> provisioner ~> executor ~> evaluator ~> reporter
ClosedShape
})
printLine("The graph pipeline was created")
graphResult
}

Your reporter Sink already materializes to a Future[Done], which you can hook to if you want to run some code when all your elements have processed.
However, at the moment you are not exposing it in your graph. Although there is a way to expose it using the graph DSL, in your case it is even easier to use the fluent DSL to achieve this:
val graphResult: RunnableGraph[Future[Done]] = producer
.via(validator)
.via(provisioner)
.via(executor)
.via(evaluator)
.toMat(reporter)(Keep.right)
This will give you back the Future[Done] when you run your graph
val result: Future[Done] = graph.run()
which then you can hook to - e.g.
result.onComplete {
case Success(_) => println("Success!")
case Failure(_) => println("Failure..")
}

Related

Akka Streams stop stream after process n elements

I have Akka Stream flow which is reading from file using alpakka, processing data and write into a file. I want to stop flow after processed n elements, count the time of duration and call system terminate. How can I achieve it?
My flow looks like that:
val graph = RunnableGraph.fromGraph(GraphDSL.create() { implicit builder: GraphDSL.Builder[NotUsed] =>
import GraphDSL.Implicits._
sourceFile ~> mainFlow ~> sinkFile
ClosedShape
})
graph.run()
Do you have an idea? Thanks
Agreeing with what #Viktor said, first of all you don't need to use the graphDSL to achieve this, and you can use take(n) to complete the graph.
Secondly, you can use mapMaterializedValue to attach a callback to your Sink materialized value (which in turn should materializes to a Future[Something]).
val graph: RunnableGraph[Future[FiniteDuration]] =
sourceFile
.via(mainFlow)
.take(N)
.toMat(sinkFile)(Keep.right)
.mapMaterializedValue { f ⇒
val start = System.nanoTime()
f.map(_ ⇒ FiniteDuration(System.nanoTime() - start, TimeUnit.NANOSECONDS))
}
graph.run().onComplete { duration ⇒
println(s"Elapsed time: $duration")
}
Note that you are going to need an ExecutionContext in scope.
EDIT
Even if you have to use the graphDSL, the same concepts apply. You only need to expose the materialized Future of your sink and map on that.
val graph: RunnableGraph[Future[??Something??]] =
RunnableGraph.fromGraph(GraphDSL.create(sinkFile) { implicit builder: GraphDSL.Builder[Future[Something]] => snk =>
import GraphDSL.Implicits._
sourceFile ~> mainFlow ~> snk
ClosedShape
})
val timedGraph: RunnableGraph[Future[FiniteDuration]] =
graph.mapMaterializedValue { f ⇒
val start = System.nanoTime()
f.map(_ ⇒ FiniteDuration(System.nanoTime() - start, TimeUnit.NANOSECONDS))
}
timedGraph.run().onComplete { duration ⇒
println(s"Elapsed time: $duration")
}
No need for GraphDSL here.
val doneFuture = (sourceFile via mainFlow.take(N) runWith sinkFile) transformWith { _ => system.terminate() }
To obtain time, you can use akka-streams-contrib: https://github.com/akka/akka-stream-contrib/blob/master/contrib/src/main/scala/akka/stream/contrib/Timed.scala

Akka Streams: How do I get Materialized Sink output from GraphDSL API?

This is a really simple, newbie question using the GraphDSL API. I read several related SO threads and I don't see the answer:
val actorSystem = ActorSystem("QuickStart")
val executor = actorSystem.dispatcher
val materializer = ActorMaterializer()(actorSystem)
val source: Source[Int, NotUsed] = Source(1 to 5)
val throttledSource = source.throttle(1, 1.second, 1, ThrottleMode.shaping)
val intDoublerFlow = Flow.fromFunction[Int, Int](i => i * 2)
val sink = Sink.foreach(println)
val graphModel = GraphDSL.create() { implicit b =>
import GraphDSL.Implicits._
throttledSource ~> intDoublerFlow ~> sink
// I presume I want to change this shape to something else
// but I can't figure out what it is.
ClosedShape
}
// TODO: This is RunnableGraph[NotUsed], I want RunnableGraph[Future[Done]] that gives the
// materialized Future[Done] from the sink. I presume I need to use a GraphDSL SourceShape
// but I can't get that working.
val graph = RunnableGraph.fromGraph(graphModel)
// This works and gives me the materialized sink output using the simpler API.
// But I want to use the GraphDSL so that I can add branches or junctures.
val graphThatIWantFromDslAPI = throttledSource.toMat(sink)(Keep.right)
The trick is to pass the stage you want the materialized value of (in your case, sink) to the GraphDSL.create. The function you pass as a second parameter changes as well, needing a Shape input parameter (s in the example below) which you can use in your graph.
val graphModel: Graph[ClosedShape, Future[Done]] = GraphDSL.create(sink) { implicit b => s =>
import GraphDSL.Implicits._
throttledSource ~> intDoublerFlow ~> s
// ClosedShape is just fine - it is always the shape of a RunnableGraph
ClosedShape
}
val graph: RunnableGraph[Future[Done]] = RunnableGraph.fromGraph(graphModel)
More info can be found in the docs.
val graphModel = GraphDSL.create(sink) { implicit b: Builder[Future[Done]] => sink =>
import akka.stream.scaladsl.GraphDSL.Implicits._
throttledSource ~> intDoublerFlow ~> sink
ClosedShape
}
val graph: RunnableGraph[Future[Done]] = RunnableGraph.fromGraph(graphModel)
val graphThatIWantFromDslAPI: RunnableGraph[Future[Done]] = throttledSource.toMat(sink)(Keep.right)
The problem with the GraphDSL API is, that the implicit Builder is heavily overloaded. You need to wrap your sink in create, which turns the Builder[NotUsed] into Builder[Future[Done]] and represents now a function from builder => sink => shape instead of builder => shape.

How do you deal with futures in Akka Flow?

I have built an akka graph that defines a flow. My objective is to reformat my future response and save it to a file. The flow can be outlined bellow:
val g = RunnableGraph.fromGraph(GraphDSL.create() { implicit builder: GraphDSL.Builder[NotUsed] =>
import GraphDSL.Implicits._
val balancer = builder.add(Balance[(HttpRequest, String)](6, waitForAllDownstreams = false))
val merger = builder.add(Merge[Future[Map[String, String]]](6))
val fileSink = FileIO.toPath(outputPath, options)
val ignoreSink = Sink.ignore
val in = Source(seeds)
in ~> balancer.in
for (i <- Range(0,6)) {
balancer.out(i) ~>
wikiFlow.async ~>
// This maps to a Future[Map[String, String]]
Flow[(Try[HttpResponse], String)].map(parseHtml) ~>
merger
}
merger.out ~>
// When we merge we need to map our Map to a file
Flow[Future[Map[String, String]]].map((d) => {
// What is the proper way of serializing future map
// so I can work with it like a normal stream into fileSink?
// I could manually do ->
// d.foreach(someWriteToFileProcess(_))
// with ignoreSink, but this defeats the nice
// akka flow
}) ~>
fileSink
ClosedShape
})
I can hack this workflow to write my future map to a file via foreach, but I'm afraid this could somehow lead to concurrency issues with FileIO and it just doesn't feel right. What is the proper way to handle futures with our akka flow?
The easiest way to create a Flow which involves an asynchronous computation is by using mapAsync.
So... lets say you want to create a Flow which consumes Int and produces String using an asynchronous computation mapper: Int => Future[String] with a parallelism of 5.
val mapper: Int => Future[String] = (i: Int) => Future(i.toString)
val yourFlow = Flow[Int].mapAsync[String](5)(mapper)
Now, you can use this flow in your graph however you want.
An example usage will be,
val graph = GraphDSL.create() { implicit builder =>
import GraphDSL.Implicits._
val intSource = Source(1 to 10)
val printSink = Sink.foreach[String](s => println(s))
val yourMapper: Int => Future[String] = (i: Int) => Future(i.toString)
val yourFlow = Flow[Int].mapAsync[String](2)(yourMapper)
intSource ~> yourFlow ~> printSink
ClosedShape
}

AKKA Streams: Performance degradation after connecting to AMQP

I have been trying to create an application that runs recursive task (crawler) with akka streams and QPID message broker.
What I have noticed is that separate parts of the graph alone perform quite good, but when connected together performance drops down significantly.
Here is statistics for the graph running on my local machine:
sending messages to message queue can achieve > 1700 msg/sec;
making HTTP requests approx. 70 req/sec;
whole graph, including reading
messages from queue 2-4 items/sec;
Source code for the pipeline can be found here:
https://gist.github.com/volisoft/3617824b16a3f3b6e01c933a8bdf8049
The pipeline is straightforward:
def main(args: Array[String]): Unit = {
startBroker()
val queueName = "amqp-conn-it-spec-simple-queue-" + System.currentTimeMillis()
val queueDeclaration = QueueDeclaration(queueName)
val in = AmqpSource(
NamedQueueSourceSettings(AmqpConnectionDetails("localhost", 5672, Some(AmqpCredentials("guest", "guest"))), queueName)
.withDeclarations(queueDeclaration),
bufferSize = 1028
).map(_.bytes.utf8String).log(":in")
val out = AmqpSink.simple(
AmqpSinkSettings(AmqpConnectionDetails("localhost", 5672, Some(AmqpCredentials("guest", "guest"))))
.withRoutingKey(queueName).withDeclarations(queueDeclaration))
val urlsSink = Flow[String].map(ByteString(_)).to(out)
val g = RunnableGraph.fromGraph(GraphDSL.create(in, urlsSink)((_,_)){ implicit b => (in, urlsSink0) =>
import GraphDSL.Implicits._
val pool = Http().superPool[String]()(materializer).log(":pool")
val download: Flow[String, Document, NotUsed] = Flow[String]
.map(url => (HttpRequest(method = HttpMethods.GET, Uri(url)), url) )
.via(pool)
.mapAsyncUnordered(8){ case (Success(response: HttpResponse), url) => parse(response, url)}
val filter = Flow[String].filter(notVisited).log(":filter")
val save = Flow[String].map(saveVisited)
val extractLinks: Flow[Document, String, NotUsed] = Flow[Document].mapConcat(document => getUrls(document))
in ~> save ~> download ~> extractLinks ~> filter ~> urlsSink0
ClosedShape
})
g.run()
Source.single(rootUrl).map(s => ByteString(s)).runWith(out)
}
How can this code be optimized to increase performance?

Monitoring a closed graph Akka Stream

If I have created a RunningGraph in Akka Stream, how can I know (from the outside)
when all nodes are cancelled due to completion?
when all nodes have been stopped due to an error?
I don't think there is a way to do it for an arbitrary graph, but if you have your graph under control, you just need to attach monitoring sinks to the output of each node which can fail or complete (these are nodes which have at least one output), for example:
import akka.actor.Status
// obtain graph parts (this can be done inside the graph building as well)
val source: Source[Int, NotUsed] = ...
val flow: Flow[Int, String, NotUsed] = ...
val sink: Sink[String, NotUsed] = ...
// create monitoring actors
val aggregate = actorSystem.actorOf(Props[Aggregate])
val sourceMonitorActor = actorSystem.actorOf(Props(new Monitor("source", aggregate)))
val flowMonitorActor = actorSystem.actorOf(Props(new Monitor("flow", aggregate)))
// create the graph
val graph = GraphDSL.create() { implicit b =>
import GraphDSL._
val sourceMonitor = b.add(Sink.actorRef(sourceMonitorActor, Status.Success(()))),
val flowMonitor = b.add(Sink.actorRef(flowMonitorActor, Status.Success(())))
val bc1 = b.add(Broadcast[Int](2))
val bc2 = b.add(Broadcast[String](2))
// main flow
source ~> bc1 ~> flow ~> bc2 ~> sink
// monitoring branches
bc1 ~> sourceMonitor
bc2 ~> flowMonitor
ClosedShape
}
// run the graph
RunnableGraph.fromGraph(graph).run()
class Monitor(name: String, aggregate: ActorRef) extends Actor {
override def receive: Receive = {
case Status.Success(_) => aggregate ! s"$name completed successfully"
case Status.Failure(e) => aggregate ! s"$name completed with failure: ${e.getMessage}"
case _ =>
}
}
class Aggregate extends Actor {
override def receive: Receive = {
case s: String => println(s)
}
}
It is also possible to create only one monitoring actor and use it in all monitoring sinks, but in that case you won't be able to differentiate easily between streams which have failed.
And there also is watchTermination() method on sources and flows which allows to materialize a future which terminates together with the flow at this point. I think it may be difficult to use with GraphDSL, but with regular stream methods it could look like this:
import akka.Done
import akka.actor.Status
import akka.pattern.pipe
val monitor = actorSystem.actorOf(Props[Monitor])
source
.watchTermination()((f, _) => f pipeTo monitor)
.via(flow).watchTermination((f, _) => f pipeTo monitor)
.to(sink)
.run()
class Monitor extends Actor {
override def receive: Receive = {
case Done => println("stream completed")
case Status.Failure(e) => println(s"stream failed: ${e.getMessage}")
}
}
You can transform the future before piping its value to the actor to differentiate between streams.