Suppose I have following simple graph.
class KafkaSource[A](kI: KafkaIterator) extends GraphStage[SourceShape[A]] {
val out = Outlet[A]("KafkaSource.out")
override val shape = SourceShape.of(out)
override def createLogic(attr: Attributes): GraphStageLogic =
new GraphStageLogic(shape) {
setHandler(out, new OutHandler {
override def onPull(): Unit = {
push(out, kI.next)
}
})
}
}
val g = GraphDSL.create(){ implicit b =>
val source = b.add(new KafkaSource[Message](itr))
val sink = b.add(Sink.foreach[Message](println))
source ~> sink
ClosedShape
}
we're running it as
RunnableGraph.fromGraph(g).run()
I wish to signal the kafkaSource to stop(or artificially complete) instead of pushing the next available element, so that connected stages downstream also stop.
how do I accomplish that ?
The scenario being, we have millions of messages in kafka & we would like to stop processing messages everyday at 9pm (for instance) and assuming we're stopping our running applications with a clean shutdown.
Although probably not relevant for phantomastray any more but helpful for others:
A KillSwitch allows the completion of graphs of FlowShape from the
outside. It consists of a flow element that can be linked to a graph
of FlowShape needing completion control. The KillSwitch trait allows
to complete or fail the graph(s). [Source: Akka Docs]
Related
I am trying to create a Redis Akka Stream Source using the GraphStage construct. The idea is that whenever I get an update from the subscribe method, I push that to the next component. Also if there is no pull signal, the component should backpressure. Here is the code:
class SubscriberSourceShape(channel: String, subscriber: Subscriber)
extends GraphStage[SourceShape[String]] {
private val outlet: Outlet[String] = Outlet("SubscriberSource.Out")
override def shape: SourceShape[String] = SourceShape(outlet)
override def createLogic(inheritedAttributes: Attributes): GraphStageLogic = {
new GraphStageLogic(shape) {
val callback = getAsyncCallback((msg: String) => push(outlet, msg)
val handler = (msg: String) => callback.invoke(msg)
override def preStart(): Unit = subscriber.subscribe(channel)(handler)
}
}
}
However, when I run this with a simple sink I get this error:
Error in stage [akka.http.impl.engine.server.HttpServerBluePrint$ProtocolSwitchStage#58344854]: No handler defined in stage [...SubscriberSourceShape#6a9193dd] for out port [RedisSubscriberSource.Out(1414886352). All inlets and outlets must be assigned a handler with setHandler in the constructor of your graph stage logic.
What is wrong?
You are getting this error because you haven't set any output handler for your Source, so when the downstream components (Flow or Sink) send a pull signal to this Source, there is no handler to handle the pull signal.
You can add an OutputHandler to get rid of the error. Leave the onPull method empty since you are producing your elements on the asyncCallback. Just add this inside the GraphStageLogic body:
setHandler(outlet, new OutHandler {
override def onPull(): Unit = {}
})
I have an external (that is, I cannot change it) Java API which looks like this:
public interface Sender {
void send(Event e);
}
I need to implement a Sender which accepts each event, transforms it to a JSON object, collects some number of them into a single bundle and sends over HTTP to some endpoint. This all should be done asynchronously, without send() blocking the calling thread, with some fixed-size buffer and dropping new events if the buffer is full.
With akka-streams this is quite simple: I create a graph of stages (which uses akka-http to send HTTP requests), materialize it and use the materialized ActorRef to push new events to the stream:
lazy val eventPipeline = Source.actorRef[Event](Int.MaxValue, OverflowStrategy.fail)
.via(CustomBuffer(bufferSize)) // buffer all events
.groupedWithin(batchSize, flushDuration) // group events into chunks
.map(toBundle) // convert each chunk into a JSON message
.mapAsyncUnordered(1)(sendHttpRequest) // send an HTTP request
.toMat(Sink.foreach { response =>
// print HTTP response for debugging
})(Keep.both)
lazy val (eventsActor, completeFuture) = eventPipeline.run()
override def send(e: Event): Unit = {
eventsActor ! e
}
Here CustomBuffer is a custom GraphStage which is very similar to the library-provided Buffer but tailored to our specific needs; it probably does not matter for this particular question.
As you can see, interacting with the stream from non-stream code is very simple - the ! method on the ActorRef trait is asynchronous and does not need any additional machinery to be called. Each event which is sent to the actor is then processed through the entire reactive pipeline. Moreover, because of how akka-http is implemented, I even get connection pooling for free, so no more than one connection is opened to the server.
However, I cannot find a way to do the same thing with FS2 properly. Even discarding the question of buffering (I will probably need to write a custom Pipe implementation which does additional things that we need) and HTTP connection pooling, I'm still stuck with a more basic thing - that is, how to push the data to the reactive stream "from outside".
All tutorials and documentation that I can find assume that the entire program happens inside some effect context, usually IO. This is not my case - the send() method is invoked by the Java library at unspecified times. Therefore, I just cannot keep everything inside one IO action, I necessarily have to finalize the "push" action inside the send() method, and have the reactive stream as a separate entity, because I want to aggregate events and hopefully pool HTTP connections (which I believe is naturally tied to the reactive stream).
I assume that I need some additional data structure, like Queue. fs2 does indeed have some kind of fs2.concurrent.Queue, but again, all documentation shows how to use it inside a single IO context, so I assume that doing something like
val queue: Queue[IO, Event] = Queue.unbounded[IO, Event].unsafeRunSync()
and then using queue inside the stream definition and then separately inside the send() method with further unsafeRun calls:
val eventPipeline = queue.dequeue
.through(customBuffer(bufferSize))
.groupWithin(batchSize, flushDuration)
.map(toBundle)
.mapAsyncUnordered(1)(sendRequest)
.evalTap(response => ...)
.compile
.drain
eventPipeline.unsafeRunAsync(...) // or something
override def send(e: Event) {
queue.enqueue(e).unsafeRunSync()
}
is not the correct way and most likely would not even work.
So, my question is, how do I properly use fs2 to solve my problem?
Consider the following example:
import cats.implicits._
import cats.effect._
import cats.effect.implicits._
import fs2._
import fs2.concurrent.Queue
import scala.concurrent.ExecutionContext
import scala.concurrent.duration._
object Answer {
type Event = String
trait Sender {
def send(event: Event): Unit
}
def main(args: Array[String]): Unit = {
val sender: Sender = {
val ec = ExecutionContext.global
implicit val cs: ContextShift[IO] = IO.contextShift(ec)
implicit val timer: Timer[IO] = IO.timer(ec)
fs2Sender[IO](2)
}
val events = List("a", "b", "c", "d")
events.foreach { evt => new Thread(() => sender.send(evt)).start() }
Thread sleep 3000
}
def fs2Sender[F[_]: Timer : ContextShift](maxBufferedSize: Int)(implicit F: ConcurrentEffect[F]): Sender = {
// dummy impl
// this is where the actual logic for batching
// and shipping over the network would live
val consume: Pipe[F, Event, Unit] = _.evalMap { event =>
for {
_ <- F.delay { println(s"consuming [$event]...") }
_ <- Timer[F].sleep(1.seconds)
_ <- F.delay { println(s"...[$event] consumed") }
} yield ()
}
val suspended = for {
q <- Queue.bounded[F, Event](maxBufferedSize)
_ <- q.dequeue.through(consume).compile.drain.start
sender <- F.delay[Sender] { evt =>
val enqueue = for {
wasEnqueued <- q.offer1(evt)
_ <- F.delay { println(s"[$evt] enqueued? $wasEnqueued") }
} yield ()
enqueue.toIO.unsafeRunAsyncAndForget()
}
} yield sender
suspended.toIO.unsafeRunSync()
}
}
The main idea is to use a concurrent Queue from fs2. Note, that the above code demonstrates that neither the Sender interface nor the logic in main can be changed. Only an implementation of the Sender interface can be swapped out.
I don't have much experience with exactly that library but it should look somehow like that:
import cats.effect.{ExitCode, IO, IOApp}
import fs2.concurrent.Queue
case class Event(id: Int)
class JavaProducer{
new Thread(new Runnable {
override def run(): Unit = {
var id = 0
while(true){
Thread.sleep(1000)
id += 1
send(Event(id))
}
}
}).start()
def send(event: Event): Unit ={
println(s"Original producer prints $event")
}
}
class HackedProducer(queue: Queue[IO, Event]) extends JavaProducer {
override def send(event: Event): Unit = {
println(s"Hacked producer pushes $event")
queue.enqueue1(event).unsafeRunSync()
println(s"Hacked producer pushes $event - Pushed")
}
}
object Test extends IOApp{
override def run(args: List[String]): IO[ExitCode] = {
val x: IO[Unit] = for {
queue <- Queue.unbounded[IO, Event]
_ = new HackedProducer(queue)
done <- queue.dequeue.map(ev => {
println(s"Got $ev")
}).compile.drain
} yield done
x.map(_ => ExitCode.Success)
}
}
We can create a bounded queue that will consume elements from sender and make them available to fs2 stream processing.
import cats.effect.IO
import cats.effect.std.Queue
import fs2.Stream
trait Sender[T]:
def send(e: T): Unit
object Sender:
def apply[T](bufferSize: Int): IO[(Sender[T], Stream[IO, T])] =
for
q <- Queue.bounded[IO, T](bufferSize)
yield
val sender: Sender[T] = (e: T) => q.offer(e).unsafeRunSync()
def stm: Stream[IO, T] = Stream.eval(q.take) ++ stm
(sender, stm)
Then we'll have two ends - one for Java worlds, to send new elements to Sender. Another one - for stream processing in fs2.
class TestSenderQueue:
#Test def testSenderQueue: Unit =
val (sender, stream) = Sender[Int](1)
.unsafeRunSync()// we have to run it preliminary to make `sender` available to external system
val processing =
stream
.map(i => i * i)
.evalMap{ ii => IO{ println(ii)}}
sender.send(1)
processing.compile.toList.start//NB! we start processing in a separate fiber
.unsafeRunSync() // immediately right now.
sender.send(2)
Thread.sleep(100)
(0 until 100).foreach(sender.send)
println("finished")
Note that we push data in the current thread and have to run fs2 in a separate thread (.start).
I am trying to track down a bug in a custom Flow where there were extra requests going upstream. I wrote a test case which should have failed given the behaviour and logging I was seeing but it didn't.
I then made a trivial Flow with an obvious extra pull and yet the test still did not fail.
Here is an example:
class EagerFlow extends GraphStage[FlowShape[String, String]] {
val in: Inlet[String] = Inlet("EagerFlow.in")
val out: Outlet[String] = Outlet("EagerFlow.out")
override def shape: FlowShape[String, String] = FlowShape(in, out)
override def createLogic(inheritedAttributes: Attributes): GraphStageLogic = new GraphStageLogic(shape) {
setHandler(in, new InHandler {
override def onPush() = {
println("EagerFlow.in.onPush")
println("EagerFlow.out.push")
push(out, grab(in))
println("EagerFlow.in.pull")
pull(in)
}
})
setHandler(out, new OutHandler {
override def onPull() = {
println("EagerFlow.out.onPull")
println("EagerFlow.in.pull")
pull(in)
}
})
}
}
Here is a test that should pass, i.e. confirm that there is an extra request.
"Eager Flow" should "pull on every push" in {
val sourceProbe = TestSource.probe[String]
val sinkProbe = TestSink.probe[String]
val eagerFlow = Flow.fromGraph(new EagerFlow)
val (source, sink) = sourceProbe.
via(eagerFlow).
toMat(sinkProbe)(Keep.both).
run
sink.request(1)
source.expectRequest()
source.expectNoMsg()
source.sendNext("hello")
sink.expectNext()
source.expectRequest()
}
In standard output you can see:
0C3E08A8 PULL DownstreamBoundary -> TestSpec$EagerFlow#1d154180 (TestSpec$EagerFlow$$anon$4$$anon$6#7e62ecbd) [TestSpec$EagerFlow$$anon$4#36904dd4]
EagerFlow.out.onPull
EagerFlow.in.pull
0C3E08A8 PULL TestSpec$EagerFlow#1d154180 -> UpstreamBoundary (BatchingActorInputBoundary(id=0, fill=0/16, completed=false, canceled=false)) [UpstreamBoundary]
EagerFlow.in.onPush
EagerFlow.out.push
EagerFlow.in.pull
0C3E08A8 PULL TestSpec$EagerFlow#1d154180 -> UpstreamBoundary (BatchingActorInputBoundary(id=0, fill=0/16, completed=false, canceled=false)) [UpstreamBoundary]
assertion failed: timeout (3 seconds) during expectMsg:
java.lang.AssertionError: assertion failed: timeout (3 seconds) during expectMsg:
at scala.Predef$.assert(Predef.scala:170)
at akka.testkit.TestKitBase$class.expectMsgPF(TestKit.scala:368)
at akka.testkit.TestKit.expectMsgPF(TestKit.scala:737)
at akka.stream.testkit.StreamTestKit$PublisherProbeSubscription.expectRequest(StreamTestKit.scala:665)
at akka.stream.testkit.TestPublisher$Probe.expectRequest(StreamTestKit.scala:172)
(This is with the debug logging in akka.stream.impl.fusing.GraphInterpreter#processEvent, well actually using print-points but effectively the same thing)
Why does this pull stop in the UpstreamBoundary and never make it back to the TestSource.probe?
I have quite a few tests which use expectNoMsg() to ensure correct backpressure but it seems these will just give false positives. How should I be testing this if this doesn't work?
To make your test applicable you'll need to set up your Flow buffer to be of size 1.
val settings = ActorMaterializerSettings(system)
.withInputBuffer(initialSize = 1, maxSize = 1)
implicit val mat = ActorMaterializer(settings)
The default sizes are:
# Initial size of buffers used in stream elements
initial-input-buffer-size = 4
# Maximum size of buffers used in stream elements
max-input-buffer-size = 16
This means that Akka will eagerly request and ingest up to 16 elements anyway at your first request. More info on internal buffers can be found here.
It's possible to create sources and sinks from actors using Source.actorPublisher() and Sink.actorSubscriber() methods respectively. But is it possible to create a Flow from actor?
Conceptually there doesn't seem to be a good reason not to, given that it implements both ActorPublisher and ActorSubscriber traits, but unfortunately, the Flow object doesn't have any method for doing this. In this excellent blog post it's done in an earlier version of Akka Streams, so the question is if it's possible also in the latest (2.4.9) version.
I'm part of the Akka team and would like to use this question to clarify a few things about the raw Reactive Streams interfaces. I hope you'll find this useful.
Most notably, we'll be posting multiple posts on the Akka team blog about building custom stages, including Flows, soon, so keep an eye on it.
Don't use ActorPublisher / ActorSubscriber
Please don't use ActorPublisher and ActorSubscriber. They're too low level and you might end up implementing them in such a way that's violating the Reactive Streams specification. They're a relict of the past and even then were only "power-user mode only". There really is no reason to use those classes nowadays. We never provided a way to build a flow because the complexity is simply explosive if it was exposed as "raw" Actor API for you to implement and get all the rules implemented correctly.
If you really really want to implement raw ReactiveStreams interfaces, then please do use the Specification's TCK to verify your implementation is correct. You will likely be caught off guard by some of the more complex corner cases a Flow (or in RS terminology a Processor has to handle).
Most operations are possible to build without going low-level
Many flows you should be able to simply build by building from a Flow[T] and adding the needed operations onto it, just as an example:
val newFlow: Flow[String, Int, NotUsed] = Flow[String].map(_.toInt)
Which is a reusable description of the Flow.
Since you're asking about power user mode, this is the most powerful operator on the DSL itself: statefulFlatMapConcat. The vast majority of operations operating on plain stream elements is expressable using it: Flow.statefulMapConcat[T](f: () ⇒ (Out) ⇒ Iterable[T]): Repr[T].
If you need timers you could zip with a Source.timer etc.
GraphStage is the simplest and safest API to build custom stages
Instead, building Sources/Flows/Sinks has its own powerful and safe API: the GraphStage. Please read the documentation about building custom GraphStages (they can be a Sink/Source/Flow or even any arbitrary shape). It handles all of the complex Reactive Streams rules for you, while giving you full freedom and type-safety while implementing your stages (which could be a Flow).
For example, taken from the docs, is an GraphStage implementation of the filter(T => Boolean) operator:
class Filter[A](p: A => Boolean) extends GraphStage[FlowShape[A, A]] {
val in = Inlet[A]("Filter.in")
val out = Outlet[A]("Filter.out")
val shape = FlowShape.of(in, out)
override def createLogic(inheritedAttributes: Attributes): GraphStageLogic =
new GraphStageLogic(shape) {
setHandler(in, new InHandler {
override def onPush(): Unit = {
val elem = grab(in)
if (p(elem)) push(out, elem)
else pull(in)
}
})
setHandler(out, new OutHandler {
override def onPull(): Unit = {
pull(in)
}
})
}
}
It also handles asynchronous channels and is fusable by default.
In addition to the docs, these blog posts explain in detail why this API is the holy grail of building custom stages of any shape:
Akka team blog: Mastering GraphStages (part 1, introduction) - a high level overview
... tomorrow we'll publish one about it's API as well...
Kunicki blog: Implementing a Custom Akka Streams Graph Stage - another example implementing sources (really applies 1:1 to building Flows)
Konrad's solution demonstrates how to create a custom stage that utilizes Actors, but in most cases I think that is a bit overkill.
Usually you have some Actor that is capable of responding to questions:
val actorRef : ActorRef = ???
type Input = ???
type Output = ???
val queryActor : Input => Future[Output] =
(actorRef ? _) andThen (_.mapTo[Output])
This can be easily utilized with basic Flow functionality which takes in the maximum number of concurrent requests:
val actorQueryFlow : Int => Flow[Input, Output, _] =
(parallelism) => Flow[Input].mapAsync[Output](parallelism)(queryActor)
Now actorQueryFlow can be integrated into any stream...
Here is a solution build by using a graph stage. The actor has to acknowledge all messages in order to have back-pressure. The actor is notified when the stream fails/completes and the stream fails when the actor terminates.
This can be useful if you don't want to use ask, e.g. when not every input message has a corresponding output message.
import akka.actor.{ActorRef, Status, Terminated}
import akka.stream._
import akka.stream.stage.{GraphStage, GraphStageLogic, InHandler, OutHandler}
object ActorRefBackpressureFlowStage {
case object StreamInit
case object StreamAck
case object StreamCompleted
case class StreamFailed(ex: Throwable)
case class StreamElementIn[A](element: A)
case class StreamElementOut[A](element: A)
}
/**
* Sends the elements of the stream to the given `ActorRef` that sends back back-pressure signal.
* First element is always `StreamInit`, then stream is waiting for acknowledgement message
* `ackMessage` from the given actor which means that it is ready to process
* elements. It also requires `ackMessage` message after each stream element
* to make backpressure work. Stream elements are wrapped inside `StreamElementIn(elem)` messages.
*
* The target actor can emit elements at any time by sending a `StreamElementOut(elem)` message, which will
* be emitted downstream when there is demand.
*
* If the target actor terminates the stage will fail with a WatchedActorTerminatedException.
* When the stream is completed successfully a `StreamCompleted` message
* will be sent to the destination actor.
* When the stream is completed with failure a `StreamFailed(ex)` message will be send to the destination actor.
*/
class ActorRefBackpressureFlowStage[In, Out](private val flowActor: ActorRef) extends GraphStage[FlowShape[In, Out]] {
import ActorRefBackpressureFlowStage._
val in: Inlet[In] = Inlet("ActorFlowIn")
val out: Outlet[Out] = Outlet("ActorFlowOut")
override def createLogic(inheritedAttributes: Attributes): GraphStageLogic = new GraphStageLogic(shape) {
private lazy val self = getStageActor {
case (_, StreamAck) =>
if(firstPullReceived) {
if (!isClosed(in) && !hasBeenPulled(in)) {
pull(in)
}
} else {
pullOnFirstPullReceived = true
}
case (_, StreamElementOut(elemOut)) =>
val elem = elemOut.asInstanceOf[Out]
emit(out, elem)
case (_, Terminated(targetRef)) =>
failStage(new WatchedActorTerminatedException("ActorRefBackpressureFlowStage", targetRef))
case (actorRef, unexpected) =>
failStage(new IllegalStateException(s"Unexpected message: `$unexpected` received from actor `$actorRef`."))
}
var firstPullReceived: Boolean = false
var pullOnFirstPullReceived: Boolean = false
override def preStart(): Unit = {
//initialize stage actor and watch flow actor.
self.watch(flowActor)
tellFlowActor(StreamInit)
}
setHandler(in, new InHandler {
override def onPush(): Unit = {
val elementIn = grab(in)
tellFlowActor(StreamElementIn(elementIn))
}
override def onUpstreamFailure(ex: Throwable): Unit = {
tellFlowActor(StreamFailed(ex))
super.onUpstreamFailure(ex)
}
override def onUpstreamFinish(): Unit = {
tellFlowActor(StreamCompleted)
super.onUpstreamFinish()
}
})
setHandler(out, new OutHandler {
override def onPull(): Unit = {
if(!firstPullReceived) {
firstPullReceived = true
if(pullOnFirstPullReceived) {
if (!isClosed(in) && !hasBeenPulled(in)) {
pull(in)
}
}
}
}
override def onDownstreamFinish(): Unit = {
tellFlowActor(StreamCompleted)
super.onDownstreamFinish()
}
})
private def tellFlowActor(message: Any): Unit = {
flowActor.tell(message, self.ref)
}
}
override def shape: FlowShape[In, Out] = FlowShape(in, out)
}
In Akka Stream 2.4.2, PushStage has been deprecated. For Streams 2.0.3 I was using the solution from this answer:
How does one close an Akka stream?
which was:
import akka.stream.stage._
val closeStage = new PushStage[Tpe, Tpe] {
override def onPush(elem: Tpe, ctx: Context[Tpe]) = elem match {
case elem if shouldCloseStream ⇒
// println("stream closed")
ctx.finish()
case elem ⇒
ctx.push(elem)
}
}
How would I close a stream in 2.4.2 immediately, from inside a GraphStage / onPush() ?
Use something like this:
val closeStage = new GraphStage[FlowShape[Tpe, Tpe]] {
val in = Inlet[Tpe]("closeStage.in")
val out = Outlet[Tpe]("closeStage.out")
override val shape = FlowShape.of(in, out)
override def createLogic(inheritedAttributes: Attributes) = new GraphStageLogic(shape) {
setHandler(in, new InHandler {
override def onPush() = grab(in) match {
case elem if shouldCloseStream ⇒
// println("stream closed")
completeStage()
case msg ⇒
push(out, msg)
}
})
setHandler(out, new OutHandler {
override def onPull() = pull(in)
})
}
}
It is more verbose but one the one side one can define this logic in a reusable way and on the other side one no longer has to worry about differences between the stream elements because the GraphStage can be handled in the same way as a flow would be handled:
val flow: Flow[Tpe] = ???
val newFlow = flow.via(closeStage)
Posting for other people's reference. sschaef's answer is correct procedurally, but the connections was kept open for a minute and eventually would time out and throw a "no activity" exception, closing the connection.
In reading the docs further, I noticed that the connection was closed when all upstreams flows completed. In my case, I had more than one upstream.
For my particular use case, the fix was to add eagerComplete=true to close stream as soon as any (rather than all) upstream completes. Something like:
... = builder.add(Merge[MyObj](3,eagerComplete = true))
Hope this helps someone.