Akka Stream - Splitting flow into multiple Sources - scala

I have a TCP connection in Akka Stream that ends in a Sink. Right now all messages go into one Sink. I want to split the stream into an unknown number of Sinks given some function.
The use case is as follows, from the TCP connection I get en continuous stream of something like List[DeltaValue], now I want to create an actorSink for each DeltaValue.id so that i can continuously accumulate and implement behaviour for each DeltaValue.id. I find this to be a standard use case in stream processing but I'm not able to find a good example with Akka Stream.
This is what I have right now:
def connect(): ActorRef = tcpConnection
.//SOMEHOW SPLIT HERE and create a ReceiverActor for each message
.to(Sink.actorRef(system.actorOf(ReceiverActor.props(), ReceiverActor.name), akka.Done))
.run()
Update:
I now have this, not sure what to say about it, it does not feel super stable but it should work:
private def spawnActorOrSendMessage(m: ResponseMessage): Unit = {
implicit val timeout = Timeout(FiniteDuration(1, TimeUnit.SECONDS))
system.actorSelection("user/" + m.id.toString).resolveOne().onComplete {
case Success(actorRef) => actorRef ! m
case Failure(ex) => (system.actorOf(ReceiverActor.props(), m.id.toString)) ! m
}
}
def connect(): ActorRef = tcpConnection
.to(Sink.foreachParallel(10)(spawnActorOrSendMessage))
.run()

The below should be a somewhat improved version of what was updated in the question. The main improvement is that your actors are kept in a data structure to avoid actorSelection resolution for every incoming message.
case class DeltaValue(id: String, value: Double)
val src: Source[DeltaValue, NotUsed] = ???
src.runFold(Map[String, ActorRef]()){
case (actors, elem) if actors.contains(elem.id) ⇒
actors(elem.id) ! elem.value
actors
case (actors, elem) ⇒
val newActor = system.actorOf(ReceiverActor.props(), ReceiverActor.name)
newActor ! elem.value
actors.updated(elem.id, newActor)
}
Keep in mind that, when you integrate Akka Streams with bare actors, you lose backpressure support. This is one of the reasons why you should try and implement your logic within the boundaries of Akka Streams whenever possible. And this is not always possible - e.g. when remoting is needed etc.
In your case, you could consider leveraging groupBy and the concept of substream. The example below is folding the elements of each substream by summing them, just to give an idea:
src.groupBy(maxSubstreams = Int.MaxValue, f = _.id)
.fold("" → 0d) {
case ((id, acc), delta) ⇒ id → delta.value + acc
}
.mergeSubstreams
.runForeach(println)

EventStream
You can send messages to the ActorSystem's EventStream within a stream sink and separately have the Actors subscribe to the stream.
Split At Stream Level
You can split the stream at the stream level using Broadcast. The documentation has a good example of this.
Split At Actor Level
You could also use Sink.actorRef in combination with a BroadcastPool to broadcast the messages to multiple Actors.

Related

Akka Streams, break tuple item apart?

Using the superPool from akka-http, I have a stream that passes down a tuple. I would like to pipeline it to the Alpakka Google Pub/Sub connector. At the end of the HTTP processing, I encode everything for the pub/sub connector and end up with
(PublishRequest, Long) // long is a timestamp
but the interface of the connector is
Flow[PublishRequest, Seq[String], NotUsed]
One first approach is to kill one part:
.map{ case(publishRequest, timestamp) => publishRequest }
.via(publishFlow)
Is there an elegant way to create this pipeline while keeping the Long information?
EDIT: added my not-so-elegant solution in the answers. More answers welcome.
I don't see anything inelegant about your solution using GraphDSL.create(), which I think has an advantage of visualizing the stream structure via the diagrammatic ~> clauses. I do see problem in your code. For example, I don't think publisher should be defined by add-ing a flow to the builder.
Below is a skeletal version (briefly tested) of what I believe publishAndRecombine should look like:
val publishFlow: Flow[PublishRequest, Seq[String], NotUsed] = ???
val publishAndRecombine = Flow.fromGraph(GraphDSL.create() { implicit b =>
import GraphDSL.Implicits._
val bcast = b.add(Broadcast[(PublishRequest, Long)](2))
val zipper = b.add(Zip[Seq[String], Long])
val publisher = Flow[(PublishRequest, Long)].
map{ case (pr, _) => pr }.
via(publishFlow)
val timestamp = Flow[(PublishRequest, Long)].
map{ case (_, ts) => ts }
bcast.out(0) ~> publisher ~> zipper.in0
bcast.out(1) ~> timestamp ~> zipper.in1
FlowShape(bcast.in, zipper.out)
})
There is now a much nicer solution for this which will be released in Akka 2.6.19 (see https://github.com/akka/akka/pull/31123).
In order to use the aformentioned unsafeViaData you would first have to represent (PublishRequest, Long) using FlowWithContext/SourceWithContext. FlowWithContext/SourceWithContext is an abstraction that was specifically designed to solve this problem (see https://doc.akka.io/docs/akka/current/stream/stream-context.html). The problem being you have a stream with the data part that is typically what you want to operate on (in your case the ByteString) and then you have the context (aka metadata) part which you typically just pass along unmodified (in your case the Long).
So in the end you would have something like this
val myFlow: FlowWithContext[PublishRequest, Long, PublishRequest, Long, NotUsed] =
FlowWithContext.fromTuples(originalFlowAsTuple) // Original flow that has `(PublishRequest, Long)` as an output
myFlow.unsafeViaData(publishFlow)
In contrast to Akka Streams, break tuple item apart?, not only is this solution involve much less boilerplate since its part of akka but it also retains the materialized value rather than losing it and always ending up with a NotUsed.
For the people wondering why the method unsafeViaData has unsafe in the name, its because the Flow that you pass into this method cannot add,drop or reorder any of the elements in the stream (doing so would mean that the context no longer properly corresponds to the data part of the stream). Ideally we would use Scala's type system to catch such errors at compile time but doing so would require a lot of changes to akka-stream especially if the changes need to remain backwards compatibility (which when dealing with akka we do). More details are in the PR mentioned earlier.
My not-so-elegant solution is using a custom flows that recombine things:
val publishAndRecombine = Flow.fromGraph(GraphDSL.create() { implicit b =>
val bc = b.add(Broadcast[(PublishRequest, Long)](2))
val publisher = b.add(Flow[(PublishRequest, Long)]
.map { case (pr, _) => pr }
.via(publishFlow))
val zipper = b.add(Zip[Seq[String], Long]).
bc.out(0) ~> publisher ~> zipper.in0
bc.out(1).map { case (pr, long) => long } ~> zipper.in1
FlowShape(bc.in, zipper.out)
})

How to backpressure a ActorPublisher

I'm writing few samples to understand akka streams and backpressures. I'm trying to see how a slow consumer backpressure's a AkkaPublisher
My code as follows.
class DataPublisher extends ActorPublisher[Int] {
import akka.stream.actor.ActorPublisherMessage._
var items: List[Int] = List.empty
def receive = {
case s: String =>
println(s"Producer buffer size ${items.size}")
if (totalDemand == 0)
items = items :+ s.toInt
else
onNext(s.toInt)
case Request(demand) =>
if (demand > items.size) {
items foreach (onNext)
items = List.empty
}
else {
val (send, keep) = items.splitAt(demand.toInt)
items = keep
send foreach (onNext)
}
case other =>
println(s"got other $other")
}
}
and
Source.fromPublisher(ActorPublisher[Int](dataPublisherRef)).runWith(sink)
Where the sink is a Subscriber with a sleep to emulate slow consumer. And publisher keeps producing data regardless.
--EDIT--
My question is when the demand is 0 programatically buffers data. How can I make use of backpressure to slow down the publisher
Something like
throttledSource().buffer(10, OverflowStrategy.backpressure).runWith(throttledSink())
This will not effect the publisher and its buffer keeps going.
Thanks,
Sajith
Don't use ActorPublisher
Firstly, don't use ActorPublisher - it is a very low-level and deprecated API. We decided to deprecate as users should not be working on such low level of abstraction in Akka Streams.
One of the tricky things is exactly what you're asking about -- handling backpressure is entirely in the hands of the developer writing the ActorPublisher if they use this API. So you have to receive the Request(n) messages and make sure that you never signal more elements than you got requests for. This behaviour is specified in the Reactive Streams Specification which you then have to implement correctly. Basically, you're exposed to all the complexities of Reactive Streams (which is a full specification, with many edge cases -- disclaimer: I was/am part of developing Reactive Streams as well as Akka Streams).
Showing how back-pressure manifests in GraphStage
Secondly, to build custom stages you should be using the API designed for it: GraphStage. Please note that such stage is also pretty low level. Normally users of Akka Streams don't need to write custom stages, however it is absolutely expected and fine to write your own stages if they would implement some logic that the built-in stages don't provide.
Here's a simplified Filter implementation from the Akka codebase:
case class Filter[T](p: T ⇒ Boolean) extends SimpleLinearGraphStage[T] {
override def initialAttributes: Attributes = DefaultAttributes.filter
override def toString: String = "Filter"
override def createLogic(inheritedAttributes: Attributes): GraphStageLogic =
new GraphStageLogic(shape) with OutHandler with InHandler {
override def onPush(): Unit = {
val elem = grab(in)
if (p(elem)) push(out, elem)
else pull(in)
}
// this method will NOT be called, if the downstream has not signalled enough demand!
// this method being NOT called is how back-pressure manifests in stages
override def onPull(): Unit = pull(in)
setHandlers(in, out, this)
}
}
As you can see, instead of implementing the entire Reactive Streams logic and rules yourself (which is hard), you get simple callbacks like onPush and onPull. Akka Streams handles the demand management, and it will automatically call onPull if the downstream has signaled demand, and it will NOT call it, if there is no demand -- which would mean the downstream is applying backpressure to this stage.
This can be accomplished with an intermediate Flow.buffer:
val flowBuffer = Flow[Int].buffer(10, OverflowStrategy.dropHead)
Source
.fromPublisher(ActorPublisher[Int](dataPublisherRef))
.via(flowBuffer)
.runWith(sink)

How to create an Akka flow with backpressure and Control

I need to create a function with the following Interface:
import akka.kafka.scaladsl.Consumer.Control
object ItemConversionFlow {
def build(config: StreamConfig): Flow[Item, OtherItem, Control] = {
// Implementation goes here
}
My problem is that I don't know how to define the flow in a way that it fits the interface above.
When I am doing something like this
val flow = Flow[Item]
.map(item => doConversion(item)
.filter(_.isDefined)
.map(_.get)
the resulting type is Flow[Item, OtherItem, NotUsed]. I haven't found something in the Akka documentation so far. Also the functions on akka.stream.scaladsl.Flow only offer a "NotUsed" instead of Control. Would be great if someone could point me into the right direction.
Some background: I need to setup several pipelines which only distinguish in the conversion part. These pipelines are sub streams to a main stream which might be stopped for some reason (a corresponding message arrives in some kafka topic). Therefor I need the Control part. The idea would be to create a Graph template where I just insert the mentioned flow as argument (a factory returning it). For a specific case we have a solution which works. To generalize it I need this kind of flow.
You actually have backpressure. However, think about what do you really need about backpressure... you are not using asynchronous stages to increase your throughput... for example. Backpressure avoids fast producers overgrowing susbscribers https://doc.akka.io/docs/akka/2.5/stream/stream-rate.html. In your sample don´t worry about it, your stream will ask for new elements to he publisher depending on how long doConversion takes to complete.
In case that you want to obtain the result of the stream use toMat or viaMat. For example, if your stream emits Item and transform these into OtherItem:
val str = Source.fromIterator(() => List(Item(Some(1))).toIterator)
.map(item => doConversion(item))
.filter(_.isDefined)
.map(_.get)
.toMat(Sink.fold(List[OtherItem]())((a, b) => {
// Examine the result of your stream
b :: a
}))(Keep.right)
.run()
str will be Future[List[OtherItem]]. Try to extrapolate this to your case.
Or using toMat with KillSwitches, "Creates a new [[Graph]] of [[FlowShape]] that materializes to an external switch that allows external completion
* of that unique materialization. Different materializations result in different, independent switches."
def build(config: StreamConfig): Flow[Item, OtherItem, UniqueKillSwitch] = {
Flow[Item]
.map(item => doConversion(item))
.filter(_.isDefined)
.map(_.get)
.viaMat(KillSwitches.single)(Keep.right)
}
val stream =
Source.fromIterator(() => List(Item(Some(1))).toIterator)
.viaMat(build(StreamConfig(1)))(Keep.right)
.toMat(Sink.ignore)(Keep.both).run
// This stops the stream
stream._1.shutdown()
// When it finishes
stream._2 onComplete(_ => println("Done"))

Testing Akka Reactive Streams

I'm testing code which streams messages over an outgoing stream TCP connection obtained via:
(IO(StreamTcp) ? StreamTcp.Connect(settings, address))
.mapTo[StreamTcp.OutgoingTcpConnection]
.map(_.outputStream)
In my tests, I substitute the resulting Subscriber[ByteString] with a dummy subscriber, trigger some outgoing messages, and assert that have arrived as expected. I use the method below to produce the dummy subscriber and stream result future. (So far, so good)
def testSubscriber[T](settings: FlowMaterializer)(implicit ec: ExecutionContext): (Subscriber[T], Future[Seq[T]]) = {
var sent = Seq.empty[T]
val (subscriber, streamComplete) =
Duct[T].foreach( bs => sent = sent :+ bs)(settings)
(subscriber, streamComplete.map( _ => sent ))
}
My question is this: is there some canonical method for testing that streams output the expected values, something similar to Akka's TestActorRef? And if not, is there some library function similar to the above function?
Testing streams is possible with the akka-streams-testkit.
Read about it here: http://doc.akka.io/docs/akka/current/scala/stream/stream-testkit.html

Processing concurrently in Scala

As in my own answer to my own question, I have the situation whereby I am processing a large number of events which arrive on a queue. Each event is handled in exactly the same manner and each even can be handled independently of all other events.
My program takes advantage of the Scala concurrency framework and many of the processes involved are modelled as Actors. As Actors process their messages sequentially, they are not well-suited to this particular problem (even though my other actors are performing actions which are sequential). As I want Scala to "control" all thread creation (which I assume is the point of it having a concurrency system in the first place) it seems I have 2 choices:
Send the events to a pool of event processors, which I control
get my Actor to process them concurrently by some other mechanism
I would have thought that #1 negates the point of using the actors subsystem: how many processor actors should I create? being one obvious question. These things are supposedly hidden from me and solved by the subsystem.
My answer was to do the following:
val eventProcessor = actor {
loop {
react {
case MyEvent(x) =>
//I want to be able to handle multiple events at the same time
//create a new actor to handle it
actor {
//processing code here
process(x)
}
}
}
}
Is there a better approach? Is this incorrect?
edit: A possibly better approach is:
val eventProcessor = actor {
loop {
react {
case MyEvent(x) =>
//Pass processing to the underlying ForkJoin framework
Scheduler.execute(process(e))
}
}
}
This seems like a duplicate of another question. So I'll duplicate my answer
Actors process one message at a time. The classic pattern to process multiple messages is to have one coordinator actor front for a pool of consumer actors. If you use react then the consumer pool can be large but will still only use a small number of JVM threads. Here's an example where I create a pool of 10 consumers and one coordinator to front for them.
import scala.actors.Actor
import scala.actors.Actor._
case class Request(sender : Actor, payload : String)
case class Ready(sender : Actor)
case class Result(result : String)
case object Stop
def consumer(n : Int) = actor {
loop {
react {
case Ready(sender) =>
sender ! Ready(self)
case Request(sender, payload) =>
println("request to consumer " + n + " with " + payload)
// some silly computation so the process takes awhile
val result = ((payload + payload + payload) map {case '0' => 'X'; case '1' => "-"; case c => c}).mkString
sender ! Result(result)
println("consumer " + n + " is done processing " + result )
case Stop => exit
}
}
}
// a pool of 10 consumers
val consumers = for (n <- 0 to 10) yield consumer(n)
val coordinator = actor {
loop {
react {
case msg # Request(sender, payload) =>
consumers foreach {_ ! Ready(self)}
react {
// send the request to the first available consumer
case Ready(consumer) => consumer ! msg
}
case Stop =>
consumers foreach {_ ! Stop}
exit
}
}
}
// a little test loop - note that it's not doing anything with the results or telling the coordinator to stop
for (i <- 0 to 1000) coordinator ! Request(self, i.toString)
This code tests to see which consumer is available and sends a request to that consumer. Alternatives are to just randomly assign to consumers or to use a round robin scheduler.
Depending on what you are doing, you might be better served with Scala's Futures. For instance, if you don't really need actors then all of the above machinery could be written as
import scala.actors.Futures._
def transform(payload : String) = {
val result = ((payload + payload + payload) map {case '0' => 'X'; case '1' => "-"; case c => c}).mkString
println("transformed " + payload + " to " + result )
result
}
val results = for (i <- 0 to 1000) yield future(transform(i.toString))
If the events can all be handled independently, why are they on a queue? Knowing nothing else about your design, this seems like an unnecessary step. If you could compose the process function with whatever is firing those events, you could potentially obviate the queue.
An actor essentially is a concurrent effect equipped with a queue. If you want to process multiple messages simultaneously, you don't really want an actor. You just want a function (Any => ()) to be scheduled for execution at some convenient time.
Having said that, your approach is reasonable if you want to stay within the actors library and if the event queue is not within your control.
Scalaz makes a distinction between Actors and concurrent Effects. While its Actor is very light-weight, scalaz.concurrent.Effect is lighter still. Here's your code roughly translated to the Scalaz library:
val eventProcessor = effect (x => process x)
This is with the latest trunk head, not yet released.
This sounds like a simple consumer/producer problem. I'd use a queue with a pool of consumers. You could probably write this with a few lines of code using java.util.concurrent.
The purpose of an actor (well, one of them) is to ensure that the state within the actor can only be accessed by a single thread at a time. If the processing of a message doesn't depend on any mutable state within the actor, then it would probably be more appropriate to just submit a task to a scheduler or a thread pool to process. The extra abstraction that the actor provides is actually getting in your way.
There are convenient methods in scala.actors.Scheduler for this, or you could use an Executor from java.util.concurrent.
Actors are much more lightweight than threads, and as such one other option is to use actor objects like Runnable objects you are used to submitting to a Thread Pool. The main difference is you do not need to worry about the ThreadPool - the thread pool is managed for you by the actor framework and is mostly a configuration concern.
def submit(e: MyEvent) = actor {
// no loop - the actor exits immediately after processing the first message
react {
case MyEvent(x) =>
process(x)
}
} ! e // immediately send the new actor a message
Then to submit a message, say this:
submit(new MyEvent(x))
, which corresponds to
eventProcessor ! new MyEvent(x)
from your question.
Tested this pattern successfully with 1 million messages sent and received in about 10 seconds on a quad-core i7 laptop.
Hope this helps.