Akka Streams: How do I model capacity/rate limiting within a system of 2 related streams? - scala

Lets say I have a pizza oven and a line of pizzas I need to bake. My oven only has capacity to bake 4 pizzas at a time, and it's reasonable to expect that over the course of a day there's always at least 4 in the queue, so the oven will need to be at full capacity as often as possible.
Every time I put a pizza in the oven I set a timer on my phone. Once that goes off, I take the pizza out of the oven, give it to whoever wants it, and capacity becomes available.
I have 2 Sources here, one being the queue of pizzas to be cooked, and one of the egg timer that goes off when a pizza has been cooked. There are also 2 Sinks in the system, one being the destination for the cooked pizza, the other being a place to send confirmation that a pizza has been put into the oven.
I'm currently representing these very naively, as follows:
Source.fromIterator(() => pizzas)
.map(putInOven) // puts in oven and sets a timer
.runWith(Sink.actorRef(confirmationDest, EndSignal))
Source.fromIterator(() => timerAlerts)
.map(removePizza)
.runWith(Sink.actorRef(pizzaDest, EndSignal))
However, these two streams are currently completely independent of each other. The eggTimer functions correctly, removing a pizza whenever it is collected. But it can't signal to the previous flow that capacity has become available. In fact, the first flow has no concept of capacity at all, and will just try to cram pizzas into the oven as soon as they join the line.
What Akka concepts can be used to compose these flows in such a way that the first only takes pizzas from the queue when there's capacity, and that the second flow can "alert" the first one to a change in capacity when a pizza is removed from the oven.
My initial impression is to implement a flow graph like this:
┌─────────────┐
┌─>│CapacityAvail│>──┐
│ └─────────────┘ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ ┌─────────────┐ ├──>│ Zip │>─>│ PutInOven │>─>│ Confirm │
│ │ Queue │>──┘ └─────────────┘ └─────────────┘ └─────────────┘
│ └─────────────┘
│ ┌─────────────┐ ┌─────────────┐
│ │ Done │>─────>│ SendPizza │
│ └─────────────┘ └─────────────┘
│ v
│ │
└─────────┘
The principle that underpins this is that there are a fixed number of CapacityAvailable objects which populate the CapacityAvail Source. They're zipped with events that come in to the Pizza queue, meaning if none are available, no pizza processing starts as the zip operation will wait for them.
Then, once a pizza is done, an CapacityAvailable object is pushed back into the pool.
The main barrier I'm seeing to this implementation is that I'm not sure how to create and populate a pool for the CapacityAvail source, and I'm also not sure whether a Source can also be a Sink. Are there any Source/Sink/Flow types that would be suitable implementation for this?

This use case does not generally map well to Akka Streams. Under the hood an Akka Stream is a reactive stream; from the documentation:
Akka Streams implementation uses the Reactive Streams interfaces
internally to pass data between the different processing stages.
Your pizza example doesn't apply to streams because you have some external event that is just as much a broadcaster of demand as the sink of your stream. The fact that you openly state "the first flow has no concept of capacity at all" means that you aren't using streams for their intended purpose.
It is always possible to use some weird coding ju-jitsu to awkwardly bend streams to solve a concurrency problem, but you'll likely have difficulties maintaining this code down-the-line. I recommend you consider using Futures, Actors, or plain-old Threads as your concurrency mechanism. If your oven has infinite capacity to hold cooking pizzas then there's no need for streams to begin with.
I would also re-examine your entire design since you are using the passage of clock time as the signaler of demand (i.e. your "egg timer"). This usually indicates a flaw in the process design. If you can't get around this requirement then you should evaluate other design patterns:
Periodic Message Scheduling
Non Thread Block Timeouts

You can represent the oven with a mapAsyncUnordered stage with parallelism=4. Completion of the Future can be from a timer (http://doc.akka.io/docs/akka/2.4/scala/futures.html#After) or that you decide to take it out from the oven for some other reason.

This is what I ended up using. It's pretty much an exact implementation of the faux-state machine in the question. The mechanics of Source.queue are much clumsier than I would have hoped, but it's otherwise pretty clean. The real sinks and sources are provided as parameters and are constructed elsewhere, so the actual implementation has a little less boilerplate than this.
RunnableGraph.fromGraph(GraphDSL.create() {
implicit builder: GraphDSL.Builder[NotUsed] =>
import GraphDSL.Implicits._
// Our Capacity Bucket. Can be refilled by passing CapacityAvaiable objects
// into capacitySrc. Can be consumed by using capacity as a Source.
val (capacity, capacitySrc) =
peekMatValue(Source.queue[CapacityAvailable.type](CONCURRENT_CAPACITY,
OverflowStrategy.fail))
// Set initial capacity
capacitySrc.foreach(c =>
Seq.fill(CONCURRENT_CAPACITY)(CapacityAvailable).foreach(c.offer))
// Pull pizzas from the RabbitMQ queue
val cookQ = RabbitSource(rabbitControl, channel(qos = CONCURRENT_CAPACITY),
consume(queue("pizzas-to-cook")), body(as[TaskRun]))
// Take the blocking events stream and turn into a source
// (Blocking in a separate dispatcher)
val cookEventsQ = Source.fromIterator(() => oven.events().asScala)
.withAttributes(ActorAttributes.dispatcher("blocking-dispatcher"))
// Split the events stream into two sources so 2 flows can be attached
val bc = builder.add(Broadcast[PizzaEvent](2))
// Zip pizzas with the capacity pool. Stops cooking pizzas when oven full.
// When cooking starts, send the confirmation back to rabbitMQ
cookQ.zip(AckedSource(capacity)).map(_._1)
.mapAsync(CONCURRENT_CAPACITY)(pizzaOven.cook)
.map(Message.queue(_, "pizzas-started-cooking"))
.acked ~> Sink.actorRef(rabbitControl, HostDied)
// Send the cook events stream into two flows
cookEventsQ ~> bc.in
// The first tops up the capacity pool
bc.out(0)
.mapAsync(CONCURRENT_CAPACITY)(e =>
capacitySrc.flatMap(cs => cs.offer(CapacityAvailable))
) ~> Sink.ignore
// The second sends out cooked events
bc.out(1)
.map(p => Message.queue(Cooked(p.id()), "pizzas-cooked")
) ~> Sink.actorRef(rabbitControl, HostDied)
ClosedShape
}).run()

Related

How can an event sourced entity to subscribe to state changes in another entity?

I have an events-sourced entity (C) that needs to change its state in response to state changes in another entity of a different type (P). The logic to whether the state of C should actually change is quite complex and the data to compute that lives in C; moreover, many instances of C should listen to one instance of P, and the set of instances increases over time, so I'd rather they pull out of a stream knowing the ID of P than have P keep track of the IDs of all the Cs and push to them.
I am thinking of doing something such as:
Tag a projection of P's events
Have a Subscribe(P.id) command that gets sent to C
If C is not already subscribing to a P (it can only subscribe to one, and it shouldn't change), fire an event Subscribed(P.id)
In response to the event, use Akka-persistent-query to materialize the stream of events tagged in 1, map them to commands, and run asynchronously with a sync that sends them to my ES entity reference
This seems a bit like an anti pattern to have a stream run in the event handler. I am wondering if there's a better/more supported way to do this without the upstream having to know about the downstream. I decided against Akka pub-sub because it does at-most-once delivery, and I'd like to avoid using Kafka if possible.
You definitely don't want to run the stream in the event handler: the event handler should never side effect.
Assuming that you would like a C to get events from times when that C was not running (including before that C had ever run), this suggests that a stream should be run for each C. Since the subscription will be to one particular P, I'd seriously consider not tagging, but instead using the eventsByPersistenceId stream to get all the events of a P and ignore the ones that aren't of interest. In the stream, you translate those to commands in C's API, including the offset in P's event stream with the command, and send it to C (for at-least-once delivery, a mapAsync with an ask is useful; C will persist an event recording that it processed the offset: this allows the command to be idempotent, as C can acknowledge the command if the offset is less-than-or-equal-to the high water offset in its state).
This stream gets kicked off by the command-handler after successfully persisting a Subscribed(P.id) event (in this case starting from offset 0) and then gets kicked off after the persistent actor is rehydrated if the state shows it's subscribed (in this case starting from one plus the high water offset).
The rationale for not using tagging here arises from an assumption that the number of events C isn't interested in is smaller than the number of events with the tag from Ps that C isn't subscribed to (note that for most of the persistence plugins, the more tags there are, the more overhead there is: a tag which is only used by one particular instance of an entity is often not a good idea). If the tag in question is rarely seen, this assumption might not hold and eventsByTag and filtering by id could be useful.
This does of course have the downside of running discrete streams for every C: depending on how many Cs are subscribed to a given P, the overhead of this may be substantial, and the streams for subscribers which are caught up will be especially wasteful. In this scenario, responsibility for delivering commands to subscribed Cs for a given P can be moved to an actor. The only real change in that scenario is that where C would run the stream, it instead confirms that it is subscribed to the event stream by asking that actor feeding events from the P. Because this approach is a marked step-up in complexity (especially around managing when Cs join and drop out of the shared "caught-up" stream), I'd tend to recommend starting with the stream-per-C approach and then going to the shared stream (it's also worth noting that there can be multiple shared streams: in fact I'd tend to have shared streams be per-ActorSystem (e.g. a "node singleton" per P of interest) so as not to involve remoting), since it's not difficult to make the transition (from C's perspective, there's not really a difference whether the adapted commands are coming from a stream it started or from a stream being run by some other actor).

Akka synchronizing timestamped messages from several actors

Imagine the following architecture. There is an actor in akka that receives push messages via websocket. They have a timestamp and interval between those timestamps is 1 minute. Though the messages with the same timestamp can arrive multiple times via websocket. And then this messages are being broadcasted to as example three further actors (ma). They calculate metrics and push the messages further to the one actor(c).
For ma I defined a TimeSeriesBuffer that allows writing to the buffer only if entities have consequent timestamps. After successfull push to the buffer ma's emit metrics, that go to the c. c can only change it's state when it has all three metrics. Therefore I defined a trait Synchronizable and then a SynchronizableTimeSeriesBuffer with "master-slave" architecture.
On each push to every buffer a check is triggered in order to understand if there are new elements in the buffers of all three SynchronizableTimeSeriesBuffer with the same timestamp that can be emitted further to c as a single message.
So here are the questions:
1) Is it too complicated of a solution?
2) Is there a better way to do it in terms of scala and akka?
3) Why is it not so fast and not so parallel when messages in the system instead of being received "one by one" are loaded from db in a big batch and fed to the system in order to backtest the metrics. (one of the buffers is filling much faster than the others, while other one is at 0 length). I have an assumption it has something to do with akka's settings regarding dispatching/mailbox.
I created a gist with regarding code:
https://gist.github.com/ifif14/18b5f85cd638af7023462227cd595a2f
I would much appreciate the community's help in solving this nontrivial case.
Thanks in advance
Igor
Simplification
It seems like much of your architecture is designed to ensure that your message are sequentially ordered in time. Why not just add a simple Actor at the beginning that filters out duplicated messages? Then the rest of your system could be relatively simple.
As an example; given a message with timestamp
type Payload = ???
case class Message(timestamp : Long, payload : Payload)
You can write the filter Actor:
class FilterActor(ma : Iterable[ActorRef]) extends Actor {
var currentMaxTime = 0L
override def receive = {
case m : Message if m.timestamp > currentMaxTime => ma foreach (_ ! m)
case _ =>
}
}
Now you can eliminate all of the "TimeSeriesBuffer" and "Synchronizable" logic since you know that ma, and c, will only receive time-ordered messages.
Batch Processing
The likely reason why batch processing is not so concurrent is because the mailbox for your ma Actor is being filled up by the database query and whatever processing it is doing is slower than the processing for c. Therefore ma's mailbox continues to accumulate messages while c's mailbox remains relatively empty.
Thanks so much for your answer. The part with cutting off is what I also implemented in Synchronizable Trait.
//clean up slaves. if their queue is behind masters latest element
master_last_timestamp match {
case Some(ts) => {
slaves.foreach { s =>
while ( s.queue.length > 0 && s.getElementTimestamp(s.queue.front) < ts ) {
s.dequeue()
}
// val els = s.dequeueAll { queue_el => s.getElementTimestamp(queue_el) < ts }
}
}
case _ => Unit
}
The reason why I started to implement the buffer is because I feel like I will be using it a lot in the system and I don't think to write this part for each actor I will be using. Seems easier to have a blueprint that does it.
But a more important reason is that for some reason one buffer is either being filled much slower or not at all than the other two. Though they are being filled by the same actors!! (just different instances, and computation time should be pretty much the same) And then after two other actors emitted all messages that were "passed" from the database the third one starts receiving it. It feels to me that this one actor is just not getting processor time. So I think it's a dispatcher's setting that can affect this. Are you familiar with this?
Also I would expect dispatcher work more like round-robin, given each process a little of execution time, but it ends up serving only limited amount of actors and then jumping to the next ones. Although they sort of have to receive initial messages at the same time since there is a broadcaster.
I read akka documentation on dispatchers and mailboxes, but I still don't understand how to do it.
Thank you
Igor

Adding a Flow for writing to a Kafka subscriber

I need to build the following graph:
val graph = getFromTopic1 ~> doSomeWork ~> writeToTopic2 ~> commitOffsetForTopic1
but trying to implement it in Reactive Kafka has me down a rabbit hole. And that seems wrong because this strikes me as a relatively common use case: I want to move data between Kafka topics while guaranteeing At Least Once Delivery Semantics.
Now it's no problem at all to write in parallel
val fanOut = new Broadcast(2)
val graph = getFromTopic1 ~> doSomeWork ~> fanOut ~> writeToTopic2
fanOut ~> commitOffsetForTopic1
This code works because writeToTopic2 can be implemented with ReactiveKafka#publish(..), which returns a Sink. But then I lose ALOS guarantees and thus data when my app crashes.
So what I really need is to write a Flow that writes to a Kafka topic. I have tried using Flow.fromSinkAndSource(..) with a custom GraphStage but run up against type issues for the data flowing through; for example, what gets committed in commitOffsetForTopic1 should not be included in writeToTopic2, meaning that I have to keep a wrapper object containing both pieces of data all the way through. But this conflicts with the requirements that writeToTopic2 accept a ProducerMessage[K,V]. My latest attempt to resolve this ran up against private and final classes in the reactive kafka library (extending/wrapping/replacing the underlying SubscriptionActor).
I don't really want to maintain a fork to make this happen. What am I missing? Why is this so hard? Am I somehow trying to build a pathological graph node or is this use case an oversight ... or is there something completely obvious I have somehow missed in the docs and source code I've been digging through?
Current version is 0.10.1. I can add more detailed information about any of my many attempts upon request.

Akka actor pipeline and congested store actor

I am attempting to implement a message processing pipeline using actors. The steps of the pipeline include functions such as reading, filtering, augmentation and, finally, storage into a database.
Something similar to this: http://sujitpal.blogspot.nl/2013/12/akka-content-ingestion-pipeline-part-i.html
The issue is that the reading, filtering and augmentation steps are much faster than the storage step which results in having a congested store actor and an unreliable system.
I am considering the following option: have the store actor pull the processed and ready to store messages. Is this a good option? better suggestions?
Thank you
You may consider several options:
if order of messages doesn't matter - just execute every storage operation inside separate actor (or future). It will cause all data storage to be doing in parallel - I recommend to use separate thread pool for that. If some messages are amendments to others or participate in same transaction - you may create separate actors only for each messageId/transactionId to avoid pessimistic/optimistic lock problems (don't forget to kill such actors on transaction end or by timeout) .
use bounded mailboxes (back-pressure) - then you will block new messages from your input if older are still not processed (for example you may block the receiving thread til message will be acknowledged by last actor in the chain). It will move responsibility to source system. It's working pretty much good with JMS durables - messages are storing in reliable way on JMS-broker side til your system finally have them processed.
combine the previous two
I am using an approach similar to this: Akka Work Pulling Pattern (source code here: WorkPullingPattern.scala). It has the advantage that it works both locally & with Akka Cluster. Plus the whole approach is fully asynchronous, no blocking at all.
If your processed "objects" won't all fit into memory, or one of the steps is slow, it is an awesome solution. If you spawn N workers, then N "tasks" will be processed at one time. It might be a good idea to put the "steps" into BalancingPools also with parallelism N (or less).
I have no idea if your processing "pipeline" is sequential or not, but if it is, just a couple hours ago I have developed a type safe abstraction based on the above + Shapeless library. A glimpse at the code, before it was merged with WorkPullingPattern is here: Pipeline.
It takes any pipeline of functions (of properly matching signatures), spawns them in BalancingPools, creates Workers and links them to a master actor which can be used for scheduling the tasks.
The new AKKA stream (still in beta) has back pressure. It's designed to solve this problem.
You could also use receive pipeline on actors:
class PipelinedActor extends Actor with ReceivePipeline {
// Increment
pipelineInner { case i: Int ⇒ Inner(i + 1) }
// Double
pipelineInner { case i: Int ⇒ Inner(i * 2) }
def receive: Receive = { case any ⇒ println(any) }
}
actor ! 5 // prints 12 = (5 + 1) * 2
http://doc.akka.io/docs/akka/2.4/contrib/receive-pipeline.html
It suits your needs the best as you have small pipelining tasks before/after processing of the message by actor. Also it is blocking code but that is fine for your case, I believe

Why are messages received by an actor unordered?

I've been studying the actor model (specifically the implementation in Scala) but I can't understand why there's a requirement that messages arrive in no particular order.
It seems like there are at least some elegant, actor-based solutions to concurrency problems that would work if only the messages arrived in order (e.g. producer-consumer variants, deferred database writes, concurrency-safe caches).
So why don't actor messages arrive in order? Is it to permit efficient implementation or maybe to prevent some kind of deadlock that would arise when messages are ordered?
My impression is that if two threads send a message to an actor a, there is no particular guarantee about which will be received by the actor first. But if you have code that looks like
a ! "one"
a ! "two"
then a will always get "one" before "two" (though who knows what else might have arrived in between from other threads).
Thus, I don't think it is the case that messages arrive in no particular order at all. Multiple messages from within one thread will (as far as I can tell from the code or from experience) arrive in order.
I'm not privy to the reasons why Scala's Actors (those in the standard library, at any rate -- there are also Akka, Lift and Scalaz implementations of Actors) chose that particular implementation. Probably as a copy of Erlang's own restrictions -- but without the guarantees for communication between two single threads. Or maybe with that guarantee as well -- I wish Phillip Haller was here to comment.
BUT, I do question your statement about concurrency problems. When studying asynchronous distributed algorithms, a basic tenet is that you can't guarantee any ordering of message receipt.
To quote Distributed Computing: Fundamentals, Simulation and Advanced Topics, by Hagit Attiya and Jennifer Welch,
A system is said to be asynchronous if there is no fixed upper bound on how long it
takes for a message to be delivered or how much time elapses between consecutive
steps of a processor.
The actor model is an asynchronous one. That enables it to work over distributed hardware -- be it different computers communicating through a network, or different processors on a system that does not provide synchronous guarantees.
Furthermore, even the multi-threading model on a multi-core processor is mostly asynchronous, with the primitives that enable synchronism being extremely expensive.
So a simple answer to the question might be:
Messages are not guaranteed to arrive in order because that's an underlying limitation of asynchronous systems, which is the basic model of computation used by actors.
This model is the one we actually have on any system distributed over TCP/IP, and the most efficient over i386/x64 multicore/multiprocessor hardware.
The following simple example shows messages arriving out of order to a very simple actor:
import scala.actors._
import scala.actors.Actor._
import scala.collection.mutable._
val adder = actor {
loop {
react {
case x: Int => println(" Computing " + x); reply(x+2)
case Exit => println("Exiting"); exit
}
}
}
actor {
for (i <- 1 to 5) {
println("Sending " + i)
adder !! (i, { case answer => println("Computed " + i + " -> " + answer) })
}
println("Sending Exit")
adder !! Exit
}
Here is the output from one run of the above code with Scala 2.9.0 final on Windows 64-bit with Sun JDK 1.6.0u25:
Sending 1
Sending 2
Sending 3
Sending 4
Sending 5
Sending Exit
Computing 1
Computed 1 -> 3
Computing 4
Computed 4 -> 6
Computing 3
Computed 3 -> 5
Exiting
What order would you choose? Should it be by when they were sent or when they were recieved? Should we freeze the entire mailbox whilst we sort the messages? Imagine sorting a large and nearly full mailbox, wouldn't that put an arbitrary lock on the queue? I think the messages don't arrive in order because there is no guaranteed way to enforce such an order. We have latency in networks and between processors.
We have no idea where the messages are coming from, only that they have arrived. So how about this, we make the guarantee that we have no ordering and don't even try to think about ordering. Instead of having to come up with some impressive logic to keep things organized while remaining as contention-free as possible we can just focus on keeping things as contention-free as possible.
Someone else probably has an even better answer than I on this.
Edit:
Now that I've had time to sleep on it, I think it's a stipulation that allows for a much more vibrant Actor ecosystem to be created. Hence, why restrict one Actor or one thread or partial ownership of a thread from a thread pool? What if someone wanted to have an Actor which could grab as many threads as possible to process as many messages in its mailbox as it could?
If you made the stipulation up front that messages had to be done in the order they proceeded you'd never be able to allow for this. The minute multiple threads could be assigned by an Actor to process messages within the mailbox you'd be in the situation whereby you had no control over which message was processed first.
Phew, what your dreams say about your mind as you sleep.