Akka synchronizing timestamped messages from several actors - scala

Imagine the following architecture. There is an actor in akka that receives push messages via websocket. They have a timestamp and interval between those timestamps is 1 minute. Though the messages with the same timestamp can arrive multiple times via websocket. And then this messages are being broadcasted to as example three further actors (ma). They calculate metrics and push the messages further to the one actor(c).
For ma I defined a TimeSeriesBuffer that allows writing to the buffer only if entities have consequent timestamps. After successfull push to the buffer ma's emit metrics, that go to the c. c can only change it's state when it has all three metrics. Therefore I defined a trait Synchronizable and then a SynchronizableTimeSeriesBuffer with "master-slave" architecture.
On each push to every buffer a check is triggered in order to understand if there are new elements in the buffers of all three SynchronizableTimeSeriesBuffer with the same timestamp that can be emitted further to c as a single message.
So here are the questions:
1) Is it too complicated of a solution?
2) Is there a better way to do it in terms of scala and akka?
3) Why is it not so fast and not so parallel when messages in the system instead of being received "one by one" are loaded from db in a big batch and fed to the system in order to backtest the metrics. (one of the buffers is filling much faster than the others, while other one is at 0 length). I have an assumption it has something to do with akka's settings regarding dispatching/mailbox.
I created a gist with regarding code:
https://gist.github.com/ifif14/18b5f85cd638af7023462227cd595a2f
I would much appreciate the community's help in solving this nontrivial case.
Thanks in advance
Igor

Simplification
It seems like much of your architecture is designed to ensure that your message are sequentially ordered in time. Why not just add a simple Actor at the beginning that filters out duplicated messages? Then the rest of your system could be relatively simple.
As an example; given a message with timestamp
type Payload = ???
case class Message(timestamp : Long, payload : Payload)
You can write the filter Actor:
class FilterActor(ma : Iterable[ActorRef]) extends Actor {
var currentMaxTime = 0L
override def receive = {
case m : Message if m.timestamp > currentMaxTime => ma foreach (_ ! m)
case _ =>
}
}
Now you can eliminate all of the "TimeSeriesBuffer" and "Synchronizable" logic since you know that ma, and c, will only receive time-ordered messages.
Batch Processing
The likely reason why batch processing is not so concurrent is because the mailbox for your ma Actor is being filled up by the database query and whatever processing it is doing is slower than the processing for c. Therefore ma's mailbox continues to accumulate messages while c's mailbox remains relatively empty.

Thanks so much for your answer. The part with cutting off is what I also implemented in Synchronizable Trait.
//clean up slaves. if their queue is behind masters latest element
master_last_timestamp match {
case Some(ts) => {
slaves.foreach { s =>
while ( s.queue.length > 0 && s.getElementTimestamp(s.queue.front) < ts ) {
s.dequeue()
}
// val els = s.dequeueAll { queue_el => s.getElementTimestamp(queue_el) < ts }
}
}
case _ => Unit
}
The reason why I started to implement the buffer is because I feel like I will be using it a lot in the system and I don't think to write this part for each actor I will be using. Seems easier to have a blueprint that does it.
But a more important reason is that for some reason one buffer is either being filled much slower or not at all than the other two. Though they are being filled by the same actors!! (just different instances, and computation time should be pretty much the same) And then after two other actors emitted all messages that were "passed" from the database the third one starts receiving it. It feels to me that this one actor is just not getting processor time. So I think it's a dispatcher's setting that can affect this. Are you familiar with this?
Also I would expect dispatcher work more like round-robin, given each process a little of execution time, but it ends up serving only limited amount of actors and then jumping to the next ones. Although they sort of have to receive initial messages at the same time since there is a broadcaster.
I read akka documentation on dispatchers and mailboxes, but I still don't understand how to do it.
Thank you
Igor

Related

Efficient processing of custom data by actors

I am an Akka newbie trying things out for a particular problem. I am trying to write code for an actor system which would efficiently process custom data coming from multiple clients in the form of events. By custom data, I mean, the content and structure of the data would vary between events from the same client (e.g., we might have instrumented to drop 5 events containing 5 different piece of information for the same client), and between events from different clients (e.g., we might be capturing completely different set of information from one client vs. another). I am wondering what would be a good way to use actor-based processing for this type of scenarios.
This are the alternatives what I have thought so far:
(A) I will write an actor which would load client-specific processor class through reflection, based on the client whose event is being processed. The client-specific processor class would contain logic corresponding to all the type of events that would be received for that client. I will initiate 'n' instances of this actor.
context.actorOf(Props[CustomEventProcessor].withRouter(RoundRobinPool(nrOfInstances = 100)), name = "CustomProcessor")
(B) I will write actors for each client, each containing logic corresponding to all the type of events that would be received for that client. I will initiate 'n' instances of each of these actors.
context.actorOf(Props[CleintXEventProcessor].withRouter(RoundRobinPool(nrOfInstances = 50)), name = "ClientXCustomProcessor")
context.actorOf(Props[CleintYEventProcessor].withRouter(RoundRobinPool(nrOfInstances = 50)), name = "ClientYCustomProcessor")
At this point, I have a few questions:
Would [A] be slower compared to [B] becuase [A] is using reflection? I am assuming that once an actor instance has finished processing a particular event, it dies, so the next actor instance processing an event from the same client would have to start with loading the processor class again. Is this assumption correct?
Given a specific event flow pattern, would a system based on [B] have a heavier runtime memory footprint compared to [A] becuase now each actor for each client can have multiple instances of them in memory?
Any other way to approach this problem?
Thanks for any pointers.
Well,
It could be a bit slower, but I think not really noticeable. And no, you don't have to kill actors between events.
No, because single actor takes like 400 bytes in memory, so you can create a single actor for each event, not only one actor per client.
Yes, via Reactive Streams which I think is a bit clearer solution than actors, but Akka Streams are still experimental, and it may be a bit harder to learn than actors. But you'll have backpressure for free if its needed.

Akka actor pipeline and congested store actor

I am attempting to implement a message processing pipeline using actors. The steps of the pipeline include functions such as reading, filtering, augmentation and, finally, storage into a database.
Something similar to this: http://sujitpal.blogspot.nl/2013/12/akka-content-ingestion-pipeline-part-i.html
The issue is that the reading, filtering and augmentation steps are much faster than the storage step which results in having a congested store actor and an unreliable system.
I am considering the following option: have the store actor pull the processed and ready to store messages. Is this a good option? better suggestions?
Thank you
You may consider several options:
if order of messages doesn't matter - just execute every storage operation inside separate actor (or future). It will cause all data storage to be doing in parallel - I recommend to use separate thread pool for that. If some messages are amendments to others or participate in same transaction - you may create separate actors only for each messageId/transactionId to avoid pessimistic/optimistic lock problems (don't forget to kill such actors on transaction end or by timeout) .
use bounded mailboxes (back-pressure) - then you will block new messages from your input if older are still not processed (for example you may block the receiving thread til message will be acknowledged by last actor in the chain). It will move responsibility to source system. It's working pretty much good with JMS durables - messages are storing in reliable way on JMS-broker side til your system finally have them processed.
combine the previous two
I am using an approach similar to this: Akka Work Pulling Pattern (source code here: WorkPullingPattern.scala). It has the advantage that it works both locally & with Akka Cluster. Plus the whole approach is fully asynchronous, no blocking at all.
If your processed "objects" won't all fit into memory, or one of the steps is slow, it is an awesome solution. If you spawn N workers, then N "tasks" will be processed at one time. It might be a good idea to put the "steps" into BalancingPools also with parallelism N (or less).
I have no idea if your processing "pipeline" is sequential or not, but if it is, just a couple hours ago I have developed a type safe abstraction based on the above + Shapeless library. A glimpse at the code, before it was merged with WorkPullingPattern is here: Pipeline.
It takes any pipeline of functions (of properly matching signatures), spawns them in BalancingPools, creates Workers and links them to a master actor which can be used for scheduling the tasks.
The new AKKA stream (still in beta) has back pressure. It's designed to solve this problem.
You could also use receive pipeline on actors:
class PipelinedActor extends Actor with ReceivePipeline {
// Increment
pipelineInner { case i: Int ⇒ Inner(i + 1) }
// Double
pipelineInner { case i: Int ⇒ Inner(i * 2) }
def receive: Receive = { case any ⇒ println(any) }
}
actor ! 5 // prints 12 = (5 + 1) * 2
http://doc.akka.io/docs/akka/2.4/contrib/receive-pipeline.html
It suits your needs the best as you have small pipelining tasks before/after processing of the message by actor. Also it is blocking code but that is fine for your case, I believe

Scala, Actors, what happens to unread inbox messages?

What happens to unread inbox messages in Scala Actors? For example two cases:1.If forget to implement react case for special message: actor!NoReactCaseMessage2. If messages comes too fast: (timeOfProcessingMessage > timeOfMessageComes)
If first or second case happens, would it be stacked in memory?
EDIT 1 Is there any mechanism to see this type of memory leak is happening? Maybe, control number of unread messages then make some garbage collect or increase actor pool. How to get number of unread messages? How this sort of memory leaks solved in other languages? For example in Erlang?
The mailbox is a queue - if nothing is pulling the messages from the queue (i.e. if the partial function in your react or receive loop returns false for isDefinedAt), then the messages just stay there.
Strictly speaking this is a memory leak (of your application), although the seriousness of it is dependent on how these unread messages grow in number (obviously). For example, I often use actors to merge in a replay query and a "live stream" of messages both identified by a sequence number. My reaction looks like this:
var lastSeq = 0L
loop {
react {
case Msg(seq, data) if seq > lastSeq => lastSeq = seq; process(data)
}
}
This contains a memory leak, but not a "serious" one, as there will be an upper bound in the number of duplicate messages (i.e. there can be no more once the replay query is finished).
However, this may still be an annoyance, as for each reaction, the actor sub-system will scan those messages again to see whether they can be processed.
In fact, thinking about a real mailbox might be a good analogy here. Imagine you left all your junk mail in there: pretty soon, you'd be suffering starvation because of all the junk mail you would have to sift through in order to get to the credit card statement.
How this sort of memory leaks solved in other languages? For example in Erlang?
Same as in Scala. First issue:
If forget to implement react case for special message
You very rarely need to leave a message in the mailbox intentionally and receive it later. I haven't ever encountered such a situation. So you can always include a catch-all clause (case _ => ... in Scala), which will do nothing, log a warning, or throw an exception -- whatever makes most sense in your situation.
If messages comes too fast
Instead of sending messages directly to a process which can't handle them fast enough, add a buffering process. It can throw away extra messages, send them to more than one worker process, etc.

Why are messages received by an actor unordered?

I've been studying the actor model (specifically the implementation in Scala) but I can't understand why there's a requirement that messages arrive in no particular order.
It seems like there are at least some elegant, actor-based solutions to concurrency problems that would work if only the messages arrived in order (e.g. producer-consumer variants, deferred database writes, concurrency-safe caches).
So why don't actor messages arrive in order? Is it to permit efficient implementation or maybe to prevent some kind of deadlock that would arise when messages are ordered?
My impression is that if two threads send a message to an actor a, there is no particular guarantee about which will be received by the actor first. But if you have code that looks like
a ! "one"
a ! "two"
then a will always get "one" before "two" (though who knows what else might have arrived in between from other threads).
Thus, I don't think it is the case that messages arrive in no particular order at all. Multiple messages from within one thread will (as far as I can tell from the code or from experience) arrive in order.
I'm not privy to the reasons why Scala's Actors (those in the standard library, at any rate -- there are also Akka, Lift and Scalaz implementations of Actors) chose that particular implementation. Probably as a copy of Erlang's own restrictions -- but without the guarantees for communication between two single threads. Or maybe with that guarantee as well -- I wish Phillip Haller was here to comment.
BUT, I do question your statement about concurrency problems. When studying asynchronous distributed algorithms, a basic tenet is that you can't guarantee any ordering of message receipt.
To quote Distributed Computing: Fundamentals, Simulation and Advanced Topics, by Hagit Attiya and Jennifer Welch,
A system is said to be asynchronous if there is no fixed upper bound on how long it
takes for a message to be delivered or how much time elapses between consecutive
steps of a processor.
The actor model is an asynchronous one. That enables it to work over distributed hardware -- be it different computers communicating through a network, or different processors on a system that does not provide synchronous guarantees.
Furthermore, even the multi-threading model on a multi-core processor is mostly asynchronous, with the primitives that enable synchronism being extremely expensive.
So a simple answer to the question might be:
Messages are not guaranteed to arrive in order because that's an underlying limitation of asynchronous systems, which is the basic model of computation used by actors.
This model is the one we actually have on any system distributed over TCP/IP, and the most efficient over i386/x64 multicore/multiprocessor hardware.
The following simple example shows messages arriving out of order to a very simple actor:
import scala.actors._
import scala.actors.Actor._
import scala.collection.mutable._
val adder = actor {
loop {
react {
case x: Int => println(" Computing " + x); reply(x+2)
case Exit => println("Exiting"); exit
}
}
}
actor {
for (i <- 1 to 5) {
println("Sending " + i)
adder !! (i, { case answer => println("Computed " + i + " -> " + answer) })
}
println("Sending Exit")
adder !! Exit
}
Here is the output from one run of the above code with Scala 2.9.0 final on Windows 64-bit with Sun JDK 1.6.0u25:
Sending 1
Sending 2
Sending 3
Sending 4
Sending 5
Sending Exit
Computing 1
Computed 1 -> 3
Computing 4
Computed 4 -> 6
Computing 3
Computed 3 -> 5
Exiting
What order would you choose? Should it be by when they were sent or when they were recieved? Should we freeze the entire mailbox whilst we sort the messages? Imagine sorting a large and nearly full mailbox, wouldn't that put an arbitrary lock on the queue? I think the messages don't arrive in order because there is no guaranteed way to enforce such an order. We have latency in networks and between processors.
We have no idea where the messages are coming from, only that they have arrived. So how about this, we make the guarantee that we have no ordering and don't even try to think about ordering. Instead of having to come up with some impressive logic to keep things organized while remaining as contention-free as possible we can just focus on keeping things as contention-free as possible.
Someone else probably has an even better answer than I on this.
Edit:
Now that I've had time to sleep on it, I think it's a stipulation that allows for a much more vibrant Actor ecosystem to be created. Hence, why restrict one Actor or one thread or partial ownership of a thread from a thread pool? What if someone wanted to have an Actor which could grab as many threads as possible to process as many messages in its mailbox as it could?
If you made the stipulation up front that messages had to be done in the order they proceeded you'd never be able to allow for this. The minute multiple threads could be assigned by an Actor to process messages within the mailbox you'd be in the situation whereby you had no control over which message was processed first.
Phew, what your dreams say about your mind as you sleep.

An Actor "queue"?

In Java, to write a library that makes requests to a server, I usually implement some sort of dispatcher (not unlike the one found here in the Twitter4J library: http://github.com/yusuke/twitter4j/blob/master/twitter4j-core/src/main/java/twitter4j/internal/async/DispatcherImpl.java) to limit the number of connections, to perform asynchronous tasks, etc.
The idea is that N number of threads are created. A "Task" is queued and all threads are notified, and one of the threads, when it's ready, will pop an item from the queue, do the work, and then return to a waiting state. If all the threads are busy working on a Task, then the Task is just queued, and the next available thread will take it.
This keeps the max number of connections to N, and allows at most N Tasks to be operating at the same time.
I'm wondering what kind of system I can create with Actors that will accomplish the same thing? Is there a way to have N number of Actors, and when a new message is ready, pass it off to an Actor to handle it - and if all Actors are busy, just queue the message?
Akka Framework is designed to solve this kind of problems, and is exactly what you're looking for.
Look thru this docu - there're lots of highly configurable dispathers (event-based, thread-based, load-balanced, work-stealing, etc.) that manage actors mailboxes, and allow them to work in conjunction. You may also find interesting this blog post.
E.g. this code instantiates new Work Stealing Dispatcher based on the fixed thread pool, that fulfils load balancing among the actors it supervises:
val workStealingDispatcher = Dispatchers.newExecutorBasedEventDrivenWorkStealingDispatcher("pooled-dispatcher")
workStealingDispatcher
.withNewThreadPoolWithLinkedBlockingQueueWithUnboundedCapacity
.setCorePoolSize(16)
.buildThreadPool
Actor that uses the dispatcher:
class MyActor extends Actor {
messageDispatcher = workStealingDispatcher
def receive = {
case _ =>
}
}
Now, if you start 2+ instances of the actor, dispatcher will balance the load between the mailboxes (queues) of the actors (actor that has too much messages in the mailbox will "donate" some to the actors that has nothing to do).
Well, you have to see about the actors scheduler, as actors are not usually 1-to-1 with threads. The idea behind actors is that you may have many of them, but the actual number of threads will be limited to something reasonable. They are not supposed to be long running either, but rather quickly answering to messages they receive. In short, the architecture of that code seems to be wholly at odds with how one would design an actor system.
Still, each working actor may send a message to a Queue actor asking for the next task, and then loop back to react. This Queue actor would receive either queueing messages, or dequeuing messages. It could be designed like this:
val q: Queue[AnyRef] = new Queue[AnyRef]
loop {
react {
case Enqueue(d) => q enqueue d
case Dequeue(a) if q.nonEmpty => a ! (q dequeue)
}
}