Scala, Actors, what happens to unread inbox messages? - scala

What happens to unread inbox messages in Scala Actors? For example two cases:1.If forget to implement react case for special message: actor!NoReactCaseMessage2. If messages comes too fast: (timeOfProcessingMessage > timeOfMessageComes)
If first or second case happens, would it be stacked in memory?
EDIT 1 Is there any mechanism to see this type of memory leak is happening? Maybe, control number of unread messages then make some garbage collect or increase actor pool. How to get number of unread messages? How this sort of memory leaks solved in other languages? For example in Erlang?

The mailbox is a queue - if nothing is pulling the messages from the queue (i.e. if the partial function in your react or receive loop returns false for isDefinedAt), then the messages just stay there.
Strictly speaking this is a memory leak (of your application), although the seriousness of it is dependent on how these unread messages grow in number (obviously). For example, I often use actors to merge in a replay query and a "live stream" of messages both identified by a sequence number. My reaction looks like this:
var lastSeq = 0L
loop {
react {
case Msg(seq, data) if seq > lastSeq => lastSeq = seq; process(data)
}
}
This contains a memory leak, but not a "serious" one, as there will be an upper bound in the number of duplicate messages (i.e. there can be no more once the replay query is finished).
However, this may still be an annoyance, as for each reaction, the actor sub-system will scan those messages again to see whether they can be processed.
In fact, thinking about a real mailbox might be a good analogy here. Imagine you left all your junk mail in there: pretty soon, you'd be suffering starvation because of all the junk mail you would have to sift through in order to get to the credit card statement.

How this sort of memory leaks solved in other languages? For example in Erlang?
Same as in Scala. First issue:
If forget to implement react case for special message
You very rarely need to leave a message in the mailbox intentionally and receive it later. I haven't ever encountered such a situation. So you can always include a catch-all clause (case _ => ... in Scala), which will do nothing, log a warning, or throw an exception -- whatever makes most sense in your situation.
If messages comes too fast
Instead of sending messages directly to a process which can't handle them fast enough, add a buffering process. It can throw away extra messages, send them to more than one worker process, etc.

Related

Akka synchronizing timestamped messages from several actors

Imagine the following architecture. There is an actor in akka that receives push messages via websocket. They have a timestamp and interval between those timestamps is 1 minute. Though the messages with the same timestamp can arrive multiple times via websocket. And then this messages are being broadcasted to as example three further actors (ma). They calculate metrics and push the messages further to the one actor(c).
For ma I defined a TimeSeriesBuffer that allows writing to the buffer only if entities have consequent timestamps. After successfull push to the buffer ma's emit metrics, that go to the c. c can only change it's state when it has all three metrics. Therefore I defined a trait Synchronizable and then a SynchronizableTimeSeriesBuffer with "master-slave" architecture.
On each push to every buffer a check is triggered in order to understand if there are new elements in the buffers of all three SynchronizableTimeSeriesBuffer with the same timestamp that can be emitted further to c as a single message.
So here are the questions:
1) Is it too complicated of a solution?
2) Is there a better way to do it in terms of scala and akka?
3) Why is it not so fast and not so parallel when messages in the system instead of being received "one by one" are loaded from db in a big batch and fed to the system in order to backtest the metrics. (one of the buffers is filling much faster than the others, while other one is at 0 length). I have an assumption it has something to do with akka's settings regarding dispatching/mailbox.
I created a gist with regarding code:
https://gist.github.com/ifif14/18b5f85cd638af7023462227cd595a2f
I would much appreciate the community's help in solving this nontrivial case.
Thanks in advance
Igor
Simplification
It seems like much of your architecture is designed to ensure that your message are sequentially ordered in time. Why not just add a simple Actor at the beginning that filters out duplicated messages? Then the rest of your system could be relatively simple.
As an example; given a message with timestamp
type Payload = ???
case class Message(timestamp : Long, payload : Payload)
You can write the filter Actor:
class FilterActor(ma : Iterable[ActorRef]) extends Actor {
var currentMaxTime = 0L
override def receive = {
case m : Message if m.timestamp > currentMaxTime => ma foreach (_ ! m)
case _ =>
}
}
Now you can eliminate all of the "TimeSeriesBuffer" and "Synchronizable" logic since you know that ma, and c, will only receive time-ordered messages.
Batch Processing
The likely reason why batch processing is not so concurrent is because the mailbox for your ma Actor is being filled up by the database query and whatever processing it is doing is slower than the processing for c. Therefore ma's mailbox continues to accumulate messages while c's mailbox remains relatively empty.
Thanks so much for your answer. The part with cutting off is what I also implemented in Synchronizable Trait.
//clean up slaves. if their queue is behind masters latest element
master_last_timestamp match {
case Some(ts) => {
slaves.foreach { s =>
while ( s.queue.length > 0 && s.getElementTimestamp(s.queue.front) < ts ) {
s.dequeue()
}
// val els = s.dequeueAll { queue_el => s.getElementTimestamp(queue_el) < ts }
}
}
case _ => Unit
}
The reason why I started to implement the buffer is because I feel like I will be using it a lot in the system and I don't think to write this part for each actor I will be using. Seems easier to have a blueprint that does it.
But a more important reason is that for some reason one buffer is either being filled much slower or not at all than the other two. Though they are being filled by the same actors!! (just different instances, and computation time should be pretty much the same) And then after two other actors emitted all messages that were "passed" from the database the third one starts receiving it. It feels to me that this one actor is just not getting processor time. So I think it's a dispatcher's setting that can affect this. Are you familiar with this?
Also I would expect dispatcher work more like round-robin, given each process a little of execution time, but it ends up serving only limited amount of actors and then jumping to the next ones. Although they sort of have to receive initial messages at the same time since there is a broadcaster.
I read akka documentation on dispatchers and mailboxes, but I still don't understand how to do it.
Thank you
Igor

Using many consumers in SQS Queue

I know that it is possible to consume a SQS queue using multiple threads. I would like to guarantee that each message will be consumed once. I know that it is possible to change the visibility timeout of a message, e.g., equal to my processing time. If my process spend more time than the visibility timeout (e.g. a slow connection) other thread can consume the same message.
What is the best approach to guarantee that a message will be processed once?
What is the best approach to guarantee that a message will be processed once?
You're asking for a guarantee - you won't get one. You can reduce probability of a message being processed more than once to a very small amount, but you won't get a guarantee.
I'll explain why, along with strategies for reducing duplication.
Where does duplication come from
When you put a message in SQS, SQS might actually receive that message more than once
For example: a minor network hiccup while sending the message caused a transient error that was automatically retried - from the message sender's perspective, it failed once, and successfully sent once, but SQS received both messages.
SQS can internally generate duplicates
Simlar to the first example - there's a lot of computers handling messages under the covers, and SQS needs to make sure nothing gets lost - messages are stored on multiple servers, and can this can result in duplication.
For the most part, by taking advantage of SQS message visibility timeout, the chances of duplication from these sources are already pretty small - like fraction of a percent small.
If processing duplicates really isn't that bad (strive to make your message consumption idempotent!), I'd consider this good enough - reducing chances of duplication further is complicated and potentially expensive...
What can your application do to reduce duplication further?
Ok, here we go down the rabbit hole... at a high level, you will want to assign unique ids to your messages, and check against an atomic cache of ids that are in progress or completed before starting processing:
Make sure your messages have unique identifiers provided at insertion time
Without this, you'll have no way of telling duplicates apart.
Handle duplication at the 'end of the line' for messages.
If your message receiver needs to send messages off-box for further processing, then it can be another source of duplication (for similar reasons to above)
You'll need somewhere to atomically store and check these unique ids (and flush them after some timeout). There are two important states: "InProgress" and "Completed"
InProgress entries should have a timeout based on how fast you need to recover in case of processing failure.
Completed entries should have a timeout based on how long you want your deduplication window
The simplest is probably a Guava cache, but would only be good for a single processing app. If you have a lot of messages or distributed consumption, consider a database for this job (with a background process to sweep for expired entries)
Before processing the message, attempt to store the messageId in "InProgress". If it's already there, stop - you just handled a duplicate.
Check if the message is "Completed" (and stop if it's there)
Your thread now has an exclusive lock on that messageId - Process your message
Mark the messageId as "Completed" - As long as this messageId stays here, you won't process any duplicates for that messageId.
You likely can't afford infinite storage though.
Remove the messageId from "InProgress" (or just let it expire from here)
Some notes
Keep in mind that chances of duplicate without all of that is already pretty low. Depending on how much time and money deduplication of messages is worth to you, feel free to skip or modify any of the steps
For example, you could leave out "InProgress", but that opens up the small chance of two threads working on a duplicated message at the same time (the second one starting before the first has "Completed" it)
Your deduplication window is as long as you can keep messageIds in "Completed". Since you likely can't afford infinite storage, make this last at least as long as 2x your SQS message visibility timeout; there is reduced chances of duplication after that (on top of the already very low chances, but still not guaranteed).
Even with all this, there is still a chance of duplication - all the precautions and SQS message visibility timeouts help reduce this chance to very small, but the chance is still there:
Your app can crash/hang/do a very long GC right after processing the message, but before the messageId is "Completed" (maybe you're using a database for this storage and the connection to it is down)
In this case, "Processing" will eventually expire, and another thread could process this message (either after SQS visibility timeout also expires or because SQS had a duplicate in it).
Store the message, or a reference to the message, in a database with a unique constraint on the Message ID, when you receive it. If the ID exists in the table, you've already received it, and the database will not allow you to insert it again -- because of the unique constraint.
AWS SQS API doesn't automatically "consume" the message when you read it with API,etc. Developer need to make the call to delete the message themselves.
SQS does have a features call "redrive policy" as part the "Dead letter Queue Setting". You just set the read request to 1. If the consume process crash, subsequent read on the same message will put the message into dead letter queue.
SQS queue visibility timeout can be set up to 12 hours. Unless you have a special need, then you need to implement process to store the message handler in database to allow it for inspection.
You can use setVisibilityTimeout() for both messages and batches, in order to extend the visibility time until the thread has completed processing the message.
This could be done by using a scheduledExecutorService, and schedule a runnable event after half the initial visibility time. The code snippet bellow creates and executes the VisibilityTimeExtender every half of the visibilityTime with a period of half the visibility time. (The time should to guarantee the message to be processed, extended with visibilityTime/2)
private final ScheduledExecutorService scheduler = Executors.newScheduledThreadPool(1);
ScheduledFuture<?> futureEvent = scheduler.scheduleAtFixedRate(new VisibilityTimeExtender(..), visibilityTime/2, visibilityTime/2, TimeUnit.SECONDS);
VisibilityTimeExtender must implement Runnable, and is where you update the new visibility time.
When the thread is done processing the message, you can delete it from the queue, and call futureEvent.cancel(true) to stop the scheduled event.

Why are messages received by an actor unordered?

I've been studying the actor model (specifically the implementation in Scala) but I can't understand why there's a requirement that messages arrive in no particular order.
It seems like there are at least some elegant, actor-based solutions to concurrency problems that would work if only the messages arrived in order (e.g. producer-consumer variants, deferred database writes, concurrency-safe caches).
So why don't actor messages arrive in order? Is it to permit efficient implementation or maybe to prevent some kind of deadlock that would arise when messages are ordered?
My impression is that if two threads send a message to an actor a, there is no particular guarantee about which will be received by the actor first. But if you have code that looks like
a ! "one"
a ! "two"
then a will always get "one" before "two" (though who knows what else might have arrived in between from other threads).
Thus, I don't think it is the case that messages arrive in no particular order at all. Multiple messages from within one thread will (as far as I can tell from the code or from experience) arrive in order.
I'm not privy to the reasons why Scala's Actors (those in the standard library, at any rate -- there are also Akka, Lift and Scalaz implementations of Actors) chose that particular implementation. Probably as a copy of Erlang's own restrictions -- but without the guarantees for communication between two single threads. Or maybe with that guarantee as well -- I wish Phillip Haller was here to comment.
BUT, I do question your statement about concurrency problems. When studying asynchronous distributed algorithms, a basic tenet is that you can't guarantee any ordering of message receipt.
To quote Distributed Computing: Fundamentals, Simulation and Advanced Topics, by Hagit Attiya and Jennifer Welch,
A system is said to be asynchronous if there is no fixed upper bound on how long it
takes for a message to be delivered or how much time elapses between consecutive
steps of a processor.
The actor model is an asynchronous one. That enables it to work over distributed hardware -- be it different computers communicating through a network, or different processors on a system that does not provide synchronous guarantees.
Furthermore, even the multi-threading model on a multi-core processor is mostly asynchronous, with the primitives that enable synchronism being extremely expensive.
So a simple answer to the question might be:
Messages are not guaranteed to arrive in order because that's an underlying limitation of asynchronous systems, which is the basic model of computation used by actors.
This model is the one we actually have on any system distributed over TCP/IP, and the most efficient over i386/x64 multicore/multiprocessor hardware.
The following simple example shows messages arriving out of order to a very simple actor:
import scala.actors._
import scala.actors.Actor._
import scala.collection.mutable._
val adder = actor {
loop {
react {
case x: Int => println(" Computing " + x); reply(x+2)
case Exit => println("Exiting"); exit
}
}
}
actor {
for (i <- 1 to 5) {
println("Sending " + i)
adder !! (i, { case answer => println("Computed " + i + " -> " + answer) })
}
println("Sending Exit")
adder !! Exit
}
Here is the output from one run of the above code with Scala 2.9.0 final on Windows 64-bit with Sun JDK 1.6.0u25:
Sending 1
Sending 2
Sending 3
Sending 4
Sending 5
Sending Exit
Computing 1
Computed 1 -> 3
Computing 4
Computed 4 -> 6
Computing 3
Computed 3 -> 5
Exiting
What order would you choose? Should it be by when they were sent or when they were recieved? Should we freeze the entire mailbox whilst we sort the messages? Imagine sorting a large and nearly full mailbox, wouldn't that put an arbitrary lock on the queue? I think the messages don't arrive in order because there is no guaranteed way to enforce such an order. We have latency in networks and between processors.
We have no idea where the messages are coming from, only that they have arrived. So how about this, we make the guarantee that we have no ordering and don't even try to think about ordering. Instead of having to come up with some impressive logic to keep things organized while remaining as contention-free as possible we can just focus on keeping things as contention-free as possible.
Someone else probably has an even better answer than I on this.
Edit:
Now that I've had time to sleep on it, I think it's a stipulation that allows for a much more vibrant Actor ecosystem to be created. Hence, why restrict one Actor or one thread or partial ownership of a thread from a thread pool? What if someone wanted to have an Actor which could grab as many threads as possible to process as many messages in its mailbox as it could?
If you made the stipulation up front that messages had to be done in the order they proceeded you'd never be able to allow for this. The minute multiple threads could be assigned by an Actor to process messages within the mailbox you'd be in the situation whereby you had no control over which message was processed first.
Phew, what your dreams say about your mind as you sleep.

How to improve mailbox scan time in Scala actors

I created following example of actors in Scala: http://pastebin.com/pa3WVpKy
Without throttling (reducing number of SendMoney messages) that occurs in lines:
val processed = iterations - counter.getCount/2
if (processed < i - banksCount * 5) Thread.sleep(1)
message processing in this test is very slow (especially when there are few bank actors).
That's because actors' mailboxes are full of SendMoney messages and receiving ReadAccountResponse messages takes a long time (they are usually almost at the end of mailbox, and whole mailbox must be scanned).
How to improve mailbox scan time in such cases?
Maybe there is a possibility to define some messages as high priority?
It would be great to have two mailboxes - one for usual messages and one for high priority ones. The high priority mailbox could be scanned first.
Also "reply" method could send messages to high priority mailbox automatically. Or maybe create two mailboxes - for usual messages and responses?
What's your oppinion?
Regards
Wojciech DurczyƄski
One potentially good solution to this problem will be Phillip Haller's translucent functions, in which the scala compiler reflectively exposes information about what kinds of objects a match expression can match. Then, actor mailboxes can be indexed by message class and lookup can potentially be drastically faster, especially in this kind of "needle in a haystack" scenario.
Here's the API for TransluncentFunction, as you can see it's pretty straightforward. It seems like the Translucent project has been on hiatus for awhile, let's hope it picks up again soon!
I believe that Lift's actors have exactly this prioritisation built-in: rather than overriding a single "act" method, there are a number of different methods (not sure of the exact names) which can be implemented dependent on the action's priority.
I'm not sure whether this solves the scanning slowdown issue though

Is it possible to build work queues in Erlang?

I have seen lots of chat examples in Erlang but what about lists, like a work queue? If I want to build a work queue system, like a project management system, is it possible to re-order messages in a process mailbox or do I have to use message priorities? Are there examples of workflow systems built in Erlang?
You cannot reorder messages in process message queues in Erlang.
You can, however do selective receives in which you can receive the message you deem most important first. It's not entirely the same but works for most purposes.
Here's an example:
receive
{important, Msg} ->
handle(Msg)
after 0 ->
ok
end,
receive
OtherMsg ->
handle(Msg)
end
It differs from:
receive
{important, Msg} ->
handle(Msg);
OtherMsg ->
handle(Msg)
end
In that it will always scan the whole message queue for {important, Msg} before continuing handling the rest of the messages. It means that those kinds of messages will always be handled before any others, if they exist. This of course comes at some performance cost (it takes more time scanning the whole queue twice).
Process mailboxes work quite well as-is for job queues.
Just have your messages include sufficient information so that selective receive patterns are easy to write, and you won't feel the need to re-order mailbox contents.
If you do need to reorder messages, you can follow the gatekeeper pattern: reify the mailbox as a separate process. When your original process is ready for another message, the gatekeeper can compute which message to forward, by any rule you choose.