Akka framework support for finding duplicate messages - scala

I'm trying to build a high-performance distributed system with Akka and Scala.
If a message requesting an expensive (and side-effect-free) computation arrives, and the exact same computation has already been requested before, I want to avoid computing the result again. If the computation requested previously has already completed and the result is available, I can cache it and re-use it.
However, the time window in which duplicate computation can be requested may be arbitrarily small. e.g. I could get a thousand or a million messages requesting the same expensive computation at the same instant for all practical purposes.
There is a commercial product called Gigaspaces that supposedly handles this situation.
However there seems to be no framework support for dealing with duplicate work requests in Akka at the moment. Given that the Akka framework already has access to all the messages being routed through the framework, it seems that a framework solution could make a lot of sense here.
Here is what I am proposing for the Akka framework to do:
1. Create a trait to indicate a type of messages (say, "ExpensiveComputation" or something similar) that are to be subject to the following caching approach.
2. Smartly (hashing etc.) identify identical messages received by (the same or different) actors within a user-configurable time window. Other options: select a maximum buffer size of memory to be used for this purpose, subject to (say LRU) replacement etc. Akka can also choose to cache only the results of messages that were expensive to process; the messages that took very little time to process can be re-processed again if needed; no need to waste precious buffer space caching them and their results.
3. When identical messages (received within that time window, possibly "at the same time instant") are identified, avoid unnecessary duplicate computations. The framework would do this automatically, and essentially, the duplicate messages would never get received by a new actor for processing; they would silently vanish and the result from processing it once (whether that computation was already done in the past, or ongoing right then) would get sent to all appropriate recipients (immediately if already available, and upon completion of the computation if not). Note that messages should be considered identical even if the "reply" fields are different, as long as the semantics/computations they represent are identical in every other respect. Also note that the computation should be purely functional, i.e. free from side-effects, for the caching optimization suggested to work and not change the program semantics at all.
If what I am suggesting is not compatible with the Akka way of doing things, and/or if you see some strong reasons why this is a very bad idea, please let me know.
Thanks,
Is Awesome, Scala

What you are asking is not dependent on the Akka framework but rather it's how you architect your actors and messages. First ensuring that your messages are immutable and have an appropriately defined identities via the equals/hashCode methods. Case classes give you both for free however if you have actorRefs embedded in the message for reply purposes you will have to override the identity methods. The case class parameters should also have the same properties recursively (immutable and proper identity).
Secondly you need to figure out how the actors will handle storing and identifying current/past computations. The easiest is to uniquely map requests to actors. This way that actor and only that actor will ever process that specific request. This can be done easily given a fixed set of actors and the hashCode of the request. Bonus points if the actor set is supervised where the supervisor is managing the load balancing/mapping and replacing failed actors ( Akka makes this part easy ).
Finally the actor itself can maintain a response caching behavior based on the criteria you described. Everything is thread safe in the context of the actor so a LRU cache keyed by the request itself ( good identity properties remember ) is easy with any type of behavior you want.

As Neil says, this is not really framework functionality, it's rather trivial to implement this and even abstract it into it's own trait.
trait CachingExpensiveThings { self: Actor =>
val cache = ...
def receive: Actor.Receive = {
case s: ExpensiveThing => cachedOrCache(s)
}
def cacheOrCached(s: ExpensiveThing) = cache.get(s) match {
case null => val result = compute(s)
cache.put(result)
self.reply_?)(result)
case cached => self.reply_?)(cached)
}
def compute(s: ExpensiveThing): Any
}
class MyExpensiveThingCalculator extends Actor with CachingExpensiveThings {
def compute(s: ExpensiveThing) = {
case l: LastDigitOfPi => ...
case ts: TravellingSalesman => ...
}
}

I do not know if all of these responsibilities should be handled only by the Akka. As usual, it all depends on the scale, and in particular - the number of attributes that defines the uniqueness of the message.
In case of cache mechanism, already mentioned approach with uniquely mapping requests to actors is way to go especially that it could be supported by the persistency.
In case of identity, instead of checking simple equality (which may be bottleneck) I will rather use graph based algorithm like signal-collect.

Related

Correct aproach to model network interaction in Enterprise Architect

I have a class Actor whose instances send/receive network messages. (E.g. each instance of that class is part of a different process running on a different physical machine.) The network messages are serialized instances of classes MessageA and MessageB whose attributes are sent over the wire. An incoming message is handled by a callback method method of my Actor class. An ougoing message is triggered by calling a method of my Actor class.
Hence, I started to model this situation in a class diagram like this:
The network messages are "signals" in EA term, i.e. classes with a special prototype (for succinctness the attributes are left out)
My Actor-class is an usual class in EA with four corresponding methods
Now, I want to model a typical interaction and started to draw the following sequence diagram:
The messages are no methods invocations, but are asynchronous and have kind "signal" which allows me to assign them the correct message type.
However, I wonder how I model
the fact that a message with payload MessageA is handled by onMessageAReceived
that method sendMessageA emits a message with payload MessageA
(Note: In terms of my implementation it is correct, that sendMessageA returns void, because sending a network message is asynchronous, offloaded to the underlying OS and the method returns to its callee after having send the message.)
in the sequence diagram.
Maybe, my whole approach is completely wrong and I am trying to model something which cannot be modeled like that. In that case some pointers to the correct approach are highly welcome.
Of course there's more than one way to model this (and it does not depend on the tool EA). So, you should ask which audience you are talking to, repsectively which their domain is basically.
Technical
A SD is well suited to show a physical transport. In that case you concentrate on the way how messages are sent. In this case you will have the physical operations shown as messages. E.g. using sockets, it would be some (a-)synchronous send(message) which assures that the content message is transported from A to B. This could be at any level of technical implementation from rough to single CRCs being sent (or how the operation is internally built to ensure packages are not lost).
Logical
In order to show a more logical aspect it's a good idea to have components (being deployed on multiple hardware) having ports (realizing some interface) along which you have an information flow (which is a connector you will find in EA) that can transport something (that is your message classes).
Overview
You might want to describe both aspects in your model. But likely you will have the focus on the one or other part depending on your overall domain.
There is no single way to model something. Models are always abstraction which is why we create models. They shall show reality, but more light weight.

Scala objects and thread safety

I am new to Scala.
I am trying to figure out how to ensure thread safety with functions in a Scala object (aka singleton)
From what I have read so far, it seems that I should keep visibility to function scope (or below) and use immutable variables wherever possible. However, I have not seen examples of where thread safety is violated, so I am not sure what other precautions should be taken.
Can someone point me to a good discussion of this issue, preferably with examples of where thread safety is violated?
Oh man. This is a huge topic. Here's a Scala-based intro to concurrency and Oracle's Java lessons actually have a pretty good intro as well. Here's a brief intro that motivates why concurrent reading and writing of shared state (of which Scala objects are particular specific case) is a problem and provides a quick overview of common solutions.
There's two (fundamentally related) classes of problems when it comes to thread safety and state mutation:
Clobbering (missing) writes
Inaccurate (changing out from under you) reads
Let's look at each of these in turn.
First clobbering writes:
object WritesExample {
var myList: List[Int] = List.empty
}
Imagine we had two threads concurrently accessing WritesExample, each of executes the following updateList
def updateList(x: WritesExample.type): Unit =
WritesExample.myList = 1 :: WritesExample.myList
You'd probably hope when both threads are done that WritesExample.myList has a length of 2. Unfortunately, that might not be the case if both threads read WritesExample.myList before the other thread has finished a write. If when both threads read WritesExample.myList it is empty, then both will write back a list of length 1, with one write overwriting the other, so that in the end WritesExample.myList only has a length of one. Hence we've effectively lost a write we were supposed to execute. Not good.
Now let's look at inaccurate reads.
object ReadsExample {
val myMutableList: collection.mutable.MutableList[Int]
}
Once again, let's say we had two threads concurrently accessing ReadsExample. This time each of them executes updateList2 repeatedly.
def updateList2(x: ReadsExample.type): Unit =
ReadsExample.myMutableList += ReadsExample.myMutableList.length
In a single-threaded context, you would expect updateList2, when repeatedly called, to simply generate an ordered list of incrementing numbers, e.g. 0, 1, 2, 3, 4,.... Unfortunately, when multiple threads are accessing ReadsExample.myMutableList with updateList2 at the same time, it's possible that between when ReadsExample.myMutableList.length is read and when the write is finally persisted, ReadsExample.myMutableList has already been modified by another thread. So in theory you could see something like 0, 0, 1, 1 or potentially if one thread takes longer to write than another 0, 1, 2, 1 (where the slower thread finally writes to the list after the other thread has already accessed and written to the list three times).
What happened is that the read was inaccurate/out-of-date; the actual data structure that was updated was different from the one that was read, i.e. was changed out from under you in the middle of things. This is also a huge source of bugs because many invariants you might expect to hold (e.g. every number in the list corresponds exactly to its index or every number appears only once) hold in a single-threaded context, but fail in a concurrent context.
Now that we've motivated some of the problems, let's dive into some of the solutions. You mentioned immutability so let's talk about that first. You might notice that in my example of clobbering writes I use an immutable data structure whereas in my inconsistent reads example I use a mutable data structure. That is intentional. They are in a sense dual to one another.
With immutable data structures you cannot have an "inaccurate" read in the sense I laid out above because you never mutate data structures, but rather place a new copy of a data structure in the same location. The data structure cannot change out from under you because it cannot change! However you can lose a write in the process by placing a version of a data structure back to its original location that does not incorporate a change made previously by another process.
With mutable data structures on the other hand, you cannot lose a write because all writes are in-place mutations of the data structure, but you can end up executing a write to a data structure whose state differs from when you analyzed it to formulate the write.
If it's a "pick your poison" kind of scenario, why do you often hear advice to go with immutable data structures to help with concurrency? Well immutable data structures make it easier to ensure invariants about the state being modified hold even if writes are lost. For example, if I rewrote the ReadsList example to use an immutable List (and a var instead), then I could confidently say that the integer elements of the list will always correspond to the indices of the list. This means that your program is much less likely to enter an inconsistent state (e.g. it's not hard to imagine that a naive mutable set implementation could end up with non-unique elements when mutated concurrently). And it turns out that modern techniques for dealing with concurrency usually are pretty good at dealing with missing writes.
Let's look at some of those approaches that deal with shared state concurrency. At their hearts they can all be summed up as various ways of serializing read/write pairs.
Locks (a.k.a. directly try to serialize read/write pairs): This is usually the one you'll hear first as a fundamental way of dealing with concurrency. Every process that wants to access state first places a lock on it. Any other process is now excluded from accessing that state. The process then writes to that state and on completion releases the lock. Other processes are now free to repeat the process. In our WritesExample, updateList would first acquire the lock before executing and releasing the lock; this would prevent other processes from reading WritesExample.myList until the write was completed, thereby preventing them from seeing old versions of myList that would lead to clobbering writes (note that are more sophisticated locking procedures that allow for simultaneous reads, but let's stick with the basics for now).
Locks often do not scale well to multiple pieces of state. With multiple locks, often you need to acquire and release locks in a certain order otherwise you can end up deadlocking or livelocking.
The Oracle and Twitter docs linked a the beginning have good overviews of this approach.
Describe Your Action, Don't Execute It (a.k.a. build up a serial representation of your actions and have someone else process it): Instead of accessing and modifying state directly, you describe an action of how to do this and then give it to someone else to actually execute the action. For example, you might pass messages to an object (e.g. actors in Scala) that queues up these requests and then executes them one-by-one on some internal state that it never directly exposes to anyone else. In the particular case of actors, this improves the situation over locks by removing the need to explicitly acquire and release locks. As long as you encapsulate all the state you need to access at once in a single object, message passing works great. Actors break down when you distribute state across multiple objects (and as such this is heavily discouraged in this paradigm).
Akka actors are one good example of this in Scala.
Transactions (a.k.a. temporarily isolate some reads and writes from others and let the isolation system serialize things for you): Wrap all your read/writes in transactions that ensure during the course of your reads and writes your view of the world is isolated from any other changes. There's usually two ways of achieving this. Either you go for an approach similar to locks where you prevent other people from accessing the data while a transaction is running or you restart a transaction from the very beginning whenever you detect that a change has occurred to the shared state and throw away any progress you've made (usually the latter for performance reasons). On the one hand, transactions, unlike locks and actors, scale to disparate pieces of state very well. Just wrap all your accesses in transactions and you're good to go. On the other hand, your reads and writes have to be side-effect-free because they might be thrown away and retried many times and you can't really undo most side effects.
And if you're really unlucky, although you usually can't truly deadlock with a good implementation of transactions, a long-lived transaction can constantly be interrupted by other short-lived transactions such that it keeps getting thrown away and retried and never actually succeeds (which amounts to something like livelocking). In effect you're giving up direct control of serialization order and hoping your transaction system orders things sensibly.
Scala's STM library is a good example of this approach.
Remove Shared State: The final "solution" is to rethink the problem altogether and try to think about whether you truly need global, shared state that is writable. If you don't need writable shared state, then concurrency problems go away altogether!
Everything in life is about trade-offs and concurrency is no exception. When thinking about concurrency first understand what state you have and what invariants you want to preserve about that state. Then use that to guide your decision as to what kind of tools you want to use to tackle the problem.
The Thread Safety Problem section within this Scala concurrency article might be of interest to you. In essence, it illustrates the thread safety problem using a simple example and outlines 3 different approaches to tackle the problem, namely synchronization, volatile and AtomicReference:
When you enter synchronized points, access volatile references, or
deference AtomicReferences, Java forces the processor to flush their
cache lines and provide a consistent view of data.
There is also a brief overview comparing the cost of the 3 approaches:
AtomicReference is the most costly of these two choices since you
have to go through method dispatch to access values. volatile and
synchronized are built on top of Java’s built-in monitors. Monitors
cost very little if there’s no contention. Since synchronized allows
you more fine-grained control over when you synchronize, there will be
less contention so synchronized tends to be the cheapest option.
This is not specific to Scala, if your object contains a state that can be modified concurrently thread safety can be violated depending on the implementation. For example:
object BankAccount {
private var balance: Long = 0L
def deposit(amount: Long): Unit = balance += amount
}
In this case the the object is not thread safe, there are a lot of approachs to make it thread safe, for example using Akka, or synchronized blocks. For simplicity I will write it using synchronized blocks
object BankAccount {
private var balance: Long = 0L
def deposit(amount: Long): Unit =
this.synchronized {
balance += amount
}
}

Atomic function/method in scala (without introducing actor system overheads)

I currently use an Akka actor to establish a code block that is executed atomically and in a thread safe manner (Akka mailbox semantics impose atomicity by virtue of processing one message at a time).
However this introduces the need for an actor system, and additional side-effects or bloat (having to manually propagate exceptions to the caller, losing type safety on ask, and in general using message semantics rather than function calls).
Can a thread-safe atomic code block be accomplished in scala in a simpler way? would you apply #volatile to a function?
It depends on what kind of shared state you want to protect here:
The easiest and universal choice is using same old synchronized. However, unlike the Akka, it's completely blocking, so may easily kill your performance and of course the code-style, as it's hard to control messy side effects. It may also allow for dead-locks.
Java's locks is same approach, but might be a little better for performance.
Another option is same old Java's AtomicReference(implements CAS operations) and related classes. The positive thing about is that they're non-blocking - developers actually use them to build high-performant collections. The ways of using locks and CAS are decribed here. They both are pretty low-level mechanizms, so I would not recommend to use them much, especially for business-logic (any actor's implementation would be better).
If your shared state is a collection - you may want use same old Java's concurrent collections (they have atomic operations like putIfAbscent). Scala has interesting non-blocking TrieMap for instance.
Scala STM is also an alternative
Finally, this question is dedicated to lightweight actor model implementations.
P.S. Volatile annotation is nothing more than volatile keyword analog from Java. You can put it on the method just because any annotation can be put on anything.
Depending on what you're trying to achieve, the simplest might be old synchronized:
//your mutable state
private var x = 0
//better than locking on 'this' is to have a dedicated lock
private val lock = new Object
def add(i:Int) = lock.synchronized { x += i }
This is the 'old Java' way, but it might work for you depending on what you're doing. Of course, this is the fastest way to deadlocks if your synchronize operation is more complex and/or you need high throughput.

What is the best way to manage mutable state?

I just finished Martin Odersky's scala class at Coursera. Scala being my first FP language, I was excited by the idea of limiting mutable state. This allows for much easier concurrency and also makes the code super maintainable.
While learning all this, I realized you could guarantee the mutability of an object as long as it had no mutable variables and only referenced immutable objects. So now I can do everything by creating a new state instead of modifying an old one, use tail recursion when possible.
Great. So I can only do this so far up the chain. At some point, my application needs to be able to modify some existing state. I know where put in concurrency control at this point, locks, blah blah. I'm still defaulting to my standard multi-threaded concurrency control I've always used.
Oh scala community, is there a better way? Monads maybe?
EDIT: this question is a bit general, so I wanted to give a use case:
I have a machine learning algorithm that stores several collections of data. They have functions that return updated representations of the data (training, etc), all immutable. Ultimately I can keep this return-updated-state pattern going up the chain to the actual object running the simulation. This has a state that is mutable and holds references to the collections. I may want to distributed to multi-cores, or multi-system.
This is a bit of a subjective question, so I won't attempt to answer the 'which is best' part of it. If your chief concern is state in the context of multithreaded concurrency, then one option may be Software Transactional Memory.
There is an Implementation (see the quickstart) of STM as provided by Akka. Depending on your use-case, it might be heavy-weight or overkill, but then again, it might be preferable to a mess of locks. Unlike locks, STM tends to be optimistic, in the same way as database transactions are. As with database transactions, you make changes to shared state explicitly in a transactional context, and the changes you describe will be committed atomically or re-attempted if a conflict is detected. Basically you have to wrap all your state in Refs which can be manipulated only in an 'atomic' block - implemented as a method that takes a closure within which you use manipulate your Refs and ScalaSTM ensures that the whole set of operations on your state either succeed or fail - there will be no half-way or inconsistent changes.
This leverages Scala's implicit parameters - all operation to Refs require a transaction object as an argument, and this is received by the closure given to atomic and can be declared implicit, so all the code within atomic will can be written in a very natural yet safe style.
The catch is, for this to be useful, you do need to use the transactional data-structures provided; so that will mean using TSet instead of Set, TMap instead of Map. These provide all-or-nothing update semantics when used in the transactional context (within an atomic block). This are very much like clojure's persistent collections. You can also build your own transactional data structures out of Refs for use within these atomic blocks.
If you are not averse to parenthesis, the clojure explanation of refs is really good: http://clojure.org/refs
Depending on your use case you might be able to stick with deeply immutable object structures which you partially copy instead of actually mutating them (similar to an "updated" immutable list that shares a suffix with its original list). So-called lenses are a nice way of dealing with such structures, read about them in this SO question or in this blog post.
Sticking with immutable structures of course only works if you don't want changes to be globally observable. An example where immutable structures are most likely not an option are two concurrent clients working on a shared list, where the modifications done by client A must be observable by client B, and vice versa.
I suggest the best way is to store the mutable variable inside a Akka actor, use message passing in and out of the Akka actor to send and receive this mutable reference. Use immutable data structures.
I have a StorageActor as follows. The variable entityMap gets updated every time something is stored via the StoreEntity. Also it doesn't need to be volatile and still works.
The Akka actor is the place where things can change, messages are passed in and out into the pure functional world.
import akka.actor.Actor
import java.util.UUID
import com.orsa.minutesheet.entity.Entity
case class EntityRef(entity: Option[Entity])
case class FindEntity(uuid: UUID)
case class StoreEntity[T >: Entity](uuid: UUID, entity: Option[T])
class StorageActor extends Actor {
private var entityMap = Map[UUID, Entity]()
private def findEntityByUUID(uuid:UUID): Option[Entity] = entityMap.get(uuid)
def receive = {
case FindEntity(uuid) => sender ! EntityRef( findEntityByUUID(uuid) )
case StoreEntity(uuid, entity) =>
entity match {
case Some(store) => entityMap += uuid -> store.asInstanceOf[Entity]
case None => entityMap -= uuid
}
}
}

Akka for simulations

I'm new to akka and the actor-pattern, therefore I'm not sure if it fit my needs.
I want to create a simulation with akka and millions of entities (think as domain objects - later actors) that can influence each other. So thinking as simulation with a more-or-less "fuzzy" result, we have an array with entities, where each of these entities has a speed, but is thwarted by the entities in front of the actual entity. When the simulation starts, each entity should move n-fields, or, if thwarted by others, less fields. We have multiple iterations, and in the end we have a new order. This is repeated for some rounds until we want to see a "snapshot" of the leading entities (which are then possibly removed before the next round starts).
So I don't understand if I can create this with akka, because:
Is it possible to have global list with the position of each actor, so they know at which position they are and which are in front of them?
As far as I understand, this violates the encapsulation of the actors. I can put the position of the actor in the actor itself, but how can I see/notify the actors around this actor?
Beside of this, the global list will create synchronization problems and impacts the performance, which is the exactly opposite of the desired behaviour (and is complementary to akka/the actor-pattern)
What did I missed? Do I have to search for another design approach?
Thanks for suggestions.
Update: working with the eventbus and classifiers doesn't seem an option, too. Refering to the documentation:
"hence it is not well-suited to use cases in which subscriptions change with very high frequency"
The actor model is a very good fit for your scenario. Actors communicate by sending messages, so each actor can send messages to his neighbors containing his position. Of course, each actor cannot know about every other actor in the system (not efficiently anyway) so you will have to also devise a scheme though which each actor knows which are his neighbors.
As for getting a snapshot of the system, simply have a central actor that is known by everybody and knows everybody.
It seems like you're just getting started with actors. Read a bit more - the akka site is a good resource - and come back and refine your question, if needed.
Your problem sounds like an n-body simulation sort of thing, so looking into that might help also.