What is the best way to manage mutable state? - scala

I just finished Martin Odersky's scala class at Coursera. Scala being my first FP language, I was excited by the idea of limiting mutable state. This allows for much easier concurrency and also makes the code super maintainable.
While learning all this, I realized you could guarantee the mutability of an object as long as it had no mutable variables and only referenced immutable objects. So now I can do everything by creating a new state instead of modifying an old one, use tail recursion when possible.
Great. So I can only do this so far up the chain. At some point, my application needs to be able to modify some existing state. I know where put in concurrency control at this point, locks, blah blah. I'm still defaulting to my standard multi-threaded concurrency control I've always used.
Oh scala community, is there a better way? Monads maybe?
EDIT: this question is a bit general, so I wanted to give a use case:
I have a machine learning algorithm that stores several collections of data. They have functions that return updated representations of the data (training, etc), all immutable. Ultimately I can keep this return-updated-state pattern going up the chain to the actual object running the simulation. This has a state that is mutable and holds references to the collections. I may want to distributed to multi-cores, or multi-system.

This is a bit of a subjective question, so I won't attempt to answer the 'which is best' part of it. If your chief concern is state in the context of multithreaded concurrency, then one option may be Software Transactional Memory.
There is an Implementation (see the quickstart) of STM as provided by Akka. Depending on your use-case, it might be heavy-weight or overkill, but then again, it might be preferable to a mess of locks. Unlike locks, STM tends to be optimistic, in the same way as database transactions are. As with database transactions, you make changes to shared state explicitly in a transactional context, and the changes you describe will be committed atomically or re-attempted if a conflict is detected. Basically you have to wrap all your state in Refs which can be manipulated only in an 'atomic' block - implemented as a method that takes a closure within which you use manipulate your Refs and ScalaSTM ensures that the whole set of operations on your state either succeed or fail - there will be no half-way or inconsistent changes.
This leverages Scala's implicit parameters - all operation to Refs require a transaction object as an argument, and this is received by the closure given to atomic and can be declared implicit, so all the code within atomic will can be written in a very natural yet safe style.
The catch is, for this to be useful, you do need to use the transactional data-structures provided; so that will mean using TSet instead of Set, TMap instead of Map. These provide all-or-nothing update semantics when used in the transactional context (within an atomic block). This are very much like clojure's persistent collections. You can also build your own transactional data structures out of Refs for use within these atomic blocks.
If you are not averse to parenthesis, the clojure explanation of refs is really good: http://clojure.org/refs

Depending on your use case you might be able to stick with deeply immutable object structures which you partially copy instead of actually mutating them (similar to an "updated" immutable list that shares a suffix with its original list). So-called lenses are a nice way of dealing with such structures, read about them in this SO question or in this blog post.
Sticking with immutable structures of course only works if you don't want changes to be globally observable. An example where immutable structures are most likely not an option are two concurrent clients working on a shared list, where the modifications done by client A must be observable by client B, and vice versa.

I suggest the best way is to store the mutable variable inside a Akka actor, use message passing in and out of the Akka actor to send and receive this mutable reference. Use immutable data structures.
I have a StorageActor as follows. The variable entityMap gets updated every time something is stored via the StoreEntity. Also it doesn't need to be volatile and still works.
The Akka actor is the place where things can change, messages are passed in and out into the pure functional world.
import akka.actor.Actor
import java.util.UUID
import com.orsa.minutesheet.entity.Entity
case class EntityRef(entity: Option[Entity])
case class FindEntity(uuid: UUID)
case class StoreEntity[T >: Entity](uuid: UUID, entity: Option[T])
class StorageActor extends Actor {
private var entityMap = Map[UUID, Entity]()
private def findEntityByUUID(uuid:UUID): Option[Entity] = entityMap.get(uuid)
def receive = {
case FindEntity(uuid) => sender ! EntityRef( findEntityByUUID(uuid) )
case StoreEntity(uuid, entity) =>
entity match {
case Some(store) => entityMap += uuid -> store.asInstanceOf[Entity]
case None => entityMap -= uuid
}
}
}

Related

Scala immutable collections cannot be shared without synchronization?

From the «Learning concurrent programming in Scala» book:
In current versions of Scala (2.11.1), however, certain collections that are
deemed immutable, such as List and Vector, cannot be shared without
synchronization. Although their external API does not allow you to
modify them, they contain non-final fields.
Could anyone demonstrate this with a small example? And does this still apply to 2.11.7?
The behavior of changes made in one thread when viewed from another is governed by the Java Memory Model. In particular, these rules are extremely weak when it comes to something like building a collection and then passing the built-and-now-immutable collection to another thread. The JMM does not guarantee that the other thread won't see an earlier view where the collection was not fully built!
Since synchronized blocks enforce an ordering, they can be used to get a consistent view if they're used on every single operation.
In practice, though, this is rarely actually necessary. On the CPU side, there is typically a memory barrier operation that can be used to enforce memory consistency (i.e. if you write the tail of your list and then pass a memory barrier, no other thread can see the tail un-set). And in practice, JVMs usually have to implement synchronized by using memory barriers. So one could hope that you could just pass the created list within a synchronzied block, trusting that a memory barrier would be issued, and everything thereafter would be fine.
Unfortunately, the JMM doesn't require that it be implemented in this way (and you can't assume that the memory-barrier-like behavior of object creation will actually be a full memory barrier that applies to everything in that thread as opposed to simply the final fields of that object), which is both why the recommendation is what it is, and why it's not fixed (yet, anyway) in the library.
For what it's worth, on x86 architectures, I've never observed a problem if you hand off the immutable object within a synchronized block. I have observed problems if you try to do it with CAS (e.g. by using the java.util.concurrent.atomic classes).
As an addition to the excellent answer from Rex Kerr:
it should be noted that most common use cases of immutable collections in a multithreading context are not affected by this problem. The only situation where this might affect you is when you do something that you probably should not do in the first place.
E.g. you have a variable var x: Vector[Int], which you write from one thread A and read from another thread B.
If you mark x with #volatile, there will be no problem, since the volatile write introduces a memory barrier. So you will never be able to observe the Vector in an inconsistent state. The same is true when using a synchronized { } block when writing and reading, or when using java.util.concurrent.atomic.AtomicReference.
If you don't mark x with #volatile, you might observe the vector in an inconsistent state (not just wrong elements, but internally inconsistent!). But in that case your code is arguably broken to begin with. It is completely undefined when you will see the changes from A in B.
You might see them
immediately
after there is a memory barrier somewhere else in your program
not at all
depending on the architecture you`re running on, the phase of the moon, whatever. So as Viktor Klang put it: "Unsafe publication is unsafe..."
Note that if you use a higher level concurrency framework such as akka actors, it is also guaranteed that receivers of messages can not see immutable collections in an inconsistent state.

Atomic function/method in scala (without introducing actor system overheads)

I currently use an Akka actor to establish a code block that is executed atomically and in a thread safe manner (Akka mailbox semantics impose atomicity by virtue of processing one message at a time).
However this introduces the need for an actor system, and additional side-effects or bloat (having to manually propagate exceptions to the caller, losing type safety on ask, and in general using message semantics rather than function calls).
Can a thread-safe atomic code block be accomplished in scala in a simpler way? would you apply #volatile to a function?
It depends on what kind of shared state you want to protect here:
The easiest and universal choice is using same old synchronized. However, unlike the Akka, it's completely blocking, so may easily kill your performance and of course the code-style, as it's hard to control messy side effects. It may also allow for dead-locks.
Java's locks is same approach, but might be a little better for performance.
Another option is same old Java's AtomicReference(implements CAS operations) and related classes. The positive thing about is that they're non-blocking - developers actually use them to build high-performant collections. The ways of using locks and CAS are decribed here. They both are pretty low-level mechanizms, so I would not recommend to use them much, especially for business-logic (any actor's implementation would be better).
If your shared state is a collection - you may want use same old Java's concurrent collections (they have atomic operations like putIfAbscent). Scala has interesting non-blocking TrieMap for instance.
Scala STM is also an alternative
Finally, this question is dedicated to lightweight actor model implementations.
P.S. Volatile annotation is nothing more than volatile keyword analog from Java. You can put it on the method just because any annotation can be put on anything.
Depending on what you're trying to achieve, the simplest might be old synchronized:
//your mutable state
private var x = 0
//better than locking on 'this' is to have a dedicated lock
private val lock = new Object
def add(i:Int) = lock.synchronized { x += i }
This is the 'old Java' way, but it might work for you depending on what you're doing. Of course, this is the fastest way to deadlocks if your synchronize operation is more complex and/or you need high throughput.

Scala: var List vs val MutableList

In Odersky et al's Scala book, they say use lists. I haven't read the book cover to cover but all the examples seem to use val List. As I understand it one also is encouraged to use vals over vars. But in most applications is there not a trade off between using a var List or a val MutableList?. Obviously we use a val List when we can. But is it good practice to be using a lot of var Lists (or var Vectors etc)?
I'm pretty new to Scala coming from C#. There I had a lot of:
public List<T> myList {get; private set;}
collections which could easily have been declared as vals if C# had immutability built in, because the collection itself never changed after construction, even though elements would be added and subtracted from the collection in its life time. So declaring a var collection almost feels like a step away from immutability.
In response to answers and comments, one of the strong selling points of Scala is: that it can have many benefits without having to completely change the way one writes code as is the case with say Lisp or Haskell.
Is it good practice to be using a lot of var Lists (or var Vectors
etc)?
I would say it's better practice to use var with immutable collections than it is to use val with mutable ones. Off the top of my head, because
You have more guarantees about behaviour: if your object has a mutable list, you never know if some other external object is going to update it
You limit the extent of mutability; methods returning a collection will yield an immutable one, so you only have mutablility within your one object
It's easy to immutabilize a var by simply assigning it to a val, whereas to make a mutable collection immutable you have to use a different collection type and rebuild it
In some circumstances, such as time-dependent applications with extensive I/O, the simplest solution is to use mutable state. And in some circumstances, a mutable solution is just more elegant. However in most code you don't need mutability at all. The key is to use collections with higher order functions instead of looping, or recursion if a suitable function doesn't exist. This is simpler than it sounds. You just need to spend some time getting to know the methods on List (and other collections, which are mostly the same). The most important ones are:
map: applies your supplied function to each element in the collection - use instead of looping and updating values in an array
foldLeft: returns a single result from a collection - use instead of looping and updating an accumulator variable
for-yield expressions: simplify your mapping and filtering especially for nested-loop type problems
Ultimately, much of functional programming is a consequence of immutability and you can't really have one without the other; however, local vars are mostly an implementation detail: there's nothing wrong with a bit of mutability so long as it cannot escape from the local scope. So use vars with immutable collections since the local vars are not what will be exported.
You are assuming either the List must be mutable, or whatever is pointing to it must be mutable. In other words, that you need to pick one of the two choices below:
val list: collection.mutable.LinkedList[T]
var list: List[T]
That is a false dichotomy. You can have both:
val list: List[T]
So, the question you ought to be asking is how do I do that?, but we can only answer that when you try it out and face a specific problem. There's no generic answer that will solve all your problems (well, there is -- monads and recursion -- but that's too generic).
So... give it a try. You might be interested in looking at Code Review, where most Scala questions pertain precisely how to make some code more functional. But, ultimately, I think the best way to go about it is to try, and resort to Stack Overflow when you just can't figure out some specific problem.
Here is how I see this problem of mutability in functional programming.
Best solution: Values are best, so the best in functional programming usage is values and recursive functions:
val myList = func(4);
def func(n) = if (n>0) n::func(n) else Nil
Need mutable stuff: Sometimes mutable stuff is needed or makes everything a lot easier. My impression when we face this situation is to use the mutables structures, so to use val list: collection.mutable.LinkedList[T] instead of var list: List[T], this is not because of a real improvement on performances but because of mutable functions which are already defined in the mutable collection.
This advice is personal and maybe not recommended when you want performance but it is a guideline I use for daily programming in scala.
I believe you can't separate the question of mutable val / immutable var from the specific use case. Let me deepen a bit: there are two questions you want to ask yourself:
How am I exposing my collection to the outside?
I want a snapshot of the current state, and this snapshot should not change regardless of the changes that are made to the entity hosting the collection. In such case, you should prefer immutable var. The reason is that the only way to do so with a mutable val is through a defensive copy.
I want a view on the state, that should change to reflect changes to the original object state. In this case, you should opt for an immutable val, and return it wrapped through an unmodifiable wrapper (much like what Collections.unmodifiableList() does in Java, here is a question where it's asked how to do so in Scala). I see no way to achieve this with an immutable var, so I believe your choice here is mandatory.
I only care about the absence of side effects. In this case the two approaches are very similar. With an immutable var you can directly return the internal representation, so it is slightly clearer maybe.
How am I modifying my collection?
I usually make bulk changes, namely, I set the whole collection at once. This makes immutable var a better choice: you can just assign what's in input to your current state, and you are fine already. With an immutable val, you need to first clear your collection, then copy the new contents in. Definitely worse.
I usually make pointwise changes, namely, I add/remove a single element (or few of them) to/from the collection. This is what I actually see most of the time, the collection being just an implementation detail and the trait only exposing methods for the pointwise manipulation of its status. In this case, a mutable val may be generally better performance-wise, but in case this is an issue I'd recommend taking a look at Scala's collections performance.
If it is necessary to use var lists, why not? To avoid problems you could for example limit the scope of the variable. There was a similar question a while ago with a pretty good answer: scala's mutable and immutable set when to use val and var.

Akka framework support for finding duplicate messages

I'm trying to build a high-performance distributed system with Akka and Scala.
If a message requesting an expensive (and side-effect-free) computation arrives, and the exact same computation has already been requested before, I want to avoid computing the result again. If the computation requested previously has already completed and the result is available, I can cache it and re-use it.
However, the time window in which duplicate computation can be requested may be arbitrarily small. e.g. I could get a thousand or a million messages requesting the same expensive computation at the same instant for all practical purposes.
There is a commercial product called Gigaspaces that supposedly handles this situation.
However there seems to be no framework support for dealing with duplicate work requests in Akka at the moment. Given that the Akka framework already has access to all the messages being routed through the framework, it seems that a framework solution could make a lot of sense here.
Here is what I am proposing for the Akka framework to do:
1. Create a trait to indicate a type of messages (say, "ExpensiveComputation" or something similar) that are to be subject to the following caching approach.
2. Smartly (hashing etc.) identify identical messages received by (the same or different) actors within a user-configurable time window. Other options: select a maximum buffer size of memory to be used for this purpose, subject to (say LRU) replacement etc. Akka can also choose to cache only the results of messages that were expensive to process; the messages that took very little time to process can be re-processed again if needed; no need to waste precious buffer space caching them and their results.
3. When identical messages (received within that time window, possibly "at the same time instant") are identified, avoid unnecessary duplicate computations. The framework would do this automatically, and essentially, the duplicate messages would never get received by a new actor for processing; they would silently vanish and the result from processing it once (whether that computation was already done in the past, or ongoing right then) would get sent to all appropriate recipients (immediately if already available, and upon completion of the computation if not). Note that messages should be considered identical even if the "reply" fields are different, as long as the semantics/computations they represent are identical in every other respect. Also note that the computation should be purely functional, i.e. free from side-effects, for the caching optimization suggested to work and not change the program semantics at all.
If what I am suggesting is not compatible with the Akka way of doing things, and/or if you see some strong reasons why this is a very bad idea, please let me know.
Thanks,
Is Awesome, Scala
What you are asking is not dependent on the Akka framework but rather it's how you architect your actors and messages. First ensuring that your messages are immutable and have an appropriately defined identities via the equals/hashCode methods. Case classes give you both for free however if you have actorRefs embedded in the message for reply purposes you will have to override the identity methods. The case class parameters should also have the same properties recursively (immutable and proper identity).
Secondly you need to figure out how the actors will handle storing and identifying current/past computations. The easiest is to uniquely map requests to actors. This way that actor and only that actor will ever process that specific request. This can be done easily given a fixed set of actors and the hashCode of the request. Bonus points if the actor set is supervised where the supervisor is managing the load balancing/mapping and replacing failed actors ( Akka makes this part easy ).
Finally the actor itself can maintain a response caching behavior based on the criteria you described. Everything is thread safe in the context of the actor so a LRU cache keyed by the request itself ( good identity properties remember ) is easy with any type of behavior you want.
As Neil says, this is not really framework functionality, it's rather trivial to implement this and even abstract it into it's own trait.
trait CachingExpensiveThings { self: Actor =>
val cache = ...
def receive: Actor.Receive = {
case s: ExpensiveThing => cachedOrCache(s)
}
def cacheOrCached(s: ExpensiveThing) = cache.get(s) match {
case null => val result = compute(s)
cache.put(result)
self.reply_?)(result)
case cached => self.reply_?)(cached)
}
def compute(s: ExpensiveThing): Any
}
class MyExpensiveThingCalculator extends Actor with CachingExpensiveThings {
def compute(s: ExpensiveThing) = {
case l: LastDigitOfPi => ...
case ts: TravellingSalesman => ...
}
}
I do not know if all of these responsibilities should be handled only by the Akka. As usual, it all depends on the scale, and in particular - the number of attributes that defines the uniqueness of the message.
In case of cache mechanism, already mentioned approach with uniquely mapping requests to actors is way to go especially that it could be supported by the persistency.
In case of identity, instead of checking simple equality (which may be bottleneck) I will rather use graph based algorithm like signal-collect.

Using Scala, does a functional paradigm make sense for analyzing live data?

For example, when analyzing live stockmarket data I expose a method to my clients
def onTrade(trade: Trade) {
}
The clients may choose to do anything from counting the number of trades, calculating averages, storing high lows, price comparisons and so on. The method I expose returns nothing and the clients often use vars and mutable structures for their computation. For example when calculating the total trades they may do something like
var numTrades = 0
def onTrade(trade: Trade) {
numTrades += 1
}
A single onTrade call may have to do six or seven different things. Is there any way to reconcile this type of flexibility with a functional paradigm? In other words a return type, vals and nonmutable data structures
You might want to look into Functional Reactive Programming. Using FRP, you would express your trades as a stream of events, and manipulate this stream as a whole, rather than focusing on a single trade at a time.
You would then use various combinators to construct new streams, for example one that would return the number of trades or highest price seen so far.
The link above contains links to several Haskell implementations, but there are probably several Scala FRP implementations available as well.
One possibility is using monads to encapsulate state within a purely functional program. You might check out the Scalaz library.
Also, according to reports, the Scala team is developing a compiler plug-in for an effect system. Then you might consider providing an interface like this to your clients,
def callbackOnTrade[A, B](f: (A, Trade) => B)
The clients define their input and output types A and B, and define a pure function f that processes the trade. All "state" gets encapsulated in A and B and threaded through f.
Callbacks may not be the best approach, but there are certainly functional designs that can solve such a problem. You might want to consider FRP or a state-monad solution as already suggested, actors are another possibility, as is some form of dataflow concurrency, and you can also take advantage of the copy method that's automatically generated for case classes.
A different approach is to use STM (software transactional memory) and stick with the imperative paradigm whilst still retaining some safety.
The best approach depends on exactly how you're persisting the data and what you're actually doing in these state changes. As always, let a profiler be your guide if performance is critical.