How to handle concurrent access to a Scala collection?

How to handle concurrent access to a Scala collection? - scala

I have an Actor that - in its very essence - maintains a list of objects. It has three basic operations, an add, update and a remove (where sometimes the remove is called from the add method, but that aside), and works with a single collection. Obviously, that backing list is accessed concurrently, with add and remove calls interleaving each other constantly.
My first version used a ListBuffer, but I read somewhere it's not meant for concurrent access. I haven't gotten concurrent access exceptions, but I did note that finding & removing objects from it does not always work, possibly due to concurrency.
I was halfway rewriting it to use a var List, but removing items from Scala's default immutable List is a bit of a pain - and I doubt it's suitable for concurrent access.
So, basic question: What collection type should I use in a concurrent access situation, and how is it used?
(Perhaps secondary: Is an Actor actually a multithreaded entity, or is that just my wrong conception and does it process messages one at a time in a single thread?)
(Tertiary: In Scala, what collection type is best for inserts and random access (delete / update)?)
Edit: To the kind responders: Excuse my late reply, I'm making a nasty habit out of dumping a question on SO or mailing lists, then moving on to the next problem, forgetting the original one for the moment.

Take a look at the scala.collection.mutable.Synchronized* traits/classes.
The idea is that you mixin the Synchronized traits into regular mutable collections to get synchronized versions of them.
For example:
import scala.collection.mutable._
val syncSet = new HashSet[Int] with SynchronizedSet[Int]
val syncArray = new ArrayBuffer[Int] with SynchronizedBuffer[Int]

You don't need to synchronize the state of the actors. The aim of the actors is to avoid tricky, error prone and hard to debug concurrent programming.
Actor model will ensure that the actor will consume messages one by one and that you will never have two thread consuming message for the same Actor.

Scala's immutable collections are suitable for concurrent usage.
As for actors, a couple of things are guaranteed as explained here the Akka documentation.
the actor send rule: where the send of the message to an actor happens before the receive of the same actor.
the actor subsequent processing rule: where processing of one message happens before processing of the next message by the same actor.
You are not guaranteed that the same thread processes the next message, but you are guaranteed that the current message will finish processing before the next one starts, and also that at any given time, only one thread is executing the receive method.
So that takes care of a given Actor's persistent state. With regard to shared data, the best approach as I understand it is to use immutable data structures and lean on the Actor model as much as possible. That is, "do not communicate by sharing memory; share memory by communicating."

What collection type should I use in a concurrent access situation, and how is it used?
See #hbatista's answer.
Is an Actor actually a multithreaded entity, or is that just my wrong conception and does it process messages one at a time in a single thread
The second (though the thread on which messages are processed may change, so don't store anything in thread-local data). That's how the actor can maintain invariants on its state.

Related

Should Akka Actors do real processing tasks?

I'm writing an application that reads relatively large text files, validates and transforms the data (every line in a text file is an own item, there are around 100M items/file) and creates some kind of output. There already exists a multihreaded Java application (using BlockingQueue between Reading/Processing/Persisting Tasks), but I want to implement a Scala application that does the same thing.
Akka seems to be a very popular choice for building concurrent applications. Unfortunately, due to the asynchronous nature of actors, I still don't understand what a single actor can or can't do, e.g. if I can use actors as traditional workers that do some sort of calculation.
Several documentations say that Actors should never block and I understand why. But the given examples for blocking code always only mention such things as blocking file/network IO.. things that make the actor waiting for a short period of time which is of course a bad thing.
But what if the actor is "blocking" because it actually does something useful instead of waiting? In my case, the processing and transformation of a single line/item of text takes 80ms which is quite a long time (pure processing, no IO involved). Can this work be done by an actor directly or should I use a Future instead (but then, If I have to use Futures anyway, why use Akka in the first place..)?.
The Akka docs and examples show that work can be done directly by actors. But it seems that the authors only do very simplistic work (such as calling filter on a String or incrementing a counter and that's it). I don't know if they do this to keep the docs simple and concise or because you really should not do more that within an actor.
How would you design an Akka-based application for my use case (reading text file, processing every line which takes quite some time, eventually persisting the result)? Or is this some kind of problem that does not suit to Akka?

It all depends on the type of an actor.
I use this rule of thumb: if you don't need to talk to this actor and this actor does not have any other responsibilities, then it's ok to block in it doing actual work. You can treat it as a Future and this is what I would call a "worker".
If you block in an actor that is not a leaf node (worker), i.e. work distributor then the whole system will slow down.
There are a few patterns that involve work pulling/pushing or actor per request model. Either of those could be a fit for your application. You can have a manager that creates an actor for each piece of work and when the work is finished actor sends result back to manager and dies. You can also keep an actor alive and ask for more work from that actor. You can also combine actors and Futures.
Sometimes you want to be able to talk to a worker if your processing is more complex and involves multiple stages. In that case a worker can delegate work yet to another actor or to a future.
To sum-up don't block in manager/work distribution actors. It's ok to block in workers if that does not slow your system down.
disclaimer: by blocking I mean doing actual work, not just busy waiting which is never ok.

Doing computations that take 100ms is fine in an actor. However, you need to make sure to properly deal with backpressure. One way would be to use the work-pulling pattern, where your CPU bound actors request new work whenever they are ready instead of receiving new work items in a message.
That said, your problem description sounds like a processing pipeline that might benefit from using a higher level abstraction such as akka streams. Basically, produce a stream of file names to be processed and then use transformations such as map to get the desired result. I have something like this in production that sounds pretty similar to your problem description, and it works very well provided the data used by the individual processing chunks is not too large.
Of course, a stream will also be materialized to a number of actors. But the high level interface will be more type-safe and easier to reason about.

Can an actor return an object to a future waiting on an ask?

I would like to use an actor to synchronize access to a pool of objects. The actor would manage the objects in the pool including their state (busy v.s. free to allocate). When asked by non-actor code, it would return an object from the pool once available. Thus the calling code has an abstraction for obtaining an object to work with.
To get this kind of abstraction, I need the actor to be able to respond to its message senders' ask messages with the object the actor is allocating to them. Can this be accomplished and would it be resource intensive to pass a whole object via a message?

There is nothing wrong in returning future that will be completed later, by actor.
Keep an eye on this matter, however: will you complete the future with some mutable internal actor state or not?
If the answer is no - it is ok and there is nothing you should worry about.
If the answer is yes - you'll have to take care of synchronization, since actor/external code may mutate this state in different threads (which, kind of defeats the purpose of using actors).
Otherwise it is legit.
BTW, this is not something that is specific to futures only. You have to follow this for any message you send from actor.
UPDATE: extending my answer to address OP's comment.
Question was primarily about returning an object that's not an actor, from an actor, not just about the scenario... not sure if this answer relates to just that... an object can be much heavier than just a "regular" message... unless Akka passes a message in a way equivalent to a reference / interning.
There is no special requirements about "heaviness" of the message in akka and in general it can be any object (you can infer this from the fact that Any is used for the type of the message instead of some akka-defined message class/trait with set of defined limitations).
Of course, you have to treat specially situations when messages should be persisted or sent to remote host, but this is kind of special case. In this scenario you have to ensure that serialization is handled properly.
Anyway, if the message (object) does not leave boundaries of the same jvm - it is ok for object to hold any amount of state.

Stateless Akka actors

I am just starting with Akka and I am trying to split some messy functions in more manageable pieces, each of which would be carried out by a different actor.
My task seems exactly what is suitable for the actor model. I have a tree of objects, which are saved in a database. Each node has some attributes; lets' concentrate on just one and call it wealth. The wealth of the children depends on the wealth of the parent. When one computes the wealth on a node, this should trigger a similar computation on the children. I want to collect all the updated instances of nodes and save them at the same time in the DB.
This seems easy: an actor just performs the computation and then starts another actor for each child of the current node. When all children send back a message with the results of the computation, the actor collects them, adds the current node and sends a message to the actor above him.
The problem is that I do not know a way to be sure that one has received a message from all the children. A trivial way would be to use a counter and increase it whenever a message from a child arrives.
But what happens if two independent parts of the system require such a computation to be performed and the same actor is reused? The actor will spawn twice as many children and the counting will not be reliable anymore. What I would need is to make sure that the same actor is not reused from the external system, but new actors are generated each time the computation is triggered.
Is this even possible? If not, is there some automatic mechanism in Akka to make sure that every child has performed its task?
Am I just using the actor model in a situation that is not suitable here? Essentially I am doing nothing more than could be done with functions - the actors themselves are stateless, but they allow me to simplify and parallelize the computation.

The way that you describe things, I think what you really want is a recursive function that returns a Future[Tree[NodeWealth]], the function would spawn a Future every time it is called and it would call itself for each child in the hierarchy, at the end of the function it would compose the Futures from those calls into a single Future[Result]. When the recursive function terminates and returns the Future[Tree[NodeWealth]] you can then compose it with another Future which updates your DB. Take a look at the Akka documentation here on Futures. And pay particular attention to the section "Composing Futures".
The composition of futures should allow you to avoid state and implement this easily.
Another option would be to use actors and futures, and use the ask pattern on child actors, compose the resulting futures into a single future which you return to the sender (parent actor). This is essentially the same thing, just intertwined with actors. I would probably only go this route if you are already representing your nodes as actors for some other reason.

Actors (scala/akka): is it implied that the receive method will be accessed in a threadsafe manner?

I assume that the messages will be received and processed in a threadsafe manner. However, I have been reading (some) akka/scala docs but I didn't encounter the keyword 'threadsafe' yet.

It is probably because the actor model assumes that each actor instance processes its own mailbox sequentially. That means it should never happen, that two or more concurrent threads execute single actor instance's code. Technically you could create a method in an actor's class (because it is still an object) and call it from multiple threads concurrently, but this would be a major departure from the actor's usage rules and you would do it "at your own risk", because then you would lose all thread-safety guarantees of that model.
This is also one of the reasons, why Akka introduced a concept of ActorRef - a handle, that lets you communicate with the actor through message passing, but not by calling its methods directly.

I think we have it pretty well documented: http://doc.akka.io/docs/akka/2.3.9/general/jmm.html

Actors are 'Treadsafe'. The Actor System (AKKA), provides each actor with its own 'light-weight thread'. Meaning that this is not a tread, but the AKKA system will give the impression that an Actor is always running in it's own thread to the developer. This means that any operations performed as a result of acting on a message are, for all purposes, thread safe.
However, you should not undermine AKKA by using mutable messages or public state. If you develop you actors to be stand alone units of functionality, then they will be threadsafe.
See also:
http://doc.akka.io/docs/akka/2.3.12/general/actors.html#State
and
http://doc.akka.io/docs/akka/2.3.12/general/jmm.html for a more indepth study of the AKKA memory model and how it manages 'tread' issues.

Why does the Scala Actor implementation involve synchronized code?

My understanding is that the queue based approach to concurrency can be implemented without locking. But I see lots of synchronized keywords in the Actor.scala file (looking at 2.8.1). Is it synchronized, is it necessary, would it make a difference if there was an implementation that was not synchronized?
Apparently the question wasn't clear enough: my understanding is that this can be implemented with a non-blocking queue. Why was it not done so? Why use the synchronized keyword anywhere in here? There may be a very good reason, or it might be just because that's the way it was done and it's not necessary. I was just curious which.

The point is that the reactions, which you write in the "act" method, do not need to concern themselves with synchronization. Also, assuming that you do not expose the actor's state, your program will be fully thread safe.
That is not to say that there is no sync at all: synchronization is absolutely necessary [1] to implement read/write access to the actor's mailbox (i.e. the sending and receiving of messages) and also to ensure the actor's private state is consistent across any subsequent reacts.
This is achieved by the library itself and you, the user, need not concern yourself with how it is done. Your state is safe (you don't even need to use volatile fields) because of the JMM's happens before guarantees. That is, if a main-memory write happens before a sync point, then any read occurring after a sync will observe the main memory state left by the write.
[1] - by "synchronization", I mean some mechanism to guarantee a happens-before relationship in the Java Memory Model. This includes the synchronized keyword, the volatile modifier and/or the java.util.concurrent locking primitives

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse