Is spawning a new actor from an existing one thread-safe? - scala

As it was said several times before here, you can spwan a new thread (putting a long-processing computation into actor{} block) inside an actor, and spawned computation will be safely run on the same thread pool (used by the actor scheduler).
actor{
var i = 0
case msg => actor {
// computation
i = i + 1 // is `i` still thread safe?
// looks like it can be access simultaneosly from 2 two threads now
// should I make it #volatile?
}
reply(i)
}
However, will it be thread-safe, and does it follow, in general, the original design, which states that only one thread in a moment of time can work with one actor?

Spawning an actor from another is completely safe.
However, sharing mutable state between actors isn't, regardless of how & where they were spawned.
The whole point of actors is that they should communicate with messages, via their mailboxes. Abuse this model and actors offer no more protection from concurrency issues than you get with raw threads.

In your example, i can indeed be accessed from multiple threads; that doing the computation and the thread calling reply. Perhaps you should reply with aFuture?
reply( future { /* perform computation */ } )
The caller would then be able to react to the futures input channel (non-blocking), or simply apply() (blocking).

Actors can only process one message simultanously, so anything you are doing within your actor is threadsafe by design you have never to think about thread safety.
This is the complete idea behind actors and it's concurrency design.

Related

Why I should use context.become() to store internal state in Actor?

I've read scala-best-practices and have a question about "5.2. SHOULD mutate state in actors only with context.become". I don't understand why it is so bad to store internal state using var. If actor executes all messages sequentially I just can't see any source of problems. What do I miss?
Consider the first example in the article that you referenced:
class MyActor extends Actor {
val isInSet = mutable.Set.empty[String]
def receive = {
case Add(key) =>
isInSet += key
case Contains(key) =>
sender() ! isInSet(key)
}
}
There's nothing inherently incorrect with this example, so there isn't some vital understanding that you're missing. As you know, an actor processes the messages in its mailbox sequentially, so it is safe to represent its state in an internal variable and to mutate this state, as long as the actor doesn't expose that state1.
become is often used to dynamically switch an actor's behavior (e.g., changing the kind of messages that the actor handles) and/or its state. In the second example in the article, the actor's state is encoded in its behavior as a parameter. This is an elegant alternative to the first example, but in this simple case it's a matter of preference.
One scenario in which become can really shine, however, is an actor that has many state transitions. Without the use of become, the actor's logic in its receive block can grow unmanageably large or turn into a jungle of if-else statements. As an example of using become to model state transitions, check out this sample project that models the "Dining Philosophers" problem.
1A potential issue is that while isInSet is a val, it's a mutable Set, so weird things can happen if the actor exposes this state to something outside of the actor itself (which it is not doing in the example). For example, if the actor sends this Set in a message to another actor, then the external actor can change this state, causing unexpected behavior or race conditions. One can mitigate this issue by changing the val to a var, and the mutable Set to an immutable Set.
I don't think there's necessarily anything wrong with using vars in actors, for exactly the reasons you mentioned (Keep in mind though, that this is only for code executed in the context of the receive(...), i.e., if you start a thread, or use a Future, even if it's from within the receive, that code is no longer executed sequentially).
However, I personally prefer to use context.become(...) for controlling state, mainly because it clearly shows me the states in the actor that can change (vars can be scattered all over).
I also prefer to limit it to 0 or 1 call to context.become(...) per message handled, so it's clear where this state transition is happening.
That said, you can get the same benefits by using a convention where you define all your vars in one place, and make sure to re-assign them in one place (say towards the end) in your message handling.

Akka Actor running infinite loop after initialization message

New to Akka and Actors - I need to start a few actors that will basically spend their lives reading from Kafka topics and writing to an Ignite cache. I configured my dispatcher like this:
kafka-dispatcher {
executor = "thread-pool-executor"
type = PinnedDispatcher
}
My actors are created with .withDispatcher("kafka-dispatcher"), and my assumption is that each actor will be assigned an individual thread.
These actors basically spend their lives like this:
override def receive: Receive = LoggingReceive {
case InitWorker => {
initialize()
pollTopic() // This never returns
}
}
In other words, they receive an initialization message and then call the pollTopic() method, which never returns - it runs a loop reading (which will block until there is data) and then writing the data.
My questions:
Is this kosher?
Is there a better, i.e. more idiomatic, way to do this? Note that the read call inside pollTopic() blocks.
Answering to your point 2 and from the description of what you're trying to do, maybe you want to consider using Akka streams with the reactive-kafka library. Akka streams uses actors under the hood but manages all of this for you, so you can focus only on implementing reusable small components that do exactly one thing.
You will then be able to write data processing pipelines, using Kafka as a Source for your data flow. I don't know much about Ignite cache, but chances are you either will write a Sink for it or - if you're talking about a blocking API, mapAsync will be your friend.

Scala system process hangs

I have an actor that uses ProcessBuilder to execute an external process:
def act {
while (true) {
receive {
case param: String => {
val filePaths = Seq("/tmp/file1","/tmp/file2")
val fileList = new ByteArrayInputStream(filePaths.mkString("\n").getBytes())
val output = s"myExecutable.sh ${param}" #< fileList !!<
doSomethingWith(output)
}
}
}
}
I run hundreds this actors running in parallel. Sometimes, for an unknown reason, the execution of the process (!!) never returns. It hangs forever. This specific actor cannot handle new messages. Is there any way to setup a timeout for this process to return, and if it exceeds retry?
What could be the reason for these executions to hold forever? Because these commands are not supposed to last more than a few milliseconds.
Edit 1:
Two important facts that I observed:
This problem does not occur on Max OS X, only in Linux
When I don't use ByteArrayInputStream as input for the execution, the program does not hang
I have an actor that uses ProcessBuilder to execute an external process: ... I run hundreds this actors running in parallel ...
That's some very heavy processing happening in parallel just to achieve a few millisecs of work in each case. Concurrent processing mechanisms rank as follows (from worst to best in terms of resource-usage, scalability and performance):
process = heavy-weight
thread = medium-weight (dozens of threads can execute within a single process space)
actor = light-weight (dozens of actors can execute by leveraging a single shared thread or multiple shared threads)
Concurrently spawning many processes takes significant operating system resources - for process creation and termination. In extreme cases, the O/S overhead to start & end processes could consume hundreds or thousands more CPU and memory resources than the actual job execution. That's why the thread-model was created (and the more efficient actor model). Think of your current processing as doing 'CGI-like' non-scalable O/S-stressing-processing from within your extremely-scalable actors - that's an anti-pattern. It doesn't take much to stress some operating systems to the point of breakage: this could be happening.
Also, if the files being read are very large in size, it would be best for scalability and reliability to limit the number of processes that concurrently read files on the same disk. It might be OK for up to 10 processes to read concurrently, I doubt it would be OK for 100.
How should an Actor invoke an external program?
Of course, if you converted your logic in myExecutable.sh into Scala, you would not need to create processes at all. Achieving scalability, performance and reliability would be more straightforward.
Assuming this is not possible/desirable, you should limit the total number of processes created and you should reuse them across different Actors / requests over time.
First solution option: (1) create a pool of processes that are reused (say size 10) (2) create actors (say 100) that communicate to/from the processes via ProcessIO (3) if all processes are busy with processing, then it is OK/appropriate that Actors block until one becomes available. The issue with this option: complexity; the 100 actors must do work to interact with the process pool and the actors themselves add little value when the processes are the bottle-neck.
Better solution option: (1) create a limited number of actors (say 10) (2) have each actor create 1 private long-running process (i.e. no pool as such) (3) have each actor communicate to/from via ProcessIO, blocking if the process is busy. Issue: still not as simple as possible; actors interact poorly with blocking processes.
Best solution option: (1) no actors, a simple for-loop from your main thread will achieve the same benefits as actors (2) create a limited number of processes (10) (3) via for-loop, sequentially interact each process using ProcessIO (if busy - block or skip to next iteration)
Is there any way to setup a timeout for this process to return, and if it exceeds retry?
Indeed there is. One of the most powerful features of actors is the ability for some actors to spawn other actors and to act as supervisor of them (receiving failure or timeout messages, from which they can recover/restart). With 'native scala actors' this is done via rudimentary programming, generating your own checks and timeout messages. But I won't cover that because the Akka approaches are more powerful and simpler. Plus the next major Scala release (2.11) will use Akka as the supported actor model, with 'native scala actors' deprecated.
Here's an example Akka supervising actor with programmatic timeout/restart (not compiled/tested). Of course, this is not useful if you go with the 3rd solution option):
import scala.concurrent.duration._
import scala.collection.immutable.Set
class Supervisor extends Actor {
override val supervisorStrategy =
OneForOneStrategy(maxNrOfRetries = 10, withinTimeRange = 1 minute) {
case _: ArithmeticException => Resume // resumes (reuses) all child actors
case _: NullPointerException => Restart // restarts all child actors
case _: IllegalArgumentException => Stop // terminates this actor & all children
case _: Exception => Escalate // supervisor to receive exception
}
val worker = context.actorOf(Props[Worker]) // creates a supervised child actor
var pendingRequests = Set.empty[WorkerRequest]
def receive = {
case req: WorkRequest(sender, jobReq) =>
pendingRequests = pendingRequests + req
worker ! req
system.scheduler.scheduleOnce(10 seconds, self, WorkTimeout(req))
case resp: WorkResponse(req # WorkRequest(sender, jobReq), jobResp) =>
pendingRequests = pendingRequests - req
sender ! resp
case timeout: WorkTimeout(req) =>
if (pendingRequests get req != None) {
// restart the unresponsive worker
worker restart
// resend all pending requests
pendingRequests foreach{ worker ! _ }
}
}
}
A word of caution: this approach to actor supervision will not overcome poor architecture & design. If you start with suitable process/thread/actor design to meet your requirements, then supervision will promote reliability. But if you start with poor design, then there's a risk that using 'brute-force' recovery from O/S-level failures could exacerbate your problems - making process reliability worse or even causing the machine to crash.
I don't have enough info to reproduce the issue, so I can't diagnose it exactly, but here's how I'd go about diagnosing it if I were in your shoes. The basic approach is a differential diagnosis - identify possible causes, and tests that would prove or rule them out.
The first thing I'd do is to validate that the myExecutable.sh process spawned by the application is actually terminating.
If the process isn't terminating, then this is part of the problem, so we need to understand why. One thing we could do is to run something other than myExecutable.sh. You suggested that ByteArrayInputStream may be part of the problem, which suggests that myExecutable.sh is getting bad input on stdin. If that's the case, then you could instead run a script that simply logs its input to a file, which would show this. If the input is invalid, then ByteArrayInputStream is providing bad data for some reason - thread safety and unicode are the obvious culprits, but looking at the actual bad data should give you a clue. If the input is valid, then it's a bug in myExecutable.sh.
If the process is terminating, then the problem is somewhere else. My first guesses would be that it's either related to actor scheduling (actor libraries typically use ForkJoin for execution, which is great, but doesn't deal well with blocking code), or a bug in the scala.sys.process library (wouldn't be unprecedented - I had to drop scala.sys.process from a project I was working on because of a memory leak).
Looking at the stack trace for a hung thread should give you some clues (VisualVM is your friend), as you should be able to see what's waiting. You can then find the relevant code in the OpenJDK or Scala standard library source code. Where you go from there depends on what you find.
Can you not fire off this process and its handling in a future and use a timed wait against it?
I don't think we can figure it out witout knowing myExecutable.sh or doSomethingWith.
When it hangs, try killing all the myExecutable.sh processes.
If it helps, you should inspect the myExecutable.sh.
If it does not help, you should inspect the doSomethingWith function.

What is the best way to get a reference to an Akka Actor

I am a little confused by how I am supposed to get a reference to my Actor once it has been crerated in the Akka system. Say I have the following:
class MyActor(val x: Int) extends Actor {
def receive = {
case Msg => doSth()
}
}
And at some stage I would create the actor as follows:
val actorRef = system.actorOf(Props(new MyActor(2), name = "Some actor")
But when I need to refer to the MyActor object I cannot see any way to get it from the actor ref?
Thanks
Des
I'm adding my comment as an answer since it seems to be viewed as something worth putting in as an answer.
Not being able to directly access the actor is the whole idea behind actor systems. You are not supposed to be able to get at the underlying actor class instance. The actor ref is a lightweight proxy to your actor class instance. Allowing people to directly access the actor instance could lead down the path of mutable data issues/concurrent state update issues and that's a bad path to go down. By forcing you to go through the ref (and thus the mailbox), state and data will always be safe as only one message is processed at a time.
I think cmbaxter has a good answer, but I want to make it just a bit more clear. The ActorRef is your friend for the following reasons:
You cannot ever access the underlying actor. Actors receive their thread of execution from the Dispatcher given to them when they are created. They operate on one mailbox message at a time, so you never have to worry about concurrency inside of the actor unless YOU introduce it (by handling a message asynchronously with a Future or another Actor instance via delegation, for example). If someone had access to the underlying class behind the ActorRef, they could easily call into the actor via a method using another thread, thus negating the point of using the Actor to avoid concurrency in the first place.
ActorRefs provide Location Transparency. By this, I mean that the Actor instance could exist locally on the same JVM and machine as the actor from which you would like to send it a message, or it could exist on a different JVM, on a different box, in a different data center. You don't know, you don't care, and your code is not littered with the details of HOW the message will be sent to that actor, thus making it more declarative (focused on the what you want to do business-wise, not how it will be done). When you start using Remoting or Clusters, this will prove very valuable.
ActorRefs mask the instance of the actor behind it when failure occurs, as well. With the old Scala Actors, you had no such proxy and when an actor "died" (threw an Exception) that resulted in a new instance of the Actor type being created in its place. You then had to propagate that new instance reference to anyone who needed it. With ActorRef, you don't care that the Actor has been reincarnated - it's all transparent.
There is one way to get access to the underlying actor when you want to do testing, using TestActorRef.underlyingActor. This way, you can write unit tests against functionality written in methods inside the actor. Useful for checking basic business logic without worrying about Akka dynamics.
Hope that helps.

Scala actors: receive vs react

Let me first say that I have quite a lot of Java experience, but have only recently become interested in functional languages. Recently I've started looking at Scala, which seems like a very nice language.
However, I've been reading about Scala's Actor framework in Programming in Scala, and there's one thing I don't understand. In chapter 30.4 it says that using react instead of receive makes it possible to re-use threads, which is good for performance, since threads are expensive in the JVM.
Does this mean that, as long as I remember to call react instead of receive, I can start as many Actors as I like? Before discovering Scala, I've been playing with Erlang, and the author of Programming Erlang boasts about spawning over 200,000 processes without breaking a sweat. I'd hate to do that with Java threads. What kind of limits am I looking at in Scala as compared to Erlang (and Java)?
Also, how does this thread re-use work in Scala? Let's assume, for simplicity, that I have only one thread. Will all the actors that I start run sequentially in this thread, or will some sort of task-switching take place? For example, if I start two actors that ping-pong messages to each other, will I risk deadlock if they're started in the same thread?
According to Programming in Scala, writing actors to use react is more difficult than with receive. This sounds plausible, since react doesn't return. However, the book goes on to show how you can put a react inside a loop using Actor.loop. As a result, you get
loop {
react {
...
}
}
which, to me, seems pretty similar to
while (true) {
receive {
...
}
}
which is used earlier in the book. Still, the book says that "in practice, programs will need at least a few receive's". So what am I missing here? What can receive do that react cannot, besides return? And why do I care?
Finally, coming to the core of what I don't understand: the book keeps mentioning how using react makes it possible to discard the call stack to re-use the thread. How does that work? Why is it necessary to discard the call stack? And why can the call stack be discarded when a function terminates by throwing an exception (react), but not when it terminates by returning (receive)?
I have the impression that Programming in Scala has been glossing over some of the key issues here, which is a shame, because otherwise it's a truly excellent book.
First, each actor waiting on receive is occupying a thread. If it never receives anything, that thread will never do anything. An actor on react does not occupy any thread until it receives something. Once it receives something, a thread gets allocated to it, and it is initialized in it.
Now, the initialization part is important. A receiving thread is expected to return something, a reacting thread is not. So the previous stack state at the end of the last react can be, and is, wholly discarded. Not needing to either save or restore the stack state makes the thread faster to start.
There are various performance reasons why you might want one or other. As you know, having too many threads in Java is not a good idea. On the other hand, because you have to attach an actor to a thread before it can react, it is faster to receive a message than react to it. So if you have actors that receive many messages but do very little with it, the additional delay of react might make it too slow for your purposes.
The answer is "yes" - if your actors are not blocking on anything in your code and you are using react, then you can run your "concurrent" program within a single thread (try setting the system property actors.maxPoolSize to find out).
One of the more obvious reasons why it is necessary to discard the call stack is that otherwise the loop method would end in a StackOverflowError. As it is, the framework rather cleverly ends a react by throwing a SuspendActorException, which is caught by the looping code which then runs the react again via the andThen method.
Have a look at the mkBody method in Actor and then the seq method to see how the loop reschedules itself - terribly clever stuff!
Those statements of "discarding the stack" confused me also for a while and I think I get it now and this is my understanding now. In case of "receive" there is a dedicated thread blocking on the message (using object.wait() on a monitor) and this means that the complete thread stack is available and ready to continue from the point of "waiting" on receiving a message.
For example if you had the following code
def a = 10;
while (! done) {
receive {
case msg => println("MESSAGE RECEIVED: " + msg)
}
println("after receive and printing a " + a)
}
the thread would wait in the receive call until the message is received and then would continue on and print the "after receive and printing a 10" message and with the value of "10" which is in the stack frame before the thread blocked.
In case of react there is no such dedicated thread, the whole method body of the react method is captured as a closure and is executed by some arbitrary thread on the corresponding actor receiving a message. This means only those statements that can be captured as a closure alone will be executed and that's where the return type of "Nothing" comes to play. Consider the following code
def a = 10;
while (! done) {
react {
case msg => println("MESSAGE RECEIVED: " + msg)
}
println("after react and printing a " + a)
}
If react had a return type of void, it would mean that it is legal to have statements after the "react" call ( in the example the println statement that prints the message "after react and printing a 10"), but in reality that would never get executed as only the body of the "react" method is captured and sequenced for execution later (on the arrival of a message). Since the contract of react has the return type of "Nothing" there cannot be any statements following react, and there for there is no reason to maintain the stack. In the example above variable "a" would not have to be maintained as the statements after the react calls are not executed at all. Note that all the needed variables by the body of react is already be captured as a closure, so it can execute just fine.
The java actor framework Kilim actually does the stack maintenance by saving the stack which gets unrolled on the react getting a message.
Just to have it here:
Event-Based Programming without Inversion of Control
These papers are linked from the scala api for Actor and provide the theoretical framework for the actor implementation. This includes why react may never return.
I haven't done any major work with scala /akka, however i understand that there is a very significant difference in the way actors are scheduled.
Akka is just a smart threadpool which is time slicing execution of actors...
Every time slice will be one message execution to completion by an actor unlike in Erlang which could be per instruction?!
This leads me to think that react is better as it hints the current thread to consider other actors for scheduling where as receive "might" engage the current thread to continue executing other messages for the same actor.