I am trying to implement fault tolerance within my actor system for my Scala project, to identify errors and handle them. I am using Classic actors. Each supervisor actor has 5 child actors, if one of these child actors fails, I want to restart that actor, and log the error, and as I mentioned, handle the problem that caused this actor to fail.
I implemented a One-For-One SupervisorStrategy in my Supervisor actor class as such:
override val supervisorStrategy =
OneForOneStrategy(maxNrOfRetries = 5, withinTimeRange = 1.minute) {
case e: ArithmeticException =>
logger.error(s"Supervisor: $e from $sender; Restarting!")
Restart
case e: NullPointerException =>
logger.error(s"Supervisor: $e from $sender; Restarting!")
Restart
case e: IllegalArgumentException =>
logger.error(s"Supervisor: $e from $sender; Restarting!")
Restart
case _: Exception =>
logger.error(s"Supervisor: Unknown exception from $sender; Escalating!")
Restart
}
I have also tried something as simple as:
override val supervisorStrategy =
OneForOneStrategy() {
case _ =>
logger.error(s"Supervisor: Unknown exception from $sender; Escalating!")
Restart
}
I added the following code into one of the supervisor methods that tells the actors to start working:
if(!errorSent){
errorSent = true
actor ! new NullPointerException
}
With a var errorSent being declared in my supervisor, and the actor throwing the exception when an exception message is received, it is implemented with this var so that it only occurs once as to not put the actor in a state of infinitely restarting. I did this because the system never/rarely fails. But, in the log file, I do not get what I expect, but the following:
13:41:19.893 [ClusterSystem-akka.actor.default-dispatcher-21] ERROR akka.actor.OneForOneStrategy - null
java.lang.NullPointerException: null
I have looked at so many examples, as well as the Akka documentation for fault tolerance for both types and classic actors, but I cannot get this to work.
Related
Here's the thing am kinda stuck at. I have a SupervisorActor which creates Actor A and B and so on. There are no child actors to ActorA or ActorB.Lets say both Actor A and B hit Database and get SQL exception. This is propagated to the SupervisorActor up the chain. When I catch SQL exception, I also need to Log that Actor A had a SQL exception. But how can I achieve this?
1 way I could think is my Actor A logs it and throws an exception above the call stack. But I would need a try-catch block in my code. Which kinda defeats the purpose here.
Creating
Another way, I can think of is Actor A and B create a new child Actor A1 which would send it up the chain but that's not an option coz that's a common library without actors.
Is there a way to achieve something similar to :
Yes! I wanted to know if we can achieve something like:
try{
saveUser()
}
catch {
case b: BatchUpdateException =>
logger.error("We received a BatchUpdateException when trying to save the user")
throw b
}
case e: Exception =>
logger.error("Some other exception occured ")
throw e
}
try{
saveSeller()
}
catch {
case b: BatchUpdateException =>
logger.error("We received a BatchUpdateException when trying to save the Seller details")
throw b
}
case e: Exception =>
logger.error("Some other exception occured ")
throw e
}
PS: Am not sure if supervision strategy is the right approach to achieve what I am trying to achieve. I am trying to explore new possibilities.
Centralizing fault-handling logic within a supervisor actor using SupervisorStrategy is a better approach than scattering/duplicating it across individual actors. In particular, expressing the Exception-handling logic as the decider parameter of type PartialFunction[Throwable, Directive] helps improve code maintainability.
When I catch SQL exception, I also need to Log that Actor A had a SQL exception. But how can I achieve this?
Within the supervisor actor, you can always log Exceptions from individual child actors within the supervisor actor by including the corresponding actor references via sender. Below is a trivialized example of a supervisor actor logging actor-specific Exceptions from a couple of child actors and taking corresponding Resume/Stop/Escalate actions:
import akka.actor.{Actor, ActorSystem, Props, ActorLogging}
import akka.actor.OneForOneStrategy
import akka.actor.SupervisorStrategy._
import scala.concurrent.duration._
import java.sql.SQLException
implicit val system = ActorSystem("system")
implicit val ec = system.dispatcher
case class CreateWorker(props: Props, name: String)
case class BogusQuery(ex: Exception)
def doQuery(q: BogusQuery) = throw q.ex
class MySupervisor extends Actor with ActorLogging {
override val supervisorStrategy =
OneForOneStrategy(maxNrOfRetries = 5, withinTimeRange = 1.minute) {
case e: SQLException =>
log.error(s"Supervisor: $e from $sender; Resuming!")
Resume
case e: NullPointerException =>
log.error(s"Supervisor: $e from $sender; Stopping!")
Stop
case _: Exception =>
log.error(s"Supervisor: Unknown exception from $sender; Escalating!")
Escalate
}
def receive = {
case w: CreateWorker => sender ! context.actorOf(w.props, w.name)
}
}
class MyWorker extends Actor with ActorLogging {
def receive = {
case q: BogusQuery =>
log.info(s"$self: Received '$q'!")
doQuery(q)
case x =>
log.error(s"$self: Unknown value '${x}'!")
}
}
val supervisor = system.actorOf(Props[MySupervisor], "supervisor")
supervisor ! CreateWorker(Props[MyWorker], "workerA")
supervisor ! CreateWorker(Props[MyWorker], "workerB")
val workerA = system.actorSelection("/user/supervisor/workerA")
val workerB = system.actorSelection("/user/supervisor/workerB")
workerA ! BogusQuery(new SQLException)
// [INFO] [<timestamp>] [<dispatcher>] [akka://system/user/supervisor/workerA]
// Actor[akka://system/user/supervisor/workerA#-2129514903]:
// Received 'BogusQuery(java.sql.SQLException)'!
// [ERROR] [<timestamp>] [<dispatcher>] [akka://system/user/supervisor]
// Supervisor: java.sql.SQLException from
// Actor[akka://system/user/supervisor/workerA#-2129514903]; Resuming!
// [WARN] [<timestamp>] [<dispatcher>] [akka://system/user/supervisor/workerA] null
workerB ! BogusQuery(new NullPointerException)
// [ERROR] [<timestamp>] [<dispatcher>] [akka://system/user/supervisor]
// Supervisor: java.lang.NullPointerException from
// Actor[akka://system/user/supervisor/workerB#-1563197689]; Stopping!
// [ERROR] [<timestamp>] [<dispatcher>] [akka://system/user/supervisor/workerB] null
// java.lang.NullPointerException ...
I have an actor that is created at application startup as a child of another actor and receives a message once per day from the parent to perform operation to fetch some files from some SFTP server.
Now, there might be some minor temporary connection exceptions that cause the operation to fail. In this case, a retry is needed.
But there might be a case in which exception is thrown and is not going to be resolved on a retry (ex: file not found, some configuration is improper etc.)
So, in this case what could be an appropriate retry mechanism and supervision strategy considering that the actor will receive messages after a long interval (once a day).
In this case, the message sent to the actor is not bad input - it is just a trigger. Example:
case object FileFetch
If I have a supervision strategy in the parent like this, it is going to restart the failing child on every minor/major exception without retries.
override val supervisorStrategy =
OneForOneStrategy(maxNrOfRetries = -1, withinTimeRange = Duration.inf) {
case _: Exception => Restart
}
What I want to have is something like this:
override val supervisorStrategy =
OneForOneStrategy(maxNrOfRetries = -1, withinTimeRange = Duration.inf) {
case _: MinorException => Retry same message 2, 3 times and then Restart
case _: Exception => Restart
}
"Retrying" or resending a message in the event of an exception is something that you have to implement yourself. From the documentation:
If an exception is thrown while a message is being processed (i.e. taken out of its mailbox and handed over to the current behavior), then this message will be lost. It is important to understand that it is not put back on the mailbox. So if you want to retry processing of a message, you need to deal with it yourself by catching the exception and retry[ing] your flow. Make sure that you put a bound on the number of retries since you don’t want a system to livelock (so consuming a lot of cpu cycles without making progress).
If you want to resend the FileFetch message to the child in the event of a MinorException without restarting the child, then you could catch the exception in the child to avoid triggering the supervision strategy. In the try-catch block, you could send a message to the parent and have the parent track the number of retries (and perhaps include a timestamp in this message, if you want the parent to enact some kind of backoff policy, for example). In the child:
def receive = {
case FileFetch =>
try {
...
} catch {
case m: MinorException =>
val now = System.nanoTime
context.parent ! MinorIncident(self, now)
}
case ...
}
In the parent:
override val supervisorStrategy =
OneForOneStrategy(maxNrOfRetries = -1, withinTimeRange = Duration.Inf) {
case _: Exception => Restart
}
var numFetchRetries = 0
def receive = {
case MinorIncident(fetcherRef, time) =>
log.error(s"${fetcherRef} threw a MinorException at ${time}")
if (numFetchRetries < 3) { // possibly use the time in the retry logic; e.g., a backoff
numFetchRetries = numFetchRetries + 1
fetcherRef ! FileFetch
} else {
numFetchRetries = 0
context.stop(fetcherRef)
... // recreate the child
}
case SomeMsgFromChildThatFetchSucceeded =>
numFetchRetries = 0
case ...
}
Alternatively, instead of catching the exception in the child, you could set the supervisor strategy to Resume the child in the event of a MinorException, while still having the parent handle the message retry logic:
override val supervisorStrategy =
OneForOneStrategy(maxNrOfRetries = -1, withinTimeRange = Duration.Inf) {
case m: MinorException =>
val child = sender()
val now = System.nanoTime
self ! MinorIncident(child, now)
Resume
case _: Exception => Restart
}
I've testing the fault tolerant system of akka and so far it's been good when talking about retrying to send a msg according the maxNrOfRetries specified.
However, it does not restart the actor within the given time range, it restarts all at once, ignoring the within time range.
I tried with AllForOneStrategy and OneForOneStrategy but does not change anything.
Trying to follow this blog post: http://letitcrash.com/post/23532935686/watch-the-routees, this is the code I've been working.
class Supervisor extends Actor with ActorLogging {
var replyTo: ActorRef = _
val child = context.actorOf(
Props(new Child)
.withRouter(
RoundRobinPool(
nrOfInstances = 5,
supervisorStrategy =
AllForOneStrategy(maxNrOfRetries = 3, withinTimeRange = 10.second) {
case _: NullPointerException => Restart
case _: Exception => Escalate
})), name = "child-router")
child ! GetRoutees
def receive = {
case RouterRoutees(routees) =>
routees foreach context.watch
case "start" =>
replyTo = sender()
child ! "error"
case Terminated(actor) =>
replyTo ! -1
context.stop(self)
}
}
class Child extends Actor with ActorLogging {
override def preRestart(reason: Throwable, message: Option[Any]): Unit = {
log.info("***** RESTARTING *****")
message foreach{ self forward }
}
def receive = LoggingReceive {
case "error" =>
log.info("***** GOT ERROR *****")
throw new NullPointerException
}
}
object Boot extends App {
val system = ActorSystem()
val supervisor = system.actorOf(Props[Supervisor], "supervisor")
supervisor ! "start"
}
Am I doing anything wrong to accomplish that?
EDIT
Actually, I misunderstood the purpose of the withinTimeRange.
To schedule my retries in a time range, I'm doing the following:
override def preRestart(reason: Throwable, message: Option[Any]): Unit = {
log.info("***** RESTARTING *****")
message foreach { msg =>
context.system.scheduler.scheduleOnce(30.seconds, self, msg)
}
}
It seems to work ok.
I think you have misunderstood the purpose of the withinTimeRange arg. That value is supposed to be used in conjunction with maxNrOfRetries to provide a window in which to support the limiting of the number of retries. For example, as you have specified, the implication is that the supervisor will no longer restart an individual child if that child needs to be restarted more than 3 times in 10 seconds.
From docs:
maxNrOfRetries - the number of times a child actor is allowed to be
restarted, negative value means no limit, if the limit is exceeded the
child actor is stopped
withinTimeRange - duration of the time window
for maxNrOfRetries, Duration.Inf means no window
Your code means that when any child fails with NullPointerException more than 3 times within 10 seconds it will not be restarted again. Because of AllForOneStrategy after first Routee fails all routees are restarted. And because you've overridden preRestart to resend failed message this situation repeats again until reaches 3 failures within 10 seconds(which is achieved in less than a second).
How should I handle an exception thrown by the DbActor here ? I'm not sure how to handle it, should pipe the Failure case ?
class RestActor extends Actor with ActorLogging {
import context.dispatcher
val dbActor = context.actorOf(Props[DbActor])
implicit val timeout = Timeout(10 seconds)
override val supervisorStrategy: SupervisorStrategy = {
OneForOneStrategy(maxNrOfRetries = 10, withinTimeRange = 10 seconds) {
case x: Exception => ???
}
}
def receive = {
case GetRequest(reqCtx, id) => {
// perform db ask
ask(dbActor, ReadCommand(reqCtx, id)).mapTo[SomeObject] onComplete {
case Success(obj) => { // some stuff }
case Failure(err) => err match {
case x: Exception => ???
}
}
}
}
}
Would be glad to get your thought, thanks in advance !
There are a couple of questions I can see here based on the questions in your code sample:
What types of things can I do when I override the default supervisor behavior in the definition of how to handle exceptions?
When using ask, what types of things can I do when I get a Failure result on the Future that I am waiting on?
Let's start with the first question first (usually a good idea). When you override the default supervisor strategy, you gain the ability to change how certain types of unhandled exceptions in the child actor are handled in regards to what to do with that failed child actor. The key word in that previous sentence is unhandled. For actors that are doing request/response, you may actually want to handle (catch) specific exceptions and return certain response types instead (or fail the upstream future, more on that later) as opposed to letting them go unhandled. When an unhandled exception happens, you basically lose the ability to respond to the sender with a description of the issue and the sender will probably then get a TimeoutException instead as their Future will never be completed. Once you figured out what you handle explicitly, then you can consider all the rest of exceptions when defining your custom supervisor strategy. Inside this block here:
OneForOneStrategy(maxNrOfRetries = 10, withinTimeRange = 10 seconds) {
case x: Exception => ???
}
You get a chance to map an exception type to a failure Directive, which defines how the failure will be handled from a supervision standpoint. The options are:
Stop - Completely stop the child actor and do not send any more messages to it
Resume - Resume the failed child, not restarting it thus keeping its current internal state
Restart - Similar to resume, but in this case, the old instance is thrown away and a new instance is constructed and internal state is reset (preStart)
Escalate - Escalate up the chain to the parent of the supervisor
So let's say that given a SQLException you wanted to resume and given all others you want to restart then your code would look like this:
OneForOneStrategy(maxNrOfRetries = 10, withinTimeRange = 10 seconds) {
case x: SQLException => Resume
case other => Restart
}
Now for the second question which pertains to what to do when the Future itself returns a Failure response. In this case, I guess it depends on what was supposed to happen as a result of that Future. If the rest actor itself was responsible for completing the http request (let's say that httpCtx has a complete(statusCode:Int, message:String) function on it), then you could do something like this:
ask(dbActor, ReadCommand(reqCtx, id)).mapTo[SomeObject] onComplete {
case Success(obj) => reqCtx.complete(200, "All good!")
case Failure(err:TimeoutException) => reqCtx.complete(500, "Request timed out")
case Failure(ex) => reqCtx.complete(500, ex.getMessage)
}
Now if another actor upstream was responsible for completing the http request and you needed to respond to that actor, you could do something like this:
val origin = sender
ask(dbActor, ReadCommand(reqCtx, id)).mapTo[SomeObject] onComplete {
case Success(obj) => origin ! someResponseObject
case Failure(ex) => origin ! Status.Failure(ex)
}
This approach assumes that in the success block you first want to massage the result object before responding. If you don't want to do that and you want to defer the result handling to the sender then you could just do:
val origin = sender
val fut = ask(dbActor, ReadCommand(reqCtx, id))
fut pipeTo origin
For simpler systems one may want to catch and forward all of the errors. For that I made this small function to wrap the receive method, without bothering with supervision:
import akka.actor.Actor.Receive
import akka.actor.ActorContext
/**
* Meant for wrapping the receive method with try/catch.
* A failed try will result in a reply to sender with the exception.
* #example
* def receive:Receive = honestly {
* case msg => sender ! riskyCalculation(msg)
* }
* ...
* (honestActor ? "some message") onComplete {
* case e:Throwable => ...process error
* case r:_ => ...process result
* }
* #param receive
* #return Actor.Receive
*
* #author Bijou Trouvaille
*/
def honestly(receive: =>Receive)(implicit context: ActorContext):Receive = { case msg =>
try receive(msg) catch { case error:Throwable => context.sender ! error }
}
you can then place it into a package file and import a la akka.pattern.pipe and such. Obviously, this won't deal with exceptions thrown by asynchronous code.
I am currently trying to get started with Akka and I am facing a weird problem. I've got the following code for my Actor:
class AkkaWorkerFT extends Actor {
def receive = {
case Work(n, c) if n < 0 => throw new Exception("Negative number")
case Work(n, c) => self reply n.isProbablePrime(c);
}
}
And this is how I start my workers:
val workers = Vector.fill(nrOfWorkers)(actorOf[AkkaWorkerFT].start());
val router = Routing.loadBalancerActor(SmallestMailboxFirstIterator(workers)).start()
And this is how I shut everything down:
futures.foreach( _.await )
router ! Broadcast(PoisonPill)
router ! PoisonPill
Now what happens is if I send the workers messages with n > 0 (no exception is thrown), everything works fine and the application shuts down properly. However, as soon as I send it a single message which results in an exception, the application does not terminate because there is still an actor running, but I can't figure out where it comes from.
In case it helps, this is the stack of the thread in question:
Thread [akka:event-driven:dispatcher:event:handler-6] (Suspended)
Unsafe.park(boolean, long) line: not available [native method]
LockSupport.park(Object) line: 158
AbstractQueuedSynchronizer$ConditionObject.await() line: 1987
LinkedBlockingQueue<E>.take() line: 399
ThreadPoolExecutor.getTask() line: 947
ThreadPoolExecutor$Worker.run() line: 907
MonitorableThread(Thread).run() line: 680
MonitorableThread.run() line: 182
PS: The thread which is not terminating isn't any of the worker threads, because I've added a postStop callback, every one of them stops properly.
PPS: Actors.registry.shutdownAll workarounds the problem, but I think shutdownAll should only be used as a last resort, shouldn't it?
The proper way to handle problems inside akka actors is not to throw an exception but rather to set supervisor hierarchies
"Throwing an exception in concurrent code (let’s assume we are using
non-linked actors), will just simply blow up the thread that currently
executes the actor.
There is no way to find out that things went wrong (apart from
inspecting the stack trace).
There is nothing you can do about it."
see Fault Tolerance Through Supervisor Hierarchies (1.2)
* note * the above is true for old versions of Akka (1.2)
In newer versions (e.g. 2.2) you'd still set a supervisor hierarchy but it will trap Exceptions thrown by child processes. e.g.
class Child extends Actor {
var state = 0
def receive = {
case ex: Exception ⇒ throw ex
case x: Int ⇒ state = x
case "get" ⇒ sender ! state
}
}
and in the supervisor:
class Supervisor extends Actor {
import akka.actor.OneForOneStrategy
import akka.actor.SupervisorStrategy._
import scala.concurrent.duration._
override val supervisorStrategy =
OneForOneStrategy(maxNrOfRetries = 10, withinTimeRange = 1 minute) {
case _: ArithmeticException ⇒ Resume
case _: NullPointerException ⇒ Restart
case _: IllegalArgumentException ⇒ Stop
case _: Exception ⇒ Escalate
}
def receive = {
case p: Props ⇒ sender ! context.actorOf(p)
}
}
see Fault Tolerance Through Supervisor Hierarchies (2.2)
Turning off the logging to make sure things terminate, as proposed by Viktor, is a bit strange. What you can do instead is:
EventHandler.shutdown()
that cleanly shuts down all the (logger) listeners that keep the world running after the exception:
def shutdown() {
foreachListener(_.stop())
EventHandlerDispatcher.shutdown()
}
Turn of the logger in the akka.conf