I am using BackoffSupervisor strategy to create a child actor that has to process some message. I want to implement a very simple restart strategy, in which in case of exception:
Child propagates failing message to supervisor
Supervisor restarts child and sends the failing message again.
Supervisor gives up after 3 retries
Akka persistence is not an option
So far what I have is this:
Supervisor definition:
val childProps = Props(new SenderActor())
val supervisor = BackoffSupervisor.props(
Backoff.onFailure(
childProps,
childName = cmd.hashCode.toString,
minBackoff = 1.seconds,
maxBackoff = 2.seconds,
randomFactor = 0.2
)
.withSupervisorStrategy(
OneForOneStrategy(maxNrOfRetries = 3, loggingEnabled = true) {
case msg: MessageException => {
println("caught specific message!")
SupervisorStrategy.Restart
}
case _: Exception => SupervisorStrategy.Restart
case _ ⇒ SupervisorStrategy.Escalate
})
)
val sup = context.actorOf(supervisor)
sup ! cmd
Child actor that is supposed to send the e-mail, but fails (throws some Exception) and propagates Exception back to supervisor:
class SenderActor() extends Actor {
def fakeSendMail():Unit = {
Thread.sleep(1000)
throw new Exception("surprising exception")
}
override def receive: Receive = {
case cmd: NewMail =>
println("new mail received routee")
try {
fakeSendMail()
} catch {
case t => throw MessageException(cmd, t)
}
}
}
In the above code I wrap any exception into custom class MessageException that gets propagated to SupervisorStrategy, but how to propagate it further to the new child to force reprocessing? Is this the right approach?
Edit. I attempted to resent the message to the Actor on preRestart hook, but somehow the hook is not being triggered:
class SenderActor() extends Actor {
def fakeSendMail():Unit = {
Thread.sleep(1000)
// println("mail sent!")
throw new Exception("surprising exception")
}
override def preStart(): Unit = {
println("child starting")
}
override def preRestart(reason: Throwable, message: Option[Any]): Unit = {
reason match {
case m: MessageException => {
println("aaaaa")
message.foreach(self ! _)
}
case _ => println("bbbb")
}
}
override def postStop(): Unit = {
println("child stopping")
}
override def receive: Receive = {
case cmd: NewMail =>
println("new mail received routee")
try {
fakeSendMail()
} catch {
case t => throw MessageException(cmd, t)
}
}
}
This gives me something similar to following output:
new mail received routee
caught specific message!
child stopping
[ERROR] [01/26/2018 10:15:35.690]
[example-akka.actor.default-dispatcher-2]
[akka://example/user/persistentActor-4-scala/$a/1962829645] Could not
process message sample.persistence.MessageException:
Could not process message <stacktrace>
child starting
But no logs from preRestart hook
The reason that the child's preRestart hook is not invoked is because Backoff.onFailure uses BackoffOnRestartSupervisor underneath the covers, which replaces the default restart behavior with a stop-and-delayed-start behavior that is consistent with the backoff policy. In other words, when using Backoff.onFailure, when a child is restarted, the child's preRestart method is not called because the underlying supervisor actually stops the child, then starts it again later. (Using Backoff.onStop can trigger the child's preRestart hook, but that's tangential to the present discussion.)
The BackoffSupervisor API doesn't support the automatic resending of a message when the supervisor's child restarts: you have to implement this behavior yourself. An idea for retrying messages is to let the BackoffSupervisor's supervisor handle it. For example:
val supervisor = BackoffSupervisor.props(
Backoff.onFailure(
...
).withReplyWhileStopped(ChildIsStopped)
).withSupervisorStrategy(
OneForOneStrategy(maxNrOfRetries = 3, loggingEnabled = true) {
case msg: MessageException =>
println("caught specific message!")
self ! Error(msg.cmd) // replace cmd with whatever the property name is
SupervisorStrategy.Restart
case ...
})
)
val sup = context.actorOf(supervisor)
def receive = {
case cmd: NewMail =>
sup ! cmd
case Error(cmd) =>
timers.startSingleTimer(cmd.id, Replay(cmd), 10.seconds)
// We assume that NewMail has an id field. Also, adjust the time as needed.
case Replay(cmd) =>
sup ! cmd
case ChildIsStopped =>
println("child is stopped")
}
In the above code, the NewMail message embedded in the MessageException is wrapped in a custom case class (in order to easily distinguish it from a "normal"/new NewMail message) and sent to self. In this context, self is the actor that created the BackoffSupervisor. This enclosing actor then uses a single timer to replay the original message at some point. This point in time should be far enough in the future such that the BackoffSupervisor can potentially exhaust SenderActor's restart attempts, so that the child can have ample opportunity to get in a "good" state before it receives the resent message. Obviously this example involves only one message resend regardless of the number of child restarts.
Another idea is to create a BackoffSupervisor-SenderActor pair for every NewMail message, and have the SenderActor send the NewMail message to itself in the preStart hook. One concern with this approach is the cleaning up of resources; i.e., shutting down the BackoffSupervisors (which will, in turn, shut down their respective SenderActor children) when the processing is successful or when the child restarts are exhausted. A map of NewMail ids to (ActorRef, Int) tuples (in which the ActorRef is a reference to a BackoffSupervisor actor, and the Int is the number of restart attempts) would be helpful in this case:
class Overlord extends Actor {
var state = Map[Long, (ActorRef, Int)]() // assuming the mail id is a Long
def receive = {
case cmd: NewMail =>
val childProps = Props(new SenderActor(cmd, self))
val supervisor = BackoffSupervisor.props(
Backoff.onFailure(
...
).withSupervisorStrategy(
OneForOneStrategy(maxNrOfRetries = 3, loggingEnabled = true) {
case msg: MessageException =>
println("caught specific message!")
self ! Error(msg.cmd)
SupervisorStrategy.Restart
case ...
})
)
val sup = context.actorOf(supervisor)
state += (cmd.id -> (sup, 0))
case ProcessingDone(cmdId) =>
state.get(cmdId) match {
case Some((backoffSup, _)) =>
context.stop(backoffSup)
state -= cmdId
case None =>
println(s"${cmdId} not found")
}
case Error(cmd) =>
val cmdId = cmd.id
state.get(cmdId) match {
case Some((backoffSup, numRetries)) =>
if (numRetries == 3) {
println(s"${cmdId} has already been retried 3 times. Giving up.")
context.stop(backoffSup)
state -= cmdId
} else
state += (cmdId -> (backoffSup, numRetries + 1))
case None =>
println(s"${cmdId} not found")
}
case ...
}
}
Note that SenderActor in the above example takes a NewMail and an ActorRef as constructor arguments. The latter argument allows the SenderActor to send a custom ProcessingDone message to the enclosing actor:
class SenderActor(cmd: NewMail, target: ActorRef) extends Actor {
override def preStart(): Unit = {
println(s"child starting, sending ${cmd} to self")
self ! cmd
}
def fakeSendMail(): Unit = ...
def receive = {
case cmd: NewMail => ...
}
}
Obviously the SenderActor is set up to fail every time with the current implementation of fakeSendMail. I'll leave the additional changes needed in SenderActor to implement the happy path, in which SenderActor sends a ProcessingDone message to target, to you.
In the good solution that #chunjef provides, he alert about the risk of schedule a job resend before the backoff supervisor has started the worker
This enclosing actor then uses a single timer to replay the original message at some point. This point in time should be far enough in the future such that the BackoffSupervisor can potentially exhaust SenderActor's restart attempts, so that the child can have ample opportunity to get in a "good" state before it receives the resent message.
If this happens, the scenario will be jobs going to dead letters and no further progress will be done.
I've made a simplified fiddle with this scenario.
So, the schedule delay should be larger than the maxBackoff, and this could represent an impact in job completion time.
A possible solution to avoid this scenario is making the worker actor to send a message to his father when is ready to work, like here.
The failed child actor is available as the sender in your supervisor strategy. Quoting https://doc.akka.io/docs/akka/current/fault-tolerance.html#creating-a-supervisor-strategy:
If the strategy is declared inside the supervising actor (as opposed
to within a companion object) its decider has access to all internal
state of the actor in a thread-safe fashion, including obtaining a
reference to the currently failed child (available as the sender of
the failure message).
Sending emails is a dangerous operation with some third party software in your case. Why not to apply Circuit Breaker pattern and skip the sender actor entirely? Also, you can still have an actor (with some Backoff Supervisor) and Circuit Breaker inside it (if that makes sense for you).
Related
case class FeatureFilter(s3Client: AmazonS3) extends Actor with ActorLogging {
override def preStart(): Unit = {
self ! Initialize
}
override def receive: Receive = {
case Initialize =>
// long running operaton
val tryfile = S3Connection(s3Client).downloadObject(...)
tryfile match {
case Success(file) =>
context.become(active(file))
case Failure(exception) =>
self ! PoisonPill
}
}
def active(file: File): Receive = {
case Query(key) =>
// do some processing and reply to sender
}
}
I am using below test for above actor:
"an actor" should {
// mocked S3 client
val client = ...
"test for presence of keys" in {
val actor = system.actorOf(Props(FeatureFilter(client)))
for (i <- 1 to 100) {
actor ! Query("test_string")
expectMsg(SomeMessage)
}
}
}
The above test fails with
java.lang.AssertionError: assertion failed: timeout (3 seconds) during expectMsg while waiting ...
I think this is because when the message actor ! Query("test_string") is sent to actor, it's handler is still receive, and so it doesn't respond, and hence the timeout.
But I even tried adding the handler for Query(key) in the receive method (just like in active method). Still I am getting the same error.
Could someone please point what is the issue here ?
Also when I move the S3 download task to preStart(), still the issue remains same. Isn't preStart() a blocking call ? How would the code in the test proceed until the preStart() is completed ?
akka stash sounds like the way you looking for. In case of any message that the actor support but is unhandled add on stash and unapply all if active is reached.
look at actor stash for documentation and example usage
may your code would look like
case msg => stash()
...
unstashAll()
context.become(active(file))
I'm trying to keep counting on each successful import. But here is a problem - Counter works if the router receives a message from its parent but if I'm trying to send a message from its children it receives it but doesn't update the global variable that is out of the scope.
I know it sounds complicated. Let me show you the code.
Here is the router
class Watcher(size: Int) extends Actor {
var router = {
val routees = Vector.fill(size) {
val w = context.actorOf(
Props[Worker]
)
context.watch(w)
ActorRefRoutee(w)
}
Router(RoundRobinRoutingLogic(), routees)
}
var sent = 0
override def supervisorStrategy(): SupervisorStrategy = OneForOneStrategy(maxNrOfRetries = 100) {
case _: DocumentNotFoundException => {
Resume
}
case _: Exception => Escalate
}
override def receive: Receive = {
case container: MessageContainer =>
router.route(container, sender)
case Success =>
sent += 1
case GetValue =>
sender ! sent
case Terminated(a) =>
router.removeRoutee(a)
val w = context.actorOf(Props[Worker])
context.watch(w)
router = router.addRoutee(w)
case undef =>
println(s"${this.getClass} received undefinable message: $undef")
}
}
Here is the worker
class Worker() extends Actor with ActorLogging {
var messages = Seq[MessageContainer]()
var received = 0
override def receive: Receive = {
case container: MessageContainer =>
try {
importMessage(container.message, container.repo)
context.parent ! Success
} catch {
case e: Exception =>
throw e
}
case e: Error =>
log.info(s"Error occurred $e")
sender ! e
case undef => println(s"${this.getClass} received undefinable message: $undef")
}
}
So on supervisor ? GetValue I get 0 but suppose to have 1000.The strangest thing is that when I debug it with the breakpoint right on the case Success => ... the value is incremented every time the new message arrives. But supervisor ? GetValue still returns 0.
Let's assume I want to count on case container: MessageContainer => ... and it will magically work; I'll get desirable number, but it doesn't show if I actually imported anything. What's going on?
Here is the test case.
#Test
def testRouter(): Unit = {
val system = ActorSystem("RouterTestSystem")
// val serv = AddressFromURIString("akka.tcp://master#host:1334")
val supervisor = system.actorOf(Props(new Watcher(20)))//.withDeploy(akka.actor.Deploy(scope = RemoteScope(serv))))
val repo = coreSession.getRepositoryName
val containers = (0 until num)
.map(_ => MessageContainer(MessageFactory.generate("/"), repo))
val watch = Stopwatch.createStarted()
(0 until num).par
.foreach( i => {
supervisor ! containers.apply(i)
})
implicit val timeout = Timeout(60 seconds)
val future = supervisor ? GetValue
val result = Await.result(future, timeout.duration).asInstanceOf[Int]
val speed = result / (watch.elapsed(TimeUnit.MILLISECONDS) / 1000.0)
println(f"Import speed: $speed%.2f")
assertEquals(num, result)
}
Can you please explained it in details. Why is it happening? Why only on message received from the children? Another approach?
Well... there can be many potential problems hidden in the parts of code that you have not shared. But, for the sake of this discussion I will assume that everything else is fine and we will just discuss problems with your shared code.
Now, let me explain a bit about Actors. To put things simply, every actor has a mailbox (where it keeps messages in the sequence they were received) and processes them one by one in the order they were received. Since the mailbox is used like a Queue we will refer to it as a Queue in this discussion.
Also... I don't know what this container.apply(i) is going to return... so I will refer to the return value of that container.apply(1) as MessageContainer__1
In your test runner you are first creating an instance of Watcher,
val supervisor = system.actorOf(Props(new Watcher(20)))
Now, lets say that you are sending these 2 messages (num = 2) to supervisor,
So supervisor's mailbox will look something like,
Queue(MessageContainer__0, MessageContainer__1)
Then you send it another message GetValue so the mailbox will look like,
Queue(MessageContainer__0, MessageContainer__1, GetValue)
Now the actor will process the first message and pass it to the workers, the mail-box will look like,
Queue(MessageContainer__1, GetValue)
Now even if your worker is ultra-fast and instantaneous in sending the reply the mailbox will look like,
Queue(MessageContainer__1, GetValue, Success)
And now since your worker super-ultra-fast and instantaneously replies with a Success, the state after passing the second MessageContainer will look like,
Queue(GetValue, Success, Success)
And... here is the root of your problem. The Supervisor sees the GetValue massage before any Success messages, no matter how fast your workers are.
And thus it will process GetValue and reply with current value of sent which is 0.
I have been using Akka Supervisor Strategy to handle business logic exceptions.
Reading one of the most famous Scala blog series Neophyte, I found him giving a different purpose for what I have always been doing.
Example:
Let's say I have an HttpActor that should contact an external resource and in case it's down, I will throw an Exception, for now a ResourceUnavailableException.
In case my Supervisor catches that, I will call a Restart on my HttpActor, and in my HttpActor preRestart method, I will call do a schedulerOnce to retry that.
The actor:
class HttpActor extends Actor with ActorLogging {
implicit val system = context.system
override def preRestart(reason: Throwable, message: Option[Any]): Unit = {
log.info(s"Restarting Actor due: ${reason.getCause}")
message foreach { msg =>
context.system.scheduler.scheduleOnce(10.seconds, self, msg)
}
}
def receive = LoggingReceive {
case g: GetRequest =>
doRequest(http.doGet(g), g.httpManager.url, sender())
}
A Supervisor:
class HttpSupervisor extends Actor with ActorLogging with RouterHelper {
override val supervisorStrategy =
OneForOneStrategy(maxNrOfRetries = 5) {
case _: ResourceUnavailableException => Restart
case _: Exception => Escalate
}
var router = makeRouter[HttpActor](5)
def receive = LoggingReceive {
case g: GetRequest =>
router.route(g, sender())
case Terminated(a) =>
router = router.removeRoutee(a)
val r = context.actorOf(Props[HttpActor])
context watch r
router = router.addRoutee(r)
}
}
What's the point here?
In case my doRequest method throws the ResourceUnavailableException, the supervisor will get that and restart the actor, forcing it to resend the message after some time, according to the scheduler. The advantages I see is the fact I get for free the number of retries and a nice way to handle the exception itself.
Now looking at the blog, he shows a different approach in case you need a retry stuff, just sending messages like this:
def receive = {
case EspressoRequest =>
val receipt = register ? Transaction(Espresso)
receipt.map((EspressoCup(Filled), _)).recover {
case _: AskTimeoutException => ComebackLater
} pipeTo(sender)
case ClosingTime => context.system.shutdown()
}
Here in case of AskTimeoutException of the Future, he pipes the result as a ComebackLater object, which he will handle doing this:
case ComebackLater =>
log.info("grumble, grumble")
context.system.scheduler.scheduleOnce(300.millis) {
coffeeSource ! EspressoRequest
}
For me this is pretty much what you can do with the strategy supervisor, but in a manually way, with no built in number of retries logic.
So what is the best approach here and why? Is my concept of using akka supervisor strategy completely wrong?
You can use BackoffSupervisor:
Provided as a built-in pattern the akka.pattern.BackoffSupervisor implements the so-called exponential backoff supervision strategy, starting a child actor again when it fails, each time with a growing time delay between restarts.
val supervisor = BackoffSupervisor.props(
Backoff.onFailure(
childProps,
childName = "myEcho",
minBackoff = 3.seconds,
maxBackoff = 30.seconds,
randomFactor = 0.2 // adds 20% "noise" to vary the intervals slightly
).withAutoReset(10.seconds) // the child must send BackoffSupervisor.Reset to its parent
.withSupervisorStrategy(
OneForOneStrategy() {
case _: MyException => SupervisorStrategy.Restart
case _ => SupervisorStrategy.Escalate
}))
I've testing the fault tolerant system of akka and so far it's been good when talking about retrying to send a msg according the maxNrOfRetries specified.
However, it does not restart the actor within the given time range, it restarts all at once, ignoring the within time range.
I tried with AllForOneStrategy and OneForOneStrategy but does not change anything.
Trying to follow this blog post: http://letitcrash.com/post/23532935686/watch-the-routees, this is the code I've been working.
class Supervisor extends Actor with ActorLogging {
var replyTo: ActorRef = _
val child = context.actorOf(
Props(new Child)
.withRouter(
RoundRobinPool(
nrOfInstances = 5,
supervisorStrategy =
AllForOneStrategy(maxNrOfRetries = 3, withinTimeRange = 10.second) {
case _: NullPointerException => Restart
case _: Exception => Escalate
})), name = "child-router")
child ! GetRoutees
def receive = {
case RouterRoutees(routees) =>
routees foreach context.watch
case "start" =>
replyTo = sender()
child ! "error"
case Terminated(actor) =>
replyTo ! -1
context.stop(self)
}
}
class Child extends Actor with ActorLogging {
override def preRestart(reason: Throwable, message: Option[Any]): Unit = {
log.info("***** RESTARTING *****")
message foreach{ self forward }
}
def receive = LoggingReceive {
case "error" =>
log.info("***** GOT ERROR *****")
throw new NullPointerException
}
}
object Boot extends App {
val system = ActorSystem()
val supervisor = system.actorOf(Props[Supervisor], "supervisor")
supervisor ! "start"
}
Am I doing anything wrong to accomplish that?
EDIT
Actually, I misunderstood the purpose of the withinTimeRange.
To schedule my retries in a time range, I'm doing the following:
override def preRestart(reason: Throwable, message: Option[Any]): Unit = {
log.info("***** RESTARTING *****")
message foreach { msg =>
context.system.scheduler.scheduleOnce(30.seconds, self, msg)
}
}
It seems to work ok.
I think you have misunderstood the purpose of the withinTimeRange arg. That value is supposed to be used in conjunction with maxNrOfRetries to provide a window in which to support the limiting of the number of retries. For example, as you have specified, the implication is that the supervisor will no longer restart an individual child if that child needs to be restarted more than 3 times in 10 seconds.
From docs:
maxNrOfRetries - the number of times a child actor is allowed to be
restarted, negative value means no limit, if the limit is exceeded the
child actor is stopped
withinTimeRange - duration of the time window
for maxNrOfRetries, Duration.Inf means no window
Your code means that when any child fails with NullPointerException more than 3 times within 10 seconds it will not be restarted again. Because of AllForOneStrategy after first Routee fails all routees are restarted. And because you've overridden preRestart to resend failed message this situation repeats again until reaches 3 failures within 10 seconds(which is achieved in less than a second).
I've a very simple example where I've an Actor (SimpleActor) that perform a periodic task by sending a message to itself. The message is scheduled in the constructor for the actor. In the normal case (i.e., without faults) everything works fine.
But what if the Actor has to deal with faults. I've another Actor (SimpleActorWithFault). This actor could have faults. In this case, I'm generating one myself by throwing an exception. When a fault happens (i.e., SimpleActorWithFault throws an exception) it is automatically restarted. However, this restart messes up the scheduler inside the Actor which no longer functions as excepted. And if the faults happens rapidly enough it generates more unexpected behavior.
My question is what's the preferred way to dealing with faults in such cases? I know I can use Try blocks to handle exceptions. But what if I'm extending another actor where I cannot put a Try in the super class or some case when I'm an excepted fault happens in the actor.
import akka.actor.{Props, ActorLogging}
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.duration._
import akka.actor.Actor
case object MessageA
case object MessageToSelf
class SimpleActor extends Actor with ActorLogging {
//schedule a message to self every second
context.system.scheduler.schedule(0 seconds, 1 seconds, self, MessageToSelf)
//keeps track of some internal state
var count: Int = 0
def receive: Receive = {
case MessageA => {
log.info("[SimpleActor] Got MessageA at %d".format(count))
}
case MessageToSelf => {
//update state and tell the world about its current state
count = count + 1
log.info("[SimpleActor] Got scheduled message at %d".format(count))
}
}
}
class SimpleActorWithFault extends Actor with ActorLogging {
//schedule a message to self every second
context.system.scheduler.schedule(0 seconds, 1 seconds, self, MessageToSelf)
var count: Int = 0
def receive: Receive = {
case MessageA => {
log.info("[SimpleActorWithFault] Got MessageA at %d".format(count))
}
case MessageToSelf => {
count = count + 1
log.info("[SimpleActorWithFault] Got scheduled message at %d".format(count))
//at some point generate a fault
if (count > 5) {
log.info("[SimpleActorWithFault] Going to throw an exception now %d".format(count))
throw new Exception("Excepttttttiooooooon")
}
}
}
}
object MainApp extends App {
implicit val akkaSystem = akka.actor.ActorSystem()
//Run the Actor without any faults or exceptions
akkaSystem.actorOf(Props(classOf[SimpleActor]))
//comment the above line and uncomment the following to run the actor with faults
//akkaSystem.actorOf(Props(classOf[SimpleActorWithFault]))
}
The correct way is to push down the risky behavior into it's own actor. This pattern is called the Error Kernel pattern (see Akka Concurrency, Section 8.5):
This pattern describes a very common-sense approach to supervision
that differentiates actors from one another based on any volatile
state that they may hold.
In a nutshell, it means that actors whose state is precious should not
be allowed to fail or restart. Any actor that holds precious data is
protected such that any risky operations are relegated to a slave
actor who, if restarted, only causes good things to happen.
The error kernel pattern implies pushing levels of risk further down
the tree.
See also another tutorial here.
So in your case it would be something like this:
SimpleActor
|- ActorWithFault
Here SimpleActor acts as a supervisor for ActorWithFault. The default supervising strategy of any actor is to restart a child on Exception and escalate on anything else:
http://doc.akka.io/docs/akka/snapshot/scala/fault-tolerance.html
Escalating means that the actor itself may get restarted. Since you really don't want to restart SimpleActor you could make it always restart the ActorWithFault and never escalate by overriding the supervisor strategy:
class SimpleActor {
override def preStart(){
// our faulty actor --- we will supervise it from now on
context.actorOf(Props[ActorWithFault], "FaultyActor")
...
override val supervisorStrategy = OneForOneStrategy () {
case _: ActorKilledException => Escalate
case _: ActorInitializationException => Escalate
case _ => Restart // keep restarting faulty actor
}
}
To avoid messing up the scheduler:
class SimpleActor extends Actor with ActorLogging {
private var cancellable: Option[Cancellable] = None
override def preStart() = {
//schedule a message to self every second
cancellable = Option(context.system.scheduler.schedule(0 seconds, 1 seconds, self, MessageToSelf))
}
override def postStop() = {
cancellable.foreach(_.cancel())
cancellable = None
}
...
To correctly handle exceptions (akka.actor.Status.Failure is for correct answer to an ask in case of Ask pattern usage by sender):
...
def receive: Receive = {
case MessageA => {
try {
log.info("[SimpleActor] Got MessageA at %d".format(count))
} catch {
case e: Exception =>
sender ! akka.actor.Status.Failure(e)
log.error(e.getMessage, e)
}
}
...