Akka - what after an actor has been stopped? - scala

I am not sure what approach to follow for Akka supervision.
I have an Akka actor that lists files from a FTP server when a message triggers it. If the connection is broken, the actor will fail with an exception (say, IOException) which will trigger supervision. At this point, I see two alternatives:
I keep resuming / restarting the actor until the server comes back up, maybe with an exponential backoff
I set parameters (such as maxNrOfRetries = xy) in a way that the supervisor will give up and stop the actor after xy times
The first strategy seems wasteful, but the second one seems to bring another issue: how would the actor be eventually restarted? I have the feeling that tuning the parameters of the Backoff supervisor is the best way to go, but maybe I'm missing something?

If you need to restart the actor eventually without knowing when the connection will be up again, exponential backoff (with an upper limit of say 60 seconds?) seems reasonable.
This way you have a fast reconnect if the connection is just lost for a few seconds and with the back off your not wasting resources. The upper limit on the backoff sets the maximum time your actor is offline even though the connection might be back up.

Related

Akka Actor Messaging Delay

I'm experiencing issues scaling my app with multiple requests.
Each request sends an ask to an actor, which then spawns other actors. This is fine, however, under load(5+ asks at once), the ask takes a massive amount of time to deliver the message to the target actor. The original design was to bulkhead requests evenly, but this is causing a bottleneck. Example:
In this picture, the ask is sent right after the query plan resolver. However, there is a multi-second gap when the Actor receives this message. This is only experienced under load(5+ requests/sec). I first thought this was a starvation issue.
Design:
Each planner-executor is a seperate instance for each request. It spawns a new 'Request Acceptor' actor each time(it logs 'requesting score' when it receives a message).
I gave the actorsystem a custom global executor(big one). I noticed the threads were not utilized beyond the core threadpool size even during this massive delay
I made sure all executioncontexts in the child actors used the correct executioncontext
Made sure all blocking calls inside actors used a future
I gave the parent actor(and all child) a custom dispatcher with core size 50 and max size 100. It did not request more(it stayed at 50) even during these delays
Finally, I tried creating a totally new Actorsystem for each request(inside the planner-executor). This also had no noticable effect!
I'm a bit stumped by this. From these tests it does not look like a thread starvation issue. Back at square one, I have no idea why the message takes longer and longer to deliver the more concurrent requests I make. The Zipkin trace before reaching this point does not degrade with more requests until it reaches the ask here. Before then, the server is able to handle multiple steps to e.g veify the request, talk to the db, and then finally go inside the planner-executor. So I doubt the application itself is running out of cpu time.
We had this very similar issue with Akka. We observed huge delay in ask pattern to deliver messages to the target actor on peek load.
Most of these issues are related to heap memory consumption and not because of usages of dispatchers.
Finally we fixed these issues by tuning some of the below configuration and changes.
1) Make sure you stop entities/actors which are no longer required. If its a persistent actor then you can always bring it back when you need it.
Refer : https://doc.akka.io/docs/akka/current/cluster-sharding.html#passivation
2) If you are using cluster sharding then check the akka.cluster.sharding.state-store-mode. By changing this to persistence we gained 50% more TPS.
3) Minimize your log entries (set it to info level).
4) Tune your logs to publish messages frequently to your logging system. Update the batch size, batch count and interval accordingly. So that the memory is freed. In our case huge heap memory is used for buffering the log messages and send in bulk. If the interval is more then you may fill your heap memory and that affects the performance (more GC activity required).
5) Run blocking operations on a separate dispatcher.
6) Use custom serializers (protobuf) and avoid JavaSerializer.
7) Add the below JAVA_OPTS to your jar
export JAVA_OPTS="$JAVA_OPTS -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap -XX:MaxRAMFraction=2 -Djava.security.egd=file:/dev/./urandom"
The main thing is XX:MaxRAMFraction=2 which will utilize more than 60% of available memory. By default its 4 means your application will use only one fourth of the available memory, which might not be sufficient.
Refer : https://blog.csanchez.org/2017/05/31/running-a-jvm-in-a-container-without-getting-killed/
Regards,
Vinoth

Communication protocol

I'm developing distributed system that consists of master and worker servers. There should be 2 kind of messages:
Heartbeat
Master gets state of worker and respond immediately with appropriate command. For instance:
Message from Worker to Master: "Hey there! I have data a,b,c"
Response from Master to Worker: "All ok, But throw away c - we dont need this anymore"
The participants exchange this messages with interval T.
Direct master command
Lets say client asks master to kill job #123. Here is conversation:
Message from Master to Worker: "Alarm! We need to kill job #123"
Message from Worker to Master: "No problem! Done."
Obvious that we can't predict when this message appear.
Simplest solution is that master is initiator of all communications for both messages (in case of heartbeat we will include another one from master to start exchange). But lets assume that it is expensive to do all heartbeat housekeeping on master side for N workers. And we don't want to waste our resources to keep several tcp connections to worker servers so we have just one.
Is there any solution for this constraints?
First off, you have to do some bookkeeping somewhere. Otherwise, who's going to realize that a worker has died? The natural place to put that data is on the master, if you're building a master/worker system. Otherwise, the workers could be asked to keep track of each other in a long circle, or a randomized graph. If a worker notices that their accountabilibuddy is not responding anymore, it can alert the master.
Same thing applies to the list of jobs currently running; who keeps track of that? It also scales O(n), so presumably the master doesn't have space for that either. Sharding that data out among the workers (e.g. by keeping track of what things their accountabilibuddy is supposed to be doing) only works so far; if a and b crashes, and a is the only one looking after b, you just lost the list of jobs running on b (and possibly the alert that was supposed to notify you that b crashed).
I'd recommend a distributed consensus algorithm for this kind of task. For production, use something someone else has already written; they probably know what they're doing. If it's for learning purposes, which I presume, have a look at the raft consensus algorithm. It's not too hard to understand, but still highlights a lot of the complexity in distributed systems. The simulator is gold for proper understanding.
A master/worker system will never properly work with less than O(n) resources for n workers in the face of crashing workers. By definition, the master needs to control the workers, which is an O(n) job, even if some workers manage other workers. Also, what happens if the master crashes?
Like Filip Haglund said read the raft paper you should also implement it yourself. However in a nutshell what you need to extract from it would be this. In regaurds to membership management.
You need to keep membership lists and the masters Identity on all nodes.
Raft does it's heartbeat sending on master's end it is not very expensive network wise you don't need to keep them open. Every 200 ms to a second you need to send the heartbeat if they don't reply back the Master tells the slaves remove member x from list.
However what what to do if the master dies well basically you need to preset candidate nodes. If you haven't received a heart beat within the timeout the candidate requests votes from the rest of the cluster. If you get the slightest majority you become the new leader.
If you want to join a existing cluster basically same as above if not leader respond not leader with leaders address.

Akka actor Kill/restart behavior

I am confused by behavior I am seeing in Akka. Briefly, I have a set of actors performing scientific calculations (star formation simulation). They have some state. When an error occurs such that one or more enter an invalid state, I want to restart the whole set to start over. I also want to do this if a single calc (over the entire set) takes too long (there is no way to predict in advance how long it may run).
So, there is the set of Simulation actors at the bottom of the tree, then a Director above them (that creates them via a Router, and sends them messages via that Router as well). There is one more Director level above that to create Directors on different machines and collect results from them all.
I handle the timeout case by using the Akka Scheduler to create a one-time timeout event, in the local Director, when the simulation is started. When the Director gets this event, if all its Simulation actors have not finished, it does this:
children ! Broadcast(Kill)
where children is the Router that owns/created them - this sends a Kill to all the children (SimulActors).
What I thought would occur is that all the child actors would be restarted. However, their preRestart() hook method is never called. I see the Kill message received, but that's it.
I must be missing something fundamental here. I have read the Akka docs on this topic and I have to say I find them less than clear (especially the page on Supervisors). I would really appreciate either a thorough explanation of the Kill/restart process, or just some other references (Google wasn't very helpful).
Note
If the child of a router terminates, the router will not automatically
spawn a new child. In the event that all children of a router have
terminated the router will terminate itself.
Taken from the akka docs.
I would consider using a supervision strategy - akka has behavior built in for killing all actors (all for one strategy) and you can define the specific strategy - eg restart.
I think a more idiomatic way to run this would be to have the actors throw x exception if they're not done after a period of time and then the supervisor handle that via supervision strategy.
You could throw a not done exception from the child and then define the behaviour like so:
override val supervisorStrategy =
AllForOneStrategy(maxNrOfRetries = 0) {
case _: NotDoneException ⇒ Stop
case _: Exception ⇒ Restart
}
It's important to understand that a restart means stopping the old actor and creating a new separate object/Actor
References:
http://doc.akka.io/docs/akka/snapshot/scala/fault-tolerance.html
http://doc.akka.io/docs/akka/snapshot/general/supervision.html

How to handle bursts with Actors?

Suppose I have an actor, which handles X requests per second. It is ok in average but sometimes there are bursts and clients send Y > X requests per second. Suppose also that all requests have timeouts so they cannot wait in queue forever.
Assuming we program in Scala and Akka what are the best practices/design patterns to make the actor handle those bursts? Are there any code examples, which handle bursts?
As long as your machine can handle the increased load (i.e. has enough CPUs), then I would suggest pooling the Actor using a Router. It sounds like from your example, a dynamically resizing router might be the best fit, but even a standard Round Robin or Smallest Mailbox might be enough. Below is the link for the routers section from the Akka documentation. I hope this helps.
http://doc.akka.io/docs/akka/2.1.2/scala/routing.html
You could also consider distributing the actor across multiple nodes, but that might be overkill for your scenario. If you have interest in that approach, let me know and I can post more context on doing that.
Now as far as what to do when after you pool the actors but the system still is getting backlogged, that's really up to you, but here are a few options. If you can handle the occasional increases in latency due to bursting, then do nothing. The actors mailboxes will just get a little backed up but they will clear as soon as the burst eases off. If not, then the question is how to handle incoming messages when the actors are backlogged. If you want to fast fail in that situation and not accept the message you might want to look into using a bounded mailbox (http://doc.akka.io/docs/akka/2.1.2/scala/dispatchers.html). When the mailbox reaches it's size limit and can no longer queue messages, the caller will get a failure sending the message (I think). Not awesome, but at least will lead to the system stabilizing faster.
I assume you are doing ask (?) (i.e. request/response), so when you do that, you get a Future. That Future will time out (with an implicitly defined timeout value) if it does not receive a response in time, so during a burst, Futures attached to the calls into the backlogged actors will just start timing out; they will not be stuck there forever.

Akka: Adding a delay to a durable mailbox

I am wondering if there is some way to delay an akka message from processing?
My use case: For every request I have, I have a small amount of work that I need to do and then I need to additional work two hours later.
Is there any easy way to delay the processing of a message in AKKA? I know I can probably setup an external distributed queue such as ActiveMQ, RabbitMQ which probably has this feature but I rather not.
I know I would need to make the mailbox durable so it can survive restarts or crashes. We already have mongo setup so I probably be using the MongoBasedMailbox for durability.
Temporal Workflow is capable of supporting your use case with minimal effort. You can think about it as a Durable Actor platform. When actor state including threads and local variables is preserved across process restarts.
Temporal offers a lot of other features for task processing.
Built it exponential retries with unlimited expiration interval
Failure handling. For example, it allows executing a task that notifies another service if both updates couldn't succeed during a configured interval.
Support for long running heartbeating operations
Ability to implement complex task dependencies. For example to implement chaining of calls or compensation logic in case of unrecoverable failures (SAGA)
Gives complete visibility into the current state of the update. For example, when using queues all you know if there are some messages in a queue and you need additional DB to track the overall progress. With Temporal every event is recorded.
Ability to cancel an update in flight.
Throttling of requests
See the presentation that goes over the Temporal programming model. It talks about Cadence which is the predecessor of Temporal.
It's not ideal, but the Akka Camel Quartz scheduler would do the trick. More heavyweight than the built-in ActorSystem scheduler, but know that Quartz has its own issues.
you could still use the normal Akka scheduler, you will just have to keep a state on the actor persistence to avoid loosing the job if the server restarted.
I have recently used PersistentFsmActor - which will keep the state of the actor persisted
I'm not sure in your case you have to use FSM (Finite State Machine) , so you could basically just use a persistentActor to save the time the job was inserted, and start a scheduler to that time. this way - even if you restarted the server, the actor will start and create a new scheduled job use the persistent data to calculate the time left to run it