Scala/ AKKA actor migration between machines - scala

Can running actors be migrated to a different node during their life cycle? According to this AKKA road map here automatic actor migration upon failure would be available with release "Rollins". I was wondering whether this actor migration can somehow be done manually, via some special message or anything? Furthermore, is there anything similar to this in Scala?

Depending on the use case, akka-cluster with cluster sharding may be a fit.
Module is capable of reallocating shards to other cluster nodes.

Related

General guidelines to self-organized task allocation within a Microservice

I have some general problems/questions regarding self managed Microservices (in Kubernetes).
The Situation:
I have a provider (Discord API) for my desired state, which tells me the count (or multiples of the count) of sharded connections (websocket -> stateful in some way) I should establish with the provider.
Currently a have a "monolithic" microservice (it can't be deployed in an autoscaling service and has to be stateful), which determines the count of connections i should have and a factor based on the currently active pods, that can establish a connection to this API.
It further (by heartbeating and updating the connection target of all those pods) manages the state of every pod and achieves this target configuration.
It also handles the case of a pod being removed from the service and a change of target configuration, by rolling out the updated target and only after updating the target discontinuing the old connections.
The Cons:
This does not in any way resemble a good microservice architecture
A failure of the manager (even when persisting the current state in a cache or db of some sort) results in the target of the target provider not being achieved and maybe one of the pods having a failure without graceful handling of the manager
The Pros:
Its "easy" to understand and maintain a centrally managed system
There is no case (assuming a running manager system) where a pod can fail and it wont be handled -> connection resumed on another pod
My Plan:
I would like this websocket connection pods to manage themselves in some way.
Theoretically there has to be a way in which a "swarm" (swarm here is just a descriptive word for pods within a service) can determine a swarm wide accepted target.
The tasks to achieve this target (or change of target) should then be allocated across the swarm by the swarm itself.
Every failure of a member of the swarm has to be recognized, and the now unhandled tasks (in my case websocket connections) have to be resumed on different members of the swarm.
Also updates of the target have to be rolled out across the swarm in a distinct manner, retaining the tasks for the old target till all tasks for the new target are handled.
My ideas so far:
As a general syncing point a cache like redis or a db like mongodb could be used.
Here the current target (and the old target, for creating earlier mentioned smooth target changes) could be stored, along with all tasks that have to be handled to achieve this desired target.
This should be relatively easy to set up and also a "voting process" for the current target could be possible - if even necessary (every swarm member checks the current target of the target provider and the target that is determined by most of the swarm members is set as the vote outcome).
But now we face the problem already mentioned in the pros for the managed system, I currently cant think of a way the failure of a swarm member can be recognized and managed by the swarm consistently.
How should a failure be determined without a constant exchange between swarm members, which i think should be avoided because of the:
swarms should operate entirely target driven and interact with each other as litte as possible
kubernetes itself isn't really designed to have easy intra service communication
Every contribution, idea or further question here helps.
My tech stack would be but isn't limited to:
Java with Micronaut for the application
Grpc as the only exchange protocol
Kubernetes as the orchestrator
Since you're on the JVM, you could use Akka Cluster to take care of failure detection between the pods (even in Kubernetes, though there's some care needed with service meshes to exempt the pod-to-pod communications from being routed through the mesh) and use (as one of many possibilities for this) Distributed Data's implementations of CRDTs to distribute state (in this case the target) among the pods.
This wouldn't require you to use Akka HTTP or Akka's gRPC implementations, so you could still use Micronaut for external interactions. It would effectively create a stateful self-organizing service which presents to Kubernetes as a regular stateless service.
If for some reason Akka isn't appealing, looking through the code and docs for its failure detection (phi-accrual) might provide some ideas for implementing a failure detector using (e.g.) periodic updates to a DB.
Disclaimer: I am employed by Lightbend, which provides commercial support for Akka and employs or has employed at some point most of the contributors to and maintainers of Akka.

Is it possible to run a single container Flink cluster in Kubernetes with high-availability, checkpointing, and savepointing?

I am currently running a Flink session cluster (Kubernetes, 1 JobManager, 1 TaskManager, Zookeeper, S3) in which multiple jobs run.
As we are working on adding more jobs, we are looking to improve our deployment and cluster management strategies. We are considering migrating to using job clusters, however there is reservation about the number of containers which will be spawned. One container per job is not an issue, but two containers (1 JM and 1 TM) per job raises concerns about memory consumption. Several of the jobs need high-availability and the ability to use checkpoints and restore from/take savepoints as they aggregate events over a window.
From my reading of the documentation and spending time on Google, I haven't found anything that seems to state whether or not what is being considered is really possible.
Is it possible to do any of these three things:
run both the JobManager and TaskManager as separate processes in the same container and have that serve as the Flink cluster, or
run the JobManager and TaskManager as literally the same process, or
run the job as a standalone JAR with the ability to recover from/take checkpoints and the ability to take a savepoint and restore from that savepoint?
(If anyone has any better ideas, I'm all ears.)
One of the responsibilities of the job manager is to monitor the task manager(s), and initiate restarts when failures have occurred. That works nicely in containerized environments when the JM and TMs are in separate containers; otherwise it seems like you're asking for trouble. Keeping the TMs separate also makes sense if you are ever going to scale up, though that may moot in your case.
What might be workable, though, would be to run the job using a LocalExecutionEnvironment (so that everything is in one process -- this is sometimes called a Flink minicluster). This path strikes me as feasible, if you're willing to work at it, but I can't recommend it. You'll have to somehow keep track of the checkpoints, and arrange for the container to be restarted from a checkpoint when things fail. And there are other things that may not work very well -- see this question for details. The LocalExecutionEnvironment wasn't designed with production deployments in mind.
What I'd suggest you explore instead is to see how far you can go toward making the standard, separate container solution affordable. For starters, you should be able to run the JM with minimal resources, since it doesn't have much to do.
Check this operator which automates the lifecycle of deploying and managing Flink in Kubernetes. The project is in beta but you can still get some idea about how to do it or directly use this operator if it fits your requirement. Here Job Manager and Task manager is separate kubernetes deployment.

Is it possible to use dolphinscheduler without zookeeper?

Zookeeper plays several roles in the open-source workflow framework dolphinscheduler, such as heartbeat detection among masters and workers, task queue,event listener and distributed lock.
dolphin-sche framework
Is it possible to replace it by using database (mysql)? The main reason is to simplify the project structure .
zookeeper in DS is mainly used as:
Task queue, for master sending tasks to worker
Lock, for the communication between host(masters and workers)
Event watcher. Master listens the event that worker added or removed
it costs to replace zk as mysql.
zk mainly assumes the responsibility of the registry and monitors the application status. zk is very mature in this area and is a recognized solution in the industry. If MySQL wants to do this, the technical implementation cost will be larger, and may not achieve the desired effect.
BTW, their team is currently working on the SPI development for the registry, and in later versions, perhaps you can use other components, such as etcd, to achieve similar functionality.
for now, MasterServer and the WorkerServer nodes in the system all use the Zookeeper for cluster management and fault tolerance. In addition, the system also performs event monitoring and distributed locking based on ZooKeeper. We have also implemented queues based on Redis, but we hope that DolphinScheduler relies on as few components as possible, so we finally removed the Redis implementation.
so now DolphinScheduler can't work fine without Zookeeper, maybe in the future.
DolphinScheduler System Architecture:
For more documents please refer: Official Document.

Akka remoting and Heroku

I'm investigating the use of Scala/Play/Akka on Heroku and I'm curious about something. Suppose I have an application structured as a network of Akka actors. Some of the actors will be in-process with the web application itself, but I may want to set aside one or more nodes as dedicated Akka actors: for example, a group of cache manager actors.
To configure Akka remoting, I need to supply IP addresses in akka.conf. But since Heroku nodes are somewhat ephemeral, I won't know each node's address at the time I write the config file.
It might simplify things to have a central "registration" node, but even there, I won't know the IP address of that node in advance.
So how do my Akka nodes refer to each other on Heroku?
Heroku dynos can't talk directly to each other so you will have to use an external messaging system like RabbitMQ. There is a great article on the Heroku Dev Center about how to coordinate Akka Actors in this way:
https://devcenter.heroku.com/articles/scaling-out-with-scala-and-akka

Distributed Actors in Akka

I'm fairly new to Akka and new to distributed programming in general. Using Akka's Mist component, I've created supervised actors to handle HTTP requests asynchronously. Everything is currently running on one physical machine with local actors. What I don't understand is how to build a truly fault-tolerant system with more than one box. As stated in the Akka docs:
Also, you (usually) need to know if one box is down and/or the service you are talking to on the other box is down. Here actor supervision/linking is a critical tool for not only monitoring the health of remote services, but to actually manage the service, do something about the problem if the actor or node is down. Such as restarting actors on the same node or on another node.
How do I do this? I'm looking for an example or pointers on how to begin making my application distributed. Other services in our group use Apache gateways in front of multiple Tomcat instances, so the event of a Tomcat server going down is transparent to the user. I'm deploying my service to the Akka microkernel and need to achieve a similar level of high availability across more than one physical box.
I'm using Akka 1.1.3.
Remote supervision works only with client-managed remote actors for the Akka 1.x series.
Akka 2.0 that is currently under development will support transparent clustering, cluster-wide supervision and cluster-wide lifecycle monitoring.
You might consider putting an HTTP load balancer in front of Akka Microkernel instances running Mist, this would match what your group does with 'Apache gateways'.
Another approach would be to expose remote actors on a number of instances and then use Akka's LoadBalancer or Actor Pool to send messages around, see here
The second approach is a bit of a pain if you have a dynamic pool of machines, because the pool of devices wants to be specified programatically. Akka 2.0 addresses this with cluster support that is setup in the akka.conf file.
As far as the release date of 2.0, for what its worth 1.2 was just recently released on 2011-Sept-19.