Different use case for akka cluster aware router & akka cluster sharding? - scala

Cluster aware router:
val router = system.actorOf(ClusterRouterPool(
RoundRobinPool(0),
ClusterRouterPoolSettings(
totalInstances = 20,
maxInstancesPerNode = 1,
allowLocalRoutees = false,
useRole = None
)
).props(Props[Worker]), name = "router")
Here, we can send message to router, the message will send to a series of remote routee actors.
Cluster sharding (Not consider persistence)
class NewShoppers extends Actor {
ClusterSharding(context.system).start(
"shardshoppers",
Props(new Shopper),
ClusterShardingSettings(context.system),
Shopper.extractEntityId,
Shopper.extractShardId
)
def proxy = {
ClusterSharding(context.system).shardRegion("shardshoppers")
}
override def receive: Receive = {
case msg => proxy forward msg
}
}
Here, we can send message to proxy, the message will send to a series of sharded actors (a.k.a. entities).
So, my question is: it seems both 2 methods can make the tasks distribute to a lot of actors. What's the design choice of above two? Which situation need which choice?

The pool router would be when you just want to send some work to whatever node and have some processing happen, two messages sent in sequence will likely not end up in the same actor for processing.
Cluster sharding is for when you have a unique id on each actor of some kind, and you have too many of them to fit in one node, but you want every message with that id to always end up in the actor for that id. For example modelling a User as an entity, you want all commands about that user to end up with the user but you want the actor to be moved if the cluster topology changes (remove or add nodes) and you want them reasonably balanced across the existing nodes.

Credit to johanandren and the linked article as basis for the following answer:
Both a router and sharding distribute work. Sharding is required if, additionally to load balancing, the recipient actors have to reliably manage state that is directly associated with the entity identifier.
To recap, the entity identifier is a key, derived from the message being sent, determining the message's receipient actor in the cluster.
First of all, can you manage state associated with an identifier across different nodes using a consistently hashing router? A Consistent Hash router will always send messages with an equal identifier to the same target actor. The answer is: No, as explained below.
The hash-based method stops working when nodes in the cluster go Down or come Up, because this changes the associated actor for some identifiers. If a node goes down, messages that were associated with it are now sent to a different actor in the network, but that actor is not informed about the former state of the actor which it is now replacing. Likewise, if a new node comes up, it will take care of messages (identifiers) that were previously associated with a different actor, and neither the new node or the old node are informed about this.
With sharding, on the other hand, the actors that are created are aware of the entity identifier that they manage. Sharding will make sure that there is exactly one actor managing the entity in the cluster. And it will re-create sharded actors on a different node if their parent node goes down. So using persistence they will retain their (persisted) state across nodes when the number of nodes changes. You also don't have to worry about concurrency issues if an actor is re-created on a different node thanks to Sharding. Furthermore, if a message with a new entity identifier is encountered, for which an actor does not exist yet, a new actor is created.
A consistently hashing router may still be of use for caching, because messages with the same key generally do go to the same actor. To manage a stateful entity that exists only once in the cluster, Sharding is required.
Use routers for load balancing, use Sharding for managing stateful entities in a distributed manner.

Related

Why Receptionist.Subscribe first messages in Akka don't contain all cluster members registered with the specified key?

I have two actors: the first already in the cluster (all in localhost) at port 25457 and the second to be started at port 25458.
Both have the following behaviour:
val addressBookKey = ServiceKey[Message]("address_book")
val listingResponseAdapter = ctx.messageAdapter[Receptionist.Listing] {
case addressBookKey.Listing(p) => OnlinePlayers(p) }
Cluster(ctx.system).manager ! Join(address)
ctx.system.receptionist ! Register(addressBookKey, ctx.self)
ctx.system.receptionist ! Subscribe(addressBookKey, listingResponseAdapter)
Behaviors.receiveMessagePartial {
case m =>
System.err.println(m)
Behaviors.same
}
When the second actor joins stderr prints Set(), Set(Actor[akka://system/user#0]) and then Set(Actor[akka://system/user#0], Actor[akka://system#localhost:27457/user#0])
When the second actor leaves, the first actor prints two times Set(Actor[akka://system/user#0])
How can the second actor receive directly all cluster participants?
Why the first actor prints two times the set after the second leaves?
Thanks
Joining the cluster is an async process, you have only just triggered joining by sending Join to the manager, actually joining the cluster is happening at some point after that. The receptionist can only know about registered services on nodes that has completed joining the cluster.
This means that when you subscribe to the receptionist, joining the cluster has likely not completed yet, so you get the locally registered services (because of ordering guarantees the receptionist will always get the local register message before it receives the subscribe), then once joining the cluster completes the receptionist learns about services on other nodes and the subscriber is updated.
To be sure other nodes are known you would have to wait with the subscription until the node has joined the cluster, this can be achieved by subscribing to cluster state and subscribing only after the node itself has been marked as Up.
In general it is often good to make something that works regardless of cluster node join as it makes it easier to test and also run the same component without cluster. So for example switching behaviour of the subscribing actor when there are no registered services vs at least one, or with a minimum count of services for the service key.
Not entirely sure about why you see the duplicated update when the actor on the other node "leaves", but there are some specifics around the CRDT used for the receptionist registry where it may have to re-remove a service for consistency, that could perhaps explain it.

Find actor by persistence id

I have a system, that has an actor per user. Users send messages rarely, but when they do, they send usually not only one, but few.
Currently, I have a map, where I store persistenceId -> ActorRef. When I'm receiving a new message for an actor, I look into the map, if there is an ActorRef, I use it. If it is missing, I create it and put it into the map. For sure I don't want to have 2 instances of same persistence actor at the same time. Also, I don't want to create and destroy the actor for each message, as recovery could take some time.
I feel there should be some cleaner way of "locating or creating" an actor. Something like actorSystem.getOrCreate(persistenceId, props). I thought that sharding might help me with that, but I couldn't find an exact example of this. Also, I know there is actorSelection, which has downsides:
using it in too many places, with hardcoded paths that are tricky to
maintain
using it to send too many messages as it has a performance
cost
So basically the question is what is the best way of locating persistence actor within one service if I actor persistenceId is userId. If I decide to use sharding, then it will be 1 shard per actor. Is this ok?
Actor sharding is pretty much what you need - you can think about it as a distributed map of actors and there is no need of having additional solutions. The sharding takes care of summoning the actor behind the scenes and there is no need for you to manage actors yourself.
val sharding = ClusterSharding(system).start(
typeName = CustomerActor.shardName,
entityProps = CustomerActor.props,
settings = ClusterShardingSettings(system),
extractEntityId = CustomerActor.extractEntityId,
extractShardId = CustomerActor.extractShardId)
}
where extractEntityId is a function which routes messages to appropriate actors
val extractEntityId: ShardRegion.ExtractEntityId = {
case gc: GetCustomer => (gc.customerId, gc)
}
And final example:
case class GetCustomer(customerId: String)
sharding ! GetCustomer("customer-id")
More details here https://doc.akka.io/docs/akka/2.5/cluster-sharding.html

Akka cluster-sharding: moving actor shards based on communication patterns

I am building an open-source distributed economic simulation platform using Akka (especially the remote and cluster packages). A key bottleneck in such simulations is the fact that communication patterns between actors evolve over the course of the simulation and often actors will end up sending loads of messages over the wire between nodes in the cluster.
I am looking for a mechanism to detect actors on some node that are communicating heavily with actors on some other node and move them to that other node. Is this possible using existing Akka cluster sharding functionality? Perhaps this is what Roland Kuhn meant by "automatic actor tree partitioning" is his answer to this SO question.
To move shards around according to your own logic is doable by implementing a custom ShardAllocationStrategy.
You just have to extend ShardAllocationStrategy and implement those 2 methods:
def allocateShard(requester: ActorRef, shardId: ShardId,
currentShardAllocations: Map[ActorRef, immutable.IndexedSeq[ShardId]])
: Future[ActorRef]
def rebalance(currentShardAllocations: Map[ActorRef,
immutable.IndexedSeq[ShardId]], rebalanceInProgress: Set[ShardId])
: Future[Set[ShardId]]
The first one determines which region will be chosen when allocating a new shard, and provides you the shards already allocated. The second one is called regularly and lets you control which shards to rebalance to another region (for example, if they became too unbalanced).
Both functions return a Future, which means that you can even query another actor to get the information you need (for example, an actor that has the affinity information between your actors).
For the affinity itself, I think you have to implement something yourself. For example, actors could collect statistics about their sender nodes and post that regularly to a cluster singleton that would determine which actors should be moved to the same node.

Starting Actors on-demand by identifier in Akka

I'm currently implementing a system that that receives inbound messages from an external monitoring system. I'm translating these messages into more concise 'events', and I'm using these to alter the state of 'Managed System' objects. Akka Actors seemed like a good use case for encapsulating mutable state in concurrent applications.
The managed systems are identified by a name (99% of the time this is a hostname). Whenever a proper event is received, the system routes the message to the correct actor based on the name property. At first I used to use actorSelection and the complete paths of said actors, but that was very ugly, and I saw several people advise against relying on the fully qualified name of an actor to deliver message.
So I've set up a simple EventBus, which is great as I can now simply do:
eventBus.subscribe(subscriber1, "/managedSystem01")
eventBus.subscribe(subscriber2, "/managedSystem02")
eventBus.publish(MonitoringEvent("/managedSystem01", MonitoringMessage("managedSystem01", "N", "CPU_LOAD_HIGH", True)))
eventBus.publish(MonitoringEvent("/managedSystem02", MonitoringMessage("managedSystem02", "Y", "DISK_USAGE_HIGH", True)))
Of course, I now have the issue that, should I receive and event that concerns a managed system for which I've not spawned an actor yet (this is entirely possibly, it is impossible for me to get an absolute list of managed systems unfortunately), the message will be routed to the dead-letter mailbox.
Ideally I don't want this to happen. When it is unable to address a specific actor, I want to spawn a new one dynamically.
I suppose that, theoretically, I could subscribe to DeadLetter messages but:
That sounds a little 'hacky', since those message are essentially reserved for the system
Is it even possible to recover the original message (in my case, the MonitoringMessage) that is sent to the DeadLetter mailbox?
Alternatively is there a way to check if there are ZERO subscribers to a certain "topic"?
What you describe ("send to Actor by some identifier, if it does not exist buffer until it gets created and then deliver to that newly on-demand created Actor") is implemented in Akka as Cluster Sharding.
While it is designed primarily for sharding load (work) across a cluster, you could use it locally as well, since your requirement is essentially a scaled down (to one node) version of problem that it solves. It takes care of starting new Actors if they don't exist for a given identifier etc, so you'd simply subscribe the shard-region to the events and it'll take care of creating the actors for you.

How generate unique id for Actor?

Suppose I have an application that uses actors for processing User. So there is one UserActor per user. Also every user Actor is mapped to user via id, e.g. to process actions with concrete user you should get Actor like that:
ActorSelection actor = actorSystem.actorSelection("/user/1");
where 1 is user id.
So the problem is - how generate unique id inside cluster effectively? First it needs to check that new id will not duplicate an existent one. I can create one actor for generating id's which will live in one node, and before creating any new UserActor Generator is asked for id, but this leads to additional request inside cluster whenever user is created. Is there a way to do this more effective? Are there build-in akka techniques to do that?
P.S. May this architecture for using Actor is not effective any suggestion/best practice is welcome.
I won't say whether or not your approach is a good idea. That's going to be up to you to decide. If I do understand your problem correctly though, then I can suggest a high level approach to making it work for you. If I understand correctly, you have a cluster, and for any given userId, there should be an actor in the system that handles requests for it, and it should only be on one node and consistently reachable based on the user id of the user. If that's correct, then consider the following approach.
Let's start first with a simple actor, let's call it UserRequestForwarder. This actors job is to find an actor instance for a request for a particular user id and forward on to it. If that actor instance does not yet exist, then this actor will create it before forwarding onto it. A very rough sketch could look like this:
class UserRequestForwarder extends Actor{
def receive = {
case req # DoSomethingForUser(userId) =>
val childName = s"user-request-handler-$userId"
val child = context.child(childName).getOrElse(context.actorOf(Props[UserRequestHandler]))
child forward req
}
}
Now this actor would be deployed onto every node in the cluster via a ConsistentHashingPool router configured in such a way that there would be one instance per node. You just need to make sure that there is something in every request that needs to travel through this router that allows it to be consistently hashed to the node that handles requests for that user (hopefully using the user id)
So if you pass all requests through this router, they will always land on the node that is responsible for that user, ending up in the UserRequestForwarder which will then find the correct user actor on that node and pass the request on to it.
I have not tried this approach myself, but it might work for what you are trying to do provided I understood your problem correctly.
Not an akka expert, so I can't offer code, but shouldn't the following approach work:
Have a single actor being responsible for creating the actors. And have it keep a Hashset of actor names, for actors that it created, and that didn't die already.
If you have to spread the load between multiple actors you can dispatch the task based on the first n digits of the hashcode of the actor name that has to be created.
It seems like you have your answer on how to generate the unique ID. In terms of your larger question, this is what Akka cluster sharding is designed to solve. It will handle distributing shards among your cluster, finding or starting your actors within the cluster and even rebalancing.
http://doc.akka.io/docs/akka/2.3.5/contrib/cluster-sharding.html
There's also an activator with a really nice example.
http://typesafe.com/activator/template/akka-cluster-sharding-scala