I want to check whether ClusterSharding started on not for one region. Here is the code:
def someMethod: {
val system = ActorSystem("ClusterSystem", ConfigFactory.load())
val region: ActorRef = ClusterSharding(system).shardRegion("someActorName")
}
Method akka.contrib.pattern.ClusterSharding#shardRegion throws IllegalArgumentException if it do not find shardRegion. I do not like approach to catch IllegalArgumentException just to check that ClusterSharding did not started.
Is there another approach like ClusterSharding(system).isStarted(shardRegionName = "someActorName")?
Or it is assumed that I should start all shardingRegion at ActorSystem start up?
You should indeed start all regions as soon as possible. According to the docs:
"When using the sharding extension you are first, typically at system startup on each node in the cluster, supposed to register the supported entry types with the ClusterSharding.start method."
Startup of a region is not immediate. In particular, even in local cases, it would take at the very least the time specified in the akka.contrib.cluster.sharding.retry-interval (the name is misleading: this value is both the initial delay of registration and the retry interval) parameter of your configuration before your sharded actors can effectively receive messages (the messages sent in that period are not lost, but not delivered until after a while).
If you want to be 100% sure that your region started, you should have one of your sharded actor respond to an identify message after you call cluster.start . Once it replies, you are guaranteed that your region is up and running. You can use a ask pattern if you want to be blocking and await on the ask future.
Related
I am currently testing the scaling of my application and I ran into something I did not expect.
The application is running on a 5 node cluster, it has multiple services/actortypes and is using a shared process model.
For some component it uses actor events as a best effort pubsub system (There are fallbacks in place so if a notification is dropped there is no issue).
The problem arises when the number of actors grows (aka subscription topics). The actorservice is partitioned to 100 partitions at the moment.
The number of topics at that point is around 160.000 where each topic is subscribed 1-5 times (nodes where it is needed) with an average of 2.5 subscriptions (Roughly 400k subscriptions).
At that point communications in the cluster start breaking down, new subscriptions are not created, unsubscribes are timing out.
But it is also affecting other services, internal calls to a diagnostics service are timing out (asking each of the 5 replicas), this is probably due to the resolving of partitions/replica endpoints as the outside calls to the webpage are fine (these endpoints use the same technology/codestack).
The eventviewer is full with warnings and errors like:
EventName: ReplicatorFaulted Category: Health EventInstanceId {c4b35124-4997-4de2-9e58-2359665f2fe7} PartitionId {a8b49c25-8a5f-442e-8284-9ebccc7be746} ReplicaId 132580461505725813 FaultType: Transient, Reason: Cancelling update epoch on secondary while waiting for dispatch queues to drain will result in an invalid state, ErrorCode: -2147017731
10.3.0.9:20034-10.3.0.13:62297 send failed at state Connected: 0x80072745
Error While Receiving Connect Reply : CannotConnect , Message : 4ba737e2-4733-4af9-82ab-73f2afd2793b:382722511 from Service 15a5fb45-3ed0-4aba-a54f-212587823cde-132580461224314284-8c2b070b-dbb7-4b78-9698-96e4f7fdcbfc
I've tried scaling the application but without this subscribe model active and I easily reach a workload twice as large without any issues.
So there are a couple of questions
Are there limits known/advised for actor events?
Would increasing the partition count or/and node count help here?
Is the communication interference logical? Why are other service endpoints having issues as well?
After time spent with the support ticket we found some info. So I will post my findings here in case it helps someone.
The actor events use a resubscription model to make sure they are still connected to the actor. Default this is done every 20 seconds. This meant a lot of resources were being used and eventually the whole system overloaded with loads of idle threads waiting to resubscribe.
You can decrease the load by setting resubscriptionInterval to a higher value when subscribing. The drawback is that it will also mean the client will potentially miss events in the mean time (if a partition is moved).
To counteract the delay in resubscribing it is possible to hook into the lower level service fabric events. The following psuedo code was offered to me in the support call.
Register for endpoint change notifications for the actor service
fabricClient.ServiceManager.ServiceNotificationFilterMatched += (o, e) =>
{
var notification = ((FabricClient.ServiceManagementClient.ServiceNotificationEventArgs)e).Notification;
/*
* Add additional logic for optimizations
* - check if the endpoint is not empty
* - If multiple listeners are registered, check if the endpoint change notification is for the desired endpoint
* Please note, all the endpoints are sent in the notification. User code should have the logic to cache the endpoint seen during susbcription call and compare with the newer one
*/
List<long> keys;
if (resubscriptions.TryGetValue(notification.PartitionId, out keys))
{
foreach (var key in keys)
{
// 1. Unsubscribe the previous subscription by calling ActorProxy.UnsubscribeAsync()
// 2. Resubscribe by calling ActorProxy.SubscribeAsync()
}
}
};
await fabricClient.ServiceManager.RegisterServiceNotificationFilterAsync(new ServiceNotificationFilterDescription(new Uri("<service name>"), true, true));
Change the resubscription interval to a value which fits your need.
Cache the partition id to actor id mapping. This cache will be used to resubscribe when the replica’s primary endpoint changes(ref #1)
await actor.SubscribeAsync(handler, TimeSpan.FromHours(2) /*Tune the value according to the need*/);
ResolvedServicePartition rsp;
((ActorProxy)actor).ActorServicePartitionClientV2.TryGetLastResolvedServicePartition(out rsp);
var keys = resubscriptions.GetOrAdd(rsp.Info.Id, key => new List<long>());
keys.Add(communicationId);
The above approach ensures the below
The subscriptions are resubscribed at regular intervals
If the primary endpoint changes in between, actorproxy resubscribes from the service notification callback
This ends the psuedo code form the support call.
Answering my original questions:
Are there limits known/advised for actor events?
No hard limits, only resource usage.
Would increasing the partition count or/and node count help here? Partition count not. node count maybe, only if that means there are less subscribing entities on a node because of it.
Is the communication interference logical? Why are other service endpoints having issues as well?
Yes, resource contention is the reason.
I'm currently implementing a system that that receives inbound messages from an external monitoring system. I'm translating these messages into more concise 'events', and I'm using these to alter the state of 'Managed System' objects. Akka Actors seemed like a good use case for encapsulating mutable state in concurrent applications.
The managed systems are identified by a name (99% of the time this is a hostname). Whenever a proper event is received, the system routes the message to the correct actor based on the name property. At first I used to use actorSelection and the complete paths of said actors, but that was very ugly, and I saw several people advise against relying on the fully qualified name of an actor to deliver message.
So I've set up a simple EventBus, which is great as I can now simply do:
eventBus.subscribe(subscriber1, "/managedSystem01")
eventBus.subscribe(subscriber2, "/managedSystem02")
eventBus.publish(MonitoringEvent("/managedSystem01", MonitoringMessage("managedSystem01", "N", "CPU_LOAD_HIGH", True)))
eventBus.publish(MonitoringEvent("/managedSystem02", MonitoringMessage("managedSystem02", "Y", "DISK_USAGE_HIGH", True)))
Of course, I now have the issue that, should I receive and event that concerns a managed system for which I've not spawned an actor yet (this is entirely possibly, it is impossible for me to get an absolute list of managed systems unfortunately), the message will be routed to the dead-letter mailbox.
Ideally I don't want this to happen. When it is unable to address a specific actor, I want to spawn a new one dynamically.
I suppose that, theoretically, I could subscribe to DeadLetter messages but:
That sounds a little 'hacky', since those message are essentially reserved for the system
Is it even possible to recover the original message (in my case, the MonitoringMessage) that is sent to the DeadLetter mailbox?
Alternatively is there a way to check if there are ZERO subscribers to a certain "topic"?
What you describe ("send to Actor by some identifier, if it does not exist buffer until it gets created and then deliver to that newly on-demand created Actor") is implemented in Akka as Cluster Sharding.
While it is designed primarily for sharding load (work) across a cluster, you could use it locally as well, since your requirement is essentially a scaled down (to one node) version of problem that it solves. It takes care of starting new Actors if they don't exist for a given identifier etc, so you'd simply subscribe the shard-region to the events and it'll take care of creating the actors for you.
Suppose I have an application that uses actors for processing User. So there is one UserActor per user. Also every user Actor is mapped to user via id, e.g. to process actions with concrete user you should get Actor like that:
ActorSelection actor = actorSystem.actorSelection("/user/1");
where 1 is user id.
So the problem is - how generate unique id inside cluster effectively? First it needs to check that new id will not duplicate an existent one. I can create one actor for generating id's which will live in one node, and before creating any new UserActor Generator is asked for id, but this leads to additional request inside cluster whenever user is created. Is there a way to do this more effective? Are there build-in akka techniques to do that?
P.S. May this architecture for using Actor is not effective any suggestion/best practice is welcome.
I won't say whether or not your approach is a good idea. That's going to be up to you to decide. If I do understand your problem correctly though, then I can suggest a high level approach to making it work for you. If I understand correctly, you have a cluster, and for any given userId, there should be an actor in the system that handles requests for it, and it should only be on one node and consistently reachable based on the user id of the user. If that's correct, then consider the following approach.
Let's start first with a simple actor, let's call it UserRequestForwarder. This actors job is to find an actor instance for a request for a particular user id and forward on to it. If that actor instance does not yet exist, then this actor will create it before forwarding onto it. A very rough sketch could look like this:
class UserRequestForwarder extends Actor{
def receive = {
case req # DoSomethingForUser(userId) =>
val childName = s"user-request-handler-$userId"
val child = context.child(childName).getOrElse(context.actorOf(Props[UserRequestHandler]))
child forward req
}
}
Now this actor would be deployed onto every node in the cluster via a ConsistentHashingPool router configured in such a way that there would be one instance per node. You just need to make sure that there is something in every request that needs to travel through this router that allows it to be consistently hashed to the node that handles requests for that user (hopefully using the user id)
So if you pass all requests through this router, they will always land on the node that is responsible for that user, ending up in the UserRequestForwarder which will then find the correct user actor on that node and pass the request on to it.
I have not tried this approach myself, but it might work for what you are trying to do provided I understood your problem correctly.
Not an akka expert, so I can't offer code, but shouldn't the following approach work:
Have a single actor being responsible for creating the actors. And have it keep a Hashset of actor names, for actors that it created, and that didn't die already.
If you have to spread the load between multiple actors you can dispatch the task based on the first n digits of the hashcode of the actor name that has to be created.
It seems like you have your answer on how to generate the unique ID. In terms of your larger question, this is what Akka cluster sharding is designed to solve. It will handle distributing shards among your cluster, finding or starting your actors within the cluster and even rebalancing.
http://doc.akka.io/docs/akka/2.3.5/contrib/cluster-sharding.html
There's also an activator with a really nice example.
http://typesafe.com/activator/template/akka-cluster-sharding-scala
I am confused by behavior I am seeing in Akka. Briefly, I have a set of actors performing scientific calculations (star formation simulation). They have some state. When an error occurs such that one or more enter an invalid state, I want to restart the whole set to start over. I also want to do this if a single calc (over the entire set) takes too long (there is no way to predict in advance how long it may run).
So, there is the set of Simulation actors at the bottom of the tree, then a Director above them (that creates them via a Router, and sends them messages via that Router as well). There is one more Director level above that to create Directors on different machines and collect results from them all.
I handle the timeout case by using the Akka Scheduler to create a one-time timeout event, in the local Director, when the simulation is started. When the Director gets this event, if all its Simulation actors have not finished, it does this:
children ! Broadcast(Kill)
where children is the Router that owns/created them - this sends a Kill to all the children (SimulActors).
What I thought would occur is that all the child actors would be restarted. However, their preRestart() hook method is never called. I see the Kill message received, but that's it.
I must be missing something fundamental here. I have read the Akka docs on this topic and I have to say I find them less than clear (especially the page on Supervisors). I would really appreciate either a thorough explanation of the Kill/restart process, or just some other references (Google wasn't very helpful).
Note
If the child of a router terminates, the router will not automatically
spawn a new child. In the event that all children of a router have
terminated the router will terminate itself.
Taken from the akka docs.
I would consider using a supervision strategy - akka has behavior built in for killing all actors (all for one strategy) and you can define the specific strategy - eg restart.
I think a more idiomatic way to run this would be to have the actors throw x exception if they're not done after a period of time and then the supervisor handle that via supervision strategy.
You could throw a not done exception from the child and then define the behaviour like so:
override val supervisorStrategy =
AllForOneStrategy(maxNrOfRetries = 0) {
case _: NotDoneException ⇒ Stop
case _: Exception ⇒ Restart
}
It's important to understand that a restart means stopping the old actor and creating a new separate object/Actor
References:
http://doc.akka.io/docs/akka/snapshot/scala/fault-tolerance.html
http://doc.akka.io/docs/akka/snapshot/general/supervision.html
How can I check if a remote actor, for which I have obtained an actorRef via actorFor, is alive? Any reference to documentation would be appreciated. I am using Akka from Scala.
I've seen reference to supervisors and deathwatch, but don't really feel my use-case needs such heavy machinery. I just want to have my client check if the master is up using a known path, and if it is send a message introducing itself. If the master is not up, then it should wait for a bit then retry.
Update 2:
Suggestions are that I just use a ping-pong ask test to see if it's alive. I understand this to be something like
implicit val timeout = Timeout(5 seconds)
val future = actor ? AreYouAlive
try{
Await.result(future, timeout.duration)
}catch{
case e:AskTimeoutException => println("It's not there: "+e)
}
I think I've been confused by the presence of exceptions in the logs, which remain there now. E.g.
Error: java.net.ConnectException:Connection refused
Error: java.nio.channels.ClosedChannelException:null
Perhaps this is just how it works and I must accept the errors/warning in the logs rather than try to protect against them?
Just send it messages. Its machine could become unreachable the nanosecond after you sent your message anyway. IF you don't get any reply, it is most likely dead. There's a large chapter on this in the docs: http://doc.akka.io/docs/akka/2.0.1/general/message-send-semantics.html
You should never assume that the network is available. Our architect here always says that there are two key concepts that come into play in distributed system design.
They are:
Timeout
Retry
Messages should 'timeout' if they don't make it after x period of time and then you can retry the message. With timeout you don't have to worry about the specific error - only that message response has failed. For high levels of availability you may want to consider using tools such as zookeeper to handle clustering/availability monitoring. See leader election here for example: http://zookeeper.apache.org/doc/trunk/recipes.html