The Service Fabric documentation states that:
Actors may receive duplicate messages from the same client.
Does this hold for reminders as well? If I set a single reminder for my actor instance, could it be called twice at the same time?
My team submitted a similar question to Service Fabric support, and this was their response...
*"If there is a failover (i.e. current primary becomes secondary or primary process crashes) while ‘ReceiveReminderAsync()’ call back is executing or failover kicks in after ‘ReceiveReminderAsync()’ completes but before ActorRuntime does automatic save state and notes down completion, on the new primary this reminder will fire again immediately.
Note that in this scenario, as the new primary comes up and invokes the reminder, the reminder callback in previous primary may be still be executing (and will eventually fail to make any local state changes as replica has become secondary)."*
This behavior seems entirely consistent with why a public actor method would be invoked twice.
Related
I am currently testing the scaling of my application and I ran into something I did not expect.
The application is running on a 5 node cluster, it has multiple services/actortypes and is using a shared process model.
For some component it uses actor events as a best effort pubsub system (There are fallbacks in place so if a notification is dropped there is no issue).
The problem arises when the number of actors grows (aka subscription topics). The actorservice is partitioned to 100 partitions at the moment.
The number of topics at that point is around 160.000 where each topic is subscribed 1-5 times (nodes where it is needed) with an average of 2.5 subscriptions (Roughly 400k subscriptions).
At that point communications in the cluster start breaking down, new subscriptions are not created, unsubscribes are timing out.
But it is also affecting other services, internal calls to a diagnostics service are timing out (asking each of the 5 replicas), this is probably due to the resolving of partitions/replica endpoints as the outside calls to the webpage are fine (these endpoints use the same technology/codestack).
The eventviewer is full with warnings and errors like:
EventName: ReplicatorFaulted Category: Health EventInstanceId {c4b35124-4997-4de2-9e58-2359665f2fe7} PartitionId {a8b49c25-8a5f-442e-8284-9ebccc7be746} ReplicaId 132580461505725813 FaultType: Transient, Reason: Cancelling update epoch on secondary while waiting for dispatch queues to drain will result in an invalid state, ErrorCode: -2147017731
10.3.0.9:20034-10.3.0.13:62297 send failed at state Connected: 0x80072745
Error While Receiving Connect Reply : CannotConnect , Message : 4ba737e2-4733-4af9-82ab-73f2afd2793b:382722511 from Service 15a5fb45-3ed0-4aba-a54f-212587823cde-132580461224314284-8c2b070b-dbb7-4b78-9698-96e4f7fdcbfc
I've tried scaling the application but without this subscribe model active and I easily reach a workload twice as large without any issues.
So there are a couple of questions
Are there limits known/advised for actor events?
Would increasing the partition count or/and node count help here?
Is the communication interference logical? Why are other service endpoints having issues as well?
After time spent with the support ticket we found some info. So I will post my findings here in case it helps someone.
The actor events use a resubscription model to make sure they are still connected to the actor. Default this is done every 20 seconds. This meant a lot of resources were being used and eventually the whole system overloaded with loads of idle threads waiting to resubscribe.
You can decrease the load by setting resubscriptionInterval to a higher value when subscribing. The drawback is that it will also mean the client will potentially miss events in the mean time (if a partition is moved).
To counteract the delay in resubscribing it is possible to hook into the lower level service fabric events. The following psuedo code was offered to me in the support call.
Register for endpoint change notifications for the actor service
fabricClient.ServiceManager.ServiceNotificationFilterMatched += (o, e) =>
{
var notification = ((FabricClient.ServiceManagementClient.ServiceNotificationEventArgs)e).Notification;
/*
* Add additional logic for optimizations
* - check if the endpoint is not empty
* - If multiple listeners are registered, check if the endpoint change notification is for the desired endpoint
* Please note, all the endpoints are sent in the notification. User code should have the logic to cache the endpoint seen during susbcription call and compare with the newer one
*/
List<long> keys;
if (resubscriptions.TryGetValue(notification.PartitionId, out keys))
{
foreach (var key in keys)
{
// 1. Unsubscribe the previous subscription by calling ActorProxy.UnsubscribeAsync()
// 2. Resubscribe by calling ActorProxy.SubscribeAsync()
}
}
};
await fabricClient.ServiceManager.RegisterServiceNotificationFilterAsync(new ServiceNotificationFilterDescription(new Uri("<service name>"), true, true));
Change the resubscription interval to a value which fits your need.
Cache the partition id to actor id mapping. This cache will be used to resubscribe when the replica’s primary endpoint changes(ref #1)
await actor.SubscribeAsync(handler, TimeSpan.FromHours(2) /*Tune the value according to the need*/);
ResolvedServicePartition rsp;
((ActorProxy)actor).ActorServicePartitionClientV2.TryGetLastResolvedServicePartition(out rsp);
var keys = resubscriptions.GetOrAdd(rsp.Info.Id, key => new List<long>());
keys.Add(communicationId);
The above approach ensures the below
The subscriptions are resubscribed at regular intervals
If the primary endpoint changes in between, actorproxy resubscribes from the service notification callback
This ends the psuedo code form the support call.
Answering my original questions:
Are there limits known/advised for actor events?
No hard limits, only resource usage.
Would increasing the partition count or/and node count help here? Partition count not. node count maybe, only if that means there are less subscribing entities on a node because of it.
Is the communication interference logical? Why are other service endpoints having issues as well?
Yes, resource contention is the reason.
I have an application which listens to a stream of events. These events tend to come in chunks: 10 to 20 of them within the same second, with minutes or even hours of silence between them. These events are processed and result in an aggregate state, and this updated state is sent further downstream.
In pseudo code, it would look something like this:
kafkaSource()
.mapAsync(1)((entityId, event) => entityProcessor(entityId).process(event)) // yields entityState
.mapAsync(1)(entityState => submitStateToExternalService(entityState))
.runWith(kafkaCommitterSink)
The thing is that the downstream submitStateToExternalService has no use for 10-20 updated states per second - it would be far more efficient to just emit the last one and only handle that one.
With that in mind, I started looking if it wouldn't be possible to not emit the state after processing immediately, and instead wait a little while to see if more events are coming in.
In a way, it's similar to conflate, but that emits elements as soon as the downstream stops backpressuring, and my processing is actually fast enough to keep up with the events coming in, so I can't rely on backpressure.
I came across groupedWithin, but this emits elements whenever the window ends (or the max number of elements is reached). What I would ideally want, is a time window where the waiting time before emitting downstream is reset by each new element in the group.
Before I implement something to do this myself, I wanted to make sure that I didn't just overlook a way of doing this that is already present in akka-streams, because this seems like a fairly common thing to do.
Honestly, I would make entityProcessor into an cluster sharded persistent actor.
case class ProcessEvent(entityId: String, evt: EntityEvent)
val entityRegion = ClusterSharding(system).shardRegion("entity")
kafkaSource()
.mapAsync(parallelism) { (entityId, event) =>
entityRegion ? ProcessEvent(entityId, event)
}
.runWith(kafkaCommitterSink)
With this, you can safely increase the parallelism so that you can handle events for multiple entities simultaneously without fear of mis-ordering the events for any particular entity.
Your entity actors would then update their state in response to the process commands and persist the events using a suitable persistence plugin, sending a reply to complete the ask pattern. One way to get the compaction effect you're looking for is for them to schedule the update of the external service after some period of time (after cancelling any previously scheduled update).
There is one potential pitfall with this scheme (it's also a potential issue with a homemade Akka Stream solution to allow n > 1 events to be processed before updating the state): what happens if the service fails between updating the local view of state and updating the external service?
One way you can deal with this is to encode whether the entity is dirty (has state which hasn't propagated to the external service) in the entity's state and at startup build a list of entities and run through them to have dirty entities update the external state.
If the entities are doing more than just tracking state for publishing to a single external datastore, it might be useful to use Akka Persistence Query to build a full-fledged read-side view to update the external service. In this case, though, since the read-side view's (State, Event) => State transition would be the same as the entity processor's, it might not make sense to go this way.
A midway alternative would be to offload the scheduling etc. to a different actor or set of actors which get told "this entity updated it's state" and then schedule an ask of the entity for its current state with a timestamp of when the state was locally updated. When the response is received, the external service is updated, if the timestamp is newer than the last update.
I'm evaluating Service Fabric for an IoT-style application using the model that each device has its own actor, along with other actors in the system. I understand that inactive actors will be garbage-collected automatically but their state will persist for when they are reactivated. I also see there is a way to explicitly delete an actor and its state.
In my scenario I'm wondering if there are any patterns or recommendations on how to handle devices that go dormant, fail or "disappear" and never send another message. Without an explicit delete their state will persist forever and I would like to clean it up automatically, e.g.: after six months.
Here's a method that works.
private async Task Kill()
{
// Do other required cleanup
var actorToDelete = ActorServiceProxy.Create(ServiceUri, Id);
await actorToDelete.DeleteActorAsync(Id, CancellationToken.None).ConfigureAwait(false);
}
Then just call this method using the following line:
var killTask = Task.Run(Kill);
This will spin up a new thread that references the actor, which will be blocked until the current turn has ended. When the task finally receives access to the actor, it will delete it. The beauty is that this can be called within the actor itself, meaning they can "self-delete".
You'll have to do this kind of clean-up yourself by writing a "clean-up" service that periodically checks for dormant actors and deletes them. The actor framework doesn't keep track of last deactivated time, so your individual actors will have to do that (which is easy enough, you have an OnDeactivate event that you can override in your actor class and save a timestamp there).
This clean-up service can be your actor service itself even, where you can implement RunAsync and do periodic clean-up work there.
Actors have a method OnPostActorMethodAsync which is called after every actor method is invoked (unless the method throws an exception, but I believe that's a bug). You could schedule a "kill me" reminder in that method to fire after X period of time. Every time an actor method is called that time will get pushed back. When the "kill me" reminder finally does fire, simply delete all the actor's state, and unregister any reminders. Eventually SF will kick it out of memory, and at that point, I believe the actor has essentially been deleted(not in memory, no persisted state.)
NEventStore: 5.1
Simple setup: WebApp (Asp.NET 4.5) == command-side
I'm searching for the "right" way for not losing commands, with an eye on sagas/process-managers which maybe would wait endlessly for an event produced from a command that was actually never handled.
Old: Dispatchers
I initially used sync commands, but with an eye on sagas/process-managers I thought it would be safer to first store them an then get them through SyncDispatcher (or AsyncDispatcher). Otherwise, that's my concern, if a saga would try to send a command and the command didn't finish due to app-crash/powerloss/..., it would be lost and noone would know.
So I created a command-stream and appended each command to that. The IsDispatched showed, if that command was already handled.
That worked.
PollingClient and Command-Stream
Now that the dispatchers are obsolete, I switched to PollingClient. What I lost is the Dispatched information.
A startup-issue arose:
I naively started polling from the current latest checkpoint going forward, but when the application restarted there was a chance that commands were stored but not executed before the crash and therefore lost (that actually happened).
I just came across the idea:
store the basic outcome of commands as (non-domain-)events in another stream.
This stream would contain CommandSucceeded and CommandFailed events.
Whenever the application starts the latest command-id or command-checkpoint-number gets extracted used to load the commands right after that one...
Questions
Are my concerns, that sync command-handling leads to the danger of losing a saga-generated command, wrong? If yes, why?
Is this generally a good idea: one big command stream?
Is this generally a good idea: store generic command-outcome-events in a stream?
You can:
Store you command in a command queue | persistent log
Use command id (guid) as Commit Id on NEventStore
Mark your command as executed in your Command Handler | Pipeline Hook | Polling Client
NEventStore gives you idempotency on same AggregateId (streamid) + CommitId, so if you app crashes before the command is marked as processed and you replay your command, the resulting commits are automatically discarded by NES.
Afaik NEventStore is meant to be the storage for event sourcing i.e storing domain objects as a stream of events. Commands and sagas have nothing to do with it. It's your service bus which should take care of durability and saga management.
Personally, I treat the event store simply as a repository detail. The application service (command handler) will dispatch the generated events, after they've been persisted.
If the app crashes and the service bus is durable (not a memory one) then the event/command will be handled again automatically, because the service bus should detect if a message wasn't successfully handled. Of course, your message handlers should be idempotent for that reason.
Let's say I have deployed an NSB endpoint that subscribes to events A,B, and C.
6 months later, version 1.1 of the endpoint adds a handler for event D, but the handler for event B is removed. What is a sensible process for removing the persisted subscription record for event B? I presume there is no automagic way for this to happen, and my choices would be:
Delete the entire contents of the subscription table and restart all endpoints.
Delete selectively based on what I know about the delta
Have some shutdown mode where my subscriber would call Unsubscribe on all its message types on the way down (and therefore would start with a clean slate on the way up)
Has anyone implemented any of these strategies, or am I missing some alternative?
The best solution would probably be option 1. The operational overhead involved in this would be fairly small:
Shut down publisher host
Clear down subscriptions db
Bounce all subscribers
Start up publisher host
Option 3 would also be possible but would involve making an unsubscribe call from every subscriber which is IMO much higher overhead (plus would require a redeployment if unscubscribe call not already implemented and then a shutdown to trigger the call).
Option 2 seems a bit hacky but would be lowest cost as you can just run a sql statement against the publisher db and bob's your mother's brother.
I would recommend option 1.