Service Fabric increasing thread count

Service Fabric increasing thread count - azure-service-fabric

We've noticed that the thread count of our Actors and WebApi increases over time, even when idle, in service fabric.
Details:
On-premise cluster using the 5.7.198.9494 version.
Development cluster (e.g all node types on 1 box, not multi-machine)
~100 unique actor ids; each with ~15 actor services (e.g UserActor, WishlistActor, etc)
Each of the actors services emits ActorEvents. These are subscribed to by the WebApi, which uses SignalR v2 to send the event data to UI clients.
My gut feeling is that I'm subscribing to ActorEvents incorrectly (Related question about subscribing to ActorEvents). Since it's difficult to know whether I've subscribed to the actor before, I might subscribe multiple times. However, I'm not seeing an increase in thread count when I do this, at least not immediately.
Looking at procexp the threads taking up CPU utilization are clr.dll!InstallCustomModule+0x1c00. There are also tons of ntdll.dll!RtReleaseSRWLockExclusive+0x1370 that take up very little cpu utilization as a single unit, but it adds up over a whole. They don't seem to be release.
Any ideas how to prevent this decay in usability?
Edit: Fixed procexp's name

Related

What are the limits on actorevents in service fabric?

I am currently testing the scaling of my application and I ran into something I did not expect.
The application is running on a 5 node cluster, it has multiple services/actortypes and is using a shared process model.
For some component it uses actor events as a best effort pubsub system (There are fallbacks in place so if a notification is dropped there is no issue).
The problem arises when the number of actors grows (aka subscription topics). The actorservice is partitioned to 100 partitions at the moment.
The number of topics at that point is around 160.000 where each topic is subscribed 1-5 times (nodes where it is needed) with an average of 2.5 subscriptions (Roughly 400k subscriptions).
At that point communications in the cluster start breaking down, new subscriptions are not created, unsubscribes are timing out.
But it is also affecting other services, internal calls to a diagnostics service are timing out (asking each of the 5 replicas), this is probably due to the resolving of partitions/replica endpoints as the outside calls to the webpage are fine (these endpoints use the same technology/codestack).
The eventviewer is full with warnings and errors like:
EventName: ReplicatorFaulted Category: Health EventInstanceId {c4b35124-4997-4de2-9e58-2359665f2fe7} PartitionId {a8b49c25-8a5f-442e-8284-9ebccc7be746} ReplicaId 132580461505725813 FaultType: Transient, Reason: Cancelling update epoch on secondary while waiting for dispatch queues to drain will result in an invalid state, ErrorCode: -2147017731
10.3.0.9:20034-10.3.0.13:62297 send failed at state Connected: 0x80072745
Error While Receiving Connect Reply : CannotConnect , Message : 4ba737e2-4733-4af9-82ab-73f2afd2793b:382722511 from Service 15a5fb45-3ed0-4aba-a54f-212587823cde-132580461224314284-8c2b070b-dbb7-4b78-9698-96e4f7fdcbfc
I've tried scaling the application but without this subscribe model active and I easily reach a workload twice as large without any issues.
So there are a couple of questions
Are there limits known/advised for actor events?
Would increasing the partition count or/and node count help here?
Is the communication interference logical? Why are other service endpoints having issues as well?

After time spent with the support ticket we found some info. So I will post my findings here in case it helps someone.
The actor events use a resubscription model to make sure they are still connected to the actor. Default this is done every 20 seconds. This meant a lot of resources were being used and eventually the whole system overloaded with loads of idle threads waiting to resubscribe.
You can decrease the load by setting resubscriptionInterval to a higher value when subscribing. The drawback is that it will also mean the client will potentially miss events in the mean time (if a partition is moved).
To counteract the delay in resubscribing it is possible to hook into the lower level service fabric events. The following psuedo code was offered to me in the support call.
Register for endpoint change notifications for the actor service
fabricClient.ServiceManager.ServiceNotificationFilterMatched += (o, e) =>
{
var notification = ((FabricClient.ServiceManagementClient.ServiceNotificationEventArgs)e).Notification;
/*
* Add additional logic for optimizations
* - check if the endpoint is not empty
* - If multiple listeners are registered, check if the endpoint change notification is for the desired endpoint
* Please note, all the endpoints are sent in the notification. User code should have the logic to cache the endpoint seen during susbcription call and compare with the newer one
*/
List<long> keys;
if (resubscriptions.TryGetValue(notification.PartitionId, out keys))
{
foreach (var key in keys)
{
// 1. Unsubscribe the previous subscription by calling ActorProxy.UnsubscribeAsync()
// 2. Resubscribe by calling ActorProxy.SubscribeAsync()
}
}
};
await fabricClient.ServiceManager.RegisterServiceNotificationFilterAsync(new ServiceNotificationFilterDescription(new Uri("<service name>"), true, true));
Change the resubscription interval to a value which fits your need.
Cache the partition id to actor id mapping. This cache will be used to resubscribe when the replica’s primary endpoint changes(ref #1)
await actor.SubscribeAsync(handler, TimeSpan.FromHours(2) /*Tune the value according to the need*/);
ResolvedServicePartition rsp;
((ActorProxy)actor).ActorServicePartitionClientV2.TryGetLastResolvedServicePartition(out rsp);
var keys = resubscriptions.GetOrAdd(rsp.Info.Id, key => new List<long>());
keys.Add(communicationId);
The above approach ensures the below
The subscriptions are resubscribed at regular intervals
If the primary endpoint changes in between, actorproxy resubscribes from the service notification callback
This ends the psuedo code form the support call.
Answering my original questions:
Are there limits known/advised for actor events?
No hard limits, only resource usage.
Would increasing the partition count or/and node count help here? Partition count not. node count maybe, only if that means there are less subscribing entities on a node because of it.
Is the communication interference logical? Why are other service endpoints having issues as well?
Yes, resource contention is the reason.

High Scalability Question: How to sync data across multiple microservices

I have the following use cases:
Assume you have two micro-services one AccountManagement and ActivityReporting that processes event U.
When a user registers, event U containing the user information will published into a broker for the two micro-services to process.
AccountManagement, and ActivityReporting microservice are replicated across two instances each for performance and scalability reasons.
Each microservice instance has a consumer listening on the broker topic. The choice of topic is so that both AccountManagement, and ActivityReporting can process U concurrently.
However, I want only one instance of AccountManagement to process event U, and one instance of ActivityReporting to process event U.
Please share your experience implementing a Consume Once per Application Group, broker system.
As this would effectively solve this problem.

If all your consumer listeners even from different instances have the same group.id property then only one of them will receive the message. You need to set this property when you initialise the consumer. So in your case you will need one group.id for AccountManagement and another for ActivityReporting.

I would recommend Cadence Workflow which is much more powerful solution for microservice orchestration.
It offers a lot of advantages over using queues for your use case.
Built it exponential retries with unlimited expiration interval
Failure handling. For example it allows to execute a task that notifies another service if both updates couldn't succeed during a configured interval.
Support for long running heartbeating operations
Ability to implement complex task dependencies. For example to implement chaining of calls or compensation logic in case of unrecoverble failures (SAGA)
Gives complete visibility into current state of the update. For example when using queues all you know if there are some messages in a queue and you need additional DB to track the overall progress. With Cadence every event is recorded.
Ability to cancel an update in flight.
See the presentation that goes over Cadence programming model.

Akka Actor Messaging Delay

I'm experiencing issues scaling my app with multiple requests.
Each request sends an ask to an actor, which then spawns other actors. This is fine, however, under load(5+ asks at once), the ask takes a massive amount of time to deliver the message to the target actor. The original design was to bulkhead requests evenly, but this is causing a bottleneck. Example:
In this picture, the ask is sent right after the query plan resolver. However, there is a multi-second gap when the Actor receives this message. This is only experienced under load(5+ requests/sec). I first thought this was a starvation issue.
Design:
Each planner-executor is a seperate instance for each request. It spawns a new 'Request Acceptor' actor each time(it logs 'requesting score' when it receives a message).
I gave the actorsystem a custom global executor(big one). I noticed the threads were not utilized beyond the core threadpool size even during this massive delay
I made sure all executioncontexts in the child actors used the correct executioncontext
Made sure all blocking calls inside actors used a future
I gave the parent actor(and all child) a custom dispatcher with core size 50 and max size 100. It did not request more(it stayed at 50) even during these delays
Finally, I tried creating a totally new Actorsystem for each request(inside the planner-executor). This also had no noticable effect!
I'm a bit stumped by this. From these tests it does not look like a thread starvation issue. Back at square one, I have no idea why the message takes longer and longer to deliver the more concurrent requests I make. The Zipkin trace before reaching this point does not degrade with more requests until it reaches the ask here. Before then, the server is able to handle multiple steps to e.g veify the request, talk to the db, and then finally go inside the planner-executor. So I doubt the application itself is running out of cpu time.

We had this very similar issue with Akka. We observed huge delay in ask pattern to deliver messages to the target actor on peek load.
Most of these issues are related to heap memory consumption and not because of usages of dispatchers.
Finally we fixed these issues by tuning some of the below configuration and changes.
1) Make sure you stop entities/actors which are no longer required. If its a persistent actor then you can always bring it back when you need it.
Refer : https://doc.akka.io/docs/akka/current/cluster-sharding.html#passivation
2) If you are using cluster sharding then check the akka.cluster.sharding.state-store-mode. By changing this to persistence we gained 50% more TPS.
3) Minimize your log entries (set it to info level).
4) Tune your logs to publish messages frequently to your logging system. Update the batch size, batch count and interval accordingly. So that the memory is freed. In our case huge heap memory is used for buffering the log messages and send in bulk. If the interval is more then you may fill your heap memory and that affects the performance (more GC activity required).
5) Run blocking operations on a separate dispatcher.
6) Use custom serializers (protobuf) and avoid JavaSerializer.
7) Add the below JAVA_OPTS to your jar
export JAVA_OPTS="$JAVA_OPTS -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap -XX:MaxRAMFraction=2 -Djava.security.egd=file:/dev/./urandom"
The main thing is XX:MaxRAMFraction=2 which will utilize more than 60% of available memory. By default its 4 means your application will use only one fourth of the available memory, which might not be sufficient.
Refer : https://blog.csanchez.org/2017/05/31/running-a-jvm-in-a-container-without-getting-killed/
Regards,
Vinoth

Why doesn't my Azure Function scale up?

For a test, I created a new function app. I added two functions, one was an http trigger that when invoked, pushed 500 messages to a queue. The other, a queue trigger to read the messages. The queue trigger function code, was setup to read a message and randomly sleep from 1 to 30 seconds. This was intended to simulate longer running tasks.
I invoked the http trigger to create the messages, then watched the que fill up (messages were processed by the other trigger). I also wired up app insights to this function app, but I did not see is scale beyond 1 server.
Do Azure functions scale up soley on the # of messages in the que?
Also, I implemented these functions in Powershell.

If you're running in the Azure Functions consumption plan, we monitor both the length and the throughput of your queue to determine whether additional VM resources are needed.
Note that a single function app instance can process multiple queue messages concurrently without needing to scale across multiple VMs. So if all 500 messages can be consumed relatively quickly (again, in the consumption plan), then it's possible that you won't scale at all.
The exact algorithm for scaling isn't published (it's subject to lots of tweaking), but generally speaking you can expect the system to automatically scale you out if messages are getting added to the queue faster than your functions can process them. Your app will also scale out if the latency of the first message in the queue is continuously increasing (meaning, messages are sitting idle and not getting processed). The time between VMs getting added is usually in the tens of seconds.
There are some thresholds based on queue count as well. For example, the system tries to ensure that there is at least 1 VM for every 1K queue messages, but usually the scale decisions are based on message throughput as I described earlier.

I think #Chris Gillum put it well, it's hard for us to push the limits of the server to the point that things will start to scale.
Some other options available are:
Use durable functions and scale with Threading:
https://learn.microsoft.com/en-us/azure/azure-functions/durable-functions-cloud-backup
Another method could be to use Event Hubs which are designed for massive scale. Instead of queues, have Function #1 trigger an Event, and your Function #2 subscribed to that Event Hub trigger. Adding Streaming Analytics, could also be an option to more fully expand on capabilities if needed.

How can (messaging) queue be scalable?

I frequently see queues in software architecture, especially those called "scalable" with prominent representative of Actor from Akka.io multi-actor platform. However, how can queue be scalable, if we have to synchronize placing messages in queue (and therefore operate in single thread vs multi thread) and again synchronize taking out messages from queue (to assure, that message it taken exactly once)? It get's even more complicated, when those messages can change state of (actor) system - in this case even after taking out message from queue, it cannot be load balanced, but still processed in single thread.
Is it correct, that putting messages in queue must be synchronized?
Is it correct, that putting messages out of queue must be synchronized?
If 1 or 2 is correct, then how is queue scalable? Doesn't synchronization to single thread immediately create bottleneck?
How can (actor) system be scalable, if it is statefull?
Does statefull actor/bean mean, that I have to process messages in single thread and in order?
Does statefullness mean, that I have to have single copy of bean/actor per entire system?
If 6 is false, then how do I share this state between instances?
When I am trying to connect my new P2P node to netowrk, I believe I have to have some "server" that will tell me, who are other peers, is that correct? When I am trying to download torrent, I have to connect to tracker - if there is "server" then we do we call it P2P? If this tracker will go down, then I cannot connect to peers, is that correct?
Is synchronization and statefullness destroying scalability?

Is it correct, that putting messages in queue must be synchronized?
Is it correct, that putting messages out of queue must be synchronized?
No.
Assuming we're talking about the synchronized java keyword then that is a reenetrant mutual exclusion lock on the object. Even multiple threads accessing that lock can be fast as long as contention is low. And each object has its own lock so there are many locks, each which only needs to be taken for a short time, i.e. it is fine-grained locking.
But even if it did, queues need not be implemented via mutual exclusion locks. Lock-free and even wait-free queue data structures exist. Which means the mere presence of locks does not automatically imply single-threaded execution.
The rest of your questions should be asked separately because they are not about message queuing.

Of course you are correct in that a single queue is not scalable. The point of the Actor Model is that you can have millions of Actors and therefore distribute the load over millions of queues—if you have so many cores in your cluster. Always remember what Carl Hewitt said:
One Actor is no actor. Actors come in systems.
Each single actor is a fully sequential and single-threaded unit of computation. The whole model is constructed such that it is perfectly suited to describe distribution, though; this means that you create as many actors as you need.