I am creating a number of Service Bus Clients in the RunAsync method on a statefull service.
The method takes longer than the 4 seconds that is allowed when running in the development environment and the application fail to start.
It fails because of what appears to be latency between my dev machine and Azure.
I am in Thailand. Azure is in South Central US.
Can that 4000 millisecond timeout be increased locally on my dev box?
The error i get is:
{
{
"Message": "RunAsync has been cancelled for a stateful service replica. The cancellation will be considered 'slow' if RunAsync does not halt execution within 4000 milliseconds.
"Level": "Informational",
"Keywords": "0x0000F00000000000",
"EventName": "StatefulRunAsyncCancellation",
"Payload": {
[...]
"slowCancellationTimeMillis": 4000.0
}
}
The problem is not that RunAsync takes too long to run, it's that it takes too long to cancel.
Service Fabric has a default (hard-coded) timeout 4 seconds for reporting "slow" cancellations. That just means that after that time it will produce a health event (which may be important during upgrades as it affects the health status of the entire application).
The real problem is the implementation of RunAsync. When you remove a service or upgrade one, Service Fabric will wait (potentially forever) for it to shut down gracefully. Your code should respect the CancellationToken passed to the RunAsync method. For example, if you have multiple IO operations, call cancellationToken.ThrowIfCancellationRequested() after each one (whenever possible) and pass it to any method that accepts it as a parameter.
Related
I have a stateless service that pulls messages from an Azure queue and processes them. The service also starts some threads in charge of cleanup operations. We recently ran into an issue where these threads which ideally should have been killed when the service shuts down continue to remain active (definitely a bug in our service shutdown process).
Further looking at logs, it seemed that, the RunAsync methods cancellation token received a cancellation request, and later within the same process a new instance of the stateless service that was registered in ServiceRuntime.RegisterServiceAsync was created.
Is this expected behavior that service fabric can re-use the same process to start a new instance of the stateless service after shutting down the current instance.
The service fabric documentation https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-hosting-model does seem to suggest that different services can be hosted on the same process, but I could not find the above behavior being mentioned there.
In the shared process model, there's one process per ServicePackage instance on every node. Adding or replacing services will reuse the existing process.
This is done to save some (process-level) overhead on the node that runs the service, so hosting is more cost-efficient. Also, it enables port sharing for services.
You can change this (default) behavior, by configuring the 'exclusive process' mode in the application manifest, so every replica/instance will run in a separate process.
<Service Name="VotingWeb" ServicePackageActivationMode="ExclusiveProcess">
As you mentioned, you can monitor the CancellationToken to abort the separate worker threads, so they can stop when the service stops.
More info about the app model here and configuring process activation mode.
I'm experimenting with Cloud Run(fully managed) and Cloud Tasks and I have been seeing weird results in terms of latency.
I have a queue in Cloud Tasks that invokes an API in Cloud Run. The tasks are created with the following values:
{
'http_request': {
'http_method': 'GET',
'url': 'https://<app>.run.app/<endpoint>'
}
}
I have tried many queue configurations, same results.
Cloud Run is a python server that processes the request(nothing fancy) and returns a response.
The problem is that latency is so high (~15 min) for the requests coming from the queue however if I curl the endpoint curl https://<app>.run.app/<endpoint> it only takes a couple of seconds.
The 15 min. is the value I get for Cloud Run's Request Latency it does not include the delay in the queue.
I also found this in the known issues but it refers to custom domains, which I'm not using, so I'm not sure it is the same problem.
Has anyone faced(and hopefully solved 😊) something similar? What could I be doing wrong?
I have a case when decision times out after 5 seconds when timeout is set to 10:
17 2019-06-13T17:46:59Z DecisionTaskScheduled {TaskList:{Name:maxim-C02XD0AAJGH6:db09fd84-98bf-4546-a0d8-fb51e30c2b41},
StartToCloseTimeoutSeconds:10, Attempt:0}
18 2019-06-13T17:47:04Z DecisionTaskTimedOut {ScheduledEventId:17,
StartedEventId:0,
TimeoutType:SCHEDULE_TO_START}
10:49 AM
It is using Cadence service running in a local docker and I can reproduce it reliably.
The 5s timeout is due to Cadence Sticky Execution feature. Sticky Execution is enabled by default on Cadence Worker which allows the workflow state to be cached on the worker after responding back with decisions. This allows Cadence server to directly dispatch new decision tasks to the same worker which allows to reuse the cached state and produce new decisions without replaying the entire execution history.
Decision SCHEDULE_TO_START timeout is put in place to allow decision to be sent to another worker when worker restarts and there is no poller on the sticky tasklist for a workflow execution. This causes the stickyness to be cleared by Cadence server for that execution and decision dispatched to original tasklist so it can be picked up by any other worker.
// Optional: Sticky schedule to start timeout.
// default: 5s
// The resolution is seconds. See details about StickyExecution on the comments for DisableStickyExecution.
StickyScheduleToStartTimeout time.Duration
In Service Fabric Cluster i have a stateless service which has a while(true) loop running continuously in RunAsync Method. Due to this while loop i am finding it hard to delete the application from the cluster. Error occurs every time i try to delete stating cannot detach the process.Normally i try to deploy the application once to remove the code. To redeploy the code on top of the application i have to deploy twice. Is there a work around to this without removing the infinite while loop.
Updated: Runasync Method
protected override async Task RunAsync(CancellationToken cancellationToken)
{
//making sure the thread is active
while (true)
{
do something;
}
}
Thank you for the input.
During shutdown, the cancellation token passed to RunAsync is canceled.
You need to check the cancellation token's IsCancellationRequested property in your main loop. When this becomes true, and if called, the token's ThrowIfCancellationRequested method throws an OperationCanceledException.
If your service does not respond to these API calls in a reasonable amount of time, Service Fabric can forcibly terminate your service. Usually this only happens during application upgrades or when a service is being deleted. This timeout is 15 minutes by default.
See this document for a good reference: https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-reliable-services-lifecycle#stateless-service-shutdown
We are wondering if there is a built-in way to warm up services as part of the service upgrades in Service Fabric, similar to the various ways you could warm up e.g. IIS based app pools before they are hit by requests. Ideally we want the individual services to perform some warm-up tasks as part of their initialization (could be cache loading, recovery etc.) before being considered as started and available for other services to contact. This warmup should be part of the upgrade domain processing so the upgrade process should wait for the warmup to be completed and the service reported as OK/Ready.
How are others handling such scenarios, controlling the process for signalling to the service fabric that the specific service is fully started and ready to be contacted by other services?
In the health policy there's this concept:
HealthCheckWaitDurationSec The time to wait (in seconds) after the upgrade has finished on the upgrade domain before Service Fabric evaluates the health of the application. This duration can also be considered as the time an application should be running before it can be considered healthy. If the health check passes, the upgrade process proceeds to the next upgrade domain. If the health check fails, Service Fabric waits for an interval (the UpgradeHealthCheckInterval) before retrying the health check again until the HealthCheckRetryTimeout is reached. The default and recommended value is 0 seconds.
Source
This is a fixed wait period though.
You can also emit Health events yourself. For instance, you can report health 'Unknown' while warming up. And adjust your health policy (HealthCheckWaitDurationSec) to check this.
Reporting health can help. You can't report Unknown, you must report Error very early on, then clear the Error when your service is ready. Warning and Ok do not impact upgrade. To clear the Error, your service can report health state Ok, RemoveWhenExpired=true, low TTL (read more on how to report).
You must increase HealthCheckRetryTimeout based on the max warm up time. Otherwise, if a health check is performed and cluster is evaluated to Error, the upgrade will fail (and rollback or pause, per your policy).
So, the order the events is:
your service reports Error - "Warming up in progress"
upgrade waits for fixed HealthCheckWaitDurationSec (you can set this to min time to warm up)
upgrade performs health checks: if the service hasn't yet warmed up, the health state is Error, so upgrade retries until either HealthCheckRetryTimeout is reached or your service is not in Error anymore (warm up completed and your service cleared the Error).