Error while deleting stateless service in Service Fabric Cluster

Error while deleting stateless service in Service Fabric Cluster - azure-service-fabric

In Service Fabric Cluster i have a stateless service which has a while(true) loop running continuously in RunAsync Method. Due to this while loop i am finding it hard to delete the application from the cluster. Error occurs every time i try to delete stating cannot detach the process.Normally i try to deploy the application once to remove the code. To redeploy the code on top of the application i have to deploy twice. Is there a work around to this without removing the infinite while loop.
Updated: Runasync Method
protected override async Task RunAsync(CancellationToken cancellationToken)
{
//making sure the thread is active
while (true)
{
do something;
}
}
Thank you for the input.

During shutdown, the cancellation token passed to RunAsync is canceled.
You need to check the cancellation token's IsCancellationRequested property in your main loop. When this becomes true, and if called, the token's ThrowIfCancellationRequested method throws an OperationCanceledException.
If your service does not respond to these API calls in a reasonable amount of time, Service Fabric can forcibly terminate your service. Usually this only happens during application upgrades or when a service is being deleted. This timeout is 15 minutes by default.
See this document for a good reference: https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-reliable-services-lifecycle#stateless-service-shutdown

Related

Stateless Worker service in Service Fabric restarted in the same process

I have a stateless service that pulls messages from an Azure queue and processes them. The service also starts some threads in charge of cleanup operations. We recently ran into an issue where these threads which ideally should have been killed when the service shuts down continue to remain active (definitely a bug in our service shutdown process).
Further looking at logs, it seemed that, the RunAsync methods cancellation token received a cancellation request, and later within the same process a new instance of the stateless service that was registered in ServiceRuntime.RegisterServiceAsync was created.
Is this expected behavior that service fabric can re-use the same process to start a new instance of the stateless service after shutting down the current instance.
The service fabric documentation https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-hosting-model does seem to suggest that different services can be hosted on the same process, but I could not find the above behavior being mentioned there.

In the shared process model, there's one process per ServicePackage instance on every node. Adding or replacing services will reuse the existing process.
This is done to save some (process-level) overhead on the node that runs the service, so hosting is more cost-efficient. Also, it enables port sharing for services.
You can change this (default) behavior, by configuring the 'exclusive process' mode in the application manifest, so every replica/instance will run in a separate process.
<Service Name="VotingWeb" ServicePackageActivationMode="ExclusiveProcess">
As you mentioned, you can monitor the CancellationToken to abort the separate worker threads, so they can stop when the service stops.
More info about the app model here and configuring process activation mode.

Removing service fabric application fails

I have deployed an application to a 5 node standalone cluster. Deployment succeeded successful. But the application did not start because of some bug in the application.
I tried removing the application from the cluster using the Service Fabric Explorer but this fails.
The health State of the application is “Error” and the status is “Deleting”
The application has 9 services. 6 services show a Health State “Unknown” with a question mark and a Status “Unknown”. 3 services show a health state “Ok” but with a Status “Deleting”.
I have also tried to remove it using powershell:
Remove-ServiceFabricApplication -ApplicationName fabric:/appname -Force -ForceRemove
The result was an Operation timed out.
I also tried the script below that I found in some other post.
Connect-ServiceFabricCluster -ConnectionEndpoint localhost:19000
$nodes = Get-ServiceFabricNode
foreach($node in $nodes)
{
$replicas = Get-ServiceFabricDeployedReplica -NodeName $node.NodeName - ApplicationName "fabric:/MyApp"
foreach ($replica in $replicas)
{
Remove-ServiceFabricReplica -ForceRemove -NodeName $node.NodeName -PartitionId $replica.Partitionid -ReplicaOrInstanceId $replica.ReplicaOrInstanceId
}
}
Also no result, the script did not find any replica to remove.
At the same time we started removing the application one of the system services also changed state.
The fabric:/System/NamingService service shows a “Warning” health state.
This is on partition 00000000-0000-0000-0000-000000001002.
The primary replica shows:
Unhealthy event: SourceId='System.NamingService', Property='Duration_PrimaryRecovery', HealthState='Warning', ConsiderWarningAsError=false.
The PrimaryRecovery started at 2016-10-06 07:55:21.252 is taking longer than 30:00.000.
I also restarted every node (1 at the time) with no result.
How to force to remove the application without recreating the cluster because that is not a option for a production environment.

Yeah this can happen if you don't allow your code to exit RunAsync or Open/Close of your ICommunicationListener.
Some background:
Your service has a lifecycle that is driven by Service Fabric. A small component in your service - you know it as FabricRuntime - drives this. For stateless service instances, it's a simple open/close lifecycle. For stateful services, it's a bit more complex. A stateful service replica opens and closes, but also changes role, between primary, secondary, and none. Lifecycle changes are initiated by Service Fabric and show up as a method call or cancellation token trigger in your code. For example, when a replica is switch to primary, we call your RunAsync method. When it switches from primary to something else, or needs to shut down, the cancellation token is triggered. Either way, the system waits for you to finish your work.
When you go delete a service, we tell your service to change role and close. If your code doesn't respond, then it will get stuck in that state.
To get out of that state, you can run Remove-ServiceFabricReplica -ForceRemove. This essentially drops the replica from the system - as far Service Fabric is concerned, the replica is gone. But your process is still running. So you have to go in and kill the process too.

The error in the script is with the '- ApplicationName' and should be '-ApplicationName'.
And after correcting the parameter, this DID remove the hosed up pieces and get me back in order to be able to correct and redeploy the application to the cluster.

Increase timeout for RunAsync

I am creating a number of Service Bus Clients in the RunAsync method on a statefull service.
The method takes longer than the 4 seconds that is allowed when running in the development environment and the application fail to start.
It fails because of what appears to be latency between my dev machine and Azure.
I am in Thailand. Azure is in South Central US.
Can that 4000 millisecond timeout be increased locally on my dev box?
The error i get is:
{
{
"Message": "RunAsync has been cancelled for a stateful service replica. The cancellation will be considered 'slow' if RunAsync does not halt execution within 4000 milliseconds.
"Level": "Informational",
"Keywords": "0x0000F00000000000",
"EventName": "StatefulRunAsyncCancellation",
"Payload": {
[...]
"slowCancellationTimeMillis": 4000.0
}
}

The problem is not that RunAsync takes too long to run, it's that it takes too long to cancel.
Service Fabric has a default (hard-coded) timeout 4 seconds for reporting "slow" cancellations. That just means that after that time it will produce a health event (which may be important during upgrades as it affects the health status of the entire application).
The real problem is the implementation of RunAsync. When you remove a service or upgrade one, Service Fabric will wait (potentially forever) for it to shut down gracefully. Your code should respect the CancellationToken passed to the RunAsync method. For example, if you have multiple IO operations, call cancellationToken.ThrowIfCancellationRequested() after each one (whenever possible) and pass it to any method that accepts it as a parameter.

Azure Service Fabric Actor Retry Logic on Exception

I'm working on a Azure Service Fabric solution.
I have implemented an Actor with some state, everything works fine.
Can anyone explain me , what happens if an user exception is thrown in my Actor implementation? What does the Service Fabric Environment do if a call to an actor throws an exception. Is there any default retry logic, that Forces the call again?

If an actor throws an exception, it gets handled inside ActorRemotingExceptionHandler or other default implementation of IExceptionHandler. Currently, if the exception is an ordinary exception which is not related to network issues or cluster or nodes availability, it will be rethrown on the client side where you will be able to handle it.

Azure service fabric does not cancel with lots of tasks

I'm trying to test out various performance aspects of Azure Service Fabric to understand how it all works and have hit against some problems when cancelling a service.
In one particular test, I create 101 tasks, 50 reading a queue, 50 writing a queue and 1 showing progress and reporting it.
When the service gets stopped, for example just re-deploying the application I can see the cancellation token gets the request and some tasks cancel, but I see a lot of events in the event viewer basically saying
AsyncCalloutAdapter-22542743: end delegate threw an exception
System.OperationCanceledException: Operation canceled. ---> System.Runtime.InteropServices.COMException: Operation aborted (Exception from HRESULT: 0x80004004 (E_ABORT))
at System.Fabric.Interop.NativeRuntime.IFabricStateReplicator2.EndReplicate(IFabricAsyncOperationContext context)
at System.Fabric.Interop.AsyncCallOutAdapter2`1.Finish(IFabricAsyncOperationContext context, Boolean expectedCompletedSynchronously)
--- End of inner exception stack trace ---
The only way to get this back is to either reset the local cluster.
When using a fewer number of tasks it all seems to work ok.
This is all using a local development cluster.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse