Service fabric fails to roll back application when deployment fails - azure-service-fabric

I have a 3 node cluster for service fabric where the deployment is stuck for 10hr on the third node. Looking at the SF explorer we saw that there is wrong SQL creds being passed hence the deployment is stuck.
1) Why is SF recognizing it at a "warning" rather than an "Error"
2) Why is it stuck and not doing a roll back?
3) Is there extra setting I need to do so it does auto rollback sooner?

Generally, it rollback when the deployment fail, but it will depend on the parameter you pass for the upgrade, like FailureAction, UpgradeMode and Timeouts.
UpgradeMode values can be:
Monitored: Indicates that the upgrade mode is monitored. After the cmdlet finishes an upgrade for an upgrade domain, if the health of the upgrade domain and the cluster meet the health policies that you define, Service Fabric upgrades the next upgrade domain. If the upgrade domain or cluster fails to meet health policies, the upgrade fails and Service Fabric rolls back the upgrade for the upgrade domain or reverts to manual mode per the policy specified on FailureAction. This is the recommended mode for application upgrades in a production environment.
Unmonitored Auto: Indicates that the upgrade mode is unmonitored automatic. After Service Fabric upgrades an upgrade domain, Service Fabric upgrades the next upgrade domain irrespective of the application health state. This mode is not recommended for production, and is only useful during development of an application.
Unmonitored Manual: Indicates that the upgrade mode is unmonitored manual. After Service Fabric upgrades an upgrade domain, it waits for you to upgrade the next upgrade domain by using the Resume-ServiceFabricApplicationUpgrade cmdlet.
FailureAction is the compensating action to perform when a Monitored upgrade encounters monitoring policy or health policy violations. The values can be:
Rollback specifies that the upgrade will automatically roll back to the pre-upgrade version.
Manual indicates that the upgrade will switch to the UnmonitoredManual upgrade mode.
Invalid indicates that the failure action is invalid and does nothing.
Given that, if the upgrade is not set as Monitored for UpgradeMode and Rollback for FailureAction, the upgrade will wait a manual action from the operator(user).
If the upgrade is already set to these values, the problem can be either:
The health check and retries are too long, preventing the upgrade to fail quickly, an example is when you HealthCheckDuration is too long or there are too much delay between checks.
The old version is also failing
The following docs give all details: https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-application-upgrade-parameters

Related

Whole Application level rolling update

My kubernetes application is made of several flavors of nodes, a couple of “schedulers” which send tasks to quite a few more “worker” nodes. In order for this app to work correctly all the nodes must be of exactly the same code version.
The deployment is performed using a standard ReplicaSet and when my CICD kicks in it just does a simple rolling update. This causes a problem though since during the rolling update, nodes of different code versions co-exist for a few seconds, so a few tasks during this time get wrong results.
Ideally what I would want is that deploying a new version would create a completely new application that only communicates with itself and has time to warm its cache, then on a flick of a switch this new app would become active and start to get new client requests. The old app would remain active for a few more seconds and then shut down.
I’m using Istio sidecar for mesh communication.
Is there a standard way to do this? How is such a requirement usually handled?
I also had such a situation. Kubernetes alone cannot satisfy your requirement, I was also not able to find any tool that allows to coordinate multiple deployments together (although Flagger looks promising).
So the only way I found was by using CI/CD: Jenkins in my case. I don't have the code, but the idea is the following:
Deploy all application deployments using single Helm chart. Every Helm release name and corresponding Kubernetes labels must be based off of some sequential number, e.g. Jenkins $BUILD_NUMBER. Helm release can be named like example-app-${BUILD_NUMBER} and all deployments must have label version: $BUILD_NUMBER . Important part here is that your Services should not be a part of your Helm chart because they will be handled by Jenkins.
Start your build with detecting the current version of the app (using bash script or you can store it in ConfigMap).
Start helm install example-app-{$BUILD_NUMBER} with --atomic flag set. Atomic flag will make sure that the release is properly removed on failure. And don't delete previous version of the app yet.
Wait for Helm to complete and in case of success run kubectl set selector service/example-app version=$BUILD_NUMBER. That will instantly switch Kubernetes Service from one version to another. If you have multiple services you can issue multiple set selector commands (each command executes immediately).
Delete previous Helm release and optionally update ConfigMap with new app version.
Depending on your app you may want to run tests on non user facing Services as a part of step 4 (after Helm release succeeds).
Another good idea is to have preStop hooks on your worker pods so that they can finish their jobs before being deleted.
You should consider Blue/Green Deployment strategy

Upgrade domain selection for service fabric

If I have say few upgrade domains in service fabric. how does service fabric selects upgrade domains while performing upgrades?
From https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-application-upgrade#rolling-upgrades-overview
Update domains do not receive updates in a particular order.
However, using the start-servicefabricclusterupgrade commandlet, you can specify -SortOrder, which
Defines the order in which an upgrade proceeds through the cluster.
Note that Default
Indicates that the default sort order (as specified in cluster manifest) will be used.
In my experience (mostly on-prem standalone clusters) for Configuration updates, I'm 99% sure it does them in sequential order: UD0, UD1, etc.

Service Fabric - How to repair a failing stateful application

I have a stateful service that configures state backups for the primary replica on RunAsync using an Azure storage account.
The other day someone inadvertently deleted the storage account being used for backups. On our next deployment, the services began throwing errors as they initialize due to this 404 error response.
I have noticed that during a deployment fabric apparently shuffles around the old version of the service spinning up new primaries as needed to free up the vm it is upgrading. If the old version of the code fails to instantiate by throwing an exception, the upgrade process will fail causing a rollback.
My problem is, once I create a new storage account, I am still left seemingly no way to bring the existing services back to healthy states. My existing services are using Storage account urls with AccountKeys that no longer exists in azure. Attempts to upgrade fail because the old service instances can’t instantiate due to now bad configuration.
Are there any ways to deal with this situation?
The simplest thing would be to use an unmonitored manual upgrade to force through the change that would point the service to the new storage account.
However, this puts a lot of management overhead on you, particularly if there are many other services, since you need to be careful to perform all safety and functionality checks manually so as not to regress anything.
The recommend solution is to use the ServiceTypeHealthPolicyMap described here to "mask out" the unhealthy service (since you expect it to be unhealthy during the upgrade). You may also need to adjust some of the other upgrade parameters depending on the exact situation.
A third recommendation, or maybe something to improve in the future, would be to make the upgrade to change the account information a configuration only upgrade. This would ensure that SF tries to change the config in-place without restarting the services (by default), which would prevent the existing services from failing over during the upgrade and encountering issues. This is demonstrated in this example.

How isUpgrade setting affects deployment process in Service Fabric Application Deployment task in Azure DevOps

Azure Devops has a standard task for deploying apps to ServiceFabric. The task is named Service Fabric Application Deployment and is documented here. Among other settings, it contains an optional boolean isUpgrade setting (default value 'true'). I tried to set it explicitly to true and false, but I did not find any difference in the behavior of the task. In both cases, the deployment was successful, all previously deployed packages were still provisioned, and Azure Pipelines logs were the same. The time of the deployment was the same, too.
My question is what the setting affects? Maybe, somebody has used it in his CI pipelines.
There are 2 types of deployment in Service Fabric. This isUpgrade flag controls which type op upgrade you are executing.
Regular
Basically this removes the old application and deploys the new version. So if you have Statefull services, this will remove all state. You will have downtime when you do a regular upgrade.
Upgrade
An upgrade will do a lot of things, It will keep the state, it will do health checking, make sure the services are available. Does a rollback when the healthcheck fails, ...
If your application or services didn't change, nothing changes in your cluster.
Typically an upgrade will take more time (This is highly dependent on your health check rules). See the application upgrade flowchart
More info about the 2 types
If you look at the code of the task. You see that it will only take effect if overridePublishProfileSettings is true. Otherwise the PulishProfile.xml is used.

Warmup services on upgrade in Service Fabric

We are wondering if there is a built-in way to warm up services as part of the service upgrades in Service Fabric, similar to the various ways you could warm up e.g. IIS based app pools before they are hit by requests. Ideally we want the individual services to perform some warm-up tasks as part of their initialization (could be cache loading, recovery etc.) before being considered as started and available for other services to contact. This warmup should be part of the upgrade domain processing so the upgrade process should wait for the warmup to be completed and the service reported as OK/Ready.
How are others handling such scenarios, controlling the process for signalling to the service fabric that the specific service is fully started and ready to be contacted by other services?
In the health policy there's this concept:
HealthCheckWaitDurationSec The time to wait (in seconds) after the upgrade has finished on the upgrade domain before Service Fabric evaluates the health of the application. This duration can also be considered as the time an application should be running before it can be considered healthy. If the health check passes, the upgrade process proceeds to the next upgrade domain. If the health check fails, Service Fabric waits for an interval (the UpgradeHealthCheckInterval) before retrying the health check again until the HealthCheckRetryTimeout is reached. The default and recommended value is 0 seconds.
Source
This is a fixed wait period though.
You can also emit Health events yourself. For instance, you can report health 'Unknown' while warming up. And adjust your health policy (HealthCheckWaitDurationSec) to check this.
Reporting health can help. You can't report Unknown, you must report Error very early on, then clear the Error when your service is ready. Warning and Ok do not impact upgrade. To clear the Error, your service can report health state Ok, RemoveWhenExpired=true, low TTL (read more on how to report).
You must increase HealthCheckRetryTimeout based on the max warm up time. Otherwise, if a health check is performed and cluster is evaluated to Error, the upgrade will fail (and rollback or pause, per your policy).
So, the order the events is:
your service reports Error - "Warming up in progress"
upgrade waits for fixed HealthCheckWaitDurationSec (you can set this to min time to warm up)
upgrade performs health checks: if the service hasn't yet warmed up, the health state is Error, so upgrade retries until either HealthCheckRetryTimeout is reached or your service is not in Error anymore (warm up completed and your service cleared the Error).