Upgrade domain selection for service fabric - azure-service-fabric

If I have say few upgrade domains in service fabric. how does service fabric selects upgrade domains while performing upgrades?

From https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-application-upgrade#rolling-upgrades-overview
Update domains do not receive updates in a particular order.
However, using the start-servicefabricclusterupgrade commandlet, you can specify -SortOrder, which
Defines the order in which an upgrade proceeds through the cluster.
Note that Default
Indicates that the default sort order (as specified in cluster manifest) will be used.
In my experience (mostly on-prem standalone clusters) for Configuration updates, I'm 99% sure it does them in sequential order: UD0, UD1, etc.

Related

Service fabric fails to roll back application when deployment fails

I have a 3 node cluster for service fabric where the deployment is stuck for 10hr on the third node. Looking at the SF explorer we saw that there is wrong SQL creds being passed hence the deployment is stuck.
1) Why is SF recognizing it at a "warning" rather than an "Error"
2) Why is it stuck and not doing a roll back?
3) Is there extra setting I need to do so it does auto rollback sooner?
Generally, it rollback when the deployment fail, but it will depend on the parameter you pass for the upgrade, like FailureAction, UpgradeMode and Timeouts.
UpgradeMode values can be:
Monitored: Indicates that the upgrade mode is monitored. After the cmdlet finishes an upgrade for an upgrade domain, if the health of the upgrade domain and the cluster meet the health policies that you define, Service Fabric upgrades the next upgrade domain. If the upgrade domain or cluster fails to meet health policies, the upgrade fails and Service Fabric rolls back the upgrade for the upgrade domain or reverts to manual mode per the policy specified on FailureAction. This is the recommended mode for application upgrades in a production environment.
Unmonitored Auto: Indicates that the upgrade mode is unmonitored automatic. After Service Fabric upgrades an upgrade domain, Service Fabric upgrades the next upgrade domain irrespective of the application health state. This mode is not recommended for production, and is only useful during development of an application.
Unmonitored Manual: Indicates that the upgrade mode is unmonitored manual. After Service Fabric upgrades an upgrade domain, it waits for you to upgrade the next upgrade domain by using the Resume-ServiceFabricApplicationUpgrade cmdlet.
FailureAction is the compensating action to perform when a Monitored upgrade encounters monitoring policy or health policy violations. The values can be:
Rollback specifies that the upgrade will automatically roll back to the pre-upgrade version.
Manual indicates that the upgrade will switch to the UnmonitoredManual upgrade mode.
Invalid indicates that the failure action is invalid and does nothing.
Given that, if the upgrade is not set as Monitored for UpgradeMode and Rollback for FailureAction, the upgrade will wait a manual action from the operator(user).
If the upgrade is already set to these values, the problem can be either:
The health check and retries are too long, preventing the upgrade to fail quickly, an example is when you HealthCheckDuration is too long or there are too much delay between checks.
The old version is also failing
The following docs give all details: https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-application-upgrade-parameters

Service Fabric Application - changing instance count on application update fails

I am building a CI/CD pipeline to release SF Stateless Application packages into clusters using parameters for everything. This is to ensure environments (DEV/UAT/PROD) can be scoped with different settings.
For example in a DEV cluster an application package may have an instance count of 3 (in a 10 node cluster)
I have noticed that if an application is in the cluster and running with an instance count (for example) of 3, and I change the deployment parameter to anything else (e.g. 5), the application package will upload and register the type, but will fail on attempting to do a rolling upgrade of the running application.
This also works the other way e.g. if the running app is -1 and you want to reduce the count on next rolling deployment.
Have I missed a setting or config somewhere, is this how it is supposed to be? At present its not lending itself to being something that is easily scaled without downtime.
At its simplest form we just want to be able to change instance counts on application updates, as we have an infrastructure-as-code approach to changes, builds and deployments for full tracking ability.
Thanks in advance
This is a common error when using Default services.
This has been already answered multiple times in these places:
Default service descriptions can not be modified as part of upgrade set EnableDefaultServicesUpgrade to true
https://blogs.msdn.microsoft.com/maheshk/2017/05/24/azure-service-fabric-error-to-allow-it-set-enabledefaultservicesupgrade-to-true/
https://github.com/Microsoft/service-fabric/issues/253#issuecomment-442074878

Adding Desired State Configuration extension to a service fabric VMSS

We recently needed to add the Microsoft.Powershell.DSC extension to our VMSS that contain our service fabric cluster. We redeployed the cluster using our ARM template, with the addition of the new extension for DSC. During the deployment we observed that as many as 4 out of 5 scale set instances were in the restarting stage at a given time. The services in our cluster were also unresponsive during that time. The outage was only a few minutes long, but this seems like something that should not happen.
Reliability Level: Silver
Durability Level: Bronze
This is caused by the selected durability level 'bronze'.
The durability tier is used to indicate to the system the privileges
that your VMs have with the underlying Azure infrastructure. In the
primary node type, this privilege allows Service Fabric to pause any
VM level infrastructure request (such as a VM reboot, VM reimage, or
VM migration) that impact the quorum requirements for the system
services and your stateful services. In the non-primary node types,
this privilege allows Service Fabric to pause any VM level
infrastructure requests like VM reboot, VM reimage, VM migration etc.,
that impact the quorum requirements for your stateful services running
in it.
Bronze - No privileges. This is the default and is recommended if you are only > running stateless workloads in your cluster.
I suggest reading this article. Its a MS employee blog. I'll copy out the relevant part:
If you don’t mind all your VMs being rebooted at the same time, you can set upgradePolicy to “Automatic”. Otherwise set it to “Manual” and take care of applying changes to the scale set model to individual VMs yourself. It is fairly easy to script rolling out the update to VMs while maintaining application uptime. See https://learn.microsoft.com/en-us/azure/virtual-machine-scale-sets/virtual-machine-scale-sets-upgrade-scale-set for more details.
If your scale set is in a Service Fabric cluster, certain updates like changing OS version are blocked (currently – that will change in future), and it is recommended that upgradePolicy be set to “Automatic”, as Service Fabric takes care of safely applying model changes (like updated extension settings) while maintaining availability.

meaning of parameter "mode " of set-AzureDeployment

What is the meaning of "mode" of set-AzureDeployment?
-Mode
Specifies the mode of upgrade. Supported values are: "Auto", "Manual", and "Simultaneous".
What does "Auto","Manual", and "Simultaneous" mean?
I am particularly interested in "Simultaneous". Does it mean my package will be deployed to multiple instances simultaneously?
Thanks
Mode specifies the type of update to initiate. Role instances are allocated to update domains when the service is deployed. Updates can be initiated manually in each update domain or initiated automatically in all update domains.
If not specified, the default value is Auto. If set to Manual, WalkUpgradeDomain must be called to apply the update. If set to Auto, the update is automatically applied to each update domain in sequence.
To perform an automatic update of a deployment, call Upgrade Deployment or Change Deployment Configuration with the Mode element set to automatic. The update proceeds from that point without a need for further input. You can call Get Operation Status to determine when the update is complete.
To perform a manual update, first call Upgrade Deployment with the Mode element set to manual. Next, call Walk Upgrade Domain to update each domain within the deployment. You should make sure that the operation is complete by calling Get Operation Status before updating the next domain. More information please refer to this link.
One of the new deployment options we now support is the ability to do a “Simultaneous Update” of a Cloud Service (we sometimes also refer to this as the “Blast Option”). When you use this option we bypass the normal upgrade domain walk that is done by default with Cloud Services (where we upgrade parts of the Cloud Service sequentially to avoid ever bringing the entire service down) and we instead upgrade all roles and instances simultaneously. With today’s release this simultaneous update logic now happens within Windows Azure (on the cloud side). This has the benefit of enabling the Cloud Service update to happen much faster. More information please refer to this link.
I am particularly interested in "Simultaneous". Does it mean my
package will be deployed to multiple instances simultaneously?
The answer is yes.

How can we route a request to every pod under a kubernetes service on Openshift?

We are building a Jboss BRMS application with two microservices in spring-boot, one for rule generation (SRV1) and one for rule execution (SRV2).
The idea is to generate the rules using the generation microservice (SRV1) and persist them in the database with versioning. The next part of the process is having the execution microservice load these persisted rules into each pods memory by querying the information from the shared database.
There are two following scenarios when this should happen :
When the rule execution service pod/pods starts up, it queries the db for the lastest version and every pod running the execution application loads those rules from the shared db.
The second senario is we manually want to trigger the loading of a specific version of rules on every pod running the execution application preferably via a rest call.
Which is where the problem lies!
Whenever we try and issue a rest request to the api, since it is load balanced under a kubernetes service, the request hits only one of the pods and the rest of them do not load the specific rules.
Is there a programatic or design change that may help us achieve that or is there any other way we construct our application to achieve a capability to load a certain version of rules on all pods serving the execution microservice.
The second senario is we manually want to trigger the loading of a specific version of rules on every pod running the execution application preferably via a rest call.
What about using Rolling Updates? When you want to change the version of rules to be fetched within all execution pods, tell OpenShift to do rolling update which kills/starts all your pods one by one until all pods are on the new version, thus, they fetch the specific version of rules at the startup. The trigger of Rolling Updates and the way you define the version resolution is up to you. For instance: Have an ENV var within a pod that defines the version of rules that are going to be fetched from db, then change the ENV var to a new value and perform Rollling Updates. At the end, you should end up with new set of pods, all of them fetching the version rules based on the new value of the ENV var you set.