Service Fabric - How to repair a failing stateful application - azure-service-fabric

I have a stateful service that configures state backups for the primary replica on RunAsync using an Azure storage account.
The other day someone inadvertently deleted the storage account being used for backups. On our next deployment, the services began throwing errors as they initialize due to this 404 error response.
I have noticed that during a deployment fabric apparently shuffles around the old version of the service spinning up new primaries as needed to free up the vm it is upgrading. If the old version of the code fails to instantiate by throwing an exception, the upgrade process will fail causing a rollback.
My problem is, once I create a new storage account, I am still left seemingly no way to bring the existing services back to healthy states. My existing services are using Storage account urls with AccountKeys that no longer exists in azure. Attempts to upgrade fail because the old service instances can’t instantiate due to now bad configuration.
Are there any ways to deal with this situation?

The simplest thing would be to use an unmonitored manual upgrade to force through the change that would point the service to the new storage account.
However, this puts a lot of management overhead on you, particularly if there are many other services, since you need to be careful to perform all safety and functionality checks manually so as not to regress anything.
The recommend solution is to use the ServiceTypeHealthPolicyMap described here to "mask out" the unhealthy service (since you expect it to be unhealthy during the upgrade). You may also need to adjust some of the other upgrade parameters depending on the exact situation.
A third recommendation, or maybe something to improve in the future, would be to make the upgrade to change the account information a configuration only upgrade. This would ensure that SF tries to change the config in-place without restarting the services (by default), which would prevent the existing services from failing over during the upgrade and encountering issues. This is demonstrated in this example.

Related

Expressing that a service requires another

I'm new to k8s, so this question might be kind of weird, please correct me as necessary.
I have an application which requires a redis database. I know that I should configure it to connect to <redis service name>.<namespace> and the cluster DNS will get me to the right place, if it exists.
It feels to me like I want to express the relationship between the application and the database. Like I want to say that the application shouldn't be deployable until the database is there and working, and maybe that it's in an error state if the DB goes away. Is that something you'd normally do, and if so - how? I can think of other instances: like with an SQL database you might need to create the tables your app wants to use at init time.
Is the alternative to try to connect early and exit 1, so that the cluster keeps on retrying? Feels like that would work but it's not very declarative.
Design for resiliency
Modern applications and Kubernetes are (or should be) designed for resiliency. The applications should be designed without single point of failure and be resilient to changes in e.g. network topology. Also see Twelve factor-app: IV. Backing services.
This means that your Redis typically should be a cluster of e.g. 3 instances. It also means that your app should retry connections if connections fails - this can also happens same time after running - since upgrades of a cluster (or rolling upgrade of an app) is done by terminating one instance at a time meanwhile a new instance at a time is launched. E.g. the instance (of a cluster) that your app currently is connected to might go away and your app need to reconnect, perhaps establish a connection to a different instance in the same cluster.
SQL Databases and schemas
I can think of other instances: like with an SQL database you might need to create the tables your app wants to use at init time.
Yes, this is a different case. On Kubernetes your app is typically deployed with at least 2 replicas, or more (for high-availability reasons). You need to consider that when managing schema changes for your app. Common tools to manage the schema are Flyway or Liquibase and they can be run as Jobs. E.g. first launch a Job to create your DB-tables and after that deploy your app. And after some weeks you might want to change some tables and launch a new Job for this schema migration.
As you've seen, YAML objects can not express such dependencies. As suggested by #fabian-lopez, your application container may include an initContainer that would wait for dependencies to be available, before starting their main container.
Now, if you want a state machine, capable to provision a database, initialize its schema, maybe import some records, and only then create your application: you're looking for an operator. Then, you may use the operator-sdk ( https://github.com/operator-framework/operator-sdk ), or pretty much anything integrating with some Kubernetes cluster API.
I think Init Containers is something you could leverage for this use case
This is up to your application code, not something Kubernetes helps nor hinders.

Windows OS Update/Patch handling - best practices for SF today

I'm aware that the SF doesn't yet automatically handle OS Upgrades/patching in any way like Cloud Services do. I eagerly await it when that is ready. But for now I am curious what I should expect by default.
Since SF uses Scale Sets and standard Windows VMs, should I expect that the instances will have the default Windows Update settings and thus will reboot automatically every so often as updates are applied? I believe the defaults are to install updates automatically and reboot during the defined maintenance window (3am?), is that correct?
If that is true, can I expect that SF will gracefully handle the reboot? By that I mean any services running on it are shutdown and the load balancer is notified to stop sending requests to any externally visible endpoints on that host?
But taking that a step further, if all of the above happens to be true, is there anything preventing all nodes in my cluster to hit the maintenance window and reboot at the same time? That would seem catastrophic to me.
Given all that, what is the best practice and general advice for handling Windows Updates in SF today?
You're correct that there could be catastrophic results if you just turn on Windows Update and let it go. There will be no coordination when the node reboots and you could lose part or all of your application or cluster if the nodes cause the service fabric services to lose quorum.
The only safe approach is to install the patches/updates on a single node at a time and don't move to the next node until the cluster is healthy. This can be scripted to make it easier or worst case can be done manually.
There may be another approach that has to do with adding nodetypes, but it is not yet tested, so I don't want to give details until we know it works.

Service Fabric: removed actors and now upgrade fails

I'm trying to upgrade a Service Fabric application with a mix of stateful and stateless actors. I did some refactoring and so removed some actors I didn't need any more. Now, when I try to upgrade the application, I get the following error:
Services must be explicitly deleted before removing their Service Types.
After thinking about it a little bit, I think I understand the trouble that could come from removed services and upgrades, but then what's the correct way to do this?
You need to remove the service instances before you can upgrade to a version that doesn't contain the removed service package. Either:
In SF Explorer, navigate to the service and click Actions > Delete Service
In PowerShell:
Connect-ServiceFabricCluster
Remove-ServiceFabricService -ServiceName fabric:/MyApp/MyService
DO BE CAREFUL - If you're deleting a stateful service you'll lose all its data. Always be sure to have a periodic backup of production data.

Service Fabric stateful service no longer replicates

FURTHER UPDATE: this error has not occurred since the November update.
EDIT: you may want to read this if your stateful service stops working for no apparent reason. Typical sign is using WordCount-like app (for example), the service deployment reports that one partition is remaining and after 5 tries gives up. The stateless service starts ok. The diagnostics reports multiple "Constructed instance of type WordCountService". If You have this, then you may have the same problem I have. No amount of uninstalling VS/SF/Azure SDKs helps. I now use a VM template with VS/Azure/SF installed and just delete and recreate it each time this error occurs (it is rare but has happened several times). Assume MSFT is aware and fixing for beta.
ORIGINAL:
Summary question: Is there a way to reset Service Fabric completely?
Background: I have a stateful/stateless app service based on Wordcount example. All of a sudden, after deployment the app no longer replicates the stateful service (1 instance, 2 replicas). The stateless service is deployed ok (one instance, no replicas).
The partition status of the primary partition is reporting "Partition is below target replica or instance count". The replica status is "InBuild" for replicas, Primary is OK.
On the primary node, there is a warning "Replica had multiple failures during open. Error = -2147024894.
I have tried cleaning the cluster, uninstalling/reinstalling service fabric, deleting the SfDevCluster directory entirely etc.
If I copy the exact code to another computer with service fabric installed, it works (and I mean copy/paste the whole solution directory).
I had a similar problem last week but it caused the host service not to start. Tried uninstall/reinstall/clean/remove SDKs, remove Visual Studio, etc. The only thing that fixed it was a reinstall of windows.

Azure Service Fabric

Please help me to know , Is there any option in the azure service fabric to delay deprovision ? I have a micro service application hosted in fabric which is distributed in different nodes at their instances . If i tried to disengage/deprovision the service from portal , Can the service fabric internally check whether any transaction is going any of the instances or not , If it is engaged , Will it wait for complete it ? Also want to know , If microsoft is not providing such a service , does we have any powershell command to check the instance status ?
Thanks
I assume that by "disengage/deprovision the service from portal" you are referring to deleting the service via the Service Fabric Explorer web app (perhaps via a link followed from the portal). Please correct me if this is wrong.
To answer your question directly, the framework will not wait for in-flight operations to complete during a service delete. Every replica for the service will lose its read and write permissions, causing all in-flight operations to fail. We do not offer a way to stall during this step in order to, for example, allow currently open transactions to be completed.
The reason we do not offer this semantic, is that service deletion is expected to be rare or permanent, and that delaying deletion for the final operation doesn't enable any additional scenarios. In either case, if a client is attempting operations on a service being deleted, either:
The last client operation may fail due to delete racing and revoking read/write permissions
Every subsequent client operation will fail due to the service no longer existing
or
The last client operation will succeed due to deletion being delayed
Every subsequent client operation will fail due to the service no longer existing
The expectation is that any client or dependent service should have already been updated or deleted prior to deleting the service they depend on, as you are making the permanent decision that this service should no longer exist.