Service Fabric: removed actors and now upgrade fails - azure-service-fabric

I'm trying to upgrade a Service Fabric application with a mix of stateful and stateless actors. I did some refactoring and so removed some actors I didn't need any more. Now, when I try to upgrade the application, I get the following error:
Services must be explicitly deleted before removing their Service Types.
After thinking about it a little bit, I think I understand the trouble that could come from removed services and upgrades, but then what's the correct way to do this?

You need to remove the service instances before you can upgrade to a version that doesn't contain the removed service package. Either:
In SF Explorer, navigate to the service and click Actions > Delete Service
In PowerShell:
Connect-ServiceFabricCluster
Remove-ServiceFabricService -ServiceName fabric:/MyApp/MyService
DO BE CAREFUL - If you're deleting a stateful service you'll lose all its data. Always be sure to have a periodic backup of production data.

Related

Expressing that a service requires another

I'm new to k8s, so this question might be kind of weird, please correct me as necessary.
I have an application which requires a redis database. I know that I should configure it to connect to <redis service name>.<namespace> and the cluster DNS will get me to the right place, if it exists.
It feels to me like I want to express the relationship between the application and the database. Like I want to say that the application shouldn't be deployable until the database is there and working, and maybe that it's in an error state if the DB goes away. Is that something you'd normally do, and if so - how? I can think of other instances: like with an SQL database you might need to create the tables your app wants to use at init time.
Is the alternative to try to connect early and exit 1, so that the cluster keeps on retrying? Feels like that would work but it's not very declarative.
Design for resiliency
Modern applications and Kubernetes are (or should be) designed for resiliency. The applications should be designed without single point of failure and be resilient to changes in e.g. network topology. Also see Twelve factor-app: IV. Backing services.
This means that your Redis typically should be a cluster of e.g. 3 instances. It also means that your app should retry connections if connections fails - this can also happens same time after running - since upgrades of a cluster (or rolling upgrade of an app) is done by terminating one instance at a time meanwhile a new instance at a time is launched. E.g. the instance (of a cluster) that your app currently is connected to might go away and your app need to reconnect, perhaps establish a connection to a different instance in the same cluster.
SQL Databases and schemas
I can think of other instances: like with an SQL database you might need to create the tables your app wants to use at init time.
Yes, this is a different case. On Kubernetes your app is typically deployed with at least 2 replicas, or more (for high-availability reasons). You need to consider that when managing schema changes for your app. Common tools to manage the schema are Flyway or Liquibase and they can be run as Jobs. E.g. first launch a Job to create your DB-tables and after that deploy your app. And after some weeks you might want to change some tables and launch a new Job for this schema migration.
As you've seen, YAML objects can not express such dependencies. As suggested by #fabian-lopez, your application container may include an initContainer that would wait for dependencies to be available, before starting their main container.
Now, if you want a state machine, capable to provision a database, initialize its schema, maybe import some records, and only then create your application: you're looking for an operator. Then, you may use the operator-sdk ( https://github.com/operator-framework/operator-sdk ), or pretty much anything integrating with some Kubernetes cluster API.
I think Init Containers is something you could leverage for this use case
This is up to your application code, not something Kubernetes helps nor hinders.

Service Fabric - How to repair a failing stateful application

I have a stateful service that configures state backups for the primary replica on RunAsync using an Azure storage account.
The other day someone inadvertently deleted the storage account being used for backups. On our next deployment, the services began throwing errors as they initialize due to this 404 error response.
I have noticed that during a deployment fabric apparently shuffles around the old version of the service spinning up new primaries as needed to free up the vm it is upgrading. If the old version of the code fails to instantiate by throwing an exception, the upgrade process will fail causing a rollback.
My problem is, once I create a new storage account, I am still left seemingly no way to bring the existing services back to healthy states. My existing services are using Storage account urls with AccountKeys that no longer exists in azure. Attempts to upgrade fail because the old service instances can’t instantiate due to now bad configuration.
Are there any ways to deal with this situation?
The simplest thing would be to use an unmonitored manual upgrade to force through the change that would point the service to the new storage account.
However, this puts a lot of management overhead on you, particularly if there are many other services, since you need to be careful to perform all safety and functionality checks manually so as not to regress anything.
The recommend solution is to use the ServiceTypeHealthPolicyMap described here to "mask out" the unhealthy service (since you expect it to be unhealthy during the upgrade). You may also need to adjust some of the other upgrade parameters depending on the exact situation.
A third recommendation, or maybe something to improve in the future, would be to make the upgrade to change the account information a configuration only upgrade. This would ensure that SF tries to change the config in-place without restarting the services (by default), which would prevent the existing services from failing over during the upgrade and encountering issues. This is demonstrated in this example.

How isUpgrade setting affects deployment process in Service Fabric Application Deployment task in Azure DevOps

Azure Devops has a standard task for deploying apps to ServiceFabric. The task is named Service Fabric Application Deployment and is documented here. Among other settings, it contains an optional boolean isUpgrade setting (default value 'true'). I tried to set it explicitly to true and false, but I did not find any difference in the behavior of the task. In both cases, the deployment was successful, all previously deployed packages were still provisioned, and Azure Pipelines logs were the same. The time of the deployment was the same, too.
My question is what the setting affects? Maybe, somebody has used it in his CI pipelines.
There are 2 types of deployment in Service Fabric. This isUpgrade flag controls which type op upgrade you are executing.
Regular
Basically this removes the old application and deploys the new version. So if you have Statefull services, this will remove all state. You will have downtime when you do a regular upgrade.
Upgrade
An upgrade will do a lot of things, It will keep the state, it will do health checking, make sure the services are available. Does a rollback when the healthcheck fails, ...
If your application or services didn't change, nothing changes in your cluster.
Typically an upgrade will take more time (This is highly dependent on your health check rules). See the application upgrade flowchart
More info about the 2 types
If you look at the code of the task. You see that it will only take effect if overridePublishProfileSettings is true. Otherwise the PulishProfile.xml is used.

Handle upgrades with spring boot admin

I am using SBA for monitoring our microservices within AWS ecs clusters.
All looks OK, except upgrades, e.g when we spin new version of service we shutdown the old one once it becomes healthy. The thing is that the old one is shown as down and starts issuing notifications util we manually remove it.
Any solution ?
I tried to use the instance de-reregistration setting but it doesn't work well since ECS probably just kills the tasks and not gracefully shuts down the context.
you can issue a DELETE request to /api/applications/<id> during your deployment scripts to remove the application from the admin server

Service Fabric stateful service no longer replicates

FURTHER UPDATE: this error has not occurred since the November update.
EDIT: you may want to read this if your stateful service stops working for no apparent reason. Typical sign is using WordCount-like app (for example), the service deployment reports that one partition is remaining and after 5 tries gives up. The stateless service starts ok. The diagnostics reports multiple "Constructed instance of type WordCountService". If You have this, then you may have the same problem I have. No amount of uninstalling VS/SF/Azure SDKs helps. I now use a VM template with VS/Azure/SF installed and just delete and recreate it each time this error occurs (it is rare but has happened several times). Assume MSFT is aware and fixing for beta.
ORIGINAL:
Summary question: Is there a way to reset Service Fabric completely?
Background: I have a stateful/stateless app service based on Wordcount example. All of a sudden, after deployment the app no longer replicates the stateful service (1 instance, 2 replicas). The stateless service is deployed ok (one instance, no replicas).
The partition status of the primary partition is reporting "Partition is below target replica or instance count". The replica status is "InBuild" for replicas, Primary is OK.
On the primary node, there is a warning "Replica had multiple failures during open. Error = -2147024894.
I have tried cleaning the cluster, uninstalling/reinstalling service fabric, deleting the SfDevCluster directory entirely etc.
If I copy the exact code to another computer with service fabric installed, it works (and I mean copy/paste the whole solution directory).
I had a similar problem last week but it caused the host service not to start. Tried uninstall/reinstall/clean/remove SDKs, remove Visual Studio, etc. The only thing that fixed it was a reinstall of windows.