Service Fabric stateful service no longer replicates - azure-service-fabric

FURTHER UPDATE: this error has not occurred since the November update.
EDIT: you may want to read this if your stateful service stops working for no apparent reason. Typical sign is using WordCount-like app (for example), the service deployment reports that one partition is remaining and after 5 tries gives up. The stateless service starts ok. The diagnostics reports multiple "Constructed instance of type WordCountService". If You have this, then you may have the same problem I have. No amount of uninstalling VS/SF/Azure SDKs helps. I now use a VM template with VS/Azure/SF installed and just delete and recreate it each time this error occurs (it is rare but has happened several times). Assume MSFT is aware and fixing for beta.
ORIGINAL:
Summary question: Is there a way to reset Service Fabric completely?
Background: I have a stateful/stateless app service based on Wordcount example. All of a sudden, after deployment the app no longer replicates the stateful service (1 instance, 2 replicas). The stateless service is deployed ok (one instance, no replicas).
The partition status of the primary partition is reporting "Partition is below target replica or instance count". The replica status is "InBuild" for replicas, Primary is OK.
On the primary node, there is a warning "Replica had multiple failures during open. Error = -2147024894.
I have tried cleaning the cluster, uninstalling/reinstalling service fabric, deleting the SfDevCluster directory entirely etc.
If I copy the exact code to another computer with service fabric installed, it works (and I mean copy/paste the whole solution directory).
I had a similar problem last week but it caused the host service not to start. Tried uninstall/reinstall/clean/remove SDKs, remove Visual Studio, etc. The only thing that fixed it was a reinstall of windows.

Related

Rabbit MQ Shovel Plugin- Creating duplicate data in case of node failure

I am creating shovel plugin in rabbit mq, that is working fine with one pod, However, We are running on Kubernetes cluster with multiple pods and in case of pod restart, it is creating multiple instance of shovel on each pod independently, which is causing duplicate message replication on destination.
Detail steps are below
We are deploying rabbit mq on Kubernetes cluster using helm chart.
After that we are creating shovel using Rabbit MQ Management UI. Once we are creating it from UI, shovels are working fine and not replicating data multiple time on destination.
When any pod get restarted, it create separate shovel instance. That start causing issue of duplicate message replication on destination from different shovel instance.
When we saw shovel status on Rabbit MQ UI then we found that, there are multiple instance of same shovel running on each pod.
When we start shovel from Rabbit MQ UI manually, then it will resolved this issue and only once instance will be visible in UI.
So problem which we concluded that, in case of pod failure/restart, shovel is not able to sync with other node/pod, if any other shovel is already running on node/pod. Since we are able to solve this issue be restarting of shovel form UI, but this not a valid approach for production.
This issue we are not getting in case of queue and exchange.
Can anyone help us here to resolve this issue.
as we lately have seen similar problems - this seems to be an issue since some 3.8. version - https://github.com/rabbitmq/rabbitmq-server/discussions/3154
it should be fixed as far as I have understood from version 3.8.20 on. see
https://github.com/rabbitmq/rabbitmq-server/releases/tag/v3.8.19
and
https://github.com/rabbitmq/rabbitmq-server/releases/tag/v3.8.20
and
https://github.com/rabbitmq/rabbitmq-server/releases/tag/v3.9.2
didn't have time yet to check if this is really fixed with those versions.

Service Fabric - How to repair a failing stateful application

I have a stateful service that configures state backups for the primary replica on RunAsync using an Azure storage account.
The other day someone inadvertently deleted the storage account being used for backups. On our next deployment, the services began throwing errors as they initialize due to this 404 error response.
I have noticed that during a deployment fabric apparently shuffles around the old version of the service spinning up new primaries as needed to free up the vm it is upgrading. If the old version of the code fails to instantiate by throwing an exception, the upgrade process will fail causing a rollback.
My problem is, once I create a new storage account, I am still left seemingly no way to bring the existing services back to healthy states. My existing services are using Storage account urls with AccountKeys that no longer exists in azure. Attempts to upgrade fail because the old service instances can’t instantiate due to now bad configuration.
Are there any ways to deal with this situation?
The simplest thing would be to use an unmonitored manual upgrade to force through the change that would point the service to the new storage account.
However, this puts a lot of management overhead on you, particularly if there are many other services, since you need to be careful to perform all safety and functionality checks manually so as not to regress anything.
The recommend solution is to use the ServiceTypeHealthPolicyMap described here to "mask out" the unhealthy service (since you expect it to be unhealthy during the upgrade). You may also need to adjust some of the other upgrade parameters depending on the exact situation.
A third recommendation, or maybe something to improve in the future, would be to make the upgrade to change the account information a configuration only upgrade. This would ensure that SF tries to change the config in-place without restarting the services (by default), which would prevent the existing services from failing over during the upgrade and encountering issues. This is demonstrated in this example.

Service Fabric Application - changing instance count on application update fails

I am building a CI/CD pipeline to release SF Stateless Application packages into clusters using parameters for everything. This is to ensure environments (DEV/UAT/PROD) can be scoped with different settings.
For example in a DEV cluster an application package may have an instance count of 3 (in a 10 node cluster)
I have noticed that if an application is in the cluster and running with an instance count (for example) of 3, and I change the deployment parameter to anything else (e.g. 5), the application package will upload and register the type, but will fail on attempting to do a rolling upgrade of the running application.
This also works the other way e.g. if the running app is -1 and you want to reduce the count on next rolling deployment.
Have I missed a setting or config somewhere, is this how it is supposed to be? At present its not lending itself to being something that is easily scaled without downtime.
At its simplest form we just want to be able to change instance counts on application updates, as we have an infrastructure-as-code approach to changes, builds and deployments for full tracking ability.
Thanks in advance
This is a common error when using Default services.
This has been already answered multiple times in these places:
Default service descriptions can not be modified as part of upgrade set EnableDefaultServicesUpgrade to true
https://blogs.msdn.microsoft.com/maheshk/2017/05/24/azure-service-fabric-error-to-allow-it-set-enabledefaultservicesupgrade-to-true/
https://github.com/Microsoft/service-fabric/issues/253#issuecomment-442074878

Why would running a container on GCE get stuck Metadata request unsuccessful forbidden (403)

I'm trying to run a container in a custom VM on Google Compute Engine. This is to perform a heavy ETL process so I need a large machine but only for a couple of hours a month. I have two versions of my container with small startup changes. Both versions were built and pushed to the same google container registry by the same computer using the same Google login. The older one works fine but the newer one fails by getting stuck in an endless list of the following error:
E0927 09:10:13 7f5be3fff700 api_server.cc:184 Metadata request unsuccessful: Server responded with 'Forbidden' (403): Transport endpoint is not connected
Can anyone tell me exactly what's going on here? Can anyone please explain why one of my images doesn't have this problem (well it gives a few of these messages but gets past them) and the other does have this problem (thousands of this message and taking over 24 hours before I killed it).
If I ssh in to a GCE instance then both versions of the container pull and run just fine. I'm suspecting the INTEGRITY_RULE checking from the logs but I know nothing about how that works.
MORE INFO: this is down to "restart policy: never". Even a simple Centos:7 container that says "hello world" deployed from the console triggers this if the restart policy is never. At least in the short term I can fix this in the entrypoint script as the instance will be destroyed when the monitor realises that the process has finished
I suggest you try creating a 3rd container that's focused on the metadata service functionality to isolate the issue. It may be that there's a timing difference between the 2 containers that's not being overcome.
Make sure you can ‘curl’ the metadata service from the VM and that the request to the metadata service is using the VM's service account.

Is it possible to control Service Fabric hosted service restart behaviour?

I can't find much documentation on the action that Service Fabric takes when a service it is hosting fails. I have performed some experimentation (using a stateless service in a local cluster), the results of which are below. My question is: is it possible to change this behaviour?
There are two distinct scenarios that I tested.
An exception thrown from the RunAsync() method.
The hosted service is restarted immediately on another cluster node. If no other node is available then it is restarted on the same node. There does not appear to be any limit to the number of times the restart will be attempted or any kind of back-off in terms of the interval between attempts.
The hosted service fails to start (e.g. an exception is thrown before RunAsync() is called).
The hosted service is restarted on the same node. In my test environment there appears to be a fixed interval between restart attempts (15 seconds) but no limit to the number of attempts.
I can see in the cluster configuration that there are some parameters in the Hosting section that look like they might be relevant (ActivationMaxRetryInterval, ActivationRetryBackoffInterval, ActivationMaxFailureCount) and I am guessing that these cover scenario (2) above (assuming that Activation == service start). These affect the entire cluster by the looks of it.