How to clear/acknowledge a warning in the Service Fabric Explorer - azure-service-fabric

I was testing things with a non-prod cluster of On Premises Service Fabric and I stopped the process for one of my services on one of my nodes (to see how it would respond).
That caused the explorer to show the following warning:
There was an error during CodePackage activation.The service host terminated with exit code:1
Service Fabric recovered well. It made a new instance of the process and kept going.
But I can't seem to see how to clear out the warning from the Explorer. In my non-prod env, I can just drop and redeploy the app.
But in prod I can't do that. Especially not to just get rid of a warning message!
Is there a way to tell Service Fabric, "thanks for telling me that, I know about it now and you can forget about it and carry on?"

Related

GKE Metadata Server is unavailable when Horizental Pod Auto Scaler start to work

Running Pods with WorkloadIdentity makes an Google Credential error when auto scaling started.
My application is configured with WorkloadIdentity to use Google Pub/Sub and also set HorizontalPodAutoscaler to scale the pods up to 5 replicas.
The problem arises when an auto scaler create replicas of the pod, GKE's metadata server does not work for few seconds then after 5 to 10 seconds no error created.
here is the error log after a pod created by auto scaler.
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable onattempt 1 of 3. Reason: timed out
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable onattempt 2 of 3. Reason: timed out
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable onattempt 3 of 3. Reason: timed out
WARNING:google.auth._default:Authentication failed using Compute Engine authentication due to unavailable metadata server
Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application. For more information, please see https://cloud.google.com/docs/authentication/getting-started
what exactly is the problem here?
When I read the doc from here workload identity docs
"The GKE metadata server takes a few seconds to start to run on a newly created Pod"
I think the problem is related to this issue but is there a solution for this kind situation?
Thanks
There is no specific solution other than to ensure your application can cope with this. Kubernetes uses DaemonSets to launch per-node apps like the metadata intercept server but as the docs clearly tell you, that takes a few seconds (noticing the new node, scheduling the pod, pulling the image, starting the container).
You can use an initContainer to prevent your application from starting until some script returns, which can just try to hit a GCP API until it works. But that's probably more work than just making your code retry when those errors happen.

How to send alert when Azure Service fabric health goes bad

Recently the our Azure service fabric health went bad after a deployment. The deployment was successful but service fabric health went bad due to some code issue and it was not rolling back. Only on looking into the service fabric explorer did we know that the cluster went bad
Is there a way to get an email alert when the service fabric health goes bad.
Scenarios where service fabric failed
Whole cluster so what happened was 1 service went bad(showed in red) and was consuming a lot of memory and that in turn caused other services to go bad. after which the whole cluster I had to log into the scaleset to see which services was taking most of the memory.
In another case we added another reliable collection to existing reliable collection to statefull service. This caused failure.
in each of the cases i need to look at the servifabric explorer and then go to each scale set to see the actual error message.

Service Fabric Cluster Upgrade Failing

I've an on-premise, secure, development cluster that I wish to upgrade. The current version is 5.7.198.9494. I've followed the steps listed here.
At the time of writing, the latest version of SF is 6.2.283.9494. However, running Get-ServiceFabricRuntimeUpgradeVersion -BaseVersion 5.7.198.9494 shows that I first must update to 6.0.232.9494, before upgrade to 6.2.283.9494.
I run the following in Powershell, and the upgrade does start:
Copy-ServiceFabricClusterPackage -Code -CodePackagePath .\MicrosoftAzureServiceFabric.6.0.232.9494.cab -ImageStoreConnectionString "fabric:ImageStore"
Register-ServiceFabricClusterPackage -Code -CodePackagePath MicrosoftAzureServiceFabric.6.0.232.9494.cab
Start-ServiceFabricClusterUpgrade -Code -CodePackageVersion 6.0.232.9494 -Monitored -FailureAction Rollback
However, after a few minutes the following happens:
Powershell IDE crashes
The Service Fabric Cluster becomes unreachable
Service Fabric Local Cluster Manager disappears from the task bar
Event Viewer will log the events, see below.
Quite some time later, the vm will reboot. Service Fabric Local Cluster Manager will only give options to Setup or Restart the local cluster.
Event viewer has logs in the under Applications and Services Logs -> Microsoft-Service Fabric -> Operational. Most are information about opening, closing, and aborting one of the upgrade domains. There are some warnings about a vm failing to open an upgrade domain stating error: Lease Failed.
This behavior happens consistently, and I've not yet been able to update the cluster. My guess is that we are not able to upgrade a development cluster, but I've not found an article that states that.
Am I doing something incorrectly here, or is it impossible to upgrade a development cluster?
I will assume you have a development cluster with a single node or multiple nodes in a single VM.
As described in the first section of the documentation from the same link your provided:
service-fabric-cluster-upgrade-windows-server
You can upgrade your cluster to the new version only if you're using a
production-style node configuration, where each Service Fabric node is
allocated on a separate physical or virtual machine. If you have a
development cluster, where more than one Service Fabric node is on a
single physical or virtual machine, you must re-create the cluster
with the new version.

Load Balancer has mystery "Failed" status

I have a testing/staging Service Fabric cluster with which I've been working through a proof-of-concept system for a project. As of last night, my load balancer is showing a "Failed" message on it, but with no other detail or messages that I can find I have no idea what the issue actually is.
I'm still able to hit my services within the cluster, and they're executing requests successfully. All Load Balancer rules seem to be working and passing traffic, but the options to add/delete/modify anything within the Load Balancer are greyed out. The "Diagnose and solve problems" sheet shows 0 failed deployments, 0 errors, and 0 alerts. I've restarted the cluster and republished all the applications on it, but the only options I appear to have on the load balancer itself are to delete it or move it to another resource group.
How can I troubleshoot this further without tearing up my cluster and starting over (again)?

Service Fabric - Node in Error state

I've spun up a Service Fabric cluster in Azure and suddenly my node is in error state - Unfortunate but this can happen.
When I check Service Fabric Explorer I can see that the node is in error state but the error doesn't really give me any hints since I really didn't do anything.
I haven't found a way to fix it and worst-case scenario was to restart the node but I was unable to find this capability.
Did anybody have this issue before or can anybody help me with this?
In Service Fabric Explorer, there is a Nodes view below the clusters. You can select the node and choose details to see more information about the node. You may be able to see something that indicates what is wrong. There are also 5 actions that can be taken on the node Activate, Deactivate (pause), Deactivate (Restart), Deactivate (remove data) and remove node state.