Service Fabric - Node in Error state - azure-service-fabric

I've spun up a Service Fabric cluster in Azure and suddenly my node is in error state - Unfortunate but this can happen.
When I check Service Fabric Explorer I can see that the node is in error state but the error doesn't really give me any hints since I really didn't do anything.
I haven't found a way to fix it and worst-case scenario was to restart the node but I was unable to find this capability.
Did anybody have this issue before or can anybody help me with this?

In Service Fabric Explorer, there is a Nodes view below the clusters. You can select the node and choose details to see more information about the node. You may be able to see something that indicates what is wrong. There are also 5 actions that can be taken on the node Activate, Deactivate (pause), Deactivate (Restart), Deactivate (remove data) and remove node state.

Related

Cloud Composer 2: prevent eviction of worker pods

I am currently planning to upgrade our Cloud Composer environment from Composer 1 to 2. However I am quite concerned about disruptions that could occur in Cloud Composer 2 due to the new autoscaling behavior inherited from GKE Autopilot. In particular since nodes will now auto-scale based on demand, it seems like nodes with running workers could be killed off if GKE thinks the workers could be rescheduled elsewhere. This would be bad because my code isn't currently very tolerant to retries.
I think that this can be prevented by adding the following annotation to the worker pods: "cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
However, I don't know how to add annotations to worker pods created by Composer (I'm not creating them myself, after all). How can I do that?
EDIT: I think this issue is made more complex by the fact that it should still be possible for the cluster to evict a pod once it's finished processing all its Airflow tasks. If the annotation is added but doesn't go away once the pod is finished processing, I'm worried that could prevent the cluster from ever scaling down.
So a more dynamic solution may be needed, perhaps one that takes into account the actual tasks that Airflow is processing.
If I have understood your problem well. Could you please try this solution:
In the Cloud Composer environment, navigate to the Kubernetes Engine --> Workloads page in the GCP Console.
Find the worker pod you want to add the annotation to and click on the name of the pod.
On the pod details page, click on the Edit button.
In the Pod template section, find the Annotations field and click on
the pencil icon to edit.
In the Edit annotations field, add the annotation "cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
Click on the Save button to apply the change.
Let me know if it works fine. Good luck.

How to delete node in EKS managed node group if the Kubelet crashes or stops reporting?

I am using aws EKS with a managed node group. Twice in the passed couple of weeks I had a case where the Kubelet in one of the nodes crashed or stopped reporting back to the control plane.
In this case I would expect the Autoscaling group to identify this node as unhealthy, and replace it. However, this is not what happens. I have recreated the issue by creating a node and manually stopping the Kubelet, see image below:
My first thought was to create an Event Bus alert that would trigger a lambda to take care of this but I couldn't find the EKS service in the list of services in Event Bus, so …
Does anyone know of a tool or configuration that would help with this?
To be clear I am looking for something that would:
Detect that that kubelet isn't connecting to the control plane
Delete the node in the cluster
Terminate the EC2
THANKS!!

Service Fabric Upgrade stuck on PreUpgradeSafetyCheck

I have received a Warning that a new version of Service Fabric is available, however when I tried to upgrade it, the process was stuck at PreUpgradeSafetyCheck on node Rep_247. I've tried -Force and -ForceRestart but it hasn't helped.
Cluster Map
This issue is likely to be happening because service fabric can't take down a service in a safe manner to upgrade the node or application.
Whenever a node is upgraded, the services activated in the node must move to another node first, so that the node can be restarted without affecting your applications\services availability.
In this case, doing so may cause a quorum loss when the service can't be placed in another node, maybe because there is no other node available, or because of placement constraints in the service, or there is only one instance of the service.
Because SF can't guarantee the reliability of the service, it will halt the upgrade process until a solution can be applied to fix the problem and the process continue.
From your cluster map and the message is possible to know the issue, your cluster has only one node of type 'Rep_247 ReportServerType', I am assuming you have services with placement constraints to be deployed only on this node type, taking down the node will make these services unavailable, because the placement constraints will prevent them to move to another node type.
If the service are not constrained to that node type, the problem might be:
It is failing to activate on other nodes, example, dependencies are missing in the node, and this will fail to have the minimum replica.
The service has only one instance available and taking down will make the service unavailable.
PS: the same applies to the node MR_236 MRType
PreUpgradeSafetyCheck
An UpgradePhase of PreUpgradeSafetyCheck means there were issues
preparing the upgrade domain before it was performed. The most common
issues in this case are service errors in the close or demotion from
primary code paths.
Possible solution for this case are:
Add more replicas\instances of the service so the minimum quorum is meet.
Remove the Placement constraints of the service to let them move to other nodes.
Add an extra node of same node type so that the service can move out safely.
Taking down the service and recreate when the node is updated (last option if not stateful, otherwise will lose data)
You might be interested to see related issues:
Github Issue #1279
Github Issue #377

How to send alert when Azure Service fabric health goes bad

Recently the our Azure service fabric health went bad after a deployment. The deployment was successful but service fabric health went bad due to some code issue and it was not rolling back. Only on looking into the service fabric explorer did we know that the cluster went bad
Is there a way to get an email alert when the service fabric health goes bad.
Scenarios where service fabric failed
Whole cluster so what happened was 1 service went bad(showed in red) and was consuming a lot of memory and that in turn caused other services to go bad. after which the whole cluster I had to log into the scaleset to see which services was taking most of the memory.
In another case we added another reliable collection to existing reliable collection to statefull service. This caused failure.
in each of the cases i need to look at the servifabric explorer and then go to each scale set to see the actual error message.

How to clear/acknowledge a warning in the Service Fabric Explorer

I was testing things with a non-prod cluster of On Premises Service Fabric and I stopped the process for one of my services on one of my nodes (to see how it would respond).
That caused the explorer to show the following warning:
There was an error during CodePackage activation.The service host terminated with exit code:1
Service Fabric recovered well. It made a new instance of the process and kept going.
But I can't seem to see how to clear out the warning from the Explorer. In my non-prod env, I can just drop and redeploy the app.
But in prod I can't do that. Especially not to just get rid of a warning message!
Is there a way to tell Service Fabric, "thanks for telling me that, I know about it now and you can forget about it and carry on?"