How to debug GKE node pool nodes being re-created - kubernetes

GKE cloud logging and monitoring is not leading me to root cause. Over some period every single node was replaced (I could verify by their age with kubectl) leading to a short complete outage (several minutes) for all services as detected by external monitoring.
Nodes are not preemtible
gcloud container operations list does not show any node-upgrade operations
In cloud Logging Node event logs, there were multiple of these:
Node <...> status is now: NodeNotReady
Deleting node <...> because it does not exist in the cloud provider
Node <...> event: Removing Node <...> from Controller
node-problem-detector logs has a bunch of these:
"2022/05/26 20:35:10 Failed to export to Stackdriver: rpc error: code = DeadlineExceeded desc = One or more TimeSeries could not be written: Deadline expired before operation could complete.: gce_instance{zone:europe-west2-a,instance_id:<...>} timeSeries[0-199]: kubernetes.io/internal/node/guest/net/rx_multicast{instance_name<...>,interface_namegkeb5dd8ca7306}"
Cluster autoscaler shows only a few nodes added and removed, but spread out over several hours.
During the period building up to this, one service in the cluster was receiving a DDoS attack, so network pressure was high, but there was no CPU throttling or OOM kills.

Related

How to rescue a Kubernetes cluster when multiple pods enter "unknown" status

I am trying to understand the lessons from a failed K8s cluster. I am running Microk8s 1.22.5. I had 3 rock solid (physical) nodes. I tried to add a fourth node (KVM guest) to satisfy the requirements of Minio. Within 24h, the KVM host had entered "unknown" status together with its pods. Within 48h, multiple pods on all of the nodes had "unknown" status. Most of the deployments and statefulsets are down, including multiple DBs (postgres, Elastic) so it's really painful (extra tips on how to save these are welcome). According to the official docs:
A Pod is not deleted automatically when a node is unreachable. The Pods running on an unreachable Node enter the 'Terminating' or
'Unknown' state after a timeout. Pods may also enter these states when
the user attempts graceful deletion of a Pod on an unreachable Node.
The only ways in which a Pod in such a state can be removed from the
apiserver are as follows:
The Node object is deleted (either by you, or by the Node Controller).
The kubelet on the unresponsive Node starts responding, kills the Pod
and removes the entry from the apiserver.
Force deletion of the Pod by
the user. The recommended best practice is to use the first or second
approach. If a Node is confirmed to be dead (e.g. permanently
disconnected from the network, powered down, etc), then delete the
Node object. If the Node is suffering from a network partition, then
try to resolve this or wait for it to resolve. When the partition
heals, the kubelet will complete the deletion of the Pod and free up
its name in the apiserver.
Normally, the system completes the deletion once the Pod is no longer
running on a Node, or the Node is deleted by an administrator. You may override this by force deleting the Pod.
So I tried draining the node (option 1), but no dice. I get some error about not being able to violate a disruption budget. Option 2 is not happening and option 3 has no effect. It looks like the failing node poisoned the whole cluster. Any advice on how to avoid this in the future? Many thanks

How to implement graceful termination of nodes without service downtime when using cluster auto-scaler?

I have set up K8S cluster using EKS. Cluster Auto-scaler(CA) has been configured to increase/decrease the number of nodes based on resources availability for pods. The CA terminates a node if it's unneeded and pods on the node can be scheduled to another node. Here, the CA terminates the node before rescheduling the pods on another node. So, the pods get scheduled on another node after the node gets terminated. Hence, There is some downtime of some services until the re-scheduled pods become healthy.
How can I avoid the downtime by ensuring that the pods get scheduled on another node before the node gets terminated?
The graceful termination period for nodes is set to 10 minutes(Default).
You need to have multiple replicas of your application running. That will allow your application to survive even in case of node sudden death. Also you may want to configure antiAffinity rules to your app manifest to ensure that replicas reside on different nodes.

kube-dns. High Availability. Error handling in kuberntes

I have a kubernetes cluster with several nodes. I have kube-dns running in 3 nodes.
The issue I'm having is that if 1 of those 3 nodes goes down the requests between my pods/containers start to fail more or less 1 of 3 times.
This is because when the container resolve a k8s service hostname it calls the kube-dns service to resolve that hostname and the kube-dns k8s services has three endpoints but one of those three endpoints is not valid as the node is down. K8s does not update the service until it detects the node is down. (Currently I have that time set to 60 seconds).
Any idea about how to mitigate this? Is there any kind of retry that could be configured outside the application? Something in the container or at k8s level.
Thank you.
The main contributor for communication between underlying Kubernetes resources on the particular Node and kube-apiserver is kubelet. Its role can be determined as a Node agent. Therefore, kubelet plays a significant role in the cluster life cycle, due to primary duties like managing liveness and readiness probes for the nested objects, updating ETCD storage in order to write metadata for the resources and periodically refreshing own health status to kube-apiserver, specified by --node-status-update-frequency flag in kubelet configuration.
--node-status-update-frequency duration Specifies how often kubelet posts node status to master. Note: be cautious when changing the
constant, it must work with nodeMonitorGracePeriod in nodecontroller.
(default 10s)
However, there is a specific component in Kubernetes called Node controller. One of the essential roles of Node controller is to check the status of the involved workers by controlling relevant heartbeat from kubelet. There are some specific flags that describe this behavior and by default these flags have been included in kube-controller-manager configuration:
--node-monitor-period - Check kubelet status with specified time
interval (default value 5s);
--node-monitor-grace-period - The time that Kubernetes controller
manager considers healthy status of Kubelet (default value 40s);
--pod-eviction-timeout - The grace timeout for deleting pods on
failed nodes (default value 5m).
Whenever you want to mitigate DNS Pods outage, in case a Node goes down, you should consider these options. You can also take a look at DNS horizontal autoscaller in order to align to stable replica count for DNS Pods, however it brings some additional logic structure to be implemented, which can consume more compute resources on the cluster engine.

How to scale out a Kubernetes Cluster based on allocatable resources?

I am looking for an opportunity to add a new worker node to my kubernetes cluster when its allocatable resources falls below a specified minimum.
To scale out a cluster automatically I found the cluster-autoscaler. But the documentation says, it only scales out, if:
there are pods that failed to run in the cluster due to insufficient resources
Therefore a new node gets started only after a pod is in pending state. The startup of a new node takes about 2 minutes and I would like to avoid a pending time of 2 minutes for a pod.

Node status changes to unknown on a high resource requirement pod

I have a Jenkins deployment pipeline which involves kubernetes plugin. Using kubernetes plugin I create a slave pod for building a node application using yarn. The requests and limits for CPU and Memory are set.
When the Jenkins master schedules the slave, sometimes (as I haven’t seen a pattern, as of now), the pod makes the entire node unreachable and changes the status of node to be Unknown. On careful inspection in Grafana, the CPU and Memory Resources seem to be well within the range with no visible spike. The only spike that occurs is with the Disk I/O, which peaks to ~ 4 MiB.
I am not sure if that is the reason for the node unable to address itself as a cluster member. I would be needing help in a few things here:
a) How to diagnose in depth the reasons for node leaving the cluster.
b) If, the reason is Disk IOPS, is there any default requests, limits for IOPS at Kubernetes level?
PS: I am using EBS (gp2)
As per the docs, for the node to be 'Ready':
True if the node is healthy and ready to accept pods, False if the node is not healthy and is not accepting pods, and Unknown if the node controller has not heard from the node in the last node-monitor-grace-period (default is 40 seconds)
If would seem that when you run your workloads your kube-apiserver doesn't hear from your node (kubelet) in 40 seconds. There could be multiple reasons, some things that you can try:
To see the 'Events' in your node run:
$ kubectl describe node <node-name>
To see if you see anything unusual on your kube-apiserver. On your active master run:
$ docker logs <container-id-of-kube-apiserver>
To see if you see anything unusual on your kube-controller-manager when your node goes into 'Unknown' state. On your active master run:
$ docker logs <container-id-of-kube-controller-manager>
Increase the --node-monitor-grace-period option in your kube-controller-manager. You can add it to the command line in the /etc/kubernetes/manifests/kube-controller-manager.yaml and restart the kube-controller-manager container.
When the node is in the 'Unknown' state can you ssh into it and see if you can reach the kubeapi-server? Both on <master-ip>:6443 and also the kubernetes.default.svc.cluster.local:443 endpoints.
Considering that the node was previously working and recently stopped showing the ready status restart your kubelet service. Just ssh into the affected node and execute:
/etc/init.d/kubelet restart
Back on your master node run kubectl get nodes to check if the node is working now