Node status changes to unknown on a high resource requirement pod - kubernetes

I have a Jenkins deployment pipeline which involves kubernetes plugin. Using kubernetes plugin I create a slave pod for building a node application using yarn. The requests and limits for CPU and Memory are set.
When the Jenkins master schedules the slave, sometimes (as I haven’t seen a pattern, as of now), the pod makes the entire node unreachable and changes the status of node to be Unknown. On careful inspection in Grafana, the CPU and Memory Resources seem to be well within the range with no visible spike. The only spike that occurs is with the Disk I/O, which peaks to ~ 4 MiB.
I am not sure if that is the reason for the node unable to address itself as a cluster member. I would be needing help in a few things here:
a) How to diagnose in depth the reasons for node leaving the cluster.
b) If, the reason is Disk IOPS, is there any default requests, limits for IOPS at Kubernetes level?
PS: I am using EBS (gp2)

As per the docs, for the node to be 'Ready':
True if the node is healthy and ready to accept pods, False if the node is not healthy and is not accepting pods, and Unknown if the node controller has not heard from the node in the last node-monitor-grace-period (default is 40 seconds)
If would seem that when you run your workloads your kube-apiserver doesn't hear from your node (kubelet) in 40 seconds. There could be multiple reasons, some things that you can try:
To see the 'Events' in your node run:
$ kubectl describe node <node-name>
To see if you see anything unusual on your kube-apiserver. On your active master run:
$ docker logs <container-id-of-kube-apiserver>
To see if you see anything unusual on your kube-controller-manager when your node goes into 'Unknown' state. On your active master run:
$ docker logs <container-id-of-kube-controller-manager>
Increase the --node-monitor-grace-period option in your kube-controller-manager. You can add it to the command line in the /etc/kubernetes/manifests/kube-controller-manager.yaml and restart the kube-controller-manager container.
When the node is in the 'Unknown' state can you ssh into it and see if you can reach the kubeapi-server? Both on <master-ip>:6443 and also the kubernetes.default.svc.cluster.local:443 endpoints.

Considering that the node was previously working and recently stopped showing the ready status restart your kubelet service. Just ssh into the affected node and execute:
/etc/init.d/kubelet restart
Back on your master node run kubectl get nodes to check if the node is working now

Related

How to rescue a Kubernetes cluster when multiple pods enter "unknown" status

I am trying to understand the lessons from a failed K8s cluster. I am running Microk8s 1.22.5. I had 3 rock solid (physical) nodes. I tried to add a fourth node (KVM guest) to satisfy the requirements of Minio. Within 24h, the KVM host had entered "unknown" status together with its pods. Within 48h, multiple pods on all of the nodes had "unknown" status. Most of the deployments and statefulsets are down, including multiple DBs (postgres, Elastic) so it's really painful (extra tips on how to save these are welcome). According to the official docs:
A Pod is not deleted automatically when a node is unreachable. The Pods running on an unreachable Node enter the 'Terminating' or
'Unknown' state after a timeout. Pods may also enter these states when
the user attempts graceful deletion of a Pod on an unreachable Node.
The only ways in which a Pod in such a state can be removed from the
apiserver are as follows:
The Node object is deleted (either by you, or by the Node Controller).
The kubelet on the unresponsive Node starts responding, kills the Pod
and removes the entry from the apiserver.
Force deletion of the Pod by
the user. The recommended best practice is to use the first or second
approach. If a Node is confirmed to be dead (e.g. permanently
disconnected from the network, powered down, etc), then delete the
Node object. If the Node is suffering from a network partition, then
try to resolve this or wait for it to resolve. When the partition
heals, the kubelet will complete the deletion of the Pod and free up
its name in the apiserver.
Normally, the system completes the deletion once the Pod is no longer
running on a Node, or the Node is deleted by an administrator. You may override this by force deleting the Pod.
So I tried draining the node (option 1), but no dice. I get some error about not being able to violate a disruption budget. Option 2 is not happening and option 3 has no effect. It looks like the failing node poisoned the whole cluster. Any advice on how to avoid this in the future? Many thanks

Kubernetes cluster recovery after linux host reboot

We are still in a design phase to move away from monolithic architecture towards Microservices with Docker and Kubernetes. We did some basic research on Docker and Kubernetes and got some understanding. We still have couple of open question considering we will be creating K8s cluster with multiple Linux hosts (due to some reason we can't think about Cloud right now) .
Consider a scenario where we have K8s Cluster spanning over multiple linux hosts (5+).
1) If one of the linux worker node crashes and once we bring it back, does enabling kubelet as part of systemctl in advance will be sufficient to bring up required K8s jobs so that it be detected by master again?
2) I believe once worker node is crashed (X pods), after the pod eviction timeout master will reschedule those X pods into some other healthy node(s). Once the node is UP it won't do any deployment of X pods as master already scheduled to other node but will be ready to accept new requests from Master.
Is this correct ?
Yes, should be the default behavior, check your Cluster deployment tool.
Yes, Kubernetes handles these things automatically for Deployments. For StatefulSets (with local volumes) and DaemonSets things can be node specific and Kubernetes will wait for the node to come back.
Better to create a test environment and see/test the failure scenarios

kube-dns. High Availability. Error handling in kuberntes

I have a kubernetes cluster with several nodes. I have kube-dns running in 3 nodes.
The issue I'm having is that if 1 of those 3 nodes goes down the requests between my pods/containers start to fail more or less 1 of 3 times.
This is because when the container resolve a k8s service hostname it calls the kube-dns service to resolve that hostname and the kube-dns k8s services has three endpoints but one of those three endpoints is not valid as the node is down. K8s does not update the service until it detects the node is down. (Currently I have that time set to 60 seconds).
Any idea about how to mitigate this? Is there any kind of retry that could be configured outside the application? Something in the container or at k8s level.
Thank you.
The main contributor for communication between underlying Kubernetes resources on the particular Node and kube-apiserver is kubelet. Its role can be determined as a Node agent. Therefore, kubelet plays a significant role in the cluster life cycle, due to primary duties like managing liveness and readiness probes for the nested objects, updating ETCD storage in order to write metadata for the resources and periodically refreshing own health status to kube-apiserver, specified by --node-status-update-frequency flag in kubelet configuration.
--node-status-update-frequency duration Specifies how often kubelet posts node status to master. Note: be cautious when changing the
constant, it must work with nodeMonitorGracePeriod in nodecontroller.
(default 10s)
However, there is a specific component in Kubernetes called Node controller. One of the essential roles of Node controller is to check the status of the involved workers by controlling relevant heartbeat from kubelet. There are some specific flags that describe this behavior and by default these flags have been included in kube-controller-manager configuration:
--node-monitor-period - Check kubelet status with specified time
interval (default value 5s);
--node-monitor-grace-period - The time that Kubernetes controller
manager considers healthy status of Kubelet (default value 40s);
--pod-eviction-timeout - The grace timeout for deleting pods on
failed nodes (default value 5m).
Whenever you want to mitigate DNS Pods outage, in case a Node goes down, you should consider these options. You can also take a look at DNS horizontal autoscaller in order to align to stable replica count for DNS Pods, however it brings some additional logic structure to be implemented, which can consume more compute resources on the cluster engine.

What makes a kubernetes node unhealthy?

We've experienced 4 AUTO_REPAIR_NODES events(revealed by the command gcloud container operations list) on our GKE cluster during the past 1 month. The consequence of node-auto-repair is that the node gets recreated and gets attached a new external IP, and the new external IP, which was not whitelisted by third-party services, eventually caused failure of services running on that the new node.
I noticed that we have "Automatic node repair" enabled in our Kubernetes cluster and felt tempted to disable that, but before I do that, I need to know more about the situation.
My questions are:
What are some common causes that makes a node unhealthy in the first place? I'm aware of this article https://cloud.google.com/kubernetes-engine/docs/how-to/node-auto-repair#node_repair_process which says, "a node reports a NotReady status on consecutive checks over the given time threshold" would trigger auto repair. But what could cause a node to become NotReady?
I'm also aware of this article https://kubernetes.io/docs/concepts/architecture/nodes/#node-status which mentions the full list of node status: {OutOfDisk, Ready, MemoryPressure, PIDPressure, DiskPressure, NetworkUnavailable, ConfigOK}. I wonder, if any of {OutOfDisk, MemoryPressure, PIDPressure, DiskPressure, NetworkUnavailable} becomes true for a node, would that node becomes NotReady?
What negative consequences could I get after I disable "Automatic node repair" in the cluster? I'm basically wondering whether we could end up in a worse situation than auto-repaired nodes and newly-attached-not-whitelisted IP. Once "Automatic node repair" is disabled, then for the pods that are running on an Unhealthy node that would've been auto-repaired, would Kubernetes create new pods on other nodes?
The confusion lies here in that there are 'Ready' and 'NotReady' states that are shown when you run kubectl get nodes which are reported by the kube-apiserver. But these are independent and unclear from the docs how they relate to the kubelet states described here
You can also see the kubelet states (in events) when you run kubectl describe nodes
To answer some parts of the questions:
As reported by the kube-apiserver
Kubelet down
docker or containerd or crio down (depending on the shim you are using)
kubelet states - unclear.
For these, the kubelet will start evicting or not scheduling pods except for Ready (https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/). Unclear from the docs how these get reported from the kubeapi-server.
You could have nodes on your cluster not being used and you'd be paying for that usage.
Yes, k8s will reschedule the pods after a certain readiness probes fail (configurable). If the kubelet is down or the node down k8s will think the pods are down.
Assuming your nodes go down, you could end up with less capacity than what you need to schedule your workloads to k8s would not be able to schedule them anyway.
Hope it helps!
Not my answer, but this answer on SF points in the right direction, regarding using a NAT gateway and whitelisting that IP
https://serverfault.com/a/930963/429795

GKE, automatic restart of stuck node

Sometimes a node backing GKE cluster goes down, with NotReady status:
$ kubectl get nodes
NAME STATUS AGE VERSION
gke-my-pool-f8045547-60gw Ready 10d v1.6.2
gke-my-pool-f8045547-7c7e NotReady 10d v1.6.2
Node can stuck for days in NotReady, until I manually restart it.
I have a Health check for my pods, so all of them go to other nodes, but the problem that this stale node still has GCE disks attached. So some of pods are unable to start on any of other nodes, until I manually detach disks (or restart stale node).
This basically kills whole idea of Kubernetes, because it this happens few times a day, so I have to babysit it whole day. Is there any way to configure Kubernetes or GCE to automate this? Most simple way would be automatic restart of NotReady nodes, but it seems that there no way to configure health check for nodes itself. Another option would be automatic unmount of disks, when it requested from another machine, but I don't see any way to configure that too.
GKE has a node auto-repair functionality that will monitor the node's health status and trigger an automatic repair event (currently a node recreation for NotReady nodes). It's currently in Beta, but you can try it: https://cloud.google.com/container-engine/docs/node-auto-repair