GKE, automatic restart of stuck node - kubernetes

Sometimes a node backing GKE cluster goes down, with NotReady status:
$ kubectl get nodes
NAME STATUS AGE VERSION
gke-my-pool-f8045547-60gw Ready 10d v1.6.2
gke-my-pool-f8045547-7c7e NotReady 10d v1.6.2
Node can stuck for days in NotReady, until I manually restart it.
I have a Health check for my pods, so all of them go to other nodes, but the problem that this stale node still has GCE disks attached. So some of pods are unable to start on any of other nodes, until I manually detach disks (or restart stale node).
This basically kills whole idea of Kubernetes, because it this happens few times a day, so I have to babysit it whole day. Is there any way to configure Kubernetes or GCE to automate this? Most simple way would be automatic restart of NotReady nodes, but it seems that there no way to configure health check for nodes itself. Another option would be automatic unmount of disks, when it requested from another machine, but I don't see any way to configure that too.

GKE has a node auto-repair functionality that will monitor the node's health status and trigger an automatic repair event (currently a node recreation for NotReady nodes). It's currently in Beta, but you can try it: https://cloud.google.com/container-engine/docs/node-auto-repair

Related

How to rescue a Kubernetes cluster when multiple pods enter "unknown" status

I am trying to understand the lessons from a failed K8s cluster. I am running Microk8s 1.22.5. I had 3 rock solid (physical) nodes. I tried to add a fourth node (KVM guest) to satisfy the requirements of Minio. Within 24h, the KVM host had entered "unknown" status together with its pods. Within 48h, multiple pods on all of the nodes had "unknown" status. Most of the deployments and statefulsets are down, including multiple DBs (postgres, Elastic) so it's really painful (extra tips on how to save these are welcome). According to the official docs:
A Pod is not deleted automatically when a node is unreachable. The Pods running on an unreachable Node enter the 'Terminating' or
'Unknown' state after a timeout. Pods may also enter these states when
the user attempts graceful deletion of a Pod on an unreachable Node.
The only ways in which a Pod in such a state can be removed from the
apiserver are as follows:
The Node object is deleted (either by you, or by the Node Controller).
The kubelet on the unresponsive Node starts responding, kills the Pod
and removes the entry from the apiserver.
Force deletion of the Pod by
the user. The recommended best practice is to use the first or second
approach. If a Node is confirmed to be dead (e.g. permanently
disconnected from the network, powered down, etc), then delete the
Node object. If the Node is suffering from a network partition, then
try to resolve this or wait for it to resolve. When the partition
heals, the kubelet will complete the deletion of the Pod and free up
its name in the apiserver.
Normally, the system completes the deletion once the Pod is no longer
running on a Node, or the Node is deleted by an administrator. You may override this by force deleting the Pod.
So I tried draining the node (option 1), but no dice. I get some error about not being able to violate a disruption budget. Option 2 is not happening and option 3 has no effect. It looks like the failing node poisoned the whole cluster. Any advice on how to avoid this in the future? Many thanks

Behaviour of Multi node Kubernetes cluster with a single master when the master goes down?

What would be the behavior of a multi node kubernetes cluster if it only has a single master node and if the node goes down?
The control plane would be unavailable. Existing pods would continue to run, however calls to the API wouldn't work, so you wouldn't be able to make any changes to the state of the system. Additionally self-repair systems like pods being restarted on failure would not happen since that functionality lives in the control plane as well.
You wouldn't be able to create or query kubernetes objects(pods, deployments etc) since the required control plane components(api-server and etcd) are not running.
Existing pods on the worker nodes will keep running. If a pod crashes, kubelet on that node would restart it as well.
If worker node goes down while master is down, even the pods created by a controllers like deployment/replicaset won't be re-scheduled to different node since controller-manager(control plane component) is not running.

Kubernetes Version Upgrades and Downtime

I just tested Ranche RKE , upgrading kubernetes 13.xx to 14.xx , during upgrade , an already running nginx Pod got restarted during upgrade. Is this expected behavior?
Can we have Kubernetes cluster upgrades without user pods restarting?
Which tool supports un-intruppted upgrades?
What are the downtimes that we can never aviod? ( apart from Control plane )
The default way Kubernetes upgrades is by doing a rolling upgrade of the nodes, one at a time.
This works by draining and cordoning (marking the node as unavailable for new deployments) each node that is being upgraded so that there no pods running on that node.
It does that by creating a new revision of the existing pods on another node (if it's available) and when the new pod starts running (and answering to the readiness/health probes), it stops and remove the old pod (sending SIGTERM to each pod container) on the node that was being upgraded.
The amount of time Kubernetes waits for the pod to graceful shutdown, is controlled by the terminationGracePeriodSeconds on the pod spec, if the pod takes longer than that, they are killed with SIGKILL.
The point is, to have a graceful Kubernetes upgrade, you need to have enough nodes available, and your pods must have correct liveness and readiness probes (https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/).
Some interesting material that is worth a read:
https://cloud.google.com/blog/products/gcp/kubernetes-best-practices-upgrading-your-clusters-with-zero-downtime (specific to GKE but has some insights)
https://blog.gruntwork.io/zero-downtime-server-updates-for-your-kubernetes-cluster-902009df5b33
Resolved By configuring the container runtime on the hosts not to restart containers on docker restart.

Node status changes to unknown on a high resource requirement pod

I have a Jenkins deployment pipeline which involves kubernetes plugin. Using kubernetes plugin I create a slave pod for building a node application using yarn. The requests and limits for CPU and Memory are set.
When the Jenkins master schedules the slave, sometimes (as I haven’t seen a pattern, as of now), the pod makes the entire node unreachable and changes the status of node to be Unknown. On careful inspection in Grafana, the CPU and Memory Resources seem to be well within the range with no visible spike. The only spike that occurs is with the Disk I/O, which peaks to ~ 4 MiB.
I am not sure if that is the reason for the node unable to address itself as a cluster member. I would be needing help in a few things here:
a) How to diagnose in depth the reasons for node leaving the cluster.
b) If, the reason is Disk IOPS, is there any default requests, limits for IOPS at Kubernetes level?
PS: I am using EBS (gp2)
As per the docs, for the node to be 'Ready':
True if the node is healthy and ready to accept pods, False if the node is not healthy and is not accepting pods, and Unknown if the node controller has not heard from the node in the last node-monitor-grace-period (default is 40 seconds)
If would seem that when you run your workloads your kube-apiserver doesn't hear from your node (kubelet) in 40 seconds. There could be multiple reasons, some things that you can try:
To see the 'Events' in your node run:
$ kubectl describe node <node-name>
To see if you see anything unusual on your kube-apiserver. On your active master run:
$ docker logs <container-id-of-kube-apiserver>
To see if you see anything unusual on your kube-controller-manager when your node goes into 'Unknown' state. On your active master run:
$ docker logs <container-id-of-kube-controller-manager>
Increase the --node-monitor-grace-period option in your kube-controller-manager. You can add it to the command line in the /etc/kubernetes/manifests/kube-controller-manager.yaml and restart the kube-controller-manager container.
When the node is in the 'Unknown' state can you ssh into it and see if you can reach the kubeapi-server? Both on <master-ip>:6443 and also the kubernetes.default.svc.cluster.local:443 endpoints.
Considering that the node was previously working and recently stopped showing the ready status restart your kubelet service. Just ssh into the affected node and execute:
/etc/init.d/kubelet restart
Back on your master node run kubectl get nodes to check if the node is working now

What makes a kubernetes node unhealthy?

We've experienced 4 AUTO_REPAIR_NODES events(revealed by the command gcloud container operations list) on our GKE cluster during the past 1 month. The consequence of node-auto-repair is that the node gets recreated and gets attached a new external IP, and the new external IP, which was not whitelisted by third-party services, eventually caused failure of services running on that the new node.
I noticed that we have "Automatic node repair" enabled in our Kubernetes cluster and felt tempted to disable that, but before I do that, I need to know more about the situation.
My questions are:
What are some common causes that makes a node unhealthy in the first place? I'm aware of this article https://cloud.google.com/kubernetes-engine/docs/how-to/node-auto-repair#node_repair_process which says, "a node reports a NotReady status on consecutive checks over the given time threshold" would trigger auto repair. But what could cause a node to become NotReady?
I'm also aware of this article https://kubernetes.io/docs/concepts/architecture/nodes/#node-status which mentions the full list of node status: {OutOfDisk, Ready, MemoryPressure, PIDPressure, DiskPressure, NetworkUnavailable, ConfigOK}. I wonder, if any of {OutOfDisk, MemoryPressure, PIDPressure, DiskPressure, NetworkUnavailable} becomes true for a node, would that node becomes NotReady?
What negative consequences could I get after I disable "Automatic node repair" in the cluster? I'm basically wondering whether we could end up in a worse situation than auto-repaired nodes and newly-attached-not-whitelisted IP. Once "Automatic node repair" is disabled, then for the pods that are running on an Unhealthy node that would've been auto-repaired, would Kubernetes create new pods on other nodes?
The confusion lies here in that there are 'Ready' and 'NotReady' states that are shown when you run kubectl get nodes which are reported by the kube-apiserver. But these are independent and unclear from the docs how they relate to the kubelet states described here
You can also see the kubelet states (in events) when you run kubectl describe nodes
To answer some parts of the questions:
As reported by the kube-apiserver
Kubelet down
docker or containerd or crio down (depending on the shim you are using)
kubelet states - unclear.
For these, the kubelet will start evicting or not scheduling pods except for Ready (https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/). Unclear from the docs how these get reported from the kubeapi-server.
You could have nodes on your cluster not being used and you'd be paying for that usage.
Yes, k8s will reschedule the pods after a certain readiness probes fail (configurable). If the kubelet is down or the node down k8s will think the pods are down.
Assuming your nodes go down, you could end up with less capacity than what you need to schedule your workloads to k8s would not be able to schedule them anyway.
Hope it helps!
Not my answer, but this answer on SF points in the right direction, regarding using a NAT gateway and whitelisting that IP
https://serverfault.com/a/930963/429795