How to rescue a Kubernetes cluster when multiple pods enter "unknown" status - kubernetes

I am trying to understand the lessons from a failed K8s cluster. I am running Microk8s 1.22.5. I had 3 rock solid (physical) nodes. I tried to add a fourth node (KVM guest) to satisfy the requirements of Minio. Within 24h, the KVM host had entered "unknown" status together with its pods. Within 48h, multiple pods on all of the nodes had "unknown" status. Most of the deployments and statefulsets are down, including multiple DBs (postgres, Elastic) so it's really painful (extra tips on how to save these are welcome). According to the official docs:
A Pod is not deleted automatically when a node is unreachable. The Pods running on an unreachable Node enter the 'Terminating' or
'Unknown' state after a timeout. Pods may also enter these states when
the user attempts graceful deletion of a Pod on an unreachable Node.
The only ways in which a Pod in such a state can be removed from the
apiserver are as follows:
The Node object is deleted (either by you, or by the Node Controller).
The kubelet on the unresponsive Node starts responding, kills the Pod
and removes the entry from the apiserver.
Force deletion of the Pod by
the user. The recommended best practice is to use the first or second
approach. If a Node is confirmed to be dead (e.g. permanently
disconnected from the network, powered down, etc), then delete the
Node object. If the Node is suffering from a network partition, then
try to resolve this or wait for it to resolve. When the partition
heals, the kubelet will complete the deletion of the Pod and free up
its name in the apiserver.
Normally, the system completes the deletion once the Pod is no longer
running on a Node, or the Node is deleted by an administrator. You may override this by force deleting the Pod.
So I tried draining the node (option 1), but no dice. I get some error about not being able to violate a disruption budget. Option 2 is not happening and option 3 has no effect. It looks like the failing node poisoned the whole cluster. Any advice on how to avoid this in the future? Many thanks

Related

Can we have --pod-eviction-timeout=300m?

I have a k8s cluster, in our cluster we do not want the pods to get evicted, because pod eviction causes lot of side effects to the applications running on it.
To prevent pod eviction from happening, we have configured all the pods as Guaranteed QoS. I know even with this the pod eviction can happen if there are any resource starvation in the system. We have monitors to alert us when there are resource starvation within the pod and node. So we get to know way before a pod gets evicted. This helps us in taking measures before pod gets evicted.
The other reasons for pod eviction to happen is if the node is in not-ready state, then kube-controller-manager will check the pod-eviction-timeout and it will evict the pods after this timeout. We have monitor to alert us when the node goes to not-ready state. now after this alert we wanted to take some measures to clean-up from application side, so the application will end gracefully. To do this clean-up we need more than few hours, but pod-eviction-timeout is by default 5 minutes.
Is it fine to increase the pod eviction timeout to 300m? what are the impacts of increasing this timeout to such a limit?
P.S: I know during this wait time, if the pod utilises more resources, then kubelet can itself evict this pod. I wanted to know what other impact of waiting for such a long time?
As #coderanger said, your limits are incorrect and this should be fixed instead of lowering self-healing capabilities of Kubernetes.
If your pod dies no matter what was the issue with it, by default it will be rescheduled based on your configuration.
If you are having a problem with this then I would recommend redoing your architecture and rewriting the app to use Kubernetes how it's supposed to be used.
if you are getting problems with a pod still being send requests when it's unresponsive, you should implement a LB in front or queue the requests,
if you are getting a problem with IPs that are being changed after pod restarts, this should be fixed by using DNS and service instead of connecting directly to a pod,
if your pod is being evicted check why, make the limits and requests,
As for the node, there is a really nice blog post about Improving Kubernetes reliability: quicker detection of a Node down, it's opposite of what you are thinking of doing but it also mentions why 340s is too much
Once the node is marked as unhealthy, the kube controller manager will remove its pods based on –pod-eviction-timeout=5m0s
This is a very important timeout, by default it’s 5m which in my opinion is too high, because although the node is already marked as unhealthy the kube controller manager won’t remove the pods so they will be accessible through their service and requests will fail.
If you still want to change default values to higher you can look into changing these:
kubelet: node-status-update-frequency=10s
controller-manager: node-monitor-period=5s
controller-manager: node-monitor-grace-period=40s
controller-manager: pod-eviction-timeout=5m
to higher ones.
If you provide more details I'll try to help more.

kube-dns. High Availability. Error handling in kuberntes

I have a kubernetes cluster with several nodes. I have kube-dns running in 3 nodes.
The issue I'm having is that if 1 of those 3 nodes goes down the requests between my pods/containers start to fail more or less 1 of 3 times.
This is because when the container resolve a k8s service hostname it calls the kube-dns service to resolve that hostname and the kube-dns k8s services has three endpoints but one of those three endpoints is not valid as the node is down. K8s does not update the service until it detects the node is down. (Currently I have that time set to 60 seconds).
Any idea about how to mitigate this? Is there any kind of retry that could be configured outside the application? Something in the container or at k8s level.
Thank you.
The main contributor for communication between underlying Kubernetes resources on the particular Node and kube-apiserver is kubelet. Its role can be determined as a Node agent. Therefore, kubelet plays a significant role in the cluster life cycle, due to primary duties like managing liveness and readiness probes for the nested objects, updating ETCD storage in order to write metadata for the resources and periodically refreshing own health status to kube-apiserver, specified by --node-status-update-frequency flag in kubelet configuration.
--node-status-update-frequency duration Specifies how often kubelet posts node status to master. Note: be cautious when changing the
constant, it must work with nodeMonitorGracePeriod in nodecontroller.
(default 10s)
However, there is a specific component in Kubernetes called Node controller. One of the essential roles of Node controller is to check the status of the involved workers by controlling relevant heartbeat from kubelet. There are some specific flags that describe this behavior and by default these flags have been included in kube-controller-manager configuration:
--node-monitor-period - Check kubelet status with specified time
interval (default value 5s);
--node-monitor-grace-period - The time that Kubernetes controller
manager considers healthy status of Kubelet (default value 40s);
--pod-eviction-timeout - The grace timeout for deleting pods on
failed nodes (default value 5m).
Whenever you want to mitigate DNS Pods outage, in case a Node goes down, you should consider these options. You can also take a look at DNS horizontal autoscaller in order to align to stable replica count for DNS Pods, however it brings some additional logic structure to be implemented, which can consume more compute resources on the cluster engine.

Kubernetes can't detect unhealthy node

I am shutting down my k8s node manually to see if this affect the master.
After shutdown I check status of nodes:
kubectl get nodes
The node which went down is still seen Ready in Status. As a consequence k8s still tries to schedule pods on this node but actually cannot. And even worst it doesn't reschedule pods on other healthy nodes.
After a while (5-10 mins) k8s notices the node has gone.
Is that expected behavior? If not how can I fix this?
I did research do find out how K8s checks node health, I couldn't find anything valuable.
I found the problem myself.
I was cutting connection at network layer with firewall rules. Since kubelet opened a session before new deny rules node was seen Ready. As it was ready it was receiving traffic. And the traffic would be blocked by the new rules since they have no open session.
So this inconsistency happens only when you change firewall rules.

What makes a kubernetes node unhealthy?

We've experienced 4 AUTO_REPAIR_NODES events(revealed by the command gcloud container operations list) on our GKE cluster during the past 1 month. The consequence of node-auto-repair is that the node gets recreated and gets attached a new external IP, and the new external IP, which was not whitelisted by third-party services, eventually caused failure of services running on that the new node.
I noticed that we have "Automatic node repair" enabled in our Kubernetes cluster and felt tempted to disable that, but before I do that, I need to know more about the situation.
My questions are:
What are some common causes that makes a node unhealthy in the first place? I'm aware of this article https://cloud.google.com/kubernetes-engine/docs/how-to/node-auto-repair#node_repair_process which says, "a node reports a NotReady status on consecutive checks over the given time threshold" would trigger auto repair. But what could cause a node to become NotReady?
I'm also aware of this article https://kubernetes.io/docs/concepts/architecture/nodes/#node-status which mentions the full list of node status: {OutOfDisk, Ready, MemoryPressure, PIDPressure, DiskPressure, NetworkUnavailable, ConfigOK}. I wonder, if any of {OutOfDisk, MemoryPressure, PIDPressure, DiskPressure, NetworkUnavailable} becomes true for a node, would that node becomes NotReady?
What negative consequences could I get after I disable "Automatic node repair" in the cluster? I'm basically wondering whether we could end up in a worse situation than auto-repaired nodes and newly-attached-not-whitelisted IP. Once "Automatic node repair" is disabled, then for the pods that are running on an Unhealthy node that would've been auto-repaired, would Kubernetes create new pods on other nodes?
The confusion lies here in that there are 'Ready' and 'NotReady' states that are shown when you run kubectl get nodes which are reported by the kube-apiserver. But these are independent and unclear from the docs how they relate to the kubelet states described here
You can also see the kubelet states (in events) when you run kubectl describe nodes
To answer some parts of the questions:
As reported by the kube-apiserver
Kubelet down
docker or containerd or crio down (depending on the shim you are using)
kubelet states - unclear.
For these, the kubelet will start evicting or not scheduling pods except for Ready (https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/). Unclear from the docs how these get reported from the kubeapi-server.
You could have nodes on your cluster not being used and you'd be paying for that usage.
Yes, k8s will reschedule the pods after a certain readiness probes fail (configurable). If the kubelet is down or the node down k8s will think the pods are down.
Assuming your nodes go down, you could end up with less capacity than what you need to schedule your workloads to k8s would not be able to schedule them anyway.
Hope it helps!
Not my answer, but this answer on SF points in the right direction, regarding using a NAT gateway and whitelisting that IP
https://serverfault.com/a/930963/429795

Killing containers in an isolated minon

What is the strategy of kubelet managing containers within the machine if the connection with the master is broken due to some network issue? Is it possible to configure kubelet to kill all containers in such a situation?
Nodes in Kubernetes checkin with the master on regular intervals. If they fail to check in AND the master is still up then the pod eviction timout flag comes into play.
It basically waits this time before the pods are rescheduled elsewhere in the cluster. This is common to wait if the machine is just rebooting or something similar.
The flag is in the controller manager: --pod-eviction-timeout=5m0s: The grace period for deleting pods on failed nodes.
The second scenario is when the master goes down (or more specifically the controller-manager). If it stops responding then the cluster will still function as-is without interuption.