Killing containers in an isolated minon - kubernetes

What is the strategy of kubelet managing containers within the machine if the connection with the master is broken due to some network issue? Is it possible to configure kubelet to kill all containers in such a situation?

Nodes in Kubernetes checkin with the master on regular intervals. If they fail to check in AND the master is still up then the pod eviction timout flag comes into play.
It basically waits this time before the pods are rescheduled elsewhere in the cluster. This is common to wait if the machine is just rebooting or something similar.
The flag is in the controller manager: --pod-eviction-timeout=5m0s: The grace period for deleting pods on failed nodes.
The second scenario is when the master goes down (or more specifically the controller-manager). If it stops responding then the cluster will still function as-is without interuption.

Related

How to rescue a Kubernetes cluster when multiple pods enter "unknown" status

I am trying to understand the lessons from a failed K8s cluster. I am running Microk8s 1.22.5. I had 3 rock solid (physical) nodes. I tried to add a fourth node (KVM guest) to satisfy the requirements of Minio. Within 24h, the KVM host had entered "unknown" status together with its pods. Within 48h, multiple pods on all of the nodes had "unknown" status. Most of the deployments and statefulsets are down, including multiple DBs (postgres, Elastic) so it's really painful (extra tips on how to save these are welcome). According to the official docs:
A Pod is not deleted automatically when a node is unreachable. The Pods running on an unreachable Node enter the 'Terminating' or
'Unknown' state after a timeout. Pods may also enter these states when
the user attempts graceful deletion of a Pod on an unreachable Node.
The only ways in which a Pod in such a state can be removed from the
apiserver are as follows:
The Node object is deleted (either by you, or by the Node Controller).
The kubelet on the unresponsive Node starts responding, kills the Pod
and removes the entry from the apiserver.
Force deletion of the Pod by
the user. The recommended best practice is to use the first or second
approach. If a Node is confirmed to be dead (e.g. permanently
disconnected from the network, powered down, etc), then delete the
Node object. If the Node is suffering from a network partition, then
try to resolve this or wait for it to resolve. When the partition
heals, the kubelet will complete the deletion of the Pod and free up
its name in the apiserver.
Normally, the system completes the deletion once the Pod is no longer
running on a Node, or the Node is deleted by an administrator. You may override this by force deleting the Pod.
So I tried draining the node (option 1), but no dice. I get some error about not being able to violate a disruption budget. Option 2 is not happening and option 3 has no effect. It looks like the failing node poisoned the whole cluster. Any advice on how to avoid this in the future? Many thanks

how to Ignore terminationGracePeriod when host is down?

I have a Cassandra statefulSet in my Kubernetes cluster with a high terminationGracePeriod to handle data handover currently.
the problem is when a host machine goes down, K8s waits whole terminationGracePeriod in termination phase before rescheduling my pod on another node.
how can i make K8s to ignore terminationGracePeriod when the node is down and reschedule pods immediately?
the problem is when a host machine goes down, K8s waits whole terminationGracePeriod in termination phase before rescheduling my pod on another node.
I think this is wrong assumption. When a host machine goes down, the node health check is used to detect this. Typically this is e.g. 5 minutes. Only after that, the pods are scheduled to other nodes.
See Node Condition and pod eviction:
If the status of the Ready condition remains Unknown or False for longer than the pod-eviction-timeout (an argument passed to the kube-controller-manager), then the node controller triggers API-initiated eviction for all Pods assigned to that node. The default eviction timeout duration is five minutes.
how can i make K8s to ignore terminationGracePeriod when the node is down and reschedule pods immediately?
I don't think terminationGracePeriod is related to this. A pod gets a SIGTERM to shutdown, only if it hasn't successfully been shutdown during the whole terminationGracePeriod, it will be killed with SIGKILL.

Endpoint update very slow when shutdown a node

When I run shutdown -h now command to shutdown a node in kubernetes cluster, endpoint update its state after about 40 seconds, but when I run command kubectl delete pod POD-NAME, endpoint update its state very quick. Can anyone explain why?
When you "shutdown" a node, you should do it gracefully with kubectl drain. This will evict the pods in a controlled manner and this should be more friendly to your traffic.
The article Kubernetes best practices: terminating with grace has a detailed description on all steps that happen when a Pod is gracefully terminated. For all planned maintenance, use gracefully shutdown - for unplanned maintenance you can not do much.
I found the solution from here:
It may be caused by that the node is marked as NotReady after a grace
period. This is configurable. After that, pod will be terminating and
rescheduled
--node-monitor-grace-period duration Default: 40s
Amount of time which we allow running Node to be unresponsive before marking it unhealthy. Must be N times more than kubelet's nodeStatusUpdateFrequency, where N means number of retries allowed for kubelet to post node status.
refer to: https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/
When I change the value of --node-monitor-grace-period to 10s, endpoint update its state more quick.

K8s pods stuck in terminating state after worker node is shut down

Whenever I shutdown a worker node, the pods that were running on the node are stuck in a "terminating 1/1" state. After the default 5 minutes for the probe check, the pods are redeployed onto healthy worker nodes, but the pods from the previous shut down node are still showing as 1/1 and terminating. It stays in this state indefinitely. Is there any way to make this process cleaner, so that whenever the pods are redeployed to new worker nodes, the old pods are removed and not in the terminating state?
this is expected behavior: the pods need to stay in terminating state until the node comes back, so that the master remembers to tell the node to stop these pods and to gather all remaining logs, events etc. This is because a node can go to unready state not only because of shutdown, but for example also because of temporary network fragmentation in which case after the link is back these pods would still be there also.
I've had the same problem and had to push a change on K8S to fix it.
The change garbage collects pods in such a state. No matter what, even though node comes back before pod's terminationGracePeriodSeconds is over, as soon as its marked "Terminating" it will be deleted. Of course, it's always better to terminate gracefully, it allows to safely release resources.
This is to be use carefully. On my side, I'm dealing with embedded systems where nodes shall always be up together, so it makes sense to terminate pods in such a stuck state. Especially when some of these pods are attached to a ReadWriteOnce volume, which would prevent any other pod from handing over.
Pull request is here: https://github.com/kubernetes/kubernetes/pull/103916

Kubernetes can't detect unhealthy node

I am shutting down my k8s node manually to see if this affect the master.
After shutdown I check status of nodes:
kubectl get nodes
The node which went down is still seen Ready in Status. As a consequence k8s still tries to schedule pods on this node but actually cannot. And even worst it doesn't reschedule pods on other healthy nodes.
After a while (5-10 mins) k8s notices the node has gone.
Is that expected behavior? If not how can I fix this?
I did research do find out how K8s checks node health, I couldn't find anything valuable.
I found the problem myself.
I was cutting connection at network layer with firewall rules. Since kubelet opened a session before new deny rules node was seen Ready. As it was ready it was receiving traffic. And the traffic would be blocked by the new rules since they have no open session.
So this inconsistency happens only when you change firewall rules.