How to automatically force delete pods stuck in 'Terminating' after node failure? - kubernetes

I have a deployment that deploys a single pod with a persistent volume claim. If I switch off the node it is running on, after a while k8s terminates the pod and tries to spin it up elsewhere. However the new pod cannot attach the volume (Multi-Attach error for volume "pvc-...").
I can manually delete the old 'Terminating' pod with kubectl delete pod <PODNAME> --grace-period=0 --force and then things recover.
Is there a way to get Kubernetes to force delete the 'Terminating' pods after a timeout or something? Tx.

According to the docs:
A Pod is not deleted automatically when a node is unreachable. The
Pods running on an unreachable Node enter the 'Terminating' or
'Unknown' state after a timeout. Pods may also enter these states when
the user attempts graceful deletion of a Pod on an unreachable Node.
The only ways in which a Pod in such a state can be removed from the
apiserver are as follows:
The Node object is deleted (either by you, or by the Node Controller).
The kubelet on the unresponsive Node starts responding, kills the Pod and removes the entry from the apiserver.
Force deletion of the Pod by the user.
So I assume you are not deleting nor draining the node that is being shut down.
In general I'd advice to ensure any broken nodes are deleted from the node list and that should make Terminating pods to be deleted by controller manager.
Node deletion normally happens automatically, at least on kubernetes clusters running on the main cloud providers, but if that's not happening for you than you need a way to remove nodes that are not healthy.

Use Recreate in .spec.strategy.type of your Deployment. This tell Kubernetes to delete the old pods before creating new ones.
Ref: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#strategy

Related

Kubernetes Pods Not Being Evicted

I have multiple pods on Kubernetes (v1.23.5) that are not being evicted and rescheduled in case of node failure.
According to Kubernetes documentation, this process must begin after 300s:
Kubernetes automatically adds a Toleration for node.kubernetes.io/not-ready and node.kubernetes.io/unreachable with tolerationSeconds=300 unless you, or a controller, set those tolerations explicitly.
These automatically-added tolerations mean that Pods remain bound to Nodes for 5 minutes after detecting one of these problems.
Unfortunately, pods get stuck in terminating status and would not evict. However, in one test on a pod without any PVC attached, it evicted and started running on another node.
I'm trying to understand how I can make other pods evict after the default 300s time.
I don't know why it would not happen automatically, and I must drain the pod stuck in a terminating state to make it work properly.
Update
I have seen the kvaps/kube-fencing project. There seems to be a fencing procedure that runs in case of a node failure. I couldn't make it solve my problem, and I didn't. I don't know whether it is because of my lack of comprehension of this project, or it is solely used to handle the node in case of a failure and not the pods stuck in termination state and evicting those pods.
There are two ways to handle this problem.
First one is to use kvaps/kube-fencing. You need to configure a PodTemplate in which you can set a node to be deleted from the cluster when it becomes NotReady or flush the node. If you have volumes attached to the pod, the pods will remain in the ContainerCreating state.
There are the annotations in the PodTemplate:
annotations:
fencing/mode: 'delete'
fencing/mode: 'flush'
The second way is to use the Kubernetes Non-Graceful Node Shutdown. This is not available in Kubernetes (v1.23.5) and you have to upgrade.

What is default behavior of Kubernetes when pod crashes?

In Kubernetes deployment with 4 static pods and no autoscaling, what happens by default if one pod crashes? Will it be re-created automatically with the same ID/different ID or will the application continue running on 3 pods?
When a pod crashes, it will automatically be restarted. You will see this by the incrementing value of the pod's "Restarts" value when you do kubectl get pods
From the documentation: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#pod-template
Only a .spec.template.spec.restartPolicy equal to Always is allowed, which is the default if not specified.
In other words, a deployment will ALWAYS restart your pod, regardless, and you cannot change that behaviour.
A restart will not change the name of the pod (or ID has you have called it)
The only time the pod name will change is if the pod gets deleted. This can happen during autoscaling processes or if the pod gets evicted from a node.
You've specified no autoscaling in your deployment, but if you have specified a value of 4 replicas, as I suspect you have, then the eviction will cause that one pod to change names, as it gets recreated by another node, in order to meet your request for 4 replica.
By "changing names" I just mean the hash at the end of the pod name will change. So your pod named my-test-g4gsv may be renamed to my-test-4dsv4 after it goes to a new node.
There is a backoff policy for restarts. So if Kubernetes detects a pod has been restarted repeatedly, it will start delaying its restart attempts. You will notice this as a CrashLoopBackoff value under the pod status (instead of Running). While in this state, the pod is not started, so during this time, your deployment is essentially running with reduced replicas until Kubernetes starts it.

Does a Kubernetes POD with restart policy always have to be under the auspice of a controller to work?

If I create a POD manifest (pod-definition.yaml) and set the restartPolicy: Always does that Pod also need to be associated with any controller (i.e., a Replicaset or Deployment)? The end goal here it to auto-start the container in the Pod should it die. Without a Pod being associated with a controller will that container automatically restart? What happens if the Pod has only one container?
The documentation is not clear here but it lead me to believe that the Pod must be under a controller for this to work, i.e., if you implicitly create a 8Ks object and specify a restart policy of Never you'll get a pod. If you specify always (the default) you'll get a deployment.
Pod without a controller(deployment, replication controller etc) and only with restartPolicy will not be restarted/rescheduled if the node(to be exact the kubelet on that node) where its running dies or drained or rebooted or for some other reason pod is evicted from the node. If the node is in good state and for some reason pod crashes it will be restarted on the same node without the need of a controller.
The reason is pod restartPolicy is handled by kubelet i.e pod is restarted by kubelet of the node.Now if the node dies kubelet is also dead and can not restart the pod. Hence you need to have a controller which will restart it in another node.
From the docs
restartPolicy only refers to restarts of the Containers by the kubelet
on the same node
In short if you want pods to survive a node failure or a kubelet failure of a node you should have a higher level controller.

Would Kubernetes bring up the down-ed Pod if only Pod definition file exists?

I have Pod definition file only. Kubernetes will bring up the pod. What happens if it goes down? Would Kubernetes bring it up automatically? Or if we want certain numbers of pods up at all time, we MUST take the help of ReplicationController( or ReplicaSet in new versions)?
Although your question is not clear , but yes , if you have deployed the pod through deployment or replicaSet , then kubernetes will create another one if you or someone else deletes that pod.
If you have just the pod without any controller like ReplicaSet , then it goes forever as there is no one to take care of it.
In case , the app crashes inside pod then:
A CrashloopBackOff means that you have a pod starting, crashing, starting again, and then crashing again.
A PodSpec has a restartPolicy field with possible values Always, OnFailure, and Never which applies to all containers in a pod. The default value is Always and the restartPolicy only refers to restarts of the containers by the kubelet on the same node (so the restart count will reset if the pod is rescheduled in a different node). Failed containers that are restarted by the kubelet are restarted with an exponential back-off delay (10s, 20s, 40s …) capped at five minutes, and is reset after ten minutes of successful execution.
https://sysdig.com/blog/debug-kubernetes-crashloopbackoff/
restartPolicy pod only refers to restarts of the Containers by the kubelet on the same node.If there is no replication controller or deployment then if a node goes down kubernetes will not reschedule or restart the pods of that node into any other nodes.This is the reason pods are not recommended to be used directly in production.

Delete all the contents from a kubernetes node

How to delete all the contents from a kubernetes node? Contents include deployments, replica sets etc. I tried to delete deplyoments seperately. But kubernetes recreates all the pods again. Is there there any ways to delete all the replica sets present in a node?
If you are testing things, the easiest way would be
kubectl delete deployment --all
Althougth if you are using minikube, the easiest would probably be delete the machine and start again with a fresh node
minikube delete
minikube start
If we are talking about a production cluster, Kubernetes has a built-in feature to drain a node of the cluster, removing all the objects from that node safely.
You can use kubectl drain to safely evict all of your pods from a node before you perform maintenance on the node. Safe evictions allow the pod’s containers to gracefully terminate and will respect the PodDisruptionBudgets you have specified.
Note: By default kubectl drain will ignore certain system pods on the node that cannot be killed; see the kubectl drain documentation for more details.
When kubectl drain returns successfully, that indicates that all of the pods (except the ones excluded as described in the previous paragraph) have been safely evicted (respecting the desired graceful termination period, and without violating any application-level disruption SLOs). It is then safe to bring down the node by powering down its physical machine or, if running on a cloud platform, deleting its virtual machine.
First, identify the name of the node you wish to drain. You can list all of the nodes in your cluster with
kubectl get nodes
Next, tell Kubernetes to drain the node:
kubectl drain <node name>
Once it returns (without giving an error), you can power down the node (or equivalently, if on a cloud platform, delete the virtual machine backing the node). drain waits for graceful termination. You should not operate on the machine until the command completes.
If you leave the node in the cluster during the maintenance operation, you need to run
kubectl uncordon <node name>
afterwards to tell Kubernetes that it can resume scheduling new pods onto the node.
Please, note that if there are any pods that are not managed by ReplicationController, ReplicaSet, DaemonSet, StatefulSet or Job, then drain will not delete any pods unless you use --force, as mentioned in the docs.
kubectl drain <node name> --force
minikube delete --all
in case you are using minikube
it will let you start a new clean cluster.
in case you run on Kubernetes :
kubectl delete pods,deployments -A --all
it will remove it from all namespaces, you can add more objects in the same command .
Kubenertes provides namespaces object for isolation and separation of concern. Therefore, It is recommended to apply all of the k8s resources objects (Deployment, ReplicaSet, Pods, Services and other) in a custom namespace.
Now If you want to remove all of the relevant and related k8s resources, you just need to delete the namespace which will remove all of these resources.
kubectl create namespace custom-namespace
kubectl create -f deployment.yaml --namespace=custom-namespace
kubectl delete namespaces custom-namespace
I have attached a link for further research.
Namespaces
I tried so many variations to delete old pods from tutorials, including everything here.
What finally worked for me was:
kubectl delete replicaset --all
Deleting them one at a time didn't seem to work; it was only with the --all flag that all pods were deleted without being recreated.