From PodInterface the two operations Delete and Evict seems having the same effect: deleting the old Pod and creating a new Pod.
If the two operations have the same effect, why do we need two API to delete a Pod and create a new one?
Deletion of a pod is done by an end user and is a normal activity. It means the pod will be deleted from ETCD and kubernetes control plane. Unless there is a higher level controller such as deployment, daemonset, statefulset etc the pod will not be created again and scheduled to a kubernetes worker node.
Eviction happens if resource consumption by pod is exceeded the limit and kubelet triggers eviction of the pod or a user performs kubectl drain or manually invoking the eviction API. It's generally not not a normal activity .Sometimes evicted pods are not automatically deleted from ETCD and kubernetes control plane. Unless there is a higher level controller such as deployment, daemonset, statefulset etc the evicted pod will not be created again and scheduled to a kubernetes worker node.
It's preferable to use delete instead of evict because evict comes with more risk compared to delete because eviction may lead to in some cases, an application to a broken state if the replacement pod created by the application’s controller(deployment etc.) does not become ready, or if the last pod evicted has a very long termination grace period
Pod evict operation (assuming you're referring to the Eviction API) is a sort of smarter delete operation, which respects PodDisruptionBudget and thus it does respect the high-availability requirements of your application (as long as PodDisruptionBudget is configured correctly). Normally you don't manually evict a pod, however the pod eviction can be initiated as a part of a node drain operation (which can be manually invoked by kubectl drain command or automatically by the Cluster Autoscaler component).
Pod delete operation on the other hand doesn't respect PodDisruptionBudget and thus can affect availability of your application. As opposite to the evict operation this operation is normally invoked manually (e.g. by kubectl delete command).
Also besides the Eviction API, pods can be evicted by kubelet due to Node-pressure conditions, in which case PodDisruptionBudget is not respected by Kubernetes (see Node-pressure Eviction)
Related
I have a pod that had a Pod Disruption Budget that says at least one has to be running. While it generally works very well it leads to a peculiar problem.
I have this pod sometimes in a failed state (due to some development) so I have two pods, scheduled for two different nodes, both in a CrashLoopBackOff state.
Now if I want to run a drain or k8s version upgrade, what happens is that pod cannot ever be evicted since it knows that there should be at least one running, which will never happen.
So k8s does not evict a pod due to Pod Disruption Budget even if the pod is not running. Is there a way to do something with this? I think ideally k8s should treat failed pods as candidates for eviction regardless of the budget (as deleting a failing pod cannot "break" anything anyway)
...if I want to run a drain or k8s version upgrade, what happens is that pod cannot ever be evicted since it knows that there should be at least one running...
kubectl drain --disable-eviction <node> will delete pod that is protected by PDB. Since you are fully aware of the downtime, you can first delete the PDB in question before draining the node.
I hit this issue too during k8s upgrade. Fyi, as mentioned in the other answer, kubectl drain --disable-eviction <node> may cause service downtime, and deleting pods might not work always when deleted pods are immediately recreated by the deployment managing the pods. Also, even if the pods are deleted successfully, it may cause service downtime depending on PodDisruptionBudget.
Instead, I increased the number of replicas of the pods in the deployment to honor PodDisruptionBudget.minAvailable or PodDisruptionBudget.maxUnavailable and was able to successfully upgrade k8s while honoring PodDisruptionBudget.
I have multiple pods on Kubernetes (v1.23.5) that are not being evicted and rescheduled in case of node failure.
According to Kubernetes documentation, this process must begin after 300s:
Kubernetes automatically adds a Toleration for node.kubernetes.io/not-ready and node.kubernetes.io/unreachable with tolerationSeconds=300 unless you, or a controller, set those tolerations explicitly.
These automatically-added tolerations mean that Pods remain bound to Nodes for 5 minutes after detecting one of these problems.
Unfortunately, pods get stuck in terminating status and would not evict. However, in one test on a pod without any PVC attached, it evicted and started running on another node.
I'm trying to understand how I can make other pods evict after the default 300s time.
I don't know why it would not happen automatically, and I must drain the pod stuck in a terminating state to make it work properly.
Update
I have seen the kvaps/kube-fencing project. There seems to be a fencing procedure that runs in case of a node failure. I couldn't make it solve my problem, and I didn't. I don't know whether it is because of my lack of comprehension of this project, or it is solely used to handle the node in case of a failure and not the pods stuck in termination state and evicting those pods.
There are two ways to handle this problem.
First one is to use kvaps/kube-fencing. You need to configure a PodTemplate in which you can set a node to be deleted from the cluster when it becomes NotReady or flush the node. If you have volumes attached to the pod, the pods will remain in the ContainerCreating state.
There are the annotations in the PodTemplate:
annotations:
fencing/mode: 'delete'
fencing/mode: 'flush'
The second way is to use the Kubernetes Non-Graceful Node Shutdown. This is not available in Kubernetes (v1.23.5) and you have to upgrade.
I have a deployment that deploys a single pod with a persistent volume claim. If I switch off the node it is running on, after a while k8s terminates the pod and tries to spin it up elsewhere. However the new pod cannot attach the volume (Multi-Attach error for volume "pvc-...").
I can manually delete the old 'Terminating' pod with kubectl delete pod <PODNAME> --grace-period=0 --force and then things recover.
Is there a way to get Kubernetes to force delete the 'Terminating' pods after a timeout or something? Tx.
According to the docs:
A Pod is not deleted automatically when a node is unreachable. The
Pods running on an unreachable Node enter the 'Terminating' or
'Unknown' state after a timeout. Pods may also enter these states when
the user attempts graceful deletion of a Pod on an unreachable Node.
The only ways in which a Pod in such a state can be removed from the
apiserver are as follows:
The Node object is deleted (either by you, or by the Node Controller).
The kubelet on the unresponsive Node starts responding, kills the Pod and removes the entry from the apiserver.
Force deletion of the Pod by the user.
So I assume you are not deleting nor draining the node that is being shut down.
In general I'd advice to ensure any broken nodes are deleted from the node list and that should make Terminating pods to be deleted by controller manager.
Node deletion normally happens automatically, at least on kubernetes clusters running on the main cloud providers, but if that's not happening for you than you need a way to remove nodes that are not healthy.
Use Recreate in .spec.strategy.type of your Deployment. This tell Kubernetes to delete the old pods before creating new ones.
Ref: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#strategy
I am relatively new to Kubernetes, and have a current task to debug Eviction pods on my work.
I'm trying to replicate the behaviour on a local k8s cluster in minikube.
So far I just cannot get evicted pods happening.
Can you help me trigger this mechanism?
the eviction of pods is managed by the qos classes (quality of pods)
there are 3 categories
Guaranteed (limit = request cpu or ram) not evictable
Burstable
BestEffort
if you want test this mechanism , you can scale a pod that consume lot of memory or cpu and before that launch your example pods with différent request and limit for test this behavior. this behavior is only avaible for eviction so your pods must be already started before a cpu load.
after if you test a scheduling mechanism during a luanch time you can configure a priorityclassname for schedule a pods even if the cluster is full.
by example if your cluster is full you can't schedule a new pods because your pod don't have a sufficient privilege.
if you want schedule anyway a pod despite that you can add a priorityclassname system-node-critical or create your own priorityclass and one of the pod with a lower priority will be evict and your pod will be launched
Does kubectl drain first make sure that pods with replicas=1 are healthy on some other node?
Assuming the pod is controlled by a deployment, and the pods can indeed be moved to other nodes.
Currently as I see it only evict (delete pods) from the nodes, without scheduling them first.
In addition to Suresh Vishnoi answer:
If PodDisruptionBudget is not specified and you have a deployment with one replica, the pod will be terminated and then new pod will be scheduled on a new node.
To make sure your application will be available during node draining process you have to specify PodDisruptionBudget and create more replicas. If you have 1 pod with minAvailable: 30% it will refuse to drain with following error:
error when evicting pod "pod01" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
Briefly that's how draining process works:
As explained in documentation kubectl drain command "safely evicts all of your pods from a node before you perform maintenance on the node and allows the pod’s containers to gracefully terminate and will respect the PodDisruptionBudgets you have specified”
Drain does two things:
cordons the node- it means that node is marked as unschedulable, so new pods cannot be scheduled on this node. Makes sense- if we know that node will be under maintenance there is no point to schedule a pod there and then reschedule it on another node due to maintenance. From Kubernetes perspective it adds a taint to the node: node.kubernetes.io/unschedulable:NoSchedule
evicts/ deletes the pods- after node is marked as unschedulable it tries to evict the pods that are running on the node. It uses Eviction API which takes PodDisruptionBudgets into account (if it's not supported it will delete pods). It calls DELETE method to K8S but considers GracePeriodSeconds so it lets a pod finish it's processes.
New Pods are scheduled, when the number of pods are not available (desired state != current state) in respective of draining or node failure.
With the PodDisruptionBudget resource you can manage the disruption during the draining of the node.
You can specify only one of maxUnavailable and minAvailable in a single PodDisruptionBudget. maxUnavailable can only be used to control the eviction of pods that have an associated controller managing them. In the examples below, “desired replicas” is the scale of the controller managing the pods being selected by the PodDisruptionBudget. https://kubernetes.io/docs/tasks/run-application/configure-pdb/#specifying-a-poddisruptionbudget
Example 1: With a minAvailable of 5, evictions are allowed as long as they leave behind 5 or more healthy pods among those selected by the PodDisruptionBudget’s selector.
Example 2: With a minAvailable of 30%, evictions are allowed as long as at least 30% of the number of desired replicas are healthy.
Example 3: With a maxUnavailable of 5, evictions are allowed as long as there are at most 5 unhealthy replicas among the total number of desired replicas.
Example 4: With a maxUnavailable of 30%, evictions are allowed as long as no more than 30% of the desired replicas are unhealthy.