How to identify pod eviction policy? - kubernetes

I have a Kubernetes cluster deployed on GCP with a single node, 4 CPU's and 15GB memory. There are a few pods with all the pods bound to the persistent volume by a persistent volume claim. I have observed that the pods have restarted automatically and the data in the persistent volume is lost.
After some research, I suspect that this could be because of the pod eviction policy. When I used kubectl describe pod , I noticed the below error.
0/1 nodes are available: 1 node(s) were not ready, 1 node(s) were out of disk space, 1 node(s) were unschedulable.
The restart policy of my pods is "always". So I think that the pods have restarted after being resource deprived.
How do I identify the pod eviction policy of my cluster and change it? so that this does not happen in the future

pod eviction policy of my cluster and change
These thresholds ( pod eviction) are flags of kubelet, you can tune these values according to your requirement. you can edit the kubelet config file, here is the detail config-file
Dynamic Kubelet Configuration allows you to edit these values in the live cluster
The restart policy of my pods is "always". So I think that the pods have restarted after being resource deprived.
Your pod has been rescheduled due to node's issue (not enough disk space ).
The restart policy of my pods is "always".
It means if the pod is not up and running then try to restart it .

Related

Why do we need PodDisruptionBudget on AKS?

I am going to implement PDB on AKS. Can someone please tell me why do we need it when we can use node autoscaler.
Also, does PDB allow zero unavailability by creating a node when one of the nodes fails?
PDB allows you to set rules before evicting your pods from a node.
Let's say you have a 2 nodes cluster and a deployment with 1 replica and you want to update your nodes.
kubectl drain will cordon node 1 so no pods can be schedule on that node
kubectl drain will remove the pod schedule on node 1
kubelet will then deploy your pod over node 2
Now if you set a PDB with a minAvailable: 50%, that drain command would fail as it would violates the rule.
The pods is killed and then kubelet tries to schedule it somewhere.
PDB allows you to prevent downtime by budgeting pods before evicting them.
Scenario without PDB
You perform node 1 update and node 2 cannot host the evicted pod :
pod is killed on node 1
kubelet cannot schedule pod anywhere
autoscaling provisions a third node
pod is scheduled on that new node
During that whole time your evicted pod was not running anywhere and your application was down.

Kubernetes Pod won't start - 1 node(s) had a volume affinity conflict

I have a pod that won't start with a volume affinity conflict. This is a bare-metal cluster so it's unrelated to regions. The pod has 4 persistent volume claims which are all reporting bound so I'm assuming it's not one of those. There are 4 nodes, one of them is tainted so that the pod will not start on it, one of them is tainted specifically so that the pod WILL start on it. That's the only affinity I have set up to my knowledge. The message looks like this:
0/4 nodes are available: 1 node(s) had taint {XXXXXXX},
that the pod didn't tolerate, 1 node(s) had volume node
affinity conflict, 2 Insufficient cpu, 2 Insufficient memory.
This is what I would have expected apart from the volume affinity conflict. There are no other affinities set other than to point it at this node. I'm really not sure why it's doing this or where to even begin. The message isn't super helpful. It does NOT say which node or which volume there is a problem with. The one thing I don't really understand is how binding works. One of the PVC's is mapped to a PV on another node however it is reporting as bound so I'm not completely certain if that's the problem. I am using local-storage as the storage class. I'm wondering if that's the problem but I'm fairly new to Kubernetes and I'm not sure where to look.
You got 4 Nodes but none of them are available for scheduling due to a different set of conditions. Note that each Node can be affected by multiple issues and so the numbers can add up to more than what you have on total nodes. Let's try to address these issues one by one:
Insufficient memory: Execute kubectl describe node <node-name> to check how much free memory is available there. Check the requests and limits of your pods. Note that Kubernetes will block the full amount of memory a pod requests regardless how much this pod uses.
Insufficient cpu: Analogical as above.
node(s) had volume node affinity conflict: Check out if the nodeAffinity of your PersistentVolume (kubectl describe pv) matches the node label (kubectl get nodes). Check if the nodeSelector in your pod also matches. Make sure you set up the Affinity and/or AntiAffinity rules correctly. More details on that can be found here.
node(s) had taint {XXXXXXX}, that the pod didn't tolerate: You can use kubectl describe node to check taints and kubectl taint nodes <node-name> <taint-name>- in order to remove them. Check the Taints and Tolerations for more details.

kubectl drain and rolling update, downtime

Does kubectl drain first make sure that pods with replicas=1 are healthy on some other node?
Assuming the pod is controlled by a deployment, and the pods can indeed be moved to other nodes.
Currently as I see it only evict (delete pods) from the nodes, without scheduling them first.
In addition to Suresh Vishnoi answer:
If PodDisruptionBudget is not specified and you have a deployment with one replica, the pod will be terminated and then new pod will be scheduled on a new node.
To make sure your application will be available during node draining process you have to specify PodDisruptionBudget and create more replicas. If you have 1 pod with minAvailable: 30% it will refuse to drain with following error:
error when evicting pod "pod01" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
Briefly that's how draining process works:
As explained in documentation kubectl drain command "safely evicts all of your pods from a node before you perform maintenance on the node and allows the pod’s containers to gracefully terminate and will respect the PodDisruptionBudgets you have specified”
Drain does two things:
cordons the node- it means that node is marked as unschedulable, so new pods cannot be scheduled on this node. Makes sense- if we know that node will be under maintenance there is no point to schedule a pod there and then reschedule it on another node due to maintenance. From Kubernetes perspective it adds a taint to the node: node.kubernetes.io/unschedulable:NoSchedule
evicts/ deletes the pods- after node is marked as unschedulable it tries to evict the pods that are running on the node. It uses Eviction API which takes PodDisruptionBudgets into account (if it's not supported it will delete pods). It calls DELETE method to K8S but considers GracePeriodSeconds so it lets a pod finish it's processes.
New Pods are scheduled, when the number of pods are not available (desired state != current state) in respective of draining or node failure.
With the PodDisruptionBudget resource you can manage the disruption during the draining of the node.
You can specify only one of maxUnavailable and minAvailable in a single PodDisruptionBudget. maxUnavailable can only be used to control the eviction of pods that have an associated controller managing them. In the examples below, “desired replicas” is the scale of the controller managing the pods being selected by the PodDisruptionBudget. https://kubernetes.io/docs/tasks/run-application/configure-pdb/#specifying-a-poddisruptionbudget
Example 1: With a minAvailable of 5, evictions are allowed as long as they leave behind 5 or more healthy pods among those selected by the PodDisruptionBudget’s selector.
Example 2: With a minAvailable of 30%, evictions are allowed as long as at least 30% of the number of desired replicas are healthy.
Example 3: With a maxUnavailable of 5, evictions are allowed as long as there are at most 5 unhealthy replicas among the total number of desired replicas.
Example 4: With a maxUnavailable of 30%, evictions are allowed as long as no more than 30% of the desired replicas are unhealthy.

Kubernetes is faling to schedule Daemonset pods on nodes in an auto scaling GKE node pool

We are seeing an issue with the GKE kubernetes scheduler being unable or unwilling to schedule Daemonset pods on nodes in an auto scaling node pool.
We have three node pools in the cluster, however the pool-x pool is used to exclusively schedule a single Deployment backed by an HPA, with the nodes having the taint "node-use=pool-x:NoSchedule" applied to them in this pool. We have also deployed a filebeat Daemonset on which we have set a very lenient tolerations spec of operator: Exists (hopefully this is correct) set to ensure the Daemonset is scheduled on every node in the cluster.
Our assumption is that, as pool-x is auto-scaled up, the filebeat Daemonset would be scheduled on the node prior to scheduling any of the pods assigned to on that node. However, we are noticing that as new nodes are added to the pool, the filebeat pods are failing to be placed on the node and are in a perpetual "Pending" state. Here is an example of the describe output (truncated) of one the filebeat Daemonset:
Number of Nodes Scheduled with Up-to-date Pods: 108
Number of Nodes Scheduled with Available Pods: 103
Number of Nodes Misscheduled: 0
Pods Status: 103 Running / 5 Waiting / 0 Succeeded / 0 Failed
And the events for one of the "Pending" filebeat pods:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 18m (x96 over 68m) default-scheduler 0/106 nodes are available: 105 node(s) didn't match node selector, 5 Insufficient cpu.
Normal NotTriggerScaleUp 3m56s (x594 over 119m) cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 6 node(s) didn't match node selector
Warning FailedScheduling 3m14s (x23 over 15m) default-scheduler 0/108 nodes are available: 107 node(s) didn't match node selector, 5 Insufficient cpu.
As you can see, the node does not have enough resources to schedule the filebeat pod CPU requests are exhausted due to the other pods running on the node. However, why is the Daemonset pod not being placed on the node prior to scheduling any other pods. Seems like the very definition of a Daemonset necessitates priority scheduling.
Also of note, if I delete a pod on a node where filebeat is "Pending" scheduling due to being unable to satisfy the CPU requests, filebeat is immediately scheduled on that node, indicating that there is some scheduling precedence being observed.
Ultimately, we just want to ensure the filebeat Daemonset is able to schedule a pod on every single node in the cluster and have that priority work nicely with our cluster autoscaling and HPAs. Any ideas on how we can achieve this?
We'd like to avoid having to use Pod Priority, as its apparently an alpha feature in GKE and we are unable to make use of it at this time.
The behavior you are expecting of the DaemonSet pods being scheduled first on a node is no longer the reality (since 1.12). Since 1.12, DaemonSet pods are handled by the default scheduler and relies on pod priority to determine the order in which pods are scheduled. You may want to consider creating a priorityCLass specific for DaemonSets with a relatively high value to ensure they are scheduled ahead of most of your other pods.
Before kubernetes 1.12 daemonset were scheduled by his own controller, after that version, deploying daemonset was managed by the default scheduler, in the hope that priority, preemption and toleration cover all the cases.
If you want schedule of daemonsets managed by daemonset scheduler, check
ScheduleDaemonSetPods feature.
https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/

Kubernetes pod eviction schedules evicted pod to node already under DiskPressure

We are running a kubernetes (1.9.4) cluster with 5 masters and 20 worker nodes. We are running one statefulset pod with replication 3 among other pods in this cluster. Initially the statefulset pods are distributed to 3 nodes. However the pod-2 on node-2 got evicted due to the disk pressure on node-2. However, when the pod-2 is evicted it went to node-1 where pod-1 was already running and node-1 was already experiencing node pressure. As per our understanding, the kubernetes-scheduler should not have scheduled a pod (non critical) to a node where there is already disk pressure. Is this the default behavior to not schedule the pods to a node under disk pressure or is it allowed. The reason is, at the same time we do observe, node-0 without any disk issue. So we were hoping that evicted pod on node-2 should have ideally come on node-0 instead of node-1 which is under disk pressure.
Another observation we had was, when the pod-2 on node-2 was evicted, we see that same pod is successfully scheduled and spawned and moved to running state in node-1. However we still see "Failed to admit pod" error in node-2 for many times for the same pod-2 that was evicted. Is this any issue with the kube-scheduler.
Yes, Scheduler should not assign a new pod to a node with a DiskPressure Condition.
However, I think you can approach this problem from few different angles.
Look into configuration of your scheduler:
./kube-scheduler --write-config-to kube-config.yaml
and check it needs any adjustments. You can find info about additional options for kube-scheduler here:
You can also configure aditional scheduler(s) depending on your needs. Tutorial for that can be found here
Check the logs:
kubeclt logs: kube-scheduler events logs
journalctl -u kubelet: kubelet logs
/var/log/kube-scheduler.log (on the master)
Look more closely at Kubelet's Eviction Thresholds (soft and hard) and how much node memory capacity is set.
Bear in mind that:
Kubelet may not observe resources pressure fast enough
or
Kubelet may evict more Pods than needed due to stats collection timing gap
Please check out my suggestions and let me know if they helped.