Kubernetes pod eviction schedules evicted pod to node already under DiskPressure - kubernetes

We are running a kubernetes (1.9.4) cluster with 5 masters and 20 worker nodes. We are running one statefulset pod with replication 3 among other pods in this cluster. Initially the statefulset pods are distributed to 3 nodes. However the pod-2 on node-2 got evicted due to the disk pressure on node-2. However, when the pod-2 is evicted it went to node-1 where pod-1 was already running and node-1 was already experiencing node pressure. As per our understanding, the kubernetes-scheduler should not have scheduled a pod (non critical) to a node where there is already disk pressure. Is this the default behavior to not schedule the pods to a node under disk pressure or is it allowed. The reason is, at the same time we do observe, node-0 without any disk issue. So we were hoping that evicted pod on node-2 should have ideally come on node-0 instead of node-1 which is under disk pressure.
Another observation we had was, when the pod-2 on node-2 was evicted, we see that same pod is successfully scheduled and spawned and moved to running state in node-1. However we still see "Failed to admit pod" error in node-2 for many times for the same pod-2 that was evicted. Is this any issue with the kube-scheduler.

Yes, Scheduler should not assign a new pod to a node with a DiskPressure Condition.
However, I think you can approach this problem from few different angles.
Look into configuration of your scheduler:
./kube-scheduler --write-config-to kube-config.yaml
and check it needs any adjustments. You can find info about additional options for kube-scheduler here:
You can also configure aditional scheduler(s) depending on your needs. Tutorial for that can be found here
Check the logs:
kubeclt logs: kube-scheduler events logs
journalctl -u kubelet: kubelet logs
/var/log/kube-scheduler.log (on the master)
Look more closely at Kubelet's Eviction Thresholds (soft and hard) and how much node memory capacity is set.
Bear in mind that:
Kubelet may not observe resources pressure fast enough
or
Kubelet may evict more Pods than needed due to stats collection timing gap
Please check out my suggestions and let me know if they helped.

Related

Kubernetes Pods Not Being Evicted

I have multiple pods on Kubernetes (v1.23.5) that are not being evicted and rescheduled in case of node failure.
According to Kubernetes documentation, this process must begin after 300s:
Kubernetes automatically adds a Toleration for node.kubernetes.io/not-ready and node.kubernetes.io/unreachable with tolerationSeconds=300 unless you, or a controller, set those tolerations explicitly.
These automatically-added tolerations mean that Pods remain bound to Nodes for 5 minutes after detecting one of these problems.
Unfortunately, pods get stuck in terminating status and would not evict. However, in one test on a pod without any PVC attached, it evicted and started running on another node.
I'm trying to understand how I can make other pods evict after the default 300s time.
I don't know why it would not happen automatically, and I must drain the pod stuck in a terminating state to make it work properly.
Update
I have seen the kvaps/kube-fencing project. There seems to be a fencing procedure that runs in case of a node failure. I couldn't make it solve my problem, and I didn't. I don't know whether it is because of my lack of comprehension of this project, or it is solely used to handle the node in case of a failure and not the pods stuck in termination state and evicting those pods.
There are two ways to handle this problem.
First one is to use kvaps/kube-fencing. You need to configure a PodTemplate in which you can set a node to be deleted from the cluster when it becomes NotReady or flush the node. If you have volumes attached to the pod, the pods will remain in the ContainerCreating state.
There are the annotations in the PodTemplate:
annotations:
fencing/mode: 'delete'
fencing/mode: 'flush'
The second way is to use the Kubernetes Non-Graceful Node Shutdown. This is not available in Kubernetes (v1.23.5) and you have to upgrade.

what should I do to find the pod evicted reason

Today when I checked the kubernetes cluster, some of the pod shows the status was evicted. But I only see the evicted status and could not found the detail logs why the pod was evicted. Disk Pressure? CPU pressure? what should I do to found the reason of the pod evicted?
you can try to look at logs of that particular pod.
Do a describe on that pod and see if you find anything.
kubectl get pods -o wide
try the above command to see on which node it was running and run a describe on that node and you find at-least some information related to the eviction.
Eviction is a process where a Pod assigned to a Node is asked for termination. One of the most common cases in Kubernetes is Preemption, where in order to schedule a new Pod in a Node with limited resources, another Pod needs to be terminated to leave resources to the first one.
So, to answer your question, the pod would have got evicted with limited CPU or memory resources allocated.

Is it possible to get the details of the node where the pod ran before restart?

I'm running a kubernetes cluster of 20+ nodes. And one pod in a namespace got restarted. The pod got killed due to OOM with exit code 137 and restarted again as expected. But would like to know the node in which the pod was running earlier. Any place we could check the logs for the info? Like tiller, kubelet, kubeproxy etc...
But would like to know the node in which the pod was running earlier.
If a pod is killed with ExitCode: 137, e.g. when it used more memory than its limit, it will be restarted on the same node - not re-scheduled. For this, check your metrics or container logs.
But Pods can also be killed due to over-committing a node, see e.g. How to troubleshoot Kubernetes OOM and CPU Throttle.

How to identify pod eviction policy?

I have a Kubernetes cluster deployed on GCP with a single node, 4 CPU's and 15GB memory. There are a few pods with all the pods bound to the persistent volume by a persistent volume claim. I have observed that the pods have restarted automatically and the data in the persistent volume is lost.
After some research, I suspect that this could be because of the pod eviction policy. When I used kubectl describe pod , I noticed the below error.
0/1 nodes are available: 1 node(s) were not ready, 1 node(s) were out of disk space, 1 node(s) were unschedulable.
The restart policy of my pods is "always". So I think that the pods have restarted after being resource deprived.
How do I identify the pod eviction policy of my cluster and change it? so that this does not happen in the future
pod eviction policy of my cluster and change
These thresholds ( pod eviction) are flags of kubelet, you can tune these values according to your requirement. you can edit the kubelet config file, here is the detail config-file
Dynamic Kubelet Configuration allows you to edit these values in the live cluster
The restart policy of my pods is "always". So I think that the pods have restarted after being resource deprived.
Your pod has been rescheduled due to node's issue (not enough disk space ).
The restart policy of my pods is "always".
It means if the pod is not up and running then try to restart it .

Kubernetes Horizontal Pod Autoscaler on GKE - "failed to get CPU utilization"

I am fairly new to Kubernetes and GKE (Google Container Engine) as a whole, so I was playing with the horizontal pod autoscaling and cluster autoscaling features by hitting my load balancer hard enough to make it scale up enough pods that it needed more instances, so it scaled those up but then it got to the point that there were some pods in Pending state, but it had also reached the max number of instances for the autoscaling cluster, so they were left in Pending state.
I then stopped the load test hoping it would come down on its own, but it wouldn't. I looked at kubectl describe hpa and I would see errors like:
7m 18s 18 {horizontal-pod-autoscaler } Warning FailedGetMetrics failed to get CPU consumption and request: metrics obtained for 4/5 of pods
7m 18s 18 {horizontal-pod-autoscaler } Warning FailedComputeReplicas failed to get CPU utilization: failed to get CPU consumption and request: metrics obtained for 4/5 of pods
There are actually only 4 pods running (and none in pending state), and looking at the heapster logs (kubectl logs -f heapster-v1.1.0-<id> --namespace=kube-system heapster) I can see it is actually looking for metrics in a pod that doesn't exist anymore (this would be the mystery 5th pod it's complaining about).
The issue with this is that because it is missing the 5th pod, it can't finish getting the current CPU utilization for the 4 pods that are running, and thus horizontal pod autoscaling doesn't work.
Any ideas how to get out of a situation like this?
I've tried removing the hpa and creating it again, but it didn't help.