How to force Eviction on a Kubernetes Cluster (minikube) - kubernetes

I am relatively new to Kubernetes, and have a current task to debug Eviction pods on my work.
I'm trying to replicate the behaviour on a local k8s cluster in minikube.
So far I just cannot get evicted pods happening.
Can you help me trigger this mechanism?

the eviction of pods is managed by the qos classes (quality of pods)
there are 3 categories
Guaranteed (limit = request cpu or ram) not evictable
Burstable
BestEffort
if you want test this mechanism , you can scale a pod that consume lot of memory or cpu and before that launch your example pods with différent request and limit for test this behavior. this behavior is only avaible for eviction so your pods must be already started before a cpu load.
after if you test a scheduling mechanism during a luanch time you can configure a priorityclassname for schedule a pods even if the cluster is full.
by example if your cluster is full you can't schedule a new pods because your pod don't have a sufficient privilege.
if you want schedule anyway a pod despite that you can add a priorityclassname system-node-critical or create your own priorityclass and one of the pod with a lower priority will be evict and your pod will be launched

Related

kubernetes pod scheduling with resource quotas

I have read about k8s resource management but things are still not very clear to me. Lets say we have 2 k8s nodes each with 22 mb memory.
Lets say Pod A has request 10mb and limit 15mb(but lets say actual usage is 5mb). so this pod is scheduled on node 1. So node1 has 22 mb memory, 5 is used by Pod A but another 17mb is available if more memory is needed by Pod A.
Pod B has request 10 and limit 15(basically the same with Pod A). so this pod is scheduled on node 2
So both nodes have 5 mbs of usages out of 22mb. If Pod C has a request 5mb and limit 10mb, will this pod be scheduled on any of the nodes? If yes, what would happen Pod C needs 10m memory and the other pod needs 15mb of memory?
What would happen if Pod C has a request of 13mb and a limit of 15mb? In this case 13(request of pod C) + 10(request of pod A) will be 23(more than 22)?
Does k8s try to make sure that requests of all pods < available memory && limits of all pods < available memory ?
Answering question from the post:
Lets say Pod A has request 10mb and limit 15mb(but lets say actual usage is 5mb). so this pod is scheduled on node 1. So node1 has 22 mb memory, 5 is used by Pod A but another 17mb is available if more memory is needed by Pod A. Pod B has request 10 and limit 15(basically the same with Pod A). so this pod is scheduled on node 2
This is not really a question but I think this part needs some perspective on how Pods are scheduled onto the nodes. The component that is responsible for telling Kubernetes where a Pod should be scheduled is: kube-scheduler. It could come to the situation as you say that:
Pod A, req:10M, limit: 15M -> Node 1, mem: 22MB
Pod B req:10M, limit: 15M -> Node 2, mem: 22MB
Citing the official documentation:
Node selection in kube-scheduler
kube-scheduler selects a node for the pod in a 2-step operation:
Filtering
Scoring
The filtering step finds the set of Nodes where it's feasible to schedule the Pod. For example, the PodFitsResources filter checks whether a candidate Node has enough available resource to meet a Pod's specific resource requests. After this step, the node list contains any suitable Nodes; often, there will be more than one. If the list is empty, that Pod isn't (yet) schedulable.
In the scoring step, the scheduler ranks the remaining nodes to choose the most suitable Pod placement. The scheduler assigns a score to each Node that survived filtering, basing this score on the active scoring rules.
Finally, kube-scheduler assigns the Pod to the Node with the highest ranking. If there is more than one node with equal scores, kube-scheduler selects one of these at random.
-- Kubernetes.io: Docs: Concepts: Scheduling eviction: Kube sheduler: Implementation
So both nodes have 5 mbs of usages out of 22mb. If Pod C has a request 5mb and limit 10mb, will this pod be scheduled on any of the nodes? If yes, what would happen Pod C needs 10m memory and the other pod needs 15mb of memory?
In this particular example I would much more focus on the request part rather than the actual usage. Assuming that there is no other factor that will deny the scheduling, Pod C should be spawned on one of the nodes (that the kube-scheduler chooses). Why is that?:
resource.limits will not deny the scheduling of the Pod (limit can be higher than memory)
resource.requests will deny the scheduling of the Pod (request cannot be higher than memory)
I encourage you to check following articles to get more reference:
Sysdig.com: Blog: Kubernetes limits requests
Cloud.google.com: Blog: Products: Containers Kubernetes: Kubernetes best practices: (this is GKE blog but it should give the baseline idea, see the part on: "The lifecycle of a Kubernetes Pod" section)
What would happen if Pod C has a request of 13mb and a limit of 15mb? In this case 13(request of pod C) + 10(request of pod A) will be 23(more than 22)?
In that example the Pod will not be scheduled as the sum of requests > memory (assuming no Pod priority). The Pod will be in Pending state.
Additional resources:
Kubernetes.io: Docs: Concepts: Scheduling eviction: Kube sheduler: What's next
Youtube.com: Deep Dive Into the Latest Kubernetes Scheduler Features
- from CNCF [Cloud Native Computing Foundation] conference

AutoScaling work loads without running out of memory

I have a number of pods running and horizontal pod auto scaler assigned to target them, the cluster I am using can also add nodes and remove nodes automatically based on current load.
BUT we recently had the cluster go offline with OOM errors and this caused a disruption in service.
Is there a way to monitor the load on each node and if usage reaches say 80% of the memory on a node, Kubernetes should not schedule more pods on that node but wait for another node to come online.
The pending pods are what one should monitor and define Resource requests which affect scheduling.
The Scheduler uses Resource requests Information when scheduling the pod
to a node. Each node has a certain amount of CPU and memory it can allocate to
pods. When scheduling a pod, the Scheduler will only consider nodes with enough
unallocated resources to meet the pod’s resource requirements. If the amount of
unallocated CPU or memory is less than what the pod requests, Kubernetes will not
schedule the pod to that node, because the node can’t provide the minimum amount
required by the pod. The new Pods will remain in Pending state until new nodes come into the cluster.
Example:
apiVersion: v1
kind: Pod
metadata:
name: requests-pod
spec:
containers:
- image: busybox
command: ["dd", "if=/dev/zero", "of=/dev/null"]
name: main
resources:
requests:
cpu: 200m
memory: 10Mi
When you don’t specify a request for CPU, you’re saying you don’t care how much
CPU time the process running in your container is allotted. In the worst case, it may
not get any CPU time at all (this happens when a heavy demand by other processes
exists on the CPU). Although this may be fine for low-priority batch jobs, which aren’t
time-critical, it obviously isn’t appropriate for containers handling user requests.
Short answer: add resources requests but don't add limits. Otherwise, you will face the throttling issue.

What's the difference between pod deletion and pod eviction?

From PodInterface the two operations Delete and Evict seems having the same effect: deleting the old Pod and creating a new Pod.
If the two operations have the same effect, why do we need two API to delete a Pod and create a new one?
Deletion of a pod is done by an end user and is a normal activity. It means the pod will be deleted from ETCD and kubernetes control plane. Unless there is a higher level controller such as deployment, daemonset, statefulset etc the pod will not be created again and scheduled to a kubernetes worker node.
Eviction happens if resource consumption by pod is exceeded the limit and kubelet triggers eviction of the pod or a user performs kubectl drain or manually invoking the eviction API. It's generally not not a normal activity .Sometimes evicted pods are not automatically deleted from ETCD and kubernetes control plane. Unless there is a higher level controller such as deployment, daemonset, statefulset etc the evicted pod will not be created again and scheduled to a kubernetes worker node.
It's preferable to use delete instead of evict because evict comes with more risk compared to delete because eviction may lead to in some cases, an application to a broken state if the replacement pod created by the application’s controller(deployment etc.) does not become ready, or if the last pod evicted has a very long termination grace period
Pod evict operation (assuming you're referring to the Eviction API) is a sort of smarter delete operation, which respects PodDisruptionBudget and thus it does respect the high-availability requirements of your application (as long as PodDisruptionBudget is configured correctly). Normally you don't manually evict a pod, however the pod eviction can be initiated as a part of a node drain operation (which can be manually invoked by kubectl drain command or automatically by the Cluster Autoscaler component).
Pod delete operation on the other hand doesn't respect PodDisruptionBudget and thus can affect availability of your application. As opposite to the evict operation this operation is normally invoked manually (e.g. by kubectl delete command).
Also besides the Eviction API, pods can be evicted by kubelet due to Node-pressure conditions, in which case PodDisruptionBudget is not respected by Kubernetes (see Node-pressure Eviction)

Which component in Kubernetes is responsible for resource allocation?

After scheduling the pods in a node in Kubernetes, which component is responsible for sharing resources among the pods in that node?
From https://kubernetes.io/docs/concepts/overview/components :
kube-scheduler - Component on the master that watches newly created pods
that have no node assigned, and selects a node for them to run on.
Factors taken into account for scheduling decisions include individual
and collective resource requirements, hardware/software/policy
constraints, affinity and anti-affinity specifications, data locality,
inter-workload interference and deadlines.
After pod is scheduled node's kubelet is responsible for dealing with pod's requests and limits. Depending on pod's quality of service and node resource pressure pod can be evicted or restarted by kubelet.
After scheduling
That will be the OS kernel.
You can reserve/limit pod resources: https://cloud.google.com/blog/products/gcp/kubernetes-best-practices-resource-requests-and-limits.
Than it is passed from kubelet down to docker, then to cgroups, and finally to a kernel.

Kubernetes pod eviction schedules evicted pod to node already under DiskPressure

We are running a kubernetes (1.9.4) cluster with 5 masters and 20 worker nodes. We are running one statefulset pod with replication 3 among other pods in this cluster. Initially the statefulset pods are distributed to 3 nodes. However the pod-2 on node-2 got evicted due to the disk pressure on node-2. However, when the pod-2 is evicted it went to node-1 where pod-1 was already running and node-1 was already experiencing node pressure. As per our understanding, the kubernetes-scheduler should not have scheduled a pod (non critical) to a node where there is already disk pressure. Is this the default behavior to not schedule the pods to a node under disk pressure or is it allowed. The reason is, at the same time we do observe, node-0 without any disk issue. So we were hoping that evicted pod on node-2 should have ideally come on node-0 instead of node-1 which is under disk pressure.
Another observation we had was, when the pod-2 on node-2 was evicted, we see that same pod is successfully scheduled and spawned and moved to running state in node-1. However we still see "Failed to admit pod" error in node-2 for many times for the same pod-2 that was evicted. Is this any issue with the kube-scheduler.
Yes, Scheduler should not assign a new pod to a node with a DiskPressure Condition.
However, I think you can approach this problem from few different angles.
Look into configuration of your scheduler:
./kube-scheduler --write-config-to kube-config.yaml
and check it needs any adjustments. You can find info about additional options for kube-scheduler here:
You can also configure aditional scheduler(s) depending on your needs. Tutorial for that can be found here
Check the logs:
kubeclt logs: kube-scheduler events logs
journalctl -u kubelet: kubelet logs
/var/log/kube-scheduler.log (on the master)
Look more closely at Kubelet's Eviction Thresholds (soft and hard) and how much node memory capacity is set.
Bear in mind that:
Kubelet may not observe resources pressure fast enough
or
Kubelet may evict more Pods than needed due to stats collection timing gap
Please check out my suggestions and let me know if they helped.