Kubernetes: do evicted pods with no resource requests get rescheduled successfully? - kubernetes

I've read as much Kubernetes documentation as I can find, but I'm still having trouble understanding a specific scenario I have in mind.
For the sake of example, let's say I have a single node with 1GB of memory. I also have a deployment that wants 100 pods with memory limits set to 100MB and memory requests unset. The pods only use 1MB most of the time, but can sometimes jump up to 99MB.
Question 1: Will all 100 pods be scheduled onto the node?
Now, let's say all the pods simultaneously start using 99MB of memory each and stay there. There isn't enough memory on the machine to handle that, but none of the pods have exceeded their memory limit. I'm assuming Kubernetes evicts some pods at this point.
Question 2: When Kubernetes tries to reschedule the evicted pods, does it succeed since there is no memory request set? What happens when the node immediately runs out of memory again? Does this eviction, rescheduling keep happening over and over? If so, is there some metric that I can use to detect that this is happening?

A pod will be scheduled as long as there's an eligible node that can satisfy the requested resources. So if you do not specify request the pod will pretty much get scheduled. Request and limits are totally different things. Request is a condition for a pod to be scheduled and limit is a condition for a running pod already scheduled.
If you overcommit the actual resources on a node you will run into typical issues - if you overcommit on memory it'll start to swap and CPU there will just be general slow down. Either way the node and pods on it will become unresponsive. It's difficult to deal with and tools like request and limits set up sane boundaries that will help you not take things quite this far where you'll simply see the pod fail to schedule.

When the Kubernetes scheduler schedules a pod running on a node, it will always ensure that the total limits of the containers are less than the node capacity. If a node runs out of resources, Kubernetes will not schedule any new containers running on it. If no node is available when you launch a pod, the pod will remain pending, since the Kubernetes scheduler will be unable to nd any node that could run your desired pod.
Kubernetes Cookbook
I think this excerpt gave you some understanding on how it internally works. So answers for your questions:
At most 10 pods will be scheduled into your node.
If there no free memory in node evicted pods will be pending. Also k8s can simply evict pod if it exceeds limits when resources are needed for other pods and services.

Related

Kubernetes reserving headroom on nodes to allow memory peaks

I'm running a pod on k8s that has a baseline memory usage of about 1GB. Sometimes, depending on user behaviour, the pod can consume much more memory (10-12GB) for a few minutes, and then drop down to the baseline level.
What I would like to configure in k8s, is that the memory request would be quite low (1GB), but that the pod will run a node with a much higher memory capacity. This way when needed the pod will just grow and shrink back. The reason I don't configure the request to be higher, is that I have multiple replicas of this pod, and ideally I would want all of them to be hosted on 1-2 nodes, and let each one peak when needed, without spending too much.
It was counter intuitive to find out that the memory limit configuration does not affect node selection, meaning if I configure the limit to be 12GB I can still get a 4GB node.
Is there any way to configure my pods to share some large nodes, so they will be able to extend their memory usage without crashing?
Resource Requests vs Limits
Memory limit doesn't affect node selection, but memory request does. This is because the concept of resource request and resource limit serve a different purpose.
Resource request makes sure that your pod has at least requested amount of resources at all times.
Resource limit makes sure that your pod does not have more than the limit amount of resources at any time. E-g not setting memory limits right leads to OOM (out-of-memory) kill on pods, and this happens very often.
Resource requests are the parameters used by the scheduler to determine how to schedule your pods on certain nodes. Limit are not used for this purpose and it is up to the application designer to set pod limits correctly.
So What Can You Do
You can set your pod request the same way you are doing currently. Say, 1.5GiB of memory. You can then, from your experience, check that your pod can consume as much as 12GiB of memory. And you are running 2 pods of this application. Therefore, if you want to schedule both these pods on a single node and not have any issues, you have to make sure your node has:
total_memory = 2 * pod_limit + overhead (system) OR
total_memory ~= 2 * pod_limit
Overhead is the memory overhead of your idle node which can account for usage by certain OS components and your cluster system components like kubelet and kubeproxy binaries. But this is generally a very small value.
This will allow you to select the right node for your pods.
Summary
You will have to ensure for your pod to have a lot more memory in order to handle that spike (but not too much, so you will set a limit). You can read more about this here.
Note: Strictly speaking, pods don't shrink and grow they have a fixed resource profile, which is determined by the resource requests and resource limits in the definition. Also note that, a pod is always "occupying" the requested amount of resources, even if it is not using them.
Other Elements to Control Scheduling
Additionally, you can also use node-affinities to add a 'preference' for the scheduler to schedule your pods. I say a preference because node affinities are not definite rules, but guidelines. You can also use anti-affinities to make sure certain pods don't get scheduled on the same node. Just keep in mind, that if a pod is in danger of not being scheduled, the affinity or anti-affinity rules can be possibly ignored by the scheduler.
The last option is, you can use nodeSelector on your pod, to make sure it lands on the node which specifies the conditions of the selector. If no node matches this selector, your pod will be stuck in Pending state, meaning it cannot be scheduled anywhere. You can read about this here.
So, after you've decided which nodes you want your pod to be scheduled on, you can label them, and use a selector to ensure the pods are scheduled on the matching node.
It is also possible to provide a specified nodeName to your pod, to force it to schedule on a particular node. This is an option rarely used. Simply because nodes can be added/deleted in your cluster and this will require you to change your pod definition every time this happens. Using nodeSelector is better since you can specify general attributes like a label which you can attach to a new node you add to the cluster and the pod scheduling will not be affected, neither will be the pod definition.
Hope this helps.

Are Kubernetes requests really guaranteed?

I'm running a pod on an EKS node with 2500m of requests and no limits - it happily uses around 3000m typically. I wanted to test whether requests were really guaranteed, so I am running a CPU stress test pod on the same node, with 3000m requests and no limits again.
This caused the original pod to not be able to use more than ~1500m of CPU - well below it's requests. Then when I turned off the stress pod, it returned to using 3000m.
There are a number of Kubernetes webpages which say that requests are what the pod is "guaranteed" - but does this only mean guaranteed for scheduling, or should it actually be a guarantee. If it is guaranteed, why might my pod CPU usage have been restricted (noting that there is no throttling for pods without limits).
Requests are not a guarantee that resources (especially CPU) will be available at runtime. If you set requests and limits very close together you have better expectations, but you need every pod in the system to cooperate to have a real guarantee.
Resource requests only affect the initial scheduling of the pod. In your example, you have one pod that requests 2.5 CPU and a second pod that requests 3 CPU. If your node has 8 CPU, both can be scheduled on the same node, but if the node only has 4 CPU, they need to go on separate nodes (if you have the cluster autoscaler, it can create a new node).
To carry on with the example, let's say the pods get scheduled on the same node with 8 CPU. Now that they've been scheduled the resource requests don't matter any more. Neither pod has resource limits, but let's say the smaller pod actually tries to use 3 CPU and the larger pod (a multi-threaded stress test) uses 13 CPU. This is more than the physical capacity of the system, so the kernel will allocate processor cycles to the two processes.
For CPU usage, if the node is overcommitted, you'll just see slow-downs in all of the processes. Either memory or disk ("ephemeral storage") can cause pods to be Evicted and rescheduled on different nodes; the pods that get evicted are the ones that exceed their resource requests by the most. Memory can also cause the node to run out of physical memory, and pods can get OOMKilled.
If every pod sets resource requests and limits to the same value then you do have an approximate guarantee that resources will be available, since nothing will be able to use more resource than the pod scheduler allocates it. For an individual pod and for non-CPU resources, if resource requests and limits are the same, your pod won't get evicted if the node is overcommitted (because it can't exceed its requests). On the other hand, most processes won't generally use exactly their resource requests, and so setting requests high enough that you're guaranteed to not be evicted also means you're causing the node to have unused resources, and your cluster as a whole will be less efficient (need more nodes to do the same work and be more expensive) (but more reliable since pods won't get killed off as often).

Can we have --pod-eviction-timeout=300m?

I have a k8s cluster, in our cluster we do not want the pods to get evicted, because pod eviction causes lot of side effects to the applications running on it.
To prevent pod eviction from happening, we have configured all the pods as Guaranteed QoS. I know even with this the pod eviction can happen if there are any resource starvation in the system. We have monitors to alert us when there are resource starvation within the pod and node. So we get to know way before a pod gets evicted. This helps us in taking measures before pod gets evicted.
The other reasons for pod eviction to happen is if the node is in not-ready state, then kube-controller-manager will check the pod-eviction-timeout and it will evict the pods after this timeout. We have monitor to alert us when the node goes to not-ready state. now after this alert we wanted to take some measures to clean-up from application side, so the application will end gracefully. To do this clean-up we need more than few hours, but pod-eviction-timeout is by default 5 minutes.
Is it fine to increase the pod eviction timeout to 300m? what are the impacts of increasing this timeout to such a limit?
P.S: I know during this wait time, if the pod utilises more resources, then kubelet can itself evict this pod. I wanted to know what other impact of waiting for such a long time?
As #coderanger said, your limits are incorrect and this should be fixed instead of lowering self-healing capabilities of Kubernetes.
If your pod dies no matter what was the issue with it, by default it will be rescheduled based on your configuration.
If you are having a problem with this then I would recommend redoing your architecture and rewriting the app to use Kubernetes how it's supposed to be used.
if you are getting problems with a pod still being send requests when it's unresponsive, you should implement a LB in front or queue the requests,
if you are getting a problem with IPs that are being changed after pod restarts, this should be fixed by using DNS and service instead of connecting directly to a pod,
if your pod is being evicted check why, make the limits and requests,
As for the node, there is a really nice blog post about Improving Kubernetes reliability: quicker detection of a Node down, it's opposite of what you are thinking of doing but it also mentions why 340s is too much
Once the node is marked as unhealthy, the kube controller manager will remove its pods based on –pod-eviction-timeout=5m0s
This is a very important timeout, by default it’s 5m which in my opinion is too high, because although the node is already marked as unhealthy the kube controller manager won’t remove the pods so they will be accessible through their service and requests will fail.
If you still want to change default values to higher you can look into changing these:
kubelet: node-status-update-frequency=10s
controller-manager: node-monitor-period=5s
controller-manager: node-monitor-grace-period=40s
controller-manager: pod-eviction-timeout=5m
to higher ones.
If you provide more details I'll try to help more.

What does it mean OutOfcpu error in kubernetes?

I got OutOfcpu in kubernetes on googlecloud what does it mean? My pods seem to be working now, however there there were pods in this same revision which got OutOfcpu.
It means that the kube-scheduler can't find any node with available CPU to schedule your pods:
kube-scheduler selects a node for the pod in a 2-step operation:
Filtering
Scoring
The filtering step finds the set of Nodes where it’s feasible to
schedule the Pod. For example, the PodFitsResources filter checks
whether a candidate Node has enough available resource to meet a Pod’s
specific resource requests.
[...]
PodFitsResources: Checks if the
Node has free resources (eg, CPU and Memory) to meet the requirement
of the Pod.
Also, as per Assigning Pods to Nodes:
If the named node does not have the resources to accommodate the pod,
the pod will fail and its reason will indicate why, e.g. OutOfmemory
or OutOfcpu.
In addition to how-kube-scheduler-schedules-pods, I think this will be helpful to understand why OutOfcpu error has been shown up.
When you create a Pod, the Kubernetes scheduler selects a node for the
Pod to run on. Each node has a maximum capacity for each of the
resource types: the amount of CPU and memory it can provide for Pods.
The scheduler ensures that, for each resource type, the sum of the
resource requests of the scheduled Containers is less than the
capacity of the node. Note that although actual memory or CPU resource
usage on nodes is very low, the scheduler still refuses to place a Pod
on a node if the capacity check fails. This protects against a
resource shortage on a node when resource usage later increases, for
example, during a daily peak in request rate.
Ref: how-pods-with-resource-requests-are-scheduled

k8s - how scheduler assigns the nodes

I am just curious to know how k8s master/scheduler will handle this.
Lets consider I have a k8s master with 2 nodes. Assume that each node has 8GB RAM and each node running a pod which consumes 3GB RAM.
node A - 8GB
- pod A - 3GB
node B - 8GB
- pod B - 3GB
Now I would like to schedule another pod, say pod C, which requires 6GB RAM.
Question:
Will the k8s master shift pod A or B to other node to accommodate the pod C in the cluster or will the pod C be in the pending status?
If the pod C is going to be in pending status, how to use the resources efficiently with k8s?
Unfortunately I could not try this with my minikube. If you know how k8s scheduler assigns the nodes, please clarify.
Most of the Kubernetes components are split by responsibility and workload assignment is no different. We could define the workload assignment process as Scheduling and Execution.
The Scheduler as the name suggests will be responsible for the Scheduling step, The process can be briefly described as, "get a list of pods, if it is not scheduled to a node, assign it to one node with capacity to run the pod". There is a nice blog post from Julia Evan here explaining Schedulers.
And Kubelet is responsible for the Execution of pods scheduled to it's node. It will get a list of POD Definitions allocated to it's node, make sure they are running with the right configuration, if not running start then.
With that in mind, the scenario you described will have the behavior expected, the POD will not be scheduled, because you don't have a node with capacity available for the POD.
Resource Balancing is mainly decided at scheduling level, a nice way to see it is when you add a new node to the cluster, if there are no PODs pending allocation, the node will not receive any pods. A brief of the logic used to Resource balancing can be seen on this PR
The solutions,
Kubernetes ships with a default scheduler. If the default scheduler does not suit your needs you can implement your own scheduler as described here. The idea would be implement and extension for the Scheduler to ReSchedule PODs already running when the cluster has capacity but not well distributed to allocated the new load.
Another option is use tools created for scenarios like this, the Descheduler is one, it will monitor the cluster and evict pods from nodes to make the scheduler re-allocate the PODs with a better balance. There is a nice blog post here describing these scenarios.
PS:
Keep in mind that the total memory of a node is not allocatable, depending on which provider you use, the capacity allocatable will be much lower than the total, take a look on this SO: Cannot create a deployment that requests more than 2Gi memory
find below the answers
Will the k8s master shift pod A or B to other node to accommodate the pod C in the cluster or will the pod C be in the pending status?
No. pod A and pod B would still be running, pod C will not be scheduled.
If the pod C is going to be in pending status, how to use the resources efficiently with k8s?
Both the nodes cant meet the resource requirements needed to run pod C and hence it cant be scheduled.
You mentioned that the node capacity is 8 GB RAM. note that whole 8 GB RAM is not available to run the work loads. certain amount of RAM is reserved for kube-proxy, kubelet and other node management activities.