Are Kubernetes requests really guaranteed?

Are Kubernetes requests really guaranteed? - kubernetes

I'm running a pod on an EKS node with 2500m of requests and no limits - it happily uses around 3000m typically. I wanted to test whether requests were really guaranteed, so I am running a CPU stress test pod on the same node, with 3000m requests and no limits again.
This caused the original pod to not be able to use more than ~1500m of CPU - well below it's requests. Then when I turned off the stress pod, it returned to using 3000m.
There are a number of Kubernetes webpages which say that requests are what the pod is "guaranteed" - but does this only mean guaranteed for scheduling, or should it actually be a guarantee. If it is guaranteed, why might my pod CPU usage have been restricted (noting that there is no throttling for pods without limits).

Requests are not a guarantee that resources (especially CPU) will be available at runtime. If you set requests and limits very close together you have better expectations, but you need every pod in the system to cooperate to have a real guarantee.
Resource requests only affect the initial scheduling of the pod. In your example, you have one pod that requests 2.5 CPU and a second pod that requests 3 CPU. If your node has 8 CPU, both can be scheduled on the same node, but if the node only has 4 CPU, they need to go on separate nodes (if you have the cluster autoscaler, it can create a new node).
To carry on with the example, let's say the pods get scheduled on the same node with 8 CPU. Now that they've been scheduled the resource requests don't matter any more. Neither pod has resource limits, but let's say the smaller pod actually tries to use 3 CPU and the larger pod (a multi-threaded stress test) uses 13 CPU. This is more than the physical capacity of the system, so the kernel will allocate processor cycles to the two processes.
For CPU usage, if the node is overcommitted, you'll just see slow-downs in all of the processes. Either memory or disk ("ephemeral storage") can cause pods to be Evicted and rescheduled on different nodes; the pods that get evicted are the ones that exceed their resource requests by the most. Memory can also cause the node to run out of physical memory, and pods can get OOMKilled.
If every pod sets resource requests and limits to the same value then you do have an approximate guarantee that resources will be available, since nothing will be able to use more resource than the pod scheduler allocates it. For an individual pod and for non-CPU resources, if resource requests and limits are the same, your pod won't get evicted if the node is overcommitted (because it can't exceed its requests). On the other hand, most processes won't generally use exactly their resource requests, and so setting requests high enough that you're guaranteed to not be evicted also means you're causing the node to have unused resources, and your cluster as a whole will be less efficient (need more nodes to do the same work and be more expensive) (but more reliable since pods won't get killed off as often).

Related

If requested memory is "the minimum", why is kubernetes killing my pod when it exceeds 10x the requested?

I am debuggin a problem with pod eviction in Kubernetes.
It looks like it is related to a configuration in PHP FPM children processes quantity.
I assigned a minimum memory of 128 MB and Kubernetes is evicting my pod apparently when exceeds 10x that amount (The node was low on resource: memory. Container phpfpm was using 1607600Ki, which exceeds its request of 128Mi.)
How can I prevent this? I thought that requested resources is the minimum and that the pod can use whatever is available if there's no upper limit.

Requested memory is not "the minimum", it is what it is called - the amount of memory requested by pod. When kubernetes schedules pod, it uses request as a guidance to choose a node which can accommodate this workload, but it doesn't guarantee that pod won't be killed if node is short on memory.
As per docs https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#how-pods-with-resource-limits-are-run
if a container exceeds its memory request and the node that it runs on becomes short of memory overall, it is likely that the Pod the container belongs to will be evicted.
If you want to guarantee a certain memory window for your pods - you should use limits, but in that case if your pod doesn't use most of this memory, it will be "wasted"
So to answer your question "How can I prevent this?", you can:
reconfigure your php-fpm in a way, that prevents it to use 10x memory (i.e. reduce workers count), and configure autoscaling. That way your overloaded pods won't be evicted, and kubernetes will schedule new pods in event of higher load
set memory limit to guarantee a certain amount of memory to your pods
Increase memory on your nodes
Use affinity to schedule your demanding pods on some dedicated nodes and other workloads on separate nodes

K8S Resource Requests and Limits = Are not used request percentages temporarily given to other pods or do they lead to idle status?

How are Kubernetes Ressource Requests handled in practice, if the "owner" does not use these resources, but another pod would require them? Will they temporarily be granted to the other ressource or do they lead to idle status?
Example: Given two Pods/Deployments (on the same node):
Pod A (Requests 40% CPU)
Pod B (Requests 60% CPU, Limit 80% CPU)
Pod A crashes internally, so will never use any ressources (actual usage 0%).
Questions:
Can Pod B Use the 80%, or will it be limited to 60% (to reseve the guaranteed 40% to Pod A, even though Pod A will never use these 40%, effectively leading to 40% Idle Status?)
Could Pod B even getting more than the 80% (on an idle System) or is the 80% absolutely hard enforced?
I have not found any docs explaining this in details (they only talk about schedlung based on resources), any links would be very much appreciated...Background/Motivation: I have an extremely slow node app on a pod and I suspect this is realated to Resource Request/Limits... Thanks very much!

In the node, the only things that kubernetes have is kubelet, kube proxy and the container runtime wich is responsible to enforce the limits established by kubernetes. So how the limits are enforced depends on wich container runtime you have in kubernetes. Let's suppose your are using Docker. Then get inside your node and check how Docker is establishing limits to your pod.
for example memory limits :
docker inspect -f "{{.HostConfig.Memory}}"

Resource requests only affect pods placement on nodes. In your example, the node has 100% CPU requests worth of pods on it, so (assuming the node has 1 CPU) nothing else can get scheduled there.
CPU limits throttle the process running. If B is in a busy loop of some sort, it will never get more than 80% CPU. It could use all of that 80% CPU, even if it only requested 60% and other scheduled pods have requested the other 40%; only the limit matters here. It could get less, if A is also trying to use 40% or 100%; the kernel will allocate things according to its own policy.
(Memory works similarly, except that the kernel can't time-slice memory if the system is overcommitted, so it will kill off an arbitrary process, usually the one with the highest memory usage.)

Kubernetes release requested cpu

We have a Java application distributed over multiple pods on Google Cloud Platform. We also set memory requests to give the pod a certain part of the memory available on the node for heap and non-heap space.
The application is very resource-intensive in terms of CPU while starting the pod but does not use the CPU after the pod is ready (only 0,5% are used). If we use container resource "requests", the pod does not release these resources after start has finished.
Does Kubernetes allow to specify that a pod is allowed to use (nearly) all the cpu power available during start and release those resources after that? Due to rolling update we can prevent that two pods are started at the same time.
Thanks for your help.

If you specify requests without a limit the value will be used for scheduling the pod to an appropriate node that satisfies the requested available CPU bandwidth. The kernel scheduler will assume that the requests match the actual resource consumption but will not prevent exceeding usage. This will be 'stolen' from other containers.
If you specify a limit as well your container will get throttled if it tries to exceed the value. You can combine both to allow bursting usage of the cpu, exceeding the usual requests but not allocating everything from the node, slowing down other processes.

"Does Kubernetes allow to specify that a pod is allowed to use
(nearly) all the cpu power available during start and release those
resources after that?"
A key word here is "available". The answer is "yes" and it can be achieved by using Burstable QoS (Quality of Service) class. Configure CPU request to a value you expect the container will need after starting up, and either:
configure CPU limit higher than the CPU request, or
don't configure CPU limit in which case either namespace's default CPU limit will apply if defined, or the container "...could use all of the CPU resources available on the Node where it is running".
If there isn't CPU available on the Node for bursting, the container won't get any beyond the requested value and as result the starting of the application could be slower.
It is worth mentioning what the docs explain for Pods with multiple Containers:
The CPU request for a Pod is the sum of the CPU requests for all the
Containers in the Pod. Likewise, the CPU limit for a Pod is the sum of
the CPU limits for all the Containers in the Pod.
If running Kubernetes v1.12+ and have access to configure kubelet, the Node CPU Management Policies could be of interest.

one factor for scheduling pods in nodes is resource availability and kubernetes scheduler calculates used resources from request value of each pod. If you do not assign any value in request parameter then for this deployment request will be zero . Request parameter doesnt ensure that the pod will use this much cpu or ram. you can get current usage of resources from "kubectl top pods / nodes".
request parameter will buffer resources for a pod. where as limit put a cap on resources usage for a pod.
you can get more information here https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/.
This will give you a rough idea of request and limit.

What's the difference between Pod resources.limits and resources.requests in Kubernetes?

I've been reading the kubernetes documentation https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container
But it's still not clear for me what's the difference between spec.containers[].resources.limits.cpu and spec.containers[].resources.requests.cpu and what's the impact on the resource limitation
Can you please suggest some reads or books where this is explained in common english?
Thanks in advance

When Kubernetes pod is scheduled on a particular node it is required the pod to have enough resources to run. Kubernetes knows resources of it's node but how does kubernetes knows the how much resources will pod takes beforehand to schedule it effectively in nodes. For that requests will be used. When we specify a request of resource kubernetes will guarantee that pod will get that amount of resource.
On other hand limit limits the resource usage by a pod. Kubernetes will not allow a pod to take more resources than the limit. When it comes to CPU if you request more kubernetes will throttle pods CPU artificially. If pod exceed a limit pod will be it will be terminated. To make it simple it simple limit is always bigger than request.
This example will give you idea about request and limit. Think that there is a pod where you have specify its memory request as 7GB and memory limit as 10GB. There are three nodes in your cluster where node1 has 2GB of memory, node2 has 8GB memory and node3 has 16GB. Your pod will never be scheduled on node1. But it will either be sceduled on node2 or node3 depending on pod current memory usage. But if it is scheduled on node3 it will be terminated in any scenario it will exceed the memory usage of 10GB.

Memory is kind of trivial to understand. requests is guaranteed and limits is something that can not be exceeded. This also means that when you issue kubectl describe nodes | tail -10 for example, you could see a phrase like:
"Total limits may be over 100 percent, i.e., overcommitted".
This means that the total sum of requests.memory is <= 100% (otherwise pods could not be scheduled and this is the meaning of guaranteed memory). At the same time if you see a value that is higher then 100%, it means that the total sum of limits.memory can go above 100% (and this is the overcommitted part in the message). So when a node tries to schedule a pod, it will only check requests.memory to see if it has enough memory.
The cpu part if more complicated.
requests.cpu translates to cpu shares, and without looking at all pods on the node, it might make little to no sense to be honest. imho, the easiest way to understand this property is by looking at an example.
Suppose you have 100 cores available on a node, you deploy a single pod and set requests.cpu = 1000m. In such a case, your pod can use 100 cpus, bot min and max.
You have the same machine (100 cores), but you deploy two pods with requests.cpu = 1000m. In such a case, your pods can use 50 cores each minimum, and 100 max.
Same node, 4 pods (requests.cpu = 1000m). Each pod can use 25 cpu min, and 100 max.
You get the picture, it matters what all pods set for requests.cpu to get an overall picture.
limits.cpu is a lot more interesting and it translated to two properties on the cgroup : cpu period and cpu quota. It means how much time (quota) can you get in a certain timeframe (period). An example should make things more simple here aswell.
Suppose period=100ms and quota=20ms and you get a request that will finish in 50ms on your pod.
This is how it will look like:
| 100ms || 100ms || 100ms |
| 20 ms ......|| 20 ms ......|| 10 ms ......|
Because it takes 50ms to process a request, and we have only 20ms available for every 100ms, it will take 300ms in total, to process our request.
That being said, there are quite a lot of people that recommend not setting the cpu, at all. google engineers, zalando, monzo, etc - including us. We do not do that, and there are strong reasons for that (that go beyond this question).

in short:
for cpu & memory requests: k8s guarantee what you declared you will get when scheduler schedule your pods.
for cpu & memory limits: k8s guarantee you can not exceed the value you set.
the results when your pod exceed the limits:
for cpu: k8s throttling your container
for memory: OOM, k8s kill your
pod

Concept
Containers specify a request, which is the amount of that resource that the system will guarantee to the container
Containers specify a limit which is the maximum amount that the system will allow the container to use.
Best practices for CPU limits and requests on Kubernetes
Use CPU requests for everything and make sure they are accurate
Do NOT use CPU limits.
Best practices for Memory limits and requests on Kubernetes
Use memory limits and memory requests
Set memory limit = memory request
For more details on limits and request setting, please refer to this answer
Details
Containers can specify a resource request and limit, 0 <= request <= Node Allocatable & request <= limit <= Infinity
If a pod is successfully scheduled, the container is guaranteed the amount of resources requested. Scheduling is based on requests and not limits
The pods and its containers will not be allowed to exceed the specified limit. How the request and limit are enforced depends on whether the resource is compressible or incompressible
Compressible Resource Guarantees
Pods are guaranteed to get the amount of CPU they request, they may or may not get additional CPU time (depending on the other jobs running). This isn't fully guaranteed today because cpu isolation is at the container level. Pod level cgroups will be introduced soon to achieve this goal.
Excess CPU resources will be distributed based on the amount of CPU requested. For example, suppose container A requests for 600 milli CPUs, and container B requests for 300 milli CPUs. Suppose that both containers are trying to use as much CPU as they can. Then the extra 100 milli CPUs will be distributed to A and B in a 2:1 ratio (implementation discussed in later sections).
Pods will be throttled if they exceed their limit. If limit is unspecified, then the pods can use excess CPU when available.
Incompressible Resource Guarantees
Pods will get the amount of memory they request, if they exceed their memory request, they could be killed (if some other pod needs memory), but if pods consume less memory than requested, they will not be killed (except in cases where system tasks or daemons need more memory).
When Pods use more memory than their limit, a process that is using the most amount of memory, inside one of the pod's containers, will be killed by the kernel.
Purpose
Kubernetes provides different levels of Quality of Service to pods depending on what they request. Pods that need to stay up reliably can request guaranteed resources, while pods with less stringent requirements can use resources with weaker or no guarantee.
For each resource, we divide containers into 3 QoS classes: Guaranteed, Burstable, and Best-Effort, in decreasing order of priority. The relationship between "Requests and Limits" and "QoS Classes" is subtle.
If limits and optionally requests (not equal to 0) are set for all resources across all containers and they are equal, then the pod is classified as Guaranteed.
If requests and optionally limits are set (not equal to 0) for one or more resources across one or more containers, and they are not equal, then the pod is classified as Burstable. When limits are not specified, they default to the node capacity.
If requests and limits are not set for all of the resources, across all containers, then the pod is classified as Best-Effort.
Pods will not be killed if CPU guarantees cannot be met (for example if system tasks or daemons take up lots of CPU), they will be temporarily throttled.
Memory is an incompressible resource and so let's discuss the semantics of memory management a bit.
Best-Effort pods will be treated as lowest priority. Processes in these pods are the first to get killed if the system runs out of memory. These containers can use any amount of free memory in the node though.
Guaranteed pods are considered top-priority and are guaranteed to not be killed until they exceed their limits, or if the system is under memory pressure and there are no lower priority containers that can be evicted.
Burstable pods have some form of minimal resource guarantee, but can use more resources when available. Under system memory pressure, these containers are more likely to be killed once they exceed their requests and no Best-Effort pods exist.
Source: Resource Quality of Service in Kubernetes

Kubernetes: do evicted pods with no resource requests get rescheduled successfully?

I've read as much Kubernetes documentation as I can find, but I'm still having trouble understanding a specific scenario I have in mind.
For the sake of example, let's say I have a single node with 1GB of memory. I also have a deployment that wants 100 pods with memory limits set to 100MB and memory requests unset. The pods only use 1MB most of the time, but can sometimes jump up to 99MB.
Question 1: Will all 100 pods be scheduled onto the node?
Now, let's say all the pods simultaneously start using 99MB of memory each and stay there. There isn't enough memory on the machine to handle that, but none of the pods have exceeded their memory limit. I'm assuming Kubernetes evicts some pods at this point.
Question 2: When Kubernetes tries to reschedule the evicted pods, does it succeed since there is no memory request set? What happens when the node immediately runs out of memory again? Does this eviction, rescheduling keep happening over and over? If so, is there some metric that I can use to detect that this is happening?

A pod will be scheduled as long as there's an eligible node that can satisfy the requested resources. So if you do not specify request the pod will pretty much get scheduled. Request and limits are totally different things. Request is a condition for a pod to be scheduled and limit is a condition for a running pod already scheduled.
If you overcommit the actual resources on a node you will run into typical issues - if you overcommit on memory it'll start to swap and CPU there will just be general slow down. Either way the node and pods on it will become unresponsive. It's difficult to deal with and tools like request and limits set up sane boundaries that will help you not take things quite this far where you'll simply see the pod fail to schedule.

When the Kubernetes scheduler schedules a pod running on a node, it will always ensure that the total limits of the containers are less than the node capacity. If a node runs out of resources, Kubernetes will not schedule any new containers running on it. If no node is available when you launch a pod, the pod will remain pending, since the Kubernetes scheduler will be unable to nd any node that could run your desired pod.
Kubernetes Cookbook
I think this excerpt gave you some understanding on how it internally works. So answers for your questions:
At most 10 pods will be scheduled into your node.
If there no free memory in node evicted pods will be pending. Also k8s can simply evict pod if it exceeds limits when resources are needed for other pods and services.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse