QoS class of Guaranteed for Pod in Kubernetes - kubernetes

On my kubernetes nodes there are
prioritized pods
dispensable pods
Therefore I would like to have QoS class of Guaranteed for the prioritized pods.
To achieve a Guaranteed class the cpu/memory requests/limits must meet some conditions. Therefore:
For every Container in the Pod, the CPU limit must equal the CPU
request
But I would like to set a higher CPU limit than request, so that the prioritized pods can use every free CPU resources which are available.
Simple example: A Node with 4 cores has:
1 prioritized pod with 2000 CPU request and 3900 CPU limit
3 dispensable pods with each 500 CPU request and limit.
If the prioritized pod would have 2000 CPU request and limit 2 Cores are wasted because the dispensable pods don't use CPU most of the time.
If the prioritized pod would have 3900 CPU request and limit, I would need an extra node for the dispensable pods.
Questions
Is it possible to set explicitly the Guaranteed class to a pod even with difference CPU request and limit?
If it's not possible: Why is there no way to explicitly set the QoS class?
Remarks
There's an system-cluster-critical option. But I think this should only be used for critical k8s add-on pods but not for critical applications.

Is it possible to set explicitly the Guaranteed class to a pod even with difference CPU request and limit?
Yes, however you will need to use an additional plugin: capacity-scheduling used with PriorityClass:
There is increasing demand to use Kubernetes to manage batch workloads (ML/DL). In those cases, one challenge is to improve cluster utilization while ensuring that each user has a reasonable amount of resources. The problem can be partially addressed by the Kubernetes ResourceQuota. The native Kubernetes ResourceQuota API can be used to specify the maximum overall resource allocation per namespace. The quota enforcement is done through an admission check. A quota consumer (e.g., a Pod) cannot be created if the aggregated resource allocation exceeds the quota limit. In other words, the overall resource usage is aggregated based on Pod's spec (i.e., cpu/mem requests) when it's created. The Kubernetes quota design has the limitation: the quota resource usage is aggregated based on the resource configurations (e.g., Pod cpu/mem requests specified in the Pod spec). Although this mechanism can guarantee that the actual resource consumption will never exceed the ResourceQuota limit, it might lead to low resource utilization as some pods may have claimed the resources but failed to be scheduled. For instance, actual resource consumption may be much smaller than the limit.
Pods can be created at a specific priority. You can control a pod's consumption of system resources based on a pod's priority, by using the scopeSelector field in the quota spec.
A quota is matched and consumed only if scopeSelector in the quota spec selects the pod.
When quota is scoped for priority class using scopeSelector field, quota object is restricted to track only following resources:
pods
cpu
memory
ephemeral-storage
limits.cpu
limits.memory
limits.ephemeral-storage
requests.cpu
requests.memory
requests.ephemeral-storage
This plugin supports also preemption (example for Elastic):
Preemption happens when a pod is unschedulable, i.e., failed in PreFilter or Filter phases.
In particular for capacity scheduling, the failure reasons could be:
Prefilter Stage
sum(allocated res of pods in the same elasticquota) + pod.request > elasticquota.spec.max
sum(allocated res of pods in the same elasticquota) + pod.request > sum(elasticquota.spec.min)
So the preemption logic will attempt to make the pod schedulable, with a cost of preempting other running pods.
Examples of yaml files and usage can be found in the plugin description.

Related

Kubernetes: Scheduling Pod without resource limits

Kubernetes: what happens when a pod has no resources limits / requests defined?
How much resources can a pod use in Kubernetes (GKE) when it has no (or only partial) resource limits/requests defined?
For example, I have a pod with only memory limits and memory requests, but it has no cpu specs.
Will the cpu available to this pod be:
0
as much as left on the node/namespace (total minus all other pod claims)
as much as possible regarding actual use by other pods on the node/namespace
If you do not specify a CPU limit for a Container, then one of these situations applied:
The Container has no upper bound limit on the CPU resources it can use. The Container can use all of the CPU resources available on the Node where the pod is running. So in your case it will be second option which you have specified in your question : as much as left on the node/namespace.
Normally Kubernetes Cluster Administrator defines the limit for each and every namespace in cluster. so the Container is running in a namespace that has a default CPU limit, and the Container is automatically assigned the default limit.
Resource Quota should be defined for each Namespace which comes in handy to get rid of pods that don't have resource request or limits and eating up all the resources. This means you can not schedule the pod until you specify the resource requirements for that pod in particular namespace and this is recommended as best practices
For more information you could refer to this section : https://kubernetes.io/docs/tasks/configure-pod-container/assign-cpu-resource/#if-you-do-not-specify-a-cpu-limit

Kubernetes release requested cpu

We have a Java application distributed over multiple pods on Google Cloud Platform. We also set memory requests to give the pod a certain part of the memory available on the node for heap and non-heap space.
The application is very resource-intensive in terms of CPU while starting the pod but does not use the CPU after the pod is ready (only 0,5% are used). If we use container resource "requests", the pod does not release these resources after start has finished.
Does Kubernetes allow to specify that a pod is allowed to use (nearly) all the cpu power available during start and release those resources after that? Due to rolling update we can prevent that two pods are started at the same time.
Thanks for your help.
If you specify requests without a limit the value will be used for scheduling the pod to an appropriate node that satisfies the requested available CPU bandwidth. The kernel scheduler will assume that the requests match the actual resource consumption but will not prevent exceeding usage. This will be 'stolen' from other containers.
If you specify a limit as well your container will get throttled if it tries to exceed the value. You can combine both to allow bursting usage of the cpu, exceeding the usual requests but not allocating everything from the node, slowing down other processes.
"Does Kubernetes allow to specify that a pod is allowed to use
(nearly) all the cpu power available during start and release those
resources after that?"
A key word here is "available". The answer is "yes" and it can be achieved by using Burstable QoS (Quality of Service) class. Configure CPU request to a value you expect the container will need after starting up, and either:
configure CPU limit higher than the CPU request, or
don't configure CPU limit in which case either namespace's default CPU limit will apply if defined, or the container "...could use all of the CPU resources available on the Node where it is running".
If there isn't CPU available on the Node for bursting, the container won't get any beyond the requested value and as result the starting of the application could be slower.
It is worth mentioning what the docs explain for Pods with multiple Containers:
The CPU request for a Pod is the sum of the CPU requests for all the
Containers in the Pod. Likewise, the CPU limit for a Pod is the sum of
the CPU limits for all the Containers in the Pod.
If running Kubernetes v1.12+ and have access to configure kubelet, the Node CPU Management Policies could be of interest.
one factor for scheduling pods in nodes is resource availability and kubernetes scheduler calculates used resources from request value of each pod. If you do not assign any value in request parameter then for this deployment request will be zero . Request parameter doesnt ensure that the pod will use this much cpu or ram. you can get current usage of resources from "kubectl top pods / nodes".
request parameter will buffer resources for a pod. where as limit put a cap on resources usage for a pod.
you can get more information here https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/.
This will give you a rough idea of request and limit.

What does it mean OutOfcpu error in kubernetes?

I got OutOfcpu in kubernetes on googlecloud what does it mean? My pods seem to be working now, however there there were pods in this same revision which got OutOfcpu.
It means that the kube-scheduler can't find any node with available CPU to schedule your pods:
kube-scheduler selects a node for the pod in a 2-step operation:
Filtering
Scoring
The filtering step finds the set of Nodes where it’s feasible to
schedule the Pod. For example, the PodFitsResources filter checks
whether a candidate Node has enough available resource to meet a Pod’s
specific resource requests.
[...]
PodFitsResources: Checks if the
Node has free resources (eg, CPU and Memory) to meet the requirement
of the Pod.
Also, as per Assigning Pods to Nodes:
If the named node does not have the resources to accommodate the pod,
the pod will fail and its reason will indicate why, e.g. OutOfmemory
or OutOfcpu.
In addition to how-kube-scheduler-schedules-pods, I think this will be helpful to understand why OutOfcpu error has been shown up.
When you create a Pod, the Kubernetes scheduler selects a node for the
Pod to run on. Each node has a maximum capacity for each of the
resource types: the amount of CPU and memory it can provide for Pods.
The scheduler ensures that, for each resource type, the sum of the
resource requests of the scheduled Containers is less than the
capacity of the node. Note that although actual memory or CPU resource
usage on nodes is very low, the scheduler still refuses to place a Pod
on a node if the capacity check fails. This protects against a
resource shortage on a node when resource usage later increases, for
example, during a daily peak in request rate.
Ref: how-pods-with-resource-requests-are-scheduled

What's the difference between Pod resources.limits and resources.requests in Kubernetes?

I've been reading the kubernetes documentation https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container
But it's still not clear for me what's the difference between spec.containers[].resources.limits.cpu and spec.containers[].resources.requests.cpu and what's the impact on the resource limitation
Can you please suggest some reads or books where this is explained in common english?
Thanks in advance
When Kubernetes pod is scheduled on a particular node it is required the pod to have enough resources to run. Kubernetes knows resources of it's node but how does kubernetes knows the how much resources will pod takes beforehand to schedule it effectively in nodes. For that requests will be used. When we specify a request of resource kubernetes will guarantee that pod will get that amount of resource.
On other hand limit limits the resource usage by a pod. Kubernetes will not allow a pod to take more resources than the limit. When it comes to CPU if you request more kubernetes will throttle pods CPU artificially. If pod exceed a limit pod will be it will be terminated. To make it simple it simple limit is always bigger than request.
This example will give you idea about request and limit. Think that there is a pod where you have specify its memory request as 7GB and memory limit as 10GB. There are three nodes in your cluster where node1 has 2GB of memory, node2 has 8GB memory and node3 has 16GB. Your pod will never be scheduled on node1. But it will either be sceduled on node2 or node3 depending on pod current memory usage. But if it is scheduled on node3 it will be terminated in any scenario it will exceed the memory usage of 10GB.
Memory is kind of trivial to understand. requests is guaranteed and limits is something that can not be exceeded. This also means that when you issue kubectl describe nodes | tail -10 for example, you could see a phrase like:
"Total limits may be over 100 percent, i.e., overcommitted".
This means that the total sum of requests.memory is <= 100% (otherwise pods could not be scheduled and this is the meaning of guaranteed memory). At the same time if you see a value that is higher then 100%, it means that the total sum of limits.memory can go above 100% (and this is the overcommitted part in the message). So when a node tries to schedule a pod, it will only check requests.memory to see if it has enough memory.
The cpu part if more complicated.
requests.cpu translates to cpu shares, and without looking at all pods on the node, it might make little to no sense to be honest. imho, the easiest way to understand this property is by looking at an example.
Suppose you have 100 cores available on a node, you deploy a single pod and set requests.cpu = 1000m. In such a case, your pod can use 100 cpus, bot min and max.
You have the same machine (100 cores), but you deploy two pods with requests.cpu = 1000m. In such a case, your pods can use 50 cores each minimum, and 100 max.
Same node, 4 pods (requests.cpu = 1000m). Each pod can use 25 cpu min, and 100 max.
You get the picture, it matters what all pods set for requests.cpu to get an overall picture.
limits.cpu is a lot more interesting and it translated to two properties on the cgroup : cpu period and cpu quota. It means how much time (quota) can you get in a certain timeframe (period). An example should make things more simple here aswell.
Suppose period=100ms and quota=20ms and you get a request that will finish in 50ms on your pod.
This is how it will look like:
| 100ms || 100ms || 100ms |
| 20 ms ......|| 20 ms ......|| 10 ms ......|
Because it takes 50ms to process a request, and we have only 20ms available for every 100ms, it will take 300ms in total, to process our request.
That being said, there are quite a lot of people that recommend not setting the cpu, at all. google engineers, zalando, monzo, etc - including us. We do not do that, and there are strong reasons for that (that go beyond this question).
in short:
for cpu & memory requests: k8s guarantee what you declared you will get when scheduler schedule your pods.
for cpu & memory limits: k8s guarantee you can not exceed the value you set.
the results when your pod exceed the limits:
for cpu: k8s throttling your container
for memory: OOM, k8s kill your
pod
Concept
Containers specify a request, which is the amount of that resource that the system will guarantee to the container
Containers specify a limit which is the maximum amount that the system will allow the container to use.
Best practices for CPU limits and requests on Kubernetes
Use CPU requests for everything and make sure they are accurate
Do NOT use CPU limits.
Best practices for Memory limits and requests on Kubernetes
Use memory limits and memory requests
Set memory limit = memory request
For more details on limits and request setting, please refer to this answer
Details
Containers can specify a resource request and limit, 0 <= request <= Node Allocatable & request <= limit <= Infinity
If a pod is successfully scheduled, the container is guaranteed the amount of resources requested. Scheduling is based on requests and not limits
The pods and its containers will not be allowed to exceed the specified limit. How the request and limit are enforced depends on whether the resource is compressible or incompressible
Compressible Resource Guarantees
Pods are guaranteed to get the amount of CPU they request, they may or may not get additional CPU time (depending on the other jobs running). This isn't fully guaranteed today because cpu isolation is at the container level. Pod level cgroups will be introduced soon to achieve this goal.
Excess CPU resources will be distributed based on the amount of CPU requested. For example, suppose container A requests for 600 milli CPUs, and container B requests for 300 milli CPUs. Suppose that both containers are trying to use as much CPU as they can. Then the extra 100 milli CPUs will be distributed to A and B in a 2:1 ratio (implementation discussed in later sections).
Pods will be throttled if they exceed their limit. If limit is unspecified, then the pods can use excess CPU when available.
Incompressible Resource Guarantees
Pods will get the amount of memory they request, if they exceed their memory request, they could be killed (if some other pod needs memory), but if pods consume less memory than requested, they will not be killed (except in cases where system tasks or daemons need more memory).
When Pods use more memory than their limit, a process that is using the most amount of memory, inside one of the pod's containers, will be killed by the kernel.
Purpose
Kubernetes provides different levels of Quality of Service to pods depending on what they request. Pods that need to stay up reliably can request guaranteed resources, while pods with less stringent requirements can use resources with weaker or no guarantee.
For each resource, we divide containers into 3 QoS classes: Guaranteed, Burstable, and Best-Effort, in decreasing order of priority. The relationship between "Requests and Limits" and "QoS Classes" is subtle.
If limits and optionally requests (not equal to 0) are set for all resources across all containers and they are equal, then the pod is classified as Guaranteed.
If requests and optionally limits are set (not equal to 0) for one or more resources across one or more containers, and they are not equal, then the pod is classified as Burstable. When limits are not specified, they default to the node capacity.
If requests and limits are not set for all of the resources, across all containers, then the pod is classified as Best-Effort.
Pods will not be killed if CPU guarantees cannot be met (for example if system tasks or daemons take up lots of CPU), they will be temporarily throttled.
Memory is an incompressible resource and so let's discuss the semantics of memory management a bit.
Best-Effort pods will be treated as lowest priority. Processes in these pods are the first to get killed if the system runs out of memory. These containers can use any amount of free memory in the node though.
Guaranteed pods are considered top-priority and are guaranteed to not be killed until they exceed their limits, or if the system is under memory pressure and there are no lower priority containers that can be evicted.
Burstable pods have some form of minimal resource guarantee, but can use more resources when available. Under system memory pressure, these containers are more likely to be killed once they exceed their requests and no Best-Effort pods exist.
Source: Resource Quality of Service in Kubernetes

How to handle CPU contention for burstable k8s pods?

The use case I'm trying to get my head around takes place when you have various burstable pods scheduled on the same node. How can you ensure that the workload in a specific pod takes priority over another pod when the node's kernel is scheduling CPU and the CPU is fully burdened? In a typical Linux host my thoughts on contention between processes immediately goes to 'niceness' of the processes, however I don't see any equivalent k8s mechanism allowing for specification of CPU scheduling priority between the processes within pods on a node.
I've read of the newest capabilities provided by k8s which (if I interpret the documentation correctly) is just providing a mechanism for CPU pinning to pods which doesn't really scratch my itch. I'd still like to maximize CPU utilization by the "second class" pods if the higher priority pods don't have an active workload while allowing the higher priority workload to have CPU scheduling priority should the need arise.
So far, having not found a satisfactory answer I'm thinking that the community will opt for an architectural solution, like auto-scaling or segregating the workloads between nodes. I don't consider these to be truly addressing the issue, but really just throwing more CPUs at it which is what I'd like to avoid. Why spin up more nodes when you've got idle CPU?
Let me first explain how CPU allocation and utilization happen in k8s (memory is bit different)
You define CPU requirement as below. where we define CPU as shares of thousand.
resources:
requests:
cpu: 50m
limits:
cpu: 100m
In the above example, we ask for min 5% and max 10% of CPU shares.
Requests are used by kubernetes to schedule the pod. If a node has free CPU more than 5% only then the pod is scheduled on that node.
The limits are passed to docker(or any other runtime) which then configure cpu.shares in cgroups.
So if you Request for 5% of CPU and use only 1% then remaining are not locked to this pod and other pods can use this free CPU's to ensure that all pod gets required CPU which ensures high CPU utilization of node.
If you limit for 10% and then try to use more than that then Linux will throttle CPU uses but it won't kill pod.
So coming to your question you can set higher limits for your burstable pod and unless all pod cpu bursting at the same time you are ok. If they burst at the same time they will get equal CPU as avaliability.
you can use pod affinity-and-anti-affinity to schedule all burstable pods on a different node.
The CPU request correlates to cgroup CPU priority. Basically if Pod A has a request of 100m CPU and Pod B has 200m, even in a starvation situation B will get twice as many run seconds as A.
As already mentioned a resource management in Pods is declared with requests and limits.
There are 3 QoS Classes in Kubernetes based on requests and limits configuration:
Guaranteed (limits == requests)
Burstable (limits > requests)
Best Effort (limits and requests are unspecified)
Both of 2) and 3) might be considered as "burstable" in a sense it may consume more resources than requested.
The closest fit for your case might be using Burtstable Class for higher priority Pods and Best Effort of all other.