Can kubernetes resource request/limit definition slow down pods? - kubernetes

I am running 3 deployments of the same app on a kubernetes cluster. I recently got around to setting resource requests and limits for one of the deployments.
resources:
limits:
cpu: 350m
memory: 225Mi
requests:
cpu: 250m
memory: 150Mi
After setting these, the affected pods have a much higher computation time compared to the 2 unchanged deployments, which does not make sense regarding kubernetes documentation as I understand it.
Running kubectl top pods allows me to confirm my pods are running at or below requested resources. When visualizing computation time (Prometheus+Grafana), it is clear one of the deployments is significantly slower:
Two deployments at ~ 60ms and one at ~ 120ms
As this is the only change I have made, I don't understand why there should be any performance degradation. Am I missing something?
Edit
Removing the cpu limit but keeping the request brings pod performance back to what it's supposed to be. Keeping in mind that these pods are running at cpu request level (around 250mCPU) which is 100mCPU below the limit.
Additional information: these pods are running a NodeJS app.

Reading this link i understand that if a pod is successfully started, the container is guaranteed the amount of resources requested so scheduling is based on requests field in yaml and not the limit field but the pods and its containers will not be allowed to exceed the specified limit on yaml.
Pods will be throttled if they exceed their limit. If limit is unspecified, then the pods can use excess CPU when available.
Refer the link for complete read
https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/resource-qos.md#compressible-resource-guarantees

Kubernetes CPU limits might not work as one would assume. I suggest to watch this presentation starting at 13:38.
A solution to the negative effect of CPU limits in k8s might be to set a different CFS quota value. By default it's set to 100ms, a better value might be 5ms. There is also an issue about this.

The pods which are not having any requests and limits might be using the node's resources with out any restriction. so it can be faster.
The pods which are having limits will be limited so they can be slow.
Please check resource consumption metrics of both deployments.

Related

How to use CPU effectively with a large number of inactive pods in Kubernetes?

I have many services. In a day, a few services are busy for about ten hours, while most other services are idle or use a small amount of cpu.
In the past, I put all services in a virtual machine with two cpus, and scale by cpu usage, there are two virtual machine at the busiest time, but most of the time there is only one.
services
instances
busy time in a day
cpu when busy(core/service)
cpu when idle(core/service)
busy services
2
8~12 hours
0.5~1
0.1~0.5
busy services
2
8~12 hours
0.3~0.8
0.1~0.3
inactive services
30
0~1 hours
0.1~0.3
< 0.1
Now, I want to put them in kubernetes, each node has two CPUs, and use node autoscaling and HPA, in order to make the node autoscaling, I must set requests CPU for all services, which is exactly the difficulty I encountered.
This is my setting.
services
instances
busy time
requests cpu (cpu/service)
total requests cpu
busy services
2
8~12 hours
300m
600m
busy services
2
8~12 hours
300m
600m
inactive services
30
0~1 hours
100m
3000m
Note: The inactive service requests CPU is set to 100m because it will not work well if it is less than 100m when it is busy.
With this setting, the number of nodes will always be greater than three, which is too costly. I think the problem is that although these services require 100m of CPU to work properly, they are mostly idle.
I really hope that all services can autoscaling, I think this is the benefit of kubernetes, which can help me assign pods more flexibly. Is my idea wrong? Shouldn't I set a request CPU for an inactive service?
Even if I ignore inactive services. I find that kubernetes more often has more than two nodes. If I have more active services, even in off-peak hours, the requests CPU will exceed 2000m. Is there any solution?
I put all services in a virtual machine with two cpus, and scale by cpu usage, there are two virtual machine at the busiest time, but most of the time there is only one.
First, if you have any availability requirements, I would recommend to always have at least two nodes. If you have only one node and that one crash (e.g. hardware failure or kernel panic) it will take some minutes before this is detected and it will take some minutes before a new node is up.
The inactive service requests cpu is set to 100m because it will not work well if it is less than 100m when it is busy.
I think the problem is that although these services require 100m of cpu to work properly, they are mostly idle.
The CPU request is a guaranteed reserved resource amount. Here you reserve too much resources for your almost idling services. Set the CPU request lower, maybe as low as 20m or even 5m? But since these services will need more resources during busy periods, set a higher limit so that the container can "burst" and also use Horizontal Pod Autoscaler for these. When using the Horizontal Pod Autoscaler more replicas will be created and the traffic will be load balanced across all replicas. Also see Managing Resources for Containers.
This is also true for your "busy services", reserve less CPU resources and use Horizontal Pod Autoscaling more actively so that the traffic is spread to more nodes during high load, but can scale down and save cost when the traffic is low.
I really hope that all services can autoscaling, I think this is the benefit of kubernetes, which can help me assign pods more flexibly. Is my idea wrong?
Yes, I agree with you.
Shouldn't I set a request cpu for an inactive service?
It is a good practice to always set some value for request and limit, at least for a production environment. The scheduling and autoscaling will not work well without resource requests.
If I have more active services, even in off-peak hours, the requests cpu will exceed 2000m. Is there any solution?
In general, try to use lower resource requests and use Horizontal Pod Autoscaling more actively. This is true for both your "busy services" and your "inactive services".
I find that kubernetes more often has more than two nodes.
Yes, there are two aspects of this.
If you only use two nodes, your environment probably is small and the Kubernetes control plane probably consists of more nodes and is the majority of the cost. For very small environments, Kubernetes may be expensive and it would be more attractive to use e.g. a serverless alternative like Google Cloud Run
Second, for availability. It is good to have at least two nodes in case of an abrupt crash e.g. hardware failure or a kernel panic, so that your "service" is still available meanwhile the node autoscaler scales up a new node. This is also true for the number of replicas for a Deployment, if availability is important, use at least two replicas. When you e.g. drain a node for maintenance or node upgrade, the pods will be evicted - but not created on a different node first. The control plane will detect that the Deployment (technically ReplicaSet) has less than the desired number of replicas and create a new pod. But when a new Pod is created on a new node, the container image will first be pulled before the Pod is running. To avoid downtime during these events, use at least two replicas for your Deployment and Pod Topology Spread Constraints to make sure that those two replicas run on different nodes.
Note: You might run into the same problem as How to use K8S HPA and autoscaler when Pods normally need low CPU but periodically scale and that should be mitigated by an upcoming Kubernetes feature: KEP - Trimaran: Real Load Aware Scheduling

Kubernetes resource request relative to node resources

HPA needs the pod has resource request defined in order to calculate metrics to know when to schedule new pods.
Is there a way to specify that the CPU request of a pod is the 75% of the node resources? Like if the machine has 4 cores, it should use cpu: 3, and if the machine has 8 cores, it should use cpu: 6.
I would like to be able to scale up the node fleet without having to change the kubernetes definitions.
I'm placing this answer for better visibility for community. As already stated in the comments this is not possible. Fix percentage request might be problematic since it will request every time the same amount of resources when being scheduled to a node. But nodes can be different from each other hence you might get yourself in situation where one pod will request 75% cpu of node-1 which has 16 cpu and in another situation the pod will request the same amount of node-2 which has on 4 cpu. This makes the value very inconsistent across node pool and not very specific. For that very reason Kubernetes uses milicores to request resources.
This was well described in the comments above:
"Somewhere 75% is X, somewhere 75% is Y" – [Konstantin Vustin]
You can read more about resources here.

Kubernetes: CPU Resource allocation for POD

I am trying to assign CPU resources for services running in kubernetes pods. Services are mostly nodejs based REST endpoints with some DB operations.
During the load test, tried different combinations between 100m and 1000m for pods.
For the expected number of requests/second, when value is less than < 500m more pods are getting spawned as part of HPA than when the value is >500m. I am using CPU based trigger for HPA.
I couldn't figure out depending on what I shall be selecting particular CPU resource value. Can someone help me in this regard?
Two points:
If you configured the HPA to autoscale based on CPU utilisation, it makes sense that there are more replicas if the CPU request is 500m than if it's 1000m. This is because the target utilisation that you define for the HPA is relative to the CPU request of the Pods.
For example, if your target utilisation is 80% and the CPU request of the Pods is 500m, then the HPA scales your app up if the actual CPU usage exceeds 400m. On the other hand, if the CPU requests are 1000m, then the HPA only scales up if the CPU usage exceeds 800m.
Selecting resource requests (e.g. CPU) for a container is very important, but it's also an art in itself. The CPU is the minimum amount of CPU that your container needs for running reliably. What you could do to find out this value is running your app locally and try to evaluate how much CPU it is actually using, for example, with ps or top.

What's the difference between Pod resources.limits and resources.requests in Kubernetes?

I've been reading the kubernetes documentation https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container
But it's still not clear for me what's the difference between spec.containers[].resources.limits.cpu and spec.containers[].resources.requests.cpu and what's the impact on the resource limitation
Can you please suggest some reads or books where this is explained in common english?
Thanks in advance
When Kubernetes pod is scheduled on a particular node it is required the pod to have enough resources to run. Kubernetes knows resources of it's node but how does kubernetes knows the how much resources will pod takes beforehand to schedule it effectively in nodes. For that requests will be used. When we specify a request of resource kubernetes will guarantee that pod will get that amount of resource.
On other hand limit limits the resource usage by a pod. Kubernetes will not allow a pod to take more resources than the limit. When it comes to CPU if you request more kubernetes will throttle pods CPU artificially. If pod exceed a limit pod will be it will be terminated. To make it simple it simple limit is always bigger than request.
This example will give you idea about request and limit. Think that there is a pod where you have specify its memory request as 7GB and memory limit as 10GB. There are three nodes in your cluster where node1 has 2GB of memory, node2 has 8GB memory and node3 has 16GB. Your pod will never be scheduled on node1. But it will either be sceduled on node2 or node3 depending on pod current memory usage. But if it is scheduled on node3 it will be terminated in any scenario it will exceed the memory usage of 10GB.
Memory is kind of trivial to understand. requests is guaranteed and limits is something that can not be exceeded. This also means that when you issue kubectl describe nodes | tail -10 for example, you could see a phrase like:
"Total limits may be over 100 percent, i.e., overcommitted".
This means that the total sum of requests.memory is <= 100% (otherwise pods could not be scheduled and this is the meaning of guaranteed memory). At the same time if you see a value that is higher then 100%, it means that the total sum of limits.memory can go above 100% (and this is the overcommitted part in the message). So when a node tries to schedule a pod, it will only check requests.memory to see if it has enough memory.
The cpu part if more complicated.
requests.cpu translates to cpu shares, and without looking at all pods on the node, it might make little to no sense to be honest. imho, the easiest way to understand this property is by looking at an example.
Suppose you have 100 cores available on a node, you deploy a single pod and set requests.cpu = 1000m. In such a case, your pod can use 100 cpus, bot min and max.
You have the same machine (100 cores), but you deploy two pods with requests.cpu = 1000m. In such a case, your pods can use 50 cores each minimum, and 100 max.
Same node, 4 pods (requests.cpu = 1000m). Each pod can use 25 cpu min, and 100 max.
You get the picture, it matters what all pods set for requests.cpu to get an overall picture.
limits.cpu is a lot more interesting and it translated to two properties on the cgroup : cpu period and cpu quota. It means how much time (quota) can you get in a certain timeframe (period). An example should make things more simple here aswell.
Suppose period=100ms and quota=20ms and you get a request that will finish in 50ms on your pod.
This is how it will look like:
| 100ms || 100ms || 100ms |
| 20 ms ......|| 20 ms ......|| 10 ms ......|
Because it takes 50ms to process a request, and we have only 20ms available for every 100ms, it will take 300ms in total, to process our request.
That being said, there are quite a lot of people that recommend not setting the cpu, at all. google engineers, zalando, monzo, etc - including us. We do not do that, and there are strong reasons for that (that go beyond this question).
in short:
for cpu & memory requests: k8s guarantee what you declared you will get when scheduler schedule your pods.
for cpu & memory limits: k8s guarantee you can not exceed the value you set.
the results when your pod exceed the limits:
for cpu: k8s throttling your container
for memory: OOM, k8s kill your
pod
Concept
Containers specify a request, which is the amount of that resource that the system will guarantee to the container
Containers specify a limit which is the maximum amount that the system will allow the container to use.
Best practices for CPU limits and requests on Kubernetes
Use CPU requests for everything and make sure they are accurate
Do NOT use CPU limits.
Best practices for Memory limits and requests on Kubernetes
Use memory limits and memory requests
Set memory limit = memory request
For more details on limits and request setting, please refer to this answer
Details
Containers can specify a resource request and limit, 0 <= request <= Node Allocatable & request <= limit <= Infinity
If a pod is successfully scheduled, the container is guaranteed the amount of resources requested. Scheduling is based on requests and not limits
The pods and its containers will not be allowed to exceed the specified limit. How the request and limit are enforced depends on whether the resource is compressible or incompressible
Compressible Resource Guarantees
Pods are guaranteed to get the amount of CPU they request, they may or may not get additional CPU time (depending on the other jobs running). This isn't fully guaranteed today because cpu isolation is at the container level. Pod level cgroups will be introduced soon to achieve this goal.
Excess CPU resources will be distributed based on the amount of CPU requested. For example, suppose container A requests for 600 milli CPUs, and container B requests for 300 milli CPUs. Suppose that both containers are trying to use as much CPU as they can. Then the extra 100 milli CPUs will be distributed to A and B in a 2:1 ratio (implementation discussed in later sections).
Pods will be throttled if they exceed their limit. If limit is unspecified, then the pods can use excess CPU when available.
Incompressible Resource Guarantees
Pods will get the amount of memory they request, if they exceed their memory request, they could be killed (if some other pod needs memory), but if pods consume less memory than requested, they will not be killed (except in cases where system tasks or daemons need more memory).
When Pods use more memory than their limit, a process that is using the most amount of memory, inside one of the pod's containers, will be killed by the kernel.
Purpose
Kubernetes provides different levels of Quality of Service to pods depending on what they request. Pods that need to stay up reliably can request guaranteed resources, while pods with less stringent requirements can use resources with weaker or no guarantee.
For each resource, we divide containers into 3 QoS classes: Guaranteed, Burstable, and Best-Effort, in decreasing order of priority. The relationship between "Requests and Limits" and "QoS Classes" is subtle.
If limits and optionally requests (not equal to 0) are set for all resources across all containers and they are equal, then the pod is classified as Guaranteed.
If requests and optionally limits are set (not equal to 0) for one or more resources across one or more containers, and they are not equal, then the pod is classified as Burstable. When limits are not specified, they default to the node capacity.
If requests and limits are not set for all of the resources, across all containers, then the pod is classified as Best-Effort.
Pods will not be killed if CPU guarantees cannot be met (for example if system tasks or daemons take up lots of CPU), they will be temporarily throttled.
Memory is an incompressible resource and so let's discuss the semantics of memory management a bit.
Best-Effort pods will be treated as lowest priority. Processes in these pods are the first to get killed if the system runs out of memory. These containers can use any amount of free memory in the node though.
Guaranteed pods are considered top-priority and are guaranteed to not be killed until they exceed their limits, or if the system is under memory pressure and there are no lower priority containers that can be evicted.
Burstable pods have some form of minimal resource guarantee, but can use more resources when available. Under system memory pressure, these containers are more likely to be killed once they exceed their requests and no Best-Effort pods exist.
Source: Resource Quality of Service in Kubernetes

How to handle CPU contention for burstable k8s pods?

The use case I'm trying to get my head around takes place when you have various burstable pods scheduled on the same node. How can you ensure that the workload in a specific pod takes priority over another pod when the node's kernel is scheduling CPU and the CPU is fully burdened? In a typical Linux host my thoughts on contention between processes immediately goes to 'niceness' of the processes, however I don't see any equivalent k8s mechanism allowing for specification of CPU scheduling priority between the processes within pods on a node.
I've read of the newest capabilities provided by k8s which (if I interpret the documentation correctly) is just providing a mechanism for CPU pinning to pods which doesn't really scratch my itch. I'd still like to maximize CPU utilization by the "second class" pods if the higher priority pods don't have an active workload while allowing the higher priority workload to have CPU scheduling priority should the need arise.
So far, having not found a satisfactory answer I'm thinking that the community will opt for an architectural solution, like auto-scaling or segregating the workloads between nodes. I don't consider these to be truly addressing the issue, but really just throwing more CPUs at it which is what I'd like to avoid. Why spin up more nodes when you've got idle CPU?
Let me first explain how CPU allocation and utilization happen in k8s (memory is bit different)
You define CPU requirement as below. where we define CPU as shares of thousand.
resources:
requests:
cpu: 50m
limits:
cpu: 100m
In the above example, we ask for min 5% and max 10% of CPU shares.
Requests are used by kubernetes to schedule the pod. If a node has free CPU more than 5% only then the pod is scheduled on that node.
The limits are passed to docker(or any other runtime) which then configure cpu.shares in cgroups.
So if you Request for 5% of CPU and use only 1% then remaining are not locked to this pod and other pods can use this free CPU's to ensure that all pod gets required CPU which ensures high CPU utilization of node.
If you limit for 10% and then try to use more than that then Linux will throttle CPU uses but it won't kill pod.
So coming to your question you can set higher limits for your burstable pod and unless all pod cpu bursting at the same time you are ok. If they burst at the same time they will get equal CPU as avaliability.
you can use pod affinity-and-anti-affinity to schedule all burstable pods on a different node.
The CPU request correlates to cgroup CPU priority. Basically if Pod A has a request of 100m CPU and Pod B has 200m, even in a starvation situation B will get twice as many run seconds as A.
As already mentioned a resource management in Pods is declared with requests and limits.
There are 3 QoS Classes in Kubernetes based on requests and limits configuration:
Guaranteed (limits == requests)
Burstable (limits > requests)
Best Effort (limits and requests are unspecified)
Both of 2) and 3) might be considered as "burstable" in a sense it may consume more resources than requested.
The closest fit for your case might be using Burtstable Class for higher priority Pods and Best Effort of all other.