Pod CPU Throttling - kubernetes

I'm experiencing a strange issue when using CPU Requests/Limits in Kubernetes. Prior to setting any CPU Requests/Limits at all, all my services performed very well. I recently started placing some Resource Quotas to avoid future resource starvation. These values were set based in the actual usage of those services, but to my surprise, after those were added, some services started to increase their response time drastically. My first guess was that I might placed wrong Requests/Limits, but looking at the metrics revealed that in fact none of the services facing this issue were near those values. In fact, some of them were closer to the Requests than the Limits.
Then I started looking at CPU throttling metrics and found that all my pods are being throttled. I then increased the limits for one of the services to 1000m (from 250m) and I saw less throttling in that pod, but I don't understand why I should set that higher limit if the pod wasn't reaching its old limit (250m).
So my question is: If I'm not reaching the CPU limits, why are my pods throttling? Why is my response time increasing if the pods are not using their full capacity?
Here there are some screenshots of my metrics (CPU Request: 50m, CPU Limit: 250m):
CPU Usage (here we can see the CPU of this pod never reached its limit of 250m):
CPU Throttling:
After setting limits to this pod to 1000m, we can observe less throttling
kubectl top
P.S: Before setting these Requests/Limits there wasn't throttling at all (as expected)
P.S 2: None of my nodes are facing high usage. In fact, none of them are using more than 50% of CPU at any time.
Thanks in advance!

If you see the documentation you see when you issue a Request for CPUs it actually uses the --cpu-shares option in Docker which actually uses the cpu.shares attribute for the cpu,cpuacct cgroup on Linux. So a value of 50m is about --cpu-shares=51 based on the maximum being 1024. 1024 represents 100% of the shares, so 51 would be 4-5% of the share. That's pretty low, to begin with. But the important factor here is that this relative to how many pods/container you have on your system and what cpu-shares those have (are they using the default).
So let's say that on your node you have another pod/container with 1024 shares which is the default and you have this pod/container with 4-5 shares. Then this container will get about get about 0.5% CPU, while the other pod/container will
get about 99.5% of the CPU (if it has no limits). So again it all depends on how many pods/container you have on the node and what their shares are.
Also, not very well documented in the Kubernetes docs, but if you use Limit on a pod it's basically using two flags in Docker: --cpu-period and --cpu--quota which actually uses the cpu.cfs_period_us and the cpu.cfs_quota_us attributes for the cpu,cpuacct cgroup on Linux. This was introduced to the fact that cpu.shares didn't provide a limit so you'd spill over cases where containers would grab most of the CPU.
So, as far as this limit is concerned you will never hit it if you have other containers on the same node that don't have limits (or higher limits) but have a higher cpu.shares because they will end up optimizing and picking idle CPU. This could be what you are seeing, but again depends on your specific case.
A longer explanation for all of the above here.

Kubernetes uses (Completely Fair Scheduler) CFS quota to enforce CPU limits on pod containers. See "How does the CPU Manager work" described in https://kubernetes.io/blog/2018/07/24/feature-highlight-cpu-manager/ for further details.
The CFS is a Linux feature, added with the 2.6.23 kernel, which is based on two parameters: cpu.cfs_period_us and cpu.cfs_quota
To visualize these two parameters, I'd like to borrow the following picture from Daniele Polencic from his excellent blog (https://twitter.com/danielepolencic/status/1267745860256841731):
If you configure a CPU limit in K8s it will set period and quota. If a process running in a container reaches the limit it is preempted and has to wait for the next period. It is throttled.
So this is the effect, which you are experiencing. The period and quota algorithm should not be considered to be a CPU limit, where processes are unthrottled, if not reached.
The behavior is confusing, and also a K8s issue exist for this: https://github.com/kubernetes/kubernetes/issues/67577
The recommendation given in https://github.com/kubernetes/kubernetes/issues/51135 is to not set CPU limits for pods that shouldn't be throttled.

TLDR: remove your CPU limits. (Unless this alert fires on metrics-server in which case that wont work.) CPU limits are actually a bad-practice and not a best-practice.
Why this happens
I'll focus on what to do, but first let me give a quick example showing why this happens:
Imagine a pod with a CPU limit of 100m which is equivalent to 1/10 vCPU.
The pod does nothing for 10 minutes.
Then it uses the CPU nonstop for 200ms. The usage during the burst is equivalent to 2/10 vCPU, hence the pod is over it's limit and will be throttled.
On the other hand, the average CPU usage will be incredibly low.
In a case like this you'll be throttled but the burst is so small (200 milliseconds) that it wont show up in any graphs.
What to do
You actually don't want CPU limits in most cases because they prevent pods from using spare resources. There are Kubernetes maintainers on the record saying you shouldn't use CPU limits and should only set requests.
More info
I wrote a whole wiki page on why CPU throttling can occur despite low CPU usage and what to do about it. I also go into some common edge cases like how to deal with this for metrics-server which doesn't follow the usual rules.

Related

CPU limits (cores) more than 100% on nodes

I have just noticed on my kubernetes dashboard this:
CPU requests (cores)
0.66 (16.50%)
CPU limits (cores)
4.7 (117.50%)
I am quite confused as to why the limit is set as to 117.50%...? Is one of my service using too much, but wouldn't it be in the requests? Looking into kubectl describe node, I don't see any service using more than 2% (there are 43, which is a total of 86 max).
Thank you.
My approximate understanding is that Kubernetes lets you overcommit — that is, have resource requests on a particular node that exceed the capacity of the node — to let you be a little more efficient with your resource use.
For instance, suppose you're running deployments A and B, both of which require only 100 MB of memory (200 MB total) when they're idle, but require 1 GB of memory when they're actively processing a request. You could set things up to have each one of them run on a node with 1 GB of memory available. You could also put them on a single node with 1.5 GB of memory, assuming that A and B won't have to process traffic simultaneously, thereby saving yourself from a huge resource allocation.
This might be especially reasonable if you're using lots of microservices: you might even know that B can't process data until A has completed a request anyway, providing you a stronger guarantee things won't overlap and cause problems.
How Kubernetes decides to overcommit resources or not depends on the quality of service (QoS) tolerance that you've configured for the deployment. For instance, you won't get overcommitment on the Guaranteed QoS class, but you may see overcommitting if you use the default class, BestEffort.
You can read more about QoS classes in the Kubernetes documentation.
Limits (of all things) are allowed to overcommit the resources of the node. Requests cannot, so that should never be more than 100% of available. Basically the idea is "request" is a minimum requirement, but "limit" is a maximum burst range and it's not super likely everyone will burst at once. And if that is likely for you, you should set your requests and limits to the same value.

Request vs limit cpu in kubernetes/openshift

I have some dilemma to choose what should be the right request and limit setting for a pod in Openshift. Some data:
during start up, the application requires at least 600 millicores to be able to fulfill the readiness check within 150 seconds.
after start up, 200 millicores should be sufficient for the application to stay in idle state.
So my understanding from documentation:
CPU Requests
Each container in a pod can specify the amount of CPU it requests on a node. The scheduler uses CPU requests to find a node with an appropriate fit for a container.
The CPU request represents a minimum amount of CPU that your container may consume, but if there is no contention for CPU, it can use all available CPU on the node. If there is CPU contention on the node, CPU requests provide a relative weight across all containers on the system for how much CPU time the container may use.
On the node, CPU requests map to Kernel CFS shares to enforce this behavior.
Noted that the scheduler will refer to the request CPU to perform allocation on the node, and then it is a guarantee resource once allocated.
Also on the other side, I might allocate extra CPU as the 600 millicores might be only required during start up.
So should i go for
resources:
limits:
cpu: 1
requests:
cpu: 600m
for guarantee resource or
resources:
limits:
cpu: 1
requests:
cpu: 200m
for better cpu saving
I think you didn't get the idea of Requests vs Limits, I would recommend you take a look on the docs before you take that decision.
In a brief explanation,
Request is how much resource will be virtually allocated to the container, it is a guarantee that you can use it when you need, does not mean it keeps reserved exclusively to the container. With that said, if you request 200mb of RAM but only uses 100mb, the other 100mb will be "borrowed" by other containers when they consume all their Requested memory, and will be "claimed back" when your container needs it.
Limit is simple terms, is how much the container can consume, requested + borrow from other containers, before it is shutdown for consuming too much resources.
If a Container exceeds its memory limit, it will probably be terminated.
If a Container exceeds its memory request, it is likely that its Pod will be evicted whenever the node runs out of memory.
In simple terms, the limit is an absolute value, it should be equal or higher than the request, and the good practice is to avoid having the limits higher than the request for all containers, only in cases while certain workloads might need it, this is because most of the containers can consume more resources (ie: memory) than they requested, suddenly the PODs will start to be evicted from the node in an unpredictable way that makes it worse than if had a fixed limit for each one.
There is also a nice post in the docker docs about resources limits.
The scheduling rule is the same for CPU and Memory, K8s will only assign a POD to a the node if the node has enough CPU and Memory allocatable to fit all resources requested by the containers within a pod.
The execution rule is a bit different:
Memory is a limited resource in the node and the capacity is an absolute limit, the containers can't consume more than the node have capacity.
The CPU on the other hand is measure as CPU time, when you reserve a CPU capacity, you are telling how much CPU time a container can use, if the container need more time than the requested, it can be throttled and go to an execution queue until other containers have consumed their allocated time or finished their work. In summary is very similar to memory, but is very unlikely the container being killed for consuming too much CPU. The container will be able to use more CPU when the other containers does not use the full CPU time allocated to them. The main issue is when a container uses more CPU than was allocated, the throttling will degrade de performance of the application and at certain point might stop working properly. If you do not provide limits, the containers will start affecting other resources in the node.
Regarding the values to be used, there is no right value or right formula, each application requires a different approach, only measuring multiple times you can find the right value, the advice I give to you is to identify the min and the max and adjust somewhere in the middle, then keep monitoring to see how it behaves, if you feel is wasting\lacking resources you can reduce\increase to an optimal value. If the service is something crucial, start with higher values and reduce afterwards.
For readiness check, you should not use it as parameters to specify these values, you can delay the readiness using initialDelaySeconds parameter in the probe to give extra time to start the POD containers.
PS: I quoted the terms "Borrow" and "Claimed back" because the container is not actually borrowing from another container, in general, the node have a pool of memory and give you chunk of the memory to the container when they need it, so the memory is not technically borrowed from the container but from the Pool.

How to determine resource limit for Openshift Pods required for my tomcat application?

I've a web application (soap service) running in Tomcat 8 server in Openshift. The payload size is relatively small with 5-10 elements and the traffic is also small (300 calls per day, 5-10 max threads at a time). I'm little confused on the Pod resource restriction. How do I come up with min and max cpu and memory limits for each pod if I'm going to use min 1 and max 3 pods for my application?
It's tricky to configure accurate limitation value without performance test.
Because we don't expect your application is required how much resources process per requests. A good rule of thumb is to limit the resource based on heaviest workload on your environment. Memory limitation can trigger OOM-killer, so you should set up afforded value which is based on your tomcat heap and static memory size.
As opposed to CPU limitation will not kill your pod if reached the limitation value, but slow down the process speed.
My suggestion of each limitation value's starting point is as follows.
Memory: Tomcat(Java) memory size + 30% buffer
CPU: personally I think CPU limitation is useless to maximize the
process performance and efficiency. Even though CPU usage is afforded and the pod
can use full cpu resources to process the requests as soon as
possible at that time, the limitation setting can disturb it. But if
you should spread the resource usage evenly for suppressing some
aggressive resource eater, you can consider the CPU limitation.
This answer might not be what you want to, but I hope it help you to consider your capacity planning.

What is the use case of setting memory request less than limit in . K8s

I understand the use case of setting CPU request less than limit - it allows for CPU bursts in each container if the instances has free CPU hence resulting in max CPU utilization.
However, I cannot really find the use case for doing the same with memory. Most application don't release memory after allocating it, so effectively applications will request up to 'limit' memory (which has the same effect as setting request = limit). The only exception is containers running on an instance that already has all its memory allocated. I don't really see any pros in this, and the cons are more nondeterministic behaviour that is hard to monitor (one container having higher latencies than the other due to heavy GC).
Only use case I can think of is a shaded in memory cache, where you want to allow for a spike in memory usage. But even in this case one would be running the risk of of one of the nodes underperforming.
Maybe not a real answer, but a point of view on the subject.
The difference with the limit on CPU and Memory is what happens when the limit is reached. In case of the CPU, the container keeps running but the CPU usage is limited. If memory limit is reached, container gets killed and restarted.
In my use case, I often set the memory request to the amount of memory my application uses on average, and the limit to +25%. This allows me to avoid container killing most of the time (which is good) but of course it exposes me to memory overallocation (and this could be a problem as you mentioned).
Actually the topic you mention is interesting and in the meantime complex, just as Linux memory management is. As we know when the process is using more memory than the limit it will quickly move up on the potential "to-kill" process "ladder". Going further, the purpose of limit is to tell the kernel when it should consider the process to be potentially killed. Requests on the other hand are a direct statement "my container will need this much memory", but other than that they provide valuable information to the Scheduler about where can the Pod be scheduled (based on available Node resources).
If there is no memory request and high limit, Kubernetes will default the request to the limit (this might result in scheduling fail, even if the pods real requirements are met).
If you set a request, but not limit - container will use the default limit for namespace (if there is none, it will be able to use the whole available Node memory)
Setting memory request which is lower than limit you will give your pods room to have activity bursts. Also you make sure that a memory which is available for Pod to consume during boost is actually a reasonable amount.
Setting memory limit == memory request is not desirable simply because activity spikes will put it on a highway to be OOM killed by Kernel. The memory limits in Kubernetes cannot be throttled, if there is a memory pressure that is the most probable scenario (lets also remember that there is no swap partition).
Quoting Will Tomlin and his interesting article on Requests vs Limits which I highly recommend:
You might be asking if there’s reason to set limits higher than
requests. If your component has a stable memory footprint, you
probably shouldn’t since when a container exceeds its requests, it’s
more likely to be evicted if the worker node encounters a low memory
condition.
To summarize - there is no straight and easy answer. You have to determine your memory requirements and use monitoring and alerting tools to have control and be ready to change/adjust the configuration accordingly to needs.

How to set the right cpu millicores for a container?

I want to optimally configure the CPU cores without over or under allocation. How can I measure the required CPU millicore for a given container? It also brings the question of how much traffic a proxy will send it to any given pod based on CPU consumption so we can optimally use the compute.
Currently I send requests and monitor with,
kubectl top pod
Is there any tool that can measure, Requests, CPU and Memory over the time and suggest the optimal CPU recommendation for the pods.
Monitoring over time and per Pod yes, there's suggestions at https://kubernetes.io/docs/tasks/debug-application-cluster/resource-usage-monitoring/ One of the more popular is the Prometheus-Grafana combination - https://grafana.com/dashboards/315
As for automatic suggestion of the request and limits, I don't think there is anything. Keep in mind Kubernetes already tries to balance giving each Pod what it needs without it taking too much. The limits and requests that you set are to help it do this more safely. There are limitations on automatically inference as an under-resourced Pod can still work but respond a bit slower - it is up to you to decide what level of slowness you would tolerate. It is also up to you to decide what level of resource consumption could be acceptable in peak load, as opposed to excessive consumption that might indicate a bug in your app or even an attack. There's a further limitation as the metric units are themselves an attempt to approximate resource power that can actually vary with types of hardware (memory and CPUs can differ in mode of operation as well as quantity) and so can vary across clusters or even nodes on a cluster if the hardware isn't all equal.
What you are doing with top seems to me a good way to get started. You'll want to monitor resource usage for the cluster anyway so keeping track of this and adjusting limits as you go is a good idea. If you can run the same app outside of kubernetes and read around to see what other apps using the same language do then that can help to indicate if there's anything you can do to improve utilisation (memory consumption on the JVM in containers for example famously requires some tweaking to get right).