Which component in Kubernetes is responsible for resource limits? - kubernetes

When a pod is created but no resource limit is specified, which component is responsible to calculate or assign resource limit to that pod? Is that the kubelet or the Docker?

If a pod doesn't specify any resource limits, it's ultimately up to the Linux kernel scheduler on the node to assign CPU cycles or not to the process, and to OOM-kill either the pod or other processes on the node if memory use is excessive. Neither Kubernetes nor Docker will assign or guess at any sort of limits.
So if you have a process with a massive memory leak, and it gets scheduled on a very large but quiet instance with 256 GB of available memory, it will get to use almost all of that memory before it gets OOM-killed. If a second replica gets scheduled on a much smaller instance with only 4 GB, it's liable to fail much sooner. Usually you'll want to actually set limits for consistent behavior.

Related

Kubernetes physical memory requests and limits and linux virtual memory

In Kubernetes, is it possible to enforce virtual memory (physical page swapping to disk) on a pod/container with memory requests and limits set?
For instance, as per the Kubernetes documentation, “if you set a memory limit of 4GiB for a container, the kubelet (and container runtime) enforce the limit. The runtime prevents the container from using more than the configured resource limit. For example: when a process in the container tries to consume more than the allowed amount of memory, the system kernel terminates the process that attempted the allocation, with an out of memory (OOM) error.”
Hence, is it possible to configure the pod (and hence linux kernel) to enforce virtual memory (that is paging and memory swapping ) on the specified physical memory limits of the pod (4GiB) instead of OOM error? am I missing something?
Reading the kernel documentation on this leads me to believe this is not possible. And I don't think this is a desirable behavior. Let's just think about the following scenario: You have a machine with 64GB of physical memory with 10GB of those used. Then you start a process with a "physical" memory limit of 500MB. If this memory limit is reached the kernel would start swapping and the process would stall even though there is enough memory available to service the memory requests of the process.
The memory limit you specify on the container is actually not a physical memory limit, but a virtual memory limit with overcommit allowed. This means your process can allocate as much memory as it wants (until you reach the overcommit limit), but it gets killed as soon as it tries to use too much memory.

Pod restart in OpenShift after deployment

Few pods in my openshift cluster are still restarted multiple times after deployment.
with describe output:
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Also, memory usage is well below the memory limits.
Any other parameter which I am missing to check?
There are no issues with the cluster in terms of resources.
„OOMKilled“ means your container memory limit was reached and the container was therefore restarted.
Especially Java-based applications can consume a large amount of memory when starting up. After the startup, the memory usage often drops considerably.
So in your case, increase the ‚requests.limit.memory‘ to avoid these OOMKills. Note that the ‚requests‘ can still be lower and should roughly reflect what your container consumes after the startup.
Basically status OOM means the container memory limit has been crossed (Out of Memory).
If the memory allocated by all of the processes in a container exceeds the memory limit, the node Out of Memory (OOM) killer will immediately select and kill a process in the container [1].
If the container does not exit immediately, an OOM kill is detectable as follows:
A container process exited with code 137, indicating it received a SIGKILL signal
The oom_kill counter in /sys/fs/cgroup/memory/memory.oom_control is incremented
If one or more processes in a pod are OOM killed, when the pod subsequently exits, whether immediately or not, it will have phase Failed and reason OOMKilled. An OOM killed pod may be restarted depending on the value of restartPolicy [2].
To check the status:
oc get pod <pod name> -o yaml
There are applications that consumes huge amounts of memory only during the start.
In this article one can find two solutions to handle the OOMKilled issues
You’ll need to size the container workload for different node configurations when using memory limits. Unfortunately there is no formula that can be applied to calculate the rate of increase in container memory usage with increasing number of cpus on the node.
One of the kernel tuneables that can help reduce the memory usage of containers is slub_max_order. A value of 0 (default is 3) can help bring down the overall memory usage of the container but can have negative performance implication for certain workloads. It’s advisable to benchmark the container workload with this tuneable. [3]
References:
Configuring cluster memory to meet container memory and risk requirements.
APPLICATION MEMORY SIZING.
OOMKilled Containers and Number of CPUs

Is there a way to force the use of the same physical CPU while allocating cores to a pod in Kubernetes?

I was wondering if it was possible to force Kubernetes to allocate the cores from the same CPU while spinning up a POD. What I would like Kubernetes to do is, as new PODs are created, the cores allocated to them should come from -let's say- CPU1 as long as there are cores still available on it. CPU2's, CPU3's, etc. cores should not be used in the newly initiated pod. I would like my PODs to have cores allocated from a single CPU as long as it is possible.
Is there a way to achieve this?
Also, is there a way to see from which physical CPUs the cores(cpu) of a POD is coming from?
Thanks a lot.
Edit: Let me explain why I want to do this.
We are running a Spark cluster on Kubernetes. The lead of system/linux administration team warned us about the concept of NUMA. He told us that we could improve the performance of our executor pods if we were to allocate the cores from the same physical CPU. That is why I started digging into this.
I found this Kubernetes CPU Manager. The documentation says:
CPU Manager allocates CPUs in a topological order on a best-effort
basis. If a whole socket is free, the CPU Manager will exclusively
allocate the CPUs from the free socket to the workload. This boosts
the performance of the workload by avoiding any cross-socket traffic.
Also on the same page:
Allocate all the logical CPUs (hyperthreads) from the same physical
CPU core if available and the container requests an entire core worth
of CPUs.
So now I am starting to think maybe what I need is to enable the static policy for the CPU manager to get what I want.

Kubernetes limit and request of resource would be better to be closer

I was told by a more experienced DevOps person, that resource(CPU and Memory) limit and request would be nicer to be closer for scheduling pods.
Intuitively I can image less scaling up and down would require less calculation power for K8s? or can someone explain it in more detail?
The resource requests and limits do two fundamentally different things. The Kubernetes scheduler places a pod on a node based only on the sum of the resource requests: if the node has 8 GB of RAM, and the pods currently scheduled on that node requested 7 GB of RAM, then a new pod that requests 512 MB will fit there. The limits control how much resource the pod is actually allowed to use, with it getting CPU-throttled or OOM-killed if it uses too much.
In practice many workloads can be "bursty". Something might require 2 GB of RAM under peak load, but far less than that when just sitting idle. It doesn't necessarily make sense to provision enough hardware to run everything at peak load, but then to have it sit idle most of the time.
If the resource requests and limits are far apart then you can "fit" more pods on the same node. But, if the system as a whole starts being busy, you can wind up with many pods that are all using above their resource request, and actually use more memory than the node has, without any individual pod being above its limit.
Consider a node with 8 GB of RAM, and pods with 512 MB RAM resource requests and 2 GB limits. 16 of these pods "fit". But if each pod wants to use 1 GB RAM (allowed by the resource limits) that's more total memory than the node has, and you'll start getting arbitrary OOM-kills. If the pods request 1 GB RAM instead, only 8 will "fit" and you'll need twice the hardware to run them at all, but in this scenario the cluster will run happily.
One strategy for dealing with this in a cloud environment is what your ops team is asking, make the resource requests and limits be very close to each other. If a node fills up, an autoscaler will automatically request another node from the cloud. Scaling down is a little trickier. But this approach avoids problems where things die randomly because the Kubernetes nodes are overcommitted, at the cost of needing more hardware for the idle state.

limit the amount of memory kube-controller-manager uses

running v1.10 and i notice that kube-controller-managers memory usage spikes and the OOMs all the time. it wouldn't be so bad if the system didn't fall to a crawl before this happens tho.
i tried modifying /etc/kubernetes/manifests/kube-controller-manager.yaml to have a resource.limits.memory=1Gi but the kube-controller-manager pod never seems to want to come back up.
any other options?
There is a bug in kube-controller-manager, and it's fixed in https://github.com/kubernetes/kubernetes/pull/65339
First of all, you missed information about the amount of memory you use per node.
Second, what do you mean by "system didn't fall to a crawl" - do you mean nodes are swapping?
All Kubernetes masters and nodes are expected to have swap disabled - it's recommended by the Kubernetes community, as mentioned in the Kubernetes documentation.
Support for swap is non-trivial and degrades performance.
Turn off swap on every node by:
sudo swapoff -a
Finally,
resource.limits.memory=1Gi
is default value per pod. These limits are hard limits. Pod reaching this level of allocated memory can cause OOM, even if you have gigabytes of unallocated memory.