Kubernetes Pod vs. Container OOMKilled - kubernetes

If I understand correctly the conditions for Kubernetes to OOM kill a pod or container (from komodor.com):
If a container uses more memory than its memory limit, it is terminated with an OOMKilled status. Similarly, if overall memory usage on all containers, or all pods on the node, exceeds the defined limit, one or more pods may be terminated.
This means that if a container in the pod exceeds the total memory it will be killed (the container) but not the pod itself. Similarly, if there are multiple containers in a pod and the pod itself exceeds its memory limitation, which is the sum of memory limits of all the containers in that pod - the pod will be OOM killed. However, the latter only seems possibly if one of the containers exceeds its memory allowance. In this case - wouldn't the container be killed first?
I'm trying to understand the actual conditions in which a pod is OOM killed instead of a container.
I've also noticed that when there is one container in the pod and that container is exceeding its memory allowance repeatedly - the pod and container are killed intermittently. I observed this - the container would restart, which would be observable by watching the logs from the pod, and every second time - the pod is killed and restarted, incrementing its restart count.
If it helps to understand the behavior - the QOS class of the pod is Burstable.

Pods aren't OOM killed at all. OOMKilled is a status ultimately caused by a kernel process (OOM Killer) that kills processes (containers are processes), which is then recognised by the kubelet which sets the status on the container. If the main container in a pod is killed then by default the pod will be restarted by the kubelet. A pod cannot be terminated, because a pod is a data structure rather than a process. Similarly, it cannot have a memory (or CPU) limit itself, rather it is limited by the sum of its component parts.
The article you reference uses imprecise language and I think this is causing some confusion. There is a better article on medium that covers this more accurately.

Related

If you have a pod with multiple containers and one triggers OOMKilller, does it restart the entire pod?

Trying to plan out a deployment for an application and am wondering if it makes sense to have multiple pods in a container vs putting them in separate pods. I expect one of the containers to potentially be operating near its allocated memory limit. My understanding is that this presents the risk of this container getting OOMKilled. If that's the case, would it restart the entire pod (so the other container in the pod is restarted as well) or will it only restart the OOMKilled container?
No, only the specific container.
For the whole pod to be recreated there needs to be a change in the Pod's ownerObject (tipically a Replicaset) or a scheduling decision by kube-scheduler.

If requested memory is "the minimum", why is kubernetes killing my pod when it exceeds 10x the requested?

I am debuggin a problem with pod eviction in Kubernetes.
It looks like it is related to a configuration in PHP FPM children processes quantity.
I assigned a minimum memory of 128 MB and Kubernetes is evicting my pod apparently when exceeds 10x that amount (The node was low on resource: memory. Container phpfpm was using 1607600Ki, which exceeds its request of 128Mi.)
How can I prevent this? I thought that requested resources is the minimum and that the pod can use whatever is available if there's no upper limit.
Requested memory is not "the minimum", it is what it is called - the amount of memory requested by pod. When kubernetes schedules pod, it uses request as a guidance to choose a node which can accommodate this workload, but it doesn't guarantee that pod won't be killed if node is short on memory.
As per docs https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#how-pods-with-resource-limits-are-run
if a container exceeds its memory request and the node that it runs on becomes short of memory overall, it is likely that the Pod the container belongs to will be evicted.
If you want to guarantee a certain memory window for your pods - you should use limits, but in that case if your pod doesn't use most of this memory, it will be "wasted"
So to answer your question "How can I prevent this?", you can:
reconfigure your php-fpm in a way, that prevents it to use 10x memory (i.e. reduce workers count), and configure autoscaling. That way your overloaded pods won't be evicted, and kubernetes will schedule new pods in event of higher load
set memory limit to guarantee a certain amount of memory to your pods
Increase memory on your nodes
Use affinity to schedule your demanding pods on some dedicated nodes and other workloads on separate nodes

Kubernetes - what happens if you don't set a pod CPU request or limit?

I understand the concept of setting a request or a limit on a Kubernetes pod for both CPU and/or memory resources but I'm trying to understand what happens if you don't set either request or limit for say a CPU?
We have configured an NGINX pod but it doesn't have either a request or limit set for its CPU. I'm assuming it will have at a minimum 1 millicore and will give as much millicores to that pod as it needs and is available on the node. If the node has exhausted all available cores, then does it just stay stuck at 1 millicore?
What happens if you don't set either request or limit for say a CPU?
When you don’t specify a request for CPU, you’re saying you don’t
care how much CPU time the process running in your container is
allotted.
In the worst case, it may not get any CPU time at all (this happens
when a heavy demand by other processes exists on the CPU). Although
this may be fine for low-priority batch jobs, which aren’t
time-critical, it obviously isn’t appropriate for containers handling
user requests.
you’re also requesting 1 millicore of memory for the container. By
doing that, you’re saying that you expect the processes running
inside the container to use at most N mebibytes of RAM. They
might use less, but you’re not expecting them to use more than that
in normal circumstances.
Understanding how resource requests affect scheduling
By specifying resource requests, you’re specifying the minimum amount of resources your pod needs. This information is what the Scheduler uses when scheduling the pod to a node.
Each node has a certain amount of CPU and memory it can allocate to pods. When scheduling a pod, the Scheduler will only consider nodes with enough unallocated resources to meet the pod’s resource requirements.
If the amount of unallocated CPU or memory is less than what the pod requests, Kubernetes will not schedule the pod to that node, because the node can’t provide the minimum amount required by the pod.
Understanding what will happened if Exceeding the limits
With CPU
CPU is a compressible resource, and it’s only natural for a process to want to consume all of the CPU time when not waiting for an I/O operation.
a process’ CPU usage is throttled, so when a CPU limit is set for a container, the process isn’t given more CPU time than the configured limit.
With Memory
With memory, it’s different. When a process tries to allocate memory over its limit, the process is killed (it’s said the container is OOMKilled, where OOM stands for Out Of Memory).
If the pod’s restart policy is set to Always or OnFailure, the process is restarted immediately, so you may not even notice it getting killed. But if it keeps going over the memory limit and getting killed, Kubernetes will begin restarting it with increasing delays between restarts. You’ll see a CrashLoopBackOff status in that case.
kubectl get po
NAME READY STATUS RESTARTS AGE
memoryhog 0/1 CrashLoopBackOff 3 1m
Note: The CrashLoopBackOff status doesn’t mean the Kubelet has given up. It means that after each crash, the Kubelet is increasing the time period before restarting the container.
Understand To examine why the container crashed
kubectl describe pod
Name:
...
Containers:
main: ...
State: Terminated
Reason: OOMKilled
...
Pay attention to the Reason attribute OOMKilled. The current container was killed because it was out of memory (OOM).

What happens if a Kubernetes pod exceeds its memory resources 'limit'?

There's a bit of a vagueness in the Kubernetes documentation about what happens if a pod's memory footprint increases to a point where it exceeds the value specified under resources.limits.
https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#how-pods-with-resource-limits-are-run
It specifies that if a pod's footprint grows to exceed its limits value, it "might be" terminated. What does the "might be" mean? What conditions would result in it being terminated vs which ones would allow it to continue on as-is?
Q: What happens if a Kubernetes pod exceeds its memory resources 'limit'?
It will be restarted.
Unlike Pod eviction, if a Pod container is OOM killed, it may be restarted by the kubelet based on its RestartPolicy.
You can Configure Out Of Resource Handling for your Node.
Evicting end-user Pods
If the kubelet is unable to reclaim sufficient resource on the node, kubelet begins evicting Pods.
The kubelet ranks Pods for eviction first by whether or not their usage of the starved resource exceeds requests, then by Priority, and then by the consumption of the starved compute resource relative to the Pods’ scheduling requests.
Limits refers to memory and cpu.
If memory consumption increases it will be terminated but in case of CPU, os can allow tiny time slices to consume more than allowed CPU share.

Kubernetes: do evicted pods with no resource requests get rescheduled successfully?

I've read as much Kubernetes documentation as I can find, but I'm still having trouble understanding a specific scenario I have in mind.
For the sake of example, let's say I have a single node with 1GB of memory. I also have a deployment that wants 100 pods with memory limits set to 100MB and memory requests unset. The pods only use 1MB most of the time, but can sometimes jump up to 99MB.
Question 1: Will all 100 pods be scheduled onto the node?
Now, let's say all the pods simultaneously start using 99MB of memory each and stay there. There isn't enough memory on the machine to handle that, but none of the pods have exceeded their memory limit. I'm assuming Kubernetes evicts some pods at this point.
Question 2: When Kubernetes tries to reschedule the evicted pods, does it succeed since there is no memory request set? What happens when the node immediately runs out of memory again? Does this eviction, rescheduling keep happening over and over? If so, is there some metric that I can use to detect that this is happening?
A pod will be scheduled as long as there's an eligible node that can satisfy the requested resources. So if you do not specify request the pod will pretty much get scheduled. Request and limits are totally different things. Request is a condition for a pod to be scheduled and limit is a condition for a running pod already scheduled.
If you overcommit the actual resources on a node you will run into typical issues - if you overcommit on memory it'll start to swap and CPU there will just be general slow down. Either way the node and pods on it will become unresponsive. It's difficult to deal with and tools like request and limits set up sane boundaries that will help you not take things quite this far where you'll simply see the pod fail to schedule.
When the Kubernetes scheduler schedules a pod running on a node, it will always ensure that the total limits of the containers are less than the node capacity. If a node runs out of resources, Kubernetes will not schedule any new containers running on it. If no node is available when you launch a pod, the pod will remain pending, since the Kubernetes scheduler will be unable to nd any node that could run your desired pod.
Kubernetes Cookbook
I think this excerpt gave you some understanding on how it internally works. So answers for your questions:
At most 10 pods will be scheduled into your node.
If there no free memory in node evicted pods will be pending. Also k8s can simply evict pod if it exceeds limits when resources are needed for other pods and services.