Scheduling according to the available memory of the node - kubernetes

How to set it up. When the available memory of the node is less than 8G, the node is set to be unschedulable, and when the memory is restored to more than 8G, it will automatically become schedulable. When the node is set to be unschedulable, the above pod continues to run unaffected
Or set the pod so that the pod is not scheduled to a node with less than 8G memory,Because the memory required by different pods is very different, it cannot be solved by setting the request
I set the configuration file of node. evictionHard:memory.available: 8Gi. but The pod running on the node will be evicted when the available memory of the node is less than 8G

Related

Kubernetes Pod vs. Container OOMKilled

If I understand correctly the conditions for Kubernetes to OOM kill a pod or container (from komodor.com):
If a container uses more memory than its memory limit, it is terminated with an OOMKilled status. Similarly, if overall memory usage on all containers, or all pods on the node, exceeds the defined limit, one or more pods may be terminated.
This means that if a container in the pod exceeds the total memory it will be killed (the container) but not the pod itself. Similarly, if there are multiple containers in a pod and the pod itself exceeds its memory limitation, which is the sum of memory limits of all the containers in that pod - the pod will be OOM killed. However, the latter only seems possibly if one of the containers exceeds its memory allowance. In this case - wouldn't the container be killed first?
I'm trying to understand the actual conditions in which a pod is OOM killed instead of a container.
I've also noticed that when there is one container in the pod and that container is exceeding its memory allowance repeatedly - the pod and container are killed intermittently. I observed this - the container would restart, which would be observable by watching the logs from the pod, and every second time - the pod is killed and restarted, incrementing its restart count.
If it helps to understand the behavior - the QOS class of the pod is Burstable.
Pods aren't OOM killed at all. OOMKilled is a status ultimately caused by a kernel process (OOM Killer) that kills processes (containers are processes), which is then recognised by the kubelet which sets the status on the container. If the main container in a pod is killed then by default the pod will be restarted by the kubelet. A pod cannot be terminated, because a pod is a data structure rather than a process. Similarly, it cannot have a memory (or CPU) limit itself, rather it is limited by the sum of its component parts.
The article you reference uses imprecise language and I think this is causing some confusion. There is a better article on medium that covers this more accurately.

How does k8s manage containers using more cpu than requested without limits?

I'm trying to understand what happens when a container is configured with a CPU request and without a limit, and it tries to use more CPU than requested while the node is fully utilized, but there is another node with available resources.
Will k8s keep the container throttled in its current node or will it be moved to another node with available resources? do we know how/when k8s decides to move the container when its throttled in such a case?
I would appreciate any extra resources to read on this matter, as I couldn't find anything that go into details for this specific scenario.
Q1) what happens when a container is configured with a CPU request and without a limit ?
ANS:
If you do not specify a CPU limit
If you do not specify a CPU limit for a Container, then one of these situations applies:
The Container has no upper bound on the CPU resources it can use. The Container could use all of the CPU resources available on the Node where it is running.
The Container is running in a namespace that has a default CPU limit, and the Container is automatically assigned the default limit. Cluster administrators can use a LimitRange to specify a default value for the CPU limit.
If you specify a CPU limit but do not specify a CPU request
If you specify a CPU limit for a Container but do not specify a CPU request, Kubernetes automatically assigns a CPU request that matches the limit. Similarly, if a Container specifies its own memory limit, but does not specify a memory request, Kubernetes automatically assigns a memory request that matches the limit.
Q2) it tries to use more CPU than requested while the node is fully utilized, but there is another node with available resources?
ANS:
The Kubernetes scheduler is a control plane process which assigns Pods to Nodes. The scheduler determines which Nodes are valid placements for each Pod in the scheduling queue according to constraints and available resources. The scheduler then ranks each valid Node and binds the Pod to a suitable Node. Multiple different schedulers may be used within a cluster; kube-scheduler is the reference implementation. See scheduling for more information about scheduling and the kube-scheduler component.
Scheduling, Preemption and Eviction
In Kubernetes, scheduling refers to making sure that Pods are matched to Nodes so that the kubelet can run them. Preemption is the process of terminating Pods with lower Priority so that Pods with higher Priority can schedule on Nodes. Eviction is the process of terminating one or more Pods on Nodes.
Q3) Will k8s keep the container throttled in its current node or will it be moved to another node with available resources?
ANS:
Pod Disruption
Pod disruption is the process by which Pods on Nodes are terminated either voluntarily or involuntarily.
Voluntary disruptions are started intentionally by application owners or cluster administrators. Involuntary disruptions are unintentional and can be triggered by unavoidable issues like Nodes running out of resources, or by accidental deletions.
Voluntary and involuntary disruptions
Pods do not disappear until someone (a person or a controller) destroys them, or there is an unavoidable hardware or system software error.
We call these unavoidable cases involuntary disruptions to an application.
Examples are:
a hardware failure of the physical machine backing the node
cluster administrator deletes VM (instance) by mistake
cloud provider or hypervisor failure makes VM disappear
a kernel panic
the node disappears from the cluster due to cluster network partition
eviction of a pod due to the node being out-of-resources.
Suggestion:
Taints and tolerations work together to ensure that pods are not scheduled onto inappropriate nodes. One or more taints are applied to a node; this marks that the node should not accept any pods that do not tolerate the taints.
Command:
kubectl taint nodes node1 key1=value1:NoSchedule
Example:
kubectl taint nodes node1 key1=node.kubernetes.io/disk-pressure:NoSchedule

If requested memory is "the minimum", why is kubernetes killing my pod when it exceeds 10x the requested?

I am debuggin a problem with pod eviction in Kubernetes.
It looks like it is related to a configuration in PHP FPM children processes quantity.
I assigned a minimum memory of 128 MB and Kubernetes is evicting my pod apparently when exceeds 10x that amount (The node was low on resource: memory. Container phpfpm was using 1607600Ki, which exceeds its request of 128Mi.)
How can I prevent this? I thought that requested resources is the minimum and that the pod can use whatever is available if there's no upper limit.
Requested memory is not "the minimum", it is what it is called - the amount of memory requested by pod. When kubernetes schedules pod, it uses request as a guidance to choose a node which can accommodate this workload, but it doesn't guarantee that pod won't be killed if node is short on memory.
As per docs https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#how-pods-with-resource-limits-are-run
if a container exceeds its memory request and the node that it runs on becomes short of memory overall, it is likely that the Pod the container belongs to will be evicted.
If you want to guarantee a certain memory window for your pods - you should use limits, but in that case if your pod doesn't use most of this memory, it will be "wasted"
So to answer your question "How can I prevent this?", you can:
reconfigure your php-fpm in a way, that prevents it to use 10x memory (i.e. reduce workers count), and configure autoscaling. That way your overloaded pods won't be evicted, and kubernetes will schedule new pods in event of higher load
set memory limit to guarantee a certain amount of memory to your pods
Increase memory on your nodes
Use affinity to schedule your demanding pods on some dedicated nodes and other workloads on separate nodes

what is kubernetes memory commitment limits on worker nodes?

I am running single node K8s cluster and my machine has 250GB memory. I am trying to launch multiple pods each needing atleast 50GB memory. I am setting my pod specifications as following.
Now I can launch first four pods and they starts "Running" immediately; however the fifth pod is "Pending". kubectl describe pods shows 0/1 node available; 1 insufficient memory
now at first this does not make sense - i have 250GB memory so i should be able to launch five pods each requesting 50GB memory. shouldn't i?
on second thought - this might be too tight for K8s to fulfill as, if it schedules the fifth pod than the worker node has no memory left for anything else (kubelet kube-proxy OS etc). if this is the case for denial than K8s must be keeping some amount of node resources (CPU memory storage network) always free or reserved. What is this amount? is it static value or N% of total node resource capacity? can we change these values when we launch cluster?
where can i find more details on related topic? i tried googling but nothing came which specifically addresses these questions. can you help?
now at first this does not make sense - i have 250GB memory so i should be able to launch five pods each requesting 50GB memory. shouldn't i?
This depends on how much memory that is "allocatable" on your node. Some memory may be reserved for e.g. OS or other system tasks.
First list your nodes:
kubectl get nodes
Then you get a list of your nodes, now describe one of your nodes:
kubectl describe node <node-name>
And now, you should see how much "allocatable" memory the node has.
Example output:
...
Allocatable:
cpu: 4
memory: 2036732Ki
pods: 110
...
Set custom reserved resources
K8s must be keeping some amount of node resources (CPU memory storage network) always free or reserved. What is this amount? is it static value or N% of total node resource capacity? can we change these values when we launch cluster?
Yes, this can be configured. See Reserve Compute Resources for System Daemons.

Kubernetes: do evicted pods with no resource requests get rescheduled successfully?

I've read as much Kubernetes documentation as I can find, but I'm still having trouble understanding a specific scenario I have in mind.
For the sake of example, let's say I have a single node with 1GB of memory. I also have a deployment that wants 100 pods with memory limits set to 100MB and memory requests unset. The pods only use 1MB most of the time, but can sometimes jump up to 99MB.
Question 1: Will all 100 pods be scheduled onto the node?
Now, let's say all the pods simultaneously start using 99MB of memory each and stay there. There isn't enough memory on the machine to handle that, but none of the pods have exceeded their memory limit. I'm assuming Kubernetes evicts some pods at this point.
Question 2: When Kubernetes tries to reschedule the evicted pods, does it succeed since there is no memory request set? What happens when the node immediately runs out of memory again? Does this eviction, rescheduling keep happening over and over? If so, is there some metric that I can use to detect that this is happening?
A pod will be scheduled as long as there's an eligible node that can satisfy the requested resources. So if you do not specify request the pod will pretty much get scheduled. Request and limits are totally different things. Request is a condition for a pod to be scheduled and limit is a condition for a running pod already scheduled.
If you overcommit the actual resources on a node you will run into typical issues - if you overcommit on memory it'll start to swap and CPU there will just be general slow down. Either way the node and pods on it will become unresponsive. It's difficult to deal with and tools like request and limits set up sane boundaries that will help you not take things quite this far where you'll simply see the pod fail to schedule.
When the Kubernetes scheduler schedules a pod running on a node, it will always ensure that the total limits of the containers are less than the node capacity. If a node runs out of resources, Kubernetes will not schedule any new containers running on it. If no node is available when you launch a pod, the pod will remain pending, since the Kubernetes scheduler will be unable to nd any node that could run your desired pod.
Kubernetes Cookbook
I think this excerpt gave you some understanding on how it internally works. So answers for your questions:
At most 10 pods will be scheduled into your node.
If there no free memory in node evicted pods will be pending. Also k8s can simply evict pod if it exceeds limits when resources are needed for other pods and services.