Kubernetes reserving headroom on nodes to allow memory peaks - kubernetes

I'm running a pod on k8s that has a baseline memory usage of about 1GB. Sometimes, depending on user behaviour, the pod can consume much more memory (10-12GB) for a few minutes, and then drop down to the baseline level.
What I would like to configure in k8s, is that the memory request would be quite low (1GB), but that the pod will run a node with a much higher memory capacity. This way when needed the pod will just grow and shrink back. The reason I don't configure the request to be higher, is that I have multiple replicas of this pod, and ideally I would want all of them to be hosted on 1-2 nodes, and let each one peak when needed, without spending too much.
It was counter intuitive to find out that the memory limit configuration does not affect node selection, meaning if I configure the limit to be 12GB I can still get a 4GB node.
Is there any way to configure my pods to share some large nodes, so they will be able to extend their memory usage without crashing?

Resource Requests vs Limits
Memory limit doesn't affect node selection, but memory request does. This is because the concept of resource request and resource limit serve a different purpose.
Resource request makes sure that your pod has at least requested amount of resources at all times.
Resource limit makes sure that your pod does not have more than the limit amount of resources at any time. E-g not setting memory limits right leads to OOM (out-of-memory) kill on pods, and this happens very often.
Resource requests are the parameters used by the scheduler to determine how to schedule your pods on certain nodes. Limit are not used for this purpose and it is up to the application designer to set pod limits correctly.
So What Can You Do
You can set your pod request the same way you are doing currently. Say, 1.5GiB of memory. You can then, from your experience, check that your pod can consume as much as 12GiB of memory. And you are running 2 pods of this application. Therefore, if you want to schedule both these pods on a single node and not have any issues, you have to make sure your node has:
total_memory = 2 * pod_limit + overhead (system) OR
total_memory ~= 2 * pod_limit
Overhead is the memory overhead of your idle node which can account for usage by certain OS components and your cluster system components like kubelet and kubeproxy binaries. But this is generally a very small value.
This will allow you to select the right node for your pods.
Summary
You will have to ensure for your pod to have a lot more memory in order to handle that spike (but not too much, so you will set a limit). You can read more about this here.
Note: Strictly speaking, pods don't shrink and grow they have a fixed resource profile, which is determined by the resource requests and resource limits in the definition. Also note that, a pod is always "occupying" the requested amount of resources, even if it is not using them.
Other Elements to Control Scheduling
Additionally, you can also use node-affinities to add a 'preference' for the scheduler to schedule your pods. I say a preference because node affinities are not definite rules, but guidelines. You can also use anti-affinities to make sure certain pods don't get scheduled on the same node. Just keep in mind, that if a pod is in danger of not being scheduled, the affinity or anti-affinity rules can be possibly ignored by the scheduler.
The last option is, you can use nodeSelector on your pod, to make sure it lands on the node which specifies the conditions of the selector. If no node matches this selector, your pod will be stuck in Pending state, meaning it cannot be scheduled anywhere. You can read about this here.
So, after you've decided which nodes you want your pod to be scheduled on, you can label them, and use a selector to ensure the pods are scheduled on the matching node.
It is also possible to provide a specified nodeName to your pod, to force it to schedule on a particular node. This is an option rarely used. Simply because nodes can be added/deleted in your cluster and this will require you to change your pod definition every time this happens. Using nodeSelector is better since you can specify general attributes like a label which you can attach to a new node you add to the cluster and the pod scheduling will not be affected, neither will be the pod definition.
Hope this helps.

Related

Are Kubernetes requests really guaranteed?

I'm running a pod on an EKS node with 2500m of requests and no limits - it happily uses around 3000m typically. I wanted to test whether requests were really guaranteed, so I am running a CPU stress test pod on the same node, with 3000m requests and no limits again.
This caused the original pod to not be able to use more than ~1500m of CPU - well below it's requests. Then when I turned off the stress pod, it returned to using 3000m.
There are a number of Kubernetes webpages which say that requests are what the pod is "guaranteed" - but does this only mean guaranteed for scheduling, or should it actually be a guarantee. If it is guaranteed, why might my pod CPU usage have been restricted (noting that there is no throttling for pods without limits).
Requests are not a guarantee that resources (especially CPU) will be available at runtime. If you set requests and limits very close together you have better expectations, but you need every pod in the system to cooperate to have a real guarantee.
Resource requests only affect the initial scheduling of the pod. In your example, you have one pod that requests 2.5 CPU and a second pod that requests 3 CPU. If your node has 8 CPU, both can be scheduled on the same node, but if the node only has 4 CPU, they need to go on separate nodes (if you have the cluster autoscaler, it can create a new node).
To carry on with the example, let's say the pods get scheduled on the same node with 8 CPU. Now that they've been scheduled the resource requests don't matter any more. Neither pod has resource limits, but let's say the smaller pod actually tries to use 3 CPU and the larger pod (a multi-threaded stress test) uses 13 CPU. This is more than the physical capacity of the system, so the kernel will allocate processor cycles to the two processes.
For CPU usage, if the node is overcommitted, you'll just see slow-downs in all of the processes. Either memory or disk ("ephemeral storage") can cause pods to be Evicted and rescheduled on different nodes; the pods that get evicted are the ones that exceed their resource requests by the most. Memory can also cause the node to run out of physical memory, and pods can get OOMKilled.
If every pod sets resource requests and limits to the same value then you do have an approximate guarantee that resources will be available, since nothing will be able to use more resource than the pod scheduler allocates it. For an individual pod and for non-CPU resources, if resource requests and limits are the same, your pod won't get evicted if the node is overcommitted (because it can't exceed its requests). On the other hand, most processes won't generally use exactly their resource requests, and so setting requests high enough that you're guaranteed to not be evicted also means you're causing the node to have unused resources, and your cluster as a whole will be less efficient (need more nodes to do the same work and be more expensive) (but more reliable since pods won't get killed off as often).

Half of My Kubernetes Cluster Is Never Used, Due to the Way Pods Are Scheduled

I have a Kubernetes cluster with 4 nodes and, ideally, my application should have 4 replicas, evenly distributed to each node. However, when pods are scheduled, they almost always end up on only two of the nodes, or if I'm very lucky, on 3 of the 4. My app as quite a bit of traffic and I would really want to use all the resources that I pay for.
I suspect the reason why this happens is that Kubernetes tries to schedule the new pods on the nodes that have the most available resources, which is nice as a concept, but it would be even nicer if it would reschedule the pods once the old nodes become available again.
What options do I have? Thanks.
You have lots of options!
First and foremost: Pod Affinity and Anti-affinity to make sure your Pod prefer to be placed on a host that does not already have a Pod with the same label.
Second, you could set up Pod Topology Spread Constraints. This is newer and a bit more advanced, but usually a better solution that simple anti-affinity.
Thirdly, you can pin your Pods to a specific node using a NodeSelector.
Finally, you could write your own scheduler or modify the default scheduler settings, but that's a bit more advanced topic. Don't forget to always set your resource requests correctly, these should be set to a value that more or less encapsulates the usage during peak traffic, to make sure that a node has enough resources available to max out the Pod without interfering with other Pods.

GKE Limit RAM & CPU

Am using GKE(google managed kubernetes) and I have requirement where I want to leave around 10% of memory on each Node as Idle so that during burst workload scenarios, the pod's already deployed on that Node can make use of those idle resources (within limit range)
Basically What I want to achieve is, I want to avoid a scenario where Pod's get scheduled onto a Node till 100% resources are consumed and assuming all the Pod's/Services are utilizing their allocated resources (set via requests) and one of the POD has a burst workload scenario or the pod got restarted and it needs more memory during boot up, then it should be able to make use of those idle resources
After going through the documentation I have come across this, but since GKE is a managed service, these properties aren't exposed anywhere, are there any other ways to achieve the same ?
GKE is a managed service and therefore you will not be able to costumize the worker node kublet parameters like --eviction-hard or --system-reserved.
As a workaround, you need to calculate your pod's memory requests and memory limits in order to configure a maximum number of pods per node, in this way you can controle the number of pods that run on your node and the spare CPU and memory to be used by your pods in case of a burst.

What does it mean OutOfcpu error in kubernetes?

I got OutOfcpu in kubernetes on googlecloud what does it mean? My pods seem to be working now, however there there were pods in this same revision which got OutOfcpu.
It means that the kube-scheduler can't find any node with available CPU to schedule your pods:
kube-scheduler selects a node for the pod in a 2-step operation:
Filtering
Scoring
The filtering step finds the set of Nodes where it’s feasible to
schedule the Pod. For example, the PodFitsResources filter checks
whether a candidate Node has enough available resource to meet a Pod’s
specific resource requests.
[...]
PodFitsResources: Checks if the
Node has free resources (eg, CPU and Memory) to meet the requirement
of the Pod.
Also, as per Assigning Pods to Nodes:
If the named node does not have the resources to accommodate the pod,
the pod will fail and its reason will indicate why, e.g. OutOfmemory
or OutOfcpu.
In addition to how-kube-scheduler-schedules-pods, I think this will be helpful to understand why OutOfcpu error has been shown up.
When you create a Pod, the Kubernetes scheduler selects a node for the
Pod to run on. Each node has a maximum capacity for each of the
resource types: the amount of CPU and memory it can provide for Pods.
The scheduler ensures that, for each resource type, the sum of the
resource requests of the scheduled Containers is less than the
capacity of the node. Note that although actual memory or CPU resource
usage on nodes is very low, the scheduler still refuses to place a Pod
on a node if the capacity check fails. This protects against a
resource shortage on a node when resource usage later increases, for
example, during a daily peak in request rate.
Ref: how-pods-with-resource-requests-are-scheduled

Kubernetes: do evicted pods with no resource requests get rescheduled successfully?

I've read as much Kubernetes documentation as I can find, but I'm still having trouble understanding a specific scenario I have in mind.
For the sake of example, let's say I have a single node with 1GB of memory. I also have a deployment that wants 100 pods with memory limits set to 100MB and memory requests unset. The pods only use 1MB most of the time, but can sometimes jump up to 99MB.
Question 1: Will all 100 pods be scheduled onto the node?
Now, let's say all the pods simultaneously start using 99MB of memory each and stay there. There isn't enough memory on the machine to handle that, but none of the pods have exceeded their memory limit. I'm assuming Kubernetes evicts some pods at this point.
Question 2: When Kubernetes tries to reschedule the evicted pods, does it succeed since there is no memory request set? What happens when the node immediately runs out of memory again? Does this eviction, rescheduling keep happening over and over? If so, is there some metric that I can use to detect that this is happening?
A pod will be scheduled as long as there's an eligible node that can satisfy the requested resources. So if you do not specify request the pod will pretty much get scheduled. Request and limits are totally different things. Request is a condition for a pod to be scheduled and limit is a condition for a running pod already scheduled.
If you overcommit the actual resources on a node you will run into typical issues - if you overcommit on memory it'll start to swap and CPU there will just be general slow down. Either way the node and pods on it will become unresponsive. It's difficult to deal with and tools like request and limits set up sane boundaries that will help you not take things quite this far where you'll simply see the pod fail to schedule.
When the Kubernetes scheduler schedules a pod running on a node, it will always ensure that the total limits of the containers are less than the node capacity. If a node runs out of resources, Kubernetes will not schedule any new containers running on it. If no node is available when you launch a pod, the pod will remain pending, since the Kubernetes scheduler will be unable to nd any node that could run your desired pod.
Kubernetes Cookbook
I think this excerpt gave you some understanding on how it internally works. So answers for your questions:
At most 10 pods will be scheduled into your node.
If there no free memory in node evicted pods will be pending. Also k8s can simply evict pod if it exceeds limits when resources are needed for other pods and services.