GKE Limit RAM & CPU - kubernetes

Am using GKE(google managed kubernetes) and I have requirement where I want to leave around 10% of memory on each Node as Idle so that during burst workload scenarios, the pod's already deployed on that Node can make use of those idle resources (within limit range)
Basically What I want to achieve is, I want to avoid a scenario where Pod's get scheduled onto a Node till 100% resources are consumed and assuming all the Pod's/Services are utilizing their allocated resources (set via requests) and one of the POD has a burst workload scenario or the pod got restarted and it needs more memory during boot up, then it should be able to make use of those idle resources
After going through the documentation I have come across this, but since GKE is a managed service, these properties aren't exposed anywhere, are there any other ways to achieve the same ?

GKE is a managed service and therefore you will not be able to costumize the worker node kublet parameters like --eviction-hard or --system-reserved.
As a workaround, you need to calculate your pod's memory requests and memory limits in order to configure a maximum number of pods per node, in this way you can controle the number of pods that run on your node and the spare CPU and memory to be used by your pods in case of a burst.

Related

Kubernetes reserving headroom on nodes to allow memory peaks

I'm running a pod on k8s that has a baseline memory usage of about 1GB. Sometimes, depending on user behaviour, the pod can consume much more memory (10-12GB) for a few minutes, and then drop down to the baseline level.
What I would like to configure in k8s, is that the memory request would be quite low (1GB), but that the pod will run a node with a much higher memory capacity. This way when needed the pod will just grow and shrink back. The reason I don't configure the request to be higher, is that I have multiple replicas of this pod, and ideally I would want all of them to be hosted on 1-2 nodes, and let each one peak when needed, without spending too much.
It was counter intuitive to find out that the memory limit configuration does not affect node selection, meaning if I configure the limit to be 12GB I can still get a 4GB node.
Is there any way to configure my pods to share some large nodes, so they will be able to extend their memory usage without crashing?
Resource Requests vs Limits
Memory limit doesn't affect node selection, but memory request does. This is because the concept of resource request and resource limit serve a different purpose.
Resource request makes sure that your pod has at least requested amount of resources at all times.
Resource limit makes sure that your pod does not have more than the limit amount of resources at any time. E-g not setting memory limits right leads to OOM (out-of-memory) kill on pods, and this happens very often.
Resource requests are the parameters used by the scheduler to determine how to schedule your pods on certain nodes. Limit are not used for this purpose and it is up to the application designer to set pod limits correctly.
So What Can You Do
You can set your pod request the same way you are doing currently. Say, 1.5GiB of memory. You can then, from your experience, check that your pod can consume as much as 12GiB of memory. And you are running 2 pods of this application. Therefore, if you want to schedule both these pods on a single node and not have any issues, you have to make sure your node has:
total_memory = 2 * pod_limit + overhead (system) OR
total_memory ~= 2 * pod_limit
Overhead is the memory overhead of your idle node which can account for usage by certain OS components and your cluster system components like kubelet and kubeproxy binaries. But this is generally a very small value.
This will allow you to select the right node for your pods.
Summary
You will have to ensure for your pod to have a lot more memory in order to handle that spike (but not too much, so you will set a limit). You can read more about this here.
Note: Strictly speaking, pods don't shrink and grow they have a fixed resource profile, which is determined by the resource requests and resource limits in the definition. Also note that, a pod is always "occupying" the requested amount of resources, even if it is not using them.
Other Elements to Control Scheduling
Additionally, you can also use node-affinities to add a 'preference' for the scheduler to schedule your pods. I say a preference because node affinities are not definite rules, but guidelines. You can also use anti-affinities to make sure certain pods don't get scheduled on the same node. Just keep in mind, that if a pod is in danger of not being scheduled, the affinity or anti-affinity rules can be possibly ignored by the scheduler.
The last option is, you can use nodeSelector on your pod, to make sure it lands on the node which specifies the conditions of the selector. If no node matches this selector, your pod will be stuck in Pending state, meaning it cannot be scheduled anywhere. You can read about this here.
So, after you've decided which nodes you want your pod to be scheduled on, you can label them, and use a selector to ensure the pods are scheduled on the matching node.
It is also possible to provide a specified nodeName to your pod, to force it to schedule on a particular node. This is an option rarely used. Simply because nodes can be added/deleted in your cluster and this will require you to change your pod definition every time this happens. Using nodeSelector is better since you can specify general attributes like a label which you can attach to a new node you add to the cluster and the pod scheduling will not be affected, neither will be the pod definition.
Hope this helps.

Kubernetes release requested cpu

We have a Java application distributed over multiple pods on Google Cloud Platform. We also set memory requests to give the pod a certain part of the memory available on the node for heap and non-heap space.
The application is very resource-intensive in terms of CPU while starting the pod but does not use the CPU after the pod is ready (only 0,5% are used). If we use container resource "requests", the pod does not release these resources after start has finished.
Does Kubernetes allow to specify that a pod is allowed to use (nearly) all the cpu power available during start and release those resources after that? Due to rolling update we can prevent that two pods are started at the same time.
Thanks for your help.
If you specify requests without a limit the value will be used for scheduling the pod to an appropriate node that satisfies the requested available CPU bandwidth. The kernel scheduler will assume that the requests match the actual resource consumption but will not prevent exceeding usage. This will be 'stolen' from other containers.
If you specify a limit as well your container will get throttled if it tries to exceed the value. You can combine both to allow bursting usage of the cpu, exceeding the usual requests but not allocating everything from the node, slowing down other processes.
"Does Kubernetes allow to specify that a pod is allowed to use
(nearly) all the cpu power available during start and release those
resources after that?"
A key word here is "available". The answer is "yes" and it can be achieved by using Burstable QoS (Quality of Service) class. Configure CPU request to a value you expect the container will need after starting up, and either:
configure CPU limit higher than the CPU request, or
don't configure CPU limit in which case either namespace's default CPU limit will apply if defined, or the container "...could use all of the CPU resources available on the Node where it is running".
If there isn't CPU available on the Node for bursting, the container won't get any beyond the requested value and as result the starting of the application could be slower.
It is worth mentioning what the docs explain for Pods with multiple Containers:
The CPU request for a Pod is the sum of the CPU requests for all the
Containers in the Pod. Likewise, the CPU limit for a Pod is the sum of
the CPU limits for all the Containers in the Pod.
If running Kubernetes v1.12+ and have access to configure kubelet, the Node CPU Management Policies could be of interest.
one factor for scheduling pods in nodes is resource availability and kubernetes scheduler calculates used resources from request value of each pod. If you do not assign any value in request parameter then for this deployment request will be zero . Request parameter doesnt ensure that the pod will use this much cpu or ram. you can get current usage of resources from "kubectl top pods / nodes".
request parameter will buffer resources for a pod. where as limit put a cap on resources usage for a pod.
you can get more information here https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/.
This will give you a rough idea of request and limit.

Question about 100 pods per node limitation

I'm trying to build a web app where each user gets their own instance of the app, running in its own container. I'm new to kubernetes so I'm probably not understanding something correctly.
I will have a few physical servers to use, which in kubernetes as I understand are called nodes. For each node, there is a limitation of 100 pods. So if I am building the app so that each user gets their own pod, will I be limited to 100 users per physical server? (If I have 10 servers, I can only have 500 users?) I suppose I could run multiple VMs that act as nodes on each physical server but doesn't that defeat the purpose of containerization?
The main issue in having too many pods in a node is because it will degrade the node performance and makes is slower(and sometimes unreliable) to manage the containers, each pod is managed individually, increasing the amount will take more time and more resources.
When you create a POD, the runtime need to keep a constant track, doing probes (readiness and Liveness), monitoring, Routing rules many other small bits that adds up to the load in the node.
Containers also requires processor time to run properly, even though you can allocate fractions of a CPU, adding too many containers\pod will increase the context switch and degrade the performance when the PODs are consuming their quota.
Each platform provider also set their own limits to provide a good quality of service and SLAs, overloading the nodes is also a risk, because a node is a single point of failure, and any fault in high density nodes might have a huge impact in the cluster and applications.
You should either consider:
Smaller nodes and add more nodes to the cluster or
Use Actors instead, where each client will be one Actor. And many actor will be running in a single container. To make it more balanced around the cluster, you partition the actors into multiple containers instances.
Regarding the limits, this thread has a good discussion about the concerns
Because of the hard limit if you have 10 servers you're limited to 1000 pods.
You might want to count also control plane pods in your 1000 available pods. Usually located in the namespace kube-system it can include (but is not limited to) :
node log exporters (1 per node)
metrics exporters
kube proxy (usually 1 per node)
kubernetes dashboard
DNS (scaling according to the number of nodes)
controllers like certmanager
A pretty good rule of thumb could be 80-90 application pods per node, so 10 nodes will be able to handle 800-900 clients considering you don't have any other big deployment on those nodes.
If you're using containers in order to gain perfs, creating node VMs will be against your goal. But if you're using containers as a way to deploy coherent environments and scale stateless applications then using VMs as node can make sense.
There are no magic rules and your context will dictate what to do.
As managing a virtualization cluster and a kubernetes cluster may skyrocket your infrastructure complexity, maybe kubernetes is not the most efficient tool to manage your workload.
You may also want to take a look at Nomad wich does not seem to have those kind of limitations and may provide features that are closer to your needs.

Kubernetes - NodeUnderMemoryPressure Issue

I'm very new to Kubernetes. We are using Kubernetes cluster on Google Cloud Platform.
I have created Cluster, Services, Pod, Replica controllers.
I have created Horizontal Pod Autoscaler and it is based on CPU Params.
Cluster details
Default running node count is set to 3
3GB allocatable memory per node
Default running node count is 3 in the cluster.
After running for 1 hour Service and Nodes showing NodeUnderMemoryPressure Issues.
How to resolve this ??
If you any more details, please ask
Thanks
I don't know how much traffic is hitting your cluster, but I would highly recommend running Prometheus in your cluster.
Prometheus is an open-source monitoring and alerting tool, and integrates very well with Kubernetes.
This tool should give you a much better view of memory consumption, CPU usage, amongst many other monitoring capabilities, that will allow you to effectively troubleshoot these types of issues.
There are several ways to address this issue that depends on the type of your workloads.
The easiest is simply scale your nodes, but it can be useless if there is a memory leakage. Even if now you are not affected by it you should always consider the possibility of a memory leakage happening, therefore the best practise is to introduce always memory limits for PODs and Namespaces.
Scale the cluster
if you have many pods running and there are not some of them way bigger that the others it would be useful to scale horizontally your cluster, in this way the number of running pods per nodes will reduce and the NodeUnderMemoryPressure warning should disappear.
if you are running few PODs or some of them are capable to make the cluster suffering alone, then the only option is to scale the nodes vertically adding a new node pool with Compute Engine instances having more memory and possibly delete the old one.
if your workload is correct and you memory suffer because in certain moment of the day you receive 100 times more the usual traffic and you create more pods to support this traffic, you should consider to make use of the Autoscaler.
Check Memory leakages
On the other hand if it is not an "healthy" situation and you have pods consuming way more RAM than expected then you should follow the advice of grizzthedj and understand why your PODs are consuming so much and maybe verify if some of your container is affected by memory leakage and in this case scale the amount of RAM is useless since at some point you will run out of it anyway.
Therefore start to understand which are the PODs consuming too much and then troubleshoot why they have this behaviour, if you do not want to make use of Prometeus simply SSH into the container and check with the classical Linux commands.
Limit the RAM consumed by PODs
To prevent this to happen in the future I advise you when writing YAML file to always limit the amount of RAM they can make use of, in this way you will control them and you will be sure that there is not the risk that they cause the Kubernetes "node agent" to fail because out of memory.
Consider also to limit the CPU and introduce minimum requirements of both RAM and CPU for PODs to help the scheduler to properly schedule the PODs to avoid to hit NodeUnderMemoryPressure under high workload.

Kubernetes: do evicted pods with no resource requests get rescheduled successfully?

I've read as much Kubernetes documentation as I can find, but I'm still having trouble understanding a specific scenario I have in mind.
For the sake of example, let's say I have a single node with 1GB of memory. I also have a deployment that wants 100 pods with memory limits set to 100MB and memory requests unset. The pods only use 1MB most of the time, but can sometimes jump up to 99MB.
Question 1: Will all 100 pods be scheduled onto the node?
Now, let's say all the pods simultaneously start using 99MB of memory each and stay there. There isn't enough memory on the machine to handle that, but none of the pods have exceeded their memory limit. I'm assuming Kubernetes evicts some pods at this point.
Question 2: When Kubernetes tries to reschedule the evicted pods, does it succeed since there is no memory request set? What happens when the node immediately runs out of memory again? Does this eviction, rescheduling keep happening over and over? If so, is there some metric that I can use to detect that this is happening?
A pod will be scheduled as long as there's an eligible node that can satisfy the requested resources. So if you do not specify request the pod will pretty much get scheduled. Request and limits are totally different things. Request is a condition for a pod to be scheduled and limit is a condition for a running pod already scheduled.
If you overcommit the actual resources on a node you will run into typical issues - if you overcommit on memory it'll start to swap and CPU there will just be general slow down. Either way the node and pods on it will become unresponsive. It's difficult to deal with and tools like request and limits set up sane boundaries that will help you not take things quite this far where you'll simply see the pod fail to schedule.
When the Kubernetes scheduler schedules a pod running on a node, it will always ensure that the total limits of the containers are less than the node capacity. If a node runs out of resources, Kubernetes will not schedule any new containers running on it. If no node is available when you launch a pod, the pod will remain pending, since the Kubernetes scheduler will be unable to nd any node that could run your desired pod.
Kubernetes Cookbook
I think this excerpt gave you some understanding on how it internally works. So answers for your questions:
At most 10 pods will be scheduled into your node.
If there no free memory in node evicted pods will be pending. Also k8s can simply evict pod if it exceeds limits when resources are needed for other pods and services.