High total CPU request but low total usage (kubernetes resources) - kubernetes

I have a bunch of pods in a cluster that is almost requesting all (7.35/8) available CPU resources on a node:
even though their actual total usage is almost nothing (0.34/8).
The pod that is currently requesting the most only requests 210m which I guess is not an outrageous amount - also I would like to enforce some sensible minimum request size for all pods in the cluster. Of course that will accumulate when there are lots of pods.
It seems I could easily scale down the request by a factor of 10 and leave the limits where they are to begin with.
But is there something else that I should look into instead before doing that - reducing replica count etc.?
Also it looks a bit strange that the pods are not more evenly distributed between the nodes.

Your request values seems overestimated.
You need time and metrics to find the right request/limit for your workload.
Keep in mind that if you change those values, your pods will restart.
Also, It's normal that you can find some unbalance nodes on your cluster. Kubernetes will never remove a pod if you don't ask.
For example, if your create a cluster with 3 nodes, fill those 3 nodes with pods and then add another 3 nodes. The new nodes will stay empty.
You can setup some HorizontalPodAutoScaler on your cluster to adapt your number of pod to your workload.
Doing that, your workload will spread among nodes and with a correct balance. (if you use the default Scheduling Policy

I suggest following:
Resource Allocation: Based on history value set your request to meaningful value with buffer. Also to have guaranteed pod resource allocation it may be a good idea to set request and limit as same value. But that means you pod cannot burst for new resource. One more thing to note is scheduling only happens based on requested value, so if node has no more resource left, then pod will be killed and rescheduled if you request is trying to burst to limit.
Resource quotas: Check Kubernetes Resource Quotas to have sensible namespace level quotas to control overly provisioned resources by developers
Affinity/AntiAffinity: Check concept of Anti-affinity to have your replicas or different pods scheduled across your cluster. You can ensure for eg., that one host or Avalability zone etc can have only one replica of your pod (helps in HA), spread different pods to different nodes (layer scheduling etc) - Check this video

There are good answers already but I would like to add some more info.
It is very important to have a good strategy when calculating how much resources you would need for each container.
Optimally, your pods should be using exactly the amount of resources you requested but that's almost impossible to achieve. If the usage is lower than your request, you are wasting resources. If it's higher, you are risking performance issues. Consider a 25% margin up and down the request value as a good starting point. Regarding limits, achieving a good setting would depend on trying and adjusting. There is no optimal value that would fit everyone as it depends on many factors related to the application itself, the demand model, the tolerance to errors etc.
Kubernetes best practices: Resource requests and limits is a very good guide explaining the idea behind these mechanisms with a detailed explanation and examples.
Also, Managing Resources for Containers will provide you with the official docs regarding:
Requests and limits
Resource types
Resource requests and limits of Pod and Container
Resource units in Kubernetes
How Pods with resource requests are scheduled
How Pods with resource limits are run, etc
Just in case you'll need a reference.

Related

How to assign resource limits dynamically?

I would like to define a policy to dynamically assigns resource limits to pods and containers. For example, if there are 4 number of pods scheduled in a specific node, and the memory capacity is 100mi, each pod to be assigned with 25mi memory limit. In other words, the fair share of the node capacity.
So, is it necessary to change the codes in scheduler.go or I need to change other objects as well?
I do agree with Arslanbekov answer, it's contrary to the ideology of scalability used by kubernetes.
The principle is that you define what resources is needed by your application and the cluster do all it can to give this resource to the pod, scalling the resources (pod, nodes) depending on the global consumption of all apps.
What you are asking is the reverse, give resources to the pod depending on the node resources, this way could prove very difficult to allow automatic scallability of the nodes as it would be the resource aim to attain (I may be confusing in my explanation but that shows how difficult it could be).
One way to do what you want would be to size all your pod to the same size to use 80% of the nodes but this would prove wrong if an app need more resources.
I think this is contrary to the ideology of the kubernetes. In this approach, the new application will not be able to get to the node.
At each point in time for the scheduler will be the utilization of 100% each node.

Kubernetes "nice" a pod's CPU usage

I have a cluster w/ 3 nodes. Hurray! The cluster doesn't autoscale nodes.
These nodes run an amazing web app, yet most of the time do almost nothing.
I also have a background process that could use an infinite amount of CPU (the usefulness drops rapidly but remains useful).
I want these background pods to run on each Node and slowed down to leave a 20% CPU headroom on the Node. Or similar.
That's the shape of a DaemonSet.
Can I tell Kubernetes to deprioritize the DaemonSet Pods w/ a 20% headroom?
Can the DaemonSet Pods detect the Nodes CPU usage and deprioritize themselves (risky if buggy)?
QoS looks like it's for scheduling and evicting pods to make room for other pods, but they don't get 'niced'.
Priority also looks like it's for eviction.
You may achieve what you're looking for in many ways.
I imagine that you've already read this and that, based on the theory of this other.
Also RedHat has nice documentation about setting hardware limits via softwarre.
Here you can find how to restrict cpu usage, which may be set inside a container to achieve what you're looking for.
So, to recap: with K8S you can set requests and limits, and inside the container you can set even further restrictive limits.
Hope this gives you the solution or at least the path to follow in order to achieve what you want.

How exactly k8s reserves resources for a namespace?

I have the following questions regarding request/limit quota for ns:
Considering the following namespace resource setup:
- request: 1 core/1GiB
- limit: 2 core/2GiB
Does it mean a namespace is guaranteed to have 1/1GiB? How is it achieved physically on cluster nodes? Does it mean k8s somehow strictly reserve these values for a ns (at a time it's created)? At which point of time reservation takes place?
Limit 2 core/2GiB - does it mean it's not guaranteed for a ns and depends on current cluster's state? Like if currently cluster has only 100MiB of free ram available, but in runtime pod needs 200Mib more above a resource request - pod will be restarted? Where does k8s take this resource if pod needs to go above it's request?
Regarding namespace granularity and k8s horizontal auto scaling: consider we have 2 applications and 2 namespaces - 1 ns per each app. We set both ns quota as such that there's some free buffer for 2 extra pods and horizontal auto scaling up to 2 pods with certain CPU threshold. So, is there really a point in doing such a set up? My concern is that if NS reserves it's resources and no other ns can utilize them - we can just create 2 extra pods in each ns replica set with no auto scaling, using these pods constantly. I can see a point in using auto scaling if we have more than 1 application in 1 ns, so that these apps could share same resource buffer for scaling. Is this assumption correct?
How do you think is this a good practice to have 1 ns per app? Why?
p.s. i know what resource request/limit are and difference between them. In most info sources there's just very high level explanation of the concept.
Thanks in advance.
The docs clearly states the following:
In the case where the total capacity of the cluster is less than the sum of the quotas of the namespaces, there may be contention for resources. This is handled on a first-come-first-served basis.
and
ResourceQuotas are independent of the cluster capacity. They are expressed in absolute units. So, if you add nodes to your cluster, this does not automatically give each namespace the ability to consume more resources.
and
resource quota divides up aggregate cluster resources, but it creates no restrictions around nodes: pods from several namespaces may run on the same node
ResourceQuotas is a constraint set in the namespace and does not reserve capacity, it just set a limit of resources that can be consumed by each namespace.
To effectively "reserve" the capacity, you have to set the restrictions to all namespaces, so that other namespaces does not use more resources than you cluster can provide. This way you can have more guarantees that a namespace will have available capacity to run their load.
The docs suggests:
Proportionally divide total cluster resources among several teams(namespaces).
Allow each team to grow resource usage as needed, but have a generous limit to prevent accidental resource exhaustion.
Detect demand from one namespace, add nodes, and increase quota.
Given that, the answer for your questions are:
it is not a reserved capacity, the reservation happens on resource(pod) creation.
Running resources are not affected after reservation. New resources are rejected if the resource creation will over commit the quotas(limits)
As stated in the docs, if the limit are higher than the capacity, the reservation will happen in a first-come-first-served basis.
This question can make to it's own question in SO, in simple terms, for resource isolation and management.

Define a priority for termination of K8s pods

I have a bunch of pods running in the same cluster. Sometimes there are not enough resources and some pods need to terminate.
That's OK, but how do I set the priority of which pods are killed first?
It usually kills my most important service first :\
Thanks!
I suggest you take a look at resource QoS.
Have you important stuff (including monitoring) specify limit=request which in turn will land them in the guaranteed QoS class.
Specifically,
The system computes pod level requests and limits by summing up per-resource requests and limits across all containers. When request == limit, the resources are guaranteed (...)
Also, overstepping CPU limits only results in throttling, so it's more important to get memory limits (per container) right.

Is there any tool for GKE nodes autoscaling base on total pods requested in kubernetes?

When I resize a replication controller using kubectl, if the cluster does not have enough resource, there will have one or more pods always in pending.
Is there has any tool will auto resize GKE cluster when the resource is running out?
I had a similar requirement (for the Go build system): wanted to know when scheduled vs. available CPU or memory was > 1, and scale out nodes when that was true (or, more accurately, when it was ~.8). There's not a built-in metric, but as you suggest you can do it with a custom metric.
This was all done in Go, but it will give you the basic idea:
Create the metrics (memory and CPU, in my case
Put values to the metrics
The key takeaway IMO is that you have to iterate over each pod in the cluster to determine how much capacity is consumed, then iterate over each node in the cluster to determine how much capacity is available. It's then just a matter of pointing your autoscaler to the custom metric(s).
Big big big thing worth noting: I ultimately determined that scaling on the built-in CPU utilization metric was just as good as (if not better than, but more on that in a bit) than the custom metric. Each pod we scheduled pegged the CPU, so when pods were maxed out so was CPU. The build-in CPU utilization metric is probably better because you don't have the latency that comes with periodically putting custom metrics.
You can turn on autoscaling for the Instance Group that your GKE nodes belong to.