I would like to define a policy to dynamically assigns resource limits to pods and containers. For example, if there are 4 number of pods scheduled in a specific node, and the memory capacity is 100mi, each pod to be assigned with 25mi memory limit. In other words, the fair share of the node capacity.
So, is it necessary to change the codes in scheduler.go or I need to change other objects as well?
I do agree with Arslanbekov answer, it's contrary to the ideology of scalability used by kubernetes.
The principle is that you define what resources is needed by your application and the cluster do all it can to give this resource to the pod, scalling the resources (pod, nodes) depending on the global consumption of all apps.
What you are asking is the reverse, give resources to the pod depending on the node resources, this way could prove very difficult to allow automatic scallability of the nodes as it would be the resource aim to attain (I may be confusing in my explanation but that shows how difficult it could be).
One way to do what you want would be to size all your pod to the same size to use 80% of the nodes but this would prove wrong if an app need more resources.
I think this is contrary to the ideology of the kubernetes. In this approach, the new application will not be able to get to the node.
At each point in time for the scheduler will be the utilization of 100% each node.
Related
I am fairly new to Kubernetes, and I think I understand the basics of provisioning nodes and setting memory limits for pods. Here's the problem I have: my application can require dramatically different amounts of memory, depending on the input (and there is no fool-proof way to predict it). Some jobs require 50MB, some require 50GB. How can I set up my K8s deployment to handle this situation?
I have one strategy that I'd like to try out, but I don't know how to do it: start with small instances (nodes with not a lot of memory), and if the job fails with out-of-memory, then automatically send it to increasingly bigger instances until it succeeds. How hard would this be to implement in Kubernetes?
Thanks!
Natively K8S supports horizontal autoscalling i.e. automatically deplying more replicas of a deployment basing on chosen metric like CPU usage, memory usage etc.: Horizontal Pod Autoscaling
What you are describing here though is vertical scaling. It is not supported out of the box, but there is a subproject that seems to be able to fulfill your requirements: vertical-pod-autoscaler
I have a bunch of pods in a cluster that is almost requesting all (7.35/8) available CPU resources on a node:
even though their actual total usage is almost nothing (0.34/8).
The pod that is currently requesting the most only requests 210m which I guess is not an outrageous amount - also I would like to enforce some sensible minimum request size for all pods in the cluster. Of course that will accumulate when there are lots of pods.
It seems I could easily scale down the request by a factor of 10 and leave the limits where they are to begin with.
But is there something else that I should look into instead before doing that - reducing replica count etc.?
Also it looks a bit strange that the pods are not more evenly distributed between the nodes.
Your request values seems overestimated.
You need time and metrics to find the right request/limit for your workload.
Keep in mind that if you change those values, your pods will restart.
Also, It's normal that you can find some unbalance nodes on your cluster. Kubernetes will never remove a pod if you don't ask.
For example, if your create a cluster with 3 nodes, fill those 3 nodes with pods and then add another 3 nodes. The new nodes will stay empty.
You can setup some HorizontalPodAutoScaler on your cluster to adapt your number of pod to your workload.
Doing that, your workload will spread among nodes and with a correct balance. (if you use the default Scheduling Policy
I suggest following:
Resource Allocation: Based on history value set your request to meaningful value with buffer. Also to have guaranteed pod resource allocation it may be a good idea to set request and limit as same value. But that means you pod cannot burst for new resource. One more thing to note is scheduling only happens based on requested value, so if node has no more resource left, then pod will be killed and rescheduled if you request is trying to burst to limit.
Resource quotas: Check Kubernetes Resource Quotas to have sensible namespace level quotas to control overly provisioned resources by developers
Affinity/AntiAffinity: Check concept of Anti-affinity to have your replicas or different pods scheduled across your cluster. You can ensure for eg., that one host or Avalability zone etc can have only one replica of your pod (helps in HA), spread different pods to different nodes (layer scheduling etc) - Check this video
There are good answers already but I would like to add some more info.
It is very important to have a good strategy when calculating how much resources you would need for each container.
Optimally, your pods should be using exactly the amount of resources you requested but that's almost impossible to achieve. If the usage is lower than your request, you are wasting resources. If it's higher, you are risking performance issues. Consider a 25% margin up and down the request value as a good starting point. Regarding limits, achieving a good setting would depend on trying and adjusting. There is no optimal value that would fit everyone as it depends on many factors related to the application itself, the demand model, the tolerance to errors etc.
Kubernetes best practices: Resource requests and limits is a very good guide explaining the idea behind these mechanisms with a detailed explanation and examples.
Also, Managing Resources for Containers will provide you with the official docs regarding:
Requests and limits
Resource types
Resource requests and limits of Pod and Container
Resource units in Kubernetes
How Pods with resource requests are scheduled
How Pods with resource limits are run, etc
Just in case you'll need a reference.
// I'm almost certain this must be a dup or at least a solved problem, but I could not find what I was after through searching the many k8 communities.
We have jobs that run for between a minute and many hours. Given that we assign them resource values that afford them QOS Guaranteed status, how could we minimize resource waste across the nodes?
The problem is that downscaling rarely happens, because each node eventually gets assigned one of the long running jobs. They are not common, but the keep all of the nodes running, even when we do not have need for them.
The dumb strategy that seems to avoid this would be a depth first scheduling algorithm, wherein among nodes that have capacity, the one most filled already will be assigned. In other words, if we have two total nodes running at 90% cpu/memory usage and 10% cpu/memory assigned, the 90% would always be assigned first provided it has sufficient capacity.
Open to any input here and/or ideas. Thanks kindly.
As of now there seems to be this kube-sheduler profile plugin:
NodeResourcesMostAllocated: Favors nodes that have a high allocation of resources.
But it is in alpha stage since k8s v1.18+, so probably not safe to use it for produciton.
There is also this parameter you can set for kube-scheduler that I have found here:
MostRequestedPriority: Favors nodes with most requested resources. This policy will fit the scheduled Pods onto the smallest number of Nodes needed to run your overall set of workloads.
and here is an example on how to configure it.
One last thing that comes to my mind is using node affinity.
Using nodeAffinity on long running pods, (more specificaly with preferredDuringSchedulingIgnoredDuringExecution), will prefer to schedule these pods on the nodes that run all the time, and prefer to not schedule them on nodes that are being autoscaled. This approach requires excluding some nodes from autoscaling and labeling approprietly so that scheduler can make use of node-affinity.
I have a cluster w/ 3 nodes. Hurray! The cluster doesn't autoscale nodes.
These nodes run an amazing web app, yet most of the time do almost nothing.
I also have a background process that could use an infinite amount of CPU (the usefulness drops rapidly but remains useful).
I want these background pods to run on each Node and slowed down to leave a 20% CPU headroom on the Node. Or similar.
That's the shape of a DaemonSet.
Can I tell Kubernetes to deprioritize the DaemonSet Pods w/ a 20% headroom?
Can the DaemonSet Pods detect the Nodes CPU usage and deprioritize themselves (risky if buggy)?
QoS looks like it's for scheduling and evicting pods to make room for other pods, but they don't get 'niced'.
Priority also looks like it's for eviction.
You may achieve what you're looking for in many ways.
I imagine that you've already read this and that, based on the theory of this other.
Also RedHat has nice documentation about setting hardware limits via softwarre.
Here you can find how to restrict cpu usage, which may be set inside a container to achieve what you're looking for.
So, to recap: with K8S you can set requests and limits, and inside the container you can set even further restrictive limits.
Hope this gives you the solution or at least the path to follow in order to achieve what you want.
When I resize a replication controller using kubectl, if the cluster does not have enough resource, there will have one or more pods always in pending.
Is there has any tool will auto resize GKE cluster when the resource is running out?
I had a similar requirement (for the Go build system): wanted to know when scheduled vs. available CPU or memory was > 1, and scale out nodes when that was true (or, more accurately, when it was ~.8). There's not a built-in metric, but as you suggest you can do it with a custom metric.
This was all done in Go, but it will give you the basic idea:
Create the metrics (memory and CPU, in my case
Put values to the metrics
The key takeaway IMO is that you have to iterate over each pod in the cluster to determine how much capacity is consumed, then iterate over each node in the cluster to determine how much capacity is available. It's then just a matter of pointing your autoscaler to the custom metric(s).
Big big big thing worth noting: I ultimately determined that scaling on the built-in CPU utilization metric was just as good as (if not better than, but more on that in a bit) than the custom metric. Each pod we scheduled pegged the CPU, so when pods were maxed out so was CPU. The build-in CPU utilization metric is probably better because you don't have the latency that comes with periodically putting custom metrics.
You can turn on autoscaling for the Instance Group that your GKE nodes belong to.