Resource Allocation in Kubernetes: How are pods scheduled? - kubernetes

In Kubernetes, the role of the scheduler is to seek a suitable node for the pods. So, after assigning a pod into a node, there are different pods on that node so that those pods are competing to gain resources. Therefore, for this competitive situation, how Kubernetes allocates resource? Is there any source code in Kubernetes for computing resource allocation?

I suppose you can take a look at the below articles to see if that answers your query
https://github.com/kubernetes/community/blob/master/contributors/devel/sig-scheduling/scheduler_algorithm.md#ranking-the-nodes
https://jvns.ca/blog/2017/07/27/how-does-the-kubernetes-scheduler-work/
The filtered nodes are considered suitable to host the Pod, and it is often that there are more than one nodes remaining. Kubernetes prioritizes the remaining nodes to find the "best" one for the Pod. The prioritization is performed by a set of priority functions. For each remaining node, a priority function gives a score which scales from 0-10 with 10 representing for "most preferred" and 0 for "least preferred". Each priority function is weighted by a positive number and the final score of each node is calculated by adding up all the weighted scores. For example, suppose there are two priority functions, priorityFunc1 and priorityFunc2 with weighting factors weight1 and weight2 respectively, the final score of some NodeA is:
finalScoreNodeA = (weight1 * priorityFunc1) + (weight2 * priorityFunc2)
After the scores of all nodes are calculated, the node with highest score is chosen as the host of the Pod. If there are more than one nodes with equal highest scores, a random one among them is chosen.
Currently, Kubernetes scheduler provides some practical priority functions, including:
LeastRequestedPriority: The node is prioritized based on the fraction of the node that would be free if the new Pod were scheduled onto the node. (In other words, (capacity - sum of requests of all Pods already on the node - request of Pod that is being scheduled) / capacity). CPU and memory are equally weighted. The node with the highest free fraction is the most preferred. Note that this priority function has the effect of spreading Pods across the nodes with respect to resource consumption.
CalculateNodeLabelPriority: Prefer nodes that have the specified label.
BalancedResourceAllocation: This priority function tries to put the Pod on a node such that the CPU and Memory utilization rate is balanced after the Pod is deployed.
CalculateSpreadPriority: Spread Pods by minimizing the number of Pods belonging to the same service on the same node. If zone information is present on the nodes, the priority will be adjusted so that pods are spread across zones and nodes.
CalculateAntiAffinityPriority: Spread Pods by minimizing the number of Pods belonging to the same service on nodes with the same value for a particular label.

Related

How to setup Kubernetes HPA to scale based on maximum available memory in a given pod?

I’d like to autoscale the pods not based on the average memory, but rather based on largest amount of available memory in a given pod.
Example:
Let’s say the target maximum available memory is 50%.
If we have 7 pods already and 6 of them have 90% occupied memory, but a single pod with 40% occupied memory, that’d satisfy my criteria and we won’t need to upscale. But the moment that last pod goes below 50% available memory we’ll upscale.
I know it’s not a wise criteria for scaling in case majority of case, but in my particular circumstance, it fits.

How to configure Kubernetes cluster autoscaler to scale down only?

I'd like to run the kubernetes cluster autoscaler so that unneeded nodes will be removed automatically, but I don't want the autoscaler to add nodes automatically. I prefer to handle scaling up myself. Is this possible?
I found maxNodesTotal, but I worry the semantics of setting this to 0 might mean all my nodes will go away. I also found scaleDownEnabled, but no corresponding option for scaling up.
Kubernetes Cluster Autoscaler or CA will attempt scale up whenever it will identify pending pods waiting to be scheduled to run but request more resources(CPU/RAM) than any available node can serve.
You can use the parameter maxNodeTotal to limit the maximum number of nodes CA would be allowed to spin up.
For example if you don't want your cluster to consist of any more than 3 nodes during peak utlization than you would set maxNodeTotal to 3.
There are different considerations that you should be aware of in terms of cost savings, performance and availability.
I would try to list some related to cost savings and efficient utilization as I suspect you might be more interested in that aspect.
Make sure you size your pods in consistency to their actual utlization, because scale up would get triggered by Pods resource request and not actual Pod resource utilization.
Also, bigger Pods are less likely to fit together on the same node, and in addition CA won't be able to scale down any semi-utilised nodes, resulting in resource spending.
Since you tagged this question with EKS, I will assume you are on AWS. On AWS the ASG (Auto Scaling Group) for each NodeGroup has a Max setting that is honoured by the cluster autoscaler. You can set this to prevent scaling above the set number of nodes. If the Min and Max on the ASG are the same value, then the autoscaler will never scale up or down. If the Min and Max are different, then the autoscaler can scale both up and down between those number of nodes. This is not exactly "never scale up", but it limits the upper end.
If you have multiple NodeGroups (ASGs), then each one can have different Min and Max nodes values.
You can also configure the cluster autoscaler itself in different ways. For example, you can set the utilization threshold. If a node's utilization fall under this threshold then the cluster autoscaler considers the node for scale down. See the FAQ.
The FAQ entry above that one may also apply. You can add an annotation to any node you do not want considered for scale down by the cluster autoscaler. Set: kubectl annotate node <nodename> cluster-autoscaler.kubernetes.io/scale-down-disabled=true or annotate the nodes as they are created. You can do this with entries in your AWS node group setup.

Kubernetes HPA Auto Scaling Velocity

We have defined HPA for an application to have min 1 and max 4 replicas with 80% cpu as the threshold.
What we wanted was, if the pod cpu goes beyond 80%, the app needs to be scaled up 1 at a time.
Instead what is happening is the application is getting scaled up to max number of replicas.
How can we define the scale velocity to scale 1 pod at a time. And again if one of the pod consumes more than 80% cpu then scale one more pod up but not maximum replicas.
Let me know how do we achieve this.
First of all, the 80% CPU utilisation is not a threshold but a target value.
The HPA algorithm for calculating the desired number of replicas is based on the following formula:
X = N * (C/T)
Where:
X: desired number of replicas
N: current number of replicas
C: current value of the metric
T: target value for the metric
In other words, the algorithm aims at calculating a replica count that keeps the observed metric value as close as possible to the target value.
In your case, this means if the average CPU utilisation across the pods of your app is below 80%, the HPA tends to decrease the number of replicas (to make the CPU utilisation of the remaining pods go up). On the other hand, if the average CPU utilisation across the pods is above 80%, the HPA tends to increase the number of replicas, so that the CPU utilisation of the individual pods decreases.
The number of replicas that are added or removed in a single step depends on how far apart the current metric value is from the target value and on the current number of replicas. This decision is internal to the HPA algorithm and you can't directly influence it. The only contract that the HPA has with its users is to keep the metric value as close as possible to the target value.
If you need a very specific autoscaling behaviour, you can write a custom controller (or operator) to autoscale your application instead of using the HPA.
This - https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#algorithm-details - expains the algorithm HPA uses, including the formula to calculate the number of "desired replicas".
If I recall, there were some (positive) changes to the HPA algo with v1.12.
HPA has total control on scale up as of today. You can only fine tune scale down operation with the following parameter.
--horizontal-pod-autoscaler-downscale-stabilization
The good news is that there is a proposal for Configurable scale up/down velocity for HPA

Pods and nodes Cpu usages in kubernates

Is Total cpu usages of a node sum up of all pods running in specific nodes.what is the relation between millicpu in cpu cores and % of cpu usages of node.Is request and limit control the cpu usages of pods if so then if a pods cpu usages reach its limit then it will be killed and move other node or continues execution in similar node with maximum limit.
Millicores is an absolute number (one core divided by 1000). A given node typically has multiple cores, so the relation between the number of millicores and the total percentage varies. For example 1000 millicores (one core) would be 25% on a four core node, but 50% on a node with two cores.
Request determines how much cpu a pod is guaranteed. It will not be scheduled on a node unless the node can deliver that much.
Limit determines how much a pod can get. It will not be killed or moved if it exceeds the limit; it is simply not allowed to exceed the limit.

Does TORQUE work on heteregenous clusters

Does TORQUE work on heteregenous clusters?
I would like to use as cluser a set of old servers I have at home, but they do not have the same characteristics (number of cpus, memory, etc..)
Torque schedules jobs to a cluster where each node has an individual set of attributes collected by the PBS agents on the nodes. Those attributes include
ncpus - number of CPU cores
physmem - amount of physical memory
totmem - amount of available memory including swap
The following two attributes are used for scheduling decisions
np - number of processing units (by default it is set to ncpus)
gpus - number of graphics accelerators
Additional boolean properties can be set by the cluster administrator to help users and administrators influence node selection. Thus if you have a cluster with nodes with different CPU count Torque can ensure that nodes will not be overloaded by matching job requirements with available cpus on the nodes.