What metrics Kubernetes scheduler depends on? - kubernetes

Does the Kubernetes scheduler place the pods on the nodes only based on their requested resources and nodes' available resources at the current snapshot of the server or it also takes into account the node's historical resource utilization?

In the official Kubernetes documentation we can find process and metrics used by kube-scheduler for choosing node for pod.
Basically this is 2-step process:
kube-scheduler selects a node for the pod in a 2-step operation:
Filtering
Scoring
Filtering step is responsible for getting list of nodes which actually are able to run a pod:
The filtering step finds the set of Nodes where it's feasible to schedule the Pod. For example, the PodFitsResources filter checks whether a candidate Node has enough available resource to meet a Pod's specific resource requests. After this step, the node list contains any suitable Nodes; often, there will be more than one. If the list is empty, that Pod isn't (yet) schedulable.
Scoring step is responsible for choosing the best node from the list generated by the filtering step:
In the scoring step, the scheduler ranks the remaining nodes to choose the most suitable Pod placement. The scheduler assigns a score to each Node that survived filtering, basing this score on the active scoring rules.
Finally, kube-scheduler assigns the Pod to the Node with the highest ranking. If there is more than one node with equal scores, kube-scheduler selects one of these at random.
When the node with the highest score is chosen, scheduler notifies the API server:
...picks a Node with the highest score among the feasible ones to run the Pod. The scheduler then notifies the API server about this decision in a process called binding.
Factors that are taken into consideration for scheduling:
Individual and collective resource requirements
Hardware
Policy constraints
Affinity and anti-affinity specifications
Data locality
Inter-workload interference
Others...
More detailed information about parameters be found here:
The following predicates implement filtering:
PodFitsHostPorts: Checks if a Node has free ports (the network protocol kind) for the Pod ports the Pod is requesting.
PodFitsHost: Checks if a Pod specifies a specific Node by its hostname.
PodFitsResources: Checks if the Node has free resources (eg, CPU and Memory) to meet the requirement of the Pod.
MatchNodeSelector: Checks if a Pod's Node Selector matches the Node's label(s).
NoVolumeZoneConflict: Evaluate if the Volumes that a Pod requests are available on the Node, given the failure zone restrictions for that storage.
NoDiskConflict: Evaluates if a Pod can fit on a Node due to the volumes it requests, and those that are already mounted.
MaxCSIVolumeCount: Decides how many CSI volumes should be attached, and whether that's over a configured limit.
PodToleratesNodeTaints: checks if a Pod's tolerations can tolerate the Node's taints.
CheckVolumeBinding: Evaluates if a Pod can fit due to the volumes it requests. This applies for both bound and unbound PVCs.
The following priorities implement scoring:
SelectorSpreadPriority: Spreads Pods across hosts, considering Pods that belong to the same Service, StatefulSet or ReplicaSet.
InterPodAffinityPriority: Implements preferred inter pod affininity and antiaffinity.
LeastRequestedPriority: Favors nodes with fewer requested resources. In other words, the more Pods that are placed on a Node, and the more resources those Pods use, the lower the ranking this policy will give.
MostRequestedPriority: Favors nodes with most requested resources. This policy will fit the scheduled Pods onto the smallest number of Nodes needed to run your overall set of workloads.
RequestedToCapacityRatioPriority: Creates a requestedToCapacity based ResourceAllocationPriority using default resource scoring function shape.
BalancedResourceAllocation: Favors nodes with balanced resource usage.
NodePreferAvoidPodsPriority: Prioritizes nodes according to the node annotation scheduler.alpha.kubernetes.io/preferAvoidPods. You can use this to hint that two different Pods shouldn't run on the same Node.
NodeAffinityPriority: Prioritizes nodes according to node affinity scheduling preferences indicated in PreferredDuringSchedulingIgnoredDuringExecution. You can read more about this in Assigning Pods to Nodes.
TaintTolerationPriority: Prepares the priority list for all the nodes, based on the number of intolerable taints on the node. This policy adjusts a node's rank taking that list into account.
ImageLocalityPriority: Favors nodes that already have the container images for that Pod cached locally.
ServiceSpreadingPriority: For a given Service, this policy aims to make sure that the Pods for the Service run on different nodes. It favours scheduling onto nodes that don't have Pods for the service already assigned there. The overall outcome is that the Service becomes more resilient to a single Node failure.
EqualPriority: Gives an equal weight of one to all nodes.
EvenPodsSpreadPriority: Implements preferred pod topology spread constraints.
Answering your question:
Does it take into account the node's historical resource utilization?
As can see, on the above list there are no parameters related to the historical resource utilization. Also, I did research and I didn't find any information about it.

Related

Kubernetes scheduler

Does the Kubernetes scheduler assign the pods to the nodes one by one in a queue (not in parallel)?
Based on this, I guess that might be the case since it is mentioned that the nodes are iterated round robin.
I want to make sure that the pod scheduling is not being done in parallel.
Short answer
Taking into consideration all the processes kube-scheduler performs when it's scheduling the pod, the answer is yes.
Scheduler and pods
For every newly created pod or other unscheduled pods, kube-scheduler
selects an optimal node for them to run on. However, every container
in pods has different requirements for resources and every pod also
has different requirements. Therefore, existing nodes need to be
filtered according to the specific scheduling requirements.
In a cluster, Nodes that meet the scheduling requirements for a Pod
are called feasible nodes. If none of the nodes are suitable, the pod
remains unscheduled until the scheduler is able to place it.
The scheduler finds feasible Nodes for a Pod and then runs a set of
functions to score the feasible Nodes and picks a Node with the
highest score among the feasible ones to run the Pod. The scheduler
then notifies the API server about this decision in a process called
binding.
Reference - kube-scheduler.
The scheduler determines which Nodes are valid placements for each Pod
in the scheduling queue according to constraints and available
resources.
Reference - kube-scheduler - synopsis.
In short words, kube-scheduler picks up pods one by one, assess them and its requests, then proceeds to finding appropriate feasible nodes to schedule pods on.
Scheduler and nodes
Mentioned link is related to nodes to give a fair chance to run pods across all feasible nodes.
Nodes in a cluster that meet the scheduling requirements of a Pod are
called feasible Nodes for the Pod
Information here is related to default kube-scheduler, there are solutions which can be used or even it's possible to implement self-written one. Also it's possible to run multiple schedulers in cluster.
Useful links:
Node selection in kube-scheduler
Kubernetes scheduler

Question about concept on Kubernetes pod assignment to nodes

I am quite a beginner in Kuberenetes and would like to ask about some concepts related to kuberenetes pod assignment.
Suppose there is a deployment to be made with a requirement of 3 replica sets.
(1)
Assume that there are 4 nodes, where each of it being a different physical server with different CPU and memory.
When the deployment is made, how would kubernetes assgin the pods to the nodes? Will there be scenario where it will put multiple pods on the same server, while a server does not have pod assignment (due to resource considereation)?
(2)
Assume there are 4 nodes (on 4 indentical physical servers), and 1 pod is created on each of the 4 nodes.
Suppose that now one of the nodes goes down. How would kuberenetes handle this? Will it recreate the pod on one of the other 3 nodes, based on which one having more available resources?
Thank you for any advice in advance.
There's a brief discussion of the Kubernetes Scheduler in the Kubernetes documentation. Generally scheduling is fairly opaque, but you also tend to aim for fairly well-loaded nodes; the important thing from your application point of view is to set appropriate resource requests: in your pod specifications. Just so long as there's enough room on each node to meet the resource requests, it usually doesn't matter to you which node gets picked.
In the scenario you describe, (1) it is possible that two replicas will be placed on the same node and so two nodes will go unused. That's especially true if the nodes aren't identical and they have resource constraints: if your pods require 4 GB of RAM, but you have some nodes that have less than that (after accounting for system pods and daemon set pods), the pods can't get scheduled there.
If a node fails (2) Kubernetes will automatically reschedule the pods running on that node if possible. "Fail" is a broad case, and can include a node being intentionally stopped to be upgraded or replaced. In this latter case you have some control over the cluster's behavior; see Disruptions in the documentation.
Many environments will run a cluster autoscaler. This can cause nodes to come and go automatically: if you try to schedule a pod and it won't fit, the autoscaler will allocate a new node, and if a node is under 50% utilization, it will be removed (and its pods rescheduled). In your first scenario you might start with only one node, but when the pod replicas don't all fit, the autoscaler would create a new node and once it's available the excess pods could be scheduled there.
Kubernetes will try to deploy pods to multiple nodes for better availability and resiliency. This will be based on the resource availability of the nodes. So if any node is not having enough capacity to host a pod it's possible that more than one replica of a pod is scheduled into same node.
Kubernetes will reschedule pods from the failed node to other available node which has enough capacity to host the pod. In this process again if there is no enough node which can host the replicas then there is a possibility that more than one replica is scheduled on same node.
You can read more on the scheduling algorithm here.
You can influence the scheduler by node and pod affinity and antiaffinity

What does it mean OutOfcpu error in kubernetes?

I got OutOfcpu in kubernetes on googlecloud what does it mean? My pods seem to be working now, however there there were pods in this same revision which got OutOfcpu.
It means that the kube-scheduler can't find any node with available CPU to schedule your pods:
kube-scheduler selects a node for the pod in a 2-step operation:
Filtering
Scoring
The filtering step finds the set of Nodes where it’s feasible to
schedule the Pod. For example, the PodFitsResources filter checks
whether a candidate Node has enough available resource to meet a Pod’s
specific resource requests.
[...]
PodFitsResources: Checks if the
Node has free resources (eg, CPU and Memory) to meet the requirement
of the Pod.
Also, as per Assigning Pods to Nodes:
If the named node does not have the resources to accommodate the pod,
the pod will fail and its reason will indicate why, e.g. OutOfmemory
or OutOfcpu.
In addition to how-kube-scheduler-schedules-pods, I think this will be helpful to understand why OutOfcpu error has been shown up.
When you create a Pod, the Kubernetes scheduler selects a node for the
Pod to run on. Each node has a maximum capacity for each of the
resource types: the amount of CPU and memory it can provide for Pods.
The scheduler ensures that, for each resource type, the sum of the
resource requests of the scheduled Containers is less than the
capacity of the node. Note that although actual memory or CPU resource
usage on nodes is very low, the scheduler still refuses to place a Pod
on a node if the capacity check fails. This protects against a
resource shortage on a node when resource usage later increases, for
example, during a daily peak in request rate.
Ref: how-pods-with-resource-requests-are-scheduled

CPU/Memory requests and limits per node

kubectl describe nodes gives information on the requests and limits for resources such as CPU and memory. However, the api endpoint api/v1/nodes doesn't provide this information.
Alternatively, I could also hit the api/v1/pods endpoint to get this information per pod which I can accumulate across nodes. But is there already a kubernetes API endpoint which provides the information pertaining to cpu/memory requests and limits per node?
From what I've found in documentation, the endpoint responsible for that is the Kubernetes API Server.
CPU and memory are each a resource type. A resource type has a base unit. CPU is specified in units of cores, and memory is
specified in units of bytes.
CPU and memory are collectively referred to as compute resources,
or just resources. Compute resources are measurable quantities that
can be requested, allocated, and consumed. They are distinct from
API resources.
API resources, such as Pods and Services
are objects that can be read and modified through the Kubernetes API
server.
Going further to what is a node:
Unlike pods and services, a node is not inherently created by Kubernetes: it is created externally by cloud providers like Google Compute Engine, or exists in your pool of physical or virtual machines. What this means is that when Kubernetes creates a node, it is really just creating an object that represents the node. After creation, Kubernetes will check whether the node is valid or not.
[...]
Currently, there are three components that interact with the Kubernetes node interface: node controller, kubelet, and kubectl.
[...]
The capacity of the node (number of cpus and amount of memory) is part of the node object. Normally, nodes register themselves and report their capacity when creating the node object. If you are doing manual node administration, then you need to set node capacity when adding a node.
The Kubernetes scheduler ensures that there are enough resources for
all the pods on a node. It checks that the sum of the requests of
containers on the node is no greater than the node capacity. It
includes all containers started by the kubelet, but not containers
started directly by Docker nor processes not in containers.
Edit:
Alternatively, I could also hit the api/v1/pods endpoint to get this
information per pod which I can accumulate across nodes.
This is an actual description in what order it works.
But is there already a kubernetes API endpoint which provides the
information pertaining to cpu/memory requests and limits per node?
Answer to this question is no, there is not.
Unfortunately there is no endpoint to get that information directly. Kubectl uses several requests to show describe of nodes. You can check them by kubectl -v 8 describe nodes:
When you run kubectl -v=8 describe nodes you can see GET calls in this order:
/api/v1/nodes?includeUninitialized=true
/api/v1/nodes/minikube
/api/v1/pods
/api/v1/events?fieldSelector

How do I schedule the same pod on different nodes using kubectl scale?

New to kubernetes. Can I use kubectl scale --replicas=N and start pods on different nodes?
By default the scheduler attempts to spread pods across nodes, so that you don't have multiple pods of the same type on the same node. So there's nothing special required if you're just aiming for best-effort pod spreading.
If you want to express the requirement that the pod must not run on a node that already has a pod of that type on it you can use pod anti-affinity, which is currently an Alpha feature.
If you want to ensure that all nodes (or all nodes matching a certain selector) have that pod on them you can use a DaemonSet.
Scaling a Deployment (or RC) tells controller-manager to create more pods, new pods are then subject to scheduling. K8S scheduler will attempt to find most reasonable placement to schedule your pods to. This does not guarantee that pods will launch on different nodes, but makes it a rather likely scenario, if you have the required resources. Unfortunately it also means that if all pods can fit on one node, there are situations where scheduler might actually do just that (ie. all other nodes in unschedulable state for some reason). If that happens, the pods will not reschedule when conditions change.
To have a solid guarantee that pods wil not get colocated on the same node you have two options:
legacy hack : define a hostPort in your pod template. As given host port is a resource that can be assigned only once per node, your pods will never exist more then once per node
alpha feature : you can look into Pod AntiAffinity, quite early and not really battle proven yet
First one has a dissadvantage - you can never have more then one pod of this type per node, so it ie. affects rolling deployments and limits your capacity for scaling (you can never have more active pods then number of nodes)