Kubernetes scheduler - kubernetes

Does the Kubernetes scheduler assign the pods to the nodes one by one in a queue (not in parallel)?
Based on this, I guess that might be the case since it is mentioned that the nodes are iterated round robin.
I want to make sure that the pod scheduling is not being done in parallel.

Short answer
Taking into consideration all the processes kube-scheduler performs when it's scheduling the pod, the answer is yes.
Scheduler and pods
For every newly created pod or other unscheduled pods, kube-scheduler
selects an optimal node for them to run on. However, every container
in pods has different requirements for resources and every pod also
has different requirements. Therefore, existing nodes need to be
filtered according to the specific scheduling requirements.
In a cluster, Nodes that meet the scheduling requirements for a Pod
are called feasible nodes. If none of the nodes are suitable, the pod
remains unscheduled until the scheduler is able to place it.
The scheduler finds feasible Nodes for a Pod and then runs a set of
functions to score the feasible Nodes and picks a Node with the
highest score among the feasible ones to run the Pod. The scheduler
then notifies the API server about this decision in a process called
binding.
Reference - kube-scheduler.
The scheduler determines which Nodes are valid placements for each Pod
in the scheduling queue according to constraints and available
resources.
Reference - kube-scheduler - synopsis.
In short words, kube-scheduler picks up pods one by one, assess them and its requests, then proceeds to finding appropriate feasible nodes to schedule pods on.
Scheduler and nodes
Mentioned link is related to nodes to give a fair chance to run pods across all feasible nodes.
Nodes in a cluster that meet the scheduling requirements of a Pod are
called feasible Nodes for the Pod
Information here is related to default kube-scheduler, there are solutions which can be used or even it's possible to implement self-written one. Also it's possible to run multiple schedulers in cluster.
Useful links:
Node selection in kube-scheduler
Kubernetes scheduler

Related

Why would the Kubernetes scheduler always place my Pod replicas on the same node in AKS?

We have an AKS test cluster with four Windows worker nodes and a Deployment with a replica count of two. The corresponding Pod spec does not specify any resource requests and limits (thus, the resulting Pods are in the BestEffort QoS class).
In order to conduct a performance test, we scaled all other Deployments on those worker nodes to 0 replicas and deleted all remaining Pods on the nodes. Only the system Pods created by AKS DaemonSets itself (in the kube-system namespace) remained. We then created the Deployment mentioned above.
We had assumed that the default Kubernetes scheduler would place the two replicas on different nodes by default, or at least choose nodes randomly. However, the scheduler always chose the same node to place both replicas on, no matter how often we deleted the Pods or scaled the Deployment to 0 and back again to 2. Only after we tainted that node as NoSchedule, did the scheduler choose another node.
I know I could configure anti-affinities or topology spread constraints to get a better spreading of my Pods. But in the Cloud Native DevOps with Kubernetes book, I read that the scheduler actually does a very good job by default and one should only use those features if absolutely necessary. (Instead maybe using the descheduler if the scheduler is forced to make bad decisions.)
So, I would like to understand why the behavior we observed would happen. From the docs, I've learned that the scheduler first filters the nodes for fitting ones. In this case, all of them should fit, as all are configured identically. It then scores the nodes, choosing randomly if all have the same score. Why would one node always win that scoring?
Follow-up question: Is there some way how I could reconstruct the scheduler's decision logic in AKS? I can see kube-scheduler logs in Container Insights, but they don't contain any information regarding scheduling, just some operative stuff.
I believe that the scheduler is aware of which Nodes already have the container images pulled down, and will give them preference to avoid the image pull (and thus faster start time)
Short of digging up the source code as proof, I would guess one could create a separate Pod (for this purpose, I literally mean kind: Pod), force it onto one of the other Nodes via nodeName:, then after the Pod has been scheduled and attempted to start, delete the Pod and scale up your Deployment
I would then expect the new Deployment managed Pod to arrive on that other Node because it by definition has less resources in use but also has the container image required
Following mdaniel's reply, which I've marked as the accepted answer, we've done some more analysis and have found the list of scheduling plugins and the scheduling framework docs. Reading the code, we can see the ImageLocality plugin assigns a very high score due to the Windows container images being really large. As we don't have resource requests, the NodeResourcesFit plugin will not compensate this.
We did not find a plugin that would strive to not put Pod replicas onto the same node (unless configured via anti-affinities or a PodTopologySpreadConstraint). Which surprised me, as that would seem to be a good default to me?
Some experimentation shows that the situation indeed changes, once we, for example, start adding (even minimal) resource requests.
In the future, we'll therefore assign resource requests (which is good practice anyway) and, if this isn't enough, follow up with PodTopologySpreadConstraints.

What metrics Kubernetes scheduler depends on?

Does the Kubernetes scheduler place the pods on the nodes only based on their requested resources and nodes' available resources at the current snapshot of the server or it also takes into account the node's historical resource utilization?
In the official Kubernetes documentation we can find process and metrics used by kube-scheduler for choosing node for pod.
Basically this is 2-step process:
kube-scheduler selects a node for the pod in a 2-step operation:
Filtering
Scoring
Filtering step is responsible for getting list of nodes which actually are able to run a pod:
The filtering step finds the set of Nodes where it's feasible to schedule the Pod. For example, the PodFitsResources filter checks whether a candidate Node has enough available resource to meet a Pod's specific resource requests. After this step, the node list contains any suitable Nodes; often, there will be more than one. If the list is empty, that Pod isn't (yet) schedulable.
Scoring step is responsible for choosing the best node from the list generated by the filtering step:
In the scoring step, the scheduler ranks the remaining nodes to choose the most suitable Pod placement. The scheduler assigns a score to each Node that survived filtering, basing this score on the active scoring rules.
Finally, kube-scheduler assigns the Pod to the Node with the highest ranking. If there is more than one node with equal scores, kube-scheduler selects one of these at random.
When the node with the highest score is chosen, scheduler notifies the API server:
...picks a Node with the highest score among the feasible ones to run the Pod. The scheduler then notifies the API server about this decision in a process called binding.
Factors that are taken into consideration for scheduling:
Individual and collective resource requirements
Hardware
Policy constraints
Affinity and anti-affinity specifications
Data locality
Inter-workload interference
Others...
More detailed information about parameters be found here:
The following predicates implement filtering:
PodFitsHostPorts: Checks if a Node has free ports (the network protocol kind) for the Pod ports the Pod is requesting.
PodFitsHost: Checks if a Pod specifies a specific Node by its hostname.
PodFitsResources: Checks if the Node has free resources (eg, CPU and Memory) to meet the requirement of the Pod.
MatchNodeSelector: Checks if a Pod's Node Selector matches the Node's label(s).
NoVolumeZoneConflict: Evaluate if the Volumes that a Pod requests are available on the Node, given the failure zone restrictions for that storage.
NoDiskConflict: Evaluates if a Pod can fit on a Node due to the volumes it requests, and those that are already mounted.
MaxCSIVolumeCount: Decides how many CSI volumes should be attached, and whether that's over a configured limit.
PodToleratesNodeTaints: checks if a Pod's tolerations can tolerate the Node's taints.
CheckVolumeBinding: Evaluates if a Pod can fit due to the volumes it requests. This applies for both bound and unbound PVCs.
The following priorities implement scoring:
SelectorSpreadPriority: Spreads Pods across hosts, considering Pods that belong to the same Service, StatefulSet or ReplicaSet.
InterPodAffinityPriority: Implements preferred inter pod affininity and antiaffinity.
LeastRequestedPriority: Favors nodes with fewer requested resources. In other words, the more Pods that are placed on a Node, and the more resources those Pods use, the lower the ranking this policy will give.
MostRequestedPriority: Favors nodes with most requested resources. This policy will fit the scheduled Pods onto the smallest number of Nodes needed to run your overall set of workloads.
RequestedToCapacityRatioPriority: Creates a requestedToCapacity based ResourceAllocationPriority using default resource scoring function shape.
BalancedResourceAllocation: Favors nodes with balanced resource usage.
NodePreferAvoidPodsPriority: Prioritizes nodes according to the node annotation scheduler.alpha.kubernetes.io/preferAvoidPods. You can use this to hint that two different Pods shouldn't run on the same Node.
NodeAffinityPriority: Prioritizes nodes according to node affinity scheduling preferences indicated in PreferredDuringSchedulingIgnoredDuringExecution. You can read more about this in Assigning Pods to Nodes.
TaintTolerationPriority: Prepares the priority list for all the nodes, based on the number of intolerable taints on the node. This policy adjusts a node's rank taking that list into account.
ImageLocalityPriority: Favors nodes that already have the container images for that Pod cached locally.
ServiceSpreadingPriority: For a given Service, this policy aims to make sure that the Pods for the Service run on different nodes. It favours scheduling onto nodes that don't have Pods for the service already assigned there. The overall outcome is that the Service becomes more resilient to a single Node failure.
EqualPriority: Gives an equal weight of one to all nodes.
EvenPodsSpreadPriority: Implements preferred pod topology spread constraints.
Answering your question:
Does it take into account the node's historical resource utilization?
As can see, on the above list there are no parameters related to the historical resource utilization. Also, I did research and I didn't find any information about it.

Can a pod run on multiple nodes?

I have one kubernetes master and three kubernetes nodes. I made one pod which is running on specific node. I want to run that pod on 2 nodes. how can I achieve this? do replica concept help me? if yes how?
Yes, you can assign pods to one or more nodes of your cluster, and here are some options to achieve this:
nodeSelector
nodeSelector is the simplest recommended form of node selection constraint. nodeSelector is a field of PodSpec. It specifies a map of key-value pairs. For the pod to be eligible to run on a node, the node must have each of the indicated key-value pairs as labels (it can have additional labels as well). The most common usage is one key-value pair.
affinity and anti-affinity
Node affinity is conceptually similar to nodeSelector -- it allows you to constrain which nodes your pod is eligible to be scheduled on, based on labels on the node.
nodeSelector provides a very simple way to constrain pods to nodes with particular labels. The affinity/anti-affinity feature, greatly expands the types of constraints you can express. The key enhancements are
The affinity/anti-affinity language is more expressive. The language offers more matching rules besides exact matches created with a logical AND operation;
you can indicate that the rule is "soft"/"preference" rather than a hard requirement, so if the scheduler can't satisfy it, the pod will still be scheduled;
you can constrain against labels on other pods running on the node (or other topological domain), rather than against labels on the node itself, which allows rules about which pods can and cannot be co-located
DaemonSet
A DaemonSet ensures that all (or some) Nodes run a copy of a Pod. As nodes are added to the cluster, Pods are added to them. As nodes are removed from the cluster, those Pods are garbage collected. Deleting a DaemonSet will clean up the Pods it created.
Some typical uses of a DaemonSet are:
running a cluster storage daemon on every node
running a logs collection daemon on every node
running a node monitoring daemon on every node
Please check this link to read more about how to assign pods to nodes.
It's not a good practice to run the pods directly on the nodes as the nodes/pods can crash at any time. It's better use the K8S controllers as mentioned in the K8S documentation here.
K8S supports multiple containers and depending on the requirement the appropriate controller can be used. By looking at the OP it's difficult to say which controller to use.
You can use daemonset, if you want to run pod on each node.
What I see is you are trying to deploy pod on each node, it's better if you allow the scheduler to make decision where the pod need to be deployed based on the resources.
This would be best in all worst scenario's.
I'm mean in case of node failures.

Question about concept on Kubernetes pod assignment to nodes

I am quite a beginner in Kuberenetes and would like to ask about some concepts related to kuberenetes pod assignment.
Suppose there is a deployment to be made with a requirement of 3 replica sets.
(1)
Assume that there are 4 nodes, where each of it being a different physical server with different CPU and memory.
When the deployment is made, how would kubernetes assgin the pods to the nodes? Will there be scenario where it will put multiple pods on the same server, while a server does not have pod assignment (due to resource considereation)?
(2)
Assume there are 4 nodes (on 4 indentical physical servers), and 1 pod is created on each of the 4 nodes.
Suppose that now one of the nodes goes down. How would kuberenetes handle this? Will it recreate the pod on one of the other 3 nodes, based on which one having more available resources?
Thank you for any advice in advance.
There's a brief discussion of the Kubernetes Scheduler in the Kubernetes documentation. Generally scheduling is fairly opaque, but you also tend to aim for fairly well-loaded nodes; the important thing from your application point of view is to set appropriate resource requests: in your pod specifications. Just so long as there's enough room on each node to meet the resource requests, it usually doesn't matter to you which node gets picked.
In the scenario you describe, (1) it is possible that two replicas will be placed on the same node and so two nodes will go unused. That's especially true if the nodes aren't identical and they have resource constraints: if your pods require 4 GB of RAM, but you have some nodes that have less than that (after accounting for system pods and daemon set pods), the pods can't get scheduled there.
If a node fails (2) Kubernetes will automatically reschedule the pods running on that node if possible. "Fail" is a broad case, and can include a node being intentionally stopped to be upgraded or replaced. In this latter case you have some control over the cluster's behavior; see Disruptions in the documentation.
Many environments will run a cluster autoscaler. This can cause nodes to come and go automatically: if you try to schedule a pod and it won't fit, the autoscaler will allocate a new node, and if a node is under 50% utilization, it will be removed (and its pods rescheduled). In your first scenario you might start with only one node, but when the pod replicas don't all fit, the autoscaler would create a new node and once it's available the excess pods could be scheduled there.
Kubernetes will try to deploy pods to multiple nodes for better availability and resiliency. This will be based on the resource availability of the nodes. So if any node is not having enough capacity to host a pod it's possible that more than one replica of a pod is scheduled into same node.
Kubernetes will reschedule pods from the failed node to other available node which has enough capacity to host the pod. In this process again if there is no enough node which can host the replicas then there is a possibility that more than one replica is scheduled on same node.
You can read more on the scheduling algorithm here.
You can influence the scheduler by node and pod affinity and antiaffinity

How do I schedule the same pod on different nodes using kubectl scale?

New to kubernetes. Can I use kubectl scale --replicas=N and start pods on different nodes?
By default the scheduler attempts to spread pods across nodes, so that you don't have multiple pods of the same type on the same node. So there's nothing special required if you're just aiming for best-effort pod spreading.
If you want to express the requirement that the pod must not run on a node that already has a pod of that type on it you can use pod anti-affinity, which is currently an Alpha feature.
If you want to ensure that all nodes (or all nodes matching a certain selector) have that pod on them you can use a DaemonSet.
Scaling a Deployment (or RC) tells controller-manager to create more pods, new pods are then subject to scheduling. K8S scheduler will attempt to find most reasonable placement to schedule your pods to. This does not guarantee that pods will launch on different nodes, but makes it a rather likely scenario, if you have the required resources. Unfortunately it also means that if all pods can fit on one node, there are situations where scheduler might actually do just that (ie. all other nodes in unschedulable state for some reason). If that happens, the pods will not reschedule when conditions change.
To have a solid guarantee that pods wil not get colocated on the same node you have two options:
legacy hack : define a hostPort in your pod template. As given host port is a resource that can be assigned only once per node, your pods will never exist more then once per node
alpha feature : you can look into Pod AntiAffinity, quite early and not really battle proven yet
First one has a dissadvantage - you can never have more then one pod of this type per node, so it ie. affects rolling deployments and limits your capacity for scaling (you can never have more active pods then number of nodes)