Why would the Kubernetes scheduler always place my Pod replicas on the same node in AKS? - kubernetes

We have an AKS test cluster with four Windows worker nodes and a Deployment with a replica count of two. The corresponding Pod spec does not specify any resource requests and limits (thus, the resulting Pods are in the BestEffort QoS class).
In order to conduct a performance test, we scaled all other Deployments on those worker nodes to 0 replicas and deleted all remaining Pods on the nodes. Only the system Pods created by AKS DaemonSets itself (in the kube-system namespace) remained. We then created the Deployment mentioned above.
We had assumed that the default Kubernetes scheduler would place the two replicas on different nodes by default, or at least choose nodes randomly. However, the scheduler always chose the same node to place both replicas on, no matter how often we deleted the Pods or scaled the Deployment to 0 and back again to 2. Only after we tainted that node as NoSchedule, did the scheduler choose another node.
I know I could configure anti-affinities or topology spread constraints to get a better spreading of my Pods. But in the Cloud Native DevOps with Kubernetes book, I read that the scheduler actually does a very good job by default and one should only use those features if absolutely necessary. (Instead maybe using the descheduler if the scheduler is forced to make bad decisions.)
So, I would like to understand why the behavior we observed would happen. From the docs, I've learned that the scheduler first filters the nodes for fitting ones. In this case, all of them should fit, as all are configured identically. It then scores the nodes, choosing randomly if all have the same score. Why would one node always win that scoring?
Follow-up question: Is there some way how I could reconstruct the scheduler's decision logic in AKS? I can see kube-scheduler logs in Container Insights, but they don't contain any information regarding scheduling, just some operative stuff.

I believe that the scheduler is aware of which Nodes already have the container images pulled down, and will give them preference to avoid the image pull (and thus faster start time)
Short of digging up the source code as proof, I would guess one could create a separate Pod (for this purpose, I literally mean kind: Pod), force it onto one of the other Nodes via nodeName:, then after the Pod has been scheduled and attempted to start, delete the Pod and scale up your Deployment
I would then expect the new Deployment managed Pod to arrive on that other Node because it by definition has less resources in use but also has the container image required

Following mdaniel's reply, which I've marked as the accepted answer, we've done some more analysis and have found the list of scheduling plugins and the scheduling framework docs. Reading the code, we can see the ImageLocality plugin assigns a very high score due to the Windows container images being really large. As we don't have resource requests, the NodeResourcesFit plugin will not compensate this.
We did not find a plugin that would strive to not put Pod replicas onto the same node (unless configured via anti-affinities or a PodTopologySpreadConstraint). Which surprised me, as that would seem to be a good default to me?
Some experimentation shows that the situation indeed changes, once we, for example, start adding (even minimal) resource requests.
In the future, we'll therefore assign resource requests (which is good practice anyway) and, if this isn't enough, follow up with PodTopologySpreadConstraints.

Related

Kubernetes scheduler

Does the Kubernetes scheduler assign the pods to the nodes one by one in a queue (not in parallel)?
Based on this, I guess that might be the case since it is mentioned that the nodes are iterated round robin.
I want to make sure that the pod scheduling is not being done in parallel.
Short answer
Taking into consideration all the processes kube-scheduler performs when it's scheduling the pod, the answer is yes.
Scheduler and pods
For every newly created pod or other unscheduled pods, kube-scheduler
selects an optimal node for them to run on. However, every container
in pods has different requirements for resources and every pod also
has different requirements. Therefore, existing nodes need to be
filtered according to the specific scheduling requirements.
In a cluster, Nodes that meet the scheduling requirements for a Pod
are called feasible nodes. If none of the nodes are suitable, the pod
remains unscheduled until the scheduler is able to place it.
The scheduler finds feasible Nodes for a Pod and then runs a set of
functions to score the feasible Nodes and picks a Node with the
highest score among the feasible ones to run the Pod. The scheduler
then notifies the API server about this decision in a process called
binding.
Reference - kube-scheduler.
The scheduler determines which Nodes are valid placements for each Pod
in the scheduling queue according to constraints and available
resources.
Reference - kube-scheduler - synopsis.
In short words, kube-scheduler picks up pods one by one, assess them and its requests, then proceeds to finding appropriate feasible nodes to schedule pods on.
Scheduler and nodes
Mentioned link is related to nodes to give a fair chance to run pods across all feasible nodes.
Nodes in a cluster that meet the scheduling requirements of a Pod are
called feasible Nodes for the Pod
Information here is related to default kube-scheduler, there are solutions which can be used or even it's possible to implement self-written one. Also it's possible to run multiple schedulers in cluster.
Useful links:
Node selection in kube-scheduler
Kubernetes scheduler

Half of My Kubernetes Cluster Is Never Used, Due to the Way Pods Are Scheduled

I have a Kubernetes cluster with 4 nodes and, ideally, my application should have 4 replicas, evenly distributed to each node. However, when pods are scheduled, they almost always end up on only two of the nodes, or if I'm very lucky, on 3 of the 4. My app as quite a bit of traffic and I would really want to use all the resources that I pay for.
I suspect the reason why this happens is that Kubernetes tries to schedule the new pods on the nodes that have the most available resources, which is nice as a concept, but it would be even nicer if it would reschedule the pods once the old nodes become available again.
What options do I have? Thanks.
You have lots of options!
First and foremost: Pod Affinity and Anti-affinity to make sure your Pod prefer to be placed on a host that does not already have a Pod with the same label.
Second, you could set up Pod Topology Spread Constraints. This is newer and a bit more advanced, but usually a better solution that simple anti-affinity.
Thirdly, you can pin your Pods to a specific node using a NodeSelector.
Finally, you could write your own scheduler or modify the default scheduler settings, but that's a bit more advanced topic. Don't forget to always set your resource requests correctly, these should be set to a value that more or less encapsulates the usage during peak traffic, to make sure that a node has enough resources available to max out the Pod without interfering with other Pods.

Will k8s scale a pod within HPA range to evict it and meet disruption budget?

excuse me for asking something that has much overlap with many specific questions about the same knowledge area. I am curious to know if kubernetes will scale a pod in order to evict it.
Given are the following facts at the time of eviction:
The pod is running one instance.
The pod has an HPA controlling it, with the following params:
minCount: 1
maxCount: 2
It has a PDB with params:
minAvailable: 1
I would expect the k8s controller to have enough information to safely scale up to 2 instances to meet the PDB, and until recently I was assuming it would indeed do so.
Why am I asking this? (The question behind the question ;)
Well, we run into auto-upgrade problems on AKS because it won't evict pods as described above, and the Azure team told me to change the params. But if no scaling happens, this means we have to set minAvailable to 2, effectively increasing pod amount only for future evictions. I want to get to the bottom of this before I file a feature request with k8s or a bug with AKS.
I believe these two parts are independent; the pod disruption budget doesn't look at the autoscaling capability, or otherwise realize that a pod is running as part of a deployment that could be temporarily upscaled.
If you have a deployment with replicas: 1, and a corresponding PDB with minAvailable: 1, this will prevent the node the pod is running on from being taken out of service. (I see this behavior in the system I work on professionally, using a different Kubernetes environment.)
The way this works normally (see also the PodDisruptionBudget example in the Kubernetes documentation):
Some command like kubectl drain or the cluster autoscaler marks a node as going out of service.
The pods on that node are terminated.
The replication controller sees that some replica sets have too few pods, and creates new ones.
The new pods get scheduled on in-service nodes.
The pod disruption budget only affects the first part of this sequence; it would keep kubectl drain from actually draining a node until the disruption budget could be satisfied, or cause the cluster autoscaler to pick a different node. HPA isn't considered at all, nor is it considered that it's "normal" to run extra copies of a deployment-managed pod during upgrades. (That is, this is a very reasonable question, it just doesn't work that way right now.)
My default setup for most deployments tends to be to use 3 replicas and to have a pod disruption budget requiring at least 1 of them to be available. That definitely adds some cost to operating the service, but it makes you tolerant of an involuntary node failure and it does allow you to consciously rotate nodes out. For things that read from message queues (Kafka or RabbitMQ-based workers) it could make sense to run only 1 replica with no PDB since the worker will be able to tolerate an outage.

How do I make my EKS AutoScalingGroup start a node with a specific instance-type if one is not already running?

I've been combing through documentation trying to find out if we are able to implement a specific EKS architecture for our use-case, and have not found a clear answer on how to do it or if it's possible.
Scope
We have several small pods that run 24/7, monitoring for new task requests. When a new task is detected, these monitor pods spin up worker pods, which are much heavier on CPU and Memory requirements.
What we would like to accomplish on EKS is to have 2 (or more) instance types within a single AutoScalingGroup:
Small, cheap instance to run the 24/7 pods
Large, expensive instances to run the tasks and then get terminated.
According to the documentation, having multiple instance types in an ASG is no problem, but it doesn't specify how to ensure that the small pods get assigned to the small instance, and the large pods to the large instance.
Testing done so far:
(Max = 3, Min = 1, Desired = 1)
We currently have the large instance type as priority 1, and the small one as priority 2. So at launch, one large instance gets started.
When I start my small pod with node-selector set to a small instance type, it remains in "pending" state because of the error event:
0/1 nodes are available: 1 node(s) didn't match node selector.
So currently, my question is:
How do I make my ASG start a node with a specific instance-type if it is not already running, based on the pod's requirement for that specific instance-type?
Any pointers to links, documentation, or suggestions for better approaches are appreciated, thank you!
This appears to be a lack of understanding on my part of how the auto-scaler and ASGs work. Based on feedback from someone in a different forum, I learned that
A) auto-scaler runs as a pod on the cluster itself (hence why the out-of-the-box EKS does not support a minimum of 0 nodes; at least one node is required to run the kube-system/auto-scaler pods).
and B) the single auto-scaler pod is able to scale the multiple ASGs that exist on the cluster. So this allows us to separate our instances into separate ASGs by cost, and ensure that the expensive instances are only used when requested by the worker pods.
Our solution so far is this:
Set up at least 2 ASGs:
Runs the 24/7 pods and the kube-system pods. This ASG uses the smaller, cheaper instance types.
(or more, if it fits the use case) Runs the burstable pods. This ASG uses the larger, more expensive instance types that are required for the task processing.
Apply identifying labels to the ASGs. The EKS recommended approach (especially if you want to use Spot instances) is to use the instance size (e.g. micro, large, 4xlarge). This lets you easily add instances with the same resource sizes to an existing ASG for more reliability. Example:
Labels: asgsize=xlarge
Apply the node-selector in the pod yaml to match the desired node:
spec:
nodeSelector:
asgsize: xlarge
Set the 24/7, small-instance ASG to min=1, desired=1, max=1 (at least; max can be bigger if that fits your needs)
Set the burstable, large-instance ASG to min=0, desired=0, max=(whatever is required for your environment)
When we implemented this approach, we were able to successfully have a small instance running 24/7, and have the larger instances burst up from 0 only when a pod with that label is created.
Disclaimer:
We also ran into this little bug on our auto-scaler where the large ASG was not scaling up from 0 initially:
https://github.com/kubernetes/autoscaler/issues/2418
The workaround solution in that issue worked for us. We forced our large ASG to have a min=1. Then we started a pod on that group, set the min=0 again, and deleted the pod. The instance auto-scaled down and got terminated, and then the next time we requested the pod, it auto-scaled up correctly.
I never had this use case but I think you should try a combination of cluster autoscaler with nodeAffinity.
Refer: Special note on GPU instances

Schedule as many pods as will fit in the cluster?

I've got a batch job to run: process a large number of media files. I have a Kubernetes cluster to run it on, but I don't want to change the size of the cluster. I want to run the processing as a low-priority job. Any time there are spare compute resources, they should work on media-processing. Any time there are other jobs that need resources, the media process should be suspended.
Currently, I'm running a Deployment with one replica for each node in my cluster. I defined a PriorityClass for the batch-job and a different PriorityClass (with higher priority) for everything else. That seems to be working to evict running batch-jobs when something else needs the resources.
I define a Affinity, specifically a WeightedPod(Anti)Affinity to discourage the batch-job from scheduling on the same machine.
The code itself is a queue-worker: it pulls one work-item off a shared queue and processes it and then goes back for the next. If it gets interrupted (because it's being evicted) the partial work is lost (which is fine).
This is working OK, but I'm leaving a lot of resources on the table, still. Is there some way to define my replica-count as "as many as you can schedule"? I could ask for far more replicas than the cluster can handle; would that be a good solution? Or are there problems with Kubernetes having 10 pods stuck "pending" for months at a time?
I think there's no harm in asking for more pods than the cluster can handle and keeping them pending forever. My only concern is whether the scheduler will be able to discern normal priority pending pods over low priority pending pods, and be able to give precedence to the more urgent ones.
The pro way to go about this issue, IMHO, is to leverage prometheus adapter and use an HPA to target the current capacity of your cluster using a prometheus query. This can give you continuous of the cluster capacity and the ability to autoscale accordingly. This medium article has a pretty good introduction to the concept.