Kubernetes pods N:M scheduling how-to - kubernetes

Batch computations, Monte Carlo, using Docker image, multiple jobs running on Google cloud and managed by Kubernetes. No Replication Controllers, just multiple pods with NoRestart policy delivering computed payloads to our server. So far so good. Problem is, I have cluster with N nodes/minions, and have M jobs to compute, where M > N. So I would like to fire M pods at once and tell Kubernetes to schedule it in such a way so that only N are running at a given time, and everything else is kept in Pending state. As soon as one pod is done, next is scheduled to run moving from Pending to Running and so on and so forth till all M pods are done.
Is it possible to do so?

Yes, you can have them all ask for a resource of which there's only one on each node, then the scheduler won't be able to schedule more than N at a time. The most common way to do this is to have each pod ask for a hostPort in the ports section of its containers spec.
However, I can't say I'm completely sure why you would want to limit the system to one such pod per node. If there are enough resources available to run multiple at a time on each node, it should speed up your job to let them run.

Just for the record, after discussion with Alex, trial and error and a binary search for a good number, what worked for me was setting the CPU resource limit in the Pod JSON to:
"resources": {
"limits": {
"cpu": "490m"
}
}
I have no idea how and why this particular value influences the Kubernetes scheduler, but it keeps nodes churning through the jobs, with exactly one pod per node running at any given moment.

Related

Why would the Kubernetes scheduler always place my Pod replicas on the same node in AKS?

We have an AKS test cluster with four Windows worker nodes and a Deployment with a replica count of two. The corresponding Pod spec does not specify any resource requests and limits (thus, the resulting Pods are in the BestEffort QoS class).
In order to conduct a performance test, we scaled all other Deployments on those worker nodes to 0 replicas and deleted all remaining Pods on the nodes. Only the system Pods created by AKS DaemonSets itself (in the kube-system namespace) remained. We then created the Deployment mentioned above.
We had assumed that the default Kubernetes scheduler would place the two replicas on different nodes by default, or at least choose nodes randomly. However, the scheduler always chose the same node to place both replicas on, no matter how often we deleted the Pods or scaled the Deployment to 0 and back again to 2. Only after we tainted that node as NoSchedule, did the scheduler choose another node.
I know I could configure anti-affinities or topology spread constraints to get a better spreading of my Pods. But in the Cloud Native DevOps with Kubernetes book, I read that the scheduler actually does a very good job by default and one should only use those features if absolutely necessary. (Instead maybe using the descheduler if the scheduler is forced to make bad decisions.)
So, I would like to understand why the behavior we observed would happen. From the docs, I've learned that the scheduler first filters the nodes for fitting ones. In this case, all of them should fit, as all are configured identically. It then scores the nodes, choosing randomly if all have the same score. Why would one node always win that scoring?
Follow-up question: Is there some way how I could reconstruct the scheduler's decision logic in AKS? I can see kube-scheduler logs in Container Insights, but they don't contain any information regarding scheduling, just some operative stuff.
I believe that the scheduler is aware of which Nodes already have the container images pulled down, and will give them preference to avoid the image pull (and thus faster start time)
Short of digging up the source code as proof, I would guess one could create a separate Pod (for this purpose, I literally mean kind: Pod), force it onto one of the other Nodes via nodeName:, then after the Pod has been scheduled and attempted to start, delete the Pod and scale up your Deployment
I would then expect the new Deployment managed Pod to arrive on that other Node because it by definition has less resources in use but also has the container image required
Following mdaniel's reply, which I've marked as the accepted answer, we've done some more analysis and have found the list of scheduling plugins and the scheduling framework docs. Reading the code, we can see the ImageLocality plugin assigns a very high score due to the Windows container images being really large. As we don't have resource requests, the NodeResourcesFit plugin will not compensate this.
We did not find a plugin that would strive to not put Pod replicas onto the same node (unless configured via anti-affinities or a PodTopologySpreadConstraint). Which surprised me, as that would seem to be a good default to me?
Some experimentation shows that the situation indeed changes, once we, for example, start adding (even minimal) resource requests.
In the future, we'll therefore assign resource requests (which is good practice anyway) and, if this isn't enough, follow up with PodTopologySpreadConstraints.

Kubernetes scheduler

Does the Kubernetes scheduler assign the pods to the nodes one by one in a queue (not in parallel)?
Based on this, I guess that might be the case since it is mentioned that the nodes are iterated round robin.
I want to make sure that the pod scheduling is not being done in parallel.
Short answer
Taking into consideration all the processes kube-scheduler performs when it's scheduling the pod, the answer is yes.
Scheduler and pods
For every newly created pod or other unscheduled pods, kube-scheduler
selects an optimal node for them to run on. However, every container
in pods has different requirements for resources and every pod also
has different requirements. Therefore, existing nodes need to be
filtered according to the specific scheduling requirements.
In a cluster, Nodes that meet the scheduling requirements for a Pod
are called feasible nodes. If none of the nodes are suitable, the pod
remains unscheduled until the scheduler is able to place it.
The scheduler finds feasible Nodes for a Pod and then runs a set of
functions to score the feasible Nodes and picks a Node with the
highest score among the feasible ones to run the Pod. The scheduler
then notifies the API server about this decision in a process called
binding.
Reference - kube-scheduler.
The scheduler determines which Nodes are valid placements for each Pod
in the scheduling queue according to constraints and available
resources.
Reference - kube-scheduler - synopsis.
In short words, kube-scheduler picks up pods one by one, assess them and its requests, then proceeds to finding appropriate feasible nodes to schedule pods on.
Scheduler and nodes
Mentioned link is related to nodes to give a fair chance to run pods across all feasible nodes.
Nodes in a cluster that meet the scheduling requirements of a Pod are
called feasible Nodes for the Pod
Information here is related to default kube-scheduler, there are solutions which can be used or even it's possible to implement self-written one. Also it's possible to run multiple schedulers in cluster.
Useful links:
Node selection in kube-scheduler
Kubernetes scheduler

Schedule as many pods as will fit in the cluster?

I've got a batch job to run: process a large number of media files. I have a Kubernetes cluster to run it on, but I don't want to change the size of the cluster. I want to run the processing as a low-priority job. Any time there are spare compute resources, they should work on media-processing. Any time there are other jobs that need resources, the media process should be suspended.
Currently, I'm running a Deployment with one replica for each node in my cluster. I defined a PriorityClass for the batch-job and a different PriorityClass (with higher priority) for everything else. That seems to be working to evict running batch-jobs when something else needs the resources.
I define a Affinity, specifically a WeightedPod(Anti)Affinity to discourage the batch-job from scheduling on the same machine.
The code itself is a queue-worker: it pulls one work-item off a shared queue and processes it and then goes back for the next. If it gets interrupted (because it's being evicted) the partial work is lost (which is fine).
This is working OK, but I'm leaving a lot of resources on the table, still. Is there some way to define my replica-count as "as many as you can schedule"? I could ask for far more replicas than the cluster can handle; would that be a good solution? Or are there problems with Kubernetes having 10 pods stuck "pending" for months at a time?
I think there's no harm in asking for more pods than the cluster can handle and keeping them pending forever. My only concern is whether the scheduler will be able to discern normal priority pending pods over low priority pending pods, and be able to give precedence to the more urgent ones.
The pro way to go about this issue, IMHO, is to leverage prometheus adapter and use an HPA to target the current capacity of your cluster using a prometheus query. This can give you continuous of the cluster capacity and the ability to autoscale accordingly. This medium article has a pretty good introduction to the concept.

Can I force Kubernetes not to run more than X replicas of a pod in the same node?

I have a tiny Kubernetes cluster consisting of just two nodes running on t3a.micro AWS EC2 instances (to save money).
I have a small web app that I am trying to run in this cluster. I have a single Deployment for this app. This deployment has spec.replicas set to 4.
When I run this Deployment, I noticed that Kubernetes scheduled 3 of its pods in one node and 1 pod in the other node.
Is it possible to force Kubernetes to schedule at most 2 pods of this Deployment per node? Having 3 instances in the same pod puts me dangerously close to running out of memory in these tiny EC2 instances.
Thanks!
The correct solution for this would be to set memory requests and limits correctly matching your steady state and burst RAM consumption levels on every pod, then the scheduler will do all this math for you.
But for the future and for others, there is a new feature which kind of allows this https://kubernetes.io/blog/2020/05/introducing-podtopologyspread/. It's not an exact match, you can't put a global cap, rather you can require pods be evenly spaced over the cluster subject to maximum skew caps.

Kubernetes batch performance with activation of thousands of pods using jobs

I am writing a pipeline with kubernetes in google cloud.
I need to activate sometimes a few pods in a second, where each pod is a task that runs inside a pod.
I plan to call kubectl run with Kubernetes job and wait for it to complete (poll every second all the pods running) and activate the next step in the pipeline.
I will also monitor the cluster size to make sure I am not exceeding the max CPU/RAM usage.
I can run tens of thousands of jobs at the same time.
I am not using standard pipelines because I need to create a dynamic number of tasks in the pipeline.
I am running the batch operation so I can handle the delay.
Is it the best approach? How long does it take to create a pod in Kubernetes?
If you wanna run ten thousands of jobs at the same time - you will definitely need to plan resource allocation. You need to estimate the number of nodes that you need. After that you may create all nodes at once, or use GKE cluster autoscaler for automatically adding new nodes in response to resource demand. If you preallocate all nodes at once - you will probably have high bill at the end of month. But pods can be created very quickly. If you create only small number of nodes initially and use cluster autoscaler - you will face large delays, because nodes take several minutes to start. You must decide what your approach will be.
If you use cluster autoscaler - do not forget to specify maximum nodes number in cluster.
Another important thing - you should put your jobs into Guaranteed quality of service in Kubernetes. Otherwise if you use Best Effort or Burstable pods - you will end up with Eviction nightmare which is really terrible and uncontrolled.