How to autoscale a pod thats pulling tasks from a queue - kubernetes

I've tried a few approaches to this, the docs suggest that there is a way of getting autoscaling to deal with queues (without an external solution) https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/ but doesn't explain how
I created a deployment which deploys pods that pull from a redis queue (there is a redis service in the cluster). I want to create a system where pods are scaled horizontally to deal with pulling tasks from the queue and executing them. When executing the task can take an unpredictable and variable amount of time.
If pod A pulls a task from the queue and is busy I want to spin up Pod B to pull the next task. At the moment I am using polling so that if the queue is empty the pod in question will just keep trying to pull from the queue.
I've used horizontal pod autoscaling which at least scales out when pod1 is working but because pod2 when running doesn't decrease the average utilization, it just keeps spinning up new pods up to the maximum. For my use case, this is semi-fine, because if the queue is empty, any pods getting an empty queue will contribute to utilization percentage coming down, and in theory when the queue is empty, the excess pods will all spin down... but doesn't feel very efficient, and the problem is that the autoscaler will scale down pods that are in the middle of running jobs.
I've looked at using the newer metrics api, but it seems ill need to create a custom metrics api to implement this which seems extreme for such a simple use case.
I've also looked at using Jobs but this doesn't seem to accommodate autoscaling at all?
I really want to be able to descale based on the CPU utilization for the specific pod that's about to get scaled-down rather than an average of all the pods.

HorizontalPodAutoscaler will always scale based on the average utilization of all available pods. I'd say Jobs are the most suitable for your use case. Queue with pod per work item is one of the example use cases for Kubernetes Jobs in their official documentation.
You could also look to use Keda which is a framework for event-driven autoscaling.
KEDA is a Kubernetes-based Event Driven Autoscaler. With KEDA, you can drive the scaling of any container in Kubernetes based on the number of events needing to be processed.

Related

Scaling a Kubernetes cluster to process jobs in a queue?

(I'm new to Kubernetes and not sure this is best practice)
I have a pipeline of jobs in a Firestore database that need to be completed as quickly as possible.
I want to set up a Kubernetes cluster (on GKE) that will scale up when there is a large backlog of tasks to complete. Each pod/node needs a single GPU to complete the task.
Is it possible to use a cloud function to manually scale the number of nodes depending on the number of jobs in the pipeline?
I was planning on using the clusters.nodePools.setSize() function from the GKE client library but I'm not sure if this is just intended for initial cluster setup rather than manual scaling.
Thanks
https://cloud.google.com/kubernetes-engine/docs/reference/rest/v1beta1/projects.locations.clusters.nodePools/setSize
You can configure and use Horizontal pod scaling on your cluster to scale the number of Pods .
As suggested by #somethingsomething refer these links on Horizontal Pod Autoscaler and Cluster autoscaler :
The Horizontal Pod Autoscaler changes the shape of your Kubernetes workload by automatically increasing or decreasing the number of Pods in response to the workload's CPU or memory consumption, or in response to custom metrics reported from within Kubernetes or external metrics from sources outside of your cluster.
Horizontal Pod autoscaling helps to ensure that your workload functions consistently in different situations, and allows you to control costs by only paying for extra capacity when you need it.
It's not always easy to predict the indicators that show whether your workload is under-resourced or under-utilized. The Horizontal Pod Autoscaler can automatically scale the number of Pods in your workload based on one or more metrics.

Why would the Kubernetes scheduler always place my Pod replicas on the same node in AKS?

We have an AKS test cluster with four Windows worker nodes and a Deployment with a replica count of two. The corresponding Pod spec does not specify any resource requests and limits (thus, the resulting Pods are in the BestEffort QoS class).
In order to conduct a performance test, we scaled all other Deployments on those worker nodes to 0 replicas and deleted all remaining Pods on the nodes. Only the system Pods created by AKS DaemonSets itself (in the kube-system namespace) remained. We then created the Deployment mentioned above.
We had assumed that the default Kubernetes scheduler would place the two replicas on different nodes by default, or at least choose nodes randomly. However, the scheduler always chose the same node to place both replicas on, no matter how often we deleted the Pods or scaled the Deployment to 0 and back again to 2. Only after we tainted that node as NoSchedule, did the scheduler choose another node.
I know I could configure anti-affinities or topology spread constraints to get a better spreading of my Pods. But in the Cloud Native DevOps with Kubernetes book, I read that the scheduler actually does a very good job by default and one should only use those features if absolutely necessary. (Instead maybe using the descheduler if the scheduler is forced to make bad decisions.)
So, I would like to understand why the behavior we observed would happen. From the docs, I've learned that the scheduler first filters the nodes for fitting ones. In this case, all of them should fit, as all are configured identically. It then scores the nodes, choosing randomly if all have the same score. Why would one node always win that scoring?
Follow-up question: Is there some way how I could reconstruct the scheduler's decision logic in AKS? I can see kube-scheduler logs in Container Insights, but they don't contain any information regarding scheduling, just some operative stuff.
I believe that the scheduler is aware of which Nodes already have the container images pulled down, and will give them preference to avoid the image pull (and thus faster start time)
Short of digging up the source code as proof, I would guess one could create a separate Pod (for this purpose, I literally mean kind: Pod), force it onto one of the other Nodes via nodeName:, then after the Pod has been scheduled and attempted to start, delete the Pod and scale up your Deployment
I would then expect the new Deployment managed Pod to arrive on that other Node because it by definition has less resources in use but also has the container image required
Following mdaniel's reply, which I've marked as the accepted answer, we've done some more analysis and have found the list of scheduling plugins and the scheduling framework docs. Reading the code, we can see the ImageLocality plugin assigns a very high score due to the Windows container images being really large. As we don't have resource requests, the NodeResourcesFit plugin will not compensate this.
We did not find a plugin that would strive to not put Pod replicas onto the same node (unless configured via anti-affinities or a PodTopologySpreadConstraint). Which surprised me, as that would seem to be a good default to me?
Some experimentation shows that the situation indeed changes, once we, for example, start adding (even minimal) resource requests.
In the future, we'll therefore assign resource requests (which is good practice anyway) and, if this isn't enough, follow up with PodTopologySpreadConstraints.

Is it possible to use Kubernetes autoscale on cron job pods

Some context: I have multiple cron jobs running daily, weekly, hourly and some of which require significant processing power.
I would like to add requests and limitations to these container cron pods to try and enable vertical scaling and ensure that the assigned node will have enough capacity when being initialized. This will prevent me from having to have multiple large node available at all times and also letting me modify how many crons I can run in parallel easily.
I would like to try and avoid timed scaling since the cron jobs processing time can increase as the application grows.
Edit - Additional Information :
Currently I am using Digital Ocean and utilizing it's UI for cluster autoscaling. I have it working with HPA's on deployments but not crons. Adding limits to crons does not trigger cluster autoscaling to my knowledge.
I have tried to enable HPA scaling with the cron but with no success. Basically it just sits on a pending status signalling that there is insufficient CPU available and does not generate a new node.
Does HPA scaling work with cron job pods and is there a way to achieve the same type of scaling?
HPA is used to scale more pods when pod loads are high, but this won't increase the resources on your cluster.
I think you're looking for cluster autoscaler (works on AWS, GKE and Azure) and will increase cluster capacity when pods can't be scheduled.
This is a Community Wiki answer so feel free to edit it and add any additional details you consider important.
As Dom already mentioned "this won't increase the resources on your cluster." Saying more specifically, it won't create an additional node as Horizontal Pod Autoscaler doesn't have such capability and in fact it has nothing to do with cluster scaling. It's name is pretty self-exlpanatory. HPA is able only to scale Pods and it scales them horizontaly, in other words it can automatically increase or decrease number of replicas of your "replication controller, deployment, replica set or stateful set based on observed CPU utilization (or, with custom metrics support, on some other application-provided metrics)" as per the docs.
As to cluster autoscaling, as already said by Dom, such solutions are implemented in so called managed kubernetes solutions such as GKE on GCP, EKS on AWS or AKS on Azure and many more. You typically don't need to do anything to enable them as they are available out of the box.
You may wonder how HPA and CA fit together. It's really well explained in FAQ section of the Cluster Autoscaler project:
How does Horizontal Pod Autoscaler work with Cluster Autoscaler?
Horizontal Pod Autoscaler changes the deployment's or replicaset's
number of replicas based on the current CPU load. If the load
increases, HPA will create new replicas, for which there may or may
not be enough space in the cluster. If there are not enough resources,
CA will try to bring up some nodes, so that the HPA-created pods have
a place to run. If the load decreases, HPA will stop some of the
replicas. As a result, some nodes may become underutilized or
completely empty, and then CA will terminate such unneeded nodes.

Schedule as many pods as will fit in the cluster?

I've got a batch job to run: process a large number of media files. I have a Kubernetes cluster to run it on, but I don't want to change the size of the cluster. I want to run the processing as a low-priority job. Any time there are spare compute resources, they should work on media-processing. Any time there are other jobs that need resources, the media process should be suspended.
Currently, I'm running a Deployment with one replica for each node in my cluster. I defined a PriorityClass for the batch-job and a different PriorityClass (with higher priority) for everything else. That seems to be working to evict running batch-jobs when something else needs the resources.
I define a Affinity, specifically a WeightedPod(Anti)Affinity to discourage the batch-job from scheduling on the same machine.
The code itself is a queue-worker: it pulls one work-item off a shared queue and processes it and then goes back for the next. If it gets interrupted (because it's being evicted) the partial work is lost (which is fine).
This is working OK, but I'm leaving a lot of resources on the table, still. Is there some way to define my replica-count as "as many as you can schedule"? I could ask for far more replicas than the cluster can handle; would that be a good solution? Or are there problems with Kubernetes having 10 pods stuck "pending" for months at a time?
I think there's no harm in asking for more pods than the cluster can handle and keeping them pending forever. My only concern is whether the scheduler will be able to discern normal priority pending pods over low priority pending pods, and be able to give precedence to the more urgent ones.
The pro way to go about this issue, IMHO, is to leverage prometheus adapter and use an HPA to target the current capacity of your cluster using a prometheus query. This can give you continuous of the cluster capacity and the ability to autoscale accordingly. This medium article has a pretty good introduction to the concept.

Kubernetes batch performance with activation of thousands of pods using jobs

I am writing a pipeline with kubernetes in google cloud.
I need to activate sometimes a few pods in a second, where each pod is a task that runs inside a pod.
I plan to call kubectl run with Kubernetes job and wait for it to complete (poll every second all the pods running) and activate the next step in the pipeline.
I will also monitor the cluster size to make sure I am not exceeding the max CPU/RAM usage.
I can run tens of thousands of jobs at the same time.
I am not using standard pipelines because I need to create a dynamic number of tasks in the pipeline.
I am running the batch operation so I can handle the delay.
Is it the best approach? How long does it take to create a pod in Kubernetes?
If you wanna run ten thousands of jobs at the same time - you will definitely need to plan resource allocation. You need to estimate the number of nodes that you need. After that you may create all nodes at once, or use GKE cluster autoscaler for automatically adding new nodes in response to resource demand. If you preallocate all nodes at once - you will probably have high bill at the end of month. But pods can be created very quickly. If you create only small number of nodes initially and use cluster autoscaler - you will face large delays, because nodes take several minutes to start. You must decide what your approach will be.
If you use cluster autoscaler - do not forget to specify maximum nodes number in cluster.
Another important thing - you should put your jobs into Guaranteed quality of service in Kubernetes. Otherwise if you use Best Effort or Burstable pods - you will end up with Eviction nightmare which is really terrible and uncontrolled.