Kubernetes batch performance with activation of thousands of pods using jobs - kubernetes

I am writing a pipeline with kubernetes in google cloud.
I need to activate sometimes a few pods in a second, where each pod is a task that runs inside a pod.
I plan to call kubectl run with Kubernetes job and wait for it to complete (poll every second all the pods running) and activate the next step in the pipeline.
I will also monitor the cluster size to make sure I am not exceeding the max CPU/RAM usage.
I can run tens of thousands of jobs at the same time.
I am not using standard pipelines because I need to create a dynamic number of tasks in the pipeline.
I am running the batch operation so I can handle the delay.
Is it the best approach? How long does it take to create a pod in Kubernetes?

If you wanna run ten thousands of jobs at the same time - you will definitely need to plan resource allocation. You need to estimate the number of nodes that you need. After that you may create all nodes at once, or use GKE cluster autoscaler for automatically adding new nodes in response to resource demand. If you preallocate all nodes at once - you will probably have high bill at the end of month. But pods can be created very quickly. If you create only small number of nodes initially and use cluster autoscaler - you will face large delays, because nodes take several minutes to start. You must decide what your approach will be.
If you use cluster autoscaler - do not forget to specify maximum nodes number in cluster.
Another important thing - you should put your jobs into Guaranteed quality of service in Kubernetes. Otherwise if you use Best Effort or Burstable pods - you will end up with Eviction nightmare which is really terrible and uncontrolled.

Related

How to autoscale a pod thats pulling tasks from a queue

I've tried a few approaches to this, the docs suggest that there is a way of getting autoscaling to deal with queues (without an external solution) https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/ but doesn't explain how
I created a deployment which deploys pods that pull from a redis queue (there is a redis service in the cluster). I want to create a system where pods are scaled horizontally to deal with pulling tasks from the queue and executing them. When executing the task can take an unpredictable and variable amount of time.
If pod A pulls a task from the queue and is busy I want to spin up Pod B to pull the next task. At the moment I am using polling so that if the queue is empty the pod in question will just keep trying to pull from the queue.
I've used horizontal pod autoscaling which at least scales out when pod1 is working but because pod2 when running doesn't decrease the average utilization, it just keeps spinning up new pods up to the maximum. For my use case, this is semi-fine, because if the queue is empty, any pods getting an empty queue will contribute to utilization percentage coming down, and in theory when the queue is empty, the excess pods will all spin down... but doesn't feel very efficient, and the problem is that the autoscaler will scale down pods that are in the middle of running jobs.
I've looked at using the newer metrics api, but it seems ill need to create a custom metrics api to implement this which seems extreme for such a simple use case.
I've also looked at using Jobs but this doesn't seem to accommodate autoscaling at all?
I really want to be able to descale based on the CPU utilization for the specific pod that's about to get scaled-down rather than an average of all the pods.
HorizontalPodAutoscaler will always scale based on the average utilization of all available pods. I'd say Jobs are the most suitable for your use case. Queue with pod per work item is one of the example use cases for Kubernetes Jobs in their official documentation.
You could also look to use Keda which is a framework for event-driven autoscaling.
KEDA is a Kubernetes-based Event Driven Autoscaler. With KEDA, you can drive the scaling of any container in Kubernetes based on the number of events needing to be processed.

Schedule as many pods as will fit in the cluster?

I've got a batch job to run: process a large number of media files. I have a Kubernetes cluster to run it on, but I don't want to change the size of the cluster. I want to run the processing as a low-priority job. Any time there are spare compute resources, they should work on media-processing. Any time there are other jobs that need resources, the media process should be suspended.
Currently, I'm running a Deployment with one replica for each node in my cluster. I defined a PriorityClass for the batch-job and a different PriorityClass (with higher priority) for everything else. That seems to be working to evict running batch-jobs when something else needs the resources.
I define a Affinity, specifically a WeightedPod(Anti)Affinity to discourage the batch-job from scheduling on the same machine.
The code itself is a queue-worker: it pulls one work-item off a shared queue and processes it and then goes back for the next. If it gets interrupted (because it's being evicted) the partial work is lost (which is fine).
This is working OK, but I'm leaving a lot of resources on the table, still. Is there some way to define my replica-count as "as many as you can schedule"? I could ask for far more replicas than the cluster can handle; would that be a good solution? Or are there problems with Kubernetes having 10 pods stuck "pending" for months at a time?
I think there's no harm in asking for more pods than the cluster can handle and keeping them pending forever. My only concern is whether the scheduler will be able to discern normal priority pending pods over low priority pending pods, and be able to give precedence to the more urgent ones.
The pro way to go about this issue, IMHO, is to leverage prometheus adapter and use an HPA to target the current capacity of your cluster using a prometheus query. This can give you continuous of the cluster capacity and the ability to autoscale accordingly. This medium article has a pretty good introduction to the concept.

Is it possible to schedule a pod to run for say 24 hours and then remove deployment/statefulset? or need to use jobs?

We have a bunch of pods running in dev environment. The pods are auto-provisioned by an application on every business action. The problem is that across various namespaces they are accumulating and eating available resources in EKS.
Is there a way without jenkins/k8s jobs to simply put some parameter on the pod manifest to tell it to self destruct say in 24 hours?
Add to your pod.spec:
activeDeadlineSeconds: 86400
After deadline your Pod will be stopped for good with the status DeadlineExceeded
If I understood your situation properly, you would like to scale your cluster down in order to save resources.
Kubernetes is featured with the ability to autoscale your application in a cluster. Literally, it means that Kubernetes can start additional pods when the load is increasing and terminate excessive pods when the load is decreasing.
It is possible to downscale the application to zero pods, but, in this case, you will have a delay serving the first request while the pod is starting.
This functionality relies on performance metrics. From the practical side, it means that autoscaling doesn't happen instantly, because it takes some time to performance metrics reach the configured threshold.
The mentioned Kubernetes feature called HPA(horizontal pod autoscale) is described in this document.
In case you are running your cluster on GCP or GKE, you are able to go further and automatically start additional nodes for your cluster when you need more computing capacity and shut down nodes when they are not running application pods anymore.
More information about this functionality can be found following the link.
Last, but not least, you can use tool like Ansible to manage all your kubernetes assets (it can create/manage deployments via playbooks).
If you decide to give it a try, you might find this information useful:
Creating a Container cluster in GKE
70% cheaper Kubernetes cluster on AWS
How to build a Kubernetes Horizontal Pod Autoscaler using custom metrics

Stateful jobs in Kubernetes

I have a requirement to run an ad-hoc job, once in a while. The job needs some state to work. Building the state takes a lot of time. So, it is desired to keep the state persistent and reusable in subsequent runs, for a fast turnaround time. I want this job to be managed as K8s pods.
This is a complete set of requirements:
Pods will go down after work finish. The K8s controller should not try to bring up the pods.
Each pod should have a persistent volume attached to it. There should be 1 volume per pod. I am planning to use EBS.
We should be able to manually bring the pods back up in future.
Future runs may have more or less replicas than the past runs.
I know K8s supports both Jobs and Statefulsets. Is there any Controller which supports both at the same time?
Pods will go down after work finish. The K8s controller should not try
to bring up the pods.
This is what Jobs do - run to completion. You only control whether you wanna retry on exit > 0.
Pods should have a persistent volume attached to
them.
Same volume to all? Will they write or only read? What volume backend do you have, AWS EBS or similar? Depending of answers you might want to split input data between few volumes or use separate volumes to write and then finalization job to assemble in 1 volume (kind of map reduce). Or use volume backend which supports multi-mount RW https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes (see table for ReadWriteMany)
We should be able to manually bring the pods back up in future.
Jobs fit here: You launch it when you need it, and it runs till completion.
Future runs may have more or less replicas than the past runs.
Jobs fit here. Specify different completions or parallelism when you launch a job: https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#parallel-jobs
StatefulSets are different concept, they mostly used for clustered software which you run continuously and need to persist the role per pod (e.g. shard).

Kubernetes pods N:M scheduling how-to

Batch computations, Monte Carlo, using Docker image, multiple jobs running on Google cloud and managed by Kubernetes. No Replication Controllers, just multiple pods with NoRestart policy delivering computed payloads to our server. So far so good. Problem is, I have cluster with N nodes/minions, and have M jobs to compute, where M > N. So I would like to fire M pods at once and tell Kubernetes to schedule it in such a way so that only N are running at a given time, and everything else is kept in Pending state. As soon as one pod is done, next is scheduled to run moving from Pending to Running and so on and so forth till all M pods are done.
Is it possible to do so?
Yes, you can have them all ask for a resource of which there's only one on each node, then the scheduler won't be able to schedule more than N at a time. The most common way to do this is to have each pod ask for a hostPort in the ports section of its containers spec.
However, I can't say I'm completely sure why you would want to limit the system to one such pod per node. If there are enough resources available to run multiple at a time on each node, it should speed up your job to let them run.
Just for the record, after discussion with Alex, trial and error and a binary search for a good number, what worked for me was setting the CPU resource limit in the Pod JSON to:
"resources": {
"limits": {
"cpu": "490m"
}
}
I have no idea how and why this particular value influences the Kubernetes scheduler, but it keeps nodes churning through the jobs, with exactly one pod per node running at any given moment.