Kubernetes Job should use all available resources

Kubernetes Job should use all available resources - kubernetes

I run a Kubernetes job with thousands of workers following the pattern described in Coarse Parallel Processing Using a Work Queue. I use the Python client for the Kubernetes API to define the job programmatically. The cluster does not scale automatically. The available resources are unknown at the time of programming.
The goal is to use all available resources of the cluster for my job. I have tried to optimise the .spec.parallelism setting. If I set .spec.parallelism and .spec.completions to the same value, all pods for the job are started at the beginning, but most of them could not be scheduled due to resource requirements (e.g. insufficient CPU). When the first pods are finished, the resources are free and more pods are scheduled. But after some time (2.4 hours on my cluster) Kubernetes gives up scheduling the remaining pods and marks them as failed, which eventually causes the whole job to fail.
Is there a pattern for a job on a Kubernetes cluster to use all available resources?

Related

How to autoscale a pod thats pulling tasks from a queue

I've tried a few approaches to this, the docs suggest that there is a way of getting autoscaling to deal with queues (without an external solution) https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/ but doesn't explain how
I created a deployment which deploys pods that pull from a redis queue (there is a redis service in the cluster). I want to create a system where pods are scaled horizontally to deal with pulling tasks from the queue and executing them. When executing the task can take an unpredictable and variable amount of time.
If pod A pulls a task from the queue and is busy I want to spin up Pod B to pull the next task. At the moment I am using polling so that if the queue is empty the pod in question will just keep trying to pull from the queue.
I've used horizontal pod autoscaling which at least scales out when pod1 is working but because pod2 when running doesn't decrease the average utilization, it just keeps spinning up new pods up to the maximum. For my use case, this is semi-fine, because if the queue is empty, any pods getting an empty queue will contribute to utilization percentage coming down, and in theory when the queue is empty, the excess pods will all spin down... but doesn't feel very efficient, and the problem is that the autoscaler will scale down pods that are in the middle of running jobs.
I've looked at using the newer metrics api, but it seems ill need to create a custom metrics api to implement this which seems extreme for such a simple use case.
I've also looked at using Jobs but this doesn't seem to accommodate autoscaling at all?
I really want to be able to descale based on the CPU utilization for the specific pod that's about to get scaled-down rather than an average of all the pods.

HorizontalPodAutoscaler will always scale based on the average utilization of all available pods. I'd say Jobs are the most suitable for your use case. Queue with pod per work item is one of the example use cases for Kubernetes Jobs in their official documentation.
You could also look to use Keda which is a framework for event-driven autoscaling.
KEDA is a Kubernetes-based Event Driven Autoscaler. With KEDA, you can drive the scaling of any container in Kubernetes based on the number of events needing to be processed.

Kubernetes scheduling ignores pod count per worker node

We have a kubernetes cluster with three worker nodes, which was built manually, borrowing from the 'Kubernetes, the hard way' Tutorial.
Everything on this cluster works as expected for one exception:
The scheduler does not - or seems not to - honor the 110 pod per worker node limit.
Example:
Worker Node 1: 60 pods
Worker Node 2: 100 pods
Worker Node 3: 110 pods
When I want to deploy a new pod, it often happens that the scheduler decides it would be best to schedule the new pod to 'Worker Node 3'. Kubelet refuses to do so, it does honor its 110 pod limitation. The scheduler tries again and again and never succeeds.
I do not understand why this is happening. I think I might be missing some detail about this problem.
From my understanding and what I have read about the scheduler itself, there is no resource or metric for 'amount of pods per node' which is considered while scheduling - or at least I haven't found anything that would suggest otherwise in the Kubernetes Scheduler documentation. Of course the scheduler considers CPU requests/limits, memory requests/limits, disk requests/limits - that's all fine and working. So I don't even know how the scheduler could ever consider the amount of pods used on a worker, but there has to be some kind of functionality doing that, right? Or am I mistaken?
Is my cluster broken? Is there some misconception I have about how scheduling should/does work?
Kubernetes binary versions: v1.17.2
Edit: Kubernetes version

Usually this means the other nodes are unsuitable. Either explicitly via taints, etc or more often things like resource request space.

Is it possible to schedule a pod to run for say 24 hours and then remove deployment/statefulset? or need to use jobs?

We have a bunch of pods running in dev environment. The pods are auto-provisioned by an application on every business action. The problem is that across various namespaces they are accumulating and eating available resources in EKS.
Is there a way without jenkins/k8s jobs to simply put some parameter on the pod manifest to tell it to self destruct say in 24 hours?

Add to your pod.spec:
activeDeadlineSeconds: 86400
After deadline your Pod will be stopped for good with the status DeadlineExceeded

If I understood your situation properly, you would like to scale your cluster down in order to save resources.
Kubernetes is featured with the ability to autoscale your application in a cluster. Literally, it means that Kubernetes can start additional pods when the load is increasing and terminate excessive pods when the load is decreasing.
It is possible to downscale the application to zero pods, but, in this case, you will have a delay serving the first request while the pod is starting.
This functionality relies on performance metrics. From the practical side, it means that autoscaling doesn't happen instantly, because it takes some time to performance metrics reach the configured threshold.
The mentioned Kubernetes feature called HPA(horizontal pod autoscale) is described in this document.
In case you are running your cluster on GCP or GKE, you are able to go further and automatically start additional nodes for your cluster when you need more computing capacity and shut down nodes when they are not running application pods anymore.
More information about this functionality can be found following the link.
Last, but not least, you can use tool like Ansible to manage all your kubernetes assets (it can create/manage deployments via playbooks).
If you decide to give it a try, you might find this information useful:
Creating a Container cluster in GKE
70% cheaper Kubernetes cluster on AWS
How to build a Kubernetes Horizontal Pod Autoscaler using custom metrics

Kubernetes batch performance with activation of thousands of pods using jobs

I am writing a pipeline with kubernetes in google cloud.
I need to activate sometimes a few pods in a second, where each pod is a task that runs inside a pod.
I plan to call kubectl run with Kubernetes job and wait for it to complete (poll every second all the pods running) and activate the next step in the pipeline.
I will also monitor the cluster size to make sure I am not exceeding the max CPU/RAM usage.
I can run tens of thousands of jobs at the same time.
I am not using standard pipelines because I need to create a dynamic number of tasks in the pipeline.
I am running the batch operation so I can handle the delay.
Is it the best approach? How long does it take to create a pod in Kubernetes?

If you wanna run ten thousands of jobs at the same time - you will definitely need to plan resource allocation. You need to estimate the number of nodes that you need. After that you may create all nodes at once, or use GKE cluster autoscaler for automatically adding new nodes in response to resource demand. If you preallocate all nodes at once - you will probably have high bill at the end of month. But pods can be created very quickly. If you create only small number of nodes initially and use cluster autoscaler - you will face large delays, because nodes take several minutes to start. You must decide what your approach will be.
If you use cluster autoscaler - do not forget to specify maximum nodes number in cluster.
Another important thing - you should put your jobs into Guaranteed quality of service in Kubernetes. Otherwise if you use Best Effort or Burstable pods - you will end up with Eviction nightmare which is really terrible and uncontrolled.

Kubernetes automatic shutdown after some idle time

Does kubernetes or Helm support shut down the pods if it is idle for more than a given threshold time?
This would be very useful in the development environment, to provide room for other processes to consume it and save cost.

Kubernetes is featured with the ability to autoscale your application in a cluster. Literally, it means that Kubernetes can start additional pods when the load is increasing and terminate excessive pods when the load is decreasing.
It is possible to downscale the application to zero pods, but, in this case, you will have a delay serving the first request while the pod is starting.
This functionality relies on performance metrics provided by Heapster application, that must be run in the cluster. From the practical side, it means that autoscaling doesn't happen instantly, because it takes some time to performance metrics reach the configured threshold.
The mentioned Kubernetes feature called HPA(horizontal pod autoscale) is described in this document.
In case you are running your cluster on GCP or GKE, you are able to go further and automatically start additional nodes for your cluster when you need more computing capacity and shut down nodes when they are not running application pods anymore.
More information about this functionality can be found following the link.
If you decide to give it a try, you might find this information useful:
Creating a Container cluster in GKE
70% cheaper Kubernetes cluster on AWS
How to build a Kubernetes Horizontal Pod Autoscaler using custom metrics