Delay in Kubernetes Job status update when running many jobs in parallel - kubernetes

I have a bit of a unique use-case where I want to run a large number (thousands to tens of thousands) of Kubernetes Jobs at once. Each job consists of a single container, Parallelism 1 and Completions 1, with no side-car or agent. My cluster has plenty of capacity for the resources I'm requesting.
My problem is that the Job status is not transitioning to Complete for a significant period of time when I run many jobs concurrently.
My application submits Jobs and has a watcher on the namespace - as soon as a Job's status transitions to 'succeeded 1', we delete the Job and send information back to the application. The application needs this to happen as soon as possible in order to define and submit subsequent Jobs.
I'm able to submit new Job requests as fast as I want, and Pod scheduling happens without delay, but beyond about one or two hundred concurrent Jobs I get significant delay between a Job's Pod completing and the Job's status updating to Complete. At only around 1,000 jobs in the cluster, it can easily take 5-10 minutes for a Job status to update.
This tells me there is some process in the Kubernetes Control Plane that needs more resources to process Pod completion events more rapidly, or a configuration option that enables it to process more tasks in parallel. However, my system monitoring tools have not yet been able to identify any Control Plane services that are maxing out their available resources while the cluster processes the backlog, and all other operations on the cluster appear to be normal.
My question is - where should I look for system resource or configuration bottlenecks? I don't know enough about Kubernetes to know exactly what components are responsible for updating a Job's status.

Related

Cluster Resource Usage in Databricks

I was just wondering if anyone could explain if all compute resources in a Databricks cluster are shared or if the resources are tied to each worker. For example, if two users were connected to a cluster made up of 2 workers with 4 cores per worker and one user's job required 2 cores and the other's required 6 cores, would they be able to share the 8 total cores or would the full 4 cores from one worker be unavailable during the job that only required 2 cores?
TL;DR; Yes, default behavior is to allow sharing but you're going to have to tightly control the default parallelism with such a small cluster.
Take a look at Job Scheduling for Apache Spark. I'm assuming you are using an "all-purpose" / "interactive" cluster where users are working on notebooks OR you are submitting jobs to an existing, all-purpose cluster and it is NOT a job cluster with multiple spark applications being deployed.
Databricks Runs in FAIR Scheduling Mode by Default
Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish. This mode is best for multi-user settings.
By default, all queries started in a notebook run in the same fair scheduling pool
The Apache Spark scheduler in Azure Databricks automatically preempts tasks to enforce fair sharing.
Apache Spark Defaults to FIFO
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.
Keep in mind the word "job" is specific Spark term that represents an action being taken that launches one or more stages and tasks. See What is the concept of application, job, stage and task in spark?.
So in your example you have...
2 Workers with 4 cores each == 8 cores == 8 tasks can be handled in parallel
One application (App A) that has a job that launches a stage with only 2 tasks.
One application (App B) that has a job that launches a stage with 6 tasks.
In this case, YES, you will be able to share the resources of the cluster. However, the devil is in the default behaviors. If you're reading from many files, performing a join, aggregating, etc, you're going to run into the fact that Spark is going to partition your data into chunks that can be acted on in parallel (see configuration like spark.default.parallelism).
So, in a more realistic example, you're going to have...
2 Workers with 4 cores each == 8 cores == 8 tasks can be handled in parallel
One application (App A) that has a job that launches a stage with 200 tasks.
One application (App B) that has a job that launches three stage with 8, 200, and 1 tasks respectively.
In a scenario like this FIFO scheduling, as is the default, will result in one of these applications blocking the other since the number of executors is completely overwhelmed by the number of tasks in just one stage.
In a FAIR scheduling mode, there will still be some blocking since the number of executors is small but some work will be done on each job since FAIR scheduling does a round-robin at the task level.
In Apache Spark, you have tighter control by creating different pools of the resources and submitting apps only to those pools where they have "isolated" resources. The "better" way of doing this is with Databricks Job clusters that have isolated compute dedicated to the application being ran.

Complete parallel Kubernetes job when one worker pod succeeds

I have a simple containerised python script which I am trying to parallelise with Kubernetes. This script guesses hashes until it finds a hashed value below a certain threshold.
I am only interested in the first such value, so I wish to create a Kubernetes job that spawns n worker pods and completes as soon as one worker pod finds a suitable value.
By default, Kubernetes jobs wait until all worker pods complete before marking the job as complete. I have so far been unable to find a way around this (no mention of this job pattern in the documentation), and have been relying on checking the logs of bare pods via a bash script to determine whether one has completed.
Is there a native means to achieve this? And, if not, what would be the best approach?
Hi look this link https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#parallel-jobs.
I've never tried it but it seems possible to launch several pods and configure the end of the job when x pods have finished. In your case x is 1.
We can define two specifications for parallel Jobs:
1. Parallel Jobs with a fixed completion count:
specify a non-zero positive value for .spec.completions.
the Job represents the overall task, and is complete when there is
one successful Pod for each value in the range 1 to
.spec.completions
not implemented yet: Each Pod is passed a different index in the
range 1 to .spec.completions.
2. Parallel Jobs with a work queue:
do not specify .spec.completions, default to .spec.parallelism
the Pods must coordinate amongst themselves or an external service to
determine what each should work on.
For example, a Pod might fetch a batch of up to N items from the work queue.
each Pod is independently capable of determining whether or not all its peers are done, and thus that the entire Job is done.
when any Pod from the Job terminates with success, no new Pods are
created
once at least one Pod has terminated with success and all Pods are
terminated, then the Job is completed with success
once any Pod has exited with success, no other Pod should still be
doing any work for this task or writing any output. They should all
be in the process of exiting
For a fixed completion count Job, you should set .spec.completions to the number of completions needed. You can set .spec.parallelism, or leave it unset and it will default to 1.
For a work queue Job, you must leave .spec.completions unset, and set .spec.parallelism to a non-negative integer.
For more information about how to make use of the different types of job, see the job patterns section.
You can also take a look on single job which starts controller pod:
This pattern is for a single Job to create a Pod which then creates other Pods, acting as a sort of custom controller for those Pods. This allows the most flexibility, but may be somewhat complicated to get started with and offers less integration with Kubernetes.
One example of this pattern would be a Job which starts a Pod which runs a script that in turn starts a Spark master controller (see spark example), runs a spark driver, and then cleans up.
An advantage of this approach is that the overall process gets the completion guarantee of a Job object, but complete control over what Pods are created and how work is assigned to them.
At the same time take under consideration that completition status of Job set by dafault - when specified number of successful completions is reached it ensure that all tasks are processed properly. Applying this status before all tasks are finished is not secure solution.
You should also know that finished Jobs are usually no longer needed in the system. Keeping them around in the system will put pressure on the API server. If the Jobs are managed directly by a higher level controller, such as CronJobs, the Jobs can be cleaned up by CronJobs based on the specified capacity-based cleanup policy.
Here is official documentations: jobs-parallel-processing , parallel-jobs.
Useful blog: article-parallel job.
EDIT:
Another option is that you can create special script which will continuously check values you look for. Using job then will not be necessary, you can simply use deployment.

Which kubernetes mode to chose

I have a situation where each message in a message queue has to be processed by a separate instance (one pod can process one message at a time). Many messages can be processed at once, but there is a limit of parallel executions. Once it's reached, no new messages are being pulled from the queue. Message processing takes about 30 minutes. No state needs to be stored on the pods between calls (all data is read from a database when pod starts processing a message). A new message should spawn a new pod, once the processing finishes, the pod should die.
Should I use Deployments, ReplicaSets, StatefulSets, Services? (we use Kubernetes with Azure) I guess the main
I've tried ReplicaSets, but in a situation when three messages are being processed and one finishes, scaling down a ReplicaSet can kill a working pod, which is definately not what I need.
I would say that since you do not need to handle state you must discard the StatefulSets, on the other hand, a Deployment is a higher-level concept of ReplicaSets, so, you should use a Deployment as it takes care of the replica set. Lastly, as your processing is under demand I would consider using Jobs, once the job completes the task it frees the resources and dies, this would require extra code to create the jobs based on a helper but could be very handy.

Queue mode in HPC Pack 2016 can scale up?

I'm working on a project where we need to execute a lot of jobs (say 60000 jobs) each time in HPC cluster.
From HPC documentation, i noticed HPC has 2 mode
- Queued mode: tart jobs in queue order, and attempt to allocate the maximum requested resources to running jobs.
- Balanced mode: Attempt to start all incoming jobs as soon as possible at their minimum resource requirements
https://learn.microsoft.com/en-us/powershell/high-performance-computing/understanding-policy-configuration?view=hpc16-ps
But i'm not sure about tolerance of this balance mode in HPC. Does it can scale like other queue service like SQS in AWS or Queue Storage in Azure?
Queue mode and balance mode are referring to the scheduling policy. There is no real queue included.
Basically, in Queue mode, your jobs run in a FIFO manner. If the next job needs more cores then currently available, it waits despite the fact that there are available resources. While in Balance mode, HPC Pack tries to run as many jobs as possible at the same time and makes the best effort to ensure they use the same amount of resources., to maximize the resource usage.
Both two policies will not effect scale target of the cluster.

Kubernetes dynamic Job scaling

I’m finally dipping my toes in the kubernetes pool and wanted to get some advice on the best way to approach a problem I have:
Tech we are using:
GCP
GKE
GCP Pub/Sub
We need to do bursts of batch processing spread out across a fleet and have decided on the following approach:
New raw data flows in
A node analyses this and breaks the data up into manageable portions which are pushed onto a queue
We have a cluster with Autoscaling On and Min Size ‘0’
A Kubernetes job spins up a pod for each new message on this cluster
When pods can’t pull anymore messages they terminate successfully
The question is:
What is the standard approach for triggering jobs such as this?
Do you create a new job each time or are jobs meant to be long lived and re-run?
I have only seen examples of using a yaml file however we would probably want the node which did the portioning of work to create the job as it knows how many parallel pods should be run. Would it be recommended to use the python sdk to create the job spec programatically? Or if jobs are long lived would you simply hit the k8 api and modify the parallel pods required then re-run job?
Jobs in Kubernetes are meant to be short-lived and are not designed to be reused. Jobs are designed for run-once, run-to-completion workloads. Typically they are be assigned a specific task, i.e. to process a single queue item.
However, if you want to process multiple items in a work queue with a single instance then it is generally advisable to instead use a Deployment to scale a pool of workers that continue to process items in the queue, scaling the number of pool workers dependent on the number of items in the queue. If there are no work items remaining then you can scale the deployment to 0 replicas, scaling back up when there is work to be done.
To create and control your workloads in Kubernetes the best-practice would be to use the Kubernetes SDK. While you can generate YAML files and shell out to another tool like kubectl using the SDK simplifies configuration and error handling, as well as allowing for simplified introspection of resources in the cluster as well.