I have many jobs running and pending. I would like to indicate the relative priority of jobs that I have submitted to the queue, that are pending, but not yet running. Is it possible to set this priority after submission? Is it possible to set this priority before submission?
You can move jobs that are pending with the btop command.
btop job_ID | "job_ID[index_list]" [position]
If you add [position] it means that the job will be put at that place in the queue.
By default, LSF dispatches jobs in a queue in the order of their
arrival (that is, first come, first served), subject to availability
of suitable server hosts.
Having said this, the priority of the job is unchanged. So you will only be able to change the order if the jobs have the same priority.
Depending on your LSF version, see the following links for details about btop
LSF 10.1.0 > Command Reference > btop
LSF 9.1.3 > Command Reference > btop
Related
I have a bit of a unique use-case where I want to run a large number (thousands to tens of thousands) of Kubernetes Jobs at once. Each job consists of a single container, Parallelism 1 and Completions 1, with no side-car or agent. My cluster has plenty of capacity for the resources I'm requesting.
My problem is that the Job status is not transitioning to Complete for a significant period of time when I run many jobs concurrently.
My application submits Jobs and has a watcher on the namespace - as soon as a Job's status transitions to 'succeeded 1', we delete the Job and send information back to the application. The application needs this to happen as soon as possible in order to define and submit subsequent Jobs.
I'm able to submit new Job requests as fast as I want, and Pod scheduling happens without delay, but beyond about one or two hundred concurrent Jobs I get significant delay between a Job's Pod completing and the Job's status updating to Complete. At only around 1,000 jobs in the cluster, it can easily take 5-10 minutes for a Job status to update.
This tells me there is some process in the Kubernetes Control Plane that needs more resources to process Pod completion events more rapidly, or a configuration option that enables it to process more tasks in parallel. However, my system monitoring tools have not yet been able to identify any Control Plane services that are maxing out their available resources while the cluster processes the backlog, and all other operations on the cluster appear to be normal.
My question is - where should I look for system resource or configuration bottlenecks? I don't know enough about Kubernetes to know exactly what components are responsible for updating a Job's status.
I have a simple containerised python script which I am trying to parallelise with Kubernetes. This script guesses hashes until it finds a hashed value below a certain threshold.
I am only interested in the first such value, so I wish to create a Kubernetes job that spawns n worker pods and completes as soon as one worker pod finds a suitable value.
By default, Kubernetes jobs wait until all worker pods complete before marking the job as complete. I have so far been unable to find a way around this (no mention of this job pattern in the documentation), and have been relying on checking the logs of bare pods via a bash script to determine whether one has completed.
Is there a native means to achieve this? And, if not, what would be the best approach?
Hi look this link https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#parallel-jobs.
I've never tried it but it seems possible to launch several pods and configure the end of the job when x pods have finished. In your case x is 1.
We can define two specifications for parallel Jobs:
1. Parallel Jobs with a fixed completion count:
specify a non-zero positive value for .spec.completions.
the Job represents the overall task, and is complete when there is
one successful Pod for each value in the range 1 to
.spec.completions
not implemented yet: Each Pod is passed a different index in the
range 1 to .spec.completions.
2. Parallel Jobs with a work queue:
do not specify .spec.completions, default to .spec.parallelism
the Pods must coordinate amongst themselves or an external service to
determine what each should work on.
For example, a Pod might fetch a batch of up to N items from the work queue.
each Pod is independently capable of determining whether or not all its peers are done, and thus that the entire Job is done.
when any Pod from the Job terminates with success, no new Pods are
created
once at least one Pod has terminated with success and all Pods are
terminated, then the Job is completed with success
once any Pod has exited with success, no other Pod should still be
doing any work for this task or writing any output. They should all
be in the process of exiting
For a fixed completion count Job, you should set .spec.completions to the number of completions needed. You can set .spec.parallelism, or leave it unset and it will default to 1.
For a work queue Job, you must leave .spec.completions unset, and set .spec.parallelism to a non-negative integer.
For more information about how to make use of the different types of job, see the job patterns section.
You can also take a look on single job which starts controller pod:
This pattern is for a single Job to create a Pod which then creates other Pods, acting as a sort of custom controller for those Pods. This allows the most flexibility, but may be somewhat complicated to get started with and offers less integration with Kubernetes.
One example of this pattern would be a Job which starts a Pod which runs a script that in turn starts a Spark master controller (see spark example), runs a spark driver, and then cleans up.
An advantage of this approach is that the overall process gets the completion guarantee of a Job object, but complete control over what Pods are created and how work is assigned to them.
At the same time take under consideration that completition status of Job set by dafault - when specified number of successful completions is reached it ensure that all tasks are processed properly. Applying this status before all tasks are finished is not secure solution.
You should also know that finished Jobs are usually no longer needed in the system. Keeping them around in the system will put pressure on the API server. If the Jobs are managed directly by a higher level controller, such as CronJobs, the Jobs can be cleaned up by CronJobs based on the specified capacity-based cleanup policy.
Here is official documentations: jobs-parallel-processing , parallel-jobs.
Useful blog: article-parallel job.
EDIT:
Another option is that you can create special script which will continuously check values you look for. Using job then will not be necessary, you can simply use deployment.
If one is running an array job on a slurm cluster, how can one restart a failed worker job?
In a Sun Grid Engine queue, one can add #$ -r y to the job file to indicate the job should be restarted if it fails--what is the Slurm equivalent of this flag?
You can use --requeue
#SBATCH --requeue ### On failure, requeue for another try
--requeue
Specifies that the batch job should eligible to being requeue. The job may be requeued explicitly by a system administrator, after node failure, or upon preemption by a higher priority job. When a job is requeued, the batch script is initiated from its beginning. Also see the --no-requeue option. The JobRequeue configuration parameter controls the default behavior on the cluster.
See more here: https://slurm.schedmd.com/sbatch.html#lbAE
Is it possible to change the already queued jobs in windows windows hpc 2012?
I need to move some files from the head node before running another queued job to free space for that job.
I found this statement in Microsoft TechNet:
The order of the job queue is based on job priority level and submit time. Jobs with higher priority levels run before lower priority jobs. The job submit time determines the order within each priority level.
So, as my already queued jobs all are of "Normal" priority, I can set the priority of my move job higher than "Normal" such as "Highest" to get the job done.
I have a bunch of jobs lined up for processing on a Sun Grid Engine queue, but I have just submitted a new job that I would like to prioritize. (The new job is 163981 in the left-most column.) Is there a command I can run to ask the server to process the 163981 job next, rather than the next job in the 140522 job array? I would be grateful for any advice others can offer on this question.
With admin/manager access you could use:
qalter -p <positive number up to 1024>
[job id of the job you want to run sooner]
Without admin/manager access you could use:
qalter -p <negative numeber down to -1023>
[job id of the other job you don't want to run next]
These may not work depending on how long a lag time between when the older job was submitted and the current time and how much weight the administrator has put on the waiting time.
Another option without admin/manager access would be to put the job you don't want to run on hold.
qalter -h u <job id of the job you don't want to run now>
This will make the job you want to run be the only one eligible. Once it has started running you can remove the hold on the other job with
qalter -h U <job id>
Does changing the job share (-js option of qsub) accomplish what you want?
Assuming other jobs are running and in queue with -js value of 0 (default), submit new job with higher priority like so:
qsub -js 10 high_priority.sh
Source: http://www.lifesci.dundee.ac.uk/services/lsc/services/cluster/using-cluster