I have a bunch of jobs lined up for processing on a Sun Grid Engine queue, but I have just submitted a new job that I would like to prioritize. (The new job is 163981 in the left-most column.) Is there a command I can run to ask the server to process the 163981 job next, rather than the next job in the 140522 job array? I would be grateful for any advice others can offer on this question.
With admin/manager access you could use:
qalter -p <positive number up to 1024>
[job id of the job you want to run sooner]
Without admin/manager access you could use:
qalter -p <negative numeber down to -1023>
[job id of the other job you don't want to run next]
These may not work depending on how long a lag time between when the older job was submitted and the current time and how much weight the administrator has put on the waiting time.
Another option without admin/manager access would be to put the job you don't want to run on hold.
qalter -h u <job id of the job you don't want to run now>
This will make the job you want to run be the only one eligible. Once it has started running you can remove the hold on the other job with
qalter -h U <job id>
Does changing the job share (-js option of qsub) accomplish what you want?
Assuming other jobs are running and in queue with -js value of 0 (default), submit new job with higher priority like so:
qsub -js 10 high_priority.sh
Source: http://www.lifesci.dundee.ac.uk/services/lsc/services/cluster/using-cluster
Related
I'm training several Neural Networks on a server in my University. Due to limited resources for all the students, there is a job scheduling system called (Slurm) that queues all students runs and in addition, we are only allowed to run our commands with a time limit (24h). Once exceed this processing time, our running process is closed to give resource availability to the others.
Specifically, I'm training GAN's and I need more training time than 24h.
Right now, I'm saving the checkpoints of my model to restart from the same training point before the process closure. But, I must execute the same command again after 24h.
For this reason I would like to schedule this execution every 24h automatically.
Currently I'm using 'tmux' to execute the command and be able to close the terminal.
Some suggestion on how to automate this kind of execution?
Thank you in advance!
You can setup your job to automatically resubmit when it's close to the timelimit.
Note that slurm's time granularity is 1 minute, so don't set the
signal timer to anything less than 60 seconds.
#!/bin/bash
#SBATCH --signal=B:USR1#300 # Tell Slurm to send signal USR1 300 seconds before timelimit
#SBATCH -t 24:00:00
resubmit() {
echo "It's time to resubmit"; # <----- Run whatever is necessary. Ideally resubmit the job using the checkpointing files
sbatch ...
}
trap "resubmit" USR1 # Register signal handler
YOUR_TRAINING_COMMAND & # It's important to run on the background otherwise bash will not process the signal until this command finishes
wait # wait until all the background processes are finished. If a signal is received this will stop, process the signal and finish the script.
I'm not sure I understand how result_expires works.
I read,
result_expires
Default: Expire after 1 day.
Time (in seconds, or a timedelta object) for when after stored task tombstones will be deleted.
A built-in periodic task will delete the results after this time (celery.backend_cleanup), assuming that celery beat is enabled. The task runs daily at 4am.
...
When using the database backend, celery beat must be running for the results to be expired.
(from here: http://docs.celeryproject.org/en/latest/userguide/configuration.html#std:setting-result_expires)
So, in order for this to work, I have to actually do something like this:
python -m celery -A myapp beat -l info --detach
?
Is that what the documentation is referring to by "celery beat is enabled"? Or, rather than executing this manually, there is some configuration that needs to be set which would cause celery beat to be called automatically?
Re: celery beat--you are correct. If you use a database backend, you have to run celery beat as you posted in your original post. By default celery beat sets up a daily task that will delete older results from the results database. If you are using a redis results backend, you do not have to run celery beat. How you choose to run celery beat is up to you, personally, we do it via systemd.
If you want to configure the default expiration time to be something other than the default 1 day, you can use the result_expires setting in celery to set the number of seconds after a result is recorded that it should be deleted. e.g., 1800 for 30 minutes.
I would like the new submitted job to bsub not PEND but start immediate run.
If possible I would like to limit this to N jobs.
If you want two jobs to run in parallel, its probably best to submit a single parallel job (bsub -n) that runs two different processes, potentially on two different hosts.
The LSF admin can force a PENDing job to run with the brun command. However, this will cause the execution host to be temporarily overloaded.
I have many jobs running and pending. I would like to indicate the relative priority of jobs that I have submitted to the queue, that are pending, but not yet running. Is it possible to set this priority after submission? Is it possible to set this priority before submission?
You can move jobs that are pending with the btop command.
btop job_ID | "job_ID[index_list]" [position]
If you add [position] it means that the job will be put at that place in the queue.
By default, LSF dispatches jobs in a queue in the order of their
arrival (that is, first come, first served), subject to availability
of suitable server hosts.
Having said this, the priority of the job is unchanged. So you will only be able to change the order if the jobs have the same priority.
Depending on your LSF version, see the following links for details about btop
LSF 10.1.0 > Command Reference > btop
LSF 9.1.3 > Command Reference > btop
A job has been submitted and an entry is also there in dba_jobs but this job is not comming in the running state.So there is no entry for the job in dba_jobs_running.But the parameter 'JOB_QUEUE_PROCESS' has the value 10
and there are no jobs in the running state.Please suggest how to solve this problem.
SELECT NEXT_DATE, NEXT_SEC, BROKEN, FAILURES, WHAT
FROM DBA_JOBS
WHERE JOB = :JOB_ID
What's that return? A BROKEN job won't kick off, and if the NEXT_DATE/NEXT_SEC is in the past, it won't kick off either.
I hope you labeled that database parameter correctly i.e. 'JOB_QUEUE_PROCESSES=10'.
This is typically why a job won't run.
Also check that the user/schema that is running the job is correct too.
An alternative is to use a different scheduling tool to run the job (i.e. cron on linux)