Get list of all slurm job state codes and the output from sacct - queue

I would like a list of all job state codes from slurm, their meaning, and the associated output from sacct -X -j job_id when running on a job with each of these states.
I cannot seem to find it in the slurm documentation, and the codes in slurm/contribs/perlapi/libslurm/perl/lib/Slurm/Constant.pm don't quite seem to match up with what sacct reports.

Related

snakemake --max-jobs-per-second parameter ignored

I am currently running Snakemake on my department's cluster (SGE). For this I have used a template given by a workshop to run and submit jobs to the scheduler and run my scripts within the different rules. The template for the profile is taken from this snakemake-gridengine repository.
However, I am running into an issue where Snakemake is not submitting the max number of jobs it should be able to the cluster at once.
snakemake --snakefile pop_split_imputation_pipeline.smk \
-j 1000 --max-status-checks-per-second 0.01 \
--profile ~/snakemake/profile -f --rerun-incomplete --use-conda
For instance, above is an example of a command used to submit a .smk pipeline to be run, which in theory should generate 1000 jobs per rule. However, within my cluster, only 10-50 jobs at any one time are being submitted. Within my config.yaml I already have set max-jobs-per-second: 1000, so clearly it should be able to submit all these jobs at once yet it doesn't.
Can anyone point to something to improve the submission speed of these jobs?

How can we get a list of failed dataproc jobs and their start time using gcloud or python

How can we get a list of failed dataproc jobs and their start time using gcloud or python? I don't see much info about this in the documentation.
It's tricky to do exactly what you are asking for, but this command almost matches it:
gcloud dataproc jobs list --filter="status.state=INACTIVE" --format="table(jobUuid,status.state,statusHistory[0].stateStartTime)"
This will print out the Job UUID, final state, and start time for all jobs that are no longer running.
Where this falls short of what you asked is that the returned list includes all of failed, cancelled, and done jobs, rather than just the failed jobs.
The issue is that Dataproc jobs list API supports filtering on job state, but only on the broad categories of "ACTIVE" or "INACTIVE". The "INACTIVE" category includes jobs with a state of "ERROR", but also includes "DONE" and "CANCELLED".
The simplest way I could get to a full solution to what you asked is to pipe the output of that command through grep
gcloud dataproc jobs list --filter="status.state=INACTIVE" --format="table(jobUuid,status.state,statusHistory[0].stateStartTime)" | grep ERROR
That will only list the failed jobs, but it is Unix specific.

Remove resource requirements of SGE jobs using qalter

So I have pending jobs on an SGE queue. The job is running with the parameters
-l h_rt=1200,h_vmem=16g
I know that I can use qalter to change the parameters.
qalter -l h_rt=1300 jid
But how do I remove these parameters?
I am not aware of a method to reset these values to their default value.
However, you can manually check what the defaults are and then apply those.
This will show the default values for queue all.q:
qconf -sq all.q
If you did not apply complex modifications to your SGE setup, the hard resource limits will likely be INFINITY. So you can reset the limits via
qalter -l h_rt=INFINITY jid
Keep in mind that above command will also replace your other limits (e.g. h_vmem). If you wish to keep these other settings you will have to re-assign them in your qalter command.

How to find out if a K8s job failed or succeeded using kubectl?

I have a Kubernetes job that runs for some time, and I need to check if it failed or was successful.
I am checking this periodically:
kubectl describe job/myjob | grep "1 Succeeded"
This works but I am concerned that a change in kubernetes can break this; say, the message is changed to "1 completed with success" (stupid text but you know what I mean) and now my grep will not find what it is looking for.
Any suggestions? this is being done in a bash script.
You can get this information from the job using jsonpath filtering to select the .status.succeeded field of the job you are interested in. It will only return the value you are interested in.
from kubectl explain job.status.succeeded:
The number of pods which reached phase Succeeded.
This command will get you that field for the particular job specified:
kubectl get job <jobname> -o jsonpath={.status.succeeded}

Prioritizing Jobs on Sun Grid Engine Queue

I have a bunch of jobs lined up for processing on a Sun Grid Engine queue, but I have just submitted a new job that I would like to prioritize. (The new job is 163981 in the left-most column.) Is there a command I can run to ask the server to process the 163981 job next, rather than the next job in the 140522 job array? I would be grateful for any advice others can offer on this question.
With admin/manager access you could use:
qalter -p <positive number up to 1024>
[job id of the job you want to run sooner]
Without admin/manager access you could use:
qalter -p <negative numeber down to -1023>
[job id of the other job you don't want to run next]
These may not work depending on how long a lag time between when the older job was submitted and the current time and how much weight the administrator has put on the waiting time.
Another option without admin/manager access would be to put the job you don't want to run on hold.
qalter -h u <job id of the job you don't want to run now>
This will make the job you want to run be the only one eligible. Once it has started running you can remove the hold on the other job with
qalter -h U <job id>
Does changing the job share (-js option of qsub) accomplish what you want?
Assuming other jobs are running and in queue with -js value of 0 (default), submit new job with higher priority like so:
qsub -js 10 high_priority.sh
Source: http://www.lifesci.dundee.ac.uk/services/lsc/services/cluster/using-cluster