Slurm error: "slurmstepd: error: no task list created!" - hpc

I'm attempting to run a simple job on Slurm but am getting a cryptic error message:
slurmstepd: error: no task list created!
I've run thousands of other jobs identical to the job I'm running here (with different parameters), but only this one run has yielded this error. Does anyone know what might cause this error? Any help would be appreciated!
Here's my full job file:
#!/bin/bash
#SBATCH --job-name=candidates
#SBATCH --output=logs/candidates.%A.%a.log
#SBATCH --error=logs/candidates-error.%A.%a.log
#SBATCH --array=1-10000
#SBATCH --requeue
#SBATCH --partition=scavenge
#SBATCH --time=1-00:00:00
#SBATCH --mem=40g
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=douglas.duhaime#gmail.com
bin/candidates numbers numbers numbers

Related

How could I know if I run MAKER with MPI(MPIch)?

I am running maker to annotation a genome.But i didn't run maker with MPI before, so the speed is very low.A genome of 300Mb last 22 days, and still running.
And i re read the manual of maker, it said support with MPI.So I installed MPI according to it.
Now the question is , how I know if it running with MPI?
the following is my slrum-script
#!/bin/bash
#SBATCH -J MPI-test
#SBATCH -p xhacnormala
#SBATCH -N 4
#SBATCH -n 64
#SBATCH --mem 0
#SBATCH -o MPI.out
#SBATCH -e MPI.err
source /public1/home/casdao/kdylvfenghua/kdy_sy2021/maker_env
module load mpi/mpich/3.3.2/gcc-485
mpiexec -n 64 maker

How to save/record SLURM script's config parameters to the output file?

I'm new to HPC and SLURM in particular. Here is an example code that I use to run my python script:
#!/bin/bash
# Slurm submission script, serial job
#SBATCH --time 48:00:00
#SBATCH --mem 0
#SBATCH --mail-type ALL
#SBATCH --partition gpu_v100
#SBATCH --gres gpu:4
#SBATCH --nodes 4
#SBATCH --ntasks-per-node=4
#SBATCH --output R-%x.%j.out
#SBATCH --error R-%x.%j.err
export NCCL_DEBUG=INFO
export PYTHONFAULTHANDLER=1
module load python3-DL/torch/1.6.0-cuda10.1
srun python3 contrastive_module.py \
--gpus 4 \
--max_epochs 1024 \
--batch_size 256 \
--num_nodes 4 \
--num_workers 8 \
Now everytime I run this script using sbatch run.sl it generates two .err and .out files that I can only encode the "run.sl" filename and Job ID into these two filenames. but how can I save a copy of all the parameters i set in the script above whether for the slurm configs or the python code arguments tied to the Job ID and the generated .out and .err files?
For example if i run the script above 4 times in a row but each time with a different parameters its not clear from those files which correspond to which unless i manually keep a track of the parameters and JOB IDs. there should be some way to automate this in SLURM no?
You add the following two lines at the end of your submission script:
scontrol show job $SLURM_JOB_ID
scontrol write batch_script $SLURM_JOB_ID -
This will write the job description and the job submission script at the end of the .out file.

How do I control LSF to send an e-mail at the start/completion/abort of my job?

I see PBS has -m b|e|a options, but I did not see any equivalent in LSF -u options. How do I control LSF to send an e-mail at the start/completion/abort of my job ? If I just use '-u', does it send an e-mail only at the completion of the job ?
Maybe I have to use -B (when the job begins) and -N (when the job completes) at the same time with -u option ?
BSUB –B –N -u my_email#biglab.com

Forcing LSF to execute jobs on different hosts

I have a setup consisting from 3 workers and a management node, which I use for submitting tasks. I would like to execute concurrently a setup script at all workers:
bsub -q queue -n 3 -m 'h0 h1 h2' -J "%J_%I" mpirun setup.sh
As far as I understand, I could use 'ptile' resource constraint to force execution at all workers:
bsub -q queue -n 3 -m 'h0 h1 h2' -J "%J_%I" -R 'span[ptile=1]' mpirun setup.sh
However, occasionally I face an issue that my script got executed several times at the same worker.
Is it expected behavior? Or there is a bug in my setup? Is there a better way for enforcing multi worker execution?
Your understanding of span[ptile=1] is correct. LSF will only use 1 core per host for your job. If there aren't enough hosts based on the -n then the job will pend until something frees up.
However, occasionally I face an issue that my script got executed
several times at the same worker.
I suspect that its something with your script. e.g., LSF appends to the stdout file by default. Use -oo to overwrite.

upstart script. shell arithmetic in script stanza producing incorrect values. equivalent /bin/sh script works

I have an upstart init script, but my dev/testing/production have different numbers of cpus/cores. I'd like to compute the number of worker processes to be 4 * number of cores within the init script
The upstart docs say that the script stanzas use /bin/sh syntax.
I created /bin/sh script to see what was going on. I'm getting drastically different results than my upstart script.
script stanza from my upstart script:
script
# get the number of cores
CORES=`lscpu | grep -v '#' | wc -l`
# set the number of worker processes to 4 * num cores
WORKERS=$(($CORES * 4))
echo exec gunicorn -b localhost:8000 --workers $WORKERS tutalk_site.wsgi > tmp/gunicorn.txt
end script
which outputs:
exec gunicorn -b localhost:8000 --workers 76 tutalk_site.wsgi
my equivalent /bin/sh script
#!/bin/sh
CORES=`lscpu -p | grep -v '#' | wc -l`
WORKERS=$(($CORES * 4))
echo exec gunicorn -b localhost:8000 --workers $WORKERS tutalk_site.wsgi
which outputs:
exec gunicorn -b localhost:8000 --workers 8 tutalk_site.wsgi
I'm hoping this is a rather simple problem and a few other pairs of eyes will locate the issue.
Any help would be appreciated.
I suppose I should have answered this several days ago. I first attempted using environment variables instead but didn't have any luck.
I solved the issue by replacing the computation with a python one-liner
WORKERS=$(python -c "import os; print os.sysconf('SC_NPROCESSORS_ONLN') * 2")
and that worked out just fine.
still curious why my bourne-shell script came up with the correct value while the upstart script, whose docs say use bourne-shell syntax didn't