How to configure slurm to email out file? - email

I'm a newbie to slurm and I'm trying to configure my bash script to, in the case that a job fails, email the corresponding standard output file when notifying me. I've managed to configure email notifications, but how can I make the body of the email contain standard output?
#!/bin/bash
#SBATCH -n 2 # two cores
#SBATCH --mem=3G
#SBATCH --time=48:00:00 # total run time limit (HH:MM:SS)
#SBATCH --mail-user=rylansch
#SBATCH --mail-type=FAIL
export PYTHONPATH=.
python -u model_train.py # -u flushes output buffer immediately
I don't see answers How to configure the content of slurm notification emails? or How to let SBATCH send stdout via email?

See my solution here
#!/bin/bash
#SBATCH -J MyModel
#SBATCH -n 1 # Number of cores
#SBATCH -t 1-00:00 # Runtime in D-HH:MM
#SBATCH -o JOB%j.out # File to which STDOUT will be written
#SBATCH -e JOB%j.out # File to which STDERR will be written
#SBATCH --mail-type=BEGIN
#SBATCH --mail-user=my#email.com
secs_to_human(){
echo "$(( ${1} / 3600 )):$(( (${1} / 60) % 60 )):$(( ${1} % 60 ))"
}
start=$(date +%s)
echo "$(date -d #${start} "+%Y-%m-%d %H:%M:%S"): ${SLURM_JOB_NAME} start id=${SLURM_JOB_ID}\n"
### exec task here
( << replace with your task here >> ) \
&& (cat JOB$SLURM_JOB_ID.out |mail -s "$SLURM_JOB_NAME Ended after $(secs_to_human $(($(date +%s) - ${start}))) id=$SLURM_JOB_ID" my#email.com && echo mail sended) \
|| (cat JOB$SLURM_JOB_ID.out |mail -s "$SLURM_JOB_NAME Failed after $(secs_to_human $(($(date +%s) - ${start}))) id=$SLURM_JOB_ID" my#email.com && echo mail sended && exit $?)
you can also edit this to send seperate stdout/stderr logs or attach them as files.
This snippet is also shared on github-gists

As a regular user, you do not get to choose the contents of the email sent to you. Only the administrators can do that.
But you could add a command at the end of your submission script to send you an email, like is explained here

Related

I cannot submit a job on HPC (QSUB)

for the past 2 months, I have been trying to find out why why I cannot submit a job on our HPC (using QSUB), recently, I found out that my home directory was
$/export/home/wrfuser
while my other co-workers are
$/home/wrfuser1
*note /export
I can submit a job but it never shows a result. Here's my sample hello.qsub:
#!/bin/bash --login
#PBS -j oe
#PBS -l walltime=00:01:00,nodes=1,ppn=1,mem=50mb
export WORKDIR=/mnt/NFS003/WRF/WRF_hist/qsub_test
cd ${WORKDIR}
echo "HELLO WORLD"
[wrfuser#HPC qsub_test]$ vi hello.qsub
[wrfuser#HPC qsub_test]$ qsub hello.qsub
Your job 7618 ("hello.qsub") has been submitted
[wrfuser#HPC qsub_test]$ qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
7617 0.55500 hello.qsub wrfuser Eqw 04/06/2018 10:21:35 1
7618 0.55500 hello.qsub wrfuser Eqw 04/06/2018 10:35:15 1
[wrfuser#HPC qsub_test]$
If its not possible to do that on /export/home, is there any other way to submit a job on HPC?
I solved it!!! I changed my qsub script to
#!/bin/bash
#
#$ -cwd
#$ -j y
#$ -S /bin/bash
#$ -pe orte 64
echo "HELLO JOHN"
mkdir Hello_world
[wrfuser#CADHPC01 run]$
I am using number of nodes,ppn, and memory on my previous script and now I changed it to number of cores #$ -pe orte 64. However, I not 100% sure that it is the main reason for that error.
I am newbie here in stackoverflow and it feels like I will learn and enjoy exponentially here!! Thanks! :D

SLURM Job Array output file in command

I have a command list like this
bedtools intersect -a BED1 -b BED2 >BED1_BED2_overlaps.txt
...
with over 100 files.
Here is the header for my job submission
#SBATCH -t 0-08:00
#SBATCH --job-name=JACCARD_DNase
#SBATCH -o /oasis/scratch/XXX/XXX/temp_project/logs/JACCARD_DNase_%a_out
#SBATCH -e /oasis/scratch/XXX/XXX/temp_project/logs/JACCARD_DNase_%a_err
#SBATCH --array=1-406%50
When I submit the job I get this error
Error: Unable to open file >BED1_BED2_overlaps.txt Exiting.
I tried to pipe an echo command like this
bedtools intersect -a BED1 -b BED2 | echo "BED1 BED2"
And I got
Error: Unable to open file |. Exiting.
So what gives? How can I submit array jobs with Bash syntax like > output and | pipes?
It looks like you are missing the shebang ; your submission script should start with
#! /bin/bash
or any other shell you like.

Log bjob info immediately after bsub

I'm looking for a way to log information to a file about a submitted job immediately after it starts.
Normally all the job status is appended to the log file after a job has completed, but I'd like to know the information it has when it starts.
I know there's the -B flag but I want it in a file, and I could also do something like:
bsub -J jobby -o run_job.log bjobs -l -J jobby > jobby.log; run_job
but maybe someone knows of a funkier way of doing this.
There are some subtle variations that essentially accomplish the same thing:
You can use a pre-exec to do a similar thing instead of doing the
bjobs as part of the command:
bsub -J jobby -E "bjobs -l -J jobby > jobby.log" run_job
You can use the job's environment to get your own jobid instead of
using -J if you write your submission as a script:
#!/bin/sh
#BSUB -o run_job.log
bjobs -l $LSB_JOBID > $LSB_JOBID.log
run_job
Then submit your job like this:
bsub < jobscript.sh
You can do some combination of the above: use $LSB_JOBID in a
pre-execution script.
That's about as 'funky' as it gets AFAIK :)

Torque qsub -o command not work

I made a test script test.qsub:
#!/bin/bash
#PBS -q batch
#PBS -o output.txt
#PBS -e Error.err
echo "hello world"
When running qsub test.qsub it does not generate the output.txt file nor the file error.txt. I also believe that the other options do not work either, appreciate your help ! It is said you should configure the torque.cfg but in my installation the file is not generated and not in /var/spool/torque.
Try "#PBS -k oe". This directs pbs to keep stdout and stderr.

LSF - Get ID of submitted job

Say I submit a job using something like bsub pwd. Now I would like to get the job ID of that job in order to build a dependency for the next job. Is there some way I can get bsub to return the job ID?
Nils and Andrey have the answers to this specific question in shell and C/C++ environments respectively. For the purposes of building dependencies, you can also name your job with -J then build the dependency based on the job name:
bsub -J "job1" <cmd1>
bsub -J "job2" <cmd2>
bsub -w "done(job1) && done(job2)" <cmd>
There's a bit more info here.
This also works with job arrays:
bsub -J "ArrayA[1-10]" <cmd1>
bsub -J "ArrayB[1-10]" <cmd2>
bsub -w "done(ArrayA[3]) && done(ArrayB[5])" <cmd>
You can even do element-by-element dependency. The following job's i-th element will only run when the corresponding element in ArrayB reaches DONE status:
bsub -w "done(ArrayB[*])" -J "ArrayC[1-10]" <cmd3>
You can find more info on the various things you can specify in -w here.
Just as a reference, this is the best solution I could come up with so far. It takes advantage of the fact that bsub write a line containing the ID to STDOUT.
function nk_jobid {
output=$($*)
echo $output | head -n1 | cut -d'<' -f2 | cut -d'>' -f1
}
Usage:
jobid=$(nk_jobid bsub pwd)
In case you are using C++, you can use the lsblib, LSF C API to submit jobs. The input and the output are structs. In particular, the output struct contains the job id.
#include <lsf/lsbatch.h>
LS_LONG_INT lsb_submit (struct submit *jobSubReq, struct submitReply *jobSubReply)
$jobid = "0"
bsub pwd > $jobid
cat $jobid
If you just want to view the JOBID after submission, most of the time I will just use bhist or bhist -l to view the running jobs and details.
$ bhist
Summary of time in seconds spent in various states:
JOBID USER JOB_NAME PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL
8664 F14r3 sample 2 0 187954 0 0 0 187956