Detect errors with torque and grid engine and prevent execution of dependent tasks - hpc

I have a shell script that queues multiple tasks for execution on an HPC cluster. The same job submission script works for either torque or grid engine with some minor conditional logic. This is a pipeline where the output of earlier tasks are fed to later tasks for further processing. I'm using qsub to define job dependencies, so later tasks wait for earlier tasks to complete before starting execution. So far so good.
Sometimes, a task fails. When a failure happens, I don't want any of the dependent tasks to attempt processing the output of the failed task. However, the dependent tasks have already been queued for execution long before the failure occurred. What is a good way to prevent the unwanted processing?

You can use the afterok dependency argument. For example, the qsub command may look like:
qsub -w depend=afterok:<jobid> submit.pbs
Torque will only start the next job if the jobid exits without errors. See documentation on the Adaptive Computing page.

Here is what I eventually implemented. The key to making this work is returning error code 100 on error. Sun Grid Engine stops execution of subsequent jobs upon seeing error code 100. Torque stops execution of subsequent jobs upon seeing any non-zero error code.
qsub starts a sequence of bash scripts. Each of those bash scripts has this code:
handleTrappedErrors()
{
errorCode=$?
bashCommand="$BASH_COMMAND"
scriptName=$(basename $0)
lineNumber=${BASH_LINENO[0]}
# log an error message to a log file here -- not shown
exit 100
}
trap handleErrors ERR
Torque (as Derek mentioned):
qsub -W depend=afterok:<jobid> ...
Sun Grid Engine:
qsub -hold_jid <jobid> ...

Related

Abort a Datastage job at a specified time

I have a scheduled parallel Datastage (11.7) job.
This job has a Hive Connector with a Before and After Statement.
The before statement run ok but After statement remains in running state for several hours (on Hue Log i see this job finished in 1hour) and i have to manually abort it on Datastage Director.
Is there the way to "program an abort"?
For example i want schedule the interruption of the running job every morning at 6.
I hope I was clear :)
Even though you can kill the job - as per other responses - using dsjob to stop the job, this may have no effect because the After statement has been issued synchronously; the job is waiting for it to finish, and (probably) not processing kill signals and the like in the meantime. You would be better advised to work out why the After command is taking too long, and addressing that.

Datastage: How to keep continuous mode job running after a unexpected termination

I have a job that uses the Kafka Connector Stage in order to read a Kafka queue and then load into the database. That job runs in Continuous Mode, which it has no time to conclude, since it keeps monitoring the Kafka queue in real time.
For unexpected reasons (say, server issues, job issues etc) that job may terminate with failure. In general, that happens after 300 running hours of that job. So, in order to keep the job alive I have to manually look to the job status and then to do a Reset and Run, in order to keep the job running.
The problem is that between the job termination and my manual Reset and Run can pass several hours, which is critical. So I'm looking for a way to eliminate the manual interaction and to reduce that gap by automating the job invocation.
I tried to use Control-M to daily run the job, but with no success: The first day the Control-M called the job, it ran it fine. But in the next day, when the Control-M did an attempt to instantiate the job again it failed (since it was already running). Besides, the Datastage will never tell back Control-M that a job was successfully concluded, since the job's nature won't allow that.
Said that, I would like to hear ideas from you that can light me up.
The first thing that came in mind is to create a intermediate Sequence and then schedule it in Control-M. Then, this new Sequence would call the continuous job asynchronously by using command line stage.
For the case where just this one job terminates unexpectedly and you want it to be restarted as soon as possible, have you considered calling this job from a sequence? The sequence could be setup to loop running this job.
Thus sequence starts job and waits for it to finish. When job finishes, the sequence will then loop and start the job again. You could have added conditions on job exit (for example, if the job aborted, then based on that job end status, you could reset the job before re-running it.
This would not handle the condition where the DataStage engine itself was shut down (such as for maintenance or possibly an error) in which case all jobs end including your new sequence. The same also applies for a server reboot or other situations where someone may have inadvertently stopped your sequence. For those cases (such as DataStage engine stop) your team would need to have process in place for jobs/sequences that need to be started up following a DataStage or System outage.
For the outage scenario, you could create a monitor script (regardless of whether running the job solo or from sequence) that sleeps/loops on 5-10 minute intervals and then checks the status of your job using dsjob command, and if not running can start that job/sequence (also via dsjob command). You can decide whether that script startup would occur at DataSTage startup, machine startup, or run it from Control M or other scheduler.

Schedule a python execution every 24h

I'm training several Neural Networks on a server in my University. Due to limited resources for all the students, there is a job scheduling system called (Slurm) that queues all students runs and in addition, we are only allowed to run our commands with a time limit (24h). Once exceed this processing time, our running process is closed to give resource availability to the others.
Specifically, I'm training GAN's and I need more training time than 24h.
Right now, I'm saving the checkpoints of my model to restart from the same training point before the process closure. But, I must execute the same command again after 24h.
For this reason I would like to schedule this execution every 24h automatically.
Currently I'm using 'tmux' to execute the command and be able to close the terminal.
Some suggestion on how to automate this kind of execution?
Thank you in advance!
You can setup your job to automatically resubmit when it's close to the timelimit.
Note that slurm's time granularity is 1 minute, so don't set the
signal timer to anything less than 60 seconds.
#!/bin/bash
#SBATCH --signal=B:USR1#300 # Tell Slurm to send signal USR1 300 seconds before timelimit
#SBATCH -t 24:00:00
resubmit() {
echo "It's time to resubmit"; # <----- Run whatever is necessary. Ideally resubmit the job using the checkpointing files
sbatch ...
}
trap "resubmit" USR1 # Register signal handler
YOUR_TRAINING_COMMAND & # It's important to run on the background otherwise bash will not process the signal until this command finishes
wait # wait until all the background processes are finished. If a signal is received this will stop, process the signal and finish the script.

sbatch: error: Batch job submission failed: Socket timed out on send/recv operation when running Snakemake

I am running a snakemake pipeline on a HPC that uses slurm. The pipeline is rather long, consisting of ~22 steps. Periodically, snakemake will encounted a problem when attempting to submit a job. This reults in the error
sbatch: error: Batch job submission failed: Socket timed out on send/recv operation
Error submitting jobscript (exit code 1):
I run the pipeline via a sbatch file with the following snakemake call
snakemake -j 999 -p --cluster-config cluster.json --cluster 'sbatch --account {cluster.account} --job-name {cluster.job-name} --ntasks-per-node {cluster.ntasks-per-node} --cpus-per-task {threads} --mem {cluster.mem} --partition {cluster.partition} --time {cluster.time} --mail-user {cluster.mail-user} --mail-type {cluster.mail-type} --error {cluster.error} --output {cluster.output}'
This results in not only an output for snakemake sbatch job, but also for the jobs that snakemake creates. The above error appears in the slurm.out for the sbatch file.
The specific job step the error indicates will run successfully, and give output, but the pipeline fails. The logs of the job step show that the job id ran without a problem. I have googled this error, and it appears to happen often with slurm, and especially when the scheduler is under high IO, which suggests it will be an inevitable and regular occurrence. I was hoping someone has encountered this problem, and could offer suggestions for a work around, so that the entire pipeline doesn't fail.
snakemake has an option --max-jobs-per-second and --max-status-checks-per-second with default argument of 10. Maybe try decreasing them to reduce strain on the scheduler? Also, maybe try to reduce -j 999?

Make job submitted with bsub run in parallel with job that was submitted before and still running

I would like the new submitted job to bsub not PEND but start immediate run.
If possible I would like to limit this to N jobs.
If you want two jobs to run in parallel, its probably best to submit a single parallel job (bsub -n) that runs two different processes, potentially on two different hosts.
The LSF admin can force a PENDing job to run with the brun command. However, this will cause the execution host to be temporarily overloaded.