Slurm: How to restart failed worker job - hpc

If one is running an array job on a slurm cluster, how can one restart a failed worker job?
In a Sun Grid Engine queue, one can add #$ -r y to the job file to indicate the job should be restarted if it fails--what is the Slurm equivalent of this flag?

You can use --requeue
#SBATCH --requeue ### On failure, requeue for another try
--requeue
Specifies that the batch job should eligible to being requeue. The job may be requeued explicitly by a system administrator, after node failure, or upon preemption by a higher priority job. When a job is requeued, the batch script is initiated from its beginning. Also see the --no-requeue option. The JobRequeue configuration parameter controls the default behavior on the cluster.
See more here: https://slurm.schedmd.com/sbatch.html#lbAE

Related

Dash Celery setup

I have docker-compose setup for my Dash application. I need suggestion or preferred way to setup my celery image.
I am using celery for following use-cases and these are cancellable/abortable/revoked task:
Upload file
Model training
Create train, test set
Case-1. Create one service as celery,
command: ["celery", "-A", "tasks", "worker", "--loglevel=INFO", "--pool=prefork", "--concurrency=3", "--statedb=/celery/worker.state"]
So, here we are using default queue, single worker (main) and 3 child/worker processes(ie can execute 3 tasks simultaneously)
Now, if I revoke any task, will it kill the main worker or just that child worker processes executing that task?
Case-2. Create three services as celery-{task_name} ie celery-upload etc,
command: ["celery", "-A", "tasks", "worker", "--loglevel=INFO", "--pool=prefork", "--concurrency=1", , "--statedb=/celery/worker.state", "--queues=upload_queue", , "--hostname=celery_worker_upload_queue"]
So, here we are using custom queue, single worker (main) and 1 child/worker processe(ie can execute 1 task) in its container. This way one service for each task.
Now, if I revoke any task, it will only kill the main worker or just the only child worker processes executing that task in respective container and rest celery containers will be alive?
I tried using below signals with command task.revoke(terminate=True)
SIGKILL and SIGTERM
In this, I observed #worker_process_shutdown.connect and #task_revoked.connect both gets fired.
Does this means main worker and concerned child worker process for whom revoke command is issued(or all child processes as main worker is down) are down?
SIGUSR1
In this, I observed only #task_revoked.connect gets fired.
Does this means main worker is still running/alive and only concerned child worker process for whom revoke command is issued is down?
Which case is preferred?
Is it possible to combine both cases? ie having single celery service with individual workers(main) and individual child worker process and individual queues Or
having single celery service with single worker (main), individual/dedicated child worker processes and individual queues for respective tasks?
One more doubt, As I think, using celery is required for above listed tasks, now say I have button for cleaning a dataframe will this too requires celery?
ie wherever I am dealing with dataframes should I need to use celery?
Please suggest.
UPDATE-2
worker processes = child-worker-process
This is how I am using as below
# Start button
result = background_task_job_one.apply_async(args=(n_clicks,), queue="upload_queue")
# Cancel button
result = result_from_tuple(data, app=celery_app)
result.revoke(terminate=True, signal=signal.SIGUSR1)
# Task
#celery_app.task(bind=True, name="job_one", base=AbortableTask)
def background_task_job_one(self, n_clicks):
msg = "Aborted"
status = False
try:
msg = job(n_clicks) # Long running task
status = True
except SoftTimeLimitExceeded as e:
self.update_state(task_id=self.request.id, state=states.REVOKED)
msg = "Aborted"
status = True
raise Ignore()
finally:
print("FINaLLY")
return status, msg
Is this way ok to handle cancellation of running task? Can you elaborate/explain this line [In practice you should not send signals directly to worker processes.]
Just for clarification from line [In prefork concurrency (the default) you will always have at least two processes running - Celery worker (coordinator) and one or more Celery worker-processes (workers)]
This means
celery -A app worker -P prefork -> 1 main worker and 1 child-worker-process. Is it same as below
celery -A app worker -P prefork -c 1 -> 1 main worker and 1 child-worker-process
Earlier, I tried using class AbortableTask and calling abort(), It was successfully updating the state and status as ABORTED but task was still alive/running.
I read to terminate currently executing task, it is must to pass terminate=True.
This is working, the task stops executing and I need to update task state and status manually to REVOKED, otherwise default PENDING. The only hard-decision to make is to use SIGKILL or SIGTERM or SIGUSR1. I found using SIGUSR1 the main worker process is alive and it revoked only the child worker process executing that task.
Also, luckily I found this link I can setup single celery service with multiple dedicated child-worker-process with its dedicated queues.
Case-3: Celery multi
command: ["celery", "multi", "show", "start", "default", "model", "upload", "-c", "1", "-l", "INFO", "-Q:default", "default_queue", "-Q:model", "model_queue", "-Q:upload", "upload_queue", "-A", "tasks", "-P", "prefork", "-p", "/proj/external/celery/%n.pid", "-f", "/proj/external/celery/%n%I.log", "-S", "/proj/external/celery/worker.state"]
But getting error,
celery service exited code 0
command: bash -c "celery multi start default model upload -c 1 -l INFO -Q:default default_queue -Q:model model_queue -Q:upload upload_queue -A tasks -P prefork -p /proj/external/celery/%n.pid -f /proj/external/celery/%n%I.log -S /proj/external/celery/worker.state"
Here also getting error,
celery | Usage: python -m celery worker [OPTIONS]
celery | Try 'python -m celery worker --help' for help.
celery | Error: No such option: -p
celery | * Child terminated with exit code 2
celery | FAILED
Some doubts, what is preferred 1 worker vs multi worker?
If multi worker with dedicated queues, creating docker service for each task increases the docker-file and services too. So I am trying single celery service with multiple dedicated child-worker-process with its dedicated queues which is easy to abort/revoke/cancel a task.
But getting error with case-3 i.e. celery multi.
Please suggest.
If you revoke a task, it may terminate the working process that was executing the task. The Celery worker will continue working as it needs to coordinate other worker processes. If the life of container is tied to the Celery worker, then container will continue running.
In practice you should not send signals directly to worker processes.
In prefork concurrency (the default) you will always have at least two processes running - Celery worker (coordinator) and one or more Celery worker-processes (workers).
To answer the last question we may need more details. It would be easier if you could run Celery task when all dataframes are available. If that is not the case, then perhaps run individual tasks to process dataframes. It is worth having a look at Celery workflows and see if you can build Chunk-ed workflow. Keep it simple, start with assumption that you have all dataframes available at once, and build from there.

Complete parallel Kubernetes job when one worker pod succeeds

I have a simple containerised python script which I am trying to parallelise with Kubernetes. This script guesses hashes until it finds a hashed value below a certain threshold.
I am only interested in the first such value, so I wish to create a Kubernetes job that spawns n worker pods and completes as soon as one worker pod finds a suitable value.
By default, Kubernetes jobs wait until all worker pods complete before marking the job as complete. I have so far been unable to find a way around this (no mention of this job pattern in the documentation), and have been relying on checking the logs of bare pods via a bash script to determine whether one has completed.
Is there a native means to achieve this? And, if not, what would be the best approach?
Hi look this link https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#parallel-jobs.
I've never tried it but it seems possible to launch several pods and configure the end of the job when x pods have finished. In your case x is 1.
We can define two specifications for parallel Jobs:
1. Parallel Jobs with a fixed completion count:
specify a non-zero positive value for .spec.completions.
the Job represents the overall task, and is complete when there is
one successful Pod for each value in the range 1 to
.spec.completions
not implemented yet: Each Pod is passed a different index in the
range 1 to .spec.completions.
2. Parallel Jobs with a work queue:
do not specify .spec.completions, default to .spec.parallelism
the Pods must coordinate amongst themselves or an external service to
determine what each should work on.
For example, a Pod might fetch a batch of up to N items from the work queue.
each Pod is independently capable of determining whether or not all its peers are done, and thus that the entire Job is done.
when any Pod from the Job terminates with success, no new Pods are
created
once at least one Pod has terminated with success and all Pods are
terminated, then the Job is completed with success
once any Pod has exited with success, no other Pod should still be
doing any work for this task or writing any output. They should all
be in the process of exiting
For a fixed completion count Job, you should set .spec.completions to the number of completions needed. You can set .spec.parallelism, or leave it unset and it will default to 1.
For a work queue Job, you must leave .spec.completions unset, and set .spec.parallelism to a non-negative integer.
For more information about how to make use of the different types of job, see the job patterns section.
You can also take a look on single job which starts controller pod:
This pattern is for a single Job to create a Pod which then creates other Pods, acting as a sort of custom controller for those Pods. This allows the most flexibility, but may be somewhat complicated to get started with and offers less integration with Kubernetes.
One example of this pattern would be a Job which starts a Pod which runs a script that in turn starts a Spark master controller (see spark example), runs a spark driver, and then cleans up.
An advantage of this approach is that the overall process gets the completion guarantee of a Job object, but complete control over what Pods are created and how work is assigned to them.
At the same time take under consideration that completition status of Job set by dafault - when specified number of successful completions is reached it ensure that all tasks are processed properly. Applying this status before all tasks are finished is not secure solution.
You should also know that finished Jobs are usually no longer needed in the system. Keeping them around in the system will put pressure on the API server. If the Jobs are managed directly by a higher level controller, such as CronJobs, the Jobs can be cleaned up by CronJobs based on the specified capacity-based cleanup policy.
Here is official documentations: jobs-parallel-processing , parallel-jobs.
Useful blog: article-parallel job.
EDIT:
Another option is that you can create special script which will continuously check values you look for. Using job then will not be necessary, you can simply use deployment.

How to run kubernetes pod for a set period of time each day?

I'm looking for a way to deploy a pod on kubernetes to run for a few hours each day. Essentially I want it to run every morning at 8AM and continue running until about 5:30 PM.
I've been researching a lot and haven't found a way to deploy the pod with a specific timeframe in mind. I've found cron jobs, but that seems to be to be for pods that terminate themselves, whereas mine should be running constantly.
Is there any way to deploy my pod on kubernetes this way? Or should I just set up the pod itself to run its intended application based on its internal clock?
According to the Kubernetes architecture, a Job creates one or more pods and ensures that a specified number of them successfully terminate. As pods successfully complete, the job tracks the successful completions. When a specified number of successful completions is reached, the job itself is complete.
In simple words, Jobs run until completion or failure. That's why there is no option to schedule a Cron Job termination in Kubernetes.
In your case, you can start a Cron Job regularly and terminate it using one of the following options:
A better way is to terminate a container by itself, so you can add such functionality to your application or use Cron. More information about how to add Cron to the Docker container, you can find here.
You can use another Cron Job to terminate your Cron Job. You need to run a command inside a Pod to find and delete a Pod related to your Job. For more information, you can look through this link. But it is not a good way, because your Cron Job will always have failed status.
In both cases, you need to check with what status your Cron Job was finished and use the correct RestartPolicy accordingly.
It seems you can implement using a cronjob object,
[ https://kubernetes.io/docs/tasks/job/automated-tasks-with-cron-jobs/#creating-a-cron-job ]

Make job submitted with bsub run in parallel with job that was submitted before and still running

I would like the new submitted job to bsub not PEND but start immediate run.
If possible I would like to limit this to N jobs.
If you want two jobs to run in parallel, its probably best to submit a single parallel job (bsub -n) that runs two different processes, potentially on two different hosts.
The LSF admin can force a PENDing job to run with the brun command. However, this will cause the execution host to be temporarily overloaded.

Change priorities of my own submitted jobs

I have many jobs running and pending. I would like to indicate the relative priority of jobs that I have submitted to the queue, that are pending, but not yet running. Is it possible to set this priority after submission? Is it possible to set this priority before submission?
You can move jobs that are pending with the btop command.
btop job_ID | "job_ID[index_list]" [position]
If you add [position] it means that the job will be put at that place in the queue.
By default, LSF dispatches jobs in a queue in the order of their
arrival (that is, first come, first served), subject to availability
of suitable server hosts.
Having said this, the priority of the job is unchanged. So you will only be able to change the order if the jobs have the same priority.
Depending on your LSF version, see the following links for details about btop
LSF 10.1.0 > Command Reference > btop
LSF 9.1.3 > Command Reference > btop