Celery lose worker - celery

I use celery 4.4.0 version in my project(Ubuntu 18.04.2 LTS). When i raise Exception('too few functions in features to classify') , celery project lost worker and i get such logs:
[2020-02-11 15:42:07,364] [ERROR] [Main ] Task handler raised error: WorkerLostError('Worker exited prematurely: exitcode 0.')
Traceback (most recent call last):
File "/var/lib/virtualenvs/simus_classifier_new/lib/python3.7/site-packages/billiard/pool.py", line 1267, in mark_as_worker_lost human_status(exitcode)), billiard.exceptions.WorkerLostError: Worker exited prematurely: exitcode 0.
[2020-02-11 15:42:07,474] [DEBUG] [ForkPoolWorker-61] Closed channel #1
Do you have any idea how to solve this problem?

WorkerLostError are almost like OutOfMemory errors - they can't be solved. They will continue to happen from time to time. What you should do is to make your task(s) idempotent and let Celery retry tasks that failed due to worker crash.
It sounds trivial, but in many cases it is not. Not all tasks can be idempotent for an example. Celery still has bugs in the way it handles WorkerLostError. Therefore you need to monitor your Celery cluster closely and react to these events, and try to minimize them. In other words, find why the worker crashed - Was it killed by the system because it was consuming all the memory? Was it killed simply because it was running on an AWS spot instance, and it got terminated? Was it killed by someone executing kill -9 <worker pid>? All these circumstances could be handled this way or another...

Related

Parallel h5py - The MPI_Comm_dup() function was called before MPI_INIT was invoked

I am experiencing the below issue with Parallel h5py on MacOS Ventura 13.0.1, on a 2020 MacBook Pro 4-Core Intel Core i7.
I installed h5py and dependencies, by following both of these docs and this guide.
Running a job which requires only mpi4py runs and finishes without any issues. The problem comes when I try to run a job which requires Parallel h5py, e.g. trying out this code.
I get back the following:
*** The MPI_Comm_dup() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[...] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** The MPI_Comm_dup() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[...] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** The MPI_Comm_dup() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[...] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** The MPI_Comm_dup() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[...] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[20469,1],3]
Exit code: 1
--------------------------------------------------------------------------
I found this GitHub issue, but it didn't help in my case.
I should also point out that I managed to install and use Parallel h5py on a MacBook Air with MacOS Monterey, though that one is only dual-core, so it doesn't allow me to test Parallel h5py with as many cores, without using -overcommit.
Since I have not found any ideas how to resolve this, apart from the above GitHub issue, I would appreciate any suggestions.

How to handle Rundeck kill job signal

I have a Python script that is being executed via Rundeck. I already have implemented handlers for signal.SIGINT and signal.SIGTERM but when the script is terminated via Rundeck KILL JOB BUTTON it is not catching the signal.
Someone know what KILL BUTTON in Rundeck use under the woods to kills the process?
Example of how I'm catching signals, it works in a standard command line execution:
def sigint_handler(signum, frame):
proc = psutil.Process(os.getpid())
children_procs = proc.children(recursive=True)
children_procs.reverse()
for child_proc in children_procs:
try:
if child_proc.is_running():
msg = f'removing: {child_proc.pid},
{child_proc.name}'
logging.debug(msg)
os.kill(child_proc.pid, SIGINT)
except OSError as exc:
raise Error('Error removing processes', detail=str(exc))
sys.exit(SIGINT_EXIT)
Adding debug logging level in Rundeck get this:
[wf:7bb0cd58-7dc6-4a55-bb0f-62399533396c] Interrupted: Engine interrupted, stopping engine...
Disconnecting from 9.11.56.44 port 22
[wf:7bb0cd58-7dc6-4a55-bb0f-62399533396c] WillShutdown: Workflow engine shutting down (interrupted? true)
[wf:7bb0cd58-7dc6-4a55-bb0f-62399533396c] OperationFailed: operation failed: java.util.concurrent.CancellationException: Task was cancelled.
SSH command execution error: Interrupted: Connection was interrupted
Caught an exception, leaving main loop due to Socket closed
Failed: Interrupted: Connection was interrupted
[workflow] finishExecuteNodeStep(mario): NodeDispatch: Interrupted: Connection was interrupted
1: Workflow step finished, result: Dispatch failed on 1 nodes: [mario: Interrupted: Connection was interrupted + {dataContext=MultiDataContextImpl(map={ContextView(step:1, node:mario)=BaseDataContext{{exec={exitCode=-1}}}, ContextView(node:mario)=BaseDataContext{{exec={exitCode=-1}}}}, base=null)} ]
[workflow] Finish step: 1,NodeDispatch
[wf:7bb0cd58-7dc6-4a55-bb0f-62399533396c] Complete: Workflow complete: [Step{stepNum=1, label='null'}: CancellationException]
[wf:7bb0cd58-7dc6-4a55-bb0f-62399533396c] Cancellation while running step [1]
[workflow] Finish execution: node-first: [Workflow result: , Node failures: {mario=[]}, status: failed]
[Workflow result: , Node failures: {mario=[]}, status: failed]
Execution failed: 57 in project iLAB: [Workflow result: , Node failures: {mario=[]}, status: failed]
It is just closing the connection?
Rundeck can't manage internal threads in that way (directly), with the kill button you can kill only the Rundeck job, the only way to manage that is by applying all the logic in your script (detect the thread, and depending on some option/behavior kill the thread). That was requested here and here.

Spring Batch Job Stop Using jobOperator

I have Started my job using jobLauncher.run(processJob,jobParameters); and when i try stop job using another request jobOperator.stop(jobExecution.getId()); then get exeption :
org.springframework.batch.core.launch.JobExecutionNotRunningException:
JobExecution must be running so that it can be stopped
Set<JobExecution> jobExecutionsSet= jobExplorer.findRunningJobExecutions("processJob");
for (JobExecution jobExecution:jobExecutionsSet) {
System.err.println("job status : "+ jobExecution.getStatus());
if (jobExecution.getStatus()== BatchStatus.STARTED|| jobExecution.getStatus()== BatchStatus.STARTING || jobExecution.getStatus()== BatchStatus.STOPPING){
jobOperator.stop(jobExecution.getId());
System.out.println("###########Stopped#########");
}
}
when print job status always get job status : STOPPING but batch job is running
its web app, first upload some CSV file and start some operation using spring batch and during this execution if user need stop then stop request from another controller method come and try to stop running job
Please help me for stop running job
If you stop a job while it is running (typically in a STARTED state), you should not get this exception. If you have this exception, it means you have stopped your job while it is currently stopping (that is what the STOPPING status means).
jobExplorer.findRunningJobExecutions returns only running executions, so if in the next line right after this one you have a job in STOPPING status, this means the status changed right after calling jobExplorer.findRunningJobExecutions. You need to be aware that this is possible and your controller should handle this case.
When you tell spring batch to stop a job it goes into STOPPING mode. What this means is it will attempt to complete the unit of work chunk it is currently processing but then stop working. Likely what's happening is you are working on a long running task that is not finishing a unit of work (is it hung?) so it can't move from STOPPING to STOPPED.
Doing it twice rightly leads to an Exception because your job is already STOPPING by the time you did it the first time.

Celery - handle WorkerLostError exception with Task.retry()

I'm using celery 4.4.7
Some of my tasks are using too much memory and are getting killed with SIGTERM 9. I would like to retry them later since I'm running with concurrency on the machine and they might run OK again.
However, as far as I understand you can't catch WorkerLostError exception thrown within a task i.e. this won't won't work as I expect:
from billiard.exceptions import WorkerLostError
#celery_app.task(acks_late=True, max_retries=2, autoretry_for=(WorkerLostError,))
def some_task():
#task code
I also don't won't to use task_reject_on_worker_lost as it makes the tasks requeued and max_retries is not applied.
What would be the best approach to handle my use case?
Thanks in advance for your time :)
Gal

Can a sleeping Perl program be killed using kill(SIGKILL, $$)?

I am running a Perl program, there is a module in the program which is triggered by an external process to kill all the child processes and terminate its execution.
This works fine.
But, when a certain function say xyz() is executing there is a sleep(60) statement on a condition.
Right now the function is executed repeatedly as it is waiting for some value.
When I trigger the kill process as mentioned above the process does not take place.
Does anybody have a clue as to why this is happening?
I don't understand how you are trying to kill a process from within itself (your $$ in question subject) when it's sleeping.
If you are killing from a DIFFERENT process, then it will have its own $$. You need to find out the PID of the original process to kill first (by trolling process list or by somehow communicating it from the original process).
Killing a sleeping process works very well
$ ( date ; perl5.8 -e 'sleep(100);' ; date ) &
Wed Sep 14 09:48:29 EDT 2011
$ kill -KILL 8897
Wed Sep 14 09:48:54 EDT 2011
This also works with other "killish" signals ('INT', 'ABRT', 'QUIT', 'TERM')
UPDATE: Upon re-reading, may be the issue you meant was that "triggered by an external process" part doesn't happen. If that's the case, you need to:
Set up a CATCHABLE signal handler in your process before going to sleep ($SIG{'INT'}) - SIGKILL can not be caught by a handler.
Send SIGINT from said "external process"
Do all the needed cleanup once sleep() is interrupted by SIGINT from SIGINT handler.