Airflow : Job <job-id> was killed before it finished (likely due to running out of memory) - kubernetes

I have a linear DAG with two tasks - first task's truthy/falsy value decide if second task would be executed. I am using ShortCircuitOperator for first task so that if needed second task can be bypassed. Following is my DAG code :-
DAG_VERSION = "1.0.0"
with DAG(
"sample_dag",
catchup=False,
tags=[DAG_VERSION],
max_active_runs=1,
schedule_interval=None,
default_args=DEFAULT_ARGS,
) as dag:
dag.doc_md = "Sample DAG"
TASK_1 = ShortCircuitOperator(
task_id="task_1",
python_callable=test_script_1,
executor_config=EXECUTOR_CONFIG,
)
TASK_2 = PythonOperator(
task_id="task_2",
python_callable=test_script_2,
executor_config=EXECUTOR_CONFIG,
)
TASK_1 >> TASK_2
However, when I try to run the DAG, then I get following in log for first task when it returns truthy value :-
task_1 logs
Marking task as SUCCESS. dag_id=sample_dag, task_id=task_1, execution_date=20220606T060000, start_date=20220606T070012, end_date=20220606T070014
[2022-06-06, 07:00:17 UTC] State of this instance has been externally set to success. Terminating instance.
[2022-06-06, 07:00:17 UTC] Sending Signals.SIGTERM to GPID 18
[2022-06-06, 07:01:17 UTC] process psutil.Process(pid=18, name='airflow task runner: sample_dag task_1 scheduled__2022-06-06T06:00:00+00:00 7037', status='sleeping', started='07:00:12') did not respond to SIGTERM. Trying SIGKILL
[2022-06-06, 07:01:17 UTC] Process psutil.Process(pid=18, name='airflow task runner: sample_dag task_1 scheduled__2022-06-06T06:00:00+00:00 7037', status='terminated', exitcode=<Negsignal.SIGKILL: -9>, started='07:00:12') (18) terminated with exit code Negsignal.SIGKILL
[2022-06-06, 07:01:17 UTC] Job 7037 was killed before it finished (likely due to running out of memory)
I am using return value of first task in second task. When I try to log xcom value of first task inside second task then I get None, which cause second task to fail. This is my code for access xcom value for first task inside second task :-
def test_script_2(**context: models.xcom) -> List[str]:
task_instance = context["task_instance"]
return_value = task_instance.xcom_pull(task_ids="task_1")
print("logging return value of first task ", return_value)
I am running Airflow 2.2.2 with kubernetes executor.
Is None xcom value due to out of memory issue in first task ? I tried by adding fixed value in first task, but again None was returned in second task with following log :-
task_2 logs
Marking task as SUCCESS. dag_id=sample_dag, task_id=task_2, execution_date=20220606T095611, start_date=20220606T095637, end_date=20220606T095638
[2022-06-06, 09:56:42 UTC] State of this instance has been externally set to success. Terminating instance.
[2022-06-06, 09:56:42 UTC] Sending Signals.SIGTERM to GPID 18
[2022-06-06, 09:57:42 UTC] process psutil.Process(pid=18, name='airflow task runner: sample_dag task_2 manual__2022-06-06T09:56:11.804005+00:00 7051', status='sleeping', started='09:56:37') did not respond to SIGTERM. Trying SIGKILL
[2022-06-06, 09:57:42 UTC] Process psutil.Process(pid=18, name='airflow task runner: sample_dag task_2 manual__2022-06-06T09:56:11.804005+00:00 7051', status='terminated', exitcode=<Negsignal.SIGKILL: -9>, started='09:56:37') (18) terminated with exit code Negsignal.SIGKILL
[2022-06-06, 09:57:42 UTC] Job 7051 was killed before it finished (likely due to running out of memory)
I am unable to find issue in the code. Would appreciate any hint on where I am going wrong and how to get xcom value of first task
Thanks

Related

Parallel h5py - The MPI_Comm_dup() function was called before MPI_INIT was invoked

I am experiencing the below issue with Parallel h5py on MacOS Ventura 13.0.1, on a 2020 MacBook Pro 4-Core Intel Core i7.
I installed h5py and dependencies, by following both of these docs and this guide.
Running a job which requires only mpi4py runs and finishes without any issues. The problem comes when I try to run a job which requires Parallel h5py, e.g. trying out this code.
I get back the following:
*** The MPI_Comm_dup() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[...] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** The MPI_Comm_dup() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[...] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** The MPI_Comm_dup() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[...] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** The MPI_Comm_dup() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[...] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[20469,1],3]
Exit code: 1
--------------------------------------------------------------------------
I found this GitHub issue, but it didn't help in my case.
I should also point out that I managed to install and use Parallel h5py on a MacBook Air with MacOS Monterey, though that one is only dual-core, so it doesn't allow me to test Parallel h5py with as many cores, without using -overcommit.
Since I have not found any ideas how to resolve this, apart from the above GitHub issue, I would appreciate any suggestions.

How to handle Rundeck kill job signal

I have a Python script that is being executed via Rundeck. I already have implemented handlers for signal.SIGINT and signal.SIGTERM but when the script is terminated via Rundeck KILL JOB BUTTON it is not catching the signal.
Someone know what KILL BUTTON in Rundeck use under the woods to kills the process?
Example of how I'm catching signals, it works in a standard command line execution:
def sigint_handler(signum, frame):
proc = psutil.Process(os.getpid())
children_procs = proc.children(recursive=True)
children_procs.reverse()
for child_proc in children_procs:
try:
if child_proc.is_running():
msg = f'removing: {child_proc.pid},
{child_proc.name}'
logging.debug(msg)
os.kill(child_proc.pid, SIGINT)
except OSError as exc:
raise Error('Error removing processes', detail=str(exc))
sys.exit(SIGINT_EXIT)
Adding debug logging level in Rundeck get this:
[wf:7bb0cd58-7dc6-4a55-bb0f-62399533396c] Interrupted: Engine interrupted, stopping engine...
Disconnecting from 9.11.56.44 port 22
[wf:7bb0cd58-7dc6-4a55-bb0f-62399533396c] WillShutdown: Workflow engine shutting down (interrupted? true)
[wf:7bb0cd58-7dc6-4a55-bb0f-62399533396c] OperationFailed: operation failed: java.util.concurrent.CancellationException: Task was cancelled.
SSH command execution error: Interrupted: Connection was interrupted
Caught an exception, leaving main loop due to Socket closed
Failed: Interrupted: Connection was interrupted
[workflow] finishExecuteNodeStep(mario): NodeDispatch: Interrupted: Connection was interrupted
1: Workflow step finished, result: Dispatch failed on 1 nodes: [mario: Interrupted: Connection was interrupted + {dataContext=MultiDataContextImpl(map={ContextView(step:1, node:mario)=BaseDataContext{{exec={exitCode=-1}}}, ContextView(node:mario)=BaseDataContext{{exec={exitCode=-1}}}}, base=null)} ]
[workflow] Finish step: 1,NodeDispatch
[wf:7bb0cd58-7dc6-4a55-bb0f-62399533396c] Complete: Workflow complete: [Step{stepNum=1, label='null'}: CancellationException]
[wf:7bb0cd58-7dc6-4a55-bb0f-62399533396c] Cancellation while running step [1]
[workflow] Finish execution: node-first: [Workflow result: , Node failures: {mario=[]}, status: failed]
[Workflow result: , Node failures: {mario=[]}, status: failed]
Execution failed: 57 in project iLAB: [Workflow result: , Node failures: {mario=[]}, status: failed]
It is just closing the connection?
Rundeck can't manage internal threads in that way (directly), with the kill button you can kill only the Rundeck job, the only way to manage that is by applying all the logic in your script (detect the thread, and depending on some option/behavior kill the thread). That was requested here and here.

Celery lose worker

I use celery 4.4.0 version in my project(Ubuntu 18.04.2 LTS). When i raise Exception('too few functions in features to classify') , celery project lost worker and i get such logs:
[2020-02-11 15:42:07,364] [ERROR] [Main ] Task handler raised error: WorkerLostError('Worker exited prematurely: exitcode 0.')
Traceback (most recent call last):
File "/var/lib/virtualenvs/simus_classifier_new/lib/python3.7/site-packages/billiard/pool.py", line 1267, in mark_as_worker_lost human_status(exitcode)), billiard.exceptions.WorkerLostError: Worker exited prematurely: exitcode 0.
[2020-02-11 15:42:07,474] [DEBUG] [ForkPoolWorker-61] Closed channel #1
Do you have any idea how to solve this problem?
WorkerLostError are almost like OutOfMemory errors - they can't be solved. They will continue to happen from time to time. What you should do is to make your task(s) idempotent and let Celery retry tasks that failed due to worker crash.
It sounds trivial, but in many cases it is not. Not all tasks can be idempotent for an example. Celery still has bugs in the way it handles WorkerLostError. Therefore you need to monitor your Celery cluster closely and react to these events, and try to minimize them. In other words, find why the worker crashed - Was it killed by the system because it was consuming all the memory? Was it killed simply because it was running on an AWS spot instance, and it got terminated? Was it killed by someone executing kill -9 <worker pid>? All these circumstances could be handled this way or another...

Starting Parpool in MATLAB

I tried starting parpool in MATLAB 2015b. Command as follows,
parpool('local',3);
This command should allocate 3 workers. Whereas I received an error stating failure to start parpool. The error message as follows,
Error using parpool (line 94)
Failed to start a parallel pool. (For information in addition to
the causing error, validate the profile 'local' in the Cluster Profile
Manager.)
A similar query was posted in (https://nl.mathworks.com/matlabcentral/answers/196549-failed-to-start-a-parallel-pool-in-matlab2015a). I followed the same procedure, to validate the local profile as per the suggestions.
Using distcomp.feature( 'LocalUseMpiexec', false); or distcomp.feature( 'LocalUseMpiexec', true) in startup.m didn't create any improvement. Also when attempting to validate local profile still gives error message as follows,
VALIDATION DETAILS
Profile: local
Scheduler Type: Local
Stage: Cluster connection test (parcluster)
Status: Passed
Description:Validation Passed
Command Line Output:(none)
Error Report:(none)
Debug Log:(none)
Stage: Job test (createJob)
Status: Failed
Description:The job errored or did not reach state finished.
Command Line Output:
Failed to determine if job 24 belongs to this cluster because: Unable to
read file 'C:\Users\varad001\AppData\Roaming\MathWorks\MATLAB
\local_cluster_jobs\R2015b\Job24.in.mat'. No such file or directory..
Error Report:(none)
Debug Log:(none)
Stage: SPMD job test (createCommunicatingJob)
Status: Failed
Description:The job errored or did not reach state finished.
Command Line Output:
Failed to determine if job 25 belongs to this cluster because: Unable to
read file 'C:\Users\varad001\AppData\Roaming\MathWorks\MATLAB
\local_cluster_jobs\R2015b\Job25.in.mat'. No such file or directory..
Error Report:(none)
Debug Log:(none)
Stage: Pool job test (createCommunicatingJob)
Status: Skipped
Description:Validation skipped due to previous failure.
Command Line Output:(none)
Error Report:(none)
Debug Log:(none)
Stage: Parallel pool test (parpool)
Status: Skipped
Description:Validation skipped due to previous failure.
Command Line Output:(none)
Error Report:(none)
Debug Log:(none)
I am receiving these error only in my cluster machine. But launching parpool in my standalone PC is working perfectly. Is there a way to rectify this issue?

How to stop and resume a spring batch job

Goal : I am using spring batch for data processing and I want to have an option to stop/resume (where it left off).
Issue: I am able to send a stop signal to a running job and it gets stopped successfully. But when I try to send start signal to same job its creating a new instance of the job and starts as a fresh job.
My question is how can we achieve a resume functionality for a stopped job in spring batch.
You just have to run it with the same parameters. Just make sure you haven't marked the job as non-restarted and that you're not using RunIdIncrementer or similar to automatically generate unique job parameters.
See for instance, this example. After the first run, we have:
INFO: Job: [SimpleJob: [name=myJob]] completed with the following parameters: [{}] and the following status: [STOPPED]
Status is: STOPPED, job execution id 0
#1 step1 COMPLETED
#2 step2 STOPPED
And after the second:
INFO: Job: [SimpleJob: [name=myJob]] completed with the following parameters: [{}] and the following status: [COMPLETED]
Status is: COMPLETED, job execution id 1
#3 step2 COMPLETED
#4 step3 COMPLETED
Note that stopped steps will be re-executed. If you're using chunk-oriented steps, make sure that at least the ItemReader implements ItemStream (and does it with the correct semantics).
Steps marked with allowRestartWithComplete will always be re-run.