Airflow Celery Workers Crashing, Cannot Complete Tasks - celery

I've setup a docker environment running:
Airflow Webserver
Airflow Scheduler
Flower
2 Airflow Workers (though the issue is reproducible with just 1 Worker)
Redis
Six images total across 4 t2.small EC2 instances in a single ECS cluster with a db.t2.micro postgresql RDS instance.
Using the CeleryExecutor, nearly all queued tasks sent to workers fail. Upon receiving tasks, the workers seem to lose communication with each other and/or the scheduler - they drift apart, miss heartbeats, and eventually are forcefully killed by the host system.
I'm able to reproduce this behavior on Airflow 1.10.3 (and the latest 1.10.4RC) using latest versions of both Redis and RabbitMQ and Celery 4.3.0.
I've padded out suggested configuration options including:
scheduler__scheduler_heartbeat_sec (currently 180 seconds)
scheduler__job_heartbeat_sec (currently default 5 seconds)
scheduler__max_threads (currently just 1 thread)
celery_broker_transport_options__visibility_timeout (currently 21600 seconds)
Below is a DAG run that runs 5 SQL queries that set permissions across schemas.
Running these queries manually takes seconds
LocalExecutor in a non-dockerized environment will run the DAG in ~30 seconds.
CeleryExecutor in this new docker environment is still trying to run the first try for each task ~300 seconds into the run.
Scheduler:
[2019-07-29 01:20:23,407] {{jobs.py:1106}} INFO - 5 tasks up for execution:
<TaskInstance: ldw_reset_permissions.service_readers 2019-07-29 01:20:17.300679+00:00 [scheduled]>
<TaskInstance: ldw_reset_permissions.marketing_readers 2019-07-29 01:20:17.300679+00:00 [scheduled]>
<TaskInstance: ldw_reset_permissions.finance_readers 2019-07-29 01:20:17.300679+00:00 [scheduled]>
<TaskInstance: ldw_reset_permissions.engineering_readers 2019-07-29 01:20:17.300679+00:00 [scheduled]>
<TaskInstance: ldw_reset_permissions.bi_readers 2019-07-29 01:20:17.300679+00:00 [scheduled]>
[2019-07-29 01:20:23,414] {{jobs.py:1144}} INFO - Figuring out tasks to run in Pool(name=None) with 128 open slots and 5 task instances in queue
[2019-07-29 01:20:23,418] {{jobs.py:1182}} INFO - DAG ldw_reset_permissions has 0/16 running and queued tasks
[2019-07-29 01:20:23,418] {{jobs.py:1182}} INFO - DAG ldw_reset_permissions has 1/16 running and queued tasks
[2019-07-29 01:20:23,418] {{jobs.py:1182}} INFO - DAG ldw_reset_permissions has 2/16 running and queued tasks
[2019-07-29 01:20:23,422] {{jobs.py:1182}} INFO - DAG ldw_reset_permissions has 3/16 running and queued tasks
[2019-07-29 01:20:23,422] {{jobs.py:1182}} INFO - DAG ldw_reset_permissions has 4/16 running and queued tasks
[2019-07-29 01:20:23,423] {{jobs.py:1223}} INFO - Setting the follow tasks to queued state:
<TaskInstance: ldw_reset_permissions.service_readers 2019-07-29 01:20:17.300679+00:00 [scheduled]>
<TaskInstance: ldw_reset_permissions.marketing_readers 2019-07-29 01:20:17.300679+00:00 [scheduled]>
<TaskInstance: ldw_reset_permissions.finance_readers 2019-07-29 01:20:17.300679+00:00 [scheduled]>
<TaskInstance: ldw_reset_permissions.engineering_readers 2019-07-29 01:20:17.300679+00:00 [scheduled]>
<TaskInstance: ldw_reset_permissions.bi_readers 2019-07-29 01:20:17.300679+00:00 [scheduled]>
[2019-07-29 01:20:23,440] {{jobs.py:1298}} INFO - Setting the following 5 tasks to queued state:
<TaskInstance: ldw_reset_permissions.service_readers 2019-07-29 01:20:17.300679+00:00 [queued]>
<TaskInstance: ldw_reset_permissions.marketing_readers 2019-07-29 01:20:17.300679+00:00 [queued]>
<TaskInstance: ldw_reset_permissions.finance_readers 2019-07-29 01:20:17.300679+00:00 [queued]>
<TaskInstance: ldw_reset_permissions.engineering_readers 2019-07-29 01:20:17.300679+00:00 [queued]>
<TaskInstance: ldw_reset_permissions.bi_readers 2019-07-29 01:20:17.300679+00:00 [queued]>
[2019-07-29 01:20:23,440] {{jobs.py:1334}} INFO - Sending ('ldw_reset_permissions', 'service_readers', datetime.datetime(2019, 7, 29, 1, 20, 17, 300679, tzinfo=<TimezoneInfo [UTC, GMT, +00:00:00, STD]>), 1) to executor with priority 1 and queue default
[2019-07-29 01:20:23,444] {{base_executor.py:59}} INFO - Adding to queue: ['airflow', 'run', 'ldw_reset_permissions', 'service_readers', '2019-07-29T01:20:17.300679+00:00', '--local', '-sd', '/usr/local/airflow/dags/ldw_reset_permissions.py']
[2019-07-29 01:20:23,445] {{jobs.py:1334}} INFO - Sending ('ldw_reset_permissions', 'marketing_readers', datetime.datetime(2019, 7, 29, 1, 20, 17, 300679, tzinfo=<TimezoneInfo [UTC, GMT, +00:00:00, STD]>), 1) to executor with priority 1 and queue default
[2019-07-29 01:20:23,446] {{base_executor.py:59}} INFO - Adding to queue: ['airflow', 'run', 'ldw_reset_permissions', 'marketing_readers', '2019-07-29T01:20:17.300679+00:00', '--local', '-sd', '/usr/local/airflow/dags/ldw_reset_permissions.py']
[2019-07-29 01:20:23,446] {{jobs.py:1334}} INFO - Sending ('ldw_reset_permissions', 'finance_readers', datetime.datetime(2019, 7, 29, 1, 20, 17, 300679, tzinfo=<TimezoneInfo [UTC, GMT, +00:00:00, STD]>), 1) to executor with priority 1 and queue default
[2019-07-29 01:20:23,446] {{base_executor.py:59}} INFO - Adding to queue: ['airflow', 'run', 'ldw_reset_permissions', 'finance_readers', '2019-07-29T01:20:17.300679+00:00', '--local', '-sd', '/usr/local/airflow/dags/ldw_reset_permissions.py']
[2019-07-29 01:20:23,446] {{jobs.py:1334}} INFO - Sending ('ldw_reset_permissions', 'engineering_readers', datetime.datetime(2019, 7, 29, 1, 20, 17, 300679, tzinfo=<TimezoneInfo [UTC, GMT, +00:00:00, STD]>), 1) to executor with priority 1 and queue default
[2019-07-29 01:20:23,447] {{base_executor.py:59}} INFO - Adding to queue: ['airflow', 'run', 'ldw_reset_permissions', 'engineering_readers', '2019-07-29T01:20:17.300679+00:00', '--local', '-sd', '/usr/local/airflow/dags/ldw_reset_permissions.py']
[2019-07-29 01:20:23,447] {{jobs.py:1334}} INFO - Sending ('ldw_reset_permissions', 'bi_readers', datetime.datetime(2019, 7, 29, 1, 20, 17, 300679, tzinfo=<TimezoneInfo [UTC, GMT, +00:00:00, STD]>), 1) to executor with priority 1 and queue default
[2019-07-29 01:20:23,447] {{base_executor.py:59}} INFO - Adding to queue: ['airflow', 'run', 'ldw_reset_permissions', 'bi_readers', '2019-07-29T01:20:17.300679+00:00', '--local', '-sd', '/usr/local/airflow/dags/ldw_reset_permissions.py']
[2019-07-29 01:21:25,589] {{jobs.py:1468}} INFO - Executor reports execution of ldw_reset_permissions.marketing_readers execution_date=2019-07-29 01:20:17.300679+00:00 exited with status failed for try_number 1
[2019-07-29 01:21:25,599] {{jobs.py:1468}} INFO - Executor reports execution of ldw_reset_permissions.engineering_readers execution_date=2019-07-29 01:20:17.300679+00:00 exited with status failed for try_number 1
[2019-07-29 01:21:56,111] {{jobs.py:1468}} INFO - Executor reports execution of ldw_reset_permissions.service_readers execution_date=2019-07-29 01:20:17.300679+00:00 exited with status failed for try_number 1
[2019-07-29 01:22:28,133] {{jobs.py:1468}} INFO - Executor reports execution of ldw_reset_permissions.bi_readers execution_date=2019-07-29 01:20:17.300679+00:00 exited with status failed for try_number 1
Worker 1:
[2019-07-29 01:20:23,593: INFO/MainProcess] Received task: airflow.executors.celery_executor.execute_command[cb066498-e350-43c1-a23d-1bc33929717a]
[2019-07-29 01:20:23,605: INFO/ForkPoolWorker-15] Executing command in Celery: ['airflow', 'run', 'ldw_reset_permissions', 'service_readers', '2019-07-29T01:20:17.300679+00:00', '--local', '-sd', '/usr/local/airflow/dags/ldw_reset_permissions.py']
[2019-07-29 01:20:23,627: INFO/MainProcess] Received task: airflow.executors.celery_executor.execute_command[d835c30a-e2bd-4f78-b291-d19b7bccad68]
[2019-07-29 01:20:23,637: INFO/ForkPoolWorker-1] Executing command in Celery: ['airflow', 'run', 'ldw_reset_permissions', 'finance_readers', '2019-07-29T01:20:17.300679+00:00', '--local', '-sd', '/usr/local/airflow/dags/ldw_reset_permissions.py']
[2019-07-29 01:20:25,260] {{settings.py:182}} INFO - settings.configure_orm(): Using pool settings. pool_size=5, pool_recycle=1800, pid=44
[2019-07-29 01:20:25,263] {{settings.py:182}} INFO - settings.configure_orm(): Using pool settings. pool_size=5, pool_recycle=1800, pid=45
[2019-07-29 01:20:25,878] {{__init__.py:51}} INFO - Using executor CeleryExecutor
[2019-07-29 01:20:25,881] {{__init__.py:51}} INFO - Using executor CeleryExecutor
[2019-07-29 01:20:26,271] {{__init__.py:305}} INFO - Filling up the DagBag from /usr/local/airflow/dags/ldw_reset_permissions.py
[2019-07-29 01:20:26,276] {{__init__.py:305}} INFO - Filling up the DagBag from /usr/local/airflow/dags/ldw_reset_permissions.py
[2019-07-29 01:20:26,601] {{cli.py:517}} INFO - Running <TaskInstance: ldw_reset_permissions.finance_readers 2019-07-29T01:20:17.300679+00:00 [queued]> on host b4b0a799a7ca
[2019-07-29 01:20:26,604] {{cli.py:517}} INFO - Running <TaskInstance: ldw_reset_permissions.service_readers 2019-07-29T01:20:17.300679+00:00 [queued]> on host b4b0a799a7ca
[2019-07-29 01:20:39,364: INFO/MainProcess] missed heartbeat from celery#0f9db941bdd7
[2019-07-29 01:21:46,121: WARNING/MainProcess] Substantial drift from celery#0f9db941bdd7 may mean clocks are out of sync. Current drift is
70 seconds. [orig: 2019-07-29 01:21:46.117058 recv: 2019-07-29 01:20:36.485961]
[2019-07-29 01:21:46,127: ERROR/MainProcess] Process 'ForkPoolWorker-15' pid:42 exited with 'signal 9 (SIGKILL)'
[2019-07-29 01:21:46,294: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 9 (SIGKILL).',)
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/billiard/pool.py", line 1223, in mark_as_worker_lost
human_status(exitcode)),
billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL).
[2019-07-29 01:21:49,853: ERROR/MainProcess] Process 'ForkPoolWorker-17' pid:62 exited with 'signal 9 (SIGKILL)'
[2019-07-29 01:22:29,230: ERROR/MainProcess] Process 'ForkPoolWorker-18' pid:63 exited with 'signal 9 (SIGKILL)'
[2019-07-29 01:22:44,002: INFO/MainProcess] missed heartbeat from celery#0f9db941bdd7
[2019-07-29 01:22:52,073: ERROR/MainProcess] Process 'ForkPoolWorker-19' pid:64 exited with 'signal 9 (SIGKILL)'
Worker 2:
[2019-07-29 01:20:23,605: INFO/MainProcess] Received task: airflow.executors.celery_executor.execute_command[dbb9b813-255e-4284-b067-22b990d8b9a2]
[2019-07-29 01:20:23,609: INFO/ForkPoolWorker-15] Executing command in Celery: ['airflow', 'run', 'ldw_reset_permissions', 'marketing_readers', '2019-07-29T01:20:17.300679+00:00', '--local', '-sd', '/usr/local/airflow/dags/ldw_reset_permissions.py']
[2019-07-29 01:20:23,616: INFO/MainProcess] Received task: airflow.executors.celery_executor.execute_command[42ee3e3a-620e-47da-add2-e5678973d87e]
[2019-07-29 01:20:23,622: INFO/ForkPoolWorker-1] Executing command in Celery: ['airflow', 'run', 'ldw_reset_permissions', 'engineering_readers', '2019-07-29T01:20:17.300679+00:00', '--local', '-sd', '/usr/local/airflow/dags/ldw_reset_permissions.py']
[2019-07-29 01:20:23,632: INFO/MainProcess] Received task: airflow.executors.celery_executor.execute_command[be609901-60bc-4dcc-9374-7c802171f2db]
[2019-07-29 01:20:23,638: INFO/ForkPoolWorker-3] Executing command in Celery: ['airflow', 'run', 'ldw_reset_permissions', 'bi_readers', '2019-07-29T01:20:17.300679+00:00', '--local', '-sd', '/usr/local/airflow/dags/ldw_reset_permissions.py']
[2019-07-29 01:20:26,124] {{settings.py:182}} INFO - settings.configure_orm(): Using pool settings. pool_size=5, pool_recycle=1800, pid=45
[2019-07-29 01:20:26,127] {{settings.py:182}} INFO - settings.configure_orm(): Using pool settings. pool_size=5, pool_recycle=1800, pid=46
[2019-07-29 01:20:26,135] {{settings.py:182}} INFO - settings.configure_orm(): Using pool settings. pool_size=5, pool_recycle=1800, pid=44
[2019-07-29 01:20:27,025] {{__init__.py:51}} INFO - Using executor CeleryExecutor
[2019-07-29 01:20:27,033] {{__init__.py:51}} INFO - Using executor CeleryExecutor
[2019-07-29 01:20:27,047] {{__init__.py:51}} INFO - Using executor CeleryExecutor
[2019-07-29 01:20:27,798] {{__init__.py:305}} INFO - Filling up the DagBag from /usr/local/airflow/dags/ldw_reset_permissions.py
[2019-07-29 01:20:27,801] {{__init__.py:305}} INFO - Filling up the DagBag from /usr/local/airflow/dags/ldw_reset_permissions.py
[2019-07-29 01:20:27,806] {{__init__.py:305}} INFO - Filling up the DagBag from /usr/local/airflow/dags/ldw_reset_permissions.py
[2019-07-29 01:20:28,426] {{cli.py:517}} INFO - Running <TaskInstance: ldw_reset_permissions.engineering_readers 2019-07-29T01:20:17.300679+00:00 [queued]> on host 0f9db941bdd7
[2019-07-29 01:20:28,426] {{cli.py:517}} INFO - Running <TaskInstance: ldw_reset_permissions.marketing_readers 2019-07-29T01:20:17.300679+00:00 [queued]> on host 0f9db941bdd7
[2019-07-29 01:20:28,437] {{cli.py:517}} INFO - Running <TaskInstance: ldw_reset_permissions.bi_readers 2019-07-29T01:20:17.300679+00:00 [queued]> on host 0f9db941bdd7
[2019-07-29 01:20:56,752: INFO/MainProcess] missed heartbeat from celery#b4b0a799a7ca
[2019-07-29 01:20:56,764: ERROR/MainProcess] Process 'ForkPoolWorker-15' pid:42 exited with 'signal 9 (SIGKILL)'
[2019-07-29 01:20:56,903: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 9 (SIGKILL).',)
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/billiard/pool.py", line 1223, in mark_as_worker_lost
human_status(exitcode)),
billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL).
[2019-07-29 01:20:57,623: WARNING/MainProcess] Substantial drift from celery#b4b0a799a7ca may mean clocks are out of sync. Current drift is
25 seconds. [orig: 2019-07-29 01:20:57.622959 recv: 2019-07-29 01:20:32.629294]
[2019-07-29 01:20:57,631: ERROR/MainProcess] Process 'ForkPoolWorker-1' pid:24 exited with 'signal 9 (SIGKILL)'
[2019-07-29 01:20:57,837: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 9 (SIGKILL).',)
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/billiard/pool.py", line 1223, in mark_as_worker_lost
human_status(exitcode)),
billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL).
[2019-07-29 01:20:58,513: ERROR/MainProcess] Process 'ForkPoolWorker-17' pid:65 exited with 'signal 9 (SIGKILL)'
[2019-07-29 01:22:23,076: INFO/MainProcess] missed heartbeat from celery#b4b0a799a7ca
[2019-07-29 01:22:23,089: ERROR/MainProcess] Process 'ForkPoolWorker-19' pid:67 exited with 'signal 9 (SIGKILL)'
[2019-07-29 01:22:23,105: ERROR/MainProcess] Process 'ForkPoolWorker-18' pid:66 exited with 'signal 9 (SIGKILL)'
[2019-07-29 01:22:23,116: ERROR/MainProcess] Process 'ForkPoolWorker-3' pid:26 exited with 'signal 9 (SIGKILL)'
[2019-07-29 01:22:23,191: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 9 (SIGKILL).',)
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/billiard/pool.py", line 1223, in mark_as_worker_lost
human_status(exitcode)),
billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL).
[2019-07-29 01:22:26,758: ERROR/MainProcess] Process 'ForkPoolWorker-22' pid:70 exited with 'signal 9 (SIGKILL)'
[2019-07-29 01:22:26,770: ERROR/MainProcess] Process 'ForkPoolWorker-21' pid:69 exited with 'signal 9 (SIGKILL)'
[2019-07-29 01:22:26,781: ERROR/MainProcess] Process 'ForkPoolWorker-20' pid:68 exited with 'signal 9 (SIGKILL)'
[2019-07-29 01:22:29,988: WARNING/MainProcess] process with pid=65 already exited
[2019-07-29 01:22:29,991: ERROR/MainProcess] Process 'ForkPoolWorker-24' pid:75 exited with 'signal 9 (SIGKILL)'
[2019-07-29 01:22:30,002: ERROR/MainProcess] Process 'ForkPoolWorker-23' pid:71 exited with 'signal 9 (SIGKILL)'
[2019-07-29 01:22:30,017: ERROR/MainProcess] Process 'ForkPoolWorker-16' pid:43 exited with 'signal 9 (SIGKILL)'
[2019-07-29 01:23:14,202: INFO/MainProcess] missed heartbeat from celery#b4b0a799a7ca
[2019-07-29 01:23:14,206: ERROR/MainProcess] Process 'ForkPoolWorker-28' pid:79 exited with 'signal 9 (SIGKILL)'
[2019-07-29 01:23:14,221: ERROR/MainProcess] Process 'ForkPoolWorker-27' pid:78 exited with 'signal 9 (SIGKILL)'
[2019-07-29 01:23:14,231: ERROR/MainProcess] Process 'ForkPoolWorker-26' pid:77 exited with 'signal 9 (SIGKILL)'
[2019-07-29 01:23:14,242: ERROR/MainProcess] Process 'ForkPoolWorker-25' pid:76 exited with 'signal 9 (SIGKILL)'
[2019-07-29 01:23:14,252: ERROR/MainProcess] Process 'ForkPoolWorker-14' pid:41 exited with 'signal 9 (SIGKILL)'
[2019-07-29 01:23:19,503: ERROR/MainProcess] Process 'ForkPoolWorker-33' pid:87 exited with 'signal 9 (SIGKILL)'
[2019-07-29 01:23:19,572: ERROR/MainProcess] Process 'ForkPoolWorker-32' pid:86 exited with 'signal 9 (SIGKILL)'
[2019-07-29 01:23:19,622: ERROR/MainProcess] Process 'ForkPoolWorker-31' pid:85 exited with 'signal 9 (SIGKILL)'
[2019-07-29 01:23:19,646: ERROR/MainProcess] Process 'ForkPoolWorker-30' pid:84 exited with 'signal 9 (SIGKILL)'
[2019-07-29 01:23:19,828: ERROR/MainProcess] Process 'ForkPoolWorker-29' pid:83 exited with 'signal 9 (SIGKILL)'
[2019-07-29 01:23:43,361: WARNING/MainProcess] process with pid=84 already exited
[2019-07-29 01:23:43,723: ERROR/MainProcess] Process 'ForkPoolWorker-38' pid:92 exited with 'signal 9 (SIGKILL)'
[2019-07-29 01:23:44,119: ERROR/MainProcess] Process 'ForkPoolWorker-37' pid:91 exited with 'signal 9 (SIGKILL)'
[2019-07-29 01:23:44,536: ERROR/MainProcess] Process 'ForkPoolWorker-36' pid:90 exited with 'signal 9 (SIGKILL)'
[2019-07-29 01:23:45,203: ERROR/MainProcess] Process 'ForkPoolWorker-35' pid:89 exited with 'signal 9 (SIGKILL)'
[2019-07-29 01:23:45,510: ERROR/MainProcess] Process 'ForkPoolWorker-34' pid:88 exited with 'signal 9 (SIGKILL)'
[2019-07-29 01:24:10,947: WARNING/MainProcess] process with pid=68 already exited
[2019-07-29 01:24:11,579: ERROR/MainProcess] Process 'ForkPoolWorker-43' pid:97 exited with 'signal 9 (SIGKILL)'
[2019-07-29 01:24:12,288: ERROR/MainProcess] Process 'ForkPoolWorker-42' pid:96 exited with 'signal 9 (SIGKILL)'
[2019-07-29 01:24:13,880: ERROR/MainProcess] Process 'ForkPoolWorker-41' pid:95 exited with 'signal 9 (SIGKILL)'
[2019-07-29 01:24:14,775: ERROR/MainProcess] Process 'ForkPoolWorker-40' pid:94 exited with 'signal 9 (SIGKILL)'
[2019-07-29 01:24:15,762: ERROR/MainProcess] Process 'ForkPoolWorker-39' pid:93 exited with 'signal 9 (SIGKILL)'
[2019-07-29 01:25:05,623: WARNING/MainProcess] process with pid=75 already exited
Suggestions on what's going on and how to remedy this?

Turns out this was an issue related to AWS ECS, and not with Airflow configuration itself.
Through more testing and monitoring via htop, I noticed that regardless of worker node count, one worker node would always spike CPU continuously until the system killed it as shown in the logs above.
The container/task definitions for the Airflow workers did not have CPU units explicitly set, under the assumption they would be managed automatically since it wasn't a required field.
Specifying enough CPU units such that each worker container/task was distributed to its own EC2 instance cleared things right up.

Related

Setup Airflow with remote Celery worker

I have Apache Airflow setup on a virtual machine that is within the local network and would like an extra Celery worker to be running on my local machine that still syncs with the rest of the airflow system.
So far, after I start the worker on my local machine, the DAGs present on the local machine is not visible on the webserver (webserver is running on VM) right away, but they are briefly after I enter airflow dags reserialize on the local machine.
I get these messages in the worker logs after doing so:
[2022-06-07 09:54:41,661] {dagbag.py:507} INFO - Filling up the DagBag from /Users/wilbertung/Documents/lowitest/airflow/dags
[2022-06-07 09:54:41,680] {dagbag.py:507} INFO - Filling up the DagBag from None
[2022-06-07 09:54:41,809] {dag.py:2379} INFO - Sync 2 DAGs
[2022-06-07 09:54:41,853] {dag.py:2923} INFO - Setting next_dagrun for ChiSo to 2022-06-06T01:54:41.852752+00:00, run_after=2022-06-07T01:54:41.852752+00:00
[2022-06-07 09:54:41,853] {dag.py:2923} INFO - Setting next_dagrun for lowi17 to 2022-06-06T16:00:00+00:00, run_after=2022-06-07T16:00:00+00:00
Then, in the scheduler logs I get the following messages:
[2022-06-07 09:54:42,473] {scheduler_job.py:353} INFO - 3 tasks up for execution:
<TaskInstance: lowi17.台灣醒報 manual__2022-06-06T06:00:03.787848+00:00 [scheduled]>
<TaskInstance: lowi17.台灣新生報 manual__2022-06-06T06:00:03.787848+00:00 [scheduled]>
<TaskInstance: lowi17.華視新聞網 manual__2022-06-06T06:00:03.787848+00:00 [scheduled]>
[2022-06-07 09:54:42,473] {scheduler_job.py:418} INFO - DAG lowi17 has 0/16 running and queued tasks
[2022-06-07 09:54:42,473] {scheduler_job.py:418} INFO - DAG lowi17 has 1/16 running and queued tasks
[2022-06-07 09:54:42,473] {scheduler_job.py:418} INFO - DAG lowi17 has 2/16 running and queued tasks
[2022-06-07 09:54:42,473] {scheduler_job.py:504} INFO - Setting the following tasks to queued state:
<TaskInstance: lowi17.台灣醒報 manual__2022-06-06T06:00:03.787848+00:00 [scheduled]>
<TaskInstance: lowi17.台灣新生報 manual__2022-06-06T06:00:03.787848+00:00 [scheduled]>
<TaskInstance: lowi17.華視新聞網 manual__2022-06-06T06:00:03.787848+00:00 [scheduled]>
[2022-06-07 09:54:42,476] {scheduler_job.py:546} INFO - Sending TaskInstanceKey(dag_id='lowi17', task_id='台灣醒報', run_id='manual__2022-06-06T06:00:03.787848+00:00', try_number=3, map_index=-1) to executor with priority 1 and queue default
[2022-06-07 09:54:42,476] {base_executor.py:91} INFO - Adding to queue: ['airflow', 'tasks', 'run', 'lowi17', '台灣醒報', 'manual__2022-06-06T06:00:03.787848+00:00', '--local', '--subdir', '/Users/wilbertung/Documents/lowitest/airflow/dags/DAG_lowi50.py']
[2022-06-07 09:54:42,477] {scheduler_job.py:546} INFO - Sending TaskInstanceKey(dag_id='lowi17', task_id='台灣新生報', run_id='manual__2022-06-06T06:00:03.787848+00:00', try_number=3, map_index=-1) to executor with priority 1 and queue default
[2022-06-07 09:54:42,477] {base_executor.py:91} INFO - Adding to queue: ['airflow', 'tasks', 'run', 'lowi17', '台灣新生報', 'manual__2022-06-06T06:00:03.787848+00:00', '--local', '--subdir', '/Users/wilbertung/Documents/lowitest/airflow/dags/DAG_lowi50.py']
[2022-06-07 09:54:42,477] {scheduler_job.py:546} INFO - Sending TaskInstanceKey(dag_id='lowi17', task_id='華視新聞網', run_id='manual__2022-06-06T06:00:03.787848+00:00', try_number=3, map_index=-1) to executor with priority 1 and queue default
[2022-06-07 09:54:42,477] {base_executor.py:91} INFO - Adding to queue: ['airflow', 'tasks', 'run', 'lowi17', '華視新聞網', 'manual__2022-06-06T06:00:03.787848+00:00', '--local', '--subdir', '/Users/wilbertung/Documents/lowitest/airflow/dags/DAG_lowi50.py']
[2022-06-07 09:54:42,621] {scheduler_job.py:599} INFO - Executor reports execution of lowi17.台灣醒報 run_id=manual__2022-06-06T06:00:03.787848+00:00 exited with status failed for try_number 3
[2022-06-07 09:54:42,621] {scheduler_job.py:599} INFO - Executor reports execution of lowi17.台灣新生報 run_id=manual__2022-06-06T06:00:03.787848+00:00 exited with status failed for try_number 3
[2022-06-07 09:54:42,621] {scheduler_job.py:599} INFO - Executor reports execution of lowi17.華視新聞網 run_id=manual__2022-06-06T06:00:03.787848+00:00 exited with status failed for try_number 3
[2022-06-07 09:54:42,626] {scheduler_job.py:643} INFO - TaskInstance Finished: dag_id=lowi17, task_id=台灣醒報, run_id=manual__2022-06-06T06:00:03.787848+00:00, map_index=-1, run_start_date=2022-06-06 06:00:06.678844+00:00, run_end_date=2022-06-06 06:51:33.138733+00:00, run_duration=3086.459889, state=queued, executor_state=failed, try_number=3, max_tries=2, job_id=83, pool=default_pool, queue=default, priority_weight=1, operator=BashOperator, queued_dttm=2022-06-07 01:54:42.474017+00:00, queued_by_job_id=100, pid=31538
[2022-06-07 09:54:42,627] {scheduler_job.py:672} ERROR - Executor reports task instance <TaskInstance: lowi17.台灣醒報 manual__2022-06-06T06:00:03.787848+00:00 [queued]> finished (failed) although the task says its queued. (Info: None) Was the task killed externally?
[2022-06-07 09:54:42,639] {scheduler_job.py:643} INFO - TaskInstance Finished: dag_id=lowi17, task_id=台灣新生報, run_id=manual__2022-06-06T06:00:03.787848+00:00, map_index=-1, run_start_date=2022-06-06 06:00:06.005933+00:00, run_end_date=2022-06-06 06:51:33.156305+00:00, run_duration=3087.150372, state=queued, executor_state=failed, try_number=3, max_tries=2, job_id=85, pool=default_pool, queue=default, priority_weight=1, operator=BashOperator, queued_dttm=2022-06-07 01:54:42.474017+00:00, queued_by_job_id=100, pid=31535
[2022-06-07 09:54:42,639] {scheduler_job.py:672} ERROR - Executor reports task instance <TaskInstance: lowi17.台灣新生報 manual__2022-06-06T06:00:03.787848+00:00 [queued]> finished (failed) although the task says its queued. (Info: None) Was the task killed externally?
[2022-06-07 09:54:42,645] {scheduler_job.py:643} INFO - TaskInstance Finished: dag_id=lowi17, task_id=華視新聞網, run_id=manual__2022-06-06T06:00:03.787848+00:00, map_index=-1, run_start_date=None, run_end_date=2022-06-06 06:51:33.162201+00:00, run_duration=None, state=queued, executor_state=failed, try_number=3, max_tries=2, job_id=None, pool=default_pool, queue=default, priority_weight=1, operator=BashOperator, queued_dttm=2022-06-07 01:54:42.474017+00:00, queued_by_job_id=100, pid=None
[2022-06-07 09:54:42,645] {scheduler_job.py:672} ERROR - Executor reports task instance <TaskInstance: lowi17.華視新聞網 manual__2022-06-06T06:00:03.787848+00:00 [queued]> finished (failed) although the task says its queued. (Info: None) Was the task killed externally?
[2022-06-07 09:54:42,672] {dagrun.py:547} ERROR - Marking run <DagRun lowi17 # 2022-06-06 06:00:03.787848+00:00: manual__2022-06-06T06:00:03.787848+00:00, externally triggered: True> failed
[2022-06-07 09:54:42,672] {dagrun.py:607} INFO - DagRun Finished: dag_id=lowi17, execution_date=2022-06-06 06:00:03.787848+00:00, run_id=manual__2022-06-06T06:00:03.787848+00:00, run_start_date=2022-06-06 06:00:03.844994+00:00, run_end_date=2022-06-07 01:54:42.672853+00:00, run_duration=71678.827859, state=failed, external_trigger=True, run_type=manual, data_interval_start=2022-06-05 06:00:03.787848+00:00, data_interval_end=2022-06-06 06:00:03.787848+00:00, dag_hash=7f2d9c074e59bc29ace385f688864720
[2022-06-07 09:54:42,675] {dag.py:2923} INFO - Setting next_dagrun for lowi17 to 2022-06-06T06:00:03.787848+00:00, run_after=2022-06-07T06:00:03.787848+00:00
After this moment, the DAG becomes invisible on the webserver as if it never existed...
I am sure I am missing some important configuration of some sort. If so, which one?
Basically, even if there's a way to put the DAG files into different absolute but same RELATIVE folder and make it work, the most common and direct method that I went with was to mount a shared folder to both the main node and the remote worker so that they can both access the same DAG folder.
More details about it can be found here:
https://github.com/apache/airflow/discussions/24275

Spark Driver terminates with code 137 and no error message. What is the cause?

My spark program is failing and neither the scheduler, driver or executors are providing any sort of useful error, apart from Exit status 137. What could be causing spark to fail?
The crash seems to happen during the conversion of an RDD to a Dataframe:
val df = sqlc.createDataFrame(processedData, schema).persist()
Right before the crash, the logs look like this:
Scheduler
19/01/22 04:01:05 INFO JobUtils: stderr: 19/01/22 04:01:04 WARN TaskSetManager: Stage 11 contains a task of very large size (22028 KB). The maximum recommended task size is 100 KB.
19/01/22 04:01:05 INFO JobUtils: stderr: 19/01/22 04:01:04 INFO TaskSetManager: Starting task 0.0 in stage 11.0 (TID 23, 10.141.1.247, executor 1133b735-967d-136c-2bbf-ffcb3884c88c-1548129213980, partition 0, PROCESS_LOCAL, 22557269 bytes)
19/01/22 04:01:05 INFO JobUtils: stderr: 19/01/22 04:01:04 INFO TaskSetManager: Starting task 1.0 in stage 11.0 (TID 24, 10.141.3.144, executor a92ceb18-b46a-c986-4672-cab9086c54c2-1548129202094, partition 1, PROCESS_LOCAL, 22558910 bytes)
19/01/22 04:01:05 INFO JobUtils: stderr: 19/01/22 04:01:04 INFO TaskSetManager: Starting task 2.0 in stage 11.0 (TID 25, 10.141.1.56, executor b9167d92-bed2-fe21-46fd-08f2c6fd1998-1548129206680, partition 2, PROCESS_LOCAL, 22558910 bytes)
19/01/22 04:01:05 INFO JobUtils: stderr: 19/01/22 04:01:04 INFO TaskSetManager: Starting task 3.0 in stage 11.0 (TID 26, 10.141.3.146, executor 0cf7394b-540d-2a6c-258a-e27bbedbdd0e-1548129212488, partition 3, PROCESS_LOCAL, 22558910 bytes)
19/01/22 04:01:09 DEBUG JobUtils: Tracing alloc 12943f1a-82ed-d4f4-07b3-dfbe5a46716b for driver
...
19/01/22 04:13:45 DEBUG JobUtils: Tracing alloc 12943f1a-82ed-d4f4-07b3-dfbe5a46716b for driver
19/01/22 04:13:46 INFO JobUtils: driver Terminated -- Exit status 137
19/01/22 04:13:46 INFO JobUtils: driver Restarting -- Restart within policy
Driver
19/01/22 04:01:12 INFO DAGScheduler: Job 7 finished: runJob at SparkHadoopMapReduceWriter.scala:88, took 8.008375 s
19/01/22 04:01:12 INFO SparkHadoopMapReduceWriter: Job job_20190122040104_0032 committed.
19/01/22 04:01:13 INFO MapPartitionsRDD: Removing RDD 28 from persistence list
19/01/22 04:01:13 INFO BlockManager: Removing RDD 28
Executors (Some variation of this)
19/01/22 04:01:13 INFO BlockManager: Removing RDD 28
19/01/22 04:13:45 ERROR CoarseGrainedExecutorBackend: Executor self-exiting due to : Driver 10.141.2.48:21297 disassociated! Shutting down.
19/01/22 04:13:45 INFO DiskBlockManager: Shutdown hook called
19/01/22 04:13:45 INFO ShutdownHookManager: Shutdown hook called
19/01/22 04:13:45 INFO ShutdownHookManager: Deleting directory /alloc/spark-ce736cb6-8b8e-4891-b9c7-06ea9d9cf797

Spark unable to read kafka Topic and gives error " unable to connect to zookeeper server within timeout 6000"

I'm trying to Execute the Example program given in Spark Directory on HDP cluster "/spark2/examples/src/main/python/streaming/kafka_wordcount.py" which tries to read kafka topic but gives Zookeeper server timeout error.
Spark is installed on HDP Cluster and Kafka is running on HDF Cluster, both are running on different cluster and are in same VPC on AWS
Command executed to run spark example on HDP cluster is "bin/spark-submit --jars spark-streaming-kafka-0-8-assembly_2.11-2.3.0.jar examples/src/main/python/streaming/kafka_wordcount.py HDF-cluster-ip-address:2181 topic"
Error Image :
enter image description here
-------------------------------------------
Time: 2018-06-20 07:51:56
-------------------------------------------
18/06/20 07:51:56 INFO JobScheduler: Finished job streaming job 1529481116000 ms.0 from job set of time 1529481116000 ms
18/06/20 07:51:56 INFO JobScheduler: Total delay: 0.171 s for time 1529481116000 ms (execution: 0.145 s)
18/06/20 07:51:56 INFO PythonRDD: Removing RDD 94 from persistence list
18/06/20 07:51:56 INFO BlockManager: Removing RDD 94
18/06/20 07:51:56 INFO BlockRDD: Removing RDD 89 from persistence list
18/06/20 07:51:56 INFO BlockManager: Removing RDD 89
18/06/20 07:51:56 INFO KafkaInputDStream: Removing blocks of RDD BlockRDD[89] at createStream at NativeMethodAccessorImpl.java:0 of time 1529481116000 ms
18/06/20 07:51:56 INFO ReceivedBlockTracker: Deleting batches: 1529481114000 ms
18/06/20 07:51:56 INFO InputInfoTracker: remove old batch metadata: 1529481114000 ms
18/06/20 07:51:57 INFO JobScheduler: Added jobs for time 1529481117000 ms
18/06/20 07:51:57 INFO JobScheduler: Starting job streaming job 1529481117000 ms.0 from job set of time 1529481117000 ms
18/06/20 07:51:57 INFO SparkContext: Starting job: runJob at PythonRDD.scala:141
18/06/20 07:51:57 INFO DAGScheduler: Registering RDD 107 (call at /usr/hdp/2.6.5.0-292/spark2/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py:2257)
18/06/20 07:51:57 INFO DAGScheduler: Got job 27 (runJob at PythonRDD.scala:141) with 1 output partitions
18/06/20 07:51:57 INFO DAGScheduler: Final stage: ResultStage 54 (runJob at PythonRDD.scala:141)
18/06/20 07:51:57 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 53)
18/06/20 07:51:57 INFO DAGScheduler: Missing parents: List()
18/06/20 07:51:57 INFO DAGScheduler: Submitting ResultStage 54 (PythonRDD[111] at RDD at PythonRDD.scala:48), which has no missing parents
18/06/20 07:51:57 INFO MemoryStore: Block broadcast_27 stored as values in memory (estimated size 7.0 KB, free 366.0 MB)
18/06/20 07:51:57 INFO MemoryStore: Block broadcast_27_piece0 stored as bytes in memory (estimated size 4.1 KB, free 366.0 MB)
18/06/20 07:51:57 INFO BlockManagerInfo: Added broadcast_27_piece0 in memory on ip-10-29-3-74.ec2.internal:46231 (size: 4.1 KB, free: 366.2 MB)
18/06/20 07:51:57 INFO SparkContext: Created broadcast 27 from broadcast at DAGScheduler.scala:1039
18/06/20 07:51:57 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 54 (PythonRDD[111] at RDD at PythonRDD.scala:48) (first 15 tasks are for partitions Vector(0))
18/06/20 07:51:57 INFO TaskSchedulerImpl: Adding task set 54.0 with 1 tasks
18/06/20 07:51:57 INFO TaskSetManager: Starting task 0.0 in stage 54.0 (TID 53, localhost, executor driver, partition 0, PROCESS_LOCAL, 7649 bytes)
18/06/20 07:51:57 INFO Executor: Running task 0.0 in stage 54.0 (TID 53)
18/06/20 07:51:57 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 0 blocks
18/06/20 07:51:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
18/06/20 07:51:57 INFO PythonRunner: Times: total = 40, boot = -881, init = 921, finish = 0
18/06/20 07:51:57 INFO PythonRunner: Times: total = 41, boot = -881, init = 922, finish = 0
18/06/20 07:51:57 INFO Executor: Finished task 0.0 in stage 54.0 (TID 53). 1493 bytes result sent to driver
18/06/20 07:51:57 INFO TaskSetManager: Finished task 0.0 in stage 54.0 (TID 53) in 48 ms on localhost (executor driver) (1/1)
18/06/20 07:51:57 INFO TaskSchedulerImpl: Removed TaskSet 54.0, whose tasks have all completed, from pool
18/06/20 07:51:57 INFO DAGScheduler: ResultStage 54 (runJob at PythonRDD.scala:141) finished in 0.055 s
18/06/20 07:51:57 INFO DAGScheduler: Job 27 finished: runJob at PythonRDD.scala:141, took 0.058062 s
18/06/20 07:51:57 INFO ZooKeeper: Session: 0x0 closed
18/06/20 07:51:57 INFO SparkContext: Starting job: runJob at PythonRDD.scala:141
18/06/20 07:51:57 INFO DAGScheduler: Got job 28 (runJob at PythonRDD.scala:141) with 3 output partitions
18/06/20 07:51:57 INFO DAGScheduler: Final stage: ResultStage 56 (runJob at PythonRDD.scala:141)
18/06/20 07:51:57 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 55)
18/06/20 07:51:57 INFO DAGScheduler: Missing parents: List()
18/06/20 07:51:57 INFO DAGScheduler: Submitting ResultStage 56 (PythonRDD[112] at RDD at PythonRDD.scala:48), which has no missing parents
18/06/20 07:51:57 INFO ReceiverSupervisorImpl: Stopping receiver with message: Error starting receiver 0: org.I0Itec.zkclient.exception.ZkTimeoutException: Unable to connect to zookeeper server within timeout: 10000
18/06/20 07:51:57 INFO ReceiverSupervisorImpl: Called receiver onStop
18/06/20 07:51:57 INFO ReceiverSupervisorImpl: Deregistering receiver 0
18/06/20 07:51:57 INFO MemoryStore: Block broadcast_28 stored as values in memory (estimated size 7.0 KB, free 365.9 MB)
18/06/20 07:51:57 INFO MemoryStore: Block broadcast_28_piece0 stored as bytes in memory (estimated size 4.1 KB, free 365.9 MB)
18/06/20 07:51:57 INFO ClientCnxn: EventThread shut down
18/06/20 07:51:57 INFO BlockManagerInfo: Added broadcast_28_piece0 in memory on ip-10-29-3-74.ec2.internal:46231 (size: 4.1 KB, free: 366.2 MB)
18/06/20 07:51:57 INFO SparkContext: Created broadcast 28 from broadcast at DAGScheduler.scala:1039
18/06/20 07:51:57 INFO DAGScheduler: Submitting 3 missing tasks from ResultStage 56 (PythonRDD[112] at RDD at PythonRDD.scala:48) (first 15 tasks are for partitions Vector(1, 2, 3))
18/06/20 07:51:57 INFO TaskSchedulerImpl: Adding task set 56.0 with 3 tasks
18/06/20 07:51:57 INFO TaskSetManager: Starting task 0.0 in stage 56.0 (TID 54, localhost, executor driver, partition 1, PROCESS_LOCAL, 7649 bytes)
18/06/20 07:51:57 INFO TaskSetManager: Starting task 1.0 in stage 56.0 (TID 55, localhost, executor driver, partition 2, PROCESS_LOCAL, 7649 bytes)
18/06/20 07:51:57 INFO TaskSetManager: Starting task 2.0 in stage 56.0 (TID 56, localhost, executor driver, partition 3, PROCESS_LOCAL, 7649 bytes)
18/06/20 07:51:57 INFO Executor: Running task 1.0 in stage 56.0 (TID 55)
18/06/20 07:51:57 INFO Executor: Running task 2.0 in stage 56.0 (TID 56)
18/06/20 07:51:57 INFO Executor: Running task 0.0 in stage 56.0 (TID 54)
18/06/20 07:51:57 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 0 blocks
18/06/20 07:51:57 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 0 blocks
18/06/20 07:51:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
18/06/20 07:51:57 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 0 blocks
18/06/20 07:51:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
18/06/20 07:51:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
18/06/20 07:51:57 ERROR ReceiverTracker: Deregistered receiver for stream 0: Error starting receiver 0 - org.I0Itec.zkclient.exception.ZkTimeoutException: Unable to connect to zookeeper server within timeout: 10000
at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:880)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:98)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:84)
at kafka.consumer.ZookeeperConsumerConnector.connectZk(ZookeeperConsumerConnector.scala:171)
at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:126)
at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:143)
at kafka.consumer.Consumer$.create(ConsumerConnector.scala:94)
at org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:100)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:149)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:131)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:600)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:590)
at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2185)
at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2185)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
18/06/20 07:51:57 INFO ReceiverSupervisorImpl: Stopped receiver 0
18/06/20 07:51:57 INFO BlockGenerator: Stopping BlockGenerator
18/06/20 07:51:57 INFO PythonRunner: Times: total = 40, boot = -947, init = 987, finish = 0
18/06/20 07:51:57 INFO PythonRunner: Times: total = 40, boot = -947, init = 987, finish = 0
18/06/20 07:51:57 INFO PythonRunner: Times: total = 41, boot = -944, init = 985, finish = 0
18/06/20 07:51:57 INFO Executor: Finished task 1.0 in stage 56.0 (TID 55). 1536 bytes result sent to driver
18/06/20 07:51:57 INFO TaskSetManager: Finished task 1.0 in stage 56.0 (TID 55) in 52 ms on localhost (executor driver) (1/3)
18/06/20 07:51:57 INFO PythonRunner: Times: total = 45, boot = -944, init = 989, finish = 0
18/06/20 07:51:57 INFO PythonRunner: Times: total = 40, boot = -32, init = 72, finish = 0
18/06/20 07:51:57 INFO Executor: Finished task 0.0 in stage 56.0 (TID 54). 1536 bytes result sent to driver
18/06/20 07:51:57 INFO TaskSetManager: Finished task 0.0 in stage 56.0 (TID 54) in 56 ms on localhost (executor driver) (2/3)
18/06/20 07:51:57 INFO PythonRunner: Times: total = 40, boot = -33, init = 73, finish = 0
18/06/20 07:51:57 INFO Executor: Finished task 2.0 in stage 56.0 (TID 56). 1536 bytes result sent to driver
18/06/20 07:51:57 INFO TaskSetManager: Finished task 2.0 in stage 56.0 (TID 56) in 58 ms on localhost (executor driver) (3/3)
18/06/20 07:51:57 INFO TaskSchedulerImpl: Removed TaskSet 56.0, whose tasks have all completed, from pool
18/06/20 07:51:57 INFO DAGScheduler: ResultStage 56 (runJob at PythonRDD.scala:141) finished in 0.063 s
18/06/20 07:51:57 INFO DAGScheduler: Job 28 finished: runJob at PythonRDD.scala:141, took 0.065728 s
-------------------------------------------
Time: 2018-06-20 07:51:57
-------------------------------------------
18/06/20 07:51:57 INFO JobScheduler: Finished job streaming job 1529481117000 ms.0 from job set of time 1529481117000 ms
18/06/20 07:51:57 INFO JobScheduler: Total delay: 0.169 s for time 1529481117000 ms (execution: 0.149 s)
18/06/20 07:51:57 INFO PythonRDD: Removing RDD 102 from persistence list
18/06/20 07:51:57 INFO BlockManager: Removing RDD 102
18/06/20 07:51:57 INFO BlockRDD: Removing RDD 97 from persistence list
18/06/20 07:51:57 INFO KafkaInputDStream: Removing blocks of RDD BlockRDD[97] at createStream at NativeMethodAccessorImpl.java:0 of time 1529481117000 ms
18/06/20 07:51:57 INFO BlockManager: Removing RDD 97
18/06/20 07:51:57 INFO ReceivedBlockTracker: Deleting batches: 1529481115000 ms
18/06/20 07:51:57 INFO InputInfoTracker: remove old batch metadata: 1529481115000 ms
18/06/20 07:51:57 INFO RecurringTimer: Stopped timer for BlockGenerator after time 1529481117400
18/06/20 07:51:57 INFO BlockGenerator: Waiting for block pushing thread to terminate
18/06/20 07:51:57 INFO BlockGenerator: Pushing out the last 0 blocks
18/06/20 07:51:57 INFO BlockGenerator: Stopped block pushing thread
18/06/20 07:51:57 INFO BlockGenerator: Stopped BlockGenerator
18/06/20 07:51:57 INFO ReceiverSupervisorImpl: Waiting for receiver to be stopped
18/06/20 07:51:57 ERROR ReceiverSupervisorImpl: Stopped receiver with error: org.I0Itec.zkclient.exception.ZkTimeoutException: Unable to connect to zookeeper server within timeout: 10000
18/06/20 07:51:57 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.I0Itec.zkclient.exception.ZkTimeoutException: Unable to connect to zookeeper server within timeout: 10000
at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:880)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:98)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:84)
at kafka.consumer.ZookeeperConsumerConnector.connectZk(ZookeeperConsumerConnector.scala:171)
at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:126)
at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:143)
at kafka.consumer.Consumer$.create(ConsumerConnector.scala:94)
at org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:100)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:149)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:131)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:600)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:590)
at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2185)
at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2185)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
18/06/20 07:51:57 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.I0Itec.zkclient.exception.ZkTimeoutException: Unable to connect to zookeeper server within timeout: 10000
at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:880)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:98)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:84)
at kafka.consumer.ZookeeperConsumerConnector.connectZk(ZookeeperConsumerConnector.scala:171)
at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:126)
at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:143)
at kafka.consumer.Consumer$.create(ConsumerConnector.scala:94)
at org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:100)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:149)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:131)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:600)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:590)
at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2185)
at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2185)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
18/06/20 07:51:57 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
18/06/20 07:51:57 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
18/06/20 07:51:57 INFO TaskSchedulerImpl: Cancelling stage 0
18/06/20 07:51:57 INFO DAGScheduler: ResultStage 0 (start at NativeMethodAccessorImpl.java:0) failed in 13.256 s due to Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.I0Itec.zkclient.exception.ZkTimeoutException: Unable to connect to zookeeper server within timeout: 10000
at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:880)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:98)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:84)
at kafka.consumer.ZookeeperConsumerConnector.connectZk(ZookeeperConsumerConnector.scala:171)
at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:126)
at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:143)
at kafka.consumer.Consumer$.create(ConsumerConnector.scala:94)
at org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:100)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:149)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:131)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:600)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:590)
at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2185)
at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2185)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
18/06/20 07:51:57 ERROR ReceiverTracker: Receiver has been stopped. Try to restart it.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.I0Itec.zkclient.exception.ZkTimeoutException: Unable to connect to zookeeper server within timeout: 10000
at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:880)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:98)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:84)
at kafka.consumer.ZookeeperConsumerConnector.connectZk(ZookeeperConsumerConnector.scala:171)
at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:126)
at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:143)
at kafka.consumer.Consumer$.create(ConsumerConnector.scala:94)
at org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:100)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:149)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:131)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:600)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:590)
at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2185)
at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2185)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Caused by: org.I0Itec.zkclient.exception.ZkTimeoutException: Unable to connect to zookeeper server within timeout: 10000
at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:880)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:98)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:84)
at kafka.consumer.ZookeeperConsumerConnector.connectZk(ZookeeperConsumerConnector.scala:171)
at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:126)
at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:143)
at kafka.consumer.Consumer$.create(ConsumerConnector.scala:94)
at org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:100)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:149)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:131)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:600)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:590)
at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2185)
at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2185)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Even on same VPC check for security groups of the two systems. If they have different security groups you probably need to allow inbound and outbound ports. Another way of verifying it is try to telnet and ping both systems from one another.

Spark&hbase: java.io.IOException: Connection reset by peer

I would appreciate it if you could help me.
During implementation of spark streaming from kafka to hbase (code is attached) we have faced an issue “java.io.IOException: Connection reset by peer” (full log is attached).
This issue comes up if we work with hbase and dynamic allocation option is on in spark settings. In case we write data in hdfs (hive table) instead of hbase or if dynamic allocation option is off there are no errors found.
We have tried to change zookeeper connections, spark executor idle timeout, network timeout. We have tried to change shuffle block transfer service (NIO) but the error is still there. If we set min/max executers (less then 80) amount for dynamic allocation there are no problems too.
What may the problem be? There are a lot of almost the same problems in Jira and stack overflow, but nothing helps.
Versions:
HBase 1.2.0-cdh5.14.0
Kafka 3.0.0-1.3.0.0.p0.40
SPARK2 2.2.0.cloudera2-1.cdh5.12.0.p0.232957
hbase-client/hbase-spark(org.apache.hbase) 1.2.0-cdh5.11.1
Spark settings:
--num-executors=80
--conf spark.sql.shuffle.partitions=200
--conf spark.driver.memory=32g
--conf spark.executor.memory=32g
--conf spark.executor.cores=4
Cluster:
1+8 nodes, 70 CPU, 755Gb RAM, x10 HDD,
Log:
18/04/09 13:51:56 INFO cluster.YarnClusterScheduler: Executor 717 on lang32.ca.sbrf.ru killed by driver.
18/04/09 13:51:56 INFO storage.BlockManagerMaster: Removed 717 successfully in removeExecutor
18/04/09 13:51:56 INFO spark.ExecutorAllocationManager: Existing executor 717 has been removed (new total is 26)
18/04/09 13:51:56 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 705.
18/04/09 13:51:56 INFO scheduler.DAGScheduler: Executor lost: 705 (epoch 45)
18/04/09 13:51:56 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 705 from BlockManagerMaster.
18/04/09 13:51:56 INFO cluster.YarnClusterScheduler: Executor 705 on lang32.ca.sbrf.ru killed by driver.
18/04/09 13:51:56 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(705, lang32.ca.sbrf.ru, 22805, None)
18/04/09 13:51:56 INFO spark.ExecutorAllocationManager: Existing executor 705 has been removed (new total is 25)
18/04/09 13:51:56 INFO storage.BlockManagerMaster: Removed 705 successfully in removeExecutor
18/04/09 13:51:56 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 716.
18/04/09 13:51:56 INFO scheduler.DAGScheduler: Executor lost: 716 (epoch 45)
18/04/09 13:51:56 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 716 from BlockManagerMaster.
18/04/09 13:51:56 INFO cluster.YarnClusterScheduler: Executor 716 on lang32.ca.sbrf.ru killed by driver.
18/04/09 13:51:56 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(716, lang32.ca.sbrf.ru, 28678, None)
18/04/09 13:51:56 INFO spark.ExecutorAllocationManager: Existing executor 716 has been removed (new total is 24)
18/04/09 13:51:56 INFO storage.BlockManagerMaster: Removed 716 successfully in removeExecutor
18/04/09 13:51:56 WARN server.TransportChannelHandler: Exception in connection from /10.116.173.65:57542
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:275)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:748)
18/04/09 13:51:56 ERROR client.TransportResponseHandler: Still have 1 requests outstanding when connection from /10.116.173.65:57542 is closed
18/04/09 13:51:56 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 548.
Try setting these two parameters. Also try caching the Dataframe before writing to HBase.
spark.network.timeout
spark.executor.heartbeatInterval
Please see my related answer here: What are possible reasons for receiving TimeoutException: Futures timed out after [n seconds] when working with Spark
It also took me a while to understand why Cloudera is stating following:
Dynamic allocation and Spark Streaming
If you are using Spark Streaming, Cloudera recommends that you disable
dynamic allocation by setting spark.dynamicAllocation.enabled to false
when running streaming applications.
Reference: https://www.cloudera.com/documentation/spark2/latest/topics/spark2_known_issues.html#ki_dynamic_allocation_streaming

Spark program hangs at Job finished: toArray - workers throw java.util.concurrent.TimeoutException

So I have a simple spark job where I'm trying to work out how to write bytes into a sequence file. It was working fine, then suddenly the job hangs seemingly at the end - in particular at this line:
14/06/06 10:57:48 INFO SparkContext: Job finished: toArray at XXXX.scala:104, took 44.439736728 s
So I had a look at the stderr logs on the workers and I see this:
java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.storage.BlockManagerMaster.askDriverWithReply(BlockManagerMaster.scala:162)
at org.apache.spark.storage.BlockManagerMaster.sendHeartBeat(BlockManagerMaster.scala:52)
at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$heartBeat(BlockManager.scala:97)
at org.apache.spark.storage.BlockManager$$anonfun$initialize$1.apply$mcV$sp(BlockManager.scala:135)
at akka.actor.Scheduler$$anon$9.run(Scheduler.scala:80)
at akka.actor.LightArrayRevolverScheduler$$anon$3$$anon$2.run(Scheduler.scala:241)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
The job output has some weird INFO messages I've not seen before:
14/06/06 11:08:28 INFO TaskSetManager: Finished TID 2 in 2163 ms on ip-172-31-23-17.ec2.internal (progress: 0/5)
14/06/06 11:08:28 INFO DAGScheduler: Completed ResultTask(1, 0)
14/06/06 11:08:30 INFO TaskSetManager: Finished TID 3 in 3635 ms on ip-172-31-29-86.ec2.internal (progress: 1/5)
14/06/06 11:08:30 INFO DAGScheduler: Completed ResultTask(1, 1)
^^ Normal output see this in jobs all the time. But below lots of weird messages.
14/06/06 11:08:50 INFO BlockManagerMasterActor$BlockManagerInfo: Added taskresult_6 in memory on ip-172-31-30-95.ec2.internal:41661 (size: 253.9 MB, free: 2.6 GB)
14/06/06 11:08:50 INFO SendingConnection: Initiating connection to [ip-172-31-30-95.ec2.internal/172.31.30.95:41661]
14/06/06 11:08:50 INFO SendingConnection: Connected to [ip-172-31-30-95.ec2.internal/172.31.30.95:41661], 1 messages pending
14/06/06 11:08:50 INFO ConnectionManager: Accepted connection from [ip-172-31-30-95.ec2.internal/172.31.30.95]
14/06/06 11:08:52 INFO TaskSetManager: Finished TID 6 in 25831 ms on ip-172-31-30-95.ec2.internal (progress: 2/5)
14/06/06 11:08:52 INFO BlockManagerMasterActor$BlockManagerInfo: Removed taskresult_6 on ip-172-31-30-95.ec2.internal:41661 in memory (size: 253.9 MB, free: 2.9 GB)
14/06/06 11:08:53 INFO DAGScheduler: Completed ResultTask(1, 4)
14/06/06 11:08:57 INFO BlockManagerMasterActor$BlockManagerInfo: Added taskresult_4 in memory on ip-172-31-22-58.ec2.internal:46736 (size: 329.3 MB, free: 2.6 GB)
14/06/06 11:08:57 INFO SendingConnection: Initiating connection to [ip-172-31-22-58.ec2.internal/172.31.22.58:46736]
14/06/06 11:08:57 INFO SendingConnection: Connected to [ip-172-31-22-58.ec2.internal/172.31.22.58:46736], 1 messages pending
14/06/06 11:08:57 INFO ConnectionManager: Accepted connection from [ip-172-31-22-58.ec2.internal/172.31.22.58]
14/06/06 11:09:00 INFO TaskSetManager: Finished TID 4 in 33738 ms on ip-172-31-22-58.ec2.internal (progress: 3/5)
14/06/06 11:09:00 INFO BlockManagerMasterActor$BlockManagerInfo: Removed taskresult_4 on ip-172-31-22-58.ec2.internal:46736 in memory (size: 329.3 MB, free: 2.9 GB)
14/06/06 11:09:02 INFO DAGScheduler: Completed ResultTask(1, 2)
If I'm then very patient, eventually the job spits out some more weird stuff:
14/06/06 11:14:15 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(ip-172-31-30-95.ec2.internal,41661)
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor updated: app-20140606110822-0000/9 is now FAILED (Command exited with code 50)
14/06/06 11:14:15 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(ip-172-31-30-95.ec2.internal,41661)
14/06/06 11:14:15 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(ip-172-31-28-236.ec2.internal,35129)
14/06/06 11:14:15 INFO ConnectionManager: Key not valid ? sun.nio.ch.SelectionKeyImpl#6b071630
14/06/06 11:14:15 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(ip-172-31-28-236.ec2.internal,35129)
14/06/06 11:14:15 ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found
14/06/06 11:14:15 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(ip-172-31-22-58.ec2.internal,46736)
14/06/06 11:14:15 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(ip-172-31-22-58.ec2.internal,46736)
14/06/06 11:14:15 ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found
14/06/06 11:14:15 INFO ConnectionManager: key already cancelled ? sun.nio.ch.SelectionKeyImpl#6b071630
java.nio.channels.CancelledKeyException
at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:341)
at org.apache.spark.network.ConnectionManager$$anon$3.run(ConnectionManager.scala:98)
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Executor app-20140606110822-0000/9 removed: Command exited with code 50
14/06/06 11:14:15 ERROR SendingConnection: Exception while reading SendingConnection to ConnectionManagerId(ip-172-31-28-236.ec2.internal,35129)
java.nio.channels.ClosedChannelException
at sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:252)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:295)
at org.apache.spark.network.SendingConnection.read(Connection.scala:398)
at org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:158)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
14/06/06 11:14:15 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(1, ip-172-31-30-95.ec2.internal, 41661, 0) with no recent heart beats: 132381ms exceeds 45000ms
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Executor 9 disconnected, so removing it
14/06/06 11:14:15 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(6, ip-172-31-17-30.ec2.internal, 43082, 0) with no recent heart beats: 132382ms exceeds 45000ms
14/06/06 11:14:15 INFO ConnectionManager: Handling connection error on connection to ConnectionManagerId(ip-172-31-28-236.ec2.internal,35129)
14/06/06 11:14:15 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(<driver>, ip-172-31-23-17.ec2.internal, 55101, 0) with no recent heart beats: 132385ms exceeds 45000ms
14/06/06 11:14:15 ERROR TaskSchedulerImpl: Lost an executor 9 (already removed): Uncaught exception
14/06/06 11:14:15 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(ip-172-31-28-236.ec2.internal,35129)
14/06/06 11:14:15 INFO ConnectionManager: Key not valid ? sun.nio.ch.SelectionKeyImpl#3c39a92
14/06/06 11:14:15 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(8, ip-172-31-22-58.ec2.internal, 46736, 0) with no recent heart beats: 132377ms exceeds 45000ms
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor added: app-20140606110822-0000/10 on worker-20140606110717-ip-172-31-21-172.ec2.internal-7078 (ip-172-31-21-172.ec2.internal:7078) with 8 cores
14/06/06 11:14:15 INFO ConnectionManager: key already cancelled ? sun.nio.ch.SelectionKeyImpl#3c39a92
java.nio.channels.CancelledKeyException
at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:267)
at org.apache.spark.network.ConnectionManager$$anon$3.run(ConnectionManager.scala:98)
14/06/06 11:14:15 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(9, ip-172-31-21-172.ec2.internal, 42635, 0) with no recent heart beats: 132384ms exceeds 45000ms
14/06/06 11:14:15 INFO ConnectionManager: Key not valid ? sun.nio.ch.SelectionKeyImpl#46000f2b
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140606110822-0000/10 on hostPort ip-172-31-21-172.ec2.internal:7078 with 8 cores, 5.0 GB RAM
14/06/06 11:14:15 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(7, ip-172-31-28-236.ec2.internal, 35129, 0) with no recent heart beats: 132379ms exceeds 45000ms
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor updated: app-20140606110822-0000/10 is now RUNNING
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor updated: app-20140606110822-0000/4 is now FAILED (Command exited with code 50)
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Executor app-20140606110822-0000/4 removed: Command exited with code 50
14/06/06 11:14:15 INFO ConnectionManager: key already cancelled ? sun.nio.ch.SelectionKeyImpl#46000f2b
java.nio.channels.CancelledKeyException
at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:267)
at org.apache.spark.network.ConnectionManager$$anon$3.run(ConnectionManager.scala:98)
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Executor 4 disconnected, so removing it
14/06/06 11:14:15 ERROR TaskSchedulerImpl: Lost executor 4 on ip-172-31-28-73.ec2.internal: Uncaught exception
14/06/06 11:14:15 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(ip-172-31-28-236.ec2.internal,35129)
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor added: app-20140606110822-0000/11 on worker-20140606110708-ip-172-31-28-73.ec2.internal-7078 (ip-172-31-28-73.ec2.internal:7078) with 8 cores
14/06/06 11:14:15 INFO DAGScheduler: Executor lost: 4 (epoch 0)
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140606110822-0000/11 on hostPort ip-172-31-28-73.ec2.internal:7078 with 8 cores, 5.0 GB RAM
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor updated: app-20140606110822-0000/3 is now FAILED (Command exited with code 50)
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Executor app-20140606110822-0000/3 removed: Command exited with code 50
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Executor 1 disconnected, so removing it
14/06/06 11:14:15 ERROR TaskSchedulerImpl: Lost executor 1 on ip-172-31-30-95.ec2.internal: remote Akka client disassociated
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Executor 3 disconnected, so removing it
14/06/06 11:14:15 ERROR TaskSchedulerImpl: Lost an executor 3 (already removed): Uncaught exception
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Executor 7 disconnected, so removing it
14/06/06 11:14:15 ERROR TaskSchedulerImpl: Lost executor 7 on ip-172-31-28-236.ec2.internal: remote Akka client disassociated
14/06/06 11:14:15 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(2, ip-172-31-23-17.ec2.internal, 44685, 0) with no recent heart beats: 132373ms exceeds 45000ms
14/06/06 11:14:15 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(0, ip-172-31-24-194.ec2.internal, 47896, 0) with no recent heart beats: 132382ms exceeds 45000ms
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Executor 5 disconnected, so removing it
14/06/06 11:14:15 ERROR TaskSchedulerImpl: Lost executor 5 on ip-172-31-29-86.ec2.internal: remote Akka client disassociated
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor added: app-20140606110822-0000/12 on worker-20140606110708-ip-172-31-26-188.ec2.internal-7078 (ip-172-31-26-188.ec2.internal:7078) with 8 cores
14/06/06 11:14:15 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(5, ip-172-31-29-86.ec2.internal, 48078, 0) with no recent heart beats: 132380ms exceeds 45000ms
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Executor 8 disconnected, so removing it
14/06/06 11:14:15 ERROR TaskSchedulerImpl: Lost executor 8 on ip-172-31-22-58.ec2.internal: remote Akka client disassociated
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140606110822-0000/12 on hostPort ip-172-31-26-188.ec2.internal:7078 with 8 cores, 5.0 GB RAM
14/06/06 11:14:15 INFO BlockManagerMasterActor: Trying to remove executor 4 from BlockManagerMaster.
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor updated: app-20140606110822-0000/6 is now FAILED (Command exited with code 50)
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Executor 2 disconnected, so removing it
14/06/06 11:14:15 INFO BlockManagerMaster: Removed 4 successfully in removeExecutor
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Executor app-20140606110822-0000/6 removed: Command exited with code 50
14/06/06 11:14:15 INFO DAGScheduler: Executor lost: 1 (epoch 1)
14/06/06 11:14:15 INFO BlockManagerMasterActor: Trying to remove executor 1 from BlockManagerMaster.
14/06/06 11:14:15 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor
14/06/06 11:14:15 ERROR TaskSchedulerImpl: Lost executor 2 on ip-172-31-23-17.ec2.internal: remote Akka client disassociated
14/06/06 11:14:15 INFO DAGScheduler: Executor lost: 7 (epoch 2)
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Executor 0 disconnected, so removing it
14/06/06 11:14:15 INFO BlockManagerMasterActor: Trying to remove executor 7 from BlockManagerMaster.
14/06/06 11:14:15 ERROR TaskSchedulerImpl: Lost an executor 0 (already removed): remote Akka client disassociated
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Executor 6 disconnected, so removing it
14/06/06 11:14:15 INFO BlockManagerMaster: Removed 7 successfully in removeExecutor
14/06/06 11:14:15 ERROR TaskSchedulerImpl: Lost an executor 6 (already removed): remote Akka client disassociated
14/06/06 11:14:15 INFO DAGScheduler: Executor lost: 5 (epoch 3)
14/06/06 11:14:15 INFO BlockManagerMasterActor: Trying to remove executor 5 from BlockManagerMaster.
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor added: app-20140606110822-0000/13 on worker-20140606110717-ip-172-31-17-30.ec2.internal-7078 (ip-172-31-17-30.ec2.internal:7078) with 8 cores
14/06/06 11:14:15 INFO BlockManagerMaster: Removed 5 successfully in removeExecutor
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140606110822-0000/13 on hostPort ip-172-31-17-30.ec2.internal:7078 with 8 cores, 5.0 GB RAM
14/06/06 11:14:15 INFO DAGScheduler: Executor lost: 8 (epoch 4)
14/06/06 11:14:15 INFO BlockManagerMasterActor: Trying to remove executor 8 from BlockManagerMaster.
14/06/06 11:14:15 INFO BlockManagerMaster: Removed 8 successfully in removeExecutor
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor updated: app-20140606110822-0000/11 is now RUNNING
14/06/06 11:14:15 INFO DAGScheduler: Executor lost: 2 (epoch 5)
14/06/06 11:14:15 INFO BlockManagerMasterActor: Trying to remove executor 2 from BlockManagerMaster.
14/06/06 11:14:15 INFO BlockManagerMaster: Removed 2 successfully in removeExecutor
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor updated: app-20140606110822-0000/13 is now RUNNING
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor updated: app-20140606110822-0000/12 is now RUNNING
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor updated: app-20140606110822-0000/0 is now FAILED (Command exited with code 50)
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Executor app-20140606110822-0000/0 removed: Command exited with code 50
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor added: app-20140606110822-0000/14 on worker-20140606110706-ip-172-31-24-194.ec2.internal-7078 (ip-172-31-24-194.ec2.internal:7078) with 8 cores
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140606110822-0000/14 on hostPort ip-172-31-24-194.ec2.internal:7078 with 8 cores, 5.0 GB RAM
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor updated: app-20140606110822-0000/14 is now RUNNING
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor updated: app-20140606110822-0000/5 is now FAILED (Command exited with code 50)
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Executor app-20140606110822-0000/5 removed: Command exited with code 50
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor added: app-20140606110822-0000/15 on worker-20140606110706-ip-172-31-29-86.ec2.internal-7078 (ip-172-31-29-86.ec2.internal:7078) with 8 cores
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140606110822-0000/15 on hostPort ip-172-31-29-86.ec2.internal:7078 with 8 cores, 5.0 GB RAM
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor updated: app-20140606110822-0000/15 is now RUNNING
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor updated: app-20140606110822-0000/1 is now FAILED (Command exited with code 50)
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Executor app-20140606110822-0000/1 removed: Command exited with code 50
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor added: app-20140606110822-0000/16 on worker-20140606110708-ip-172-31-30-95.ec2.internal-7078 (ip-172-31-30-95.ec2.internal:7078) with 8 cores
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140606110822-0000/16 on hostPort ip-172-31-30-95.ec2.internal:7078 with 8 cores, 5.0 GB RAM
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor updated: app-20140606110822-0000/16 is now RUNNING
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor updated: app-20140606110822-0000/8 is now FAILED (Command exited with code 50)
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Executor app-20140606110822-0000/8 removed: Command exited with code 50
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor added: app-20140606110822-0000/17 on worker-20140606110708-ip-172-31-22-58.ec2.internal-7078 (ip-172-31-22-58.ec2.internal:7078) with 8 cores
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140606110822-0000/17 on hostPort ip-172-31-22-58.ec2.internal:7078 with 8 cores, 5.0 GB RAM
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor updated: app-20140606110822-0000/17 is now RUNNING
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor updated: app-20140606110822-0000/7 is now FAILED (Command exited with code 50)
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Executor app-20140606110822-0000/7 removed: Command exited with code 50
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor added: app-20140606110822-0000/18 on worker-20140606110706-ip-172-31-28-236.ec2.internal-7078 (ip-172-31-28-236.ec2.internal:7078) with 8 cores
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140606110822-0000/18 on hostPort ip-172-31-28-236.ec2.internal:7078 with 8 cores, 5.0 GB RAM
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor updated: app-20140606110822-0000/18 is now RUNNING
14/06/06 11:14:15 INFO AppClient$ClientActor: Executor updated: app-20140606110822-0000/2 is now FAILED (Command exited with code 50)
14/06/06 11:14:15 INFO SparkDeploySchedulerBackend: Executor app-20140606110822-0000/2 removed: Command exited with code 50
14/06/06 11:14:15 ERROR AppClient$ClientActor: Master removed our application: FAILED; stopping client
14/06/06 11:14:15 WARN SparkDeploySchedulerBackend: Disconnected from Spark cluster! Waiting for reconnection...
Where it just hangs again ... again if I'm patient it then spits out the following and hangs again
14/06/06 11:14:15 WARN SparkDeploySchedulerBackend: Disconnected from Spark cluster! Waiting for reconnection...
14/06/06 11:16:54 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(3, ip-172-31-26-188.ec2.internal, 55392, 0) with no recent heart beats: 159686ms exceeds 45000ms
14/06/06 11:19:42 WARN BlockManagerMaster: Error sending message to BlockManagerMaster in 1 attempts
java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.storage.BlockManagerMaster.askDriverWithReply(BlockManagerMaster.scala:162)
at org.apache.spark.storage.BlockManagerMaster.sendHeartBeat(BlockManagerMaster.scala:52)
at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$heartBeat(BlockManager.scala:97)
at org.apache.spark.storage.BlockManager$$anonfun$initialize$1.apply$mcV$sp(BlockManager.scala:135)
at akka.actor.Scheduler$$anon$9.run(Scheduler.scala:80)
at akka.actor.LightArrayRevolverScheduler$$anon$3$$anon$2.run(Scheduler.scala:241)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
After 10 mins, my patience runs out and I kill -9 it (normal interrupt doesn't work).
The question is, how do I get my cluster back to when it worked? It seems spark is holding some state somewhere which we can't zap. We have tried deleting the spark cache files i.e. .../spark/spark-*, we have tried restarting all the workers and the master!
UPDATE:
I think the problem could be that the file I thought I was reading got corrupted in some way that meant it became about 370 MB. The toArray on such a large amount of data may have caused stuff to go crazy. After just deleting the file and trying again on other files, things returned to normal. Nevertheless leaving the question open as the behaviour thrown isn't what one would expect - one would simply expect a long wait, followed by possibly an OOM.