Airflow dags are getting stuck in "running" state indefinitely

Airflow dags are getting stuck in "running" state indefinitely - celery

My dags are getting stuck in "running" state indefinitely.
Even when I mark them as "failed" and rerun them again it is still getting stuck. When I check on the airflow UI the dag is in the "running" state :
Screenshot
When I check my airflow celery logs the last lines are the following and nothing else happened :
[...]
[2021-05-24 14:14:31,486] {dagbag.py:451} INFO - Filling up the DagBag from /home/******.py
[2021-05-24 14:14:31,490] {dagbag.py:451} INFO - Filling up the DagBag from /home/******.py
[2021-05-24 14:14:31,498] {dagbag.py:451} INFO - Filling up the DagBag from /home/******.py
[2021-05-24 14:14:31,505] {dagbag.py:451} INFO - Filling up the DagBag from /home/******.py
[2021-05-24 14:14:31,502] {dagbag.py:451} INFO - Filling up the DagBag from /home/******.py
[2021-05-24 14:14:31,508] {dagbag.py:451} INFO - Filling up the DagBag from /home/******.py
[2021-05-24 14:14:56,679: WARNING/ForkPoolWorker-23] Running <TaskInstance: ***** 2021-05-24T06:14:30.718313+00:00 [queued]> on host **********
[2021-05-24 14:14:56,682: WARNING/ForkPoolWorker-21] Running <TaskInstance: ***** 2021-05-24T06:14:30.718313+00:00 [queued]> on host **********
[2021-05-24 14:14:56,719: WARNING/ForkPoolWorker-8] Running <TaskInstance: ***** 2021-05-24T06:14:30.718313+00:00 [queued]> on host **********
[2021-05-24 14:14:56,722: WARNING/ForkPoolWorker-18] Running <TaskInstance: ***** 2021-05-24T06:14:30.718313+00:00 [queued]> on host **********
[2021-05-24 14:14:56,742: WARNING/ForkPoolWorker-26] Running <TaskInstance: ***** 2021-05-24T06:14:30.718313+00:00 [queued]> on host **********
[2021-05-24 14:14:56,754: WARNING/ForkPoolWorker-28] Running <TaskInstance: *****A1 2021-05-24T06:14:30.718313+00:00 [queued]> on host **********
Below are some of the logs I got from the scheduler :
[2021-05-24 14:09:18,552] {scheduler_job.py:1006} INFO - DAG *** has 32/32 running and queued tasks
[2021-05-24 14:09:18,552] {scheduler_job.py:1014} INFO - Not executing <TaskInstance: **** 2021-05-24 05:45:57.736363+00:00 [scheduled]> since the number of tasks running or queued from DAG **** is >= to the DAG's
task concurrency limit of 32
.
[...]
/home/ubuntu/**** 1 0 0.65s 2021-05-23T07:46:28
/home/ubuntu/**** 1 0 0.56s 2021-05-23T07:46:27
/home/ubuntu/**** 1 0 0.47s 2021-05-23T07:46:54
/home/ubuntu/**** 1 0 1.18s 2021-05-23T07:47:03
/home/ubuntu/**** 2 0 1.25s 2021-05-23T07:46:20
/home/ubuntu/**** 2 0 1.26s 2021-05-23T07:46:20
/home/ubuntu/**** 2 0 1.30s 2021-05-23T07:46:19
/home/ubuntu/**** 113 0 2.91s 2021-05-23T07:47:05
/home/ubuntu/**** 459 0 7.85s 2021-05-23T07:46:38
================================================================================
[2021-05-23 15:47:58,271] {scheduler_job.py:182} INFO - Started process (PID=13794) to work on ********
[2021-05-23 15:47:58,272] {scheduler_job.py:161} INFO - Closing parent pipe
[2021-05-23 15:47:58,273] {scheduler_job.py:629} INFO - Processing file ******************** for tasks to queue
[2021-05-23 15:47:58,273] {dagbag.py:451} INFO - Filling up the DagBag from ********************
[2021-05-23 15:47:58,273] {scheduler_job.py:190} INFO - Processing ******** took 0.014 seconds
[2021-05-23 15:47:58,274] {scheduler_job.py:641} WARNING - No viable dags retrieved from ************
[2021-05-23 15:47:58,275] {scheduler_job.py:182} INFO - Started process (PID=13797) to work on **************
[2021-05-23 15:47:58,275] {scheduler_job.py:190} INFO - Processing ******** took 0.014 seconds
[2021-05-23 15:47:58,276] {scheduler_job.py:629} INFO - Processing file ************** for tasks to queue
[2021-05-23 15:47:58,276] {dagbag.py:451} INFO - Filling up the DagBag from **************
[2021-05-23 15:47:58,277] {scheduler_job.py:641} WARNING - No viable dags retrieved from **************
[2021-05-23 15:47:58,278] {scheduler_job.py:190} INFO - Processing ******** took 0.016 seconds
[2021-05-23 15:47:58,281] {scheduler_job.py:190} INFO - Processing ******** took 0.016 seconds
[2021-05-23 15:47:58,285] {scheduler_job.py:190} INFO - Processing ******** took 0.016 seconds
[2021-05-23 15:47:58,287] {scheduler_job.py:190} INFO - Processing ************** took 0.015 seconds
[2021-05-23 15:48:02,300] {scheduler_job.py:161} INFO - Closing parent pipe
[2021-05-23 15:48:02,302] {scheduler_job.py:182} INFO - Started process (PID=13932) to work on *****************.py
[2021-05-23 15:48:02,303] {scheduler_job.py:629} INFO - Processing file ***************** for tasks to queue
[2021-05-23 15:48:02,304] {dagbag.py:451} INFO - Filling up the DagBag from *****************
[2021-05-23 15:48:02,434] {scheduler_job.py:641} WARNING - No viable dags retrieved from *****************
[2021-05-23 15:48:02,445] {scheduler_job.py:190} INFO - Processing ***************** took 0.144 seconds
[2021-05-23 15:48:02,452] {scheduler_job.py:161} INFO - Closing parent pipe
[2021-05-23 15:48:02,455] {scheduler_job.py:182} INFO - Started process (PID=13934) to work on *****************
[2021-05-23 15:48:02,456] {scheduler_job.py:629} INFO - Processing file ***************** for tasks to queue
[2021-05-23 15:48:02,456] {dagbag.py:451} INFO - Filling up the DagBag from *****************
[2021-05-23 15:48:03,457] {scheduler_job.py:161} INFO - Closing parent pipe
[2021-05-23 15:48:03,460] {scheduler_job.py:182} INFO - Started process (PID=13959) to work on ********
[2021-05-23 15:48:03,461] {scheduler_job.py:629} INFO - Processing file ******** for tasks to queue
[2021-05-23 15:48:03,461] {dagbag.py:451} INFO - Filling up the DagBag from ********
[2021-05-23 15:48:03,501] {scheduler_job.py:641} WARNING - No viable dags retrieved from *****************
[2021-05-23 15:48:03,514] {scheduler_job.py:190} INFO - Processing ***************** took 1.061 seconds
[2021-05-23 15:48:04,547] {scheduler_job.py:639} INFO - DAG(s) dict_keys(['****']) retrieved from *******
[2021-05-23 15:48:04,559] {dag.py:1824} INFO - Sync 1 DAGs
[2021-05-23 15:48:04,568] {dag.py:2280} INFO - Setting next_dagrun for **** to 2021-05-23T04:00:00+00:00
[2021-05-23 15:48:04,572] {scheduler_job.py:190} INFO - Processing ********took 1.115 seconds
[2021-05-23 15:48:05,538] {dag_processing.py:1092} INFO - Finding 'running' jobs without a recent heartbeat
[2021-05-23 15:48:05,539] {dag_processing.py:1096} INFO - Failing jobs without heartbeat after 2021-05-23 07:43:05.539102+00:00
[2021-05-23 15:48:05,546] {scheduler_job.py:161} INFO - Closing parent pipe
[2021-05-23 15:48:05,549] {scheduler_job.py:182} INFO - Started process (PID=14077) to work on ********
[2021-05-23 15:48:05,549] {scheduler_job.py:629} INFO - Processing file ********for tasks to queue
[2021-05-23 15:48:05,550] {dagbag.py:451} INFO - Filling up the DagBag from ********
To make it work I have to :
Empty the running slots and queued slots in Admin > Pools by setting the tasks as Failed
Restart airflow celery worker
Saw on github that people have recently face this issue:
https://github.com/apache/airflow/issues/13542
Link to my github issue:
https://github.com/apache/airflow/issues/15978
Also I noticed a weird behavior with Airflow backfilling. I noticed that my previous dags are still queing and running even after doing the following :
Setting catchup_by_default=False in airflow.cfg
Setting catchup=False in the DAG definition
Using LatestOnlyOperator
Would appreciate if someone can help me solving this issue, thank you!
Configuration:
Apache Airflow version: 2.0.2
OS : Ubuntu 18.04.3 (AWS EC2)
Install tools: celery = 4.4.7, redis = 3.5.3

Related

Setup Airflow with remote Celery worker

I have Apache Airflow setup on a virtual machine that is within the local network and would like an extra Celery worker to be running on my local machine that still syncs with the rest of the airflow system.
So far, after I start the worker on my local machine, the DAGs present on the local machine is not visible on the webserver (webserver is running on VM) right away, but they are briefly after I enter airflow dags reserialize on the local machine.
I get these messages in the worker logs after doing so:
[2022-06-07 09:54:41,661] {dagbag.py:507} INFO - Filling up the DagBag from /Users/wilbertung/Documents/lowitest/airflow/dags
[2022-06-07 09:54:41,680] {dagbag.py:507} INFO - Filling up the DagBag from None
[2022-06-07 09:54:41,809] {dag.py:2379} INFO - Sync 2 DAGs
[2022-06-07 09:54:41,853] {dag.py:2923} INFO - Setting next_dagrun for ChiSo to 2022-06-06T01:54:41.852752+00:00, run_after=2022-06-07T01:54:41.852752+00:00
[2022-06-07 09:54:41,853] {dag.py:2923} INFO - Setting next_dagrun for lowi17 to 2022-06-06T16:00:00+00:00, run_after=2022-06-07T16:00:00+00:00
Then, in the scheduler logs I get the following messages:
[2022-06-07 09:54:42,473] {scheduler_job.py:353} INFO - 3 tasks up for execution:
<TaskInstance: lowi17.台灣醒報 manual__2022-06-06T06:00:03.787848+00:00 [scheduled]>
<TaskInstance: lowi17.台灣新生報 manual__2022-06-06T06:00:03.787848+00:00 [scheduled]>
<TaskInstance: lowi17.華視新聞網 manual__2022-06-06T06:00:03.787848+00:00 [scheduled]>
[2022-06-07 09:54:42,473] {scheduler_job.py:418} INFO - DAG lowi17 has 0/16 running and queued tasks
[2022-06-07 09:54:42,473] {scheduler_job.py:418} INFO - DAG lowi17 has 1/16 running and queued tasks
[2022-06-07 09:54:42,473] {scheduler_job.py:418} INFO - DAG lowi17 has 2/16 running and queued tasks
[2022-06-07 09:54:42,473] {scheduler_job.py:504} INFO - Setting the following tasks to queued state:
<TaskInstance: lowi17.台灣醒報 manual__2022-06-06T06:00:03.787848+00:00 [scheduled]>
<TaskInstance: lowi17.台灣新生報 manual__2022-06-06T06:00:03.787848+00:00 [scheduled]>
<TaskInstance: lowi17.華視新聞網 manual__2022-06-06T06:00:03.787848+00:00 [scheduled]>
[2022-06-07 09:54:42,476] {scheduler_job.py:546} INFO - Sending TaskInstanceKey(dag_id='lowi17', task_id='台灣醒報', run_id='manual__2022-06-06T06:00:03.787848+00:00', try_number=3, map_index=-1) to executor with priority 1 and queue default
[2022-06-07 09:54:42,476] {base_executor.py:91} INFO - Adding to queue: ['airflow', 'tasks', 'run', 'lowi17', '台灣醒報', 'manual__2022-06-06T06:00:03.787848+00:00', '--local', '--subdir', '/Users/wilbertung/Documents/lowitest/airflow/dags/DAG_lowi50.py']
[2022-06-07 09:54:42,477] {scheduler_job.py:546} INFO - Sending TaskInstanceKey(dag_id='lowi17', task_id='台灣新生報', run_id='manual__2022-06-06T06:00:03.787848+00:00', try_number=3, map_index=-1) to executor with priority 1 and queue default
[2022-06-07 09:54:42,477] {base_executor.py:91} INFO - Adding to queue: ['airflow', 'tasks', 'run', 'lowi17', '台灣新生報', 'manual__2022-06-06T06:00:03.787848+00:00', '--local', '--subdir', '/Users/wilbertung/Documents/lowitest/airflow/dags/DAG_lowi50.py']
[2022-06-07 09:54:42,477] {scheduler_job.py:546} INFO - Sending TaskInstanceKey(dag_id='lowi17', task_id='華視新聞網', run_id='manual__2022-06-06T06:00:03.787848+00:00', try_number=3, map_index=-1) to executor with priority 1 and queue default
[2022-06-07 09:54:42,477] {base_executor.py:91} INFO - Adding to queue: ['airflow', 'tasks', 'run', 'lowi17', '華視新聞網', 'manual__2022-06-06T06:00:03.787848+00:00', '--local', '--subdir', '/Users/wilbertung/Documents/lowitest/airflow/dags/DAG_lowi50.py']
[2022-06-07 09:54:42,621] {scheduler_job.py:599} INFO - Executor reports execution of lowi17.台灣醒報 run_id=manual__2022-06-06T06:00:03.787848+00:00 exited with status failed for try_number 3
[2022-06-07 09:54:42,621] {scheduler_job.py:599} INFO - Executor reports execution of lowi17.台灣新生報 run_id=manual__2022-06-06T06:00:03.787848+00:00 exited with status failed for try_number 3
[2022-06-07 09:54:42,621] {scheduler_job.py:599} INFO - Executor reports execution of lowi17.華視新聞網 run_id=manual__2022-06-06T06:00:03.787848+00:00 exited with status failed for try_number 3
[2022-06-07 09:54:42,626] {scheduler_job.py:643} INFO - TaskInstance Finished: dag_id=lowi17, task_id=台灣醒報, run_id=manual__2022-06-06T06:00:03.787848+00:00, map_index=-1, run_start_date=2022-06-06 06:00:06.678844+00:00, run_end_date=2022-06-06 06:51:33.138733+00:00, run_duration=3086.459889, state=queued, executor_state=failed, try_number=3, max_tries=2, job_id=83, pool=default_pool, queue=default, priority_weight=1, operator=BashOperator, queued_dttm=2022-06-07 01:54:42.474017+00:00, queued_by_job_id=100, pid=31538
[2022-06-07 09:54:42,627] {scheduler_job.py:672} ERROR - Executor reports task instance <TaskInstance: lowi17.台灣醒報 manual__2022-06-06T06:00:03.787848+00:00 [queued]> finished (failed) although the task says its queued. (Info: None) Was the task killed externally?
[2022-06-07 09:54:42,639] {scheduler_job.py:643} INFO - TaskInstance Finished: dag_id=lowi17, task_id=台灣新生報, run_id=manual__2022-06-06T06:00:03.787848+00:00, map_index=-1, run_start_date=2022-06-06 06:00:06.005933+00:00, run_end_date=2022-06-06 06:51:33.156305+00:00, run_duration=3087.150372, state=queued, executor_state=failed, try_number=3, max_tries=2, job_id=85, pool=default_pool, queue=default, priority_weight=1, operator=BashOperator, queued_dttm=2022-06-07 01:54:42.474017+00:00, queued_by_job_id=100, pid=31535
[2022-06-07 09:54:42,639] {scheduler_job.py:672} ERROR - Executor reports task instance <TaskInstance: lowi17.台灣新生報 manual__2022-06-06T06:00:03.787848+00:00 [queued]> finished (failed) although the task says its queued. (Info: None) Was the task killed externally?
[2022-06-07 09:54:42,645] {scheduler_job.py:643} INFO - TaskInstance Finished: dag_id=lowi17, task_id=華視新聞網, run_id=manual__2022-06-06T06:00:03.787848+00:00, map_index=-1, run_start_date=None, run_end_date=2022-06-06 06:51:33.162201+00:00, run_duration=None, state=queued, executor_state=failed, try_number=3, max_tries=2, job_id=None, pool=default_pool, queue=default, priority_weight=1, operator=BashOperator, queued_dttm=2022-06-07 01:54:42.474017+00:00, queued_by_job_id=100, pid=None
[2022-06-07 09:54:42,645] {scheduler_job.py:672} ERROR - Executor reports task instance <TaskInstance: lowi17.華視新聞網 manual__2022-06-06T06:00:03.787848+00:00 [queued]> finished (failed) although the task says its queued. (Info: None) Was the task killed externally?
[2022-06-07 09:54:42,672] {dagrun.py:547} ERROR - Marking run <DagRun lowi17 # 2022-06-06 06:00:03.787848+00:00: manual__2022-06-06T06:00:03.787848+00:00, externally triggered: True> failed
[2022-06-07 09:54:42,672] {dagrun.py:607} INFO - DagRun Finished: dag_id=lowi17, execution_date=2022-06-06 06:00:03.787848+00:00, run_id=manual__2022-06-06T06:00:03.787848+00:00, run_start_date=2022-06-06 06:00:03.844994+00:00, run_end_date=2022-06-07 01:54:42.672853+00:00, run_duration=71678.827859, state=failed, external_trigger=True, run_type=manual, data_interval_start=2022-06-05 06:00:03.787848+00:00, data_interval_end=2022-06-06 06:00:03.787848+00:00, dag_hash=7f2d9c074e59bc29ace385f688864720
[2022-06-07 09:54:42,675] {dag.py:2923} INFO - Setting next_dagrun for lowi17 to 2022-06-06T06:00:03.787848+00:00, run_after=2022-06-07T06:00:03.787848+00:00
After this moment, the DAG becomes invisible on the webserver as if it never existed...
I am sure I am missing some important configuration of some sort. If so, which one?

Basically, even if there's a way to put the DAG files into different absolute but same RELATIVE folder and make it work, the most common and direct method that I went with was to mount a shared folder to both the main node and the remote worker so that they can both access the same DAG folder.
More details about it can be found here:
https://github.com/apache/airflow/discussions/24275

Timeout while streaming messages from message queue

I am processing messages from IBM MQ with a Scala program. It was working fine and stopped working without any code change.
This timeout occurs without a specific pattern and from time to time.
I run the application like this:
spark-submit --conf spark.streaming.driver.writeAheadLog.allowBatching=true --conf spark.streaming.driver.writeAheadLog.batchingTimeout=15000 --class com.ibm.spark.streaming.mq.SparkMQExample --master yarn --deploy-mode client --num-executors 1 $jar_file_loc lots of args here >> script.out.log 2>> script.err.log < /dev/null
I tried two properties:
spark.streaming.driver.writeAheadLog.batchingTimeout 15000
spark.streaming.driver.writeAheadLog.allowBatching true
See error:
2021-12-14 14:13:05 WARN ReceivedBlockTracker:90 - Exception thrown while writing record: BatchAllocationEvent(1639487580000 ms,AllocatedBlocks(Map(0 -> Queue()))) to the WriteAheadLog.
java.util.concurrent.TimeoutException: Futures timed out after [5000 milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
at org.apache.spark.streaming.util.BatchedWriteAheadLog.write(BatchedWriteAheadLog.scala:84)
at org.apache.spark.streaming.scheduler.ReceivedBlockTracker.writeToLog(ReceivedBlockTracker.scala:238)
at org.apache.spark.streaming.scheduler.ReceivedBlockTracker.allocateBlocksToBatch(ReceivedBlockTracker.scala:118)
at org.apache.spark.streaming.scheduler.ReceiverTracker.allocateBlocksToBatch(ReceiverTracker.scala:209)
at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:248)
at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:247)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:247)
at org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:183)
at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:89)
at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:88)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
2021-12-14 14:13:05 INFO ReceivedBlockTracker:57 - Possibly processed batch 1639487580000 ms needs to be processed again in WAL recovery
2021-12-14 14:13:05 INFO JobScheduler:57 - Added jobs for time 1639487580000 ms
2021-12-14 14:13:05 INFO JobGenerator:57 - Checkpointing graph for time 1639487580000 ms
2021-12-14 14:13:05 INFO DStreamGraph:57 - Updating checkpoint data for time 1639487580000 ms
rdd is empty
2021-12-14 14:13:05 INFO JobScheduler:57 - Starting job streaming job 1639487580000 ms.0 from job set of time 1639487580000 ms
2021-12-14 14:13:05 INFO DStreamGraph:57 - Updated checkpoint data for time 1639487580000 ms
2021-12-14 14:13:05 INFO JobScheduler:57 - Finished job streaming job 1639487580000 ms.0 from job set of time 1639487580000 ms
2021-12-14 14:13:05 INFO JobScheduler:57 - Total delay: 5.011 s for time 1639487580000 ms (execution: 0.001 s)
2021-12-14 14:13:05 INFO CheckpointWriter:57 - Submitted checkpoint of time 1639487580000 ms to writer queue
2021-12-14 14:13:05 INFO BlockRDD:57 - Removing RDD 284 from persistence list
2021-12-14 14:13:05 INFO PluggableInputDStream:57 - Removing blocks of RDD BlockRDD[284] at receiverStream at JmsStreamUtils.scala:64 of time 1639487580000 ms
2021-12-14 14:13:05 INFO BlockManager:57 - Removing RDD 284
2021-12-14 14:13:05 INFO JobGenerator:57 - Checkpointing graph for time 1639487580000 ms
2021-12-14 14:13:05 INFO DStreamGraph:57 - Updating checkpoint data for time 1639487580000 ms
2021-12-14 14:13:05 INFO DStreamGraph:57 - Updated checkpoint data for time 1639487580000 ms
2021-12-14 14:13:05 INFO CheckpointWriter:57 - Submitted checkpoint of time 1639487580000 ms to writer queue
Any kind of information would be useful. Thank you!

Slow running Airflow 1.10.2 ETL when using ExternalTaskSensor for DAG task dependency?

I have two DAGs that I need to run with Airflow 1.10.2 + the CeleryExecutor. The first DAG (DAG1) is a long-running data load from s3 into Redshift (3+ hours). My second DAG (DAG2) performs computations on data loaded by DAG1. I want to include an ExternalTaskSensor in DAG2 so that the computations are reliably performed after the data loads. Theoretically so simple!
I can successfully get DAG2 to wait for DAG1 to complete by ensuring both DAGs are scheduled to start at the same time (schedule="0 8 * * *" for both DAGs) and DAG2 is dependent on the final task in DAG1. But I'm seeing a massive delay in our ETL on DAG1 when I introduce the sensor. I at first though it was because my original implementation used mode="poke" which I understand locks a worker. However, even when I changed this to mode="reschedule" as I read in the docs https://airflow.readthedocs.io/en/stable/_modules/airflow/sensors/base_sensor_operator.html I still see a massive ETL delay.
I'm using the ExternalTaskSensor code below in DAG2:
wait_for_data_load = ExternalTaskSensor(
dag=dag,
task_id="wait_for_data_load",
external_dag_id="dag1",
external_task_id="dag1_final_task_id",
mode="reschedule",
poke_interval=1800, # check every 30 min
timeout=43200, # timeout after 12 hours (catch delayed data load runs)
soft_fail=False # if the task fails, we assume a failure
)
If the code were working properly, I'd expect the sensor to perform a quick check whether DAG1 had finished and, if not, reschedule for 30 min time as defined by the poke_interval, causing no delay to DAG1 ETL. If DAG1 fails to complete after 12 hours, then DAG2 would stop poking and fail.
Instead, I'm getting frequent errors for each of the tasks in DAG1 saying (for example) Executor reports task instance <TaskInstance: dag1.data_table_temp_redshift_load 2019-05-20 08:00:00+00:00 [queued]> finished (failed) although the task says its queued. Was the task killed externally? even though the tasks are completing successfully (with some delay). Just before this error is sent, I see a line in our Sentry logs saying Executor reports dag1.data_table_temp_redshift_load execution_date=2019-05-20 08:00:00+00:00 as failed for try_number 1 though (again) I can see the task succeeded.
The logs on DAG2 are also looking a bit strange. I'm seeing repeated attempts logged at the same time intervals like the excerpt below:
--------------------------------------------------------------------------------
Starting attempt 1 of 4
--------------------------------------------------------------------------------
[2019-05-21 08:01:48,417] {{models.py:1593}} INFO - Executing <Task(ExternalTaskSensor): wait_for_data_load> on 2019-05-20T08:00:00+00:00
[2019-05-21 08:01:48,419] {{base_task_runner.py:118}} INFO - Running: ['bash', '-c', 'airflow run dag2 wait_for_data_load 2019-05-20T08:00:00+00:00 --job_id 572075 --raw -sd DAGS_FOLDER/dag2.py --cfg_path /tmp/tmp4g2_27c7']
[2019-05-21 08:02:02,543] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load [2019-05-21 08:02:02,542] {{settings.py:174}} INFO - settings.configure_orm(): Using pool settings. pool_size=5, pool_recycle=1800, pid=28219
[2019-05-21 08:02:12,000] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load [2019-05-21 08:02:11,996] {{__init__.py:51}} INFO - Using executor CeleryExecutor
[2019-05-21 08:02:15,840] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load [2019-05-21 08:02:15,827] {{models.py:273}} INFO - Filling up the DagBag from /usr/local/airflow/dags/dag2.py
[2019-05-21 08:02:16,746] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load [2019-05-21 08:02:16,745] {{dag2.py:40}} INFO - Waiting for the dag1_final_task_id operator to complete in the dag1 DAG
[2019-05-21 08:02:17,199] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load [2019-05-21 08:02:17,198] {{cli.py:520}} INFO - Running <TaskInstance: dag1. wait_for_data_load 2019-05-20T08:00:00+00:00 [running]> on host 11d93b0b0c2d
[2019-05-21 08:02:17,708] {{external_task_sensor.py:91}} INFO - Poking for dag1. dag1_final_task_id on 2019-05-20T08:00:00+00:00 ...
[2019-05-21 08:02:17,890] {{models.py:1784}} INFO - Rescheduling task, marking task as UP_FOR_RESCHEDULE
[2019-05-21 08:02:17,892] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load /usr/local/lib/python3.6/site-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.25.2) or chardet (3.0.4) doesn't match a supported version!
[2019-05-21 08:02:17,893] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load RequestsDependencyWarning)
[2019-05-21 08:02:17,893] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load /usr/local/lib/python3.6/site-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.
[2019-05-21 08:02:17,894] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load """)
[2019-05-21 08:02:22,597] {{logging_mixin.py:95}} INFO - [2019-05-21 08:02:22,589] {{jobs.py:2527}} INFO - Task exited with return code 0
[2019-05-21 08:01:48,125] {{models.py:1359}} INFO - Dependencies all met for <TaskInstance: dag2. wait_for_data_load 2019-05-20T08:00:00+00:00 [queued]>
[2019-05-21 08:01:48,311] {{models.py:1359}} INFO - Dependencies all met for <TaskInstance: dag2. wait_for_data_load 2019-05-20T08:00:00+00:00 [queued]>
[2019-05-21 08:01:48,311] {{models.py:1571}} INFO -
--------------------------------------------------------------------------------
Starting attempt 1 of 4
--------------------------------------------------------------------------------
[2019-05-21 08:01:48,417] {{models.py:1593}} INFO - Executing <Task(ExternalTaskSensor): wait_for_data_load> on 2019-05-20T08:00:00+00:00
[2019-05-21 08:01:48,419] {{base_task_runner.py:118}} INFO - Running: ['bash', '-c', 'airflow run dag2 wait_for_data_load 2019-05-20T08:00:00+00:00 --job_id 572075 --raw -sd DAGS_FOLDER/dag2.py --cfg_path /tmp/tmp4g2_27c7']
[2019-05-21 08:02:02,543] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load [2019-05-21 08:02:02,542] {{settings.py:174}} INFO - settings.configure_orm(): Using pool settings. pool_size=5, pool_recycle=1800, pid=28219
[2019-05-21 08:02:12,000] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load [2019-05-21 08:02:11,996] {{__init__.py:51}} INFO - Using executor CeleryExecutor
[2019-05-21 08:02:15,840] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load [2019-05-21 08:02:15,827] {{models.py:273}} INFO - Filling up the DagBag from /usr/local/airflow/dags/dag2.py
[2019-05-21 08:02:16,746] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load [2019-05-21 08:02:16,745] {{dag2.py:40}} INFO - Waiting for the dag1_final_task_id operator to complete in the dag1 DAG
[2019-05-21 08:02:17,199] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load [2019-05-21 08:02:17,198] {{cli.py:520}} INFO - Running <TaskInstance: dag2.wait_for_data_load 2019-05-20T08:00:00+00:00 [running]> on host 11d93b0b0c2d
[2019-05-21 08:02:17,708] {{external_task_sensor.py:91}} INFO - Poking for dag1.dag1_final_task_id on 2019-05-20T08:00:00+00:00 ...
[2019-05-21 08:02:17,890] {{models.py:1784}} INFO - Rescheduling task, marking task as UP_FOR_RESCHEDULE
[2019-05-21 08:02:17,892] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load /usr/local/lib/python3.6/site-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.25.2) or chardet (3.0.4) doesn't match a supported version!
[2019-05-21 08:02:17,893] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load RequestsDependencyWarning)
[2019-05-21 08:02:17,893] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load /usr/local/lib/python3.6/site-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.
[2019-05-21 08:02:17,894] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load """)
[2019-05-21 08:02:22,597] {{logging_mixin.py:95}} INFO - [2019-05-21 08:02:22,589] {{jobs.py:2527}} INFO - Task exited with return code 0
[2019-05-21 08:33:31,875] {{models.py:1359}} INFO - Dependencies all met for <TaskInstance: dag2.wait_for_data_load 2019-05-20T08:00:00+00:00 [queued]>
[2019-05-21 08:33:31,903] {{models.py:1359}} INFO - Dependencies all met for <TaskInstance: dag2.wait_for_data_load 2019-05-20T08:00:00+00:00 [queued]>
[2019-05-21 08:33:31,903] {{models.py:1571}} INFO -
--------------------------------------------------------------------------------
Starting attempt 1 of 4
--------------------------------------------------------------------------------
Though all logs say Starting attempt 1 of 4, I do see attempts records about every 30 min, but I see multiple logs for each time interval (10+ of the same logs printed for each 30 min interval).
From searching around I see other people are using sensors in production flows https://eng.lyft.com/running-apache-airflow-at-lyft-6e53bb8fccff, which makes me think there's a way around this or I'm implementing something wrong. But I'm also seeing open issues in the airflow project related to this issue, so perhaps there's a deeper issue in the project? I also found a related, but unanswered post here Apache Airflow 1.10.3: Executor reports task instance ??? finished (failed) although the task says its queued. Was the task killed externally?
Also, we are using the following config settings:
# The amount of parallelism as a setting to the executor. This defines
# the max number of task instances that should run simultaneously
# on this airflow installation
parallelism = 32
# The number of task instances allowed to run concurrently by the scheduler
dag_concurrency = 16
# Are DAGs paused by default at creation
dags_are_paused_at_creation = True
# When not using pools, tasks are run in the "default pool",
# whose size is guided by this config element
non_pooled_task_slot_count = 128
# The maximum number of active DAG runs per DAG
max_active_runs_per_dag = 16

These symptoms were actually caused by a call to Variable.set() in the body of DAG1 that DAG2 then used to retrieve DAG1s dynamically generated dag_id. The Variable.set() all was causing an error (discovered in the worker logs). As described here, the scheduler polls the DAG definitions with every heartbeat to update keep DAGs up-to-date. That meant an error with every heartbeat, which caused a large ETL delay.

JobManager doesn't automatically redirect all requests to the remaining / running TaskManager

Problem Description
2 computers(203,204)
created a Standalone mode HA Flink v1.6.1 cluster
both run jobmanager and taskmanager(2 task slots) on every computer
After I start a job (examples SocketWindowWordCount.jar ./flink run ../examples/streaming/SocketWindowWordCount.jar --hostname 10.1.2.9 --port 9000) on the JobManager node, I kill the working TaskManager instance.
Web Dashboard I can see the job being cancelled and then failed. Web Dashboard image
flink-conf.yaml
state.backend: filesystem
state.checkpoints.dir: hdfs://10.1.2.109:8020/wulin/flink-checkpoints
rest.port: 9081
blob.server.port: 6124
query.server.port: 6125
web.tmpdir: /home/flink/deploy/webTmp
web.log.path: /home/flink/deploy/log
io.tmp.dirs: /home/flink/deploy/taskManagerTmp
high-availability: zookeeper
high-availability.zookeeper.quorum: 10.0.1.79:2181
high-availability.zookeeper.path.root: /flink
high-availability.cluster-id: flink
high-availability.storageDir: hdfs://10.1.2.109:8020/wulin
security.kerberos.login.principal: xxxx
security.kerberos.login.keytab: /home/ctu/flink/flink-1.6/conf/user.keytab
full logs
log-standalonesession-203
log-taskexecutor-203
log-standalonesession-204
exception
kill working TM, get the excpetion like this
2018-12-28 11:04:27,877 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#hz203:42861] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink#hz203:42861]] Caused by: [Connection refused: hz203/10.0.0.203:42861]
2018-12-28 11:04:28,660 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: hz203/10.0.0.203:42861
2018-12-28 11:04:28,660 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#hz203:42861] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink#hz203:42861]] Caused by: [Connection refused: hz203/10.0.0.203:42861]
2018-12-28 11:04:28,678 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - The heartbeat of TaskManager with id 0f41bca09600cd25000e19801076fa1f timed out.
2018-12-28 11:04:28,678 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Closing TaskExecutor connection 0f41bca09600cd25000e19801076fa1f because: The heartbeat of TaskManager with id 0f41bca09600cd25000e19801076fa1f timed out.
2018-12-28 11:04:28,678 INFO org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Unregister TaskManager dcf3bb5b7ed2208cf45b658d212fd8d2 from the SlotManager.
2018-12-28 11:04:28,678 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Socket Stream -> Flat Map (1/1) (88aa62ad152f4df6b39a969dd32c0249) switched from RUNNING to FAILED.
org.apache.flink.util.FlinkException: The assigned slot 0f41bca09600cd25000e19801076fa1f_0 was removed.
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:786)
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:756)
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:948)
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:372)
at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:803)
at org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener$1.run(ResourceManager.java:1116)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
at akka.actor.ActorCell.invoke(ActorCell.scala:495)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2018-12-28 11:04:28,680 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job Socket Window WordCount (61f55876e79934d515c163d095d706a6) switched from state RUNNING to FAILING.
submit job
run ./bin/flink run -d ./examples/streaming/SocketWindowWordCount.jar --port 9000 --hostname 10.1.2.9, get the JM logs like this
2018-12-28 19:20:01,354 INFO org.apache.flink.runtime.jobmaster.JobMaster - Starting execution of job Socket Window WordCount (5cdb91c15ee12ec6e74256eed10b5291)
2018-12-28 19:20:01,354 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job Socket Window WordCount (5cdb91c15ee12ec6e74256eed10b5291) switched from state CREATED to RUNNING.
2018-12-28 19:20:01,356 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Socket Stream -> Flat Map (1/1) (e30439b9f548c6013d8b8689e30d0dd7) switched from CREATED to SCHEDULED.
2018-12-28 19:20:01,359 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Window(TumblingProcessingTimeWindows(5000), ProcessingTimeTrigger, ReduceFunction$1, PassThroughWindowFunction) -> Sink: Print to Std. Out (1/1) (102d04f5aa6fc50cfe5088e20902c72e) switched from CREATED to SCHEDULED.
2018-12-28 19:20:01,364 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{e33a40832a3922897470fb76bcf76b29}]
2018-12-28 19:20:01,367 INFO org.apache.flink.runtime.jobmaster.JobMaster - Connecting to ResourceManager akka.tcp://flink#hz203:46596/user/resourcemanager(b22f96303e74df23645fe4567f884b9e)
2018-12-28 19:20:01,370 INFO org.apache.flink.runtime.jobmaster.JobMaster - Resolved ResourceManager address, beginning registration
2018-12-28 19:20:01,370 INFO org.apache.flink.runtime.jobmaster.JobMaster - Registration at ResourceManager attempt 1 (timeout=100ms)
2018-12-28 19:20:01,371 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService /leader/5cdb91c15ee12ec6e74256eed10b5291/job_manager_lock.
2018-12-28 19:20:01,371 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Registering job manager 9a31e8b4e8dfbf7b31d6ed3d227648b6#akka.tcp://flink#hz203:46596/user/jobmanager_0 for job 5cdb91c15ee12ec6e74256eed10b5291.
2018-12-28 19:20:01,431 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Registered job manager 9a31e8b4e8dfbf7b31d6ed3d227648b6#akka.tcp://flink#hz203:46596/user/jobmanager_0 for job 5cdb91c15ee12ec6e74256eed10b5291.
2018-12-28 19:20:01,432 INFO org.apache.flink.runtime.jobmaster.JobMaster - JobManager successfully registered at ResourceManager, leader id: b22f96303e74df23645fe4567f884b9e.
2018-12-28 19:20:01,433 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Requesting new slot [SlotRequestId{e33a40832a3922897470fb76bcf76b29}] and profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} from resource manager.
2018-12-28 19:20:01,434 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Request slot with profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} for job 5cdb91c15ee12ec6e74256eed10b5291 with allocation id AllocationID{f7a24e609e2ec618ccb456076049fa3b}.
2018-12-28 19:20:01,510 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Socket Stream -> Flat Map (1/1) (e30439b9f548c6013d8b8689e30d0dd7) switched from SCHEDULED to DEPLOYING.
2018-12-28 19:20:01,511 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Deploying Source: Socket Stream -> Flat Map (1/1) (attempt #0) to hz203
2018-12-28 19:20:01,515 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Window(TumblingProcessingTimeWindows(5000), ProcessingTimeTrigger, ReduceFunction$1, PassThroughWindowFunction) -> Sink: Print to Std. Out (1/1) (102d04f5aa6fc50cfe5088e20902c72e) switched from SCHEDULED to DEPLOYING.
2018-12-28 19:20:01,515 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Deploying Window(TumblingProcessingTimeWindows(5000), ProcessingTimeTrigger, ReduceFunction$1, PassThroughWindowFunction) -> Sink: Print to Std. Out (1/1) (attempt #0) to hz203
2018-12-28 19:20:01,674 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Window(TumblingProcessingTimeWindows(5000), ProcessingTimeTrigger, ReduceFunction$1, PassThroughWindowFunction) -> Sink: Print to Std. Out (1/1) (102d04f5aa6fc50cfe5088e20902c72e) switched from DEPLOYING to RUNNING.
2018-12-28 19:20:01,708 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Socket Stream -> Flat Map (1/1) (e30439b9f548c6013d8b8689e30d0dd7) switched from DEPLOYING to RUNNING.
2018-12-28 19:20:43,267 INFO org.apache.flink.runtime.blob.BlobClient - Downloading null/t-61808afb630553305c73a0a23f9231ffd6b2b448-513fbe1e6ddf69d10689eccf4c65da97 from hz203/10.0.0.203:6124
2018-12-28 19:20:48,339 INFO org.apache.flink.runtime.blob.BlobClient - Downloading null/t-dd915bb9821ff6ced34dd5e489966b674de5a48f-7ea2600930e5fc5a4fbb7d47ee198789 from hz203/10.0.0.203:6124
2018-12-28 19:20:52,623 INFO org.apache.flink.runtime.blob.BlobClient - Downloading null/t-61808afb630553305c73a0a23f9231ffd6b2b448-0bd1ab86fa4cc54daeb472079bfbea8c from hz203/10.0.0.203:6124
kill TM
Body is limited to 30000 characters. please read this JM logs when kill TM

The logs indicate that your RestartStrategy has depleted its restart attempts or that no RestartStrategy has been configured. Please check whether you specified a RestartStrategy in your program via env.setRestartStrategy(RestartStrategies.fixedDelayRestart(10, 0L)) or in flink-conf.yaml via restart-strategy: fixed-delay. If you want to learn more about Flink's restart strategies check out the documentation.

Airflow scheduler keep on Failing jobs without heartbeat

I'm new to airflow and i tried to manually trigger a job through UI. When I did that, the scheduler keep on logging that it is Failing jobs without heartbeat as follows:
[2018-05-28 12:13:48,248] {jobs.py:1662} INFO - Heartbeating the executor
[2018-05-28 12:13:48,250] {jobs.py:1672} INFO - Heartbeating the scheduler
[2018-05-28 12:13:48,259] {jobs.py:368} INFO - Started process (PID=58141) to work on /Users/gkumar6/airflow/dags/tutorial.py
[2018-05-28 12:13:48,264] {jobs.py:1742} INFO - Processing file /Users/gkumar6/airflow/dags/tutorial.py for tasks to queue
[2018-05-28 12:13:48,265] {models.py:189} INFO - Filling up the DagBag from /Users/gkumar6/airflow/dags/tutorial.py
[2018-05-28 12:13:48,275] {jobs.py:1754} INFO - DAG(s) ['tutorial'] retrieved from /Users/gkumar6/airflow/dags/tutorial.py
[2018-05-28 12:13:48,298] {models.py:341} INFO - Finding 'running' jobs without a recent heartbeat
[2018-05-28 12:13:48,299] {models.py:345} INFO - Failing jobs without heartbeat after 2018-05-28 06:38:48.299278
[2018-05-28 12:13:48,304] {jobs.py:375} INFO - Processing /Users/gkumar6/airflow/dags/tutorial.py took 0.045 seconds
[2018-05-28 12:13:49,266] {jobs.py:1627} INFO - Heartbeating the process manager
[2018-05-28 12:13:49,267] {dag_processing.py:468} INFO - Processor for /Users/gkumar6/airflow/dags/tutorial.py finished
[2018-05-28 12:13:49,271] {dag_processing.py:537} INFO - Started a process (PID: 58149) to generate tasks for /Users/gkumar6/airflow/dags/tutorial.py
[2018-05-28 12:13:49,272] {jobs.py:1662} INFO - Heartbeating the executor
[2018-05-28 12:13:49,283] {jobs.py:368} INFO - Started process (PID=58149) to work on /Users/gkumar6/airflow/dags/tutorial.py
[2018-05-28 12:13:49,288] {jobs.py:1742} INFO - Processing file /Users/gkumar6/airflow/dags/tutorial.py for tasks to queue
[2018-05-28 12:13:49,289] {models.py:189} INFO - Filling up the DagBag from /Users/gkumar6/airflow/dags/tutorial.py
[2018-05-28 12:13:49,300] {jobs.py:1754} INFO - DAG(s) ['tutorial'] retrieved from /Users/gkumar6/airflow/dags/tutorial.py
[2018-05-28 12:13:49,326] {models.py:341} INFO - Finding 'running' jobs without a recent heartbeat
[2018-05-28 12:13:49,327] {models.py:345} INFO - Failing jobs without heartbeat after 2018-05-28 06:38:49.327218
[2018-05-28 12:13:49,332] {jobs.py:375} INFO - Processing /Users/gkumar6/airflow/dags/tutorial.py took 0.049 seconds
[2018-05-28 12:13:50,279] {jobs.py:1627} INFO - Heartbeating the process manager
[2018-05-28 12:13:50,280] {dag_processing.py:468} INFO - Processor for /Users/gkumar6/airflow/dags/tutorial.py finished
[2018-05-28 12:13:50,283] {dag_processing.py:537} INFO - Started a process (PID: 58150) to generate tasks for /Users/gkumar6/airflow/dags/tutorial.py
[2018-05-28 12:13:50,285] {jobs.py:1662} INFO - Heartbeating the executor
[2018-05-28 12:13:50,296] {jobs.py:368} INFO - Started process (PID=58150) to work on /Users/gkumar6/airflow/dags/tutorial.py
[2018-05-28 12:13:50,301] {jobs.py:1742} INFO - Processing file /Users/gkumar6/airflow/dags/tutorial.py for tasks to queue
[2018-05-28 12:13:50,302] {models.py:189} INFO - Filling up the DagBag from /Users/gkumar6/airflow/dags/tutorial.py
[2018-05-28 12:13:50,312] {jobs.py:1754} INFO - DAG(s) ['tutorial'] retrieved from /Users/gkumar6/airflow/dags/tutorial.py
[2018-05-28 12:13:50,338] {models.py:341} INFO - Finding 'running' jobs without a recent heartbeat
[2018-05-28 12:13:50,339] {models.py:345} INFO - Failing jobs without heartbeat after 2018-05-28 06:38:50.339147
[2018-05-28 12:13:50,344] {jobs.py:375} INFO - Processing /Users/gkumar6/airflow/dags/tutorial.py took 0.048 seconds
And the status of job on UI is stuck at running. Is there something i need to configure to solve this issue?

It seems that it's not a "Failing jobs" problem but a logging problem. Here's what I found when I tried to fix this problem.
Is this message indicates that there's something wrong that I should
be concerned?
No.
"Finding 'running' jobs" and "Failing jobs..." are INFO level logs
generated from find_zombies function of heartbeat utility. So there will be logs generated every
heartbeat interval even if you don't have any failing jobs
running.
How do I turn it off?
The logging_level option in airflow.cfg does not control the scheduler logging.
There's one hard-code in
airflow/settings.py:
LOGGING_LEVEL = logging.INFO
You could change this to:
LOGGING_LEVEL = logging.WARN
Then restart the scheduler and the problem will be gone.

I think in point 2 if you just change the logging_level = INFO to WARN in airflow.cfg, you won't get INFO log. you don't need to modify settings.py file.