I am new to Quartz.net and I have been following these tutorials to get it started. Since I have configured the logging system I noticed that the sampleJob that he recommends to check is still running:
17:13:15 [ServerScheduler_Worker-3] INFO Quartz.Server.SampleJob - SampleJob running...
17:13:20 [ServerScheduler_Worker-3] INFO Quartz.Server.SampleJob - SampleJob run finished.
17:13:25 [ServerScheduler_Worker-4] INFO Quartz.Server.SampleJob - SampleJob running...
17:13:30 [ServerScheduler_Worker-4] INFO Quartz.Server.SampleJob - SampleJob run finished.
17:13:35 [ServerScheduler_Worker-5] INFO Quartz.Server.SampleJob - SampleJob running...
17:13:40 [ServerScheduler_Worker-5] INFO Quartz.Server.SampleJob - SampleJob run finished.
17:13:45 [ServerScheduler_Worker-6] INFO Quartz.Server.SampleJob - SampleJob running...
17:13:50 [ServerScheduler_Worker-6] INFO Quartz.Server.SampleJob - SampleJob run finished.
how do I stop this?
IScheduler has different methods if you want to:
Pause a job: PauseJob
Pause all jobs: PauseAll
Pause a trigger: PauseTrigger
Unschedule a job: UnscheduleJob
More references here.
Could be that /quartz_jobs.xml file has been occasionally left in service folder.
Related
My dags are getting stuck in "running" state indefinitely.
Even when I mark them as "failed" and rerun them again it is still getting stuck. When I check on the airflow UI the dag is in the "running" state :
Screenshot
When I check my airflow celery logs the last lines are the following and nothing else happened :
[...]
[2021-05-24 14:14:31,486] {dagbag.py:451} INFO - Filling up the DagBag from /home/******.py
[2021-05-24 14:14:31,490] {dagbag.py:451} INFO - Filling up the DagBag from /home/******.py
[2021-05-24 14:14:31,498] {dagbag.py:451} INFO - Filling up the DagBag from /home/******.py
[2021-05-24 14:14:31,505] {dagbag.py:451} INFO - Filling up the DagBag from /home/******.py
[2021-05-24 14:14:31,502] {dagbag.py:451} INFO - Filling up the DagBag from /home/******.py
[2021-05-24 14:14:31,508] {dagbag.py:451} INFO - Filling up the DagBag from /home/******.py
[2021-05-24 14:14:56,679: WARNING/ForkPoolWorker-23] Running <TaskInstance: ***** 2021-05-24T06:14:30.718313+00:00 [queued]> on host **********
[2021-05-24 14:14:56,682: WARNING/ForkPoolWorker-21] Running <TaskInstance: ***** 2021-05-24T06:14:30.718313+00:00 [queued]> on host **********
[2021-05-24 14:14:56,719: WARNING/ForkPoolWorker-8] Running <TaskInstance: ***** 2021-05-24T06:14:30.718313+00:00 [queued]> on host **********
[2021-05-24 14:14:56,722: WARNING/ForkPoolWorker-18] Running <TaskInstance: ***** 2021-05-24T06:14:30.718313+00:00 [queued]> on host **********
[2021-05-24 14:14:56,742: WARNING/ForkPoolWorker-26] Running <TaskInstance: ***** 2021-05-24T06:14:30.718313+00:00 [queued]> on host **********
[2021-05-24 14:14:56,754: WARNING/ForkPoolWorker-28] Running <TaskInstance: *****A1 2021-05-24T06:14:30.718313+00:00 [queued]> on host **********
Below are some of the logs I got from the scheduler :
[2021-05-24 14:09:18,552] {scheduler_job.py:1006} INFO - DAG *** has 32/32 running and queued tasks
[2021-05-24 14:09:18,552] {scheduler_job.py:1014} INFO - Not executing <TaskInstance: **** 2021-05-24 05:45:57.736363+00:00 [scheduled]> since the number of tasks running or queued from DAG **** is >= to the DAG's
task concurrency limit of 32
.
[...]
/home/ubuntu/**** 1 0 0.65s 2021-05-23T07:46:28
/home/ubuntu/**** 1 0 0.56s 2021-05-23T07:46:27
/home/ubuntu/**** 1 0 0.47s 2021-05-23T07:46:54
/home/ubuntu/**** 1 0 1.18s 2021-05-23T07:47:03
/home/ubuntu/**** 2 0 1.25s 2021-05-23T07:46:20
/home/ubuntu/**** 2 0 1.26s 2021-05-23T07:46:20
/home/ubuntu/**** 2 0 1.30s 2021-05-23T07:46:19
/home/ubuntu/**** 113 0 2.91s 2021-05-23T07:47:05
/home/ubuntu/**** 459 0 7.85s 2021-05-23T07:46:38
================================================================================
[2021-05-23 15:47:58,271] {scheduler_job.py:182} INFO - Started process (PID=13794) to work on ********
[2021-05-23 15:47:58,272] {scheduler_job.py:161} INFO - Closing parent pipe
[2021-05-23 15:47:58,273] {scheduler_job.py:629} INFO - Processing file ******************** for tasks to queue
[2021-05-23 15:47:58,273] {dagbag.py:451} INFO - Filling up the DagBag from ********************
[2021-05-23 15:47:58,273] {scheduler_job.py:190} INFO - Processing ******** took 0.014 seconds
[2021-05-23 15:47:58,274] {scheduler_job.py:641} WARNING - No viable dags retrieved from ************
[2021-05-23 15:47:58,275] {scheduler_job.py:182} INFO - Started process (PID=13797) to work on **************
[2021-05-23 15:47:58,275] {scheduler_job.py:190} INFO - Processing ******** took 0.014 seconds
[2021-05-23 15:47:58,276] {scheduler_job.py:629} INFO - Processing file ************** for tasks to queue
[2021-05-23 15:47:58,276] {dagbag.py:451} INFO - Filling up the DagBag from **************
[2021-05-23 15:47:58,277] {scheduler_job.py:641} WARNING - No viable dags retrieved from **************
[2021-05-23 15:47:58,278] {scheduler_job.py:190} INFO - Processing ******** took 0.016 seconds
[2021-05-23 15:47:58,281] {scheduler_job.py:190} INFO - Processing ******** took 0.016 seconds
[2021-05-23 15:47:58,285] {scheduler_job.py:190} INFO - Processing ******** took 0.016 seconds
[2021-05-23 15:47:58,287] {scheduler_job.py:190} INFO - Processing ************** took 0.015 seconds
[2021-05-23 15:48:02,300] {scheduler_job.py:161} INFO - Closing parent pipe
[2021-05-23 15:48:02,302] {scheduler_job.py:182} INFO - Started process (PID=13932) to work on *****************.py
[2021-05-23 15:48:02,303] {scheduler_job.py:629} INFO - Processing file ***************** for tasks to queue
[2021-05-23 15:48:02,304] {dagbag.py:451} INFO - Filling up the DagBag from *****************
[2021-05-23 15:48:02,434] {scheduler_job.py:641} WARNING - No viable dags retrieved from *****************
[2021-05-23 15:48:02,445] {scheduler_job.py:190} INFO - Processing ***************** took 0.144 seconds
[2021-05-23 15:48:02,452] {scheduler_job.py:161} INFO - Closing parent pipe
[2021-05-23 15:48:02,455] {scheduler_job.py:182} INFO - Started process (PID=13934) to work on *****************
[2021-05-23 15:48:02,456] {scheduler_job.py:629} INFO - Processing file ***************** for tasks to queue
[2021-05-23 15:48:02,456] {dagbag.py:451} INFO - Filling up the DagBag from *****************
[2021-05-23 15:48:03,457] {scheduler_job.py:161} INFO - Closing parent pipe
[2021-05-23 15:48:03,460] {scheduler_job.py:182} INFO - Started process (PID=13959) to work on ********
[2021-05-23 15:48:03,461] {scheduler_job.py:629} INFO - Processing file ******** for tasks to queue
[2021-05-23 15:48:03,461] {dagbag.py:451} INFO - Filling up the DagBag from ********
[2021-05-23 15:48:03,501] {scheduler_job.py:641} WARNING - No viable dags retrieved from *****************
[2021-05-23 15:48:03,514] {scheduler_job.py:190} INFO - Processing ***************** took 1.061 seconds
[2021-05-23 15:48:04,547] {scheduler_job.py:639} INFO - DAG(s) dict_keys(['****']) retrieved from *******
[2021-05-23 15:48:04,559] {dag.py:1824} INFO - Sync 1 DAGs
[2021-05-23 15:48:04,568] {dag.py:2280} INFO - Setting next_dagrun for **** to 2021-05-23T04:00:00+00:00
[2021-05-23 15:48:04,572] {scheduler_job.py:190} INFO - Processing ********took 1.115 seconds
[2021-05-23 15:48:05,538] {dag_processing.py:1092} INFO - Finding 'running' jobs without a recent heartbeat
[2021-05-23 15:48:05,539] {dag_processing.py:1096} INFO - Failing jobs without heartbeat after 2021-05-23 07:43:05.539102+00:00
[2021-05-23 15:48:05,546] {scheduler_job.py:161} INFO - Closing parent pipe
[2021-05-23 15:48:05,549] {scheduler_job.py:182} INFO - Started process (PID=14077) to work on ********
[2021-05-23 15:48:05,549] {scheduler_job.py:629} INFO - Processing file ********for tasks to queue
[2021-05-23 15:48:05,550] {dagbag.py:451} INFO - Filling up the DagBag from ********
To make it work I have to :
Empty the running slots and queued slots in Admin > Pools by setting the tasks as Failed
Restart airflow celery worker
Saw on github that people have recently face this issue:
https://github.com/apache/airflow/issues/13542
Link to my github issue:
https://github.com/apache/airflow/issues/15978
Also I noticed a weird behavior with Airflow backfilling. I noticed that my previous dags are still queing and running even after doing the following :
Setting catchup_by_default=False in airflow.cfg
Setting catchup=False in the DAG definition
Using LatestOnlyOperator
Would appreciate if someone can help me solving this issue, thank you!
Configuration:
Apache Airflow version: 2.0.2
OS : Ubuntu 18.04.3 (AWS EC2)
Install tools: celery = 4.4.7, redis = 3.5.3
I have two DAGs that I need to run with Airflow 1.10.2 + the CeleryExecutor. The first DAG (DAG1) is a long-running data load from s3 into Redshift (3+ hours). My second DAG (DAG2) performs computations on data loaded by DAG1. I want to include an ExternalTaskSensor in DAG2 so that the computations are reliably performed after the data loads. Theoretically so simple!
I can successfully get DAG2 to wait for DAG1 to complete by ensuring both DAGs are scheduled to start at the same time (schedule="0 8 * * *" for both DAGs) and DAG2 is dependent on the final task in DAG1. But I'm seeing a massive delay in our ETL on DAG1 when I introduce the sensor. I at first though it was because my original implementation used mode="poke" which I understand locks a worker. However, even when I changed this to mode="reschedule" as I read in the docs https://airflow.readthedocs.io/en/stable/_modules/airflow/sensors/base_sensor_operator.html I still see a massive ETL delay.
I'm using the ExternalTaskSensor code below in DAG2:
wait_for_data_load = ExternalTaskSensor(
dag=dag,
task_id="wait_for_data_load",
external_dag_id="dag1",
external_task_id="dag1_final_task_id",
mode="reschedule",
poke_interval=1800, # check every 30 min
timeout=43200, # timeout after 12 hours (catch delayed data load runs)
soft_fail=False # if the task fails, we assume a failure
)
If the code were working properly, I'd expect the sensor to perform a quick check whether DAG1 had finished and, if not, reschedule for 30 min time as defined by the poke_interval, causing no delay to DAG1 ETL. If DAG1 fails to complete after 12 hours, then DAG2 would stop poking and fail.
Instead, I'm getting frequent errors for each of the tasks in DAG1 saying (for example) Executor reports task instance <TaskInstance: dag1.data_table_temp_redshift_load 2019-05-20 08:00:00+00:00 [queued]> finished (failed) although the task says its queued. Was the task killed externally? even though the tasks are completing successfully (with some delay). Just before this error is sent, I see a line in our Sentry logs saying Executor reports dag1.data_table_temp_redshift_load execution_date=2019-05-20 08:00:00+00:00 as failed for try_number 1 though (again) I can see the task succeeded.
The logs on DAG2 are also looking a bit strange. I'm seeing repeated attempts logged at the same time intervals like the excerpt below:
--------------------------------------------------------------------------------
Starting attempt 1 of 4
--------------------------------------------------------------------------------
[2019-05-21 08:01:48,417] {{models.py:1593}} INFO - Executing <Task(ExternalTaskSensor): wait_for_data_load> on 2019-05-20T08:00:00+00:00
[2019-05-21 08:01:48,419] {{base_task_runner.py:118}} INFO - Running: ['bash', '-c', 'airflow run dag2 wait_for_data_load 2019-05-20T08:00:00+00:00 --job_id 572075 --raw -sd DAGS_FOLDER/dag2.py --cfg_path /tmp/tmp4g2_27c7']
[2019-05-21 08:02:02,543] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load [2019-05-21 08:02:02,542] {{settings.py:174}} INFO - settings.configure_orm(): Using pool settings. pool_size=5, pool_recycle=1800, pid=28219
[2019-05-21 08:02:12,000] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load [2019-05-21 08:02:11,996] {{__init__.py:51}} INFO - Using executor CeleryExecutor
[2019-05-21 08:02:15,840] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load [2019-05-21 08:02:15,827] {{models.py:273}} INFO - Filling up the DagBag from /usr/local/airflow/dags/dag2.py
[2019-05-21 08:02:16,746] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load [2019-05-21 08:02:16,745] {{dag2.py:40}} INFO - Waiting for the dag1_final_task_id operator to complete in the dag1 DAG
[2019-05-21 08:02:17,199] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load [2019-05-21 08:02:17,198] {{cli.py:520}} INFO - Running <TaskInstance: dag1. wait_for_data_load 2019-05-20T08:00:00+00:00 [running]> on host 11d93b0b0c2d
[2019-05-21 08:02:17,708] {{external_task_sensor.py:91}} INFO - Poking for dag1. dag1_final_task_id on 2019-05-20T08:00:00+00:00 ...
[2019-05-21 08:02:17,890] {{models.py:1784}} INFO - Rescheduling task, marking task as UP_FOR_RESCHEDULE
[2019-05-21 08:02:17,892] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load /usr/local/lib/python3.6/site-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.25.2) or chardet (3.0.4) doesn't match a supported version!
[2019-05-21 08:02:17,893] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load RequestsDependencyWarning)
[2019-05-21 08:02:17,893] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load /usr/local/lib/python3.6/site-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.
[2019-05-21 08:02:17,894] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load """)
[2019-05-21 08:02:22,597] {{logging_mixin.py:95}} INFO - [2019-05-21 08:02:22,589] {{jobs.py:2527}} INFO - Task exited with return code 0
[2019-05-21 08:01:48,125] {{models.py:1359}} INFO - Dependencies all met for <TaskInstance: dag2. wait_for_data_load 2019-05-20T08:00:00+00:00 [queued]>
[2019-05-21 08:01:48,311] {{models.py:1359}} INFO - Dependencies all met for <TaskInstance: dag2. wait_for_data_load 2019-05-20T08:00:00+00:00 [queued]>
[2019-05-21 08:01:48,311] {{models.py:1571}} INFO -
--------------------------------------------------------------------------------
Starting attempt 1 of 4
--------------------------------------------------------------------------------
[2019-05-21 08:01:48,417] {{models.py:1593}} INFO - Executing <Task(ExternalTaskSensor): wait_for_data_load> on 2019-05-20T08:00:00+00:00
[2019-05-21 08:01:48,419] {{base_task_runner.py:118}} INFO - Running: ['bash', '-c', 'airflow run dag2 wait_for_data_load 2019-05-20T08:00:00+00:00 --job_id 572075 --raw -sd DAGS_FOLDER/dag2.py --cfg_path /tmp/tmp4g2_27c7']
[2019-05-21 08:02:02,543] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load [2019-05-21 08:02:02,542] {{settings.py:174}} INFO - settings.configure_orm(): Using pool settings. pool_size=5, pool_recycle=1800, pid=28219
[2019-05-21 08:02:12,000] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load [2019-05-21 08:02:11,996] {{__init__.py:51}} INFO - Using executor CeleryExecutor
[2019-05-21 08:02:15,840] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load [2019-05-21 08:02:15,827] {{models.py:273}} INFO - Filling up the DagBag from /usr/local/airflow/dags/dag2.py
[2019-05-21 08:02:16,746] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load [2019-05-21 08:02:16,745] {{dag2.py:40}} INFO - Waiting for the dag1_final_task_id operator to complete in the dag1 DAG
[2019-05-21 08:02:17,199] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load [2019-05-21 08:02:17,198] {{cli.py:520}} INFO - Running <TaskInstance: dag2.wait_for_data_load 2019-05-20T08:00:00+00:00 [running]> on host 11d93b0b0c2d
[2019-05-21 08:02:17,708] {{external_task_sensor.py:91}} INFO - Poking for dag1.dag1_final_task_id on 2019-05-20T08:00:00+00:00 ...
[2019-05-21 08:02:17,890] {{models.py:1784}} INFO - Rescheduling task, marking task as UP_FOR_RESCHEDULE
[2019-05-21 08:02:17,892] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load /usr/local/lib/python3.6/site-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.25.2) or chardet (3.0.4) doesn't match a supported version!
[2019-05-21 08:02:17,893] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load RequestsDependencyWarning)
[2019-05-21 08:02:17,893] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load /usr/local/lib/python3.6/site-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.
[2019-05-21 08:02:17,894] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load """)
[2019-05-21 08:02:22,597] {{logging_mixin.py:95}} INFO - [2019-05-21 08:02:22,589] {{jobs.py:2527}} INFO - Task exited with return code 0
[2019-05-21 08:33:31,875] {{models.py:1359}} INFO - Dependencies all met for <TaskInstance: dag2.wait_for_data_load 2019-05-20T08:00:00+00:00 [queued]>
[2019-05-21 08:33:31,903] {{models.py:1359}} INFO - Dependencies all met for <TaskInstance: dag2.wait_for_data_load 2019-05-20T08:00:00+00:00 [queued]>
[2019-05-21 08:33:31,903] {{models.py:1571}} INFO -
--------------------------------------------------------------------------------
Starting attempt 1 of 4
--------------------------------------------------------------------------------
Though all logs say Starting attempt 1 of 4, I do see attempts records about every 30 min, but I see multiple logs for each time interval (10+ of the same logs printed for each 30 min interval).
From searching around I see other people are using sensors in production flows https://eng.lyft.com/running-apache-airflow-at-lyft-6e53bb8fccff, which makes me think there's a way around this or I'm implementing something wrong. But I'm also seeing open issues in the airflow project related to this issue, so perhaps there's a deeper issue in the project? I also found a related, but unanswered post here Apache Airflow 1.10.3: Executor reports task instance ??? finished (failed) although the task says its queued. Was the task killed externally?
Also, we are using the following config settings:
# The amount of parallelism as a setting to the executor. This defines
# the max number of task instances that should run simultaneously
# on this airflow installation
parallelism = 32
# The number of task instances allowed to run concurrently by the scheduler
dag_concurrency = 16
# Are DAGs paused by default at creation
dags_are_paused_at_creation = True
# When not using pools, tasks are run in the "default pool",
# whose size is guided by this config element
non_pooled_task_slot_count = 128
# The maximum number of active DAG runs per DAG
max_active_runs_per_dag = 16
These symptoms were actually caused by a call to Variable.set() in the body of DAG1 that DAG2 then used to retrieve DAG1s dynamically generated dag_id. The Variable.set() all was causing an error (discovered in the worker logs). As described here, the scheduler polls the DAG definitions with every heartbeat to update keep DAGs up-to-date. That meant an error with every heartbeat, which caused a large ETL delay.
My effort does not work:
/usr/local/spark/spark-2.3.2-bin-hadoop2.7/bin/spark-submit --driver-memory 6g --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.3.2 runspark.py && bokeh serve --show bokeh_app
runspark.py contains the instantiation of spark, and bokeh_app is the folder of the bokeh server app. spark is being used to update a streaming dask dataframe.
WHAT HAPPENS:
The spark instance starts running, loads as it normally would without the bokeh server. However as soon as the bokeh server app kicks in (i.e.) the web page opens, the spark instance shuts down. It doesn't send back any errors in the console output.
OUTPUT BELOW:
2018-11-26 21:04:05 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#4f0492c9{/static/sql,null,AVAILABLE,#Spark}
2018-11-26 21:04:06 INFO StateStoreCoordinatorRef:54 - Registered StateStoreCoordinator endpoint
2018-11-26 21:04:06 INFO SparkContext:54 - Invoking stop() from shutdown hook
2018-11-26 21:04:06 INFO AbstractConnector:318 - Stopped Spark#4f3c4272{HTTP/1.1,[http/1.1]}{0.0.0.0:4041}
2018-11-26 21:04:06 INFO SparkUI:54 - Stopped Spark web UI at http://192.168.1.25:4041
2018-11-26 21:04:06 INFO MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped!
2018-11-26 21:04:06 INFO MemoryStore:54 - MemoryStore cleared
2018-11-26 21:04:06 INFO BlockManager:54 - BlockManager stopped
2018-11-26 21:04:06 INFO BlockManagerMaster:54 - BlockManagerMaster stopped
2018-11-26 21:04:07 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
2018-11-26 21:04:07 INFO SparkContext:54 - Successfully stopped SparkContext
2018-11-26 21:04:07 INFO ShutdownHookManager:54 - Shutdown hook called
2018-11-26 21:04:07 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-c42ce0b3-d49e-48ce-962c-277b42166267
2018-11-26 21:04:07 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-bd448b2e-6b0f-467a-9e43-689542c42a6f
2018-11-26 21:04:07 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-bd448b2e-6b0f-467a-9e43-689542c42a6f/pyspark-117d2a10-7cb9-4eb3-b4d0-f92f9046522c
2018-11-26 21:04:08,542 Starting Bokeh server version 0.13.0 (running on Tornado 5.1.1)
2018-11-26 21:04:08,547 Bokeh app running at: http://localhost:5006/aion_analytics
2018-11-26 21:04:08,547 Starting Bokeh server with process id: 10769
Ok, I found the answer. The idea is simply to embed the bokeh server in the pyspark code instead of running the bokeh server from the command line. Use the pyspark submit command as normal.
https://github.com/bokeh/bokeh/blob/1.0.1/examples/howto/server_embed/standalone_embed.py
I did exactly what shown in the link above.
I'm new to airflow and i tried to manually trigger a job through UI. When I did that, the scheduler keep on logging that it is Failing jobs without heartbeat as follows:
[2018-05-28 12:13:48,248] {jobs.py:1662} INFO - Heartbeating the executor
[2018-05-28 12:13:48,250] {jobs.py:1672} INFO - Heartbeating the scheduler
[2018-05-28 12:13:48,259] {jobs.py:368} INFO - Started process (PID=58141) to work on /Users/gkumar6/airflow/dags/tutorial.py
[2018-05-28 12:13:48,264] {jobs.py:1742} INFO - Processing file /Users/gkumar6/airflow/dags/tutorial.py for tasks to queue
[2018-05-28 12:13:48,265] {models.py:189} INFO - Filling up the DagBag from /Users/gkumar6/airflow/dags/tutorial.py
[2018-05-28 12:13:48,275] {jobs.py:1754} INFO - DAG(s) ['tutorial'] retrieved from /Users/gkumar6/airflow/dags/tutorial.py
[2018-05-28 12:13:48,298] {models.py:341} INFO - Finding 'running' jobs without a recent heartbeat
[2018-05-28 12:13:48,299] {models.py:345} INFO - Failing jobs without heartbeat after 2018-05-28 06:38:48.299278
[2018-05-28 12:13:48,304] {jobs.py:375} INFO - Processing /Users/gkumar6/airflow/dags/tutorial.py took 0.045 seconds
[2018-05-28 12:13:49,266] {jobs.py:1627} INFO - Heartbeating the process manager
[2018-05-28 12:13:49,267] {dag_processing.py:468} INFO - Processor for /Users/gkumar6/airflow/dags/tutorial.py finished
[2018-05-28 12:13:49,271] {dag_processing.py:537} INFO - Started a process (PID: 58149) to generate tasks for /Users/gkumar6/airflow/dags/tutorial.py
[2018-05-28 12:13:49,272] {jobs.py:1662} INFO - Heartbeating the executor
[2018-05-28 12:13:49,283] {jobs.py:368} INFO - Started process (PID=58149) to work on /Users/gkumar6/airflow/dags/tutorial.py
[2018-05-28 12:13:49,288] {jobs.py:1742} INFO - Processing file /Users/gkumar6/airflow/dags/tutorial.py for tasks to queue
[2018-05-28 12:13:49,289] {models.py:189} INFO - Filling up the DagBag from /Users/gkumar6/airflow/dags/tutorial.py
[2018-05-28 12:13:49,300] {jobs.py:1754} INFO - DAG(s) ['tutorial'] retrieved from /Users/gkumar6/airflow/dags/tutorial.py
[2018-05-28 12:13:49,326] {models.py:341} INFO - Finding 'running' jobs without a recent heartbeat
[2018-05-28 12:13:49,327] {models.py:345} INFO - Failing jobs without heartbeat after 2018-05-28 06:38:49.327218
[2018-05-28 12:13:49,332] {jobs.py:375} INFO - Processing /Users/gkumar6/airflow/dags/tutorial.py took 0.049 seconds
[2018-05-28 12:13:50,279] {jobs.py:1627} INFO - Heartbeating the process manager
[2018-05-28 12:13:50,280] {dag_processing.py:468} INFO - Processor for /Users/gkumar6/airflow/dags/tutorial.py finished
[2018-05-28 12:13:50,283] {dag_processing.py:537} INFO - Started a process (PID: 58150) to generate tasks for /Users/gkumar6/airflow/dags/tutorial.py
[2018-05-28 12:13:50,285] {jobs.py:1662} INFO - Heartbeating the executor
[2018-05-28 12:13:50,296] {jobs.py:368} INFO - Started process (PID=58150) to work on /Users/gkumar6/airflow/dags/tutorial.py
[2018-05-28 12:13:50,301] {jobs.py:1742} INFO - Processing file /Users/gkumar6/airflow/dags/tutorial.py for tasks to queue
[2018-05-28 12:13:50,302] {models.py:189} INFO - Filling up the DagBag from /Users/gkumar6/airflow/dags/tutorial.py
[2018-05-28 12:13:50,312] {jobs.py:1754} INFO - DAG(s) ['tutorial'] retrieved from /Users/gkumar6/airflow/dags/tutorial.py
[2018-05-28 12:13:50,338] {models.py:341} INFO - Finding 'running' jobs without a recent heartbeat
[2018-05-28 12:13:50,339] {models.py:345} INFO - Failing jobs without heartbeat after 2018-05-28 06:38:50.339147
[2018-05-28 12:13:50,344] {jobs.py:375} INFO - Processing /Users/gkumar6/airflow/dags/tutorial.py took 0.048 seconds
And the status of job on UI is stuck at running. Is there something i need to configure to solve this issue?
It seems that it's not a "Failing jobs" problem but a logging problem. Here's what I found when I tried to fix this problem.
Is this message indicates that there's something wrong that I should
be concerned?
No.
"Finding 'running' jobs" and "Failing jobs..." are INFO level logs
generated from find_zombies function of heartbeat utility. So there will be logs generated every
heartbeat interval even if you don't have any failing jobs
running.
How do I turn it off?
The logging_level option in airflow.cfg does not control the scheduler logging.
There's one hard-code in
airflow/settings.py:
LOGGING_LEVEL = logging.INFO
You could change this to:
LOGGING_LEVEL = logging.WARN
Then restart the scheduler and the problem will be gone.
I think in point 2 if you just change the logging_level = INFO to WARN in airflow.cfg, you won't get INFO log. you don't need to modify settings.py file.
Well I have my JAVA_HOME set correctly. And I am getting this error now.
C:\projects\zookeeper\zk\bin>call "C:\Program Files\Java\jdk-9"\bin\java "-Dzookeeper.log.dir=C:\projects\zookeeper\zk\bin\..\logs" "-Dzookeeper.root.logger=INFO,CONSOLE" "-Dzookeeper.log.file=zookeeper-User-server-HUNTER-PC.log" "-XX:+HeapDumpOnOutOfMemoryError" "-XX:OnOutOfMemoryError=cmd /c taskkill /pid %%p /t /f" -cp "C:\projects\zookeeper\zk\bin\..\build\classes;C:\projects\zookeeper\zk\bin\..\build\lib\*;C:\projects\zookeeper\zk\bin\..\*;C:\projects\zookeeper\zk\bin\..\lib\*;C:\projects\zookeeper\zk\bin\..\conf" org.apache.zookeeper.server.quorum.QuorumPeerMain "C:\projects\zookeeper\zk\bin\..\conf\zoo.cfg" start
2017-09-29 10:44:10,183 [myid:] - INFO [main:DatadirCleanupManager#78] - autopurge.snapRetainCount set to 3
2017-09-29 10:44:10,183 [myid:] - INFO [main:DatadirCleanupManager#79] - autopurge.purgeInterval set to 0
2017-09-29 10:44:10,183 [myid:] - INFO [main:DatadirCleanupManager#101] - Purge task is not scheduled.
2017-09-29 10:44:10,183 [myid:] - WARN [main:QuorumPeerMain#122] - Either no config or no quorum defined in config, running in standalone mode
2017-09-29 10:44:10,183 [myid:] - INFO [main:ManagedUtil#46] - Log4j found with jmx enabled.
2017-09-29 10:44:10,330 [myid:] - ERROR [main:ZooKeeperServerMain#64] - Invalid arguments, exiting abnormally
java.lang.NumberFormatException: For input string: "C:\projects\zookeeper\zk\bin\..\conf\zoo.cfg"
at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.base/java.lang.Integer.parseInt(Integer.java:652)
at java.base/java.lang.Integer.parseInt(Integer.java:770)
at org.apache.zookeeper.server.ServerConfig.parse(ServerConfig.java:61)
at org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:101)
at org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:62)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:125)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:79)
2017-09-29 10:44:10,330 [myid:] - INFO [main:ZooKeeperServerMain#65] - Usage: ZooKeeperServerMain configfile | port datadir [ticktime] [maxcnxns]
Usage: ZooKeeperServerMain configfile | port datadir [ticktime] [maxcnxns]
C:\projects\zookeeper\zk\bin>endlocal
Here is my config file: zoo.cfg
tickTime=2000
dataDir=c:/projects/zookeeper/zk/
clientPort=2181
initLimit=5
syncLimit=2
server.1=zoo1:2888:3888
server.2=zoo2:2888:3888
server.3=zoo3:2888:3888
I have changed the dataDir repeatedly to C:/tmp to /usr/
and making sure the directories are actually there. I am at a loss. I am running this on windows 10. I have the newest JDK 9 installed and the path is accurate.
Assuming that you've renamed the conf\zoo_sample.cfg file to conf\zoo.cfg and set the dataDir property correctly, which you've done already; use the following command to start the Zookeeper server
C:\zookeeper-3.4.12\bin>.\zkServer.cmd
Not the following
C:\zookeeper-3.4.12\bin>.\zkServer.cmd start
Okay so I figured out my own answer. Everything was installed correctly. But even running from Administrator mode command prompt, zookeeper was not working. I have to actually right click on the zkServer.cmd in the bin folder and say run as Administrator. I have had this issue before with SigWebTablet software. I think its an issue with running Windows10 developer. Others around the office have this issue with Windows. Hopefully this will help someone else.
In my case, running as Administrator didn't work. I needed a way to tell Zookeeper to setup three instances according to the 3 config files. What did work, was to edit the zkServer.cmd file and remove the "%ZOOCFG%"parameter from the line that begins wil call. Now I can run bin\zkServer.cmd conf\zoo.cfg, bin\zkServer.cmd conf\zoo2.cfg, and bin\zkServer.cmd conf\zoo3.cfg in three command prompt windows and get my cluster up and running :)