Parfor-loop restarts after executing the first set of iterations - matlab

I'm having a problem in a script that executes a parfor-loop, I hope you can help me with it.
I didn't have this problem before and I think I didn't change anything that could cause it.
The problem is that the parfor-loop restarts after starting the parallel group with 4 workers and executing the first 4 iterations. This happens once and then normally executes all the iterations as it should.
Here is my code, simplified in order to show this problem:
parfor loopVariable = 1 : 21
fprintf('%s - Running iteration %i/%i \n', datestr(datetime), loopVariable, 21)
*statements*
end
And this is the output I get, you will note that the first 4 iterations are repeated:
Starting parallel pool (parpool) using the 'local' profile ...
connected to 4 workers.
04-May-2020 11:43:21 - Running iteration 1/21
04-May-2020 11:43:21 - Running iteration 2/21
04-May-2020 11:43:21 - Running iteration 4/21
04-May-2020 11:43:21 - Running iteration 7/21
Analyzing and transferring files to the workers ...done.
04-May-2020 15:01:12 - Running iteration 7/21
04-May-2020 15:01:12 - Running iteration 1/21
04-May-2020 15:01:12 - Running iteration 2/21
04-May-2020 15:01:12 - Running iteration 4/21
04-May-2020 15:24:29 - Running iteration 3/21
04-May-2020 16:21:16 - Running iteration 6/21
04-May-2020 16:12:52 - Running iteration 13/21
04-May-2020 16:20:32 - Running iteration 10/21
04-May-2020 18:34:27 - Running iteration 12/21
04-May-2020 18:39:20 - Running iteration 9/21
04-May-2020 20:33:04 - Running iteration 5/21
04-May-2020 20:50:08 - Running iteration 11/21
04-May-2020 21:07:43 - Running iteration 8/21
04-May-2020 22:42:34 - Running iteration 15/21
05-May-2020 01:09:18 - Running iteration 14/21
04-May-2020 23:05:16 - Running iteration 18/21
04-May-2020 23:53:35 - Running iteration 19/21
05-May-2020 01:50:12 - Running iteration 17/21
05-May-2020 04:40:23 - Running iteration 16/21
05-May-2020 01:52:47 - Running iteration 21/21
05-May-2020 03:34:10 - Running iteration 20/21
I don't know if this is relevant, but I'm running the script remotely using:
nohup matlab -nodisplay -nosplash -r scriptFile -logfile outputFile.txt < /dev/null &
Thanks in advance for the help.

The segment of code you show is correct. Are you sure it's not some other part of the code or some of the *statements*?
>> parfor loopVariable = 1 : 21
fprintf('%s - Running iteration %i/%i \n', datestr(datetime), loopVariable, 21)
end
Starting parallel pool (parpool) using the 'local' profile ...
Connected to the parallel pool (number of workers: 4).
05-May-2020 11:10:56 - Running iteration 1/21
05-May-2020 11:10:56 - Running iteration 6/21
05-May-2020 11:10:56 - Running iteration 5/21
05-May-2020 11:10:56 - Running iteration 15/21
05-May-2020 11:10:56 - Running iteration 19/21
05-May-2020 11:10:56 - Running iteration 2/21
05-May-2020 11:10:56 - Running iteration 8/21
05-May-2020 11:10:56 - Running iteration 7/21
05-May-2020 11:10:56 - Running iteration 13/21
05-May-2020 11:10:56 - Running iteration 17/21
05-May-2020 11:10:56 - Running iteration 3/21
05-May-2020 11:10:56 - Running iteration 10/21
05-May-2020 11:10:56 - Running iteration 9/21
05-May-2020 11:10:56 - Running iteration 14/21
05-May-2020 11:10:56 - Running iteration 18/21
05-May-2020 11:10:56 - Running iteration 4/21
05-May-2020 11:10:56 - Running iteration 12/21
05-May-2020 11:10:56 - Running iteration 11/21
05-May-2020 11:10:56 - Running iteration 16/21
05-May-2020 11:10:56 - Running iteration 21/21
05-May-2020 11:10:56 - Running iteration 20/21

Related

Celery. Running single celery beat + multiple celery workers scale

Having single celery beat running by:
celery -A app:celery beat --loglevel=DEBUG
and three workers running by:
celery -A app:celery worker -E --loglevel=ERROR -n n1
celery -A app:celery worker -E --loglevel=ERROR -n n2
celery -A app:celery worker -E --loglevel=ERROR -n n3
Same Redis DB used as messages broker for all workers and beat.
All workers started on same machine for development purposes while they will be deployed using different Kubernetes pods on production. Main idea of usage multiple workers to distribute 50-150 tasks between different Kube pods each running on 4-8 core machine. We expect that none of pod will take more tasks than he have cores until there are any worker exists that has less tasks than available cores so max amount of tasks to be executed concurrently.
So I having troubles to test it locally.
Here is local beat triggers three tasks:
[2021-08-23 21:35:32,700: DEBUG/MainProcess] Current schedule:
<ScheduleEntry: task-5872-accrual Task5872Accrual() <crontab: 36 21 * * * (m/h/d/dM/MY)>
<ScheduleEntry: task-5872-accrual2 Task5872Accrual2() <crontab: 37 21 * * * (m/h/d/dM/MY)>
<ScheduleEntry: task-5872-accrual3 Task5872Accrual3() <crontab: 38 21 * * * (m/h/d/dM/MY)>
[2021-08-23 21:35:32,700: DEBUG/MainProcess] beat: Ticking with max interval->5.00 minutes
[2021-08-23 21:35:32,701: DEBUG/MainProcess] beat: Waking up in 27.29 seconds.
[2021-08-23 21:36:00,017: DEBUG/MainProcess] beat: Synchronizing schedule...
[2021-08-23 21:36:00,026: INFO/MainProcess] Scheduler: Sending due task task-5872-accrual (Task5872Accrual)
[2021-08-23 21:36:00,035: DEBUG/MainProcess] Task5872Accrual sent. id->96e671f8-bd07-4c36-a595-b963659bee5c
[2021-08-23 21:36:00,035: DEBUG/MainProcess] beat: Waking up in 59.95 seconds.
[2021-08-23 21:37:00,041: INFO/MainProcess] Scheduler: Sending due task task-5872-accrual2 (Task5872Accrual2)
[2021-08-23 21:37:00,043: DEBUG/MainProcess] Task5872Accrual2 sent. id->532eac4d-1d10-4117-9d7e-16b3f1ae7aee
[2021-08-23 21:37:00,043: DEBUG/MainProcess] beat: Waking up in 59.95 seconds.
[2021-08-23 21:38:00,027: INFO/MainProcess] Scheduler: Sending due task task-5872-accrual3 (Task5872Accrual3)
[2021-08-23 21:38:00,029: DEBUG/MainProcess] Task5872Accrual3 sent. id->68729b64-807d-4e13-8147-0b372ce536af
[2021-08-23 21:38:00,029: DEBUG/MainProcess] beat: Waking up in 5.00 minutes.
I expect that each worker will take single task to optimize load between workers but unfortunately here how they are distributed:
So i am not sure what does different workers synchronized between each other to distribute load between them smoothly? If not can I achieve that somehow? Tried to search in Google but there are mostly about concurrency between tasks in single worker but what to do if I need to run more tasks concurrently than single machine in Kube claster is have?
You should do two things in order to achieve what you want:
Run workers with the -O fair option. Example: celery -A app:celery worker -E --loglevel=ERROR -n n1 -O fair
Make workers prefetch as little as possible with worker_prefetch_multiplier=1 in your config.

Airflow Webserver Shutting down

my airflow webserver shut down abruptly around the same timing about 16:37 GMT.
My airflow scheduler runs fine (no crash) tasks still run.
There is not much error except.
Handling signal: ttou
Worker exiting (pid: 118711)
ERROR - No response from gunicorn master within 120 seconds
ERROR - Shutting down webserver
Handling signal: term
Worker exiting
Worker exiting
Worker exiting
Worker exiting
Worker exiting
Shutting down: Master
Is it a cause of memory?
My cfg setting for webserver is standard.
# Number of seconds the webserver waits before killing gunicorn master that doesn't respond
web_server_master_timeout = 120
# Number of seconds the gunicorn webserver waits before timing out on a worker
web_server_worker_timeout = 120
# Number of workers to refresh at a time. When set to 0, worker refresh is
# disabled. When nonzero, airflow periodically refreshes webserver workers by
# bringing up new ones and killing old ones.
worker_refresh_batch_size = 1
# Number of seconds to wait before refreshing a batch of workers.
worker_refresh_interval = 30
Update:
Ok its doesn't crash everyday but today I have gunicorn unable to restart log.
ERROR - [0/0] Some workers seem to have died and gunicorn did not restart them as expected
Update: 30 October 2020
[CRITICAL] WORKER TIMEOUT (pid:108237)
I am getting this, I have increased timeout to 240, twice the default value.
Anyone know why this keep arising?

Slow running Airflow 1.10.2 ETL when using ExternalTaskSensor for DAG task dependency?

I have two DAGs that I need to run with Airflow 1.10.2 + the CeleryExecutor. The first DAG (DAG1) is a long-running data load from s3 into Redshift (3+ hours). My second DAG (DAG2) performs computations on data loaded by DAG1. I want to include an ExternalTaskSensor in DAG2 so that the computations are reliably performed after the data loads. Theoretically so simple!
I can successfully get DAG2 to wait for DAG1 to complete by ensuring both DAGs are scheduled to start at the same time (schedule="0 8 * * *" for both DAGs) and DAG2 is dependent on the final task in DAG1. But I'm seeing a massive delay in our ETL on DAG1 when I introduce the sensor. I at first though it was because my original implementation used mode="poke" which I understand locks a worker. However, even when I changed this to mode="reschedule" as I read in the docs https://airflow.readthedocs.io/en/stable/_modules/airflow/sensors/base_sensor_operator.html I still see a massive ETL delay.
I'm using the ExternalTaskSensor code below in DAG2:
wait_for_data_load = ExternalTaskSensor(
dag=dag,
task_id="wait_for_data_load",
external_dag_id="dag1",
external_task_id="dag1_final_task_id",
mode="reschedule",
poke_interval=1800, # check every 30 min
timeout=43200, # timeout after 12 hours (catch delayed data load runs)
soft_fail=False # if the task fails, we assume a failure
)
If the code were working properly, I'd expect the sensor to perform a quick check whether DAG1 had finished and, if not, reschedule for 30 min time as defined by the poke_interval, causing no delay to DAG1 ETL. If DAG1 fails to complete after 12 hours, then DAG2 would stop poking and fail.
Instead, I'm getting frequent errors for each of the tasks in DAG1 saying (for example) Executor reports task instance <TaskInstance: dag1.data_table_temp_redshift_load 2019-05-20 08:00:00+00:00 [queued]> finished (failed) although the task says its queued. Was the task killed externally? even though the tasks are completing successfully (with some delay). Just before this error is sent, I see a line in our Sentry logs saying Executor reports dag1.data_table_temp_redshift_load execution_date=2019-05-20 08:00:00+00:00 as failed for try_number 1 though (again) I can see the task succeeded.
The logs on DAG2 are also looking a bit strange. I'm seeing repeated attempts logged at the same time intervals like the excerpt below:
--------------------------------------------------------------------------------
Starting attempt 1 of 4
--------------------------------------------------------------------------------
[2019-05-21 08:01:48,417] {{models.py:1593}} INFO - Executing <Task(ExternalTaskSensor): wait_for_data_load> on 2019-05-20T08:00:00+00:00
[2019-05-21 08:01:48,419] {{base_task_runner.py:118}} INFO - Running: ['bash', '-c', 'airflow run dag2 wait_for_data_load 2019-05-20T08:00:00+00:00 --job_id 572075 --raw -sd DAGS_FOLDER/dag2.py --cfg_path /tmp/tmp4g2_27c7']
[2019-05-21 08:02:02,543] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load [2019-05-21 08:02:02,542] {{settings.py:174}} INFO - settings.configure_orm(): Using pool settings. pool_size=5, pool_recycle=1800, pid=28219
[2019-05-21 08:02:12,000] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load [2019-05-21 08:02:11,996] {{__init__.py:51}} INFO - Using executor CeleryExecutor
[2019-05-21 08:02:15,840] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load [2019-05-21 08:02:15,827] {{models.py:273}} INFO - Filling up the DagBag from /usr/local/airflow/dags/dag2.py
[2019-05-21 08:02:16,746] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load [2019-05-21 08:02:16,745] {{dag2.py:40}} INFO - Waiting for the dag1_final_task_id operator to complete in the dag1 DAG
[2019-05-21 08:02:17,199] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load [2019-05-21 08:02:17,198] {{cli.py:520}} INFO - Running <TaskInstance: dag1. wait_for_data_load 2019-05-20T08:00:00+00:00 [running]> on host 11d93b0b0c2d
[2019-05-21 08:02:17,708] {{external_task_sensor.py:91}} INFO - Poking for dag1. dag1_final_task_id on 2019-05-20T08:00:00+00:00 ...
[2019-05-21 08:02:17,890] {{models.py:1784}} INFO - Rescheduling task, marking task as UP_FOR_RESCHEDULE
[2019-05-21 08:02:17,892] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load /usr/local/lib/python3.6/site-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.25.2) or chardet (3.0.4) doesn't match a supported version!
[2019-05-21 08:02:17,893] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load RequestsDependencyWarning)
[2019-05-21 08:02:17,893] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load /usr/local/lib/python3.6/site-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.
[2019-05-21 08:02:17,894] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load """)
[2019-05-21 08:02:22,597] {{logging_mixin.py:95}} INFO - [2019-05-21 08:02:22,589] {{jobs.py:2527}} INFO - Task exited with return code 0
[2019-05-21 08:01:48,125] {{models.py:1359}} INFO - Dependencies all met for <TaskInstance: dag2. wait_for_data_load 2019-05-20T08:00:00+00:00 [queued]>
[2019-05-21 08:01:48,311] {{models.py:1359}} INFO - Dependencies all met for <TaskInstance: dag2. wait_for_data_load 2019-05-20T08:00:00+00:00 [queued]>
[2019-05-21 08:01:48,311] {{models.py:1571}} INFO -
--------------------------------------------------------------------------------
Starting attempt 1 of 4
--------------------------------------------------------------------------------
[2019-05-21 08:01:48,417] {{models.py:1593}} INFO - Executing <Task(ExternalTaskSensor): wait_for_data_load> on 2019-05-20T08:00:00+00:00
[2019-05-21 08:01:48,419] {{base_task_runner.py:118}} INFO - Running: ['bash', '-c', 'airflow run dag2 wait_for_data_load 2019-05-20T08:00:00+00:00 --job_id 572075 --raw -sd DAGS_FOLDER/dag2.py --cfg_path /tmp/tmp4g2_27c7']
[2019-05-21 08:02:02,543] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load [2019-05-21 08:02:02,542] {{settings.py:174}} INFO - settings.configure_orm(): Using pool settings. pool_size=5, pool_recycle=1800, pid=28219
[2019-05-21 08:02:12,000] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load [2019-05-21 08:02:11,996] {{__init__.py:51}} INFO - Using executor CeleryExecutor
[2019-05-21 08:02:15,840] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load [2019-05-21 08:02:15,827] {{models.py:273}} INFO - Filling up the DagBag from /usr/local/airflow/dags/dag2.py
[2019-05-21 08:02:16,746] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load [2019-05-21 08:02:16,745] {{dag2.py:40}} INFO - Waiting for the dag1_final_task_id operator to complete in the dag1 DAG
[2019-05-21 08:02:17,199] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load [2019-05-21 08:02:17,198] {{cli.py:520}} INFO - Running <TaskInstance: dag2.wait_for_data_load 2019-05-20T08:00:00+00:00 [running]> on host 11d93b0b0c2d
[2019-05-21 08:02:17,708] {{external_task_sensor.py:91}} INFO - Poking for dag1.dag1_final_task_id on 2019-05-20T08:00:00+00:00 ...
[2019-05-21 08:02:17,890] {{models.py:1784}} INFO - Rescheduling task, marking task as UP_FOR_RESCHEDULE
[2019-05-21 08:02:17,892] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load /usr/local/lib/python3.6/site-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.25.2) or chardet (3.0.4) doesn't match a supported version!
[2019-05-21 08:02:17,893] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load RequestsDependencyWarning)
[2019-05-21 08:02:17,893] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load /usr/local/lib/python3.6/site-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.
[2019-05-21 08:02:17,894] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load """)
[2019-05-21 08:02:22,597] {{logging_mixin.py:95}} INFO - [2019-05-21 08:02:22,589] {{jobs.py:2527}} INFO - Task exited with return code 0
[2019-05-21 08:33:31,875] {{models.py:1359}} INFO - Dependencies all met for <TaskInstance: dag2.wait_for_data_load 2019-05-20T08:00:00+00:00 [queued]>
[2019-05-21 08:33:31,903] {{models.py:1359}} INFO - Dependencies all met for <TaskInstance: dag2.wait_for_data_load 2019-05-20T08:00:00+00:00 [queued]>
[2019-05-21 08:33:31,903] {{models.py:1571}} INFO -
--------------------------------------------------------------------------------
Starting attempt 1 of 4
--------------------------------------------------------------------------------
Though all logs say Starting attempt 1 of 4, I do see attempts records about every 30 min, but I see multiple logs for each time interval (10+ of the same logs printed for each 30 min interval).
From searching around I see other people are using sensors in production flows https://eng.lyft.com/running-apache-airflow-at-lyft-6e53bb8fccff, which makes me think there's a way around this or I'm implementing something wrong. But I'm also seeing open issues in the airflow project related to this issue, so perhaps there's a deeper issue in the project? I also found a related, but unanswered post here Apache Airflow 1.10.3: Executor reports task instance ??? finished (failed) although the task says its queued. Was the task killed externally?
Also, we are using the following config settings:
# The amount of parallelism as a setting to the executor. This defines
# the max number of task instances that should run simultaneously
# on this airflow installation
parallelism = 32
# The number of task instances allowed to run concurrently by the scheduler
dag_concurrency = 16
# Are DAGs paused by default at creation
dags_are_paused_at_creation = True
# When not using pools, tasks are run in the "default pool",
# whose size is guided by this config element
non_pooled_task_slot_count = 128
# The maximum number of active DAG runs per DAG
max_active_runs_per_dag = 16
These symptoms were actually caused by a call to Variable.set() in the body of DAG1 that DAG2 then used to retrieve DAG1s dynamically generated dag_id. The Variable.set() all was causing an error (discovered in the worker logs). As described here, the scheduler polls the DAG definitions with every heartbeat to update keep DAGs up-to-date. That meant an error with every heartbeat, which caused a large ETL delay.

Angular end-to-end tests on random port

We are running angular e2e tests on a CI environment where tests are run at the same time on the same build slave (for example for different branches).
Since our recent upgrade to angular 7 (from angular 5) protractor uses the default port (4200).
CI jobs now occasionally are failing because port 4200 is in use.
I found the option to run with port 0:
ng e2e --port 0
This does seem to use a random port, but the tests fail, see below.
I tried on a clean new angular project created with angular cli, no changes.
ng e2e
works fine
How can I get the e2e tests to run without port number clashes?
Angular CLI: 7.1.0
node: v8.9.3
OS: Windows 10 Version 1607
Thanks in advance,
Rob
This output is from the failing run:
$ ng e2e --port 0
** Angular Live Development Server is listening on localhost:49152, open your browser on http://localhost:49152/ **
Date: 2018-12-03T15:16:48.890Z
Hash: 97006afaee956149f40f
Time: 7337ms
chunk {main} main.js, main.js.map (main) 9.77 kB [initial] [rendered]
chunk {polyfills} polyfills.js, polyfills.js.map (polyfills) 223 kB [initial] [rendered]
chunk {runtime} runtime.js, runtime.js.map (runtime) 6.08 kB [entry] [rendered]
chunk {styles} styles.js, styles.js.map (styles) 16.3 kB [initial] [rendered]
chunk {vendor} vendor.js, vendor.js.map (vendor) 3.43 MB [initial] [rendered]
i 「wdm」: Compiled successfully.
[16:16:49] I/update - chromedriver: file exists C:\Users\rob.gansevles\tmp\noot\node_modules\protractor\node_modules\webdriver-manager\selenium\chromedriver_2.44.zip
[16:16:49] I/update - chromedriver: unzipping chromedriver_2.44.zip
[16:16:49] I/update - chromedriver: chromedriver_2.44.exe up to date
[16:16:50] I/launcher - Running 1 instances of WebDriver
[16:16:50] I/direct - Using ChromeDriver directly...
DevTools listening on ws://127.0.0.1:50805/devtools/browser/7ee01341-be32-4d52-ae53-0794c11c8864
Jasmine started
[23540:12920:1203/161652.814:ERROR:tcp_socket_win.cc(861)] connect failed: 10049
[23540:12920:1203/161652.814:ERROR:tcp_socket_win.cc(861)] connect failed: 10049
[23540:12920:1203/161652.823:ERROR:tcp_socket_win.cc(861)] connect failed: 10049
[23540:12920:1203/161652.823:ERROR:tcp_socket_win.cc(861)] connect failed: 10049
[16:17:03] E/protractor - Could not find Angular on page http://localhost:0/ : retries looking for angular exceeded
workspace-project App
× should display welcome message
- Failed: Angular could not be found on the page http://localhost:0/. If this is not an Angular application, you may need to turn off waiting for Angular.
Please see
https://github.com/angular/protractor/blob/master/docs/timeouts.md#waiting-for-angular-on-page-load
Please see
https://github.com/angular/protractor/blob/master/docs/timeouts.md#waiting-for-angular-on-page-load
at executeAsyncScript_.then (C:\Users\rob.gansevles\tmp\noot\node_modules\protractor\built\browser.js:720:27)
at ManagedPromise.invokeCallback_ (C:\Users\rob.gansevles\tmp\noot\node_modules\selenium-webdriver\lib\promise.js:1376:14)
at TaskQueue.execute_ (C:\Users\rob.gansevles\tmp\noot\node_modules\selenium-webdriver\lib\promise.js:3084:14)
at TaskQueue.executeNext_ (C:\Users\rob.gansevles\tmp\noot\node_modules\selenium-webdriver\lib\promise.js:3067:27)
at asyncRun (C:\Users\rob.gansevles\tmp\noot\node_modules\selenium-webdriver\lib\promise.js:2927:27)
at C:\Users\rob.gansevles\tmp\noot\node_modules\selenium-webdriver\lib\promise.js:668:7
at
at process._tickCallback (internal/process/next_tick.js:188:7)
From: Task: Run it("should display welcome message") in control flow
at ControlFlow.emit (C:\Users\rob.gansevles\tmp\noot\node_modules\selenium-webdriver\lib\events.js:62:21)
at ControlFlow.shutdown_ (C:\Users\rob.gansevles\tmp\noot\node_modules\selenium-webdriver\lib\promise.js:2674:10)
at shutdownTask_.MicroTask (C:\Users\rob.gansevles\tmp\noot\node_modules\selenium-webdriver\lib\promise.js:2599:53)
From asynchronous test:
Error
at Suite. (C:\Users\rob.gansevles\tmp\noot\e2e\src\app.e2e-spec.ts:10:3)
at Object. (C:\Users\rob.gansevles\tmp\noot\e2e\src\app.e2e-spec.ts:3:1)
at Module._compile (module.js:635:30)
at Module.m._compile (C:\Users\rob.gansevles\tmp\noot\node_modules\ts-node\src\index.ts:439:23)
at Module._extensions..js (module.js:646:10)
at Object.require.extensions.(anonymous function) [as .ts] (C:\Users\rob.gansevles\tmp\noot\node_modules\ts-node\src\index.ts:442:12)
Failures *
1) workspace-project App should display welcome message
- Failed: Angular could not be found on the page http://localhost:0/. If this is not an Angular application, you may need to turn off waiting for Angular.
Please see
https://github.com/angular/protractor/blob/master/docs/timeouts.md#waiting-for-angular-on-page-load
Executed 1 of 1 spec (1 FAILED) in 10 secs.
[16:17:03] I/launcher - 0 instance(s) of WebDriver still running
[16:17:03] I/launcher - chrome #01 failed 1 test(s)
[16:17:03] I/launcher - overall: 1 failed spec(s)
[16:17:03] E/launcher - Process exited with error code 1
An unexpected error occurred: undefined
error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command

Apache Airflow - Scheduler Slowness

We are using Airflow 1.7.3 with CeleryExecutor. Airflow scheduler is setup as systemd service with --num-run set to 10 to stop it and restart every 10th run (as suggested here).
We noticed that at every 9th loop of scheduler takes considerably loger time (about 160 seconds or more) as compared to the regular loop of ~16 seconds. As per the logs, this is the loop when scheduler fills the DagBag by refreshing all the dags. This time gets increased as number of dags/tasks increase in our airflow installation.
Most of our tasks are very small and takes just about few seconds to run but they get stuck in the "undefined" state an does not get queued, while scheduler is busy "filling up dagbag". In the meantime celery workers are sitting idle. We have tried the following:
increased celeryd_concurrency (which gave us ability to send more tasks to workers)
increased non_pooled_task_slot_count (so that more tasks can get queued)
also increased parallelism and dag_concurrency
All these measures allows to launch more tasks, only if scheduler queues them, which it not effectively does when it goes to that refreshh stage. Here are the timings for each scheduler loop:
[2016-11-07 23:18:28,106] {jobs.py:680} INFO - Starting the scheduler
[2016-11-07 23:21:26,515] {jobs.py:744} INFO - Loop took: 16.422769 seconds
[2016-11-07 23:21:46,186] {jobs.py:744} INFO - Loop took: 16.058172 seconds
[2016-11-07 23:22:02,800] {jobs.py:744} INFO - Loop took: 14.410493 seconds
[2016-11-07 23:22:21,310] {jobs.py:744} INFO - Loop took: 16.275255 seconds
[2016-11-07 23:22:41,470] {jobs.py:744} INFO - Loop took: 17.93543 seconds
[2016-11-07 23:22:59,176] {jobs.py:744} INFO - Loop took: 15.484449 seconds
[2016-11-07 23:23:17,455] {jobs.py:744} INFO - Loop took: 16.130971 seconds
[2016-11-07 23:23:35,948] {jobs.py:744} INFO - Loop took: 16.311113 seconds
[2016-11-07 23:23:55,043] {jobs.py:744} INFO - Loop took: 16.830728 seconds
[2016-11-07 23:26:57,044] {jobs.py:744} INFO - Loop took: 179.613778 seconds
[2016-11-07 23:27:09,328] {jobs.py:680} INFO - Starting the scheduler
[2016-11-07 23:29:57,988] {jobs.py:744} INFO - Loop took: 16.881139 seconds
[2016-11-07 23:30:17,584] {jobs.py:744} INFO - Loop took: 17.021958 seconds
[2016-11-07 23:30:36,062] {jobs.py:744} INFO - Loop took: 16.148552 seconds
[2016-11-07 23:30:56,975] {jobs.py:744} INFO - Loop took: 18.532384 seconds
[2016-11-07 23:31:16,214] {jobs.py:744} INFO - Loop took: 16.907037 seconds
[2016-11-07 23:31:39,060] {jobs.py:744} INFO - Loop took: 15.637057 seconds
[2016-11-07 23:31:56,231] {jobs.py:744} INFO - Loop took: 15.003683 seconds
[2016-11-07 23:32:13,618] {jobs.py:744} INFO - Loop took: 15.215657 seconds
[2016-11-07 23:32:35,738] {jobs.py:744} INFO - Loop took: 19.938704 seconds
[2016-11-07 23:35:33,905] {jobs.py:744} INFO - Loop took: 176.030812 seconds
[2016-11-07 23:35:45,908] {jobs.py:680} INFO - Starting the scheduler
Questions:
does --num-run required anymore in 1.7.1.3 version (as mentioned in pitfalls: https://cwiki.apache.org/confluence/display/AIRFLOW/Common+Pitfalls)? do we still have to restart the scheduler after every n number of runs?
increasing the max_threads value (to launch multiple scheduler thread) would help? I think defualt is 2.
Thanks for any help.