Delay between celery tasks scheduled by airflow - celery

I was trying to run the following simple workflow by using celeryExecutor in Airflow:
default_args = {
'depends_on_past': False,
'start_date': datetime.now(),
}
dag = DAG('HelloWorld', default_args=default_args, schedule_interval=None)
default_args=default_args)
t1 = BashOperator(
task_id='task_1',
bash_command='echo "Hello World from Task 1"; sleep 0.1',
dag=dag)
t2 = BashOperator(
task_id='task_2',
bash_command='echo "Hello World from Task 2"; sleep 0.2',
dag=dag)
t2.set_upstream(t1)
However, it always has ~5 seconds delay between task_1 and task_2. Following are the airflow.cfg snapshot:
[scheduler]
# Task instances listen for external kill signal (when you clear tasks
# from the CLI or the UI), this defines the frequency at which they should
# listen (in seconds).
job_heartbeat_sec = 0.1
# The scheduler constantly tries to trigger new tasks (look at the
# scheduler section in the docs for more information). This defines
# how often the scheduler should run (in seconds).
scheduler_heartbeat_sec = 1
It looks like that celery is one that causes the delay, but, if true, how to set the celery worker heartbeat interval (or pooling rate) from airflow config or API?

As a batch scheduler Airflow doesn't currently guarantee super low latency. The aim for the project has been to make it possible to do sub-minute latency at scale, but it's common for this go up to a few minutes in larger environments.
If latency is around 1 minutes, it doesn't make sense to do a chain of 1-2 seconds tasks. Typically the duration of an Airflow task should be counted in minutes, not seconds (there are exceptions though). Airflow is not Amazon Lambda.
It's probably possible to fine-tune and get to say <= 5 seconds, but that will become impossible to provide these guarantees as you scale the system.

Related

Airflow tasks stuck in queued state

We're running Airflow 1.10.12, with KubernetesExecutor and KubernetesPodOperator.
In the past few days, we’re seeing tasks getting stuck in queued state for a long time (to be honest, unless we restart the scheduler, it will remain stuck in that state), new tasks of the same DAG are getting scheduled properly.
The only thing that helps is either clearing it manually, or restarting the scheduler service
We usually see it happen when we run our E2E tests, which spawns ~20 DAG runs for everyone of our 3 DAGs, due to limited parallelism, some will be queued (which is fine by us)
These are our parallelism params in airflow.cfg
parallelism = 32
dag_concurrency = 16
max_active_runs_per_dag = 16
2 of our DAGs, overwrite the max_active_runs and set it to 10
Any idea what could be causing it?

Who waits in celery apply_async(countdown=10)

I run celery worker with concurrency 20
"-c", "20", "-P", "eventlet", "-Ofair"
And generate packs of 20 task for this worker queue in another task
async_call.apply_async(
(call_id, engine),
expires=60,
countdown=60*random(), # random delay, to prevent spikes
)
In flower viewer i see that there are no more than 20 task.
Question is: who waiting the countdown? is it inside the queue ? or it is inside worker process (some idle time of worker who take the task)?
If it is inside worker, then to use all CPU i need to increase concurrency according to fraction of countdown time (idle time) to work time.
The wait is in a process / thread that each celery worker has that will poll the broker for new tasks if the worker is waiting for tasks.

Airflow Workers starving for queued tasks

I am using Airflow (1.10.3) with AWS RDS Postgres as metaStore and celery_backend, SQS as queue service and CeleryExecutor. I have 1 master machine running airflow webserver and scheduler service, and a 1 worker machine.
Airflow worker is always starving for more tasks (queued) with lot of unused resources (CPU and RAM, with usage always below 20%). I've observed worker pick up tasks in batches, for eg: If there are 10 tasks in queue and 2 running tasks, then it will wait for 2 tasks to complete before picking next batch of tasks from the queue.
Parallelism setting in airflow.cfg in Worker instances.
parallelism = 32 .
dag_concurrency = 32.
non_pooled_task_slot_count = 128.
max_active_runs_per_dag = 32.
max_threads = 2 (no issues in scheduler though, as tasks are queued immediately)
One important thing to point out in my implementation - Airflow task is not a single process task, and individual Task further spawns multiple processes (3-5). Though even after considering process counts, my airflow worker never reaches full parallelism.
Any suggestion to -
a). Is there a way to fully utilise parallel execution of tasks on an airflow worker? Or if there's some more info that I am missing while setting up Airflow.
b). Above mentioned parallelism settings are configured at airflow task as atomic-unit, or number of threads/processes that task spawns?
Thanks!

Change timeout for builtin celery tasks (i.e. celery.backend_cleanup)

We're using Celery 4.2.1 and Redis with global soft and hard timeouts set for our tasks. All of our custom tasks are designed to stay under the limits, but every day the builtin task backend_cleanup task ends up forcibly killed by the timeouts.
I'd rather not have to raise our global timeout just to accommodate builtin Celery tasks. Is there a way to set the timeout of these builtin tasks directly?
I've had trouble finding any documentation on this or even anyone hitting the same problem.
Relevant source from celery/app/builtins.py:
#connect_on_app_finalize
def add_backend_cleanup_task(app):
"""Task used to clean up expired results.
If the configured backend requires periodic cleanup this task is also
automatically configured to run every day at 4am (requires
:program:`celery beat` to be running).
"""
#app.task(name='celery.backend_cleanup', shared=False, lazy=False)
def backend_cleanup():
app.backend.cleanup()
return backend_cleanup
You may set backend cleanup schedule directly in celery.py.
app.conf.beat_schedule = {
'backend_cleanup': {
'task': 'celery.backend_cleanup',
'schedule': 600, # 10 minutes
},
}
And then run the beat celery process:
celery -A YOUR_APP_NAME beat -l info --detach

Scaling periodic tasks in celery

We have a 10 queue setup in our celery, a large setup each queue have a group of 5 to 10 task and each queue running on dedicated machine and some on multiple machines for scaling.
On the other hand, we have a bunch of periodic tasks, running on a separate machine with single instance, and some of the periodic tasks are taking long to execute and I want to run them in 10 queues instead.
Is there a way to scale celery beat or use it purely to trigger the task on a different destination "one of the 10 queues"?
Please advise?
Use celery routing to dispatch the task to where you need: