Airflow distributed model services - celery

Switching from localexecutor to celeryexecutor.
In this model, I have
Masternode1 - airflow webserver, airflow scheduler, rabbitmq
Masternode2 - airflow webserver, rabbitmq
Workernode1 - airflowworker
Workernode2 - airflowworker
Workernode3 - airflowworker
Question:
Where does the Flower service run for celery? Is it required to run that in all nodes or just any one of the nodes (since its only a UI)
Is there any other components trivial to manage a production workload ?
Is using Kafka for broker a reality and available to use ?
Thank you

Celery Flower is yet another (optional) service that you may want to independently either on a dedicated machine, or share one machine among few Airflow services.
You may, for an example, run the webserver and flower on one machine, scheduler and few Airflow workers each on a dedicated machine.
Kafka as broker for Celery is something people talk about quite a lot but as far as I know there is no concrete work in Celery for it. However, considering there is an interest to have Kafka support in Kombu, I would assume that the moment Kombu gets Kafka support, Celery soon follow as Kombu is the core Celery dependency.

Related

Airflow fault tolerance

I have 2 questions:
first, what does it mean that the Kubernetes executor is fault tolerance, in other words, what happens if one worker nodes gets down?
Second question, is it possible that the whole Airflow server gets down? if yes, is there a backup that runs automatically to continue the work?
Note: I have started learning airflow recently.
Thanks in advance
This is a theoretical question that faced me while learning apache airflow, I have read the documentation
but it did not mention how fault tolerance is handled
what does it mean that the Kubernetes executor is fault tolerance?
Airflow scheduler use a Kubernetes API watcher to watch the state of the workers (tasks) on each change in order to discover failed pods. When a worker pod gets down, the scheduler detect this failure and change the state of the failed tasks in the Metadata, then these tasks can be rescheduled and executed based on the retry configurations.
is it possible that the whole Airflow server gets down?
yes it is possible for different reasons, and you have some different solutions/tips for each one:
problem in the Metadata: the most important part in Airflow is the Metadata where it's the central point used to communicate between the different schedulers and workers, and it is used to save the state of all the dag runs and tasks, and to share messages between tasks, and to store variables and connections, so when it gets down, everything will fail:
you can use a managed service (AWS RDS or Aurora, GCP Cloud SQL or Cloud Spanner, ...)
you can deploy it on your K8S cluster but in HA mode (doc for postgresql)
problem with the scheduler: the scheduler is running as a pod, and the is a possibility to lose depending on how you deploy it:
Try to request enough resources (especially memory) to avoid OOM problem
Avoid running it on spot/preemptible VMs
Create multiple replicas (minimum 3) for the scheduler to activate HA mode, in this case if a scheduler gets down, there will be other schedulers up
problem with webserver pod: it doesn't affect your workload, but you will not be able to access the UI/API during the downtime:
Try to request enough resources (especially memory) to avoid OOM problem
It's a stateless service, so you can create multiple replicas without any problem, if one gets down, you will access the UI/API using the other replicas

Brokers for Celery Executor in Airflow

Is it possible to use the following brokers instead of Redis or RabbitMQ:
Zookeeper
IBM MQ
Kafka
Megacache
If so, how would I be able to use it ?
Thanks
As per Celery documentation in a part of transport brokers support, RabbitMQ and Redis are fully featured and qualified as a stable solutions.
According to the list you've provided for any alternatives around, Zookeper might be also adopted as an Celery executor in Airflow but only as an experimental option with some functional limitations.
Installation details for Zookeper broker implementation you can find here.
Using Python package:
$ pip install "celery[zookeeper]"
You can check out all the available extensions in the source setup.py code.
Referencing Airflow documentation:
CeleryExecutor is one of the ways you can scale out the number of workers. For this to work, you need to setup a Celery backend
(RabbitMQ, Redis, …) and change your airflow.cfg to point the executor parameter to CeleryExecutor and provide the related Celery
settings.
After particular Celery backend being prepared, adjust appropriate settings in airflow.cfg file, for any incoming doubts refer to this example.

Balancing task distribution by performance in Apache Airflow

I would like to distribute work to my workers in Apache Airflow based on the health and current load of each worker. Something similar to HAProxy leastconn is what I am after.
Is there a way for workers to report their load/health and have tasks distributed accordingly? I am fine with Dask or Celery, but most familiar with Celery
If you use Dask it should do this automatically. The Dask scheduler takes care of load balancing and node fallover. I would expect Celery to do the same, though I'm less familiar there.

Airflow 1.9 - Tasks stuck in queue

Latest Apache-Airflow install from PyPy (1.9.0)
Set up includes:
Apache-Airflow
Apache-Airflow[celery]
RabbitMQ 3.7.5
Celery 4.1.1
Postgres
I have the installation across 3 hosts.
Host #1
Airflow Webserver
Airflow Scheduler
RabbitMQ Server
Postgres Server
Host #2
Airflow Worker
Host #3
Airflow Worker
I have a simple DAG that executes a BashOperator Task that runs every 1 minute. I can see the scheduler "queue" the job however, it nevers gets added to a Celery/RabbitMQ queue and gets picked up by the workers. I have a custom RabbitMQ user, authentication seems fine. Flower, however, doesn't show any of the queues populating with data. It does see the two worker machines listening on their respective queues.
Things I've checked:
Airflow Pool configuration
Airflow environment variables
Upgrade/Downgrade Celery and RabbitMQ
Postgres permissions
RabbitMQ Permissions
DEBUG level airflow logs
I read the documentation section about jobs not running. My "start_date" variable is a static date that exists before the current date.
OS: Centos 7
I was able to figure it out but I'm not sure why this is the answer.
Changing the "broker_url" variable to use "pyamqp" instead of "amqp" was the fix.

Airflow: When to use CeleryExecutor and when to use MesosExecutor

I am pretty new to Airflow and trying to understand how should we set it up in our environment(on aws).
I read the Airflow uses Celery with redis broker.
How is it different from Mesos? I have not used Celery before but I tried to set up celery-redis on my dev machine and it worked with ease. But adding new components means, add more monitoring.
Since we already use mesos for our cluster management, I am trying to think what am I missing if I dont chose celery and go with MesosExecutor instead?
Using Celery is the more proven/stable approach at the moment.
For us, managing dependencies using containers is more convenient than managing dependencies on the Mesos instances, which is the case if you choose MesosExecutor. As such we are finding Celery more flexible.
We are currently using Celery + RabbitMQ but we will switch to MesosExecutor in the future though, as our codebase stabilises.
Airflow with the CeleryExecuter doesn't necessarily need to use the Redis Broker. Any broker that celery can use is compatible with airflow, though it is recommended to either use the RabbitMQ broker or the Redis Broker.
Celery is quite different from Mesos. While airflow supports the MesosExecutor too, it is recommended to use the CeleryExecutor if you are planning to distribute the workers. From what I know, Airbnb uses the CeleryExecutor and actively maintains it.
For us, the MesosExecutor cannot be used. We need an abstraction level to handle dependencies for job, we cannot (and shouldn't) rely on any dependencied being installed on the mesos slaves. When Docker container and/or Mesos Container will be supported by MesosExecutor we can turn to it. Also, I like seeing the allocated workers inside Marathon. I am working on how to autoscale workers with Marathon.
The MesosExecutor is still experimental at this stage and does not support running Docker containers, having different resource limits per task and probably many other limitations.
I plan to work on this though, it's a community effort and having spent some effort to deploy a Mesos cluster, I feel that adding Celery and another MQ broker is a waste of resources.