Brokers for Celery Executor in Airflow - celery

Is it possible to use the following brokers instead of Redis or RabbitMQ:
Zookeeper
IBM MQ
Kafka
Megacache
If so, how would I be able to use it ?
Thanks

As per Celery documentation in a part of transport brokers support, RabbitMQ and Redis are fully featured and qualified as a stable solutions.
According to the list you've provided for any alternatives around, Zookeper might be also adopted as an Celery executor in Airflow but only as an experimental option with some functional limitations.
Installation details for Zookeper broker implementation you can find here.
Using Python package:
$ pip install "celery[zookeeper]"
You can check out all the available extensions in the source setup.py code.
Referencing Airflow documentation:
CeleryExecutor is one of the ways you can scale out the number of workers. For this to work, you need to setup a Celery backend
(RabbitMQ, Redis, …) and change your airflow.cfg to point the executor parameter to CeleryExecutor and provide the related Celery
settings.
After particular Celery backend being prepared, adjust appropriate settings in airflow.cfg file, for any incoming doubts refer to this example.

Related

How do I view metrics for a confluentinc/cp-kafka container?

Hi I have a Kafka container built using the image 'confluentinc/cp-kafka:6.1.0'.
How do I view the metrics from the container?
You can add an environment variable for JMX_PORT then attach a tool like jconsole or visualvm to that.
This is mentioned in the docs, but I think it might be incorrect (at least, trying to use /jmx on Zookeeper, and the variable is only JMX_PORT and shouldn't be different in the container)
If you want to use Prometheus/Grafana, then you'll need to extend the container to add the JMX exporter
I set up Kafka using https://docs.confluent.io/platform/current/quickstart/ce-docker-quickstart.html#ce-docker-quickstart. This launches Kafka with JMX installed.
This installation provides Confluent Control Center so you can view metrics there.
However I wanted the raw metrics exposed by JMX so I proceeded to the next steps.
I installed VisualVM from here https://visualvm.github.io/download.html.
(You can also use jconsole available in the JAVA/jdk/bin folder installed in your local m/c but I had connectivity issues running jconsole against the container JMX.)
Install the VisualVM-MBeans plugin in VisualVM.
Add a JMX connection using the KAFKA_JMX_HOSTNAME:KAFKA_JMX_PORT values from your docker-compose.yml in Step 1.
Bingo you can see the metrics from Confluent Kafka running on the container!

Airflow distributed model services

Switching from localexecutor to celeryexecutor.
In this model, I have
Masternode1 - airflow webserver, airflow scheduler, rabbitmq
Masternode2 - airflow webserver, rabbitmq
Workernode1 - airflowworker
Workernode2 - airflowworker
Workernode3 - airflowworker
Question:
Where does the Flower service run for celery? Is it required to run that in all nodes or just any one of the nodes (since its only a UI)
Is there any other components trivial to manage a production workload ?
Is using Kafka for broker a reality and available to use ?
Thank you
Celery Flower is yet another (optional) service that you may want to independently either on a dedicated machine, or share one machine among few Airflow services.
You may, for an example, run the webserver and flower on one machine, scheduler and few Airflow workers each on a dedicated machine.
Kafka as broker for Celery is something people talk about quite a lot but as far as I know there is no concrete work in Celery for it. However, considering there is an interest to have Kafka support in Kombu, I would assume that the moment Kombu gets Kafka support, Celery soon follow as Kombu is the core Celery dependency.

Airflow 1.9 - Tasks stuck in queue

Latest Apache-Airflow install from PyPy (1.9.0)
Set up includes:
Apache-Airflow
Apache-Airflow[celery]
RabbitMQ 3.7.5
Celery 4.1.1
Postgres
I have the installation across 3 hosts.
Host #1
Airflow Webserver
Airflow Scheduler
RabbitMQ Server
Postgres Server
Host #2
Airflow Worker
Host #3
Airflow Worker
I have a simple DAG that executes a BashOperator Task that runs every 1 minute. I can see the scheduler "queue" the job however, it nevers gets added to a Celery/RabbitMQ queue and gets picked up by the workers. I have a custom RabbitMQ user, authentication seems fine. Flower, however, doesn't show any of the queues populating with data. It does see the two worker machines listening on their respective queues.
Things I've checked:
Airflow Pool configuration
Airflow environment variables
Upgrade/Downgrade Celery and RabbitMQ
Postgres permissions
RabbitMQ Permissions
DEBUG level airflow logs
I read the documentation section about jobs not running. My "start_date" variable is a static date that exists before the current date.
OS: Centos 7
I was able to figure it out but I'm not sure why this is the answer.
Changing the "broker_url" variable to use "pyamqp" instead of "amqp" was the fix.

Airflow: When to use CeleryExecutor and when to use MesosExecutor

I am pretty new to Airflow and trying to understand how should we set it up in our environment(on aws).
I read the Airflow uses Celery with redis broker.
How is it different from Mesos? I have not used Celery before but I tried to set up celery-redis on my dev machine and it worked with ease. But adding new components means, add more monitoring.
Since we already use mesos for our cluster management, I am trying to think what am I missing if I dont chose celery and go with MesosExecutor instead?
Using Celery is the more proven/stable approach at the moment.
For us, managing dependencies using containers is more convenient than managing dependencies on the Mesos instances, which is the case if you choose MesosExecutor. As such we are finding Celery more flexible.
We are currently using Celery + RabbitMQ but we will switch to MesosExecutor in the future though, as our codebase stabilises.
Airflow with the CeleryExecuter doesn't necessarily need to use the Redis Broker. Any broker that celery can use is compatible with airflow, though it is recommended to either use the RabbitMQ broker or the Redis Broker.
Celery is quite different from Mesos. While airflow supports the MesosExecutor too, it is recommended to use the CeleryExecutor if you are planning to distribute the workers. From what I know, Airbnb uses the CeleryExecutor and actively maintains it.
For us, the MesosExecutor cannot be used. We need an abstraction level to handle dependencies for job, we cannot (and shouldn't) rely on any dependencied being installed on the mesos slaves. When Docker container and/or Mesos Container will be supported by MesosExecutor we can turn to it. Also, I like seeing the allocated workers inside Marathon. I am working on how to autoscale workers with Marathon.
The MesosExecutor is still experimental at this stage and does not support running Docker containers, having different resource limits per task and probably many other limitations.
I plan to work on this though, it's a community effort and having spent some effort to deploy a Mesos cluster, I feel that adding Celery and another MQ broker is a waste of resources.

Celery Flower Broker Tab not populating with broker_api set for rabbitmq api

I'm trying to populate the Broker tab on Celery Flower but when I pass a broker_api like the following example:
python manage.py celery flower --broker_api=http://guest:guest#localhost:15672/api/
I get the following error:
state.py:108 (run) Failed to inspect the broker: 'list' object is not callable
I'm confident the credentials I'm using are correct and the RabbitMQ Management Plugin is enabled. I'm able to access the RabbitMQ monitoring page through the browser.
flower==0.6.0
RabbitMQ 3.2.1
Does anyone know how to fix this?
Try removing the slash after /api/:
python manage.py celery flower --broker_api=http://guest:guest#localhost:15672/api
Had the same issue on an Airflow setup with Celery 5.2.6 and Flower 1.0.0. The solution for me was to launch Flower using:
airflow celery flower --broker-api=http://guest:guest#rabbitmq:15672/api/
For non-Airflow readers, I believe the command should be:
celery flower --broker=amqp://guest:guest#rabbitmq:5672 --broker_api=http://guest:guest#rabbitmq:15672/api/
A few remarks:
The above assumes a shared Docker network. If that's not the case, every #rabbitmq should be replaced with e.g. #localhost
--broker is not needed if running under Airflow's umbrella (it's passed from the Airflow config)
A good test to verify the API works is to access http://guest:guest#localhost:15672/api/index.html locally