Airflow: When to use CeleryExecutor and when to use MesosExecutor - celery

I am pretty new to Airflow and trying to understand how should we set it up in our environment(on aws).
I read the Airflow uses Celery with redis broker.
How is it different from Mesos? I have not used Celery before but I tried to set up celery-redis on my dev machine and it worked with ease. But adding new components means, add more monitoring.
Since we already use mesos for our cluster management, I am trying to think what am I missing if I dont chose celery and go with MesosExecutor instead?

Using Celery is the more proven/stable approach at the moment.
For us, managing dependencies using containers is more convenient than managing dependencies on the Mesos instances, which is the case if you choose MesosExecutor. As such we are finding Celery more flexible.
We are currently using Celery + RabbitMQ but we will switch to MesosExecutor in the future though, as our codebase stabilises.

Airflow with the CeleryExecuter doesn't necessarily need to use the Redis Broker. Any broker that celery can use is compatible with airflow, though it is recommended to either use the RabbitMQ broker or the Redis Broker.
Celery is quite different from Mesos. While airflow supports the MesosExecutor too, it is recommended to use the CeleryExecutor if you are planning to distribute the workers. From what I know, Airbnb uses the CeleryExecutor and actively maintains it.

For us, the MesosExecutor cannot be used. We need an abstraction level to handle dependencies for job, we cannot (and shouldn't) rely on any dependencied being installed on the mesos slaves. When Docker container and/or Mesos Container will be supported by MesosExecutor we can turn to it. Also, I like seeing the allocated workers inside Marathon. I am working on how to autoscale workers with Marathon.

The MesosExecutor is still experimental at this stage and does not support running Docker containers, having different resource limits per task and probably many other limitations.
I plan to work on this though, it's a community effort and having spent some effort to deploy a Mesos cluster, I feel that adding Celery and another MQ broker is a waste of resources.

Related

Brokers for Celery Executor in Airflow

Is it possible to use the following brokers instead of Redis or RabbitMQ:
Zookeeper
IBM MQ
Kafka
Megacache
If so, how would I be able to use it ?
Thanks
As per Celery documentation in a part of transport brokers support, RabbitMQ and Redis are fully featured and qualified as a stable solutions.
According to the list you've provided for any alternatives around, Zookeper might be also adopted as an Celery executor in Airflow but only as an experimental option with some functional limitations.
Installation details for Zookeper broker implementation you can find here.
Using Python package:
$ pip install "celery[zookeeper]"
You can check out all the available extensions in the source setup.py code.
Referencing Airflow documentation:
CeleryExecutor is one of the ways you can scale out the number of workers. For this to work, you need to setup a Celery backend
(RabbitMQ, Redis, …) and change your airflow.cfg to point the executor parameter to CeleryExecutor and provide the related Celery
settings.
After particular Celery backend being prepared, adjust appropriate settings in airflow.cfg file, for any incoming doubts refer to this example.

how do we choose --nthreads and --nprocs per worker in dask distributed running via helm on kubernetes?

I'm running some I/O intensive Python code on Dask and want to increase the number of threads per worker. I've deployed a Kubernetes cluster that runs Dask distributed via helm. I see from the worker deployment template that the number of threads for a worker is set to the number of CPUs, but I'd like to set the number of threads higher unless that's an anti-pattern. How do I do that?
It looks like from this similar question that I can ssh to the dask scheduler and spin up workers with dask-worker? But ideally I'd be able to configure the worker resources via helm so that I don't have to interact with the scheduler other than submitting jobs to it via the Client.
Kubernetes resource limits and requests should match the --memory-limit and --nthreads parameters given to the dask-worker command. For more information please follow the link 1 (Best practices described on Dask`s official documentation) and 2
Threading in Python is a careful art and is really dependent on your code. To do the easy one, -nprocs should almost certainly be 1, if you want more processes, launch more replicas instead. For the thread count, first remember the GIL means only one thread can be running Python code at a time. So you only get concurrency gains under two main sitations: 1) some threads are blocked on I/O like waiting to hear back from a database or web API or 2) some threads are running non-GIL-bound C code inside NumPy or friends. For the second situation, you still can't get more concurrency than the number of CPUs since that's just how many slots there are to run at once, but the first can benefit from more threads than CPUs in some situations.
There's a limitation of Dask's helm chart that doesn't allow for the setting of --nthreads in the chart. I confirmed this with the Dask team and filed an issue: https://github.com/helm/charts/issues/18708.
In the meantime, use Dask Kubernetes for a higher degree of customization.

Airflow distributed model services

Switching from localexecutor to celeryexecutor.
In this model, I have
Masternode1 - airflow webserver, airflow scheduler, rabbitmq
Masternode2 - airflow webserver, rabbitmq
Workernode1 - airflowworker
Workernode2 - airflowworker
Workernode3 - airflowworker
Question:
Where does the Flower service run for celery? Is it required to run that in all nodes or just any one of the nodes (since its only a UI)
Is there any other components trivial to manage a production workload ?
Is using Kafka for broker a reality and available to use ?
Thank you
Celery Flower is yet another (optional) service that you may want to independently either on a dedicated machine, or share one machine among few Airflow services.
You may, for an example, run the webserver and flower on one machine, scheduler and few Airflow workers each on a dedicated machine.
Kafka as broker for Celery is something people talk about quite a lot but as far as I know there is no concrete work in Celery for it. However, considering there is an interest to have Kafka support in Kombu, I would assume that the moment Kombu gets Kafka support, Celery soon follow as Kombu is the core Celery dependency.

Balancing task distribution by performance in Apache Airflow

I would like to distribute work to my workers in Apache Airflow based on the health and current load of each worker. Something similar to HAProxy leastconn is what I am after.
Is there a way for workers to report their load/health and have tasks distributed accordingly? I am fine with Dask or Celery, but most familiar with Celery
If you use Dask it should do this automatically. The Dask scheduler takes care of load balancing and node fallover. I would expect Celery to do the same, though I'm less familiar there.

Apache Mesos vs Google Kubernetes

What's the difference between Apache's Mesos and Google's Kubernetes
I read the accepted answers but I'm still confused what the differences are.
If Kubernetes is a cluster management then what does Mesos do (I understand what it does from watching bunch of videos but I suppose I'm more confused how those two work together)?
From reading both Kubernetes and Marathon are "framework" sitting on top of Mesos?
What is Mesos responsible for and what are Kubernetes/Marathon responsible for and how do they work with each other?
EDIT:
I think the better question is When would I want to use Kubernetes on top of Mesos vs just running Mesos alone?
Mesos is another abstraction layer. It simply abstracts underlying hardware so the software that want to run on the top of it could only define required resources without having to know any other information.
Kubernetes could do similar thing but without abstraction provided by Mesos you can't run other frameworks (e.g., Spark or Cassandra) on same machine without manually dividing it between those frameworks.
Apache Mesos is a resource manager that shares resources (CPU shares, RAM, disk, ports) across a cluster of machines in a fair way. By sharing, I mean it offers these resources to so called framework schedulers (such as Marathon) and thereby has a clear separation of concerns in terms of resource management and scheduling decisions (which is implemented, depending on the job type, for example long-running or batch, by the framework scheduler). See also the Mesos architecture for further details.