Balancing task distribution by performance in Apache Airflow - celery

I would like to distribute work to my workers in Apache Airflow based on the health and current load of each worker. Something similar to HAProxy leastconn is what I am after.
Is there a way for workers to report their load/health and have tasks distributed accordingly? I am fine with Dask or Celery, but most familiar with Celery

If you use Dask it should do this automatically. The Dask scheduler takes care of load balancing and node fallover. I would expect Celery to do the same, though I'm less familiar there.

Related

Airflow fault tolerance

I have 2 questions:
first, what does it mean that the Kubernetes executor is fault tolerance, in other words, what happens if one worker nodes gets down?
Second question, is it possible that the whole Airflow server gets down? if yes, is there a backup that runs automatically to continue the work?
Note: I have started learning airflow recently.
Thanks in advance
This is a theoretical question that faced me while learning apache airflow, I have read the documentation
but it did not mention how fault tolerance is handled
what does it mean that the Kubernetes executor is fault tolerance?
Airflow scheduler use a Kubernetes API watcher to watch the state of the workers (tasks) on each change in order to discover failed pods. When a worker pod gets down, the scheduler detect this failure and change the state of the failed tasks in the Metadata, then these tasks can be rescheduled and executed based on the retry configurations.
is it possible that the whole Airflow server gets down?
yes it is possible for different reasons, and you have some different solutions/tips for each one:
problem in the Metadata: the most important part in Airflow is the Metadata where it's the central point used to communicate between the different schedulers and workers, and it is used to save the state of all the dag runs and tasks, and to share messages between tasks, and to store variables and connections, so when it gets down, everything will fail:
you can use a managed service (AWS RDS or Aurora, GCP Cloud SQL or Cloud Spanner, ...)
you can deploy it on your K8S cluster but in HA mode (doc for postgresql)
problem with the scheduler: the scheduler is running as a pod, and the is a possibility to lose depending on how you deploy it:
Try to request enough resources (especially memory) to avoid OOM problem
Avoid running it on spot/preemptible VMs
Create multiple replicas (minimum 3) for the scheduler to activate HA mode, in this case if a scheduler gets down, there will be other schedulers up
problem with webserver pod: it doesn't affect your workload, but you will not be able to access the UI/API during the downtime:
Try to request enough resources (especially memory) to avoid OOM problem
It's a stateless service, so you can create multiple replicas without any problem, if one gets down, you will access the UI/API using the other replicas

how do we choose --nthreads and --nprocs per worker in dask distributed running via helm on kubernetes?

I'm running some I/O intensive Python code on Dask and want to increase the number of threads per worker. I've deployed a Kubernetes cluster that runs Dask distributed via helm. I see from the worker deployment template that the number of threads for a worker is set to the number of CPUs, but I'd like to set the number of threads higher unless that's an anti-pattern. How do I do that?
It looks like from this similar question that I can ssh to the dask scheduler and spin up workers with dask-worker? But ideally I'd be able to configure the worker resources via helm so that I don't have to interact with the scheduler other than submitting jobs to it via the Client.
Kubernetes resource limits and requests should match the --memory-limit and --nthreads parameters given to the dask-worker command. For more information please follow the link 1 (Best practices described on Dask`s official documentation) and 2
Threading in Python is a careful art and is really dependent on your code. To do the easy one, -nprocs should almost certainly be 1, if you want more processes, launch more replicas instead. For the thread count, first remember the GIL means only one thread can be running Python code at a time. So you only get concurrency gains under two main sitations: 1) some threads are blocked on I/O like waiting to hear back from a database or web API or 2) some threads are running non-GIL-bound C code inside NumPy or friends. For the second situation, you still can't get more concurrency than the number of CPUs since that's just how many slots there are to run at once, but the first can benefit from more threads than CPUs in some situations.
There's a limitation of Dask's helm chart that doesn't allow for the setting of --nthreads in the chart. I confirmed this with the Dask team and filed an issue: https://github.com/helm/charts/issues/18708.
In the meantime, use Dask Kubernetes for a higher degree of customization.

Airflow distributed model services

Switching from localexecutor to celeryexecutor.
In this model, I have
Masternode1 - airflow webserver, airflow scheduler, rabbitmq
Masternode2 - airflow webserver, rabbitmq
Workernode1 - airflowworker
Workernode2 - airflowworker
Workernode3 - airflowworker
Question:
Where does the Flower service run for celery? Is it required to run that in all nodes or just any one of the nodes (since its only a UI)
Is there any other components trivial to manage a production workload ?
Is using Kafka for broker a reality and available to use ?
Thank you
Celery Flower is yet another (optional) service that you may want to independently either on a dedicated machine, or share one machine among few Airflow services.
You may, for an example, run the webserver and flower on one machine, scheduler and few Airflow workers each on a dedicated machine.
Kafka as broker for Celery is something people talk about quite a lot but as far as I know there is no concrete work in Celery for it. However, considering there is an interest to have Kafka support in Kombu, I would assume that the moment Kombu gets Kafka support, Celery soon follow as Kombu is the core Celery dependency.

Load balancer and celery result backends

I have a task that takes approximately 3 minutes to run. It pulls data from a remote server and makes cpu-intensive analysis on it. This task will be invoked by an api call. Upon the api call, i am planning to give client a unique task id and assign the task to a celery worker. Then the client will poll the server with the given task id to see if the task is completed by celery worker and its result it saved to a result backend. I think of using nginx, gunicorn, flask and dockerize them for a easy deploy in case i need to distribute this architecture across multiple machines.
The problem is that the client may poll different servers due to load balancer and if not handled well, the polled server’s celery’s result backend might not have the task’s result but other server’s celery result backend has it.
Is it possible to use a single result backend over multiple celery instances and make different celery instances wuery the same result backend? What might be other possible ways to solve this other than using cloud storage like S3?
Would I have this problem only if I have multiple machines or would it happen even if I have multiple gunicorn instances in a single machine where nginx acts as a load balancer on them?
Not that it is possible to use a single result backend by all Celery workers, but that is the only setting that makes sense! Same goes for the broker in most cases, unless you have a complicated Celery infrastructure with exchanges, and complicated routes...

Airflow: When to use CeleryExecutor and when to use MesosExecutor

I am pretty new to Airflow and trying to understand how should we set it up in our environment(on aws).
I read the Airflow uses Celery with redis broker.
How is it different from Mesos? I have not used Celery before but I tried to set up celery-redis on my dev machine and it worked with ease. But adding new components means, add more monitoring.
Since we already use mesos for our cluster management, I am trying to think what am I missing if I dont chose celery and go with MesosExecutor instead?
Using Celery is the more proven/stable approach at the moment.
For us, managing dependencies using containers is more convenient than managing dependencies on the Mesos instances, which is the case if you choose MesosExecutor. As such we are finding Celery more flexible.
We are currently using Celery + RabbitMQ but we will switch to MesosExecutor in the future though, as our codebase stabilises.
Airflow with the CeleryExecuter doesn't necessarily need to use the Redis Broker. Any broker that celery can use is compatible with airflow, though it is recommended to either use the RabbitMQ broker or the Redis Broker.
Celery is quite different from Mesos. While airflow supports the MesosExecutor too, it is recommended to use the CeleryExecutor if you are planning to distribute the workers. From what I know, Airbnb uses the CeleryExecutor and actively maintains it.
For us, the MesosExecutor cannot be used. We need an abstraction level to handle dependencies for job, we cannot (and shouldn't) rely on any dependencied being installed on the mesos slaves. When Docker container and/or Mesos Container will be supported by MesosExecutor we can turn to it. Also, I like seeing the allocated workers inside Marathon. I am working on how to autoscale workers with Marathon.
The MesosExecutor is still experimental at this stage and does not support running Docker containers, having different resource limits per task and probably many other limitations.
I plan to work on this though, it's a community effort and having spent some effort to deploy a Mesos cluster, I feel that adding Celery and another MQ broker is a waste of resources.