I'm running Airflow 2.2.5 using docker compose setup. I use celery executor and 10+ worker nodes on different machines. This setup works fine for few worker nodes, but if I launch all 12 nodes, the worker instances start to crash. I suspect, that the reason might be that scheduler can't handle the traffic from all the worker nodes.
I would like to test a setup where I have multiple schedulers on the main node to see if this solves my problem. I was unable to find an answer on how to implement this sort of setup on my docker compose file. Can I just make two services scheduler1 and scheduler2 with identical definitions or is there a better way?
The official documentation was bit short on this one:
https://airflow.apache.org/docs/apache-airflow/2.0.2/scheduler.html?highlight=scheduler#running-more-than-one-scheduler
I know that in the Kubernetes setup the scheduler count is just one parameter, but unfortunately I do not have Kubernetes at hand at the moment.
Found an answer from this post. It didn't solve the celery issue though. How to set up multiple schedulers for airflow
Related
I have a k8s cluster that runs the main workload and has a lot of nodes.
I also have a node (I call it the special node) that some of special container are running on that that is NOT part of the cluster. The node has access to some resources that are required for those special containers.
I want to be able to manage containers on the special node along with the cluster, and make it possible to access them inside the cluster, so the idea is to add the node to the cluster as a worker node and taint it to prevent normal workloads to be scheduled on it, and add tolerations on the pods running special containers.
The idea looks fine, but there may be a problem. There will be some other containers and non-container daemons and services running on the special node that are not managed by the cluster (they belong to other activities that have to be separated from the cluster). I'm not sure that will be a problem, but I have not seen running non-cluster containers along with pod containers on a worker node before, and I could not find a similar question on the web about that.
So please enlighten me, is it ok to have non-cluster containers and other daemon services on a worker node? Does is require some cautions, or I'm just worrying too much?
Ahmad from the above description, I could understand that you are trying to deploy a kubernetes cluster using kudeadm or minikube or any other similar kind of solution. In this you have some servers and in those servers one is having some special functionality like GPU etc., for deploying your special pods you can use node selector and I hope you are already doing this.
Coming to running separate container runtime on one of these nodes you need to consider two points mainly
This can be done and if you didn’t integrated the container runtime with
kubernetes it will be one more software that is running on your server
let’s say you used kubeadm on all the nodes and you want to run docker
containers this will be separate provided you have drafted a proper
architecture and configured separate isolated virtual network
accordingly.
Now comes the storage part, you need to create separate storage volumes
for kubernetes and container runtime separately because if any one
software gets failed or corrupted it should not affect the second one and
also for providing the isolation.
If you maintain proper isolation starting from storage to network then you can run both kubernetes and container runtime separately however it is not a suggested way of implementation for production environments.
I wanted to know if it's possible to setup a KubernetesExecutor on Airflow but having the webserver and scheduler running on an EC2?
Meaning that tasks would run on Kubernetes pods (EKS in my case) but the base services on a regular EC2.
I tried to find information about the issue but failed short...
the following quote is from Airflow's docs, and it's the reason I'm asking this question
KubernetesExecutor runs as a process in the Airflow Scheduler. The scheduler itself does not necessarily need to be running on Kubernetes, but does need access to a Kubernetes cluster.
Thanks in advance!
Yes, this is entirely possible.
You just need to run your airflow scheduler and airflow webserver on EC2 and configure the EC2 instance to have all the necessary acces (via service account likely - but this is your decision and deployment configuration) to be able to spin pods on your EKS cluster.
Nothing special about it besides that you will have to learn how to run and configure the components to talk to each other - there are no ready-to-use recipes, you will have to simply follow theconfiguration parameters of Airflo, and authentication schemes that you need to have.
We decided to run Airflow on Kubernetes. We would like to make use of the power of Kubernetes, but in a balanced way.
We have some very small tasks in our DAGs, for example create a directory. The KubernetesExecutor spins up a pod for every task, this takes long and therefore is overkill for many short tasks.
My question is, is it possible to configure Airflow to spin up a Kubernetes pod for a whole DAG, instead of a pod per task? (Preferably without Celery)
I do not think it is possible to use one pod per DAG, because KubernetesExecutor is designed to request a pod per task:
When a DAG submits a task, the KubernetesExecutor requests a worker pod from the Kubernetes API. The worker pod then runs the task, reports the result, and terminates.
Maybe combining multiple smaller tasks into one is a way to go.
https://airflow.apache.org/docs/apache-airflow/stable/executor/celery_kubernetes.html
The CeleryKubernetes Executor allows you to use the immediate resources of a celery worker or spin up a pod for a task. I haven’t configured this setup but it seems to match your use case.
What we want to achieve:
We would like to use Airflow to manage our machine learning and data pipeline while using Kubernetes to manage the resources and schedule the jobs. What we would like to achieve is for Airflow to orchestrate the workflow (e.g. Various tasks dependencies. Re-run jobs upon failures) and Kubernetes to orchestrate the infrastructure (e.g cluster autoscaling and individual jobs assignment to nodes). In other words Airflow will tell the Kubernetes cluster what to do and Kubernetes decides how to distribute the work. In the same time we would also want Airflow to be able to monitor the individual tasks status. For example if we have 10 tasks spreaded across a cluster of 5 nodes, Airflow should be able to communicate with the cluster and reports show something like: 3 “small tasks” are done, 1 “small task” has failed and will be scheduled to re-run and the remaining 6 “big tasks” are still running.
Questions:
Our understanding is that Airflow has no Kubernetes-Operator, see open issues at https://issues.apache.org/jira/browse/AIRFLOW-1314. That being said we don’t want Airflow to manage resources like managing service accounts, env variables, creating clusters, etc. but simply send tasks to an existing Kubernetes cluster and let Airflow know when a job is done. An alternative would be to use Apache Mesos but it looks less flexible and less straightforward compared to Kubernetes.
I guess we could use Airflow’s bash_operator to run kubectl but this seems not like the most elegant solution.
Any thoughts? How do you deal with that?
Airflow has both a Kubernetes Executor as well as a Kubernetes Operator.
You can use the Kubernetes Operator to send tasks (in the form of Docker images) from Airflow to Kubernetes via whichever AirflowExecutor you prefer.
Based on your description though, I believe you are looking for the KubernetesExecutor to schedule all your tasks against your Kubernetes cluster. As you can see from the source code it has a much tighter integration with Kubernetes.
This will also allow you to not have to worry about creating the docker images ahead of time as is required with the Kubernetes Operator.
I have the following system in mind: A master program that polls a list of tasks to see if they should be launched (based on some trigger information). The tasks themselves are container images in some repository. Tasks are executed as jobs on a Kubernetes cluster to ensure that they are run to completion. The master program is a container executing in a pod that is kept running indefinitely by a replication controller.
However, I have not stumbled upon this pattern of launching jobs from a pod. Every tutorial seems to be assuming that I just call kubectl from outside the cluster. Of course I could do this but then I would have to ensure the master program's availability and reliability through some other system. So am I missing something? Launching one-off jobs from inside an indefinitely running pod seems to me as a perfectly valid use case for Kubernetes.
Your master program can utilize the Kubernetes client libraries to preform operations on a cluster. Find a complete example here.