I am running Airflow with CeleryExecutor and they are package and deployed with helm charts. I have one worker that I can scale up/down by specifying the number of replicas in the yaml file. This worker is running as StatefulSet and by default it uses the airflow queue. Is it possible to create another StatefulSet chart for another worker and specify the queue name? This way for the DAG definition files, I can specify the queue to use as well.
In your use case it would be better to use Kubernetes Executor instead of splitting the worker management logic between StatefulSets.
The kubernetes executor is introduced in Apache Airflow 1.10.0. The
Kubernetes executor will create a new pod for every task instance.
You can find more details about it in this documentation.
And also I recommend this blog.
I hope it helps.
Related
I've created a Flink cluster using Session mode on native K8s using the command:
$ ./bin/kubernetes-session.sh -Dkubernetes.cluster-id=my-first-flink-cluster
based on these instructions.
I am able to submit jobs to the cluster and view the Flink UI. However, I noticed that Flink creates a taskmanager pod only when the job is submitted and deletes it right after the job is finished. Previously I tried the same using YARN based deployment on Google Dataproc and with that method the cluster had a taskmanager always running which reduced job start time.
Hence, is there a way to keep a taskmanager pod always running using K8s Flink deployment such that job start time is reduced?
the intention of the native k8s support provided by Flink is to have this active resource allocation (i.e. task slots through new TaskManager instances) in case it is needed. In addition to that, it will allow the shutdown of TaskManager pods if they are not used anymore. That's the behavior you're observing.
What you're looking for is the standalone k8s support. Here, Flink does not try to start new TaskManager pods. The ResourceManager is passive, i.e. it only considers the TaskManagers that are registered. Some outside process (or a user) has to manage TaskManager pods instead. This might lead to jobs failing if there are not enough task slots available.
Best,
Matthias
We decided to run Airflow on Kubernetes. We would like to make use of the power of Kubernetes, but in a balanced way.
We have some very small tasks in our DAGs, for example create a directory. The KubernetesExecutor spins up a pod for every task, this takes long and therefore is overkill for many short tasks.
My question is, is it possible to configure Airflow to spin up a Kubernetes pod for a whole DAG, instead of a pod per task? (Preferably without Celery)
I do not think it is possible to use one pod per DAG, because KubernetesExecutor is designed to request a pod per task:
When a DAG submits a task, the KubernetesExecutor requests a worker pod from the Kubernetes API. The worker pod then runs the task, reports the result, and terminates.
Maybe combining multiple smaller tasks into one is a way to go.
https://airflow.apache.org/docs/apache-airflow/stable/executor/celery_kubernetes.html
The CeleryKubernetes Executor allows you to use the immediate resources of a celery worker or spin up a pod for a task. I haven’t configured this setup but it seems to match your use case.
I am assessing the migration of my current Airflow deployment from Celery executor to Kubernetes (K8s) executor to leverage the dynamic allocation of resources and the isolation of tasks provided by pods.
It is clear to me that we can use the native KubernetesPodOperator to run tasks on a K8s cluster via the K8s executor. However I couldn't find info about the compatibility between the K8s executor with other operators, such as bash and athena.
Here is the question is it possible to run a bash (or any other) operator on a K8s powered Airflow or I should migrate all my tasks to the KubernetesPodOperator?
Thanks!
Kubernetes executor will work with all operators.
Using the kubernetes executor will create a worker pod for every task instead of using the celery worker as the celery executor will.
Using the KubernetesPodOperator will pull any specific image to launch a pod and execute your task.
So if you are to use the KubernetesPodOperator with the KubernetesExecutor, Airflow will launch a worker pod for your task, and that task will launch a pod and monitor its execution. 2 pods for 1 task.
If you use a BashOperator with the KubernetesExecutor, Airflow will launch a worker pod and execute bash commands on that worker pod. 1 pod for 1 task.
I'm trying to deploy high available flink cluster on kubernetes. In the below examples worker nodes are replicated but we have only one master pod.
https://github.com/apache/flink-statefun
As far as I understand there are 2 approaches to make job manager HA.
https://ci.apache.org/projects/flink/flink-docs-stable/ops/jobmanager_high_availability.html
https://medium.com/hepsiburadatech/high-available-flink-cluster-on-kubernetes-setup-73b2baf9200e
In the first example we deploy another job manager to switch between them in case of failure
In the second example kubernetes redeploy the job manager pod in case of failure
So I have few questions
For both examples what happens to the running jobs when the active job manager fails?
Can the first scenario be applied on kubernetes?
For the second scenario in case of job manager failure flink UI will be unavailable until the pod recover but in the second first scenario it will be available am I right?
What is the pros/cons of the both scenarios?
There is one approach to make job manager HA, both of your link is using the JM HA using zookeeper cluster to make active/standby arhitecture of the JM.
When JobManager fails there is a "Failover" such as describe in apache flink documentation(first link), the standby JM become to be Active.
Ofcouse, kubernetes is just the deployment of the whole Flink cluster, you can still use the HA cluster mode using zk.
No, both will make the "failover" and a standby JM will become active.
You are not understand that kubernetes is only the deploy cluster of flink, Same as you can deploy it on phsical/virtual servers, than u can deploy it on kubernetes, but things like High Aviability will stay the same.
EDIT:
You can make 2 or more pods in kubernetes of JobManager and then it`ll be equal to the first solution.
What we want to achieve:
We would like to use Airflow to manage our machine learning and data pipeline while using Kubernetes to manage the resources and schedule the jobs. What we would like to achieve is for Airflow to orchestrate the workflow (e.g. Various tasks dependencies. Re-run jobs upon failures) and Kubernetes to orchestrate the infrastructure (e.g cluster autoscaling and individual jobs assignment to nodes). In other words Airflow will tell the Kubernetes cluster what to do and Kubernetes decides how to distribute the work. In the same time we would also want Airflow to be able to monitor the individual tasks status. For example if we have 10 tasks spreaded across a cluster of 5 nodes, Airflow should be able to communicate with the cluster and reports show something like: 3 “small tasks” are done, 1 “small task” has failed and will be scheduled to re-run and the remaining 6 “big tasks” are still running.
Questions:
Our understanding is that Airflow has no Kubernetes-Operator, see open issues at https://issues.apache.org/jira/browse/AIRFLOW-1314. That being said we don’t want Airflow to manage resources like managing service accounts, env variables, creating clusters, etc. but simply send tasks to an existing Kubernetes cluster and let Airflow know when a job is done. An alternative would be to use Apache Mesos but it looks less flexible and less straightforward compared to Kubernetes.
I guess we could use Airflow’s bash_operator to run kubectl but this seems not like the most elegant solution.
Any thoughts? How do you deal with that?
Airflow has both a Kubernetes Executor as well as a Kubernetes Operator.
You can use the Kubernetes Operator to send tasks (in the form of Docker images) from Airflow to Kubernetes via whichever AirflowExecutor you prefer.
Based on your description though, I believe you are looking for the KubernetesExecutor to schedule all your tasks against your Kubernetes cluster. As you can see from the source code it has a much tighter integration with Kubernetes.
This will also allow you to not have to worry about creating the docker images ahead of time as is required with the Kubernetes Operator.