I am using Airflow (1.10.3) with AWS RDS Postgres as metaStore and celery_backend, SQS as queue service and CeleryExecutor. I have 1 master machine running airflow webserver and scheduler service, and a 1 worker machine.
Airflow worker is always starving for more tasks (queued) with lot of unused resources (CPU and RAM, with usage always below 20%). I've observed worker pick up tasks in batches, for eg: If there are 10 tasks in queue and 2 running tasks, then it will wait for 2 tasks to complete before picking next batch of tasks from the queue.
Parallelism setting in airflow.cfg in Worker instances.
parallelism = 32 .
dag_concurrency = 32.
non_pooled_task_slot_count = 128.
max_active_runs_per_dag = 32.
max_threads = 2 (no issues in scheduler though, as tasks are queued immediately)
One important thing to point out in my implementation - Airflow task is not a single process task, and individual Task further spawns multiple processes (3-5). Though even after considering process counts, my airflow worker never reaches full parallelism.
Any suggestion to -
a). Is there a way to fully utilise parallel execution of tasks on an airflow worker? Or if there's some more info that I am missing while setting up Airflow.
b). Above mentioned parallelism settings are configured at airflow task as atomic-unit, or number of threads/processes that task spawns?
Thanks!
Related
Context: Running Airflow v2.2.2 with Kubernetes Executor.
I am running a process that creates a burst of quick tasks.
The tasks are short enough that the Kubernetes Pod initialization takes up the majority of runtime for many of them.
Is there a way to re-utilize pre-initialized pods for multiple tasks?
I've found a comment in an old issue that states that when running the Subdag operator, all subtasks will be run on the same pod, but I cannot find any further information. Is this true?
I have searched the following resources:
Airflow Documentation: Kubernetes
Airflow Documentation: KubernetesExecutor
Airflow Documentation: KubernetesPodOperator
StackOverflow threads: Run two jobs on same pod, Best Run Airflow on Kube
Google Search: airflow kubernetes reuse initialization
But haven't really found anything that directly addresses my problem.
I don't think this is possible in Airlfow even with Subdag operator which runs a separate dag as a part of the current dag with the same way used for the other tasks.
To solve your problem, you can use CeleryKubernetesExecutor which combine CeleryExecutor and KubernetesExecutor. By default the tasks will be queued in Celery queue but for the heavy tasks, you can choose K8S queue in order to run them in isolated pods. In this case you will be able to use the Celery workers which are up all the time.
kubernetes_task= BashOperator(
task_id='kubernetes_executor_task',
bash_command='echo "Hello from Kubernetes"',
queue = 'kubernetes'
)
celery_task = BashOperator(
task_id='celery_executor_task',
bash_command='echo "Hello from Celery"',
)
If you are worry about the scalability/cost of Celery workers, you can use KEDA to scale the Celery workers from 0 workers to a maximum number of workers based on queued tasks count.
I am running an airflow cluster on EKS on AWS. I have setup some scaling config for worker setup. If CPU/Mem > 70% then airflow spins up new worker pod. However I am facing an issue when these worker pods are scaling down. When worker pods start scaling down, two things happen:
If no tasks is running on a worker pod, it terminates within 40sec.
If any task is running on a worker pod, it terminates in about 8min, and after one more minute, I find the task failing on UI.
I have setup below two properties in helm chart for worker pod termiantion.
celery:
## if celery worker Pods are gracefully terminated
## - consider defining a `workers.podDisruptionBudget` to prevent there not being
## enough available workers during graceful termination waiting periods
##
## graceful termination process:
## 1. prevent worker accepting new tasks
## 2. wait AT MOST `workers.celery.gracefullTerminationPeriod` for tasks to finish
## 3. send SIGTERM to worker
## 4. wait AT MOST `workers.terminationPeriod` for kill to finish
## 5. send SIGKILL to worker
##
gracefullTermination: true
## how many seconds to wait for tasks to finish before SIGTERM of the celery worker
##
gracefullTerminationPeriod: 180
## how many seconds to wait after SIGTERM before SIGKILL of the celery worker
## - [WARNING] tasks that are still running during SIGKILL will be orphaned, this is important
## to understand with KubernetesPodOperator(), as Pods may continue running
##
terminationPeriod: 120
I can see that worker pod should shutdown after 5 mins or irrespective task running or not. So I am not sure why I see total of 8 min for worker pod termination. And my main issue is there any way I can setup config so that worker pod only terminates when task running on it finishes execution. Since tasks in my dags can run anywhere between few minutes to few hours so I don't want to put a large value for gracefullTerminationPeriod. I Would appreciate any solution around this.
Some more info: Generally the long running task is a python operator which runs either a presto sql query or Databricks job via Prestohook or DatabricksOperator respectively. And I don't want these to recivie SIGTERM before they complete their execution on worker pod scaling down.
This is not possible due to limitations from K8 end. More details are available here. However by using a large value of "gracefulTerminationPeriod" works, although this is not what I intended to do but it works better than I originally thought. When large value of gracefulTerminationPeriod is set, workers doesn't wait around for gracefulTerminationPeriod time to terminate. If a worker pod is marked for termination it terminates as soon as tasks running on it reaches zero.
Until K8 accept proposed changes and new community helm chart is released, I think this is the best solution without incurring costs of keeping worker up.
I'm trying to understand the Arch of Airflow on Kubernetes.
Using the helm and Kubernetes executor, the installation mounts 3 pods called: Trigger, WebServer, and Scheduler...
When I run a dag using the Kubernetes pod operator, it also mounts 2 pods more: one with the dag name and another one with the task name...
I want to understand the communication between pods... So far I know the only the expressed in the image:
Note: I'm using the git sync option
Thanks in advance for the help that you can give me!!
Airflow application has some components that require for it to operate normally: Webserver, Database, Scheduler, Trigger, Worker(s), Executor. You can read about it here.
Lets go over the options:
Kubernetes Executor (As you choose):
In your instance since you are deploying on Kubernetes with Kubernetes Executor then each task being executed is a pod. Airflow wraps the task with a Pod no matter what task it is. This brings to the front the isolation that Kubernetes offer, this also bring the overhead of creating a pod for each task. Choosing Kubernetes Executor often goes with case where many/most of your tasks takes long time to execute - as if your tasks takes 5 seconds to complete it might not be worth to pay the overhead of creating pod for each task. As for what you see as the DAG -> Task1 in your diagram. Consider that the Scheduler launches the Airflow workers. The workers are starting the tasks in new pods. So the worker needs to monitor the execution of the task.
Celery Executor - Setting up a Worker/Pod which tasks can run in it. This gives you speed as there is no need to create pod for each task but there is no isolation for each task. Noting that using this executor doesn't mean that you can't run tasks in their own Pod. User can run KubernetesPodOperator and it will create a Pod for the task.
CeleryKubernetes Executor - Enjoying both worlds. You decide which tasks will be executed by Celery and which by Kubernetes. So for example you can set small short tasks to Celery and longer tasks to Kubernetes.
How will it look like Pod wise?
Kubernetes Executor - Every task creates a pod. PythonOperator, BashOperator - all of them will be wrapped with pods (user doesn't need to change anything on his DAG code).
Celery Executor - Every task will be executed in a Celery worker(pod). So the pod is always in Running waiting to get tasks. You can create a dedicated pod for a task if you will explicitly use KubernetesPodOperator.
CeleryKubernetes - Combining both of the above.
Note again that you can use each one of these executors with Kubernetes environment. Keep in mind that all of these are just executors. Airflow has other components like mentioned earlier so it's very OK to deploy Airflow on Kubernetes (Scheduler, Webserver) but to use CeleryExecutor thus the user code (tasks) are not creating new pods automatically.
As for Triggers since you asked about it specifically - It's a feature added in Airflow 2.2: Deferrable Operators & Triggers it allows tasks to defer and release worker slot.
I was just wondering if anyone could explain if all compute resources in a Databricks cluster are shared or if the resources are tied to each worker. For example, if two users were connected to a cluster made up of 2 workers with 4 cores per worker and one user's job required 2 cores and the other's required 6 cores, would they be able to share the 8 total cores or would the full 4 cores from one worker be unavailable during the job that only required 2 cores?
TL;DR; Yes, default behavior is to allow sharing but you're going to have to tightly control the default parallelism with such a small cluster.
Take a look at Job Scheduling for Apache Spark. I'm assuming you are using an "all-purpose" / "interactive" cluster where users are working on notebooks OR you are submitting jobs to an existing, all-purpose cluster and it is NOT a job cluster with multiple spark applications being deployed.
Databricks Runs in FAIR Scheduling Mode by Default
Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish. This mode is best for multi-user settings.
By default, all queries started in a notebook run in the same fair scheduling pool
The Apache Spark scheduler in Azure Databricks automatically preempts tasks to enforce fair sharing.
Apache Spark Defaults to FIFO
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.
Keep in mind the word "job" is specific Spark term that represents an action being taken that launches one or more stages and tasks. See What is the concept of application, job, stage and task in spark?.
So in your example you have...
2 Workers with 4 cores each == 8 cores == 8 tasks can be handled in parallel
One application (App A) that has a job that launches a stage with only 2 tasks.
One application (App B) that has a job that launches a stage with 6 tasks.
In this case, YES, you will be able to share the resources of the cluster. However, the devil is in the default behaviors. If you're reading from many files, performing a join, aggregating, etc, you're going to run into the fact that Spark is going to partition your data into chunks that can be acted on in parallel (see configuration like spark.default.parallelism).
So, in a more realistic example, you're going to have...
2 Workers with 4 cores each == 8 cores == 8 tasks can be handled in parallel
One application (App A) that has a job that launches a stage with 200 tasks.
One application (App B) that has a job that launches three stage with 8, 200, and 1 tasks respectively.
In a scenario like this FIFO scheduling, as is the default, will result in one of these applications blocking the other since the number of executors is completely overwhelmed by the number of tasks in just one stage.
In a FAIR scheduling mode, there will still be some blocking since the number of executors is small but some work will be done on each job since FAIR scheduling does a round-robin at the task level.
In Apache Spark, you have tighter control by creating different pools of the resources and submitting apps only to those pools where they have "isolated" resources. The "better" way of doing this is with Databricks Job clusters that have isolated compute dedicated to the application being ran.
I have few questions about running tasks in parallel in Azure Batch. Per the official documentation, "Azure Batch allows you to set maximum tasks per node up to four times (4x) the number of node cores."
Is there a setup other than specifying the max tasks per node when creating a pool, that needs to be done (to the code) to be able to run parallel tasks with batch?
So if I am understanding this correctly, if I have a Standard_D1_v2 machine with 1 core, I can run up to 4 concurrent tasks running in parallel in it. Is that right? If yes, I ran some tests and I am quite not sure about the behavior that I got. In a pool of D1_v2 machines set up to run 1 task per node, I get about 16 min for my job execution time. Then, using the same applications and same parameters with the only change being a new pool with same setup, also D1_v2, except running 4 tasks per node, I still get a job execution time of about 15 min. There wasn't any improvement in the job execution time for running tasks in parallel. What could be happening? What am I missing here?
I ran a test with a pool of D3_v2 machines with 4 cores, set up to run 2 tasks per core for a total of 8 tasks per node, and another test with a pool (same number of machines as previous one) of D2_v2 machines with 2 cores, set up to run 2 tasks per core for a total of 4 parallel tasks per node. The run time/ job execution time for both these tests were the same. Isn't there supposed to be an improvement considering that 8 tasks are running per node in the first test versus 4 tasks per node in the second test? If yes, what could be a reason why I'm not getting this improvement?
No. Although you may want to look into the task scheduling policy, compute node fill type to control how your tasks are distributed amongst nodes in your pool.
How many tasks are in your job? Are your tasks compute-bound? If so, you won't see any improvement (perhaps even end-to-end performance degradation).
Batch merely schedules the tasks concurrently on the node. If the command/process that you're running utilizes all of the cores on the machine and is compute-bound, you won't see an improvement. You should double check your tasks start and end times within the job and the node execution info to see if they are actually being scheduled concurrently on the same node.