Airflow has a queuing mechanism to execute queued tasks, and it does this across all the dags. I have a specific DAG which is high priority, meaning if there a task from this DAG scheduled and queued to run, I want it to be processes high priority. Is there a way to instruct airflow to do this?
Airflow consider priority of tasks when scheduling thus you can use priority weight to increase the ranking of specific tasks. Note that this is not a DAG parameter but an Operator parameter. If you wish to apply it to all tasks in a specific DAG use default_args.
Example:
from airflow.operators.bash import BashOperator
from airflow.utils.weight_rule import WeightRule
BashOperator(
task_id="my_task",
bash_command="echo 1",
weight_rule=WeightRule.ABSOLUTE,
priority_weight=10000,
)
You can change the weight_rule & priority_weight according to what you wish.
Related
Context: Running Airflow v2.2.2 with Kubernetes Executor.
I am running a process that creates a burst of quick tasks.
The tasks are short enough that the Kubernetes Pod initialization takes up the majority of runtime for many of them.
Is there a way to re-utilize pre-initialized pods for multiple tasks?
I've found a comment in an old issue that states that when running the Subdag operator, all subtasks will be run on the same pod, but I cannot find any further information. Is this true?
I have searched the following resources:
Airflow Documentation: Kubernetes
Airflow Documentation: KubernetesExecutor
Airflow Documentation: KubernetesPodOperator
StackOverflow threads: Run two jobs on same pod, Best Run Airflow on Kube
Google Search: airflow kubernetes reuse initialization
But haven't really found anything that directly addresses my problem.
I don't think this is possible in Airlfow even with Subdag operator which runs a separate dag as a part of the current dag with the same way used for the other tasks.
To solve your problem, you can use CeleryKubernetesExecutor which combine CeleryExecutor and KubernetesExecutor. By default the tasks will be queued in Celery queue but for the heavy tasks, you can choose K8S queue in order to run them in isolated pods. In this case you will be able to use the Celery workers which are up all the time.
kubernetes_task= BashOperator(
task_id='kubernetes_executor_task',
bash_command='echo "Hello from Kubernetes"',
queue = 'kubernetes'
)
celery_task = BashOperator(
task_id='celery_executor_task',
bash_command='echo "Hello from Celery"',
)
If you are worry about the scalability/cost of Celery workers, you can use KEDA to scale the Celery workers from 0 workers to a maximum number of workers based on queued tasks count.
I have a reporting application that uses Celery to process thousands of jobs per day. There is a python module per each report type that encapsulates all job steps. Jobs take customer-specific parameters and typically complete within a few minutes. Currently, jobs are triggered by customers on-demand when they create a new report or request a refresh of an existing one.
Now, I would like to add scheduling, so the jobs run daily, and reports get refreshed automatically. I understand that Airflow shines at task orchestration and scheduling. I also like the idea of expressing my jobs as DAGs and getting the benefit of task retries. I can see how I can use Airflow to run scheduled batch-processing jobs, but I am unsure about my use case.
If I express my jobs as Airflow DAGs, I will still need to run them parametrized for each customer. It means, if the customer creates a new report, I will need to have a way to trigger a DAG with the customer-specific configuration. And with a scheduled execution, I will need to enumerate all customers and create a parametrized (sub-)DAG for each of them. My understanding this should be possible since Airflow supports DAGs created dynamically, however, I am not sure if this is an efficient and correct way to use Airflow.
I wonder if anyway considered using Airflow for a scenario similar to mine.
Celery workflows do literally the same, and you can create and run them at any point of time. Also, Celery has a pretty good scheduler (I have never seen it failing in 5 years of using Celery) - Celery Beat.
Sure, Airflow can be used to do what you need without any problems.
You can use Airflow to create DAGs dynamically, I am not sure if this will work with a scale of 1000 of DAGs though. There are some good examples on astronomer.io on Dynamically Generating DAGs in Airflow.
I have some DAGs and task that are dynamically generated by a yaml configuration with different schedules and configurations. It all works without any issue.
Only thing that might be challenging is the "jobs are triggered by customers on-demand" - I guess you could trigger any DAG with Airflow's REST API, but it's still in a experimental state.
Can we limit the number of DAGs running at any time in Apache Airflow ?
We have a limit on resources in the environment . Is there a configuration to limit the no. of DAGs running in Airflow as a whole at a point in time ?
max_active_runs parameter limits run within a DAG
Is it possible that , If one DAG is running , all other scheduled DAGs should wait for the first DAG to complete and then trigger itself sequentially ?
By setting parallelism configuration option in airflow.cfg, you can limit the total maximum number of tasks (not DAGs) allowed to run in parallel. Then, by setting the dag_concurrency configuration option, you can specify how many tasks can a DAG run in parallel.
For example, setting parallelism=8 and dag_concurrency=1 will give you at maximum 8 DAGs running in parallel (with 1 running task each) at any time.
Is it possible to change the already queued jobs in windows windows hpc 2012?
I need to move some files from the head node before running another queued job to free space for that job.
I found this statement in Microsoft TechNet:
The order of the job queue is based on job priority level and submit time. Jobs with higher priority levels run before lower priority jobs. The job submit time determines the order within each priority level.
So, as my already queued jobs all are of "Normal" priority, I can set the priority of my move job higher than "Normal" such as "Highest" to get the job done.
We have a 10 queue setup in our celery, a large setup each queue have a group of 5 to 10 task and each queue running on dedicated machine and some on multiple machines for scaling.
On the other hand, we have a bunch of periodic tasks, running on a separate machine with single instance, and some of the periodic tasks are taking long to execute and I want to run them in 10 queues instead.
Is there a way to scale celery beat or use it purely to trigger the task on a different destination "one of the 10 queues"?
Please advise?
Use celery routing to dispatch the task to where you need: