Specifying DAG queue via Airflow's UI TriggerDag parameters - celery

thanks for reading this question.
I accomplished to set up an Airflow cluster following the official instructions here and I managed to add workers hosted at remote machines. It seems that everything works fine (connections to Redis and Postgre are working and DAG tasks are distributed and executed properly across the different workers).
Also, I can execute DAGs at a specific worker by subscribing each worker to an exclusive queue and hardcoding the queue parameter of each operator in the DAG.
My problem is I would like to be able to parametrize said execution queue via Airflow CLI, with the help of the Trigger DAG button. I tried using Jinja Templates and XComs, but these options didn't help me with my problem since Jinja Templates don't seem to work on the parameters of Operators and XCom needs the ti parameter or Jinja Templates.
I know some people have written plugins in order to simplify this task, but since all the information I found was prior to Airflow 2.x I wanted to know if there was already a solution for this problem.
Thank you so much in advance
Edit: I would like to do this: Triggering DAG via UI.
However,
task_test = BashOperator(
task_id='test2',
queue='{{ dag_run.conf['queue'] }}',
bash_command="echo hi",
)
Does not work, since the job gets queued at {{ dag_run.conf['queue'] }} instead of queue1
I also tried the following approach, and it doesnt work either, as all
jobs get scheduled on default queue:
with DAG(
'queue_execution_test',
default_args=default_args,
description='Test specific execution.',
schedule_interval=None, # Para que solo se ejecute on demand
start_date=days_ago(2),
) as dag:
run_on_queue = 'default' #'{{ dag_run.conf["queue"] if dag_run else "default" }}'
def parse_queue_parameter(ti, **kwargs): # Task instance for XCom and kwargs for dagrun parameters
try:
ti.xcom_push(key='custom_queue', value= kwargs['dag_run'].conf['queue'] )
except:
print("No queue defined")
initial_task = PythonOperator(
task_id='test1',
queue=run_on_queue,
provide_context=True,
python_callable=parse_queue_parameter,
)
task_test = BashOperator(
task_id='test2',
queue=run_on_queue,
bash_command="echo hi",
)
initial_task.set_downstream(task_test)

Related

Airflow unpause dag during manual trigger

I am using Airflow 2.2.2 and have a dag which is scheduled to run every 10 minutes and is paused. I am trying to invoke it manually using airflow client. Dag is not getting unpaused and dagrun is in queued stated. Is it possible to unpause dag using airflow client when creating dag run without invoking additional API call.
api_instance = dag_run_api.DAGRunApi(api_client)
dag_run = DAGRun(
logical_date=datetime.now(timezone(timedelta())),
conf=request_data,
)
api_response = api_instance.post_dag_run(
"airflow_testn", dag_run
)
You can add the parameter is_paused_upon_creation = False to your default_args
(Optional[bool]) -- Specifies if the dag is paused when created for the first time. If the dag exists already, this flag will be ignored. If this optional parameter is not specified, the global config setting will be used.

Airflow PostgresOperator Safety -- execution_timeout not respected? How to kill process in DB if it's taking too long?

Working on getting Airflow implemented at my company but need to perform some safety checks prior to connecting to our prod dbs.
There is concern about stupid SQL being deployed and eating up too many resources. My thought was that an execution_timeout setting on a PostgresOp task would:
Fail the task
Kill the query process in the db
I have found neither to be true.
Code:
with DAG(
# Arguments applied to instantiate this DAG. Update your values here
# All parameters visible in airflow.models.dag
dag_id=DAG_ID,
default_args=DEFAULT_ARGS,
dagrun_timeout=timedelta(minutes=20),
start_date=days_ago(1),
schedule_interval=None,
tags=['admin'],
max_active_runs=1
) as dag:
kill_test = PostgresOperator(
task_id="kill_test",
execution_timeout=timedelta(seconds=10),
postgres_conn_id="usa_db",
sql="""
SET application_name to airflow_test;
<SELECT... intentionally long running query> ;
""")
Airflow does not fail the task after the timeout.
Even when I manually fail the task in the UI, it does not kill the query in the Postgres db.
What is the deal here? Is there any way to put in safety measures to hard kill an Airflow initiated Postgres query in the db?
I'm not posting here, but I have checked:
Airflow UI shows task instance duration way over execution timeout
pg_stat activity to confirm query is running way over execution timeout
I guess you are looking for this parameter runtime_parameters={'statement_timeout': '180000ms'}.(airflow example)
I don't know in which version this was added but if you update your apache-airflow-providers-postgres module to the last version you can use the mentioned parameter.

Use Airflow to run parametrized jobs on-demand and with a schedule

I have a reporting application that uses Celery to process thousands of jobs per day. There is a python module per each report type that encapsulates all job steps. Jobs take customer-specific parameters and typically complete within a few minutes. Currently, jobs are triggered by customers on-demand when they create a new report or request a refresh of an existing one.
Now, I would like to add scheduling, so the jobs run daily, and reports get refreshed automatically. I understand that Airflow shines at task orchestration and scheduling. I also like the idea of expressing my jobs as DAGs and getting the benefit of task retries. I can see how I can use Airflow to run scheduled batch-processing jobs, but I am unsure about my use case.
If I express my jobs as Airflow DAGs, I will still need to run them parametrized for each customer. It means, if the customer creates a new report, I will need to have a way to trigger a DAG with the customer-specific configuration. And with a scheduled execution, I will need to enumerate all customers and create a parametrized (sub-)DAG for each of them. My understanding this should be possible since Airflow supports DAGs created dynamically, however, I am not sure if this is an efficient and correct way to use Airflow.
I wonder if anyway considered using Airflow for a scenario similar to mine.
Celery workflows do literally the same, and you can create and run them at any point of time. Also, Celery has a pretty good scheduler (I have never seen it failing in 5 years of using Celery) - Celery Beat.
Sure, Airflow can be used to do what you need without any problems.
You can use Airflow to create DAGs dynamically, I am not sure if this will work with a scale of 1000 of DAGs though. There are some good examples on astronomer.io on Dynamically Generating DAGs in Airflow.
I have some DAGs and task that are dynamically generated by a yaml configuration with different schedules and configurations. It all works without any issue.
Only thing that might be challenging is the "jobs are triggered by customers on-demand" - I guess you could trigger any DAG with Airflow's REST API, but it's still in a experimental state.

celery result_backend with postgresql - null for some fields in the task table

I am using Celery for a POC. My objective is to create a workflow across a distributed system.
For the purpose of a POC I have created a couple of docker containers each with a worker started with the queue option. I initiated a chain from yet another container. The chain executes successfully across the containers. Now I decided to enable the result_backend to store the results of each task getting executed.
I set the result_backend to postgresql with the schema option. After executing the chain now, I do see the tables created in postgres. However the task_meta table has some columns as null (e.g. the worker, queue.) Where as the task_id and status are correctly populated.
Has anyone faced a similar issue? Any help is appreciated.
I know it has been a long time since you asked the question but I hope it will help others.
By default, Celery does not write all task result attributes to the backend.
Your have to configure it by setting result_extended to True as stated here: https://docs.celeryproject.org/en/stable/userguide/configuration.html#result-extended
So if you configure your app from python you will have to do something like:
import celery
app = Celery(
broker=broker_url,
backend=postgresq_url,
)
app.conf.update(result_extended=True)

Apache Airflow through HTTP

I've been running into an issue where I can successfully trigger a dag from airflow's rest api command(s) (https://airflow.apache.org/api.html); however, the dag INSTANCES do not run. I'm calling -> POST /api/experimental/dags/dag_id/dag_runs where dag_id is the dag I'm running. The only thing that happens is that the dag immediately returns success. I trigged the dag manually and I get running dag instances (see picture 2nd dag run). Note the 2nd DAG run fails - this should not affect the issue I am trying to fix.
DAG
Fixed the issue -> Had to deal with scheduler. I added 'depends_on_past': False, 'start_date': datetime(2019, 6, 1) and it got fixed
The dag runs created outside the scheduler still must occur after the start_date; if there are no existing runs already you might want to set the schedule to #once and the start_date to a past date for which you want to have the execution_date run. This will give you a successful run (once it completes) against which other manual runs can compare themselves for depends_on_past.