Create virtualenv for each DAG instead of each task in Airflow - virtualenv

Right now I have the following DAG (omitted some non-relevant syntax)
from airflow.operator.python import PythonVirtualenvOperator
t1 = PythonVirtualenvOperator(requirements = req1)
t2 = PythonVirtualenvOperator(requirements = req2)
t3 = PythonVirtualenvOperator(requirements = req3)
t4 = PythonVirtualenvOperator(requirements = req1) #Yes, its the same as t1
t1>>t2>>t3>>t4
I'm a big fan of venvs, and since we are multiple people deploying DAGs on the same server, we can then keep the versions seperated.
The issue is that above we have to create a venv for each of the tasks (which takes some time, not much, but some), and a lot of the packages in the requirements are the same.
Isn't there a way to create a virtual-environment for a specific DAG thus being able to use PythonOperator instead of PythonVirtualenvOperator? Or is the best way then to create a Docker container for each DAG?

In Airflow 2.4.0 they use the ExternalPythonOperator which should solve it, pr. this Github issue

Related

Specifying DAG queue via Airflow's UI TriggerDag parameters

thanks for reading this question.
I accomplished to set up an Airflow cluster following the official instructions here and I managed to add workers hosted at remote machines. It seems that everything works fine (connections to Redis and Postgre are working and DAG tasks are distributed and executed properly across the different workers).
Also, I can execute DAGs at a specific worker by subscribing each worker to an exclusive queue and hardcoding the queue parameter of each operator in the DAG.
My problem is I would like to be able to parametrize said execution queue via Airflow CLI, with the help of the Trigger DAG button. I tried using Jinja Templates and XComs, but these options didn't help me with my problem since Jinja Templates don't seem to work on the parameters of Operators and XCom needs the ti parameter or Jinja Templates.
I know some people have written plugins in order to simplify this task, but since all the information I found was prior to Airflow 2.x I wanted to know if there was already a solution for this problem.
Thank you so much in advance
Edit: I would like to do this: Triggering DAG via UI.
However,
task_test = BashOperator(
task_id='test2',
queue='{{ dag_run.conf['queue'] }}',
bash_command="echo hi",
)
Does not work, since the job gets queued at {{ dag_run.conf['queue'] }} instead of queue1
I also tried the following approach, and it doesnt work either, as all
jobs get scheduled on default queue:
with DAG(
'queue_execution_test',
default_args=default_args,
description='Test specific execution.',
schedule_interval=None, # Para que solo se ejecute on demand
start_date=days_ago(2),
) as dag:
run_on_queue = 'default' #'{{ dag_run.conf["queue"] if dag_run else "default" }}'
def parse_queue_parameter(ti, **kwargs): # Task instance for XCom and kwargs for dagrun parameters
try:
ti.xcom_push(key='custom_queue', value= kwargs['dag_run'].conf['queue'] )
except:
print("No queue defined")
initial_task = PythonOperator(
task_id='test1',
queue=run_on_queue,
provide_context=True,
python_callable=parse_queue_parameter,
)
task_test = BashOperator(
task_id='test2',
queue=run_on_queue,
bash_command="echo hi",
)
initial_task.set_downstream(task_test)

How to obtain GCP project name within SQL run against BigQuery from Composer Airflow

I would like to write a query to be executed by Composer/Airflow (python BigQueryOperator referring to an SQL file) along the lines of
SELECT col1, col2, ... FROM `{{ GCP_PROJECT }}.dataset.table` ...
I wish for the GCP project to be parametrised in the SQL so I can deploy the same SQL file in the productiond/development (prod/dev) environments and test in dev without attempting to query prod tables the dev environment does not have access to.
Is this something that would already be set up?
On this point I couldn't find any helpful examples in the Composer guide, apart from that GCP_PROJECT is already reserved, not sure how to pass an environment variable on to the templating anyway.. Thanks.
Haven't tested it myself, but I think something like that should work:
import os
dag = DAG(...)
def print_env_var():
print(os.environ["GCP_PROJECT"])
print_context = PythonOperator(
task_id="gcp_project",
python_callable=print_env_var,
dag=dag,
)
Based on Google documentation it's not recommended to use reserved environmental variables (https://cloud.google.com/composer/docs/how-to/managing/environment-variables#reserved_names):
Cloud Composer uses these reserved names for variables that internal
processes use. Do not refer to reserved names in your workflows.
Variable values can change without notice.
I think better approach to handle staging/production environments is to set variables in Airflow itself and read them in the dag. Here's a great resource that goes into a lot of detail about Airflow variables.

Use Airflow to run parametrized jobs on-demand and with a schedule

I have a reporting application that uses Celery to process thousands of jobs per day. There is a python module per each report type that encapsulates all job steps. Jobs take customer-specific parameters and typically complete within a few minutes. Currently, jobs are triggered by customers on-demand when they create a new report or request a refresh of an existing one.
Now, I would like to add scheduling, so the jobs run daily, and reports get refreshed automatically. I understand that Airflow shines at task orchestration and scheduling. I also like the idea of expressing my jobs as DAGs and getting the benefit of task retries. I can see how I can use Airflow to run scheduled batch-processing jobs, but I am unsure about my use case.
If I express my jobs as Airflow DAGs, I will still need to run them parametrized for each customer. It means, if the customer creates a new report, I will need to have a way to trigger a DAG with the customer-specific configuration. And with a scheduled execution, I will need to enumerate all customers and create a parametrized (sub-)DAG for each of them. My understanding this should be possible since Airflow supports DAGs created dynamically, however, I am not sure if this is an efficient and correct way to use Airflow.
I wonder if anyway considered using Airflow for a scenario similar to mine.
Celery workflows do literally the same, and you can create and run them at any point of time. Also, Celery has a pretty good scheduler (I have never seen it failing in 5 years of using Celery) - Celery Beat.
Sure, Airflow can be used to do what you need without any problems.
You can use Airflow to create DAGs dynamically, I am not sure if this will work with a scale of 1000 of DAGs though. There are some good examples on astronomer.io on Dynamically Generating DAGs in Airflow.
I have some DAGs and task that are dynamically generated by a yaml configuration with different schedules and configurations. It all works without any issue.
Only thing that might be challenging is the "jobs are triggered by customers on-demand" - I guess you could trigger any DAG with Airflow's REST API, but it's still in a experimental state.

celery result_backend with postgresql - null for some fields in the task table

I am using Celery for a POC. My objective is to create a workflow across a distributed system.
For the purpose of a POC I have created a couple of docker containers each with a worker started with the queue option. I initiated a chain from yet another container. The chain executes successfully across the containers. Now I decided to enable the result_backend to store the results of each task getting executed.
I set the result_backend to postgresql with the schema option. After executing the chain now, I do see the tables created in postgres. However the task_meta table has some columns as null (e.g. the worker, queue.) Where as the task_id and status are correctly populated.
Has anyone faced a similar issue? Any help is appreciated.
I know it has been a long time since you asked the question but I hope it will help others.
By default, Celery does not write all task result attributes to the backend.
Your have to configure it by setting result_extended to True as stated here: https://docs.celeryproject.org/en/stable/userguide/configuration.html#result-extended
So if you configure your app from python you will have to do something like:
import celery
app = Celery(
broker=broker_url,
backend=postgresq_url,
)
app.conf.update(result_extended=True)

Go through all kubernetes Jobs using google cloud functions

I'm using kubernetes==10.1.0
My code is in python 3.7 and something like that:
from kubernetes import client as kubernetes_client
BatchV1_api = kubernetes_client.BatchV1Api()
api_response = BatchV1_api.list_namespaced_job(namespace="default", watch = False, pretty='true', async_req=False )
The problem is I got about 1500 jobs in kubernetes and api_response returns only 20
The goal is to implement program which go throw all jobs and delete old ones by job name as parameter
Any idea why I'm getting only partial data from BatchV1_api.list_namespaced_job function?