How to create TaskFlow DAGs dynamically in AirFlow 2.0? - airflow-2.x

I have a parameterized DAG and I want to programmatically create DAGs instances based on this DAG.
In traditional Airflow model, I can achieve this easily using a loop:
# Code sample from: https://github.com/astronomer/dynamic-dags-tutorial/blob/main/dags/dynamic-dags-loop.py
def create_dag(dag_id,
schedule,
dag_number,
default_args):
def hello_world_py(*args):
print('Hello World')
print('This is DAG: {}'.format(str(dag_number)))
dag = DAG(dag_id,
schedule_interval=schedule,
default_args=default_args)
with dag:
t1 = PythonOperator(
task_id='hello_world',
python_callable=hello_world_py)
return dag
for n in range(1, 4):
dag_id = 'loop_hello_world_{}'.format(str(n))
// ...
globals()[dag_id] = create_dag(dag_id,
schedule,
dag_number,
default_args)
How I can implement similar behavior in AirFlow 2.0 using TaskFlow model?
I have tried to manually expand the #dag decorations like the following code, but it does not work. The dynamically created DAGs do not have any tasks in them:
#dag(schedule_interval=None, start_date=datetime(2021, 1, 1), catchup=False, tags=['test'])
def flow_test():
#task()
def sleep():
time.sleep(5)
t1 = sleep()
for n in range(1,3):
dag_id = 'flow_test_{}'.format(str(n))
globals()[dag_id] = dag(dag_id=dag_id, tags=['test'])(flow_test)()

Related

Dynamically generated apache airflow operator in V1.10.14

My team is trying to create a "sharded" operator. In its simplest form, it's just a KubernetesPodOperator with a few extra arguments { totalPartitions: number, partitionNumber: number } that the job implementation will use to partition it's data source and execute some code.
We would like to be able to auto generate these partitioned jobs in airflow rather than in the job implementation, something like this in (2.3.x):
from datetime import datetime
from airflow import DAG
from airflow.decorators import task
with DAG(...) as dag:
#task
def sharded_job(total_shards, shard_number):
return Operator(
...
job_id = "same_job_id_for_all"
total_shards = total_shards,
shard_number = shard_number,
)
sharded_job.expand(total_shards = [5], shard_number = [0, 1, 2, 3, 4])
but have it work? Additionally, is there a nice way to package the above into a util or ShardedOperator class that can spin up total_shards number of job instances?
If not the above, does airflow have other primitives for partitioning jobs/creating a cluster of same-type jobs?

Best approach for building an LSH table using Apache Beam and Dataflow

I have an LSH table builder utility class which goes as follows (referred from here):
class BuildLSHTable:
def __init__(self, hash_size=8, dim=2048, num_tables=10, lsh_file="lsh_table.pkl"):
self.hash_size = hash_size
self.dim = dim
self.num_tables = num_tables
self.lsh = LSH(self.hash_size, self.dim, self.num_tables)
self.embedding_model = embedding_model
self.lsh_file = lsh_file
def train(self, training_files):
for id, training_file in enumerate(training_files):
image, label = training_file
if len(image.shape) < 4:
image = image[None, ...]
features = self.embedding_model.predict(image)
self.lsh.add(id, features, label)
with open(self.lsh_file, "wb") as handle:
pickle.dump(self.lsh,
handle, protocol=pickle.HIGHEST_PROTOCOL)
I then execute the following in order to build my LSH table:
training_files = zip(images, labels)
lsh_builder = BuildLSHTable()
lsh_builder.train(training_files)
Now, when I am trying to do this via Apache Beam (code below), it's throwing:
TypeError: can't pickle tensorflow.python._pywrap_tf_session.TF_Operation objects
Code used for Beam:
def generate_lsh_table(args):
options = beam.options.pipeline_options.PipelineOptions(**args)
args = namedtuple("options", args.keys())(*args.values())
with beam.Pipeline(args.runner, options=options) as pipeline:
(
pipeline
| 'Build LSH Table' >> beam.Map(
args.lsh_builder.train, args.training_files)
)
This is how I am invoking the beam runner:
args = {
"runner": "DirectRunner",
"lsh_builder": lsh_builder,
"training_files": training_files
}
generate_lsh_table(args)
Apache Beam pipelines should be converted to a standard (for example, proto) format before being executed. As a part of this certain pipeline objects such as DoFns get serialized (picked). If your DoFns have instance variables that cannot be serialized this process cannot continue.
One way to solve this is to load/define such instance objects or modules during execution instead of creating and storing such objects during pipeline submission. This might require adjusting your pipeline.

How to test on_failure_callback Airflow operators

I have a scenario where I want to make an integration test for operators called through on_failure_callback in an Airflow DAG.
A minimal example of this DAG is as follows:
def failure_callback(context):
# CustomOperator in this case links to an external K8s service
handle_failure = CustomOperator(
task_id="handle_failure",
timestamp=context["ts"]
)
handle_failure.execute(context=context)
args = {
"catchup": False,
"retries": 3,
"retry_delay": timedelta(seconds=30),
"start_date": START_DATE,
"on_failure_callback": failure_callback,
}
with DAG("foo", schedule_interval=None, default_args=args) as dag:
task_to_fail = SomeOperator()
My first thought for testing would be to run task_to_fail, let it fail, and validate the outcome of the failure_callback with some other process, attempt below:
import pytest
from airflow.models import DagBag, TaskInstance
from dateutil import parser
#pytest.fixture
def foo_dag():
dag_id = "foo"
dag_bag = DagBag("dags")
return dag_bag.dags[dag_id]
#pytest.mark.integration
def test_task_to_fail(foo_dag):
execution_date = parser.parse("2000-01-01T00:00+00:00")
task_id = "task_to_fail"
task = foo_dag.get_task(task_id=task_id)
task_instance = TaskInstance(task, execution_date)
with pytest.raises(Exception):
task_instance.run(ignore_task_deps=True, ignore_ti_state=True, test_mode=True)
assert "INSERT DESIRED OUTCOME OF `failure_callback` HERE"
The issue I'm having is that it doesn't appear that failure_callback is being called when running pytest. I suspect this is due to how TaskInstance is being called (i.e not running the on_failure_callback, but am not sure.
My questions:
Is this the correct way to validate the behavior of this callback? If not, how should this be handled?
Upstream of the task_to_fail task, there are many expensive operations that I want to avoid running during tests. Is it possible to have a full-run of a DAG, executed with pytest, starting from a particular task (in this case, task_to_fail?

airflow TriggerDagRunOperator how to change the execution date

I noticed that for scheduled task the execution date is set in the past according to
Airflow was developed as a solution for ETL needs. In the ETL world,
you typically summarize data. So, if I want to summarize data for
2016-02-19, I would do it at 2016-02-20 midnight GMT, which would be
right after all data for 2016-02-19 becomes available.
however, when a dag triggers another dag the execution time is set to now().
Is there a way to have the triggered dags with the same execution time of triggering dag? Of course, I can rewrite the template and use yesterday_ds, however, this is a tricky solution.
The following class expands on TriggerDagRunOperator to allow passing the execution date as a string that then gets converted back into a datetime. It's a bit hacky but it is the only way I found to get the job done.
from datetime import datetime
import logging
from airflow import settings
from airflow.utils.state import State
from airflow.models import DagBag
from airflow.operators.dagrun_operator import TriggerDagRunOperator, DagRunOrder
class MMTTriggerDagRunOperator(TriggerDagRunOperator):
"""
MMT-patched for passing explicit execution date
(otherwise it's hard to hook the datetime.now() date).
Use when you want to explicity set the execution date on the target DAG
from the controller DAG.
Adapted from Paul Elliot's solution on airflow-dev mailing list archives:
http://mail-archives.apache.org/mod_mbox/airflow-dev/201711.mbox/%3cCAJuWvXgLfipPmMhkbf63puPGfi_ezj8vHYWoSHpBXysXhF_oZQ#mail.gmail.com%3e
Parameters
------------------
execution_date: str
the custom execution date (jinja'd)
Usage Example:
-------------------
my_dag_trigger_operator = MMTTriggerDagRunOperator(
execution_date="{{execution_date}}"
task_id='my_dag_trigger_operator',
trigger_dag_id='my_target_dag_id',
python_callable=lambda: random.getrandbits(1),
params={},
dag=my_controller_dag
)
"""
template_fields = ('execution_date',)
def __init__(
self, trigger_dag_id, python_callable, execution_date,
*args, **kwargs
):
self.execution_date = execution_date
super(MMTTriggerDagRunOperator, self).__init__(
trigger_dag_id=trigger_dag_id, python_callable=python_callable,
*args, **kwargs
)
def execute(self, context):
run_id_dt = datetime.strptime(self.execution_date, '%Y-%m-%d %H:%M:%S')
dro = DagRunOrder(run_id='trig__' + run_id_dt.isoformat())
dro = self.python_callable(context, dro)
if dro:
session = settings.Session()
dbag = DagBag(settings.DAGS_FOLDER)
trigger_dag = dbag.get_dag(self.trigger_dag_id)
dr = trigger_dag.create_dagrun(
run_id=dro.run_id,
state=State.RUNNING,
execution_date=self.execution_date,
conf=dro.payload,
external_trigger=True)
logging.info("Creating DagRun {}".format(dr))
session.add(dr)
session.commit()
session.close()
else:
logging.info("Criteria not met, moving on")
There is an issue you may run into when using this and not setting execution_date=now(): your operator will throw a mysql error if you try to start a dag with an identical execution_date twice. This is because the execution_date and dag_id are used to create the row index and rows with identical indexes cannot be inserted.
I can't think of a reason you would ever want to run two identical dags with the same execution_date in production anyway, but it is something I ran into while testing and you should not be alarmed by it. Simply clear the old job or use a different datetime.
The TriggerDagRunOperator now has an execution_date parameter to set the execution date of the triggered run.
Unfortunately the parameter is not in the template fields.
If it will be added to template fields (or if you override the operator and change the template_fields value) it will be possible to use it like this:
my_trigger_task= TriggerDagRunOperator(task_id='my_trigger_task',
trigger_dag_id="triggered_dag_id",
python_callable=conditionally_trigger,
execution_date= '{{execution_date}}',
dag=dag)
It has not been released yet but you can see the sources here:
https://github.com/apache/incubator-airflow/blob/master/airflow/operators/dagrun_operator.py
The commit that did the change was:
https://github.com/apache/incubator-airflow/commit/089c996fbd9ecb0014dbefedff232e8699ce6283#diff-41f9029188bd5e500dec9804fed26fb4
I improved a bit the MMTTriggerDagRunOperator. The function checks if the dag_run already exists, if found, restart the dag using the clear function of airflow. This allows us to create a dependency between dags because the possibility to have the execution date moved to the triggered dag opens a whole universe of amazing possibilities. I wonder why this is not the default behavior in airflow.
def execute(self, context):
run_id_dt = datetime.strptime(self.execution_date, '%Y-%m-%d %H:%M:%S')
dro = DagRunOrder(run_id='trig__' + run_id_dt.isoformat())
dro = self.python_callable(context, dro)
if dro:
session = settings.Session()
dbag = DagBag(settings.DAGS_FOLDER)
trigger_dag = dbag.get_dag(self.trigger_dag_id)
if not trigger_dag.get_dagrun( self.execution_date ):
dr = trigger_dag.create_dagrun(
run_id=dro.run_id,
state=State.RUNNING,
execution_date=self.execution_date,
conf=dro.payload,
external_trigger=True
)
logging.info("Creating DagRun {}".format(dr))
session.add(dr)
session.commit()
else:
trigger_dag.clear(
start_date = self.execution_date,
end_date = self.execution_date,
only_failed = False,
only_running = False,
confirm_prompt = False,
reset_dag_runs = True,
include_subdags= False,
dry_run = False
)
logging.info("Cleared DagRun {}".format(trigger_dag))
session.close()
else:
logging.info("Criteria not met, moving on")
There is a function available in the experimental API section of airflow that allows you to trigger a dag with a specific execution date. https://github.com/apache/incubator-airflow/blob/master/airflow/api/common/experimental/trigger_dag.py
You can call this function as a part of PythonOperator and achieve the objective.
So it will look like
from airflow.api.common.experimental.trigger_dag import trigger_dag
trigger_operator=PythonOperator(task_id='YOUR_TASK_ID',
python_callable=trigger_dag,
op_args=['dag_id'],
op_kwargs={'execution_date': datetime.now()})

Replacing Celerybeat with Chronos

How mature is Chronos? Is it a viable alternative to scheduler like celery-beat?
Right now our scheduling implements a periodic "heartbeat" task that checks of "outstanding" events and fires them if they are overdue. We are using python-dateutil's rrule for defining this.
We are looking at alternatives to this approach, and Chronos seems a very attactive alternative: 1) it would mitigate the necessity to use a heartbeat schedule task, 2) it supports RESTful submission of events with ISO8601 format, 3) has a useful interface for management, and 4) it scales.
The crucial requirement is that scheduling needs to be configurable on the fly from the Web Interface. This is why can't use celerybeat's built-in scheduling out of the box.
Are we going to shoot ourselves in the foot by switching over to Chronos?
This SO has solutions to your dynamic periodic task problem. It's not the accepted answer at the moment:
from djcelery.models import PeriodicTask, IntervalSchedule
from datetime import datetime
class TaskScheduler(models.Model):
periodic_task = models.ForeignKey(PeriodicTask)
#staticmethod
def schedule_every(task_name, period, every, args=None, kwargs=None):
""" schedules a task by name every "every" "period". So an example call would be:
TaskScheduler('mycustomtask', 'seconds', 30, [1,2,3])
that would schedule your custom task to run every 30 seconds with the arguments 1 ,2 and 3 passed to the actual task.
"""
permissible_periods = ['days', 'hours', 'minutes', 'seconds']
if period not in permissible_periods:
raise Exception('Invalid period specified')
# create the periodic task and the interval
ptask_name = "%s_%s" % (task_name, datetime.datetime.now()) # create some name for the period task
interval_schedules = IntervalSchedule.objects.filter(period=period, every=every)
if interval_schedules: # just check if interval schedules exist like that already and reuse em
interval_schedule = interval_schedules[0]
else: # create a brand new interval schedule
interval_schedule = IntervalSchedule()
interval_schedule.every = every # should check to make sure this is a positive int
interval_schedule.period = period
interval_schedule.save()
ptask = PeriodicTask(name=ptask_name, task=task_name, interval=interval_schedule)
if args:
ptask.args = args
if kwargs:
ptask.kwargs = kwargs
ptask.save()
return TaskScheduler.objects.create(periodic_task=ptask)
def stop(self):
"""pauses the task"""
ptask = self.periodic_task
ptask.enabled = False
ptask.save()
def start(self):
"""starts the task"""
ptask = self.periodic_task
ptask.enabled = True
ptask.save()
def terminate(self):
self.stop()
ptask = self.periodic_task
self.delete()
ptask.delete()
I haven't used djcelery yet, but it supposedly has an admin interface for dynamic periodic tasks.