Generating uuid and use it across Airflow DAG - constants

I'm trying to create a dynamic airflow that has the following 2 tasks:
Task 1: Creates files with a generated UUID as part of their name
Task 2: Runs a check on those files
So I define a variable 'FILE_UUID' and sets it as follow: str(uuid.uuid4()). And also created a constant file name:
MY_FILE = '{file_uuid}_file.csv'.format(file_uuid=FILE_UUID}
Then - Task 1 is a bashOperator that get MY_FILE as part of the command, and it creates a file successfully.
I can see the generated files include a specific UUID in the name,
TASK 2 fails is a PythonOperator that get MY_FILE as an op_args. But can't access the file. Logs show that it tries to access files with a different UUID.
Why is my "constant" is being run separately on every task? Is there any way to prevent that from happening?
I'm using Airflow 1.10, my executor is LocalExecutor.
I tried setting the constant outside the "with DAG" and inside it, also tries working with macros, but then PythonOperator just uses the macro strings literally using the values they hold.

You have to keep in mind that the DAG definition file is a sort of "configuration script", not an actual executable to run your DAGs. The tasks are executed in completely different environments, most of the times not even on the same machine. Think of it like a configuration XML which sets up your tasks, and then they are built and executed on some other machine in the cloud - but it's Python instead of XML.
In conclusion - your DAG code is Python, but it's not the one being executed in the runtime of your tasks. So if you generate a random uuid there, it will get evaluated at an unknown time and multiple times - for each task, on different machines.
To have it consistent across tasks you need to find another way, for example:
use XCOM such that the first tasks uses the uuid it gets, and then writes that to XCOM for all downstream tasks to use.
anchor your uuid with something constant across your pipeline, a source, a date, or whatever (e.g. if it's a daily task, you can build your uuid from date parts mixing in some dag/task specifics, etc. - whatever will make your uuid the same for all tasks, but unique for unique days)
Example DAG using the first method (XCOM's):
from datetime import datetime
import uuid
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.bash_operator import BashOperator
with DAG(dag_id='global_uuid',
schedule_interval='#daily',
start_date=...) as dag:
generate_uuid = PythonOperator(
task_id='generate_uuid',
python_callable=lambda: str(uuid.uuid4())
)
print_uuid1 = BashOperator(
task_id='print1',
bash_command='echo {{ task_instance.xcom_pull("generate_uuid") }}'
)
print_uuid2 = BashOperator(
task_id='print2',
bash_command='echo {{ task_instance.xcom_pull("generate_uuid") }}'
)
generate_uuid >> print_uuid1 >> print_uuid2

Related

Pass parameters/arguments to HDInsight/Spark Activity in Azure Data Factory

I have an on-demand HDInsight cluster that is launched from a Spark Activity within Azure Data Factory and runs PySpark 3.1. To test out my code, I normally launch Jupyter Notebook from the created HDInsight Cluster page.
Now, I would like to pass some parameters to that Spark activity and retrieve these parameters from within Jupyter notebook code. I've tried doing so in two ways, but none of them worked for me:
Method A. as Arguments and then tried to retrieve them using sys.argv[].
Method B. as Spark configuration and then tried to retrieve them using sc.getConf().getAll().
I suspect that either:
I am not specifying parameters correctly
or using a wrong way to retrieve them in Jupyter Notebook code
or parameters are only valid for the Python *.py scripts specified in the "File path" field, but not for the Jupyter notebooks.
Any pointers on how to pass parameters into HDInsight Spark activity within Azure Data Factory would be much appreciated.
The issue is with the entryFilePath. In the Spark activity of HDInsight cluster, you must either give the entryFilePath as a .jar file or .py file. When we follow this, we can successfully pass arguments which can be utilized using sys.argv.
The following is an example of how you can pass arguments to python script.
The code inside nb1.py (sample) is as shown below:
from pyspark import SparkContext
from pyspark.sql import *
import sys
sc = SparkContext()
sqlContext = HiveContext(sc)
# Create an RDD from sample data which is already available
hvacText = sc.textFile("wasbs:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv")
# Create a schema for our data
Entry = Row('Date', 'Time', 'TargetTemp', 'ActualTemp', 'BuildingID')
# Parse the data and create a schema
hvacParts = hvacText.map(lambda s: s.split(',')).filter(lambda s: s[0] != 'Date')
hvac = hvacParts.map(lambda p: Entry(str(p[0]), str(p[1]), int(p[2]), int(p[3]), int(p[6])))
# Infer the schema and create a table
hvacTable = sqlContext.createDataFrame(hvac)
hvacTable.registerTempTable('hvactemptable')
dfw = DataFrameWriter(hvacTable)
#using agrument from pipeline to create table.
dfw.saveAsTable(sys.argv[1])
When the pipeline is triggered, it runs successfully and the required table will be created (name of this table is passed as an argument from pipeline Spark activity). We can query this table in HDInsight cluster's Jupyter notebook using the following query:
select * from new_hvac
NOTE:
So, please ensure that you are passing arguments to python script (.py file) but not a python notebook.

upgrading celery 4.x.x to 5.x.x in Django app - execute_from_commandline() replacement

the usage in 4.x.x was as following:
from tenant_schemas_celery.app import CeleryApp
class TenantCeleryApp(CeleryApp):
def create_task_cls(self):
return self.subclass_with_self('...', abstract=True, name='...', attribute='_app')
tenant_celery = TenantCeleryApp()
base = celery.CeleryCommand(app=tenant_celery)
base.execute_from_commandline('...')
...
Now when updating celery lib to 5.x.x the following error show:
base = celery.CeleryCommand(app=tenant_celery)
TypeError: __init__() got an unexpected keyword argument 'app'
from the documentation, the new CeleryCommand use click.Command class, how do I change my code to fit - what is the replacement usage for execute_from_commandline()?
EDIT:
after some tries hard the following code works:
tenant_celery.worker_main(argv=['--broker=amqp://***:***#rabbitmq:5672//***',
'-A', f'{__name__}:tenant_celery',
'worker', '-c', '1', '-Q', 'c1,c2,c3,c4'])
You can do a few things here.
The typical way to invoke / start a worker from within python is discussed at this answer:
worker = tenant_celery.Worker(
include=['project.tasks']
)
worker.start()
In this case, you would be responsible for making the worker exit when you are done.
To execute the CeleryCommand / click.Command, you pass in the arguments to the main function
base = CeleryCommand()
base.main(args=['worker', '-A', f'{__name__}:tenant_celery'])
You would still be responsible for controlling how celery exits in this case, too. You may choose a verb other than worker such as multi for whatever celery subcommand you were expecting to call.
You may also want to explicitly specify the name of the celery module for the -A parameter as discussed here.

Pipeline Dependencies in Data Fusion

I have three pipelines in Data Fusion say A,B and C. I want to the Pipeline C to get triggered after execution of Pipeline A and B both Completes. Pipeline triggers are putting the dependency on one pipeline only.
Can this be implemented in Data Fusion ?
You can do it using Google Cloud Composer [1]. In order to perform this action first of all you need to create a new Environment in Google Cloud Composer [2], once done, you need to install a new Python Package in your environment [3], and the package that you will need to install is [4] "apache-airflow-backport-providers-google".
With this package installed you will be able to use these operations [5], the one you will need is [6] "Start a DataFusion pipeline", this way you will be able to start a new pipeline from Airflow.
An example of the python code would be as follows:
import airflow
import datetime
from airflow import DAG
from airflow import models
from airflow.operators.bash_operator import BashOperator
from datetime import timedelta
from airflow.providers.google.cloud.operators.datafusion import (
CloudDataFusionStartPipelineOperator
)
default_args = {
'start_date': airflow.utils.dates.days_ago(0),
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
with models.DAG(
'composer_DF',
schedule_interval=datetime.timedelta(days=1),
default_args=default_args) as dag:
# the operations.
A = CloudDataFusionStartPipelineOperator(
location="us-west1", pipeline_name="A",
instance_name="instance_name", task_id="start_pipelineA",
)
B = CloudDataFusionStartPipelineOperator(
location="us-west1", pipeline_name="B",
instance_name="instance_name", task_id="start_pipelineB",
)
C = CloudDataFusionStartPipelineOperator(
location="us-west1", pipeline_name="C",
instance_name="instance_name", task_id="start_pipelineC",
)
# First A then B and then C
A >> B >> C
You can set the time intervals by checking the Airflow documentation.
Once you have this code saved as a .py file, save it to ther Google Cloud Storage DAG folder of your environment.
When the DAG starts, it will execute task A and when it finishes it will execute task B and so on.
[1] https://cloud.google.com/composer
[2] https://cloud.google.com/composer/docs/how-to/managing/creating#:~:text=In%20the%20Cloud%20Console%2C%20open%20the%20Create%20Environment%20page.&text=Under%20Node%20configuration%2C%20click%20Add%20environment%20variable.&text=The%20From%3A%20email%20address%2C%20such,%40%20.&text=Your%20SendGrid%20API%20key.
[3] https://cloud.google.com/composer/docs/how-to/using/installing-python-dependencies
[4] https://pypi.org/project/apache-airflow-backport-providers-google/
[5] https://airflow.readthedocs.io/en/latest/_api/airflow/providers/google/cloud/operators/datafusion/index.html
[6] https://airflow.readthedocs.io/en/latest/howto/operator/google/cloud/datafusion.html#start-a-datafusion-pipeline
There is no direct way i could think of but two workarounds
Work around 1. Merging the pipeline A and B into pipeline AB then trigger pipeline C (AB > C).
Pipeline A - (GCS Copy > Decompress),
Pipeline B - (GCS2 > thrashsad)
BigQueryExecute to mitigate error : Invalid DAG. There is an island made up of stages..
In BigQueryExecute, valid and dummy query.
Merging the two pipeline in one, may unease the pipeline testing. To overcome this you can add a dummy condition to run a pipeline one time.
In BigQueryExecute,change query to 'Select ${flag}' and pass the value of flag in runtime argument or Select 1 as flag and tick "Row As Arguments" to true.
Add condition plugin after BigQueryExecute and put condition runtime['flag'] = 1
Condition plugin has two outlet, connect them to pipeline A and pipeline B.
Workaround 2 : Store the flag of both pipelines(A & B) in BiqQuery table,create two flow A>C and B >C to trigger the pipeline C. This would trigger pipeline C twice but using BigQueryExecute and condition plugin will run only when both flags are available in BigQuery table.
How?
In Pipeline A & B to write output (a row) to BigQuery table 'Pipeline_Run'
In Pipeline C, add BigQueryExecute and query 'select count(*) as Cnt from ds.Pipeline_Run' and tick "Row As Arguments" to true.
In Pipeline C, add Condition plugin and check if value of cnt is 2 (runtime['cnt'] = 2) and connect your rest of the pipeline's plugins to its "Yes" outlet.
You can explore "schedules" set through CDAP REST APIs. That allows parallel execution of pipelines and there is no dependency on cloud composer (except for file based trigger of first pipeline in workflow. For that you would need cloud function or may be cloud composer file sensor)

How to fix PipelineParam from discarding all information except for name in Kubeflow Pipeline

I'm trying to write an application using Kubeflow Pipelines. I'm running into trouble when passing in parameters to the pipeline (the main python function decorated with #kfp.dsl.pipeline). The parameters should be automatically converted into a PipelineParam with name, value, etc info. However, it seems that everything except for the name is being discarded. I'm on an Ubuntu server.
I've tried uninstalling/reinstalling and updating Kubeflow, tried installing several of the most recent versions of kfp (0.1.23, 0.1.22, 0.1.20, 0.1.18), as well as installing on my local machine.
def print_pipeline_param():
return(kfp.dsl.PipelineParam("Name", value="Value"))
#kfp.dsl.pipeline(
name='Test Pipeline',
description='Test pipeline'
)
def test_pipeline(output_file='/output.txt'):
print(print_pipeline_param())
print(output_file)
if __name__ == '__main__':
import kfp.compiler as compiler
compiler.Compiler().compile(test_pipeline, __file__ + '.tar.gz'
The result of running this is:
{{pipelineparam:op=;name=Name;value=Value;type=;}}
{{pipelineparam:op=;name=output-file;value=;type=;}
I should be getting '/output.txt' in the "value" field, but the only field populated is the name. This only happens when passing in parameters to the main pipeline function. This also happens when directly passing in a PipelineParam like so:
#kfp.dsl.pipeline(
name='Test Pipeline',
description='Test pipeline'
)
def test_pipeline(output_file=kfp.dsl.PipelineParam("Output File", value="/output.txt")):
print(output_file)
Prints out: {{pipelineparam:op=;name=output-file;value=;type=;}

how to pass custom parameters to a locust test class?

I'm currently passing custom parameters to my load test using environment variables. For example, my test class looks like this:
from locust import HttpLocust, TaskSet, task
import os
class UserBehavior(TaskSet):
#task(1)
def login(self):
test_dir = os.environ['BASE_DIR']
auth=tuple(open(test_dir + '/PASSWORD').read().rstrip().split(':'))
self.client.request(
'GET',
'/myendpoint',
auth=auth
)
class WebsiteUser(HttpLocust):
task_set = UserBehavior
Then I'm running my test with:
locust -H https://myserver --no-web --clients=500 --hatch-rate=500 --num-request=15000 --print-stats --only-summary
Is there a more locust way that I can pass custom parameters to the locust command line application?
You could use like env <parameter>=<value> locust <options> and use <parameter> inside the locust script to use its value
E.g.,
env IP_ADDRESS=100.0.1.1 locust -f locust-file.py --no-web --clients=5 --hatch-rate=1 --num-request=500 and use IP_ADDRESS inside the locust script to access its value which is 100.0.1.1 in this case.
Nowadays it is possible to add custom parameters to Locust (it wasnt possible when this question was originally asked, at which time using env vars was probably the best option).
Since version 2.2, custom parameters are even forwarded to the workers in a distributed run.
https://docs.locust.io/en/stable/extending-locust.html#custom-arguments
from locust import HttpUser, task, events
#events.init_command_line_parser.add_listener
def _(parser):
parser.add_argument("--my-argument", type=str, env_var="LOCUST_MY_ARGUMENT", default="", help="It's working")
# Set `include_in_web_ui` to False if you want to hide from the web UI
parser.add_argument("--my-ui-invisible-argument", include_in_web_ui=False, default="I am invisible")
#events.test_start.add_listener
def _(environment, **kw):
print("Custom argument supplied: %s" % environment.parsed_options.my_argument)
class WebsiteUser(HttpUser):
#task
def my_task(self):
print(f"my_argument={self.environment.parsed_options.my_argument}")
print(f"my_ui_invisible_argument={self.environment.parsed_options.my_ui_invisible_argument}")
It is not recommended to run locust in command line if you want to test in high concurrency. As in --no-web mode, you can only use one CPU core, so that you can not make full use of your test machine.
Back to your question, there is not another way to pass custom parameters to locust in command line.