Pipeline Dependencies in Data Fusion - google-cloud-data-fusion

I have three pipelines in Data Fusion say A,B and C. I want to the Pipeline C to get triggered after execution of Pipeline A and B both Completes. Pipeline triggers are putting the dependency on one pipeline only.
Can this be implemented in Data Fusion ?

You can do it using Google Cloud Composer [1]. In order to perform this action first of all you need to create a new Environment in Google Cloud Composer [2], once done, you need to install a new Python Package in your environment [3], and the package that you will need to install is [4] "apache-airflow-backport-providers-google".
With this package installed you will be able to use these operations [5], the one you will need is [6] "Start a DataFusion pipeline", this way you will be able to start a new pipeline from Airflow.
An example of the python code would be as follows:
import airflow
import datetime
from airflow import DAG
from airflow import models
from airflow.operators.bash_operator import BashOperator
from datetime import timedelta
from airflow.providers.google.cloud.operators.datafusion import (
CloudDataFusionStartPipelineOperator
)
default_args = {
'start_date': airflow.utils.dates.days_ago(0),
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
with models.DAG(
'composer_DF',
schedule_interval=datetime.timedelta(days=1),
default_args=default_args) as dag:
# the operations.
A = CloudDataFusionStartPipelineOperator(
location="us-west1", pipeline_name="A",
instance_name="instance_name", task_id="start_pipelineA",
)
B = CloudDataFusionStartPipelineOperator(
location="us-west1", pipeline_name="B",
instance_name="instance_name", task_id="start_pipelineB",
)
C = CloudDataFusionStartPipelineOperator(
location="us-west1", pipeline_name="C",
instance_name="instance_name", task_id="start_pipelineC",
)
# First A then B and then C
A >> B >> C
You can set the time intervals by checking the Airflow documentation.
Once you have this code saved as a .py file, save it to ther Google Cloud Storage DAG folder of your environment.
When the DAG starts, it will execute task A and when it finishes it will execute task B and so on.
[1] https://cloud.google.com/composer
[2] https://cloud.google.com/composer/docs/how-to/managing/creating#:~:text=In%20the%20Cloud%20Console%2C%20open%20the%20Create%20Environment%20page.&text=Under%20Node%20configuration%2C%20click%20Add%20environment%20variable.&text=The%20From%3A%20email%20address%2C%20such,%40%20.&text=Your%20SendGrid%20API%20key.
[3] https://cloud.google.com/composer/docs/how-to/using/installing-python-dependencies
[4] https://pypi.org/project/apache-airflow-backport-providers-google/
[5] https://airflow.readthedocs.io/en/latest/_api/airflow/providers/google/cloud/operators/datafusion/index.html
[6] https://airflow.readthedocs.io/en/latest/howto/operator/google/cloud/datafusion.html#start-a-datafusion-pipeline

There is no direct way i could think of but two workarounds
Work around 1. Merging the pipeline A and B into pipeline AB then trigger pipeline C (AB > C).
Pipeline A - (GCS Copy > Decompress),
Pipeline B - (GCS2 > thrashsad)
BigQueryExecute to mitigate error : Invalid DAG. There is an island made up of stages..
In BigQueryExecute, valid and dummy query.
Merging the two pipeline in one, may unease the pipeline testing. To overcome this you can add a dummy condition to run a pipeline one time.
In BigQueryExecute,change query to 'Select ${flag}' and pass the value of flag in runtime argument or Select 1 as flag and tick "Row As Arguments" to true.
Add condition plugin after BigQueryExecute and put condition runtime['flag'] = 1
Condition plugin has two outlet, connect them to pipeline A and pipeline B.
Workaround 2 : Store the flag of both pipelines(A & B) in BiqQuery table,create two flow A>C and B >C to trigger the pipeline C. This would trigger pipeline C twice but using BigQueryExecute and condition plugin will run only when both flags are available in BigQuery table.
How?
In Pipeline A & B to write output (a row) to BigQuery table 'Pipeline_Run'
In Pipeline C, add BigQueryExecute and query 'select count(*) as Cnt from ds.Pipeline_Run' and tick "Row As Arguments" to true.
In Pipeline C, add Condition plugin and check if value of cnt is 2 (runtime['cnt'] = 2) and connect your rest of the pipeline's plugins to its "Yes" outlet.

You can explore "schedules" set through CDAP REST APIs. That allows parallel execution of pipelines and there is no dependency on cloud composer (except for file based trigger of first pipeline in workflow. For that you would need cloud function or may be cloud composer file sensor)

Related

Pass parameters/arguments to HDInsight/Spark Activity in Azure Data Factory

I have an on-demand HDInsight cluster that is launched from a Spark Activity within Azure Data Factory and runs PySpark 3.1. To test out my code, I normally launch Jupyter Notebook from the created HDInsight Cluster page.
Now, I would like to pass some parameters to that Spark activity and retrieve these parameters from within Jupyter notebook code. I've tried doing so in two ways, but none of them worked for me:
Method A. as Arguments and then tried to retrieve them using sys.argv[].
Method B. as Spark configuration and then tried to retrieve them using sc.getConf().getAll().
I suspect that either:
I am not specifying parameters correctly
or using a wrong way to retrieve them in Jupyter Notebook code
or parameters are only valid for the Python *.py scripts specified in the "File path" field, but not for the Jupyter notebooks.
Any pointers on how to pass parameters into HDInsight Spark activity within Azure Data Factory would be much appreciated.
The issue is with the entryFilePath. In the Spark activity of HDInsight cluster, you must either give the entryFilePath as a .jar file or .py file. When we follow this, we can successfully pass arguments which can be utilized using sys.argv.
The following is an example of how you can pass arguments to python script.
The code inside nb1.py (sample) is as shown below:
from pyspark import SparkContext
from pyspark.sql import *
import sys
sc = SparkContext()
sqlContext = HiveContext(sc)
# Create an RDD from sample data which is already available
hvacText = sc.textFile("wasbs:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv")
# Create a schema for our data
Entry = Row('Date', 'Time', 'TargetTemp', 'ActualTemp', 'BuildingID')
# Parse the data and create a schema
hvacParts = hvacText.map(lambda s: s.split(',')).filter(lambda s: s[0] != 'Date')
hvac = hvacParts.map(lambda p: Entry(str(p[0]), str(p[1]), int(p[2]), int(p[3]), int(p[6])))
# Infer the schema and create a table
hvacTable = sqlContext.createDataFrame(hvac)
hvacTable.registerTempTable('hvactemptable')
dfw = DataFrameWriter(hvacTable)
#using agrument from pipeline to create table.
dfw.saveAsTable(sys.argv[1])
When the pipeline is triggered, it runs successfully and the required table will be created (name of this table is passed as an argument from pipeline Spark activity). We can query this table in HDInsight cluster's Jupyter notebook using the following query:
select * from new_hvac
NOTE:
So, please ensure that you are passing arguments to python script (.py file) but not a python notebook.

How to refer previous task and stop the build in azure devops if there is no new data to publish an artifact

Getsolution.exe will give New data available or no new data available, if new data available then next jobs should be executed else nothing should be executed. How should i do it? (i am working on classic editor)
example: i have set of tasks, consider 4 tasks:
task-1: builds the solution
task-2: runs the Getstatus.exe which get the status of data available or no data available
task-3: i should be able to use the above task and make a condition/use some api query and to proceed to publish an artifact if data is available if not cleanly break out of the task and stop the build. it Shouldn't proceed to publish artifact or move to the next available task
task-4:publish artifact
First what you need is to set a variable in your task where you run Getstatus.exe:
and then set condition in next tasks:
If you set doThing to different valu than Yes you will get this:
How to refer previous task and stop the build in azure devops if there is no new data to publish an artifact
Since we need to execute different task based on the different results of Getstatus.exe running, we need set the condition based on the result of Getstatus.exe running.
To resolve this, just like the Krzysztof Madej said, we could set variable(s) based on the return value of Getstatus.exe in the inline powershell task:
$dataAvailable= $(The value of the `Getstatus.exe`)
if ($dataAvailable -eq "True")
{
Write-Host ("##vso[task.setvariable variable=Status]Yes")
}
elseif ($dataAvailable -eq "False")
{
Write-Host ("##vso[task.setvariable variable=Status]No")
}
Then set the different condition for next task:
You could check the document Specify conditions for some more details.

How to fix PipelineParam from discarding all information except for name in Kubeflow Pipeline

I'm trying to write an application using Kubeflow Pipelines. I'm running into trouble when passing in parameters to the pipeline (the main python function decorated with #kfp.dsl.pipeline). The parameters should be automatically converted into a PipelineParam with name, value, etc info. However, it seems that everything except for the name is being discarded. I'm on an Ubuntu server.
I've tried uninstalling/reinstalling and updating Kubeflow, tried installing several of the most recent versions of kfp (0.1.23, 0.1.22, 0.1.20, 0.1.18), as well as installing on my local machine.
def print_pipeline_param():
return(kfp.dsl.PipelineParam("Name", value="Value"))
#kfp.dsl.pipeline(
name='Test Pipeline',
description='Test pipeline'
)
def test_pipeline(output_file='/output.txt'):
print(print_pipeline_param())
print(output_file)
if __name__ == '__main__':
import kfp.compiler as compiler
compiler.Compiler().compile(test_pipeline, __file__ + '.tar.gz'
The result of running this is:
{{pipelineparam:op=;name=Name;value=Value;type=;}}
{{pipelineparam:op=;name=output-file;value=;type=;}
I should be getting '/output.txt' in the "value" field, but the only field populated is the name. This only happens when passing in parameters to the main pipeline function. This also happens when directly passing in a PipelineParam like so:
#kfp.dsl.pipeline(
name='Test Pipeline',
description='Test pipeline'
)
def test_pipeline(output_file=kfp.dsl.PipelineParam("Output File", value="/output.txt")):
print(output_file)
Prints out: {{pipelineparam:op=;name=output-file;value=;type=;}

Generating uuid and use it across Airflow DAG

I'm trying to create a dynamic airflow that has the following 2 tasks:
Task 1: Creates files with a generated UUID as part of their name
Task 2: Runs a check on those files
So I define a variable 'FILE_UUID' and sets it as follow: str(uuid.uuid4()). And also created a constant file name:
MY_FILE = '{file_uuid}_file.csv'.format(file_uuid=FILE_UUID}
Then - Task 1 is a bashOperator that get MY_FILE as part of the command, and it creates a file successfully.
I can see the generated files include a specific UUID in the name,
TASK 2 fails is a PythonOperator that get MY_FILE as an op_args. But can't access the file. Logs show that it tries to access files with a different UUID.
Why is my "constant" is being run separately on every task? Is there any way to prevent that from happening?
I'm using Airflow 1.10, my executor is LocalExecutor.
I tried setting the constant outside the "with DAG" and inside it, also tries working with macros, but then PythonOperator just uses the macro strings literally using the values they hold.
You have to keep in mind that the DAG definition file is a sort of "configuration script", not an actual executable to run your DAGs. The tasks are executed in completely different environments, most of the times not even on the same machine. Think of it like a configuration XML which sets up your tasks, and then they are built and executed on some other machine in the cloud - but it's Python instead of XML.
In conclusion - your DAG code is Python, but it's not the one being executed in the runtime of your tasks. So if you generate a random uuid there, it will get evaluated at an unknown time and multiple times - for each task, on different machines.
To have it consistent across tasks you need to find another way, for example:
use XCOM such that the first tasks uses the uuid it gets, and then writes that to XCOM for all downstream tasks to use.
anchor your uuid with something constant across your pipeline, a source, a date, or whatever (e.g. if it's a daily task, you can build your uuid from date parts mixing in some dag/task specifics, etc. - whatever will make your uuid the same for all tasks, but unique for unique days)
Example DAG using the first method (XCOM's):
from datetime import datetime
import uuid
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.bash_operator import BashOperator
with DAG(dag_id='global_uuid',
schedule_interval='#daily',
start_date=...) as dag:
generate_uuid = PythonOperator(
task_id='generate_uuid',
python_callable=lambda: str(uuid.uuid4())
)
print_uuid1 = BashOperator(
task_id='print1',
bash_command='echo {{ task_instance.xcom_pull("generate_uuid") }}'
)
print_uuid2 = BashOperator(
task_id='print2',
bash_command='echo {{ task_instance.xcom_pull("generate_uuid") }}'
)
generate_uuid >> print_uuid1 >> print_uuid2

Azure data factory Lookup

I have a lookup activity which reads data from a SQL table and the output of this is passed onto multiple Execute Pipeline tasks as parameters. The flow is as follows
Lookup -> Exct Pipeline 1 -> Exct Pipeline 2 -> Exct Pipeline 3
.This works fine for the first pipeline however the second execute Pipeline fails with the following error.
> "The template validation failed: 'The inputs of template action 'Exct
> Pipeline 2' at line '1 and column '178987' cannot reference action
> 'Lookup'. Action 'Lookup' must either be in 'runAfter' path or within
> a scope action on the 'runAfter' path of action 'Execute Pipeline 3',
> or be a Trigger"
Another point to be noted is that the Pipeline runs fine when triggered.It only fails when in debug.
Has anyone else seen this issue?