Azure data factory Lookup - azure-data-factory

I have a lookup activity which reads data from a SQL table and the output of this is passed onto multiple Execute Pipeline tasks as parameters. The flow is as follows
Lookup -> Exct Pipeline 1 -> Exct Pipeline 2 -> Exct Pipeline 3
.This works fine for the first pipeline however the second execute Pipeline fails with the following error.
> "The template validation failed: 'The inputs of template action 'Exct
> Pipeline 2' at line '1 and column '178987' cannot reference action
> 'Lookup'. Action 'Lookup' must either be in 'runAfter' path or within
> a scope action on the 'runAfter' path of action 'Execute Pipeline 3',
> or be a Trigger"
Another point to be noted is that the Pipeline runs fine when triggered.It only fails when in debug.
Has anyone else seen this issue?

Related

Pipeline Dependencies in Data Fusion

I have three pipelines in Data Fusion say A,B and C. I want to the Pipeline C to get triggered after execution of Pipeline A and B both Completes. Pipeline triggers are putting the dependency on one pipeline only.
Can this be implemented in Data Fusion ?
You can do it using Google Cloud Composer [1]. In order to perform this action first of all you need to create a new Environment in Google Cloud Composer [2], once done, you need to install a new Python Package in your environment [3], and the package that you will need to install is [4] "apache-airflow-backport-providers-google".
With this package installed you will be able to use these operations [5], the one you will need is [6] "Start a DataFusion pipeline", this way you will be able to start a new pipeline from Airflow.
An example of the python code would be as follows:
import airflow
import datetime
from airflow import DAG
from airflow import models
from airflow.operators.bash_operator import BashOperator
from datetime import timedelta
from airflow.providers.google.cloud.operators.datafusion import (
CloudDataFusionStartPipelineOperator
)
default_args = {
'start_date': airflow.utils.dates.days_ago(0),
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
with models.DAG(
'composer_DF',
schedule_interval=datetime.timedelta(days=1),
default_args=default_args) as dag:
# the operations.
A = CloudDataFusionStartPipelineOperator(
location="us-west1", pipeline_name="A",
instance_name="instance_name", task_id="start_pipelineA",
)
B = CloudDataFusionStartPipelineOperator(
location="us-west1", pipeline_name="B",
instance_name="instance_name", task_id="start_pipelineB",
)
C = CloudDataFusionStartPipelineOperator(
location="us-west1", pipeline_name="C",
instance_name="instance_name", task_id="start_pipelineC",
)
# First A then B and then C
A >> B >> C
You can set the time intervals by checking the Airflow documentation.
Once you have this code saved as a .py file, save it to ther Google Cloud Storage DAG folder of your environment.
When the DAG starts, it will execute task A and when it finishes it will execute task B and so on.
[1] https://cloud.google.com/composer
[2] https://cloud.google.com/composer/docs/how-to/managing/creating#:~:text=In%20the%20Cloud%20Console%2C%20open%20the%20Create%20Environment%20page.&text=Under%20Node%20configuration%2C%20click%20Add%20environment%20variable.&text=The%20From%3A%20email%20address%2C%20such,%40%20.&text=Your%20SendGrid%20API%20key.
[3] https://cloud.google.com/composer/docs/how-to/using/installing-python-dependencies
[4] https://pypi.org/project/apache-airflow-backport-providers-google/
[5] https://airflow.readthedocs.io/en/latest/_api/airflow/providers/google/cloud/operators/datafusion/index.html
[6] https://airflow.readthedocs.io/en/latest/howto/operator/google/cloud/datafusion.html#start-a-datafusion-pipeline
There is no direct way i could think of but two workarounds
Work around 1. Merging the pipeline A and B into pipeline AB then trigger pipeline C (AB > C).
Pipeline A - (GCS Copy > Decompress),
Pipeline B - (GCS2 > thrashsad)
BigQueryExecute to mitigate error : Invalid DAG. There is an island made up of stages..
In BigQueryExecute, valid and dummy query.
Merging the two pipeline in one, may unease the pipeline testing. To overcome this you can add a dummy condition to run a pipeline one time.
In BigQueryExecute,change query to 'Select ${flag}' and pass the value of flag in runtime argument or Select 1 as flag and tick "Row As Arguments" to true.
Add condition plugin after BigQueryExecute and put condition runtime['flag'] = 1
Condition plugin has two outlet, connect them to pipeline A and pipeline B.
Workaround 2 : Store the flag of both pipelines(A & B) in BiqQuery table,create two flow A>C and B >C to trigger the pipeline C. This would trigger pipeline C twice but using BigQueryExecute and condition plugin will run only when both flags are available in BigQuery table.
How?
In Pipeline A & B to write output (a row) to BigQuery table 'Pipeline_Run'
In Pipeline C, add BigQueryExecute and query 'select count(*) as Cnt from ds.Pipeline_Run' and tick "Row As Arguments" to true.
In Pipeline C, add Condition plugin and check if value of cnt is 2 (runtime['cnt'] = 2) and connect your rest of the pipeline's plugins to its "Yes" outlet.
You can explore "schedules" set through CDAP REST APIs. That allows parallel execution of pipelines and there is no dependency on cloud composer (except for file based trigger of first pipeline in workflow. For that you would need cloud function or may be cloud composer file sensor)

How to refer previous task and stop the build in azure devops if there is no new data to publish an artifact

Getsolution.exe will give New data available or no new data available, if new data available then next jobs should be executed else nothing should be executed. How should i do it? (i am working on classic editor)
example: i have set of tasks, consider 4 tasks:
task-1: builds the solution
task-2: runs the Getstatus.exe which get the status of data available or no data available
task-3: i should be able to use the above task and make a condition/use some api query and to proceed to publish an artifact if data is available if not cleanly break out of the task and stop the build. it Shouldn't proceed to publish artifact or move to the next available task
task-4:publish artifact
First what you need is to set a variable in your task where you run Getstatus.exe:
and then set condition in next tasks:
If you set doThing to different valu than Yes you will get this:
How to refer previous task and stop the build in azure devops if there is no new data to publish an artifact
Since we need to execute different task based on the different results of Getstatus.exe running, we need set the condition based on the result of Getstatus.exe running.
To resolve this, just like the Krzysztof Madej said, we could set variable(s) based on the return value of Getstatus.exe in the inline powershell task:
$dataAvailable= $(The value of the `Getstatus.exe`)
if ($dataAvailable -eq "True")
{
Write-Host ("##vso[task.setvariable variable=Status]Yes")
}
elseif ($dataAvailable -eq "False")
{
Write-Host ("##vso[task.setvariable variable=Status]No")
}
Then set the different condition for next task:
You could check the document Specify conditions for some more details.

Azure Datafactory Pipeline execution status

It is kind of annoying we cannot change the logical order(AND/OR) of the Activity dependencies. however, I have got another issue. having said that I have activities for on failure to log the error messages in DB, since the logging activity succeeds, the entire pipeline succeeds too! is there any workaround to say if any activities failed the entire pipeline and the parent pipeline, if it is called from another pipeline, should be failed either?
In my screenshot, i have selected the on completion dependencies to log the successful or error.
I see that you defined "On Success" of the copy activity to run "usp_postexecution" . Please define a "On failure" of the copy activity and add any activity ( may be a set variable for testing ) and execute the pipeline . The pipeline will fail .
Just to give you more context what i tried .
I have a variable name "test" of the type boolean and I am failing it deliberately ( by assigning to a non-boolean value of true1 )
Pipeline will fail when I define both success and failure scenarios .
The pipeline will succeed when you have only "Failure" defined

Azure Data Factory - Event based triggers on multiple files/blobs

I am invoking an ADF V2 pipeline via an event based trigger when new files/blobs are created in a folder within a blob container.
Blob Container structure:
BlobContainer ->
FolderName ->
-> File1.csv
-> File2.csv
-> File3.csv
I've created the trigger with below configuration:
Container Name: BlobContainer
Blob path begins with: FolderName/
Blob path ends with: .csv
Event Checked:Blob Created
Trigger Screenshot
Problem: Three csv files are created in the folder on ad hoc basis. The trigger that invokes the pipeline runs 3 times (probably because 3 blobs are created). The pipeline actually move the files in another blob container. So the 1st trigger run succeeds and remaining 2 fails because the files have been moved already. However how can I configure the trigger so that it only run once per folder even though 3 files are created within it?
Because the files are generated together, I am required to move them together into a new location using ADF.
Your blobEventTrigger triggered the pipeline for each file, For it, you can use a 'lookup activity' which gets the filenames and then use filter activity, which filtered the required filename and gives the filterdItemCounts attribute that could be checked in the IF Activity. When there is no file the filterdItemCounts returns '0'and your pipeline not triggered.
Summary-
Lookup Activity -> Filter Activity -> IF Activity -> Your Pipeline

Pass content from build back into Visual Studio Team Services Build

I am running build on Azure with a custom build agent (using Unity3d to be precise) I generate output of the build within a .txt file on the build machine and would like to include content within work items created during build.
Example:
Unity build fails and an error is logged to Build.log.
New bug is created with reference to build and the error message from the
logfile
Right now I am using a powershell script
$content = [IO.File]::ReadAllText("$(Build.Repository.LocalPath)\BuildProjectPC\Build.log")
Write-Host "##vso[task.setvariable variable=logContent;]$content"
To format the bug i use System.Description = $logContent but the content of the variable from PS does for some reason not end up in the bug item (it just contains "$logContent").
Any idea or direction how to fix this, respectively how to feed info back into vsts?
The variable value that used for creating work item is initialized before running build steps, so you can’t specify a dynamic variable or change the variable value during the build step that used for creating work item.
You can follow up these steps to verify it:
Open your build definition > Variable > Add a new variable (e.g. logContent, value: oldValue)
Select Options > Check Work Item on Failure > Additional Fields > System.Title $(logContent)
Add PowerShell build step: Write-Host "$(logContent)"
Add PowerShell build step: Write-Host "##vso[task.setvariable variable=logContent;]newValue"
Add PowerShell build step: Write-Host "$(logContent)"
Add PowerShell build step: [Environment]::Exit(1)
The log result:
Step 3 output oldValue
Step 5 output newValue
The created work item title oldValue.
The workaround for your requirement is that, you can create a work item and associated to build through PowerShell with Rest API (add PowerShell step at the end of other steps and Check Always run option).
Associate work item to build:
Request Type: Patch
Request URL: https://[your vsts account].visualstudio.com/DefaultCollection/_apis/wit/workitems/[work item id]?api-version=1.0
Content-Type:application/json-patch+json
Body:
[
{
"op": "add",
"path": "/relations/-",
"value": {
"rel": "ArtifactLink",
"url": "vstfs:///Build/Build/[build id]",
"attributes": {
"comment": "Making a new link for the dependency"
}
}
}
]
You can refer to this blog for PowerShell scripts to create and associate work item.
Build association with work Items in vNext
In the Additional Fields you need to reference the build/release definition variable with the following notation: $(logContent)
as shown here below: