Easiest way to fail pipeline in Data Factory? - azure-data-factory

I have a data factory pipeline that has an "If Condition" Activity and I want the pipeline to fail on a certain condition. What is the best way to achieve this? There is no fail activity..

They've just added a "Fail Activity" in Data Factory and Synapse Analytics:
https://learn.microsoft.com/en-us/azure/data-factory/control-flow-fail-activity
It's found in the "General" category in "Activities" in the Data Factory Studio.

Update September 2021: There is now a Fail activity in ADF.
First of all, please vote for this feedback. This is a common need.
There are several workarounds such as a Web Activity to throw an error by trying to connect to https://ThrowAnError or a Stored Procedure activity which executes a raiserror function in Azure SQL Database. But that’s as close as we can get to solving your problem currently.

Related

Get the name of a pipeline and its activities

I am building a pipeline in ADF and I must save in the database the name of the pipeline and the activities that are being used, how can I save this information in the database?
You would get a better answer if you could be more specific on when/where you want to do that, i.e the usage scenario. Without that, my best-guess answer is that you can use PowerShell to obtain that information.
Specifically, you can use the cmdlet Get-AzDataFactoryV2Pipeline, as specified here: https://learn.microsoft.com/en-us/powershell/module/az.datafactory/get-azdatafactoryv2pipeline?view=azps-5.8.0
You can use a python script to parse these details and then load it into the database this can all be done using Azure DevOps pipelines.

How to fail Azure Data Factory pipeline based on IF Task

I have a pipeline built on Azure data Factory. It has:
a "LookUp" task that has an SQL query that returns a column [CountRecs]. This columns holds a value 0 or more.
an "if" task to check this returned value. I want to fail the pipeline when the value of [CountRecs]>0
Is this possible?
You could probably achieve this by having a Web Activity when your IF Condition is true ([CountRecs]>0) in which the web activity should call the below REST API to cancel the pipeline run by using the pipelinerunID (you can get this value by using dynamic expression - #pipeline().RunId)
Sample Dynamic Expression for Condition: #greater(activity('LookupTableRecordCount').output.firstRow.COUNTRECS, 0)
REST API to Cancel the Pipeline Run: POST https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.DataFactory/factories/{factoryName}/pipelineruns/{runId}/cancel?api-version=2018-06-01
MS Doc related to Rest API: ADF Pipeline Runs - Cancel
One other possible way is to have an invalid URL in your web activity which will fail the Web activity in-turn it will fail the IfCondition activity, which inturn will result in your pipeline to fail.
There is an existing feature request related to the same requirement in ADF user voice forum suggested by other ADF users. I would recommend you please up-vote and/or comment on this feedback which will help to increase the priority of the feature request implementation.
ADF User voice feedback related to this requirement: https://feedback.azure.com/forums/270578-data-factory/suggestions/38143873-a-new-activity-for-cancelling-the-pipeline-executi
Hope this helps.
As a sort-of hack-solution you can create a "Set variable" activity which incurs division by zero if a certain condition is met. I don't like it but it works.
#string(
div(
1
, if(
greater( int(variables('date_diff')), 100 )
, 0
, 1
)
)
)

Triggering Kusto commands using 'ADX Command' activity in ADFv2 vs calling WebAPI on it

In ADFv2 (Azure Data Factory V2) if we need to trigger a command on an ADX (Azure Data Explorer) cluster , we have two choices:-
Use 'Azure Data Explorer Commmand' activity
Use POST method provided in the 'WebActivity' activity
Having figured out that both the methods work I would say from development/maintenance point of view the first method sounds more slick and systematic especially because it is out of the box feature to support Kusto in ADFv2. Is there any scenario where the Web Activity method would be more preferable or more performant? I am trying to figure out if it's alright to simply use the ADX Command activity all the time to run any Kusto command from ADFv2 instead of ever using the Web activity,
It is indeed recommended to use the "Azure Data Explorer Command" activity:
That activity is more comfortable, as you don't have to construct by yourself a the HTTP request.
That command takes care of few things for you, such as:
In case you are running an async command, it will poll the Operations table until your async command is completed.
Logging.
Error handling.
In addition, you should take into consideration that the result format will be different between both cases, and that each activity has its own limits in terms of response size and timeout.

How to force to set Pipelines' status to failed

I'm using Copy Data.
When there is some data error. I would export them to a blob.
But in this case, the Pipelines's status is still Succeeded. I want to set it to false. Is it possible?
When there is some data error.
It depends on what error you mentioned here.
1.If you mean it's common incompatibility or mismatch error, ADF supports built-in feature named Fault tolerance in Copy Activity which supports below 3 scenarios:
Incompatibility between the source data type and the sink native
type.
Mismatch in the number of columns between the source and the sink.
Primary key violation when writing to SQL Server/Azure SQL
Database/Azure Cosmos DB.
If you configure to log the incompatible rows, you can find the log file at this path: https://[your-blob-account].blob.core.windows.net/[path-if-configured]/[copy-activity-run-id]/[auto-generated-GUID].csv.
If you want to abort the job as soon as any error occurs,you could set as below:
Please see this case: Fault tolerance and log the incompatible rows in Azure Blob storage
2.If you are talking about your own logic for the data error,may some business logic. I'm afraid that ADF can't detect that for you, though it's also a common requirement I think. However,you could follow this case (How to control data failures in Azure Data Factory Pipelines?) to do a workaround. The main idea is using custom activity to divert the bad rows before the execution of copy activity. In custom activity, you could upload the bad rows into Azure Blob Storage with .net SDK as you want.
Update:
Since you want to log all incompatible rows and enforce the job failed at the same time, I'm afraid that it can not be implemented in the copy activity directly.
However, I came up with an idea that you could use If Condition activity after Copy Activity to judge if the output contains rowsSkipped. If so, output False,then you will know there are some skip data so that you could check them in the blob storage.

How to control data failures in Azure Data Factory Pipelines?

I receive an error from time and time due to incompatible data in my source data set compared to my target data set. I would like to control the action that the pipeline determines based on error types, maybe output or drop those particulate rows, yet completing everything else. Is that possible? Furthermore, is it possible to get a hold of the actual failing line(s) from Data Factory without accessing and searching in the actual source data set in some simple way?
Copy activity encountered a user error at Sink side: ErrorCode=UserErrorInvalidDataValue,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Column 'Timestamp' contains an invalid value '11667'. Cannot convert '11667' to type 'DateTimeOffset'.,Source=Microsoft.DataTransfer.Common,''Type=System.FormatException,Message=String was not recognized as a valid DateTime.,Source=mscorlib,'.
Thanks
I think you've hit a fairly common problem and limitation within ADF. Although the datasets you define with your JSON allow ADF to understand the structure of the data, that is all, just the structure, the orchestration tool can't do anything to transform or manipulate the data as part of the activity processing.
To answer your question directly, it's certainly possible. But you need to break out the C# and use ADF's extensibility functionality to deal with your bad rows before passing it to the final destination.
I suggest you expand your data factory to include a custom activity where you can build some lower level cleaning processes to divert the bad rows as described.
This is an approach we often take as not all data is perfect (I wish) and ETL or ELT doesn't work. I prefer the acronym ECLT. Where the 'C' stands for clean. Or cleanse, prepare etc. This certainly applies to ADF because this service doesn't have its own compute or SSIS style data flow engine.
So...
In terms of how to do this. First I recommend you check out this blog post on creating ADF custom activities. Link:
https://www.purplefrogsystems.com/paul/2016/11/creating-azure-data-factory-custom-activities/
Then within your C# class inherited from IDotNetActivity do something like the below.
public IDictionary<string, string> Execute(
IEnumerable<LinkedService> linkedServices,
IEnumerable<Dataset> datasets,
Activity activity,
IActivityLogger logger)
{
//etc
using (StreamReader vReader = new StreamReader(YourSource))
{
using (StreamWriter vWriter = new StreamWriter(YourDestination))
{
while (!vReader.EndOfStream)
{
//data transform logic, if bad row etc
}
}
}
}
You get the idea. Build your own SSIS data flow!
Then write out your clean row as an output dataset, which can be the input for your next ADF activity. Either with multiple pipelines, or as chained activities within a single pipeline.
This is the only way you will get ADF to deal with your bad data in the current service offerings.
Hope this helps