How to control data failures in Azure Data Factory Pipelines? - azure-data-factory

I receive an error from time and time due to incompatible data in my source data set compared to my target data set. I would like to control the action that the pipeline determines based on error types, maybe output or drop those particulate rows, yet completing everything else. Is that possible? Furthermore, is it possible to get a hold of the actual failing line(s) from Data Factory without accessing and searching in the actual source data set in some simple way?
Copy activity encountered a user error at Sink side: ErrorCode=UserErrorInvalidDataValue,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Column 'Timestamp' contains an invalid value '11667'. Cannot convert '11667' to type 'DateTimeOffset'.,Source=Microsoft.DataTransfer.Common,''Type=System.FormatException,Message=String was not recognized as a valid DateTime.,Source=mscorlib,'.
Thanks

I think you've hit a fairly common problem and limitation within ADF. Although the datasets you define with your JSON allow ADF to understand the structure of the data, that is all, just the structure, the orchestration tool can't do anything to transform or manipulate the data as part of the activity processing.
To answer your question directly, it's certainly possible. But you need to break out the C# and use ADF's extensibility functionality to deal with your bad rows before passing it to the final destination.
I suggest you expand your data factory to include a custom activity where you can build some lower level cleaning processes to divert the bad rows as described.
This is an approach we often take as not all data is perfect (I wish) and ETL or ELT doesn't work. I prefer the acronym ECLT. Where the 'C' stands for clean. Or cleanse, prepare etc. This certainly applies to ADF because this service doesn't have its own compute or SSIS style data flow engine.
So...
In terms of how to do this. First I recommend you check out this blog post on creating ADF custom activities. Link:
https://www.purplefrogsystems.com/paul/2016/11/creating-azure-data-factory-custom-activities/
Then within your C# class inherited from IDotNetActivity do something like the below.
public IDictionary<string, string> Execute(
IEnumerable<LinkedService> linkedServices,
IEnumerable<Dataset> datasets,
Activity activity,
IActivityLogger logger)
{
//etc
using (StreamReader vReader = new StreamReader(YourSource))
{
using (StreamWriter vWriter = new StreamWriter(YourDestination))
{
while (!vReader.EndOfStream)
{
//data transform logic, if bad row etc
}
}
}
}
You get the idea. Build your own SSIS data flow!
Then write out your clean row as an output dataset, which can be the input for your next ADF activity. Either with multiple pipelines, or as chained activities within a single pipeline.
This is the only way you will get ADF to deal with your bad data in the current service offerings.
Hope this helps

Related

Automatically map contents of REST JSON body as flat table in Data Flow

With the Copy Data transformation it is possible to retrieve data from a REST call (array with flat json objects, similar to Odata) and copy the contents to a flat table keeping the data types from the source but without the necessity to set the schema for that specific data.
When I try to recreate this with Data Flow, I can't get this to work. When I check the Data Preview of my Source I get a hierarchy with a body (with my odata like data) and a header. And if I send that to my sink (Avro) it will be saved in this same hierarchical structure (including the header). I know I can fix this manually by using a Select operation (body.column1, body.column2, etc.), but I want to make my Data Flow dynamic so I'm able to use it with multiple tables/endpoints.
So I receive it like this with my REST source:
link
And I want it to be like this at my Sink without hardcoding my schema:
link
The only work around I can come up with is retrieving the data using Copy Data, put it somewhere temporarily and then use my data flow to further transform the data. Is there a more easy way to do this? I cannot imagine that I'm the only one that has this issue.
Hopefully it's clear and somebody is able to help. Thank you very much in advance.
Data flow projection will get schema from API including body and header. Hence, when you use auto mapping everything going to be saved.
Below work arounds you can think of,
As you mentioned, using copy data first and then data flow to further transform.
Use select or derived column transformations and transform your data to get all column names and then finally use sink. For this you can opt with Column pattern matching syntax. So that one condition can be meet with multiple columns to transform.
Check below link to know about column pattern mappings.
https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-column-pattern

How to force to set Pipelines' status to failed

I'm using Copy Data.
When there is some data error. I would export them to a blob.
But in this case, the Pipelines's status is still Succeeded. I want to set it to false. Is it possible?
When there is some data error.
It depends on what error you mentioned here.
1.If you mean it's common incompatibility or mismatch error, ADF supports built-in feature named Fault tolerance in Copy Activity which supports below 3 scenarios:
Incompatibility between the source data type and the sink native
type.
Mismatch in the number of columns between the source and the sink.
Primary key violation when writing to SQL Server/Azure SQL
Database/Azure Cosmos DB.
If you configure to log the incompatible rows, you can find the log file at this path: https://[your-blob-account].blob.core.windows.net/[path-if-configured]/[copy-activity-run-id]/[auto-generated-GUID].csv.
If you want to abort the job as soon as any error occurs,you could set as below:
Please see this case: Fault tolerance and log the incompatible rows in Azure Blob storage
2.If you are talking about your own logic for the data error,may some business logic. I'm afraid that ADF can't detect that for you, though it's also a common requirement I think. However,you could follow this case (How to control data failures in Azure Data Factory Pipelines?) to do a workaround. The main idea is using custom activity to divert the bad rows before the execution of copy activity. In custom activity, you could upload the bad rows into Azure Blob Storage with .net SDK as you want.
Update:
Since you want to log all incompatible rows and enforce the job failed at the same time, I'm afraid that it can not be implemented in the copy activity directly.
However, I came up with an idea that you could use If Condition activity after Copy Activity to judge if the output contains rowsSkipped. If so, output False,then you will know there are some skip data so that you could check them in the blob storage.

Azure Copy Activity Rest Results Unexpected

I'm attempting to pull data from the Square Connect v1 API using ADF. I'm utilizing a Copy Activity with a REST source. I am successfully pulling back data, however, the results are unexpected.
The endpoint is /v1/{location_id}/payments. I have three parameters, shown below.
I can successfully pull this data via Postman.
The results are stored in a Blob and are as if I did not specify any parameters whatsoever.
Only when I hardcode the parameters into the relative path
do I get correct results.
I feel I must be missing a setting somewhere, but which one?
You can try setting the values you want into a setVariable activity, and then have your copyActivity reference those variables. This will tell you whether it is an issue with the dynamic content or not. I have run into some unexpected behavior myself. The benefit of the intermediate setVariable activity is twofold. Firstly it coerces the datatype, secondly, it lets you see what the value is.
My apologies for not using comments. I do not yet have enough points to comment.

How to update data with GraphQL

I am studying graphql.
I can retrieve data from my mongo database with queries, I can create data with mutations.
But how I can modify existing data?
I am a bit lost here...
I have to create a new mutation?
Yes, every mutation describes a specific action that can be done to a bit of data. GraphQL is not like REST - it doesn't specify any standard CRUD-type actions.
When you are writing a mutation to update some data, you have two options. Let's explain them in the context of a todo item that has a completed status, and a text field:
Write mutations that represent semantic actions - markTodoCompleted, updateTodoText, etc.
Write a generic mutation that just sets any properties passed it, you could call it updateTodo.
I prefer the first approach, because it makes it more clear what the client is doing when it calls a certain mutation. In the second approach, you need to be careful to validate the values to be set to make sure someone can't set some invalid combination.
In short, you need to define your own mutations to update data.

Using visjs manipulation to create workflow dependencies

We are currently using visjs version 3 to map the dependencies of our custom built workflow engine. This has been WONDERFUL because it helps us to visualize the flow and find invalid or missing dependencies. What we want to do next is simplify the process of building the dependencies using the visjs manipulation feature. The idea would be that we would display a large group of nodes and allow the user to order them correctly. We then want to be able to submit that json structure back to the server for processing.
Would this be possible?
Yes, this is possible.
Vis.js dispatches various events that relate to user interactions with graph (e.g. manipulations, or position changes) for which you can add handlers that modify or store the data on change. If you use DataSets to store nodes and edges in your network, you can always use the DataSets' get() function to retrieve all elements in you handler in JSON format. Then in your handler, just use an ajax request to transmit the JSON to your server to store the entire graph in your DB or by saving the JSON as a file.
The oppposite for loading the graph: simply query the JSON from your server and inject it into the node and edge DataSets' using the set method.
You can also store the networks current options using the network's getOptions method, which returns all applied options as json.