Automatically map contents of REST JSON body as flat table in Data Flow - azure-data-factory

With the Copy Data transformation it is possible to retrieve data from a REST call (array with flat json objects, similar to Odata) and copy the contents to a flat table keeping the data types from the source but without the necessity to set the schema for that specific data.
When I try to recreate this with Data Flow, I can't get this to work. When I check the Data Preview of my Source I get a hierarchy with a body (with my odata like data) and a header. And if I send that to my sink (Avro) it will be saved in this same hierarchical structure (including the header). I know I can fix this manually by using a Select operation (body.column1, body.column2, etc.), but I want to make my Data Flow dynamic so I'm able to use it with multiple tables/endpoints.
So I receive it like this with my REST source:
link
And I want it to be like this at my Sink without hardcoding my schema:
link
The only work around I can come up with is retrieving the data using Copy Data, put it somewhere temporarily and then use my data flow to further transform the data. Is there a more easy way to do this? I cannot imagine that I'm the only one that has this issue.
Hopefully it's clear and somebody is able to help. Thank you very much in advance.

Data flow projection will get schema from API including body and header. Hence, when you use auto mapping everything going to be saved.
Below work arounds you can think of,
As you mentioned, using copy data first and then data flow to further transform.
Use select or derived column transformations and transform your data to get all column names and then finally use sink. For this you can opt with Column pattern matching syntax. So that one condition can be meet with multiple columns to transform.
Check below link to know about column pattern mappings.
https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-column-pattern

Related

Talend - Insert data into DB using Django REST API

I am trying utilise Django REST APIs to insert data into the database, instead of the direct write. I've been able to read JSON data using the tRESTClient component but I am not too sure about the insertion/POST. Could someone point me to the components (and relation) that I should use?
The current job that I have is mostly:
Read data from raw file -> tMap -> DB
and I wish to do something like:
Read data from raw file -> tMap -> (pass on data to REST endpoint via POST)
Used the tRestClient component after my tMap and I could see the records getting inserted into the DB but all of them are without any data. Strangely nowhere I was asked to specify the JASON tree. The number of records getting inserted are equal to rows being read from raw file so at least something is right. But I couldn't locate the menu/options to specify which data element read from the raw file should tag to which JASON element.
How do I specify the data to JSON mapping?
PS: I realise that this might not be the most efficient way to ingest data but that's what the business wants since it brings in an additional layer of control.

How to control data failures in Azure Data Factory Pipelines?

I receive an error from time and time due to incompatible data in my source data set compared to my target data set. I would like to control the action that the pipeline determines based on error types, maybe output or drop those particulate rows, yet completing everything else. Is that possible? Furthermore, is it possible to get a hold of the actual failing line(s) from Data Factory without accessing and searching in the actual source data set in some simple way?
Copy activity encountered a user error at Sink side: ErrorCode=UserErrorInvalidDataValue,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Column 'Timestamp' contains an invalid value '11667'. Cannot convert '11667' to type 'DateTimeOffset'.,Source=Microsoft.DataTransfer.Common,''Type=System.FormatException,Message=String was not recognized as a valid DateTime.,Source=mscorlib,'.
Thanks
I think you've hit a fairly common problem and limitation within ADF. Although the datasets you define with your JSON allow ADF to understand the structure of the data, that is all, just the structure, the orchestration tool can't do anything to transform or manipulate the data as part of the activity processing.
To answer your question directly, it's certainly possible. But you need to break out the C# and use ADF's extensibility functionality to deal with your bad rows before passing it to the final destination.
I suggest you expand your data factory to include a custom activity where you can build some lower level cleaning processes to divert the bad rows as described.
This is an approach we often take as not all data is perfect (I wish) and ETL or ELT doesn't work. I prefer the acronym ECLT. Where the 'C' stands for clean. Or cleanse, prepare etc. This certainly applies to ADF because this service doesn't have its own compute or SSIS style data flow engine.
So...
In terms of how to do this. First I recommend you check out this blog post on creating ADF custom activities. Link:
https://www.purplefrogsystems.com/paul/2016/11/creating-azure-data-factory-custom-activities/
Then within your C# class inherited from IDotNetActivity do something like the below.
public IDictionary<string, string> Execute(
IEnumerable<LinkedService> linkedServices,
IEnumerable<Dataset> datasets,
Activity activity,
IActivityLogger logger)
{
//etc
using (StreamReader vReader = new StreamReader(YourSource))
{
using (StreamWriter vWriter = new StreamWriter(YourDestination))
{
while (!vReader.EndOfStream)
{
//data transform logic, if bad row etc
}
}
}
}
You get the idea. Build your own SSIS data flow!
Then write out your clean row as an output dataset, which can be the input for your next ADF activity. Either with multiple pipelines, or as chained activities within a single pipeline.
This is the only way you will get ADF to deal with your bad data in the current service offerings.
Hope this helps

Using visjs manipulation to create workflow dependencies

We are currently using visjs version 3 to map the dependencies of our custom built workflow engine. This has been WONDERFUL because it helps us to visualize the flow and find invalid or missing dependencies. What we want to do next is simplify the process of building the dependencies using the visjs manipulation feature. The idea would be that we would display a large group of nodes and allow the user to order them correctly. We then want to be able to submit that json structure back to the server for processing.
Would this be possible?
Yes, this is possible.
Vis.js dispatches various events that relate to user interactions with graph (e.g. manipulations, or position changes) for which you can add handlers that modify or store the data on change. If you use DataSets to store nodes and edges in your network, you can always use the DataSets' get() function to retrieve all elements in you handler in JSON format. Then in your handler, just use an ajax request to transmit the JSON to your server to store the entire graph in your DB or by saving the JSON as a file.
The oppposite for loading the graph: simply query the JSON from your server and inject it into the node and edge DataSets' using the set method.
You can also store the networks current options using the network's getOptions method, which returns all applied options as json.

Which way should I create a list of objects when sending POST request

When I creating same type objects and save them into database, should I send a list of that objects in one request or should I send individually for each one?
For example, I would like to create a todo list, I can create multiple todos, then click save to send a list of todos, or when I finish editing one todo, I save it directly.
The first way can save request numbers, only one request needed to create many objects. But is the first way RESTful? All infomation about create in REST is creating a single object, but will there be poblems if increasing requests numbers?
----Edit
Thank you guys answering me.
For a more spicific usecase, I am using Django Rest Framework. I created a Todo model and a corresponding serializer. I am wondering, how could I create a list of Todos? I tried to send a list of Todos to serializer, and expecting serializer can automatically loop through it as same as getting a list of instance. But that doesn't work. I know I may be able to create a loop to call create method everytime. But is there a better way to do it?
There is nothing in REST that tells you what kind of payload you are allowed to use. You can POST/PUT whatever you want - one entity representation or many representations, in lists, dictionaries, XML, URL-encoded key/values or JSON, what ever suits your use case best.
In your case you might even want to send a delta/diff list of changes on the client: Lets for instance say your client loads some existing 3 todo items. Then the user modifies one of them, deletes another one and adds a new one. You can either do that in three requests or one single request with add/modify/delete operations encoded in it. Both ways are valid and the best solution depends on your use case and constraints like bandwidth, processing power and network round-trip time.

Breeze column based security

I have a "web forms", "database first enitity" project using Breeze. I have a "People" table that include sensitive data (e.g. SSN#). At the moment I have an IQueryable web api for GetPeople.
The current page I'm working on is a "Manage people" screen, but it is not meant for editing or viewing of SSN#'s. I think I know how to use the BeforeSaveEntity to make sure that the user won't be able to save SSN changes, but is there any way to not pass the SSN#s to the client?
Note: I'd prefer to use only one EDMX file. Right now the only way I can see to accomplish this is to have a "View" in the database for each set of data I want to pass to the client that is not an exact match of the table.
You can also use JSON.NET serialization attributes to suppress serialization of the SSN from the server to the client. See the JSON.NET documention on serialization attributes.
Separate your tables. (For now, this is the only solution that comes to mind.)
Put your SSN data in another table with a related key (1 to 1 relation) and the problem will be solved. (Just handle your save in case you need it.)
If you are using Breeze it will work, because you have almost no control on Breeze API interaction after the user logs in, so it is safer to separate your data. (Breeze is usually great, but in this case it's harmful.)