Modifying json in azure data factory to give nested list - azure-data-factory

Im retrieving data from sql server in Azure Data Factory.
The API im passing it to requires the Json in a specific format.
I have been unable to get the data in the required format so far, trying "for json output" in tsql.
is there a way to do this in data factory with the data it retrieved from SQL Server?
SQL Server Output
EntityName customField1 CustomField2
------------------------------------------
AA01 NGO21 2022-01-01
AA02 BS34 2022-03-01
How it appears in Data Factory
[
{"EntityName": "AA01", "CustomField1": "NGO21", "CustomField2":"2022-01-01"},
{"EntityName": "AA02", "CustomField1": "BS32", "CustomField2":"2022-03-01"}
]
Required output
[
{
"EntityName": "AA01"
"OtherFields":[{"customFieldName": "custom field 1, "customFieldValue": "NGO21"},{"customFieldName": "custom field 2", "customFieldValue": "2022-01-01"} ]
},
{
"EntityName": "AA02"
"OtherFields":[{"customFieldName": "CustomField1", "customFieldValue": "BS34"},{ "CustomFieldName":"CustomField2", "customFieldValue" : "2022-03-01"}]
}
]

I managed to do it in ADF, the solution is quite long, i think if you will write a stored procedure, it will be much easier.
Here is a quick demo that i built:
the idea is to build the Json structure as requested and then using collect function to build the array.
we have 2 arrays, one for EntityName and one for OtherFields.
Prepare the data:
First, i added column names in the corresponding rows, we will use this info later on to build our Json, i used a Derived column activity to fill the rows with column names.
Splitting Columns:
In order to build the Json structure as requested, i split the data into two parallel flows.
first flow is to select CustomFieldName1 and CustomFieldValue1 and the second flow is to select CustomFieldName2 and CustomFieldValue2 like so:
SelectColumn2 Activity:
SelectColumn1 Activity:
Note: Please keep the EntityName, We will Union the data by it later on in the flow.
OtherFields Column:
In order to build the Json, we need to do it using Sub-columns feature in a Derived column activity, that will ensure to us the Json structure.
Add new column with a name 'OtherFields' and open Expression Builder:
add 2 subcolumns : CustomFieldName and CustomFieldValue, add CustomFieldName1 as a value for the subcolumn CustomFieldName and add CustomFieldValue1 to the CustomFiedValue column like so:
Add a derived column activity and repeat same steps to CustomFieldName2.
Union:
Now we have 2 flows, one for extracting field1 and field2, we need to Union the data (you can do it by position or by name)
In order to create an array of Json we need to aggregate the data; this will transform complex data type {} into array []
Aggregate Activity:
Group by -> 'EntityName'
Aggregates -> collect(OtherFields)
Building Outer Json:
as described in the question above , we need to have a json that consists of 2 columns : {"EntitiyName" :"" , "OtherFields":[]}
In order to do it, again we need to add a derived column and add 2 subcolumns,
also, in order to combine all Json's in one Json array, we need a common value so we can aggregate by it, since we have different values, i added a dummy value with a constant 1, this will guarantee to us that all Json's will be under the same array
**Aggregate Data Json Activity: **
the output is an array of Json's, so we need to aggregate the data column
Group by -> dummy
aggregate : collect(data)
SelectDataColumn Activity:
Select Data column because we want it to be our output.
Finally, write to sink.
P.s: you can extract data value so you wont end up with a data key.
Output:
ADF activities:
You can read more about it here:
https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-column-pattern
https://learn.microsoft.com/en-us/azure/data-factory/data-flow-union
https://learn.microsoft.com/en-us/azure/data-factory/data-flow-derived-column

Related

How to map the iterator value to sink in adf

A question concerning Azure Data Factory.
I need to persist the iterator value from a lookup activity (an Id column from a sql table) to my sink together with other values.
How to do that?
I thought that I could just reference the iterator value as #{item().id} as source and a destination column name from from my sql table sink. That doesn’t seems to work. The resulting value in the destination column is NULL.
I have used 2 look up activities, one for id values and the other for remaining values. Now, to combine and insert these values to sink table, I have used the following:
The ids look up activity output is as following:
I have one more column to combine with above id values. The following is the look up output for that:
I have given the following dynamic content as the items value in for each as following:
#range(0,length(activity('ids').output.value))
Inside for each activity, I have given the following script activity query to insert data as required into sink table:
insert into t1 values(#{activity('ids').output.value[item()].id},'#{activity('remaining rows').output.value[item()].gname}')
The data would be inserted successfully and the following is the reference image of the same:

How to map Data Flow parameters to Sink SQL Table

I need to store/map one or more data flow parameters to my Sink (Azure SQL Table).
I can fetch other data from a REST Api and is able to map these to my Sink columns (see below). I also need to generate some UUID's as key fields and add these to the same table.
I would like my EmployeeId column to contain my Data Flow Input parameter, e.g. named param_test. In addition to this I need to insert UUID's to other columns which are not part of my REST input fields.
How to I acccomplish that?
You need to use a derived column transformation, and there edit the expression to include the parameters.
derived column transformation
expression builder
Adding to #Chen Hirsh, use the same derived column to get uuid values to the columns after REST API Source.
They will come into sink mapping:
Output:

Using stringify activity in azure data factory

I need to sync a cosmosdb container to sql database. The objects in cosmosdb are like so :
[
{
id: "d8ab4619-eb3d-4e25-8663-925bd33b9b1e",
buyerIds: [
"4a7c169f-0642-42a9-b5a7-214a646d6c59",
"87a956b3-2aef-43a1-a0f0-29c07519dfbc",
...
]
},
{...}
]
On the SQL side, the sink table contains 2 columns: Id and BuyerId.
What I want is to convert the buyerIds array to a string joined by coma for instance, to then be able to pass it to a SQL stored procedure.
The sql stored procedure will then split the string, and insert as many lines in the table as there are buyerIds.
In azure adf, I tried using a stringify activity in a dataflow but I have this error and don't understand what I need to change: Stringify expressions must be a complex type or an array of complex types.
My stringify activity take the buyerIds column in input and perform the following to create the string :
reduce(buyerIds, '', #acc + ',' + #item, #result)
Do you know what I am missing or another way to do it more simply ?
Because your property is an array, you'll want to use Flatten. That will allow you to unroll your array for the target relational destination. Use stringify to turn structures into strings.

group the record by columns and wanted to pick only the max of created date record using mapping data flows azure data factory

HI there in Mapping data flows and azure data factory. I tried to create data flow in that we used aggregate transformation to group the records, now there is a column called [createddate] now we want to pick the max of created record and out put should show all the columns.
any advice please help me
The Aggregate transformation will only output the columns participating in the aggregation (group by and aggregate function columns). To include all of the columns in the Aggregate output, add a new column pattern to your aggregate (see pic below ... change column1, column2, ... to the names of the columns you are already using in your agg)

How do I query Postgresql with IDs from a parquet file in an Data Factory pipeline

I have an azure pipeline that moves data from one point to another in parquet files. I need to join some data from a Postgresql database that is in an AWS tenancy by a unique ID. I am using a dataflow to create the unique ID I need from two separate columns using a concatenate. I am trying to create where clause e.g.
select * from tablename where unique_id in ('id1','id2',id3'...)
I can do a lookup query to the database, but I can't figure out how to create the list of IDs in a parameter that I can use in the select statement out of the dataflow output. I tried using a set variable and was going to put that into a for-each, but the set variable doesn't like the output of the dataflow (object instead of array). "The variable 'xxx' of type 'Array' cannot be initialized or updated with value of type 'Object'. The variable 'xxx' only supports values of types 'Array'." I've used a flatten to try to transform to array, but I think the sync operation is putting it back into JSON?
What a workable approach to getting the IDs into a string that I can put into a lookup query?
Some notes:
The parquet file has a small number of unique IDs compared to the total unique IDs in the database.
If this were an azure postgresql I could just use a join in the dataflow to do the join, but the generic postgresql driver isn't available in dataflows. I can't copy the entire database over to Azure just to do the join and I need the dataflow in Azure for non-technical reasons.
Edit:
For clarity sake, I am trying to replace local python code that does the following:
query = "select * from mytable where id_number in "
df = pd.read_parquet("input_file.parquet")
df['id_number'] = df.country_code + df.id
df_other_data = pd.read_sql(conn, query + str(tuple(df.id_number))
I'd like to replace this locally executing code with ADF. In the ADF process, I have to replace the transformation of the IDs which seems easy enough if a couple of different ways. Once I have the IDs in the proper format in a column in a dataset, I can't figure out how to query a database that isn't supported by Data Flow and restrict it to only the IDs I need so I don't bring down the entire database.
Due to variables of ADF only can store simple type. So we can define an Array type paramter in ADF and set default value. Paramters of ADF support any type of elements including complex JSON structure.
For example:
Define a json array:
[{"name": "Steve","id": "001","tt_1": 0,"tt_2": 4,"tt3_": 1},{"name": "Tom","id": "002","tt_1": 10,"tt_2": 8,"tt3_": 1}]
Define an Array type paramter and set its default value:
So we will not get any error.