Using stringify activity in azure data factory - azure-data-factory

I need to sync a cosmosdb container to sql database. The objects in cosmosdb are like so :
[
{
id: "d8ab4619-eb3d-4e25-8663-925bd33b9b1e",
buyerIds: [
"4a7c169f-0642-42a9-b5a7-214a646d6c59",
"87a956b3-2aef-43a1-a0f0-29c07519dfbc",
...
]
},
{...}
]
On the SQL side, the sink table contains 2 columns: Id and BuyerId.
What I want is to convert the buyerIds array to a string joined by coma for instance, to then be able to pass it to a SQL stored procedure.
The sql stored procedure will then split the string, and insert as many lines in the table as there are buyerIds.
In azure adf, I tried using a stringify activity in a dataflow but I have this error and don't understand what I need to change: Stringify expressions must be a complex type or an array of complex types.
My stringify activity take the buyerIds column in input and perform the following to create the string :
reduce(buyerIds, '', #acc + ',' + #item, #result)
Do you know what I am missing or another way to do it more simply ?

Because your property is an array, you'll want to use Flatten. That will allow you to unroll your array for the target relational destination. Use stringify to turn structures into strings.

Related

Can't convert table value to string in ADF

I'm trying to loop over data in an SQL table, but when I'm trying to use the value inside a foreach loop action using the #item() i get the error:
"Failed to convert the value in 'table' property to 'System.String' type. Please make sure the payload structure and value are correct."
So the row value can't be converted to a string.
Could that be my problem? and if so, what can I do about it?
Here is the pipeline:
I reproduced the above scenario with SQL table containing table names in lookup and csv files from ADLS gen2 in copy activity and got the same error.
The above error arises when we gave the lookup output array items directly into a string parameter inside ForEach.
If we look at the below lookup output,
The above value array is not a normal array, it is an array of objects. So, #item() in 1st iteration of ForEach means one object { "tablename": "sample1.csv" }. But our parameter expects a string value and that's why it is giving the above error.
To resolve this, give #item().tablename which will give our table names in every iteration inside ForEach.
My repro for your reference:
I have given same in sink also and this is my output.
Pipeline Execution
Copied data in target

Modifying json in azure data factory to give nested list

Im retrieving data from sql server in Azure Data Factory.
The API im passing it to requires the Json in a specific format.
I have been unable to get the data in the required format so far, trying "for json output" in tsql.
is there a way to do this in data factory with the data it retrieved from SQL Server?
SQL Server Output
EntityName customField1 CustomField2
------------------------------------------
AA01 NGO21 2022-01-01
AA02 BS34 2022-03-01
How it appears in Data Factory
[
{"EntityName": "AA01", "CustomField1": "NGO21", "CustomField2":"2022-01-01"},
{"EntityName": "AA02", "CustomField1": "BS32", "CustomField2":"2022-03-01"}
]
Required output
[
{
"EntityName": "AA01"
"OtherFields":[{"customFieldName": "custom field 1, "customFieldValue": "NGO21"},{"customFieldName": "custom field 2", "customFieldValue": "2022-01-01"} ]
},
{
"EntityName": "AA02"
"OtherFields":[{"customFieldName": "CustomField1", "customFieldValue": "BS34"},{ "CustomFieldName":"CustomField2", "customFieldValue" : "2022-03-01"}]
}
]
I managed to do it in ADF, the solution is quite long, i think if you will write a stored procedure, it will be much easier.
Here is a quick demo that i built:
the idea is to build the Json structure as requested and then using collect function to build the array.
we have 2 arrays, one for EntityName and one for OtherFields.
Prepare the data:
First, i added column names in the corresponding rows, we will use this info later on to build our Json, i used a Derived column activity to fill the rows with column names.
Splitting Columns:
In order to build the Json structure as requested, i split the data into two parallel flows.
first flow is to select CustomFieldName1 and CustomFieldValue1 and the second flow is to select CustomFieldName2 and CustomFieldValue2 like so:
SelectColumn2 Activity:
SelectColumn1 Activity:
Note: Please keep the EntityName, We will Union the data by it later on in the flow.
OtherFields Column:
In order to build the Json, we need to do it using Sub-columns feature in a Derived column activity, that will ensure to us the Json structure.
Add new column with a name 'OtherFields' and open Expression Builder:
add 2 subcolumns : CustomFieldName and CustomFieldValue, add CustomFieldName1 as a value for the subcolumn CustomFieldName and add CustomFieldValue1 to the CustomFiedValue column like so:
Add a derived column activity and repeat same steps to CustomFieldName2.
Union:
Now we have 2 flows, one for extracting field1 and field2, we need to Union the data (you can do it by position or by name)
In order to create an array of Json we need to aggregate the data; this will transform complex data type {} into array []
Aggregate Activity:
Group by -> 'EntityName'
Aggregates -> collect(OtherFields)
Building Outer Json:
as described in the question above , we need to have a json that consists of 2 columns : {"EntitiyName" :"" , "OtherFields":[]}
In order to do it, again we need to add a derived column and add 2 subcolumns,
also, in order to combine all Json's in one Json array, we need a common value so we can aggregate by it, since we have different values, i added a dummy value with a constant 1, this will guarantee to us that all Json's will be under the same array
**Aggregate Data Json Activity: **
the output is an array of Json's, so we need to aggregate the data column
Group by -> dummy
aggregate : collect(data)
SelectDataColumn Activity:
Select Data column because we want it to be our output.
Finally, write to sink.
P.s: you can extract data value so you wont end up with a data key.
Output:
ADF activities:
You can read more about it here:
https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-column-pattern
https://learn.microsoft.com/en-us/azure/data-factory/data-flow-union
https://learn.microsoft.com/en-us/azure/data-factory/data-flow-derived-column

How do I query Postgresql with IDs from a parquet file in an Data Factory pipeline

I have an azure pipeline that moves data from one point to another in parquet files. I need to join some data from a Postgresql database that is in an AWS tenancy by a unique ID. I am using a dataflow to create the unique ID I need from two separate columns using a concatenate. I am trying to create where clause e.g.
select * from tablename where unique_id in ('id1','id2',id3'...)
I can do a lookup query to the database, but I can't figure out how to create the list of IDs in a parameter that I can use in the select statement out of the dataflow output. I tried using a set variable and was going to put that into a for-each, but the set variable doesn't like the output of the dataflow (object instead of array). "The variable 'xxx' of type 'Array' cannot be initialized or updated with value of type 'Object'. The variable 'xxx' only supports values of types 'Array'." I've used a flatten to try to transform to array, but I think the sync operation is putting it back into JSON?
What a workable approach to getting the IDs into a string that I can put into a lookup query?
Some notes:
The parquet file has a small number of unique IDs compared to the total unique IDs in the database.
If this were an azure postgresql I could just use a join in the dataflow to do the join, but the generic postgresql driver isn't available in dataflows. I can't copy the entire database over to Azure just to do the join and I need the dataflow in Azure for non-technical reasons.
Edit:
For clarity sake, I am trying to replace local python code that does the following:
query = "select * from mytable where id_number in "
df = pd.read_parquet("input_file.parquet")
df['id_number'] = df.country_code + df.id
df_other_data = pd.read_sql(conn, query + str(tuple(df.id_number))
I'd like to replace this locally executing code with ADF. In the ADF process, I have to replace the transformation of the IDs which seems easy enough if a couple of different ways. Once I have the IDs in the proper format in a column in a dataset, I can't figure out how to query a database that isn't supported by Data Flow and restrict it to only the IDs I need so I don't bring down the entire database.
Due to variables of ADF only can store simple type. So we can define an Array type paramter in ADF and set default value. Paramters of ADF support any type of elements including complex JSON structure.
For example:
Define a json array:
[{"name": "Steve","id": "001","tt_1": 0,"tt_2": 4,"tt3_": 1},{"name": "Tom","id": "002","tt_1": 10,"tt_2": 8,"tt3_": 1}]
Define an Array type paramter and set its default value:
So we will not get any error.

How can I prevent SQL injection with arbitrary JSONB query string provided by an external client?

I have a basic REST service backed by a PostgreSQL database with a table with various columns, one of which is a JSONB column that contains arbitrary data. Clients can store data filling in the fixed columns and provide any JSON as opaque data that is stored in the JSONB column.
I want to allow the client to query the database with constraints on both the fixed columns and the JSONB. It is easy to translate some query parameters like ?field=value and convert that into a parameterized SQL query for the fixed columns, but I want to add an arbitrary JSONB query to the SQL as well.
This JSONB query string could contain SQL injection, how can I prevent this? I think that because the structure of the JSONB data is arbitrary I can't use a parameterized query for this purpose. All the documentation I can find suggests I use parameterized queries, and I can't find any useful information on how to actually sanitize the query string itself, which seems like my only option.
For example a similar question is:
How to prevent SQL Injection in PostgreSQL JSON/JSONB field?
But I can't apply the same solution as I don't know the structure of the JSONB or the query, I can't assume the client wants to query a particular path using a particular operator, the entire JSONB query needs to be freely provided by the client.
I'm using golang, in case there are any existing libraries or code fragments that I can use.
edit: some example queries on the JSONB that the client might do:
(content->>'company') is NULL
(content->>'income')::numeric>80000
content->'company'->>'name'='EA' AND (content->>'income')::numeric>80000
content->'assets'#>'[{"kind":"car"}]'
(content->>'DOB')::TIMESTAMP<'2000-01-30T10:12:18.120Z'::TIMESTAMP
EXISTS (SELECT FROM jsonb_array_elements(content->'assets') asset WHERE (asset->>'value')::numeric > 100000)
Note that these don't cover all possible types of queries. Ideally I want any query that PostgreSQL supports on the JSONB data to be allowed. I just want to check the query to ensure it doesn't contain sql injection. For example, a simplistic and probably inadequate solution would be to not allow any ";" in the query string.
You could allow the users to specify a path within the JSON document, and then parameterize that path within a call to a function like json_extract_path_text. That is, the WHERE clause would look like:
WHERE json_extract_path_text(data, $1) = $2
The path argument is just a string, easily parameterized, which describes the keys to traverse down to the given value, e.g. 'foo.bars[0].name'. The right-hand side of the clause would be parameterized along the same rules as you're using for fixed column filtering.

Expression invalid

Azure Data Factory error:
The expression 'item().$v.collection.$v' cannot be evaluated because property '$v' doesn't exist, available properties are '$t, $v._id.$t, $v._id.$v, $v.id.$t, $v.id.$v, $v.database.$t, $v.database.$v, $v.collection.$t, $v.collection.$v, id, _self, _etag, _rid, _attachments, _ts'
How can I get around that ?
I am using this expression in forEach which is connected to lookup activity which is reading values from CosmosDB. I am interested only in single column, but SQL:
select collection from backups
didn't work, hence I switched from "Query" to "Table", hence output of lookup activity contains json object with fields containing $
this error results from for each activity treats "." as the property accessor, please use the expression "#item()['$v.collection.$v']" to get around the error. Thanks.