Expression invalid - azure-data-factory

Azure Data Factory error:
The expression 'item().$v.collection.$v' cannot be evaluated because property '$v' doesn't exist, available properties are '$t, $v._id.$t, $v._id.$v, $v.id.$t, $v.id.$v, $v.database.$t, $v.database.$v, $v.collection.$t, $v.collection.$v, id, _self, _etag, _rid, _attachments, _ts'
How can I get around that ?
I am using this expression in forEach which is connected to lookup activity which is reading values from CosmosDB. I am interested only in single column, but SQL:
select collection from backups
didn't work, hence I switched from "Query" to "Table", hence output of lookup activity contains json object with fields containing $

this error results from for each activity treats "." as the property accessor, please use the expression "#item()['$v.collection.$v']" to get around the error. Thanks.

Related

How to add null value in Azure Datafactory Derived columns expression builder

I am currently using Azure Datafactory in that I am creating a Derived column and since the field will always will be blank, so I want the value to be NULL
currently Derived Column I am doing this for adding the expression e.g. toString("null") and toString(null()) but this is appearing as string. I only want null to appear without quotes in Json document
I have reproduced the above and got below results.
I tried to give null() to a column and it gave the error like below.
So, in ADF Dataflow, there should be any wrap over null() with the functions like toInteger() or toString()
When I give toString(null()) when id is 4 in derived column of dataflow and the sink is a JSON, it gave me the below output.
You can see the row with id==4 skipped the null valued key in JSON. If you give toString(null()) same key in every row will be skipped.
You can go through this link by #ShaikMaheer-MSFT to understand more about this.
AFAIK, The workaround for this can be to store the null as 'null' string to get that key in JSON like this and later use this as per your requirement.

Can't convert table value to string in ADF

I'm trying to loop over data in an SQL table, but when I'm trying to use the value inside a foreach loop action using the #item() i get the error:
"Failed to convert the value in 'table' property to 'System.String' type. Please make sure the payload structure and value are correct."
So the row value can't be converted to a string.
Could that be my problem? and if so, what can I do about it?
Here is the pipeline:
I reproduced the above scenario with SQL table containing table names in lookup and csv files from ADLS gen2 in copy activity and got the same error.
The above error arises when we gave the lookup output array items directly into a string parameter inside ForEach.
If we look at the below lookup output,
The above value array is not a normal array, it is an array of objects. So, #item() in 1st iteration of ForEach means one object { "tablename": "sample1.csv" }. But our parameter expects a string value and that's why it is giving the above error.
To resolve this, give #item().tablename which will give our table names in every iteration inside ForEach.
My repro for your reference:
I have given same in sink also and this is my output.
Pipeline Execution
Copied data in target

How can I check if a JSON field exists using an ADF expression?

I want to do some activity in an ADF pipeline, but only if a field in a JSON output is present. What kind of ADF expression can I use to check that?
I set up two json files for testing, one with a firstName attribute and one without:
I then created a Lookup activity to get the contents of the JSON file and a Set Variable activity for testing the expression. I often use these to test expressions and it's a good way to test and view expression results iteratively:
I then created a Boolean variable (which is one of the datatypes supported by Azure Data Factory and Synapse pipelines) and the expression I am using to check the existence of the attribute is this:
#bool(contains(activity('Lookup1').output.firstRow, 'firstName'))
You can then use that boolean variable in an If activity, to execute subsequent activities conditionally based on the value of the variable.

Redshift Spectrum table doesnt recognize array

I have ran a crawler on json S3 file for updating an existing external table.
Once finished I checked the SVL_S3LOG to see the structure of the external table and saw it was updated and I have new column with Array<int> type like expected.
When I have tried to execute select * on the external table I got this error: "Invalid operation: Nested tables do not support '*' in the SELECT clause.;"
So I have tried to detailed the select statement with all columns names:
select name, date, books.... (books is the Array<int> type)
from external_table_a1
and got this error:
Invalid operation: column "books" does not exist in external_table_a1;"
I have also checked under "AWS Glue" the table external_table_a1 and saw that column "books" is recognized and have the type Array<int>.
Can someone explain why my simple query is wrong?
What am I missing?
Querying JSON data is a bit of a hassle with Redshift: when parsing is enabled (eg using the appropriate SerDe configuration) the JSON is stored as a SUPER type. In your case that's the Array<int>.
The AWS documentation on Querying semistructured data seems pretty straightforward, mentioning that PartiQL uses "dotted notation and array subscript for path navigation when accessing nested data". This doesn't work for me, although I don't find any reasons in their SUPER Limitations Documentation.
Solution 1
What I have to do is set the flags set json_serialization_enable to true; and set json_serialization_parse_nested_strings to true; which will parse the SUPER type as JSON (ie back to JSON). I can then use JSON-functions to query the data. Unnesting data gets even crazier because you can only use the unnest syntax select item from table as t, t.items as item on SUPER types. I genuinely don't think that this is the supposed way to query and unnest SUPER objects but that's the only approach that worked for me.
They described that in some older "Amazon Redshift Developer Guide".
Solution 2
When you are writing your query or creating a query Redshift will try to fit the output into one of the basic column data types. If the result of your query does not match any of those types, Redshift will not process the query. Hence, in order to convert a SUPER to a compatible type you will have to unnest it (using the rather peculiar Redshift unnest syntax).
For me, this works in certain cases but I'm not always able to properly index arrays, not can I access the array index (using my_table.array_column as array_entry at array_index syntax).

How do I query Postgresql with IDs from a parquet file in an Data Factory pipeline

I have an azure pipeline that moves data from one point to another in parquet files. I need to join some data from a Postgresql database that is in an AWS tenancy by a unique ID. I am using a dataflow to create the unique ID I need from two separate columns using a concatenate. I am trying to create where clause e.g.
select * from tablename where unique_id in ('id1','id2',id3'...)
I can do a lookup query to the database, but I can't figure out how to create the list of IDs in a parameter that I can use in the select statement out of the dataflow output. I tried using a set variable and was going to put that into a for-each, but the set variable doesn't like the output of the dataflow (object instead of array). "The variable 'xxx' of type 'Array' cannot be initialized or updated with value of type 'Object'. The variable 'xxx' only supports values of types 'Array'." I've used a flatten to try to transform to array, but I think the sync operation is putting it back into JSON?
What a workable approach to getting the IDs into a string that I can put into a lookup query?
Some notes:
The parquet file has a small number of unique IDs compared to the total unique IDs in the database.
If this were an azure postgresql I could just use a join in the dataflow to do the join, but the generic postgresql driver isn't available in dataflows. I can't copy the entire database over to Azure just to do the join and I need the dataflow in Azure for non-technical reasons.
Edit:
For clarity sake, I am trying to replace local python code that does the following:
query = "select * from mytable where id_number in "
df = pd.read_parquet("input_file.parquet")
df['id_number'] = df.country_code + df.id
df_other_data = pd.read_sql(conn, query + str(tuple(df.id_number))
I'd like to replace this locally executing code with ADF. In the ADF process, I have to replace the transformation of the IDs which seems easy enough if a couple of different ways. Once I have the IDs in the proper format in a column in a dataset, I can't figure out how to query a database that isn't supported by Data Flow and restrict it to only the IDs I need so I don't bring down the entire database.
Due to variables of ADF only can store simple type. So we can define an Array type paramter in ADF and set default value. Paramters of ADF support any type of elements including complex JSON structure.
For example:
Define a json array:
[{"name": "Steve","id": "001","tt_1": 0,"tt_2": 4,"tt3_": 1},{"name": "Tom","id": "002","tt_1": 10,"tt_2": 8,"tt3_": 1}]
Define an Array type paramter and set its default value:
So we will not get any error.