'int' is a primitive and doesn't support nested properties: Azure Data Factory v2 - azure-data-factory

I am trying to find a substring of a string as a part of an activity in ADF. Say for instance I want to extract out the subsctring 'de' out of a string 'abcde'. I have tried:
#substring(variables('target_folder_name'), 3, (int(variables('target_folder_name_length'))-3))
where int(variables('target_folder_name_length')) has a value of 5 and variables('target_folder_name') has a value of 'abcde'
But it gives me: Unrecognized expression: (int(variables('target_folder_name_length'))-3)
On the other hand, if I try this: #substring(variables('target_folder_name'), 2, int(variables('target_folder_name_length'))-3)
This gives me: 'int' is a primitive and doesn't support nested properties
Where am I going wrong?

Use indexof. See my example below:
#substring(variables('testString'), indexof(variables('testString'), variables('de')), length(variables('de')))
Output of 'result' variable:

Since your preceding values are static, you can use below dynamic expression to achieve the substring value as per your requirement.
#substring(variables('varInputFolderName'), 3, sub(length(variables('varInputFolderName')), 3))
Where varInputFolderName = String = abcde as per this sample.
Here is the pipeline JSON payload for this sample. You can play around with it for further testing.
{
"name": "pipeline_FindSubstring",
"properties": {
"activities": [
{
"name": "setSubstringValue",
"type": "SetVariable",
"dependsOn": [],
"userProperties": [],
"typeProperties": {
"variableName": "varSubstringOutput",
"value": {
"value": "#substring(variables('varInputFolderName'), 3, sub(length(variables('varInputFolderName')), 3))",
"type": "Expression"
}
}
}
],
"variables": {
"varInputFolderName": {
"type": "String",
"defaultValue": "abcde"
},
"varSubstringOutput": {
"type": "String"
}
},
"annotations": []
}
}

Related

Azure Data Factory Copy Data activity - Use variables/expressions in mapping to dynamically select correct incoming column

I have the below mappings for a Copy activity in ADF:
"translator": {
"type": "TabularTranslator",
"mappings": [
{
"source": {
"path": "$['id']"
},
"sink": {
"name": "TicketID"
}
},
{
"source": {
"path": "$['summary']"
},
"sink": {
"name": "TicketSummary"
}
},
{
"source": {
"path": "$['status']['name']"
},
"sink": {
"name": "TicketStatus"
}
},
{
"source": {
"path": "$['company']['identifier']"
},
"sink": {
"name": "CustomerAccountNumber"
}
},
{
"source": {
"path": "$['company']['name']"
},
"sink": {
"name": "CustomerName"
}
},
{
"source": {
"path": "$['customFields'][74]['value']"
},
"sink": {
"name": "Landlord"
}
},
{
"source": {
"path": "$['customFields'][75]['value']"
},
"sink": {
"name": "Building"
}
}
],
"collectionReference": "",
"mapComplexValuesToString": false
}
The challenge I need to overcome is that the array indexes of the custom fields of the last two sources might change. So I've created an Azure Function which calculates the correct array index. However I can't work out how to use the Azure Function output value in the source path string - I have tried to refer to it using an expression like #activity('Get Building Field Index').output but as it's expecting a JSON path, this doesn't work and produces an error:
JSON path $['customFields'][#activity('Get Building Field Index').outputS]['value'] is invalid.
Is there a different way to achieve what I am trying to do?
Thanks in advance
I have a slightly similar scenario that you might be able to work with.
First, I have a JSON file that is emitted that I then access with Synapse/ADF with Lookup.
I next have a For each activity that runs a copy data activity.
The for each activity receives my Lookup and makes my JSON usable, by setting the following in the For each's Settings like so:
#activity('Lookup').output.firstRow.childItems
My JSON roughly looks as follows:
{"childItems": [
{"subpath": "path/to/folder",
"filename": "filename.parquet",
"subfolder": "subfolder",
"outfolder": "subfolder",
"origin": "A"}]}
So this means in my copy data activity within the for each activity, I can access the parameters of my JSON like so:
#item()['subpath']
#item()['filename']
#item()['folder']
.. etc
Edit:
Adding some screen caps of the parameterization:
https://i.stack.imgur.com/aHpWk.png

JSON Schema - can array / list validation be combined with anyOf?

I have a json document I'm trying to validate with this form:
...
"products": [{
"prop1": "foo",
"prop2": "bar"
}, {
"prop3": "hello",
"prop4": "world"
},
...
There are multiple different forms an object may take. My schema looks like this:
...
"definitions": {
"products": {
"type": "array",
"items": { "$ref": "#/definitions/Product" },
"Product": {
"type": "object",
"oneOf": [
{ "$ref": "#/definitions/Product_Type1" },
{ "$ref": "#/definitions/Product_Type2" },
...
]
},
"Product_Type1": {
"type": "object",
"properties": {
"prop1": { "type": "string" },
"prop2": { "type": "string" }
},
"Product_Type2": {
"type": "object",
"properties": {
"prop3": { "type": "string" },
"prop4": { "type": "string" }
}
...
On top of this, certain properties of the individual product array objects may be indirected via further usage of anyOf or oneOf.
I'm running into issues in VSCode using the built-in schema validation where it throws errors for every item in the products array that don't match Product_Type1.
So it seems the validator latches onto that first oneOf it found and won't validate against any of the other types.
I didn't find any limitations to the oneOf mechanism on jsonschema.org. And there is no mention of it being used in the page specifically dealing with arrays here: https://json-schema.org/understanding-json-schema/reference/array.html
Is what I'm attempting possible?
Your general approach is fine. Let's take a slightly simpler example to illustrate what's going wrong.
Given this schema
{
"oneOf": [
{ "properties": { "foo": { "type": "integer" } } },
{ "properties": { "bar": { "type": "integer" } } }
]
}
And this instance
{ "foo": 42 }
At first glance, this looks like it matches /oneOf/0 and not oneOf/1. It actually matches both schemas, which violates the one-and-only-one constraint imposed by oneOf and the oneOf fails.
Remember that every keyword in JSON Schema is a constraint. Anything that is not explicitly excluded by the schema is allowed. There is nothing in the /oneOf/1 schema that says a "foo" property is not allowed. Nor does is say that "foo" is required. It only says that if the instance has a keyword "foo", then it must be an integer.
To fix this, you will need required and maybe additionalProperties depending on the situation. I show here how you would use additionalProperties, but I recommend you don't use it unless you need to because is does have some problematic properties.
{
"oneOf": [
{
"properties": { "foo": { "type": "integer" } },
"required": ["foo"],
"additionalProperties": false
},
{
"properties": { "bar": { "type": "integer" } },
"required": ["bar"],
"additionalProperties": false
}
]
}

Ingesting multi-valued dimension from comma sep string

I have event data from Kafka with the following structure that I want to ingest in Druid
{
"event": "some_event",
"id": "1",
"parameters": {
"campaigns": "campaign1, campaign2",
"other_stuff": "important_info"
}
}
Specifically, I want to transform the dimension "campaigns" from a comma-separated string into an array / multi-valued dimension so that it can be nicely filtered and grouped by.
My ingestion so far looks as follows
{
"type": "kafka",
"dataSchema": {
"dataSource": "event-data",
"parser": {
"type": "string",
"parseSpec": {
"format": "json",
"timestampSpec": {
"column": "timestamp",
"format": "posix"
},
"flattenSpec": {
"fields": [
{
"type": "root",
"name": "parameters"
},
{
"type": "jq",
"name": "campaigns",
"expr": ".parameters.campaigns"
}
]
}
},
"dimensionSpec": {
"dimensions": [
"event",
"id",
"campaigns"
]
}
},
"metricsSpec": [
{
"type": "count",
"name": "count"
}
],
"granularitySpec": {
"type": "uniform",
...
}
},
"tuningConfig": {
"type": "kafka",
...
},
"ioConfig": {
"topic": "production-tracking",
...
}
}
Which however leads to campaigns being ingested as a string.
I could neither find a way to generate an array out of it with a jq expression in flattenSpec nor did I find something like a string split expression that may be used as a transformSpec.
Any suggestions?
Try setting useFieldDiscover: false in your ingestion spec. when this flag is set to true (which is default case) then it interprets all fields with singular values (not a map or list) and flat lists (lists of singular values) at the root level as columns.
Here is a good example and reference link to use flatten spec:
https://druid.apache.org/docs/latest/ingestion/flatten-json.html
Looks like since Druid 0.17.0, Druid expressions support typed constructors for creating arrays, so using expression string_to_array should do the trick!

Using ADF Copy Activity with dynamic schema mapping

I'm trying to drive the columnMapping property from a database configuration table. My first activity in the pipeline pulls in the rows from the config table. My copy activity source is a Json file in Azure blob storage and my sink is an Azure SQL database.
In copy activity I'm setting the mapping using the dynamic content window. The code looks like this:
"translator": {
"value": "#json(activity('Lookup1').output.value[0].ColumnMapping)",
"type": "Expression"
}
My question is, what should the value of activity('Lookup1').output.value[0].ColumnMapping look like?
I've tried several different json formats but the copy activity always seems to ignore it.
For example, I've tried:
{
"type": "TabularTranslator",
"columnMappings": {
"view.url": "url"
}
}
and:
"columnMappings": {
"view.url": "url"
}
and:
{
"view.url": "url"
}
In this example, view.url is the name of the column in the JSON source, and url is the name of the column in my destination table in Azure SQL database.
The issue is due to the dot (.) sign in your column name.
To use column mapping, you should also specify structure in your source and sink dataset.
For your source dataset, you need specify your format correctly. And since your column name has dot, you need specify the json path as following.
You could use ADF UI to setup a copy for a single file first to get the related format, structure and column mapping format. Then change it to lookup.
And as my understanding, your first format should be the right format. If it is already in json format, then you may not need use "json" function in your expression.
There seems to be a disconnect between the question and the answer, so I'll hopefully provide a more straightforward answer.
When setting this up, you should have a source dataset with dynamic mapping. The sink doesn't require one, as we're going to specify it in the mapping.
Within the copy activity, format the dynamic json like the following:
{
"structure": [
{
"name": "Address Number"
},
{
"name": "Payment ID"
},
{
"name": "Document Number"
},
...
...
]
}
You would then specify your dynamic mapping like this:
{
"translator": {
"type": "TabularTranslator",
"mappings": [
{
"source": {
"name": "Address Number",
"type": "Int32"
},
"sink": {
"name": "address_number"
}
},
{
"source": {
"name": "Payment ID",
"type": "Int64"
},
"sink": {
"name": "payment_id"
}
},
{
"source": {
"name": "Document Number",
"type": "Int32"
},
"sink": {
"name": "document_number"
}
},
...
...
]
}
}
Assuming these were set in separate variables, you would want to send the source as a string, and the mapping as json:
source: #string(json(variables('str_dyn_structure')).structure)
mapping: #json(variables('str_dyn_translator')).translator
VladDrak - You could skip the source dynamic definition by building dynamic mapping like this:
{
"translator": {
"type": "TabularTranslator",
"mappings": [
{
"source": {
"type": "String",
"ordinal": "1"
},
"sink": {
"name": "dateOfActivity",
"type": "String"
}
},
{
"source": {
"type": "String",
"ordinal": "2"
},
"sink": {
"name": "CampaignID",
"type": "String"
}
}
]
}
}

How to fix Data Lake Analytics script

I would like to use Azure Data Factory with Azure Data Lake Analytics as action, but without success.
This is my PIPELINE script
{
"name": "UsageStatistivsPipeline",
"properties": {
"description": "Standardize JSON data into CSV, with friendly column names & consistent output for all event types. Creates one output (standardized) file per day.",
"activities": [{
"name": "UsageStatisticsActivity",
"type": "DataLakeAnalyticsU-SQL",
"linkedServiceName": {
"referenceName": "DataLakeAnalytics",
"type": "LinkedServiceReference"
},
"typeProperties": {
"scriptLinkedService": {
"referenceName": "BlobStorage",
"type": "LinkedServiceReference"
},
"scriptPath": "adla-scripts/usage-statistics-adla-script.json",
"degreeOfParallelism": 30,
"priority": 100,
"parameters": {
"sourcefile": "wasb://nameofblob.blob.core.windows.net/$$Text.Format('{0:yyyy}/{0:MM}/{0:dd}/0_647de4764587459ea9e0ce6a73e9ace7_2.json', SliceStart)",
"destinationfile": "$$Text.Format('wasb://nameofblob.blob.core.windows.net/{0:yyyy}/{0:MM}/{0:dd}/DailyResult.csv', SliceStart)"
}
},
"inputs": [{
"type": "DatasetReference",
"referenceName": "DirectionsData"
}
],
"outputs": [{
"type": "DatasetReference",
"referenceName": "OutputData"
}
],
"policy": {
"timeout": "06:00:00",
"concurrency": 10,
"executionPriorityOrder": "NewestFirst"
}
}
],
"start": "2018-01-08T00:00:00Z",
"end": "2017-01-09T00:00:00Z",
"isPaused": false,
"pipelineMode": "Scheduled"
}}
I have two parameters variables sourcefile and destinationfile, which are dynamic (path is from Date).
Then I have this ADLA script for execution.
REFERENCE ASSEMBLY master.[Newtonsoft.Json];
REFERENCE ASSEMBLY master.[Microsoft.Analytics.Samples.Formats];
USING Microsoft.Analytics.Samples.Formats.Json;
#Data =
EXTRACT
jsonstring string
FROM #sourcefile
USING Extractors.Tsv(quoting:false);
#CreateJSONTuple =
SELECT
JsonFunctions.JsonTuple(jsonstring) AS EventData
FROM
#Data;
#records =
SELECT
JsonFunctions.JsonTuple(EventData["records"], "[*].*") AS record
FROM
#CreateJSONTuple;
#properties =
SELECT
JsonFunctions.JsonTuple(record["[0].properties"]) AS prop,
record["[0].time"] AS time
FROM
#records;
#result =
SELECT
...
FROM #properties;
OUTPUT #result
TO #destinationfile
USING Outputters.Csv(outputHeader:false,quoting:true);
Job execution fails and the error is :
EDIT:
It seems, that Text.Format is not executed and passed into script like string ... Then in Data Lake Analytics Job detail is this :
DECLARE #sourcefile string = "$$Text.Format('wasb://nameofblob.blob.core.windows.net/{0:yyyy}/{0:MM}/{0:dd}/0_647de4764587459ea9e0ce6a73e9ace7_2.json', SliceStart)";
In your code sample, the sourcefile parameter is not defined the same way as destinationfile. The latter appears to be correct while the former does not.
The whole string should be wrapped inside $$Text.Format() for both:
"paramName" : "$$Text.Format('...{0:pattern}...', param)"
Also consider passing only the formatted date like so:
"sliceStart": "$$Text.Format('{0:yyyy-MM-dd}', SliceStart)"
and then doing the rest in U-SQL:
DECLARE #sliceStartDate DateTime = DateTime.Parse(#sliceStart);
DECLARE #path string = String.Format("wasb://path/to/file/{0:yyyy}/{0:MM}/{0:dd}/file.csv", #sliceStartDate);
Hope this helps