How to flatten a nested JSON array in Azure Data Factory and output the original JSON object as string? - azure-data-factory

My input (simplified) is coming from many JSON files structured like this:
{
Type : "Root",
Id: "R1",
Nested : [
{ Type : "NestedType1", Id: "N1", SharedAttribute: 1 },
{ Type : "NestedType1", Id: "N2", SharedAttribute: 2 },
{ Type : "NestedType2", Id: "N3", SharedAttribute: 3, NestedType2SpecificAttribute = "foo" }
]
}
The important bit is that the nested elements share certain attributes (eg SharedAttribute), but they can have various other attributes (eg NestedType2SpecificAttribute). I cannot capture all attributes in the input schema as they are changing over time.
I want the nested array to be transformed so that it outputs a Kusto table with all common/shared attributes and an additional column with a string representing the nested array's items' JSON.
I'm using a DataFlow to read these JSON files from Data Lake.
To extract the array, I added a "Flatten" formatter and selected to unroll by "Nested" and made the unroll root the same. This gives me the expected data:
Type
Id
SharedAttribute
NestedType1
N1
1
NestedType1
N2
2
NestedType2
N3
3
But I have no idea how to add an additional column called "RawJson" which should contain the source element of the unrolled array, essentially toString(currentItem) to produce a result like this:
Type
Id
SharedAttribute
RawJson
NestedType1
N1
1
{ Type : "NestedType1", Id: "N1", SharedAttribute: 1 }
NestedType1
N2
2
{ Type : "NestedType1", Id: "N2", SharedAttribute: 2 }
NestedType2
N3
3
{ Type : "NestedType2", Id: "N3", SharedAttribute: 3, NestedType2SpecificAttribute = "foo" }

Adding to #Mark Kromer MSFT, after flatten transformation, you will get the array of objects in the output.
additional column with a string representing the nested array's items'
JSON.
Then use derived column transformation to get the output as String.
Flatten:
Flatten Output as array of objects:
Derived column:
Output:

In the Flatten transformation's "Input Columns" section, you can include the original array. You should be able to choose the array name that you set to unroll by and you can then rename it to "RawJson" in the Input Columns using the Name As property.
In my example, "coordinates" is the array that I unrolled as well as generated it as part of the output.

Related

Velocity in AWS Api Gateway: how to access array of objects

So in AWS Api Gateway, I'm querying my DynamoDB and gets this JSON as reply:
https://pastebin.com/GpQady4Z
So, Items is an array of 3 objects. I need to extract the properties of those objects: TS, Key and CamID.
I'm using Velocity in the Integration Response. Here is my Mapping Template:
#set($count = $input.json('$.Count'))
#set($items = $input.json('$.Items'))
{
"count" : $count,
"items" : $items,
"first_item": $items[0]
},
The result from API Gateway:
{
"count" : 3,
"items" : [{"TS":{"N":"1599050893346"},"Key":{"S":"000000/000000_2020-08-02-12.48.13.775-CEST.mp4"},"CamID":{"S":"000000"}},{"TS":{"N":"1599051001832"},"Key":{"S":"000000/000000_2020-08-02-12.50.01.220-CEST.mp4"},"CamID":{"S":"000000"}},{"TS":{"N":"1599051082769"},"Key":{"S":"000000/000000_2020-08-02-12.51.22.208-CEST.mp4"},"CamID":{"S":"000000"}}],
"first_item":
}
first_item always returns empty value
While in a pure array like this:
#set($foo = [ 42, "a string", 21, $myVar ])
"test" : $foo[0]
"test" returns 42
Why is my code not working on array of objects?
$items is a JSON string (not JSON object), so $items[0] doesn't make sense.
If you want to access the first item, use $input.json('$.Items[0]').
If you want to iterate over them, you can convert the JSON string to object first using $util.parseJson()

How to document a mixed typed array structures in requests/responses with Spring REST Docs

Given the following exemplary JSON document, which is a list of polymorphic objects of type A and B:
[ {
"a" : 1,
"type" : "A"
}, {
"b" : true,
"type" : "B"
}, {
"b" : false,
"type" : "B"
}, {
"a" : 2,
"type" : "A"
} ]
How would I be able to select the As and the Bs to be able to document them differently.
I put an example project on github: https://github.com/dibog/spring-restdocs-polymorphic-list-demo
Here is an excerpt of me trying to document the fetch method:
.andDo(document("fetch-tree",
responseFields(
beneathPath("[0]").withSubsectionId("typeA"),
fieldWithPath("type")
.type(JsonFieldType.STRING)
.description("only node types 'A' and 'B' are supported"),
fieldWithPath("a")
.type(JsonFieldType.NUMBER)
.description("specific field for node type A")
),
responseFields(
beneathPath("[1]").withSubsectionId("typeB"),
fieldWithPath("type")
.type(JsonFieldType.STRING)
.description("only node types 'A' and 'B' are supported"),
fieldWithPath("b")
.type(JsonFieldType.BOOLEAN)
.description("specific field for node type A")
)))
But I get the following error message:
org.springframework.restdocs.payload.PayloadHandlingException: [0] identifies multiple sections of the payload and they do not have a common structure. The following non-optional uncommon paths were found: [[0].a, [0].b]
It looks like that [0] or [1] does not work and is interpreted as [].
What would be the best way to handle this situation?
Thanks,
Dieter
It looks like that [0] or [1] does not work and is interpreted as [].
That's correct. Adding support for indices is being tracked by this issue.
What would be the best way to handle this situation?
The beneathPath method that you've tried to use above returns an implementation of a strategy interface, PayloadSubsectionExtractor. You could provide your own implementation of this interface and, in the extractSubsection(byte[], MediaType) method, extract the JSON for a particular element in the array and return it as a byte[].

Mongodb: Indexing field of sub-document that can be either text or array

I have a collection of documents representing messages. Each message has multiple fields that change from message to message. They are stored in a "fields" array of sub-documents.
Each element in this array contains the label and value of a field.
Some fields may contain long lists of strings (IP addresses, URLs, etc.) - each string appears in a new line within that field. Lists can be thousands of lines long.
For that purpose, each element also stores a "type" - type 1 represents a standard text, while type 2 represents a list. When there's a type 2 field, the "value" in the sub-document is an array of the list.
It looks something like this:
"fields" : [
{
"type" : 1,
"label" : "Observed on",
"value" : "01/09/2016"
},
{
"type" : 1,
"label" : "Indicator of",
"value" : "Malware"
},
{
"type" : 2,
"label" : "Relevant IP addresses",
"value" : [
"10.0.0.0",
"190.15.55.21",
"11.132.33.55",
"109.0.15.3"
]
}
]
I want all fields values to be searchable and indexed, whether these values are in a standard string or in an array within "value".
Would setting up a standard index on "fields.value" index both type 1 and type 2 content? do I need to set up two indexes?
Thanks in advance!
When creating a new index, mongodb will automatically switch to Multikey index if it stumbles across an array in a document on the indexed field.
Which means that simply:
collection.createIndex( { "fields.value": 1 } )
should work just fine.
see: https://docs.mongodb.com/v3.2/core/index-multikey/

Find query result to List

I have got a database filled with documents like the following :
{
"_id" : ObjectId("56zeffb2abcf7ff24b46"),
"id_thing" : -1,
"data" : {
"info1" : 36.0709427,
"date" : ISODate('2005-11-01T00:33:21.987+07:00'),
"info2" : 24563.87148077
}
}
My find method returns a List which I operate some operations over:
for (d <- result_of_find_method_here)
{
val l_d = d("data")
}
But I would like to l_d a List which is currently not, and the toList method does not work.
How do I retrieve all the fields and their value of the data container as a list?
EDIT:
I have tried multiple methods, and none work because neither applies to AnyRef which is what I get when I iterate through the l_d with a foreach loop.
Find method returns a list because there are more objects returned.
l_d is not a list, because d['data'] is not a list is a key value store: a dictionary, json or map in Scala. The question is how do you want to represent this data?
Maybe you want to take out the values from the map as a list.
You can convert map to list using: l_d.toList or map values to list: l_d.values.toList

RMongo dbGetQueryForKeys(), what is the structure of "keys," and how do I sub-key them?

I am trying to query a mongo database from R using RMongo and return the values of a couple nested documents.
Looking through the documentation for RMongo, I understand the following query:
output <- dbGetQueryForKeys(mongo, 'test_data', '{"foo": "bar"}', '{"foo":1}')
Where the arguments are...
db = mongo
collection = 'test_data'
query = '{"foo": "bar"}'
keys = 'Specify a set of keys to return.'
What is the 1 in '{"foo":1}'? What is the structure of this key set? Checking against this blog post, I found a format like:
result < - dbGetQueryForKeys(mongo, "items", "{'publish_date' : { '$gte' : '2011-04-01', '$lt' : '2011-05-01'}}", "{'publish_date' : 1, 'rank' : 1}")
So, apparently, the keys need the value 1?
How would I get keys for nested documents? If I wanted something like...
output <- dbGetQueryForKeys(mongo, 'test_data', '{"foo": "bar"}', '{"foo1.foo2.foo3.foo4":1,"foo1.foo2.foo3.bar4":1}')
For nested keys, I'm currently returning something more like...
X_id
1 50fabd42a29d6013864fb9d7
foo1
1 { "foo2" : { "foo3" : { "foo4" : "090909" , "bar4" : "1"}}}
...where output[,2] is a looooong string, rather than as two separate variables for the values associated with the keys foo4 and bar4, ("090909", "1") as I would have expected.
What is the 1 in '{"foo":1}'? What is the structure of this key set?
These keys are the query projections to return for read operations in MongoDB. A value of "1" means to include a specific field and "0" excludes. The default behaviour is to include all fields in the projection.
How would I get keys for nested documents?
For nested keys, I'm currently returning something more like...
1 { "foo2" : { "foo3" : { "foo4" : "090909" , "bar4" : "1"}}}
...where output[,2] is a looooong string, rather than as two
separate variables for the values associated with the keys foo4
and bar4, ("090909", "1") as I would have expected.
The RMongo driver is returning data including the embedding hiearchy.
You can reshape & flatten the result output using the RMongo dbAggregate() command and the $project operator which is part of the Aggregation Framework in MongoDB 2.2+.
If your end goal is to extract the values from the nested object for some type of downstream processing in R this will get you there. It avoids having to build an aggregation pipeline and is a simple solution to your problem. Instead of trying to get deep into the nested structure and access bar4 directly, extract the top level of the object which will provide the long string that you've referenced.
output <- dbGetQueryForKeys(mongo, 'test_data', '{"foo": "bar"}', '{"foo1.foo2.foo3.foo4":1,"foo1":1}')
Since the output is a data.frame, you can use the 'jsonlite' library to get to your data:
library(jsonlite)
foo1 <- fromJSON(output$foo1)
bar4 <- foo1$foo2$foo3$bar4