Transform nested JSON in data factory to sql - azure-data-factory

New to data factory. I have a json file that needs to manipulate but I can't figure out how to go about it. The file has a generic "name" property but it should have the value as the key name. How can I get it so that I can get the value as key?
So far been getting Complex JSON errors. This json is coming from file store.
[
{
"Version": "1.1",
"Documents": [
{
"DocumentState": "Correct",
"DocumentData": {
"Name": "Name1",
"$type": "Document",
"Fields": [
{
"Name": "Form",
"$type": "Text",
"Value": "Birthday Form"
},
{
"Name": "Date",
"$type": "Text",
"Value": "12/1/1999"
},
{
"Name": "FirstName",
"$type": "Text",
"Value": "John"
},
{
"Name": "FirstName",
"$type": "Text",
"Value": "Smith"
}
]
}
}
]
},
{
"Version": "1.1",
"Documents": [
{
"DocumentState": "Correct",
"DocumentData": {
"Name": "Name2",
"$type": "Document",
"Fields": [
{
"Name": "Form",
"$type": "Text",
"Value": "Entry Form"
},
{
"Name": "Date",
"$type": "Text",
"Value": "4/3/2010"
},
{
"Name": "FirstName",
"$type": "Text",
"Value": "Jane"
},
{
"Name": "LastName",
"$type": "Text",
"Value": "Doe"
}
]
}
}
]
}
]
Expected output
DocumentData: [
{
"Form":"Birthday Form",
"Date": "12/1/1999",
"FirstName": "John",
"LastName": "Smith"
},
{
"Form":"Entry Form",
"Date": "4/3/2010",
"FirstName": "Jane",
"LastName": "Doe"
}
]

#jaimers,
I was able to achieve it by making use of the Data Flow Activity
The below is the complete DataFlow
1) Source1
This step involves getting the data from source. You will have to configure the Source dataset.
The only change I had done in the source was to Convert Fields.Name,Field.Type,Field.Value as string[] (From string).
This was required to make/create key value pair of the fields in the Subsequent steps.
Flatten1
I had made use of Flatten at the Document level.
And got the values of DocumentData.DocumentName and DocumentData.Fields
Note : If you don't want DocumentData.DocumentName - You can safely ignore it.
4) DerivedColumn1
This is actual step where I convert name:key1 key:value1 to key1:value1.
To do that I had made use of the below expression :
keyValues(Fields.Name,Fields.Value)
Note: Keyvalues() function expects 2 array arguments. Hence, in the first step we had changed the type of Fields.Name and Fields.Value to array.
4) Select
Just to select the columns that need to be sent as an output
Output

You mentioned SQL in your title so if you have access to a SQL database, eg Azure SQL DB, then it is quite capable with manipulating JSON, eg using the OPENJSON and FOR JSON PATH methods. A simple example:
DECLARE #json VARCHAR(MAX) = '[
{
"Version": "1.1",
"Documents": [
{
"DocumentState": "Correct",
"DocumentData": {
"Name": "Name1",
"$type": "Document",
"Fields": [
{
"Name": "Form",
"$type": "Text",
"Value": "Birthday Form"
},
{
"Name": "Date",
"$type": "Text",
"Value": "12/1/1999"
},
{
"Name": "FirstName",
"$type": "Text",
"Value": "John"
},
{
"Name": "FirstName",
"$type": "Text",
"Value": "Smith"
}
]
}
}
]
},
{
"Version": "1.1",
"Documents": [
{
"DocumentState": "Correct",
"DocumentData": {
"Name": "Name2",
"$type": "Document",
"Fields": [
{
"Name": "Form",
"$type": "Text",
"Value": "Entry Form"
},
{
"Name": "Date",
"$type": "Text",
"Value": "4/3/2010"
},
{
"Name": "FirstName",
"$type": "Text",
"Value": "Jane"
},
{
"Name": "LastName",
"$type": "Text",
"Value": "Doe"
}
]
}
}
]
}
]'
-- Restructure the JSON and add a root
SELECT *
FROM OPENJSON ( #json )
WITH
(
Form VARCHAR(50) '$.Documents[0].DocumentData.Fields[0].Value',
[Date] DATE '$.Documents[0].DocumentData.Fields[1].Value',
FirstName VARCHAR(50) '$.Documents[0].DocumentData.Fields[2].Value',
LastName VARCHAR(50) '$.Documents[0].DocumentData.Fields[3].Value'
)
FOR JSON PATH, ROOT('DocumentData');
My results:
NB I've used the ROOT clause to add a root to the JSON document. You could make the #json a stored proc parameter and use a Stored Proc task from the pipeline.

Related

PubSub Subscription error with REPEATED Column Type - Avro Schema

I am trying to use the PubSub Subscription "Write to BigQuery" but am running into an issue with the "REPEATED" column type. the message I get when update the subscription is
Incompatible schema mode for field 'Values': field is REQUIRED in the topic schema, but REPEATED in the BigQuery table schema
My Avro Schema is:
{
"type": "record",
"name": "Avro",
"fields": [
{
"name": "ItemID",
"type": "string"
},
{
"name": "UserType",
"type": "string"
},
{
"name": "Values",
"type": [
{
"type": "record",
"name": "Values",
"fields": [
{
"name": "AttributeID",
"type": "string"
},
{
"name": "AttributeValue",
"type": "string"
}
]
}
]
}
]
}
Input JSON That "Matches" Schema:
{
"ItemID": "Item_1234",
"UserType": "Item",
"Values": {
"AttributeID": "TEST_ID_1",
"AttributeValue": "Value_1"
}
}
my Table looks like:
ItemID | STRING | NULLABLE
UserType | STRING | NULLABLE
Values | RECORD | REPEATED
AttributeID | STRING | NULLABLE
AttributeValue | STRING | NULLABLE
I am able to "Test" and "Validate Schema" and it comes back with a success. Question is, what am I missing on the Avro for the Values node to make it "REPEATED" vs "Required" for subscription to be created.
The issue is that Values is not an array type in your Avro schema, meaning it expects only one in the message, while it is a repeated type in your BigQuery schema, meaning it expects a list of them.
Per Kamal's comment above, this schema works:
{
"type": "record",
"name": "Avro",
"fields": [
{
"name": "ItemID",
"type": "string"
},
{
"name": "UserType",
"type": "string"
},
{
"name": "Values",
"type": {
"type": "array",
"items": {
"name": "NameDetails",
"type": "record",
"fields": [
{
"name": "ID",
"type": "string"
},
{
"name": "Value",
"type": "string"
}
]
}
}
}
]
}
the payload:
{
"ItemID": "Item_1234",
"UserType": "Item",
"Values": [
{ "AttributeID": "TEST_ID_1" },
{ "AttributeValue": "Value_1" }
]
}

How to select filtered postgresql jsonb field with performance prioritization?

A table:
CREATE TABLE events_holder(
id serial primary key,
version int not null,
data jsonb not null
);
Data field can be very very large (up to 100 Mb) and looks like this:
{
"id": 5,
"name": "name5",
"events": [
{
"id": 255,
"name": "festival",
"start_date": "2022-04-15",
"end_date": "2023-04-15",
"values": [
{
"id": 654,
"type": "text",
"name": "importance",
"value": "high"
},
{
"id": 655,
"type": "boolean",
"name": "epic",
"value": "true"
}
]
},
{
"id": 256,
"name": "discovery",
"start_date": "2022-02-20",
"end_date": "2022-02-22",
"values": [
{
"id": 711,
"type": "text",
"name": "importance",
"value": "low"
},
{
"id": 712,
"type": "boolean",
"name": "specificAttribute",
"value": "false"
}
]
}
]
}
I want to select data field by version, but filtered with extra condition: where events end_date > '2022-03-15'. And the output must look like this:
{
"id": 5,
"name": "name5",
"events": [
{
"id": 255,
"name": "festival",
"start_date": "2022-04-15",
"end_date": "2023-04-15",
"values": [
{
"id": 654,
"type": "text",
"name": "importance",
"value": "high"
},
{
"id": 655,
"type": "boolean",
"name": "epic",
"value": "true"
}
]
}
]
}
How can I do this with maximum performance? How should I index the data field?
My primary solution:
with cte as (
select eh.id, eh.version, jsonb_agg(events) as filteredEvents from events_holder eh
cross join jsonb_array_elements(eh.data #> '{events}') as events
where version = 1 and (events ->> 'end_date')::timestamp >= '2022-03-15'::timestamp
group by id, version
)
select jsonb_set(data, '{events}', cte.filteredEvents) from events_holder, cte
where events_holder.id = cte.id;
But i don't think it's a good variant.
You can do this using a JSON path expression:
select eh.id, eh.version,
jsonb_path_query_array(data,
'$.events[*] ? (#.end_date.datetime() >= "2022-03-15".datetime())')
from events_holder eh
where eh.version = 1
and eh.data #? '$.events[*] ? (#.end_date.datetime() >= "2022-03-15".datetime())'
Given your example JSON, this returns:
[
{
"id": 255,
"name": "festival",
"values": [
{
"id": 654,
"name": "importance",
"type": "text",
"value": "high"
},
{
"id": 655,
"name": "epic",
"type": "boolean",
"value": "true"
}
],
"end_date": "2023-04-15",
"start_date": "2022-04-15"
}
]
Depending on your data distribution a GIN index on data or an index on version could help.
If you need to re-construct the whole JSON content but with just a filtered events array, you can do something like this:
select (data - 'events')||
jsonb_build_object('events', jsonb_path_query_array(data, '$.events[*] ? (#.end_date.datetime() >= "2022-03-15".datetime())'))
from events_holder eh
...
(data - 'events') removes the events key from the json. Then the the result of the JSON path query is appended back to that (partial) object.

AVRO schema with optional record

Hi folks I need to create AVRO schema for the following example ;
{ "Car" : { "Make" : "Ford" , "Year": 1990 , "Engine" : "V8" , "VIN" : "123123123" , "Plate" : "XXTT9O",
"Accident" : { "Date" :"2020/02/02" , "Location" : "NJ" , "Driver" : "Joe" } ,
"Owner" : { "Name" : "Joe" , "LastName" : "Doe" } }
Accident and Owner is optional objects and created schema also needs to validate following subset message;
{ "Car" : { "Make" : "Tesla" , "Year": 2020 , "Engine" : "4ELEC" , "VIN" : "54545426" , "Plate" : "TESLA" }
I read the AVRO specs and see a lot of optional attribute and array examples but none of them worked for the record. How can I define a record as optional ? Thanks.
Following schema without any optional attribute is working.
{
"name": "MyClass", "type": "record", "namespace": "com.acme.avro", "fields": [
{
"name": "Car", "type": {
"name": "Car","type": "record","fields": [
{ "name": "Make", "type": "string" },
{ "name": "Year", "type": "int" },
{ "name": "Engine", "type": "string" },
{ "name": "VIN", "type": "string" },
{ "name": "Plate", "type": "string" },
{ "name": "Accident",
"type":
{ "name": "Accident",
"type": "record",
"fields": [
{ "name": "Date","type": "string" },
{ "name": "Location","type": "string" },
{ "name": "Driver", "type": "string" }
]
}
},
{ "name": "Owner",
"type":
{"name": "Owner",
"type": "record",
"fields": [
{"name": "Name", "type": "string" },
{"name": "LastName", "type": "string" }
]
}
}
]
}
}
]
}
when I change the Owner object as suggested avro-tool is returning error.
{ "name": "Owner",
"type": [
"null",
"record" : {
"name": "Owner",
"fields": [
{"name": "Name", "type": "string" },
{"name": "LastName", "type": "string" }
]
}
] , "default": null }
]
}
}
]
}
Test:
Projects/avro_test$ java -jar avro-tools-1.8.2.jar fromjson --schema-file CarStackOver.avsc Car.json > o2
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Exception in thread "main" org.apache.avro.SchemaParseException: org.codehaus.jackson.JsonParseException: Unexpected character (':' (code 58)): was expecting comma to separate ARRAY entries
at [Source: org.apache.hadoop.fs.ChecksumFileSystem$FSDataBoundedInputStream#4034c28c; line: 26, column: 13]
at org.apache.avro.Schema$Parser.parse(Schema.java:1034)
at org.apache.avro.Schema$Parser.parse(Schema.java:1004)
at org.apache.avro.tool.Util.parseSchemaFromFS(Util.java:165)
at org.apache.avro.tool.DataFileWriteTool.run(DataFileWriteTool.java:83)
at org.apache.avro.tool.Main.run(Main.java:87)
at org.apache.avro.tool.Main.main(Main.java:76)
Caused by: org.codehaus.jackson.JsonParseException: Unexpected character (':' (code 58)): was expecting comma to separate ARRAY entries
at [Source: org.apache.hadoop.fs.ChecksumFileSystem$FSDataBoundedInputStream#4034c28c; line: 26, column: 13]
at org.codehaus.jackson.JsonParser._constructError(JsonParser.java:1433)
at org.codehaus.jackson.impl.JsonParserMinimalBase._reportError(JsonParserMinimalBase.java:521)
at org.codehaus.jackson.impl.JsonParserMinimalBase._reportUnexpectedChar(JsonParserMinimalBase.java:442)
at org.codehaus.jackson.impl.Utf8StreamParser.nextToken(Utf8StreamParser.java:482)
at org.codehaus.jackson.map.deser.std.BaseNodeDeserializer.deserializeArray(JsonNodeDeserializer.java:222)
at org.codehaus.jackson.map.deser.std.BaseNodeDeserializer.deserializeObject(JsonNodeDeserializer.java:200)
at org.codehaus.jackson.map.deser.std.BaseNodeDeserializer.deserializeArray(JsonNodeDeserializer.java:224)
at org.codehaus.jackson.map.deser.std.BaseNodeDeserializer.deserializeObject(JsonNodeDeserializer.java:200)
at org.codehaus.jackson.map.deser.std.BaseNodeDeserializer.deserializeObject(JsonNodeDeserializer.java:197)
at org.codehaus.jackson.map.deser.std.BaseNodeDeserializer.deserializeArray(JsonNodeDeserializer.java:224)
at org.codehaus.jackson.map.deser.std.BaseNodeDeserializer.deserializeObject(JsonNodeDeserializer.java:200)
at org.codehaus.jackson.map.deser.std.JsonNodeDeserializer.deserialize(JsonNodeDeserializer.java:58)
at org.codehaus.jackson.map.deser.std.JsonNodeDeserializer.deserialize(JsonNodeDeserializer.java:15)
at org.codehaus.jackson.map.ObjectMapper._readValue(ObjectMapper.java:2704)
at org.codehaus.jackson.map.ObjectMapper.readTree(ObjectMapper.java:1344)
at org.apache.avro.Schema$Parser.parse(Schema.java:1032)
You can make it possible for records to be optional by doing a union with null.
Like this:
{
"name": "Owner",
"type": [
"null",
{
"name": "Owner",
"type": "record",
"fields": [
{ "name": "Name", type": "string" },
{ "name": "LastName", type": "string" },
]
}
],
"default": null
},

Using GraphQL,Springboot,MongoDB.The json is 1000+ lines deeply nested.Instead of updating whole doc,need to update specific key-value at any position

Requirement for mutation is to behave like upsert. For example in below json mutation is required to change "status" under Rooms->Availability section. Which I do not want to hard code like
db.collection.update({
'Rooms.Availability.status':true
},{
$set : {
'Rooms.Availability.status':False
}
})
for specific Array Index because there are possibilities of not having "status" or "Availability" key in some other document.
Below is similar JSON structure. Keys can be different in other JSON documents within same collection.
#GraphQLMutation(name = "updateHotelDetails")
public Hotel updateHotelDetails(Hotel h){ // Do I need to pass whole object as argument or only specific key-value?
mongoTemplate.updateFirst(....); // How can I write update code without hard coding?
}
Document 1:
{
"_id" : ObjectId("71testsrtdtsea6995432"),
"HotelName": "Test71testsrtdtsea699fff",
"Description": ".....",
"Address": {
"Street": "....",
"City": "....",
"State": "...."
},
"Rooms": [
{
"Description": "......",
"Type": ".....",
"Price": "....."
"Availability": [
"status": false,
"readOnly": false
]
},
{
"Description": "......",
"Type": "....",
"Price": "..."
"Availability": [
"status": true,
"readOnly": false
]
"newDynamickey": [
{}
]
},
]
"AdditionalData": [
{
"key1": "Vlaue1",
"key2":"Value2"
},
{...}
]
}

Using Mongoose with a rich document?

I'm working on a prototype that will be used for reporting (read only) where the record is a very rich set of objects embedded into a single document. Essentially the document structure is this (edited for brevity):
{
"_id": ObjectId("56b3af6f84ef45c8903acc51"),
"id": "7815dd97-e895-46e5-b6c9-45184c6eae89",
"survey": {
"id": "1fb21c69-6a5c-4805-b1cf-fabef7a5d0e6",
"type": "Survey",
"data": {
"description": "Testing reporting and data ouput",
"id": "1fb21c69-6a5c-4805-b1cf-fabef7a5d0e6",
"start_date": "2016-02-04T11:12:46Z",
"questions": [
{
"sequence": 1,
"modified_at": "2016-02-04T16:11:04.505849+00:00",
"id": "2a77921b-6853-463b-80e7-5713c82c51ca",
"previous_question": null,
"created_at": "2016-02-04T16:10:56.647746+00:00",
"parent_question": "",
"next_question": "",
"validators": [
"required",
"email"
],
"question_data": {
"modified_at": "2016-02-04T16:10:37.542715+00:00",
"type": "open-ended",
"text": "Please provide your email address",
"id": "27aa00db-4a56-4a3e-bc30-226179062af0",
"reporting_name": "email address",
"created_at": "2016-02-04T16:10:37.542695+00:00"
}
},
{
"sequence": 2,
"modified_at": "2016-02-04T16:09:53.539073+00:00",
"id": "c034819d-9281-4943-801f-c53f4047d03e",
"previous_question": null,
"created_at": "2016-02-04T16:09:53.539051+00:00",
"parent_question": "",
"next_question": null,
"validators": [
"alpha-numeric"
],
"question_data": {
"modified_at": "2016-02-04T16:05:31.008363+00:00",
"type": "open-ended",
"text": "Is there anything else that we could have done to improve your experience?",
"id": "e33c7804-20cb-4473-abfa-77b3c2a3113c",
"reporting_name": "more info open-ended",
"created_at": "2016-02-01T20:19:55.036899+00:00"
}
},
{
"sequence": 1,
"modified_at": "2016-02-04T16:08:55.681461+00:00",
"id": "f91fd70e-f204-4c38-9a56-dd6ff25e4cd8",
"previous_question": "",
"created_at": "2016-02-04T16:08:55.681441+00:00",
"parent_question": "",
"next_question": null,
"validators": [
"required"
],
"question_data": {
"modified_at": "2016-02-04T16:04:56.848528+00:00",
"type": "nps",
"text": "On a scale of 0-10 how likely are you to recommend us to a friend?",
"id": "fdb6b74d-96a3-4680-af35-8b2f6aa2bbc9",
"reporting_name": "key nps",
"created_at": "2016-02-01T20:19:27.371920+00:00"
}
}
],
"name": "Reporting Survey",
"end_date": "2016-02-11T11:12:47Z",
"trigger_active": false,
"created_at": "2016-02-04T16:13:16.808108Z",
"url": "http://www.peoplemetrics.com",
"fatigue_limit": "monthly",
"modified_at": "2016-02-04T16:13:16.808132Z",
"template": {
"id": "0ea02379-c80b-4e17-b0a6-d621d49076b9",
"type": "Template"
},
"landing_page": null,
"trigger": null,
"slug": "test-reporting-survey"
}
},
"invite_code": "7801",
"end_date": null,
"created_at": "2016-02-04T19:38:31.931147Z",
"url": "http://127.0.0.1:8000/api/v0/responses/7815dd97-e895-46e5-b6c9-45184c6eae89",
"answers": {
"data": [
{
"id": "bcc3d0dd-5419-4661-9900-ccda3ac9a308",
"end_datetime": "2016-01-22T19:57:03Z",
"survey_question": {
"id": "662fcdf9-3c92-415e-b779-ac5b0fd330d3",
"type": "SurveyQuestion"
},
"response": {
"id": "7815dd97-e895-46e5-b6c9-45184c6eae89",
"type": "Response"
},
"modified_at": "2016-02-04T19:38:31.972717Z",
"value_type": "number",
"created_at": "2016-02-04T19:38:31.972687Z",
"value": "10",
"slug": "",
"start_datetime": "2016-01-21T10:10:21Z"
},
{
"id": "8696f11e-679a-43da-b6e2-aee72a70ca9b",
"end_datetime": "2016-01-28T13:45:37Z",
"survey_question": {
"id": "f118c9dd-1c03-47e0-80ef-2a36eb3b9a29",
"type": "SurveyQuestion"
},
"response": {
"id": "7815dd97-e895-46e5-b6c9-45184c6eae89",
"type": "Response"
},
"modified_at": "2016-02-04T19:38:32.001970Z",
"value_type": "boolean",
"created_at": "2016-02-04T19:38:32.001939Z",
"value": "True",
"slug": "",
"start_datetime": "2016-02-15T04:51:24Z"
}
]
},
"modified_at": "2016-02-04T19:38:31.931171Z",
"start_date": "2016-02-01T16:14:13Z",
"invite_date": "2016-02-01T13:14:08Z",
"contact": {
"id": "94833455-b9b8-4206-9bc9-a2f96c1706ca",
"type": "Contact",
"external_contactid": null,
"name": "Miss Marceline Herzog PhD"
},
"referring_source": "web"
}
given a structure in that format, I'm unsure the best path forward using Mongoose as the ORM. Again, this is read-only, so I was it would seem that creating a nested schema would work, but the mapping itself seems tedious to say the least. Is there a better/different option available for something with embedded?
Interesting. First, I would think if I need all the document and its embedded subdocuments fields. You said it will be read-only, so will each call needs the entire document?
If not, I recommend taking a look at the mongo drivers (node.js, .NET, Python, etc.) and using their aggregation pipelines to simplify the document if possible.
If you're using Mongoose, you will probably end up with two or three Schemas, and with schemas inside a list. Mongoose docs e.g.
var surveySchema = new Schema(
{ "type" : string,
"data" : [dataSchema],
"invite_code" : string,
"end_date" : DateTime,
"created_at" : DateTime,
"url" : string,
"answers" : { "data": [answersSchema]},
"modified_at" : DateTime,
"start_date" : DateTime,
"invite_date" : DateTime,
"contact" : [ContactSchema],
"referring_source" : string
});
Or, you can use mongoose references and build your own schema depending on what data you need to use for your report. A simple example:
var surveySchema = {
"id" : { type: Schema.Types.ObjectId }
"description" : { type: string , ref: dataSchema },
"contactSchema" : { type: string , ref: contactSchema }
}