How to map a json string into object type in sink transformation - azure-data-factory

Using Azure Data Factory and a data transformation flow. I have a csv that contains a column with a json object string, below an example including the header:
"Id","Name","Timestamp","Value","Metadata"
"99c9347ab7c34733a4fe0623e1496ffd","data1","2021-03-18 05:53:00.0000000","0","{""unit"":""%""}"
"99c9347ab7c34733a4fe0623e1496ffd","data1","2021-03-19 05:53:00.0000000","4","{""jobName"":""RecipeB""}"
"99c9347ab7c34733a4fe0623e1496ffd","data1","2021-03-16 02:12:30.0000000","state","{""jobEndState"":""negative""}"
"99c9347ab7c34733a4fe0623e1496ffd","data1","2021-03-19 06:33:00.0000000","23","{""unit"":""kg""}"
Want to store the data in a json like this:
{
"id": "99c9347ab7c34733a4fe0623e1496ffd",
"name": "data1",
"values": [
{
"timestamp": "2021-03-18 05:53:00.0000000",
"value": "0",
"metadata": {
"unit": "%"
}
},
{
"timestamp": "2021-03-19 05:53:00.0000000",
"value": "4",
"metadata": {
"jobName": "RecipeB"
}
}
....
]
}
The challenge is that metadata has dynamic content, meaning, that it will be always a json object but the content can vary. Therefore I cannot define a schema. Currently the column "metadata" on the sink schema is defined as object, but whenever I run the transformation I run into an exception:
Conversion from ArrayType(StructType(StructField(timestamp,StringType,false),
StructField(value,StringType,false), StructField(metadata,StringType,false)),true) to ArrayType(StructType(StructField(timestamp,StringType,true),
StructField(value,StringType,true), StructField(metadata,StructType(StructField(,StringType,true)),true)),false) not defined

We can get the output you expected, we need the expression to get the object Metadata.value.
Please ref my steps, here's my source:
Derived column expressions, create a JSON schema to convert the data:
#(id=Id,
name=Name,
values=#(timestamp=Timestamp,
value=Value,
metadata=#(unit=substring(split(Metadata,':')[2], 3, length(split(Metadata,':')[2])-6))))
Sink mapping and output data preview:
The key is that your matadata value is an object and may have different schema and content, may be 'value' or other key. We only can manually build the schema, it doesn't support expression. That's the limit.
We can't achieve that within Data Factory.
HTH.

Related

InvalidInputException: AWS Personalize error importing boolean fields in user or item metadata

I'm building recommender system using AWS Personalize. User-personalization recipe has 3 dataset inputs: interactions, user_metadata and item_metadata. I am having trouble importing user metadata which contains boolean field.
I created the following schema:
user_schema = {
"type": "record",
"name": "Users",
"namespace": "com.amazonaws.personalize.schema",
"fields": [
{
"name": "USER_ID",
"type": "string"
},
{
"name": "type",
"type": [
"null",
"string"
],
"categorical": True
},
{
"name": "lang",
"type": [
"null",
"string"
],
"categorical": True
},
{
"name": "is_active",
"type": "boolean"
}
],
"version": "1.0"
}
dataset csv file content looks like:
USER_ID,type,lang,is_active
1234#gmail.com ,,geo,True
01027061015#mail.ru ,facebook,eng,True
03dadahda#gmail.com ,facebook,geo,True
040168fadw#gmail.com ,facebook,geo,False
I uploaded given csv file on s3 bucket.
When I am trying create dataset import job it gives me the following exception:
InvalidInputException: An error occurred (InvalidInputException) when calling the CreateDatasetImportJob operation: Input csv has rows that do not conform to the dataset schema. Please ensure all required data fields are present and that they are of the type specified in the schema.
I tested and it works without boolean field is_active. There are no NaN values in given column!
It'd be nice to have an ability to directly test if your pandas dataframe or csv file conforms given schema and possibly get more detailed error message.
Does anybody know how to format boolean field to fix that issue?
I found a solution through many trials. Checked the AWS Personalization documentation (https://docs.aws.amazon.com/personalize/latest/dg/how-it-works-dataset-schema.html#dataset-requirements) which says that: boolean (values true and false must be lower case in your data).
Then I tried several things to find a solution, and one of them really worked. But still the hard way to find a solution and spent hours.
Solution:
Convert column in pandas DataFrame into string (Object) format.
lowercase True and False string values to get true and false.
store pandas DataFrame as csv file
it results in lowercase values of true and false.
USER_ID,type,lang,is_active
1234#gmail.com ,,geo,true
01027061015#mail.ru ,facebook,eng,true
03dadahda#gmail.com ,facebook,geo,true
040168fadw#gmail.com ,facebook,geo,false
That's all! There is no need to change "boolean" type in schema to "string"!
Hopefully they'll solve that issue soon since I contacted with AWS technical support with the same issue.

Apache Kafka & JSON Schema

I am starting to get into Apache Kafka (Confluent) and have some questions regarding the use of schemas.
First, is my general understanding correct that a schema is used for validating the data? My understanding of schemas is that when the data is "produced", it checks if the Keys and Values fit the predefined concept and splits them accordingly.
My current technical setup is as follows:
Python:
from confluent_kafka import Producer
from config import conf
import json
# create producer
producer = Producer(conf)
producer.produce("datagen-topic", json.dumps({"product":"table","brand":"abc"}))
producer.flush()
in Confluent, i set up a json key schema for my topic:
{
"$schema": "http://json-schema.org/draft-04/schema#",
"properties": {
"brand": {
"type": "string"
},
"product": {
"type": "string"
}
},
"required": [
"product",
"brand"
],
"type": "object"
}
Now, when I produce the data, the message in Confluent contains only content in "Value". Key and Header are null:
{
"product": "table",
"brand": "abc"
}
Basically it doesn't make a difference if I have this schema set up or not, so I guess it's just not working as I set it up. Can you help me where my way of thinking is wrong or where my code is lacking input?
The Confluent Python library Producer class doesn't interact with the Registry in any way, so your message wouldn't be validated.
You'll want to use SerializingProducer like in the example - https://github.com/confluentinc/confluent-kafka-python/blob/master/examples/json_producer.py
If you want non-null keys and headers, you'll need to pass those on to the send method

Passing multiple json as a payload for a request in Gatling

sample json payload:
'{
"Stub1": "XXXXX",
"Stub2": "XXXXX-3047-4ed3-b73b-83fbcc0c2aa9",
"Code": "CodeX",
"people": [
{
"ID": "XXXXX-6425-EA11-A94A-A08CFDCA6C02"
"customer": {
"Id": 173,
"Account": 275,
"AFile": "tel"
},
"products": [
{
"product": 1,
"type": "A",
"stub1": "XXXXX-42E1-4A13-8190-20C2DE39C0A5",
"Stub2": "XXXXX-FC4F-41AB-92E7-A408E7F4C632",
"stub3": "XXXXX-A2B4-4ADF-96C5-8F3CDCF5821D",
"Stub4": "XXXXX-1948-4B3C-987F-B5EC4D6C2824"
},
{
"product": 2,
"type": "B",
"stub1": "XXXXX-42E1-4A13-8190-20C2DE39C0A5",
"Stub2": "XXXXX-FC4F-41AB-92E7-A408E7F4C632",
"stub3": "XXXXX-A2B4-4ADF-96C5-8F3CDCF5821D",
"Stub4": "XXXXX-1948-4B3C-987F-B5EC4D6C2824"
}
]
}
]
}'
I am working on a POST call. Is there any way to feed multiple json files as a payload in Gatling. I am using body(RawFileBody("file.json")) as json here.
This works fine for a single json file. I want to check response for multiple json files. Is there any way we can parametrize this and get response against multiple json files.
As far as I can see, there's a couple of ways you could do this.
Use a JSON feeder (https://gatling.io/docs/current/session/feeder#json-feeders). This would need your multiple JSON files to be in a single file, with the root element being a JSON array. Essentially you'd put the JSON objects you have inside an array inside a single JSON file
Create a Scala Iterator and have the names of the JSON files you're going to use in it. e.g:
val fileNames = Iterator("file1.json", "file2.json)
// and later, in your scenario
body(RawFileBody(fileNames.next())
Note that this method cannot be used across users, as the iterator will initialize separately for each user. You'd have to use repeat or something similar to send multiple files as a single user.
You could do something similar by maintaining the file names as a list inside Gatling's session variable, but this session would still not be shared between different users you inject into your scenario.

how to setup copy activity for json in Azure Data Factory v2 schema not detected

i have try to read Json file using copy activity and write data in sql server.
my json file available in blob store.
i have set file fromat-JSON format
when i try to import schema i got error-Error occurred when
deserializing source JSON data. Please check if the data is in valid
JSON object format.. Activity ID:2f799221-f037-4f72-8e6c-385778929110
myjsonData
{
"id": "ed0e4960-d9c5-11e6-85dc-d7996816aad3",
"context": {
"device": {
"type": "PC"
},
"custom": {
"dimensions": [
{
"TargetResourceType": "Microsoft.Compute/virtualMachines"
},
{
"ResourceManagementProcessRunId": "827f8aaa-ab72-437c-ba48-d8917a7336a3"
},
{
"OccurrenceTime": "1/13/2017 11:24:37 AM"
}
]
}
}
}
Regards,
Manish
Based on your description and your sample source data, you could import the schema directly,however the column is nested.
If you want to flatten the nested json before you store them into sql server database as rows,you could execute Azure Function Activity before the Copy Activity.
Or you could execute the stored procedure in sql server dataset.

datatype of complextype entity is null when returning an array of complex types

We have created a complextype field "carriers" which is an array of Carrier objects. See below metadata
"dataProperties": [
{
"name": "carriers",
"complexTypeName":"Carrier#Test",
"isScalar":false
}]
The Carrier entity is defined as below:
{
"shortName": "Carrier",
"namespace": "Test",
"isComplexType": true,
"dataProperties": [
{
"name": "Testing",
"isScalar":true,
"dataType": "String"
}
]
}
We are trying to return an array of complextype in breeze from a REST service call. We get an error in breeze.debug.js in the method proto._updateTargetFromRaw. The error is because the datatype is null.
Any idea how to fix this issue?
I'm guessing the problem is in your "complexTypeName". You wrote "Carrier#Test" when I think you meant to write "Carrier:#Test". The ":#" combination separates the "short name" from the namespace; you omitted the colon.
Hope that's the explanation.