AVRO schema with optional record - apache-kafka

Hi folks I need to create AVRO schema for the following example ;
{ "Car" : { "Make" : "Ford" , "Year": 1990 , "Engine" : "V8" , "VIN" : "123123123" , "Plate" : "XXTT9O",
"Accident" : { "Date" :"2020/02/02" , "Location" : "NJ" , "Driver" : "Joe" } ,
"Owner" : { "Name" : "Joe" , "LastName" : "Doe" } }
Accident and Owner is optional objects and created schema also needs to validate following subset message;
{ "Car" : { "Make" : "Tesla" , "Year": 2020 , "Engine" : "4ELEC" , "VIN" : "54545426" , "Plate" : "TESLA" }
I read the AVRO specs and see a lot of optional attribute and array examples but none of them worked for the record. How can I define a record as optional ? Thanks.
Following schema without any optional attribute is working.
{
"name": "MyClass", "type": "record", "namespace": "com.acme.avro", "fields": [
{
"name": "Car", "type": {
"name": "Car","type": "record","fields": [
{ "name": "Make", "type": "string" },
{ "name": "Year", "type": "int" },
{ "name": "Engine", "type": "string" },
{ "name": "VIN", "type": "string" },
{ "name": "Plate", "type": "string" },
{ "name": "Accident",
"type":
{ "name": "Accident",
"type": "record",
"fields": [
{ "name": "Date","type": "string" },
{ "name": "Location","type": "string" },
{ "name": "Driver", "type": "string" }
]
}
},
{ "name": "Owner",
"type":
{"name": "Owner",
"type": "record",
"fields": [
{"name": "Name", "type": "string" },
{"name": "LastName", "type": "string" }
]
}
}
]
}
}
]
}
when I change the Owner object as suggested avro-tool is returning error.
{ "name": "Owner",
"type": [
"null",
"record" : {
"name": "Owner",
"fields": [
{"name": "Name", "type": "string" },
{"name": "LastName", "type": "string" }
]
}
] , "default": null }
]
}
}
]
}
Test:
Projects/avro_test$ java -jar avro-tools-1.8.2.jar fromjson --schema-file CarStackOver.avsc Car.json > o2
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Exception in thread "main" org.apache.avro.SchemaParseException: org.codehaus.jackson.JsonParseException: Unexpected character (':' (code 58)): was expecting comma to separate ARRAY entries
at [Source: org.apache.hadoop.fs.ChecksumFileSystem$FSDataBoundedInputStream#4034c28c; line: 26, column: 13]
at org.apache.avro.Schema$Parser.parse(Schema.java:1034)
at org.apache.avro.Schema$Parser.parse(Schema.java:1004)
at org.apache.avro.tool.Util.parseSchemaFromFS(Util.java:165)
at org.apache.avro.tool.DataFileWriteTool.run(DataFileWriteTool.java:83)
at org.apache.avro.tool.Main.run(Main.java:87)
at org.apache.avro.tool.Main.main(Main.java:76)
Caused by: org.codehaus.jackson.JsonParseException: Unexpected character (':' (code 58)): was expecting comma to separate ARRAY entries
at [Source: org.apache.hadoop.fs.ChecksumFileSystem$FSDataBoundedInputStream#4034c28c; line: 26, column: 13]
at org.codehaus.jackson.JsonParser._constructError(JsonParser.java:1433)
at org.codehaus.jackson.impl.JsonParserMinimalBase._reportError(JsonParserMinimalBase.java:521)
at org.codehaus.jackson.impl.JsonParserMinimalBase._reportUnexpectedChar(JsonParserMinimalBase.java:442)
at org.codehaus.jackson.impl.Utf8StreamParser.nextToken(Utf8StreamParser.java:482)
at org.codehaus.jackson.map.deser.std.BaseNodeDeserializer.deserializeArray(JsonNodeDeserializer.java:222)
at org.codehaus.jackson.map.deser.std.BaseNodeDeserializer.deserializeObject(JsonNodeDeserializer.java:200)
at org.codehaus.jackson.map.deser.std.BaseNodeDeserializer.deserializeArray(JsonNodeDeserializer.java:224)
at org.codehaus.jackson.map.deser.std.BaseNodeDeserializer.deserializeObject(JsonNodeDeserializer.java:200)
at org.codehaus.jackson.map.deser.std.BaseNodeDeserializer.deserializeObject(JsonNodeDeserializer.java:197)
at org.codehaus.jackson.map.deser.std.BaseNodeDeserializer.deserializeArray(JsonNodeDeserializer.java:224)
at org.codehaus.jackson.map.deser.std.BaseNodeDeserializer.deserializeObject(JsonNodeDeserializer.java:200)
at org.codehaus.jackson.map.deser.std.JsonNodeDeserializer.deserialize(JsonNodeDeserializer.java:58)
at org.codehaus.jackson.map.deser.std.JsonNodeDeserializer.deserialize(JsonNodeDeserializer.java:15)
at org.codehaus.jackson.map.ObjectMapper._readValue(ObjectMapper.java:2704)
at org.codehaus.jackson.map.ObjectMapper.readTree(ObjectMapper.java:1344)
at org.apache.avro.Schema$Parser.parse(Schema.java:1032)

You can make it possible for records to be optional by doing a union with null.
Like this:
{
"name": "Owner",
"type": [
"null",
{
"name": "Owner",
"type": "record",
"fields": [
{ "name": "Name", type": "string" },
{ "name": "LastName", type": "string" },
]
}
],
"default": null
},

Related

PubSub Subscription error with REPEATED Column Type - Avro Schema

I am trying to use the PubSub Subscription "Write to BigQuery" but am running into an issue with the "REPEATED" column type. the message I get when update the subscription is
Incompatible schema mode for field 'Values': field is REQUIRED in the topic schema, but REPEATED in the BigQuery table schema
My Avro Schema is:
{
"type": "record",
"name": "Avro",
"fields": [
{
"name": "ItemID",
"type": "string"
},
{
"name": "UserType",
"type": "string"
},
{
"name": "Values",
"type": [
{
"type": "record",
"name": "Values",
"fields": [
{
"name": "AttributeID",
"type": "string"
},
{
"name": "AttributeValue",
"type": "string"
}
]
}
]
}
]
}
Input JSON That "Matches" Schema:
{
"ItemID": "Item_1234",
"UserType": "Item",
"Values": {
"AttributeID": "TEST_ID_1",
"AttributeValue": "Value_1"
}
}
my Table looks like:
ItemID | STRING | NULLABLE
UserType | STRING | NULLABLE
Values | RECORD | REPEATED
AttributeID | STRING | NULLABLE
AttributeValue | STRING | NULLABLE
I am able to "Test" and "Validate Schema" and it comes back with a success. Question is, what am I missing on the Avro for the Values node to make it "REPEATED" vs "Required" for subscription to be created.
The issue is that Values is not an array type in your Avro schema, meaning it expects only one in the message, while it is a repeated type in your BigQuery schema, meaning it expects a list of them.
Per Kamal's comment above, this schema works:
{
"type": "record",
"name": "Avro",
"fields": [
{
"name": "ItemID",
"type": "string"
},
{
"name": "UserType",
"type": "string"
},
{
"name": "Values",
"type": {
"type": "array",
"items": {
"name": "NameDetails",
"type": "record",
"fields": [
{
"name": "ID",
"type": "string"
},
{
"name": "Value",
"type": "string"
}
]
}
}
}
]
}
the payload:
{
"ItemID": "Item_1234",
"UserType": "Item",
"Values": [
{ "AttributeID": "TEST_ID_1" },
{ "AttributeValue": "Value_1" }
]
}

org.apache.avro.AvroTypeException: Unknown union branch EventId

I am trying to convert a json to avro using 'kafka-avro-console-producer' and publish it to kafka topic.
I am able to do that flat json/schema's but for below given schema and json I am getting
"org.apache.avro.AvroTypeException: Unknown union branch EventId" error.
Any help would be appreciated.
Schema :
{
"type": "record",
"name": "Envelope",
"namespace": "CoreOLTPEvents.dbo.Event",
"fields": [{
"name": "before",
"type": ["null", {
"type": "record",
"name": "Value",
"fields": [{
"name": "EventId",
"type": "long"
}, {
"name": "CameraId",
"type": ["null", "long"],
"default": null
}, {
"name": "SiteId",
"type": ["null", "long"],
"default": null
}],
"connect.name": "CoreOLTPEvents.dbo.Event.Value"
}],
"default": null
}, {
"name": "after",
"type": ["null", "Value"],
"default": null
}, {
"name": "op",
"type": "string"
}, {
"name": "ts_ms",
"type": ["null", "long"],
"default": null
}],
"connect.name": "CoreOLTPEvents.dbo.Event.Envelope"
}
And Json input is like below :
{
"before": null,
"after": {
"EventId": 12,
"CameraId": 10,
"SiteId": 11974
},
"op": "C",
"ts_ms": null
}
And in my case I cant alter schema, I can alter only json such a way that it works
If you are using the Avro JSON format, the input you have is slightly off. For unions, non-null values need to be specified such that the type information is listed: https://avro.apache.org/docs/current/spec.html#json_encoding
See below for an example which I think should work.
{
"before": null,
"after": {
"CoreOLTPEvents.dbo.Event.Value": {
"EventId": 12,
"CameraId": {
"long": 10
},
"SiteId": {
"long": 11974
}
}
},
"op": "C",
"ts_ms": null
}
Removing "connect.name": "CoreOLTPEvents.dbo.Event.Value" and "connect.name": "CoreOLTPEvents.dbo.Event.Envelope" as The RecordType can only contains {'namespace', 'aliases', 'fields', 'name', 'type', 'doc'} keys.
Could you try with below schema and see if you are able to produce the msg?
{
"type": "record",
"name": "Envelope",
"namespace": "CoreOLTPEvents.dbo.Event",
"fields": [
{
"name": "before",
"type": [
"null",
{
"type": "record",
"name": "Value",
"fields": [
{
"name": "EventId",
"type": "long"
},
{
"name": "CameraId",
"type": [
"null",
"long"
],
"default": "null"
},
{
"name": "SiteId",
"type": [
"null",
"long"
],
"default": "null"
}
]
}
],
"default": null
},
{
"name": "after",
"type": [
"null",
"Value"
],
"default": null
},
{
"name": "op",
"type": "string"
},
{
"name": "ts_ms",
"type": [
"null",
"long"
],
"default": null
}
]
}

AWS-API gateway -- jsonschema child object should validate when parent object exists

I need to create Jsonschema for the following JSON input. Here properties under Vehicle like( Manufacturer, Model, etc) should be required only when Vehicle object exists.
{
"Manufacturer": "",
"Characteristics": {
"Starts": "new",
"vehicle": {
"Manufacturer": "hello",
"Model": "hh",
"Opening": "",
"Quantity": "",
"Principle": "",
"Type": ""
}
}
}
I tried the following JsonSchema but this works when Vehicle object is not there but if we rename Vehicle to some other ex: Vehicle1 it doesn't give an error. Please guide me on how to fix this.
{
"$schema": "http://json-schema.org/draft-07/schema",
"type": "object",
"properties": {
"Manufacturer": {
"type": [
"string",
"null"
]
},
"Characteristics": {
"type": "object",
"properties": {
"Starts": {
"type": [
"string",
"null"
]
},
"Vehicle": {
"$ref": "#/definitions/Vehicle"
}
},
"required": [
"Starts", "Vehcle"
]
}
},
"required": [
"Manufacturer"
],
"definitions": {
"Vehicle": {
"type": "object",
"properties": {
"Manufacturer": {
"type": [
"string",
"null"
]
},
"Model": {
"type": [
"string",
"null"
]
},
"Opening": {
"type": [
"string",
"null"
]
},
"PanelQuantity": {
"type": [
"string",
"null"
]
},
"Principle": {
"type": [
"string",
"null"
]
},
"Type": {
"type": [
"string",
"null"
]
}
},
"required": ["Manufacturer", "Model", "Opening", "Quantity", "Principle", "Type"]
}
}
}
Thanks,
Bhaskar
Sounds like you want to add "additionalProperties": false -- which will generate an error if any other properties are present that aren't defined under properties.

avro schema question: TypeError: unhashable type: 'dict'

I need to write a Avro schema for the following data. The exposure is a array of arrays with 3 numbers.
{
"Response": {
"status": "",
"responseDetail": {
"request_id": "Z618978.R",
"exposure": [
[
372,
20000000.0,
31567227140.238808
]
[
373,
480000000.0,
96567227140.238808
]
[
374,
23300000.0,
251567627149.238808
]
],
"product": "ABC",
}
}
}
So I came up with a schema like the following:
{
"name": "Response",
"type":{
"name": "algoResponseType",
"type": "record",
"fields":
[
{"name": "status", "type": ["null","string"]},
{
"name": "responseDetail",
"type": {
"name": "responseDetailType",
"type": "record",
"fields":
[
{"name": "request_id", "type": "string"},
{
"name": "exposure",
"type": {
"type": "array",
"items":
{
"name": "single_exposure",
"type": {
"type": "array",
"items": "string"
}
}
}
},
{"name": "product", "type": ["null","string"]}
]
}
}
]
}
}
When I tried to register the schema. I got the following error. TypeError: unhashable type: 'dict' which means I used a list as a dictionary key.
Traceback (most recent call last):
File "sa_publisher_main4test.py", line 28, in <module>
schema_registry_client)
File "/usr/local/lib64/python3.6/site-packages/confluent_kafka/schema_registry/avro.py", line 175, in __init__
parsed_schema = parse_schema(schema_dict)
File "fastavro/_schema.pyx", line 71, in fastavro._schema.parse_schema
File "fastavro/_schema.pyx", line 204, in fastavro._schema._parse_schema
TypeError: unhashable type: 'dict'
Can anyone help point out what is causing the error?
There are a few issues.
First, at the very top level of your schema, you have the following:
{
"name": "Response",
"type": {...}
}
But this isn't right. The top level should be a record type with a field called Response. So it should look like this:
{
"name": "Response",
"type": "record",
"fields": [
{
"name": "Response",
"type": {...}
}
]
}
The second problem is that for the array of arrays, you currently have the following:
{
"name":"exposure",
"type":{
"type":"array",
"items":{
"name":"single_exposure",
"type":{
"type":"array",
"items":"string"
}
}
}
}
But instead it should look like this:
{
"name":"exposure",
"type":{
"type":"array",
"items":{
"type":"array",
"items":"string"
}
}
}
After fixing those, the schema will be able to be parsed, but your data contains an array of array of floats and your schema says it should be an array of array of string. Therefore either the schema needs to be changed to float, or the data needs to be strings.
For reference, here's an example script that works after fixing those issues:
import fastavro
s = {
"name":"Response",
"type":"record",
"fields":[
{
"name":"Response",
"type": {
"name":"algoResponseType",
"type":"record",
"fields":[
{
"name":"status",
"type":[
"null",
"string"
]
},
{
"name":"responseDetail",
"type":{
"name":"responseDetailType",
"type":"record",
"fields":[
{
"name":"request_id",
"type":"string"
},
{
"name":"exposure",
"type":{
"type":"array",
"items":{
"type":"array",
"items":"string"
}
}
},
{
"name":"product",
"type":[
"null",
"string"
]
}
]
}
}
]
}
}
]
}
data = {
"Response":{
"status":"",
"responseDetail":{
"request_id":"Z618978.R",
"exposure":[
[
"372",
"20000000.0",
"31567227140.238808"
],
[
"373",
"480000000.0",
"96567227140.238808"
],
[
"374",
"23300000.0",
"251567627149.238808"
]
],
"product":"ABC"
}
}
}
parsed_schema = fastavro.parse_schema(s)
fastavro.validate(data, parsed_schema)
The error you get is because Schema Registry doesn't accept your schema. Your top element has to be a record with "Response" field.
This schema should work, I changed array item type, as in your message you have float and not string.
{
"type": "record",
"name": "yourMessage",
"fields": [
{
"name": "Response",
"type": {
"name": "AlgoResponseType",
"type": "record",
"fields": [
{
"name": "status",
"type": [
"null",
"string"
]
},
{
"name": "responseDetail",
"type": {
"name": "ResponseDetailType",
"type": "record",
"fields": [
{
"name": "request_id",
"type": "string"
},
{
"name": "exposure",
"type": {
"type": "array",
"items": {
"type": "array",
"items": "float"
}
}
},
{
"name": "product",
"type": [
"null",
"string"
]
}
]
}
}
]
}
}
]
}
Your message is not correct, as array elements should have comma between them.
{
"Response": {
"status": "",
"responseDetail": {
"request_id": "Z618978.R",
"exposure": [
[
372,
20000000.0,
31567227140.238808
],
[
373,
480000000.0,
96567227140.238808
],
[
374,
23300000.0,
251567627149.238808
]
],
"product": "ABC",
}
}
}
As you are using fastavro, I recommend running this code to check that your message is an example of a schema.
from fastavro.validation import validate
import json
with open('schema.avsc', 'r') as schema_file:
schema = json.loads(schema_file.read())
message = {
"Response": {
"status": "",
"responseDetail": {
"request_id": "Z618978.R",
"exposure": [
[
372,
20000000.0,
31567227140.238808
],
[
373,
480000000.0,
96567227140.238808
],
[
374,
23300000.0,
251567627149.238808
]
],
"product": "ABC",
}
}
}
try:
validate(message, schema)
print('Message is matching schema')
except Exception as ex:
print(ex)

Kafka JDBC sink no handling null values

I am trying to insert data with the Kafka JDBC Sink connector, but it is returning me this exception.
org.apache.kafka.connect.errors.DataException: Invalid null value for required INT64 field
The records have the following schema:
[
{
"schema": {
"type": "struct",
"fields": [
{
"type": "int64",
"field": "ID"
},
{
"type": "int64",
"field": "TENANT_ID"
},
{
"type": "string",
"field": "ITEM"
},
{
"type": "int64",
"field": "TIPO"
},
{
"type": "int64",
"field": "BUSINESS_CONCEPT"
},
{
"type": "string",
"field": "ETIQUETA"
},
{
"type": "string",
"field": "VALOR"
},
{
"type": "string",
"field": "GG_T_TYPE"
},
{
"type": "string",
"field": "GG_T_TIMESTAMP"
},
{
"type": "string",
"field": "TD_T_TIMESTAMP"
},
{
"type": "string",
"field": "POS"
}
]
},
"payload": {
"ID": 298457,
"TENANT_ID": 83,
"ITEM": "0-0-0",
"TIPO": 4,
"BUSINESS_CONCEPT": null,
"ETIQUETA": "¿Cuándo ha ocurrido?",
"VALOR": "2019-05-31T10:33:00Z",
"GG_T_TYPE": "I",
"GG_T_TIMESTAMP": "2019-05-31 14:35:19.002221",
"TD_T_TIMESTAMP": "2019-06-05T10:46:55.0106",
"POS": "00000000530096832544"
}
}
]
As you can see, the value BUSINESS_CONCEPT can be null. It is the only null value, so I suppose the exception is due to that field. How could I make the sink insert the value as null?
You need to change the definition of
{
"type": "int64",
"field": "BUSINESS_CONCEPT"
}
to
{
"type": ["null", "int64"],
"field": "BUSINESS_CONCEPT"
}
in order to treat BUSINESS_CONCEPT as optional field.