Avro genericdata.Record ignores data types - scala

I have the following avro schema
{ "namespace": "example.avro",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
I use the following snippet to set up a Record
val schema = new Schema.Parser().parse(new File("data/user.avsc"))
val user1 = new GenericData.Record(schema) //strangely this schema only checks for valid fields NOT types.
user1.put("name", "Fred")
user1.put("favorite_number", "Jones")
I would have thought that this would fail to validate against the schema
When I add the line
user1.put("last_name", 100)
It generates a run time error, which is what I would expect in the first case as well.
Exception in thread "main" org.apache.avro.AvroRuntimeException: Not a valid schema field: last_name
at org.apache.avro.generic.GenericData$Record.put(GenericData.java:125)
at csv2avro$.main(csv2avro.scala:40)
at csv2avro.main(csv2avro.scala)
What's going on here?

It won't fail when adding it into the Record, it will fail when it tries to serialize because it is at that point when it is trying to match the type. As far as I'm aware that is the only place it does type checking.

Related

Avro: org.apache.avro.AvroTypeException: Expected long. Got START_OBJECT

I am working on an Avro schema and trying to create a testing data to test it with Kafa, but when I produce the message got this error: "Caused by: org.apache.avro.AvroTypeException: Expected long. Got START_OBJECT"
The Schema I created is like this:
{
"name": "MyClass",
"type": "record",
"namespace": "com.acme.avro",
"doc":"This schema is for streaming information",
"fields":[
{"name":"batchId", "type": "long"},
{"name":"status", "type": {"type": "enum", "name": "PlannedTripRequestedStatus", "namespace":"com.acme.avro.Dtos", "symbols":["COMPLETED", "FAILED"]}},
{"name":"runRefId", "type": "int"},
{"name":"tripId", "type": ["null", "int"]},
{"name": "referenceNumber", "type": ["null", "string"]},
{"name":"errorMessage", "type": ["null", "string"]}
]
}
The testing data is like this:
{
"batchId": {
"long": 3
},
"status": "COMPLETED",
"runRefId": {
"int": 1000
},
"tripId": {
"int": 200
},
"referenceNumber": {
"string": "ReferenceNumber1111"
},
"errorMessage": {
"string": "Hello World"
}
}
However, when I registered this schema and try to produce a message with Confluent console tool, I got the error: org.apache.avro.AvroTypeException: Expected long. Got START_OBJECT The whole error message is like this:
org.apache.kafka.common.errors.SerializationException: Error deserializing {"batchId": ...} to Avro of schema {"type":...}" at io.confluent.kafka.formatter.AvroMessageReader.readFrom(AvroMessageReader.java:134)
at io.confluent.kafka.formatter.SchemaMessageReader.readMessage(SchemaMessageReader.java:325)
at kafka.tools.ConsoleProducer$.main(ConsoleProducer.scala:51)
at kafka.tools.ConsoleProducer.main(ConsoleProducer.scala)
Caused by: org.apache.avro.AvroTypeException: Expected long. Got START_OBJECT
at org.apache.avro.io.JsonDecoder.error(JsonDecoder.java:511)
at org.apache.avro.io.JsonDecoder.readLong(JsonDecoder.java:177)
at org.apache.avro.io.ResolvingDecoder.readLong(ResolvingDecoder.java:169)
at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:197)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:160)
at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:259)
at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:247)
at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:179)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:160)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
at io.confluent.kafka.schemaregistry.avro.AvroSchemaUtils.toObject(AvroSchemaUtils.java:213)
at io.confluent.kafka.formatter.AvroMessageReader.readFrom(AvroMessageReader.java:124)
Does any know what I did wrong with my schema or test data? Thank you so much!
You only need the type object if the type is unclear (union of string or number, for example), or its nullable.
For batchId and runRefId, just use simple values

Default value for a record in AVRO?

I wanted to add a new field into an AVRO schema of type "record" that cannot be null and therefore has a default value. The topic is set to compatibility type "Full_Transitive".
The schema did not change from the last version, only the last field produktType was added:
{
"type": "record",
"name": "Finished",
"namespace": "com.domain.finishing",
"doc": "Schema to indicate the end of the ongoing saga...",
"fields": [
{
"name": "numberOfAThing",
"type": [
"null",
{
"type": "string",
"avro.java.string": "String"
}
],
"default": null
},
{
"name": "previousNumbersOfThings",
"type": {
"type": "array",
"items": {
"type": "string",
"avro.java.string": "String"
}
},
"default": []
},
{
"name": "produktType",
"type": {
"type": "record",
"name": "ProduktType",
"fields": [
{
"name": "art",
"type": "int",
"default": 1
},
{
"name": "code",
"type": "int",
"default": 10003
}
]
},
"default": { "art": 1, "code": 10003 }
}
]
}
I've checked with the schema-registry that the new version of the schema is compatible.
But when we try to read old messages that do not contain that new field with the new schema (where the defaults are) there is a EOF Exception and it does not seem to work.
The part that causes headaches is the new added field "produktType". It cannot be null so we tried adding defaults. Which is possible for primitive type fields ("int" and so on). The line "default": { "art": 1, "code": 10003 } seems to be ok with the schema-registry but does not seem to have an effect when we read messages from the topic that do not contain this field.
The schema registry also marks it as not compatible when the last "default": { "art": 1, "code": 10003 } line is missing (but also "default": true works regarding schema compatibility...).
The AVRO specification for complex types contains an example for type "record" and default {"a": 1} so that is where we got that idea from. But since its not working something is still wrong.
There are similar questions like this one claiming records can only have null as a default or this un-answered one.
Is this supposed to work? And if so how can defaults for these "type": "record" fields be defined? Or is it still true that records can only have null as default?
Thanks!
Update on the compatibility cases:
Schema V1 (old one without the new field): can read v1 and v2 records.
Schema V2 (new field added): cannot read v1 records, can read v2 records
The case where a consumer using schema v2 encountering records using v1 is the surprising one - as I thought the defaults are for that purpose.
Even weirder: when I don't set the new field values at all. The v2 record does contain some values:
I have no idea where the value for code is from. The schema uses other numbers for its defaults:
So one of them seems to work, the other does not.

Avro invalid default for union field

I'm trying to serialise and then write to the hortonworks schema registry an avro schema but I'm getting the following error message during the write operation.
Caused by: java.lang.RuntimeException: An exception was thrown while processing request with message: [Invalid default for field viewingMode: null not a [{"type":"record","name":"aName","namespace":"domain.assembled","fields":[{"name":"aKey","type":"string"}]},{"type":"record","name":"anotherName","namespace":"domain.assembled","fields":[{"name":"anotherKey","type":"string"},{"name":"yetAnotherKey","type":"string"}]}]]
at com.hortonworks.registries.schemaregistry.client.SchemaRegistryClient.handleSchemaIdVersionResponse(SchemaRegistryClient.java:678)
at com.hortonworks.registries.schemaregistry.client.SchemaRegistryClient.doAddSchemaVersion(SchemaRegistryClient.java:664)
at com.hortonworks.registries.schemaregistry.client.SchemaRegistryClient.lambda$addSchemaVersion$1(SchemaRegistryClient.java:591)
This is the avro schema
{
"type": "record",
"name": "aSchema",
"namespace": "domain.assembled",
"fields": [
{
"name": "viewingMode",
"type": [
{
"name": "aName",
"type": "record",
"fields": [
{"name": "aKey","type": "string"}
]
},
{
"name": "anotherName",
"type": "record",
"fields": [
{"name": "anotherKey","type": "string"},
{"name": "yetAnotherKey","type": "string"}
]
}
]
}
]
}
Whoever if I add a "null" as the first type of the union this the succeeds. Do avro union types require a "null"? In my case this would be an incorrect representation of data so I'm not keen on doing it.
If it makes any difference I'm using avro 1.9.1.
Also, apologies if the tags are incorrect but couldn't find a hortonworks-schema-registry tag and don't have enough rep to create a new one.
Turns out if was an issue with hortonwork's schema registry.
This has actually already been fixed here and I've requested a new release here. Hopefully this happens soon.

Transforming Map field in avro schema to string using kafka SMT in jdbc-sink-connector?

I have an avro schema defined as follows:
[
{
"namespace": "com.fun.message",
"type": "record",
"name": "FileData",
"doc": "Avro Schema for FileData",
"fields": [
{"name": "id", "type": "string", "doc": "Unique file id" },
{"name": "absolutePath", "type": "string", "doc": "Absolute path of file" },
{"name": "fileName", "type": "string", "doc": "File name" },
{"name": "source", "type": "string", "doc": "unique identification of source" },
{"name": "metaData", "type": {"type": "map", "values": "string"}}
]
}
]
I want to push this data to postgres using jdbc-sink-connector so that I can transform the "metaData" field (which is map type) in my schema to a string. How do I do this?
You need to use SMTs and AFAIK there is currently no SMT that perfectly meets your requirement (ExtractField is a Map.get operation and therefore nested fields cannot be extracted in one pass). You could take a look at Debezium's io.debezium.transforms.UnwrapFromEnvelope SMT which you can modify in order to extract nested fields.
UnwrapFromEnvelope is being used for CDC Event Flattening in order to extract fields from more complex structures like the data formed by Debezium (which I believe is similar to your structure).

Avro to Scala case class annotation with nested types

I am using Avro serialization for messages on Kafka and processing with some custom Scala code using this annotation method currently. The following is a basic schema with a nested record:
{
"type": "record",
"name": "TestMessage",
"namespace": "",
"fields": [
{"name": "message", "type": "string"},
{
"name": "metaData",
"type": {
"type": "record",
"name": "MetaData",
"fields": [
{"name": "source", "type": "string"},
{"name": "timestamp", "type": "string"}
]
}
}
]
}
And the annotation, I believe should quite simply look like:
#AvroTypeProvider("schema-common/TestMessage.avsc")
#AvroRecord
case class TestMessage()
The message itself is something like the following:
{"message":"hello 1",
"metaData":{
"source":"postman",
"timestamp":"123456789"
}
}
However when I log the TestMessage type or view the output in a Kafka consumer in the console, all I see is:
{"message":"hello 1"}
And not the subtype I added to capture MetaData. Anything I am missing? Let me know if I can provide further information - thanks!
This should now be fixed in version 0.10.3 for Scala 2.11, and version 0.4.5 for scala 2.10
Keep in mind that for every record type in a schema, there needs to be a case class that represents it. And for Scala 2.10, the most nested classes must be defined first. A safe definition is the following:
#AvroTypeProvider("schema-common/TestMessage.avsc")
#AvroRecord
case class MetaData()
#AvroTypeProvider("schema-common/TestMessage.avsc")
#AvroRecord
case class TestMessage()