Debezium might produce invalid schema

Debezium might produce invalid schema - apache-kafka

I face an issue with Avro and Schema Registry. After Debezium created a schema and a topic, I have downloaded the schema from Schema Registry. I put it into a .asvc file and it looks like this:
{
"type": "record",
"name": "Envelope",
"namespace": "my.namespace",
"fields": [
{
"name": "before",
"type": [
"null",
{
"type": "record",
"name": "MyValue",
"fields": [
{
"name": "id",
"type": "int"
}
]
}
],
"default": null
},
{
"name": "after",
"type": [
"null",
"MyValue"
],
"default": null
}
]
}
I ran two experiments:
I tried to put it back into Schema Registry but I get this error: MyValue is not correct. When I remove "after" record, the schema seems to work well.
I used 'generate-sources' from avro-maven-plugin to generate the Java classes. When I try to consume the topic above, I see this error:
Exception in thread "b2-StreamThread-1" org.apache.kafka.streams.errors.StreamsException: Exception caught in process. [...]: Error registering Avro schema: [...]
Caused by: io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: Schema being registered is incompatible with an earlier schema; error code: 409
Did anyone faced the same problem? Is it Debezium that is producing an invalid schema or is Schema Registry that is having a bug?

MyValue is not correct.
That's not an Avro type. You would have to embed the actual record within the union, just like the before value
In other words, you're not able to cross reference the record types within a schema, AFAIK
When I try to consume the topic above, I see this error:
A consumer does not register schemas, so it's not clear how you're getting that error unless maybe using Kafka Streams, which produces into intermediate topics

Related

MSK S3 sink not working with Field Partitioner

I am using AWS MSK and msk connect. S3 sink connector is not working properly when I added io.confluent.connect.storage.partitioner.FieldPartitioner Note:without fieldPartitioner s3sink had worked. Other than this stack overflow Question Link I was not able to find any resource
Error
ERROR [FieldPart-sink|task-0] Value is not Struct type. (io.confluent.connect.storage.partitioner.FieldPartitioner:81)
Caused by: io.confluent.connect.storage.errors.PartitionException: Error encoding partition.
ERROR [Sink-FieldPartition|task-0] WorkerSinkTask{id=Sink-FieldPartition-0} Task threw an uncaught and unrecoverable exception. Task is being killed and will not recover until manually restarted. Error: Error encoding partition. (org.apache.kafka.connect.runtime.WorkerSinkTask:612)
MSK Connect Config
connector.class=io.confluent.connect.s3.S3SinkConnector
format.class=io.confluent.connect.s3.format.avro.AvroFormat
flush.size=1
schema.compatibility=BACKWARD
tasks.max=2
topics=MSKTutorialTopic
storage.class=io.confluent.connect.s3.storage.S3Storage
topics.dir=mskTrials
s3.bucket.name=clickstream
s3.region=us-east-1
partitioner.class=io.confluent.connect.storage.partitioner.FieldPartitioner
partition.field.name=name
value.converter.schemaAutoRegistrationEnabled=true
value.converter.registry.name=datalake-schema-registry
value.convertor.schemaName=MSKTutorialTopic-value
value.converter.avroRecordType=GENERIC_RECORD
value.converter.region=us-east-1
value.converter.schemas.enable=true
value.converter=org.apache.kafka.connect.storage.StringConverter
key.converter=org.apache.kafka.connect.storage.StringConverter
Data Schema which is stored in glue schema registry
{
"namespace": "example.avro",
"type": "record",
"name": "UserData",
"fields": [
{
"name": "name",
"type": "string"
},
{
"name": "favorite_number",
"type": [
"int",
"null"
]
},
{
"name": "favourite_color",
"type": [
"string",
"null"
]
}
]
}

In order to partition by fields, your data needs actual fields.
StringConverter cannot parse data it consumes to add said fields. Use AvroConverter if you have Avro data in the topic. Also, Avro always has a schema, so remove the schemas.enable configuration

Avro union field with "local-timestamp-millis" deserialization issue

Avro: 10.1
Dataflow (Apache Beam): 2.28.0
Runner: org.apache.beam.runners.dataflow.DataflowRunner
Avro schema piece:
{
"name": "client_timestamp",
"type": [
"null",
{ "type": "long", "logicalType": "local-timestamp-millis" }
],
"default": null,
"doc": "Client side timestamp of this xxx"
},
An exception when writing Avro output file:
Caused by: org.apache.avro.UnresolvedUnionException:
Not in union ["null",{"type":"long","logicalType":"local-timestamp-millis"}]:
2021-03-12T12:21:17.599
Link to a longer stacktrace
Some of the steps taken:
Replacing "logicalType":"local-timestamp-millis" with "logicalType":"timestamp-millis" causes the same error.
Writing Avro locally also works.
Removing "type": "null" option eliminates the exception

Try something like following:
{
"name": "client_timestamp",
"type": ["null", "long"],
"doc": "Client side timestamp of this xxx",
"default": null,
"logicalType": "timestamp-millis"
}
It is a way to define logicalType in avro schema (link).

Avro invalid default for union field

I'm trying to serialise and then write to the hortonworks schema registry an avro schema but I'm getting the following error message during the write operation.
Caused by: java.lang.RuntimeException: An exception was thrown while processing request with message: [Invalid default for field viewingMode: null not a [{"type":"record","name":"aName","namespace":"domain.assembled","fields":[{"name":"aKey","type":"string"}]},{"type":"record","name":"anotherName","namespace":"domain.assembled","fields":[{"name":"anotherKey","type":"string"},{"name":"yetAnotherKey","type":"string"}]}]]
at com.hortonworks.registries.schemaregistry.client.SchemaRegistryClient.handleSchemaIdVersionResponse(SchemaRegistryClient.java:678)
at com.hortonworks.registries.schemaregistry.client.SchemaRegistryClient.doAddSchemaVersion(SchemaRegistryClient.java:664)
at com.hortonworks.registries.schemaregistry.client.SchemaRegistryClient.lambda$addSchemaVersion$1(SchemaRegistryClient.java:591)
This is the avro schema
{
"type": "record",
"name": "aSchema",
"namespace": "domain.assembled",
"fields": [
{
"name": "viewingMode",
"type": [
{
"name": "aName",
"type": "record",
"fields": [
{"name": "aKey","type": "string"}
]
},
{
"name": "anotherName",
"type": "record",
"fields": [
{"name": "anotherKey","type": "string"},
{"name": "yetAnotherKey","type": "string"}
]
}
]
}
]
}
Whoever if I add a "null" as the first type of the union this the succeeds. Do avro union types require a "null"? In my case this would be an incorrect representation of data so I'm not keen on doing it.
If it makes any difference I'm using avro 1.9.1.
Also, apologies if the tags are incorrect but couldn't find a hortonworks-schema-registry tag and don't have enough rep to create a new one.

Turns out if was an issue with hortonwork's schema registry.
This has actually already been fixed here and I've requested a new release here. Hopefully this happens soon.

Partition column disappears in result set dataframe Spark

I tried to split Spark data frame by the timestamp column update_database_time and write it into HDFS with defined Avro schema. However, after calling the repartition method I get this exception:
Caused by: org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Catalyst type StructType(StructField(random_pk,DecimalType(38,0),true), StructField(random_string,StringType,true), StructField(code,StringType,true), StructField(random_bool,BooleanType,true), StructField(random_int,IntegerType,true), StructField(random_float,DoubleType,true), StructField(random_double,DoubleType,true), StructField(random_enum,StringType,true), StructField(random_date,DateType,true), StructField(random_decimal,DecimalType(4,2),true), StructField(update_database_time_tz,TimestampType,true), StructField(random_money,DecimalType(19,4),true)) to Avro type {"type":"record","name":"TestData","namespace":"DWH","fields":[{"name":"random_pk","type":["null",{"type":"bytes","logicalType":"decimal","precision":38,"scale":0}]},{"name":"random_string","type":["string","null"]},{"name":"code","type":["string","null"]},{"name":"random_bool","type":["boolean","null"]},{"name":"random_int","type":["int","null"]},{"name":"random_float","type":["double","null"]},{"name":"random_double","type":["double","null"]},{"name":"random_enum","type":["null",{"type":"enum","name":"enumType","symbols":["VAL_1","VAL_2","VAL_3"]}]},{"name":"random_date","type":["null",{"type":"int","logicalType":"date"}]},{"name":"random_decimal","type":["null",{"type":"bytes","logicalType":"decimal","precision":4,"scale":2}]},{"name":"update_database_time","type":["null",{"type":"long","logicalType":"timestamp-millis"}]},{"name":"update_database_time_tz","type":["null",{"type":"long","logicalType":"timestamp-millis"}]},{"name":"random_money","type":["null",{"type":"bytes","logicalType":"decimal","precision":19,"scale":4}]}]}.
I assume that the column for partitioning disappears in the result. How can I redefine the operation so it would not happen?
Here is the code I use:
dataDF.write
.partitionBy("update_database_time")
.format("avro")
.option(
"avroSchema",
SchemaRegistry.getSchema(
schemaRegistryConfig.url,
schemaRegistryConfig.dataSchemaSubject,
schemaRegistryConfig.dataSchemaVersion))
.save(s"${hdfsURL}${pathToSave}")

By the exception you have provided, the error seems to stem from incompatible schemas between the fetched AVRO schema and Spark's schema. Taking a quick look, the most worrisome parts are probably these ones:
(Possibly catalyst doesn't know how to transform string into enumType)
Spark schema:
StructField(random_enum,StringType,true)
AVRO schema:
{
"name": "random_enum",
"type": [
"null",
{
"type": "enum",
"name": "enumType",
"symbols": [
"VAL_1",
"VAL_2",
"VAL_3"
]
}
]
}
(update_databse_time_tz appears only once in the dataframe's schema, but twice in the AVRO schema)
Spark schema:
StructField(update_database_time_tz,TimestampType,true)
AVRO schema:
{
"name": "update_database_time",
"type": [
"null",
{
"type": "long",
"logicalType": "timestamp-millis"
}
]
},
{
"name": "update_database_time_tz",
"type": [
"null",
{
"type": "long",
"logicalType": "timestamp-millis"
}
]
}
I'd suggest to consolidate schemas first and get rid of that exception before going inside other possible partitioning problems.
EDIT: In regards to number 2 I've missed the face that there are different names in the AVRO schema, which leads to a problem of missing a column update_database_time in the Dataframe.

Deserializing exception with data generated ksql-datagen utility

Generated sample stream from ksql-datagen utility from following schema -
{
"type": "record",
"name": "users",
**"namespace": "com.example",**
"fields": [
{
"name": "registertime",
"type": {
"type":"long",
"arg.properties":{
"range":{"min":1487715775521,"max":1519273364600}
}
}
},
{
"name": "userid",
"type": {
"type":"string",
"arg.properties":{"regex":"User_[1-9][0-2]"}
}
},
{
"name": "regionid",
"type": {
"type":"string",
"arg.properties":{"regex":"Region_[1-9]"}
}
},
{
"name": "gender",
"type": {
"type":"string",
"arg.properties":{
"options":["MALE","FEMALE","OTHER"]
}
}
}
]}
while checked for versions, it still picks "io.confluent.ksql.avro_schemas" schema -
curl "http://localhost:8081/subjects/test-user-value/versions/1"
{"subject":"test-user-value","version":1,"id":4,"schema":"{"type":"record","name":"KsqlDataSourceSchema","namespace":"io.confluent.ksql.avro_schemas","fields":[{"name":"registertime","type":["null","long"],"default":null},{"name":"userid","type":["null","string"],"default":null},{"name":"regionid","type":["null","string"],"default":null},{"name":"gender","type":["null","string"],"default":null}]}"}
Got following error while tried to consume with Kafka-streams API -
Exception in thread
"PageView-Users-Stream-Join-eg-1dc610a3-c9d9-4c1e-b5eb-910e4bc74826-StreamThread-1"
org.apache.kafka.streams.errors.StreamsException: Deserialization
exception handler is set to fail upon a deserialization error. If you
would rather have the streaming pipeline continue after a
deserialization error, please set the
default.deserialization.exception.handler appropriately. at
org.apache.kafka.streams.processor.internals.RecordDeserializer.deserialize(RecordDeserializer.java:80)
at
org.apache.kafka.streams.processor.internals.RecordQueue.maybeUpdateTimestamp(RecordQueue.java:160)
at
org.apache.kafka.streams.processor.internals.RecordQueue.addRawRecords(RecordQueue.java:101)
at
org.apache.kafka.streams.processor.internals.PartitionGroup.addRawRecords(PartitionGroup.java:124)
at
org.apache.kafka.streams.processor.internals.StreamTask.addRecords(StreamTask.java:711)
at
org.apache.kafka.streams.processor.internals.StreamThread.addRecordsToTasks(StreamThread.java:995)
at
org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:833)
at
org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:777)
at
org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:747)
Caused by: org.apache.kafka.common.errors.SerializationException:
Error deserializing Avro message for id 4 Caused by:
org.apache.kafka.common.errors.SerializationException: Could not find
class io.confluent.ksql.avro_schemas.KsqlDataSourceSchema specified in writer's schema whilst finding reader's schema for a
SpecificRecord.

Answered at https://github.com/confluentinc/schema-registry/issues/980
Datagen always defines the namespace as io.confluent.ksql.avro_schemas. See confluentinc/ksql#1906
There are now other ways to generate test data into Kafka, too.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Debezium might produce invalid schema - apache-kafka

Related

MSK S3 sink not working with Field Partitioner

Avro union field with "local-timestamp-millis" deserialization issue

Avro invalid default for union field

Partition column disappears in result set dataframe Spark

Deserializing exception with data generated ksql-datagen utility

Categories

Resources