Exception while Deserialize avro data using ConfluentSchemaRegistry? - apache-kafka

I am new to flink and Kafka. I am trying to deserialize avro data using Confluent Schema registry. I have already installed flink and Kafka on ec2 machine. Also, the "test" topic has been created before running code.
Code Path: https://gist.github.com/mandar2174/5dc13350b296abf127b92d0697c320f2
The code does the following operation as part of implementation:
1) Create a flink DataStream object using a list of user element. (User class is avro generated class)
2) Write the Datastream source to Kafka using AvroSerializationSchema.
3) Read the data from Kafka using ConfluentRegistryAvroDeserializationSchema by reading the schema from Confluent Schema registry.
Command to run flink executable jar:
./bin/flink run -c com.streaming.example.ConfluentSchemaRegistryExample /opt/flink-1.7.2/kafka-flink-stream-processing-assembly-0.1.jar
Exception while running code:
java.io.IOException: Unknown data format. Magic number does not match
at org.apache.flink.formats.avro.registry.confluent.ConfluentSchemaRegistryCoder.readSchema(ConfluentSchemaRegistryCoder.java:55)
at org.apache.flink.formats.avro.RegistryAvroDeserializationSchema.deserialize(RegistryAvroDeserializationSchema.java:66)
at org.apache.flink.streaming.util.serialization.KeyedDeserializationSchemaWrapper.deserialize(KeyedDeserializationSchemaWrapper.java:44)
at org.apache.flink.streaming.connectors.kafka.internal.KafkaFetcher.runFetchLoop(KafkaFetcher.java:140)
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.run(FlinkKafkaConsumerBase.java:665)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:94)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:58)
at org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:99)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
at java.lang.Thread.run(Thread.java:748)
Avro schema which I am using for User class is as below:
{
"type": "record",
"name": "User",
"namespace": "com.streaming.example",
"fields": [
{
"name": "name",
"type": "string"
},
{
"name": "favorite_number",
"type": [
"int",
"null"
]
},
{
"name": "favorite_color",
"type": [
"string",
"null"
]
}
]
}
Can someone point out what steps I am missing as part of deserializing avro data using confluent Kafka schema registry?

How you wrote the Avro data needs to use the Registry as well in order for the deserializer that depends on it to work.
But this is an open PR in Flink, still for adding a ConfluentRegistryAvroSerializationSchema class
The workaround, I believe would be to use AvroDeserializationSchema, which does not depend on the Registry.
If you did want to use the Registry in the producer code, then you'd have to do so outside of Flink until that PR is merged.

Related

Not able to override consumer config in azure iot hub sink connector

I'm making an AzureIoT Hub sink connector using the Microsoft connector class. I am using an AVRO converter on the connector.
I want to use KafkaAvroDeserializer, on the consumer to deserialize the Avro data coming from the topic but I'm unable to override value. deserializer value.
I'm using consumer.override.value.deserializer in the logs.
Could anyone please suggest a way out?
My config is below :
"consumer.value.deserializer": "io.confluent.kafka.serializers.KafkaAvroDeSerializer".
I'm getting the deserializer as byte array and I want it to be kafkaAvroDeserializer
I am making a azure iot hub sink connector. And, I'm getting error deserializing avro data from kafka topic.
{
"config": {
"IotHub.ConnectionString": "connectionString",
"IotHub.MessageDeliveryAcknowledgement": "None",
"confluent.topic.bootstrap.servers": "server",
"confluent.topic.replication.factor": "1","connector.class":"com.microsoft.azure.iot.kafka.connect.sink.IotHubSinkConnector",
"consumer.override.auto.register.schemas": "true",
"consumer.override.id.compatibility.strict": "false",
"consumer.override.latest.compatibility.strict": "false",
"consumer.override.schema.registry.url": "registryUrl",
"consumer.value.deserializer":"io.confluent.kafka.serializers.KafkaAvroDeSerializer",
"key.converter":"org.apache.kafka.connect.storage.StringConverter",
"name": "TEST1",
"tasks.max": "1",
"topics": "testtopicazure3",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.auto.register.schemas": "true",
"value.converter.schema.registry.url": "registryUrl"
},
}
Getting error :
Caused by:
org.apache.kafka.common.errors.SerializationException: Error
deserializing Avro message for id -1
Caused by: org.apache.kafka.common.errors.SerializationException:
Unknown magic byte!
In Connect, you only set value.converter, not consumer client deserializers
value.converter=io.confluent.connect.avro.AvroConverter
And all your consumer.override prefixes should be value.converter, instead
https://docs.confluent.io/kafka-connectors/self-managed/userguide.html#configuring-key-and-value-converters

Kafka Connect - From JSON records to Avro files in HDFS

My current setup contains Kafka, HDFS, Kafka Connect, and a Schema Registry all in networked docker containers.
The Kafka topic contains simple JSON data without a Schema:
{
"repo_name": "ironbee/ironbee"
}
The Schema Registry contains a JSON Schema describing the data in the Kafka Topic:
{"$schema": "https://json-schema.org/draft/2019-09/schema",
"$id": "http://example.com/example.json",
"type": "object",
"title": "Root Schema",
"required": [
"repo_name"
],
"properties": {
"repo_name": {
"type": "string",
"default": "",
"title": "The repo_name Schema",
"examples": [
"ironbee/ironbee"
]
}
}}
What I am trying to achieve is a Connection that reads JSON data from a Topic and dumps it into files in HDFS (Avro or Parquet).
{
"name": "kafka to hdfs",
"connector.class": "io.confluent.connect.hdfs3.Hdfs3SinkConnector",
"topics": "repo",
"hdfs.url": "hdfs://namenode:9000",
"flush.size": 3,
"confluent.topic.bootstrap.servers": "kafka-1:19092,kafka-2:29092,kafka-3:39092",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"key.converter.schemas.enable": "false",
"value.converter": "io.confluent.connect.json.JsonSchemaConverter",
"value.converter.schemas.enable": "false",
"value.converter.schema.registry.url": "http://schema-registry:8081"
}
If I try to read the raw JSON value via the StringConverter (no schema used) and dump it into Avro files it works, resulting in
Key=null Value={my json} touples
so no usable structure at all.
When I try to use my schema via the JsonSchemaConverter I get the errors
“Converting byte[] to Kafka Connect data failed due to serialization error of topic”
“Unknown magic byte”
I think that there is something wrong with the configuration of my connection, but after a week of trying everything my google-skills have reached their limits.
All the code is available here: https://github.com/SDU-minions/7-Scalable-Systems-Project/tree/dev/Kafka
raw JSON value via the StringConverter (no schema used)
schemas.enable property only exists on JSONConverter. Strings don't have schemas. JSONSchema always has a schema, so property also doesn't exist there.
When I try to use my schema via the JsonSchemaConverter I get the errors
Your producer needs to use Confluent JSONSchema Serializer. Otherwise, it doesn't get sent to Kafka with the "magic byte" referred to in your error.
I personally haven't tried converting JSON schema records to Avro directly in Connect. Usually the pattern is to either produce Avro directly, or convert within ksqlDB, for example to a new Avro topic, which is then consumed by Connect.

Debezium, Kafka connect: is there a way to send only payload and not schema?

I have an outbox postgresql table and debezium connector in kafka connect that creates kafka messages based on the added rows to the table.
The problem I am facing is with the message format. This is the created message value:
{
"schema": {
"type": "string",
"optional": true,
"name": "io.debezium.data.Json",
"version": 1
},
"payload": "{\"foo\": \"bar\"}"
}
But (because of consumer) I need the message to contain only the payload, like this:
{
"\"foo\": \"bar\""
}
This is part of my kafka connector configuration:
"transforms": "outbox",
"transforms.outbox.type": "io.debezium.transforms.outbox.EventRouter",
"transforms.outbox.route.topic.replacement": "${routedByValue}",
"transforms.outbox.route.by.field": "aggregate_type",
"transforms.outbox.table.field.event.payload.id": "aggregate_id",
"transforms.outbox.table.fields.additional.placement": "payload_type:header:__TypeId__"
Is there any way to achieve this without creating custom transformer?
It looks like you're using org.apache.kafka.connect.json.JsonConverter with schemas.enable=true for your value converter. When you do this it embeds the schema alongside the payload in the message.
If you set value.converter.schemas.enable=false you should get just the payload in your message.
Ref: Kafka Connect: Converters and Serialization Explained — JSON and Schemas

Sending from Logastash to Kafka in with Avro

I am trying to send data from logstash into kafka using an avro schema.
My logstash output looks like:
kafka{
codec => avro {
schema_uri => "/tmp/avro/hadoop.avsc"
}
topic_id => "hadoop_log_processed"
}
My schema file looks like:
{"type": "record",
"name": "hadoop_schema",
"fields": [
{"name": "loglevel", "type": "string"},
{"name": "error_msg", "type": "string"},
{"name": "syslog", "type": ["string", "null"]},
{"name": "javaclass", "type": ["string", "null"]}
]
}
Output of kafka-console-consumer:
CElORk+gAURvd24gdG8gdGhlIGxhc3QgbWVyZ2UtcGCzcywgd2l0aCA3IHNlZ21lbnRzIGxlZnQgb2YgdG90YWwgc256ZTogMjI4NDI0NDM5IGJ5dGVzAAxbbWFpbl0APm9yZy5hcGFjaGUuaGFkb29wLm1hcHJlZC5NZXJnZXI=
CElORk9kVGFzayAnYXR0ZW1wdF8xNDQ1JDg3NDkxNDQ1XzAwMDFfbV8wMDAwMDRfMCcgZG9uZS4ADFttYWluXQA6t3JnLmFwYWNoZS5oYWRvb6AubWFwcmVkLlRhc2s=
CElORk9kVGFzayAnYXR0ZW1wdF8xNDQ1JDg3NDkxNDQ1XzAwMDFfbV8wMDAwMDRfMCcgZG9uZS4ADFttYWluXQA6t3JnLmFwYWNoZS5oYWRvb6AubWFwcmVkLlRhc2s=
CElORk9OVGFza0hlYAJ0YmVhdEhhbmRsZXIgdGhyZWFkIGludGVycnVwdGVkAERbVGFza0hlYXJdYmVhdEhhbmRsZXIgUGluZ0NoZWNrZXJdAG5vcmcuYVBhY2hlLmhhZG9vcC5tYXByZWR1Y2UudjIuYXBwLlRhc2tIZWFydGJ3YXRIYW5kbGVy
I am getting also the following error in my connector:
org.apache.kafka.connect.errors.ConnectException: Tolerance exceeded in error handler
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:178)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execute(RetryWithToleranceOperator.java:104)
at org.apache.kafka.connect.runtime.WorkerSinkTask.convertAndTransformRecord(WorkerSinkTask.java:488)
at org.apache.kafka.connect.runtime.WorkerSinkTask.convertMessages(WorkerSinkTask.java:465)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:321)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:224)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:192)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:177)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:227)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: org.apache.kafka.connect.errors.DataException: Failed to deserialize data for topic hadoop_log_processed to Avro:
at io.confluent.connect.avro.AvroConverter.toConnectData(AvroConverter.java:110)
at org.apache.kafka.connect.storage.Converter.toConnectData(Converter.java:86)
at org.apache.kafka.connect.runtime.WorkerSinkTask.lambda$convertAndTransformRecord$2(WorkerSinkTask.java:488)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndRetry(RetryWithToleranceOperator.java:128)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:162)
... 13 more
Caused by: org.apache.kafka.common.errors.SerializationException: Unknown magic byte!
I know that I encoding the data on the logstash site. Do I have to decode the messages during the input in kafka, or can I decode/deserialize the data in the connector config?
Is there a way to disable the encoding on the logstash site? I read about an base64_encoding options, but it seems it hasn't the option.
The problem you have here is that Logstash's Avro codec is not serialising the data into an Avro form that the Confluent Schema Registry Avro deserialiser expects.
Whilst Logstash takes an avsc and encodes the data into a binary form based on that, the Confluent Schema Registry [de]serialiser instead stores & retrieves a schema directly from the registry (not avsc files).
So when you get Failed to deserialize data … SerializationException: Unknown magic byte! this is the Avro deserialiser saying that it doesn't recognise the data as Avro that's been serialised using the Schema Registry serialiser.
I had a quick Google and found this codec that looks like it supports the Schema Registry (and thus Kafka Connect, and any other consumer deserialising Avro data this way).
Alternatively, write your data as JSON into Kafka and use the org.apache.kafka.connect.json.JsonConverter in Kafka Connect to read it from the topic.
Ref:
http://rmoff.dev/ksldn19-kafka-connect
https://www.confluent.io/blog/kafka-connect-deep-dive-converters-serialization-explained/

Apache Kafka with centralized Avro schema

We use Apache Kafka( not confluent kafka ) 0.10. We would like to setup AVRO schema with kafka.
I have avro schema as follows.
{
"namespace": "Rule",
"type": "record",
"name": "RuleMessage",
"fields": [
{
"name": "station",
"type": "string"
},
{
"name": "model",
"type": "string"
}
}
Serializing message like,
public byte[] serializeMessage(EventMessage eventMessage) throws IOException {
ByteArrayOutputStream out = new ByteArrayOutputStream();
BinaryEncoder encoder = EncoderFactory.get().binaryEncoder(out, null);
DatumWriter<EventMessage> writer = new SpecificDatumWriter<EventMessage>(EventMessage.getClassSchema());
writer.write(eventMessage, encoder);
encoder.flush();
out.close();
return out.toByteArray();
}
This is working as expected.
But, would like to setup an Avro Schema at the topic level, so that the topic will reject messages if the message is not meeting the avro schema.
Is there anyway, I could do this with Apache Kafka 0.10.
Thanks
You can use Confluent's Schema Registry (its open source and Apache licensed) with Apache Kafka 0.10.0 to associate a schema with a topic. It arrives with Avro Serializers/DeSerializers that automatically validate the Avro schemas in exactly the way you requested.
Please note that the is no such thing as "Confluent Kafka" - it would be a trademark violation to have it. Confluent simply packages Apache Kafka in its distribution for convenience, but since the Schema Registry is on github, you can use that without using Confluent packaging if you prefer.