Unable to read Kafka topic avro messages - apache-kafka

Kafka connect events for Debezium connector is Avro encoded.
Mentioned the following in the connect-standalone.properties passed to Kafka connect standalone service.
key.converter=io.confluent.connect.avro.AvroConverter
value.confluent=io.confluent.connect.avro.AvroConverter
internal.key.converter=io.confluent.connect.avro.AvroConverter
internal.value.converter=io.confluent.connect.avro.AvroConverter
schema.registry.url=http://ip_address:8081
internal.key.converter.schema.registry.url=http://ip_address:8081
internal.value.converter.schema.registry.url=http://ip_address:8081
Configuring the Kafka consumer code with these properties:
Properties props = new Properties();
props.put("bootstrap.servers", "ip_address:9092");
props.put("zookeeper.connect", "ip_address:2181");
props.put("group.id", "test-consumer-group");
props.put("auto.offset.reset","smallest");
//Setting auto comit to false to ensure that on processing failure we retry the read
props.put("auto.commit.offset", "false");
props.put("key.converter.schema.registry.url", "ip_address:8081");
props.put("value.converter.schema.registry.url", "ip_address:8081");
props.put("schema.registry.url", "ip_address:8081");
In the consumer implementation, following is the code to read the key and value components. I am getting the schema for key and value from Schema Registry using REST.
GenericDatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>(schema);
return reader.read(null, DecoderFactory.get().binaryDecoder(byteData, null));
Parsing the key worked fine. While parsing the value part of the message, I am getting ArrayIndexOutOfBoundsException.
Downloaded the source code for Avro and debugged. Found that the GenericDatumReader.readInt method is returning a negative value. This value is expected to be the index of an array (symbols) and hence should have been positive.
Tried consuming events using the kafka-avro-standalone-consumer but it threw an ArrayIndexOutOfBoundsException too. So, my guess is that the message is improperly encoded at Kafka connect (producer) & the issue is with the configuration.
Following are the questions:
Is there anything wrong with the configuration passed at producer or consumer?
Why is key de-serialization working but not that of value?
Is there anything else needed to be done for things to work? (like specifying character encoding somewhere).
Can Debezium with Avro be used in production, or is it an experimental feature for now? The post on Debezium Avro specifically says that examples involving Avro will be included in future.
There have been many posts where Avro deserialization threw ArrayIndexOutOfBoundsException but could not relate it to the problem I am facing.

Followed the steps in http://docs.confluent.io/current/schema-registry/docs/serializer-formatter.html & things are working fine now.

Related

Flink KafkaSource with multiple topics, each topic with different avro schema of data

If I do multiple of the below - one for each topic:
KafkaSource<T> kafkaDataSource = KafkaSource<T>builder().setBootstrapServers(consumerProps.getProperty("bootstrap.servers")).setTopics(topic).setDeserializer(deserializer).setGroupId(identifier)
.setProperties(consumerProps).build();
The deserializer seems to get into some issue and ends up reading data from different topic of different schema than it meant for and fails!
If I provide all topics in the same KafkaSource then watermarks seems to be progress across the topics together.
DataStream<T> dataSource = environment.fromSource(kafkaDataSource,
WatermarkStrategy.<T>forBoundedOutOfOrderness(Duration.ofMillis(2000))
.withTimestampAssigner((event, timestamp) -> {...}, ""));
Also the avro data in the kafka itself holds the first magic byte for schema (schema info is embedded), not using any external avro registry (it's all in the libraries).
It works fine with FlinkKafkaConsumer (created multiple instances of it).
FlinkKafkaConsumer<T> kafkaConsumer = new FlinkKafkaConsumer<>(topic, deserializer, consumerProps);
kafkaConsumer.assignTimestampsAndWatermarks(WatermarkStrategy.<T>forBoundedOutOfOrderness(Duration.ofMillis(2000))
.withTimestampAssigner((event, timestamp) -> {
Not sure if it's a problem the way that I am using? Any pointers on how to solve would be appreciated. Also FlinkKafkaConsumer is deprecated.
Figured it based on the code in here Custom avro message deserialization with Flink. Implemented open method and the instance fields of the deserialisier are made transient.

Kafka Avro deserializer with local schema file

Background :
I used SpringKafka to implement Avro based Consumer and Producer. Other important components : Kafka-Broker, Zookeeper and Schema-registry runs in a docker container. And this works perfectly fine for me.
What I Want :
I want to have a Kafka Avro Deserializer(in Consumer) which should be independent of schema-registry. In my case, I have a avroSchema file which would not change. So I want to get rid of this additional step of using schema-registry on Consumer side and Rather go for a local schema file
If the Producer serializer uses the Schema Registry, then the Consumer should as well. Avro requires you to have a reader and writer schema.
If the consumer, for whatever reason cannot access the Registry over the network, you would need to use ByteArrayDeserializer, then you would take the byte-slice after position 5 (0x0 + 4 byte schema integer ID) of the byte[] from the part of the record you want to parse.
Then, from Avro API, you can use GenericDatumReader along with your local Schema reference to get a GenericRecord instance, but this would assume your reader and writer schema are the exact same (but this shouldn't be true, as the producer could change the schema at any time).
Or you can create a SpecificRecord from the schema you have, and configure the KafkaAvroDeserializer to use that.

How to deserialize avro message using mirrormaker?

I want to replicate a kafka topic to an azure event hub.
The messages are in avro format and uses a schema that is behind a schema registry with USER_INFO authentication.
Using a java client to connect to kafka, I can use a KafkaAvroDeserializer to deserialize the message correctly.
But this configuration doesn't seems to work with mirrormaker.
Is is possible to deserialize the avro message using mirrormaker before sending it ?
Cheers
For MirrorMaker1, the consumer deserializer properties are hard-coded
Unless you plan on re-serializing the data into a different format when the producer sends data to EventHub, you should stick to using the default ByteArrayDeserializer.
If you did want to manipulate the messages in any way, that would need to be done with a MirrorMakerMessageHandler subclass
For MirrorMaker2, you can use AvroConverter followed by some transforms properties, but still ByteArrayConverter would be preferred for a one-to-one byte copy.

how can I pass KafkaAvroSerializer into a Kafka ProducerRecord?

I have messages which are being streamed to Kafka. I would like to convert the messages in avro binary format (means to encode them).
I'm using the confluent platform. I have a Kafka ProducerRecord[String,String] which sends the messages to the Kafka topic.
Can someone provide with a (short) example? Or recommend a website with examples?
Does anyone know how I can pass a instance of a KafkaAvroSerializer into the KafkaProducer?
Can I use inside the ProducerRecord a Avro GenericRecord instance?
Kind regards
Nika
You need to use the KafkaAvroSerializer in your producer config for the either serializer config, as well as set the schema registry url in the producer config as well (AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG)
That serializer will Avro-encode primitives and strings, but if you need complex objects, you could try adding Avro4s, for example. Otherwise, GenericRecord will work as well.
Java example is here - https://docs.confluent.io/current/schema-registry/serializer-formatter.html

Two strange bytes at the beginning of each message of a Kafka message produced by my Kafka Connector

I develop a Kafka connector which simply creates messages for each line in a file retrieved from an external API. It works nicely but now I try to consume the messages and I have two strange bytes at the beginning of each value. I can reproduce the problem with the console consumer and with my kafka stream processor.
�168410002,OpenX Market,459980962,OpenX_Bidder_Order_merkur_bidder_800x250,313115722,OpenX_Bidder_ANY_LI_merkur_800x250_550,106800839362,OpenX_Bidder_Creative_merkur_800x250_2,10
The source files are good and even printlns before creating the SourceRecord don't show these two bytes. I used a struct with one field before and now use a simple String schema but I still have the same problem:
def convert(line: String, ...) = {
...
val record = new SourceRecord(
Partition.sole(partition),
offset.forConnectApi,
topic,
Schema.STRING_SCHEMA,
line
)
...
So in the above code, if I add println(line) no strange chars are shown.
It looks like you used the AvroConverter or the JsonConverter in your connector. Try using the StringConverter that ships with Kafka in your key.converter and value.converter in the worker for connect. That will encode the data as strings that shouldn't have this extra stuff in it.