Kafka Avro deserializer with local schema file - apache-kafka

Background :
I used SpringKafka to implement Avro based Consumer and Producer. Other important components : Kafka-Broker, Zookeeper and Schema-registry runs in a docker container. And this works perfectly fine for me.
What I Want :
I want to have a Kafka Avro Deserializer(in Consumer) which should be independent of schema-registry. In my case, I have a avroSchema file which would not change. So I want to get rid of this additional step of using schema-registry on Consumer side and Rather go for a local schema file

If the Producer serializer uses the Schema Registry, then the Consumer should as well. Avro requires you to have a reader and writer schema.
If the consumer, for whatever reason cannot access the Registry over the network, you would need to use ByteArrayDeserializer, then you would take the byte-slice after position 5 (0x0 + 4 byte schema integer ID) of the byte[] from the part of the record you want to parse.
Then, from Avro API, you can use GenericDatumReader along with your local Schema reference to get a GenericRecord instance, but this would assume your reader and writer schema are the exact same (but this shouldn't be true, as the producer could change the schema at any time).
Or you can create a SpecificRecord from the schema you have, and configure the KafkaAvroDeserializer to use that.

Related

How does a kafka connect connector know which schema to use?

Let's say I have a bunch of different topics, each with their own json schema. In schema registry, I indicated which schemas exist within the different topics, not directly refering to which topic a schema applies. Then, in my sink connector, I only refer to the endpoint (URL) of the schema registry. So to my knowledge, I never indicated which registered schema a kafka connector (e.g., JDBC sink) should be used in order to deserialize a message from a certain topic?
Asking here as I can't seem to find anything online.
I am trying to decrease my kafka message size by removing overhead of having to specify the schema in each message, and using schema registry instead. However, I cannot seem to understand how this could work.
Your producer serializes the schema id directly in the bytes of the record. Connect (or consumers with the json deserializer) use the schema that's part of each record.
https://docs.confluent.io/platform/current/schema-registry/serdes-develop/index.html#wire-format
If you're trying to decrease message size, don't use JSON, but rather a binary format and enable topic compression such as ZSTD

How to deserialize avro message using mirrormaker?

I want to replicate a kafka topic to an azure event hub.
The messages are in avro format and uses a schema that is behind a schema registry with USER_INFO authentication.
Using a java client to connect to kafka, I can use a KafkaAvroDeserializer to deserialize the message correctly.
But this configuration doesn't seems to work with mirrormaker.
Is is possible to deserialize the avro message using mirrormaker before sending it ?
Cheers
For MirrorMaker1, the consumer deserializer properties are hard-coded
Unless you plan on re-serializing the data into a different format when the producer sends data to EventHub, you should stick to using the default ByteArrayDeserializer.
If you did want to manipulate the messages in any way, that would need to be done with a MirrorMakerMessageHandler subclass
For MirrorMaker2, you can use AvroConverter followed by some transforms properties, but still ByteArrayConverter would be preferred for a one-to-one byte copy.

How to monitor 'bad' messages written to kafka topic with no schema

I use Kafka Connect to take data from RabbitMQ into kafka topic. The data comes without schema so in order to associate schema I use ksql stream. On top of the stream I create a new topic that now has a defined schema. At the end I take the data to BQ database. My question is how do I monitor messages that have not passed the stream stage? in this way, do i support schema evolution? and if not, how can use the schema registry functionality?
Thanks
use Kafka Connect to take data ... data comes without schema
I'm not familiar specifically with Rabbitmq connector, but if you use the Confluent converter classes that do use schemas, then it would have one, although maybe only a string or bytes schema
If ksql is consuming the non-schema topic, then there's a consumer group associated with that process. You can monitor its lag to know how many messages have not yet been processed by ksql. If ksql is unable to parse a message because it's "bad", then I assume it's either skipped or the stream stops consuming completely; this is likely configurable
If you've set the output topic format to Avro, for example, then the schema will automatically be registered to the Registry. There will be no evolution until you modify the fields of the stream

Kafka multiple Topics into the same avro file

i am new of the KAFKA protocol's world and i would like to ask you some inportant information related to my project.
I am using AVRO file for producing and consuming messages, i want to know if i can use the same avro file for multiple Topics maybe for example by using a different "name" attribute into the producer and by using a specific "name" attribute in the consumer.
Thanks a lot.
Stefano
You can use one file to send data to multiple topics, yes, although I'm not sure why one would do that
I would be cautious about merging multiple topics into one Avro file because the schema must match in every topic for that file
It would be suggested that you use the Confluent Schema Registry, for example, rather than sending individual Avro events because if you are not using some registry, then you're likely sending the Avro schema as part of every message, which will slow down the possible throughput of your topic. And then, the name of the Avro schema record in the register will correspond to the topic name

Kafka Connect: How can I send protobuf data from Kafka topics to HDFS using hdfs sink connector?

I have a producer that's producing protobuf messages to a topic. I have a consumer application which deserializes the protobuf messages. But hdfs sink connector picks up messages from the Kafka topics directly. What would the key and value converter in etc/schema-registry/connect-avro-standalone.properties be set to? What's the best way to do this? Thanks in advance!
Kafka Connect is designed to separate the concern of serialization format in Kafka from individual connectors with the concept of converters. As you seem to have found, you'll need to adjust the key.converter and value.converter classes to implementations that support protobufs. These classes are commonly implemented as a normal Kafka Deserializer followed by a step which performs a conversion from serialization-specific runtime formats (e.g. Message in protobufs) to Kafka Connect's runtime API (which doesn't have any associated serialization format -- it's just a set of Java types and a class to define Schemas).
I'm not aware of an existing implementation. The main challenge in implementing this is that protobufs is self-describing (i.e. you can deserialize it without access to the original schema), but since its fields are simply integer IDs, you probably wouldn't get useful schema information without either a) requiring that the specific schema is available to the converter, e.g. via config (which makes migrating schemas more complicated) or b) a schema registry service + wrapper format for your data that allows you to look up the schema dynamically.