How to handle Kafka schema evolution - apache-kafka

I'm new to kafka .Here is the question i have on ever changing kafka schema.
How can we handle schema changes at kafka consumer end?
If we change the payload structure at kafka publisher end, how can I make sure nothing breaks on the kafka consumer end?
I would be interested to know industry wide practises to handle this scenario.
I won't be using Confluent's schema registry for avro. Are there any other tried and tested options?

Schema Registry is the solution for a centralized schema management and compatibility checks as schemas evolve
Configure the schema registry into kafka producers and consumers
kafkaProducerProps.put(KafkaAvroSerializerConfig.SCHEMA_REGISTRY_URL_CONFIG,"http://localhost:8081");
kafkaConsumerProps.put(KafkaAvroDeserializerConfig.SPECIFIC_AVRO_READER_CONFIG, "true");
//Schema registry location.
kafkaConsumerProps.put(KafkaAvroDeserializerConfig.SCHEMA_REGISTRY_URL_CONFIG, "http://localhost:8081");
Run Schema Registry on 8081
Refer below URL for sample code
https://dzone.com/articles/kafka-avro-serialization-and-the-schema-registry

Related

Where do Kafka consumers get the first version of Avro schema from?

Most Quarkus examples that illustrate the use of Avro and Service registry combine consumer and producer in the same project. In doing so, the schema and the generated classes are available to the producer and the consumer.
I understand that the role of the schema registry is to maintain various versions of a schema and make them available to consumers.
What I do not fully understand is how and when the consumer pulls the schema from the registry. For instance, how does the consumer's team get the initial version of the schema? Do they just go to the registry and download it manually? Do they use the maven plugin to download it and generate the sources?
In the case of Quarkus, the avro extension automatically generates the Java source from avro schema. I wonder if it also downloads the initial schema for consumers.
The consumers can optionally download the schema from the registry. Then the consumer should be configured to use a SpecificRecord deserialization.
For the Confluent serialization format, the Schema ID is embedded in each record. When the consumer deserializer processes the bytes, it'll always perform a lookup from the registry for that ID.
Avro requires both a writer schema (the one that's downloaded from that ID), and a reader schema (either the one that the consumer app developer has downloaded, or defaults to the writer schema), and then schema evolution is evaluated.
Quarkus/Kafka doesn't really matter, here. It's entirely done within the Avro library after HTTP calls to the Registry.

What is the use of confluent schema registry if Kafka can use Avro without it

The difference between vanilla apache Avro and Avro with confluent schema registry is that when using apache avro , we send schema+message in kafka topic whereas in confluent schema registry , we send schemaID+message in kafka topic ? So here , schema registry helps in performance improvement via schema look up in registry. Is there any other benefit of using confluent schema registry ? Also , does apache avro supports compatabilty rules of schema evolution like schema registry ?
Note: There are other implementations of a "Schema Registry" that can use used with Kafka.
Here are a list of reasons
Clients can discover schemas without interacting with Kafka. For example, Apache Hive / Presto / Spark can download schemas from the Registry to perform analytics.
The registry is centrally responsible for compatibility checks rather than pushing each client to operate that on their own (to answer your second question)
The same applies to any serialization format, as well, not only Avro

Sending Avro messages to Kafka

I have an app that produces an array of messages in raw JSON periodically. I was able to convert that to Avro using the avro-tools. I did that because I needed the messages to include schema due to the limitations of Kafka-Connect JDBC sink. I can open this file on notepad++ and see that it includes the schema and a few lines of data.
Now I would like to send this to my central Kafka Broker and then use Kafka Connect JDBC sink to put the data in a database. I am having a hard time understanding how I should be sending these Avro files I have to my Kafka Broker. Do I need a schema registry for my purposes? I believe Kafkacat does not support Avro so I suppose I will have to stick with the kafka-producer.sh that comes with the Kafka installation (please correct me if I am wrong).
Question is: Can someone please share the steps to produce my Avro file to a Kafka broker without getting Confluent getting involved.
Thanks,
To use the Kafka Connect JDBC Sink, your data needs an explicit schema. The converter that you specify in your connector configuration determines where the schema is held. This can either be embedded within the JSON message (org.apache.kafka.connect.json.JsonConverter with schemas.enabled=true) or held in the Schema Registry (one of io.confluent.connect.avro.AvroConverter, io.confluent.connect.protobuf.ProtobufConverter, or io.confluent.connect.json.JsonSchemaConverter).
To learn more about this see https://www.confluent.io/blog/kafka-connect-deep-dive-converters-serialization-explained
To write an Avro message to Kafka you should serialise it as Avro and store the schema in the Schema Registry. There is a Go client library to use with examples
without getting Confluent getting involved.
It's not entirely clear what you mean by this. The Kafka Connect JDBC Sink is written by Confluent. The best way to manage schemas is with the Schema Registry. If you don't want to use the Schema Registry then you can embed the schema in your JSON message but it's a suboptimal way of doing things.

How to replicate schema with Kafka mirror maker?

We are using the mirror maker to sync on-premise and AWS Kafka topics. How can a topic with its schema registered in on-premise be replicated exactly the same in other clusters (AWS in this case)?
How Avro schema is replicated using mirror maker?
MirrorMaker only copies byte arrays, not schemas. And doesn't care about the format of the data
As of Confluent 4.x or later, Schema Registry added endpoint GET /schemas/ids/(number). So, if your consumers are configured to the original registry, this shouldn't matter since your destination consumers can lookup the schema ID.
You otherwise can mirror the _schemas topic as well, as recommend by Confluent when using Confluent Replicator
If you absolutely need one-to-one schema copying, you would need to implement a MessageHandler interface, and pass this on to the MirrorMaker command, to get and post the schema, similar to the internal logic I added to this Kafka Connect plugin (which you could use Connect instead of MirrorMaker). https://github.com/OneCricketeer/schema-registry-transfer-smt

How is schema from Schema-Registry is propagated over Replicator

How do schemas from Confluent Schema-Registry get propagated by Confluent-Replicator to destination Kafka-Cluster and Schema-Registry?
Is each replicated message schema contained in it or are schemas replicated somehow separately through a separate topic?
I didn't see any configuration possibilities in Confluent-Replicator regarding this.
It sounds like you are asking how the schema registry can be used in a multi data center environment. There's a pretty good doc on this https://docs.confluent.io/current/schema-registry/docs/multidc.html
Replicator can be used to keep the schema registry data in sync on the backend as shown in the doc.
Schemas are not stored with the topic, only their ID's. And the _schemas topic is not replicated, only the ID's stored within the replicated topics.
On a high-level, if you use the AvroConverter with Replicator, it'll deserialize the message from the source cluster, optionally rename the topic as per the replicator configuration, then serialize the message and send the new subject name to the destination cluster + registry.
Otherwise, if you use the ByteArrayConverter, it will not inspect the message, and it just copies it along to the destination cluster with no registration.
A small optimization on the Avro way would be to only inspect that the message is Avro encoded on the first 5 bytes, as per the Schema Registry specification, then perform HTTP lookups to the source subject using Schema Registry REST API GET /schemas/ids/:id, again rename topic if needed to the destination cluster, and POST the schema there. A similar approach can work in any Consumer/Producer pair such as a MirrorMaker MessageHandler implementation.