Kafka - Can you create schema before topic exist and what is the relation? - apache-kafka

Is there any order that must be followed - e.g. person should create a topic first and then schema in schema registry or vice versa?
Can two topics use the same schema from Schema Registry?
Does every topic needs to have Key and Value? (and therefore needs to exist 2 schemas for each topic?)
What is the relation and possible combinations?
Thanks.

is there any order that must be followed
Nope. If you have auto topic creation enabled, you could even start producing Avro immediately to a non existing topic. The Confluent serializers automatically register the schema, and the broker will create a topic with default partitions and replicas
Can two topics use the same schema
Yes, the Avro Schema ID of two distinct topics can be the same. For example, Avro key of a string shared over more than one topic will cause two subjects to be entered into the registry, however, only one schema ID will back them
Does every topic needs to have Key and Value?
Yes. Thats part of the Kafka Record protocol. The key can be nullable, however. If you're not using Avro serializer for either key or value, no entry will be made. You're not required to use Avro for both options if one or the other is

Related

ksqlDB can't get data from Schema Registry

Case: I have topic in Kafka with name some_table_prefix.table_name. Data is serialized with AVRO, but for historical reasons I have record in Schema Registry named table_name-value.
When I'm trying to setup ksqlDB stream
CREATE STREAM some_stream_name
WITH (KAFKA_TOPIC='some_table_prefix.table_name', VALUE_FORMAT='AVRO');
I'm getting error Schema for message values on topic some_table_prefix.table_name does not exist in the Schema Registry.Subject: some_table_prefix.table_name-value.
I have Schema registry integrated correctly, for others topics everything works ok.
So, is it possible to specify Schema Registry record name in ksqlDB stream creation or resolve this issue some other way?
If you have a topic named table_name, that has Avro being produced to it (which would automatically create table_name-value in the Registry), then that's what ksqlDB should consume from. If you'd manually created that subject by posting the schema on your own, without matching the topic name, then that's part of the problem.
As the error says, it's looking for a specific subject in the Registry based on the topic you've provided. To my knowledge, its not possible to use another subject name, so the workaround is to POST the old subject's schemas into the new one

How to version a field in avro schema when Kafka Consumer updates?

Example :- I have a field named
"abc":[
{"key1":"value1", "key2":"value2"},
{"key1":"value1", "key2":"value2"}
]
Consumer1, consumer2 consuming this variable, where as now consumer2 require few more fields and need to change the structure.
How to address this issue by following best practice?
You can use type map in Avro schema. key is always a string and value can be any type but should one type for the whole map.
So, in your case, introduce a map into your schema. consumer_1 can consume the event and get they keys needed only for the consumer_1 and do the same for consumer_2. But still same Avro schema.
Note: you can not send null to the map in schema. you need to send empty map.
If possible introduce Schema Registry server for schema versioning. Register all the different avro schema's at schema registry and a version Id will be given. Connect your producer and consumer app with schema registry server to fetch the registered schema for the respective Kafka message. Now message with any kind of schema can be received by any consumer with full compatibility.

One or more schemas per topic when using Schema Registry with Kafka, and Avro...?

There is somethigng I'm trying to understand about how Avro-serialized messages are treated by Kafka and Schema Registry - from this post I've understood the schema ID is stored in an predictable place in each message so it seems that we can have messages of varous schemas in the same topic and be able to find the right schema and deserialize them successfully based on just that. On the other hand I see many people seem to be using expression "a schema attached to a topic", this however implies one schema per topic..
So which is right? Can I take advantages of the Schema Registry (like i.e. KSql) and have messages of various types (or schemas) in the same topic?
Typically you have 1:1 topic/schema relationship, but it is possible (and valid) to have multiple schemas per topic in some situations. For more information, see https://www.confluent.io/blog/put-several-event-types-kafka-topic/

In schema registry, consumer's schema could differ from the producer's, what actually it means

While producing AVRO data to Kafka, Avro serializer writing the same schema ID in the byte array which is used while writing the data.
Kafka Consumer fetches the schema from Schema Registry based on schema ID in the byte array received. So same schema ID is used in both i.e. Producer and Consumer and so the schema.
But why many article including this one says The consumer's schema could differ from the producer's.
Please help me in understanding this.
Kafka Consumer fetches the schema from Schema Registry based on schema ID
Only if you let the deserializer do that.
You're capable of defining your own, compatible schema in the consumer code. Therefore, it could be different and follows the rules of Avro schema evolution
schema evolution happens only during deserialization at the consumer (read). If the consumer’s schema is different from the producer’s schema, then the value or key is automatically modified during deserialization to conform to the consumer's read schema if possible.

How does Avro for Kafka work with Schema registry?

I am working on Kafka and as a beginner the following question popped out of my mind.
Every time we design the schema for Avro, we create the Java object out of it through its jars.
Now we use that object to populate data and send it from Producer.
For consuming the message we generate the Object again in Consumer. Now the objects generated in both places Producer & Consumer contains a field
"public static final org.apache.avro.Schema SCHEMA$" which actually stores the schema as a String.
If that is the case then why should kafka use schema registry at all ? The schema is already available as part of the Avro objects.
Hope my question is clear. If someone can answer me, It would be of great help.
Schema Registry is the repository which store the schema of all the records sent to Kafka. So when a Kafka producer send some records using KafkaAvroSerializer. The schema of the record is extracted and stored in schema registry and the actual record in Kafka only contains the schema-id.
The consumer when de-serializing the record fetches the schema-id and use it to fetch the actual schema from schema- registry. The record is then de-serialized using the fetched schema.
So in short Kafka does not keep a copy of schema in every record instead it is stored in schema registry and referenced via schema-id.
This helps in saving space while storing records also to detect any schema compatibility issue between various clients.
https://docs.confluent.io/current/schema-registry/docs/serializer-formatter.html
Schema registry is a central repo for all the schemas and helps in enforcing schema compatibility rules while registering new schemas , without which schema evolution would be difficult.
Based on the configured compatibility ( backward, forward , full etc) , the schema registry will restrict adding new schema which doesn't confirm to the configured compatibility.