Consumer schema update during deserialization - apache-kafka

I'm currently studying the Avro schema system and from what I understand the flow of a schema update is:
1) A client changes the schema (maybe by adding a new field with a default value for backwards compatibility) and sends data to Kafka serialized with the updated schema.
2) Schema registry does compatibility checks and registers the new version of the schema with a new version and a unique Id.
3) The consumer (still using the old schema) attempts to deserialize the data and schema evolution drops the new field, allowing the consumer to deserialize the data.
From I understand we need to explicitly update the
consumers after a schema change in order to supply them with the latest schema.
But why the consumer just pull the latest schema when it sees that the ID has changed?

You need to update consumer schemas if they are using a SpecificRecord subclass. That's effectively skipping the schema ID lookup
If you want it to always match the latest, then you can make an http call to the registry to /latest and get it, then restart the app.
Or if you always want the consumer to use the ID of the message, use GenericRecord as the object type

Related

How to version a field in avro schema when Kafka Consumer updates?

Example :- I have a field named
"abc":[
{"key1":"value1", "key2":"value2"},
{"key1":"value1", "key2":"value2"}
]
Consumer1, consumer2 consuming this variable, where as now consumer2 require few more fields and need to change the structure.
How to address this issue by following best practice?
You can use type map in Avro schema. key is always a string and value can be any type but should one type for the whole map.
So, in your case, introduce a map into your schema. consumer_1 can consume the event and get they keys needed only for the consumer_1 and do the same for consumer_2. But still same Avro schema.
Note: you can not send null to the map in schema. you need to send empty map.
If possible introduce Schema Registry server for schema versioning. Register all the different avro schema's at schema registry and a version Id will be given. Connect your producer and consumer app with schema registry server to fetch the registered schema for the respective Kafka message. Now message with any kind of schema can be received by any consumer with full compatibility.

Why Apache Kafka consumer would use a different version of schema to deserialize record other than the one sent along with the data?

Let us assume I am using Avro serialization while sending data to kafka.
While consuming record from Apache Kafka, I get both the schema and the record. I can use the schema to parse the record. I am not getting the scenario why consumer would use a different version of schema to deserialize the record. Can someone help?
The message is serialized with the the unique id for a specific version of the schema when produced onto Kafka. The consumer would use that unique schema id to deserialize.
Taken from https://docs.confluent.io/current/schema-registry/avro.html#schema-evolution

Kafka topic - schema registry compatibility

I have v001 version topic and v001-value schema subject.
I made a breaking change in schema, where some optional fields are made mandatory.
Although, messages in kafka before this change have this all fields. Whether I have to create new topic for this change?
The Schema Registry will tell you if you have an incompatible schema, depending on how you have it configured.
https://docs.confluent.io/current/schema-registry/develop/api.html#post--compatibility-subjects-(string-%20subject)-versions-(versionId-%20version)
have to create new topic for this change
Not necessarily, you could delete the schema in the registry, then push on a new one.

How does Avro for Kafka work with Schema registry?

I am working on Kafka and as a beginner the following question popped out of my mind.
Every time we design the schema for Avro, we create the Java object out of it through its jars.
Now we use that object to populate data and send it from Producer.
For consuming the message we generate the Object again in Consumer. Now the objects generated in both places Producer & Consumer contains a field
"public static final org.apache.avro.Schema SCHEMA$" which actually stores the schema as a String.
If that is the case then why should kafka use schema registry at all ? The schema is already available as part of the Avro objects.
Hope my question is clear. If someone can answer me, It would be of great help.
Schema Registry is the repository which store the schema of all the records sent to Kafka. So when a Kafka producer send some records using KafkaAvroSerializer. The schema of the record is extracted and stored in schema registry and the actual record in Kafka only contains the schema-id.
The consumer when de-serializing the record fetches the schema-id and use it to fetch the actual schema from schema- registry. The record is then de-serialized using the fetched schema.
So in short Kafka does not keep a copy of schema in every record instead it is stored in schema registry and referenced via schema-id.
This helps in saving space while storing records also to detect any schema compatibility issue between various clients.
https://docs.confluent.io/current/schema-registry/docs/serializer-formatter.html
Schema registry is a central repo for all the schemas and helps in enforcing schema compatibility rules while registering new schemas , without which schema evolution would be difficult.
Based on the configured compatibility ( backward, forward , full etc) , the schema registry will restrict adding new schema which doesn't confirm to the configured compatibility.

Confluent Schema Registry Avro Schema

Hey I would like to use the Confluent schema registry with the Avro Serializers: The documentation now basically says: do not use the same schema for multiple different topics
Can anyone explain to me why?
I reasearch the source code and it basically stores the schema in a kafka topic as follows (topicname,magicbytes,version->key) (schema->value)
Therefore I don't see the problem of using the schema multiple times expect redundancy?
I think you are referring to this comment in the documentation:
We recommend users use the new producer in org.apache.kafka.clients.producer.KafkaProducer. If you are using a version of Kafka older than 0.8.2.0, you can plug KafkaAvroEncoder into the old producer in kafka.javaapi.producer. However, there will be some limitations. You can only use KafkaAvroEncoder for serializing the value of the message and only send value of type Avro record. The Avro schema for the value will be registered under the subject recordName-value, where recordName is the name of the Avro record. Because of this, the same Avro record type shouldn’t be used in more than one topic.
First, the commenter above is correct -- this only refers to the old producer API pre-0.8.2. It's highly recommended that you use the new producer anyway as it is a much better implementation, doesn't depend on the whole core jar, and is the client which will be maintained going forward (there isn't a specific timeline yet, but the old producer will eventually be deprecated and then removed).
However, if you are using the old producer, this restriction is only required if the schema for the two subjects might evolve separately. Suppose that you did write two applications that wrote to different topics, but use the same Avro record type, let's call it record. Now both applications will register it/look it up under the subject record-value and get assigned version=1. This is all fine as long as the schema doesn't change. But lets say application A now needs to add a field. When it does so, the schema will be registered under subject record-value and get assigned version=2. This is fine for application A, but application B has either not been upgraded to handle this schema, or worse, the schema isn't even valid for application B. However, you lose the protection the schema registry normally gives you -- now some other application could publish data of that format into the topic used by application B (it looks ok because record-value has that schema registered). Now application B could see data which it doesn't know how to handle since its not a schema it supports.
So the short version is that because with the old producer the subject has to be shared if you also use the same schema, you end up coupling the two applications and the schemas they must support. You can use the same schema across topics, but we suggest not doing so since it couples your applications (and their development, the teams developing them, etc).