Why use a schema registry - apache-kafka

I just started working with Kafka and I use Protocol Buffers for the message format and I just learn about schema registry.
To give some context we are a small team with a dozen of webservices and we use Kafka to communicate between them and we store all the schemas and read/write models in a library that is later imported by each service. This way they know to serialize/deserialize a message.
But now schema registry comes into play. Why use it? Now my infrastructure becomes more complicated plus I need to update it every time I change a schema and I need to define as well the read/write models in each service like I do now using the library.
So from my point of view I only see cons mainly just complicating things so why should I use a schema registry?
Thanks

The schema registry ensures your messages will not deviate from a common base compatibility guarantee (the first version of the schema).
For example, you have a schema that describes an event like {"first_name": "Jane", "last_name": "Doe"}, but then later decide that names can actually have more than 2 parts, so you then move to a schema that can support {"name": "Jane P. Doe"}... You still need a way to deserialize old data with first_name and last_name fields to migrate to the new schema having only name. Therefore, consumers will need both schemas. The registry will hold that and encode the schema ID within each payload from the producer. After all, the initial events with the two name fields would know nothing about the "future" schema with only name.
You say your models are shared in libraries across services. You probably then have some regression testing and release cycle to publish these between services? The registry will allow you to centralize that logic.

Related

How can Confluent SchemaRegistry help ensuring the read (projection) Avro schema evolution?

SchemaRegistry helps with sharing the write Avro schema, which is used to encode a message, with the consumers that need the write schema to decode the received message.
Another important feature is assisting the schema evolution.
Let's say a producer P defines a write Avro schema v1 that is stored under the logical schema S, a consumer C1 that defines a read (projection) schema v1
and another consumer C2 that defines its own read (projection) schema. The read schemas are not shared as they are used locally by Avro to translate messages from the writer schema into the reader schema.
Imagine the schema evolution without any breaking changes:
The consumer C1 requests a new property by the new optional field added to its schema. This is a backward-compatible change.
Messages encoded without this field will be still translated into the read schema.
Now we've got v2 of the C1's read schema.
The producer P satisfies the consumer C1's need by the new field added to its schema. The field doesn't have to be required as this is a forwards-compatible change.
The consumer C1 will access the data encoded in the newly added field. The consumer C2 will simply ignore it, as it is a tolerant reader.
Now we've got v2 of the P's write schema.
Consumers need to know the exact schema with which the messages were written, so the new version is stored under the logical schema S.
Now imagine some schema breaking changes:
The producer P decides to delete a non-optional field. One of the consumers might use this field. This is not a forwards-compatible change.
Assuming the subject S is configured with FORWARD_TRANSITIVE compatibility type, the attempt to store the new write schema will fail. We are safe.
The consumer C2 requests a new property by the new field added to its schema. Since it's not written by the producer, this is not a backward-compatible change.
The question is how can the SchemaRegistry come in handy to prevent any breaking changes on the consumer side?
Note that the compatibility check of the read schema has to be done against all versions of the write schema.
There is an endpoint that allows checking the compatibility against the versions in the subject.
The issue is that it uses the compatibility type that is set on the subject.
The subject which contains versions of the write schema can not be used, because it is configured with FORWARD_TRANSITIVE compatibility type, but the read schema has to be backward compatible.
Creating another subject with the compatibility type BACKWARD_TRANSITIVE will not work, because a new version of the write schema with a forwards-compatible change (e.g. add a non-optional field) will fail to be stored in this subject.
One option that came to mind is to have some unit tests written using the CompatibilityChecker. It's an ugly solution because each consumer must hold locally all versions of the write schema.
It's going to be a pain to sync all the consumers when the producer's schema changes.
Schema Registry lets us keep track of schemas that are currently in use, both by producers and consumers.
Creating another subject with the compatibility type BACKWARD_TRANSITIVE will not work, because a new version of the write schema with a forwards-compatible change (e.g. add a non-optional field) will fail to be stored in this subject.
You were very close. Indeed, adding a non-optional field to the write schema is forward-compatible, but not backward-compatible because you may have data already produced that don't have values for this field. But we don't apply the same changes both to the write and read schemas. This only works when the change is both forward and backward compatible (aka full compatibility), e.g., adding/removing optional fields. In our case, we'd have to add the new field as optional to the read schema.
You can push the write schema to this new subject initially, but from that point on it is a separate read schema, and it would have to evolve separately from the write schema.
You can apply whatever approach you're currently using for checking the write schema changes. For example, make each consumer push the schema it's about to use to a subject with a BACKWARD_TRANSITIVE compatibility type before being allowed to use it.
There's also Schema Registry Maven Plugin for use in a CI/CD environment.
An alternative would be to use a single subject with FULL_TRANSITIVE compatibility.

using kafka with schema-registry, in app with multiple topics and SubjectNameStrategy

To begin with, I have found a way how to do this, more or less. But it's really bad code. So I'm looking for suggestions how to solve this better if this approach exist.
To lay something to work with. Assume you have app, which sends avro to n topics and uses schema registry. Assume(at first) that you don't want to use avro unions, since they bring some issues along. N-1 topics are easy, 1 schema per topic. But then, you have data, you need to send in order, which means 1 topic and specified group key, but these data don't have same schema. So to do that, you need to register multiple schema per that topic in schema registry, which implies use of key.subject.name.strategy=io.confluent.kafka.serializers.subject.RecordNameStrategy or similar. And here it becomes ugly.
But that setting is per schema registry instance, so you have to declare 2(or more) schema registry instances, one per each SubjectNameStrategy key/value combination. This will work.
But then, according to documentation, RecordNameStrategy is java-platform only (!), so if you would like to create service, which is not language specific (which you would most probably like to do in 2021 ...), you cannot use RecordNameStrategy.
So if you cannot use RecordNameStrategy, and for some reason you still want to use avro and schema registry, IIUC you have no other choice, than to use avro unions on top level, and use defaut TopicNameStrategy, which is fine now, since you have single unioned schema. But top-level unions weren't nice to me in past, since deserializer don't know, naturally, which type would you like to deserialize from the data. So theoretically a way out of this could be using say Cloudevents standard(or something similar), setting cloudevent type attribute in respect to which type from union was used to serialize data, and then have type->deserializer map, to be able to pick correct deserializer for avro-encoded data in received cloudevents message. This will work, and not only for java.
So to wrap up, here are 2 generally described solutions to very simple problem. But to be honest, these seems extremely complicated for widely accepted solution (avro/schema-registry). I'd like to know, if there is easier way through this.
This is a common theme, particularly in CQRS-like systems in which commands may be ordered (eg create before update or delete etc). In these cases, using Kafka, it's often not desirable to publish the messages over multiple topics. You are correct that there are two solutions for sending messages with multiple schemas on the same topic: either a top-level union in the avro schema, or multiple schemas per topic.
You say you don't want to use top-level unions in the schema, so I'll address the case of multiple schemas per topic. You are correct that this excludes the use of any subject naming strategy that includes only the topic name to define the subject. So TopicNameStrategy is out.
But then, according to documentation, RecordNameStrategy is java-platform only (!), so if you would like to create service, which is not language specific (which you would most probably like to do in 2021 ...), you cannot use RecordNameStrategy.
This is worthy of some clarification.... In the confluent way of things, the 'schema-registry aware avro serializers' first register your writer schema in the registry against a subject name to obtain a schema id. They then prefix your avro bytes with that schema id before publishing to kafka. See the 'Confluent Wire Format' at https://docs.confluent.io/platform/current/schema-registry/serdes-develop/index.html#wire-format.
So the subject naming is a choice in the serializer library; the deserializer just resolves a schema by the id prefixing the kafka message. The confluent Java serializers make this subject naming configurable and define strategies TopicNameStrategy, RecordNameStrategy and TopicRecordNameStrategy. See https://docs.confluent.io/platform/current/schema-registry/serdes-develop/index.html#subject-name-strategy. The three strategies are conventions for defining 'scopes' over which schemas will be tested for compatibility in the registry (per-topic, per-record, or a combination). You've identified RecordNameStrategy fits your use case for having multiple avro schemas per topic.
However, I think your concern about non-Java support for RecordNameStrategy can be set aside. In the serializer, the subject naming is free to be implemented however the serializer developer chooses. Having worked on this stuff in Java, Python, Go and NodeJS, I've experienced some variety in how third-party serializers work in this regard. Nevertheless, working non-Java libs do exist.
If all else fails, you can write your own 'schema-registry aware serializer' that registers the schema with your chosen subject name prior to encoding the confluent wire-format for Kafka. I've had happy outcomes from other tooling by keeping to one of the well-known confluent strategies, so I can recommend mimicking them.

What are the advantages of the Confluent's Kakfa Avro serializers?

I can't seem to find clear in the docs what are the advantages of using AvroKafkaSerializer (that have schema support) vs serializing the object "manually" in code and sending them as bytes/string ?
Maybe schema check when producing a new message? What are the others ?
A message schema is a contract between a group of client applications producing and consuming messages. Schema validation is required when you have many independent applications that need to agree on a specific format, in order to exchange messages reliably.
If you also add a Schema Registry into the picture, then you don't need to include the schema in all your services or every single message, but you will get it from the common registry, with the additional support of schema evolution and validation rules (i.e. backward compatibility, versioning, syntax validation). It is one of the fundamental components in event driven architectures (EDA).

Kafka Avro Schema evolution

I am trying to learn more about the Avro schemas which we use for our Kafka topics and I am relatively new to this.
I was wondering is there a way to evolve schemas in a particular situation. We update our schema with a new field that can't be null or any default values because these new fields are identifiers. The workaround to solve this is to create new topics, but is there a better way to evolve existing schemas?
There are four possible compatibility in topic:
- Forward: a client which await the old version of the schema can read the new version
- Backward: a client which await the new version of the schema can read the old version
- Both: both above
- None: none of above
Consider that there are some times where some producer will produce old and new data, and consumer will except new or old data.
How would behave clients in your case?
adding a field is always forward compatible (old clients just drop the new field)
it is backward compatible only if you specify a default value
Also, this is only true if you are planning to convert data to a specific schema (with the corresponing POCO for example) - if you just convert it to json and make custom treatment, you could have a new client process both schema.
So two possibe ways for me to wrte to same topic:
you set a default value. You may be misunderstanding default values, it doesn't mean a default value will be written, but (quoting avro specs)
A default value for this field, used when reading instances that lack
this field (optional)
For example, if you previously had a "name" and want to add "surname", you can
set "surname" default as "NC" (or empty), as you may have done in a database.
You set your compatibility default to none (or forward), so that you can update your schema (as by default, comptibiliaty is backward). In this case, client awaiting the new schema won't be able to process old data. But it could fit your usage if you only process incoming data (change compatibility, update all your producer (so that only new data will arrive), then your clients awaiting the new schema - remember to set compatibility back to backward or the compatibility your really want
I would go with option 1.

Confluent Platform: Schema Registry Subjects

Working with Confluent Platform, the platform offered by the creators of Apache Kafka, and I have a question:
In the documentation of the Schema Registry API Reference, they mention the abstraction of a "Subject". You register a schema under a "subject" which is of the form topicName-key, or topicName-value, yet there is no explanation as to why you need (as it implies) a separate schema for the key and value of messages on a given topic. Nor is there any direct statement to the effect that registration with a "subject" necessarily associates the schema with that topic, other than mnemonically.
Further confusing matters, the subsequent examples ("get schema version for subject" and "register new schema under subject") on that page do not use that format for the subject name, and instead use just a topic name for the "subject" value. If anyone has any insight into a) why there are these two "subjects" per topic, and b) what the proper usage is, it would be greatly appreciated.
Confluent Schema Registry is actually a bit inconsistent with subject names :)
Indeed, the KafkaAvroSerializer (used for new Kafka 0.8.2 producer) uses topic-key|value pattern for subjects (link) whereas KafkaAvroEncoder (for old producer) uses schema.getName()-value pattern (link).
The reason why one would have 2 different subjects per topic (one for key, one for value) is pretty simple:
say I have an Avro schema representing a log entry, and each log entry has a source information attached to it:
{
"type":"record",
"name":"LogEntry",
"fields":[
{
"name":"line",
"type":"string"
},
{
"name":"source",
"type":{
"type":"record",
"name":"SourceInfo",
"fields":[
{
"name":"host",
"type":"string"
},
{
"name":"...",
"type":"string"
}
]
}
}
]
}
A common use case would be that I want to partition entries by source, thus would like to have two subjects associated for topic (and subjects are basically revisions of Avro schemas) - one for key (which is SourceInfo) and one for value (LogEntry).
Having these two subjects would allow partitioning and storing the data as long as I have a schema registry running and my producers/consumers can talk to it. Any modifications to these schemas would be reflected in the schema registry and as long as they satisfy compatibility settings everything should just serialize/deserialize without you having to care about this.
Note: any further information is just my personal thoughts and maybe I just don't yet fully understand how this is supposed to work so I might be wrong.
I actually like more how the KafkaAvroEncoder is implemented rather than the KafkaAvroSerializer. KafkaAvroEncoder does not in any way enforce you to use ONE schema per topic key\value whereas KafkaAvroSerializer does. This might be an issue when you plan to produce data for multiple Avro schemas into one topic. In this case KafkaAvroSerializer would try to update the topic-key and topic-value subjects and 99% would break if compatibility is violated (and if you have multiple Avro schemas they are almost always different and incompatible with each other).
On the other side, KafkaAvroEncoder cares just about schema names and you may safely produce data for multiple Avro schemas into one topic and everything should work just fine (you will have as many subjects as schemas).
This inconsistency is still unclear to me and I hope Confluent guys can explain this if they see this question/answer.
Hope that helps you