What are the advantages of the Confluent's Kakfa Avro serializers? - apache-kafka

I can't seem to find clear in the docs what are the advantages of using AvroKafkaSerializer (that have schema support) vs serializing the object "manually" in code and sending them as bytes/string ?
Maybe schema check when producing a new message? What are the others ?

A message schema is a contract between a group of client applications producing and consuming messages. Schema validation is required when you have many independent applications that need to agree on a specific format, in order to exchange messages reliably.
If you also add a Schema Registry into the picture, then you don't need to include the schema in all your services or every single message, but you will get it from the common registry, with the additional support of schema evolution and validation rules (i.e. backward compatibility, versioning, syntax validation). It is one of the fundamental components in event driven architectures (EDA).

Related

test both forwards and backwards compatibility of json schemas for both readers and writers

I am seeking to programmaticaly answer the question whether an arbitrary two message schemas are compatible. Ideally for each of JSON, Avro and Protobuf. I am aware that internally to the kafka schema registry there is such logic. Yet I want to ask programmatically in a deployment pipeline if when I am promoting an old topic reader between environments whether it will be able to read the latest (or historic) messages on the topic.
I am aware that:
The maven plugin can do this at consumer compile time but that option isn't open to me as I am not using Java for my "exotic" consumer. Yet I can generate the json schema it expects.
I am aware that I can invoke the schema registry API to ask about a new schema being compatible with an old one but I want to ask whether a reader that knows its own expected schema is compatible with the latest registered and that isn't supported.
I am aware that the topic can be set with FORWARD_TRANSITIVE or FULL_TRANSITIVE compatibility and based on knowing that I can assume my reader will always work. Yet I do not control the many topics in a large organization controlled by many other teams so I cannot enforce that the many teams with many existing topics set a correct policy.
I am aware that with careful testing and change management we can manually verify compatibility. The real world is messy and we will be doing this at scale with many inexperienced teams so anything that can go wrong will most certainly go wrong at some point.
I am aware that other folks must have wanted to do this so if I searched hard enough the answer should be there; yet I have read all the Q&As I could find and I really couldn't find any actual working answer to any of the past times this question has been asked.
What I want to control is that I don't promote a reader into production when at that point in time I can pull the latest registered schema and check it is compatible with the (generated) schema expected by the (exotic) reader that is being deployed.
I am aware I can just pull the source code that the kafka schema registry / maven plugin uses and roll my own solution but I feel that there must be a convenient solution out there I can easily script on a deployment pipeline to check "is this (generated) schema a subset of that (published/downloaded) one".
Okay I cracked open the Confluent Kafka Client code and wrote a simple Java CLI that uses it's logic and the generic ParsedSchema class that can be either a concrete json, or avro, or protobuf schema. The dependencies are:
<dependency>
<groupId>io.confluent</groupId>
<artifactId>kafka-schema-registry-client</artifactId>
<version>7.3.1</version>
</dependency>
<dependency>
<groupId>io.confluent</groupId>
<artifactId>kafka-json-schema-provider</artifactId>
<version>7.3.1</version>
</dependency>
To be able to handle avro and protobuf it would need to have their corresponding schema provider dependency included. The code then parses a json schema from a string with:
final var readSchemaString = ... // load from file
final var jsonSchemaRead = new JsonSchema(readSchemaString);
final var writeSchemaString = ... // load from file
final var jsonSchemaWrite = new JsonSchema(writeSchemaString);
Then we can validate that using my helper class:
final var validator = new Validator();
final var errors = validator.validate(jsonSchemaRead, jsonSchemaWrite);
for( var e : errors) {
logger.error(e);
}
System.exit(errors.size());
The validator class is very basic just using "can be read" semantics:
public class Validator {
SchemaValidator validator = new SchemaValidatorBuilder().canBeReadStrategy().validateLatest();
public List<String> validate(ParsedSchema reader, ParsedSchema writer) {
return validator.validate(reader, Collections.singleton(writer));
}
}
The full code is over on github at /simbo1905/msg-schema-read-validator
At deployment time I can simply curl the latest schema in the registry (or any historic schemas) that writers should be using. I can generate the schema for my reader on disk. I can then run that simple tool to check that things are compatible.
Running things in a debugger I can see that there are at least 56 different schema compatibility rules that are being checked for JSON schema alone. It would not be feasible to try to code that up oneself.
It should be relatively simply to extend the code to add in the avro and protobuf providers to get a ParsedSchema for those if anyone wants to be fully generic.

Why use a schema registry

I just started working with Kafka and I use Protocol Buffers for the message format and I just learn about schema registry.
To give some context we are a small team with a dozen of webservices and we use Kafka to communicate between them and we store all the schemas and read/write models in a library that is later imported by each service. This way they know to serialize/deserialize a message.
But now schema registry comes into play. Why use it? Now my infrastructure becomes more complicated plus I need to update it every time I change a schema and I need to define as well the read/write models in each service like I do now using the library.
So from my point of view I only see cons mainly just complicating things so why should I use a schema registry?
Thanks
The schema registry ensures your messages will not deviate from a common base compatibility guarantee (the first version of the schema).
For example, you have a schema that describes an event like {"first_name": "Jane", "last_name": "Doe"}, but then later decide that names can actually have more than 2 parts, so you then move to a schema that can support {"name": "Jane P. Doe"}... You still need a way to deserialize old data with first_name and last_name fields to migrate to the new schema having only name. Therefore, consumers will need both schemas. The registry will hold that and encode the schema ID within each payload from the producer. After all, the initial events with the two name fields would know nothing about the "future" schema with only name.
You say your models are shared in libraries across services. You probably then have some regression testing and release cycle to publish these between services? The registry will allow you to centralize that logic.

How can Confluent SchemaRegistry help ensuring the read (projection) Avro schema evolution?

SchemaRegistry helps with sharing the write Avro schema, which is used to encode a message, with the consumers that need the write schema to decode the received message.
Another important feature is assisting the schema evolution.
Let's say a producer P defines a write Avro schema v1 that is stored under the logical schema S, a consumer C1 that defines a read (projection) schema v1
and another consumer C2 that defines its own read (projection) schema. The read schemas are not shared as they are used locally by Avro to translate messages from the writer schema into the reader schema.
Imagine the schema evolution without any breaking changes:
The consumer C1 requests a new property by the new optional field added to its schema. This is a backward-compatible change.
Messages encoded without this field will be still translated into the read schema.
Now we've got v2 of the C1's read schema.
The producer P satisfies the consumer C1's need by the new field added to its schema. The field doesn't have to be required as this is a forwards-compatible change.
The consumer C1 will access the data encoded in the newly added field. The consumer C2 will simply ignore it, as it is a tolerant reader.
Now we've got v2 of the P's write schema.
Consumers need to know the exact schema with which the messages were written, so the new version is stored under the logical schema S.
Now imagine some schema breaking changes:
The producer P decides to delete a non-optional field. One of the consumers might use this field. This is not a forwards-compatible change.
Assuming the subject S is configured with FORWARD_TRANSITIVE compatibility type, the attempt to store the new write schema will fail. We are safe.
The consumer C2 requests a new property by the new field added to its schema. Since it's not written by the producer, this is not a backward-compatible change.
The question is how can the SchemaRegistry come in handy to prevent any breaking changes on the consumer side?
Note that the compatibility check of the read schema has to be done against all versions of the write schema.
There is an endpoint that allows checking the compatibility against the versions in the subject.
The issue is that it uses the compatibility type that is set on the subject.
The subject which contains versions of the write schema can not be used, because it is configured with FORWARD_TRANSITIVE compatibility type, but the read schema has to be backward compatible.
Creating another subject with the compatibility type BACKWARD_TRANSITIVE will not work, because a new version of the write schema with a forwards-compatible change (e.g. add a non-optional field) will fail to be stored in this subject.
One option that came to mind is to have some unit tests written using the CompatibilityChecker. It's an ugly solution because each consumer must hold locally all versions of the write schema.
It's going to be a pain to sync all the consumers when the producer's schema changes.
Schema Registry lets us keep track of schemas that are currently in use, both by producers and consumers.
Creating another subject with the compatibility type BACKWARD_TRANSITIVE will not work, because a new version of the write schema with a forwards-compatible change (e.g. add a non-optional field) will fail to be stored in this subject.
You were very close. Indeed, adding a non-optional field to the write schema is forward-compatible, but not backward-compatible because you may have data already produced that don't have values for this field. But we don't apply the same changes both to the write and read schemas. This only works when the change is both forward and backward compatible (aka full compatibility), e.g., adding/removing optional fields. In our case, we'd have to add the new field as optional to the read schema.
You can push the write schema to this new subject initially, but from that point on it is a separate read schema, and it would have to evolve separately from the write schema.
You can apply whatever approach you're currently using for checking the write schema changes. For example, make each consumer push the schema it's about to use to a subject with a BACKWARD_TRANSITIVE compatibility type before being allowed to use it.
There's also Schema Registry Maven Plugin for use in a CI/CD environment.
An alternative would be to use a single subject with FULL_TRANSITIVE compatibility.

using kafka with schema-registry, in app with multiple topics and SubjectNameStrategy

To begin with, I have found a way how to do this, more or less. But it's really bad code. So I'm looking for suggestions how to solve this better if this approach exist.
To lay something to work with. Assume you have app, which sends avro to n topics and uses schema registry. Assume(at first) that you don't want to use avro unions, since they bring some issues along. N-1 topics are easy, 1 schema per topic. But then, you have data, you need to send in order, which means 1 topic and specified group key, but these data don't have same schema. So to do that, you need to register multiple schema per that topic in schema registry, which implies use of key.subject.name.strategy=io.confluent.kafka.serializers.subject.RecordNameStrategy or similar. And here it becomes ugly.
But that setting is per schema registry instance, so you have to declare 2(or more) schema registry instances, one per each SubjectNameStrategy key/value combination. This will work.
But then, according to documentation, RecordNameStrategy is java-platform only (!), so if you would like to create service, which is not language specific (which you would most probably like to do in 2021 ...), you cannot use RecordNameStrategy.
So if you cannot use RecordNameStrategy, and for some reason you still want to use avro and schema registry, IIUC you have no other choice, than to use avro unions on top level, and use defaut TopicNameStrategy, which is fine now, since you have single unioned schema. But top-level unions weren't nice to me in past, since deserializer don't know, naturally, which type would you like to deserialize from the data. So theoretically a way out of this could be using say Cloudevents standard(or something similar), setting cloudevent type attribute in respect to which type from union was used to serialize data, and then have type->deserializer map, to be able to pick correct deserializer for avro-encoded data in received cloudevents message. This will work, and not only for java.
So to wrap up, here are 2 generally described solutions to very simple problem. But to be honest, these seems extremely complicated for widely accepted solution (avro/schema-registry). I'd like to know, if there is easier way through this.
This is a common theme, particularly in CQRS-like systems in which commands may be ordered (eg create before update or delete etc). In these cases, using Kafka, it's often not desirable to publish the messages over multiple topics. You are correct that there are two solutions for sending messages with multiple schemas on the same topic: either a top-level union in the avro schema, or multiple schemas per topic.
You say you don't want to use top-level unions in the schema, so I'll address the case of multiple schemas per topic. You are correct that this excludes the use of any subject naming strategy that includes only the topic name to define the subject. So TopicNameStrategy is out.
But then, according to documentation, RecordNameStrategy is java-platform only (!), so if you would like to create service, which is not language specific (which you would most probably like to do in 2021 ...), you cannot use RecordNameStrategy.
This is worthy of some clarification.... In the confluent way of things, the 'schema-registry aware avro serializers' first register your writer schema in the registry against a subject name to obtain a schema id. They then prefix your avro bytes with that schema id before publishing to kafka. See the 'Confluent Wire Format' at https://docs.confluent.io/platform/current/schema-registry/serdes-develop/index.html#wire-format.
So the subject naming is a choice in the serializer library; the deserializer just resolves a schema by the id prefixing the kafka message. The confluent Java serializers make this subject naming configurable and define strategies TopicNameStrategy, RecordNameStrategy and TopicRecordNameStrategy. See https://docs.confluent.io/platform/current/schema-registry/serdes-develop/index.html#subject-name-strategy. The three strategies are conventions for defining 'scopes' over which schemas will be tested for compatibility in the registry (per-topic, per-record, or a combination). You've identified RecordNameStrategy fits your use case for having multiple avro schemas per topic.
However, I think your concern about non-Java support for RecordNameStrategy can be set aside. In the serializer, the subject naming is free to be implemented however the serializer developer chooses. Having worked on this stuff in Java, Python, Go and NodeJS, I've experienced some variety in how third-party serializers work in this regard. Nevertheless, working non-Java libs do exist.
If all else fails, you can write your own 'schema-registry aware serializer' that registers the schema with your chosen subject name prior to encoding the confluent wire-format for Kafka. I've had happy outcomes from other tooling by keeping to one of the well-known confluent strategies, so I can recommend mimicking them.

Confluent Platform: Schema Registry Subjects

Working with Confluent Platform, the platform offered by the creators of Apache Kafka, and I have a question:
In the documentation of the Schema Registry API Reference, they mention the abstraction of a "Subject". You register a schema under a "subject" which is of the form topicName-key, or topicName-value, yet there is no explanation as to why you need (as it implies) a separate schema for the key and value of messages on a given topic. Nor is there any direct statement to the effect that registration with a "subject" necessarily associates the schema with that topic, other than mnemonically.
Further confusing matters, the subsequent examples ("get schema version for subject" and "register new schema under subject") on that page do not use that format for the subject name, and instead use just a topic name for the "subject" value. If anyone has any insight into a) why there are these two "subjects" per topic, and b) what the proper usage is, it would be greatly appreciated.
Confluent Schema Registry is actually a bit inconsistent with subject names :)
Indeed, the KafkaAvroSerializer (used for new Kafka 0.8.2 producer) uses topic-key|value pattern for subjects (link) whereas KafkaAvroEncoder (for old producer) uses schema.getName()-value pattern (link).
The reason why one would have 2 different subjects per topic (one for key, one for value) is pretty simple:
say I have an Avro schema representing a log entry, and each log entry has a source information attached to it:
{
"type":"record",
"name":"LogEntry",
"fields":[
{
"name":"line",
"type":"string"
},
{
"name":"source",
"type":{
"type":"record",
"name":"SourceInfo",
"fields":[
{
"name":"host",
"type":"string"
},
{
"name":"...",
"type":"string"
}
]
}
}
]
}
A common use case would be that I want to partition entries by source, thus would like to have two subjects associated for topic (and subjects are basically revisions of Avro schemas) - one for key (which is SourceInfo) and one for value (LogEntry).
Having these two subjects would allow partitioning and storing the data as long as I have a schema registry running and my producers/consumers can talk to it. Any modifications to these schemas would be reflected in the schema registry and as long as they satisfy compatibility settings everything should just serialize/deserialize without you having to care about this.
Note: any further information is just my personal thoughts and maybe I just don't yet fully understand how this is supposed to work so I might be wrong.
I actually like more how the KafkaAvroEncoder is implemented rather than the KafkaAvroSerializer. KafkaAvroEncoder does not in any way enforce you to use ONE schema per topic key\value whereas KafkaAvroSerializer does. This might be an issue when you plan to produce data for multiple Avro schemas into one topic. In this case KafkaAvroSerializer would try to update the topic-key and topic-value subjects and 99% would break if compatibility is violated (and if you have multiple Avro schemas they are almost always different and incompatible with each other).
On the other side, KafkaAvroEncoder cares just about schema names and you may safely produce data for multiple Avro schemas into one topic and everything should work just fine (you will have as many subjects as schemas).
This inconsistency is still unclear to me and I hope Confluent guys can explain this if they see this question/answer.
Hope that helps you