Kafka Avro Schema evolution - apache-kafka

I am trying to learn more about the Avro schemas which we use for our Kafka topics and I am relatively new to this.
I was wondering is there a way to evolve schemas in a particular situation. We update our schema with a new field that can't be null or any default values because these new fields are identifiers. The workaround to solve this is to create new topics, but is there a better way to evolve existing schemas?

There are four possible compatibility in topic:
- Forward: a client which await the old version of the schema can read the new version
- Backward: a client which await the new version of the schema can read the old version
- Both: both above
- None: none of above
Consider that there are some times where some producer will produce old and new data, and consumer will except new or old data.
How would behave clients in your case?
adding a field is always forward compatible (old clients just drop the new field)
it is backward compatible only if you specify a default value
Also, this is only true if you are planning to convert data to a specific schema (with the corresponing POCO for example) - if you just convert it to json and make custom treatment, you could have a new client process both schema.
So two possibe ways for me to wrte to same topic:
you set a default value. You may be misunderstanding default values, it doesn't mean a default value will be written, but (quoting avro specs)
A default value for this field, used when reading instances that lack
this field (optional)
For example, if you previously had a "name" and want to add "surname", you can
set "surname" default as "NC" (or empty), as you may have done in a database.
You set your compatibility default to none (or forward), so that you can update your schema (as by default, comptibiliaty is backward). In this case, client awaiting the new schema won't be able to process old data. But it could fit your usage if you only process incoming data (change compatibility, update all your producer (so that only new data will arrive), then your clients awaiting the new schema - remember to set compatibility back to backward or the compatibility your really want
I would go with option 1.

Related

Why use a schema registry

I just started working with Kafka and I use Protocol Buffers for the message format and I just learn about schema registry.
To give some context we are a small team with a dozen of webservices and we use Kafka to communicate between them and we store all the schemas and read/write models in a library that is later imported by each service. This way they know to serialize/deserialize a message.
But now schema registry comes into play. Why use it? Now my infrastructure becomes more complicated plus I need to update it every time I change a schema and I need to define as well the read/write models in each service like I do now using the library.
So from my point of view I only see cons mainly just complicating things so why should I use a schema registry?
Thanks
The schema registry ensures your messages will not deviate from a common base compatibility guarantee (the first version of the schema).
For example, you have a schema that describes an event like {"first_name": "Jane", "last_name": "Doe"}, but then later decide that names can actually have more than 2 parts, so you then move to a schema that can support {"name": "Jane P. Doe"}... You still need a way to deserialize old data with first_name and last_name fields to migrate to the new schema having only name. Therefore, consumers will need both schemas. The registry will hold that and encode the schema ID within each payload from the producer. After all, the initial events with the two name fields would know nothing about the "future" schema with only name.
You say your models are shared in libraries across services. You probably then have some regression testing and release cycle to publish these between services? The registry will allow you to centralize that logic.

Using ksqlDB to implement CDC using multiple event types in a single topic?

I have the following situation where I have an Apache Kafka topic containing numerous record types.
For example:
UserCreated
UserUpdated
UserDeleted
AnotherRecordType
...
I wish to implement CDC on the three listed User* record types such that at the end, I have an up-to-date KTable with all user information.
How can I do this in ksqlDB? Since, as far as I know, Debezium and other CDC connectors also source their data from a single topic, I at least know it should be possible.
I've been reading through the Confluent docs for a while now, but I can't seem to find anything quite pertinent to my use case (CDC using existing topic). If there is anything I've overlooked, I would greatly appreciate a link to the relevant documentation as well.
I assume that, at the very least, the records must have the same key for ksqlDB to be able to match them. So my questions boil down to:
How would I teach ksqlDB which is an insert, an update and a delete?
Is the key matching a hard requirement, or are there other join/match predicates that we can use?
One possibility that I can think of is basically how CDC already does it: treat each incoming record as a new entry so that I can have something like a slowly changing dimension in the KTable, grouping on the key and selecting entries with e.g. the latest timestamp.
So, is something like the following:
CREATE TABLE users AS
SELECT user.user_id,
latest_by_offset(user.name) AS name,
latest_by_offset(user.email),
CASE WHEN record.key = UserDeleted THEN true ELSE FALSE END,
user.timestamp,
...
FROM users
GROUP BY user.user_id
EMIT CHANGES;
possible (using e.g. ROWKEY for record.key)? If not, how does e.g. Debezium do it?
The general pattern is to not have different schema types; just User. Then, the first record of any unique key (userid, for example) is an insert. Afterwards any non null values for the same key are updates (generally requiring all fields to be part of the value, effectively going a "replace" operation in the table). Deletes are caused by sending null values for the key (tombstone events).
If you have multiple schemas, it might be better to create a new stream that nulls out any of the delete events, unifies the creates and updates to a common schema that you want information for, and filter event types that you want to ignore.
how does e.g. Debezium do it?
For consuming data coming from Debezium topics, you can use a transform to "extract the new record state". It doesn't create any tables for you.

How can Confluent SchemaRegistry help ensuring the read (projection) Avro schema evolution?

SchemaRegistry helps with sharing the write Avro schema, which is used to encode a message, with the consumers that need the write schema to decode the received message.
Another important feature is assisting the schema evolution.
Let's say a producer P defines a write Avro schema v1 that is stored under the logical schema S, a consumer C1 that defines a read (projection) schema v1
and another consumer C2 that defines its own read (projection) schema. The read schemas are not shared as they are used locally by Avro to translate messages from the writer schema into the reader schema.
Imagine the schema evolution without any breaking changes:
The consumer C1 requests a new property by the new optional field added to its schema. This is a backward-compatible change.
Messages encoded without this field will be still translated into the read schema.
Now we've got v2 of the C1's read schema.
The producer P satisfies the consumer C1's need by the new field added to its schema. The field doesn't have to be required as this is a forwards-compatible change.
The consumer C1 will access the data encoded in the newly added field. The consumer C2 will simply ignore it, as it is a tolerant reader.
Now we've got v2 of the P's write schema.
Consumers need to know the exact schema with which the messages were written, so the new version is stored under the logical schema S.
Now imagine some schema breaking changes:
The producer P decides to delete a non-optional field. One of the consumers might use this field. This is not a forwards-compatible change.
Assuming the subject S is configured with FORWARD_TRANSITIVE compatibility type, the attempt to store the new write schema will fail. We are safe.
The consumer C2 requests a new property by the new field added to its schema. Since it's not written by the producer, this is not a backward-compatible change.
The question is how can the SchemaRegistry come in handy to prevent any breaking changes on the consumer side?
Note that the compatibility check of the read schema has to be done against all versions of the write schema.
There is an endpoint that allows checking the compatibility against the versions in the subject.
The issue is that it uses the compatibility type that is set on the subject.
The subject which contains versions of the write schema can not be used, because it is configured with FORWARD_TRANSITIVE compatibility type, but the read schema has to be backward compatible.
Creating another subject with the compatibility type BACKWARD_TRANSITIVE will not work, because a new version of the write schema with a forwards-compatible change (e.g. add a non-optional field) will fail to be stored in this subject.
One option that came to mind is to have some unit tests written using the CompatibilityChecker. It's an ugly solution because each consumer must hold locally all versions of the write schema.
It's going to be a pain to sync all the consumers when the producer's schema changes.
Schema Registry lets us keep track of schemas that are currently in use, both by producers and consumers.
Creating another subject with the compatibility type BACKWARD_TRANSITIVE will not work, because a new version of the write schema with a forwards-compatible change (e.g. add a non-optional field) will fail to be stored in this subject.
You were very close. Indeed, adding a non-optional field to the write schema is forward-compatible, but not backward-compatible because you may have data already produced that don't have values for this field. But we don't apply the same changes both to the write and read schemas. This only works when the change is both forward and backward compatible (aka full compatibility), e.g., adding/removing optional fields. In our case, we'd have to add the new field as optional to the read schema.
You can push the write schema to this new subject initially, but from that point on it is a separate read schema, and it would have to evolve separately from the write schema.
You can apply whatever approach you're currently using for checking the write schema changes. For example, make each consumer push the schema it's about to use to a subject with a BACKWARD_TRANSITIVE compatibility type before being allowed to use it.
There's also Schema Registry Maven Plugin for use in a CI/CD environment.
An alternative would be to use a single subject with FULL_TRANSITIVE compatibility.

using kafka with schema-registry, in app with multiple topics and SubjectNameStrategy

To begin with, I have found a way how to do this, more or less. But it's really bad code. So I'm looking for suggestions how to solve this better if this approach exist.
To lay something to work with. Assume you have app, which sends avro to n topics and uses schema registry. Assume(at first) that you don't want to use avro unions, since they bring some issues along. N-1 topics are easy, 1 schema per topic. But then, you have data, you need to send in order, which means 1 topic and specified group key, but these data don't have same schema. So to do that, you need to register multiple schema per that topic in schema registry, which implies use of key.subject.name.strategy=io.confluent.kafka.serializers.subject.RecordNameStrategy or similar. And here it becomes ugly.
But that setting is per schema registry instance, so you have to declare 2(or more) schema registry instances, one per each SubjectNameStrategy key/value combination. This will work.
But then, according to documentation, RecordNameStrategy is java-platform only (!), so if you would like to create service, which is not language specific (which you would most probably like to do in 2021 ...), you cannot use RecordNameStrategy.
So if you cannot use RecordNameStrategy, and for some reason you still want to use avro and schema registry, IIUC you have no other choice, than to use avro unions on top level, and use defaut TopicNameStrategy, which is fine now, since you have single unioned schema. But top-level unions weren't nice to me in past, since deserializer don't know, naturally, which type would you like to deserialize from the data. So theoretically a way out of this could be using say Cloudevents standard(or something similar), setting cloudevent type attribute in respect to which type from union was used to serialize data, and then have type->deserializer map, to be able to pick correct deserializer for avro-encoded data in received cloudevents message. This will work, and not only for java.
So to wrap up, here are 2 generally described solutions to very simple problem. But to be honest, these seems extremely complicated for widely accepted solution (avro/schema-registry). I'd like to know, if there is easier way through this.
This is a common theme, particularly in CQRS-like systems in which commands may be ordered (eg create before update or delete etc). In these cases, using Kafka, it's often not desirable to publish the messages over multiple topics. You are correct that there are two solutions for sending messages with multiple schemas on the same topic: either a top-level union in the avro schema, or multiple schemas per topic.
You say you don't want to use top-level unions in the schema, so I'll address the case of multiple schemas per topic. You are correct that this excludes the use of any subject naming strategy that includes only the topic name to define the subject. So TopicNameStrategy is out.
But then, according to documentation, RecordNameStrategy is java-platform only (!), so if you would like to create service, which is not language specific (which you would most probably like to do in 2021 ...), you cannot use RecordNameStrategy.
This is worthy of some clarification.... In the confluent way of things, the 'schema-registry aware avro serializers' first register your writer schema in the registry against a subject name to obtain a schema id. They then prefix your avro bytes with that schema id before publishing to kafka. See the 'Confluent Wire Format' at https://docs.confluent.io/platform/current/schema-registry/serdes-develop/index.html#wire-format.
So the subject naming is a choice in the serializer library; the deserializer just resolves a schema by the id prefixing the kafka message. The confluent Java serializers make this subject naming configurable and define strategies TopicNameStrategy, RecordNameStrategy and TopicRecordNameStrategy. See https://docs.confluent.io/platform/current/schema-registry/serdes-develop/index.html#subject-name-strategy. The three strategies are conventions for defining 'scopes' over which schemas will be tested for compatibility in the registry (per-topic, per-record, or a combination). You've identified RecordNameStrategy fits your use case for having multiple avro schemas per topic.
However, I think your concern about non-Java support for RecordNameStrategy can be set aside. In the serializer, the subject naming is free to be implemented however the serializer developer chooses. Having worked on this stuff in Java, Python, Go and NodeJS, I've experienced some variety in how third-party serializers work in this regard. Nevertheless, working non-Java libs do exist.
If all else fails, you can write your own 'schema-registry aware serializer' that registers the schema with your chosen subject name prior to encoding the confluent wire-format for Kafka. I've had happy outcomes from other tooling by keeping to one of the well-known confluent strategies, so I can recommend mimicking them.

Backwards compatibility of enum fields

I have an enum field with no default value:
{
"name": "FavouriteIceCream",
"type": "enum",
"symbols": [
"Vanilla",
"Strawberry",
"Chocolate"
]
}
The topic has compatibility mode set to BACKWARDS. If I remove one of the symbols, the Schema Registry API still reports the schema as compatible.
Is this correct? How would it parse a record with the field set to the now removed symbol?
Backward Compatibility
BACKWARD compatibility means that consumers using the new schema can read data produced with the last schema. For example, if there are three schemas for a subject that change in order X-2, X-1, and X then BACKWARD compatibility ensures that consumers using the new schema X can process data written by producers using schema X or X-1, but not necessarily X-2. If the consumer using the new schema needs to be able to process data written by all registered schemas, not just the last two schemas, then use BACKWARD_TRANSITIVE instead of BACKWARD. For example, if there are three schemas for a subject that change in order X-2, X-1, and X then BACKWARD_TRANSITIVE compatibility ensures that consumers using the new schema X can process data written by producers using schema X, X-1, or X-2.
BACKWARD: consumer using schema X can process data produced with schema X or X-1
BACKWARD_TRANSITIVE: consumer using schema X can process data produced with schema X, X-1, or X-2
An example of a backward compatible change is a removal of a field. A consumer that was developed to process events without this field will be able to process events written with the old schema and contain the field – the consumer will just ignore that field.
https://docs.confluent.io/platform/current/schema-registry/avro.html
The Avro specification states:
if the writer's symbol is not present in the reader's enum and the reader has a default value, then that value is used, otherwise an error is signalled.
Your schema does not define a default value for the enum. If you remove one of the symbols, then the reader cannot parse a field set to the now removed symbol.