Confluent Schema Registry/Kafka Streams: prevent schema evolution - apache-kafka

Is there a way to configure Confluent Schema Registry and/or Kafka Streams to prevent schema evolution?
Motivation
We have multiple Kafka Streams jobs producing messages for the same topic. All of the jobs should send messages with the same schema, but due to misconfiguration of the jobs, it has happened that some of them send messages with fields missing. This has caused issues downstream and is something we want to prevent.
When this happens, we can see a schema evolution in the schema registry as expected.
Solution
We checked the documentation for Confluent Schema Registry and/or Kafka Streams, but couldn't find a way to prevent the schema evolution.
Hence, we consider to modify the Kafka Streams jobs to read the schema from Confluent Schema Registry before sending it. If the received schema matches the local schema of the messages, only then we send them.
Is this the right way to go or did we miss a better option?
Update: we found an article on medium for validating the schema against the schema registry before sending.

It depends which language and library you use and what kind of APIs do they provide. If you are publishing generic records, you can read and parse .avdl or .avsc file into the record type and build your event. Which means if event you are trying to build wouldn't be compatible with the current schema you won't be able even to build that event hence won't be able to modify existing schema. So in this case simply store with your source code a static schema. With specific record it is more or less the same, you can build your Java/C# or other language classes based on the schema, you build them then simply new them up and publish. Does it make any sense?)
PS. I worked with C# libs for Kafka maybe some other languages do not have that support or have some other better options

Related

How does a kafka connect connector know which schema to use?

Let's say I have a bunch of different topics, each with their own json schema. In schema registry, I indicated which schemas exist within the different topics, not directly refering to which topic a schema applies. Then, in my sink connector, I only refer to the endpoint (URL) of the schema registry. So to my knowledge, I never indicated which registered schema a kafka connector (e.g., JDBC sink) should be used in order to deserialize a message from a certain topic?
Asking here as I can't seem to find anything online.
I am trying to decrease my kafka message size by removing overhead of having to specify the schema in each message, and using schema registry instead. However, I cannot seem to understand how this could work.
Your producer serializes the schema id directly in the bytes of the record. Connect (or consumers with the json deserializer) use the schema that's part of each record.
https://docs.confluent.io/platform/current/schema-registry/serdes-develop/index.html#wire-format
If you're trying to decrease message size, don't use JSON, but rather a binary format and enable topic compression such as ZSTD

What is the value of an Avro Schema Registry?

I have many microservices reading/writing Avro messages in Kafka.
Schemas are great. Avro is great. But is a schema registry really needed? It helps centralize Schemas, yes, but do the microservices really need to query the registry? I don't think so.
Each microservice has a copy of the schema, user.avsc, and an Avro-generated POJO: User extends SpecificRecord. I want a POJO of each Schema for easy manipulation in the code.
Write to Kafka:
byte [] value = user.toByteBuffer().array();
producer.send(new ProducerRecord<>(TOPIC, key, value));
Read from Kafka:
User user = User.fromByteBuffer(ByteBuffer.wrap(record.value()));
Schema Registry gives you a way for broader set of applications and services to use the data, not just your Java-based microservices.
For example, your microservice streams data to a topic, and you want to send that data to Elasticsearch, or a database. If you've got the Schema Registry you literally hook up Kafka Connect to the topic and it now has the schema and can create the target mapping or table. Without a Schema Registry each consumer of the data has to find out some other way what the schema of the data is.
Taken the other way around too - your microservice wants to access data that's written into a Kafka topic from elsewhere (e.g. with Kafka Connect, or any other producer) - with the Schema Registry you can simply retrieve the schema. Without it you start coupling your microservice development to having to know about where the source data is being produced and its schema.
There's a good talk about this subject here: https://qconnewyork.com/system/files/presentation-slides/qcon_17_-_schemas_and_apis.pdf
Do they need to? No, not really.
Should you save yourself some space on your topic and not send the schema as part of the message or require the consumers to have the schema to read anything? Yes, and that is what the AvroSerializer is doing for you - externalizing that data elsewhere that is consumable as simply a REST API.
The deserializer then must know how that schema is gotten, and you can configure it with specific.avro.reader=true property rather than manually invoking the fromByteBuffer yourself, letting the AvroDeserializer handle it.
Also, in larger orgs, shuffling around a single user.avsc file (even if version controlled) doesn't control that copy becoming stale over time or handle evolution in a clean way.
One of the most important features of the Schema Registry is to manage the evolution of schemas. It provides the layer of compatibility checking. By setting an appropriate Compatibility Type you determine the allowed schema changes.
You can find all the available Compatibility Types here.

Single letter being prepended to JSON records from NIFI JSONRecordSetWriter

I'm fairly new to NiFi and Kafka and I have been struggling with this problem for a few days. I have a NiFi data flow that ends with JSON records being being published to a Kafka topic using PublishKafkaRecord_2_0 processor configured with a JSONRecordSetWriter service as the writer. Everything seems to work great: messages are published to Kafka and looking at the records in the flow file after being published look like well-formed JSON. Though, when consuming the messages on the command line I see that they are prepended with a single letter. Trying to read the messages with ConsumeKafkaRecord_2_0 configured with a JSONTreeReader and of course see the error here.
As I've tried different things the letter has changed: it started with an "h", then "f" (when configuring a JSONRecordSetWriter farther upstream and before being published to Kafka), and currently a "y".
I can't figure out where it is coming from. I suspect it is caused by the JSONRecordSetWriter but not sure. My configuration for the writer is here and nothing looks unusual to me.
I've tried debugging by creating different flows. I thought the issue might be with my Avro schema and tried replacing that. I'm out of things to try, does anyone have any ideas?
Since you have the "Schema Write Strategy" set to "Confluent Schema Reference" this is telling the writer to write the confluent schema id reference at the beginning of the content of the message, so likely what you are seeing is the bytes of that.
If you are using the confluent schema registry then this is correct behavior and those values need to be there for the consuming side to determine what schema to use.
If you are not using confluent schema registry when consuming these messages, just choose one of the other Schema Write Strategies.

Confluent Schema Registry Avro Schema

Hey I would like to use the Confluent schema registry with the Avro Serializers: The documentation now basically says: do not use the same schema for multiple different topics
Can anyone explain to me why?
I reasearch the source code and it basically stores the schema in a kafka topic as follows (topicname,magicbytes,version->key) (schema->value)
Therefore I don't see the problem of using the schema multiple times expect redundancy?
I think you are referring to this comment in the documentation:
We recommend users use the new producer in org.apache.kafka.clients.producer.KafkaProducer. If you are using a version of Kafka older than 0.8.2.0, you can plug KafkaAvroEncoder into the old producer in kafka.javaapi.producer. However, there will be some limitations. You can only use KafkaAvroEncoder for serializing the value of the message and only send value of type Avro record. The Avro schema for the value will be registered under the subject recordName-value, where recordName is the name of the Avro record. Because of this, the same Avro record type shouldn’t be used in more than one topic.
First, the commenter above is correct -- this only refers to the old producer API pre-0.8.2. It's highly recommended that you use the new producer anyway as it is a much better implementation, doesn't depend on the whole core jar, and is the client which will be maintained going forward (there isn't a specific timeline yet, but the old producer will eventually be deprecated and then removed).
However, if you are using the old producer, this restriction is only required if the schema for the two subjects might evolve separately. Suppose that you did write two applications that wrote to different topics, but use the same Avro record type, let's call it record. Now both applications will register it/look it up under the subject record-value and get assigned version=1. This is all fine as long as the schema doesn't change. But lets say application A now needs to add a field. When it does so, the schema will be registered under subject record-value and get assigned version=2. This is fine for application A, but application B has either not been upgraded to handle this schema, or worse, the schema isn't even valid for application B. However, you lose the protection the schema registry normally gives you -- now some other application could publish data of that format into the topic used by application B (it looks ok because record-value has that schema registered). Now application B could see data which it doesn't know how to handle since its not a schema it supports.
So the short version is that because with the old producer the subject has to be shared if you also use the same schema, you end up coupling the two applications and the schemas they must support. You can use the same schema across topics, but we suggest not doing so since it couples your applications (and their development, the teams developing them, etc).

Avro compatibility when only documentation of some of the field changes

I have been using avro as the data format for offline processing over kafka in my application. I have a use case where the producer uses a schema which is almost same as what is used by consumer except that producer has some changes in the documentation of some fields. Would the consumer be able to consume such events without erroring out? I'm debugging an issue where numerous events are missing in the data pipeline and trying to figure out the root cause. I noticed this difference and want to understand if at all this causes an issue.
You should probably test to confirm, but documentation should not impact schema resolution as the schema should get normalized to the "canonical form":
https://avro.apache.org/docs/1.7.7/spec.html#Parsing+Canonical+Form+for+Schemas