Avro compatibility when only documentation of some of the field changes - apache-kafka

I have been using avro as the data format for offline processing over kafka in my application. I have a use case where the producer uses a schema which is almost same as what is used by consumer except that producer has some changes in the documentation of some fields. Would the consumer be able to consume such events without erroring out? I'm debugging an issue where numerous events are missing in the data pipeline and trying to figure out the root cause. I noticed this difference and want to understand if at all this causes an issue.

You should probably test to confirm, but documentation should not impact schema resolution as the schema should get normalized to the "canonical form":
https://avro.apache.org/docs/1.7.7/spec.html#Parsing+Canonical+Form+for+Schemas

Related

Confluent Schema Registry/Kafka Streams: prevent schema evolution

Is there a way to configure Confluent Schema Registry and/or Kafka Streams to prevent schema evolution?
Motivation
We have multiple Kafka Streams jobs producing messages for the same topic. All of the jobs should send messages with the same schema, but due to misconfiguration of the jobs, it has happened that some of them send messages with fields missing. This has caused issues downstream and is something we want to prevent.
When this happens, we can see a schema evolution in the schema registry as expected.
Solution
We checked the documentation for Confluent Schema Registry and/or Kafka Streams, but couldn't find a way to prevent the schema evolution.
Hence, we consider to modify the Kafka Streams jobs to read the schema from Confluent Schema Registry before sending it. If the received schema matches the local schema of the messages, only then we send them.
Is this the right way to go or did we miss a better option?
Update: we found an article on medium for validating the schema against the schema registry before sending.
It depends which language and library you use and what kind of APIs do they provide. If you are publishing generic records, you can read and parse .avdl or .avsc file into the record type and build your event. Which means if event you are trying to build wouldn't be compatible with the current schema you won't be able even to build that event hence won't be able to modify existing schema. So in this case simply store with your source code a static schema. With specific record it is more or less the same, you can build your Java/C# or other language classes based on the schema, you build them then simply new them up and publish. Does it make any sense?)
PS. I worked with C# libs for Kafka maybe some other languages do not have that support or have some other better options

KStream.through() not creating intermediate topic, and impacting other processor nodes in topology

I am trying out the sample KafkaStreams code from Chapter 4 from the book - Kafka Streams in Action. I pretty much copied the code in github - https://github.com/bbejeck/kafka-streams-in-action/blob/master/src/main/java/bbejeck/chapter_4/ZMartKafkaStreamsAddStateApp.java This is an example using StateStore. When I run the code as is, no data is flowing through the topology. I verified that mock data is being generated, as I can see the offset in the input topic - transactions go up. However, nothing in the output topics, and nothing is printed to console.
However, when I comment line 81-88 (https://github.com/bbejeck/kafka-streams-in-action/blob/master/src/main/java/bbejeck/chapter_4/ZMartKafkaStreamsAddStateApp.java#L81-L88), basically avoid creating the "through()" processor node, the code works. I see data being generated to the "patterns" topics, and output generate in console.
I am using Kafka broker and client version 2.4. Would appreciate any help or pointers to debug the issue.
Thank you,
Ahmed.
It is well documented, that you need to create intermediate topic that you use via through() manually and upfront before you start your application. Intermediate topics, similar to input and output topics are not managed by Kafka Streams, but it's the users responsibility to manage them.
Cf: https://docs.confluent.io/current/streams/developer-guide/manage-topics.html
Btw: there is work in progress to add a new repartition() operator that allows you to repartition via a topic that will be managed by Kafka Streams (cf. https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+DSL+with+Connecting+Topic+Creation+and+Repartition+Hint)

kafka topic name change without changing consumers

We planning to remove versions from kafka topic names. Currently schema version of the relevant message forms part of the topic name. But in future we will have large number of small variations of message and we don't want to create too many topics. But there are already many consumers for these topics. We don't want all those topics to make changes (should still try to consume from topic with version number). How can this be achieved? Are there any tools (e.g. AVRO) which can help achieve this? Has anyone experienced similar problem. And second question - how can consumer differentiate messages with small changes in structure coming to same topic.
If I understood correct I would suggest:
regarding to the first question:
pattern = Pattern.compile("topic_name_without_schema_version.*");
kafkaConsumer.subscribe(pattern, new ConsumerRebalanceListener());
regarding to the second one: what kind of differentiation would you like to achieve? In case you just want to know if the current message is incompatible with the latest Avro Schema you can just try to convert current message and catch an Exception (if there is any), or you can generate an Avro schema based on the current message and check an equality of two schemas

Single letter being prepended to JSON records from NIFI JSONRecordSetWriter

I'm fairly new to NiFi and Kafka and I have been struggling with this problem for a few days. I have a NiFi data flow that ends with JSON records being being published to a Kafka topic using PublishKafkaRecord_2_0 processor configured with a JSONRecordSetWriter service as the writer. Everything seems to work great: messages are published to Kafka and looking at the records in the flow file after being published look like well-formed JSON. Though, when consuming the messages on the command line I see that they are prepended with a single letter. Trying to read the messages with ConsumeKafkaRecord_2_0 configured with a JSONTreeReader and of course see the error here.
As I've tried different things the letter has changed: it started with an "h", then "f" (when configuring a JSONRecordSetWriter farther upstream and before being published to Kafka), and currently a "y".
I can't figure out where it is coming from. I suspect it is caused by the JSONRecordSetWriter but not sure. My configuration for the writer is here and nothing looks unusual to me.
I've tried debugging by creating different flows. I thought the issue might be with my Avro schema and tried replacing that. I'm out of things to try, does anyone have any ideas?
Since you have the "Schema Write Strategy" set to "Confluent Schema Reference" this is telling the writer to write the confluent schema id reference at the beginning of the content of the message, so likely what you are seeing is the bytes of that.
If you are using the confluent schema registry then this is correct behavior and those values need to be there for the consuming side to determine what schema to use.
If you are not using confluent schema registry when consuming these messages, just choose one of the other Schema Write Strategies.

Confluent Schema Registry Avro Schema

Hey I would like to use the Confluent schema registry with the Avro Serializers: The documentation now basically says: do not use the same schema for multiple different topics
Can anyone explain to me why?
I reasearch the source code and it basically stores the schema in a kafka topic as follows (topicname,magicbytes,version->key) (schema->value)
Therefore I don't see the problem of using the schema multiple times expect redundancy?
I think you are referring to this comment in the documentation:
We recommend users use the new producer in org.apache.kafka.clients.producer.KafkaProducer. If you are using a version of Kafka older than 0.8.2.0, you can plug KafkaAvroEncoder into the old producer in kafka.javaapi.producer. However, there will be some limitations. You can only use KafkaAvroEncoder for serializing the value of the message and only send value of type Avro record. The Avro schema for the value will be registered under the subject recordName-value, where recordName is the name of the Avro record. Because of this, the same Avro record type shouldn’t be used in more than one topic.
First, the commenter above is correct -- this only refers to the old producer API pre-0.8.2. It's highly recommended that you use the new producer anyway as it is a much better implementation, doesn't depend on the whole core jar, and is the client which will be maintained going forward (there isn't a specific timeline yet, but the old producer will eventually be deprecated and then removed).
However, if you are using the old producer, this restriction is only required if the schema for the two subjects might evolve separately. Suppose that you did write two applications that wrote to different topics, but use the same Avro record type, let's call it record. Now both applications will register it/look it up under the subject record-value and get assigned version=1. This is all fine as long as the schema doesn't change. But lets say application A now needs to add a field. When it does so, the schema will be registered under subject record-value and get assigned version=2. This is fine for application A, but application B has either not been upgraded to handle this schema, or worse, the schema isn't even valid for application B. However, you lose the protection the schema registry normally gives you -- now some other application could publish data of that format into the topic used by application B (it looks ok because record-value has that schema registered). Now application B could see data which it doesn't know how to handle since its not a schema it supports.
So the short version is that because with the old producer the subject has to be shared if you also use the same schema, you end up coupling the two applications and the schemas they must support. You can use the same schema across topics, but we suggest not doing so since it couples your applications (and their development, the teams developing them, etc).