Kafka avro serialization with schema evolution - apache-kafka

I am trying to build a kakfa pipeline which will read JSON input data into Kafka topic.
I am using AVRO serialization with schema registry, as my schema gets changed on regular basis.
As of now GenericRecord is used to parse the schema.
But I recently came to know that avro-tools are available to read schema and generate Java classes which can be used to create Producer Code.
I am confused choose between these two options.
Can you please suggest me which one is better, as my schema gets frequently changed?

avro-tools are available to read schema and generate java classes which can be used to create Producer Code
They create specific Avro classes, not Producer code, but regarding the question. Both will work.
The way I see it
GenericRecord - Think of it as a HashMap<String, Object>. As a consumer need to know the fields to get. If, as a producer or schema creator, you are not able to send your classes as a library to your consumers, this is the essentially the best you can get. I believe you'll always be able to get the latest data, though (all possible fields can be accessed by a get("fieldname") call. See example here
SpecificRecord (what avro-tools generates) - It is just a generated class with getter methods and builder objects / setter methods. Any consumer will be able to import your producer classes as dependencies, deserialize the message, then immediately know what fields are available. You are not guaranteed to get the latest schema here - you will be "downgraded" and limited to whatever schema was used to generate those classes.
I use avro-maven-plugin to generally create the classes. Just as this example
You could also use AvroReflect to build an Avro schema from a Java class rather than the other way around. Annotations can be used on fields to set #Union or #AvroDefault settings.
Further Reading about using the Confluent Schema Registry

Related

ksqlDB generates "connect.name" element in AVRO schema and immediately kicks it out again in new version

I'm defining a new stream with a new topic via Confluent Cloud ksqlDB GUI, automatically registering a new schema with no previous versions.
v1 of the schema looks as expected, but starts with an element
"connect.name": "my.namespace", which I already don't really understand. However, there is also immediately a new version of the schema generated, which is v2 and lacks this element. How can this behaviour be explained? There are no connectors in place (which is usually the context I would expect from a connect element) and this has been the case for all ksqldDB-related schemas as far as I can see.
Furthermore, I could observe that doc fields are evolved "out" of existing AVRO schemas (in the context of ksqlDB-related schemas), which adds up to behaviour that does not meet expectations, maybe it's related because CC schema registry kicks out non-required fields by default?
If there are no connectors running, then you are correct that connect.name should probably not be generated.
However, ksqlDB may use Connect Struct types and associated AvroData class methods to/fromConnectSchema to convert between Kafka records and internal ksqlDB Row datatypes that can be queried, and these are used by AvroConverter in Connect framework to read/write to external systems, and what are generates and registers the schema you're seeing.

Difference Avro Vs Cloudevent Vs AsyncAPI | Best fit for Schema evolution and naming convention in kafka

I am using confluent schema registry.
What is the difference between Avro Vs Cloudevent Vs AsyncAPI
What is the best fit for Schema evolution and naming convention in kafka ?
Can we use different schema types for different topics based on requirement ?
First of all, looking at the other answer, AsyncAPI is not a library.
CloudEvents is a specification for describing your event data
AsyncAPI is a specification for defining API of your application that is part of event architecture. In simple words, it is like OpenAPI for REST
They both can coexist https://www.asyncapi.com/blog/asyncapi-cloud-events/
Avro is a binary data format for serialization and deserialization, like JSON or XML.
All of those can work together, your Avro can be encapsulated in CloudEvents enveloped and a number of them can be listed in AsyncAPI file that describes your application.
CloudEvents is a specification. It does not prescribe what form the data should take
Avro is a binary data format for serialization and deserialization
AsyncAPI is library in a way that the code runs
You could use AsyncAPI to send non-blocking Avro records (that get registered to the Schema Registry) that adhere to the CloudEvents specification...
Can we use different schema types for different topics based on requirement ?
Are you having problems with this? That is how the Schema Registry works by default
I'd recommend having a look at AsyncAPI or CloudEvents? Both My Captain! for a detailed explanation and example of how to use them together.

Kafka Source Connector - How to pass the schema from String (json)

I've developed a custom source connector for an external REST service.
I get JSONs, convert them to org.apache.kafka.connect.data.Struct with manually defined schema (SchemaBuilder) and wrap all this to SourceRecord.
All of this is for one entity only, but there a dozen of them.
My new goal is to make this connector universal and parametrize the schema. The idea is to get the schema as String (json) from configs or external files and pass it to SourceRecord, but it only accepts Schema objects.
Is there any simple/good ways to convert String/json to Schema or even pass String schema directly?
There is a JSON to Avro converter, however, if you are already building a Struct/Schema combination, then you shouldn't need to do anything, as the Converter classes in Kafka Connect can handle the conversion for you
https://www.confluent.io/blog/kafka-connect-deep-dive-converters-serialization-explained/

Random Avro data generator in Java/Scala

Is this possible to generate random Avro data by the specified schema using org.apache.avro library?
I need to produce this data with Kafka.
I tried to find some kind of random data generator for test, however, I have stumbled upon tools for such data generator or GenericRecord usage. Tools are not very suitable for me as there is a specific file dependency (like reading the file and so on) and GenericRecord should be generated one-by-one as I've understood.
Are there any other solutions for Java/Scala?
UPDATE: I have found this class but it does not seem to beaccessible from org.apache.avro version version 1.8.2
The reason you need to read a file, is that it matches a Schema, which defines the fields that need to be created, and of which types.
That is not a hard requirement, and there would be nothing preventing creation of random Generic or Specific Records that are built in code via Avro's SchemaBuilder class
See this repo for example, that uses a POJO generated from an AVSC schema (which again, could be done with SchemaBuilder instead) into a Java class.
Even the class you linked to uses a schema file
So I personally would probably use Avro4s (https://github.com/sksamuel/avro4s) in conjunction with scalachecks (https://www.scalacheck.org) Gen to model such tests.
You could use scalacheck to generate random instances of case classes and avro4s to convert them to generic records, extract their schema etc etc.
There's also avro-mocker https://github.com/speedment/avro-mocker though I don't know how easy it is to hook into the code.
I'd just use Podam http://mtedone.github.io/podam/ to generate POJOs and then just output them to Avro using Java Avro library https://avro.apache.org/docs/1.8.1/gettingstartedjava.html#Serializing

How to evolve Avro schema with Akka Persistence without sending schema or using a registry?

We are considering a serialization approach for our scala-based Akka Persistence app. We consider it likely that our persisted events will "evolve" over time, so we want to support schema evolution, and are considering Avro first.
We'd like to avoid including the full schema with every message. However, for the foreseeable future, this Akka Persistence app is the only app that will be serializing and deserializing these messages, so we don't see a need for a separate schema registry.
Checking the docs for avro and the various scala libs, I see ways to include the schema with messages, and also how to use it "schema-less" by using a schema registry, but what about the in-between case? What's the correct approach for going schema-less, but somehow including an identifier to be able to look up the correct schema (available in the local deployed codebase) for the deserialized object? Would I literally just create a schema that represents my case class, but with an additional "identifier" field for schema version, and then have some sort of in-memory map of identifier->schema at runtime?
Also, is the correct approach to have one serializer/deserialize class for each version of the schema, so it knows how to translate every version to/from the most recent version?
Finally, are there recommendations on how to unit-test schema evolutions? For instance, store a message in akka-persistence, then actually change the definition of the case class, and then kill the actor and make sure it properly evolves. (I don't see how to change the definition of the case class at runtime.)
After spending more time on this, here are the answers I came up with.
Using avro4s, you can use the default data output stream to include the schema with every serialized message. Or, you can use the binary output stream, which simply omits the schema when serializing each message. ('binary' is a bit of a misnomer here since all it does is omit the schema. In either case it is still an Array[Byte].)
Akka itself supplies a Serializer trait or a SerializerWithStringManifest trait, which will automatically include a field for a "schema identifier" in the object of whatever you serialize.
So when you create your custom serializer, you can extend the appropriate trait, define your schema identifier, and use the binary output stream. When those techniques are combined, you'll successfully be using schema-less serialization while including a schema identifier.
One common technique is to "fingerprint" your schema - treat it as a string and then calculate its digest (MD5, SHA-256, whatever). If you construct an in-memory map of fingerprint to schema, that can serve as your application's in-memory schema registry.
So then when deserializing, your incoming object will have the schema identifier of the schema that was used to serialize it (the "writer"). While deserializing, you should know the identifier of the schema to use to deserialize it (the "reader"). Avro4s supports a way for you to specify both using a builder pattern, so avro can translate the object from the old format to the new. That's how you support "schema evolution". Because of how that works, you don't need a separate serializer for each schema version. Your custom serializer will know how to evolve your objects, because that's the part that Avro gives you for free.
As for unit testing, your best bet is exploratory testing. Actually define multiple versions of a case class in your test, and multiple accompanying versions of its schema, and then explore how Avro works by writing tests that will evolve an object between different versions of that schema.
Unfortunately that won't be directly relevant to the code you are writing, because it's hard to simulate actually changing the code you are testing as you test it.
I developed a prototype that demonstrates several of these answers, and it's available on github. It uses avro, avro4s, and akka persistence. For this one, I demonstrated a changing codebase by actually changing it across commits - you'd check out commit #1, run the code, then move to commit #2, etc. It runs against cassandra so it will demonstrate replaying events that need to be evolved using new schema, all without using an external schema registry.