How to define multiple serializers in kafka? - apache-kafka

Say, I publish and consume different type of java objects.For each I have to define own serializer implementations.
How can we provide all implementations in the kafka consumer/producer properties file under the "serializer.class" property?

We have a similar setup with different objects in different topics, but always the same object type in one topic. We use the ByteArrayDeserializer that comes with the Java API 0.9.0.1, which means or message consumers get only ever a byte[] as the value part of the message (we consistently use String for the keys). The first thing the topic-specific message consumer does is to call the right deserializer to convert the byte[]. You could use a apache commons helper class. Simple enough.
If you prefer to let the KafkaConsumer do the deserialization for you, you can of course write your own Deserializer. The deserialize method you need to implement has the topic as the first argument. Use it as a key into a map that provides the necessary deserializer and off you go. My hunch is that in most cases you will just do a normal Java deserialization anyway.
The downside of the 2nd approach is that you need a common super class for all your message objects to be able to parameterize the ConsumerRecord<K,V> properly. With the first approach, however, it is ConsumerRecord<String, byte[]> anyway. But then you convert the byte[] to the object you need just at the right place and need only one cast right there.

One option is Avro. Avro lets you define record types that you can then easily serialize and deserialize.
Here's an example schema adapted from the documentation:
{"namespace": "example.avro",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "default": null, "type": ["null","int"]},
{"name": "favorite_color", "default": null, "type": ["null","string"]}
]
}
Avro distinguishes between so-called SpecificData and GenericData. With SpecificData readers and writers, you can easily serialize and deserialize known Java objects. The downside is SpecificData requires compile-time knowledge of the class to schema conversion.
On the other hand, GenericData readers and writers let you deal with record types you didn't know about at compile time. While obviously very powerful, this can get kind of clumsy -- you will have to invest time coding around the rough edges.
There are other options out there -- Thrift comes to mind -- but from what I understand, one of the major differences is Avro's ability to work with GenericData.
Another benefit is multi-language compatibility. Avro I know has native support for a lot of languages, on a lot of platforms. The other options do too, I am sure -- probably any off the shelf option is going to be better than rolling your own in terms of multi-language support, it's just a matter of degrees.

Related

using kafka with schema-registry, in app with multiple topics and SubjectNameStrategy

To begin with, I have found a way how to do this, more or less. But it's really bad code. So I'm looking for suggestions how to solve this better if this approach exist.
To lay something to work with. Assume you have app, which sends avro to n topics and uses schema registry. Assume(at first) that you don't want to use avro unions, since they bring some issues along. N-1 topics are easy, 1 schema per topic. But then, you have data, you need to send in order, which means 1 topic and specified group key, but these data don't have same schema. So to do that, you need to register multiple schema per that topic in schema registry, which implies use of key.subject.name.strategy=io.confluent.kafka.serializers.subject.RecordNameStrategy or similar. And here it becomes ugly.
But that setting is per schema registry instance, so you have to declare 2(or more) schema registry instances, one per each SubjectNameStrategy key/value combination. This will work.
But then, according to documentation, RecordNameStrategy is java-platform only (!), so if you would like to create service, which is not language specific (which you would most probably like to do in 2021 ...), you cannot use RecordNameStrategy.
So if you cannot use RecordNameStrategy, and for some reason you still want to use avro and schema registry, IIUC you have no other choice, than to use avro unions on top level, and use defaut TopicNameStrategy, which is fine now, since you have single unioned schema. But top-level unions weren't nice to me in past, since deserializer don't know, naturally, which type would you like to deserialize from the data. So theoretically a way out of this could be using say Cloudevents standard(or something similar), setting cloudevent type attribute in respect to which type from union was used to serialize data, and then have type->deserializer map, to be able to pick correct deserializer for avro-encoded data in received cloudevents message. This will work, and not only for java.
So to wrap up, here are 2 generally described solutions to very simple problem. But to be honest, these seems extremely complicated for widely accepted solution (avro/schema-registry). I'd like to know, if there is easier way through this.
This is a common theme, particularly in CQRS-like systems in which commands may be ordered (eg create before update or delete etc). In these cases, using Kafka, it's often not desirable to publish the messages over multiple topics. You are correct that there are two solutions for sending messages with multiple schemas on the same topic: either a top-level union in the avro schema, or multiple schemas per topic.
You say you don't want to use top-level unions in the schema, so I'll address the case of multiple schemas per topic. You are correct that this excludes the use of any subject naming strategy that includes only the topic name to define the subject. So TopicNameStrategy is out.
But then, according to documentation, RecordNameStrategy is java-platform only (!), so if you would like to create service, which is not language specific (which you would most probably like to do in 2021 ...), you cannot use RecordNameStrategy.
This is worthy of some clarification.... In the confluent way of things, the 'schema-registry aware avro serializers' first register your writer schema in the registry against a subject name to obtain a schema id. They then prefix your avro bytes with that schema id before publishing to kafka. See the 'Confluent Wire Format' at https://docs.confluent.io/platform/current/schema-registry/serdes-develop/index.html#wire-format.
So the subject naming is a choice in the serializer library; the deserializer just resolves a schema by the id prefixing the kafka message. The confluent Java serializers make this subject naming configurable and define strategies TopicNameStrategy, RecordNameStrategy and TopicRecordNameStrategy. See https://docs.confluent.io/platform/current/schema-registry/serdes-develop/index.html#subject-name-strategy. The three strategies are conventions for defining 'scopes' over which schemas will be tested for compatibility in the registry (per-topic, per-record, or a combination). You've identified RecordNameStrategy fits your use case for having multiple avro schemas per topic.
However, I think your concern about non-Java support for RecordNameStrategy can be set aside. In the serializer, the subject naming is free to be implemented however the serializer developer chooses. Having worked on this stuff in Java, Python, Go and NodeJS, I've experienced some variety in how third-party serializers work in this regard. Nevertheless, working non-Java libs do exist.
If all else fails, you can write your own 'schema-registry aware serializer' that registers the schema with your chosen subject name prior to encoding the confluent wire-format for Kafka. I've had happy outcomes from other tooling by keeping to one of the well-known confluent strategies, so I can recommend mimicking them.

How do you handle nested source data with AVRO serialization in Apache Kafka?

My goal is to grab JSON data from an HTTP source and store it in a Kafka topic using AVRO serialization.
Using Kafka Connect and an HTTP source connector along with a bunch of SMTs, I managed to create a Connect data structure that looks like this when written to the topic with the StringConverter:
Struct{base=stations,cod=200,coord=Struct{lat=54.0,lon=9.0},dt=1632150605}
Thus the JSON was successfully parsed into STRUCTs and I can manipulate individual elements using SMTs. Next, I created a new subject with the corresponding schema inside the Confluent Schema Registry and switched the connector's value converter over to the Confluent AVRO Converter with "value.converter": "io.confluent.connect.avro.AvroConverter".
Instead of the expected serialization I got an error message saying:
org.apache.kafka.common.errors.SerializationException: Error serializing Avro message
Caused by: org.apache.avro.SchemaParseException: Can't redefine: io.confluent.connect.avro.ConnectDefault
As soon as I remove the nested STRUCT with ReplaceField or simplify the structure with Flatten, the AVRO serialization works like a charm. So it looks like the converter cannot handle nested structures.
What is the right way to go when you have nested elements and want them to be serialized as such rather than storing the JSON as a String and trying to deal with object creation in the consumer or beyond? Is this possible in Kafka Connect?
The creation of STRUCT elements from a JSON String can be achieved by different means. Originally, the SMT ExpandJson was used for its simplicity. It does not create sufficiently named STRUCTs, however, as it doesn't have a schema to work off of. And that is what caused the initial error message as the AVRO serializer uses the generic class io.confluent.connect.avro.ConnectDefault for those STRUCTs and if there is more than one there is ambiguity, which throws an exception.
Another SMT doing seemingly the same thing is Json Schema, which has a documented FromJson conversion. It does accept a schema and thus gets around ExpandJson's problem of parsing nested elements as a generic type. What is being accepted is a JSON Schema, though, and the mapping to AVRO fullnames works by taking the word "properties" as the namespace and copying the field name. In this example, you would end up with properties.coord as the fullname of the inner element.
As an example, when the following JSON Schema is passed to the SMT:
{
"$schema": "http://json-schema.org/draft-04/schema#",
"type": "object",
"properties": {
"coord": {
"type": "object",
"properties": {
"lon": {
"type": "number"
},
"lat": {
"type": "number"
}
},
"required": [
"lon",
"lat"
]
},
...
}
The AVRO schema it produces (and thus looks for in the Schema Registry) is:
{
"type": "record",
"fields": [
...
{
"name": "coord",
"type": {
"type": "record",
"name": "coord",
"namespace": "properties",
"fields": [
{
"name": "lat",
"type": "double"
},
{
"name": "lon",
"type": "double"
}
],
"connect.name": "properties.coord"
}
},
...
}
In theory, if you have another schema with a coord element on the second level, it will get the same fullname, but since these are not individual entries inside the Schema Registry needing to be referenced, this will not lead to collisions. Not being able to control the namespace of the AVRO record from the JSON Schema is a little bit of a shame, as it feels like you're just about there, but I haven't been able to dig deep enough to offer a solution.
The suggested SMT SetSchemaMetadata (see first reply to the question) can be useful in this process, but it's documentation clashes a little with AVRO naming conventions as it shows order-value in an example. It will try to find a schema that contains an AVRO record with this name as the root element and since '-' is an illegal character in an AVRO name, you get an error. If you use the correct name of the root element, though, the SMT does something very useful: Its RestService class, which queries the Schema Registry to find a matching schema, fails with a message printing out the exact schema definition that needs to be created, so you don't necessarily have to memorize all the transformation rules.
Thus the answer to the original question is: Yes, it can be done with Kafka Connect. And it also is the best way to go if you
don't want to write your own producer/connector
want to store JSON blobs in a typed way as opposed to converting them after they hit an initial topic
If conversion after data ingestion is an option, the de-, re- and serialization capabilities of ksqlDB seem to be quite powerful.

Event sourcing with kafka

What is the best practice to structure a message for a topic containing different types that need to be sorted.
Example
Topic: user-events
Event types: UserCreatedEvent, UserUpdatedEvent, UserDeletedEvent.
Those events need to be saved in the same topic and partition to guarantee the order.
Possible solutions I see
Single schema containing all event type fields
Schema containing all event types schemas. {eventId, timestamp, userCreated: {}, userUpdated: {}, userDeleted: {}}
Different schema for event using Avro union
Pro
Easy to implement and process as a stream
Easy to implement, process as a stream and setup required fields for each event type
Every message is an event
Cons
Possible to have many empty fields and it's not possible to specify required fields per event type
Not clear the message type without inspecting the payload
Difficult to deserialize (GenericRecord)
Are there other possible solutions, how do you normally handle a topic with different message types? How do you process this king of topics?
Any reference to code example is welcome.
UPDATE
There are two articles from confluent trying to explain who to solve this:
Should You Put Several Event Types in the Same Kafka Topic?
Putting Several Event Types in the Same Topic
My opinion on the articles is that they give you only a partial answer.
The first tells you when is a good idea to save different types into the same topic, and event sourcing is a good fit.
The second, it’s more technical and illustrate the possibility of doing this with Avro union.
But none of them explain in details how to do it with a real example.
I have seen projects on github where they simplified the scenario by creating a single schema, more as a state than actual event (point 1.).
Talking to someone with experience using kafka, came up with the solution explained at point 2 by nesting the events into a “carrying event”.
I managed yesterday (I will share the solution asap) to use avro union and deserialize the events as GenericRecord and do transformation based on the event type.
Since I didn’t find any similar solution I was curious to know if I'm missing something, like drawbacks (e.g. Ksqldb doesn’t support different types) or better practices to do the same in kafka.
In cases when I need to transfer different objects through one topic, I use the transport container. It stores some meta information about nested object, and serialized object, that you want to transport.
Avsc schema of the container can be like
{
"type": "record",
"name": "TransportContainer",
"namespace": "org.example",
"fields": [
{
"name": "id",
"type": "long"
},
{
"name": "event_type",
"type": "string"
},
{
"name": "event_timestamp",
"type": "long"
},
{
"name": "event",
"type": "bytes"
}
]
}
In "event_type" field you should store the type of the event. You should use it to determine, which schema you need to use to deserialize a nested object, stored in the "event" field. Also, it helps to avoid deserialization of every nested object, if you want to read objects only with a specific type.

What is the best way to publish and consume different type of messages?

Kafka 0.8V
I want to publish /consume byte[] objects, java bean objects, serializable objects and much more..
What is the best way to define a publisher and consumer for this type scenario?
When I consume a message from the consumer iterator, I do not know what type of the message it is.
Can anybody point me a guide on how to design such scenarios?
I enforce a single schema or object type per Kafka Topic. That way when you receive messages you know exactly what you are getting.
At a minimum, you should decide whether a given topic is going to hold binary or string data, and depending on that, how it will be further encoded.
For example, you could have a topic named Schema that contains JSON-encoded objects stored as strings.
If you use JSON and a loosely-typed language like JavaScript, it could be tempting to store different objects with different schemas in the same topic. With JavaScript, you can just call JSON.parse(...), take a peek at the resulting object, and figure out what you want to do with it.
But you can't do that in a strictly-typed language like Scala. The Scala JSON parsers generally want you to parse the JSON into an already defined Scala type, usually a case class. They do not work with this model.
One solution is to keep the one schema / one topic rule, but cheat a little: wrap an object in an object. A typical example would be an Action object where you have a header that describes the action, and a payload object with a schema dependent on the action type listed in the header. Imagine this pseudo-schema:
{name: "Action", fields: [
{name: "actionType", type: "string"},
{name: "actionObject", type: "string"}
]}
This way, in even a strongly-typed language, you can do something like the following (again this is pseudo-code) :
action = JSONParser[Action].parse(msg)
switch(action.actionType) {
case "foo" => var foo = JSONParser[Foo].parse(action.actionObject)
case "bar" => var bar = JSONParser[Bar].parse(action.actionObject)
}
One of the neat things about this approach is that if you have a consumer that's waiting for only a specific action.actionType, and is just going to ignore all the others, it's pretty lightweight for it to decode just the header and put off decoding action.actionObject until when and if it is needed.
So far this has all been about string-encoded data. If you want to work with binary data, of course you can wrap it in JSON as well, or any of a number of string-based encodings like XML. But there are a number of binary-encoding systems out there, too, like Thrift and Avro. In fact, the pseudo-schema above is based on Avro. You can even do cool things in Avro like schema evolution, which amongst other things provides a very slick way to handle the above Action use case -- instead of wrapping an object in an object, you can define a schema that is a subset of other schemas and decode just the fields you want, in this case just the action.actionType field. Here is a really excellent description of schema evolution.
In a nutshell, what I recommend is:
Settle on a schema-based encoding system (be it JSON, XML, Avro,
whatever)
Enforce a one schema per topic rule

Confluent Platform: Schema Registry Subjects

Working with Confluent Platform, the platform offered by the creators of Apache Kafka, and I have a question:
In the documentation of the Schema Registry API Reference, they mention the abstraction of a "Subject". You register a schema under a "subject" which is of the form topicName-key, or topicName-value, yet there is no explanation as to why you need (as it implies) a separate schema for the key and value of messages on a given topic. Nor is there any direct statement to the effect that registration with a "subject" necessarily associates the schema with that topic, other than mnemonically.
Further confusing matters, the subsequent examples ("get schema version for subject" and "register new schema under subject") on that page do not use that format for the subject name, and instead use just a topic name for the "subject" value. If anyone has any insight into a) why there are these two "subjects" per topic, and b) what the proper usage is, it would be greatly appreciated.
Confluent Schema Registry is actually a bit inconsistent with subject names :)
Indeed, the KafkaAvroSerializer (used for new Kafka 0.8.2 producer) uses topic-key|value pattern for subjects (link) whereas KafkaAvroEncoder (for old producer) uses schema.getName()-value pattern (link).
The reason why one would have 2 different subjects per topic (one for key, one for value) is pretty simple:
say I have an Avro schema representing a log entry, and each log entry has a source information attached to it:
{
"type":"record",
"name":"LogEntry",
"fields":[
{
"name":"line",
"type":"string"
},
{
"name":"source",
"type":{
"type":"record",
"name":"SourceInfo",
"fields":[
{
"name":"host",
"type":"string"
},
{
"name":"...",
"type":"string"
}
]
}
}
]
}
A common use case would be that I want to partition entries by source, thus would like to have two subjects associated for topic (and subjects are basically revisions of Avro schemas) - one for key (which is SourceInfo) and one for value (LogEntry).
Having these two subjects would allow partitioning and storing the data as long as I have a schema registry running and my producers/consumers can talk to it. Any modifications to these schemas would be reflected in the schema registry and as long as they satisfy compatibility settings everything should just serialize/deserialize without you having to care about this.
Note: any further information is just my personal thoughts and maybe I just don't yet fully understand how this is supposed to work so I might be wrong.
I actually like more how the KafkaAvroEncoder is implemented rather than the KafkaAvroSerializer. KafkaAvroEncoder does not in any way enforce you to use ONE schema per topic key\value whereas KafkaAvroSerializer does. This might be an issue when you plan to produce data for multiple Avro schemas into one topic. In this case KafkaAvroSerializer would try to update the topic-key and topic-value subjects and 99% would break if compatibility is violated (and if you have multiple Avro schemas they are almost always different and incompatible with each other).
On the other side, KafkaAvroEncoder cares just about schema names and you may safely produce data for multiple Avro schemas into one topic and everything should work just fine (you will have as many subjects as schemas).
This inconsistency is still unclear to me and I hope Confluent guys can explain this if they see this question/answer.
Hope that helps you