Design question around Spring Cloud Stream, Avro, Kafka and losing data along the way - apache-kafka

We have implemented a system consisting of several Spring Boot microservices that communicate via messages posted to Kafka topics. We are using Spring Cloud Stream to handle a lot of the heavy lifting of sending and receiving messages via Kafka. We are using Apache Avro as a transport protocol, integrating with a Schema Server (Spring Cloud Stream default implementation for local development, and Confluent for production).


We model our message classes in a common library, that every micro service includes as a dependency. We use ‘dynamic schema generation’ to infer the Avro schema from the shape of our message classes before the Avro serialisation occurs when a microservice acts as a producer and sends a message. The consuming micro service can look up the schema from the registry based on the schema version and deserialise into the message class, which it also has as a dependency.


It works well, however there is one big drawback for us that I wonder if anyone has experienced before and could offer any advice. If we wish to add a new field for example to one of the model classes, we do it in the common model class library and update the of that dependency in the micro service. But it means that we need to update the version of that dependency in every micro service along the chain, even if the in-between micro services do not need that new field. Otherwise, the data value of that new field will be lost along the way, because of the way the micro service consumers deserialise into an object (which might be an out of date version of the class) along the way.

To give an example, lets say we have a model class in our model-common library called PaymentRequest (the #Data annotation is Lombok and juts generates getters and setters from the fields):
#Data
class PaymentRequest {
String paymentId;
String customerId;
}


And we have a micro service called PayService which sends a PaymentRequest message onto Kafka topic:


#Output("payment-broker”)
MessageChannel paymentBrokerTopic();
...

PaymentRequest paymentRequest = getPaymentRequest();

Message<PaymentRequest> message = MessageBuilder.withPayload(paymentRequest).build();
paymentBrokerTopic().(message);

And we have this config in application.yaml in our Spring Boot application:


spring:
cloud:
stream:
schema-registry-client:
endpoint: http://localhost:8071
schema:
avro:
dynamicSchemaGenerationEnabled: true
 bindings:
Payment-broker:
destination: paymentBroker
contentType: application/*+avro

Spring Cloud Stream’s Avro MessageConverter infers the schema from the PaymentRequest object, adds a schema to the schema registry if there is not already a matching one there, and sends the message on Kafka in Avro format.

Then we have a consumer in another micro service, BrokerService, which has this consumer:


#Output("payment-processor”)
MessageChannel paymentProcessorTopic();


#Input(“payment-request”)
SubscribableChannel paymentRequestTopic();

#StreamListener("payment-request")
public void processNewPayment(Message<PaymentRequest> request) {
// do some processing and then send on…
paymentProcessorTopic().(message);
}


It is able to deserialise that Avro message from Kafka into a PaymentRequest POJO, do some extra processing on it, and send the message onwards to another topic, which is called paymentProcessor, which then gets picked up by another micro service, called PaymentProcessor, which has another StreamListener consumer:



#Input(“payment-processor”)
SubscribableChannel paymentProcessorTopic();


#StreamListener("payment-processor”)
public void processNewPayment(Message<PaymentRequest> request) {
// do some processing and action request…
}


If we wish to update the PaymentRequest class in the model-common library, so that it has a new field:

#Data
class PaymentRequest {
String paymentId;
String customerId;
String processorCommand;
}


if we update the dependency version in each of the micro service, the value of that new field get deserialised into the field when the message is read, and reserialised into the message when it gets sent on to the next topic, each time.


However, if we do not update the version of model-common library in the second service in the chain. BrokerService, for example, it will deserialise the message into a version of the class without that new field, and so when the message is reserialised into a message sent on to the payment-processor topic the Avro message will not have the data for that field.
The third micro service, PaymentProcessor, might have the version of the model-common lib that does contain the new field, but when the message is deserialised into the POJO the value for that field will be null.

I know Avro has features for schema evolution where default values can be assigned for new fields to allow for backwards and forwards compatibility, but that is not sufficient for us here, we need the real values. And ideally we do not want a situation where we would have to update the dependency version of the model library in every micro service because that introduces a lot of work and coupling between services. Often a new field is not needed by the services midway along the chain, and only might be relevant in the first service and the final one for example.


So has anyone else faced this issue and thought of a good way round it? We are keen to not lose the power of Avro and the convenience of Spring Cloud Stream, but not have such dependency issues. Anything around custom serializers/deserializers we could try? Or using GenericRecords? Or an entirely different approach?


Thanks for any help in advance!


Related

Spring cloud stream routing on payload

I want to use spring cloud stream for my microservice to handle event from kafka.
I read from one topic that can hold several JSON payloads (I have one topic since its all messages arrived to it are from the same subject).
I have different cloud function to handle according to the different payload.
How can I rout the incoming event to specific function based on property in its payload?
Say I have JSON message that can have the following properties:
{
"type":"A"
"content": xyz
}
So the input message can have a property A or B
Say I want to call some bean function when the type is A and another bean function when type is B
It is not clear from the question whether you are using the message channel-based Kafka binder or Kafka Streams binder. The comments above imply some reference to KStream. Assuming that you are using the message channel-based Kafka binder, you have the option of using the message routing feature in Spring Cloud Stream. The basic usage is explained in this section of the docs: https://docs.spring.io/spring-cloud-stream/docs/3.2.1/reference/html/spring-cloud-stream.html#_event_routing
You can provide a routing-expression which is a SpEL expression to pass the right property values.
If you want advanced routing capabilities beyond what can be expressed through a SpEL expression, you can also implement a custom MessageRoutingCallback. See this sample application for more details: https://github.com/spring-cloud/spring-cloud-stream-samples/tree/main/routing-samples/message-routing-callback

How to consume and parse different Avro messages in kafka consumer

In My application Kafka topics are dedicated to a domain(can't change that) and multiple different types of events (1 Event = 1 Avro schema message) related to that domain being produced by different micro-services in that one topic.
Now I have only one consumer app in which I should be able to apply different schema dynamically (by inspecting event name in message) and transform in appropriate pojo object(generated by specific Avro schema) for further event specific actions.
Whatever example I find on net is all about single schema type message consumer so need some help.
Related blog post: https://www.confluent.io/blog/multiple-event-types-in-the-same-kafka-topic/
How to configure the consumer:
https://docs.confluent.io/platform/current/schema-registry/serdes-develop/serdes-avro.html#avro-deserializer
https://github.com/openweb-nl/kafka-graphql-examples/blob/307bbad6f10e4aaa6b797a3bbe3b6620d3635263/graphql-endpoint/src/main/java/nl/openweb/graphql_endpoint/service/AccountCreationService.java#L47
https://github.com/openweb-nl/kafka-graphql-examples/blob/307bbad6f10e4aaa6b797a3bbe3b6620d3635263/graphql-endpoint/src/main/resources/application.yml#L20
You need the generated Avro classes on the classpath. Most likely by adding a dependency.

How to Handle Deserialization Exception & Converting to New Schema with Spring Cloud Stream?

I am have trouble understanding how to properly handle a deserialization exception within Spring Cloud stream. Primarily because the framework implemented does not support headers and the DLQ is supposed to be a separate schema than the original message. So the process flow needs to be: consume message -> deserialization error -> DlqHandler -> serialize with NEW schema -> send to DLQ
The documentation linked below doesn't give a good idea on if that is even possible. I have seen quite a few examples of SeekToCurrentErrorHandler for Spring-Kafka but those to my knowledge are different implementations and do not match with how I could properly get the deserialization error and then have a section for custom code to serialize into a new format and move from there.
My main question is: Is capturing the deserialization exception and reserializing possible with spring cloud streams (kafka)?
Spring Cloud Documentation for DLQ
Yes, but not using the binding retry or DLQ properties.
Instead, add a ListenerContainerCustomizer bean and customize the binding's listener container with a SeekToCurrentErrorHandler configured for the retries you need and, probably, a subclass of the DeadLetterPublishingRecoverer using an appropriately configured KafkaTemplate and possibly overriding the createProducerRecord method.

What is the value of an Avro Schema Registry?

I have many microservices reading/writing Avro messages in Kafka.
Schemas are great. Avro is great. But is a schema registry really needed? It helps centralize Schemas, yes, but do the microservices really need to query the registry? I don't think so.
Each microservice has a copy of the schema, user.avsc, and an Avro-generated POJO: User extends SpecificRecord. I want a POJO of each Schema for easy manipulation in the code.
Write to Kafka:
byte [] value = user.toByteBuffer().array();
producer.send(new ProducerRecord<>(TOPIC, key, value));
Read from Kafka:
User user = User.fromByteBuffer(ByteBuffer.wrap(record.value()));
Schema Registry gives you a way for broader set of applications and services to use the data, not just your Java-based microservices.
For example, your microservice streams data to a topic, and you want to send that data to Elasticsearch, or a database. If you've got the Schema Registry you literally hook up Kafka Connect to the topic and it now has the schema and can create the target mapping or table. Without a Schema Registry each consumer of the data has to find out some other way what the schema of the data is.
Taken the other way around too - your microservice wants to access data that's written into a Kafka topic from elsewhere (e.g. with Kafka Connect, or any other producer) - with the Schema Registry you can simply retrieve the schema. Without it you start coupling your microservice development to having to know about where the source data is being produced and its schema.
There's a good talk about this subject here: https://qconnewyork.com/system/files/presentation-slides/qcon_17_-_schemas_and_apis.pdf
Do they need to? No, not really.
Should you save yourself some space on your topic and not send the schema as part of the message or require the consumers to have the schema to read anything? Yes, and that is what the AvroSerializer is doing for you - externalizing that data elsewhere that is consumable as simply a REST API.
The deserializer then must know how that schema is gotten, and you can configure it with specific.avro.reader=true property rather than manually invoking the fromByteBuffer yourself, letting the AvroDeserializer handle it.
Also, in larger orgs, shuffling around a single user.avsc file (even if version controlled) doesn't control that copy becoming stale over time or handle evolution in a clean way.
One of the most important features of the Schema Registry is to manage the evolution of schemas. It provides the layer of compatibility checking. By setting an appropriate Compatibility Type you determine the allowed schema changes.
You can find all the available Compatibility Types here.

Confluent Schema Registry Avro Schema

Hey I would like to use the Confluent schema registry with the Avro Serializers: The documentation now basically says: do not use the same schema for multiple different topics
Can anyone explain to me why?
I reasearch the source code and it basically stores the schema in a kafka topic as follows (topicname,magicbytes,version->key) (schema->value)
Therefore I don't see the problem of using the schema multiple times expect redundancy?
I think you are referring to this comment in the documentation:
We recommend users use the new producer in org.apache.kafka.clients.producer.KafkaProducer. If you are using a version of Kafka older than 0.8.2.0, you can plug KafkaAvroEncoder into the old producer in kafka.javaapi.producer. However, there will be some limitations. You can only use KafkaAvroEncoder for serializing the value of the message and only send value of type Avro record. The Avro schema for the value will be registered under the subject recordName-value, where recordName is the name of the Avro record. Because of this, the same Avro record type shouldn’t be used in more than one topic.
First, the commenter above is correct -- this only refers to the old producer API pre-0.8.2. It's highly recommended that you use the new producer anyway as it is a much better implementation, doesn't depend on the whole core jar, and is the client which will be maintained going forward (there isn't a specific timeline yet, but the old producer will eventually be deprecated and then removed).
However, if you are using the old producer, this restriction is only required if the schema for the two subjects might evolve separately. Suppose that you did write two applications that wrote to different topics, but use the same Avro record type, let's call it record. Now both applications will register it/look it up under the subject record-value and get assigned version=1. This is all fine as long as the schema doesn't change. But lets say application A now needs to add a field. When it does so, the schema will be registered under subject record-value and get assigned version=2. This is fine for application A, but application B has either not been upgraded to handle this schema, or worse, the schema isn't even valid for application B. However, you lose the protection the schema registry normally gives you -- now some other application could publish data of that format into the topic used by application B (it looks ok because record-value has that schema registered). Now application B could see data which it doesn't know how to handle since its not a schema it supports.
So the short version is that because with the old producer the subject has to be shared if you also use the same schema, you end up coupling the two applications and the schemas they must support. You can use the same schema across topics, but we suggest not doing so since it couples your applications (and their development, the teams developing them, etc).

Categories