KafkaHandler for multiple JSON types in Single Kafka Topic - Process in batch - apache-kafka

TL/DR; is it possible to use separate KafkaHandlers for different JSON types in batch mode?
I have a topic I am consuming that contains several different JSON messages. I am processing the data and inserting into a database so I am processing this in batch and manually doing the Kafka commit once everything has been inserted into the database.
So I have my factory which includes
Map<String, Object> props = new HashMap<>();
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, JsonDeserializer.class);
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false);
and the factory sets
factory.setBatchListener(true);
For testing I have two POJOs that are TypeA with strings var1 & var2 and TypeB with strings (var3 and var4). My consumer then has a method which is essentially:
#KafkaListener(topics = "${kafka.topics}");
public void receive (List<Object> data, Acknowledgement ack) {
for (int i = 0; i < data.size(); i++) {
Object d = data.get(i);
if (d instanceof TypeA) {
LOGGER.info("We have Type A - '{}' - '{}'", a.getVar1(), a.getVar2());
}
if (d instancef TypeB) {
LOGGER.info("We have Type B - '{}' - '{}'", b.getVar3(), a.getVar4());
}
}
ack.acknowledge();
}
This works but I have been attempting to to get this to work using KafkaHandlers for each type instead of using instanceof.
If I remove the line to enable batch processing and move the KafkaListener annotation to the class level I can then create separate handlers
#KafkaHandler
public void receiveA(#Payload TypeA) {
LOGGER.info("We have Type A - '{}' - '{}'", a.getVar1(), a.getVar2());
}
#KafkaHandler
public void receiveB(#Payload TypeA) {
LOGGER.info("We have Type B - '{}' - '{}'", a.getVar3(), a.getVar4());
}
This works fine but I loose the batch ability.
If I enable batch mode then it only seems to want a handler for ArrayList and you can't have separate handlers for the different types.
Is there some middle ground here? Is there any way I could processes the individual records using KafkaHandler but have something that fires once all the records have been processed by their handlers (to handle the acknowledgment and database commits) or is there a better way of handling this than in the first code using lots if instance of statements?

#KafkaHandler is currently not supported for batch listeners; please open a GitHub issue - we should be able to properly detect the generic list contents type from the generic parameter type.
You might be able to use a custom BatchToRecordAdapter to call the record level listener, and set some flag in the final message in the batch to signal that it is the last.
See https://docs.spring.io/spring-kafka/docs/current/reference/html/#transactions-batch
EDIT
It doesn't make sense to support #KafkaHandler - the batch could contain mixed types.

Related

Mutiny - Kafka writes happening sequentially

I am new to Quarkus. I am trying to write a REST endpoint using quarkus reactive that receives an input, does some validation, transforms the input to a list and then writes a message to kafka. My understanding was converting everything to Uni/Multi, would result in the execution happening on the I/O thread in async manner. In, the intelliJ logs, I could see that the code is getting executed in a sequential manner in the executor thread. The kafka write happens in its own network thread sequentially, which is increasing latency.
#POST
#Consumes(MediaType.APPLICATION_JSON)
#Produces(MediaType.APPLICATION_JSON)
public Multi<OutputSample> send(InputSample inputSample) {
ObjectMapper mapper = new ObjectMapper();
//deflateMessage() converts input to a list of inputSample
Multi<InputSample> keys = Multi.createFrom().item(inputSample)
.onItem().transformToMulti(array -> Multi.createFrom().iterable(deflateMessage.deflateMessage(array)))
.concatenate();
return keys.onItem().transformToUniAndMerge(payload -> {
try {
return producer.writeToKafka(payload, mapper);
} catch (JsonProcessingException e) {
e.printStackTrace();
}
return null;
});
}
#Inject
#Channel("write")
Emitter<String> emitter;
Uni<OutputSample> writeToKafka(InputSample kafkaPayload, ObjectMapper mapper) throws JsonProcessingException {
String inputSampleJson = mapper.writeValueAsString(kafkaPayload);
return Uni.createFrom().completionStage(emitter.send(inputSampleJson))
.onItem().transform(ignored -> new OutputSample("id", 200, "OK"))
.onFailure().recoverWithItem(new OutputSample("id", 500, "INTERNAL_SERVER_ERROR"));
}
I have been on it for a couple of days. Not sure if doing anything wrong. Any help would be appreciated.
Thanks
mutiny as any other reactive library is designed mainly around data flow control.
That being said, at its heart, it will offer a set of capabilities (generally through some operators) to control flow execution and scheduling. This means that unless you instruct munity objects to go asynchronous, they will simply execute in a sequential (old) fashion.
Execution scheduling is controlled using two operators:
runSubscriptionOn: which will cause the code snippet generating the items (which is generally referred to upstream) to execute on a thread from the specified Executor
emitOn: which will cause subscribing code (which is generally referred to downstream) to execute on a thread from the specified Executor
You can then update your code as follows causing the deflation to go asynchronous:
Multi<InputSample> keys = Multi.createFrom()
.item(inputSample)
.onItem()
.transformToMulti(array -> Multi.createFrom()
.iterable(deflateMessage.deflateMessage(array)))
.runSubscriptionOn(Infrastructure.getDefaultExecutor()) // items will be transformed on a separate thread
.concatenate();
EDIT: Downstream on a separate thread
In order to have the full downstream, transformation and writing to Kafka queue done on a separate thread, you can use the emitOn operator as follows:
#POST
#Consumes(MediaType.APPLICATION_JSON)
#Produces(MediaType.APPLICATION_JSON)
public Multi<OutputSample> send(InputSample inputSample) {
ObjectMapper mapper = new ObjectMapper();
return Uni.createFrom()
.item(inputSample)
.onItem()
.transformToMulti(array -> Multi.createFrom().iterable(deflateMessage.deflateMessage(array)))
.emitOn(Executors.newFixedThreadPool(5)) // items will be emitted on a separate thread after transformation
.onItem()
.transformToUniAndConcatenate(payload -> {
try {
return producer.writeToKafka(payload, mapper);
} catch (JsonProcessingException e) {
e.printStackTrace();
}
return Uni.createFrom().<OutputSample>nothing();
});
}
Multi is intended to be used when you have a source that emits items continuously until it emits a completion event, which is not your case.
From Mutiny docs:
A Multi represents a stream of data. A stream can emit 0, 1, n, or an
infinite number of items.
You will rarely create instances of Multi yourself but instead use a
reactive client that exposes a Mutiny API.
What you are looking for is a Uni<List<OutputSample>> because your API you return 1 and only 1 item with the complete result list.
So what you need is to send each message to Kafka without immediately waiting for their return but collecting the generated Unis and then collecting it to a single Uni.
#POST
public Uni<List<OutputSample>> send(InputSample inputSample) {
// This could be injected directly inside your producer
ObjectMapper mapper = new ObjectMapper();
// Send each item to Kafka and collect resulting Unis
List<Uni<OutputSample>> uniList = deflateMessage(inputSample).stream()
.map(input -> producer.writeToKafka(input, mapper))
.collect(Collectors.toList());
// Transform a list of Unis to a single Uni of a list
#SuppressWarnings("unchecked") // Mutiny API fault...
Uni<List<OutputSample>> result = Uni.combine().all().unis(uniList)
.combinedWith(list -> (List<OutputSample>) list);
return result;
}

How to create a generic deserializer for bytes in Kafka?

I have a proto file with multiple message types. I want to create a generic deserializer for these messages.
Do I need to send these message's types with Kafka message header so Consumer can deserialize these messages with this type information? Is this a best practice or is there any other solution?
deserialize method example;
public Object deserialize(String topic, Headers headers, byte[] data) {
if(headers[0].equals("Person")){
return Person.parseFrom(data);
} else if....
}
my proto file;
message Person {
uint64 number = 1;
string name = 2;
}
message Event {
string msg = 1;
code = 2;
}
message Data {
string inf = 1;
string desc = 2;
}
....
Do I need to send these message's types with Kafka message header so
Consumer can deserialize these messages with this type information?
If your KafkaConsumer consumes messages only from a certain topic with only a certain type (class) of messages, then you can configure the class as such in the deserializer configurations like, for example, value.class, key.class etc in your configuration which you can get using the configure() in Deserializer using configs.get("value.class") or configs.get("key.class") and then store them in the member variables.
void configure(java.util.Map<java.lang.String,?> configs,
boolean isKey)
In case your topic contains different types of messages or your consumer subscribes to different topics each with different types of messages, then storing the class in the Headers should be appropriate.
Another alternative is to write a wrapper class.
class MessageWrapper {
private Class messageClass;
private byte[] messageData;
ProtobufSchema schema;
}
and then in the data you can de-serialize the MessageWrapper. Here the messageData type could be a Person, Data, or Event and the messageClass should help you in parsing.
For example,
mapper.readerFor(messageWrapper.getMessageClass())
.with(messageWrapper.getSchema())
.readValue(messageWrapper.getMessageData());
Once you get the object, you can check if it is instanceof Person or Event or Data
You may also look at Generating Protobuf schema from POJO definition and omit the schema field in the MessageWrapper
Snippet
ProtobufMapper mapper = new ProtobufMapper()
ProtobufSchema schemaWrapper = mapper.generateSchemaFor(messageWrapper.getMessageClass())
NativeProtobufSchema nativeProtobufSchema = schemaWrapper.getSource();
String asProtofile = nativeProtobufSchema.toString();

How to subscribe multiple topic using #KafkaListner annotation

I am using Kafka message broker to publish and subscribe event. For that using spring infrastructure. My requirement is I need to create one consumer which will subscribe multiple topic.
Following is the code which is working perfectly fine when it subscribe to single topic.
#KafkaListener(topics = "com.customer.nike")
public void receive(String payload) {
LOGGER.info("received payload='{}'", payload);
}
But I want , it should be subscribe to some pattern of topic.
like..
#KafkaListener(topics = "com.cusotmer.*.nike")
public void receive(String payload) {
LOGGER.info("received payload='{}'", payload);
}
In this code * will keep changing. It may be some numeric value like 1000. 1010 and so on. For this I also used SpeL.
#KafkaListener(topics = "#{com.cusotmer.*.nike}")
public void receive(String payload) {
LOGGER.info("received payload='{}'", payload);
}
But this one is also not working for me.
Could someone help me to subscribe multiple topic.
Thanks in advance.
I use #KafkaListener(topics = "#{'${kafka.topics}'.split(',')}" where kafka.topics is taken from my property file and contains the comma separated topics to which my listener should listen to.
But may be, during start up u can add a logic to generate all the possible topics and assign to a variable, which can be later used as above.
Update : Wild card is possible as Alexandre commented below.
Regarding the subscription of multuiple topics, you can use topicPatterns to achieve that:
The topic pattern for this listener. The entries can be 'topic
pattern', a 'property-placeholder key' or an 'expression'. The
framework will create a container that subscribes to all topics
matching the specified pattern to get dynamically assigned partitions.
The pattern matching will be performed periodically against topics
existing at the time of check. An expression must be resolved to the
topic pattern (String or Pattern result types are supported).
Mutually exclusive with topics() and topicPartitions().
#KafkaListener(topicPattern = "com.customer.*")
public void receive(String payload) {
LOGGER.info("received payload='{}'", payload);
}
Regarding the programatic access to the topic name, you can use #Header annotated method to extract a specific header value, defined by KafkaHeaders, which in your case is RECEIVED_TOPIC:
The header containing the topic from which the message was received.
#KafkaListener(topics = "com.customer.nike")
public void receive(String payload, #Header(KafkaHeaders.RECEIVED_TOPIC) String topic) {
LOGGER.info("received payload='{}'", payload);
LOG.info("received from topic: {}", topic);
}
If we have to fetch multiple topics from the application.properties file :
#KafkaListener(topics = { "${spring.kafka.topic1}", "${spring.kafka.topic2}" })

Preventing KStream from emiting old unchanged aggregate value

I have a KStream pipeline which groups by key and then windows on some interval and then applies a custom aggregation on that:
KStream<String, Integer> input = /* define input stream */
/* group by key and then apply windowing */
KTable<Windowed<String>, MyAggregate> aggregateTable =
input.groupByKey()
.windowedBy(/* window defintion here */)
.aggregate(MyAggregate::new, (key, value, agg) -> agg.addAndReturn(value))
// I need to get a change log of aggregateTable so:
aggregateTable.toStream().to("output-topic");
The problem is that majority of the input records will not change the internal state of MyAggregate object. The structure is similar to:
class MyAggregate {
private Set<Integer> checkBeforeInsert = /* some predefined values */
private List<Integer> actualState = new ArrayList<>();
public MyAggregate addAndReturn(Integer value) {
/* for 99% of records the if check passes */
if (checkBeforeInsert.contains(value)) {
/* do nothing and return. Note that the state hasn't been changed */
return this;
} else {
actualState.add(value);
return this;
}
}
}
However, KStream doesn't have any clue that the aggregate object hasn't been changed, it still stores the aggregate (which is same as old). It also propagates to same old value to changelog topic and also triggers aggregateTable.toStream() with the same old value.
Although the semantic of my application works fine (the rest of application is aware of this fact that unchanged state might arrive), but this creates a huge noise traffic on intermediate topics. I need a way to notify KStream whether an aggregate has been really changed and should be stored or it's the same as previous record (just ignore it).

Detecting abandoned processess in Kafka Streams 2.0

I have this case: users collect orders as order lines. I implemented this with Kafka topic containing events with order changes, they are merged, stored in local key-value store and broadcasted as second topic with order versions.
I need to somehow react to abandoned orders - ones that were started but there was no change for at least last x hours.
Simple solution could be to scan local storage every y minutes and post event of order status change to Abandoned. It seems I cannot access store not from processor... But it is also not very elegant coding. Any suggestions are welcome.
--edit
I cannot just add puctuation to merge/validation transformer, because its output is different and should be routed elsewhere, like on this image (single kafka streams app):
so "abandoned orders processor/transformer" will be a no-op for its input (the only trigger here is time). Another thing is that i such case (as on image) my transformer gets ForwardingDisabledProcessorContext upon initialization so I cannot emit any messages in punctuator. I could just pass there kafkaTemplate bean and just produce new messages, but then whole processor/transformer is just empty shell only to access local store...
this is snippet of code I used:
public class AbandonedOrdersTransformer implements ValueTransformer<OrderEvent, OrderEvent> {
#Override
public void init(ProcessorContext processorContext) {
this.context = processorContext;
stateStore = (KeyValueStore)processorContext.getStateStore(KafkaConfig.OPENED_ORDERS_STORE);
//main scheduler
this.context.schedule(TimeUnit.MINUTES.toMillis(5), PunctuationType.WALL_CLOCK_TIME, (timestamp) -> {
KeyValueIterator<String, Order> iter = this.stateStore.all();
while (iter.hasNext()) {
KeyValue<String, Order> entry = iter.next();
if(OrderStatuses.NEW.equals(entry.value.getStatus()) &&
(timestamp - entry.value.getLastChanged().getTime()) > TimeUnit.HOURS.toMillis(4)) {
//SEND ABANDON EVENT "event"
context.forward(entry.key, event);
}
}
iter.close();
context.commit();
});
}
#Override
public OrderEvent transform(OrderEvent orderEvent) {
//do nothing
return null;
}
#Override
public void close() {
//do nothing
}
}