How do I concurrently process Reactor Kafka Streams by Topic and Partition with Auto Acknowledgement? - apache-kafka

I am trying to achieve concurrent processing of Kafka Topic-Partitions using Reactor Kafka with auto-acknowledgement. The documentation here makes it seem like this is possible:
http://projectreactor.io/docs/kafka/milestone/reference/#concurrent-ordered
The only difference between that and what I am attempting is I am using auto-acknowledgement.
I have the following code (relevant method is receiveAuto):
public class KafkaFluxFactory<K, V> {
private final Map<String, Object> properties;
public KafkaFluxFactory(Map<String, Object> properties) {
this.properties = properties;
}
public Flux<ConsumerRecord<K, V>> receiveAuto(Collection<String> topics, Scheduler scheduler) {
return KafkaReceiver.create(ReceiverOptions.create(properties).subscription(topics))
.receiveAutoAck()
.flatMap(flux -> flux.groupBy(this::extractTopicPartition))
.flatMap(topicPartitionFlux -> topicPartitionFlux.publishOn(scheduler));
}
private TopicPartition extractTopicPartition(ConsumerRecord<K, V> record) {
return new TopicPartition(record.topic(), record.partition());
}
}
When I use this to create a Flux of Consumer Records from Kafka with a parallel Scheduler (Schedulers.newParallel("debug", 10)), I see that they all end up getting processed on the same Thread.
Any thoughts on what I may be doing wrong?

After quite a bit of trial-and-error plus some rethinking of what I want to accomplish I realized I was trying to solve two problems in one bit of code.
The two things I need are:
In-order processing of Kafka Partitions
Ability to parallelize the processing of each partition
In trying to solve both with this piece of code, I was limiting downstream users' abilities to configure the level of parallelization. I therefore changed the method to return a Flux of GroupedFluxes which provides downstream users with the correct granularity of determining what is parallelizable:
public Flux<GroupedFlux<TopicPartition, ConsumerRecord<K, V>>> receiveAuto(Collection<String> topics) {
return KafkaReceiver.create(createReceiverOptions(topics))
.receiveAutoAck()
.flatMap(flux -> flux.groupBy(this::extractTopicPartition));
}
Downstream, users are able to parallelize each emitted GroupedFlux using whatever Scheduler they wish:
public <V> void work(Flux<GroupedFlux<TopicPartition, V>> flux) {
flux.doOnNext(groupPublisher -> groupPublisher
.publishOn(Schedulers.elastic())
.subscribe(this::doWork))
.subscribe();
}
This has the desired behavior processing each TopicPartition-GroupedFlux in-order and parallel to other GroupedFluxes.

I guess it executes sequentially at least in your consumer. To do a parallel consuming you should convert you flux to ParallelFlux
public ParallelFlux<ConsumerRecord<K, V>> receiveAuto(Collection<String> topics, Scheduler scheduler) {
return KafkaReceiver.create(ReceiverOptions.create(properties).subscription(topics))
.receiveAutoAck()
.flatMap(flux -> flux.groupBy(this::extractTopicPartition))
.flatMap(topicPartitionFlux -> topicPartitionFlux.parallel().runOn(Schedulers.parallel()));
}
After in your consumer function if you want to consume in parallel way you should use method such as:
void subscribe(Consumer<? super T> onNext, Consumer<? super Throwable>
onError, Runnable onComplete, Consumer<? super Subscription> onSubscribe)
Or any other overloaded method with Consumer<T super T> onNext arguments.
If you just use method as below you will consume flux in sequential way
void subscribe(Subscriber<? super T> s)

Related

Kafka state-store sharing across sub-toplogies

I am trying to create a custom joining consumer to join multiple events.
I have created a topology which have four sub-toplogies(subtopology-0, subtoplogy-1, subtopology-2, subtopology-3) not in the exact order of what is described by topology.describe().
I have created a state-store in three of the sub-toplogies(subtopology-0, subtoplogy-1, subtopology-2) and trying to attach all the state-store created different state-stores using .connectProcessorAndStateStores("PROCESS2", "COUNTS") as per the kafka developer guide https://kafka.apache.org/0110/documentation/streams/developer-guide
Here is the code snippet of how I am creating and attaching processors to the topology.
class StreamCustomizer implements KafkaStreamsInfrastructureCustomizer {
public someMethod(StreamBuilder builder) {
Topology topology = builder.build();
topology.addProcessor("Processor1", new Processor() {...}, "state-store-1).addStateStore(store1,..);
topology.addProcessor("Processor2", new Processor() {...}, "state-store-1)
.addStateStore(store1,..);
topology.addProcessor("Processor3", new Processor() {...}, "state-store-1)
addStateStore(store1,..);
topology.addProcessor("Processor4", new Processor4() {...}, "Processor1", Processor2", "Processor3")
connectProcessorAndStateStores("Prcoessor4", "state-store-1", "state-store-2", "state-store-3");
}
}
This is how the processor is defined for all the sub-toplogies as described above
new Processor {
private ProcessorContext;
private KeyValueStore<K, V> store;
init(ProcessorContext) {
this.context = context;
store = context.getStore("store-name");
}
}
This is hot the processor 4 is written, with all the state-store retrieved in init method from context store.
new Processor4() {
private KeyValueStore<K, V> store1;
private KeyValueStore<K, V> store2;
private KeyValueStore<K, V> store3;
}
I am observing a strange behaviour that with the above code, store1, store2, and store3 all are re-intiailized and no keys are preserved whatever were stored in their respective sub-toplogies(1,2,3). However, the same code works i.e., all state store preserved the key-value stored in their respective sub-topology when state-stores are declared at class level.
class StreamCustomizer implements KafkaStreamsInfrastructureCustomizer {
private KeyValueStore <K, V> store1;
private KeyValueStore <K, V> store2;
private KeyValueStore <K, V> store3;
}
and then in the processor implementation, just init the state-store in init method.
new Processor {
private ProcessorContext;
init(ProcessorContext) {
this.context = context;
store1 = context.getStore("store-name-1");
}
}
Can someone please assist in finding the reason, or if there is anything wrong in this topology? Also, I have read in state-store can be shared within the same sub-topology.
Hard to say (the code snippets are not really clear), however, if share state you effectively merge sub-topologies. Thus, if you do it correct, you would end up with a single sub-topology containing all your processor.
As long as you see 4 sub-topologies, state store are not shared yet, ie, not connected correctly.

Mutiny - Kafka writes happening sequentially

I am new to Quarkus. I am trying to write a REST endpoint using quarkus reactive that receives an input, does some validation, transforms the input to a list and then writes a message to kafka. My understanding was converting everything to Uni/Multi, would result in the execution happening on the I/O thread in async manner. In, the intelliJ logs, I could see that the code is getting executed in a sequential manner in the executor thread. The kafka write happens in its own network thread sequentially, which is increasing latency.
#POST
#Consumes(MediaType.APPLICATION_JSON)
#Produces(MediaType.APPLICATION_JSON)
public Multi<OutputSample> send(InputSample inputSample) {
ObjectMapper mapper = new ObjectMapper();
//deflateMessage() converts input to a list of inputSample
Multi<InputSample> keys = Multi.createFrom().item(inputSample)
.onItem().transformToMulti(array -> Multi.createFrom().iterable(deflateMessage.deflateMessage(array)))
.concatenate();
return keys.onItem().transformToUniAndMerge(payload -> {
try {
return producer.writeToKafka(payload, mapper);
} catch (JsonProcessingException e) {
e.printStackTrace();
}
return null;
});
}
#Inject
#Channel("write")
Emitter<String> emitter;
Uni<OutputSample> writeToKafka(InputSample kafkaPayload, ObjectMapper mapper) throws JsonProcessingException {
String inputSampleJson = mapper.writeValueAsString(kafkaPayload);
return Uni.createFrom().completionStage(emitter.send(inputSampleJson))
.onItem().transform(ignored -> new OutputSample("id", 200, "OK"))
.onFailure().recoverWithItem(new OutputSample("id", 500, "INTERNAL_SERVER_ERROR"));
}
I have been on it for a couple of days. Not sure if doing anything wrong. Any help would be appreciated.
Thanks
mutiny as any other reactive library is designed mainly around data flow control.
That being said, at its heart, it will offer a set of capabilities (generally through some operators) to control flow execution and scheduling. This means that unless you instruct munity objects to go asynchronous, they will simply execute in a sequential (old) fashion.
Execution scheduling is controlled using two operators:
runSubscriptionOn: which will cause the code snippet generating the items (which is generally referred to upstream) to execute on a thread from the specified Executor
emitOn: which will cause subscribing code (which is generally referred to downstream) to execute on a thread from the specified Executor
You can then update your code as follows causing the deflation to go asynchronous:
Multi<InputSample> keys = Multi.createFrom()
.item(inputSample)
.onItem()
.transformToMulti(array -> Multi.createFrom()
.iterable(deflateMessage.deflateMessage(array)))
.runSubscriptionOn(Infrastructure.getDefaultExecutor()) // items will be transformed on a separate thread
.concatenate();
EDIT: Downstream on a separate thread
In order to have the full downstream, transformation and writing to Kafka queue done on a separate thread, you can use the emitOn operator as follows:
#POST
#Consumes(MediaType.APPLICATION_JSON)
#Produces(MediaType.APPLICATION_JSON)
public Multi<OutputSample> send(InputSample inputSample) {
ObjectMapper mapper = new ObjectMapper();
return Uni.createFrom()
.item(inputSample)
.onItem()
.transformToMulti(array -> Multi.createFrom().iterable(deflateMessage.deflateMessage(array)))
.emitOn(Executors.newFixedThreadPool(5)) // items will be emitted on a separate thread after transformation
.onItem()
.transformToUniAndConcatenate(payload -> {
try {
return producer.writeToKafka(payload, mapper);
} catch (JsonProcessingException e) {
e.printStackTrace();
}
return Uni.createFrom().<OutputSample>nothing();
});
}
Multi is intended to be used when you have a source that emits items continuously until it emits a completion event, which is not your case.
From Mutiny docs:
A Multi represents a stream of data. A stream can emit 0, 1, n, or an
infinite number of items.
You will rarely create instances of Multi yourself but instead use a
reactive client that exposes a Mutiny API.
What you are looking for is a Uni<List<OutputSample>> because your API you return 1 and only 1 item with the complete result list.
So what you need is to send each message to Kafka without immediately waiting for their return but collecting the generated Unis and then collecting it to a single Uni.
#POST
public Uni<List<OutputSample>> send(InputSample inputSample) {
// This could be injected directly inside your producer
ObjectMapper mapper = new ObjectMapper();
// Send each item to Kafka and collect resulting Unis
List<Uni<OutputSample>> uniList = deflateMessage(inputSample).stream()
.map(input -> producer.writeToKafka(input, mapper))
.collect(Collectors.toList());
// Transform a list of Unis to a single Uni of a list
#SuppressWarnings("unchecked") // Mutiny API fault...
Uni<List<OutputSample>> result = Uni.combine().all().unis(uniList)
.combinedWith(list -> (List<OutputSample>) list);
return result;
}

Spring Cloud Stream Kafka Commit Failed since the group is rebalanced

I have got the CommitFailedException for some time-consuming Spring Cloud Stream applications. I know to fix this issue I need to set the max.poll.records and max.poll.interval.ms to match my expectations for the time it takes to process the batch. However, I am not quite sure how to set it for consumers in Spring Cloud Stream.
Exception:
org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records. at
org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.sendOffsetCommitRequest(ConsumerCoordinator.java:808) at
org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.commitOffsetsSync(ConsumerCoordinator.java:691) at
org.apache.kafka.clients.consumer.KafkaConsumer.commitSync(KafkaConsumer.java:1416) at
org.apache.kafka.clients.consumer.KafkaConsumer.commitSync(KafkaConsumer.java:1377) at
org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.commitIfNecessary(KafkaMessageListenerContainer.java:1554) at
org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.processCommits(KafkaMessageListenerContainer.java:1418) at
org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.pollAndInvoke(KafkaMessageListenerContainer.java:739) at
org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.run(KafkaMessageListenerContainer.java:700) at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.lang.Thread.run(Thread.java:748)
Moreover, how can I ensure this situation won't happen at all? Or alternatively, how can I inject some sort of roll-back in the case of this exception? The reason is I am doing some other external works and once it is finished I publish the output message accordingly. Therefore, if the message cannot get published due to any issues after the work was done on the external system, I have to revert it back (some sort of atomic transaction over Kafka publish and other external systems).
You can set arbitrary Kafka properties either at the binder level documentation here
spring.cloud.stream.kafka.binder.consumerProperties
Key/Value map of arbitrary Kafka client consumer properties. In addition to support known Kafka consumer properties, unknown consumer properties are allowed here as well. Properties here supersede any properties set in boot and in the configuration property above.
Default: Empty map.
e.g. spring.cloud.stream.kafka.binder.consumerProperties.max.poll.records=10
Or at the binding level documentation here.
spring.cloud.stream.kafka.bindings.<channelName>.consumer.configuration
Map with a key/value pair containing generic Kafka consumer properties. In addition to having Kafka consumer properties, other configuration properties can be passed here. For example some properties needed by the application such as spring.cloud.stream.kafka.bindings.input.consumer.configuration.foo=bar.
Default: Empty map.
e.g. spring.cloud.stream.kafka.bindings.input.consumer.configuration.max.poll.records=10
You can get notified of commit failures by adding an OffsetCommitCallback to the listener container's ContainerProperties and setting syncCommits to false. To customize the container and its properties, add a ListenerContainerCustomizer bean to the application.
EDIT
Async commit callback...
#SpringBootApplication
#EnableBinding(Sink.class)
public class So57970152Application {
public static void main(String[] args) {
SpringApplication.run(So57970152Application.class, args);
}
#Bean
public ListenerContainerCustomizer<AbstractMessageListenerContainer<byte[], byte[]>> customizer() {
return (container, dest, group) -> {
container.getContainerProperties().setAckMode(AckMode.RECORD);
container.getContainerProperties().setSyncCommits(false);
container.getContainerProperties().setCommitCallback((map, ex) -> {
if (ex == null) {
System.out.println("Successful commit for " + map);
}
else {
System.out.println("Commit failed for " + map + ": " + ex.getMessage());
}
});
container.getContainerProperties().setClientId("so57970152");
};
}
#StreamListener(Sink.INPUT)
public void listen(String in) {
System.out.println(in);
}
#Bean
public ApplicationRunner runner(KafkaTemplate<byte[], byte[]> template) {
return args -> {
template.send("input", "foo".getBytes());
};
}
}
Manual commits (sync)...
#SpringBootApplication
#EnableBinding(Sink.class)
public class So57970152Application {
public static void main(String[] args) {
SpringApplication.run(So57970152Application.class, args);
}
#Bean
public ListenerContainerCustomizer<AbstractMessageListenerContainer<byte[], byte[]>> customizer() {
return (container, dest, group) -> {
container.getContainerProperties().setAckMode(AckMode.MANUAL_IMMEDIATE);
container.getContainerProperties().setClientId("so57970152");
};
}
#StreamListener(Sink.INPUT)
public void listen(String in, #Header(KafkaHeaders.ACKNOWLEDGMENT) Acknowledgment ack) {
System.out.println(in);
try {
ack.acknowledge(); // MUST USE MANUAL_IMMEDIATE for this to work.
System.out.println("Commit successful");
}
catch (Exception e) {
System.out.println("Commit failed " + e.getMessage());
}
}
#Bean
public ApplicationRunner runner(KafkaTemplate<byte[], byte[]> template) {
return args -> {
template.send("input", "foo".getBytes());
};
}
}
Set you heartbeat interval to less that 1/3rd of your session timeout. If the broker cannot determine if your consumer is alive, it will initiate a partition rebalance among the remaining consumers. So you have a heartbeat thread to inform the broker that the consumer is alive in case the application is taking a bit longer to process. Change these in your consumer configs:
heartbeat.interval.ms
session.timeout.ms
Try increasing the session timeout if it does not work. You have to fiddle around with these values.

How to handle backpressure when using Reactor Kafka?

I'm using Reactor Kafka to both consume and produce Kafka events. In the case of consuming events my consumer is slow and therefor I need to handle backpressure.
However, I experience that no matter what I call Subscription.request() with, the publisher will publish all events from the topic immediately, therefor overwhelming the consumer.
I'm using a custom Subscriber, setting a small number of initial request by calling Subscription.request(), when I subscribe to KafkaReceiver.receive() to do this. To my understanding this is how I tell the publisher how many events my consumer initially wants.
My subscriber:
public class KafkaEventSubscriber extends BaseSubscriber {
private final int numberOfItemsToRequestOnSubscribe;
private final int numberOfItemsToRequestOnNext;
public KafkaEventSubscriber(int numberOfItemsToRequestOnSubscribe,
int numberOfItemsToRequestOnNext) {
this.numberOfItemsToRequestOnSubscribe = numberOfItemsToRequestOnSubscribe;
this.numberOfItemsToRequestOnNext = numberOfItemsToRequestOnNext;
}
#Override
protected void hookOnSubscribe(Subscription subscription) {
subscription.request(numberOfItemsToRequestOnSubscribe);
}
#Override
protected void hookOnNext(EnrichedMetadata value) {
request(numberOfItemsToRequestOnNext);
}
}
How I use the subscriber:
kafkaReceiver.receive().map(ReceiverRecord::value).map(KafkaConsumer::acknowledge).subscribe(new KafkaEventSubscriber(10, 1));
I expect the KafkaReceiver to output 10 events before any call to the subscribers onNext() method is done, but the KafkaReceiver outputs all events that are not already ACK:ed from the topic.
I experience that no matter what we call Subscription.request() with, the publisher will publish all events from the topic immediately, not respecting the backpressure measures I've been taking.

Spring Kafka - access offsetsForTimes to start consuming from specific offset

I have a fairly straightforward Kafka consumer:
MessageListener<String, T> messageListener = record -> {
doStuff( record.value()));
};
startConsumer(messageListener);
protected void startConsumer(MessageListener<String, T> messageListener) {
ConcurrentMessageListenerContainer<String, T> container = new ConcurrentMessageListenerContainer<>(
consumerFactory(this.brokerAddress, this.groupId),
containerProperties(this.topic, messageListener));
container.start();
}
I can consume messages without any issue.
Now, I have the requirement to seek from a specific offset based on the result of a call to offsetsForTimes on the Kafka Consumer.
I understand that I can seek to a certain position using the ConsumerSeekAware interface:
#Override
public void onPartitionsAssigned(Map<TopicPartition, Long> assignments,
ConsumerSeekCallback callback) {
assignments.forEach((t, o) -> callback.seek(t.topic(), t.partition(), ?????));
}
The problem now, is that I do not have access to the Kafka Consumer inside the callback, therefore I have no way to call offsetsForTimes.
Is there any other way to achieve this?
Use a ConsumerAwareRebalanceListener to do the initial seeks (introduced in 2.0).
The current version is 2.2.0.
How to test a ConsumerAwareRebalanceListener?