Spring Cloud Stream Kafka Commit Failed since the group is rebalanced - apache-kafka

I have got the CommitFailedException for some time-consuming Spring Cloud Stream applications. I know to fix this issue I need to set the max.poll.records and max.poll.interval.ms to match my expectations for the time it takes to process the batch. However, I am not quite sure how to set it for consumers in Spring Cloud Stream.
Exception:
org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records. at
org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.sendOffsetCommitRequest(ConsumerCoordinator.java:808) at
org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.commitOffsetsSync(ConsumerCoordinator.java:691) at
org.apache.kafka.clients.consumer.KafkaConsumer.commitSync(KafkaConsumer.java:1416) at
org.apache.kafka.clients.consumer.KafkaConsumer.commitSync(KafkaConsumer.java:1377) at
org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.commitIfNecessary(KafkaMessageListenerContainer.java:1554) at
org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.processCommits(KafkaMessageListenerContainer.java:1418) at
org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.pollAndInvoke(KafkaMessageListenerContainer.java:739) at
org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.run(KafkaMessageListenerContainer.java:700) at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.lang.Thread.run(Thread.java:748)
Moreover, how can I ensure this situation won't happen at all? Or alternatively, how can I inject some sort of roll-back in the case of this exception? The reason is I am doing some other external works and once it is finished I publish the output message accordingly. Therefore, if the message cannot get published due to any issues after the work was done on the external system, I have to revert it back (some sort of atomic transaction over Kafka publish and other external systems).

You can set arbitrary Kafka properties either at the binder level documentation here
spring.cloud.stream.kafka.binder.consumerProperties
Key/Value map of arbitrary Kafka client consumer properties. In addition to support known Kafka consumer properties, unknown consumer properties are allowed here as well. Properties here supersede any properties set in boot and in the configuration property above.
Default: Empty map.
e.g. spring.cloud.stream.kafka.binder.consumerProperties.max.poll.records=10
Or at the binding level documentation here.
spring.cloud.stream.kafka.bindings.<channelName>.consumer.configuration
Map with a key/value pair containing generic Kafka consumer properties. In addition to having Kafka consumer properties, other configuration properties can be passed here. For example some properties needed by the application such as spring.cloud.stream.kafka.bindings.input.consumer.configuration.foo=bar.
Default: Empty map.
e.g. spring.cloud.stream.kafka.bindings.input.consumer.configuration.max.poll.records=10
You can get notified of commit failures by adding an OffsetCommitCallback to the listener container's ContainerProperties and setting syncCommits to false. To customize the container and its properties, add a ListenerContainerCustomizer bean to the application.
EDIT
Async commit callback...
#SpringBootApplication
#EnableBinding(Sink.class)
public class So57970152Application {
public static void main(String[] args) {
SpringApplication.run(So57970152Application.class, args);
}
#Bean
public ListenerContainerCustomizer<AbstractMessageListenerContainer<byte[], byte[]>> customizer() {
return (container, dest, group) -> {
container.getContainerProperties().setAckMode(AckMode.RECORD);
container.getContainerProperties().setSyncCommits(false);
container.getContainerProperties().setCommitCallback((map, ex) -> {
if (ex == null) {
System.out.println("Successful commit for " + map);
}
else {
System.out.println("Commit failed for " + map + ": " + ex.getMessage());
}
});
container.getContainerProperties().setClientId("so57970152");
};
}
#StreamListener(Sink.INPUT)
public void listen(String in) {
System.out.println(in);
}
#Bean
public ApplicationRunner runner(KafkaTemplate<byte[], byte[]> template) {
return args -> {
template.send("input", "foo".getBytes());
};
}
}
Manual commits (sync)...
#SpringBootApplication
#EnableBinding(Sink.class)
public class So57970152Application {
public static void main(String[] args) {
SpringApplication.run(So57970152Application.class, args);
}
#Bean
public ListenerContainerCustomizer<AbstractMessageListenerContainer<byte[], byte[]>> customizer() {
return (container, dest, group) -> {
container.getContainerProperties().setAckMode(AckMode.MANUAL_IMMEDIATE);
container.getContainerProperties().setClientId("so57970152");
};
}
#StreamListener(Sink.INPUT)
public void listen(String in, #Header(KafkaHeaders.ACKNOWLEDGMENT) Acknowledgment ack) {
System.out.println(in);
try {
ack.acknowledge(); // MUST USE MANUAL_IMMEDIATE for this to work.
System.out.println("Commit successful");
}
catch (Exception e) {
System.out.println("Commit failed " + e.getMessage());
}
}
#Bean
public ApplicationRunner runner(KafkaTemplate<byte[], byte[]> template) {
return args -> {
template.send("input", "foo".getBytes());
};
}
}

Set you heartbeat interval to less that 1/3rd of your session timeout. If the broker cannot determine if your consumer is alive, it will initiate a partition rebalance among the remaining consumers. So you have a heartbeat thread to inform the broker that the consumer is alive in case the application is taking a bit longer to process. Change these in your consumer configs:
heartbeat.interval.ms
session.timeout.ms
Try increasing the session timeout if it does not work. You have to fiddle around with these values.

Related

KafkaMessageListenerContainer.stop() is not stopping consumption of messages in message listener

UseCase: Given topic with 100 messages in kafka topic, I want to read messaged from offset 10 to offset 20. I could able to fetch from beginning offset. when i reach end offset, I have written code to stop the container.Even after execution of code, Consumer can consume further messages(from offset 21).It only stops after reading all messages in the topic
#Service
public class Consumer1 implements MessageListener<String, GenericRecord> {
#Override
public void onMessage(ConsumerRecord<String, GenericRecord> data) {
log.info("feed record {}", data);
if (data.offset() == 20) {
feedService.stopConsumer();
}
}
}
#Service
public class FeedService{
// start logic here
public void stopConsumer() {
kafkaMessageListenerContainer.stop();
}
}
Note: I am using spring-kafka latest version(2.6.4). One observation is container stop method is being executed but consumer is not getting closed.And no errors on output
The stop() doesn't terminate the current records batch cycle:
while (isRunning()) {
try {
pollAndInvoke();
}
catch (#SuppressWarnings(UNUSED) WakeupException e) {
// Ignore, we're stopping or applying immediate foreign acks
}
That pollAndInvoke() calls a KafkaConsumer.poll(), gets some records collection and invokes your onMessage() on each record. At some point you decide to call the stop, but it doesn't mean that we are really in the end of that records list to exit immediately.
We really stop on the next cycle when that isRunning() returns false for us already.

Which kafka property decides Poll frequency for KafkaConsumer?

I am trying to understand kafka in some details with respect to kafka streams (kafka stream client to kafka).
I understand that KafkConsumer (java client) would get data from kafka, however I am not able to understand at which frequency does client poll kakfa topic to fetch the data?
The frequency of the poll is defined by your code because you're responsible to call poll.
A very naive example of user code using KafkaConsumer is like the following
public class KafkaConsumerExample {
...
static void runConsumer() throws InterruptedException {
final Consumer<Long, String> consumer = createConsumer();
final int giveUp = 100; int noRecordsCount = 0;
while (true) {
final ConsumerRecords<Long, String> consumerRecords =
consumer.poll(1000);
if (consumerRecords.count()==0) {
noRecordsCount++;
if (noRecordsCount > giveUp) break;
else continue;
}
consumerRecords.forEach(record -> {
System.out.printf("Consumer Record:(%d, %s, %d, %d)\n",
record.key(), record.value(),
record.partition(), record.offset());
});
consumer.commitAsync();
}
consumer.close();
System.out.println("DONE");
}
}
In this case the frequency is defined by the duration of processing the messages in consumerRecords.forEach.
However, keep in mind that if you don't call poll "fast enough" your consumer will be considered dead by the broker coordinator and a rebalance will be triggered.
This "fast enough" is determined by the property max.poll.interval.ms in kafka >= 0.10.1.0. See this answer for more details.
max.poll.interval.ms default value is five minutes, so if your consumerRecords.forEach takes longer than that your consumer will be considered dead.
If you don't want to use the raw KafkaConsumer directly you could use alpakka kafka, a library for consume from and produce to kafka topics in a safe and backpressured way (is based on akka streams).
With this library, the frequency of poll is determined by configuration akka.kafka.consumer.poll-interval.
We say is safe because it will continue polling to avoid the consumer is considered dead even when your processing can't keep up the rate. It's able to do this because KafkaConsumer allows pausing the consumer
/**
* Suspend fetching from the requested partitions. Future calls to {#link #poll(Duration)} will not return
* any records from these partitions until they have been resumed using {#link #resume(Collection)}.
* Note that this method does not affect partition subscription. In particular, it does not cause a group
* rebalance when automatic assignment is used.
* #param partitions The partitions which should be paused
* #throws IllegalStateException if any of the provided partitions are not currently assigned to this consumer
*/
#Override
public void pause(Collection<TopicPartition> partitions) { ... }
To fully understand this you should read about akka-streams and backpressure.

How do I concurrently process Reactor Kafka Streams by Topic and Partition with Auto Acknowledgement?

I am trying to achieve concurrent processing of Kafka Topic-Partitions using Reactor Kafka with auto-acknowledgement. The documentation here makes it seem like this is possible:
http://projectreactor.io/docs/kafka/milestone/reference/#concurrent-ordered
The only difference between that and what I am attempting is I am using auto-acknowledgement.
I have the following code (relevant method is receiveAuto):
public class KafkaFluxFactory<K, V> {
private final Map<String, Object> properties;
public KafkaFluxFactory(Map<String, Object> properties) {
this.properties = properties;
}
public Flux<ConsumerRecord<K, V>> receiveAuto(Collection<String> topics, Scheduler scheduler) {
return KafkaReceiver.create(ReceiverOptions.create(properties).subscription(topics))
.receiveAutoAck()
.flatMap(flux -> flux.groupBy(this::extractTopicPartition))
.flatMap(topicPartitionFlux -> topicPartitionFlux.publishOn(scheduler));
}
private TopicPartition extractTopicPartition(ConsumerRecord<K, V> record) {
return new TopicPartition(record.topic(), record.partition());
}
}
When I use this to create a Flux of Consumer Records from Kafka with a parallel Scheduler (Schedulers.newParallel("debug", 10)), I see that they all end up getting processed on the same Thread.
Any thoughts on what I may be doing wrong?
After quite a bit of trial-and-error plus some rethinking of what I want to accomplish I realized I was trying to solve two problems in one bit of code.
The two things I need are:
In-order processing of Kafka Partitions
Ability to parallelize the processing of each partition
In trying to solve both with this piece of code, I was limiting downstream users' abilities to configure the level of parallelization. I therefore changed the method to return a Flux of GroupedFluxes which provides downstream users with the correct granularity of determining what is parallelizable:
public Flux<GroupedFlux<TopicPartition, ConsumerRecord<K, V>>> receiveAuto(Collection<String> topics) {
return KafkaReceiver.create(createReceiverOptions(topics))
.receiveAutoAck()
.flatMap(flux -> flux.groupBy(this::extractTopicPartition));
}
Downstream, users are able to parallelize each emitted GroupedFlux using whatever Scheduler they wish:
public <V> void work(Flux<GroupedFlux<TopicPartition, V>> flux) {
flux.doOnNext(groupPublisher -> groupPublisher
.publishOn(Schedulers.elastic())
.subscribe(this::doWork))
.subscribe();
}
This has the desired behavior processing each TopicPartition-GroupedFlux in-order and parallel to other GroupedFluxes.
I guess it executes sequentially at least in your consumer. To do a parallel consuming you should convert you flux to ParallelFlux
public ParallelFlux<ConsumerRecord<K, V>> receiveAuto(Collection<String> topics, Scheduler scheduler) {
return KafkaReceiver.create(ReceiverOptions.create(properties).subscription(topics))
.receiveAutoAck()
.flatMap(flux -> flux.groupBy(this::extractTopicPartition))
.flatMap(topicPartitionFlux -> topicPartitionFlux.parallel().runOn(Schedulers.parallel()));
}
After in your consumer function if you want to consume in parallel way you should use method such as:
void subscribe(Consumer<? super T> onNext, Consumer<? super Throwable>
onError, Runnable onComplete, Consumer<? super Subscription> onSubscribe)
Or any other overloaded method with Consumer<T super T> onNext arguments.
If you just use method as below you will consume flux in sequential way
void subscribe(Subscriber<? super T> s)

Consuming messages from a Hazelcast Queue only once in a distributed environment

I have a similar question as this post:
Consume message only once from Topic per listeners running in cluster
When I tried using a queue to publish messages and added an item listener in two different JVMs, I am receiving the messages twice in both of them. I want to receive the message only once in a clustered/distributed environments.
Here's my code snippet:
Publishing of the message:
getQueue().add("some sample message");
I have the same listener configured in two different JVMs which goes like this:
public HazelcastQueueListener(){
HazelcastInstance instance = HazelcastClient.newHazelcastClient(HazelClientConfig.getClientConfig());
IQueue<String> queue1 = instance.getQueue("SAMPLEQUEUE");
queue1.addItemListener(this, false);
}
public static void main(String args[]){
HazelcastQueueListener listener = new HazelcastQueueListener();
}
#Override
public void itemAdded(ItemEvent<String> arg0) {
// TODO Auto-generated method stub
if(arg0!=null){
System.out.println("Item coming out of queue 1" +arg0);
}
else{
System.out.println("null");
}
}
You have to poll the queue, like a standard java BlockingQueue in order to consume an item only once.
String item = queue1.take()
AFAIK, Hazelcast doesn't support asynchronous operation on queue. The ItemListener doesn't consume the item, it only notifies that an item is available.

Apache Curator LeaderSelector: How to avoid giving up leadership by not exiting from takeLeadership() method?

I'm trying to implement a simple leader election based system where my main business logic of the application runs on the elected leader node. As part of acquiring leadership, the main business logic starts various other services. I'm using Apache Curator LeaderSelector recipe to implement the leader selection process.
In my system, the node which gets selected as the leader keeps the leadership until failure forces another leader to be selected. In other words, once I get the leadership I don't want to relinquish it.
According to Curator LeaderSelection documentation, leadership gets relinquished when the takeLeadership() method returns. I want to avoid it, and I'm right now just block the return by introducing a wait loop.
My question is:
Is this the right way to implement leadership?
Is the wait loop (as shown in the code example below) the right way to block?
public class MainBusinessLogic extends LeaderSelectorListenerAdapter {
private static final String ZK_PATH_LEADER_ROOT = "/some-path";
private final CuratorFramework client;
private final LeaderSelector leaderSelector;
public MainBusinessLogic() {
client = CuratorService.getInstance().getCuratorFramework();
leaderSelector = new LeaderSelector(client, ZK_PATH_LEADER_ROOT, this);
leaderSelector.autoRequeue();
leaderSelector.start();
}
#Override
public void takeLeadership(CuratorFramework client) throws IOException {
// Start various other internal services...
ServiceA serviceA = new ServiceA(...);
ServiceB serviceB = new ServiceB(...);
...
...
serviceA.start();
serviceB.start();
...
...
// We are done but need to keep leadership to this instance, else all the business
// logic and services will start on another node.
// Is this the right way to prevent relinquishing leadership???
while (true) {
synchronized (this) {
try {
wait();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
}
}
LeaderLatchInstead of the wait(), you can just do:
Thread.currentThread().join();
But, yes, that's correct.
BTW - if you prefer a different method, you can use LeaderLatch with a LeaderLatchListener.