Kafka Stream - CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member - apache-kafka

I am running a Kafka Stream application which consumes data from 2 topics and output the joined/merged result into 3 topic.
The kafka topics have 15 partitions and 3 replication factor. We have 5 kafka brokers and 5 zookeeper's.
I am running 15 instances of Kafka Stream application so each application can have 1 partition.
Kafka version-
I am getting the below exception in my kafka stream application:
org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot
be completed since the group has already rebalanced and assigned the
partitions to another member. This means that the time between
subsequent calls to poll() was longer than the configured
max.poll.interval.ms, which typically implies that the poll loop is
spending too much time message processing. You can address this either
by increasing the session timeout or by reducing the maximum size of
batches returned in poll() with max.poll.records.
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.sendOffsetCommitRequest(ConsumerCoordinator.java:725)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.commitOffsetsSync(ConsumerCoordinator.java:604)
at org.apache.kafka.clients.consumer.KafkaConsumer.commitSync(KafkaConsumer.java:1173)
at org.apache.kafka.streams.processor.internals.StreamTask.commitOffsets(StreamTask.java:307)
at org.apache.kafka.streams.processor.internals.StreamTask.access$000(StreamTask.java:49)
at org.apache.kafka.streams.processor.internals.StreamTask$1.run(StreamTask.java:268)
at org.apache.kafka.streams.processor.internals.StreamsMetricsImpl.measureLatencyNs(StreamsMetricsImpl.java:187)
at org.apache.kafka.streams.processor.internals.StreamTask.commitImpl(StreamTask.java:259)
at org.apache.kafka.streams.processor.internals.StreamTask.suspend(StreamTask.java:362)
at org.apache.kafka.streams.processor.internals.StreamTask.suspend(StreamTask.java:346)
at org.apache.kafka.streams.processor.internals.StreamThread$3.apply(StreamThread.java:1118)
at org.apache.kafka.streams.processor.internals.StreamThread.performOnStreamTasks(StreamThread.java:1448)
at org.apache.kafka.streams.processor.internals.StreamThread.suspendTasksAndState(StreamThread.java:1110)
at org.apache.kafka.streams.processor.internals.StreamThread.access$1800(StreamThread.java:73)
at org.apache.kafka.streams.processor.internals.StreamThread$RebalanceListener.onPartitionsRevoked(StreamThread.java:218)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinPrepare(ConsumerCoordinator.java:422)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:353)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:310)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:297)
at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1078)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1043)
at org.apache.kafka.streams.processor.internals.StreamThread.pollRequests(StreamThread.java:582)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:553)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:527)
2017-08-09 14:50:49 - [ERROR] [click-live-StreamThread-1]
Can someone please help what could be the cause and solution?
Also, when 1 of my kafka broker is down, my kafka stream application is not connecting to other broker?
I have set brokers.list=broker1:9092,broker2:9092,broker3:9092,broker4:9092,broker5:9092

Based on the information, this is the most likely solution route:
Try to follow the suggestions in the message:
"You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records"


Kafka Consumer death handling

I have question regarding handling of consumers death due to exceeding the timeout values.
my example configuration:
session.timeout.ms = 10000 (10 seconds)
heartbeat.interval.ms = 2000 (2 seconds)
max.poll.interval.ms = 300000 (5 minutes)
I have 1 topic, 10 partitions, 1 consumer group, 10 consumers (1 partition = 1 consumer).
From my understanding consuming messages in Kafka, very simplified, works as follows:
consumer polls 100 records from topic
a heartbeat signal is sent to broker
processing records in progress
processing records completes
finalize processing (commit, do nothing etc.)
repeat #1-5 in a loop
My question is, what happens if time between heartbeats takes longer than previously configured session.timeout.ms. I understand the part, that if session times out, the broker initializes a re-balance, the consumer which processing took longer than the session.timeout.ms value is marked as dead and a different consumer is assigned/subscribed to that partition.
Okey, but what then...?
Is that long-processing consumer removed/unsubscribed from the topic and my application is left with 9 working consumers? What if all the consumers exceed timeout and are all considered dead, am I left with a running application which does nothing because there are no consumers?
Long-processing consumer finishes processing after re-balancing already took place, does broker initializes re-balance again and consumer is assigned a partition anew? As I understand it continues running #1-5 in a loop and sending a heartbeat to broker initializes also process of adding consumer to the consumers group, from which it was removed after being given dead status, correct?
Application throws some sort of exception indicating that session.timeout.ms was exceeded and the processing is abruptly stopped?
Also what about max.poll.interval.ms property, what if we even exceed that period and consumer X finishes processing after max.poll.interval.ms value? Consumer already exceeded the session.timeout.ms value, it was excluded from consumer group, status set to dead, what difference does it gives us in configuring Kafka consumer?
We have a process which extracts data for processing and this extraction consists of 50+ SQL queries (majority being SELECT's, few UPDATES), they usually go fast but of course all depends on the db load and possible locks etc. and there is a possibility that the processing takes longer than the session's timeout. I do not want to infinitely increase sessions timeout until "I hit the spot". The process is idempotent, if it's repeated X times withing X minutes we do not care.
Please find the answers.
#1. Yes. If all of your consumer instances are kicked out of the consumer group due to session.timeout, then you will be left with Zero consumer instance, eventually, consumer application is dead unless you restart.
#2. This depends, how you write your consumer code with respect to poll() and consumer record iterations. If you have a proper while(true) and try and catch inside, you consumer will be able to re-join the consumer group after processing that long running record.
#3. You will end up with the commit failed exception:
failed: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
And again it depends on your code, to auto join into the consumer group.
#4. Answer lies here
The amount of time a consumer can be out of contact with the brokers while still
considered alive defaults to 3 seconds. If more than session.timeout.ms passes
without the consumer sending a heartbeat to the group coordinator, it is considered
dead and the group coordinator will trigger a rebalance of the consumer group to
allocate partitions from the dead consumer to the other consumers in the group. This
property is closely related to heartbeat.interval.ms. heartbeat.interval.ms con‐
trols how frequently the KafkaConsumer poll() method will send a heartbeat to the
group coordinator, whereas session.timeout.ms controls how long a consumer can
go without sending a heartbeat. Therefore, those two properties are typically modi‐
fied together—heatbeat.interval.ms must be lower than session.timeout.ms, and
is usually set to one-third of the timeout value. So if session.timeout.ms is 3 sec‐
onds, heartbeat.interval.ms should be 1 second. Setting session.timeout.ms
lower than the default will allow consumer groups to detect and recover from failure
sooner, but may also cause unwanted rebalances as a result of consumers taking
longer to complete the poll loop or garbage collection. Setting session.timeout.ms
higher will reduce the chance of accidental rebalance, but also means it will take
longer to detect a real failure.

Can offsets be skipped(almost 12000 in some cases) for a particular partition in the kafka broker?

Kafka image :confluentinc/cp-kafka:5.2.1
Kafka client: apache.kafka.client 2.5.0
load:One thread is handling 4 partitions
Noticing some of the offsets are getting skipped (missing) in a partition.First thought it was an issue on the consumer side but In both thread consumer groups logs offsets are getting skipped(also another observation: Consumer thread is taking significant amount of time to jump the skipped offset which is causing lag)This happened when the load on the cluster was high . We are not using transactional kafka or using idempotent configurations.What can be the possible reasons for this?
Producer properties:
Consumer properties:
there were no consumer rebalances checked the logs during this time so no consumer rebalances
we have two consumer groups (in different data centers )and both the groups skipped these offsets.So ruling out any issue from the consumer side.
-Both had the same pattern the consumer eg-> both consumer threads stopped consuming till offset 112300 and after 30 mins started consuming after skipping 12k offsets. And the threads were consuming other partitions. This only happened for 3-4 partitions.
-so what I’m wondering is it normal to have such huge offset gaps ? (During high loads)Didn’t find anything concrete when going through docs. And what can cause this issue - is it from the broker side or the producer side to have ghost offsets

Kafka behavior during partition re-balancing

Given the following scenario: There is a Kafka (2.1.1) topic with 2 partitions and one consumer. A producer sends a message with keyX to Kafka which ends up on partition 2. The consumer starts processing this message. At the same time a new consumer is starting up and Kafka re-balances the topic. Consumer 1 is now responsible only for partition 1, consumer 2 is responsible for partition 2. The producer sends a message again with the same keyX, this time it will be consumer 2 which processes the message.
Consumer 2 might be processing the message, while consumer 1 has not finished yet.
My question is whether this is a realistic scenario or not, since it might be a problem for me if different consumers would process a message with the same key at the same time.
Any thought on this is welcome, thanks a lot!
Yes, it's a realistic scenario. Nevertheless, during a rebalance consumer 1 will closed all of its existing connections. In your case, consumer 1 will closed connections to partition 1 and 2 so it may not have committed its offset before message processing. It may depend if you have configured your consumer with the property enable.auto.commit to true. With this property set to true, consumer will periodically commit its current offset. The period is defined with auto.commit.interval.ms.
You can also be nofity when a rebalance occurs thanks to consumer listener [ConsumerRebalanceListener][1]. It enables to know when a partition is revoked or reassigned.

Issue of Kafka Balancing at high load

Using kafka version 2.11- to publish 10,000 messages (total size of all messages are 10MB), there will be 2 consumers (with same group-id) to consume the message as a parallel processing.
While consuming, same message was consumed by both the consumers.
Below errors/warning were throws by kafka
WARN: This member will leave the group because consumer poll timeout
has expired. This means the time between subsequent calls to poll()
was longer than the configured max.poll.interval.ms, which typically
implies that the poll loop is spending too much time processing
messages. You can address this either by increasing
max.poll.interval.ms or by reducing the maximum size of batches
returned in poll() with max.poll.records.
INFO: Attempt to heartbeat failed since group is rebalancing
INFO: Sending LeaveGroup request to coordinator
WARN: Synchronous auto-commit of offsets
{ingest-data-1=OffsetAndMetadata{offset=5506, leaderEpoch=null,
metadata=''}} failed: Commit cannot be completed since the group has
already rebalanced and assigned the partitions to another member. This
means that the time between subsequent calls to poll() was longer than
the configured max.poll.interval.ms, which typically implies that the
poll loop is spending too much time message processing. You can
address this either by increasing max.poll.interval.ms or by reducing
the maximum size of batches returned in poll() with max.poll.records.
Below configurations were provided to kafka
What should have changed to resolve the multiple consumptions?
Are your consumers in the same group? If yes you will have multiple consumption if a consumer leaves/dies/timeouts without having committed some messages it has processed.
If all your messages are consumed by both consumers you probably have not set the same group id for them.
More info:
So you have set the same group id for all consumers, good. You are in the situation where the cluster/broker thinks that a consumer died and therefore rebalances the load to another one. This other one will start consuming where the last commit was done.
So lets say consumer C_A read offsets up to 100 from partition P_1 then processed them then committed '100' then read offsets up to 200 then processed them but could not commit because the broker considered C_A as dead.
The broker reassigns partition P_1 to consumer C_B which will start from the last commit for the group, which is 100, will read up to 200, process and commit 200.
So your question is how to avoid that the consumer is considered as dead (I assume it is not dead)?
The answer is already in the yellow WARN message in your question: you can tell your consumer to consume less messages (max.poll.records) in one poll to reduce the processing time between two polls to the broker AND/OR you can increase the max.poll.interval.ms telling the broker to wait longer before considering your consumer as dead...

Kafka group re-balancing after consumer failed. org.apache.kafka.clients.consumer.internals.ConsumerCoordinator

I'm running a Kafka cluster with 4 nodes, 1 producer and 1 consumer. It was working fine until consumer failed. Now after I restart the consumer, it starts consuming new messages but after some minutes it throws this error:
[WARN ]: org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - Auto offset commit failed for group eventGroup: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured session.timeout.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
And it starts consuming the same messages again and loops forever.
I increased session timeout, tried to change group id and it still does the same thing.
Also is the client version of Kafka consumer a big deal?
I'd suggest you to decouple the consumer and the processing logic, to start with. E.g. let the Kafka consumer only poll messages and maybe after sanitizing the messages (if necessary) delegate the actual processing of each record to a separate thread, then see if the same error is still occurring. The error says, you're spending too much time between the subsequent polls, so this might resolve your issue. Also, please mention the version of Kafka you're using. Kafka had a different heartbeat management policy before version 0.10 which could make this issue easier to reproduce.