CommitFailedException: Commit cannot be completed due to group rebalance - apache-kafka

I am using kafka 0.9.0.1 broker and 0.9.0.1 consumer client. My consumer instances are consuming records with a processing time less than 1 second. And other main configs are
enable.auto.commit=false
session.timeout.ms=30000
heartbeat.interval.ms=25000
I am committing offset after processing.
I am getting the exception
Error UNKNOWN_MEMBER_ID occurred while committing offsets for group
kafka_to_s3
ERROR com.bsb.hike.analytics.consumer.Consumer - unable to commit
retryCount=2 org.apache.kafka.clients.consumer.CommitFailedException:
Commit cannot be completed due to group rebalance
once or twice in an hour. Consuming approx 6 billion events a day. It seems like offsets are stored in only one partition of the topic "__consumer_offsets". It increase the load on the particular broker also.
Anybody have clue about these problems ?

Kafka triggers a rebalance if it doesn't receive at least one heartbeat within session time out. If the rebalance is triggered, the commit will fail. That is expected. So the question is why has the heartbeat not happened? There might be a couple of reasons for that.
First thing is that you are doing a manual commit. Starting 0.9, heartbeat doesn't happen in a separate thread. The consumer runs on a single thread which handles commit, heartbeat and polling. So the heartbeat happens when you do a consumer.poll() or consumer.commit(). So if your processing time is exceeding the session time out, that might cause the heartbeat to fail.
There is a known issue in kafka 0.9 consumer which might cause the problem you are facing.
https://issues.apache.org/jira/browse/KAFKA-3627
In either case, downgrading your consumer to 0.8 will solve the problem.
Edit: You can try increasing the session time out to as high as 5 min and see if it works.
Regarding kafka configs
Kafka server expects that it receives at least one heartbeat within the session time out. So the consumer tries to do a heartbeat at most (session time out/heartbeat times). Some heartbeats might be missed. So your heartbeat time should not be more than 1/3 of the session time out. (You can refer to the docs)

Related

Timeout of 60000ms expired before successfully committing the current consumed offsets

For testing purposes, I posted 5k messages on a Kafka topic and I am using a pull method to read 100 messages every iteration in my spring batch application, it runs for ~2hrs before it finishes.
Facing below error at times and execution is getting stopped.
org.apache.kafka.common.errors.TimeoutException: Timeout of 60000ms expired before successfully committing offsets
what could be the reason and fix?
Did you consume all the message in two hours? If it is still consuming, MAX_POLL_INTERVAL_MS_CONFIG may be triggered. The default is 5 minutes. If the interval between poll() is more than 5 minutes, it will be kicked out of the consumer group and rebalanced.
During this process, the consumer group is unavailable.
I am not clear about more information, just provide a solution direction.
fix by 20211012:"If it is still consuming, MAX_POLL_INTERVAL_MS_CONFIG may be triggered" means that if the consumption has not been finished, then the consumption rate is very slow, so this mechanism may be triggered. You can adjust this parameter to try to verify.

Is Poll call during kafka rebalancing a busy wait?

I am using manual kafka commit by setting property enable.auto.commit as false while initialising the Kafka consumer and calling kafka commit manually after receiving and processing the message.
However since the processing of message in my consumer is time taking, I am getting Exception with message "error": "Broker: Group rebalance in progress"
The reason being that commit after rebalance timeout is rejected with this error. Now the recovery action for this is either I exit and re-instantiate the process which will trigger rebalancing and partition assignment again. Another way is to catch this exception and then continue as usual which will work correctly only if the poll() call is blocked till the rebalancing is complete, otherwise it will fetch the next packet from the batch and might process and commit it successfully leading to loss of the message whose commit got failed while rebalancing.
So, Need to know what is the correct way to handle this case, should I re-instantiate the process or should I catch and ignore the exception?
The best approach is to ignore if it happens occasionally, and if it happens frequently then reduce the max.poll.records or increase the max.poll.interval.ms to ensure it does only happen occasionally. Also, ensure that your code can handle duplicate records (if you can't do that then there is a different answer).
The error you see is, as you probably realise, just because by the time the consumer committed, the group had decided that it had probably gone and so it's partitions were picked up by a different consumer as part of a rebalance - the new consumer would have started from the last committed offset, hence duplicates.
Given that the original consumer is alive and well it will no doubt poll again and so trigger another rebalance. This poll won't block waiting for rebalance to occur - each poll allows for some communication about the current state of the group (within the polling thread) and after a number of polls the new allocation of partitions will be agreed and accepted after which the rebalance is considered compete and that poll will tell the consumer it's partition allocation and return a set of records.

When does kafka consumer get evicted from the group?

I am using spring kafka and want to know when does kafka consumer get evicted from the group. Does it get evicted when the processing time taken is more than the poll interval? If yes then isn't the purpose of the heartbeat to indicate the consumer is alive and if that happens then the consumer should never be evicted unless the process itself fails.
You are correct that the heartbeat thread tells the group that the consumer process is still alive. The reason for additionally considering a consumer to be gone when there is excessive time between polls is to prevent livelock.
Without this, a consumer might never poll, and so would take partitions without making any progress through them.
The question then is really why there is a heartbeat and session timeout. The heartbeat thread is actually doing other stuff (pre-fetching) but I assume the reason it is used to check that consumers are alive is that it is generally talking to the broker more frequently than the polling thread as the latter has to process messages, and so a failed consumer process will be spotted earlier.
In short there are 3 things that can trigger a rebalance - a change in number of partitions at the broker end, polling taking longer than max.poll.interval.ms, and gap between heartbeats longer than session.timeout.ms

Does kafka partition assignment happen across processes?

I have a topic with 20 partitions and 3 processes with consumers(with the same group_id) consuming messages from the topic.
But I am seeing a discrepancy where unless one of the process commits , the other consumer(in a different process) is not reading any message.
The consumers in other process do cconsume messages when I set auto-commit to true. (which is why I suspect the consumers are being assigned to the first partition in each process)
Can someone please help me out with this issue? And also how to consume messages parallely across processes ?
If it is of any use , I am doing this on a pod(kubernetes) , where the 3 processes are 3 different mules.
Commit shouldn't make any difference because the committed offset is only used when there is a change in group membership. With three processes there would be some rebalancing while they start up but then when all 3 are running they will each have a fair share of the partitions.
Each time they poll, they keep track in memory of which offset they have consumed on each partition and each poll causes them to fetch from that point on. Whether they commit or not doesn't affect that behaviour.
Autocommit also makes little difference - it just means a commit is done synchronously during a subsequent poll rather than your application code doing it. The only real reason to manually commit is if you spawn other threads to process messages and so need to avoid committing messages that have not actually been processed - doing this is generally not advisable - better to add consumers to increase throughput rather than trying to share out processing within a consumer.
One possible explanation is just infrequent polling. You mention that other consumers are picking up partitions, and committing affects behaviour so I think it is safe to say that rebalances must be happening. Rebalances are caused by either a change in partitions at the broker (presumably not the case) or a change in group membership caused by either heartbeat thread dying (a pod being stopped) or a consumer failing to poll for a long time (default 5 minutes, set by max.poll.interval.ms)
After a rebalance, each partition is assigned to a consumer, and if a previous consumer has ever committed an offset for that partition, then the new one will poll from that offset. If not then the new one will poll from either the start of the partition or the high watermark - set by auto.offset.reset - default is latest (high watermark)
So, if you have a consumer, it polls but doesn't commit, and doesn't poll again for 5 minutes then a rebalance happens, a new consumer picks up the partition, starts from the end (so skipping any messages up to that point). Its first poll will return nothing as it is starting from the end. If it doesn't poll for 5 minutes another rebalance happens and the sequence repeats.
That could be the cause - there should be more information about what is going on in your logs - Kafka consumer code puts in plenty of helpful INFO level logging about rebalances.

Kafka group re-balancing after consumer failed. org.apache.kafka.clients.consumer.internals.ConsumerCoordinator

I'm running a Kafka cluster with 4 nodes, 1 producer and 1 consumer. It was working fine until consumer failed. Now after I restart the consumer, it starts consuming new messages but after some minutes it throws this error:
[WARN ]: org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - Auto offset commit failed for group eventGroup: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured session.timeout.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
And it starts consuming the same messages again and loops forever.
I increased session timeout, tried to change group id and it still does the same thing.
Also is the client version of Kafka consumer a big deal?
I'd suggest you to decouple the consumer and the processing logic, to start with. E.g. let the Kafka consumer only poll messages and maybe after sanitizing the messages (if necessary) delegate the actual processing of each record to a separate thread, then see if the same error is still occurring. The error says, you're spending too much time between the subsequent polls, so this might resolve your issue. Also, please mention the version of Kafka you're using. Kafka had a different heartbeat management policy before version 0.10 which could make this issue easier to reproduce.