Kafka incremental sticky rebalancing - kubernetes

I am running Kafka on Kubernetes using the Kafka Strimzi operator. I am using incremental sticky rebalance strategy by configuring my consumers with the following:
ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG,
org.apache.kafka.clients.consumer.CooperativeStickyAssignor.class.getName()
Each time I scale consumers in my consumer group all existing consumer in the group generate the following exception
Exception in thread "main" org.apache.kafka.common.errors.RebalanceInProgressException: Offset commit cannot be completed since the consumer is undergoing a rebalance for auto partition assignment. You can try completing the rebalance by calling poll() and then retry the operation
Any idea on what caused this exception and/or how to resolve it?
Thank you.

The consumer rebalance happens whenever there is a change in the metadata information of a consumer group.
Adding more consumers (scaling in your words) in a group is one such change and triggers a rebalance. During this change, each consumer will be re-assigned partitions and therefore will not know which offsets to commit until the re-assignment is complete. Now, the StickyAssignor does try and ensure that the previous assignment gets preserved as much as possible but the rebalance will still be triggered and even distribution of partitions will take precedence over retaining previous assignment. (Reference - Kafka Documentation)
Rest, the exception's message is self-explanatory that while the rebalance is happening some of the operations are prohibited.
How to avoid such situations?
This is a tricky one because Kafka needs rebalancing to be able to work effectively. There are a few practices you could use to avoid unnecessary impact:
Increase the polling time - max.poll.interval.ms - so the possibility of experiencing these exceptions is reduced.
Decrease the number of poll records - max.poll.records or max.partition.fetch.bytes
Try and utilise the latest version(s) of Kafka (or upgrade if you're using an old one) as many of the latest upgrades so far have made improvements to the rebalance protocol
Use Static membership protocol to reduce rebalances
Might consider configuring group.initial.rebalance.delay.ms for empty consumer groups (either for the first time deployment or destroyin everything and redeploying again)
These techniques can only help you reduce the unnecessary behaviour or exception but will NOT prevent rebalance completely.

Related

Can a kafka consumer group freeze during a rebalance

Can a rolling deployment of a Kafka consumer group cause the group to freeze?
So let's consider this scenario,
we start a rolling deployment
one consumer leaves the group
Kafka notices this and triggers a rebalance (hence consumption stops)
rebalance happens but soon a new consumer wants to join
also another consumer leaves
again a new rebalance happens
(loop till deployment is complete)
So if you have a large enough cluster and it takes some time for the deployment to get completed on one machine (which is usually the case), Will this lead to a complete freeze in consumption?
If yes, What are the strategies to do a consumer group update in production
Yes, that's definitely possible. There have been a number of recent improvements to mitigate the downtime during events like this. I'd recommend enabling one or both or the following features:
Static membership was added in 2.3 and can prevent a rebalance from occurring when a known member of the group is bounced. This requires both the client and the broker to be on version 2.3+
Incremental cooperative rebalancing enables the group to have faster rebalances AND allows individual members to continue consuming throughout the rebalance. You'll still see rebalances during a rolling deployment but they won't result in a complete freeze in consumption for the duration. This is completely client side so it will work with any brokers, but your clients should be on version 2.5.1+

Static membership in Apache Kafka for consumers

I got to know in a recent version of kafka, static membership strategy is available for consumer subscription instead of early dynamic membership detection which helps is scenario when consumer is bounces as part of rolling deployment. Now when consumer is up after getting bounced it catches up with the same partition and starts processing.
My question is what will happen if we have deliberately shutdown consumer ? How message in partition to which particular consumer was subscribed will get processed ?
After a consumer has been shutdown, the Consumer Group will undergo a normal rebalance after the consumer's session.timeout.ms has elapsed.
https://kafka.apache.org/10/documentation/streams/developer-guide/config-streams.html#kafka-consumers-and-producer-configuration-parameters
When configuring Static Membership, it is important to increase the session.timeout.ms higher than the default of 10000 so that consumers are not prematurely rebalanced. Set this value high enough to allow workers time to start, restart, deploy, etc. Otherwise, the application may get into a restart cycle if it misses too many
heartbeats during normal operations. Setting it too high may cause
long periods of partial unavailability if a worker dies, and the
workload is not rebalanced. Each application will set this value
differently based on its own availability needs.
If you manually subscribe then you would have to deal with that scenario in your application code - that's the advantage of automatic subscription, all partitions will be assigned to one of the group after a rebalance.
To cater for consumers permanently leaving the group with manual subscription, I guess you would need to track subscriptions somewhere and maybe have each consumer pinging to let you know it is alive.
I'm not sure what use cases the manual subscription is catering for - I will have to go back and check the Javadoc in KafkaConsumer, which is pretty comprehensive. As long as you have no local state in consumers the automatic subscription seems much safer and more resilient.

Suspending Camel KafkaConsumer

My app has N instance running. The number of instances is always greater than the number of Kafka partitions. E.g. 6 instances of a consumer-group, consuming from 4 Kafka partitions... so, only 4 of the instances are actually consuming at any point.
In this context can I suspend a Kafka consumer Camel route, without causing Kafka to attempt to re-balance to other potential consumers? My understanding is that the suspended route would stop polling, causing the other to pick up the load.
This is not a Camel but a Kafka question. The rebalancing is handled by Kafka and triggered whenever a consumer explicitly leaves the consumer group or silently dies (does no more sending heartbeats).
Kafka 2.3 introduced a new feature called "Static Membership" to avoid rebalancing just because of a consumer restart.
But in your case (another consumer must take the load of a leaving consumer) I think Kafka must trigger a rebalancing over all consumers due to the protocol used.
See also this article for a quite deep dive into rebalancing and its trade-offs between availability and fault-tolerance.
Edit due to comments
If you want to avoid rebalancing, I think you would have to increase both session.timeout.ms (heartbeat interval) and max.poll.interval.ms (processing timeout).
But even if you set them very high I guess it would not work reliably because route suspension could still happen just before a heartbeat (simply bad timing).
See this q&a for the difference between session.timeout.ms and max.poll.interval.ms.

Kafka group re-balancing after consumer failed. org.apache.kafka.clients.consumer.internals.ConsumerCoordinator

I'm running a Kafka cluster with 4 nodes, 1 producer and 1 consumer. It was working fine until consumer failed. Now after I restart the consumer, it starts consuming new messages but after some minutes it throws this error:
[WARN ]: org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - Auto offset commit failed for group eventGroup: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured session.timeout.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
And it starts consuming the same messages again and loops forever.
I increased session timeout, tried to change group id and it still does the same thing.
Also is the client version of Kafka consumer a big deal?
I'd suggest you to decouple the consumer and the processing logic, to start with. E.g. let the Kafka consumer only poll messages and maybe after sanitizing the messages (if necessary) delegate the actual processing of each record to a separate thread, then see if the same error is still occurring. The error says, you're spending too much time between the subsequent polls, so this might resolve your issue. Also, please mention the version of Kafka you're using. Kafka had a different heartbeat management policy before version 0.10 which could make this issue easier to reproduce.

kafka consumer sessions timing out

We have an application that a consumer reads a message and the thread does a number of things, including database accesses before a message is produced to another topic. The time between consuming and producing the message on the thread can take several minutes. Once message is produced to new topic, a commit is done to indicate we are done with work on the consumer queue message. Auto commit is disabled for this reason.
I'm using the high level consumer and what I'm noticing is that zookeeper and kafka sessions timeout because it is taking too long before we do anything on consumer queue so kafka ends up rebalancing every time the thread goes back to read more from consumer queue and it starts to take a long time before a consumer reads a new message after a while.
I can set zookeeper session timeout very high to not make that a problem but then i have to adjust the rebalance parameters accordingly and kafka won't pickup a new consumer for a while among other side effects.
What are my options to solve this problem? Is there a way to heartbeat to kafka and zookeeper to keep both happy? Do i still have these same issues if i were to use a simple consumer?
It sounds like your problems boil down to relying on the high-level consumer to manage the last-read offset. Using a simple consumer would solve that problem since you control the persistence of that offset. Note that all the high-level consumer commit does is store the last read offset in zookeeper. There's no other action taken and the message you just read is still there in the partition and is readable by other consumers.
With the kafka simple consumer, you have much more control over when and how that offset storage takes place. You can even persist that offset somewhere other than Zookeeper (a data base, for example).
The bad news is that while the simple consumer itself is simpler than the high-level consumer, there's a lot more work you have to do code-wise to make it work. You'll also have to write code to access multiple partitions - something the high-level consumer does quite nicely for you.
I think issue is consumer's poll method trigger consumer's heartbeat request. And when you increase session.timeout. Consumer's heartbeat will not reach to coordinator. Because of this heartbeat skipping, coordinator mark consumer dead. And also consumer rejoining is very slow especially in case of single consumer.
I have faced a similar issue and to solve that I have to change following parameter in consumer config properties
session.timeout.ms=
request.timeout.ms=more than session timeout
Also you have to add following property in server.properties at kafka broker node.
group.max.session.timeout.ms =
You can see the following link for more detail.
http://grokbase.com/t/kafka/users/16324waa50/session-timeout-ms-limit