Can a kafka consumer group freeze during a rebalance - apache-kafka

Can a rolling deployment of a Kafka consumer group cause the group to freeze?
So let's consider this scenario,
we start a rolling deployment
one consumer leaves the group
Kafka notices this and triggers a rebalance (hence consumption stops)
rebalance happens but soon a new consumer wants to join
also another consumer leaves
again a new rebalance happens
(loop till deployment is complete)
So if you have a large enough cluster and it takes some time for the deployment to get completed on one machine (which is usually the case), Will this lead to a complete freeze in consumption?
If yes, What are the strategies to do a consumer group update in production

Yes, that's definitely possible. There have been a number of recent improvements to mitigate the downtime during events like this. I'd recommend enabling one or both or the following features:
Static membership was added in 2.3 and can prevent a rebalance from occurring when a known member of the group is bounced. This requires both the client and the broker to be on version 2.3+
Incremental cooperative rebalancing enables the group to have faster rebalances AND allows individual members to continue consuming throughout the rebalance. You'll still see rebalances during a rolling deployment but they won't result in a complete freeze in consumption for the duration. This is completely client side so it will work with any brokers, but your clients should be on version 2.5.1+

Related

Kafka incremental sticky rebalancing

I am running Kafka on Kubernetes using the Kafka Strimzi operator. I am using incremental sticky rebalance strategy by configuring my consumers with the following:
ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG,
org.apache.kafka.clients.consumer.CooperativeStickyAssignor.class.getName()
Each time I scale consumers in my consumer group all existing consumer in the group generate the following exception
Exception in thread "main" org.apache.kafka.common.errors.RebalanceInProgressException: Offset commit cannot be completed since the consumer is undergoing a rebalance for auto partition assignment. You can try completing the rebalance by calling poll() and then retry the operation
Any idea on what caused this exception and/or how to resolve it?
Thank you.
The consumer rebalance happens whenever there is a change in the metadata information of a consumer group.
Adding more consumers (scaling in your words) in a group is one such change and triggers a rebalance. During this change, each consumer will be re-assigned partitions and therefore will not know which offsets to commit until the re-assignment is complete. Now, the StickyAssignor does try and ensure that the previous assignment gets preserved as much as possible but the rebalance will still be triggered and even distribution of partitions will take precedence over retaining previous assignment. (Reference - Kafka Documentation)
Rest, the exception's message is self-explanatory that while the rebalance is happening some of the operations are prohibited.
How to avoid such situations?
This is a tricky one because Kafka needs rebalancing to be able to work effectively. There are a few practices you could use to avoid unnecessary impact:
Increase the polling time - max.poll.interval.ms - so the possibility of experiencing these exceptions is reduced.
Decrease the number of poll records - max.poll.records or max.partition.fetch.bytes
Try and utilise the latest version(s) of Kafka (or upgrade if you're using an old one) as many of the latest upgrades so far have made improvements to the rebalance protocol
Use Static membership protocol to reduce rebalances
Might consider configuring group.initial.rebalance.delay.ms for empty consumer groups (either for the first time deployment or destroyin everything and redeploying again)
These techniques can only help you reduce the unnecessary behaviour or exception but will NOT prevent rebalance completely.

Selective Kafka rebalancing on Kubernetes infrastructure

I am running a kafka cluster with a set of consumers on a dockerized Kubernetes infrastructure. The typical workflow is that when a certain consumer (of the consumer group) dies, a rebalancing process will be triggered, and a new assignment of the partitions to the set of the consumers (excluding the failed one) is performed.
After some time, Kubernetes controller will recreate/restart the consumer instance that has failed/died and a new rebalance is performed again.
Is there any way to control the first rebalancing process (when the consumer died) e.g., such as to wait few seconds without rebalancing until the failed consumer returns, or until a time out is triggered. And if the consumer returned, continue consuming based on the old rebalancing assignment (i.e., without new rebalancing)?
There are the 3 parameter on that basis group coordinator decide consumer is dead or alive
session.timeout.ms
max.poll.interval.ms
heartbeat.interval.ms
You can avoid unwanted rebalancing by tuning above three parameter and one thumb rule : used separate thread for calling 3rd party api in pool loop.
tuning above three parameter required ans. of below questions
what is size max.poll.records
How much avg. time application is taking to process 1 record[message]
How much avg. time application is taking to process complete batch
Please refer Kafka consumer config
https://docs.confluent.io/platform/current/installation/configuration/consumer-configs.html
you can also explore Cooperative Rebalances
https://www.confluent.io/blog/incremental-cooperative-rebalancing-in-kafka/

Static membership in Apache Kafka for consumers

I got to know in a recent version of kafka, static membership strategy is available for consumer subscription instead of early dynamic membership detection which helps is scenario when consumer is bounces as part of rolling deployment. Now when consumer is up after getting bounced it catches up with the same partition and starts processing.
My question is what will happen if we have deliberately shutdown consumer ? How message in partition to which particular consumer was subscribed will get processed ?
After a consumer has been shutdown, the Consumer Group will undergo a normal rebalance after the consumer's session.timeout.ms has elapsed.
https://kafka.apache.org/10/documentation/streams/developer-guide/config-streams.html#kafka-consumers-and-producer-configuration-parameters
When configuring Static Membership, it is important to increase the session.timeout.ms higher than the default of 10000 so that consumers are not prematurely rebalanced. Set this value high enough to allow workers time to start, restart, deploy, etc. Otherwise, the application may get into a restart cycle if it misses too many
heartbeats during normal operations. Setting it too high may cause
long periods of partial unavailability if a worker dies, and the
workload is not rebalanced. Each application will set this value
differently based on its own availability needs.
If you manually subscribe then you would have to deal with that scenario in your application code - that's the advantage of automatic subscription, all partitions will be assigned to one of the group after a rebalance.
To cater for consumers permanently leaving the group with manual subscription, I guess you would need to track subscriptions somewhere and maybe have each consumer pinging to let you know it is alive.
I'm not sure what use cases the manual subscription is catering for - I will have to go back and check the Javadoc in KafkaConsumer, which is pretty comprehensive. As long as you have no local state in consumers the automatic subscription seems much safer and more resilient.

Kafka Consumer Rebalancing and Its Impact

I'm new to Kafka and I'm trying to design a wrapper library in both Java and Go (uses Confluent/Kafka-Go) for Kafka to be used internally. For my use-case, CommitSync is a crucial step and we should do a read only after properly committing the old one. Repeated processing is not a big issue and our client service is idempotent enough. But data loss is a major issue and should not occur.
I will create X number of consumers initially and will keep on polling from them. Hence I would like to know more about the negative scenario's that could happen here, Impact of them and how to properly handle them.
I would like to know more about:
1) Network issue during consumer processing:
What happens when network goes of for a brief period and comes back? Does Kafka consumer automatically handle this and becomes alive when network comes back or do we have to reinitialise them? If they come back alive do they resume work from where they left of?
Eg: Consumer X read 50 records from Partition Y. Now internally the consumer offset moved to +50. But before committing network issue happens and the comes back alive. Now will the consumer have the metadata about what it read for last poll. Can it go on to commit +50 in offset?
2) Rebalancing in consumer groups. Impact of them on existing consumer process - whether the existing working consumer instance will pause and resume work during a rebalance or do we have to reinitialize them? How long can rebalance occur? If the consumer comes back alive after rebalance, does it have metadata about it last read?
3) What happens when a consumer joins during a rebalancing. Ideally it is again a rebalancing scenario. What will happen now? The existing will be discarded and the new one starts or will wait for the existing rebalance to complete?
What happens when network goes of for a brief period and comes back? Does Kafka consumer automatically handle this and becomes alive when network comes back or do we have to reinitialise them?
The consumer will try to reconnect. If the consumer group coordinator doesn't receive heartbeats or brokers don't respond to brokers, then the group rebalances.
If they come back alive do they resume work from where they left of?
From the last committed offset, yes.
whether the existing working consumer instance will pause and resume work during a rebalance
It will pause and resume. No action needed.
How long can rebalance occur?
Varies on many factors, and can happen indefinitely under certain conditions.
If the consumer comes back alive after rebalance, does it have metadata about it last read?
The last committed offsets are stored on the broker, not by consumers.
The existing will be discarded and the new one starts or will wait for the existing rebalance to complete?
All reblances must complete before any polls continue.

Kafka Rest Proxy Consumer Creation

Let's say I have a service that that consumes messages through kafka-rest-proxy and always on the same consumer group. Let's also say that it is consuming on a topic that has one partition. When the service is started, it creates a new consumer in kafka-rest-proxy, and uses the generated consumer url until the service is shutdown. When the service comes back up, it will create a new consumer in kafka-rest-proxy, and use the new url (and new consumer) for consuming.
My Questions
Since kafka can only have at most one consumer per partition. What will happen in kafka and kafka-rest-proxy, when the consumer is restarted? i.e. A new consumer is created in kafka-rest-proxy, but the old one didn't have a chance to be destroyed. So now there are 'n' consumers after 'n' restarts of my service in kafka-rest-proxy, but only one of them is actively being consumed. Will I even be able to consume messages on my new consumer since there are more consumers than partitions?
Let's make this more complicated and say that I have 5 instances of my service on the same consumer group and 5 partitions in the topic. After 'n' restarts of all 5 instances of my service, would I even be guranteed to consume all messages without ensuring the proper destruction of the existing consumers. i.e. What does Kafka and kafka-rest-proxy do during consumer creation, when the consumers out number the partitions?
What is considered to be the kafka-rest-proxy best practice, to ensure stale consumers are always cleaned up? Do you suggest persisting the consumer url? Should I force a kafka-rest-proxy restart to ensure existing consumers are destroyed before starting my service?
* EDIT *
I believe part of my question is answered with this configuration, but not all of it.
consumer.instance.timeout.ms - Amount of idle time before a consumer instance is automatically destroyed.
Type: int
Default: 300000
Importance: low
If you cannot cleanly shutdown the consumer, it will stay alive for a period after last request was made to it. The proxy will garbage collect stale consumers for exactly this case -- if it isn't cleanly shutdown, the consumer would hold on to some partitions indefinitely. By automatically garbage collecting the consumers, you don't need some separate durable storage to keep track of your consumer instances. As you discovered, you can control this timeout via the config consumer.instance.timeout.ms.
Since instances will be garbage collected, you are guaranteed to eventually consume all the messages. But during the timeout period, some partitions may still be assigned to the old set of consumers and you will not make any progress on those partitions.
Ideally unclean shutdown of your app is rare, so best practice is just to clean up the consumer when you're app is shutting down. Even in exceptional cases, you can use the finally block of a try/catch/finally to destroy the consumer. If one is left alive, it will eventually recover. Other than that, consider tweaking the consumer.instance.timeout.ms setting to be lower if your application can tolerate that. It just needs to be larger than the longest period between calls that use the consumer (and you should keep in mind possible error cases, e.g. if processing a message requires interacting with another system and that system can become slow/inaccessible, you should account for that when setting this config).
You can persist the URLs, but even that is at some risk for losing track of consumers since you can't atomically create the consumer and save its URL to some other persistent storage. Also, since completely uncontrolled failures where you have no chance to cleanup shouldn't be a common case, it often doesn't benefit you much to do that. If you need really fast recovery from that failure, the consumer instance timeout can probably be reduced significantly for your application anyway.
Re: forcing a restart of the proxy, this would be fairly uncommon since the REST Proxy is often a shared service and doing so would affect all other applications that are using it.