Kafka Consumer group rebalancing - apache-kafka

I'm using kafka consumer group management for processing my messages.
The processing time for my messages vary from one another. So I have set the max poll interval to 20 min for max records of 20. And I'm using 5 partition and 5 consumer instances with default config values apart from the above two.
But still I'm getting the following error intermittently:
[Consumer clientId=consumer-3, groupId=amc_dashboard_analytics] Attempt to heartbeat failed since group is rebalancing
The understanding is that rebalancing won't occur unless poll is not called before max poll interval is reached as written in consumer config Docs. But for me rebalancing occurs before 20 minutes only.
Also after few hours of running, all the assigned consumers just leave saying "Attempt to heartbeat failed since group is rebalancing" and do not join back again(Ideally should join back again).
Am I missing something here? Any leads would be helpful.

Another reason of rebalance is expiring session.timeout.ms without sending heartbeat. You can consider to increase this consumer config.
From Kafka docs:
heartbeat.interval.ms: The expected time between heartbeats to the
consumer coordinator when using Kafka's group management facilities.
Heartbeats are used to ensure that the consumer's session stays active
and to facilitate rebalancing when new consumers join or leave the
group. The value must be set lower than session.timeout.ms, but
typically should be set no higher than 1/3 of that value. It can be
adjusted even lower to control the expected time for normal
rebalances. (default: 3000)
session.timeout.ms: The timeout used to detect client failures when
using Kafka's group management facility. The client sends periodic
heartbeats to indicate its liveness to the broker. If no heartbeats
are received by the broker before the expiration of this session
timeout, then the broker will remove this client from the group and
initiate a rebalance. Note that the value must be in the allowable
range as configured in the broker configuration by
group.min.session.timeout.ms and group.max.session.timeout.ms. (default: 10000)
You can check this link for more information.
Even if heartbeat is sent in fixed time intervals via separate thread, in some cases heartbeat cannot be sent to broker in session.timeout.ms. Some of the possible reasons of this situation is:
Network problem
stop-the-world garbage collection in consumer or broker sides

Related

Multiple kafka heartbeat failed messages during rebalance

I have a Kafka application where the consumer is polling messages. Each message processing takes around 30-40 mins. I have tuned the following settings so that my consumer instance is not removed from the group, due to long processing of the message:
max.poll.interval.ms:3600000
max.poll.records = 1
But I'm still getting a lot of following rebalance messages in the log:
2022-07-04 12:17:54,168 INFO thread=kafka-coordinator-heartbeat-thread | periodicSync o.a.k.c.c.i.AbstractCoordinator:1054 : [Consumer clientId=consumer-periodicSync-5, groupId=periodicSync] Attempt to heartbeat failed since group is rebalancing
[Consumer clientId=consumer-periodicSync-5, groupId=periodicSync] Attempt to heartbeat failed since group is rebalancing
Is there any other setting I need to do? I'm getting a lot of messages like this frequently and I'm not sure how it might affect the working of the application.
Session timeout between consumer and broker also plays a vital role here. Your consumer may be taking the default session.timeout.ms. Kindly adjust it accordingly.
session.timeout.ms
The amount of time a consumer can be out of contact with the brokers while still
considered alive defaults to 3 seconds. If more than session.timeout.ms passes
without the consumer sending a heartbeat to the group coordinator, it is considered
dead and the group coordinator will trigger a rebalance of the consumer group to
allocate partitions from the dead consumer to the other consumers in the group. This
property is closely related to heartbeat.interval.ms. heartbeat.interval.ms con‐
trols how frequently the KafkaConsumer poll() method will send a heartbeat to the
group coordinator, whereas session.timeout.ms controls how long a consumer can
go without sending a heartbeat. Therefore, those two properties are typically modi‐
fied together—heatbeat.interval.ms must be lower than session.timeout.ms, and
is usually set to one-third of the timeout value. So if session.timeout.ms is 3 sec‐
onds, heartbeat.interval.ms should be 1 second. Setting session.timeout.ms
lower than the default will allow consumer groups to detect and recover from failure
sooner, but may also cause unwanted rebalances as a result of consumers taking
longer to complete the poll loop or garbage collection. Setting session.timeout.ms
higher will reduce the chance of accidental rebalance, but also means it will take
longer to detect a real failure.

Kafka Consumer death handling

I have question regarding handling of consumers death due to exceeding the timeout values.
my example configuration:
session.timeout.ms = 10000 (10 seconds)
heartbeat.interval.ms = 2000 (2 seconds)
max.poll.interval.ms = 300000 (5 minutes)
I have 1 topic, 10 partitions, 1 consumer group, 10 consumers (1 partition = 1 consumer).
From my understanding consuming messages in Kafka, very simplified, works as follows:
consumer polls 100 records from topic
a heartbeat signal is sent to broker
processing records in progress
processing records completes
finalize processing (commit, do nothing etc.)
repeat #1-5 in a loop
My question is, what happens if time between heartbeats takes longer than previously configured session.timeout.ms. I understand the part, that if session times out, the broker initializes a re-balance, the consumer which processing took longer than the session.timeout.ms value is marked as dead and a different consumer is assigned/subscribed to that partition.
Okey, but what then...?
Is that long-processing consumer removed/unsubscribed from the topic and my application is left with 9 working consumers? What if all the consumers exceed timeout and are all considered dead, am I left with a running application which does nothing because there are no consumers?
Long-processing consumer finishes processing after re-balancing already took place, does broker initializes re-balance again and consumer is assigned a partition anew? As I understand it continues running #1-5 in a loop and sending a heartbeat to broker initializes also process of adding consumer to the consumers group, from which it was removed after being given dead status, correct?
Application throws some sort of exception indicating that session.timeout.ms was exceeded and the processing is abruptly stopped?
Also what about max.poll.interval.ms property, what if we even exceed that period and consumer X finishes processing after max.poll.interval.ms value? Consumer already exceeded the session.timeout.ms value, it was excluded from consumer group, status set to dead, what difference does it gives us in configuring Kafka consumer?
We have a process which extracts data for processing and this extraction consists of 50+ SQL queries (majority being SELECT's, few UPDATES), they usually go fast but of course all depends on the db load and possible locks etc. and there is a possibility that the processing takes longer than the session's timeout. I do not want to infinitely increase sessions timeout until "I hit the spot". The process is idempotent, if it's repeated X times withing X minutes we do not care.
Please find the answers.
#1. Yes. If all of your consumer instances are kicked out of the consumer group due to session.timeout, then you will be left with Zero consumer instance, eventually, consumer application is dead unless you restart.
#2. This depends, how you write your consumer code with respect to poll() and consumer record iterations. If you have a proper while(true) and try and catch inside, you consumer will be able to re-join the consumer group after processing that long running record.
#3. You will end up with the commit failed exception:
failed: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
And again it depends on your code, to auto join into the consumer group.
#4. Answer lies here
session.timeout.ms
The amount of time a consumer can be out of contact with the brokers while still
considered alive defaults to 3 seconds. If more than session.timeout.ms passes
without the consumer sending a heartbeat to the group coordinator, it is considered
dead and the group coordinator will trigger a rebalance of the consumer group to
allocate partitions from the dead consumer to the other consumers in the group. This
property is closely related to heartbeat.interval.ms. heartbeat.interval.ms con‐
trols how frequently the KafkaConsumer poll() method will send a heartbeat to the
group coordinator, whereas session.timeout.ms controls how long a consumer can
go without sending a heartbeat. Therefore, those two properties are typically modi‐
fied together—heatbeat.interval.ms must be lower than session.timeout.ms, and
is usually set to one-third of the timeout value. So if session.timeout.ms is 3 sec‐
onds, heartbeat.interval.ms should be 1 second. Setting session.timeout.ms
lower than the default will allow consumer groups to detect and recover from failure
sooner, but may also cause unwanted rebalances as a result of consumers taking
longer to complete the poll loop or garbage collection. Setting session.timeout.ms
higher will reduce the chance of accidental rebalance, but also means it will take
longer to detect a real failure.

Spring Kafka Consumer Unable to Rejoin after LeaveGroup Request

We are using Spring Kafka in Production under heavy load.
We have used #KafkaListener annotations and created those listeners as part of Spring Boot Services.
Very frequently these consumers send LeaveGroup requests to the coordinator and then the consumers hang/stuck indefinitely without any log or error. The only option we are left in that case is to redeploy that particular instance.
This is the series of logs that we see:
Attempt to heartbeat failed since group is rebalancing
Attempt to heartbeat failed since group is rebalancing
This member will leave the group because consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
Member consumer-1-0d0b333d-9e5b-4038-ac3f-e5d59c4e9d19 sending LeaveGroup request to coordinator 172.25.128.233:9092 (id: 2147483560 rack: null)
Additional information:
We are using below Kafka Configs:
Batch Size: 10
Max Poll Interval: 12 mins
Heart Beat: 8 seconds
Max Partition Size: 10 MB
Session Timeout: 25 seconds
Request Timeout: 20 mins
Basically, we want to know once the Consumer leaves the group, why it is not sending the Join Request again?
You should understand that the retry suspends the consumer thread (if a BackOffPolicy is used). There are no calls to Consumer.poll() during the retries. Kafka has two properties to determine consumer health. The session.timeout.ms is used to determine if the consumer is active. Since kafka-clients version 0.10.1.0, heartbeats are sent on a background thread, so a slow consumer no longer affects that. max.poll.interval.ms (default: five minutes) is used to determine if a consumer appears to be hung (taking too long to process records from the last poll). If the time between poll() calls exceeds this, the broker revokes the assigned partitions and performs a rebalance. For lengthy retry sequences, with back off, this can easily happen.
Since version 2.1.3, you can avoid this problem by using stateful retry in conjunction with a SeekToCurrentErrorHandler. In this case, each delivery attempt throws the exception back to the container, the error handler re-seeks the unprocessed offsets, and the same message is redelivered by the next poll(). This avoids the problem of exceeding the max.poll.interval.ms property (as long as an individual delay between attempts does not exceed it). So, when you use an ExponentialBackOffPolicy, you must ensure that the maxInterval is less than the max.poll.interval.ms property. To enable stateful retry, you can use the RetryingMessageListenerAdapter constructor that takes a stateful boolean argument (set it to true). When you configure the listener container factory (for #KafkaListener), set the factory’s statefulRetry property to true.
https://docs.spring.io/spring-kafka/reference/html/#stateful-retry

max.poll.intervals.ms set to int.Max by default

Apache Kafka documentation states:
The internal Kafka Streams consumer max.poll.interval.ms default value
was changed from 300000 to Integer.MAX_VALUE
Since this value is used to detect when the processing time for a batch of records exceeds a given threshold, is there a reason for such an "unlimited" value?
Does it enable applications to become unresponsive? Or Kafka Streams has a different way to leave the consumer group when the processing is taking too long?
Does it enable applications to become unresponsive? Or Kafka Streams has a different way to leave the consumer group when the processing is taking too long?
Kafka Streams leverages a heartbeat functionality of the Kafka consumer client in this context, and thus decouples heartbeats ("Is this app instance still alive?") from calls to poll(). The two main parameters are session.timeout.ms (for the heartbeat thread) and max.poll.interval.ms (for the processing thread), and their difference is described in more detail at https://stackoverflow.com/a/39759329/1743580.
The heartbeating was introduced so that an application instance may be allowed to spent a lot of time processing a record without being considered "not making progress" and thus "be dead". For example, your app can do a lot of crunching for a single record for a minute, while still heartbeating to Kafka "Hey, I'm still alive, and I am making progress. But I'm simply not done with the processing yet. Stay tuned."
Of course you can change max.poll.interval.ms from its default (Integer.MAX_VALUE) to a lower setting if, for example, you actually do want your app instance to be considered "dead" if it takes longer than X seconds in-between polling records, and thus if it takes longer than X seconds to process the latest round of records. It depends on your specific use case whether or not such a configuration makes sense -- in most cases, the default setting is a safe bet.
session.timeout.ms: The timeout used to detect consumer failures when using Kafka's group management facility. The consumer sends periodic heartbeats to indicate its liveness to the broker. If no heartbeats are received by the broker before the expiration of this session timeout, then the broker will remove this consumer from the group and initiate a rebalance. Note that the value must be in the allowable range as configured in the broker configuration by group.min.session.timeout.ms and group.max.session.timeout.ms.
max.poll.interval.ms: The maximum delay between invocations of poll() when using consumer group management. This places an upper bound on the amount of time that the consumer can be idle before fetching more records. If poll() is not called before expiration of this timeout, then the consumer is considered failed and the group will rebalance in order to reassign the partitions to another member.

Difference between heartbeat.interval.ms and session.timeout.ms in Kafka consumer config

I'm currently running kafka 0.10.0.1 and the corresponding docs for the two values in question are as follows:
heartbeat.interval.ms -
The expected time between heartbeats to the consumer coordinator when using Kafka's group management facilities. Heartbeats are used to ensure that the consumer's session stays active and to facilitate rebalancing when new consumers join or leave the group. The value must be set lower than session.timeout.ms, but typically should be set no higher than 1/3 of that value. It can be adjusted even lower to control the expected time for normal rebalances.
session.timeout.ms -
The timeout used to detect failures when using Kafka's group management facilities. When a consumer's heartbeat is not received within the session timeout, the broker will mark the consumer as failed and rebalance the group. Since heartbeats are sent only when poll() is invoked, a higher session timeout allows more time for message processing in the consumer's poll loop at the cost of a longer time to detect hard failures. See also max.poll.records for another option to control the processing time in the poll loop.
It isn't clear to me why the docs recommend setting heartbeat.interval.ms to 1/3 of session.timeout.ms. Does it not make sense to have these values be the same since the heartbeat is only sent when poll() is invoked, and thus when processing of the current records is done?
The heartbeat.interval.ms specifies the frequency of sending heart beat signal by the consumer. So if this is 3000 ms (default), then every 3 seconds the consumer will send the heartbeat signal to the broker.
The session.timeout.ms specifies the amount of time within which the broker needs to get at least one heart beat signal from the consumer. Otherwise it will mark the consumer as dead. The default value 10000 ms (10 seconds) makes provision for missing three heart beat signals before a broker will mark the consumer as dead.
In a network setup under heavy load, it is normal to miss few heartbeat signals. So it is recommended to wait for missing 3 heart beat signals before marking the consumer as dead. That is the reason for the 1/3 recommendation.
The code makes a hard limit that you cannot set heartbeat.interval.ms no less than request.timeout.ms, otherwise Kafka complains "Heartbeat must be set lower than the session timeout".
If you really have these two configs be the same value, a possible situation is network client will never heartbeat anymore because the session timeout nearly always happens before doing heartbeat.
As for the 1/3, I prefer to think it sort of being a heuristic value.
heartbeat.interval.ms is the duration within which consumer sends signal to kafka broker to indicate it is alive, session.timeout.ms is the maximum duration that kafka broker can wait without a receiving heartbeat from consumer, if session.timeout.ms duration exceeds without receiving a heartbeat from consumer than that consumer will be marked as dead(i.e.it can no more consume message). In a kafka queue where millions of messages processing a day can have more session.timeout.ms duration say upto 30000ms( default is 10s) to keep consumer alive while processing huge volume.