Multiple kafka heartbeat failed messages during rebalance - apache-kafka

I have a Kafka application where the consumer is polling messages. Each message processing takes around 30-40 mins. I have tuned the following settings so that my consumer instance is not removed from the group, due to long processing of the message:
max.poll.interval.ms:3600000
max.poll.records = 1
But I'm still getting a lot of following rebalance messages in the log:
2022-07-04 12:17:54,168 INFO thread=kafka-coordinator-heartbeat-thread | periodicSync o.a.k.c.c.i.AbstractCoordinator:1054 : [Consumer clientId=consumer-periodicSync-5, groupId=periodicSync] Attempt to heartbeat failed since group is rebalancing
[Consumer clientId=consumer-periodicSync-5, groupId=periodicSync] Attempt to heartbeat failed since group is rebalancing
Is there any other setting I need to do? I'm getting a lot of messages like this frequently and I'm not sure how it might affect the working of the application.

Session timeout between consumer and broker also plays a vital role here. Your consumer may be taking the default session.timeout.ms. Kindly adjust it accordingly.
session.timeout.ms
The amount of time a consumer can be out of contact with the brokers while still
considered alive defaults to 3 seconds. If more than session.timeout.ms passes
without the consumer sending a heartbeat to the group coordinator, it is considered
dead and the group coordinator will trigger a rebalance of the consumer group to
allocate partitions from the dead consumer to the other consumers in the group. This
property is closely related to heartbeat.interval.ms. heartbeat.interval.ms con‐
trols how frequently the KafkaConsumer poll() method will send a heartbeat to the
group coordinator, whereas session.timeout.ms controls how long a consumer can
go without sending a heartbeat. Therefore, those two properties are typically modi‐
fied together—heatbeat.interval.ms must be lower than session.timeout.ms, and
is usually set to one-third of the timeout value. So if session.timeout.ms is 3 sec‐
onds, heartbeat.interval.ms should be 1 second. Setting session.timeout.ms
lower than the default will allow consumer groups to detect and recover from failure
sooner, but may also cause unwanted rebalances as a result of consumers taking
longer to complete the poll loop or garbage collection. Setting session.timeout.ms
higher will reduce the chance of accidental rebalance, but also means it will take
longer to detect a real failure.

Related

Kafka Consumer death handling

I have question regarding handling of consumers death due to exceeding the timeout values.
my example configuration:
session.timeout.ms = 10000 (10 seconds)
heartbeat.interval.ms = 2000 (2 seconds)
max.poll.interval.ms = 300000 (5 minutes)
I have 1 topic, 10 partitions, 1 consumer group, 10 consumers (1 partition = 1 consumer).
From my understanding consuming messages in Kafka, very simplified, works as follows:
consumer polls 100 records from topic
a heartbeat signal is sent to broker
processing records in progress
processing records completes
finalize processing (commit, do nothing etc.)
repeat #1-5 in a loop
My question is, what happens if time between heartbeats takes longer than previously configured session.timeout.ms. I understand the part, that if session times out, the broker initializes a re-balance, the consumer which processing took longer than the session.timeout.ms value is marked as dead and a different consumer is assigned/subscribed to that partition.
Okey, but what then...?
Is that long-processing consumer removed/unsubscribed from the topic and my application is left with 9 working consumers? What if all the consumers exceed timeout and are all considered dead, am I left with a running application which does nothing because there are no consumers?
Long-processing consumer finishes processing after re-balancing already took place, does broker initializes re-balance again and consumer is assigned a partition anew? As I understand it continues running #1-5 in a loop and sending a heartbeat to broker initializes also process of adding consumer to the consumers group, from which it was removed after being given dead status, correct?
Application throws some sort of exception indicating that session.timeout.ms was exceeded and the processing is abruptly stopped?
Also what about max.poll.interval.ms property, what if we even exceed that period and consumer X finishes processing after max.poll.interval.ms value? Consumer already exceeded the session.timeout.ms value, it was excluded from consumer group, status set to dead, what difference does it gives us in configuring Kafka consumer?
We have a process which extracts data for processing and this extraction consists of 50+ SQL queries (majority being SELECT's, few UPDATES), they usually go fast but of course all depends on the db load and possible locks etc. and there is a possibility that the processing takes longer than the session's timeout. I do not want to infinitely increase sessions timeout until "I hit the spot". The process is idempotent, if it's repeated X times withing X minutes we do not care.
Please find the answers.
#1. Yes. If all of your consumer instances are kicked out of the consumer group due to session.timeout, then you will be left with Zero consumer instance, eventually, consumer application is dead unless you restart.
#2. This depends, how you write your consumer code with respect to poll() and consumer record iterations. If you have a proper while(true) and try and catch inside, you consumer will be able to re-join the consumer group after processing that long running record.
#3. You will end up with the commit failed exception:
failed: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
And again it depends on your code, to auto join into the consumer group.
#4. Answer lies here
session.timeout.ms
The amount of time a consumer can be out of contact with the brokers while still
considered alive defaults to 3 seconds. If more than session.timeout.ms passes
without the consumer sending a heartbeat to the group coordinator, it is considered
dead and the group coordinator will trigger a rebalance of the consumer group to
allocate partitions from the dead consumer to the other consumers in the group. This
property is closely related to heartbeat.interval.ms. heartbeat.interval.ms con‐
trols how frequently the KafkaConsumer poll() method will send a heartbeat to the
group coordinator, whereas session.timeout.ms controls how long a consumer can
go without sending a heartbeat. Therefore, those two properties are typically modi‐
fied together—heatbeat.interval.ms must be lower than session.timeout.ms, and
is usually set to one-third of the timeout value. So if session.timeout.ms is 3 sec‐
onds, heartbeat.interval.ms should be 1 second. Setting session.timeout.ms
lower than the default will allow consumer groups to detect and recover from failure
sooner, but may also cause unwanted rebalances as a result of consumers taking
longer to complete the poll loop or garbage collection. Setting session.timeout.ms
higher will reduce the chance of accidental rebalance, but also means it will take
longer to detect a real failure.

Kafka Consumer group rebalancing

I'm using kafka consumer group management for processing my messages.
The processing time for my messages vary from one another. So I have set the max poll interval to 20 min for max records of 20. And I'm using 5 partition and 5 consumer instances with default config values apart from the above two.
But still I'm getting the following error intermittently:
[Consumer clientId=consumer-3, groupId=amc_dashboard_analytics] Attempt to heartbeat failed since group is rebalancing
The understanding is that rebalancing won't occur unless poll is not called before max poll interval is reached as written in consumer config Docs. But for me rebalancing occurs before 20 minutes only.
Also after few hours of running, all the assigned consumers just leave saying "Attempt to heartbeat failed since group is rebalancing" and do not join back again(Ideally should join back again).
Am I missing something here? Any leads would be helpful.
Another reason of rebalance is expiring session.timeout.ms without sending heartbeat. You can consider to increase this consumer config.
From Kafka docs:
heartbeat.interval.ms: The expected time between heartbeats to the
consumer coordinator when using Kafka's group management facilities.
Heartbeats are used to ensure that the consumer's session stays active
and to facilitate rebalancing when new consumers join or leave the
group. The value must be set lower than session.timeout.ms, but
typically should be set no higher than 1/3 of that value. It can be
adjusted even lower to control the expected time for normal
rebalances. (default: 3000)
session.timeout.ms: The timeout used to detect client failures when
using Kafka's group management facility. The client sends periodic
heartbeats to indicate its liveness to the broker. If no heartbeats
are received by the broker before the expiration of this session
timeout, then the broker will remove this client from the group and
initiate a rebalance. Note that the value must be in the allowable
range as configured in the broker configuration by
group.min.session.timeout.ms and group.max.session.timeout.ms. (default: 10000)
You can check this link for more information.
Even if heartbeat is sent in fixed time intervals via separate thread, in some cases heartbeat cannot be sent to broker in session.timeout.ms. Some of the possible reasons of this situation is:
Network problem
stop-the-world garbage collection in consumer or broker sides

Kafka10.1 heartbeat.interval.ms, session.timeout.ms and max.poll.interval.ms

I am using kafka 0.10.1.1 and confused with the following 3 properties.
heartbeat.interval.ms
session.timeout.ms
max.poll.interval.ms
heartbeat.interval.ms - This was added in 0.10.1 and it will send heartbeat between polls.
session.timeout.ms - This is to start rebalancing if no request to kafka and it gets reset on every poll.
max.poll.interval.ms - This is across the poll.
But, when does kafka starts rebalancing? Why do we need these 3? What are the default values for all of them?
Thanks
Assuming we are talking about Kafka 0.10.1.0 or upwards where each consumer instance employs two threads to function. One is user thread from which poll is called; the other is heartbeat thread that specially takes care of heartbeat things.
session.timeout.ms is for heartbeat thread. If coordinator fails to get any heartbeat from a consumer before this time interval elapsed, it marks consumer as failed and triggers a new round of rebalance.
max.poll.interval.ms is for user thread. If message processing logic is too heavy to cost larger than this time interval, coordinator explicitly have the consumer leave the group and also triggers a new round of rebalance.
heartbeat.interval.ms is used to have other healthy consumers aware of the rebalance much faster. If coordinator triggers a rebalance, other consumers will only know of this by receiving the heartbeat response with REBALANCE_IN_PROGRESS exception encapsulated. Quicker the heartbeat request is sent, faster the consumer knows it needs to rejoin the group.
Suggested values:
session.timeout.ms : a relatively low value, 10 seconds for instance.
max.poll.interval.ms: based on your processing requirements
heartbeat.interval.ms: a relatively low value, better 1/3 of the session.timeout.ms
session.timeout.ms is closely related to heartbeat.interval.ms.
heartbeat.interval.ms controls how frequently the KafkaConsumer poll() method will send a heartbeat to the group coordinator, whereas session.timeout.ms controls how long a consumer can go without sending a heartbeat.
Therefore, those two properties are typically modified together.
heatbeat.interval.ms must be lower than session.timeout.ms, and is usually set to one-third of the timeout value. So if session.timeout.ms is 3 seconds, heartbeat.interval.ms should be 1 second.
max.poll.interval.ms - The maximum delay between invocations of poll() when using consumer group management.
This places an upper bound on the amount of time that the consumer can be idle before fetching more records. If poll() is not called before expiration of this timeout, then the consumer is considered failed and the group will rebalance in order to reassign the partitions to another member
just make them more clear, a heartbeat thread(along with user thread which invokes Poll function in the same process) will send heartbeat to coordinator every "heartbeat.interval.ms" time, and the coordinator will mark the consumer in the user thread as dead if it exceeds "session.timeout.ms" or "max.poll.interval.ms".

Difference between heartbeat.interval.ms and session.timeout.ms in Kafka consumer config

I'm currently running kafka 0.10.0.1 and the corresponding docs for the two values in question are as follows:
heartbeat.interval.ms -
The expected time between heartbeats to the consumer coordinator when using Kafka's group management facilities. Heartbeats are used to ensure that the consumer's session stays active and to facilitate rebalancing when new consumers join or leave the group. The value must be set lower than session.timeout.ms, but typically should be set no higher than 1/3 of that value. It can be adjusted even lower to control the expected time for normal rebalances.
session.timeout.ms -
The timeout used to detect failures when using Kafka's group management facilities. When a consumer's heartbeat is not received within the session timeout, the broker will mark the consumer as failed and rebalance the group. Since heartbeats are sent only when poll() is invoked, a higher session timeout allows more time for message processing in the consumer's poll loop at the cost of a longer time to detect hard failures. See also max.poll.records for another option to control the processing time in the poll loop.
It isn't clear to me why the docs recommend setting heartbeat.interval.ms to 1/3 of session.timeout.ms. Does it not make sense to have these values be the same since the heartbeat is only sent when poll() is invoked, and thus when processing of the current records is done?
The heartbeat.interval.ms specifies the frequency of sending heart beat signal by the consumer. So if this is 3000 ms (default), then every 3 seconds the consumer will send the heartbeat signal to the broker.
The session.timeout.ms specifies the amount of time within which the broker needs to get at least one heart beat signal from the consumer. Otherwise it will mark the consumer as dead. The default value 10000 ms (10 seconds) makes provision for missing three heart beat signals before a broker will mark the consumer as dead.
In a network setup under heavy load, it is normal to miss few heartbeat signals. So it is recommended to wait for missing 3 heart beat signals before marking the consumer as dead. That is the reason for the 1/3 recommendation.
The code makes a hard limit that you cannot set heartbeat.interval.ms no less than request.timeout.ms, otherwise Kafka complains "Heartbeat must be set lower than the session timeout".
If you really have these two configs be the same value, a possible situation is network client will never heartbeat anymore because the session timeout nearly always happens before doing heartbeat.
As for the 1/3, I prefer to think it sort of being a heuristic value.
heartbeat.interval.ms is the duration within which consumer sends signal to kafka broker to indicate it is alive, session.timeout.ms is the maximum duration that kafka broker can wait without a receiving heartbeat from consumer, if session.timeout.ms duration exceeds without receiving a heartbeat from consumer than that consumer will be marked as dead(i.e.it can no more consume message). In a kafka queue where millions of messages processing a day can have more session.timeout.ms duration say upto 30000ms( default is 10s) to keep consumer alive while processing huge volume.

Confluent Kafka Consumer Configuration - How session.timeout.ms and max.poll.interval.ms are related?

I'm trying to understand how the default values of below two confluent consumer configurations work together.
max.poll.interval.ms - As per confluent documentation, the default value is 300,000 ms
session.timeout.ms - As per confluent documentation, the default value is 10,000 ms
heartbeat.interval.ms - As per confluent documentation, the default value is 3,000 ms
Let's say if I'm using these default values in my configuration. Now I've a question here.
For example, let's assume for a consumer, consumer is sending heartbeats every 3,000 ms and my first poll happened at the timestamp t1 and then second poll happened at t1 + 20,00 ms. Then would it cause a re-balance because this exceed the 'session.timeout.ms' ? or would it work fine as the consumer did send a heartbeat as per the expected timestamp?
In previous thread Here also explained about session time out and max poll timeout. Let me also explain about my understanding on this.
ConsumerRecords poll(final long timeout):
is used to fetch data sequentially from topic's partition starting from last consumed offset or manual set offset. This will return immediately if there are record available otherwise it will await the passed timeout. If timeout passes will return empty record.
The poll API keep calling to fetch any new message arrived as well as its ensure liveness of consumer.Underneath the covers
session.timeout.ms During each poll Consumer coordinator send heartbeat to broker to ensure that consumer's session live and active. If broker didn't receive any heartbeat till session.timeout.ms broker then broker leave that consumer and do rebalance
You can assume session.timeout.ms is maximum time broker wait to get heartbeat from consumer whereas heartbeat.interval.ms is expected time consumer suppose to send heartbeat to Broker.
thats explained heartbeat.interval.ms always less than session.timeout.ms because ideal case 1/3 of session timeout.
max.poll.interval.ms : The maximum delay between invocations of poll() when using consumer group management. That means consumer maximum time will be idle before fetching more records.If poll() is not called before expiration of this timeout, then the consumer is considered failed and the group will rebalance by calling poll in order to reassign the partitions to another consumer instance.
If we are doing long batch processing its good to increase max.poll.interval.ms but please note increasing this value may delay a group rebalance since the consumer will only join the rebalance inside the call to poll. We can tune by keeping max poll interval low by tuning max.poll.records.
Now let's discuss how they relate to each other.
Consumer while calling poll its check heartbeat, session time out poll time out in background as below manner:
Consumer coordinator check if consumer is not in rebalancing state if still rebalancing then wait coordinator to join the consumer. wait and call poll . Please note if max.poll.interval.ms large it will take more time to rebalance.
After poll and rebalance completed coordinator check session time out
if session timeout has expired without seeing a successful heartbeat, old coordinator will get disconnected so next poll will try to rebalance.
So Session timeout directly dependent time coordinator liveness if session time out consumer coordinator itself get dead and call poll will have to assign new coordinator before rebalancing.
After session timeout check coordinator validate heartbeat.pollTimeoutExpired if poll timeout has expired, which means that the foreground thread has stalled in between calls to poll(), so member explicitly leave the group and call poll to get join new consumer not whole consumer group coordinator.
After session time out and poll time out validation and before sending heart beat status , consumer coordinator check heart beat timeout, if heart beat exceed max delay heart beat time then pause/wait to retry backoff and poll again.
If heartbeat time is also in limit not exceed then consumer coordinator sent sendHeartbeatRequest
In case of sendHeartbeatRequest success thread will reset heartbeat time and call poll but in case of fail and consumer group is not in rebalance state it will wakeup consumer group coordinator to call poll again.
As mentioned on shared link polling is independent with heartbeat so during polling in case poll is quite larger heartbeat still allow to sent heartbeat which make sure your thread are live means session time out doesn't directly link to poll .
session.timeout.ms: Max time to receive heart beat
max.poll.interval.ms: Max time on independent processing thread
So if you set max.poll.interval.ms 300,000 then will have 300,000 ms to next poll that means consumer thread have max 300,000 ms to complete processing. In between heartbeat will keep sending heartbeat request at heartbeat.interval.ms i.e. 3,000 to indicate thread is still live and in case no heartbeat till session.timeout.ms i.e. 10,000 coordinator will be dead and call poll to reassign new coordinator and rebalancing