Kafka console consumer ERROR "Offset commit failed on partition" - apache-kafka

I am using a kafka-console-consumer to probe a kafka topic.
Intermittently, I am getting this error message, followed by 2 warnings:
[2018-05-01 18:14:38,888] ERROR [Consumer clientId=consumer-1, groupId=console-consumer-56648] Offset commit failed on partition my-topic-0 at offset 444: The coordinator is not aware of this member. (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
[2018-05-01 18:14:38,888] WARN [Consumer clientId=consumer-1, groupId=console-consumer-56648] Asynchronous auto-commit of offsets {my-topic-0=OffsetAndMetadata{offset=444, metadata=''}} failed: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records. (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
[2018-05-01 18:14:38,888] WARN [Consumer clientId=consumer-1, groupId=console-consumer-56648] Synchronous auto-commit of offsets {my-topic-0=OffsetAndMetadata{offset=447, metadata=''}} failed: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records. (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
It suggested in the warn logs that:
This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
So, I either need to increase max.poll.interval.ms or decrease max.poll.records.
Please advise what would be the implication of each method, and which one is preferred on a different situation?

If you increase max.poll.interval.ms that says “it’s ok to spend time processing a large batch of records” and you’ll gain throughput if you can process larger batches more efficiently than smaller ones.
To decrease max.poll.records says ”take fewer records so there’s enough time to process them” and would favor latency over throughput.
Also consider that both are configured fine, but something else is causing performance issues within your poll loop. I would explore that first before changing configuration so you don’t mask a bigger problem.

Related

Multiple kafka heartbeat failed messages during rebalance

I have a Kafka application where the consumer is polling messages. Each message processing takes around 30-40 mins. I have tuned the following settings so that my consumer instance is not removed from the group, due to long processing of the message:
max.poll.interval.ms:3600000
max.poll.records = 1
But I'm still getting a lot of following rebalance messages in the log:
2022-07-04 12:17:54,168 INFO thread=kafka-coordinator-heartbeat-thread | periodicSync o.a.k.c.c.i.AbstractCoordinator:1054 : [Consumer clientId=consumer-periodicSync-5, groupId=periodicSync] Attempt to heartbeat failed since group is rebalancing
[Consumer clientId=consumer-periodicSync-5, groupId=periodicSync] Attempt to heartbeat failed since group is rebalancing
Is there any other setting I need to do? I'm getting a lot of messages like this frequently and I'm not sure how it might affect the working of the application.
Session timeout between consumer and broker also plays a vital role here. Your consumer may be taking the default session.timeout.ms. Kindly adjust it accordingly.
session.timeout.ms
The amount of time a consumer can be out of contact with the brokers while still
considered alive defaults to 3 seconds. If more than session.timeout.ms passes
without the consumer sending a heartbeat to the group coordinator, it is considered
dead and the group coordinator will trigger a rebalance of the consumer group to
allocate partitions from the dead consumer to the other consumers in the group. This
property is closely related to heartbeat.interval.ms. heartbeat.interval.ms con‐
trols how frequently the KafkaConsumer poll() method will send a heartbeat to the
group coordinator, whereas session.timeout.ms controls how long a consumer can
go without sending a heartbeat. Therefore, those two properties are typically modi‐
fied together—heatbeat.interval.ms must be lower than session.timeout.ms, and
is usually set to one-third of the timeout value. So if session.timeout.ms is 3 sec‐
onds, heartbeat.interval.ms should be 1 second. Setting session.timeout.ms
lower than the default will allow consumer groups to detect and recover from failure
sooner, but may also cause unwanted rebalances as a result of consumers taking
longer to complete the poll loop or garbage collection. Setting session.timeout.ms
higher will reduce the chance of accidental rebalance, but also means it will take
longer to detect a real failure.

Kafka Consumer death handling

I have question regarding handling of consumers death due to exceeding the timeout values.
my example configuration:
session.timeout.ms = 10000 (10 seconds)
heartbeat.interval.ms = 2000 (2 seconds)
max.poll.interval.ms = 300000 (5 minutes)
I have 1 topic, 10 partitions, 1 consumer group, 10 consumers (1 partition = 1 consumer).
From my understanding consuming messages in Kafka, very simplified, works as follows:
consumer polls 100 records from topic
a heartbeat signal is sent to broker
processing records in progress
processing records completes
finalize processing (commit, do nothing etc.)
repeat #1-5 in a loop
My question is, what happens if time between heartbeats takes longer than previously configured session.timeout.ms. I understand the part, that if session times out, the broker initializes a re-balance, the consumer which processing took longer than the session.timeout.ms value is marked as dead and a different consumer is assigned/subscribed to that partition.
Okey, but what then...?
Is that long-processing consumer removed/unsubscribed from the topic and my application is left with 9 working consumers? What if all the consumers exceed timeout and are all considered dead, am I left with a running application which does nothing because there are no consumers?
Long-processing consumer finishes processing after re-balancing already took place, does broker initializes re-balance again and consumer is assigned a partition anew? As I understand it continues running #1-5 in a loop and sending a heartbeat to broker initializes also process of adding consumer to the consumers group, from which it was removed after being given dead status, correct?
Application throws some sort of exception indicating that session.timeout.ms was exceeded and the processing is abruptly stopped?
Also what about max.poll.interval.ms property, what if we even exceed that period and consumer X finishes processing after max.poll.interval.ms value? Consumer already exceeded the session.timeout.ms value, it was excluded from consumer group, status set to dead, what difference does it gives us in configuring Kafka consumer?
We have a process which extracts data for processing and this extraction consists of 50+ SQL queries (majority being SELECT's, few UPDATES), they usually go fast but of course all depends on the db load and possible locks etc. and there is a possibility that the processing takes longer than the session's timeout. I do not want to infinitely increase sessions timeout until "I hit the spot". The process is idempotent, if it's repeated X times withing X minutes we do not care.
Please find the answers.
#1. Yes. If all of your consumer instances are kicked out of the consumer group due to session.timeout, then you will be left with Zero consumer instance, eventually, consumer application is dead unless you restart.
#2. This depends, how you write your consumer code with respect to poll() and consumer record iterations. If you have a proper while(true) and try and catch inside, you consumer will be able to re-join the consumer group after processing that long running record.
#3. You will end up with the commit failed exception:
failed: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
And again it depends on your code, to auto join into the consumer group.
#4. Answer lies here
session.timeout.ms
The amount of time a consumer can be out of contact with the brokers while still
considered alive defaults to 3 seconds. If more than session.timeout.ms passes
without the consumer sending a heartbeat to the group coordinator, it is considered
dead and the group coordinator will trigger a rebalance of the consumer group to
allocate partitions from the dead consumer to the other consumers in the group. This
property is closely related to heartbeat.interval.ms. heartbeat.interval.ms con‐
trols how frequently the KafkaConsumer poll() method will send a heartbeat to the
group coordinator, whereas session.timeout.ms controls how long a consumer can
go without sending a heartbeat. Therefore, those two properties are typically modi‐
fied together—heatbeat.interval.ms must be lower than session.timeout.ms, and
is usually set to one-third of the timeout value. So if session.timeout.ms is 3 sec‐
onds, heartbeat.interval.ms should be 1 second. Setting session.timeout.ms
lower than the default will allow consumer groups to detect and recover from failure
sooner, but may also cause unwanted rebalances as a result of consumers taking
longer to complete the poll loop or garbage collection. Setting session.timeout.ms
higher will reduce the chance of accidental rebalance, but also means it will take
longer to detect a real failure.

Impact of reducing max.poll.records in Kafka Consumer configuration

I am writing an consumer application to pick records from kafka stream and process it using spring-kafka.
My processing steps are as below :
Getting records from stream --> dump it into a table --> Fetch records and call API --> API will update records into a table --> calling Async Commit()
It seems in some scenarios, the API processing taking more time because of more records are being fetched and we are getting below errors?
Member consumer-prov-em-1-399ede46-9e12-4388-b5b8-f198a4e6a5bc
sending LeaveGroup request to coordinator apslt2555.uhc.com:9095 (id:
2147483577 rack: null) due to consumer poll timeout has expired. This
means the time between subsequent calls to poll() was longer than the
configured max.poll.interval.ms, which typically implies that the poll
loop is spending too much time processing messages. You can address
this either by increasing max.poll.interval.ms or by reducing the
maximum size of batches returned in poll() with max.poll.records.
org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot
be completed since the group has already rebalanced and assigned the
partitions to another member. This means that the time between
subsequent calls to poll() was longer than the configured
max.poll.interval.ms, which typically implies that the poll loop is
spending too much time message processing. You can address this either
by increasing max.poll.interval.ms or by reducing the maximum size of
batches returned in poll() with max.poll.records.
I know this can be handled by reducing max.poll.records or by increasing max.poll.interval.ms. What I am trying to understand if I set max.poll.records to 10 then what would be poll() behavior? Is it going to take 10 records from stream wait for these records to be committed and then will go for next 10 records ? When the next poll occurs ?Is it also going to impact performance as we are reducing max.poll.records from default 500 to 10.
Do I also have to increase max.poll.interval.ms. Probably make it 10 minutes. Is there any down impact that I should be aware of while changing these values ? Except these parameters, is there any other way to handle these errors ?
max.poll.records allows batch processing consumption model in which records are collected in memory before flushing them to another system. The idea is to get all the records by polling from kafka together and then process that in memory in the poll loop.
If you decrease the number then the consumer will be polling more frequently from kafka. This means it needs to make network call more often. This might reduce performance of kafka stream processing.
max.poll.interval.ms controls the maximum time between poll invocations before the consumer will proactively leave the group. If this number increases then it will take longer for kafka to detect the consumer failures. On the other hand, if this value is too low kafka might falsely detect many alive consumers as failed thus rebalancing more often.

Kafka Listner Exception : Commit cannot be completed

With Manual kafka acknowledgement, while consuming/processing huge amount of messages, we observed the below error.
If we set the max.poll.records property to 1 during consumer creation, will it have performance issues while handling huge load of messages????
2019-10-22 23:01:55.208 UTC [org.springframework.kafka.KafkaListenerEndpointContainer#0-2-C-1] ERROR c.d.s.s.b.a.kafka.KafkaRxTx - Kafka Listner Exception : Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records. -> {}
org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.sendOffsetCommitRequest(ConsumerCoordinator.java:808)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.commitOffsetsSync(ConsumerCoordinator.java:691)
at org.apache.kafka.clients.consumer.KafkaConsumer.commitSync(KafkaConsumer.java:1416)
at org.apache.kafka.clients.consumer.KafkaConsumer.commitSync(KafkaConsumer.java:1377)
at brave.kafka.clients.TracingConsumer.commitSync(TracingConsumer.java:151)
at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.ackImmediate(KafkaMessageListenerContainer.java:922)
at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.processAck(KafkaMessageListenerContainer.java:904)
at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.access$2000(KafkaMessageListenerContainer.java:384)
at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer$ConsumerAcknowledgment.acknowledge(KafkaMessageListenerContainer.java:1593)
Yes it will have a significant impact on performance; how much will depend on your situation; you should run tests.
Also, consider increasing max.poll.interval.ms.

Issue of Kafka Balancing at high load

Using kafka version 2.11-0.11.0.3 to publish 10,000 messages (total size of all messages are 10MB), there will be 2 consumers (with same group-id) to consume the message as a parallel processing.
While consuming, same message was consumed by both the consumers.
Below errors/warning were throws by kafka
WARN: This member will leave the group because consumer poll timeout
has expired. This means the time between subsequent calls to poll()
was longer than the configured max.poll.interval.ms, which typically
implies that the poll loop is spending too much time processing
messages. You can address this either by increasing
max.poll.interval.ms or by reducing the maximum size of batches
returned in poll() with max.poll.records.
INFO: Attempt to heartbeat failed since group is rebalancing
INFO: Sending LeaveGroup request to coordinator
WARN: Synchronous auto-commit of offsets
{ingest-data-1=OffsetAndMetadata{offset=5506, leaderEpoch=null,
metadata=''}} failed: Commit cannot be completed since the group has
already rebalanced and assigned the partitions to another member. This
means that the time between subsequent calls to poll() was longer than
the configured max.poll.interval.ms, which typically implies that the
poll loop is spending too much time message processing. You can
address this either by increasing max.poll.interval.ms or by reducing
the maximum size of batches returned in poll() with max.poll.records.
Below configurations were provided to kafka
server.properties
max.poll.interval.ms=30000
group.initial.rebalance.delay.ms=0
group.max.session.timeout.ms=120000
group.min.session.timeout.ms=6000
consumer.properties
session.timeout.ms=30000
request.timeout.ms=40000
What should have changed to resolve the multiple consumptions?
Are your consumers in the same group? If yes you will have multiple consumption if a consumer leaves/dies/timeouts without having committed some messages it has processed.
If all your messages are consumed by both consumers you probably have not set the same group id for them.
More info:
So you have set the same group id for all consumers, good. You are in the situation where the cluster/broker thinks that a consumer died and therefore rebalances the load to another one. This other one will start consuming where the last commit was done.
So lets say consumer C_A read offsets up to 100 from partition P_1 then processed them then committed '100' then read offsets up to 200 then processed them but could not commit because the broker considered C_A as dead.
The broker reassigns partition P_1 to consumer C_B which will start from the last commit for the group, which is 100, will read up to 200, process and commit 200.
So your question is how to avoid that the consumer is considered as dead (I assume it is not dead)?
The answer is already in the yellow WARN message in your question: you can tell your consumer to consume less messages (max.poll.records) in one poll to reduce the processing time between two polls to the broker AND/OR you can increase the max.poll.interval.ms telling the broker to wait longer before considering your consumer as dead...