For testing purposes, I posted 5k messages on a Kafka topic and I am using a pull method to read 100 messages every iteration in my spring batch application, it runs for ~2hrs before it finishes.
Facing below error at times and execution is getting stopped.
org.apache.kafka.common.errors.TimeoutException: Timeout of 60000ms expired before successfully committing offsets
what could be the reason and fix?
Did you consume all the message in two hours? If it is still consuming, MAX_POLL_INTERVAL_MS_CONFIG may be triggered. The default is 5 minutes. If the interval between poll() is more than 5 minutes, it will be kicked out of the consumer group and rebalanced.
During this process, the consumer group is unavailable.
I am not clear about more information, just provide a solution direction.
fix by 20211012:"If it is still consuming, MAX_POLL_INTERVAL_MS_CONFIG may be triggered" means that if the consumption has not been finished, then the consumption rate is very slow, so this mechanism may be triggered. You can adjust this parameter to try to verify.
Related
We have 2 kafka cluster each with 6 nodes active and standby. And a topic with 12 partions and 12 instances of app is running. Our app is a consumer and all consumers use same consumer group ID to receive events from kafka. Event processing from kafka is sequential which is 'event comes in-> process the event->do manual ack'. This event processing takes approx 5 seconds to complete and do manual acknowledge. Though, they are multiple instances only one event at a time will be processed. But, recently we found an issue in production that consumer re balancing is happening for every 2 seconds and due to this message offset commit(manual ack) is being failed and same event is being sent twice and resulting in duplicate record insertion in database.
Kafka consume config values are:
max.poll.interval.ms = 300000//5 mins
max.poll.records = 500
heartbeat.interval.ms=3000
session.timeout.ms=10000
Errors seen:
commit offset failed since the consumer is not part of active group.
2.Time between subsequent calls to poll() was longer than configured max poll interval milli seconds which typically implies that poll loop is spending too much of time message processing.
but message processing is taking 5 seconds not more than configured max poll interval which is 5 mins. As it is sequential processing, only one consumer can poll and get event at once and other instances had to wait till their turn to poll, is it causing above 2 errors and rebalancing? Appreciate the help.
I am using manual kafka commit by setting property enable.auto.commit as false while initialising the Kafka consumer and calling kafka commit manually after receiving and processing the message.
However since the processing of message in my consumer is time taking, I am getting Exception with message "error": "Broker: Group rebalance in progress"
The reason being that commit after rebalance timeout is rejected with this error. Now the recovery action for this is either I exit and re-instantiate the process which will trigger rebalancing and partition assignment again. Another way is to catch this exception and then continue as usual which will work correctly only if the poll() call is blocked till the rebalancing is complete, otherwise it will fetch the next packet from the batch and might process and commit it successfully leading to loss of the message whose commit got failed while rebalancing.
So, Need to know what is the correct way to handle this case, should I re-instantiate the process or should I catch and ignore the exception?
The best approach is to ignore if it happens occasionally, and if it happens frequently then reduce the max.poll.records or increase the max.poll.interval.ms to ensure it does only happen occasionally. Also, ensure that your code can handle duplicate records (if you can't do that then there is a different answer).
The error you see is, as you probably realise, just because by the time the consumer committed, the group had decided that it had probably gone and so it's partitions were picked up by a different consumer as part of a rebalance - the new consumer would have started from the last committed offset, hence duplicates.
Given that the original consumer is alive and well it will no doubt poll again and so trigger another rebalance. This poll won't block waiting for rebalance to occur - each poll allows for some communication about the current state of the group (within the polling thread) and after a number of polls the new allocation of partitions will be agreed and accepted after which the rebalance is considered compete and that poll will tell the consumer it's partition allocation and return a set of records.
I'm running a Kafka cluster with 4 nodes, 1 producer and 1 consumer. It was working fine until consumer failed. Now after I restart the consumer, it starts consuming new messages but after some minutes it throws this error:
[WARN ]: org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - Auto offset commit failed for group eventGroup: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured session.timeout.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
And it starts consuming the same messages again and loops forever.
I increased session timeout, tried to change group id and it still does the same thing.
Also is the client version of Kafka consumer a big deal?
I'd suggest you to decouple the consumer and the processing logic, to start with. E.g. let the Kafka consumer only poll messages and maybe after sanitizing the messages (if necessary) delegate the actual processing of each record to a separate thread, then see if the same error is still occurring. The error says, you're spending too much time between the subsequent polls, so this might resolve your issue. Also, please mention the version of Kafka you're using. Kafka had a different heartbeat management policy before version 0.10 which could make this issue easier to reproduce.
I see from the logs that exact same message is consumed by the 665 times. Why does this happen?
I also see this in the logs
Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member.
This means that the time between subsequent calls to poll() was longer than the configured session.timeout.ms, which typically implies
that the poll loop is spending too much time message processing. You can address this either by increasing the session
timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
Consumer properties
group.id=someGroupId
bootstrap.servers=kafka:9092
enable.auto.commit=false
key.deserializer=org.apache.kafka.common.serialization.StringDeserializer
value.deserializer=org.apache.kafka.common.serialization.StringDeserializer
session.timeout.ms=30000
max.poll.records=20
PS: Is it possible to consume only a specific number of messages like 10 or 50 or 100 messages from the 1000 that are in the queue?
I was looking at 'fetch.max.bytes' config, but it seems like it is for a message size rather than number of messages.
Thanks
The answer lies in the understanding of the following concepts:
session.timeout.ms
heartbeats
max.poll.interval.ms
In your case, your consumer receives a message via poll() but is not able to complete the processing in max.poll.interval.ms time. Therefore, it is assumed hung by the Broker and re-balancing of partitions happen due to which this consumer loses the ownership of all partitions. It is marked dead and is no longer part of a consumer group.
Then when your consumer completes the processing and calls poll() again two things happen:
Commit fails as the consumer no longer owns the partitions.
Broker identifies that the consumer is up again and therefore a re-balance is triggered and the consumer again joins the Consumer Group, start owning partitions and request messages from the Broker. Since the earlier message was not marked as committed (refer #1 above, failed commit) and is pending processing, the broker delivers the same message to consumer again.
Consumer again takes a lot of time to process and since is unable to finish processing in less than max.poll.interval.ms, 1. and 2. keep repeating in a loop.
To fix the problem, you can increase the max.poll.interval.ms to a large enough value based on how much time your consumer needs for processing. Then your consumer will not get marked as dead and will not receive duplicate messages.
However, the real fix is to check your processing logic and try to reduce the processing time.
The fix is described in the message you pasted:
You can address this either by increasing the session timeout or by
reducing the maximum size of batches returned in poll() with
max.poll.records.
The reason is a timeout is reached before your consumer is able to process and commit the message. When your Kafka consumer "commits", it's basically acknowledging receipt of the previous message, advancing the offset, and therefore moving onto the next message. But if that timeout is passed (as is the case for you), the consumer's commit isn't effective because it's happening too late; then the next time the consumer asks for a message, it's given the same message
Some of your options are to:
Increase session.timeout.ms=30000, so the consumer has more time
process the messages
Decrease the max.poll.records=20 so the consumer has less messages it'll need to work on before the timeout occurs. But this doesn't really apply to you because your consumer is already only just working on a single message
Or turn on enable.auto.commit, which probably also isn't the best solution for you because it might result in dropping messages though, as mentioned below:
If we allowed offsets to auto commit as in the previous example
messages would be considered consumed after they were given out by the
consumer, and it would be possible that our process could fail after
we have read messages into our in-memory buffer but before they had
been inserted into the database.
Source: https://kafka.apache.org/090/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html
I am using kafka 0.9.0.1 broker and 0.9.0.1 consumer client. My consumer instances are consuming records with a processing time less than 1 second. And other main configs are
enable.auto.commit=false
session.timeout.ms=30000
heartbeat.interval.ms=25000
I am committing offset after processing.
I am getting the exception
Error UNKNOWN_MEMBER_ID occurred while committing offsets for group
kafka_to_s3
ERROR com.bsb.hike.analytics.consumer.Consumer - unable to commit
retryCount=2 org.apache.kafka.clients.consumer.CommitFailedException:
Commit cannot be completed due to group rebalance
once or twice in an hour. Consuming approx 6 billion events a day. It seems like offsets are stored in only one partition of the topic "__consumer_offsets". It increase the load on the particular broker also.
Anybody have clue about these problems ?
Kafka triggers a rebalance if it doesn't receive at least one heartbeat within session time out. If the rebalance is triggered, the commit will fail. That is expected. So the question is why has the heartbeat not happened? There might be a couple of reasons for that.
First thing is that you are doing a manual commit. Starting 0.9, heartbeat doesn't happen in a separate thread. The consumer runs on a single thread which handles commit, heartbeat and polling. So the heartbeat happens when you do a consumer.poll() or consumer.commit(). So if your processing time is exceeding the session time out, that might cause the heartbeat to fail.
There is a known issue in kafka 0.9 consumer which might cause the problem you are facing.
https://issues.apache.org/jira/browse/KAFKA-3627
In either case, downgrading your consumer to 0.8 will solve the problem.
Edit: You can try increasing the session time out to as high as 5 min and see if it works.
Regarding kafka configs
Kafka server expects that it receives at least one heartbeat within the session time out. So the consumer tries to do a heartbeat at most (session time out/heartbeat times). Some heartbeats might be missed. So your heartbeat time should not be more than 1/3 of the session time out. (You can refer to the docs)