What happens to records/messages during consumption when the record processing took more than 'max.poll.interval.ms'? - apache-kafka

I've below consumer settings.
auto.offset.reset=earliest
enable.auto.commit=true (default value)
session.timeout.ms=10000 (default value)
max.poll.interval.ms= 300000 (default value)
With the above configuration, let's say i have five messages( m1, m2, m3, m4 and m5) in a topic A (with only 1 partition). Now I've consumer subscribed to this topic and was able to process first two messages (m1 and m2) without any issues and committed offset.
Now, Let us say the consumer got the third message m3 and trying to process it and it took 300100 ms for processing because of some network latency. Now, as per my understanding, the offset commit will not happen because the record processing took more than max.poll.interval.ms and hence the consumer would be considered as dead and removed from the group.
Now I've two questions
What happens to the message m3? I mean, would it be picked in the next poll because it's offset was not committed
What happens to the other messages m4 and m5?

Expiring max.poll.inteval.ms without calling poll() is one of the reasons of rebalance. When rebalance starts in a consumer group, all the consumers in this consumer group are revoked. (removed from consumer list) During rebalance Kafka waits all healthy consumers to send joinGroupRequest by calling poll() until rebalance timeout (rebalance timeout equals to max.poll.interval.ms). Upon completion of joinGroupRequests of healthy consumers or rebalance timeout, Kafka assign partitions to consumers that sends joinGroupRequests.
In your case:
What happens to the message m3? I mean, would it be picked in the next
poll because it's offset was not committed
Answer: Its process continues even after your consumer is revoked unless you have a logic to interrupt process thread in case of revoke. So all the messages returned from previous poll are processed. But offset cannot be committed. If this partition is assigned to another consumer at the result of the rebalance, then new consumer will get same messages starts from M3. So message(s) will be processed twice. When first consumer sends poll request again, that means joinGroupRequests and again rebalance will be triggered.
What happens to the other messages m4 and m5?
Answer: If these messages are returned from poll() as well as m3, then result will be the same. They will be processed, but cannot be committed by the old consumer. New consumer will process messages and commit offset.

Related

What if a Kafka's consumer handles a message too long? Will Kafka reappoint this partition to another consumer and the message will doubly handled?

Suppose Kafka, 1 partition, 2 consumers.(2nd consumer is idle)
Suppose the 1st one consumed a message, goes to handle it with 3 other services and suddenly sticks on one of them and miss the Kafka's timeout.
Will Kafka reappoint the partition to the 2nd consumer and the message will doubly handled (suppose the 1st one eventually succeed)?
What if a Kafka's consumer handles a message too long? Will Kafka reappoint this partition to another consumer and the message will doubly handled?
Yes, that's correct. If Kafka consumer takes too long to handle a message and subsequent poll() is delayed, Kafka will re-appoint this partition to another consumer and the message will be processed again (and again).
For more clarity, first we need decide and define 'How long is too long?'.
This is defined by the property max.poll.interval.ms. From the docs,
The maximum delay between invocations of poll() when using consumer group management. This places an upper bound on the amount of time that the consumer can be idle before fetching more records. If poll() is not called before expiration of this timeout, then the consumer is considered failed and the group will rebalance in order to reassign the partitions to another member.
Consumer group is rebalanced if there are no calls to poll() within this time.
There is one more property auto.commit.interval.ms. The auto commit offsets check will be called only during the poll - it checks whether time elapsed is greater than the configured auto commit interval time and if result is yes, the offset is committed.
If Kafka consumer is taking too long to process the records, then the subsequent poll() call is also delayed and the offsets returned on the last poll() are not committed. If rebalance happens at this time, the new consumer client assigned to this partition will start processing the messages again.
Consumer group rebalance and resulting partition reassignment can be avoided by increasing this value. This will increase the allowed interval between polls and give more time to consumers to handle the record(s) returned from poll(). The consumers will only join the rebalance inside the call to poll, so increasing max poll interval will also delay group rebalances.
There is one more problem in increasing max poll interval to a big value. If the consumer dies for some other reason, it takes longer than the configured max.poll.interval.ms interval to detect the failure.
session.timeout.ms and heartbeat.interval.ms are available in this case to detect the total failure as earlier as possible.
For more details about these parameters:
Please refer this
KIP-62
Please note that the values configured for session.timeout.ms must be in the allowable range as configured in the broker configuration by properties
group.min.session.timeout.ms
group.max.session.timeout.ms
Otherwise, following exception will be thrown while starting consumer client.
Exception in thread "main" org.apache.kafka.common.errors.InvalidSessionTimeoutException:
The session timeout is not within the range allowed by the broker
(as configured by group.min.session.timeout.ms and group.max.session.timeout.ms)
Update: To avoid handling the messages again
There is another method in KafkaConsumer class commitAsync() to trigger commit offsets operation.
ConsumerRecords<String, String> records = kafkaConsumer.poll(Duration.ofMillis(500));
kafkaConsumer.commitAsync();
For more details on commitSync() and commitAsync(), please check this thread
Committing an offset manually is an action of saying that the offset has been processed so that the Kafka won't send the committed records for the same partition again. When offsets are committed manually, it is important to note that if the consumer dies before processing records for any reason, there is a chance these records won't be processed again.

Consumer timeout during rebalance

When a consumer drops from a group and a rebalance is triggered, I understand no messages are consumed -
But does an in-flight request for messages stay queued passed the max wait time?
Or does Kafka send any payload back during the rebalance?
UPDATE
For clarification, I'm referring specifically to the consumer polling process.
From my understanding, when one of the consumers drop from the consumer group, a rebalance of the partitions to consumers is performed.
During the rebalance, will an error be sent back to the consumer if it's already polled and waiting for max time to pass?
Or does Kafka wait the max time and send an empty payload?
Or does Kafka queue the request passed max wait time until the rebalance is complete?
Bottom line - I'm trying to explain periodic timeouts from consumers.
This may be in the docs, but I'm not sure where to find it.
Kafka producers doesn't directly send messages to their consumers, rather they send them to the brokers.
The inflight requests corresponds to the producer and not to the consumer.
Whether the consumer leaves a group and a rebalance is triggered or not is quite immaterial to the behaviour of the producer.
Producer messages are queued in the buffer, batched, optionally compressed and sent to the Kafka broker as per the configuration.
In-flight requests are the maximum number of unacknowledged requests
the client will send on a single connection before blocking.
Note that when we say ack, it is acknowledgement by the broker and not by the consumer.
Does Kafka send any payload back during the rebalance?
Kafka broker doesn't notify of any rebalance to its producers.

Consumer 'group_name' group is rebalancing forever

I am using Kafka: 2.11-1.0.1. The application contains consumers with concurrency=5 for the topic 'X' with partitions=5.
When the application is restarted and the message is published on topic 'X' before partition assignment,
5 consumers of topic 'X' find group coordinator and send the join group request to the group coordinator. It is expected to get a response from the group coordinator but no response is received.
I have Checked Kafka server logs but I could not find relevant logs found DEBUG log level.
When I run describe consumer group command, the following observation is made:
consumer group is rebalancing
Old consumers with some lag
New consumers with some random names. As time goes new consumer numbers are increasing.
New messages are published on the topic 'X', but it is not being received by the consumers.
heartbeat and session.time.out is set as default.
This problem occurs if the message is published before the partition assignment for the topic 'X' and its consumers.
My doubt is: Why rebalancing is not getting complete so that new consumer starts consuming the newly produced message?
Application have below consumers in consumer group
Consumer A listens to Topic1. Topic1 have 1 partition.
max.poll.interval.time.ms=4 hours for this consumer.
Consumer B listens to Topic2. Topic2 have 5 partiition.
Consumer B concurrency=5.
max.poll.interval.time.ms=1 hour for this consumer.
What is happening on application restart and if one of the topic has already published message
When the application restarts one instance of consumer (consumerA1)
created and it subscribes to topic1. ConsumerA1 finds Group Coordinate (GC) and sends join group request.
ConsumerA1 gets response from GC and becomes leader.Till this step not other consumer has initialized.
ConsumerA1 assigns partitions and sends SyncGroup request to GC. New
assignment generation happens. In this way first rebalance completed.
Message on topic1 is already published , consumerA1 fetches this message
and starts processing. Processing of completion of this message take
significant time. (Say 2 hours)
Now 5 consumer instances initialize one by one and all of them subscribes to topic2. These consumer finds GC and sends join group request.
but GC does not respond them.
When consumerA1 sends heartbeat to GC , GC responds that rebalancing is going on but consumerA1 does not revoke partition since it is processing the message.
According to Rebalancing protocol (Nice article on rebalancing) , GC waits till all consumer sends join group request. In this case , GC waits to get join group request from consumerA1. Maximum wait is max.poll.interval.time.ms i.e. 4 hours in this case.
Root Cause:
Group Coordinator did not wait to all consumers initialization after application restart therefore first unnecessary rebalance happened therefore consumerA1 fetched the message from partition and started processing it.
Solution:
To avoid such unnecessary initial rebalance , kafka provides one configuration in which Group Coordinator waits till consumer join new consumer group. Documentation
group.initial.rebalance.delay.ms
Checked my kafka server.properties , it was set to 0.
Tries with default i.e. 3 seconds.
Initial rebalance avoided , GC wait 3 seconds on application restart and in this time all other consumers initialized.All consumers sent join group request , as all GC got request from all consumers. GC responded without any delay , rebalancing procedded and completed successfully.

Issue of Kafka Balancing at high load

Using kafka version 2.11-0.11.0.3 to publish 10,000 messages (total size of all messages are 10MB), there will be 2 consumers (with same group-id) to consume the message as a parallel processing.
While consuming, same message was consumed by both the consumers.
Below errors/warning were throws by kafka
WARN: This member will leave the group because consumer poll timeout
has expired. This means the time between subsequent calls to poll()
was longer than the configured max.poll.interval.ms, which typically
implies that the poll loop is spending too much time processing
messages. You can address this either by increasing
max.poll.interval.ms or by reducing the maximum size of batches
returned in poll() with max.poll.records.
INFO: Attempt to heartbeat failed since group is rebalancing
INFO: Sending LeaveGroup request to coordinator
WARN: Synchronous auto-commit of offsets
{ingest-data-1=OffsetAndMetadata{offset=5506, leaderEpoch=null,
metadata=''}} failed: Commit cannot be completed since the group has
already rebalanced and assigned the partitions to another member. This
means that the time between subsequent calls to poll() was longer than
the configured max.poll.interval.ms, which typically implies that the
poll loop is spending too much time message processing. You can
address this either by increasing max.poll.interval.ms or by reducing
the maximum size of batches returned in poll() with max.poll.records.
Below configurations were provided to kafka
server.properties
max.poll.interval.ms=30000
group.initial.rebalance.delay.ms=0
group.max.session.timeout.ms=120000
group.min.session.timeout.ms=6000
consumer.properties
session.timeout.ms=30000
request.timeout.ms=40000
What should have changed to resolve the multiple consumptions?
Are your consumers in the same group? If yes you will have multiple consumption if a consumer leaves/dies/timeouts without having committed some messages it has processed.
If all your messages are consumed by both consumers you probably have not set the same group id for them.
More info:
So you have set the same group id for all consumers, good. You are in the situation where the cluster/broker thinks that a consumer died and therefore rebalances the load to another one. This other one will start consuming where the last commit was done.
So lets say consumer C_A read offsets up to 100 from partition P_1 then processed them then committed '100' then read offsets up to 200 then processed them but could not commit because the broker considered C_A as dead.
The broker reassigns partition P_1 to consumer C_B which will start from the last commit for the group, which is 100, will read up to 200, process and commit 200.
So your question is how to avoid that the consumer is considered as dead (I assume it is not dead)?
The answer is already in the yellow WARN message in your question: you can tell your consumer to consume less messages (max.poll.records) in one poll to reduce the processing time between two polls to the broker AND/OR you can increase the max.poll.interval.ms telling the broker to wait longer before considering your consumer as dead...

Kafka Message at-least-once mode at multi-consumer

Kafka messaging use at-least-once message delivery to ensure every message to be processed, and uses a message offset to indicates which message is to deliver next.
When there are multiple consumers, if some deadly message cause a consumer crash during message processing, will this message be redelivered to other consumers and spread the death? If some slow message blocked a single consumer, can other consumers keep going and process subsequent messages?
Or even worse, if a slow and deadly message caused a consumer crash, will it cause other consumers start from its offset again?
There are a few things to consider here:
A Kafka topic partition can be consumed by one consumer in a consumer group at a time. So if two consumers belong to two different groups they can consume from the same partition simultaneously.
Stored offsets are per consumer group. So each topic partition has a stored offset for each active (or recently active) consumer group with consumer(s) subscribed to that partition.
Offsets can be auto-committed at certain intervals, or manually committed (by the consumer application).
So let's look at the scenarios you described.
Some deadly message causes a consumer crash during message processing
If offsets are auto-committed, chances are by the time the processing of the message fails and crashes the consumer, the offset is already committed and the next consumer in the group that takes over would not see that message anymore.
If offsets are manually committed after processing is done, then the offset of that message will not be committed (for simplicity, I am assuming one message is read and processed at a time, but this can be easily generalized) because of the consumer crash. So any other consumer in the group that is (will be) subscribed to that topic will read the message again after taking over that partition. So it's possible that it will crash other consumers too. If offsets are committed before message processing, then the next consumers won't see the message because the offset is already committed when the first consumer crashed.
Some slow message blocks a single consumer: As long as the consumer is considered alive no other consumer in the group will take over. If the slowness goes beyond the consumer's session.timeout.ms the consumer will be considered dead and removed from the group. So whether another consumer in the group will read that message depends on how/when the offset is committed.
Slow and deadly message causes a consumer crash: This scenario should be similar to the previous ones in terms of how Kafka handles it. Either slowness is detected first or the crash occurs first. Again the main thing is how/when the offset is committed.
I hope that helps with your questions.