Consumer does not know partition revoke - apache-kafka

Topic Name : testTopic
Total # of message at topic : 1
Partition : 8
Consumer group Name : Consumer1
Consumer language : Java with partition listener impl
Infrastructure : having 4 jvms running parallel (which means, 4 consumers are running with same group name)
problem : when i start my first consumer Lister call back methods are called and partition assignments are done.. this consumer started processing my messages.
take an example, this consumer is holding a message MSG-1 and my processor is processing the message(I intentionally put 20 mille second as thread wait). so, did not committed MSG-1 back to topic with offset.
properties of the consumer
session.timeout.ms = 15 mille seconds.
In the mean time, consumer 2 started,
this consumer started, assigned partition(properly call back methods are called) and did not consumed the messages because, these 2 messages are hold by consumer 1 .
now, from consumer heart beat interval exceed and broker considered that consumer-1 is dead and reassigned the partition at consumer-2(all)
now call back method called at consumer-2 (assigned & revoked). In the mean time, my session timeout is expired and msg-1 and msg-2 put back to topic and picked up consumer 2.
now, i have processed msg-1 && msg-2 two times.... one time from consumer-1 and consumer-2
My problem here is,
Consumer-1 did not get called with partition revoked call back method ?
after my Thread sleep completes (from consumer -1 ) he is trying to commit the offset with partition.... we are getting partition re-assignment is done .. u cant commit. this is correct, but how can i get the call back method form consumer-1....
-Naresh.

Consumer-1 did not get called with partition revoked call back method ?
The consumer can only get a partition revoked call back, if it participates in a rebalance. However, because it timed out and did drop out of the group, it does not participate in a rebalance and the broker does not send a any information to the consumer. Therefore, the consumer does not know that the partitions is revoked (and hence, no call back).
after my Thread sleep completes (from consumer -1 ) he is trying to commit the offset with partition.... we are getting partition re-assignment is done .. u cant commit. this is correct
Not sure what you mean by we are getting partition re-assignment is done: because the consumer does not participate in the rebalance, it still thinks it owns the partitions. Hence, it tries to commit, and as you correctly said, it is (correctly) not allowed to commit as it dropped out of the group.
but how can i get the call back method form consumer-1....
You need to re-join the group by calling poll() again to get back into healthy state.
General comment: Your timeout configurations seems rather low and it's not recommended to have such small timeouts in practice. I think it will be hard to get a stable group with such low timeouts, as most likely consumers will hit the timeout regularly, dropping out of the group, and need to rejoin again.

Related

Kafka consumer - how does rebalance work if one consumer fails

I'm using AWS Kafka MSK and I have a topic with 2 partitions.
I also have a 2 consumers which are part of the same consumer group.
I'm wondering that will happen in the following case:
Consumer A - took messages 1 - 100
Consumer B - took messages 101 - 200
Consumer A failed
Consumer B succeeded
What happens to the messages 1 - 100?
Will the auto Kafka rebalance set consumer B to read messages 1 - 100?
or the new consumer that will startup instead of Consumer A will read the messages?
Thanks in advance.
Offset ranges are for partitions, not topics.
This scenario is not possible for a fresh consumer application unless one of the following is true
Offsets 0-100 of the partition assigned to consumer B have been removed due to retention
Your code calls seek method to skip those offsets
On the other hand, if the consumer group already existed and consumed none of the records of partition assigned to consumer A (say, it had failed before), and did commit offset 100 of the other partition. In this case, perhaps the same thing would happen; the consumer group might fail reading offset 0 of the "first" partition.
When any consumer instance fails, the group will rebalance. Depending on how you handle errors/failures, the previously healthy instance may then be assigned both partitions, and then fail consuming the "first" partition again (since it'll be the same code that died previously). Or, writing code differently, you'll ignore consumer exceptions and optionally mark bad offsets in a dead-letter queue. When logged or ignored, you'd commit offsets for the original consumer and skip those records.

What if a Kafka's consumer handles a message too long? Will Kafka reappoint this partition to another consumer and the message will doubly handled?

Suppose Kafka, 1 partition, 2 consumers.(2nd consumer is idle)
Suppose the 1st one consumed a message, goes to handle it with 3 other services and suddenly sticks on one of them and miss the Kafka's timeout.
Will Kafka reappoint the partition to the 2nd consumer and the message will doubly handled (suppose the 1st one eventually succeed)?
What if a Kafka's consumer handles a message too long? Will Kafka reappoint this partition to another consumer and the message will doubly handled?
Yes, that's correct. If Kafka consumer takes too long to handle a message and subsequent poll() is delayed, Kafka will re-appoint this partition to another consumer and the message will be processed again (and again).
For more clarity, first we need decide and define 'How long is too long?'.
This is defined by the property max.poll.interval.ms. From the docs,
The maximum delay between invocations of poll() when using consumer group management. This places an upper bound on the amount of time that the consumer can be idle before fetching more records. If poll() is not called before expiration of this timeout, then the consumer is considered failed and the group will rebalance in order to reassign the partitions to another member.
Consumer group is rebalanced if there are no calls to poll() within this time.
There is one more property auto.commit.interval.ms. The auto commit offsets check will be called only during the poll - it checks whether time elapsed is greater than the configured auto commit interval time and if result is yes, the offset is committed.
If Kafka consumer is taking too long to process the records, then the subsequent poll() call is also delayed and the offsets returned on the last poll() are not committed. If rebalance happens at this time, the new consumer client assigned to this partition will start processing the messages again.
Consumer group rebalance and resulting partition reassignment can be avoided by increasing this value. This will increase the allowed interval between polls and give more time to consumers to handle the record(s) returned from poll(). The consumers will only join the rebalance inside the call to poll, so increasing max poll interval will also delay group rebalances.
There is one more problem in increasing max poll interval to a big value. If the consumer dies for some other reason, it takes longer than the configured max.poll.interval.ms interval to detect the failure.
session.timeout.ms and heartbeat.interval.ms are available in this case to detect the total failure as earlier as possible.
For more details about these parameters:
Please refer this
KIP-62
Please note that the values configured for session.timeout.ms must be in the allowable range as configured in the broker configuration by properties
group.min.session.timeout.ms
group.max.session.timeout.ms
Otherwise, following exception will be thrown while starting consumer client.
Exception in thread "main" org.apache.kafka.common.errors.InvalidSessionTimeoutException:
The session timeout is not within the range allowed by the broker
(as configured by group.min.session.timeout.ms and group.max.session.timeout.ms)
Update: To avoid handling the messages again
There is another method in KafkaConsumer class commitAsync() to trigger commit offsets operation.
ConsumerRecords<String, String> records = kafkaConsumer.poll(Duration.ofMillis(500));
kafkaConsumer.commitAsync();
For more details on commitSync() and commitAsync(), please check this thread
Committing an offset manually is an action of saying that the offset has been processed so that the Kafka won't send the committed records for the same partition again. When offsets are committed manually, it is important to note that if the consumer dies before processing records for any reason, there is a chance these records won't be processed again.

Consumer 'group_name' group is rebalancing forever

I am using Kafka: 2.11-1.0.1. The application contains consumers with concurrency=5 for the topic 'X' with partitions=5.
When the application is restarted and the message is published on topic 'X' before partition assignment,
5 consumers of topic 'X' find group coordinator and send the join group request to the group coordinator. It is expected to get a response from the group coordinator but no response is received.
I have Checked Kafka server logs but I could not find relevant logs found DEBUG log level.
When I run describe consumer group command, the following observation is made:
consumer group is rebalancing
Old consumers with some lag
New consumers with some random names. As time goes new consumer numbers are increasing.
New messages are published on the topic 'X', but it is not being received by the consumers.
heartbeat and session.time.out is set as default.
This problem occurs if the message is published before the partition assignment for the topic 'X' and its consumers.
My doubt is: Why rebalancing is not getting complete so that new consumer starts consuming the newly produced message?
Application have below consumers in consumer group
Consumer A listens to Topic1. Topic1 have 1 partition.
max.poll.interval.time.ms=4 hours for this consumer.
Consumer B listens to Topic2. Topic2 have 5 partiition.
Consumer B concurrency=5.
max.poll.interval.time.ms=1 hour for this consumer.
What is happening on application restart and if one of the topic has already published message
When the application restarts one instance of consumer (consumerA1)
created and it subscribes to topic1. ConsumerA1 finds Group Coordinate (GC) and sends join group request.
ConsumerA1 gets response from GC and becomes leader.Till this step not other consumer has initialized.
ConsumerA1 assigns partitions and sends SyncGroup request to GC. New
assignment generation happens. In this way first rebalance completed.
Message on topic1 is already published , consumerA1 fetches this message
and starts processing. Processing of completion of this message take
significant time. (Say 2 hours)
Now 5 consumer instances initialize one by one and all of them subscribes to topic2. These consumer finds GC and sends join group request.
but GC does not respond them.
When consumerA1 sends heartbeat to GC , GC responds that rebalancing is going on but consumerA1 does not revoke partition since it is processing the message.
According to Rebalancing protocol (Nice article on rebalancing) , GC waits till all consumer sends join group request. In this case , GC waits to get join group request from consumerA1. Maximum wait is max.poll.interval.time.ms i.e. 4 hours in this case.
Root Cause:
Group Coordinator did not wait to all consumers initialization after application restart therefore first unnecessary rebalance happened therefore consumerA1 fetched the message from partition and started processing it.
Solution:
To avoid such unnecessary initial rebalance , kafka provides one configuration in which Group Coordinator waits till consumer join new consumer group. Documentation
group.initial.rebalance.delay.ms
Checked my kafka server.properties , it was set to 0.
Tries with default i.e. 3 seconds.
Initial rebalance avoided , GC wait 3 seconds on application restart and in this time all other consumers initialized.All consumers sent join group request , as all GC got request from all consumers. GC responded without any delay , rebalancing procedded and completed successfully.

What happens to records/messages during consumption when the record processing took more than 'max.poll.interval.ms'?

I've below consumer settings.
auto.offset.reset=earliest
enable.auto.commit=true (default value)
session.timeout.ms=10000 (default value)
max.poll.interval.ms= 300000 (default value)
With the above configuration, let's say i have five messages( m1, m2, m3, m4 and m5) in a topic A (with only 1 partition). Now I've consumer subscribed to this topic and was able to process first two messages (m1 and m2) without any issues and committed offset.
Now, Let us say the consumer got the third message m3 and trying to process it and it took 300100 ms for processing because of some network latency. Now, as per my understanding, the offset commit will not happen because the record processing took more than max.poll.interval.ms and hence the consumer would be considered as dead and removed from the group.
Now I've two questions
What happens to the message m3? I mean, would it be picked in the next poll because it's offset was not committed
What happens to the other messages m4 and m5?
Expiring max.poll.inteval.ms without calling poll() is one of the reasons of rebalance. When rebalance starts in a consumer group, all the consumers in this consumer group are revoked. (removed from consumer list) During rebalance Kafka waits all healthy consumers to send joinGroupRequest by calling poll() until rebalance timeout (rebalance timeout equals to max.poll.interval.ms). Upon completion of joinGroupRequests of healthy consumers or rebalance timeout, Kafka assign partitions to consumers that sends joinGroupRequests.
In your case:
What happens to the message m3? I mean, would it be picked in the next
poll because it's offset was not committed
Answer: Its process continues even after your consumer is revoked unless you have a logic to interrupt process thread in case of revoke. So all the messages returned from previous poll are processed. But offset cannot be committed. If this partition is assigned to another consumer at the result of the rebalance, then new consumer will get same messages starts from M3. So message(s) will be processed twice. When first consumer sends poll request again, that means joinGroupRequests and again rebalance will be triggered.
What happens to the other messages m4 and m5?
Answer: If these messages are returned from poll() as well as m3, then result will be the same. They will be processed, but cannot be committed by the old consumer. New consumer will process messages and commit offset.

Does kafka partition assignment happen across processes?

I have a topic with 20 partitions and 3 processes with consumers(with the same group_id) consuming messages from the topic.
But I am seeing a discrepancy where unless one of the process commits , the other consumer(in a different process) is not reading any message.
The consumers in other process do cconsume messages when I set auto-commit to true. (which is why I suspect the consumers are being assigned to the first partition in each process)
Can someone please help me out with this issue? And also how to consume messages parallely across processes ?
If it is of any use , I am doing this on a pod(kubernetes) , where the 3 processes are 3 different mules.
Commit shouldn't make any difference because the committed offset is only used when there is a change in group membership. With three processes there would be some rebalancing while they start up but then when all 3 are running they will each have a fair share of the partitions.
Each time they poll, they keep track in memory of which offset they have consumed on each partition and each poll causes them to fetch from that point on. Whether they commit or not doesn't affect that behaviour.
Autocommit also makes little difference - it just means a commit is done synchronously during a subsequent poll rather than your application code doing it. The only real reason to manually commit is if you spawn other threads to process messages and so need to avoid committing messages that have not actually been processed - doing this is generally not advisable - better to add consumers to increase throughput rather than trying to share out processing within a consumer.
One possible explanation is just infrequent polling. You mention that other consumers are picking up partitions, and committing affects behaviour so I think it is safe to say that rebalances must be happening. Rebalances are caused by either a change in partitions at the broker (presumably not the case) or a change in group membership caused by either heartbeat thread dying (a pod being stopped) or a consumer failing to poll for a long time (default 5 minutes, set by max.poll.interval.ms)
After a rebalance, each partition is assigned to a consumer, and if a previous consumer has ever committed an offset for that partition, then the new one will poll from that offset. If not then the new one will poll from either the start of the partition or the high watermark - set by auto.offset.reset - default is latest (high watermark)
So, if you have a consumer, it polls but doesn't commit, and doesn't poll again for 5 minutes then a rebalance happens, a new consumer picks up the partition, starts from the end (so skipping any messages up to that point). Its first poll will return nothing as it is starting from the end. If it doesn't poll for 5 minutes another rebalance happens and the sequence repeats.
That could be the cause - there should be more information about what is going on in your logs - Kafka consumer code puts in plenty of helpful INFO level logging about rebalances.