max poll interval and session timeout ms | kafka consumer alive - apache-kafka

Scenario:
Committing offsets manually after processing the messages.
session.timeout.ms: 10 seconds
max.poll.interval.ms: 5 minutes
Processing of messages consumed in a "poll()" is taking 6 minutes
Timeline:
A (0 seconds): app starts poll(), have consumed the messages and started processing (will take 6 minutes)
B (3 seconds): a heartbeat is sent
C (6 seconds): another heartbeat is sent
D (5 minutes): another heartbeat is sent (5 * 60 % 3 = 0) BUT "max.poll.interval.ms" (5 minutes) is reached
At point "D" will consumer:
send "LeaveGroup request" to consider this consumer "dead" and re-balance?
continue sending heartbeats every 3 seconds ?
If point "1" is the case, then
a. how will this consumer commit offsets after completing the processing of 6 minutes considering that its partition(s) are changed due to re-balancing at point "D" ?
b. should the "max.poll.interval.ms" be set in prior according to the expected processing time ?
If point "2" is the case, then will we never know if the processing is actually blocked ?
Thankyou.

Starting with Kafka version 0.10.1.0, consumer heartbeats are sent in a background thread, such that the client processing time can be longer then the session timeout without causing the consumer to be considered dead.
However, the max.poll.interval.ms still sets the maximum allowable time for a consumer to call the poll method.
In your case, with a processing time of 6 minutes it would mean at point "d" that your consumer will be considered dead.
Your concerns are right, as the consumer will then not be able to commit the messages after 6 minutes. Your consumer will get a CommitFailedExcpetion (as described in another anser on CommitFailedExcpetion.
To conclude, yes, you need to increase the max.poll.interval.ms time if you already know that your processing time will exceed the default time of 5 minutes.
Another option would be to limit the fetched records during a poll by decreasing the configuration max.poll.records which defaults to 500 and is described as: "The maximum number of records returned in a single call to poll()".

Related

Spring-kafka Rebalance issue

I am using spring-kafka and I have set..
enable.auto.commit as false....so the acknowledgement of the records will be taken care by spring in BATCH mode .
Assuming I have set max.poll.record as 200 and the polling time is 5 minutes and I know that all 200 record will not be processed within 5 min so I have a interesting analysis which I ma not able to understand.
The first rebalance happens after 5 min which is expected but..
After rebalance the next rebalance does not happen until very long
time ..why so? It should have happen exactly after 5 min , why the
later rebalances takes so long?

Kafka is sending same event twice to different instances of micro service

We have 2 kafka cluster each with 6 nodes active and standby. And a topic with 12 partions and 12 instances of app is running. Our app is a consumer and all consumers use same consumer group ID to receive events from kafka. Event processing from kafka is sequential which is 'event comes in-> process the event->do manual ack'. This event processing takes approx 5 seconds to complete and do manual acknowledge. Though, they are multiple instances only one event at a time will be processed. But, recently we found an issue in production that consumer re balancing is happening for every 2 seconds and due to this message offset commit(manual ack) is being failed and same event is being sent twice and resulting in duplicate record insertion in database.
Kafka consume config values are:
max.poll.interval.ms = 300000//5 mins
max.poll.records = 500
heartbeat.interval.ms=3000
session.timeout.ms=10000
Errors seen:
commit offset failed since the consumer is not part of active group.
2.Time between subsequent calls to poll() was longer than configured max poll interval milli seconds which typically implies that poll loop is spending too much of time message processing.
but message processing is taking 5 seconds not more than configured max poll interval which is 5 mins. As it is sequential processing, only one consumer can poll and get event at once and other instances had to wait till their turn to poll, is it causing above 2 errors and rebalancing? Appreciate the help.

Kafka Consumer polling interval

I have a Kafka topic and I have attached 1 consumers to it(topic has only 1 partition).
Now for timeouts, I am using default values(heartbeat : 3Sec, Session Timeout : 10sec, Poll TImeout : 5 mins).
As per documentation, poll timeout defines that consumer has to process the message before this else broker will remove this consumer from the consumer group. Now Suppose , it takes consumer only 1 minute to finish processing the message.
Now I have two questions
a) Now will it call poll only after 5 mins or it will call poll() as soon as it finishes processing.
b) Also, suppose consumer is sitting idle for sometime, then what would be the frequency of polling i.e. at what interval consumer will poll the broker for message? Will it be poll timeout or something else?
I presume the 5 minute setting you are referring to is max.poll.interval.ms, and not the poll timeout.
Also, I presume you are calling Kafka from Java; the answer might be different if you are using a different language.
The poll timeout is the value you pass to the KafkaConsumer poll() method. This is the maximum time the poll() method will block for, after you call it.
The maximum polling interval of 5 minutes means that you must call poll() again before 5 minutes are over since the last call of poll() had returned. If you don't, your consumer will be disconnected.
So your questions:
a) Now will it call poll only after 5 minutes or it will call poll() as soon as it finishes processing.
That is totally up to you. You are the one who is doing the calling, in your own code. You should have a loop in which you call poll().
b) Also, suppose consumer is sitting idle for sometime, then what would be the frequency of polling i.e. at what interval consumer will poll the broker for message? Will it be poll timeout or something else?
A consumer (i.e. your own application code) shouldn't be sitting idle but should be in a loop. In this loop, you call poll(), then you process the event (1 minute), and then you call poll() again.

Kafka consumer behavior in case of DisconnectException

We have several applications consuming from Kafka, that regularly encounter a DisconnectException.
What happens is always like the following:
The application is subscribed on say partitions 5 and 6, messages are processed from both partitions
From time T, no message is consumed on partition 5, only messages of partition 6 are consumed.
At T + around 5 minutes, Kafka consumer spits many log lines:
Error sending fetch request (sessionId=552335215, epoch=INITIAL) to node 0: org.apache.kafka.common.errors.DisconnectException.
After that, the consumption resumes from partition 5 and 6 and catches up the accumulated lag
Same issue occurs if the application consumes a single partition : in this case, no message is consumed for 5 minutes.
My understanding according to https://issues.apache.org/jira/browse/KAFKA-6520 is that in case of connection issue, the Kafka consumer retries (with backoff, up to 1 second by default according to reconnect.backoff.max.ms config), hiding the issue to the end user. The calls to poll() return 0 message, so the polling loop goes on and on.
However, some interrogations:
If the fetch fails due to connection issue, then the broker does not receive these requests and after the "max.poll.interval.ms" (50 seconds in our case) it should expel the consumer and trigger a rebalance. This is not happening, why?
Since the Kafka consumer retries every second, why would it take systematically 5 minutes to reconnect? Unless there is some infrastructure / network issue going on...
Otherwise, any client side configuration param which could explain the 5 minutes delay? Could this delay be somehow related to "metadata.max.age.ms"? (5 min by default)

What is the delay time between each poll

In kafka documentation i'm trying to understand this property max.poll.interval.ms
The maximum delay between invocations of poll() when using consumer group management. This places an upper bound on the amount of time that the consumer can be idle before fetching more records. If poll() is not called before expiration of this timeout, then the consumer is considered failed and the group will rebalance in order to reassign the partitions to another member.
This mean each poll will happen before the poll-time-out by default it is 5 minutes. So my question is exactly how much time consumer thread takes between two consecutive polls?
For example: Consumer Thread 1
First poll--> with 100 records
--> process 100 records (took 1 minute)
--> consumer submitted offset
Second poll--> with 100 records
--> process 100 records (took 1 minute)
--> consumer submitted offset
Does consumer take time between first and second poll? if yes, why? and how can we change that time ( assume this when topic has huge data)
It's not clear what you mean by "take time between"; if you are talking about the spring-kafka listener container, there is no wait or sleep, if that's what you mean.
The consumer is polled immediately after the offsets are committed.
So, max.poll.interval.ms must be large enough for your listener to process max.poll.records (plus some extra, just in case).
But, no, there are no delays added between polls, just the time it takes the listener to handle the results of the poll.