Kafka consumer behavior in case of DisconnectException - apache-kafka

We have several applications consuming from Kafka, that regularly encounter a DisconnectException.
What happens is always like the following:
The application is subscribed on say partitions 5 and 6, messages are processed from both partitions
From time T, no message is consumed on partition 5, only messages of partition 6 are consumed.
At T + around 5 minutes, Kafka consumer spits many log lines:
Error sending fetch request (sessionId=552335215, epoch=INITIAL) to node 0: org.apache.kafka.common.errors.DisconnectException.
After that, the consumption resumes from partition 5 and 6 and catches up the accumulated lag
Same issue occurs if the application consumes a single partition : in this case, no message is consumed for 5 minutes.
My understanding according to https://issues.apache.org/jira/browse/KAFKA-6520 is that in case of connection issue, the Kafka consumer retries (with backoff, up to 1 second by default according to reconnect.backoff.max.ms config), hiding the issue to the end user. The calls to poll() return 0 message, so the polling loop goes on and on.
However, some interrogations:
If the fetch fails due to connection issue, then the broker does not receive these requests and after the "max.poll.interval.ms" (50 seconds in our case) it should expel the consumer and trigger a rebalance. This is not happening, why?
Since the Kafka consumer retries every second, why would it take systematically 5 minutes to reconnect? Unless there is some infrastructure / network issue going on...
Otherwise, any client side configuration param which could explain the 5 minutes delay? Could this delay be somehow related to "metadata.max.age.ms"? (5 min by default)

Related

Kafka Consumer death handling

I have question regarding handling of consumers death due to exceeding the timeout values.
my example configuration:
session.timeout.ms = 10000 (10 seconds)
heartbeat.interval.ms = 2000 (2 seconds)
max.poll.interval.ms = 300000 (5 minutes)
I have 1 topic, 10 partitions, 1 consumer group, 10 consumers (1 partition = 1 consumer).
From my understanding consuming messages in Kafka, very simplified, works as follows:
consumer polls 100 records from topic
a heartbeat signal is sent to broker
processing records in progress
processing records completes
finalize processing (commit, do nothing etc.)
repeat #1-5 in a loop
My question is, what happens if time between heartbeats takes longer than previously configured session.timeout.ms. I understand the part, that if session times out, the broker initializes a re-balance, the consumer which processing took longer than the session.timeout.ms value is marked as dead and a different consumer is assigned/subscribed to that partition.
Okey, but what then...?
Is that long-processing consumer removed/unsubscribed from the topic and my application is left with 9 working consumers? What if all the consumers exceed timeout and are all considered dead, am I left with a running application which does nothing because there are no consumers?
Long-processing consumer finishes processing after re-balancing already took place, does broker initializes re-balance again and consumer is assigned a partition anew? As I understand it continues running #1-5 in a loop and sending a heartbeat to broker initializes also process of adding consumer to the consumers group, from which it was removed after being given dead status, correct?
Application throws some sort of exception indicating that session.timeout.ms was exceeded and the processing is abruptly stopped?
Also what about max.poll.interval.ms property, what if we even exceed that period and consumer X finishes processing after max.poll.interval.ms value? Consumer already exceeded the session.timeout.ms value, it was excluded from consumer group, status set to dead, what difference does it gives us in configuring Kafka consumer?
We have a process which extracts data for processing and this extraction consists of 50+ SQL queries (majority being SELECT's, few UPDATES), they usually go fast but of course all depends on the db load and possible locks etc. and there is a possibility that the processing takes longer than the session's timeout. I do not want to infinitely increase sessions timeout until "I hit the spot". The process is idempotent, if it's repeated X times withing X minutes we do not care.
Please find the answers.
#1. Yes. If all of your consumer instances are kicked out of the consumer group due to session.timeout, then you will be left with Zero consumer instance, eventually, consumer application is dead unless you restart.
#2. This depends, how you write your consumer code with respect to poll() and consumer record iterations. If you have a proper while(true) and try and catch inside, you consumer will be able to re-join the consumer group after processing that long running record.
#3. You will end up with the commit failed exception:
failed: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
And again it depends on your code, to auto join into the consumer group.
#4. Answer lies here
session.timeout.ms
The amount of time a consumer can be out of contact with the brokers while still
considered alive defaults to 3 seconds. If more than session.timeout.ms passes
without the consumer sending a heartbeat to the group coordinator, it is considered
dead and the group coordinator will trigger a rebalance of the consumer group to
allocate partitions from the dead consumer to the other consumers in the group. This
property is closely related to heartbeat.interval.ms. heartbeat.interval.ms con‐
trols how frequently the KafkaConsumer poll() method will send a heartbeat to the
group coordinator, whereas session.timeout.ms controls how long a consumer can
go without sending a heartbeat. Therefore, those two properties are typically modi‐
fied together—heatbeat.interval.ms must be lower than session.timeout.ms, and
is usually set to one-third of the timeout value. So if session.timeout.ms is 3 sec‐
onds, heartbeat.interval.ms should be 1 second. Setting session.timeout.ms
lower than the default will allow consumer groups to detect and recover from failure
sooner, but may also cause unwanted rebalances as a result of consumers taking
longer to complete the poll loop or garbage collection. Setting session.timeout.ms
higher will reduce the chance of accidental rebalance, but also means it will take
longer to detect a real failure.

Kafka is sending same event twice to different instances of micro service

We have 2 kafka cluster each with 6 nodes active and standby. And a topic with 12 partions and 12 instances of app is running. Our app is a consumer and all consumers use same consumer group ID to receive events from kafka. Event processing from kafka is sequential which is 'event comes in-> process the event->do manual ack'. This event processing takes approx 5 seconds to complete and do manual acknowledge. Though, they are multiple instances only one event at a time will be processed. But, recently we found an issue in production that consumer re balancing is happening for every 2 seconds and due to this message offset commit(manual ack) is being failed and same event is being sent twice and resulting in duplicate record insertion in database.
Kafka consume config values are:
max.poll.interval.ms = 300000//5 mins
max.poll.records = 500
heartbeat.interval.ms=3000
session.timeout.ms=10000
Errors seen:
commit offset failed since the consumer is not part of active group.
2.Time between subsequent calls to poll() was longer than configured max poll interval milli seconds which typically implies that poll loop is spending too much of time message processing.
but message processing is taking 5 seconds not more than configured max poll interval which is 5 mins. As it is sequential processing, only one consumer can poll and get event at once and other instances had to wait till their turn to poll, is it causing above 2 errors and rebalancing? Appreciate the help.

kafka consumer is not picking up the message from topic even though previous message manually committed

I have two messages in Kafka Topic (let's say offset 1 and 2) and both messages are placed in same partition (let's say p1).
My consumer app is like this:
my consumer is picking up the message 1 (with offset 1 from patition 1) and sending manual commit signal to Kafka and then waiting for 5 secs.
My expectation is since commit signal went to kafka, while my thread 1 is waiting for 5 secs and another consumer thread should pickup the message 2 from partition 1 and process it in separate thread.
However, it is not working like this. it is processing one after the other. only after 5secs finished by thread 1 then it is picking up the second message from topic.
NOTE: I have made sure that amount of inparallel consumers are set to more than one (in my case 5 and max consumer pool size is 10).
Am I doing anything incorrect? has anyone faced similar issue? if so, what is the solution?
thanks,
Bala
Each Partition can be consumed only by one thread at a time and that thread will continue to wait(there are other factors), until unless a rebalance is triggered, which then assigns that partition to a different thread.
Rebalance will be triggered
Either manually or
When new thread is added with same consumer group or
When one of the threads stop calling poll method for max.poll.interval.ms many milliseconds (by default is 5 mins)
Here is blog with lot more details about it.

max poll interval and session timeout ms | kafka consumer alive

Scenario:
Committing offsets manually after processing the messages.
session.timeout.ms: 10 seconds
max.poll.interval.ms: 5 minutes
Processing of messages consumed in a "poll()" is taking 6 minutes
Timeline:
A (0 seconds): app starts poll(), have consumed the messages and started processing (will take 6 minutes)
B (3 seconds): a heartbeat is sent
C (6 seconds): another heartbeat is sent
D (5 minutes): another heartbeat is sent (5 * 60 % 3 = 0) BUT "max.poll.interval.ms" (5 minutes) is reached
At point "D" will consumer:
send "LeaveGroup request" to consider this consumer "dead" and re-balance?
continue sending heartbeats every 3 seconds ?
If point "1" is the case, then
a. how will this consumer commit offsets after completing the processing of 6 minutes considering that its partition(s) are changed due to re-balancing at point "D" ?
b. should the "max.poll.interval.ms" be set in prior according to the expected processing time ?
If point "2" is the case, then will we never know if the processing is actually blocked ?
Thankyou.
Starting with Kafka version 0.10.1.0, consumer heartbeats are sent in a background thread, such that the client processing time can be longer then the session timeout without causing the consumer to be considered dead.
However, the max.poll.interval.ms still sets the maximum allowable time for a consumer to call the poll method.
In your case, with a processing time of 6 minutes it would mean at point "d" that your consumer will be considered dead.
Your concerns are right, as the consumer will then not be able to commit the messages after 6 minutes. Your consumer will get a CommitFailedExcpetion (as described in another anser on CommitFailedExcpetion.
To conclude, yes, you need to increase the max.poll.interval.ms time if you already know that your processing time will exceed the default time of 5 minutes.
Another option would be to limit the fetched records during a poll by decreasing the configuration max.poll.records which defaults to 500 and is described as: "The maximum number of records returned in a single call to poll()".

kafka increase session timeout so that consumers will not get the same messsage

I have a Java app that consumes and produce messages from kafka.
I have a threadpool of 5 thread, each thread creates a consumer and since I have 5 partitions the job is decided between them.
i have a problem that 2 threads are getting the same message since the hearthbeat doesn't comes to the broker since each message processing takes about an hour.
I tried to increase the session.timeout.ms in the broker and also changed the group.min.session.timeout.ms so that the max value will allow it.
In this case the consumer cannot be started.
Any ideas?
The keep alive is not sent so thats not true as far as it seems