I am using spring-kafka and I have set..
enable.auto.commit as false....so the acknowledgement of the records will be taken care by spring in BATCH mode .
Assuming I have set max.poll.record as 200 and the polling time is 5 minutes and I know that all 200 record will not be processed within 5 min so I have a interesting analysis which I ma not able to understand.
The first rebalance happens after 5 min which is expected but..
After rebalance the next rebalance does not happen until very long
time ..why so? It should have happen exactly after 5 min , why the
later rebalances takes so long?
Related
We have 2 kafka cluster each with 6 nodes active and standby. And a topic with 12 partions and 12 instances of app is running. Our app is a consumer and all consumers use same consumer group ID to receive events from kafka. Event processing from kafka is sequential which is 'event comes in-> process the event->do manual ack'. This event processing takes approx 5 seconds to complete and do manual acknowledge. Though, they are multiple instances only one event at a time will be processed. But, recently we found an issue in production that consumer re balancing is happening for every 2 seconds and due to this message offset commit(manual ack) is being failed and same event is being sent twice and resulting in duplicate record insertion in database.
Kafka consume config values are:
max.poll.interval.ms = 300000//5 mins
max.poll.records = 500
heartbeat.interval.ms=3000
session.timeout.ms=10000
Errors seen:
commit offset failed since the consumer is not part of active group.
2.Time between subsequent calls to poll() was longer than configured max poll interval milli seconds which typically implies that poll loop is spending too much of time message processing.
but message processing is taking 5 seconds not more than configured max poll interval which is 5 mins. As it is sequential processing, only one consumer can poll and get event at once and other instances had to wait till their turn to poll, is it causing above 2 errors and rebalancing? Appreciate the help.
Scenario:
Committing offsets manually after processing the messages.
session.timeout.ms: 10 seconds
max.poll.interval.ms: 5 minutes
Processing of messages consumed in a "poll()" is taking 6 minutes
Timeline:
A (0 seconds): app starts poll(), have consumed the messages and started processing (will take 6 minutes)
B (3 seconds): a heartbeat is sent
C (6 seconds): another heartbeat is sent
D (5 minutes): another heartbeat is sent (5 * 60 % 3 = 0) BUT "max.poll.interval.ms" (5 minutes) is reached
At point "D" will consumer:
send "LeaveGroup request" to consider this consumer "dead" and re-balance?
continue sending heartbeats every 3 seconds ?
If point "1" is the case, then
a. how will this consumer commit offsets after completing the processing of 6 minutes considering that its partition(s) are changed due to re-balancing at point "D" ?
b. should the "max.poll.interval.ms" be set in prior according to the expected processing time ?
If point "2" is the case, then will we never know if the processing is actually blocked ?
Thankyou.
Starting with Kafka version 0.10.1.0, consumer heartbeats are sent in a background thread, such that the client processing time can be longer then the session timeout without causing the consumer to be considered dead.
However, the max.poll.interval.ms still sets the maximum allowable time for a consumer to call the poll method.
In your case, with a processing time of 6 minutes it would mean at point "d" that your consumer will be considered dead.
Your concerns are right, as the consumer will then not be able to commit the messages after 6 minutes. Your consumer will get a CommitFailedExcpetion (as described in another anser on CommitFailedExcpetion.
To conclude, yes, you need to increase the max.poll.interval.ms time if you already know that your processing time will exceed the default time of 5 minutes.
Another option would be to limit the fetched records during a poll by decreasing the configuration max.poll.records which defaults to 500 and is described as: "The maximum number of records returned in a single call to poll()".
We have several applications consuming from Kafka, that regularly encounter a DisconnectException.
What happens is always like the following:
The application is subscribed on say partitions 5 and 6, messages are processed from both partitions
From time T, no message is consumed on partition 5, only messages of partition 6 are consumed.
At T + around 5 minutes, Kafka consumer spits many log lines:
Error sending fetch request (sessionId=552335215, epoch=INITIAL) to node 0: org.apache.kafka.common.errors.DisconnectException.
After that, the consumption resumes from partition 5 and 6 and catches up the accumulated lag
Same issue occurs if the application consumes a single partition : in this case, no message is consumed for 5 minutes.
My understanding according to https://issues.apache.org/jira/browse/KAFKA-6520 is that in case of connection issue, the Kafka consumer retries (with backoff, up to 1 second by default according to reconnect.backoff.max.ms config), hiding the issue to the end user. The calls to poll() return 0 message, so the polling loop goes on and on.
However, some interrogations:
If the fetch fails due to connection issue, then the broker does not receive these requests and after the "max.poll.interval.ms" (50 seconds in our case) it should expel the consumer and trigger a rebalance. This is not happening, why?
Since the Kafka consumer retries every second, why would it take systematically 5 minutes to reconnect? Unless there is some infrastructure / network issue going on...
Otherwise, any client side configuration param which could explain the 5 minutes delay? Could this delay be somehow related to "metadata.max.age.ms"? (5 min by default)
In kafka documentation i'm trying to understand this property max.poll.interval.ms
The maximum delay between invocations of poll() when using consumer group management. This places an upper bound on the amount of time that the consumer can be idle before fetching more records. If poll() is not called before expiration of this timeout, then the consumer is considered failed and the group will rebalance in order to reassign the partitions to another member.
This mean each poll will happen before the poll-time-out by default it is 5 minutes. So my question is exactly how much time consumer thread takes between two consecutive polls?
For example: Consumer Thread 1
First poll--> with 100 records
--> process 100 records (took 1 minute)
--> consumer submitted offset
Second poll--> with 100 records
--> process 100 records (took 1 minute)
--> consumer submitted offset
Does consumer take time between first and second poll? if yes, why? and how can we change that time ( assume this when topic has huge data)
It's not clear what you mean by "take time between"; if you are talking about the spring-kafka listener container, there is no wait or sleep, if that's what you mean.
The consumer is polled immediately after the offsets are committed.
So, max.poll.interval.ms must be large enough for your listener to process max.poll.records (plus some extra, just in case).
But, no, there are no delays added between polls, just the time it takes the listener to handle the results of the poll.
We're using consumer kafka client 0.10.2.0 with the following configuration:
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "true");
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
props.put(ConsumerConfig.AUTO_COMMIT_INTERVAL_MS_CONFIG, "1000");
props.put(ConsumerConfig.RECEIVE_BUFFER_CONFIG, 64 * 1024);
props.put(ConsumerConfig.MAX_PARTITION_FETCH_BYTES_CONFIG, 16 * 1024);
props.put(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG, RoundRobinAssignor.class.getName());
props.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, "30000");
props.put(ConsumerConfig.REQUEST_TIMEOUT_MS_CONFIG, "40000");
props.put(ConsumerConfig.HEARTBEAT_INTERVAL_MS_CONFIG, "10000");
props.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, "100");
So as you can see we're using autocommit.
The consumer API version that we're using has a dedicated thread for doing autocommit.
So every one second we have an autocommit which means that we have an heartbeat every one second.
Our application processing time may actually take(from time to time) more than 40 seconds (the request time out interval)
What I wanted to ask is:
1 - if the processing time will take , for example , a minute . will there be a rebalance although there is the autocommit heartbean every second?
2 - What more weird is that in case of long execution time it seems that we're getting the same message more than once. Is it normal? If the consumer has committed an offset , why the rebalance make the same offset being used again?
Thanks,
Orel
You can use KafkaConsumer.pause() / KafkaConsumer.resume() to prevent consumer re-balancing during long processing pauses. JavaDocs. Take a look at this question.
Re.2. Are you sure that these offsets are commited?
Just to clarify , AutoCommit check is called in every poll and it checks that the time elapsed is greater than configured time ,if yes then only it does the commit
Eg. if commit interval is 5 secs and poll is happening in 7 secs, In this case , the commit will happen after 7 sec
For your questions
Auto commit doesn't count for heartbeat , if there is long processing time then obviously commit will not happen and will lead to session timeout which in-turn triggers rebalance
This shouldn't happen unless you are seeking/resetting the offset to previously committed offset or the consumer rebalance occurred
From Kafka v0.10.1.0, you don't need to manually trigger auto commit to do heart beat. Kafka consumer itself initiates a new thread for heart-beat mechanism in background. To know more, read KIP-62.
In your case, you can set max.poll.interval.ms to the maximum time taken by your processor to handle the max.poll.record records.