Impact of reducing max.poll.records in Kafka Consumer configuration - apache-kafka

I am writing an consumer application to pick records from kafka stream and process it using spring-kafka.
My processing steps are as below :
Getting records from stream --> dump it into a table --> Fetch records and call API --> API will update records into a table --> calling Async Commit()
It seems in some scenarios, the API processing taking more time because of more records are being fetched and we are getting below errors?
Member consumer-prov-em-1-399ede46-9e12-4388-b5b8-f198a4e6a5bc
sending LeaveGroup request to coordinator apslt2555.uhc.com:9095 (id:
2147483577 rack: null) due to consumer poll timeout has expired. This
means the time between subsequent calls to poll() was longer than the
configured max.poll.interval.ms, which typically implies that the poll
loop is spending too much time processing messages. You can address
this either by increasing max.poll.interval.ms or by reducing the
maximum size of batches returned in poll() with max.poll.records.
org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot
be completed since the group has already rebalanced and assigned the
partitions to another member. This means that the time between
subsequent calls to poll() was longer than the configured
max.poll.interval.ms, which typically implies that the poll loop is
spending too much time message processing. You can address this either
by increasing max.poll.interval.ms or by reducing the maximum size of
batches returned in poll() with max.poll.records.
I know this can be handled by reducing max.poll.records or by increasing max.poll.interval.ms. What I am trying to understand if I set max.poll.records to 10 then what would be poll() behavior? Is it going to take 10 records from stream wait for these records to be committed and then will go for next 10 records ? When the next poll occurs ?Is it also going to impact performance as we are reducing max.poll.records from default 500 to 10.
Do I also have to increase max.poll.interval.ms. Probably make it 10 minutes. Is there any down impact that I should be aware of while changing these values ? Except these parameters, is there any other way to handle these errors ?

max.poll.records allows batch processing consumption model in which records are collected in memory before flushing them to another system. The idea is to get all the records by polling from kafka together and then process that in memory in the poll loop.
If you decrease the number then the consumer will be polling more frequently from kafka. This means it needs to make network call more often. This might reduce performance of kafka stream processing.
max.poll.interval.ms controls the maximum time between poll invocations before the consumer will proactively leave the group. If this number increases then it will take longer for kafka to detect the consumer failures. On the other hand, if this value is too low kafka might falsely detect many alive consumers as failed thus rebalancing more often.

Related

Kafka Consumer death handling

I have question regarding handling of consumers death due to exceeding the timeout values.
my example configuration:
session.timeout.ms = 10000 (10 seconds)
heartbeat.interval.ms = 2000 (2 seconds)
max.poll.interval.ms = 300000 (5 minutes)
I have 1 topic, 10 partitions, 1 consumer group, 10 consumers (1 partition = 1 consumer).
From my understanding consuming messages in Kafka, very simplified, works as follows:
consumer polls 100 records from topic
a heartbeat signal is sent to broker
processing records in progress
processing records completes
finalize processing (commit, do nothing etc.)
repeat #1-5 in a loop
My question is, what happens if time between heartbeats takes longer than previously configured session.timeout.ms. I understand the part, that if session times out, the broker initializes a re-balance, the consumer which processing took longer than the session.timeout.ms value is marked as dead and a different consumer is assigned/subscribed to that partition.
Okey, but what then...?
Is that long-processing consumer removed/unsubscribed from the topic and my application is left with 9 working consumers? What if all the consumers exceed timeout and are all considered dead, am I left with a running application which does nothing because there are no consumers?
Long-processing consumer finishes processing after re-balancing already took place, does broker initializes re-balance again and consumer is assigned a partition anew? As I understand it continues running #1-5 in a loop and sending a heartbeat to broker initializes also process of adding consumer to the consumers group, from which it was removed after being given dead status, correct?
Application throws some sort of exception indicating that session.timeout.ms was exceeded and the processing is abruptly stopped?
Also what about max.poll.interval.ms property, what if we even exceed that period and consumer X finishes processing after max.poll.interval.ms value? Consumer already exceeded the session.timeout.ms value, it was excluded from consumer group, status set to dead, what difference does it gives us in configuring Kafka consumer?
We have a process which extracts data for processing and this extraction consists of 50+ SQL queries (majority being SELECT's, few UPDATES), they usually go fast but of course all depends on the db load and possible locks etc. and there is a possibility that the processing takes longer than the session's timeout. I do not want to infinitely increase sessions timeout until "I hit the spot". The process is idempotent, if it's repeated X times withing X minutes we do not care.
Please find the answers.
#1. Yes. If all of your consumer instances are kicked out of the consumer group due to session.timeout, then you will be left with Zero consumer instance, eventually, consumer application is dead unless you restart.
#2. This depends, how you write your consumer code with respect to poll() and consumer record iterations. If you have a proper while(true) and try and catch inside, you consumer will be able to re-join the consumer group after processing that long running record.
#3. You will end up with the commit failed exception:
failed: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
And again it depends on your code, to auto join into the consumer group.
#4. Answer lies here
session.timeout.ms
The amount of time a consumer can be out of contact with the brokers while still
considered alive defaults to 3 seconds. If more than session.timeout.ms passes
without the consumer sending a heartbeat to the group coordinator, it is considered
dead and the group coordinator will trigger a rebalance of the consumer group to
allocate partitions from the dead consumer to the other consumers in the group. This
property is closely related to heartbeat.interval.ms. heartbeat.interval.ms con‐
trols how frequently the KafkaConsumer poll() method will send a heartbeat to the
group coordinator, whereas session.timeout.ms controls how long a consumer can
go without sending a heartbeat. Therefore, those two properties are typically modi‐
fied together—heatbeat.interval.ms must be lower than session.timeout.ms, and
is usually set to one-third of the timeout value. So if session.timeout.ms is 3 sec‐
onds, heartbeat.interval.ms should be 1 second. Setting session.timeout.ms
lower than the default will allow consumer groups to detect and recover from failure
sooner, but may also cause unwanted rebalances as a result of consumers taking
longer to complete the poll loop or garbage collection. Setting session.timeout.ms
higher will reduce the chance of accidental rebalance, but also means it will take
longer to detect a real failure.

does keeping kafka consumer poll to MAX_VALUE delay the group rebalance?

consumer.poll(Long.MAX_VALUE);
This is to allow the processing (high latency) to complete. Are there any side effects for keeping this high value?
Does it block other messages in the partition being read? Does it adversely affect rebalancing?
There are two configurations that you should understand for this.
timeout that you specify to the poll method.
MAX_POLL_INTERVAL_MS_CONFIG while creating the Kafka consumer.
1. timeout:
From the Kafka documentation for the poll method:
This method returns immediately if there are records available.
Otherwise, it will await the passed timeout. If the timeout
expires, an empty record set will be returned.
#param timeout: The maximum time to block (must not be greater than
{#link Long#MAX_VALUE} milliseconds)
So what if I keep this value higher?
Each iteration of the poll method would wait for this higher value that you set if the records are not available to be consumed.
Does it adversely affect rebalancing?
No, this config does not have a relation to the rebalancing that gets triggered if the consumer in the group is found to be dead by the Kafka consumer group management.
2. MAX_POLL_INTERVAL_MS_CONFIG
From the Kafka documentation for this config
The maximum delay between invocations of poll() when using consumer
group management. This places an upper bound on the amount of time
that the consumer can be idle before fetching more records. If poll()
is not called before expiration of this timeout, then the consumer is
considered failed and the group will rebalance in order to reassign
the partitions to another member.
So what if I keep this value higher?
Sometimes it is known that the message would not be available for consumption during a certain time period. And this is the nature of the application, the producer of the topic might not produce messages in a certain time period.
In that case, we want to stop polling from Kafka. We Know that poll is stopped intentionally and the consumer is not failed. Setting this to a high value like 8 hours will ensure that re-balance simply does not happen.
Does it adversely affect rebalancing?
Yes, this config decides when the rebalancing would be triggered. If you set this to high value, rebalancing would be postponed until that amount of time.

Should we use max.poll.records or max.poll.interval.ms to handle records that take longer to process in kafka consumer?

I'm trying to understand what is better option to handle records that take longer to process in kafka consumer? I ran few tests to understand this and observed that we can control this with by modifying either max.poll.records or max.poll.interval.ms.
Now my question is, what's the better option to choose? Please suggest.
max.poll.records simply defines the maximum number of records returned in a single call to poll().
Now max.poll.interval.ms defines the delay between the calls to poll().
max.poll.interval.ms: The maximum delay between invocations of
poll() when using consumer group management. This places an upper
bound on the amount of time that the consumer can be idle before
fetching more records. If poll() is not called before expiration of
this timeout, then the consumer is considered failed and the group
will rebalance in order to reassign the partitions to another member.
For consumers using a non-null group.instance.id which reach this
timeout, partitions will not be immediately reassigned. Instead, the
consumer will stop sending heartbeats and partitions will be
reassigned after expiration of session.timeout.ms. This mirrors the
behavior of a static consumer which has shutdown.
I believe you can tune both in order to get to the expected behaviour. For example, you could compute the average processing time for the messages. If the average processing time is say 1 second and you have max.poll.records=100 then you should allow approximately 100+ seconds for the poll interval.
If you have slow processing and so want to avoid rebalances then tuning either would achieve that. However extending max.poll.interval.ms to allow for longer gaps between poll does have a bit of a side effect.
Each consumer only uses 2 threads - polling thread and heartbeat thread.
The latter lets the group know that your application is still alive so can trigger a rebalance before max.poll.interval.ms expires.
The polling thread does everything else in terms of group communication so during the poll method you find out if a rebalance has been triggered elsewhere, you find out if a partition leader has died and hence metadata refresh is required. The implication is that if you allow longer gaps between polls then the group as a whole is slower to respond to change (for example no consumers start receiving messages after a rebalance until they have all received their new partitions - if a rebalance occurs just after one consumer has started processing a batch for 10 minutes then all consumers will be hanging around for at least that long).
Hence for a more responsive group in situations where processing of messages is expected to be slow you should choose to reduce the records fetched in each batch.

Do Kafka consumers spin on poll() or are they woken up by a broadcast/signal from the broker?

If I poll() from a consumer in a while True: statement, I see that poll() is blocking. If the consumer is up to date with messages from the topic (offset = OFFSET_END) how is the consumer conducting it's blocking poll()?
Does the consumer default adhere to a pub/sub mentality in which it sleeps and waits for a publish and a broadcast/signal from the broker?
Or is the consumer constantly spinning itself checking the topic?
I'm using the confluent python client, if that matters.
Thanks!
kafka consumers are basically long poll loops, driven (asynchronously) by the user thread calling poll().
the whole protocol is request-response, and entirely client driven. there is no form of broker-initiated "push".
fetch.max.wait.ms controls how long any single broker will wait before responding (if no data), while blocking of the user thread is controlled by argument to poll()
Yes, you are right its while a true condition that waits to consume the message till waiting timeout time.
If it receives a message it will return immediately otherwise it will await to passed timeout and return an empty record.
Kafka Broker use the below parameter to control message to send to Consumer
fetch.min.bytes: The broker will wait for this amount of data to fill BEFORE it sends the response to the consumer client.
fetch.wait.max.ms: The broker will wait for this amount of time BEFORE sending a response to the consumer client unless it has enough data to fill the response (fetch.message.max.bytes)
There is a possibility to take a long time to call the next poll() due to the processing of consumed messages. max.poll.interval.ms prevent not to process take so much time and call the next poll within max.poll.interval.ms otherwise consumer leaves the group and trigger rebalance.
You can get more detail about this here
max.poll.interval.ms: By increasing the interval between expected polls, you can give the consumer more time to handle a batch of
records returned from poll(long). The drawback is that increasing this
value may delay a group rebalance since the consumer will only join
the rebalance inside the call to poll. You can use this setting to
bound the time to finish a rebalance, but you risk slower progress if
the consumer cannot actually call poll often enough.
max.poll.records: Use this setting to limit the total records returned from a single call to a poll. This can make it easier to
predict the maximum that must be handled within each poll interval. By
tuning this value, you may be able to reduce the poll interval, which
will reduce the impact of group rebalancing.

Kafka Listner Exception : Commit cannot be completed

With Manual kafka acknowledgement, while consuming/processing huge amount of messages, we observed the below error.
If we set the max.poll.records property to 1 during consumer creation, will it have performance issues while handling huge load of messages????
2019-10-22 23:01:55.208 UTC [org.springframework.kafka.KafkaListenerEndpointContainer#0-2-C-1] ERROR c.d.s.s.b.a.kafka.KafkaRxTx - Kafka Listner Exception : Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records. -> {}
org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.sendOffsetCommitRequest(ConsumerCoordinator.java:808)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.commitOffsetsSync(ConsumerCoordinator.java:691)
at org.apache.kafka.clients.consumer.KafkaConsumer.commitSync(KafkaConsumer.java:1416)
at org.apache.kafka.clients.consumer.KafkaConsumer.commitSync(KafkaConsumer.java:1377)
at brave.kafka.clients.TracingConsumer.commitSync(TracingConsumer.java:151)
at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.ackImmediate(KafkaMessageListenerContainer.java:922)
at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.processAck(KafkaMessageListenerContainer.java:904)
at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.access$2000(KafkaMessageListenerContainer.java:384)
at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer$ConsumerAcknowledgment.acknowledge(KafkaMessageListenerContainer.java:1593)
Yes it will have a significant impact on performance; how much will depend on your situation; you should run tests.
Also, consider increasing max.poll.interval.ms.