Prevent kafka consumer from timing out for long process - apache-kafka

I need to prevent the kafka consumer from timing out while the application waits for a particular process to complete. My approach is to pause the partitions and then resume them once the process is completed.
List<TopicPartition> partitionList = new ArrayList<>();
partitionList.addAll(kafkaConsumer.assignment());
kafkaConsumer.pause(partitionList);
while(//waiting for the process to complete){
Thread.sleep(10000);
kafkaConsumer.poll(0);
}
kafkaConsumer.resume(partitionList);
Questions
Does pause send heartbeat to kafka automatically or should I still need to poll at regular intervals to send the heart beat?
Is mine the best approach ? or is there a better way of doing it?

Since Kafka 0.10.1, consumers do have a background thread for sending heartbeats: https://cwiki.apache.org/confluence/display/KAFKA/KIP-62%3A+Allow+consumer+to+send+heartbeats+from+a+background+thread
Thus, you don't need to call poll() to send heartbeat to the brokers. However, there is a second timeout max.poll.interval.ms -- you must call poll() within this time to avoid this second timeout. Default value is 5 minutes. You can just increase this timeout if your wait is even longer than this. If you do so, you also don't need to pause any partitions etc.
If you are using an older version, you approach of pausing, and calling poll() regularly is the only way to send regular heartbeat to avoid the timeout.

Related

Can Kafka consumer be paused for definite interval and resume automatically after the interval

Can kafka consumer be paused for a definite time interval say 2 mins and get resumed automatically once the given time interval is over. Or can we provide time window to pause command?
Yes, we are using this for very very long processing time and in order processing of events, there is api in kafka client called pause and resume, you need to remember that in order that your consumer won't die you must keep calling poll() , pause makes it that poll request won't return any new records but will refresh the "timers" on poll request, when you wish you can call the resume and then poll again,
I am not a fan of this way of using kafka, as I would like kafka to be more of real time streaming machine and it somehow felt to me like against that logic , but it is working,
Yes, KafkaConsumer has a pause method. As a parameter it takes a collection of partitions. This method does not affect partition subscription and does not cause a group rebalance. Later, you can use resume method to resume a subscription.

Should we use max.poll.records or max.poll.interval.ms to handle records that take longer to process in kafka consumer?

I'm trying to understand what is better option to handle records that take longer to process in kafka consumer? I ran few tests to understand this and observed that we can control this with by modifying either max.poll.records or max.poll.interval.ms.
Now my question is, what's the better option to choose? Please suggest.
max.poll.records simply defines the maximum number of records returned in a single call to poll().
Now max.poll.interval.ms defines the delay between the calls to poll().
max.poll.interval.ms: The maximum delay between invocations of
poll() when using consumer group management. This places an upper
bound on the amount of time that the consumer can be idle before
fetching more records. If poll() is not called before expiration of
this timeout, then the consumer is considered failed and the group
will rebalance in order to reassign the partitions to another member.
For consumers using a non-null group.instance.id which reach this
timeout, partitions will not be immediately reassigned. Instead, the
consumer will stop sending heartbeats and partitions will be
reassigned after expiration of session.timeout.ms. This mirrors the
behavior of a static consumer which has shutdown.
I believe you can tune both in order to get to the expected behaviour. For example, you could compute the average processing time for the messages. If the average processing time is say 1 second and you have max.poll.records=100 then you should allow approximately 100+ seconds for the poll interval.
If you have slow processing and so want to avoid rebalances then tuning either would achieve that. However extending max.poll.interval.ms to allow for longer gaps between poll does have a bit of a side effect.
Each consumer only uses 2 threads - polling thread and heartbeat thread.
The latter lets the group know that your application is still alive so can trigger a rebalance before max.poll.interval.ms expires.
The polling thread does everything else in terms of group communication so during the poll method you find out if a rebalance has been triggered elsewhere, you find out if a partition leader has died and hence metadata refresh is required. The implication is that if you allow longer gaps between polls then the group as a whole is slower to respond to change (for example no consumers start receiving messages after a rebalance until they have all received their new partitions - if a rebalance occurs just after one consumer has started processing a batch for 10 minutes then all consumers will be hanging around for at least that long).
Hence for a more responsive group in situations where processing of messages is expected to be slow you should choose to reduce the records fetched in each batch.

Kafka Consumer - continue calling poll() while paused?

I read the docs on using the pause and resume methods for a kafka consumer, and they seem easy enough to implement. However, do I need another thread to continue calling the poll() method while paused to meet the heartbeat requirements and not trigger a rebalance?
My consumer is running SQL scripts after polling the topic and depending the messages returned, the scripts may take longer than the current session.timeout.ms interval (we have increased this value, but the length of time for the scripts to run can vary quiet a bit and regardless of the interval we will exceed it at times). I also want to avoid a rebalance as safe ordering and data integrity are more important than throughput and error detention.
From version 0.10.1.0 heartbeat is sent via a separate thread so pausing your process thread wouldn't affect heartbeat thread.
You can check this for more information.
yes, you need to continue calling poll() on the consumer, even if you pause all partitions, or it will be kicked out of any consumer group its a member of and its assigned partitions will transfer to another consumer. as to which thread ends up calling poll - that doesnt matter (so long as only a single thread interacts with the consumer at a time)
quoting from kip-62:
max.poll.interval.ms. This config sets the maximum delay between client calls to poll(). When the timeout expires, the consumer will stop sending heartbeats and send an explicit LeaveGroup request.

Do Kafka consumers spin on poll() or are they woken up by a broadcast/signal from the broker?

If I poll() from a consumer in a while True: statement, I see that poll() is blocking. If the consumer is up to date with messages from the topic (offset = OFFSET_END) how is the consumer conducting it's blocking poll()?
Does the consumer default adhere to a pub/sub mentality in which it sleeps and waits for a publish and a broadcast/signal from the broker?
Or is the consumer constantly spinning itself checking the topic?
I'm using the confluent python client, if that matters.
Thanks!
kafka consumers are basically long poll loops, driven (asynchronously) by the user thread calling poll().
the whole protocol is request-response, and entirely client driven. there is no form of broker-initiated "push".
fetch.max.wait.ms controls how long any single broker will wait before responding (if no data), while blocking of the user thread is controlled by argument to poll()
Yes, you are right its while a true condition that waits to consume the message till waiting timeout time.
If it receives a message it will return immediately otherwise it will await to passed timeout and return an empty record.
Kafka Broker use the below parameter to control message to send to Consumer
fetch.min.bytes: The broker will wait for this amount of data to fill BEFORE it sends the response to the consumer client.
fetch.wait.max.ms: The broker will wait for this amount of time BEFORE sending a response to the consumer client unless it has enough data to fill the response (fetch.message.max.bytes)
There is a possibility to take a long time to call the next poll() due to the processing of consumed messages. max.poll.interval.ms prevent not to process take so much time and call the next poll within max.poll.interval.ms otherwise consumer leaves the group and trigger rebalance.
You can get more detail about this here
max.poll.interval.ms: By increasing the interval between expected polls, you can give the consumer more time to handle a batch of
records returned from poll(long). The drawback is that increasing this
value may delay a group rebalance since the consumer will only join
the rebalance inside the call to poll. You can use this setting to
bound the time to finish a rebalance, but you risk slower progress if
the consumer cannot actually call poll often enough.
max.poll.records: Use this setting to limit the total records returned from a single call to a poll. This can make it easier to
predict the maximum that must be handled within each poll interval. By
tuning this value, you may be able to reduce the poll interval, which
will reduce the impact of group rebalancing.

heartbeat failed for group because it's rebalancing

What's the exact reason to have heartbeat failure for group because it's rebalancing ? What's the reason for rebalance where all the consumers in group are up ?
Thank you.
Heartbeats are the basic mechanism to check if all consumers are still up and running. If you get a heartbeat failure because the group is rebalancing, it indicates that your consumer instance took too long to send the next heartbeat and was considered dead and thus a rebalance got triggered.
If you want to prevent this from happening, you can either increase the timeout (session.timeout.ms), or make sure your consumer sends heartbeat more often (heartbeat.interval.ms). Heartbeats are basically embedded in poll(), thus, you need to make sure you call poll frequently enough. This can usually be achieved by limit the number of records a single poll returns via max.poll.records (to shorten the time it takes to process all data that got fetched).
Update
Since Kafka 0.10.1, heartbeats are sent in a background thread, and not when poll() is called (cf. https://cwiki.apache.org/confluence/display/KAFKA/KIP-62%3A+Allow+consumer+to+send+heartbeats+from+a+background+thread). In this new design, configuration session.timeout.ms and heartbeat.interval.ms are still the same. Additionally, there is max.poll.interval.ms that determines how often poll() must be called. If you miss to call poll() within max.poll.interval.ms, the heartbeat thread assume that the processing thread died, and will send a leave-group-request that will trigger a rebalance, and the heartbeat thread will stop sending heartbeats afterwards. If you processing thread is ok but just slow, the next call to poll() will initiate another rebalance to re-join the group again.
For more details, cf. Difference between session.timeout.ms and max.poll.interval.ms for Kafka >= 0.10.1