Kafka Rebalancing and listeners pitfalls - apache-kafka

I am reading Kafka: The Definitive Guide and would like to better understand the rebalance listener. The example in the book simple uses a HashMap to maintain the current offsets that have been processed and will commit the current state when a partition is revoked. My concerns are:
There are two issues/questions I have around the code example:
The language used leads me to assume that these callbacks are made on a different thread. So, shouldn't thread safety be considered when applying the current offsets? Additionally, shouldn't the current batch be cancelled after this is committed?
It says to use commitSync to make sure offsets are committed before the rebalance proceeds. However this is only synchronous within that consumer. Is there some mechanism where the coordinator will not proceed until it hears back from all subscribed consumers?

I re-read the section in the book and I agree I was a bit confused too!
The Javadoc states:
This callback will only execute in the user thread as part of the
poll(long) call whenever partition assignment changes.
I had a look at the code and confirmed the rebalance listener methods are indeed called in the same thread that owns the Consumer.
Yes you should use commitSync() when committing in the rebalance listener.
To explain why, let's look at the golden path example. We start with a consumer happily consuming and heartbeating regularly to the coordinator. At some point the coordinator returns a REBALANCE_IN_PROGRESS error to a heartbeat request. This can be caused by a new member wanting to join the group, a member leaving or failing to heartbeat, or new partition being added/removed from the subscription. At this point, all consumers need to rejoin the group.
Before attempting to rejoin the group, the consumer will synchronously execute ConsumerRebalanceListener.onPartitionsRevoked(). Once the listener returns, the consumer will send a JoinRequest to the coordinator to rejoin the group.
That said, and I think this is what you were thinking about, if your callback takes too long (> session.timeout.ms) to commit, the group could be already be in another generation and the partitions with offset trying to be committed assigned to another member. In that case, the commit will fail even if it was synchronous. But by using commitSync() in the listener you are guaranteed the consumer won't rejoin the group before completing the commit.

Related

How a consumer in Kafka knows when another consumer takes ownership of a partition it currently owns during rebalanceing?

As I read Kafka's paper, I found the papers says during rebalance, "when there are multiple consumers within a group, each of them will be notified of a broker or a consumer change. However, the
notification may come at slightly different times at the consumers.
So, it is possible that one consumer tries to take ownership of a
partition still owned by another consumer. When this happens, the
first consumer simply releases all the partitions that it currently
owns, waits a bit and retries the rebalance process"
I am wondering how one consumer knows when another consumer takes ownership of a partition it currently owns?
I think that may be some underlying mechanism notifying this condition or coordination within a consumer group.

Is Poll call during kafka rebalancing a busy wait?

I am using manual kafka commit by setting property enable.auto.commit as false while initialising the Kafka consumer and calling kafka commit manually after receiving and processing the message.
However since the processing of message in my consumer is time taking, I am getting Exception with message "error": "Broker: Group rebalance in progress"
The reason being that commit after rebalance timeout is rejected with this error. Now the recovery action for this is either I exit and re-instantiate the process which will trigger rebalancing and partition assignment again. Another way is to catch this exception and then continue as usual which will work correctly only if the poll() call is blocked till the rebalancing is complete, otherwise it will fetch the next packet from the batch and might process and commit it successfully leading to loss of the message whose commit got failed while rebalancing.
So, Need to know what is the correct way to handle this case, should I re-instantiate the process or should I catch and ignore the exception?
The best approach is to ignore if it happens occasionally, and if it happens frequently then reduce the max.poll.records or increase the max.poll.interval.ms to ensure it does only happen occasionally. Also, ensure that your code can handle duplicate records (if you can't do that then there is a different answer).
The error you see is, as you probably realise, just because by the time the consumer committed, the group had decided that it had probably gone and so it's partitions were picked up by a different consumer as part of a rebalance - the new consumer would have started from the last committed offset, hence duplicates.
Given that the original consumer is alive and well it will no doubt poll again and so trigger another rebalance. This poll won't block waiting for rebalance to occur - each poll allows for some communication about the current state of the group (within the polling thread) and after a number of polls the new allocation of partitions will be agreed and accepted after which the rebalance is considered compete and that poll will tell the consumer it's partition allocation and return a set of records.

Kafka Consumer Rebalancing and Its Impact

I'm new to Kafka and I'm trying to design a wrapper library in both Java and Go (uses Confluent/Kafka-Go) for Kafka to be used internally. For my use-case, CommitSync is a crucial step and we should do a read only after properly committing the old one. Repeated processing is not a big issue and our client service is idempotent enough. But data loss is a major issue and should not occur.
I will create X number of consumers initially and will keep on polling from them. Hence I would like to know more about the negative scenario's that could happen here, Impact of them and how to properly handle them.
I would like to know more about:
1) Network issue during consumer processing:
What happens when network goes of for a brief period and comes back? Does Kafka consumer automatically handle this and becomes alive when network comes back or do we have to reinitialise them? If they come back alive do they resume work from where they left of?
Eg: Consumer X read 50 records from Partition Y. Now internally the consumer offset moved to +50. But before committing network issue happens and the comes back alive. Now will the consumer have the metadata about what it read for last poll. Can it go on to commit +50 in offset?
2) Rebalancing in consumer groups. Impact of them on existing consumer process - whether the existing working consumer instance will pause and resume work during a rebalance or do we have to reinitialize them? How long can rebalance occur? If the consumer comes back alive after rebalance, does it have metadata about it last read?
3) What happens when a consumer joins during a rebalancing. Ideally it is again a rebalancing scenario. What will happen now? The existing will be discarded and the new one starts or will wait for the existing rebalance to complete?
What happens when network goes of for a brief period and comes back? Does Kafka consumer automatically handle this and becomes alive when network comes back or do we have to reinitialise them?
The consumer will try to reconnect. If the consumer group coordinator doesn't receive heartbeats or brokers don't respond to brokers, then the group rebalances.
If they come back alive do they resume work from where they left of?
From the last committed offset, yes.
whether the existing working consumer instance will pause and resume work during a rebalance
It will pause and resume. No action needed.
How long can rebalance occur?
Varies on many factors, and can happen indefinitely under certain conditions.
If the consumer comes back alive after rebalance, does it have metadata about it last read?
The last committed offsets are stored on the broker, not by consumers.
The existing will be discarded and the new one starts or will wait for the existing rebalance to complete?
All reblances must complete before any polls continue.

How does spring kafka handle maintaining a heartbeat

In the kafka consumer documentation https://kafka.apache.org/10/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html it states that care needs to taken to make sure poll is called every so often or the broker will assume the consumer is dead.
The most reliable procedure was pretty complicated:
For use cases where message processing time varies unpredictably,
neither of these options may be sufficient. The recommended way to
handle these cases is to move message processing to another thread,
which allows the consumer to continue calling poll while the processor
is still working. Some care must be taken to ensure that committed
offsets do not get ahead of the actual position. Typically, you must
disable automatic commits and manually commit processed offsets for
records only after the thread has finished handling them (depending on
the delivery semantics you need). Note also that you will need to
pause the partition so that no new records are received from poll
until after thread has finished handling those previously returned.
Does spring kafka handle this for me under the hood?
The heartbeat is mentioned very brief in the documentation. Apparently the heartbeat is managed by Spring-Kafka on a different thread.
Since version 0.10.1.0 heartbeats are sent on a background thread
You can also read this github issue to read more about the heartbeat.

Can a Kafka consumer commit an offset in a seperate thread?

Does Kafka permit one thread or process to consume data from a partition, while another thread or process takes the responsibility of manually committing the offset once the data has been completely processed?
Direct from the KafkaConsumer documentation:
The Kafka consumer is NOT thread-safe. All network I/O happens in the
thread of the application making the call.
...
The only exception to this rule is wakeup(), which can safely be used from an external thread to interrupt an active operation.
So, no it is not recommended to use the consumer outside of one thread, beyond the wakeup exception.
Yes, I believe it's possible. As noted above KafkaConsumer objects are not thread safe hence each thread should have its own instance. Both instances should have the same group id and auto-commit should of course be disabled. There are commit methods that take specific partitions and offsets as parameters:
https://kafka.apache.org/11/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html#commitSync-java.util.Map-
and
https://kafka.apache.org/11/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html#commitAsync-java.util.Map-org.apache.kafka.clients.consumer.OffsetCommitCallback-
However, I think you may not be able to do this when using the automatic group management via the subscribe method (the old high level consumer-style usage) but rather you will have to manage partition assignment manually using assign method (like with the old simple consumer). But you can give the former a try and see if that too is possible.