Kafka is marking the coordinator dead when autoscaling is on - kubernetes

We run a Kubernetes cluster with Kafka 0.10.2. In the cluster we have a replica set of 10 replicas running one of our services, which consume from one topic as one consumer-group.
Lately we turned on the autoscaling feature for this replica-set, so it can increase or decrease the number of pods, based on its CPU consumption.
Soon after this feature was enabled we started to see lags in our Kafka queue. I looked at the log and saw the consumer is marking the coordinator dead very often (almost every 5 minutes) and the reconnect to the same coordinator few seconds later.
I also saw frequently in the logs:
org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member.
This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing.
You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
It takes a few seconds to process a message (normally) and we never had this kind of issues before. I assume the problem relates to a bad partition assignment but I can't pinpoint the problem.
If we kill pod that got "stuck" Kafka reassign the partition to another pod and it get stuck as well, but if we scale down the replica-set to 0 and then scale it up the messages are being consumed quickly!
Relevant consumer configurations:
heartbeat.interval.ms = 3000
max.poll.interval.ms = 300000
max.poll.records = 500
session.timeout.ms = 10000
Any suggestions?

I am not saying this is the problem but Spring kafka 1.1.x had a very complicated threading model (required by the 0.9 clients). For long-running listeners we had to pause/resume the consumer thread; I saw some issues with early kafka versions where the resume didn't always work.
KIP-62 allowed us to greatly simplify the threading model.
This was incorporated into the 1.3.x line.
I would suggest upgrading to at least cloud-stream Ditmars, which uses spring-kafka 1.3.x. The current 1.3.x version is 1.3.8.

Related

When does kafka consumer get evicted from the group?

I am using spring kafka and want to know when does kafka consumer get evicted from the group. Does it get evicted when the processing time taken is more than the poll interval? If yes then isn't the purpose of the heartbeat to indicate the consumer is alive and if that happens then the consumer should never be evicted unless the process itself fails.
You are correct that the heartbeat thread tells the group that the consumer process is still alive. The reason for additionally considering a consumer to be gone when there is excessive time between polls is to prevent livelock.
Without this, a consumer might never poll, and so would take partitions without making any progress through them.
The question then is really why there is a heartbeat and session timeout. The heartbeat thread is actually doing other stuff (pre-fetching) but I assume the reason it is used to check that consumers are alive is that it is generally talking to the broker more frequently than the polling thread as the latter has to process messages, and so a failed consumer process will be spotted earlier.
In short there are 3 things that can trigger a rebalance - a change in number of partitions at the broker end, polling taking longer than max.poll.interval.ms, and gap between heartbeats longer than session.timeout.ms

Processing kafka messages taking long time

I have a Python process (or rather, set of processes running in parallel within a consumer group) that processes data according to inputs coming in as Kafka messages from certain topic. Usually each message is processed quickly, but sometimes, depending on the content of the message, it may take a long time (several minutes). In this case, Kafka broker disconnects the client from the group and initiates the rebalance. I could set session_timeout_ms to a really large value but it would be like 10 minutes of more, which means if a client dies, the cluster would not be properly rebalanced for 10 minutes. This seems to be a bad idea. Also, most messages (about 98% of them) are fast, so paying such penalty for just 1-2% of messages seems wasteful. OTOH, large messages are frequent enough to cause a lot of rebalances and cost a lot of performance (since while the group is rebalancing, nothing is getting done, and then the "dead" client re-joins again and causes another rebalance).
So, I wonder, are there any other ways for handling messages that take a long time to process? Is there any way to initiate heartbeats manually to tell the broker "it's ok, I am alive, I'm just working on the message"? I thought the Python client (I use kafka-python 1.4.7) was supposed to do that for me but it doesn't seem to happen. Also, the API doesn't seem to even have separate "heartbeat" function at all. And as I understand, calling poll() would actually get me the next messages - while I am not even done with the current one, and would also mess up iterator API for Kafka consumer, which is quite convenient to use in Python.
In case it's important, the Kafka cluster is Confluent, version 2.3 if I remember correctly.
In Kafka, 0.10.1+ Kafka polling and session heartbeat are decoupled to each other.
You can get an explanationhere
max.poll.interval.ms how much time permit to complete processing by consumer instance before time out means if processing time takes more than max.poll.interval.ms time Consumer Group will presume its die remove from Consumer Group and invoke rebalance.
To increase this will increase the interval between expected polls which give consumers more time to handle a batch of records returned from poll(long).
But at the same time, it will also delay group rebalances since the consumer will only join the rebalance inside the call to poll.
session.timeout.ms is the timeout used to identify if the consumer is still alive and sending a heartbeat on a defined interval (heartbeat.interval.ms). In general, the thumb-rule is heartbeat.interval.ms should be 1/3 of session timeout so in case of network failure consumers can miss at most 3-time heartbeat before session timeout.
session.timeout.ms: low value would be good to detect failure more quickly.
max.poll.interval.ms: large value will reduce the risk of failure due to increased processing time however increases the rebalancing time.
Note: A large number of partition and topics consumed by Consumer Group also effect on overall rebalance time
The other approach if you would really want to get rid of rebalancing you can assign partitions on each consumer instance manually, using partition assign. In that case, each consumer instance will be running independently with their own assigned partitions. But in that case, you would not able to leverage the rebalance features to assign partitions automatically.

Kafka group re-balancing after consumer failed. org.apache.kafka.clients.consumer.internals.ConsumerCoordinator

I'm running a Kafka cluster with 4 nodes, 1 producer and 1 consumer. It was working fine until consumer failed. Now after I restart the consumer, it starts consuming new messages but after some minutes it throws this error:
[WARN ]: org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - Auto offset commit failed for group eventGroup: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured session.timeout.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
And it starts consuming the same messages again and loops forever.
I increased session timeout, tried to change group id and it still does the same thing.
Also is the client version of Kafka consumer a big deal?
I'd suggest you to decouple the consumer and the processing logic, to start with. E.g. let the Kafka consumer only poll messages and maybe after sanitizing the messages (if necessary) delegate the actual processing of each record to a separate thread, then see if the same error is still occurring. The error says, you're spending too much time between the subsequent polls, so this might resolve your issue. Also, please mention the version of Kafka you're using. Kafka had a different heartbeat management policy before version 0.10 which could make this issue easier to reproduce.

max.poll.intervals.ms set to int.Max by default

Apache Kafka documentation states:
The internal Kafka Streams consumer max.poll.interval.ms default value
was changed from 300000 to Integer.MAX_VALUE
Since this value is used to detect when the processing time for a batch of records exceeds a given threshold, is there a reason for such an "unlimited" value?
Does it enable applications to become unresponsive? Or Kafka Streams has a different way to leave the consumer group when the processing is taking too long?
Does it enable applications to become unresponsive? Or Kafka Streams has a different way to leave the consumer group when the processing is taking too long?
Kafka Streams leverages a heartbeat functionality of the Kafka consumer client in this context, and thus decouples heartbeats ("Is this app instance still alive?") from calls to poll(). The two main parameters are session.timeout.ms (for the heartbeat thread) and max.poll.interval.ms (for the processing thread), and their difference is described in more detail at https://stackoverflow.com/a/39759329/1743580.
The heartbeating was introduced so that an application instance may be allowed to spent a lot of time processing a record without being considered "not making progress" and thus "be dead". For example, your app can do a lot of crunching for a single record for a minute, while still heartbeating to Kafka "Hey, I'm still alive, and I am making progress. But I'm simply not done with the processing yet. Stay tuned."
Of course you can change max.poll.interval.ms from its default (Integer.MAX_VALUE) to a lower setting if, for example, you actually do want your app instance to be considered "dead" if it takes longer than X seconds in-between polling records, and thus if it takes longer than X seconds to process the latest round of records. It depends on your specific use case whether or not such a configuration makes sense -- in most cases, the default setting is a safe bet.
session.timeout.ms: The timeout used to detect consumer failures when using Kafka's group management facility. The consumer sends periodic heartbeats to indicate its liveness to the broker. If no heartbeats are received by the broker before the expiration of this session timeout, then the broker will remove this consumer from the group and initiate a rebalance. Note that the value must be in the allowable range as configured in the broker configuration by group.min.session.timeout.ms and group.max.session.timeout.ms.
max.poll.interval.ms: The maximum delay between invocations of poll() when using consumer group management. This places an upper bound on the amount of time that the consumer can be idle before fetching more records. If poll() is not called before expiration of this timeout, then the consumer is considered failed and the group will rebalance in order to reassign the partitions to another member.

kafka increase session timeout so that consumers will not get the same messsage

I have a Java app that consumes and produce messages from kafka.
I have a threadpool of 5 thread, each thread creates a consumer and since I have 5 partitions the job is decided between them.
i have a problem that 2 threads are getting the same message since the hearthbeat doesn't comes to the broker since each message processing takes about an hour.
I tried to increase the session.timeout.ms in the broker and also changed the group.min.session.timeout.ms so that the max value will allow it.
In this case the consumer cannot be started.
Any ideas?
The keep alive is not sent so thats not true as far as it seems