Kafka consumer group keep moving to PreparingRebalance state and stops consuming - apache-kafka

We have a Kafka Streams consumer group that keep moving to PreparingRebalance state and stop consuming.
The pattern is as follows:
Consumer group is running and stable for around 20 minutes
New consumers (members) start to appear in the group state without any clear reason, these new members only originate from a small number of VMs (not the same VMs each time), and they keep joining
Group state changes to PreparingRebalance
All consumers stop consuming, showing these logs:
"Group coordinator ... is unavailable or invalid, will attempt rediscovery"
The consumer on VMs that generated extra members show these logs:
Offset commit failed on partition X at offset Y: The coordinator is not aware of this member.
Failed to commit stream task X since it got migrated to another thread already. Closing it as zombie before triggering a new rebalance.
Detected task Z that got migrated to another thread. This implies that this thread missed a rebalance and dropped out of the consumer group. Will try to rejoin the consumer group.
We kill all consumer processes on all VMs, the group moves to Empty with 0 members, we start the processes and we're back to step 1
Kafka version is 1.1.0, streams version is 2.0.0
We took thread dumps from the misbehaving consumers, and didn't see more consumer threads than configured.
We tried restarting kafka brokers, cleaning zookeeper cache.
We suspect that the issue has to do with missing heartbeats, but the default heartbeat is 3 seconds and message handling times are no where near that.
Anyone encountered a similar behaviour?

Related

kafka consumers in consumer group not resuming messages after restart

Hope you are having good day.
I have an issue with kafka consumers on kubernetes. I am running 3 replicas inside a consumer group
I have a topic with 3 partitions and 3 brokers with offsets replication factor set to 3. My offset in consumer group is set to earliest.
When I start the consumer group, all are working fine with each consumer replica taking different partition and processing the data.
Issue: When by any means if a consumer replica inside the consumer group "abc-consumer-group" restarts OR if a broker(leader) restarts, it is not resuming from the point where it stopped. It states that I am up to date and no messages I have to process.
Any suggestions please where to look at?
Tried increasing rebalance, heartbeat, session timeout on broker level, no luck.
And yes whenever any new consumer is added or removed to the consumer group rebalacing is taken care by kafka. I do see it happening but still not consumers are not resuming messages. It states nothing to process.

Kafka Streams Apps Threads fail transaction and are fenced and restarted after Kafka broker restart

We are noticing Streams Apps threads fail transactions during rolling restarts of our Kafka Brokers. The transaction failure causes stream thread fencing which in turn causes a restart of the thread and re-balancing. The re-balancing causes some delay in processing. Our goal is to make broker restarts as smooth as possible and prevent processing delays as much as possible.
For our rolling Broker restarts we use the controlled.shutdown=true configuration, and before each restart we wait for all partitions to be in-sync across all replicas.
For our Streams Apps we have properly configured group.instance.id and an appropriate session.timeout.ms so that rolling restarts of the streams apps themselves are smooth and without re-balances.
From the Kafka Streams app logs I have identified a sequence of events leading up to the fencing:
Broker starts shutting down
App logs error producing to topic due to NOT_LEADER_OR_FOLLOWER
App heartbeats failing because coordinator is restarting broker
App discovers new group coordinator (this bounces a a bit between the restarting broker and live brokers)
App stabilizes
Broker starting up again
App fails to do fetch request to starting broker due to FETCH_SESSION_ID_NOT_FOUND
App discovers starting broker as transaction coordinator
App transaction fails due to one of two reasons:
InvalidProducerEpochException: Producer attempted to produce with an old epoch.
ProducerFencedException: There is a newer producer with the same transactionalId which fences the current one
Stream threads end up in fatal error state, get fenced and restarted which causes a rebalance.
What could be causing the two exceptions that cause stream thread transactions to fail? My intuition is that the broker starting up is assigned as transaction coordinator before it has synced its transaction states with the in-sync brokers. This could explain old epochs or different transactional ids to be known by that broker.
How can we further identify what is going wrong here and how it can be improved?
you can set request.timeout.ms in kafka streams which will make stream API wait for a longer period of time. if kafka broker is not up in a given period of time then only it will throw an exception which can be handled by using ProductionExceptionHandler as described in Handling exceptions in Kafka streams

Duplicate message consumption in Kafka due to auto-downscaling/deletion of pods

Background
We have a simple producer/consumer style application with Kafka as the message broker and Consumer Processes running as Kubernetes pods. We have defined two topics namely the in-topic and the out-topic. A set of consumer pods that belong to the same consumer group read messages from the in-topic, perform some work and finally write out the same message (key) to the out-topic once the work is complete.
Issue Description
We noticed that there are duplicate messages being written out to the out-topic by the consumers that are running in the Kubernetes pods. To rephrase, two different consumers are consuming the same messages from the in-topic twice and thus publishing the same message twice to the out-topic as well. We analyzed the issue and can safely conclude that this issue only occurs when pods are auto-downscaled/deleted by Kubernetes.
In fact, an interesting observation we have is that if any message is read by two different consumers from the in-topic (and thus published twice in the out-topic), the given message is always the last message consumed by one of the pods that was downscaled. In other words, if a message is consumed twice, the root cause is always the downscaling of a pod.
We can conclude that a pod is getting downscaled after a consumer writes the message to the out-topic but before Kafka can commit the offset to the in-topic.
Consumer configuration
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "true");
props.put(ConsumerConfig.MAX_POLL_INTERVAL_MS_CONFIG, "3600000");
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest");
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG,"org.apache.kafka.common.serialization.StringDeserializer");
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG"org.apache.kafka.common.serialization.StringDeserializer")
Zookeeper/broker logs :
[2021-04-07 02:42:22,708] INFO [GroupCoordinator 0]: Preparing to rebalance group PortfolioEnrichmentGroup14 in state PreparingRebalance with old generation 1 (__consumer_offsets-17) (reason: removing member PortfolioEnrichmentConsumer13-9aa71765-2518-
493f-a312-6c1633225015 on heartbeat expiration) (kafka.coordinator.group.GroupCoordinator)
[2021-04-07 02:42:23,331] INFO [GroupCoordinator 0]: Stabilized group PortfolioEnrichmentGroup14 generation 2 (__consumer_offsets-17) (kafka.coordinator.group.GroupCoordinator)
[2021-04-07 02:42:23,335] INFO [GroupCoordinator 0]: Assignment received from leader for group PortfolioEnrichmentGroup14 for generation 2 (kafka.coordinator.group.GroupCoordinator)
What we tried
Looking at the logs, it was clear that rebalancing takes place because of the heartbeat expiration. We added the following configuration parameters to increase the heartbeat and also increase the session time out :
props.put(ConsumerConfig.HEARTBEAT_INTERVAL_MS_CONFIG, "10000")
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest");
props.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, "900000");
props.put(ConsumerConfig.MAX_PARTITION_FETCH_BYTES_CONFIG, "512");
props.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, "1");
However, this did not solve the issue. Looking at the broker logs, we can confirm that the issue is due to the downscaling of pods.
Question : What could be causing this behavior where a message is consumed twice when a pod gets downscaled?
Note : I already understand the root cause of the issue; however, considering that a consumer is a long lived process running in an infinite loop, how and why is Kubernetes downscaling/killing a pod before the consumer commits the offset? How do I tell Kubernetes not to remove a running pod from a consumer group until all Kafka commits are completed?
"What could be causing this behavior where a message is consumed twice when a pod gets downscaled?"
You have provided the answer already yourself: "[...] that a pod is getting downscaled after a consumer writes the message to the out-topic but before Kafka can commit the offset to the in-topic."
As the message was processed but not committed, another pod is re-processing the same message again after the downscaling happens. Remember that adding or removing a consumer from a consumer group always initiates a Rebalancing. You have now first-hand experience why this should generally be avoided as much as feasible. Depending on the Kafka version a rebalance will cause every single consumer of the consumer group to stop consuming until the rebalancing is done.
To solve your issue, I see two options:
Only remove running pods out of the Consumer Group when they are idle
Reduce the consumer configuration auto.commit.interval.ms to 1 as this defaults to 5 seconds. This will only work if you set enable.auto.commit to true.
If you want your consumer to commit message/s before exiting you would need to handle exit signal to your consumer. A lot of languages do support this. Have a look at this thread on how to do this in java - How to finish kafka consumer safety?(Is there meaning to call thread#join inside shutdownHook ? ).
That being said, please note that there is no 100% guarantee to achieving exactly once. Your process can be killed forcefully by OS before even given time to run any exit clean up (kill -9 <process_id>.

Confluent.Kafka.KafkaException: Broker: Specified group generation id is not valid

Environment
3-node Kafka Cluster
Amazon MSK
v2.3
1 topic
6 partitions
1 consumer group with 2 consumers
Running in Kubernetes
Confluent .NET SDK 1.2.2
Except for bootstrap.servers and group.id, all of the default settings.
Problem
First, one of my consumers encounters the following exception.
Confluent.Kafka.KafkaException: Broker: Specified group generation id is not valid
at Confluent.Kafka.Impl.SafeKafkaHandle.Commit(IEnumerable`1 offsets)
at Confluent.Kafka.Consumer`2.Commit(IEnumerable`1 offsets)
The exception is trapped and the consumer is supposed to retry, but instead the app sits idle. The container is still up and running, but not consuming any more messages.
What's weirder is that the broker never reassigns that consumer's partitions so the consumer lag on those partitions begins to grow. It seems like the consumer is both alive (since the broker is not reassigning its partitions) and dead (since it cannot commit its offset or consume more messages). If we intervene and manually restart the consumers then the partitions are reassigned and the situation goes back to normal.
I'm not entirely sure what to make of the exception above. Google doesn't offer much. The most relevant lead I have is this issue in GitHub, which involves a broker restarting. To my knowledge, that is not happening in my situation. Any assistance would be greatly appreciated.
at least I have found a solution for me.
In my code I did a manual commit and set EnableAutoCommit = false.
Somehow it was possible that for an offset a commit was executed twice. I removed the manual commits on the consumer and set EnableAutoCommit = true.
After that it worked.

What happens to consumer groups in Kafka if the entire cluster goes down?

We have a consumer service that is always trying to read data from a topic using a consumer group. Due to redeployments, our Kafka cluster periodically is brought down and recreated again.
Whenever the cluster comes back again, we observed that although the previous topics are picked up (probably from zookeeper), the previous consumer groups are not created. Because of this, our running consumer process which is created with a previous consumer group gets stuck and never comes out.
Is this how the behavior of the consumer groups should be or is there a configuration we need to enable somewhere?
Any help is greatly appreciated.
Kafka Brokers keep a cache of healthy consumers and consumer groups, if the entire cluster is destroyed/recreated it no longer has knowledge of those consumers and groups, including offsets. The consumers will have to reconnect and re-establish the group and offsets from the beginning of the topic.
Operationally it makes more sense to keep the Kafka cluster running long-term, and do version upgrades in a rolling fashion so you don't interrupt the service.