Background
We have a simple producer/consumer style application with Kafka as the message broker and Consumer Processes running as Kubernetes pods. We have defined two topics namely the in-topic and the out-topic. A set of consumer pods that belong to the same consumer group read messages from the in-topic, perform some work and finally write out the same message (key) to the out-topic once the work is complete.
Issue Description
We noticed that there are duplicate messages being written out to the out-topic by the consumers that are running in the Kubernetes pods. To rephrase, two different consumers are consuming the same messages from the in-topic twice and thus publishing the same message twice to the out-topic as well. We analyzed the issue and can safely conclude that this issue only occurs when pods are auto-downscaled/deleted by Kubernetes.
In fact, an interesting observation we have is that if any message is read by two different consumers from the in-topic (and thus published twice in the out-topic), the given message is always the last message consumed by one of the pods that was downscaled. In other words, if a message is consumed twice, the root cause is always the downscaling of a pod.
We can conclude that a pod is getting downscaled after a consumer writes the message to the out-topic but before Kafka can commit the offset to the in-topic.
Consumer configuration
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "true");
props.put(ConsumerConfig.MAX_POLL_INTERVAL_MS_CONFIG, "3600000");
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest");
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG,"org.apache.kafka.common.serialization.StringDeserializer");
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG"org.apache.kafka.common.serialization.StringDeserializer")
Zookeeper/broker logs :
[2021-04-07 02:42:22,708] INFO [GroupCoordinator 0]: Preparing to rebalance group PortfolioEnrichmentGroup14 in state PreparingRebalance with old generation 1 (__consumer_offsets-17) (reason: removing member PortfolioEnrichmentConsumer13-9aa71765-2518-
493f-a312-6c1633225015 on heartbeat expiration) (kafka.coordinator.group.GroupCoordinator)
[2021-04-07 02:42:23,331] INFO [GroupCoordinator 0]: Stabilized group PortfolioEnrichmentGroup14 generation 2 (__consumer_offsets-17) (kafka.coordinator.group.GroupCoordinator)
[2021-04-07 02:42:23,335] INFO [GroupCoordinator 0]: Assignment received from leader for group PortfolioEnrichmentGroup14 for generation 2 (kafka.coordinator.group.GroupCoordinator)
What we tried
Looking at the logs, it was clear that rebalancing takes place because of the heartbeat expiration. We added the following configuration parameters to increase the heartbeat and also increase the session time out :
props.put(ConsumerConfig.HEARTBEAT_INTERVAL_MS_CONFIG, "10000")
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest");
props.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, "900000");
props.put(ConsumerConfig.MAX_PARTITION_FETCH_BYTES_CONFIG, "512");
props.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, "1");
However, this did not solve the issue. Looking at the broker logs, we can confirm that the issue is due to the downscaling of pods.
Question : What could be causing this behavior where a message is consumed twice when a pod gets downscaled?
Note : I already understand the root cause of the issue; however, considering that a consumer is a long lived process running in an infinite loop, how and why is Kubernetes downscaling/killing a pod before the consumer commits the offset? How do I tell Kubernetes not to remove a running pod from a consumer group until all Kafka commits are completed?
"What could be causing this behavior where a message is consumed twice when a pod gets downscaled?"
You have provided the answer already yourself: "[...] that a pod is getting downscaled after a consumer writes the message to the out-topic but before Kafka can commit the offset to the in-topic."
As the message was processed but not committed, another pod is re-processing the same message again after the downscaling happens. Remember that adding or removing a consumer from a consumer group always initiates a Rebalancing. You have now first-hand experience why this should generally be avoided as much as feasible. Depending on the Kafka version a rebalance will cause every single consumer of the consumer group to stop consuming until the rebalancing is done.
To solve your issue, I see two options:
Only remove running pods out of the Consumer Group when they are idle
Reduce the consumer configuration auto.commit.interval.ms to 1 as this defaults to 5 seconds. This will only work if you set enable.auto.commit to true.
If you want your consumer to commit message/s before exiting you would need to handle exit signal to your consumer. A lot of languages do support this. Have a look at this thread on how to do this in java - How to finish kafka consumer safety?(Is there meaning to call thread#join inside shutdownHook ? ).
That being said, please note that there is no 100% guarantee to achieving exactly once. Your process can be killed forcefully by OS before even given time to run any exit clean up (kill -9 <process_id>.
Related
Hope you are having good day.
I have an issue with kafka consumers on kubernetes. I am running 3 replicas inside a consumer group
I have a topic with 3 partitions and 3 brokers with offsets replication factor set to 3. My offset in consumer group is set to earliest.
When I start the consumer group, all are working fine with each consumer replica taking different partition and processing the data.
Issue: When by any means if a consumer replica inside the consumer group "abc-consumer-group" restarts OR if a broker(leader) restarts, it is not resuming from the point where it stopped. It states that I am up to date and no messages I have to process.
Any suggestions please where to look at?
Tried increasing rebalance, heartbeat, session timeout on broker level, no luck.
And yes whenever any new consumer is added or removed to the consumer group rebalacing is taken care by kafka. I do see it happening but still not consumers are not resuming messages. It states nothing to process.
We have a remote kafka cluster that belongs to an external service, with which we pull data using a mirrormaker to our internal kafka cluster.
The following situation has occurred - on the side of the external service, one of the cluster brokers has fallen due to technical reasons.
The following appeared in the mirrormaker logs:
...
ERROR [Consumer clientId=XXX-1, groupId=YYY] Offset commit failed on partition PARTITION_NAME at offset 123456: The coordinator is not aware of this member. (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
WARN Failed to commit offsets because the consumer group has rebalanced and assigned partitions to another instance. If you see this regularly, it could indicate that you need to either increase the consumer's session.timeout.ms or reduce the number of records handled on each iteration with max.poll.records (kafka.tools.MirrorMaker$)
...
Next, consumers reconnected to alive nodes in the cluster and continued to read messages.
The problem is that due to the fall of the broker on the side of the external kafka, the messages could be read, but could not be committed. For this reason, after the rebalancing, the messages were read again and duplicates appeared in our internal cluster.
Are there any ways that would help in this situation to avoid duplicates in the internal cluster? (except for those indicated in the log warning.)
Maybe there are some consumer configuration parameters that would help to solve problems with duplicates.
My kafka topic has two partitions and single kafka consumer. I have deployed my consumer application(spring kafka) in the AWS. In logs I see kafka consumer re-balance in time to time. This is not frequent. As per the current observation when consumer is listening to the topic and idle this re-balancing occurs. Appreciate if someone can explain me this behavior. I have posted some logs here.
[Consumer clientId=consumer-b2b-group-1, groupId=b2b-group] Request joining group due to: group is already rebalancing
[Consumer clientId=consumer-b2b-group-1, groupId=b2b-group] Revoke previously assigned partitions order-response-v3-qa-0, order-response-v3-qa-1
[Consumer clientId=consumer-b2b-group-1, groupId=b2b-group] Revoke previously assigned partitions order-response-v3-qa-0, order-response-v3-qa-1
b2b-group: partitions revoked: [order-response-v3-qa-0, order-response-v3-qa-1
[Consumer clientId=consumer-b2b-group-1, groupId=b2b-group] (Re-)joining group
Re-balancing is a feature that automatically optimizes uneven workloads as well as topology changes (e.g., adding or removing brokers). This is achieved via a background process that continuously checks a variety of metrics to determine if and when a to rebalance should occur.
you can go through the below link for further knowledge:
https://medium.com/streamthoughts/apache-kafka-rebalance-protocol-or-the-magic-behind-your-streams-applications-e94baf68e4f2
Kafka starts a rebalancing if a consumer joins or leaves a group. Below are various reasons why that can or will happen.
A consumer joins a group:
Application Start/Restart — If we deploy an application (or restart it), a new consumer joins the group
Application scale-up — We are creating new pods/application
A consumer leaves a group:
max.poll.interval.ms exceeded — polled records not processed in time
session.timeout.ms exceeded — no heartbeats sent, likely because of an application crash or a network error
Consumer shuts down
Pod relocation — Kubernetes relocates pods sometimes, e.g. if nodes are removed via kubectl drain or the cluster is scaled down. The consumer shuts down (leaves the group) and is restarted again on another node (joins the group).
Application scale-down
If you would like to understand more in depth. Here is one of the amazing article I have read
https://medium.com/bakdata/solving-my-weird-kafka-rebalancing-problems-c05e99535435
Environment
3-node Kafka Cluster
Amazon MSK
v2.3
1 topic
6 partitions
1 consumer group with 2 consumers
Running in Kubernetes
Confluent .NET SDK 1.2.2
Except for bootstrap.servers and group.id, all of the default settings.
Problem
First, one of my consumers encounters the following exception.
Confluent.Kafka.KafkaException: Broker: Specified group generation id is not valid
at Confluent.Kafka.Impl.SafeKafkaHandle.Commit(IEnumerable`1 offsets)
at Confluent.Kafka.Consumer`2.Commit(IEnumerable`1 offsets)
The exception is trapped and the consumer is supposed to retry, but instead the app sits idle. The container is still up and running, but not consuming any more messages.
What's weirder is that the broker never reassigns that consumer's partitions so the consumer lag on those partitions begins to grow. It seems like the consumer is both alive (since the broker is not reassigning its partitions) and dead (since it cannot commit its offset or consume more messages). If we intervene and manually restart the consumers then the partitions are reassigned and the situation goes back to normal.
I'm not entirely sure what to make of the exception above. Google doesn't offer much. The most relevant lead I have is this issue in GitHub, which involves a broker restarting. To my knowledge, that is not happening in my situation. Any assistance would be greatly appreciated.
at least I have found a solution for me.
In my code I did a manual commit and set EnableAutoCommit = false.
Somehow it was possible that for an offset a commit was executed twice. I removed the manual commits on the consumer and set EnableAutoCommit = true.
After that it worked.
We have a Kafka Streams consumer group that keep moving to PreparingRebalance state and stop consuming.
The pattern is as follows:
Consumer group is running and stable for around 20 minutes
New consumers (members) start to appear in the group state without any clear reason, these new members only originate from a small number of VMs (not the same VMs each time), and they keep joining
Group state changes to PreparingRebalance
All consumers stop consuming, showing these logs:
"Group coordinator ... is unavailable or invalid, will attempt rediscovery"
The consumer on VMs that generated extra members show these logs:
Offset commit failed on partition X at offset Y: The coordinator is not aware of this member.
Failed to commit stream task X since it got migrated to another thread already. Closing it as zombie before triggering a new rebalance.
Detected task Z that got migrated to another thread. This implies that this thread missed a rebalance and dropped out of the consumer group. Will try to rejoin the consumer group.
We kill all consumer processes on all VMs, the group moves to Empty with 0 members, we start the processes and we're back to step 1
Kafka version is 1.1.0, streams version is 2.0.0
We took thread dumps from the misbehaving consumers, and didn't see more consumer threads than configured.
We tried restarting kafka brokers, cleaning zookeeper cache.
We suspect that the issue has to do with missing heartbeats, but the default heartbeat is 3 seconds and message handling times are no where near that.
Anyone encountered a similar behaviour?