Will Kafka group rebalance causes less and less consumers alive? - apache-kafka

If Kafka consumers consume messages in sync mode, but their dependent downstream service is broken(eg. cannot connect to DB). Kafka coordinator keeps triggering group rebalance, will all the consumers be dead in the end?

As far as consumer is responding for heart beat intervals, Kafka does not do the re-balancing. If suppose you consumer failed to respond for heart beat to zookeeper then Kafka assume the consumer is dead and perform re balancing. it is based on couple of properties here
heartbeat.interval.ms
The expected time between heartbeats to the consumer coordinator when using Kafka's group management facilities. Heartbeats are used to ensure that the consumer's session stays active and to facilitate rebalancing when new consumers join or leave the group. The value must be set lower than session.timeout.ms, but typically should be set no higher than 1/3 of that value. It can be adjusted even lower to control the expected time for normal rebalances.
session.timeout.ms
The timeout used to detect consumer failures when using Kafka's group management facility. The consumer sends periodic heartbeats to indicate its liveness to the broker. If no heartbeats are received by the broker before the expiration of this session timeout, then the broker will remove this consumer from the group and initiate a rebalance. Note that the value must be in the allowable range as configured in the broker configuration by group.min.session.timeout.ms and group.max.session.timeout.ms
zookeeper.session.timeout.ms
ZooKeeper session timeout. If the consumer fails to heartbeat to ZooKeeper for this period of time it is considered dead and a rebalance will occur.

Related

kafka consumers in consumer group not resuming messages after restart

Hope you are having good day.
I have an issue with kafka consumers on kubernetes. I am running 3 replicas inside a consumer group
I have a topic with 3 partitions and 3 brokers with offsets replication factor set to 3. My offset in consumer group is set to earliest.
When I start the consumer group, all are working fine with each consumer replica taking different partition and processing the data.
Issue: When by any means if a consumer replica inside the consumer group "abc-consumer-group" restarts OR if a broker(leader) restarts, it is not resuming from the point where it stopped. It states that I am up to date and no messages I have to process.
Any suggestions please where to look at?
Tried increasing rebalance, heartbeat, session timeout on broker level, no luck.
And yes whenever any new consumer is added or removed to the consumer group rebalacing is taken care by kafka. I do see it happening but still not consumers are not resuming messages. It states nothing to process.

Weired Kafka Consumer Group Re-balancing

My kafka topic has two partitions and single kafka consumer. I have deployed my consumer application(spring kafka) in the AWS. In logs I see kafka consumer re-balance in time to time. This is not frequent. As per the current observation when consumer is listening to the topic and idle this re-balancing occurs. Appreciate if someone can explain me this behavior. I have posted some logs here.
[Consumer clientId=consumer-b2b-group-1, groupId=b2b-group] Request joining group due to: group is already rebalancing
[Consumer clientId=consumer-b2b-group-1, groupId=b2b-group] Revoke previously assigned partitions order-response-v3-qa-0, order-response-v3-qa-1
[Consumer clientId=consumer-b2b-group-1, groupId=b2b-group] Revoke previously assigned partitions order-response-v3-qa-0, order-response-v3-qa-1
b2b-group: partitions revoked: [order-response-v3-qa-0, order-response-v3-qa-1
[Consumer clientId=consumer-b2b-group-1, groupId=b2b-group] (Re-)joining group
Re-balancing is a feature that automatically optimizes uneven workloads as well as topology changes (e.g., adding or removing brokers). This is achieved via a background process that continuously checks a variety of metrics to determine if and when a to rebalance should occur.
you can go through the below link for further knowledge:
https://medium.com/streamthoughts/apache-kafka-rebalance-protocol-or-the-magic-behind-your-streams-applications-e94baf68e4f2
Kafka starts a rebalancing if a consumer joins or leaves a group. Below are various reasons why that can or will happen.
A consumer joins a group:
Application Start/Restart — If we deploy an application (or restart it), a new consumer joins the group
Application scale-up — We are creating new pods/application
A consumer leaves a group:
max.poll.interval.ms exceeded — polled records not processed in time
session.timeout.ms exceeded — no heartbeats sent, likely because of an application crash or a network error
Consumer shuts down
Pod relocation — Kubernetes relocates pods sometimes, e.g. if nodes are removed via kubectl drain or the cluster is scaled down. The consumer shuts down (leaves the group) and is restarted again on another node (joins the group).
Application scale-down
If you would like to understand more in depth. Here is one of the amazing article I have read
https://medium.com/bakdata/solving-my-weird-kafka-rebalancing-problems-c05e99535435

Kafka Streams Apps Threads fail transaction and are fenced and restarted after Kafka broker restart

We are noticing Streams Apps threads fail transactions during rolling restarts of our Kafka Brokers. The transaction failure causes stream thread fencing which in turn causes a restart of the thread and re-balancing. The re-balancing causes some delay in processing. Our goal is to make broker restarts as smooth as possible and prevent processing delays as much as possible.
For our rolling Broker restarts we use the controlled.shutdown=true configuration, and before each restart we wait for all partitions to be in-sync across all replicas.
For our Streams Apps we have properly configured group.instance.id and an appropriate session.timeout.ms so that rolling restarts of the streams apps themselves are smooth and without re-balances.
From the Kafka Streams app logs I have identified a sequence of events leading up to the fencing:
Broker starts shutting down
App logs error producing to topic due to NOT_LEADER_OR_FOLLOWER
App heartbeats failing because coordinator is restarting broker
App discovers new group coordinator (this bounces a a bit between the restarting broker and live brokers)
App stabilizes
Broker starting up again
App fails to do fetch request to starting broker due to FETCH_SESSION_ID_NOT_FOUND
App discovers starting broker as transaction coordinator
App transaction fails due to one of two reasons:
InvalidProducerEpochException: Producer attempted to produce with an old epoch.
ProducerFencedException: There is a newer producer with the same transactionalId which fences the current one
Stream threads end up in fatal error state, get fenced and restarted which causes a rebalance.
What could be causing the two exceptions that cause stream thread transactions to fail? My intuition is that the broker starting up is assigned as transaction coordinator before it has synced its transaction states with the in-sync brokers. This could explain old epochs or different transactional ids to be known by that broker.
How can we further identify what is going wrong here and how it can be improved?
you can set request.timeout.ms in kafka streams which will make stream API wait for a longer period of time. if kafka broker is not up in a given period of time then only it will throw an exception which can be handled by using ProductionExceptionHandler as described in Handling exceptions in Kafka streams

What happens when ISR are down but message was written in leader

When there's a failure writing a message to all ISR replicas, but the message is already persisted on the leader, what happens?
Even if the request fails, is the data still is available for consumers?
Are consumers able to read "uncommitted" data?
The message is not visible to the KafkaConsumer until the topic configuration min.insync.replicas is fulfilled.
Kafka is expecting the failed replica broker to get up and running again, so the replication can complete.
Note, that this scenario is only relevant if your KafkaProducer configuration acks is set to all. When you set it to 0 or 1, the Consumer will be able to consume the message as soon as the data arrives at the partition leader.
In general, the client (here KafkaProducer) only communicates with the partition leader and depending on its mode such as synchronous, asynchronous or fire-and-forget and its acks configuration waits or does not wait for a reply. The replication of the data itself is independent of the producer and is handled by a (single-thread) on the partition leader broker. The leader will only notify on the success or failure on the replication and the leader/replicas continue to ensure to satisfy the replication set with the topic configuration replication.factor.
The producer may still try to resend the message in case the message was only acknowledged by producer but not by the replicas, depending on its acks setting and if the exception is retriable or not. In that case you will end up having duplicate values. You can avoid this by enabling idempotence.

Apache Kafka Cluster Consumer Fail Over

I had a set up of 3 Node Zookeeper and 3 Broker Cluster when one of my brokers goes down in Cluster, the producer is not giving any Error but, consumers will throw an error saying that...
Marking coordinator Dead for the group... Discovered coordinator for
the group.
According to my knowledge if any one Broker available across the cluster I should not be stopped consuming messages.
But, as of now Server.1, server.2, server.3 if my server.2 goes down my all consumers stops consuming messages.
What are the exact parameters to set to achieve failover of producers and as well as consumers?
if my server.2 goes down my all consumers stops consuming messages.
For starters, you disable unclear leader election in the brokers, and create your topics with --replication-factor=3 and a configuration of min.insync.replicas=2.
To ensure that a producer has at least two durable writes (as set by the in-sync replcicas), then set acks=all
Then, if any broker fails, and assuming a leader election does not have any error, a producer and consumer should seemlessly re-connect to the new leader TopicPartitions.