High Level Consumer Failure in Kafka - apache-kafka

I have the following Kafka Setup
Number of producer : 1
Number of topics : 1
Number of partitions : 2
Number of consumers : 3 (with same group id)
Number of Kafka cluster : none(single Kafka server)
Zookeeper.session.timeout : 1000
Consumer Type : High Level Consumer
Producer produces messages without any specific partitioning logic(default partitioning logic).
Consumer 1 consumes message continuously. I am abruptly killing consumer 1 and I would except consumer 2 or consumer 3 to consume the messages after the failure of consumer 1.
In some cases rebalance occurs and consumer 2 starts consuming messages. This is perfectly fine.
But in some cases either consumer 2 or consumer 3 is not at all consuming. I have to manually kill all the consumers and start all three consumers again.
Only after this restart consumer 1 starts consuming again.
Precisely rebalance is successful in some cases while in some cases rebalance is not successful.
Is there any configuration that I am missing.

Kafka uses Zookeeper to coordinate high level consumers.
From http://kafka.apache.org/documentation.html :
Partition Owner registry
Each broker partition is consumed by a single consumer within a given
consumer group. The consumer must establish its ownership of a given
partition before any consumption can begin. To establish its
ownership, a consumer writes its own id in an ephemeral node under the
particular broker partition it is claiming.
/consumers/[group_id]/owners/[topic]/[broker_id-partition_id] -->
consumer_node_id (ephemeral node)
There is a known ephemeral nodes quirk that they can linger up to 30 seconds after ZK client suddenly goes down :
http://developers.blog.box.com/2012/04/10/a-gotcha-when-using-zookeeper-ephemeral-nodes/
So you may be running into this if you expect consumer 2 and 3 to start reading messages immediately after #1 is terminated.
You can also check that /consumers/[group_id]/owners/[topic]/[broker_id-partition_id] contains correct data after rebalancing.

Related

kafka consumers in consumer group not resuming messages after restart

Hope you are having good day.
I have an issue with kafka consumers on kubernetes. I am running 3 replicas inside a consumer group
I have a topic with 3 partitions and 3 brokers with offsets replication factor set to 3. My offset in consumer group is set to earliest.
When I start the consumer group, all are working fine with each consumer replica taking different partition and processing the data.
Issue: When by any means if a consumer replica inside the consumer group "abc-consumer-group" restarts OR if a broker(leader) restarts, it is not resuming from the point where it stopped. It states that I am up to date and no messages I have to process.
Any suggestions please where to look at?
Tried increasing rebalance, heartbeat, session timeout on broker level, no luck.
And yes whenever any new consumer is added or removed to the consumer group rebalacing is taken care by kafka. I do see it happening but still not consumers are not resuming messages. It states nothing to process.

Kafka consumer is not reading from only one partition out of 4

I was using Kafka 0.9 and recently migrated to Kafka 1.0, but the client I am using is still 0.9. Irrespective of this I was facing a problem where our consumers sometimes intermittently stop consuming from one or two of the partitions.
I have 5 consumers reading from 24 partitions, these are consumer JVM threads created from an application deployed in the single server. Frequently one of the consumer (thread) will stop reading from one of the partitions it would be consuming from.
Eg: One consumer thread would be reading from partition 1,2,3,and 4. It will stop reading from partition 1 and end up in building the lag. I have to restart the consumer to start picking those messages from that particular partition.
I want to understand the issue here.
My consumer configuration
session.timeout.ms=150000
request.timeout.ms=300000
max.partition.fetch.bytes=153600

Apache Kafka Topic Partition Message Handling

I'm a bit confused on the Topic partitioning in Apache Kafka. So I'm charting down a simple use case and I would like to know what happens in different scenarios. So here it is:
I have a Topic T that has 4 partitions TP1, TP2, TP4 and TP4.
Assume that I have 8 messages M1 to M8. Now when my producer sends these messages to the topic T, how will they be received by the Kafka broker under the following scenarios:
Scenario 1: There is only one kafka broker instance that has Topic T with the afore mentioned partitions.
Scenario 2: There are two kafka broker instances with each node having same Topic T with the afore mentioned partitions.
Now assuming that kafka broker instance 1 goes down, how will the consumers react? I'm assuming that my consumer was reading from broker instance 1.
I'll answer your questions by walking you through partition replication, because you need to learn about replication to understand the answer.
A single broker is considered the "leader" for a given partition. All produces and consumes occur with the leader. Replicas of the partition are replicated to a configurable amount of other brokers. The leader handles replicating a produce to the other replicas. Other replicas that are caught up to the leader are called "in-sync replicas." You can configure what "caught up" means.
A message is only made available to consumers when it has been committed to all in-sync replicas.
If the leader for a given partition fails, the Kafka coordinator will elect a new leader from the list of in-sync replicas and consumers will begin consuming from this new leader. Consumers will have a few milliseconds of added latency while the new leader is elected. A new coordinator will also be elected automatically if the coordinator fails (this adds more latency, too).
If the topic is configured with no replicas, then when the leader of a given partition fails, consumers can't consume from that partition until the broker that was the leader is brought back online. Or, if it is never brought back online, the data previously produced to that partition will be lost forever.
To answer your question directly:
Scenario 1: if replication is configured for the topic, and there exists an in-sync replica for each partition, a new leader will be elected, and consumers will only experience a few milliseconds of latency because of the failure.
Scenario 2: now that you understand replication, I believe you'll see that this scenario is Scenario 1 with a replication factor of 2.
You may also be interested to learn about acks in the producer.
In the producer, you can configure acks such that the produce is acknowledged when:
the message is put on the producer's socket buffer (acks=0)
the message is written to the log of the lead broker (acks=1)
the message is written to the log of the lead broker, and replicated to all other in-sync replicas (acks=all)
Further, you can configure the minimum number of in-sync replicas required to commit a produce. Then, in the event when not enough in-sync replicas exist given this configuration, the produce will fail. You can build your producer to handle this failure in different ways: buffer, retry, do nothing, block, etc.

One Kafka broker connects to multiple zookeepers

I'm new to Kafka, zookeeper and Storm.
I our environment we have one Kafka broker connecting to multiple zookeepers. Is there an advantage having the producer send the messages to a specific topic and partition on one broker to multiple zookeepers vs multiple brokers to multiple zookeepers?
Yes there is. Kafka allows you to scale by adding brokers. When you use a Kafka cluster with a single broker, as you have, all partitions reside on that single broker. But when you have multiple brokers, Kafka will split the partitions between them. So, broker A may be elected leader for partitions 1 and 2 of your topic, and broker B leader for partition 3. So, when you publish messages to the topic, the client will split the messages between the various partitions on the two brokers.
Note that I also mentioned leader election. Adding brokers to your Kafka cluster gives you replication. Kafka uses ZooKeeper to elect a leader for each partition as I mentioned in my example. Once a leader is elected, the client splits messages among partitions and sends each message to the leader for the appropriate partition. Depending on the topic configuration, the leader may synchronously replicate messages to a backup. So, in my example, if the replication factor for the topic is 2 then broker A will synchronously replicate messages for partitions 1 and 2 to broker B and broker B will synchronously replicate messages for partition 3 to broker A.
So, that's all to say that adding brokers gives you both scalability and fault-tolerance.

Partition re-balance on brokers in Kafka 0.8

The relatively scarce documentation for Kafka 0.8 does not mention what the expected behaviour for balancing existing topics, partitions and replicas on brokers is.
More specifically, what is the expected behaviour on arrival of a broker and on crash of a broker (leader or not) ?
Thanks.
I have tested those 2 cases a while ago and not under heavy load. I have one producer sending 10k messages (just a little string) synchronously to a topic, with replication factor of 2, with 2 partitions, on a cluster of 2 brokers. There are 2 consumers. Each component is deployed on a separate machine. What I have observed is :
On normal operation : broker 1 is leader on partition 1 and replica on partition 2. broker 2 is leader on partition 2 and replica on partition 1. Bring a broker 3 into the cluster don't trigger rebalance on partitions automatically.
On broker revival (crashed than reboot) : rebalancing is transparent to the producer and consumers. The rebooting broker replicate the log first and then make itself available.
On broker crashed (leader or not) : simulated by a kill -9 on any one broker. The producer and consumers get frozen until the ephemeral node in ZK of the killed broker is expired. After that, operations are resumed normally.