Kafka on Kubernetes - Clients are unable to retrieve metadata - apache-kafka

I have a Kafka cluster running on Kubernetes along with ZooKeeper on Kubernetes. As outlined in this answer, I have configured the internal broker port as well as the advertised external port for the clients:
listener.security.protocol.map=PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
inter.broker.listener.name=PLAINTEXT
listeners=PLAINTEXT://:29092,PLAINTEXT_HOST://0.0.0.0:9093
advertised.listeners=PLAINTEXT://:29092,PLAINTEXT_HOST://{EXTERNAL-IP-ADDRESS}:9093
zookeeper.connect=zk-cs.analytics.svc:2181
I expect the inter-broker communication to happen on 29092. External clients should be able to connect on port 9093.
I have one external IP for the entire Kubernetes service, which means that this is the only external IP that should be exposed from the Kafka brokers. As far as I understand, the Kubernetes load balancer will route any request to this IP to one of my brokers.
I have validated that my kafka brokers registered correctly to ZooKeeper:
get /brokers/ids/0
{"listener_security_protocol_map":{"PLAINTEXT":"PLAINTEXT","PLAINTEXT_HOST":"PLAINTEXT"},"endpoints":["PLAINTEXT://kafka-0.kafka-hs.analytics.svc.cluster.local:29092","PLAINTEXT_HOST://{EXTERNAL-IP-ADDRESS}"],"jmx_port":-1,"host":"kafka-0.kafka-hs.analytics.svc.cluster.local","timestamp":"1525689391350","port":29092,"version":4}
cZxid = 0x90000029f
ctime = Mon May 07 12:36:31 CEST 2018
mZxid = 0x90000029f
mtime = Mon May 07 12:36:31 CEST 2018
pZxid = 0x90000029f
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x1632acfab520009
dataLength = 344
numChildren = 0
Creating a topic looks good in the logs to me, logs are below.
Primary:
[2018-05-07 10:41:12,760] DEBUG [TopicChangeListener on Controller 0]: Topic change listener fired for path /brokers/topics with children test-topic (kafka.controller.PartitionStateMachine$TopicChangeListener)
[2018-05-07 10:41:12,767] INFO [TopicChangeListener on Controller 0]: New topics: [Set(test-topic)], deleted topics: [Set()], new partition replica assignment [Map([test-topic,0] -> List(0, 1))] (kafka.controller.PartitionStateMachine$TopicChangeListener)
[2018-05-07 10:41:12,768] INFO [Controller 0]: New topic creation callback for [test-topic,0] (kafka.controller.KafkaController)
[2018-05-07 10:41:12,770] INFO [Controller 0]: New partition creation callback for [test-topic,0] (kafka.controller.KafkaController)
[2018-05-07 10:41:12,771] INFO [Partition state machine on Controller 0]: Invoking state change to NewPartition for partitions [test-topic,0] (kafka.controller.PartitionStateMachine)
[2018-05-07 10:41:12,772] TRACE Controller 0 epoch 12 changed partition [test-topic,0] state from NonExistentPartition to NewPartition with assigned replicas 0,1 (state.change.logger)
[2018-05-07 10:41:12,774] INFO [Replica state machine on controller 0]: Invoking state change to NewReplica for replicas [Topic=test-topic,Partition=0,Replica=0],[Topic=test-topic,Partition=0,Replica=1] (kafka.controller.ReplicaStateMachine)
[2018-05-07 10:41:12,778] TRACE Controller 0 epoch 12 changed state of replica 0 for partition [test-topic,0] from NonExistentReplica to NewReplica (state.change.logger)
[2018-05-07 10:41:12,779] TRACE Controller 0 epoch 12 changed state of replica 1 for partition [test-topic,0] from NonExistentReplica to NewReplica (state.change.logger)
[2018-05-07 10:41:12,779] INFO [Partition state machine on Controller 0]: Invoking state change to OnlinePartition for partitions [test-topic,0] (kafka.controller.PartitionStateMachine)
[2018-05-07 10:41:12,780] DEBUG [Partition state machine on Controller 0]: Live assigned replicas for partition [test-topic,0] are: [List(0, 1)] (kafka.controller.PartitionStateMachine)
[2018-05-07 10:41:12,782] DEBUG [Partition state machine on Controller 0]: Initializing leader and isr for partition [test-topic,0] to (Leader:0,ISR:0,1,LeaderEpoch:0,ControllerEpoch:12) (kafka.controller.PartitionStateMachine)
[2018-05-07 10:41:12,805] TRACE Controller 0 epoch 12 changed partition [test-topic,0] from NewPartition to OnlinePartition with leader 0 (state.change.logger)
[2018-05-07 10:41:12,806] TRACE Controller 0 epoch 12 sending become-follower LeaderAndIsr request (Leader:0,ISR:0,1,LeaderEpoch:0,ControllerEpoch:12) to broker 1 for partition [test-topic,0] (state.change.logger)
[2018-05-07 10:41:12,809] TRACE Controller 0 epoch 12 sending become-leader LeaderAndIsr request (Leader:0,ISR:0,1,LeaderEpoch:0,ControllerEpoch:12) to broker 0 for partition [test-topic,0] (state.change.logger)
[2018-05-07 10:41:12,810] TRACE Controller 0 epoch 12 sending UpdateMetadata request (Leader:0,ISR:0,1,LeaderEpoch:0,ControllerEpoch:12) to brokers Set(0, 1, 2, 3, 4) for partition test-topic-0 (state.change.logger)
[2018-05-07 10:41:12,811] INFO [Replica state machine on controller 0]: Invoking state change to OnlineReplica for replicas [Topic=test-topic,Partition=0,Replica=0],[Topic=test-topic,Partition=0,Replica=1] (kafka.controller.ReplicaStateMachine)
[2018-05-07 10:41:12,812] TRACE Controller 0 epoch 12 changed state of replica 0 for partition [test-topic,0] from NewReplica to OnlineReplica (state.change.logger)
[2018-05-07 10:41:12,813] TRACE Controller 0 epoch 12 changed state of replica 1 for partition [test-topic,0] from NewReplica to OnlineReplica (state.change.logger)
[2018-05-07 10:41:12,813] TRACE Broker 0 received LeaderAndIsr request PartitionState(controllerEpoch=12, leader=0, leaderEpoch=0, isr=[0, 1], zkVersion=0, replicas=[0, 1]) correlation id 5 from controller 0 epoch 12 for partition [test-topic,0] (state.change.logger)
[2018-05-07 10:41:12,813] TRACE Broker 0 received LeaderAndIsr request PartitionState(controllerEpoch=12, leader=0, leaderEpoch=0, isr=[0, 1], zkVersion=0, replicas=[0, 1]) correlation id 4 from controller 0 epoch 12 for partition [test-topic,0] (state.change.logger)
[2018-05-07 10:41:12,816] TRACE Broker 0 cached leader info (LeaderAndIsrInfo:(Leader:0,ISR:0,1,LeaderEpoch:0,ControllerEpoch:12),ReplicationFactor:2),AllReplicas:0,1) for partition test-topic-0 in response to UpdateMetadata request sent by controller 0 epoch 12 with correlation id 2 (state.change.logger)
[2018-05-07 10:41:12,817] TRACE Controller 0 epoch 12 received response {error_code=0} for a request sent to broker kafka-2.kafka-hs.analytics.svc.cluster.local:29092 (id: 2 rack: null) (state.change.logger)
[2018-05-07 10:41:12,823] TRACE Controller 0 epoch 12 received response {error_code=0} for a request sent to broker kafka-3.kafka-hs.analytics.svc.cluster.local:29092 (id: 3 rack: null) (state.change.logger)
[2018-05-07 10:41:12,823] TRACE Controller 0 epoch 12 received response {error_code=0} for a request sent to broker kafka-4.kafka-hs.analytics.svc.cluster.local:29092 (id: 4 rack: null) (state.change.logger)
[2018-05-07 10:41:12,827] TRACE Broker 0 handling LeaderAndIsr request correlationId 4 from controller 0 epoch 12 starting the become-leader transition for partition test-topic-0 (state.change.logger)
[2018-05-07 10:41:12,828] INFO [ReplicaFetcherManager on broker 0] Removed fetcher for partitions test-topic-0 (kafka.server.ReplicaFetcherManager)
[2018-05-07 10:41:12,852] INFO Completed load of log test-topic-0 with 1 log segments and log end offset 0 in 17 ms (kafka.log.Log)
[2018-05-07 10:41:12,853] INFO Created log for partition [test-topic,0] in /tmp/kafka-logs with properties {compression.type -> producer, message.format.version -> 0.10.2-IV0, file.delete.delay.ms -> 60000, max.message.bytes -> 1000012, min.compaction.lag.ms -> 0, message.timestamp.type -> CreateTime, min.insync.replicas -> 1, segment.jitter.ms -> 0, preallocate -> false, min.cleanable.dirty.ratio -> 0.5, index.interval.bytes -> 4096, unclean.leader.election.enable -> true, retention.bytes -> -1, delete.retention.ms -> 86400000, cleanup.policy -> [delete], flush.ms -> 9223372036854775807, segment.ms -> 604800000, segment.bytes -> 1073741824, retention.ms -> 604800000, message.timestamp.difference.max.ms -> 9223372036854775807, segment.index.bytes -> 10485760, flush.messages -> 9223372036854775807}. (kafka.log.LogManager)
[2018-05-07 10:41:12,853] INFO Partition [test-topic,0] on broker 0: No checkpointed highwatermark is found for partition test-topic-0 (kafka.cluster.Partition)
[2018-05-07 10:41:12,861] TRACE Broker 0 stopped fetchers as part of become-leader request from controller 0 epoch 12 with correlation id 4 for partition test-topic-0 (state.change.logger)
[2018-05-07 10:41:12,861] TRACE Broker 0 completed LeaderAndIsr request correlationId 4 from controller 0 epoch 12 for the become-leader transition for partition test-topic-0 (state.change.logger)
[2018-05-07 10:41:12,864] WARN Broker 0 ignoring LeaderAndIsr request from controller 0 with correlation id 5 epoch 12 for partition [test-topic,0] since its associated leader epoch 0 is not higher than the current leader epoch 0 (state.change.logger)
[2018-05-07 10:41:12,865] TRACE Controller 0 epoch 12 received response {error_code=0,partitions=[{topic=test-topic,partition=0,error_code=11}]} for a request sent to broker kafka-0.kafka-hs.analytics.svc.cluster.local:29092 (id: 0 rack: null) (state.change.logger)
[2018-05-07 10:41:12,865] TRACE Controller 0 epoch 12 received response {error_code=0,partitions=[{topic=test-topic,partition=0,error_code=0}]} for a request sent to broker kafka-1.kafka-hs.analytics.svc.cluster.local:29092 (id: 1 rack: null) (state.change.logger)
[2018-05-07 10:41:12,867] TRACE Broker 0 cached leader info (LeaderAndIsrInfo:(Leader:0,ISR:0,1,LeaderEpoch:0,ControllerEpoch:12),ReplicationFactor:2),AllReplicas:0,1) for partition test-topic-0 in response to UpdateMetadata request sent by controller 0 epoch 12 with correlation id 6 (state.change.logger)
[2018-05-07 10:41:12,867] TRACE Controller 0 epoch 12 received response {error_code=0} for a request sent to broker kafka-0.kafka-hs.analytics.svc.cluster.local:29092 (id: 0 rack: null) (state.change.logger)
[2018-05-07 10:41:12,867] TRACE Broker 0 cached leader info (LeaderAndIsrInfo:(Leader:0,ISR:0,1,LeaderEpoch:0,ControllerEpoch:12),ReplicationFactor:2),AllReplicas:0,1) for partition test-topic-0 in response to UpdateMetadata request sent by controller 0 epoch 12 with correlation id 5 (state.change.logger)
[2018-05-07 10:41:12,868] TRACE Controller 0 epoch 12 received response {error_code=0} for a request sent to broker kafka-1.kafka-hs.analytics.svc.cluster.local:29092 (id: 1 rack: null) (state.change.logger)
[2018-05-07 10:41:26,213] INFO Partition [test-topic,0] on broker 0: Shrinking ISR for partition [test-topic,0] from 0,1 to 0 (kafka.cluster.Partition)
[2018-05-07 10:41:28,721] DEBUG [IsrChangeNotificationListener on Controller 0]: ISR change notification listener fired (kafka.controller.IsrChangeNotificationListener)
[2018-05-07 10:41:28,735] DEBUG [IsrChangeNotificationListener on Controller 0]: Sending MetadataRequest to Brokers:ArrayBuffer(0, 1, 2, 3, 4) for TopicAndPartitions:Set([test-topic,0], [__consumer_offsets,30], [__consumer_offsets,6]) (kafka.controller.IsrChangeNotificationListener)
[2018-05-07 10:41:28,735] INFO Leader not yet assigned for partition [__consumer_offsets,30]. Skip sending UpdateMetadataRequest. (kafka.controller.ControllerBrokerRequestBatch)
[2018-05-07 10:41:28,735] INFO Leader not yet assigned for partition [__consumer_offsets,6]. Skip sending UpdateMetadataRequest. (kafka.controller.ControllerBrokerRequestBatch)
[2018-05-07 10:41:28,735] TRACE Controller 0 epoch 12 sending UpdateMetadata request (Leader:0,ISR:0,LeaderEpoch:0,ControllerEpoch:12) to brokers Set(0, 1, 2, 3, 4) for partition test-topic-0 (state.change.logger)
[2018-05-07 10:41:28,739] TRACE Broker 0 cached leader info (LeaderAndIsrInfo:(Leader:0,ISR:0,LeaderEpoch:0,ControllerEpoch:12),ReplicationFactor:2),AllReplicas:0,1) for partition test-topic-0 in response to UpdateMetadata request sent by controller 0 epoch 12 with correlation id 6 (state.change.logger)
[2018-05-07 10:41:28,739] TRACE Broker 0 cached leader info (LeaderAndIsrInfo:(Leader:0,ISR:0,LeaderEpoch:0,ControllerEpoch:12),ReplicationFactor:2),AllReplicas:0,1) for partition test-topic-0 in response to UpdateMetadata request sent by controller 0 epoch 12 with correlation id 3 (state.change.logger)
[2018-05-07 10:41:28,739] TRACE Controller 0 epoch 12 received response {error_code=0} for a request sent to broker kafka-1.kafka-hs.analytics.svc.cluster.local:29092 (id: 1 rack: null) (state.change.logger)
[2018-05-07 10:41:28,740] TRACE Broker 0 cached leader info (LeaderAndIsrInfo:(Leader:0,ISR:0,LeaderEpoch:0,ControllerEpoch:12),ReplicationFactor:2),AllReplicas:0,1) for partition test-topic-0 in response to UpdateMetadata request sent by controller 0 epoch 12 with correlation id 7 (state.change.logger)
[2018-05-07 10:41:28,740] TRACE Controller 0 epoch 12 received response {error_code=0} for a request sent to broker kafka-2.kafka-hs.analytics.svc.cluster.local:29092 (id: 2 rack: null) (state.change.logger)
[2018-05-07 10:41:28,740] TRACE Controller 0 epoch 12 received response {error_code=0} for a request sent to broker kafka-3.kafka-hs.analytics.svc.cluster.local:29092 (id: 3 rack: null) (state.change.logger)
[2018-05-07 10:41:28,740] TRACE Controller 0 epoch 12 received response {error_code=0} for a request sent to broker kafka-0.kafka-hs.analytics.svc.cluster.local:29092 (id: 0 rack: null) (state.change.logger)
[2018-05-07 10:41:28,741] TRACE Controller 0 epoch 12 received response {error_code=0} for a request sent to broker kafka-4.kafka-hs.analytics.svc.cluster.local:29092 (id: 4 rack: null) (state.change.logger)
[2018-05-07 10:41:28,746] DEBUG [IsrChangeNotificationListener on Controller 0]: ISR change notification listener fired (kafka.controller.IsrChangeNotificationListener)
[2018-05-07 10:41:36,297] TRACE [Controller 0]: checking need to trigger partition rebalance (kafka.controller.KafkaController)
[2018-05-07 10:41:36,298] DEBUG [Controller 0]: preferred replicas by broker Map(0 -> Map([test-topic,0] -> List(0, 1))) (kafka.controller.KafkaController)
[2018-05-07 10:41:36,302] DEBUG [Controller 0]: topics not in preferred replica Map() (kafka.controller.KafkaController)
[2018-05-07 10:41:36,303] TRACE [Controller 0]: leader imbalance ratio for broker 0 is 0.000000 (kafka.controller.KafkaController)
Replica #1:
[2018-05-07 10:41:12,822] TRACE Broker 1 cached leader info (LeaderAndIsrInfo:(Leader:0,ISR:0,1,LeaderEpoch:0,ControllerEpoch:12),ReplicationFactor:2),AllReplicas:0,1) for partition test-topic-0 in response to UpdateMetadata request sent by controller 0 epoch 12 with correlation id 3 (state.change.logger)
[2018-05-07 10:41:28,739] TRACE Broker 1 cached leader info (LeaderAndIsrInfo:(Leader:0,ISR:0,LeaderEpoch:0,ControllerEpoch:12),ReplicationFactor:2),AllReplicas:0,1) for partition test-topic-0 in response to UpdateMetadata request sent by controller 0 epoch 12 with correlation id 4 (state.change.logger)
Replica #2:
[2018-05-07 10:41:12,823] TRACE Broker 2 cached leader info (LeaderAndIsrInfo:(Leader:0,ISR:0,1,LeaderEpoch:0,ControllerEpoch:12),ReplicationFactor:2),AllReplicas:0,1) for partition test-topic-0 in response to UpdateMetadata request sent by controller 0 epoch 12 with correlation id 1 (state.change.logger)
[2018-05-07 10:41:28,740] TRACE Broker 2 cached leader info (LeaderAndIsrInfo:(Leader:0,ISR:0,LeaderEpoch:0,ControllerEpoch:12),ReplicationFactor:2),AllReplicas:0,1) for partition test-topic-0 in response to UpdateMetadata request sent by controller 0 epoch 12 with correlation id 2 (state.change.logger)
However, whenever I connect with the console producer to the cluster, I get the following error:
.\kafka-console-producer.bat --broker-list {EXTERNAL-IP-ADDRESS}:9093 --topic test-topic --property parse.key=true --property key.separator=:
>testKey:23487239847237894asduhzdfhzusfhhsdf
[2018-05-07 12:42:58,395] WARN [Producer clientId=console-producer] Error while fetching metadata with correlation id 2 : {test-topic=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient)
[2018-05-07 12:42:58,512] WARN [Producer clientId=console-producer] Error while fetching metadata with correlation id 3 : {test-topic=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient)
[2018-05-07 12:42:58,641] WARN [Producer clientId=console-producer] Error while fetching metadata with correlation id 4 : {test-topic=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient)
[2018-05-07 12:42:58,765] WARN [Producer clientId=console-producer] Error while fetching metadata with correlation id 5 : {test-topic=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient)
[2018-05-07 12:42:58,886] WARN [Producer clientId=console-producer] Error while fetching metadata with correlation id 6 : {test-topic=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient)
Is it a problem that a Kubernetes Service usually exposes one external IP address and all of my Kafka brokers are advertising this IP? Are there solutions to this?

I expect the inter-broker communication to happen on 29092.
Yes, they use 29092 for internal communication.
External clients should be able to connect on port 9093. I have one external IP for the entire Kubernetes service, which means that this is the only external IP that should be exposed from the Kafka brokers. As far as I understand, the Kubernetes load balancer will route any request to this IP to one of my brokers.
Yes, Kubernetes will route all traffic from that service to one of your brokers and that is a problem.
Internally, you use Headless Service to discover addresses of your Kafka brokers, so they are available by DNS names kafka-[_NUM_OF_THE_REPLICA_]._SERVICE_NAME_ and it works without any problems.
For access from the outside of the cluster, you need to expose all your replicas on the different addresses or ports. But, you have only one service which can balance requests between services.
To fix it, you should create a separate service for each replica and use themes external addresses as EXTERNAL-IP-ADDRESSES in your configuration.
Here is an example from the issue in GitHub repo where you get a configuration of Kafka cluster for Kubernetes:
---
apiVersion: v1
kind: Service
metadata:
name: kafka-es-0
spec:
ports:
- port: 9092
name: kafka-port
protocol: TCP
selector:
pod-name: kafka-0
type: LoadBalancer
---
apiVersion: v1
kind: Service
metadata:
name: kafka-es-1
spec:
ports:
- port: 9092
name: kafka-port
protocol: TCP
selector:
pod-name: kafka-1
type: LoadBalancer

Related

mysql table record not being consumed by Kafka

I just started learning kafka and I am running kafka 2.13-2.80 on windows server 2012 R2. I started zookeeper using the following:
zookeeper-server-start.bat ../../config/zookeeper.properties
I started kafka using the following:
kafka-server-start.bat ../../config/server.properties
I started a connector with the following:
connect-standalone.bat ../../config/connect-standalone.properties ../../config/mysql.properties
The content of my mysql.properties file is as follows:
name=test-source-mysql-jdbc-autoincrement
connector.class=io.confluent.connect.jdbc.JdbcSourceConnector
tasks.max=1
connection.url=jdbc:mysql://127.0.0.1:3306/DBName?user=username&password=userpassword
mode=incrementing
incrementing.column.name=id
topic.prefix=test-mysql-jdbc-
I started a consumer with and without a partition option:
kafka-console-consumer.bat -topic test-mysql-jdbc-groups -bootstrap-server localhost:9092 -from-beginning [-partition 0]
All seemingly started without issues but when I add a record to my mysql table called groups, I do not see it in my consumer. I checked all the various logs. The only error messages I saw were in the state-change.log and they looked like the following:
ERROR [Broker id=0] Ignoring StopReplica request (delete=true) from controller 0 with correlation id 5 epoch 1 for partition mytopic-2 as the local replica for the partition is in an offline log directory (state.change.logger)
ERROR [Broker id=0] Ignoring StopReplica request (delete=true) from controller 0 with correlation id 5 epoch 1 for partition mytopic-1 as the local replica for the partition is in an offline log directory (state.change.logger)
ERROR [Broker id=0] Ignoring StopReplica request (delete=true) from controller 0 with correlation id 5 epoch 1 for partition mytopic-0 as the local replica for the partition is in an offline log directory (state.change.logger)
ERROR [Broker id=0] Received LeaderAndIsrRequest with correlation id 1 from controller 0 epoch 2 for partition mytopic-0 (last update controller epoch 1) but cannot become follower since the new leader -1 is unavailable. (state.change.logger)
ERROR [Broker id=0] Received LeaderAndIsrRequest with correlation id 1 from controller 0 epoch 2 for partition mytopic-1 (last update controller epoch 1) but cannot become follower since the new leader -1 is unavailable. (state.change.logger)
ERROR [Broker id=0] Received LeaderAndIsrRequest with correlation id 1 from controller 0 epoch 2 for partition mytopic-2 (last update controller epoch 1) but cannot become follower since the new leader -1 is unavailable. (state.change.logger)
I also notice this message in zookeeper
INFO Expiring session timeout of exceeded (org.apache.zookeeper.server.ZooKeeperServer)
Please could anyone give me pointers as to what I could be doing wrong? Thanks

Apache storm kafka spout only reading from half of a topic's partitions

A problem developed on our production Storm cluster that we cannot figure out or work around.
At some point it appears that the kafka spout stopped reading from half of the topic partitions. There are 40 partitions, and it is only reading from 20 of them. We cannot find any changes that we made to either the storm cluster or kafka at the time this started happening.
We changed the consumer group id and set the spout config startOffsetTime to OffsetRequest.LatestTime to try to get it reading fresh data from all partitions. It still only connects to the same 20 partitions. We've looked at the node /<topic>/<consumer_group> inside the Storm zookeeper and see only 20 partitions there.
We have verified that messages are being published to all 40 partitions.
Kafka version is 0.9.0.1,storm version is 1.1.0.
Any tips on how to debug or where to look would be greatly appreciated. Did I mention that this is happening in production? Did I mention it started a week ago, and we just noticed this morning? :(
Additional info: we found some errors in the Kafka state change log (partition 9 is one of the affected partitions and the timestamp in the log looks to be about the time that the problem started)
kafka.common.NoReplicaOnlineException: No replica for partition
[transcription-results,9] is alive. Live brokers are: [Set()], Assigned replicas are: [List(1, 4, 0)]
[2018-03-14 03:11:40,863] TRACE Controller 0 epoch 44 changed state of replica 1 for partition [transcription-results,9] from OnlineReplica to OfflineReplica (state.change.logger)
[2018-03-14 03:11:41,141] TRACE Controller 0 epoch 44 sending become-follower LeaderAndIsr request (Leader:-1,ISR:0,4,LeaderEpoch:442,ControllerEpoch:44) to broker 4 for partition [transcription-results,9] (state.change.logger)
[2018-03-14 03:11:41,145] TRACE Controller 0 epoch 44 sending become-follower LeaderAndIsr request (Leader:-1,ISR:0,4,LeaderEpoch:442,ControllerEpoch:44) to broker 0 for partition [transcription-results,9] (state.change.logger)
[2018-03-14 03:11:41,208] TRACE Controller 0 epoch 44 changed state of replica 4 for partition [transcription-results,9] from OnlineReplica to OnlineReplica (state.change.logger)
[2018-03-14 03:11:41,218] TRACE Controller 0 epoch 44 changed state of replica 1 for partition [transcription-results,9] from OfflineReplica to OnlineReplica (state.change.logger)
[2018-03-14 03:11:41,226] TRACE Controller 0 epoch 44 sending become-follower LeaderAndIsr request (Leader:-1,ISR:0,4,LeaderEpoch:442,ControllerEpoch:44) to broker 4 for partition [transcription-results,9] (state.change.logger)
[2018-03-14 03:11:41,230] TRACE Controller 0 epoch 44 sending become-follower LeaderAndIsr request (Leader:-1,ISR:0,4,LeaderEpoch:442,ControllerEpoch:44) to broker 1 for partition [transcription-results,9] (state.change.logger)
[2018-03-14 03:11:41,450] TRACE Broker 0 received LeaderAndIsr request (LeaderAndIsrInfo:Leader:-1,ISR:0,4,LeaderEpoch:442,ControllerEpoch:44),ReplicationFactor:3),AllReplicas:1,4,0) correlation id 158 from controller 0 epoch 44 for partition [transcription-results,9] (state.change.logger)
[2018-03-14 03:11:41,454] TRACE Broker 0 handling LeaderAndIsr request correlationId 158 from controller 0 epoch 44 starting the become-follower transition for partition [transcription-results,9] (state.change.logger)
[2018-03-14 03:11:41,455] ERROR Broker 0 received LeaderAndIsrRequest with correlation id 158 from controller 0 epoch 44 for partition [transcription-results,9] but cannot become follower since the new leader -1 is unavailable. (state.change.logger)
//... removed some TRACE statements here
[2018-03-14 03:11:41,908] WARN Broker 0 ignoring LeaderAndIsr request from controller 1 with correlation id 1 epoch 47 for partition [transcription-results,9] since its associated leader epoch 441 is old. Current leader epoch is 441 (state.change.logger)
[2018-03-14 03:11:41,982] TRACE Broker 0 cached leader info (LeaderAndIsrInfo:Leader:1,ISR:0,1,4,LeaderEpoch:441,ControllerEpoch:44),ReplicationFactor:3),AllReplicas:1,4,0) for partition [transcription-results,9] in response to UpdateMetadata request sent by controller 1 epoch 47 with correlation id 2 (state.change.logger)
[2018-03-22 14:43:36,098] TRACE Broker 0 received LeaderAndIsr request (LeaderAndIsrInfo:Leader:-1,ISR:,LeaderEpoch:444,ControllerEpoch:47),ReplicationFactor:3),AllReplicas:1,4,0) correlation id 679 from controller 1 epoch 47 for partition [transcription-results,9] (state.change.logger)
Possibly caused by this bug: https://issues.apache.org/jira/browse/KAFKA-3963
But how can we recover from it?
I'd start by looking in Kafka's Zookeeper under /brokers/topics to verify that all partitions are listed. That's where storm-kafka reads the partitions from.

Kafka console producer cannot connect to the broker

Connecting to a Kafka broker using the console producer using the following command:
KAFA_HEAP_OPTS="-Djava.security.krb5.conf=/etc/krb5.conf -Dsun.security.krb5.debug=true" \
bin/kafka-console-producer.sh \
--broker-list server-01.eigenroute.com:9092
--topic test-topic \
--producer.config config/sasl-producer.properties
fails with this warning:
>test message
[2018-01-06 15:29:10,724] WARN [Producer clientId=console-producer] Connection to node -1 could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-01-06 15:29:10,816] WARN [Producer clientId=console-producer] Connection to node -1 could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
My Kafka broker seems to be functioning without problems:
KAFKA_HEAP_OPTS="-Djava.security.auth.login.config=/home/kafka/kafka_2.11-1.0.0/config/jaas.conf -Dsun.security.krb5.debug=true -Djava.security.krb5.conf=/etc/krb5.conf -Xmx256M -Xms128M" bin/kafka-server-start.sh config/server-sasl-brokers-zookeeper.properties
[2018-01-06 19:59:27,853] INFO KafkaConfig values:
advertised.host.name = null
advertised.listeners = SASL_PLAINTEXT://server-01.eigenroute.com:9092
...
zookeeper.connect = zookeeper-server-01.eigenroute.com:2181,zookeeper-server-02.eigenroute.com
:2181,zookeeper-server-03.eigenroute.com:2181/apps/kafka-cluster-demo
...
[2018-01-06 19:59:29,173] INFO zookeeper state changed (SaslAuthenticated) (org.I0Itec.zkclient.ZkClie
nt)
[2018-01-06 19:59:29,207] INFO Created zookeeper path /apps/kafka-cluster-demo (kafka.server.KafkaServer)
...
[2018-01-06 19:59:30,174] INFO zookeeper state changed (SaslAuthenticated) (org.I0Itec.zkclient.ZkClient)
[2018-01-06 19:59:30,389] INFO Cluster ID = TldZ-s6DQtWxpjl045dPlg (kafka.server.KafkaServer)
[2018-01-06 19:59:30,457] INFO [ThrottledRequestReaper-Fetch]: Starting (kafka.server.ClientQuotaManager$ThrottledRequestReaper)
...
[2018-01-06 19:59:33,035] INFO Successfully authenticated client: authenticationID=kafka-broker-1-1/server-01.eigenroute.com#EIGENROUTE.COM; authorizationID=kafka-broker-1-1/server-01.eigenroute.com#EIGENROUTE.COM. (org.apache.kafka.common.security.authenticator.SaslServerCallbackHandler)
[2018-01-06 19:59:33,082] INFO [ReplicaFetcherManager on broker 11] Removed fetcher for partitions test-topic-0 (kafka.server.ReplicaFetcherManager)
[2018-01-06 19:59:33,381] INFO Replica loaded for partition test-topic-0 with initial high watermark 0 (kafka.cluster.Replica)
[2018-01-06 19:59:33,385] INFO [Partition test-topic-0 broker=11] test-topic-0 starts at Leader Epoch 1 from offset 0. Previous Leader Epoch was: -1 (kafka.cluster.Partition)
[2018-01-06 19:59:33,424] INFO [ReplicaFetcherManager on broker 11] Removed fetcher for partitions test-topic-0 (kafka.server.ReplicaFetcherManager)
[2018-01-06 19:59:33,424] INFO [Partition test-topic-0 broker=11] test-topic-0 starts at Leader Epoch 2 from offset 0. Previous Leader Epoch was: 1 (kafka.cluster.Partition)
[2018-01-06 20:09:31,261] INFO [GroupMetadataManager brokerId=11] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2018-01-06 20:19:31,261] INFO [GroupMetadataManager brokerId=11] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2018-01-06 20:29:31,261] INFO [GroupMetadataManager brokerId=11] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
Here is my producer config (config/sasl-producer.properties):
bootstrap.servers=server-01.eigenroute.com:9092
compression.type=none
security.protocol=SASL_PLAINTEXT
sasl.mechanism=GSSAPI
sasl.kerberos.service.name=kafka
sasl.jaas.config=com.sun.security.auth.module.Krb5LoginModule required \
useKeyTab=true \
storeKey=true \
keyTab="/Users/shafiquejamal/allfiles/kerberos/producer1.whatever.keytab" \
principal="producer1/whatever#EIGENROUTE.COM";
Here is my broker config (config/server-sasl-brokers-zookeeper.properties):
broker.id=11
listeners=SASL_PLAINTEXT://server-01.eigenroute.com:9092
advertised.listeners=SASL_PLAINTEXT://server-01.eigenroute.com:9092
# host.name=server-01.eigenroute.com
security.inter.broker.protocol=SASL_PLAINTEXT
# sasl.kerberos.service.name=kafka-broker-1-1/server-01.eigenroute.com
num.network.threads=3
num.io.threads=8
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
log.dirs=/var/log/kafka
num.partitions=1
num.recovery.threads.per.data.dir=1
offsets.topic.replication.factor=1
transaction.state.log.replication.factor=1
transaction.state.log.min.isr=1
log.retention.hours=168
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000
zookeeper.connect=zookeeper-server-01.eigenroute.com:2181,zookeeper-server-02.eigenroute.com:2181,zookeeper-server-03.eigenroute.com:2181/apps/kafka-cluster-demo
zookeeper.connection.timeout.ms=6000
group.initial.rebalance.delay.ms=0
Note that I am using SASL authentication between the Kafka broker and ZooKeeper, and between the Kafka broker and Kafka clients (in this case, just one producer). Here are the contents of my Kafka broker jaas.conf file:
KafkaServer {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
keyTab="/home/kafka/kafka_2.11-1.0.0/config/kafka-broker-1-1.server-01.eigenroute.com.keytab"
storeKey=true
useTicketCache=false
serviceName="kafka-broker-1-1"
principal="kafka-broker-1-1/server-01.eigenroute.com#EIGENROUTE.COM";
};
// This is for the broker acting as a client to ZooKeeper
Client {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
keyTab="/home/kafka/kafka_2.11-1.0.0/config/kafka-broker-1-1.server-01.eigenroute.com.keytab"
storeKey=true
useTicketCache=false
serviceName="zookeeper"
principal="kafka-broker-1-1/server-01.eigenroute.com#EIGENROUTE.COM";
};
In my /etc/hosts file, I have the following entry:
127.0.0.1 server-01.eigenroute.com
Any suggestions on why the producer client cannot connect to the Kafka broker? Thanks!
UPDATE: Here is the content of the ZooKeeper znode /apps/kafka-cluster-demo/brokers/ids/11:
[zk: zookeeper-server-02.eigenroute.com:2181(CONNECTED) 27] get /apps/kafka-cluster-demo/brokers/ids/11
{"listener_security_protocol_map":{"SASL_PLAINTEXT":"SASL_PLAINTEXT"},"endpoints":["SASL_PLAINTEXT://server-01.eigenroute.com:9092"],"jmx_port":-1,"host":null,"timestamp":"1515275931134","port":-1,"version":4}
cZxid = 0x2c0000023c
ctime = Sat Jan 06 21:58:51 UTC 2018
mZxid = 0x2c0000023c
mtime = Sat Jan 06 21:58:51 UTC 2018
pZxid = 0x2c0000023c
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x1001d6237f1001c
dataLength = 209
numChildren = 0
There are two problems in my configuration above. The first is that, for the producer properties, in config/sasl-producer.properties, the line
sasl.kerberos.service.name=kafka
should instead be
sasl.kerberos.service.name=kafka-broker-1-1
This is because the service name in the client must match the service name in the broker. After fixing this, a second problem arose:
org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after...
The following post had the answer for solving this:
ERROR Error when sending message to topic
For the Kafka broker, in config/server-sasl-brokers-zookeeper.properties I had to change
listeners=SASL_PLAINTEXT://server-01.eigenroute.com:9092
to
listeners=SASL_PLAINTEXT://0.0.0.0:9092
(This might have something to do with using AWS). Now all is fine - the producer can write to the topic and the consumer can read from the topic.

org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition

We are running Kafka (vesion kafka_2.11-0.10.1.0) in a 2 node cluster.
We have 2 producers (Java API) acting on different topics. Each topic has single partition.
The topic where we had this issue, has one consumer running.
This set up has been running fine for 3 months, and we saw this issue. All the suggested cases/solutions for this issue in other forums don't seem to apply for my scenario.
Exception at producer;
-2017-11-25T17:40:33,035 [kafka-producer-network-thread | producer-1] ERROR client.producer.BingLogProducerCallback - Encountered exception
in sending message ; > org.apache.kafka.common.errors.NotLeaderForPartitionException: This
server is not the leader for that topic-partition.
We haven't enabled retries for the messages, because this is transactional data and we want to maintain the order.
Producer config:
bootstrap.servers : server1ip:9092
acks :all
retries : 0
linger.ms :0
buffer.memory :10240000
max.request.size :1024000
key.serializer : org.apache.kafka.common.serialization.StringSerializer
value.serializer : org.apache.kafka.common.serialization.StringSerializer
We are connecting to server1 at both producer and consumer.
The controller log at server2 indicates there is some shutdown happened at during sametime, but I dont understand why this happened.
[2017-11-25 17:31:44,776] DEBUG [Controller 2]: topics not in
preferred replica Map() (kafka.controller.KafkaController) [2017-11-25
17:31:44,776] TRACE [Controller 2]: leader imbalance ratio for broker
2 is 0.000000 (kafka.controller.KafkaController) [2017-11-25
17:31:44,776] DEBUG [Controller 2]: topics not in preferred replica
Map() (kafka.controller.KafkaController) [2017-11-25 17:31:44,776]
TRACE [Controller 2]: leader imbalance ratio for broker 1 is 0.000000
(kafka.controller.KafkaController) [2017-11-25 17:34:18,314] INFO
[SessionExpirationListener on 2], ZK expired; shut down all controller
components and try to re-elect
(kafka.controller.KafkaController$SessionExpirationListener)
[2017-11-25 17:34:18,317] DEBUG [Controller 2]: Controller resigning,
broker id 2 (kafka.controller.KafkaController) [2017-11-25
17:34:18,317] DEBUG [Controller 2]: De-registering
IsrChangeNotificationListener (kafka.controller.KafkaController)
[2017-11-25 17:34:18,317] INFO [delete-topics-thread-2], Shutting down
(kafka.controller.TopicDeletionManager$DeleteTopicsThread) [2017-11-25
17:34:18,317] INFO [delete-topics-thread-2], Stopped
(kafka.controller.TopicDeletionManager$DeleteTopicsThread) [2017-11-25
17:34:18,318] INFO [delete-topics-thread-2], Shutdown completed
(kafka.controller.TopicDeletionManager$DeleteTopicsThread) [2017-11-25
17:34:18,318] INFO [Partition state machine on Controller 2]: Stopped
partition state machine (kafka.controller.PartitionStateMachine)
[2017-11-25 17:34:18,318] INFO [Replica state machine on controller
2]: Stopped replica state machine
(kafka.controller.ReplicaStateMachine) [2017-11-25 17:34:18,318] INFO
[Controller-2-to-broker-2-send-thread], Shutting down
(kafka.controller.RequestSendThread) [2017-11-25 17:34:18,318] INFO
[Controller-2-to-broker-2-send-thread], Stopped
(kafka.controller.RequestSendThread) [2017-11-25 17:34:18,319] INFO
[Controller-2-to-broker-2-send-thread], Shutdown completed
(kafka.controller.RequestSendThread) [2017-11-25 17:34:18,319] INFO
[Controller-2-to-broker-1-send-thread], Shutting down
(kafka.controller.RequestSendThread) [2017-11-25 17:34:18,319] INFO
[Controller-2-to-broker-1-send-thread], Stopped
(kafka.controller.RequestSendThread) [2017-11-25 17:34:18,319] INFO
[Controller-2-to-broker-1-send-thread], Shutdown completed
(kafka.controller.RequestSendThread) [2017-11-25 17:34:18,319] INFO
[Controller 2]: Broker 2 resigned as the controller
(kafka.controller.KafkaController) [2017-11-25 17:34:18,353] DEBUG
[IsrChangeNotificationListener] Fired!!!
(kafka.controller.IsrChangeNotificationListener) [2017-11-25
17:34:18,353] DEBUG [IsrChangeNotificationListener] Fired!!!
(kafka.controller.IsrChangeNotificationListener) [2017-11-25
17:34:18,354] INFO [BrokerChangeListener on Controller 2]: Broker
change listener fired for path /brokers/ids with children 1,2
(kafka.controller.ReplicaStateMachine$BrokerChangeListener)
[2017-11-25 17:34:18,355] DEBUG [DeleteTopicsListener on 2]: Delete
topics listener fired for topics to be deleted
(kafka.controller.PartitionStateMachine$DeleteTopicsListener)
[2017-11-25 17:34:18,362] INFO [AddPartitionsListener on 2]: Partition
modification triggered {"version":1,"partitions":{"0":[1]}} for path
/brokers/topics/ESQ
(kafka.controller.PartitionStateMachine$PartitionModificationsListener)
[2017-11-25 17:34:18,368] INFO [AddPartitionsListener on 2]: Partition
modification triggered {"version":1,"partitions":{"0":[1]}} for path
/brokers/topics/Test1
(kafka.controller.PartitionStateMachine$PartitionModificationsListener)
[2017-11-25 17:34:18,369] INFO [AddPartitionsListener on 2]: Partition
modification triggered {"version":1,"partitions":{"0":[2]}} for path
/brokers/topics/ImageQ
(kafka.controller.PartitionStateMachine$PartitionModificationsListener)
[2017-11-25 17:34:18,374] INFO [AddPartitionsListener on 2]: Partition
modification triggered
{"version":1,"partitions":{"8":[1,2],"4":[1,2],"9":[2,1],"5":[2,1],"6":[1,2],"1":[2,1],"0":[1,2],"2":[1,2],"7":[2,1],"3":[2,1]}}
for path /brokers/topics/NMS_NotifyQ
(kafka.controller.PartitionStateMachine$PartitionModificationsListener)
[2017-11-25 17:34:18,375] INFO [AddPartitionsListener on 2]: Partition
modification triggered {"version":1,"partitions":{"0":[1]}} for path
/brokers/topics/TempBinLogReqQ #

Killing node with __consumer_offsets leads to no message consumption at consumers

I have 3 node(nodes0,node1,node2) Kafka cluster(broker0, broker1, broker2) with replication factor 2 and Zookeeper(using zookeeper packaged with Kafka tar) running on a different node (node 4).
I had started broker 0 after starting zookeper and then remaining nodes. It is seen in broker 0 logs that it is reading __consumer_offsets and seems they are stored on broker 0. Below are sample logs:
Kafka Version: kafka_2.10-0.10.2.0
2017-06-30 10:50:47,381] INFO [GroupCoordinator 0]: Loading group metadata for console-consumer-85124 with generation 2 (kafka.coordinator.GroupCoordinator)
[2017-06-30 10:50:47,382] INFO [Group Metadata Manager on Broker 0]: Finished loading offsets from __consumer_offsets-41 in 23 milliseconds. (kafka.coordinator.GroupMetadataManager)
[2017-06-30 10:50:47,382] INFO [Group Metadata Manager on Broker 0]: Loading offsets and group metadata from __consumer_offsets-44 (kafka.coordinator.GroupMetadataManager)
[2017-06-30 10:50:47,387] INFO [Group Metadata Manager on Broker 0]: Finished loading offsets from __consumer_offsets-44 in 5 milliseconds. (kafka.coordinator.GroupMetadataManager)
[2017-06-30 10:50:47,387] INFO [Group Metadata Manager on Broker 0]: Loading offsets and group metadata from __consumer_offsets-47 (kafka.coordinator.GroupMetadataManager)
[2017-06-30 10:50:47,398] INFO [Group Metadata Manager on Broker 0]: Finished loading offsets from __consumer_offsets-47 in 11 milliseconds. (kafka.coordinator.GroupMetadataManager)
[2017-06-30 10:50:47,398] INFO [Group Metadata Manager on Broker 0]: Loading offsets and group metadata from __consumer_offsets-1 (kafka.coordinator.GroupMetadataManager)
Also, I can see GroupCoordinator messages in the same broker 0 logs.
[2017-06-30 14:35:22,874] INFO [GroupCoordinator 0]: Preparing to restabilize group console-consumer-34472 with old generation 1 (kafka.coordinator.GroupCoordinator)
[2017-06-30 14:35:22,877] INFO [GroupCoordinator 0]: Group console-consumer-34472 with generation 2 is now empty (kafka.coordinator.GroupCoordinator)
[2017-06-30 14:35:25,946] INFO [GroupCoordinator 0]: Preparing to restabilize group console-consumer-6612 with old generation 1 (kafka.coordinator.GroupCoordinator)
[2017-06-30 14:35:25,946] INFO [GroupCoordinator 0]: Group console-consumer-6612 with generation 2 is now empty (kafka.coordinator.GroupCoordinator)
[2017-06-30 14:35:38,326] INFO [GroupCoordinator 0]: Preparing to restabilize group console-consumer-30165 with old generation 1 (kafka.coordinator.GroupCoordinator)
[2017-06-30 14:35:38,326] INFO [GroupCoordinator 0]: Group console-consumer-30165 with generation 2 is now empty (kafka.coordinator.GroupCoordinator)
[2017-06-30 14:43:15,656] INFO [Group Metadata Manager on Broker 0]: Removed 0 expired offsets in 3 milliseconds. (kafka.coordinator.GroupMetadataManager)
[2017-06-30 14:53:15,653] INFO [Group Metadata Manager on Broker 0]: Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.GroupMetadataManager)
While testing fault tolerance for the cluster using the kafka-console-consumer.sh and kafka-console-producer.sh, I see that on killing broker 1 or broker 2, the consumer can still receive new messages coming from producer. The Rebalance is happening correctly.
However, killing broker 0 leads to no new or old messages consumption at any number of consumers.
Below is the state of topic before and after broker 0 is killed.
Before
Topic:test-topic PartitionCount:3 ReplicationFactor:2 Configs:
Topic: test-topic Partition: 0 Leader: 2 Replicas: 2,0 Isr: 0,2
Topic: test-topic Partition: 1 Leader: 0 Replicas: 0,1 Isr: 0,1
Topic: test-topic Partition: 2 Leader: 1 Replicas: 1,2 Isr: 1,2
After
Topic:test-topic PartitionCount:3 ReplicationFactor:2 Configs:
Topic: test-topic Partition: 0 Leader: 2 Replicas: 2,0 Isr: 2
Topic: test-topic Partition: 1 Leader: 1 Replicas: 0,1 Isr: 1
Topic: test-topic Partition: 2 Leader: 1 Replicas: 1,2 Isr: 1,2
Following are the WARN messages seen in the consumer logs after broker 0 is killed
[2017-06-30 14:19:17,155] WARN Auto-commit of offsets {test-topic-2=OffsetAndMetadata{offset=4, metadata=''}, test-topic-0=OffsetAndMetadata{offset=5, metadata=''}, test-topic-1=OffsetAndMetadata{offset=4, metadata=''}} failed for group console-consumer-34472: Offset commit failed with a retriable exception. You should retry committing offsets. (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
[2017-06-30 14:19:10,542] WARN Auto-commit of offsets {test-topic-2=OffsetAndMetadata{offset=4, metadata=''}, test-topic-0=OffsetAndMetadata{offset=5, metadata=''}, test-topic-1=OffsetAndMetadata{offset=4, metadata=''}} failed for group console-consumer-30165: Offset commit failed with a retriable exception. You should retry committing offsets. (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
Broker Properties. The remaining default properties are unchanged.
broker.id=0
delete.topic.enable=true
auto.create.topics.enable=false
listeners=PLAINTEXT://XXX:9092
advertised.listeners=PLAINTEXT://XXX:9092
log.dirs=/tmp/kafka-logs-test1
num.partitions=3
zookeeper.connect=XXX:2181
Producer properties. The remaining default properties are unchanged.
bootstrap.servers=XXX,XXX,XXX
compression.type=snappy
Consumer properties. The remaining default properties are unchanged.
zookeeper.connect=XXX:2181
zookeeper.connection.timeout.ms=6000
group.id=test-consumer-group
As far I understand, if node holding/acting GroupCoordinator and __consumer_offsets dies, then the consumer unable to resume normal operations in spite of new leaders elected for partitions.
I see something similar posted in post. This post suggests to restart the dead broker node. However, there would be delay in message consumption in-spite of having more nodes until broker 0 is restarted in production environment.
Q1: How can the above situation be mitigated ?
Q2: Is there a way to change the GroupCoordinator, __consumer_offsets to another node?
Any suggestions/help is appreciated.
Check the replication factor on the __consumer_offsets topic. If it's not 3 then that's your problem.
Run the following command kafka-topics --zookeeper localhost:2181 --describe --topic __consumer_offsets and see if in the first line of output it says "ReplicationFactor:1" or "ReplicationFactor:3".
It's a common problem when doing trials to first setup one node and then this topic gets created with replication factor of 1. Later when you expand to 3 nodes you forget to change the topic level settings on this existing topic so even though the topics you are producing and consuming from are fault tolerant, the offsets topic is still stuck on broker 0 only.