Apache Kafka in kraft mode fails frequently - apache-kafka

We have created a 3 node kafka-3.3.1 cluster in kraft mode. This is based on bitnami-kafka image. Basic configuration for all nodes are (port number is different for each and other changes as required)
KAFKA_ENABLE_KRAFT: 'yes'
KAFKA_KRAFT_CLUSTER_ID: xxyyddjjjddkk1234
KAFKA_CFG_PROCESS_ROLES: broker,controller
KAFKA_CFG_CONTROLLER_LISTENER_NAMES: CONTROLLER
KAFKA_CFG_LISTENERS: CONTROLLER://:9093,INSIDE://:9092,EXTERNAL://:9094
KAFKA_CFG_LISTENER_SECURITY_PROTOCOL_MAP: CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT,INSIDE:PLAINTEXT,EXTERNAL:PLAINTEXT
KAFKA_CFG_CONTROLLER_QUORUM_VOTERS: 1#dpkafka01:9093,2#dpkafka02:9093,3#dpkafka03:9093
KAFKA_CFG_ADVERTISED_LISTENERS: INSIDE://dpkafka02:9092,EXTERNAL://_{HOSTIP}:9098
KAFKA_BROKER_ID: 2
KAFKA_INTER_BROKER_LISTENER_NAME: INSIDE
KAFKA_HEAP_OPTS: "-Xmx1G -Xms256m"
KAFKA_LOG_DIRS: /bitnami/kafka/kafka-logs
KAFKA_AUTO_CREATE_TOPICS_ENABLE: 'false'
KAFKA_LOG_RETENTION_MS: 7200000
KAFKA_LOG_SEGMENT_MS: 86400000
KAFKA_LOG_DELETE_RETENTION_MS: 7200000
KAFKA_LOG_RETENTION_CHECK_INTERVAL_MS: 60000
KAFKA_LOG_CLEANUP_POLICY: "compact,delete"
KAFKA_CFG_GROUP_INITIAL_REBALANCE_DELAY_MS: 12000
KAFKA_CFG_NUM_RECOVERY_THREADS_PER_DATA_DIR: 4
KAFKA_CFG_OFFSETS_TOPIC_REPLICATION_FACTOR: 2
KAFKA_CFG_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 2
KAFKA_CFG_TRANSACTION_STATE_LOG_MIN_ISR: 2
ALLOW_PLAINTEXT_LISTENER: 'yes'
BITNAMI_DEBUG: 'true'
KAFKA_OPTS: -javaagent:/opt/bitnami/kafka/libs/jmx_prometheus_javaagent.jar=7072:/opt/bitnami/kafka/libs/prom-jmx-agent-config.yml
While the cluster works for a while, one or two of them shuts down very frequently. Logs are not very helpful to identify the root cause. Some relevant logs we see before the state changes to shutdown are:
[2022-12-04 08:35:16,928] INFO [RaftManager nodeId=2] Become candidate due to fetch timeout (org.apache.kafka.raft.KafkaRaftClient)
[2022-12-04 08:35:17,414] INFO [RaftManager nodeId=2] Disconnecting from node 3 due to request timeout. (org.apache.kafka.clients.NetworkClient)
[2022-12-04 08:35:17,414] INFO [RaftManager nodeId=2] Cancelled in-flight FETCH request with correlation id 73082 due to node 3 being disconnected (elapsed time since creation: 2471ms, elapsed time since send: 2471ms, request timeout: 2000ms) (org.apache.kafka.clients.NetworkClient)
[2022-12-04 08:35:27,508] INFO [RaftManager nodeId=2] Completed transition to CandidateState(localId=2, epoch=31047, retries=1, electionTimeoutMs=1697) (org.apache.kafka.raft.QuorumState)
[2022-12-04 08:35:27,508] INFO [Controller 2] In the new epoch 31047, the leader is (none). (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:27,802] INFO [RaftManager nodeId=2] Completed transition to Unattached(epoch=31048, voters=[1, 2, 3], electionTimeoutMs=0) (org.apache.kafka.raft.QuorumState)
[2022-12-04 08:35:27,802] INFO [Controller 2] In the new epoch 31048, the leader is (none). (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:27,815] INFO [BrokerToControllerChannelManager broker=2 name=heartbeat] Client requested disconnect from node 3 (org.apache.kafka.clients.NetworkClient)
[2022-12-04 08:35:27,815] INFO [BrokerLifecycleManager id=2] Unable to send a heartbeat because the RPC got timed out before it could be sent. (kafka.server.BrokerLifecycleManager)
[2022-12-04 08:35:27,830] INFO [RaftManager nodeId=2] Completed transition to Voted(epoch=31048, votedId=1, voters=[1, 2, 3], electionTimeoutMs=1014) (org.apache.kafka.raft.QuorumState)
.....
[2022-12-04 08:35:32,210] INFO [Broker id=2] Stopped fetchers as part of become-follower for 479 partitions (state.change.logger)
[2022-12-04 08:35:32,211] INFO [Broker id=2] Started fetchers as part of become-follower for 479 partitions (state.change.logger)
[2022-12-04 08:35:32,232] INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Shutting down (kafka.server.ReplicaFetcherThread)
[2022-12-04 08:35:32,232] INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Client requested connection close from node 1 (org.apache.kafka.clients.NetworkClient)
[2022-12-04 08:35:32,233] INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Cancelled in-flight FETCH request with correlation id 675913 due to node 1 being disconnected (elapsed time since creation: 4394ms, elapsed time since send: 4394ms, request timeout: 30000ms) (org.apache.kafka.clients.NetworkClient)
[2022-12-04 08:35:32,233] INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Error sending fetch request (sessionId=1961820001, epoch=181722) to node 1: (org.apache.kafka.clients.FetchSessionHandler)
java.io.IOException: Client was shutdown before response was read
at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:108)
at kafka.server.BrokerBlockingSender.sendRequest(BrokerBlockingSender.scala:113)
at kafka.server.RemoteLeaderEndPoint.fetch(RemoteLeaderEndPoint.scala:78)
at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:309)
at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3(AbstractFetcherThread.scala:124)
at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3$adapted(AbstractFetcherThread.scala:123)
at scala.Option.foreach(Option.scala:407)
at kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:123)
at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:106)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96)
[2022-12-04 08:35:32,234] INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Stopped (kafka.server.ReplicaFetcherThread)
[2022-12-04 08:35:32,234] INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Shutdown completed (kafka.server.ReplicaFetcherThread)
[2022-12-04 08:35:32,237] INFO [ReplicaFetcher replicaId=2, leaderId=3, fetcherId=0] Shutting down (kafka.server.ReplicaFetcherThread)
[2022-12-04 08:35:32,237] INFO [ReplicaFetcher replicaId=2, leaderId=3, fetcherId=0] Shutdown completed (kafka.server.ReplicaFetcherThread)
[2022-12-04 08:35:32,237] INFO [ReplicaFetcher replicaId=2, leaderId=3, fetcherId=0] Stopped (kafka.server.ReplicaFetcherThread)
[2022-12-04 08:35:32,245] INFO [GroupCoordinator 2]: Resigned as the group coordinator for partition 13 in epoch Some(3200) (kafka.coordinator.group.GroupCoordinator)
....
[2022-12-04 08:35:48,229] INFO [Controller 2] Unfenced broker: 2 (org.apache.kafka.controller.ClusterControlManager)
[2022-12-04 08:35:48,254] INFO [RaftManager nodeId=2] Completed transition to Unattached(epoch=31055, voters=[1, 2, 3], electionTimeoutMs=1607) (org.apache.kafka.raft.QuorumState)
[2022-12-04 08:35:48,254] INFO [RaftManager nodeId=2] Vote request VoteRequestData(clusterId='<redacted>', topics=[TopicData(topicName='__cluster_metadata', partitions=[PartitionData(partitionIndex=0, candidateEpoch=31055, candidateId=3, lastOffsetEpoch=31052, lastOffset=6552512)])]) with epoch 31055 is rejected (org.apache.kafka.raft.KafkaRaftClient)
[2022-12-04 08:35:48,254] WARN [Controller 2] Renouncing the leadership due to a metadata log event. We were the leader at epoch 31052, but in the new epoch 31055, the leader is (none). Reverting to last committed offset 6552511. (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 8243762 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] alterPartition: failed with NotControllerException in 8005283 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 7743806 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 7243753 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] processBrokerHeartbeat: failed with NotControllerException in 7151815 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] processBrokerHeartbeat: failed with NotControllerException in 7151616 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 6743693 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 6243134 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 5742969 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 5242852 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 4742694 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 4242529 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 3742380 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 3242258 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 2741822 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 2241677 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 1741549 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 1241369 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 741246 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] maybeFenceReplicas: failed with NotControllerException in 244485 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 241049 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] maybeFenceReplicas: failed with NotControllerException in 196629 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] processBrokerHeartbeat: failed with NotControllerException in 27063 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,255] INFO [BrokerToControllerChannelManager broker=2 name=heartbeat] Client requested disconnect from node 2 (org.apache.kafka.clients.NetworkClient)
[2022-12-04 08:35:48,255] ERROR Encountered fatal fault: exception while renouncing leadership (org.apache.kafka.server.fault.ProcessExitingFaultHandler)
java.lang.NullPointerException
at org.apache.kafka.timeline.SnapshottableHashTable$HashTier.mergeFrom(SnapshottableHashTable.java:125)
at org.apache.kafka.timeline.Snapshot.mergeFrom(Snapshot.java:68)
at org.apache.kafka.timeline.SnapshotRegistry.deleteSnapshot(SnapshotRegistry.java:236)
at org.apache.kafka.timeline.SnapshotRegistry$SnapshotIterator.remove(SnapshotRegistry.java:67)
at org.apache.kafka.timeline.SnapshotRegistry.revertToSnapshot(SnapshotRegistry.java:214)
at org.apache.kafka.controller.QuorumController.renounce(QuorumController.java:1232)
at org.apache.kafka.controller.QuorumController.access$3300(QuorumController.java:150)
at org.apache.kafka.controller.QuorumController$QuorumMetaLogListener.lambda$handleLeaderChange$3(QuorumController.java:1076)
at org.apache.kafka.controller.QuorumController$QuorumMetaLogListener.lambda$appendRaftEvent$4(QuorumController.java:1101)
at org.apache.kafka.controller.QuorumController$ControlEvent.run(QuorumController.java:496)
at org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:121)
at org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:200)
at org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:173)
at java.base/java.lang.Thread.run(Thread.java:829)
[2022-12-04 08:35:48,259] INFO [BrokerServer id=2] Transition from STARTED to SHUTTING_DOWN (kafka.server.BrokerServer)
[2022-12-04 08:35:48,259] INFO [BrokerServer id=2] shutting down (kafka.server.BrokerServer)
[2022-12-04 08:35:48,261] INFO [BrokerLifecycleManager id=2] Beginning controlled shutdown. (kafka.server.BrokerLifecycleManager)
[2022-12-04 08:35:48,277] INFO [RaftManager nodeId=2] Completed transition to FollowerState(fetchTimeoutMs=2000, epoch=31055, leaderId=3, voters=[1, 2, 3], highWatermark=Optional[LogOffsetMetadata(offset=6552512, metadata=Optional[(segmentBaseOffset=6497886,relativePositionInSegment=3821894)])], fetchingSnapshot=Optional.empty) (org.apache.kafka.raft.QuorumState)
[2022-12-04 08:35:48,355] INFO [BrokerToControllerChannelManager broker=2 name=heartbeat]: Recorded new controller, from now on will use broker dpkafka03:9093 (id: 3 rack: null) (kafka.server.BrokerToControllerRequestThread)
Appreciate if anyone experienced with Kraft mode Kafka cluster can provide some guidance to debug this issue. Other problem is that the container doesn't exit after the error which makes the services to fail. Container will be restarted by our orchestration layer if it exits. (this is a different problem that as we use Bitnami images)
I also didn't find many production examples out there that uses kraft mode. Are we missing some configuration or do we need to change any default configuration values such as request timeout in Kraft mode?

I just consulted my workmate about that NullPointerException and got this patch from him (https://github.com/alexkuoecity).
diff --git a/metadata/src/main/java/org/apache/kafka/timeline/SnapshottableHashTable.java b/metadata/src/main/java/org/apache/kafka/timeline/SnapshottableHashTable.java
index 299f65a6f7..e87ce22264 100644
--- a/metadata/src/main/java/org/apache/kafka/timeline/SnapshottableHashTable.java
+++ b/metadata/src/main/java/org/apache/kafka/timeline/SnapshottableHashTable.java
## -105,6 +105,7 ## class SnapshottableHashTable<T extends SnapshottableHashTable.ElementWithStartEp
HashTier(int size) {
this.size = size;
+ this.deltaTable = new BaseHashTable<T>(size);
}
#SuppressWarnings("unchecked")
I applied this to the branch 3.3 of kafka, seems work.But I still have no idea about the root cause, so use it at your own risk.
$ git clone https://github.com/apache/kafka.git
$ cd kafka
$ git checkout 3.3
$ patch < file.patch
$ ./gradlew releaseTarGz
Then, copy the whole kafka dir into your docker image and run.

This happens to me also. When I deploy, docker compose starts fine, but after I reboot the VM entirely, I have another error:
kafka_1 | [2022-12-06 15:23:04,721] ERROR [Controller 1] writeNoOpRecord: unable to start processing because of TimeoutException. (org.apache.kafka.controller.QuorumController)
kafka_1 | [2022-12-06 15:23:04,721] ERROR [Controller 1] maybeBalancePartitionLeaders: unable to start processing because of TimeoutException. (org.apache.kafka.controller.QuorumController)
Probably it is related to VM performance, at boot stage it is slower.
So I do docker compose restart and it works again. Finally I fixed it by adding restart: always, at second attempt it works.
I did not see any problems so far, but I did not test at high load.

Related

Apache Kafka: Trouble deleting and re-creating topic with a dot in the name

I have run into an issue in kafka where if I create topic with a dot in the name, then delete it, then create it again, topic creation fails. I am using kafka_2.13-3.3.1 with KRaft on a 5-node cluster.
I originally ran into this problem while setting up MirrorMaker2. It creates topics with dots in the name, and i nuked the MM2 topics to redo MM2, and now MM2 can't recreate its topics.
Anyway, here is a simple CLI example:
# bin/kafka-topics.sh --create --topic a.test.topic --bootstrap-server kfk-01:9092
WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.
Created topic a.test.topic.
# bin/kafka-topics.sh --delete --topic a.test.topic --bootstrap-server kfk-01:9092
# bin/kafka-topics.sh --create --topic a.test.topic --bootstrap-server kfk-01:9092
WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.
Error while executing topic command : The server experienced an unexpected error when processing the request.
[2023-02-06 19:35:59,110] ERROR org.apache.kafka.common.errors.UnknownServerException: The server experienced an unexpected error when processing the request.
I don't think this is a timing issue...if i perform this exercise with a topic without dots in the name, it always succeeds.
I am getting some messages in the server logs on the local node...it goes back and forth between this one:
[2023-02-06 20:01:31,740] WARN [Controller 1] createTopics: failed with unknown server exception NoSuchElementException at epoch 10188 in 40 us. Renouncing leadership and reverting to the last committed offset 2760857. (org.apache.kafka.controller.QuorumController)
java.util.NoSuchElementException
at org.apache.kafka.timeline.SnapshottableHashTable$CurrentIterator.next(SnapshottableHashTable.java:167)
at org.apache.kafka.timeline.SnapshottableHashTable$CurrentIterator.next(SnapshottableHashTable.java:139)
at org.apache.kafka.timeline.TimelineHashSet$ValueIterator.next(TimelineHashSet.java:120)
at org.apache.kafka.controller.ReplicationControlManager.validateNewTopicNames(ReplicationControlManager.java:799)
at org.apache.kafka.controller.ReplicationControlManager.createTopics(ReplicationControlManager.java:567)
at org.apache.kafka.controller.QuorumController.lambda$createTopics$7(QuorumController.java:1832)
at org.apache.kafka.controller.QuorumController$ControllerWriteEvent.run(QuorumController.java:767)
at org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:121)
at org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:200)
at org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:173)
at java.base/java.lang.Thread.run(Thread.java:829)
[2023-02-06 20:01:31,740] INFO [RaftManager nodeId=1] Received user request to resign from the current epoch 10188 (org.apache.kafka.raft.KafkaRaftClient)
[2023-02-06 20:01:31,740] INFO [RaftManager nodeId=1] Completed transition to ResignedState(localId=1, epoch=10188, voters=[1, 2, 3, 4, 5], electionTimeoutMs=1140, unackedVoters=[2, 3, 4, 5], preferredSuccessors=[2, 3, 4, 5]) (org.apache.kafka.raft.QuorumState)
[2023-02-06 20:01:31,750] INFO [RaftManager nodeId=1] Completed transition to Unattached(epoch=10189, voters=[1, 2, 3, 4, 5], electionTimeoutMs=1909) (org.apache.kafka.raft.QuorumState)
[2023-02-06 20:01:31,750] INFO [Controller 1] In the new epoch 10189, the leader is (none). (org.apache.kafka.controller.QuorumController)
[2023-02-06 20:01:31,754] INFO [RaftManager nodeId=1] Completed transition to Voted(epoch=10189, votedId=2, voters=[1, 2, 3, 4, 5], electionTimeoutMs=1929) (org.apache.kafka.raft.QuorumState)
[2023-02-06 20:01:31,754] INFO [RaftManager nodeId=1] Vote request VoteRequestData(clusterId='ZmJlNWVjMDI5OWFlNDVhYw', topics=[TopicData(topicName='__cluster_metadata', partitions=[PartitionData(partitionIndex=0, candidateEpoch=10189, candidateId=2, lastOffsetEpoch=10188, lastOffset=2760858)])]) wit
[2023-02-06 20:01:31,763] INFO [RaftManager nodeId=1] Completed transition to FollowerState(fetchTimeoutMs=2000, epoch=10189, leaderId=2, voters=[1, 2, 3, 4, 5], highWatermark=Optional.empty, fetchingSnapshot=Optional.empty) (org.apache.kafka.raft.QuorumState)
[2023-02-06 20:01:31,763] INFO [Controller 1] In the new epoch 10189, the leader is 2. (org.apache.kafka.controller.QuorumController)
[2023-02-06 20:01:31,823] ERROR [Controller 1] processBrokerHeartbeat: unable to start processing because of NotControllerException. (org.apache.kafka.controller.QuorumController)
[2023-02-06 20:01:33,468] ERROR [Controller 1] processBrokerHeartbeat: unable to start processing because of NotControllerException. (org.apache.kafka.controller.QuorumController)
[2023-02-06 20:01:33,512] ERROR [Controller 1] processBrokerHeartbeat: unable to start processing because of NotControllerException. (org.apache.kafka.controller.QuorumController)
[2023-02-06 20:01:33,595] ERROR [Controller 1] processBrokerHeartbeat: unable to start processing because of NotControllerException. (org.apache.kafka.controller.QuorumController)
[2023-02-06 20:01:33,595] INFO [BrokerToControllerChannelManager broker=1 name=heartbeat] Client requested disconnect from node 1 (org.apache.kafka.clients.NetworkClient)
[2023-02-06 20:01:33,595] INFO [BrokerToControllerChannelManager broker=1 name=heartbeat]: Recorded new controller, from now on will use broker kfk-02:9091 (id: 2 rack: null) (kafka.server.BrokerToControllerRequestThread)
[2023-02-06 20:01:33,671] ERROR [Controller 1] processBrokerHeartbeat: unable to start processing because of NotControllerException. (org.apache.kafka.controller.QuorumController)
and this one:
[2023-02-06 20:02:15,898] ERROR [Controller 1] createTopics: unable to start processing because of NotControllerException. (org.apache.kafka.controller.QuorumController)
[2023-02-06 20:02:16,034] INFO [RaftManager nodeId=1] Become candidate due to fetch timeout (org.apache.kafka.raft.KafkaRaftClient)
[2023-02-06 20:02:16,039] INFO [RaftManager nodeId=1] Completed transition to CandidateState(localId=1, epoch=10190, retries=1, electionTimeoutMs=1411) (org.apache.kafka.raft.QuorumState)
[2023-02-06 20:02:16,040] INFO [Controller 1] In the new epoch 10190, the leader is (none). (org.apache.kafka.controller.QuorumController)
[2023-02-06 20:02:16,051] INFO [RaftManager nodeId=1] Completed transition to Leader(localId=1, epoch=10190, epochStartOffset=2760947, highWatermark=Optional.empty, voterStates={1=ReplicaState(nodeId=1, endOffset=Optional.empty, lastFetchTimestamp=-1, lastCaughtUpTimestamp=-1, hasAcknowledgedLeader=
[2023-02-06 20:02:16,067] INFO [Controller 1] Becoming the active controller at epoch 10190, committed offset 2760946, committed epoch 10189 (org.apache.kafka.controller.QuorumController)
[2023-02-06 20:02:17,742] INFO [BrokerToControllerChannelManager broker=1 name=heartbeat] Client requested disconnect from node 2 (org.apache.kafka.clients.NetworkClient)
[2023-02-06 20:02:17,743] INFO [BrokerToControllerChannelManager broker=1 name=heartbeat]: Recorded new controller, from now on will use broker kfk-01:9091 (id: 1 rack: null) (kafka.server.BrokerToControllerRequestThread)
What am I doing wrong? Is this a bug?
Use Kafka 3.3.2 where this is fixed
https://issues.apache.org/jira/browse/KAFKA-14337
Or, read the warning, and use underscores (or hyphens) instead since metrics will always use dot-notation.

org.apache.kafka.raft.errors.NotLeaderException: Append failed because the replication is not the current leader

Introduction:
Previously, I saw a similar question (this link), but mine is different as we use Kafka KRaft instead of Kafka with Zookeeper.
Specification:
Kafka version: 3.3.1
Number of brokers: 8
Minimum replication factor of topics: 3
Problem Description:
At the time of writing, I had experienced this issue numerous times. Kafka's log can be found here:
[2023-01-09 09:53:03,929] WARN [Controller 3] maybeFenceReplicas: failed with unknown server exception NotLeaderException at epoch 2641 in 1913 us. Renouncing leadership and reverting to the last committed offset 9986340. (org.apache.kafka.controller.QuorumController)
org.apache.kafka.raft.errors.NotLeaderException: Append failed because the replication is not the current leader
at org.apache.kafka.raft.KafkaRaftClient.lambda$append$27(KafkaRaftClient.java:2262)
at java.base/java.util.Optional.orElseThrow(Optional.java:408)
at org.apache.kafka.raft.KafkaRaftClient.append(KafkaRaftClient.java:2261)
at org.apache.kafka.raft.KafkaRaftClient.scheduleAtomicAppend(KafkaRaftClient.java:2257)
at org.apache.kafka.controller.QuorumController$ControllerWriteEvent$1.apply(QuorumController.java:813)
at org.apache.kafka.controller.QuorumController$ControllerWriteEvent$1.apply(QuorumController.java:792)
at org.apache.kafka.controller.QuorumController.appendRecords(QuorumController.java:903)
at org.apache.kafka.controller.QuorumController$ControllerWriteEvent.run(QuorumController.java:791)
at org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:121)
at org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:200)
at org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:173)
at java.base/java.lang.Thread.run(Thread.java:829)
[2023-01-09 09:53:03,931] INFO [Controller 3] writeNoOpRecord: failed with NotControllerException in 415741179 us (org.apache.kafka.controller.QuorumController)
[2023-01-09 09:53:03,931] INFO [Controller 3] writeNoOpRecord: failed with NotControllerException in 206629449 us (org.apache.kafka.controller.QuorumController)
[2023-01-09 09:53:03,931] INFO [Controller 3] maybeFenceReplicas: failed with NotControllerException in 206629220 us (org.apache.kafka.controller.QuorumController)
[2023-01-09 09:53:03,931] INFO [Controller 3] maybeFenceReplicas: failed with NotControllerException in 206626538 us (org.apache.kafka.controller.QuorumController)
[2023-01-09 09:53:03,931] INFO [Controller 3] maybeFenceReplicas: failed with NotControllerException in 205746648 us (org.apache.kafka.controller.QuorumController)
[2023-01-09 09:53:03,931] INFO [Controller 3] maybeFenceReplicas: failed with NotControllerException in 7549 us (org.apache.kafka.controller.QuorumController)
[2023-01-09 09:53:03,931] INFO [Controller 3] maybeFenceReplicas: failed with NotControllerException in 6986 us (org.apache.kafka.controller.QuorumController)
[2023-01-09 09:53:03,931] INFO [Controller 3] maybeFenceReplicas: failed with NotControllerException in 6399 us (org.apache.kafka.controller.QuorumController)
[2023-01-09 09:53:03,931] INFO [Controller 3] maybeFenceReplicas: failed with NotControllerException in 5912 us (org.apache.kafka.controller.QuorumController)
[2023-01-09 09:53:03,932] ERROR [Controller 3] Unexpected exception while executing deferred write event maybeFenceReplicas. Rescheduling for a minute from now. (org.apache.kafka.controller.QuorumController)
org.apache.kafka.common.errors.UnknownServerException: org.apache.kafka.raft.errors.NotLeaderException: Append failed because the replication is not the current leader
Caused by: org.apache.kafka.raft.errors.NotLeaderException: Append failed because the replication is not the current leader
at org.apache.kafka.raft.KafkaRaftClient.lambda$append$27(KafkaRaftClient.java:2262)
at java.base/java.util.Optional.orElseThrow(Optional.java:408)
at org.apache.kafka.raft.KafkaRaftClient.append(KafkaRaftClient.java:2261)
at org.apache.kafka.raft.KafkaRaftClient.scheduleAtomicAppend(KafkaRaftClient.java:2257)
at org.apache.kafka.controller.QuorumController$ControllerWriteEvent$1.apply(QuorumController.java:813)
at org.apache.kafka.controller.QuorumController$ControllerWriteEvent$1.apply(QuorumController.java:792)
at org.apache.kafka.controller.QuorumController.appendRecords(QuorumController.java:903)
at org.apache.kafka.controller.QuorumController$ControllerWriteEvent.run(QuorumController.java:791)
at org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:121)
at org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:200)
at org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:173)
at java.base/java.lang.Thread.run(Thread.java:829)
And
ERROR [Controller 3] processBrokerHeartbeat: unable to start processing because of NotControllerException. (org.apache.kafka.controller.QuorumController)
As this is our production node, we constantly monitor it using Prometheus and Grafana. The timestamp indicates that this broker had trouble at 2023-01-09 09:53. According to the monitoring, the other 7 brokers should be working properly and data-loss shouldn't occur, but the results from the monitoring are different from what we expected.
This issue has happened again at 11:31.
Observations:
In this case, I assume that there is no data loss based on the monitoring screenshots and the topic messages.
Is this correct? How can we prevent this issue from recurring?
When the NotLeaderException is thrown, the Kafka Producer should retry the write until successful. This exception normally occurs if the leader broker fails and a new leader has not yet been elected.
It's quite hard to detect whether or not there was data loss from the monitoring graph. It looks like during the times the leader failed, messages dropped because the producers will have been receiving the NotLeaderException until the new leader was elected. Once the new leader was elected, the producers were able to continue as normal.
This does not necessarily mean there was no data loss though. It's the producers responsibility to ensure no data loss.
For example if acks=0 and a message was sent to the topic but was not received successfully before the leader failed, that message would not exist in the topic, however, the producer would have assumed a successful write and move on to the next message.
To ensure message availability and durability, producers should have the following configurations set:
acks=all
For a write to be considered successful, it needs to be acknowledged by all replicas in the ISR
min.insync.replicas >=2
At least 2 replicas must be in sync before the write is considered successful.
Depending on what vendor you use, some of the above configurations are set by default.
Hope this helps!

kafka broker Connection to INTERNAL_BROKER_DNS failed

i have 3 kafka brokers in MSK. Two of them gives below error.
org.apache.kafka.common.errors.KafkaStorageException: Error while writing to checkpoint file LOG_DIR/__amazon_msk_connect_offsets_debezium-kafka-connector-x/leader-epoch-checkpoint
Caused by: java.io.IOException: No space left on device
And the other broker gives below error.
[2022-10-18 11:50:43,383] INFO [GroupCoordinator 1]: Member connect-1-97b56132-1f54-46a4-91f1-8d31e61e18a9 in group __amazon_msk_connect_cluster_debezium-kafka-connector-x has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2022-10-18 11:50:43,383] INFO [GroupCoordinator 1]: Stabilized group __amazon_msk_connect_cluster_debezium-kafka-connector-x generation 1115 (__consumer_offsets-18) (kafka.coordinator.group.GroupCoordinator)
[2022-10-18 11:50:43,387] INFO [GroupCoordinator 1]: Assignment received from leader for group __amazon_msk_connect_cluster_debezium-kafka-connector-x for generation 1115 (kafka.coordinator.group.GroupCoordinator)
[2022-10-18 11:50:43,387] INFO [GroupCoordinator 1]: Preparing to rebalance group __amazon_msk_connect_cluster_debezium-kafka-connector-x in state PreparingRebalance with old generation 1115 (__consumer_offsets-18) (reason: error when storing group assignment during SyncGroup (member: connect-1-9c20e001-f852-4007-8614-78a5c27207f6)) (kafka.coordinator.group.GroupCoordinator)
[2022-10-18 11:50:43,876] WARN [ReplicaFetcher replicaId=1, leaderId=3, fetcherId=0] Error in response for fetch request (type=FetchRequest, replicaId=1, maxWait=500, minBytes=1, maxBytes=10485760, fetchData={test-0=PartitionData(fetchOffset=149547, logStartOffset=7151, maxBytes=1048576, currentLeaderEpoch=Optional[0], lastFetchedEpoch=Optional.empty), test-0=PartitionData(fetchOffset=74123, logStartOffset=826, maxBytes=1048576, currentLeaderEpoch=Optional[0], lastFetchedEpoch=Optional.empty)}, isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=INVALID, epoch=INITIAL), rackId=) (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to INTERNAL_BROKER_DNS (id: 3 rack: null) failed.
I want to know why i'm getting above error Error in response for fetch request ? What does it mean ?
No space left on device - You need a larger EBS, or remove log segments from it manually, assuming you can SSH to it. If not, contact MSK Support.
Given that there are brokers failing with storage requirements, then other brokers will simply be unable to connect to them

Confluent Control Center failure: Unable to fetch consumer offsets for cluster id

I am running confluent platform (version 6.1.1). I deploy the following components: 3 Brokers, 3 ZK, Schema Registry, 3 Kafka Connect, KSQL and Confluent Control Center (CCC).
The CCC has entered into a failed state and I have difficulties to bring it back.
To make things cleaner, I have created another EC2 instance (m4.2xlarge) where I configured new CCC with the aim to connect it to the current cluster. New CCC has exactly the same configuration as the failed one, but with a different confluent.controlcenter.id.
I start the CCC and it is running. I can access the CCC UI but it is not working properly: the pages are loading too long, it keeps showing the changing state of the connect cluster (sometimes healthy, sometimes not), it keeps showing the changing state of the brokers (sometimes healthy, sometimes not)
For example it looks like this (see screenshots):
After running certain amount of time, it is automatically restarted and keeps restarting every 5-7 minutes.
When it is started, I see a bunch of new topics created in the Kafka cluster.
After that in the control-center.log I see :
INFO [main] Setting offsets for topic=_confluent-monitoring (io.confluent.controlcenter.KafkaHelper)
INFO [main] found 12 topicPartitions for topic=_confluent-monitoring (io.confluent.controlcenter.KafkaHelper)
INFO [main] Setting offsets for topic=_confluent-metrics (io.confluent.controlcenter.KafkaHelper)
INFO [main] found 12 topicPartitions for topic=_confluent-metrics (io.confluent.controlcenter.KafkaHelper)
INFO [main] action=starting topology=command (io.confluent.controlcenter.ControlCenter)
INFO [main] waiting for streams to be in running state REBALANCING (io.confluent.command.CommandStore)
INFO [main] Streams state RUNNING (io.confluent.command.CommandStore)
INFO [main] action=started topology=command (io.confluent.controlcenter.ControlCenter)
INFO [main] action=starting operation=command-migration (io.confluent.controlcenter.ControlCenter)
INFO [main] action=completed operation=command-migration (io.confluent.controlcenter.ControlCenter)
INFO [main] action=starting topology=monitoring (io.confluent.controlcenter.ControlCenter)
INFO [main] action=started topology=monitoring (io.confluent.controlcenter.ControlCenter)
INFO [main] Starting Health Check (io.confluent.controlcenter.ControlCenter)
INFO [main] Starting Alert Manager (io.confluent.controlcenter.ControlCenter)
INFO [main] Starting Consumer Offsets Fetch (io.confluent.controlcenter.ControlCenter)
INFO [control-center-heartbeat-0] current clusterId=lCRehAk0RqmLR04nhXKHtA (io.confluent.controlcenter.healthcheck.HealthCheck)
INFO [control-center-heartbeat-0] broker id set has changed new={1001=[10.251.xx.xx:9093 (id: 1001 rack: null)], 1002=[10.251.xx.xx:9093 (id: 1002 rack: null)], 1003=[10.251.xx.xx:9093 (id: 1003 rack: null)]} removed={} (io.confluent.controlcenter.healthcheck.HealthCheck)
INFO [control-center-heartbeat-0] new controller=10.251.xx.xx:9093 (id: 1002 rack: null) (io.confluent.controlcenter.healthcheck.HealthCheck)
INFO [main] Initial capacity 128, increased by 64, maximum capacity 2147483647. (io.confluent.rest.ApplicationServer)
INFO [main] Adding listener: http://0.0.0.0:9021 (io.confluent.rest.ApplicationServer)
INFO [main] x509=X509#3a8ead9(ip-44-135-xx-xx.eu-central-1.compute.internal,h=[ip-44-135-xx-xx.eu-central-1.compute.internal],w=[]) for Server#7c8b37a8[provider=null,keyStore=file:///var/kafka-ssl/server.keystore.jks,trustStore=file:///var/kafka-ssl/client.truststore.jks] (org.eclipse.jetty.util.ssl.SslContextFactory)
INFO [main] x509=X509#3831f4c2(caroot,h=[eu-central-1.compute.internal],w=[]) for Server#7c8b37a8[provider=null,keyStore=file:///var/kafka-ssl/server.keystore.jks,trustStore=file:///var/kafka-ssl/client.truststore.jks] (org.eclipse.jetty.util.ssl.SslContextFactory)
INFO [main] jetty-9.4.38.v20210224; built: 2021-02-24T20:25:07.675Z; git: 288f3cc74549e8a913bf363250b0744f2695b8e6; jvm 11.0.13+8-LTS (org.eclipse.jetty.server.Server)
INFO [main] DefaultSessionIdManager workerName=node0 (org.eclipse.jetty.server.session)
INFO [main] No SessionScavenger set, using defaults (org.eclipse.jetty.server.session)
INFO [main] node0 Scavenging every 660000ms (org.eclipse.jetty.server.session)
INFO [main] Started o.e.j.s.ServletContextHandler#1ef5cde4{/,[jar:file:/usr/share/java/acl/acl-6.1.1.jar!/io/confluent/controlcenter/rest/static],AVAILABLE} (org.eclipse.jetty.server.handler.ContextHandler)
INFO [main] Started o.e.j.s.ServletContextHandler#5401c6a8{/ws,null,AVAILABLE} (org.eclipse.jetty.server.handler.ContextHandler)
INFO [main] Started NetworkTrafficServerConnector#5d6b5d3d{HTTP/1.1, (http/1.1)}{0.0.0.0:9021} (org.eclipse.jetty.server.AbstractConnector)
INFO [main] Started #36578ms (org.eclipse.jetty.server.Server)
INFO [_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-1] name=monitoring-input-topic-progress-.count type=monitoring cluster= value=0.0 (io.confluent.controlcenter.util.StreamProgressReporter)
INFO [_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-1] name=monitoring-input-topic-progress-.rate type=monitoring cluster= value=0.0 (io.confluent.controlcenter.util.StreamProgressReporter)
INFO [_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-1] name=monitoring-input-topic-progress-.timestamp type=monitoring cluster= value=NaN (io.confluent.controlcenter.util.StreamProgressReporter)
INFO [_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-1] name=monitoring-input-topic-progress-.min type=monitoring cluster= value=1.7976931348623157E308 (io.confluent.controlcenter.util.StreamProgressReporter)
INFO [_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-1] name=metrics-input-topic-progress-lCRehAk0RqmLR04nhXKHtA.count type=metrics cluster=lCRehAk0RqmLR04nhXKHtA value=0.0 (io.confluent.controlcenter.util.StreamProgressReporter)
INFO [_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-1] name=metrics-input-topic-progress-lCRehAk0RqmLR04nhXKHtA.rate type=metrics cluster=lCRehAk0RqmLR04nhXKHtA value=0.0 (io.confluent.controlcenter.util.StreamProgressReporter)
INFO [_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-1] name=metrics-input-topic-progress-lCRehAk0RqmLR04nhXKHtA.timestamp type=metrics cluster=lCRehAk0RqmLR04nhXKHtA value=NaN (io.confluent.controlcenter.util.StreamProgressReporter)
INFO [_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-1] name=metrics-input-topic-progress-lCRehAk0RqmLR04nhXKHtA.min type=metrics cluster=lCRehAk0RqmLR04nhXKHtA value=1.7976931348623157E308 (io.confluent.controlcenter.util.StreamProgressReporter)
WARN [control-center-heartbeat-0] misconfigured topic=_confluent-command config=segment.bytes value=1073741824 expected=134217728 (io.confluent.controlcenter.healthcheck.HealthCheck)
WARN [control-center-heartbeat-0] misconfigured topic=_confluent-command config=delete.retention.ms value=86400000 expected=259200000 (io.confluent.controlcenter.healthcheck.HealthCheck)
INFO [control-center-heartbeat-0] misconfigured topic=_confluent-metrics config=min.insync.replicas value=1 expected=2 (io.confluent.controlcenter.healthcheck.HealthCheck)
WARN [control-center-heartbeat-1] Unable to fetch consumer offsets for cluster id lCRehAk0RqmLR04nhXKHtA (io.confluent.controlcenter.data.ConsumerOffsetsFetcher)
java.util.concurrent.TimeoutException
at org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:108)
at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:272)
at io.confluent.controlcenter.data.ConsumerOffsetsDao.getAllConsumerGroupDescriptions(ConsumerOffsetsDao.java:220)
at io.confluent.controlcenter.data.ConsumerOffsetsDao.getAllConsumerGroupOffsets(ConsumerOffsetsDao.java:58)
at io.confluent.controlcenter.data.ConsumerOffsetsFetcher.run(ConsumerOffsetsFetcher.java:73)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
WARN [kafka-admin-client-thread | adminclient-3] failed fetching description for consumerGroup=_confluent-ksql-eim_ksql_non_prodquery_CSAS_SDL_STMTS_GG_347 (io.confluent.controlcenter.data.ConsumerOffsetsDao)
org.apache.kafka.common.errors.TimeoutException: Call(callName=describeConsumerGroups, deadlineMs=1654853629184, tries=1, nextAllowedTryMs=1654853629324) timed out at 1654853629224 after 1 attempt(s)
Caused by: org.apache.kafka.common.errors.DisconnectException: Cancelled describeConsumerGroups request with correlation id 168 due to node 1001 being disconnected
WARN [kafka-admin-client-thread | adminclient-3] failed fetching description for consumerGroup=connect-mongo-dci-grid-partner-test11 (io.confluent.controlcenter.data.ConsumerOffsetsDao)
org.apache.kafka.common.errors.TimeoutException: Call(callName=describeConsumerGroups, deadlineMs=1654853629184, tries=1, nextAllowedTryMs=1654853629324) timed out at 1654853629224 after 1 attempt(s)
Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment. Call: describeConsumerGroups
WARN [kafka-admin-client-thread | adminclient-3] failed fetching description for consumerGroup=_confluent-ksql-eim_ksql_non_prodquery_CSAS_SDL_STMTS_UPWARD_GG_355 (io.confluent.controlcenter.data.ConsumerOffsetsDao)
org.apache.kafka.common.errors.TimeoutException: Call(callName=describeConsumerGroups, deadlineMs=1654853629184, tries=1, nextAllowedTryMs=1654853629324) timed out at 1654853629224 after 1 attempt(s)
Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting to send the call. Call: describeConsumerGroups
WARN [kafka-admin-client-thread | adminclient-3] failed fetching description for consumerGroup=_eim_c3_non_prod-4 (io.confluent.controlcenter.data.ConsumerOffsetsDao)
org.apache.kafka.common.errors.TimeoutException: Call(callName=describeConsumerGroups, deadlineMs=1654853629184, tries=1, nextAllowedTryMs=1654853629324) timed out at 1654853629224 after 1 attempt(s)
Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting to send the call. Call: describeConsumerGroups
...
and so on...
WARN [control-center-heartbeat-1] Unable to fetch consumer offsets for cluster id lCRehAk0RqmLR04nhXKHtA (io.confluent.controlcenter.data.ConsumerOffsetsFetcher)
java.util.concurrent.TimeoutException
at org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:108)
at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:272)
at io.confluent.controlcenter.data.ConsumerOffsetsDao.getAllConsumerGroupDescriptions(ConsumerOffsetsDao.java:220)
at io.confluent.controlcenter.data.ConsumerOffsetsDao.getAllConsumerGroupOffsets(ConsumerOffsetsDao.java:58)
at io.confluent.controlcenter.data.ConsumerOffsetsFetcher.run(ConsumerOffsetsFetcher.java:73)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
and so on...
In the control-center-kafka.log I see:
INFO [control-center-heartbeat-1] Kafka version: 6.1.1-ce (org.apache.kafka.common.utils.AppInfoParser)
INFO [control-center-heartbeat-1] Kafka commitId: 73deb3aeb1f8647c (org.apache.kafka.common.utils.AppInfoParser)
INFO [control-center-heartbeat-1] Kafka startTimeMs: 1654853610852 (org.apache.kafka.common.utils.AppInfoParser)
INFO [kafka-coordinator-heartbeat-thread | _eim_c3_non_prod-4] [Consumer clientId=_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-5-consumer, groupId=_eim_c3_non_prod-4] Resetting offset for partition _eim_c3_non_prod-4-monitoring-message-rekey-store-7 to position FetchPosition{offset=0, offsetEpoch=Optional.empty, currentLeader=LeaderAndEpoch{leader=Optional[10.251.6.2:9093 (id: 1002 rack: null)], epoch=0}}. (org.apache.kafka.clients.consumer.internals.SubscriptionState)
INFO [kafka-coordinator-heartbeat-thread | _eim_c3_non_prod-4] [Consumer clientId=_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-5-consumer, groupId=_eim_c3_non_prod-4] Resetting offset for partition _eim_c3_non_prod-4-monitoring-trigger-event-rekey-7 to position FetchPosition{offset=0, offsetEpoch=Optional.empty, currentLeader=LeaderAndEpoch{leader=Optional[10.251.6.2:9093 (id: 1002 rack: null)], epoch=0}}. (org.apache.kafka.clients.consumer.internals.SubscriptionState)
INFO [kafka-coordinator-heartbeat-thread | _eim_c3_non_prod-4] [Consumer clientId=_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-5-consumer, groupId=_eim_c3_non_prod-4] Resetting offset for partition _eim_c3_non_prod-4-MonitoringStream-ONE_MINUTE-repartition-7 to position FetchPosition{offset=0, offsetEpoch=Optional.empty, currentLeader=LeaderAndEpoch{leader=Optional[10.251.6.2:9093 (id: 1002 rack: null)], epoch=0}}. (org.apache.kafka.clients.consumer.internals.SubscriptionState)
INFO [kafka-coordinator-heartbeat-thread | _eim_c3_non_prod-4] [Consumer clientId=_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-5-consumer, groupId=_eim_c3_non_prod-4] Resetting offset for partition _eim_c3_non_prod-4-aggregatedTopicPartitionTableWindows-ONE_MINUTE-repartition-7 to position FetchPosition{offset=0, offsetEpoch=Optional.empty, currentLeader=LeaderAndEpoch{leader=Optional[10.251.6.1:9093 (id: 1001 rack: null)], epoch=0}}. (org.apache.kafka.clients.consumer.internals.SubscriptionState)
and so on ...
INFO [kafka-coordinator-heartbeat-thread | _eim_c3_non_prod-4] [Consumer clientId=_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-10-consumer, groupId=_eim_c3_non_prod-4] Error sending fetch request (sessionId=INVALID, epoch=INITIAL) to node 1003: (org.apache.kafka.clients.FetchSessionHandler)
org.apache.kafka.common.errors.DisconnectException
INFO [_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-3] [Consumer clientId=_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-3-consumer, groupId=_eim_c3_non_prod-4] Error sending fetch request (sessionId=INVALID, epoch=INITIAL) to node 1002: (org.apache.kafka.clients.FetchSessionHandler)
org.apache.kafka.common.errors.DisconnectException
INFO [kafka-coordinator-heartbeat-thread | _eim_c3_non_prod-4] [Consumer clientId=_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-3-consumer, groupId=_eim_c3_non_prod-4] Error sending fetch request (sessionId=INVALID, epoch=INITIAL) to node 1001: (org.apache.kafka.clients.FetchSessionHandler)
org.apache.kafka.common.errors.DisconnectException
INFO [_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-10] [Consumer clientId=_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-10-consumer, groupId=_eim_c3_non_prod-4] Error sending fetch request (sessionId=INVALID, epoch=INITIAL) to node 1002: (org.apache.kafka.clients.FetchSessionHandler)
org.apache.kafka.common.errors.DisconnectException
INFO [kafka-coordinator-heartbeat-thread | _eim_c3_non_prod-4] [Consumer clientId=_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-5-consumer, groupId=_eim_c3_non_prod-4] Error sending fetch request (sessionId=1478925475, epoch=1) to node 1003: (org.apache.kafka.clients.FetchSessionHandler)
org.apache.kafka.common.errors.DisconnectException
INFO [kafka-coordinator-heartbeat-thread | _eim_c3_non_prod-4] [Consumer clientId=_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-6-consumer, groupId=_eim_c3_non_prod-4] Error sending fetch request (sessionId=1947312909, epoch=1) to node 1002: (org.apache.kafka.clients.FetchSessionHandler)
org.apache.kafka.common.errors.DisconnectException
and so on ...
Any ideas what can be wrong here?

org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition

We are running Kafka (vesion kafka_2.11-0.10.1.0) in a 2 node cluster.
We have 2 producers (Java API) acting on different topics. Each topic has single partition.
The topic where we had this issue, has one consumer running.
This set up has been running fine for 3 months, and we saw this issue. All the suggested cases/solutions for this issue in other forums don't seem to apply for my scenario.
Exception at producer;
-2017-11-25T17:40:33,035 [kafka-producer-network-thread | producer-1] ERROR client.producer.BingLogProducerCallback - Encountered exception
in sending message ; > org.apache.kafka.common.errors.NotLeaderForPartitionException: This
server is not the leader for that topic-partition.
We haven't enabled retries for the messages, because this is transactional data and we want to maintain the order.
Producer config:
bootstrap.servers : server1ip:9092
acks :all
retries : 0
linger.ms :0
buffer.memory :10240000
max.request.size :1024000
key.serializer : org.apache.kafka.common.serialization.StringSerializer
value.serializer : org.apache.kafka.common.serialization.StringSerializer
We are connecting to server1 at both producer and consumer.
The controller log at server2 indicates there is some shutdown happened at during sametime, but I dont understand why this happened.
[2017-11-25 17:31:44,776] DEBUG [Controller 2]: topics not in
preferred replica Map() (kafka.controller.KafkaController) [2017-11-25
17:31:44,776] TRACE [Controller 2]: leader imbalance ratio for broker
2 is 0.000000 (kafka.controller.KafkaController) [2017-11-25
17:31:44,776] DEBUG [Controller 2]: topics not in preferred replica
Map() (kafka.controller.KafkaController) [2017-11-25 17:31:44,776]
TRACE [Controller 2]: leader imbalance ratio for broker 1 is 0.000000
(kafka.controller.KafkaController) [2017-11-25 17:34:18,314] INFO
[SessionExpirationListener on 2], ZK expired; shut down all controller
components and try to re-elect
(kafka.controller.KafkaController$SessionExpirationListener)
[2017-11-25 17:34:18,317] DEBUG [Controller 2]: Controller resigning,
broker id 2 (kafka.controller.KafkaController) [2017-11-25
17:34:18,317] DEBUG [Controller 2]: De-registering
IsrChangeNotificationListener (kafka.controller.KafkaController)
[2017-11-25 17:34:18,317] INFO [delete-topics-thread-2], Shutting down
(kafka.controller.TopicDeletionManager$DeleteTopicsThread) [2017-11-25
17:34:18,317] INFO [delete-topics-thread-2], Stopped
(kafka.controller.TopicDeletionManager$DeleteTopicsThread) [2017-11-25
17:34:18,318] INFO [delete-topics-thread-2], Shutdown completed
(kafka.controller.TopicDeletionManager$DeleteTopicsThread) [2017-11-25
17:34:18,318] INFO [Partition state machine on Controller 2]: Stopped
partition state machine (kafka.controller.PartitionStateMachine)
[2017-11-25 17:34:18,318] INFO [Replica state machine on controller
2]: Stopped replica state machine
(kafka.controller.ReplicaStateMachine) [2017-11-25 17:34:18,318] INFO
[Controller-2-to-broker-2-send-thread], Shutting down
(kafka.controller.RequestSendThread) [2017-11-25 17:34:18,318] INFO
[Controller-2-to-broker-2-send-thread], Stopped
(kafka.controller.RequestSendThread) [2017-11-25 17:34:18,319] INFO
[Controller-2-to-broker-2-send-thread], Shutdown completed
(kafka.controller.RequestSendThread) [2017-11-25 17:34:18,319] INFO
[Controller-2-to-broker-1-send-thread], Shutting down
(kafka.controller.RequestSendThread) [2017-11-25 17:34:18,319] INFO
[Controller-2-to-broker-1-send-thread], Stopped
(kafka.controller.RequestSendThread) [2017-11-25 17:34:18,319] INFO
[Controller-2-to-broker-1-send-thread], Shutdown completed
(kafka.controller.RequestSendThread) [2017-11-25 17:34:18,319] INFO
[Controller 2]: Broker 2 resigned as the controller
(kafka.controller.KafkaController) [2017-11-25 17:34:18,353] DEBUG
[IsrChangeNotificationListener] Fired!!!
(kafka.controller.IsrChangeNotificationListener) [2017-11-25
17:34:18,353] DEBUG [IsrChangeNotificationListener] Fired!!!
(kafka.controller.IsrChangeNotificationListener) [2017-11-25
17:34:18,354] INFO [BrokerChangeListener on Controller 2]: Broker
change listener fired for path /brokers/ids with children 1,2
(kafka.controller.ReplicaStateMachine$BrokerChangeListener)
[2017-11-25 17:34:18,355] DEBUG [DeleteTopicsListener on 2]: Delete
topics listener fired for topics to be deleted
(kafka.controller.PartitionStateMachine$DeleteTopicsListener)
[2017-11-25 17:34:18,362] INFO [AddPartitionsListener on 2]: Partition
modification triggered {"version":1,"partitions":{"0":[1]}} for path
/brokers/topics/ESQ
(kafka.controller.PartitionStateMachine$PartitionModificationsListener)
[2017-11-25 17:34:18,368] INFO [AddPartitionsListener on 2]: Partition
modification triggered {"version":1,"partitions":{"0":[1]}} for path
/brokers/topics/Test1
(kafka.controller.PartitionStateMachine$PartitionModificationsListener)
[2017-11-25 17:34:18,369] INFO [AddPartitionsListener on 2]: Partition
modification triggered {"version":1,"partitions":{"0":[2]}} for path
/brokers/topics/ImageQ
(kafka.controller.PartitionStateMachine$PartitionModificationsListener)
[2017-11-25 17:34:18,374] INFO [AddPartitionsListener on 2]: Partition
modification triggered
{"version":1,"partitions":{"8":[1,2],"4":[1,2],"9":[2,1],"5":[2,1],"6":[1,2],"1":[2,1],"0":[1,2],"2":[1,2],"7":[2,1],"3":[2,1]}}
for path /brokers/topics/NMS_NotifyQ
(kafka.controller.PartitionStateMachine$PartitionModificationsListener)
[2017-11-25 17:34:18,375] INFO [AddPartitionsListener on 2]: Partition
modification triggered {"version":1,"partitions":{"0":[1]}} for path
/brokers/topics/TempBinLogReqQ #