Apache Kafka: Trouble deleting and re-creating topic with a dot in the name - apache-kafka

I have run into an issue in kafka where if I create topic with a dot in the name, then delete it, then create it again, topic creation fails. I am using kafka_2.13-3.3.1 with KRaft on a 5-node cluster.
I originally ran into this problem while setting up MirrorMaker2. It creates topics with dots in the name, and i nuked the MM2 topics to redo MM2, and now MM2 can't recreate its topics.
Anyway, here is a simple CLI example:
# bin/kafka-topics.sh --create --topic a.test.topic --bootstrap-server kfk-01:9092
WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.
Created topic a.test.topic.
# bin/kafka-topics.sh --delete --topic a.test.topic --bootstrap-server kfk-01:9092
# bin/kafka-topics.sh --create --topic a.test.topic --bootstrap-server kfk-01:9092
WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.
Error while executing topic command : The server experienced an unexpected error when processing the request.
[2023-02-06 19:35:59,110] ERROR org.apache.kafka.common.errors.UnknownServerException: The server experienced an unexpected error when processing the request.
I don't think this is a timing issue...if i perform this exercise with a topic without dots in the name, it always succeeds.
I am getting some messages in the server logs on the local node...it goes back and forth between this one:
[2023-02-06 20:01:31,740] WARN [Controller 1] createTopics: failed with unknown server exception NoSuchElementException at epoch 10188 in 40 us. Renouncing leadership and reverting to the last committed offset 2760857. (org.apache.kafka.controller.QuorumController)
java.util.NoSuchElementException
at org.apache.kafka.timeline.SnapshottableHashTable$CurrentIterator.next(SnapshottableHashTable.java:167)
at org.apache.kafka.timeline.SnapshottableHashTable$CurrentIterator.next(SnapshottableHashTable.java:139)
at org.apache.kafka.timeline.TimelineHashSet$ValueIterator.next(TimelineHashSet.java:120)
at org.apache.kafka.controller.ReplicationControlManager.validateNewTopicNames(ReplicationControlManager.java:799)
at org.apache.kafka.controller.ReplicationControlManager.createTopics(ReplicationControlManager.java:567)
at org.apache.kafka.controller.QuorumController.lambda$createTopics$7(QuorumController.java:1832)
at org.apache.kafka.controller.QuorumController$ControllerWriteEvent.run(QuorumController.java:767)
at org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:121)
at org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:200)
at org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:173)
at java.base/java.lang.Thread.run(Thread.java:829)
[2023-02-06 20:01:31,740] INFO [RaftManager nodeId=1] Received user request to resign from the current epoch 10188 (org.apache.kafka.raft.KafkaRaftClient)
[2023-02-06 20:01:31,740] INFO [RaftManager nodeId=1] Completed transition to ResignedState(localId=1, epoch=10188, voters=[1, 2, 3, 4, 5], electionTimeoutMs=1140, unackedVoters=[2, 3, 4, 5], preferredSuccessors=[2, 3, 4, 5]) (org.apache.kafka.raft.QuorumState)
[2023-02-06 20:01:31,750] INFO [RaftManager nodeId=1] Completed transition to Unattached(epoch=10189, voters=[1, 2, 3, 4, 5], electionTimeoutMs=1909) (org.apache.kafka.raft.QuorumState)
[2023-02-06 20:01:31,750] INFO [Controller 1] In the new epoch 10189, the leader is (none). (org.apache.kafka.controller.QuorumController)
[2023-02-06 20:01:31,754] INFO [RaftManager nodeId=1] Completed transition to Voted(epoch=10189, votedId=2, voters=[1, 2, 3, 4, 5], electionTimeoutMs=1929) (org.apache.kafka.raft.QuorumState)
[2023-02-06 20:01:31,754] INFO [RaftManager nodeId=1] Vote request VoteRequestData(clusterId='ZmJlNWVjMDI5OWFlNDVhYw', topics=[TopicData(topicName='__cluster_metadata', partitions=[PartitionData(partitionIndex=0, candidateEpoch=10189, candidateId=2, lastOffsetEpoch=10188, lastOffset=2760858)])]) wit
[2023-02-06 20:01:31,763] INFO [RaftManager nodeId=1] Completed transition to FollowerState(fetchTimeoutMs=2000, epoch=10189, leaderId=2, voters=[1, 2, 3, 4, 5], highWatermark=Optional.empty, fetchingSnapshot=Optional.empty) (org.apache.kafka.raft.QuorumState)
[2023-02-06 20:01:31,763] INFO [Controller 1] In the new epoch 10189, the leader is 2. (org.apache.kafka.controller.QuorumController)
[2023-02-06 20:01:31,823] ERROR [Controller 1] processBrokerHeartbeat: unable to start processing because of NotControllerException. (org.apache.kafka.controller.QuorumController)
[2023-02-06 20:01:33,468] ERROR [Controller 1] processBrokerHeartbeat: unable to start processing because of NotControllerException. (org.apache.kafka.controller.QuorumController)
[2023-02-06 20:01:33,512] ERROR [Controller 1] processBrokerHeartbeat: unable to start processing because of NotControllerException. (org.apache.kafka.controller.QuorumController)
[2023-02-06 20:01:33,595] ERROR [Controller 1] processBrokerHeartbeat: unable to start processing because of NotControllerException. (org.apache.kafka.controller.QuorumController)
[2023-02-06 20:01:33,595] INFO [BrokerToControllerChannelManager broker=1 name=heartbeat] Client requested disconnect from node 1 (org.apache.kafka.clients.NetworkClient)
[2023-02-06 20:01:33,595] INFO [BrokerToControllerChannelManager broker=1 name=heartbeat]: Recorded new controller, from now on will use broker kfk-02:9091 (id: 2 rack: null) (kafka.server.BrokerToControllerRequestThread)
[2023-02-06 20:01:33,671] ERROR [Controller 1] processBrokerHeartbeat: unable to start processing because of NotControllerException. (org.apache.kafka.controller.QuorumController)
and this one:
[2023-02-06 20:02:15,898] ERROR [Controller 1] createTopics: unable to start processing because of NotControllerException. (org.apache.kafka.controller.QuorumController)
[2023-02-06 20:02:16,034] INFO [RaftManager nodeId=1] Become candidate due to fetch timeout (org.apache.kafka.raft.KafkaRaftClient)
[2023-02-06 20:02:16,039] INFO [RaftManager nodeId=1] Completed transition to CandidateState(localId=1, epoch=10190, retries=1, electionTimeoutMs=1411) (org.apache.kafka.raft.QuorumState)
[2023-02-06 20:02:16,040] INFO [Controller 1] In the new epoch 10190, the leader is (none). (org.apache.kafka.controller.QuorumController)
[2023-02-06 20:02:16,051] INFO [RaftManager nodeId=1] Completed transition to Leader(localId=1, epoch=10190, epochStartOffset=2760947, highWatermark=Optional.empty, voterStates={1=ReplicaState(nodeId=1, endOffset=Optional.empty, lastFetchTimestamp=-1, lastCaughtUpTimestamp=-1, hasAcknowledgedLeader=
[2023-02-06 20:02:16,067] INFO [Controller 1] Becoming the active controller at epoch 10190, committed offset 2760946, committed epoch 10189 (org.apache.kafka.controller.QuorumController)
[2023-02-06 20:02:17,742] INFO [BrokerToControllerChannelManager broker=1 name=heartbeat] Client requested disconnect from node 2 (org.apache.kafka.clients.NetworkClient)
[2023-02-06 20:02:17,743] INFO [BrokerToControllerChannelManager broker=1 name=heartbeat]: Recorded new controller, from now on will use broker kfk-01:9091 (id: 1 rack: null) (kafka.server.BrokerToControllerRequestThread)
What am I doing wrong? Is this a bug?

Use Kafka 3.3.2 where this is fixed
https://issues.apache.org/jira/browse/KAFKA-14337
Or, read the warning, and use underscores (or hyphens) instead since metrics will always use dot-notation.

Related

org.apache.kafka.raft.errors.NotLeaderException: Append failed because the replication is not the current leader

Introduction:
Previously, I saw a similar question (this link), but mine is different as we use Kafka KRaft instead of Kafka with Zookeeper.
Specification:
Kafka version: 3.3.1
Number of brokers: 8
Minimum replication factor of topics: 3
Problem Description:
At the time of writing, I had experienced this issue numerous times. Kafka's log can be found here:
[2023-01-09 09:53:03,929] WARN [Controller 3] maybeFenceReplicas: failed with unknown server exception NotLeaderException at epoch 2641 in 1913 us. Renouncing leadership and reverting to the last committed offset 9986340. (org.apache.kafka.controller.QuorumController)
org.apache.kafka.raft.errors.NotLeaderException: Append failed because the replication is not the current leader
at org.apache.kafka.raft.KafkaRaftClient.lambda$append$27(KafkaRaftClient.java:2262)
at java.base/java.util.Optional.orElseThrow(Optional.java:408)
at org.apache.kafka.raft.KafkaRaftClient.append(KafkaRaftClient.java:2261)
at org.apache.kafka.raft.KafkaRaftClient.scheduleAtomicAppend(KafkaRaftClient.java:2257)
at org.apache.kafka.controller.QuorumController$ControllerWriteEvent$1.apply(QuorumController.java:813)
at org.apache.kafka.controller.QuorumController$ControllerWriteEvent$1.apply(QuorumController.java:792)
at org.apache.kafka.controller.QuorumController.appendRecords(QuorumController.java:903)
at org.apache.kafka.controller.QuorumController$ControllerWriteEvent.run(QuorumController.java:791)
at org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:121)
at org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:200)
at org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:173)
at java.base/java.lang.Thread.run(Thread.java:829)
[2023-01-09 09:53:03,931] INFO [Controller 3] writeNoOpRecord: failed with NotControllerException in 415741179 us (org.apache.kafka.controller.QuorumController)
[2023-01-09 09:53:03,931] INFO [Controller 3] writeNoOpRecord: failed with NotControllerException in 206629449 us (org.apache.kafka.controller.QuorumController)
[2023-01-09 09:53:03,931] INFO [Controller 3] maybeFenceReplicas: failed with NotControllerException in 206629220 us (org.apache.kafka.controller.QuorumController)
[2023-01-09 09:53:03,931] INFO [Controller 3] maybeFenceReplicas: failed with NotControllerException in 206626538 us (org.apache.kafka.controller.QuorumController)
[2023-01-09 09:53:03,931] INFO [Controller 3] maybeFenceReplicas: failed with NotControllerException in 205746648 us (org.apache.kafka.controller.QuorumController)
[2023-01-09 09:53:03,931] INFO [Controller 3] maybeFenceReplicas: failed with NotControllerException in 7549 us (org.apache.kafka.controller.QuorumController)
[2023-01-09 09:53:03,931] INFO [Controller 3] maybeFenceReplicas: failed with NotControllerException in 6986 us (org.apache.kafka.controller.QuorumController)
[2023-01-09 09:53:03,931] INFO [Controller 3] maybeFenceReplicas: failed with NotControllerException in 6399 us (org.apache.kafka.controller.QuorumController)
[2023-01-09 09:53:03,931] INFO [Controller 3] maybeFenceReplicas: failed with NotControllerException in 5912 us (org.apache.kafka.controller.QuorumController)
[2023-01-09 09:53:03,932] ERROR [Controller 3] Unexpected exception while executing deferred write event maybeFenceReplicas. Rescheduling for a minute from now. (org.apache.kafka.controller.QuorumController)
org.apache.kafka.common.errors.UnknownServerException: org.apache.kafka.raft.errors.NotLeaderException: Append failed because the replication is not the current leader
Caused by: org.apache.kafka.raft.errors.NotLeaderException: Append failed because the replication is not the current leader
at org.apache.kafka.raft.KafkaRaftClient.lambda$append$27(KafkaRaftClient.java:2262)
at java.base/java.util.Optional.orElseThrow(Optional.java:408)
at org.apache.kafka.raft.KafkaRaftClient.append(KafkaRaftClient.java:2261)
at org.apache.kafka.raft.KafkaRaftClient.scheduleAtomicAppend(KafkaRaftClient.java:2257)
at org.apache.kafka.controller.QuorumController$ControllerWriteEvent$1.apply(QuorumController.java:813)
at org.apache.kafka.controller.QuorumController$ControllerWriteEvent$1.apply(QuorumController.java:792)
at org.apache.kafka.controller.QuorumController.appendRecords(QuorumController.java:903)
at org.apache.kafka.controller.QuorumController$ControllerWriteEvent.run(QuorumController.java:791)
at org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:121)
at org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:200)
at org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:173)
at java.base/java.lang.Thread.run(Thread.java:829)
And
ERROR [Controller 3] processBrokerHeartbeat: unable to start processing because of NotControllerException. (org.apache.kafka.controller.QuorumController)
As this is our production node, we constantly monitor it using Prometheus and Grafana. The timestamp indicates that this broker had trouble at 2023-01-09 09:53. According to the monitoring, the other 7 brokers should be working properly and data-loss shouldn't occur, but the results from the monitoring are different from what we expected.
This issue has happened again at 11:31.
Observations:
In this case, I assume that there is no data loss based on the monitoring screenshots and the topic messages.
Is this correct? How can we prevent this issue from recurring?
When the NotLeaderException is thrown, the Kafka Producer should retry the write until successful. This exception normally occurs if the leader broker fails and a new leader has not yet been elected.
It's quite hard to detect whether or not there was data loss from the monitoring graph. It looks like during the times the leader failed, messages dropped because the producers will have been receiving the NotLeaderException until the new leader was elected. Once the new leader was elected, the producers were able to continue as normal.
This does not necessarily mean there was no data loss though. It's the producers responsibility to ensure no data loss.
For example if acks=0 and a message was sent to the topic but was not received successfully before the leader failed, that message would not exist in the topic, however, the producer would have assumed a successful write and move on to the next message.
To ensure message availability and durability, producers should have the following configurations set:
acks=all
For a write to be considered successful, it needs to be acknowledged by all replicas in the ISR
min.insync.replicas >=2
At least 2 replicas must be in sync before the write is considered successful.
Depending on what vendor you use, some of the above configurations are set by default.
Hope this helps!

Apache Kafka in kraft mode fails frequently

We have created a 3 node kafka-3.3.1 cluster in kraft mode. This is based on bitnami-kafka image. Basic configuration for all nodes are (port number is different for each and other changes as required)
KAFKA_ENABLE_KRAFT: 'yes'
KAFKA_KRAFT_CLUSTER_ID: xxyyddjjjddkk1234
KAFKA_CFG_PROCESS_ROLES: broker,controller
KAFKA_CFG_CONTROLLER_LISTENER_NAMES: CONTROLLER
KAFKA_CFG_LISTENERS: CONTROLLER://:9093,INSIDE://:9092,EXTERNAL://:9094
KAFKA_CFG_LISTENER_SECURITY_PROTOCOL_MAP: CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT,INSIDE:PLAINTEXT,EXTERNAL:PLAINTEXT
KAFKA_CFG_CONTROLLER_QUORUM_VOTERS: 1#dpkafka01:9093,2#dpkafka02:9093,3#dpkafka03:9093
KAFKA_CFG_ADVERTISED_LISTENERS: INSIDE://dpkafka02:9092,EXTERNAL://_{HOSTIP}:9098
KAFKA_BROKER_ID: 2
KAFKA_INTER_BROKER_LISTENER_NAME: INSIDE
KAFKA_HEAP_OPTS: "-Xmx1G -Xms256m"
KAFKA_LOG_DIRS: /bitnami/kafka/kafka-logs
KAFKA_AUTO_CREATE_TOPICS_ENABLE: 'false'
KAFKA_LOG_RETENTION_MS: 7200000
KAFKA_LOG_SEGMENT_MS: 86400000
KAFKA_LOG_DELETE_RETENTION_MS: 7200000
KAFKA_LOG_RETENTION_CHECK_INTERVAL_MS: 60000
KAFKA_LOG_CLEANUP_POLICY: "compact,delete"
KAFKA_CFG_GROUP_INITIAL_REBALANCE_DELAY_MS: 12000
KAFKA_CFG_NUM_RECOVERY_THREADS_PER_DATA_DIR: 4
KAFKA_CFG_OFFSETS_TOPIC_REPLICATION_FACTOR: 2
KAFKA_CFG_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 2
KAFKA_CFG_TRANSACTION_STATE_LOG_MIN_ISR: 2
ALLOW_PLAINTEXT_LISTENER: 'yes'
BITNAMI_DEBUG: 'true'
KAFKA_OPTS: -javaagent:/opt/bitnami/kafka/libs/jmx_prometheus_javaagent.jar=7072:/opt/bitnami/kafka/libs/prom-jmx-agent-config.yml
While the cluster works for a while, one or two of them shuts down very frequently. Logs are not very helpful to identify the root cause. Some relevant logs we see before the state changes to shutdown are:
[2022-12-04 08:35:16,928] INFO [RaftManager nodeId=2] Become candidate due to fetch timeout (org.apache.kafka.raft.KafkaRaftClient)
[2022-12-04 08:35:17,414] INFO [RaftManager nodeId=2] Disconnecting from node 3 due to request timeout. (org.apache.kafka.clients.NetworkClient)
[2022-12-04 08:35:17,414] INFO [RaftManager nodeId=2] Cancelled in-flight FETCH request with correlation id 73082 due to node 3 being disconnected (elapsed time since creation: 2471ms, elapsed time since send: 2471ms, request timeout: 2000ms) (org.apache.kafka.clients.NetworkClient)
[2022-12-04 08:35:27,508] INFO [RaftManager nodeId=2] Completed transition to CandidateState(localId=2, epoch=31047, retries=1, electionTimeoutMs=1697) (org.apache.kafka.raft.QuorumState)
[2022-12-04 08:35:27,508] INFO [Controller 2] In the new epoch 31047, the leader is (none). (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:27,802] INFO [RaftManager nodeId=2] Completed transition to Unattached(epoch=31048, voters=[1, 2, 3], electionTimeoutMs=0) (org.apache.kafka.raft.QuorumState)
[2022-12-04 08:35:27,802] INFO [Controller 2] In the new epoch 31048, the leader is (none). (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:27,815] INFO [BrokerToControllerChannelManager broker=2 name=heartbeat] Client requested disconnect from node 3 (org.apache.kafka.clients.NetworkClient)
[2022-12-04 08:35:27,815] INFO [BrokerLifecycleManager id=2] Unable to send a heartbeat because the RPC got timed out before it could be sent. (kafka.server.BrokerLifecycleManager)
[2022-12-04 08:35:27,830] INFO [RaftManager nodeId=2] Completed transition to Voted(epoch=31048, votedId=1, voters=[1, 2, 3], electionTimeoutMs=1014) (org.apache.kafka.raft.QuorumState)
.....
[2022-12-04 08:35:32,210] INFO [Broker id=2] Stopped fetchers as part of become-follower for 479 partitions (state.change.logger)
[2022-12-04 08:35:32,211] INFO [Broker id=2] Started fetchers as part of become-follower for 479 partitions (state.change.logger)
[2022-12-04 08:35:32,232] INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Shutting down (kafka.server.ReplicaFetcherThread)
[2022-12-04 08:35:32,232] INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Client requested connection close from node 1 (org.apache.kafka.clients.NetworkClient)
[2022-12-04 08:35:32,233] INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Cancelled in-flight FETCH request with correlation id 675913 due to node 1 being disconnected (elapsed time since creation: 4394ms, elapsed time since send: 4394ms, request timeout: 30000ms) (org.apache.kafka.clients.NetworkClient)
[2022-12-04 08:35:32,233] INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Error sending fetch request (sessionId=1961820001, epoch=181722) to node 1: (org.apache.kafka.clients.FetchSessionHandler)
java.io.IOException: Client was shutdown before response was read
at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:108)
at kafka.server.BrokerBlockingSender.sendRequest(BrokerBlockingSender.scala:113)
at kafka.server.RemoteLeaderEndPoint.fetch(RemoteLeaderEndPoint.scala:78)
at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:309)
at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3(AbstractFetcherThread.scala:124)
at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3$adapted(AbstractFetcherThread.scala:123)
at scala.Option.foreach(Option.scala:407)
at kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:123)
at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:106)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96)
[2022-12-04 08:35:32,234] INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Stopped (kafka.server.ReplicaFetcherThread)
[2022-12-04 08:35:32,234] INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Shutdown completed (kafka.server.ReplicaFetcherThread)
[2022-12-04 08:35:32,237] INFO [ReplicaFetcher replicaId=2, leaderId=3, fetcherId=0] Shutting down (kafka.server.ReplicaFetcherThread)
[2022-12-04 08:35:32,237] INFO [ReplicaFetcher replicaId=2, leaderId=3, fetcherId=0] Shutdown completed (kafka.server.ReplicaFetcherThread)
[2022-12-04 08:35:32,237] INFO [ReplicaFetcher replicaId=2, leaderId=3, fetcherId=0] Stopped (kafka.server.ReplicaFetcherThread)
[2022-12-04 08:35:32,245] INFO [GroupCoordinator 2]: Resigned as the group coordinator for partition 13 in epoch Some(3200) (kafka.coordinator.group.GroupCoordinator)
....
[2022-12-04 08:35:48,229] INFO [Controller 2] Unfenced broker: 2 (org.apache.kafka.controller.ClusterControlManager)
[2022-12-04 08:35:48,254] INFO [RaftManager nodeId=2] Completed transition to Unattached(epoch=31055, voters=[1, 2, 3], electionTimeoutMs=1607) (org.apache.kafka.raft.QuorumState)
[2022-12-04 08:35:48,254] INFO [RaftManager nodeId=2] Vote request VoteRequestData(clusterId='<redacted>', topics=[TopicData(topicName='__cluster_metadata', partitions=[PartitionData(partitionIndex=0, candidateEpoch=31055, candidateId=3, lastOffsetEpoch=31052, lastOffset=6552512)])]) with epoch 31055 is rejected (org.apache.kafka.raft.KafkaRaftClient)
[2022-12-04 08:35:48,254] WARN [Controller 2] Renouncing the leadership due to a metadata log event. We were the leader at epoch 31052, but in the new epoch 31055, the leader is (none). Reverting to last committed offset 6552511. (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 8243762 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] alterPartition: failed with NotControllerException in 8005283 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 7743806 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 7243753 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] processBrokerHeartbeat: failed with NotControllerException in 7151815 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] processBrokerHeartbeat: failed with NotControllerException in 7151616 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 6743693 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 6243134 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 5742969 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 5242852 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 4742694 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 4242529 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 3742380 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 3242258 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 2741822 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 2241677 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 1741549 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 1241369 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 741246 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] maybeFenceReplicas: failed with NotControllerException in 244485 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] writeNoOpRecord: failed with NotControllerException in 241049 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] maybeFenceReplicas: failed with NotControllerException in 196629 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,254] INFO [Controller 2] processBrokerHeartbeat: failed with NotControllerException in 27063 us (org.apache.kafka.controller.QuorumController)
[2022-12-04 08:35:48,255] INFO [BrokerToControllerChannelManager broker=2 name=heartbeat] Client requested disconnect from node 2 (org.apache.kafka.clients.NetworkClient)
[2022-12-04 08:35:48,255] ERROR Encountered fatal fault: exception while renouncing leadership (org.apache.kafka.server.fault.ProcessExitingFaultHandler)
java.lang.NullPointerException
at org.apache.kafka.timeline.SnapshottableHashTable$HashTier.mergeFrom(SnapshottableHashTable.java:125)
at org.apache.kafka.timeline.Snapshot.mergeFrom(Snapshot.java:68)
at org.apache.kafka.timeline.SnapshotRegistry.deleteSnapshot(SnapshotRegistry.java:236)
at org.apache.kafka.timeline.SnapshotRegistry$SnapshotIterator.remove(SnapshotRegistry.java:67)
at org.apache.kafka.timeline.SnapshotRegistry.revertToSnapshot(SnapshotRegistry.java:214)
at org.apache.kafka.controller.QuorumController.renounce(QuorumController.java:1232)
at org.apache.kafka.controller.QuorumController.access$3300(QuorumController.java:150)
at org.apache.kafka.controller.QuorumController$QuorumMetaLogListener.lambda$handleLeaderChange$3(QuorumController.java:1076)
at org.apache.kafka.controller.QuorumController$QuorumMetaLogListener.lambda$appendRaftEvent$4(QuorumController.java:1101)
at org.apache.kafka.controller.QuorumController$ControlEvent.run(QuorumController.java:496)
at org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:121)
at org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:200)
at org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:173)
at java.base/java.lang.Thread.run(Thread.java:829)
[2022-12-04 08:35:48,259] INFO [BrokerServer id=2] Transition from STARTED to SHUTTING_DOWN (kafka.server.BrokerServer)
[2022-12-04 08:35:48,259] INFO [BrokerServer id=2] shutting down (kafka.server.BrokerServer)
[2022-12-04 08:35:48,261] INFO [BrokerLifecycleManager id=2] Beginning controlled shutdown. (kafka.server.BrokerLifecycleManager)
[2022-12-04 08:35:48,277] INFO [RaftManager nodeId=2] Completed transition to FollowerState(fetchTimeoutMs=2000, epoch=31055, leaderId=3, voters=[1, 2, 3], highWatermark=Optional[LogOffsetMetadata(offset=6552512, metadata=Optional[(segmentBaseOffset=6497886,relativePositionInSegment=3821894)])], fetchingSnapshot=Optional.empty) (org.apache.kafka.raft.QuorumState)
[2022-12-04 08:35:48,355] INFO [BrokerToControllerChannelManager broker=2 name=heartbeat]: Recorded new controller, from now on will use broker dpkafka03:9093 (id: 3 rack: null) (kafka.server.BrokerToControllerRequestThread)
Appreciate if anyone experienced with Kraft mode Kafka cluster can provide some guidance to debug this issue. Other problem is that the container doesn't exit after the error which makes the services to fail. Container will be restarted by our orchestration layer if it exits. (this is a different problem that as we use Bitnami images)
I also didn't find many production examples out there that uses kraft mode. Are we missing some configuration or do we need to change any default configuration values such as request timeout in Kraft mode?
I just consulted my workmate about that NullPointerException and got this patch from him (https://github.com/alexkuoecity).
diff --git a/metadata/src/main/java/org/apache/kafka/timeline/SnapshottableHashTable.java b/metadata/src/main/java/org/apache/kafka/timeline/SnapshottableHashTable.java
index 299f65a6f7..e87ce22264 100644
--- a/metadata/src/main/java/org/apache/kafka/timeline/SnapshottableHashTable.java
+++ b/metadata/src/main/java/org/apache/kafka/timeline/SnapshottableHashTable.java
## -105,6 +105,7 ## class SnapshottableHashTable<T extends SnapshottableHashTable.ElementWithStartEp
HashTier(int size) {
this.size = size;
+ this.deltaTable = new BaseHashTable<T>(size);
}
#SuppressWarnings("unchecked")
I applied this to the branch 3.3 of kafka, seems work.But I still have no idea about the root cause, so use it at your own risk.
$ git clone https://github.com/apache/kafka.git
$ cd kafka
$ git checkout 3.3
$ patch < file.patch
$ ./gradlew releaseTarGz
Then, copy the whole kafka dir into your docker image and run.
This happens to me also. When I deploy, docker compose starts fine, but after I reboot the VM entirely, I have another error:
kafka_1 | [2022-12-06 15:23:04,721] ERROR [Controller 1] writeNoOpRecord: unable to start processing because of TimeoutException. (org.apache.kafka.controller.QuorumController)
kafka_1 | [2022-12-06 15:23:04,721] ERROR [Controller 1] maybeBalancePartitionLeaders: unable to start processing because of TimeoutException. (org.apache.kafka.controller.QuorumController)
Probably it is related to VM performance, at boot stage it is slower.
So I do docker compose restart and it works again. Finally I fixed it by adding restart: always, at second attempt it works.
I did not see any problems so far, but I did not test at high load.

Why must controller be localhost when running KRaft mode

In KRaft mode, the Kafka broker does not start unless if the controller listens on localhost. For example, either of the following does not work on my laptop:
listeners=PLAINTEXT://10.0.0.48:9092,CONTROLLER://10.0.0.48:9093
listeners=PLAINTEXT://192.168.56.1:9092,CONTROLLER://192.168.56.1:9093
listeners=PLAINTEXT://localhost:9092,CONTROLLER://192.168.56.1:9093
If I replace the controller IP address with localhost in either of the above, kafka-server-start.sh starts successfully.
I get the following logs continuously in the failure scenario:
[2022-10-27 15:06:19,885] WARN [BrokerToControllerChannelManager broker=1 name=heartbeat] Connection to node 1 (localhost/127.0.0.1:9093) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2022-10-27 15:06:19,885] INFO [BrokerToControllerChannelManager broker=1 name=heartbeat]: Recorded new controller, from now on will use broker localhost:9093 (id: 1 rack: null) (kafka.server.BrokerToControllerRequestThread)
[2022-10-27 15:06:19,935] INFO [BrokerToControllerChannelManager broker=1 name=heartbeat]: Recorded new controller, from now on will use broker localhost:9093 (id: 1 rack: null) (kafka.server.BrokerToControllerRequestThread)
[2022-10-27 15:06:19,936] INFO [BrokerToControllerChannelManager broker=1 name=heartbeat] Node 1 disconnected. (org.apache.kafka.clients.NetworkClient)
Until I get the following error and kafka-server-start.sh exits:
[2022-10-27 15:06:22,804] ERROR Exiting Kafka due to fatal exception during startup. (kafka.Kafka$)
java.util.concurrent.CancellationException
at java.base/java.util.concurrent.CompletableFuture.cancel(CompletableFuture.java:2468)
at kafka.server.BrokerLifecycleManager$ShutdownEvent.run(BrokerLifecycleManager.scala:485)
at org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:174)
at java.base/java.lang.Thread.run(Thread.java:832)
It seems like it expects the controller to be localhost. If this is the case, why?
I had to change controller.quorum.voters to match what's in listeners:
controller.quorum.voters=1#192.168.56.1:9093

mysql table record not being consumed by Kafka

I just started learning kafka and I am running kafka 2.13-2.80 on windows server 2012 R2. I started zookeeper using the following:
zookeeper-server-start.bat ../../config/zookeeper.properties
I started kafka using the following:
kafka-server-start.bat ../../config/server.properties
I started a connector with the following:
connect-standalone.bat ../../config/connect-standalone.properties ../../config/mysql.properties
The content of my mysql.properties file is as follows:
name=test-source-mysql-jdbc-autoincrement
connector.class=io.confluent.connect.jdbc.JdbcSourceConnector
tasks.max=1
connection.url=jdbc:mysql://127.0.0.1:3306/DBName?user=username&password=userpassword
mode=incrementing
incrementing.column.name=id
topic.prefix=test-mysql-jdbc-
I started a consumer with and without a partition option:
kafka-console-consumer.bat -topic test-mysql-jdbc-groups -bootstrap-server localhost:9092 -from-beginning [-partition 0]
All seemingly started without issues but when I add a record to my mysql table called groups, I do not see it in my consumer. I checked all the various logs. The only error messages I saw were in the state-change.log and they looked like the following:
ERROR [Broker id=0] Ignoring StopReplica request (delete=true) from controller 0 with correlation id 5 epoch 1 for partition mytopic-2 as the local replica for the partition is in an offline log directory (state.change.logger)
ERROR [Broker id=0] Ignoring StopReplica request (delete=true) from controller 0 with correlation id 5 epoch 1 for partition mytopic-1 as the local replica for the partition is in an offline log directory (state.change.logger)
ERROR [Broker id=0] Ignoring StopReplica request (delete=true) from controller 0 with correlation id 5 epoch 1 for partition mytopic-0 as the local replica for the partition is in an offline log directory (state.change.logger)
ERROR [Broker id=0] Received LeaderAndIsrRequest with correlation id 1 from controller 0 epoch 2 for partition mytopic-0 (last update controller epoch 1) but cannot become follower since the new leader -1 is unavailable. (state.change.logger)
ERROR [Broker id=0] Received LeaderAndIsrRequest with correlation id 1 from controller 0 epoch 2 for partition mytopic-1 (last update controller epoch 1) but cannot become follower since the new leader -1 is unavailable. (state.change.logger)
ERROR [Broker id=0] Received LeaderAndIsrRequest with correlation id 1 from controller 0 epoch 2 for partition mytopic-2 (last update controller epoch 1) but cannot become follower since the new leader -1 is unavailable. (state.change.logger)
I also notice this message in zookeeper
INFO Expiring session timeout of exceeded (org.apache.zookeeper.server.ZooKeeperServer)
Please could anyone give me pointers as to what I could be doing wrong? Thanks

org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition

We are running Kafka (vesion kafka_2.11-0.10.1.0) in a 2 node cluster.
We have 2 producers (Java API) acting on different topics. Each topic has single partition.
The topic where we had this issue, has one consumer running.
This set up has been running fine for 3 months, and we saw this issue. All the suggested cases/solutions for this issue in other forums don't seem to apply for my scenario.
Exception at producer;
-2017-11-25T17:40:33,035 [kafka-producer-network-thread | producer-1] ERROR client.producer.BingLogProducerCallback - Encountered exception
in sending message ; > org.apache.kafka.common.errors.NotLeaderForPartitionException: This
server is not the leader for that topic-partition.
We haven't enabled retries for the messages, because this is transactional data and we want to maintain the order.
Producer config:
bootstrap.servers : server1ip:9092
acks :all
retries : 0
linger.ms :0
buffer.memory :10240000
max.request.size :1024000
key.serializer : org.apache.kafka.common.serialization.StringSerializer
value.serializer : org.apache.kafka.common.serialization.StringSerializer
We are connecting to server1 at both producer and consumer.
The controller log at server2 indicates there is some shutdown happened at during sametime, but I dont understand why this happened.
[2017-11-25 17:31:44,776] DEBUG [Controller 2]: topics not in
preferred replica Map() (kafka.controller.KafkaController) [2017-11-25
17:31:44,776] TRACE [Controller 2]: leader imbalance ratio for broker
2 is 0.000000 (kafka.controller.KafkaController) [2017-11-25
17:31:44,776] DEBUG [Controller 2]: topics not in preferred replica
Map() (kafka.controller.KafkaController) [2017-11-25 17:31:44,776]
TRACE [Controller 2]: leader imbalance ratio for broker 1 is 0.000000
(kafka.controller.KafkaController) [2017-11-25 17:34:18,314] INFO
[SessionExpirationListener on 2], ZK expired; shut down all controller
components and try to re-elect
(kafka.controller.KafkaController$SessionExpirationListener)
[2017-11-25 17:34:18,317] DEBUG [Controller 2]: Controller resigning,
broker id 2 (kafka.controller.KafkaController) [2017-11-25
17:34:18,317] DEBUG [Controller 2]: De-registering
IsrChangeNotificationListener (kafka.controller.KafkaController)
[2017-11-25 17:34:18,317] INFO [delete-topics-thread-2], Shutting down
(kafka.controller.TopicDeletionManager$DeleteTopicsThread) [2017-11-25
17:34:18,317] INFO [delete-topics-thread-2], Stopped
(kafka.controller.TopicDeletionManager$DeleteTopicsThread) [2017-11-25
17:34:18,318] INFO [delete-topics-thread-2], Shutdown completed
(kafka.controller.TopicDeletionManager$DeleteTopicsThread) [2017-11-25
17:34:18,318] INFO [Partition state machine on Controller 2]: Stopped
partition state machine (kafka.controller.PartitionStateMachine)
[2017-11-25 17:34:18,318] INFO [Replica state machine on controller
2]: Stopped replica state machine
(kafka.controller.ReplicaStateMachine) [2017-11-25 17:34:18,318] INFO
[Controller-2-to-broker-2-send-thread], Shutting down
(kafka.controller.RequestSendThread) [2017-11-25 17:34:18,318] INFO
[Controller-2-to-broker-2-send-thread], Stopped
(kafka.controller.RequestSendThread) [2017-11-25 17:34:18,319] INFO
[Controller-2-to-broker-2-send-thread], Shutdown completed
(kafka.controller.RequestSendThread) [2017-11-25 17:34:18,319] INFO
[Controller-2-to-broker-1-send-thread], Shutting down
(kafka.controller.RequestSendThread) [2017-11-25 17:34:18,319] INFO
[Controller-2-to-broker-1-send-thread], Stopped
(kafka.controller.RequestSendThread) [2017-11-25 17:34:18,319] INFO
[Controller-2-to-broker-1-send-thread], Shutdown completed
(kafka.controller.RequestSendThread) [2017-11-25 17:34:18,319] INFO
[Controller 2]: Broker 2 resigned as the controller
(kafka.controller.KafkaController) [2017-11-25 17:34:18,353] DEBUG
[IsrChangeNotificationListener] Fired!!!
(kafka.controller.IsrChangeNotificationListener) [2017-11-25
17:34:18,353] DEBUG [IsrChangeNotificationListener] Fired!!!
(kafka.controller.IsrChangeNotificationListener) [2017-11-25
17:34:18,354] INFO [BrokerChangeListener on Controller 2]: Broker
change listener fired for path /brokers/ids with children 1,2
(kafka.controller.ReplicaStateMachine$BrokerChangeListener)
[2017-11-25 17:34:18,355] DEBUG [DeleteTopicsListener on 2]: Delete
topics listener fired for topics to be deleted
(kafka.controller.PartitionStateMachine$DeleteTopicsListener)
[2017-11-25 17:34:18,362] INFO [AddPartitionsListener on 2]: Partition
modification triggered {"version":1,"partitions":{"0":[1]}} for path
/brokers/topics/ESQ
(kafka.controller.PartitionStateMachine$PartitionModificationsListener)
[2017-11-25 17:34:18,368] INFO [AddPartitionsListener on 2]: Partition
modification triggered {"version":1,"partitions":{"0":[1]}} for path
/brokers/topics/Test1
(kafka.controller.PartitionStateMachine$PartitionModificationsListener)
[2017-11-25 17:34:18,369] INFO [AddPartitionsListener on 2]: Partition
modification triggered {"version":1,"partitions":{"0":[2]}} for path
/brokers/topics/ImageQ
(kafka.controller.PartitionStateMachine$PartitionModificationsListener)
[2017-11-25 17:34:18,374] INFO [AddPartitionsListener on 2]: Partition
modification triggered
{"version":1,"partitions":{"8":[1,2],"4":[1,2],"9":[2,1],"5":[2,1],"6":[1,2],"1":[2,1],"0":[1,2],"2":[1,2],"7":[2,1],"3":[2,1]}}
for path /brokers/topics/NMS_NotifyQ
(kafka.controller.PartitionStateMachine$PartitionModificationsListener)
[2017-11-25 17:34:18,375] INFO [AddPartitionsListener on 2]: Partition
modification triggered {"version":1,"partitions":{"0":[1]}} for path
/brokers/topics/TempBinLogReqQ #