I have a Kafka cluster running on Kubernetes along with ZooKeeper on Kubernetes. As outlined in this answer, I have configured the internal broker port as well as the advertised external port for the clients:
listener.security.protocol.map=PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
inter.broker.listener.name=PLAINTEXT
listeners=PLAINTEXT://:29092,PLAINTEXT_HOST://0.0.0.0:9093
advertised.listeners=PLAINTEXT://:29092,PLAINTEXT_HOST://{EXTERNAL-IP-ADDRESS}:9093
zookeeper.connect=zk-cs.analytics.svc:2181
I expect the inter-broker communication to happen on 29092. External clients should be able to connect on port 9093.
I have one external IP for the entire Kubernetes service, which means that this is the only external IP that should be exposed from the Kafka brokers. As far as I understand, the Kubernetes load balancer will route any request to this IP to one of my brokers.
I have validated that my kafka brokers registered correctly to ZooKeeper:
get /brokers/ids/0
{"listener_security_protocol_map":{"PLAINTEXT":"PLAINTEXT","PLAINTEXT_HOST":"PLAINTEXT"},"endpoints":["PLAINTEXT://kafka-0.kafka-hs.analytics.svc.cluster.local:29092","PLAINTEXT_HOST://{EXTERNAL-IP-ADDRESS}"],"jmx_port":-1,"host":"kafka-0.kafka-hs.analytics.svc.cluster.local","timestamp":"1525689391350","port":29092,"version":4}
cZxid = 0x90000029f
ctime = Mon May 07 12:36:31 CEST 2018
mZxid = 0x90000029f
mtime = Mon May 07 12:36:31 CEST 2018
pZxid = 0x90000029f
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x1632acfab520009
dataLength = 344
numChildren = 0
Creating a topic looks good in the logs to me, logs are below.
Primary:
[2018-05-07 10:41:12,760] DEBUG [TopicChangeListener on Controller 0]: Topic change listener fired for path /brokers/topics with children test-topic (kafka.controller.PartitionStateMachine$TopicChangeListener)
[2018-05-07 10:41:12,767] INFO [TopicChangeListener on Controller 0]: New topics: [Set(test-topic)], deleted topics: [Set()], new partition replica assignment [Map([test-topic,0] -> List(0, 1))] (kafka.controller.PartitionStateMachine$TopicChangeListener)
[2018-05-07 10:41:12,768] INFO [Controller 0]: New topic creation callback for [test-topic,0] (kafka.controller.KafkaController)
[2018-05-07 10:41:12,770] INFO [Controller 0]: New partition creation callback for [test-topic,0] (kafka.controller.KafkaController)
[2018-05-07 10:41:12,771] INFO [Partition state machine on Controller 0]: Invoking state change to NewPartition for partitions [test-topic,0] (kafka.controller.PartitionStateMachine)
[2018-05-07 10:41:12,772] TRACE Controller 0 epoch 12 changed partition [test-topic,0] state from NonExistentPartition to NewPartition with assigned replicas 0,1 (state.change.logger)
[2018-05-07 10:41:12,774] INFO [Replica state machine on controller 0]: Invoking state change to NewReplica for replicas [Topic=test-topic,Partition=0,Replica=0],[Topic=test-topic,Partition=0,Replica=1] (kafka.controller.ReplicaStateMachine)
[2018-05-07 10:41:12,778] TRACE Controller 0 epoch 12 changed state of replica 0 for partition [test-topic,0] from NonExistentReplica to NewReplica (state.change.logger)
[2018-05-07 10:41:12,779] TRACE Controller 0 epoch 12 changed state of replica 1 for partition [test-topic,0] from NonExistentReplica to NewReplica (state.change.logger)
[2018-05-07 10:41:12,779] INFO [Partition state machine on Controller 0]: Invoking state change to OnlinePartition for partitions [test-topic,0] (kafka.controller.PartitionStateMachine)
[2018-05-07 10:41:12,780] DEBUG [Partition state machine on Controller 0]: Live assigned replicas for partition [test-topic,0] are: [List(0, 1)] (kafka.controller.PartitionStateMachine)
[2018-05-07 10:41:12,782] DEBUG [Partition state machine on Controller 0]: Initializing leader and isr for partition [test-topic,0] to (Leader:0,ISR:0,1,LeaderEpoch:0,ControllerEpoch:12) (kafka.controller.PartitionStateMachine)
[2018-05-07 10:41:12,805] TRACE Controller 0 epoch 12 changed partition [test-topic,0] from NewPartition to OnlinePartition with leader 0 (state.change.logger)
[2018-05-07 10:41:12,806] TRACE Controller 0 epoch 12 sending become-follower LeaderAndIsr request (Leader:0,ISR:0,1,LeaderEpoch:0,ControllerEpoch:12) to broker 1 for partition [test-topic,0] (state.change.logger)
[2018-05-07 10:41:12,809] TRACE Controller 0 epoch 12 sending become-leader LeaderAndIsr request (Leader:0,ISR:0,1,LeaderEpoch:0,ControllerEpoch:12) to broker 0 for partition [test-topic,0] (state.change.logger)
[2018-05-07 10:41:12,810] TRACE Controller 0 epoch 12 sending UpdateMetadata request (Leader:0,ISR:0,1,LeaderEpoch:0,ControllerEpoch:12) to brokers Set(0, 1, 2, 3, 4) for partition test-topic-0 (state.change.logger)
[2018-05-07 10:41:12,811] INFO [Replica state machine on controller 0]: Invoking state change to OnlineReplica for replicas [Topic=test-topic,Partition=0,Replica=0],[Topic=test-topic,Partition=0,Replica=1] (kafka.controller.ReplicaStateMachine)
[2018-05-07 10:41:12,812] TRACE Controller 0 epoch 12 changed state of replica 0 for partition [test-topic,0] from NewReplica to OnlineReplica (state.change.logger)
[2018-05-07 10:41:12,813] TRACE Controller 0 epoch 12 changed state of replica 1 for partition [test-topic,0] from NewReplica to OnlineReplica (state.change.logger)
[2018-05-07 10:41:12,813] TRACE Broker 0 received LeaderAndIsr request PartitionState(controllerEpoch=12, leader=0, leaderEpoch=0, isr=[0, 1], zkVersion=0, replicas=[0, 1]) correlation id 5 from controller 0 epoch 12 for partition [test-topic,0] (state.change.logger)
[2018-05-07 10:41:12,813] TRACE Broker 0 received LeaderAndIsr request PartitionState(controllerEpoch=12, leader=0, leaderEpoch=0, isr=[0, 1], zkVersion=0, replicas=[0, 1]) correlation id 4 from controller 0 epoch 12 for partition [test-topic,0] (state.change.logger)
[2018-05-07 10:41:12,816] TRACE Broker 0 cached leader info (LeaderAndIsrInfo:(Leader:0,ISR:0,1,LeaderEpoch:0,ControllerEpoch:12),ReplicationFactor:2),AllReplicas:0,1) for partition test-topic-0 in response to UpdateMetadata request sent by controller 0 epoch 12 with correlation id 2 (state.change.logger)
[2018-05-07 10:41:12,817] TRACE Controller 0 epoch 12 received response {error_code=0} for a request sent to broker kafka-2.kafka-hs.analytics.svc.cluster.local:29092 (id: 2 rack: null) (state.change.logger)
[2018-05-07 10:41:12,823] TRACE Controller 0 epoch 12 received response {error_code=0} for a request sent to broker kafka-3.kafka-hs.analytics.svc.cluster.local:29092 (id: 3 rack: null) (state.change.logger)
[2018-05-07 10:41:12,823] TRACE Controller 0 epoch 12 received response {error_code=0} for a request sent to broker kafka-4.kafka-hs.analytics.svc.cluster.local:29092 (id: 4 rack: null) (state.change.logger)
[2018-05-07 10:41:12,827] TRACE Broker 0 handling LeaderAndIsr request correlationId 4 from controller 0 epoch 12 starting the become-leader transition for partition test-topic-0 (state.change.logger)
[2018-05-07 10:41:12,828] INFO [ReplicaFetcherManager on broker 0] Removed fetcher for partitions test-topic-0 (kafka.server.ReplicaFetcherManager)
[2018-05-07 10:41:12,852] INFO Completed load of log test-topic-0 with 1 log segments and log end offset 0 in 17 ms (kafka.log.Log)
[2018-05-07 10:41:12,853] INFO Created log for partition [test-topic,0] in /tmp/kafka-logs with properties {compression.type -> producer, message.format.version -> 0.10.2-IV0, file.delete.delay.ms -> 60000, max.message.bytes -> 1000012, min.compaction.lag.ms -> 0, message.timestamp.type -> CreateTime, min.insync.replicas -> 1, segment.jitter.ms -> 0, preallocate -> false, min.cleanable.dirty.ratio -> 0.5, index.interval.bytes -> 4096, unclean.leader.election.enable -> true, retention.bytes -> -1, delete.retention.ms -> 86400000, cleanup.policy -> [delete], flush.ms -> 9223372036854775807, segment.ms -> 604800000, segment.bytes -> 1073741824, retention.ms -> 604800000, message.timestamp.difference.max.ms -> 9223372036854775807, segment.index.bytes -> 10485760, flush.messages -> 9223372036854775807}. (kafka.log.LogManager)
[2018-05-07 10:41:12,853] INFO Partition [test-topic,0] on broker 0: No checkpointed highwatermark is found for partition test-topic-0 (kafka.cluster.Partition)
[2018-05-07 10:41:12,861] TRACE Broker 0 stopped fetchers as part of become-leader request from controller 0 epoch 12 with correlation id 4 for partition test-topic-0 (state.change.logger)
[2018-05-07 10:41:12,861] TRACE Broker 0 completed LeaderAndIsr request correlationId 4 from controller 0 epoch 12 for the become-leader transition for partition test-topic-0 (state.change.logger)
[2018-05-07 10:41:12,864] WARN Broker 0 ignoring LeaderAndIsr request from controller 0 with correlation id 5 epoch 12 for partition [test-topic,0] since its associated leader epoch 0 is not higher than the current leader epoch 0 (state.change.logger)
[2018-05-07 10:41:12,865] TRACE Controller 0 epoch 12 received response {error_code=0,partitions=[{topic=test-topic,partition=0,error_code=11}]} for a request sent to broker kafka-0.kafka-hs.analytics.svc.cluster.local:29092 (id: 0 rack: null) (state.change.logger)
[2018-05-07 10:41:12,865] TRACE Controller 0 epoch 12 received response {error_code=0,partitions=[{topic=test-topic,partition=0,error_code=0}]} for a request sent to broker kafka-1.kafka-hs.analytics.svc.cluster.local:29092 (id: 1 rack: null) (state.change.logger)
[2018-05-07 10:41:12,867] TRACE Broker 0 cached leader info (LeaderAndIsrInfo:(Leader:0,ISR:0,1,LeaderEpoch:0,ControllerEpoch:12),ReplicationFactor:2),AllReplicas:0,1) for partition test-topic-0 in response to UpdateMetadata request sent by controller 0 epoch 12 with correlation id 6 (state.change.logger)
[2018-05-07 10:41:12,867] TRACE Controller 0 epoch 12 received response {error_code=0} for a request sent to broker kafka-0.kafka-hs.analytics.svc.cluster.local:29092 (id: 0 rack: null) (state.change.logger)
[2018-05-07 10:41:12,867] TRACE Broker 0 cached leader info (LeaderAndIsrInfo:(Leader:0,ISR:0,1,LeaderEpoch:0,ControllerEpoch:12),ReplicationFactor:2),AllReplicas:0,1) for partition test-topic-0 in response to UpdateMetadata request sent by controller 0 epoch 12 with correlation id 5 (state.change.logger)
[2018-05-07 10:41:12,868] TRACE Controller 0 epoch 12 received response {error_code=0} for a request sent to broker kafka-1.kafka-hs.analytics.svc.cluster.local:29092 (id: 1 rack: null) (state.change.logger)
[2018-05-07 10:41:26,213] INFO Partition [test-topic,0] on broker 0: Shrinking ISR for partition [test-topic,0] from 0,1 to 0 (kafka.cluster.Partition)
[2018-05-07 10:41:28,721] DEBUG [IsrChangeNotificationListener on Controller 0]: ISR change notification listener fired (kafka.controller.IsrChangeNotificationListener)
[2018-05-07 10:41:28,735] DEBUG [IsrChangeNotificationListener on Controller 0]: Sending MetadataRequest to Brokers:ArrayBuffer(0, 1, 2, 3, 4) for TopicAndPartitions:Set([test-topic,0], [__consumer_offsets,30], [__consumer_offsets,6]) (kafka.controller.IsrChangeNotificationListener)
[2018-05-07 10:41:28,735] INFO Leader not yet assigned for partition [__consumer_offsets,30]. Skip sending UpdateMetadataRequest. (kafka.controller.ControllerBrokerRequestBatch)
[2018-05-07 10:41:28,735] INFO Leader not yet assigned for partition [__consumer_offsets,6]. Skip sending UpdateMetadataRequest. (kafka.controller.ControllerBrokerRequestBatch)
[2018-05-07 10:41:28,735] TRACE Controller 0 epoch 12 sending UpdateMetadata request (Leader:0,ISR:0,LeaderEpoch:0,ControllerEpoch:12) to brokers Set(0, 1, 2, 3, 4) for partition test-topic-0 (state.change.logger)
[2018-05-07 10:41:28,739] TRACE Broker 0 cached leader info (LeaderAndIsrInfo:(Leader:0,ISR:0,LeaderEpoch:0,ControllerEpoch:12),ReplicationFactor:2),AllReplicas:0,1) for partition test-topic-0 in response to UpdateMetadata request sent by controller 0 epoch 12 with correlation id 6 (state.change.logger)
[2018-05-07 10:41:28,739] TRACE Broker 0 cached leader info (LeaderAndIsrInfo:(Leader:0,ISR:0,LeaderEpoch:0,ControllerEpoch:12),ReplicationFactor:2),AllReplicas:0,1) for partition test-topic-0 in response to UpdateMetadata request sent by controller 0 epoch 12 with correlation id 3 (state.change.logger)
[2018-05-07 10:41:28,739] TRACE Controller 0 epoch 12 received response {error_code=0} for a request sent to broker kafka-1.kafka-hs.analytics.svc.cluster.local:29092 (id: 1 rack: null) (state.change.logger)
[2018-05-07 10:41:28,740] TRACE Broker 0 cached leader info (LeaderAndIsrInfo:(Leader:0,ISR:0,LeaderEpoch:0,ControllerEpoch:12),ReplicationFactor:2),AllReplicas:0,1) for partition test-topic-0 in response to UpdateMetadata request sent by controller 0 epoch 12 with correlation id 7 (state.change.logger)
[2018-05-07 10:41:28,740] TRACE Controller 0 epoch 12 received response {error_code=0} for a request sent to broker kafka-2.kafka-hs.analytics.svc.cluster.local:29092 (id: 2 rack: null) (state.change.logger)
[2018-05-07 10:41:28,740] TRACE Controller 0 epoch 12 received response {error_code=0} for a request sent to broker kafka-3.kafka-hs.analytics.svc.cluster.local:29092 (id: 3 rack: null) (state.change.logger)
[2018-05-07 10:41:28,740] TRACE Controller 0 epoch 12 received response {error_code=0} for a request sent to broker kafka-0.kafka-hs.analytics.svc.cluster.local:29092 (id: 0 rack: null) (state.change.logger)
[2018-05-07 10:41:28,741] TRACE Controller 0 epoch 12 received response {error_code=0} for a request sent to broker kafka-4.kafka-hs.analytics.svc.cluster.local:29092 (id: 4 rack: null) (state.change.logger)
[2018-05-07 10:41:28,746] DEBUG [IsrChangeNotificationListener on Controller 0]: ISR change notification listener fired (kafka.controller.IsrChangeNotificationListener)
[2018-05-07 10:41:36,297] TRACE [Controller 0]: checking need to trigger partition rebalance (kafka.controller.KafkaController)
[2018-05-07 10:41:36,298] DEBUG [Controller 0]: preferred replicas by broker Map(0 -> Map([test-topic,0] -> List(0, 1))) (kafka.controller.KafkaController)
[2018-05-07 10:41:36,302] DEBUG [Controller 0]: topics not in preferred replica Map() (kafka.controller.KafkaController)
[2018-05-07 10:41:36,303] TRACE [Controller 0]: leader imbalance ratio for broker 0 is 0.000000 (kafka.controller.KafkaController)
Replica #1:
[2018-05-07 10:41:12,822] TRACE Broker 1 cached leader info (LeaderAndIsrInfo:(Leader:0,ISR:0,1,LeaderEpoch:0,ControllerEpoch:12),ReplicationFactor:2),AllReplicas:0,1) for partition test-topic-0 in response to UpdateMetadata request sent by controller 0 epoch 12 with correlation id 3 (state.change.logger)
[2018-05-07 10:41:28,739] TRACE Broker 1 cached leader info (LeaderAndIsrInfo:(Leader:0,ISR:0,LeaderEpoch:0,ControllerEpoch:12),ReplicationFactor:2),AllReplicas:0,1) for partition test-topic-0 in response to UpdateMetadata request sent by controller 0 epoch 12 with correlation id 4 (state.change.logger)
Replica #2:
[2018-05-07 10:41:12,823] TRACE Broker 2 cached leader info (LeaderAndIsrInfo:(Leader:0,ISR:0,1,LeaderEpoch:0,ControllerEpoch:12),ReplicationFactor:2),AllReplicas:0,1) for partition test-topic-0 in response to UpdateMetadata request sent by controller 0 epoch 12 with correlation id 1 (state.change.logger)
[2018-05-07 10:41:28,740] TRACE Broker 2 cached leader info (LeaderAndIsrInfo:(Leader:0,ISR:0,LeaderEpoch:0,ControllerEpoch:12),ReplicationFactor:2),AllReplicas:0,1) for partition test-topic-0 in response to UpdateMetadata request sent by controller 0 epoch 12 with correlation id 2 (state.change.logger)
However, whenever I connect with the console producer to the cluster, I get the following error:
.\kafka-console-producer.bat --broker-list {EXTERNAL-IP-ADDRESS}:9093 --topic test-topic --property parse.key=true --property key.separator=:
>testKey:23487239847237894asduhzdfhzusfhhsdf
[2018-05-07 12:42:58,395] WARN [Producer clientId=console-producer] Error while fetching metadata with correlation id 2 : {test-topic=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient)
[2018-05-07 12:42:58,512] WARN [Producer clientId=console-producer] Error while fetching metadata with correlation id 3 : {test-topic=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient)
[2018-05-07 12:42:58,641] WARN [Producer clientId=console-producer] Error while fetching metadata with correlation id 4 : {test-topic=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient)
[2018-05-07 12:42:58,765] WARN [Producer clientId=console-producer] Error while fetching metadata with correlation id 5 : {test-topic=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient)
[2018-05-07 12:42:58,886] WARN [Producer clientId=console-producer] Error while fetching metadata with correlation id 6 : {test-topic=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient)
Is it a problem that a Kubernetes Service usually exposes one external IP address and all of my Kafka brokers are advertising this IP? Are there solutions to this?
I expect the inter-broker communication to happen on 29092.
Yes, they use 29092 for internal communication.
External clients should be able to connect on port 9093. I have one external IP for the entire Kubernetes service, which means that this is the only external IP that should be exposed from the Kafka brokers. As far as I understand, the Kubernetes load balancer will route any request to this IP to one of my brokers.
Yes, Kubernetes will route all traffic from that service to one of your brokers and that is a problem.
Internally, you use Headless Service to discover addresses of your Kafka brokers, so they are available by DNS names kafka-[_NUM_OF_THE_REPLICA_]._SERVICE_NAME_ and it works without any problems.
For access from the outside of the cluster, you need to expose all your replicas on the different addresses or ports. But, you have only one service which can balance requests between services.
To fix it, you should create a separate service for each replica and use themes external addresses as EXTERNAL-IP-ADDRESSES in your configuration.
Here is an example from the issue in GitHub repo where you get a configuration of Kafka cluster for Kubernetes:
---
apiVersion: v1
kind: Service
metadata:
name: kafka-es-0
spec:
ports:
- port: 9092
name: kafka-port
protocol: TCP
selector:
pod-name: kafka-0
type: LoadBalancer
---
apiVersion: v1
kind: Service
metadata:
name: kafka-es-1
spec:
ports:
- port: 9092
name: kafka-port
protocol: TCP
selector:
pod-name: kafka-1
type: LoadBalancer
A problem developed on our production Storm cluster that we cannot figure out or work around.
At some point it appears that the kafka spout stopped reading from half of the topic partitions. There are 40 partitions, and it is only reading from 20 of them. We cannot find any changes that we made to either the storm cluster or kafka at the time this started happening.
We changed the consumer group id and set the spout config startOffsetTime to OffsetRequest.LatestTime to try to get it reading fresh data from all partitions. It still only connects to the same 20 partitions. We've looked at the node /<topic>/<consumer_group> inside the Storm zookeeper and see only 20 partitions there.
We have verified that messages are being published to all 40 partitions.
Kafka version is 0.9.0.1,storm version is 1.1.0.
Any tips on how to debug or where to look would be greatly appreciated. Did I mention that this is happening in production? Did I mention it started a week ago, and we just noticed this morning? :(
Additional info: we found some errors in the Kafka state change log (partition 9 is one of the affected partitions and the timestamp in the log looks to be about the time that the problem started)
kafka.common.NoReplicaOnlineException: No replica for partition
[transcription-results,9] is alive. Live brokers are: [Set()], Assigned replicas are: [List(1, 4, 0)]
[2018-03-14 03:11:40,863] TRACE Controller 0 epoch 44 changed state of replica 1 for partition [transcription-results,9] from OnlineReplica to OfflineReplica (state.change.logger)
[2018-03-14 03:11:41,141] TRACE Controller 0 epoch 44 sending become-follower LeaderAndIsr request (Leader:-1,ISR:0,4,LeaderEpoch:442,ControllerEpoch:44) to broker 4 for partition [transcription-results,9] (state.change.logger)
[2018-03-14 03:11:41,145] TRACE Controller 0 epoch 44 sending become-follower LeaderAndIsr request (Leader:-1,ISR:0,4,LeaderEpoch:442,ControllerEpoch:44) to broker 0 for partition [transcription-results,9] (state.change.logger)
[2018-03-14 03:11:41,208] TRACE Controller 0 epoch 44 changed state of replica 4 for partition [transcription-results,9] from OnlineReplica to OnlineReplica (state.change.logger)
[2018-03-14 03:11:41,218] TRACE Controller 0 epoch 44 changed state of replica 1 for partition [transcription-results,9] from OfflineReplica to OnlineReplica (state.change.logger)
[2018-03-14 03:11:41,226] TRACE Controller 0 epoch 44 sending become-follower LeaderAndIsr request (Leader:-1,ISR:0,4,LeaderEpoch:442,ControllerEpoch:44) to broker 4 for partition [transcription-results,9] (state.change.logger)
[2018-03-14 03:11:41,230] TRACE Controller 0 epoch 44 sending become-follower LeaderAndIsr request (Leader:-1,ISR:0,4,LeaderEpoch:442,ControllerEpoch:44) to broker 1 for partition [transcription-results,9] (state.change.logger)
[2018-03-14 03:11:41,450] TRACE Broker 0 received LeaderAndIsr request (LeaderAndIsrInfo:Leader:-1,ISR:0,4,LeaderEpoch:442,ControllerEpoch:44),ReplicationFactor:3),AllReplicas:1,4,0) correlation id 158 from controller 0 epoch 44 for partition [transcription-results,9] (state.change.logger)
[2018-03-14 03:11:41,454] TRACE Broker 0 handling LeaderAndIsr request correlationId 158 from controller 0 epoch 44 starting the become-follower transition for partition [transcription-results,9] (state.change.logger)
[2018-03-14 03:11:41,455] ERROR Broker 0 received LeaderAndIsrRequest with correlation id 158 from controller 0 epoch 44 for partition [transcription-results,9] but cannot become follower since the new leader -1 is unavailable. (state.change.logger)
//... removed some TRACE statements here
[2018-03-14 03:11:41,908] WARN Broker 0 ignoring LeaderAndIsr request from controller 1 with correlation id 1 epoch 47 for partition [transcription-results,9] since its associated leader epoch 441 is old. Current leader epoch is 441 (state.change.logger)
[2018-03-14 03:11:41,982] TRACE Broker 0 cached leader info (LeaderAndIsrInfo:Leader:1,ISR:0,1,4,LeaderEpoch:441,ControllerEpoch:44),ReplicationFactor:3),AllReplicas:1,4,0) for partition [transcription-results,9] in response to UpdateMetadata request sent by controller 1 epoch 47 with correlation id 2 (state.change.logger)
[2018-03-22 14:43:36,098] TRACE Broker 0 received LeaderAndIsr request (LeaderAndIsrInfo:Leader:-1,ISR:,LeaderEpoch:444,ControllerEpoch:47),ReplicationFactor:3),AllReplicas:1,4,0) correlation id 679 from controller 1 epoch 47 for partition [transcription-results,9] (state.change.logger)
Possibly caused by this bug: https://issues.apache.org/jira/browse/KAFKA-3963
But how can we recover from it?
I'd start by looking in Kafka's Zookeeper under /brokers/topics to verify that all partitions are listed. That's where storm-kafka reads the partitions from.