Kafka Streams shutdown after IllegalStateException: No current assignment for partition - apache-kafka

I have a Kafka Streams application that launches and runs successfully. We have 4 instances of the application running. Occasionally one of our instance of the application is legitimately killed which causes several rounds of rebalancing until the old node is replaced.
Sometimes during the rebalance, one ore more previously healthy nodes fail. The logs are indicating that the Streams application transitions into a PENDING_SHUTDOWN state directly after receiving the following exception:
java.lang.IllegalStateException: No current assignment for partition public.chat.message-28
at org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:256)
at org.apache.kafka.clients.consumer.internals.SubscriptionState.resetFailed(SubscriptionState.java:418)
at org.apache.kafka.clients.consumer.internals.Fetcher$2.onFailure(Fetcher.java:621)
at org.apache.kafka.clients.consumer.internals.RequestFuture.fireFailure(RequestFuture.java:177)
at org.apache.kafka.clients.consumer.internals.RequestFuture.raise(RequestFuture.java:147)
at org.apache.kafka.clients.consumer.internals.RequestFutureAdapter.onFailure(RequestFutureAdapter.java:30)
at org.apache.kafka.clients.consumer.internals.RequestFuture$1.onFailure(RequestFuture.java:209)
at org.apache.kafka.clients.consumer.internals.RequestFuture.fireFailure(RequestFuture.java:177)
at org.apache.kafka.clients.consumer.internals.RequestFuture.raise(RequestFuture.java:147)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:571)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.firePendingCompletedRequests(ConsumerNetworkClient.java:389)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:297)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:236)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:215)
at org.apache.kafka.clients.consumer.internals.Fetcher.getTopicMetadata(Fetcher.java:292)
at org.apache.kafka.clients.consumer.internals.Fetcher.getAllTopicMetadata(Fetcher.java:275)
at org.apache.kafka.clients.consumer.KafkaConsumer.listTopics(KafkaConsumer.java:1849)
at org.apache.kafka.clients.consumer.KafkaConsumer.listTopics(KafkaConsumer.java:1827)
at org.apache.kafka.streams.processor.internals.StoreChangelogReader.refreshChangelogInfo(StoreChangelogReader.java:259)
at org.apache.kafka.streams.processor.internals.StoreChangelogReader.initialize(StoreChangelogReader.java:133)
at org.apache.kafka.streams.processor.internals.StoreChangelogReader.restore(StoreChangelogReader.java:79)
at org.apache.kafka.streams.processor.internals.TaskManager.updateNewAndRestoringTasks(TaskManager.java:328)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:866)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:804)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:773)
Prior to this error we often seem to also recieve some informational logs reporting a disconnect exception:
Error sending fetch request (sessionId=568252460, epoch=7) to node 4: org.apache.kafka.common.errors.DisconnectException
I have a feeling the two are related but I'm unable to reason why at present.
Is anyone able to give me some hints as to what may be causing this issue and any possible solutions?
Additional Info:
Kafka 2.2.1
32 partitions spread evenly across the 4 worker nodes
StreamsConfig settings:
kafkaStreamProps.put(StreamsConfig.REPLICATION_FACTOR_CONFIG, 2);
kafkaStreamProps.put(StreamsConfig.NUM_STANDBY_REPLICAS_CONFIG, 1);
kafkaStreamProps.put(StreamsConfig.NUM_STREAM_THREADS_CONFIG, 4);
kafkaStreamProps.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 120000);
kafkaStreamProps.put(StreamsConfig.TOPOLOGY_OPTIMIZATION, StreamsConfig.OPTIMIZE);

This looks like it could be related to https://issues.apache.org/jira/browse/KAFKA-9073, which has been fixed in Kafka Streams 2.3.2.
If you can't wait for that release, you could try creating a private build using the changeset from this pull request: https://github.com/apache/kafka/pull/7630/files

Related

Debugging root cause for high Purgatory size in Kafka

We using ELK stack (7.10.2) in Kubernetes (1.21.5). After several time our service provider Gardener change OS version (318.9.0 -> 576.1.0) and our troubles with logging stack started.
It seems, that Kafka (v 2.8.1, 2 pods) not stream data to Logstash (7.10.2, 2 pods), but sent it by chunks of data every few moments. In fact, in Kibana we not see continual adding log records, but we see bunch of new records every few moments. If high load occur (e.g. debugging some component in k8s cluster), this delay is rising to minutes.
We discovered, that metric delayed fetch in purgatory is jumping with very similar pattern
see screenshot, like a "saw". When I downgrade OS version on nodes from current (576.2.0, orange) to previous one (318.9.0, blue), problem disappeared. As you expected, we dont stay on same OS version much longer.
I asked Gardener staff for assistance, but without root cause they are not able help us. We not change any settings, component versions, ... Just OS version on nodes.
From Logstashs debug log I can see, that Logstash is continuously connecting/disconnecting to Kafka:
[2022-01-17T08:53:33,232][INFO ][org.apache.kafka.clients.consumer.internals.AbstractCoordinator] [Consumer clientId=elk-logstash-indexer-6c84d6bf8c-58gnz-containers-10, groupId=containers] Attempt to heartbeat failed since group is rebalancing
[2022-01-17T08:53:30,501][INFO ][org.apache.kafka.clients.consumer.internals.AbstractCoordinator] [Consumer clientId=elk-logstash-indexer-6c84d6bf8c-ct29t-containers-49, groupId=containers] Discovered group coordinator elk-kafka-0.kafka.logging.svc.cluster.local:9092 (id: 2147483647 rack: null)
[2022-01-17T08:53:30,001][INFO ][org.apache.kafka.common.utils.AppInfoParser] Kafka startTimeMs: 1642409610000
These lines are still repeating in loop.
Similar situation I can see on Kafka:
[2022-01-20 11:55:04,241] DEBUG [broker-0-to-controller-send-thread]: Controller isn't cached, looking for local metadata changes (kafka.server.BrokerToControllerRequestThread)
[2022-01-20 11:55:04,241] DEBUG [broker-0-to-controller-send-thread]: No controller defined in metadata cache, retrying after backoff (kafka.server.BrokerToControllerRequestThread)
[2022-01-20 11:55:04,342] DEBUG [broker-0-to-controller-send-thread]: Controller isn't cached, looking for local metadata changes (kafka.server.BrokerToControllerRequestThread)
[2022-01-20 11:55:04,342] DEBUG [broker-0-to-controller-send-thread]: No controller defined in metadata cache, retrying after backoff (kafka.server.BrokerToControllerRequestThread)
[2022-01-20 11:55:04,365] DEBUG Accepted connection from /10.250.1.127:53678 on /100.96.30.21:9092 and assigned it to processor 1, sendBufferSize [actual|requested]: [102400|102400] recvBufferSize [actual|requested]: [102400|102400] (kafka.network.Acceptor)
[2022-01-20 11:55:04,365] DEBUG Processor 1 listening to new connection from /10.250.1.127:53678 (kafka.network.Processor)
[2022-01-20 11:55:04,368] DEBUG [SocketServer listenerType=ZK_BROKER, nodeId=0] Connection with /10.250.1.127 disconnected (org.apache.kafka.common.network.Selector)
I attempted:
double resources for Kafka and Logstash (no change occurred)
change container engine from Docker to ContainerD (problem was worse in ContainerD, ~400 -> ~1000)
change Logstash parameters for Kafka plugin (no change occurred)
compare Kernel settings (5.4.0 -> 5.10.0, I not spotted any interesting changes)
temporary disable Karydia for Kafka, Logstash and ZooKeeper (no change occurred)
temporary upgrade Logstash version (7.10.2 -> 7.12.0, without success, all tested version have same bad behavior, move to higher version currently isnt possible without change version of another components in ELK)
Unfortunately, I am not a Kafka expert, I am not sure, that connecting/disconnecting is root cause of some of our non-optimal settings, or communication is interference by something unknow for us.
I would like to ask community for help with this problem. Some suggestion, how to continue with investigation are very welcome too.

io.confluent.ksql.exception.KafkaTopicExistsException: when launching ksql-server-start ksql-server.properties

I'm working with ksql from quite some time. Kafka cluster if of 3 nodes. I've been using udf as well and all looks good until I stop the servers and start them again.
On server start I'm seeing the following in the logs:
[2019-04-03 11:29:54,381] ERROR Exception encountered running command: A Kafka topic with the name 'czxcorp-structured-data-enriched' already exists, with different partition/replica configuration than required. KSQL expects 4 partitions (topic has 9), and 1 replication factor (topic has 1).. Retrying in 5000 ms (io.confluent.ksql.util.RetryUtil:80)
[2019-04-03 11:29:54,381] ERROR Stack trace: io.confluent.ksql.exception.KafkaTopicExistsException: A Kafka topic with the name 'czxcorp-structured-data-enriched' already exists, with different partition/replica configuration than required. KSQL expects 4 partitions (topic has 9), and 1 replication factor (topic has 1).
at io.confluent.ksql.services.TopicValidationUtil.validateTopicProperties(TopicValidationUtil.java:51)
at io.confluent.ksql.services.TopicValidationUtil.validateTopicProperties(TopicValidationUtil.java:35)
at io.confluent.ksql.services.KafkaTopicClientImpl.validateTopicProperties(KafkaTopicClientImpl.java:292)
at io.confluent.ksql.services.KafkaTopicClientImpl.createTopic(KafkaTopicClientImpl.java:76)
at io.confluent.ksql.planner.plan.KsqlStructuredDataOutputNode.createSinkTopic(KsqlStructuredDataOutputNode.java:244)
at io.confluent.ksql.planner.plan.KsqlStructuredDataOutputNode.buildStream(KsqlStructuredDataOutputNode.java:146)
at io.confluent.ksql.physical.PhysicalPlanBuilder.buildPhysicalPlan(PhysicalPlanBuilder.java:106)
at io.confluent.ksql.QueryEngine.buildPhysicalPlan(QueryEngine.java:113)
at io.confluent.ksql.KsqlEngine$EngineExecutor.execute(KsqlEngine.java:625)
at io.confluent.ksql.KsqlEngine$EngineExecutor.access$800(KsqlEngine.java:577)
at io.confluent.ksql.KsqlEngine.execute(KsqlEngine.java:247)
at io.confluent.ksql.rest.server.computation.StatementExecutor.startQuery(StatementExecutor.java:277)
at io.confluent.ksql.rest.server.computation.StatementExecutor.executeStatement(StatementExecutor.java:191)
at io.confluent.ksql.rest.server.computation.StatementExecutor.handleStatementWithTerminatedQueries(StatementExecutor.java:167)
at io.confluent.ksql.rest.server.computation.StatementExecutor.handleRestore(StatementExecutor.java:101)
at io.confluent.ksql.rest.server.computation.CommandRunner.lambda$null$0(CommandRunner.java:139)
at io.confluent.ksql.util.RetryUtil.retryWithBackoff(RetryUtil.java:63)
at io.confluent.ksql.util.RetryUtil.retryWithBackoff(RetryUtil.java:36)
at io.confluent.ksql.rest.server.computation.CommandRunner.lambda$processPriorCommands$1(CommandRunner.java:135)
at java.util.ArrayList.forEach(ArrayList.java:1257)
at io.confluent.ksql.rest.server.computation.CommandRunner.processPriorCommands(CommandRunner.java:134)
at io.confluent.ksql.rest.server.KsqlRestApplication.buildApplication(KsqlRestApplication.java:414)
at io.confluent.ksql.rest.server.KsqlServerMain.createExecutable(KsqlServerMain.java:80)
at io.confluent.ksql.rest.server.KsqlServerMain.main(KsqlServerMain.java:42)
(io.confluent.ksql.util.RetryUtil:84)
Though I've stopped/terminated all the queries, the log prints all the commands I've executed from the beginning for my testing till data, including create, select, drop. I've pulled out the .jar(UDF) from /ext folder and the server started, though the log prints udf function(i'm using) not available.
This is my ksql-server.properties:
bootstrap.servers=hostname:9092
service.id=cyan_ksql
commit.interval.ms=5000
cache.max.bytes.buffering=20000000
num.stream.threads=10
fail.on.deserialization.error=false
listeners=http://localhost:8088
ksql.extension.dir=/opt/ksql-master/ext/
Going nuts with the error. I'm deleting the topic and somehow its recreated. Someone please help.
Check out the error:
A Kafka topic with the name 'czxcorp-structured-data-enriched' already exists, with different partition/replica configuration than required.
KSQL expects 4 partitions (topic has 9), and 1 replication factor (topic has 1)
If you've deleted the topic then either
it didn't actually get deleted
it got deleted and something else recreated it with nine partitions and your erroring KSQL query has not specified an override (WITH (PARTITIONS=9) to the default four
another KSQL command is creating it ahead of the one that errors out and your erroring KSQL query has not specified an override (WITH (PARTITIONS=9) to the default four
If you want to blow away your state and start from scratch, simply change your ksql.service.id which will cause KSQL to use a new command topic (which is what get replayed when you restart the process)

UnknownProducerIdException in Kafka streams when enabling exactly once

After enabling exactly once processing on a Kafka streams application, the following error appears in the logs:
ERROR o.a.k.s.p.internals.StreamTask - task [0_0] Failed to close producer
due to the following error:
org.apache.kafka.streams.errors.StreamsException: task [0_0] Abort
sending since an error caught with a previous record (key 222222 value
some-value timestamp 1519200902670) to topic exactly-once-test-topic-
v2 due to This exception is raised by the broker if it could not
locate the producer metadata associated with the producerId in
question. This could happen if, for instance, the producer's records
were deleted because their retention time had elapsed. Once the last
records of the producerId are removed, the producer's metadata is
removed from the broker, and future appends by the producer will
return this exception.
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.recordSendError(RecordCollectorImpl.java:125)
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.access$500(RecordCollectorImpl.java:48)
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl$1.onCompletion(RecordCollectorImpl.java:180)
at org.apache.kafka.clients.producer.KafkaProducer$InterceptorCallback.onCompletion(KafkaProducer.java:1199)
at org.apache.kafka.clients.producer.internals.ProducerBatch.completeFutureAndFireCallbacks(ProducerBatch.java:204)
at org.apache.kafka.clients.producer.internals.ProducerBatch.done(ProducerBatch.java:187)
at org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:627)
at org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:596)
at org.apache.kafka.clients.producer.internals.Sender.completeBatch(Sender.java:557)
at org.apache.kafka.clients.producer.internals.Sender.handleProduceResponse(Sender.java:481)
at org.apache.kafka.clients.producer.internals.Sender.access$100(Sender.java:74)
at org.apache.kafka.clients.producer.internals.Sender$1.onComplete(Sender.java:692)
at org.apache.kafka.clients.ClientResponse.onComplete(ClientResponse.java:101)
at org.apache.kafka.clients.NetworkClient.completeResponses(NetworkClient.java:482)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:474)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:239)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:163)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.kafka.common.errors.UnknownProducerIdException
We've reproduced the issue with a minimal test case where we move messages from a source stream to another stream without any transformation. The source stream contains millions of messages produced over several months. The KafkaStreams object is created with the following StreamsConfig:
StreamsConfig.PROCESSING_GUARANTEE_CONFIG = "exactly_once"
StreamsConfig.APPLICATION_ID_CONFIG = "Some app id"
StreamsConfig.NUM_STREAM_THREADS_CONFIG = 1
ProducerConfig.BATCH_SIZE_CONFIG = 102400
The app is able to process some messages before the exception occurs.
Context information:
we're running a 5 node Kafka 1.1.0 cluster with 5 zookeeper nodes.
there are multiple instances of the app running
Has anyone seen this problem before or can give us any hints about what might be causing this behaviour?
Update
We created a new 1.1.0 cluster from scratch and started to process new messages without problems. However, when we imported old messages from the old cluster, we hit the same UnknownProducerIdException after a while.
Next we tried to set the cleanup.policy on the sink topic to compact while keeping the retention.ms at 3 years. Now the error did not occur. However, messages seem to have been lost. The source offset is 106 million and the sink offset is 100 million.
As explained in the comments, there currently seems to be a bug that may cause problems when replaying messages older than the (maximum configurable?) retention time.
At time of writing this is unresolved, the latest status can always be seen here:
https://issues.apache.org/jira/browse/KAFKA-6817

Kafka Stream Startup Issue - org.apache.kafka.streams.errors.LockException

I have a Kafka Streams Application version - 0.11 which takes data from few topics and joins the data and puts it in another topic.
Kafka Configuration:
5 kafka brokers - version 0.11
Kafka Topics - 15 partitions and 3 replication factor.
Few millions of records are consumed/produced every hour. Whenever I take any kafka broker down, it throws below Exception:
org.apache.kafka.streams.errors.LockException: task [4_10] Failed to lock the state directory for task 4_10
at org.apache.kafka.streams.processor.internals.ProcessorStateManager.<init>(ProcessorStateManager.java:99)
at org.apache.kafka.streams.processor.internals.AbstractTask.<init>(AbstractTask.java:80)
at org.apache.kafka.streams.processor.internals.StandbyTask.<init>(StandbyTask.java:62)
at org.apache.kafka.streams.processor.internals.StreamThread.createStandbyTask(StreamThread.java:1325)
at org.apache.kafka.streams.processor.internals.StreamThread.access$2400(StreamThread.java:73)
at org.apache.kafka.streams.processor.internals.StreamThread$StandbyTaskCreator.createTask(StreamThread.java:313)
at org.apache.kafka.streams.processor.internals.StreamThread$AbstractTaskCreator.retryWithBackoff(StreamThread.java:254)
at org.apache.kafka.streams.processor.internals.StreamThread.addStandbyTasks(StreamThread.java:1366)
at org.apache.kafka.streams.processor.internals.StreamThread.access$1200(StreamThread.java:73)
at org.apache.kafka.streams.processor.internals.StreamThread$RebalanceListener.onPartitionsAssigned(StreamThread.java:185)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinComplete(ConsumerCoordinator.java:265)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:363)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:310)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:297)
at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1078)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1043)
at org.apache.kafka.streams.processor.internals.StreamThread.pollRequests(StreamThread.java:582)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:553)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:527)
I have read at few jira issues that cleaningUp the streams might help to fix the issue. But cleaningUp the streams everytime we start the Kafka Stream Application is a right solution or a patch? Also, stream cleanUp will delay the application startup right?
Note: Do I need to call streams.cleanUp() before calling streams.start(), each time I start the Kafka Streams application
Seeing a org.apache.kafka.streams.errors.LockException: task [4_10] Failed to lock the state directory for task 4_10 is actually expected and should resolve itself. The thread will back off in order to wait until another thread releases the lock and retries later. Thus, you might even see this WARN message is the logs multiple time in case the retry happens before the second thread did release the lock.
However, eventually the lock should be release by the second thread and the first thread will be able to get the lock. Afterwards, Streams should just move forward. Note, it's a WARN message and not an error.

Cannot create more than 15 topics in Kafka

Me and my colleague were testing out Kafka on a 3 nodes cluster and we encountered this problem were trying to test the performance of sending message to multiple topics. We can't create more than 15 topics. The first 15 topics works fine. But when trying to create the 16th topic(and topics onward), a lot of errors started to appear in the 2 follower servers.
One with a lot of errors like this:
ERROR [ReplicaFetcherThread-0-1], Error for partition [__consumer_offsets,36] to broker 1:org.apache.kafka.common.errors.UnknownServerException: The server experienced an unexpected error when processing the request (kafka.server.ReplicaFetcherThread)
The other with errors like this:
[2017-06-16 18:44:07,146] ERROR [KafkaApi-1] Error when handling request {replica_id=2,max_wait_time=500,min_bytes=1,max_bytes=10485760,topics=[{topic=__consumer_offsets,partitions=[{partition=6,fetch_offset=5,max_bytes=1048576},{partition=36,fetch_offset=3,max_bytes=1048576},{partition=18,fetch_offset=0,max_bytes=1048576}]},{topic=multi-test-12,partitions=[{partition=1,fetch_offset=1,max_bytes=1048576}]},{topic=multi-test-11,partitions=[{partition=2,fetch_offset=1,max_bytes=1048576}]},{topic=__consumer_offsets,partitions=[{partition=0,fetch_offset=5,max_bytes=1048576},{partition=45,fetch_offset=0,max_bytes=1048576}]},{topic=multi-test-16,partitions=[{partition=0,fetch_offset=0,max_bytes=1048576}]},{topic=__consumer_offsets,partitions=[{partition=27,fetch_offset=0,max_bytes=1048576},{partition=12,fetch_offset=0,max_bytes=1048576},{partition=9,fetch_offset=0,max_bytes=1048576}]},{topic=multi-test-10,partitions=[{partition=0,fetch_offset=0,max_bytes=1048576}]},{topic=multi-test-9,partitions=[{partition=2,fetch_offset=0,max_bytes=1048576}]},{topic=__consumer_offsets,partitions=[{partition=39,fetch_offset=0,max_bytes=1048576}]},{topic=multi-test-4,partitions=[{partition=1,fetch_offset=1,max_bytes=1048576}]},{topic=__consumer_offsets,partitions=[{partition=21,fetch_offset=10,max_bytes=1048576}]},{topic=multi-test-3,partitions=[{partition=2,fetch_offset=0,max_bytes=1048576}]},{topic=multi-test-13,partitions=[{partition=0,fetch_offset=1,max_bytes=1048576}]},{topic=__consumer_offsets,partitions=[{partition=3,fetch_offset=10,max_bytes=1048576},{partition=48,fetch_offset=0,max_bytes=1048576}]},{topic=multi-test-8,partitions=[{partition=0,fetch_offset=0,max_bytes=1048576}]},{topic=__consumer_offsets,partitions=[{partition=33,fetch_offset=0,max_bytes=1048576},{partition=30,fetch_offset=15,max_bytes=1048576},{partition=15,fetch_offset=0,max_bytes=1048576}]},{topic=multi-test-1,partitions=[{partition=1,fetch_offset=1,max_bytes=1048576}]},{topic=multi-test-0,partitions=[{partition=2,fetch_offset=0,max_bytes=1048576}]},{topic=multi-test-2,partitions=[{partition=1,fetch_offset=1,max_bytes=1048576}]},{topic=__consumer_offsets,partitions=[{partition=42,fetch_offset=3,max_bytes=1048576},{partition=24,fetch_offset=0,max_bytes=1048576}]}]} (kafka.server.KafkaApis)
kafka.common.NotAssignedReplicaException: Leader 1 failed to record follower 2's position -1 since the replica is not recognized to be one of the assigned replicas for partition multi-test-16-0.
at kafka.cluster.Partition.updateReplicaLogReadResult(Partition.scala:246)
at kafka.server.ReplicaManager$$anonfun$updateFollowerLogReadResults$2.apply(ReplicaManager.scala:920)
at kafka.server.ReplicaManager$$anonfun$updateFollowerLogReadResults$2.apply(ReplicaManager.scala:917)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at kafka.server.ReplicaManager.updateFollowerLogReadResults(ReplicaManager.scala:917)
at kafka.server.ReplicaManager.fetchMessages(ReplicaManager.scala:462)
at kafka.server.KafkaApis.handleFetchRequest(KafkaApis.scala:530)
at kafka.server.KafkaApis.handle(KafkaApis.scala:81)
at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:62)
at java.lang.Thread.run(Thread.java:748)
We assigned a replication factor of 2 and a partition of 3 for each topic and every topic is created in the same way.I deleted and recreated each topic manually just to make sure that 15-16 is the exact number that everything went wrong.
Well, weird problems seems to always have weird answers.
Turns out, the problem is that one of our node is using a x32 cpu, switching it to a x64 machine solved the problem.