io.confluent.ksql.exception.KafkaTopicExistsException: when launching ksql-server-start ksql-server.properties - apache-kafka

I'm working with ksql from quite some time. Kafka cluster if of 3 nodes. I've been using udf as well and all looks good until I stop the servers and start them again.
On server start I'm seeing the following in the logs:
[2019-04-03 11:29:54,381] ERROR Exception encountered running command: A Kafka topic with the name 'czxcorp-structured-data-enriched' already exists, with different partition/replica configuration than required. KSQL expects 4 partitions (topic has 9), and 1 replication factor (topic has 1).. Retrying in 5000 ms (io.confluent.ksql.util.RetryUtil:80)
[2019-04-03 11:29:54,381] ERROR Stack trace: io.confluent.ksql.exception.KafkaTopicExistsException: A Kafka topic with the name 'czxcorp-structured-data-enriched' already exists, with different partition/replica configuration than required. KSQL expects 4 partitions (topic has 9), and 1 replication factor (topic has 1).
at io.confluent.ksql.services.TopicValidationUtil.validateTopicProperties(TopicValidationUtil.java:51)
at io.confluent.ksql.services.TopicValidationUtil.validateTopicProperties(TopicValidationUtil.java:35)
at io.confluent.ksql.services.KafkaTopicClientImpl.validateTopicProperties(KafkaTopicClientImpl.java:292)
at io.confluent.ksql.services.KafkaTopicClientImpl.createTopic(KafkaTopicClientImpl.java:76)
at io.confluent.ksql.planner.plan.KsqlStructuredDataOutputNode.createSinkTopic(KsqlStructuredDataOutputNode.java:244)
at io.confluent.ksql.planner.plan.KsqlStructuredDataOutputNode.buildStream(KsqlStructuredDataOutputNode.java:146)
at io.confluent.ksql.physical.PhysicalPlanBuilder.buildPhysicalPlan(PhysicalPlanBuilder.java:106)
at io.confluent.ksql.QueryEngine.buildPhysicalPlan(QueryEngine.java:113)
at io.confluent.ksql.KsqlEngine$EngineExecutor.execute(KsqlEngine.java:625)
at io.confluent.ksql.KsqlEngine$EngineExecutor.access$800(KsqlEngine.java:577)
at io.confluent.ksql.KsqlEngine.execute(KsqlEngine.java:247)
at io.confluent.ksql.rest.server.computation.StatementExecutor.startQuery(StatementExecutor.java:277)
at io.confluent.ksql.rest.server.computation.StatementExecutor.executeStatement(StatementExecutor.java:191)
at io.confluent.ksql.rest.server.computation.StatementExecutor.handleStatementWithTerminatedQueries(StatementExecutor.java:167)
at io.confluent.ksql.rest.server.computation.StatementExecutor.handleRestore(StatementExecutor.java:101)
at io.confluent.ksql.rest.server.computation.CommandRunner.lambda$null$0(CommandRunner.java:139)
at io.confluent.ksql.util.RetryUtil.retryWithBackoff(RetryUtil.java:63)
at io.confluent.ksql.util.RetryUtil.retryWithBackoff(RetryUtil.java:36)
at io.confluent.ksql.rest.server.computation.CommandRunner.lambda$processPriorCommands$1(CommandRunner.java:135)
at java.util.ArrayList.forEach(ArrayList.java:1257)
at io.confluent.ksql.rest.server.computation.CommandRunner.processPriorCommands(CommandRunner.java:134)
at io.confluent.ksql.rest.server.KsqlRestApplication.buildApplication(KsqlRestApplication.java:414)
at io.confluent.ksql.rest.server.KsqlServerMain.createExecutable(KsqlServerMain.java:80)
at io.confluent.ksql.rest.server.KsqlServerMain.main(KsqlServerMain.java:42)
(io.confluent.ksql.util.RetryUtil:84)
Though I've stopped/terminated all the queries, the log prints all the commands I've executed from the beginning for my testing till data, including create, select, drop. I've pulled out the .jar(UDF) from /ext folder and the server started, though the log prints udf function(i'm using) not available.
This is my ksql-server.properties:
bootstrap.servers=hostname:9092
service.id=cyan_ksql
commit.interval.ms=5000
cache.max.bytes.buffering=20000000
num.stream.threads=10
fail.on.deserialization.error=false
listeners=http://localhost:8088
ksql.extension.dir=/opt/ksql-master/ext/
Going nuts with the error. I'm deleting the topic and somehow its recreated. Someone please help.

Check out the error:
A Kafka topic with the name 'czxcorp-structured-data-enriched' already exists, with different partition/replica configuration than required.
KSQL expects 4 partitions (topic has 9), and 1 replication factor (topic has 1)
If you've deleted the topic then either
it didn't actually get deleted
it got deleted and something else recreated it with nine partitions and your erroring KSQL query has not specified an override (WITH (PARTITIONS=9) to the default four
another KSQL command is creating it ahead of the one that errors out and your erroring KSQL query has not specified an override (WITH (PARTITIONS=9) to the default four
If you want to blow away your state and start from scratch, simply change your ksql.service.id which will cause KSQL to use a new command topic (which is what get replayed when you restart the process)

Related

Apache Kafka: kafka.common.OffsetsOutOfOrderException when reassigning __consumer_offsets to new brokers

we have upgraded Kafka from 2.2 to 2.6, and added 4 new brokers to the existing 4 brokers. This went fine.
After that we started to reassign topic data to the new brokers. Most topics went ok, but on one of the 50 partitions of __consumer_offsets, the reassignment hangs. 49 of the partitions were successfully moved from the old brokers (id 3,4,5,6) to the new (ids 10,11,12,13).
But on __consumer_offsets-18 we consistently get this error (in the server.log of the new brokers)
[2020-10-24 15:04:54,528] ERROR [ReplicaFetcher replicaId=10, leaderId=3, fetcherId=0] Unexpected error occurred while processing data for partition __consumer_offsets-18 at offset 1545264631 (kafka.server.ReplicaFetcherThread)
kafka.common.OffsetsOutOfOrderException: Out of order offsets found in append to __consumer_offsets-18: ArrayBuffer(1545264631, 1545264632,
... thousands of other ids
1545272005, 1545272006, 1545272007)
at kafka.log.Log.$anonfun$append$2(Log.scala:1126)
at kafka.log.Log.append(Log.scala:2340)
at kafka.log.Log.appendAsFollower(Log.scala:1036)
at kafka.cluster.Partition.doAppendRecordsToFollowerOrFutureReplica(Partition.scala:939)
at kafka.cluster.Partition.appendRecordsToFollowerOrFutureReplica(Partition.scala:946)
at kafka.server.ReplicaFetcherThread.processPartitionData(ReplicaFetcherThread.scala:168)
at kafka.server.AbstractFetcherThread.$anonfun$processFetchRequest$7(AbstractFetcherThread.scala:332)
at kafka.server.AbstractFetcherThread.$anonfun$processFetchRequest$6(AbstractFetcherThread.scala:320)
at kafka.server.AbstractFetcherThread.$anonfun$processFetchRequest$6$adapted(AbstractFetcherThread.scala:319)
at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:553)
at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:551)
at scala.collection.AbstractIterable.foreach(Iterable.scala:920)
at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:319)
at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3(AbstractFetcherThread.scala:135)
at kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:134)
at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:117)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96)
[2020-10-24 15:04:54,534] INFO [ReplicaFetcher replicaId=10, leaderId=4, fetcherId=0] Truncating partition __consumer_offsets-31 to local high watermark 0 (kafka.server.ReplicaFetcherThread)
[2020-10-24 15:04:54,547] WARN [ReplicaFetcher replicaId=10, leaderId=3, fetcherId=0] Partition __consumer_offsets-18 marked as failed (kafka.server.ReplicaFetcherThread)
Any idea what is wrong here? The whole cluster seems to process data nicely. It's just that we can't move over this particular partition. We tried various things (cancelling the reassignment, restarting it with all partitions, restarting it with just partition 18, restarting all brokers) to no avail
Help very much appreciated, this is happening in PROD only after it worked successfully on all test environments.
EDIT: we actually found that in the huge list of message offsets in the list of the exception, there is actually a descrepancy. The relevant part of this list is
... 1545271418, 1545271419, 1, 1, 1545271422, 1545271423,
Obviously the two '1' entries there look really wrong! They should be 1545271420/1545271421. Could it be that the leader really has some kind of data corruption?
We were ultimately able to solve this issue - mainly by sitting and waiting
The issue was indeed that somewhen, somehow, the data on the leader of this __consumer_offset-18 partition got corrupted. This probably happened during the upgrade from Kafka 2.2 -> 2.6. We were doing this in a rather dangerous way, as we know now: we simply stopped all brokers, updated the SW on all, and restarted them. We thought since we can afford this outage on the weekend, this would be a safe way. But we will certainly never do that again. At least not unless we know 100% that all producers and all consumers are really stopped. This was NOT the case during that upgrade, we overlooked one consumer and left that running, and that consumer (group) was storing their offsets in the __consumer_offset-18 partition. So that action - taking all brokers down and upgrade them - while consumers/producers are still running, did probably cause the corruption
Lesson learnt: never do that again, always do rolling upgrades, even if they take a lot longer.
The issue was then actually solved through the nature of the compacting topic. The default settings for compaction are to run it every week (or when a segment gets bigger than 1GB, which will not happen in our case). Compaction of that __consumer_offset-18 partition kicked in yesterday evening. And by that the 2 corrupted offsets were purged away. After that it was uphill, the reassign of that partition to the new brokers then worked like a charm.
We could certainly have speeded up this recovery by setting the topic parameters in a way that compaction would kick in earlier. But it was only on Wednesday when we reached the understanding that the problem could actually be resolved that way. We decided to leave everything as it was and wait another day.

Kafka Streams shutdown after IllegalStateException: No current assignment for partition

I have a Kafka Streams application that launches and runs successfully. We have 4 instances of the application running. Occasionally one of our instance of the application is legitimately killed which causes several rounds of rebalancing until the old node is replaced.
Sometimes during the rebalance, one ore more previously healthy nodes fail. The logs are indicating that the Streams application transitions into a PENDING_SHUTDOWN state directly after receiving the following exception:
java.lang.IllegalStateException: No current assignment for partition public.chat.message-28
at org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:256)
at org.apache.kafka.clients.consumer.internals.SubscriptionState.resetFailed(SubscriptionState.java:418)
at org.apache.kafka.clients.consumer.internals.Fetcher$2.onFailure(Fetcher.java:621)
at org.apache.kafka.clients.consumer.internals.RequestFuture.fireFailure(RequestFuture.java:177)
at org.apache.kafka.clients.consumer.internals.RequestFuture.raise(RequestFuture.java:147)
at org.apache.kafka.clients.consumer.internals.RequestFutureAdapter.onFailure(RequestFutureAdapter.java:30)
at org.apache.kafka.clients.consumer.internals.RequestFuture$1.onFailure(RequestFuture.java:209)
at org.apache.kafka.clients.consumer.internals.RequestFuture.fireFailure(RequestFuture.java:177)
at org.apache.kafka.clients.consumer.internals.RequestFuture.raise(RequestFuture.java:147)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:571)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.firePendingCompletedRequests(ConsumerNetworkClient.java:389)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:297)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:236)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:215)
at org.apache.kafka.clients.consumer.internals.Fetcher.getTopicMetadata(Fetcher.java:292)
at org.apache.kafka.clients.consumer.internals.Fetcher.getAllTopicMetadata(Fetcher.java:275)
at org.apache.kafka.clients.consumer.KafkaConsumer.listTopics(KafkaConsumer.java:1849)
at org.apache.kafka.clients.consumer.KafkaConsumer.listTopics(KafkaConsumer.java:1827)
at org.apache.kafka.streams.processor.internals.StoreChangelogReader.refreshChangelogInfo(StoreChangelogReader.java:259)
at org.apache.kafka.streams.processor.internals.StoreChangelogReader.initialize(StoreChangelogReader.java:133)
at org.apache.kafka.streams.processor.internals.StoreChangelogReader.restore(StoreChangelogReader.java:79)
at org.apache.kafka.streams.processor.internals.TaskManager.updateNewAndRestoringTasks(TaskManager.java:328)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:866)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:804)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:773)
Prior to this error we often seem to also recieve some informational logs reporting a disconnect exception:
Error sending fetch request (sessionId=568252460, epoch=7) to node 4: org.apache.kafka.common.errors.DisconnectException
I have a feeling the two are related but I'm unable to reason why at present.
Is anyone able to give me some hints as to what may be causing this issue and any possible solutions?
Additional Info:
Kafka 2.2.1
32 partitions spread evenly across the 4 worker nodes
StreamsConfig settings:
kafkaStreamProps.put(StreamsConfig.REPLICATION_FACTOR_CONFIG, 2);
kafkaStreamProps.put(StreamsConfig.NUM_STANDBY_REPLICAS_CONFIG, 1);
kafkaStreamProps.put(StreamsConfig.NUM_STREAM_THREADS_CONFIG, 4);
kafkaStreamProps.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 120000);
kafkaStreamProps.put(StreamsConfig.TOPOLOGY_OPTIMIZATION, StreamsConfig.OPTIMIZE);
This looks like it could be related to https://issues.apache.org/jira/browse/KAFKA-9073, which has been fixed in Kafka Streams 2.3.2.
If you can't wait for that release, you could try creating a private build using the changeset from this pull request: https://github.com/apache/kafka/pull/7630/files

Cannot Restart Kafka Consumer Application, Failing due to OffsetOutOfRangeException

Currently, my Kafka Consumer streaming application is manually committing the offsets into Kafka with enable.auto.commit set to false.
The application failed when I tried restarting it throwing below exception:
org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of range with no configured reset policy for partitions:{partition-12=155555555}
Assuming the above error is due to the message not present/partition deleted due to retention period, I tried below method:
I disabled the manual commit and enabled auto commit(enable.auto.commit=true and auto.offset.reset=earliest)
Still it fails with the same error
org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of range with no configured reset policy for partitions:{partition-12=155555555}
Please suggest ways to restart the job so that it can successfully read the correct offset for which message/partition is present
You are trying to read offset 155555555 from partition 12 of topic partition, but -most probably- it might have already been deleted due to your retention policy.
You can either use Kafka Streams Application Reset Tool in order to reset your Kafka Streams application's internal state, such that it can reprocess its input data from scratch
$ bin/kafka-streams-application-reset.sh
Option (* = required) Description
--------------------- -----------
* --application-id <id> The Kafka Streams application ID (application.id)
--bootstrap-servers <urls> Comma-separated list of broker urls with format: HOST1:PORT1,HOST2:PORT2
(default: localhost:9092)
--intermediate-topics <list> Comma-separated list of intermediate user topics
--input-topics <list> Comma-separated list of user input topics
--zookeeper <url> Format: HOST:POST
(default: localhost:2181)
or start your consumer using a fresh consumer group ID.
I met the same problem and I use package org.apache.spark.streaming.kafka010 in my application.In the begining,I suscepted the auto.offset.reset strategy take no effect,but when I read the description of the method fixKafkaParams in the object KafkaUtils,i found the configuration has been overwrited.I guess the reason why it tweak the configuration ConsumerConfig.AUTO_OFFSET_RESET_CONFIG for executor is to keep consistent offset obtained by driver and executor.

UnknownProducerIdException in Kafka streams when enabling exactly once

After enabling exactly once processing on a Kafka streams application, the following error appears in the logs:
ERROR o.a.k.s.p.internals.StreamTask - task [0_0] Failed to close producer
due to the following error:
org.apache.kafka.streams.errors.StreamsException: task [0_0] Abort
sending since an error caught with a previous record (key 222222 value
some-value timestamp 1519200902670) to topic exactly-once-test-topic-
v2 due to This exception is raised by the broker if it could not
locate the producer metadata associated with the producerId in
question. This could happen if, for instance, the producer's records
were deleted because their retention time had elapsed. Once the last
records of the producerId are removed, the producer's metadata is
removed from the broker, and future appends by the producer will
return this exception.
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.recordSendError(RecordCollectorImpl.java:125)
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.access$500(RecordCollectorImpl.java:48)
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl$1.onCompletion(RecordCollectorImpl.java:180)
at org.apache.kafka.clients.producer.KafkaProducer$InterceptorCallback.onCompletion(KafkaProducer.java:1199)
at org.apache.kafka.clients.producer.internals.ProducerBatch.completeFutureAndFireCallbacks(ProducerBatch.java:204)
at org.apache.kafka.clients.producer.internals.ProducerBatch.done(ProducerBatch.java:187)
at org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:627)
at org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:596)
at org.apache.kafka.clients.producer.internals.Sender.completeBatch(Sender.java:557)
at org.apache.kafka.clients.producer.internals.Sender.handleProduceResponse(Sender.java:481)
at org.apache.kafka.clients.producer.internals.Sender.access$100(Sender.java:74)
at org.apache.kafka.clients.producer.internals.Sender$1.onComplete(Sender.java:692)
at org.apache.kafka.clients.ClientResponse.onComplete(ClientResponse.java:101)
at org.apache.kafka.clients.NetworkClient.completeResponses(NetworkClient.java:482)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:474)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:239)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:163)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.kafka.common.errors.UnknownProducerIdException
We've reproduced the issue with a minimal test case where we move messages from a source stream to another stream without any transformation. The source stream contains millions of messages produced over several months. The KafkaStreams object is created with the following StreamsConfig:
StreamsConfig.PROCESSING_GUARANTEE_CONFIG = "exactly_once"
StreamsConfig.APPLICATION_ID_CONFIG = "Some app id"
StreamsConfig.NUM_STREAM_THREADS_CONFIG = 1
ProducerConfig.BATCH_SIZE_CONFIG = 102400
The app is able to process some messages before the exception occurs.
Context information:
we're running a 5 node Kafka 1.1.0 cluster with 5 zookeeper nodes.
there are multiple instances of the app running
Has anyone seen this problem before or can give us any hints about what might be causing this behaviour?
Update
We created a new 1.1.0 cluster from scratch and started to process new messages without problems. However, when we imported old messages from the old cluster, we hit the same UnknownProducerIdException after a while.
Next we tried to set the cleanup.policy on the sink topic to compact while keeping the retention.ms at 3 years. Now the error did not occur. However, messages seem to have been lost. The source offset is 106 million and the sink offset is 100 million.
As explained in the comments, there currently seems to be a bug that may cause problems when replaying messages older than the (maximum configurable?) retention time.
At time of writing this is unresolved, the latest status can always be seen here:
https://issues.apache.org/jira/browse/KAFKA-6817

Cannot create more than 15 topics in Kafka

Me and my colleague were testing out Kafka on a 3 nodes cluster and we encountered this problem were trying to test the performance of sending message to multiple topics. We can't create more than 15 topics. The first 15 topics works fine. But when trying to create the 16th topic(and topics onward), a lot of errors started to appear in the 2 follower servers.
One with a lot of errors like this:
ERROR [ReplicaFetcherThread-0-1], Error for partition [__consumer_offsets,36] to broker 1:org.apache.kafka.common.errors.UnknownServerException: The server experienced an unexpected error when processing the request (kafka.server.ReplicaFetcherThread)
The other with errors like this:
[2017-06-16 18:44:07,146] ERROR [KafkaApi-1] Error when handling request {replica_id=2,max_wait_time=500,min_bytes=1,max_bytes=10485760,topics=[{topic=__consumer_offsets,partitions=[{partition=6,fetch_offset=5,max_bytes=1048576},{partition=36,fetch_offset=3,max_bytes=1048576},{partition=18,fetch_offset=0,max_bytes=1048576}]},{topic=multi-test-12,partitions=[{partition=1,fetch_offset=1,max_bytes=1048576}]},{topic=multi-test-11,partitions=[{partition=2,fetch_offset=1,max_bytes=1048576}]},{topic=__consumer_offsets,partitions=[{partition=0,fetch_offset=5,max_bytes=1048576},{partition=45,fetch_offset=0,max_bytes=1048576}]},{topic=multi-test-16,partitions=[{partition=0,fetch_offset=0,max_bytes=1048576}]},{topic=__consumer_offsets,partitions=[{partition=27,fetch_offset=0,max_bytes=1048576},{partition=12,fetch_offset=0,max_bytes=1048576},{partition=9,fetch_offset=0,max_bytes=1048576}]},{topic=multi-test-10,partitions=[{partition=0,fetch_offset=0,max_bytes=1048576}]},{topic=multi-test-9,partitions=[{partition=2,fetch_offset=0,max_bytes=1048576}]},{topic=__consumer_offsets,partitions=[{partition=39,fetch_offset=0,max_bytes=1048576}]},{topic=multi-test-4,partitions=[{partition=1,fetch_offset=1,max_bytes=1048576}]},{topic=__consumer_offsets,partitions=[{partition=21,fetch_offset=10,max_bytes=1048576}]},{topic=multi-test-3,partitions=[{partition=2,fetch_offset=0,max_bytes=1048576}]},{topic=multi-test-13,partitions=[{partition=0,fetch_offset=1,max_bytes=1048576}]},{topic=__consumer_offsets,partitions=[{partition=3,fetch_offset=10,max_bytes=1048576},{partition=48,fetch_offset=0,max_bytes=1048576}]},{topic=multi-test-8,partitions=[{partition=0,fetch_offset=0,max_bytes=1048576}]},{topic=__consumer_offsets,partitions=[{partition=33,fetch_offset=0,max_bytes=1048576},{partition=30,fetch_offset=15,max_bytes=1048576},{partition=15,fetch_offset=0,max_bytes=1048576}]},{topic=multi-test-1,partitions=[{partition=1,fetch_offset=1,max_bytes=1048576}]},{topic=multi-test-0,partitions=[{partition=2,fetch_offset=0,max_bytes=1048576}]},{topic=multi-test-2,partitions=[{partition=1,fetch_offset=1,max_bytes=1048576}]},{topic=__consumer_offsets,partitions=[{partition=42,fetch_offset=3,max_bytes=1048576},{partition=24,fetch_offset=0,max_bytes=1048576}]}]} (kafka.server.KafkaApis)
kafka.common.NotAssignedReplicaException: Leader 1 failed to record follower 2's position -1 since the replica is not recognized to be one of the assigned replicas for partition multi-test-16-0.
at kafka.cluster.Partition.updateReplicaLogReadResult(Partition.scala:246)
at kafka.server.ReplicaManager$$anonfun$updateFollowerLogReadResults$2.apply(ReplicaManager.scala:920)
at kafka.server.ReplicaManager$$anonfun$updateFollowerLogReadResults$2.apply(ReplicaManager.scala:917)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at kafka.server.ReplicaManager.updateFollowerLogReadResults(ReplicaManager.scala:917)
at kafka.server.ReplicaManager.fetchMessages(ReplicaManager.scala:462)
at kafka.server.KafkaApis.handleFetchRequest(KafkaApis.scala:530)
at kafka.server.KafkaApis.handle(KafkaApis.scala:81)
at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:62)
at java.lang.Thread.run(Thread.java:748)
We assigned a replication factor of 2 and a partition of 3 for each topic and every topic is created in the same way.I deleted and recreated each topic manually just to make sure that 15-16 is the exact number that everything went wrong.
Well, weird problems seems to always have weird answers.
Turns out, the problem is that one of our node is using a x32 cpu, switching it to a x64 machine solved the problem.