Kafka generate huge extra data when creating new topics - apache-kafka

I have a 3-node Zookeeper cluster version 3.4.11 and 2-node Kafka cluster version 0.11.3. We wrote a producer that send messages to the specific topic and partitions of the Kafka cluster (I did it before and the producer is tested). Here are the Brokers configs:
broker.id=1
listeners=PLAINTEXT://node1:9092
num.partitions=24
delete.topic.enable=true
default.replication.factor=2
log.dirs=/data
zookeeper.connect=zoo1:2181,zoo2:2181,zoo3:2181
log.retention.hours=168
zookeeper.session.timeout.ms=40000
zookeeper.connection.timeout.ms=10000
offsets.topic.replication.factor=2
transaction.state.log.replication.factor=2
transaction.state.log.min.isr=2
In the beginning, there is no topic on the brokers and they will be created automatically. When I start the producer, Kafka cluster shows a strange behavior:
1- It creates all topics but while the rate of producing data is 10KB per second, in less than one minutes the log directory of each broker goes from zero data to 9.0 Gigabyte data! and all brokers turned off (because of lack of log-dir capacity)
2- Just when producing data started I try to consume data using the console-consumer and it just errors
WARN Error while fetching metadata with correlation id 2 : {Topic1=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient)
3- Here is the error which repeatedly appears in brokers log:
INFO Updated PartitionLeaderEpoch. New: {epoch:0, offset:0}, Current: {epoch:-1, offset-1} for Partition: Topic6-6. Cache now contains 0 entries. (kafka.server.epoch.LeaderEpochFileCache)
WARN Newly rolled segment file 00000000000000000000.log already exists; deleting it first (kafka.log.Log)
WARN Newly rolled segment file 00000000000000000000.index already exists; deleting it first (kafka.log.Log)
WARN Newly rolled segment file 00000000000000000000.timeindex already exists; deleting it first (kafka.log.Log)
ERROR [Replica Manager on Broker 1]: Error processing append operation on partition Topic6-6 (kafka.server.ReplicaManager)
kafka.common.KafkaException: Trying to roll a new log segment for topic partition Topic6-6 with start offset 0 while it already exists.
After many repeatation of the above log we have:
ERROR [ReplicaManager broker=1] Error processing append operation on partition Topic24-10 (kafka.server.ReplicaManager)
org.apache.kafka.common.errors.InvalidOffsetException: Attempt to append an offset (402) to position 5 no larger than the last offset appended (402)
And at the last (when there is no space in log-dir) it errors:
FATAL [Replica Manager on Broker 1]: Error writing to highwatermark file: (kafka.server.ReplicaManager)
java.io.FileNotFoundException: /data/replication-offset-checkpoint.tmp (No space left on device)
and shutdown!
4- I set up new single node Kafka version 0.11.3 one another machine and it works well using the same producer and using the same Zookeeper cluster.
5- I turned one of the two Kafka brokers off and just using one broker (of the cluster) it behaves the same as when I used two node Kafka cluster.
What is the problem?
UPDATE1: I tried Kafka version 2.1.0 but the same result!
UPDATE2: I found out that root of the problem. In producing I create 25 topics each of which has 24 partitions. Surprisingly each topic just after creating (using kafka-topic.sh command and when no data is stored) occupy 481MB space! For example in the log directory of topic "20" for each partition directory I have the following files that totally has 21MB:
00000000000000000000.index (10MB) 00000000000000000000.log(0MB) 00000000000000000000.timeindex(10MB) leader-epoch-checkpoint(4KB)
and Kafka writes the following lines for each topic-partitions in the server.log file:
[2019-02-05 10:10:54,957] INFO [Log partition=topic20-14, dir=/data] Loading producer state till offset 0 with message format version 2 (kafka.log.Log)
[2019-02-05 10:10:54,957] INFO [Log partition=topic20-14, dir=/data] Completed load of log with 1 segments, log start offset 0 and log end offset 0 in 1 ms (kafka.log.Log)
[2019-02-05 10:10:54,958] INFO Created log for partition topic20-14 in /data with properties {compression.type -> producer, message.format.version -> 2.1-IV2, file.delete.delay.ms -> 60000, max.message.bytes -> 1000012, min.compaction.lag.ms -> 0, message.timestamp.type -> CreateTime, message.downconversion.enable -> true, min.insync.replicas -> 1, segment.jitter.ms -> 0, preallocate -> false, min.cleanable.dirty.ratio -> 0.5, index.interval.bytes -> 4096, unclean.leader.election.enable -> false, retention.bytes -> -1, delete.retention.ms -> 86400000, cleanup.policy -> [delete], flush.ms -> 9223372036854775807, segment.ms -> 604800000, segment.bytes -> 1073741824, retention.ms -> 604800000, message.timestamp.difference.max.ms -> 9223372036854775807, segment.index.bytes -> 10485760, flush.messages -> 9223372036854775807}. (kafka.log.LogManager)
[2019-02-05 10:10:54,958] INFO [Partition topic20-14 broker=0] No checkpointed highwatermark is found for partition topic20-14 (kafka.cluster.Partition)
[2019-02-05 10:10:54,958] INFO Replica loaded for partition topic20-14 with initial high watermark 0 (kafka.cluster.Replica)
[2019-02-05 10:10:54,958] INFO [Partition topic20-14 broker=0] topic20-14 starts at Leader Epoch 0 from offset 0. Previous Leader Epoch was: -1 (kafka.cluster.Partition)
There is no error on the server log. I even can consume data if I produce data on the topic. As the total log directory space is 10GB, Kafka needs 12025MB for 25 topics in my scenario, which is greater than total directory space and Kafka will error and shutdown!
Just for a test I set up another Kafka broker (namely broker2) using the same Zookeeper cluster and creating a new topic with 24 partitions there just occupy 100K for all the empty partitions!
So I'm really confused! The Broker1 and Broker2, the same version of Kafka (0.11.3) is running and just the OS and System File is different:
In case Broker1 (occupy 481MB data for the new topic):
OS CentOS 7 and XFS as System File
In case Broker2 (occupy 100KB data for a new topic):
OS Ubuntu 16.04 and ext4 as System File

Why Kafka preallocate 21MB for each partition?
It's normal behavior and the preallocated size of indices are controlling using the server property: segment.index.bytes and the default value is 10485760 bytes or 10MB. That's because indices in each partition directory allocate 10MB:
00000000000000000000.index (10MB)
00000000000000000000.log(0MB)
00000000000000000000.timeindex(10MB)
leader-epoch-checkpoint(4KB)
On the other hand, the Kafka document says about that property:
We preallocate this index file and shrink it only after log rolls.
But in my case, it never shrank the indices. After a lot of searches I found out Java 8 in some versions (192 in my case) has a bug in working with many small files and it's fixed in update 202. So I updated my Java version to 202 and it solved the problem.

Related

Kafka delete topics when auto.create.topics is enabled

We have a 2 node kafka cluster with both auto.create.topics.enable and delete.topic.enable set to true. Our app reads from a common request topic and responds to a response topic provided by the client in the request payload.
auto.create.topics is set to true as our client has an auto-scale environment wherein a new worker will read from a new response topic. Due to some implementation issues on the client side, there are a lot of topics created which have never been used (end offset is 0) and we are attempting to clean that up.
The problem is that upon deleting the topic, it is being recreated almost immediately. These topics don't have any consumer (as the worker listening to it is already dead).
I have tried the following
Kafka CLI delete command
kafka-topics.sh --zookeeper localhost:2181 --topic <topic-name> --delete
Create a zookeeper node under
/admin/delete_topics/<topic-name>
Both don't seem to work. In the logs, I see that a request for delete was received and the corresponding logs/indexes were deleted. But within a few seconds/minutes, the topic is auto-created. Logs for reference -
INFO [Partition <topic-name>-0 broker=0] No checkpointed highwatermark is found for partition <topic-name>-0 (kafka.cluster.Partition)
INFO Replica loaded for partition <topic-name>-0 with initial high watermark 0 (kafka.cluster.Replica)
INFO Replica loaded for partition <topic-name>-0 with initial high watermark 0 (kafka.cluster.Replica)
INFO [Partition <topic-name>-0 broker=0] <topic-name>-0 starts at Leader Epoch 0 from offset 0. Previous Leader Epoch was: -1 (kafka.cluster.Partition)
INFO Deleted log /home/ec2-user/data/kafka/0/<topic-name>-4.7a79dfc720624d228d5ee90c8d4c325e-delete/00000000000000000000.log. (kafka.log.LogSegment)
INFO Deleted offset index /home/ec2-user/data/kafka/0/<topic-name>-4.7a79dfc720624d228d5ee90c8d4c325e-delete/00000000000000000000.index. (kafka.log.LogSegment)
INFO Deleted time index /home/ec2-user/data/kafka/0/<topic-name>-4.7a79dfc720624d228d5ee90c8d4c325e-delete/00000000000000000000.timeindex. (kafka.log.LogSegment)
INFO Deleted log for partition <topic-name>-4 in /home/ec2-user/data/kafka/0/<topic-name>-4.7a79dfc720624d228d5ee90c8d4c325e-delete. (kafka.log.LogManager)
INFO Deleted log /home/ec2-user/data/kafka/0/<topic-name>-2.d32a905f9ace459cb62a530b2c605347-delete/00000000000000000000.log. (kafka.log.LogSegment)
INFO Deleted offset index /home/ec2-user/data/kafka/0/<topic-name>-2.d32a905f9ace459cb62a530b2c605347-delete/00000000000000000000.index. (kafka.log.LogSegment)
INFO Deleted time index /home/ec2-user/data/kafka/0/<topic-name>-2.d32a905f9ace459cb62a530b2c605347-delete/00000000000000000000.timeindex. (kafka.log.LogSegment)
INFO Deleted log for partition <topic-name>-2 in /home/ec2-user/data/kafka/0/<topic-name>-2.d32a905f9ace459cb62a530b2c605347-delete. (kafka.log.LogManager)
INFO Deleted log /home/ec2-user/data/kafka/0/<topic-name>-3.0670e8aefae5481682d53afcc09bab6a-delete/00000000000000000000.log. (kafka.log.LogSegment)
INFO Deleted offset index /home/ec2-user/data/kafka/0/<topic-name>-3.0670e8aefae5481682d53afcc09bab6a-delete/00000000000000000000.index. (kafka.log.LogSegment)
INFO Deleted time index /home/ec2-user/data/kafka/0/<topic-name>-3.0670e8aefae5481682d53afcc09bab6a-delete/00000000000000000000.timeindex. (kafka.log.LogSegment)
INFO Deleted log for partition <topic-name>-3 in /home/ec2-user/data/kafka/0/<topic-name>-3.0670e8aefae5481682d53afcc09bab6a-delete. (kafka.log.LogManager)
INFO Deleted log /home/ec2-user/data/kafka/0/<topic-name>-7.ac76d42a39094955abfb9d37951f4fae-delete/00000000000000000000.log. (kafka.log.LogSegment)
INFO Deleted offset index /home/ec2-user/data/kafka/0/<topic-name>-7.ac76d42a39094955abfb9d37951f4fae-delete/00000000000000000000.index. (kafka.log.LogSegment)
INFO Deleted time index /home/ec2-user/data/kafka/0/<topic-name>-7.ac76d42a39094955abfb9d37951f4fae-delete/00000000000000000000.timeindex. (kafka.log.LogSegment)
INFO Deleted log for partition <topic-name>-7 in /home/ec2-user/data/kafka/0/<topic-name>-7.ac76d42a39094955abfb9d37951f4fae-delete. (kafka.log.LogManager)
INFO Deleted log /home/ec2-user/data/kafka/0/<topic-name>-1.4872c74d579f4553a881114749e08141-delete/00000000000000000000.log. (kafka.log.LogSegment)
INFO Deleted offset index /home/ec2-user/data/kafka/0/<topic-name>-1.4872c74d579f4553a881114749e08141-delete/00000000000000000000.index. (kafka.log.LogSegment)
INFO Deleted time index /home/ec2-user/data/kafka/0/<topic-name>-1.4872c74d579f4553a881114749e08141-delete/00000000000000000000.timeindex. (kafka.log.LogSegment)
INFO Deleted log for partition <topic-name>-1 in /home/ec2-user/data/kafka/0/<topic-name>-1.4872c74d579f4553a881114749e08141-delete. (kafka.log.LogManager)
INFO Deleted log /home/ec2-user/data/kafka/0/<topic-name>-0.489b7241226341f0a7ffa3d1b9a70e35-delete/00000000000000000000.log. (kafka.log.LogSegment)
INFO Deleted offset index /home/ec2-user/data/kafka/0/<topic-name>-0.489b7241226341f0a7ffa3d1b9a70e35-delete/00000000000000000000.index. (kafka.log.LogSegment)
INFO Deleted time index /home/ec2-user/data/kafka/0/<topic-name>-0.489b7241226341f0a7ffa3d1b9a70e35-delete/00000000000000000000.timeindex. (kafka.log.LogSegment)
INFO Deleted log for partition <topic-name>-0 in /home/ec2-user/data/kafka/0/<topic-name>-0.489b7241226341f0a7ffa3d1b9a70e35-delete. (kafka.log.LogManager)
INFO Deleted log /home/ec2-user/data/kafka/0/<topic-name>-5.6d659cd119304e1f9a4077265364d05b-delete/00000000000000000000.log. (kafka.log.LogSegment)
INFO Deleted offset index /home/ec2-user/data/kafka/0/<topic-name>-5.6d659cd119304e1f9a4077265364d05b-delete/00000000000000000000.index. (kafka.log.LogSegment)
INFO Deleted time index /home/ec2-user/data/kafka/0/<topic-name>-5.6d659cd119304e1f9a4077265364d05b-delete/00000000000000000000.timeindex. (kafka.log.LogSegment)
INFO Deleted log for partition <topic-name>-5 in /home/ec2-user/data/kafka/0/<topic-name>-5.6d659cd119304e1f9a4077265364d05b-delete. (kafka.log.LogManager)
INFO Deleted log /home/ec2-user/data/kafka/0/<topic-name>-6.652d1aec02014a3aa59bd3e14635bd3b-delete/00000000000000000000.log. (kafka.log.LogSegment)
INFO Deleted offset index /home/ec2-user/data/kafka/0/<topic-name>-6.652d1aec02014a3aa59bd3e14635bd3b-delete/00000000000000000000.index. (kafka.log.LogSegment)
INFO Deleted time index /home/ec2-user/data/kafka/0/<topic-name>-6.652d1aec02014a3aa59bd3e14635bd3b-delete/00000000000000000000.timeindex. (kafka.log.LogSegment)
INFO Deleted log for partition <topic-name>-6 in /home/ec2-user/data/kafka/0/<topic-name>-6.652d1aec02014a3aa59bd3e14635bd3b-delete. (kafka.log.LogManager)
INFO [GroupCoordinator 0]: Removed 0 offsets associated with deleted partitions: <topic-name>-0. (kafka.coordinator.group.GroupCoordinator)
INFO [ReplicaFetcherManager on broker 0] Removed fetcher for partitions Set(<topic-name>-0) (kafka.server.ReplicaFetcherManager)
INFO [ReplicaAlterLogDirsManager on broker 0] Removed fetcher for partitions Set(<topic-name>-0) (kafka.server.ReplicaAlterLogDirsManager)
INFO [ReplicaFetcherManager on broker 0] Removed fetcher for partitions Set(<topic-name>-0) (kafka.server.ReplicaFetcherManager)
INFO [ReplicaAlterLogDirsManager on broker 0] Removed fetcher for partitions Set(<topic-name>-0) (kafka.server.ReplicaAlterLogDirsManager)
INFO Log for partition <topic-name>-0 is renamed to /home/ec2-user/data/kafka/0/<topic-name>-0.185c7eda12b749a2999cd39b3f90c738-delete and is scheduled for deletion (kafka.log.LogManager)
INFO Creating topic <topic-name> with configuration {} and initial partition assignment Map(0 -> ArrayBuffer(0, 1)) (kafka.zk.AdminZkClient)
INFO [KafkaApi-0] Auto creation of topic <topic-name> with 1 partitions and replication factor 2 is successful (kafka.server.KafkaApis)
INFO [ReplicaFetcherManager on broker 0] Removed fetcher for partitions Set(<topic-name>-0) (kafka.server.ReplicaFetcherManager)
INFO [Log partition=<topic-name>-0, dir=/home/ec2-user/data/kafka/0] Loading producer state till offset 0 with message format version 2 (kafka.log.Log)
INFO [Log partition=<topic-name>-0, dir=/home/ec2-user/data/kafka/0] Completed load of log with 1 segments, log start offset 0 and log end offset 0 in 0 ms (kafka.log.Log)
INFO Created log for partition <topic-name>-0 in /home/ec2-user/data/kafka/0 with properties {compression.type -> producer, message.format.version -> 2.2-IV1, file.delete.delay.ms -> 60000, max.message.bytes -> 1000012, min.compaction.lag.ms -> 0, message.timestamp.type -> CreateTime, message.downconversion.enable -> true, min.insync.replicas -> 1, segment.jitter.ms -> 0, preallocate -> false, min.cleanable.dirty.ratio -> 0.5, index.interval.bytes -> 4096, unclean.leader.election.enable -> false, retention.bytes -> -1, delete.retention.ms -> 86400000, cleanup.policy -> [delete], flush.ms -> 9223372036854775807, segment.ms -> 604800000, segment.bytes -> 1073741824, retention.ms -> 86400000, message.timestamp.difference.max.ms -> 9223372036854775807, segment.index.bytes -> 10485760, flush.messages -> 9223372036854775807}. (kafka.log.LogManager)
INFO [Partition <topic-name>-0 broker=0] No checkpointed highwatermark is found for partition <topic-name>-0 (kafka.cluster.Partition)
INFO Replica loaded for partition <topic-name>-0 with initial high watermark 0 (kafka.cluster.Replica)
INFO Replica loaded for partition <topic-name>-0 with initial high watermark 0 (kafka.cluster.Replica)
INFO [Partition <topic-name>-0 broker=0] <topic-name>-0 starts at Leader Epoch 0 from offset 0. Previous Leader Epoch was: -1 (kafka.cluster.Partition)
Does anyone know the reason behind the topic being re-created when no consumers are listening and no producers are producing to the topic? Could replication be behind it (some race condition perhaps)? We are using Kafka 2.2.
Deleting the log directory for that topic directly seems to work, however, this is cumbersome when there are thousands of topics created. We want to have a cleanup script that does this periodically as due to the auto-scale nature of the client environment, there may be frequent orphaned response topics.
Update
I tried Giorgos' suggestion by disabling auto.create.topics.enable and then deleting the topic. This time the topic did get deleted, but none of my applications through any errors (which leads to me the conclusion that there are no consumers/producers to the said topic).
Further, when auto.create.topics.enable is enabled and the topic is created with a replication-factor=1, the topic does not get re-created after deletion. This leads me to believe that perhaps replication is the culprit. Could this be a bug in Kafka?
Jumped the gun here; turns out something is listening/re-creating these topics from the customer environment.
Even if you've mentioned that no consumer/producer is consuming/producing from the topic, it sounds that this is the case. Maybe you have any connectors running on Kafka Connect that replicate data from/to Kafka?
If you still can't find what is causing the re-creation of the deleted Kafka topics, I would suggest setting auto.create.topics.enable to false (temporarily) so that topics cannot be automatically re-created. Then the process that is causing topic re-creation will normally fail and this is going to be reported in your logs.

Missing messages on Kafka's compacted topic

I have a topic that is compacted:
/opt/kafka/bin/kafka-topics.sh --zookeeper localhost --describe --topic myTopic
Topic:myTopic PartitionCount:1 ReplicationFactor:1 Configs:cleanup.policy=compact
There are no messages on it:
/opt/kafka/bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic myTopic --from-beginning --property print-key=true
^CProcessed a total of 0 messages
Both the earliest and latest offset on the only partition that's there is 12, though.
/opt/kafka/bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list localhost:9092 --topic myTopic --time -2
myTopic:0:12
/opt/kafka/bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list localhost:9092 --topic myTopic --time -1
myTopic:0:12
I wonder what could have happened with these 12 messages? The number is correct, I was expecting them to be there, but for some reason they're gone.
As far as I understand, even if these 12 messages had the same key, I should have seen at least one - that's how the compaction works.
The topic in question was created as compacted. The only weird thing that might have happened during that time is that the Kafka instance lost its Zookeeper data completely. Is it possible that it also caused the data loss?
To rephrase the last question: can something bad happen with the physical data on Kafka if I remove all the Kafka-related ZNodes on Zookeeper?
In addition, here are some logs from Kafka startup.
[2019-04-30 12:02:16,510] WARN [Log partition=myTopic-0, dir=/var/lib/kafka] Found a corrupted index file corresponding to log file /var/lib/kafka/myTopic-0/00000000000000000000.log due to Corrupt index found, index file (/var/lib/kafka/myTopic-0/00000000000000000000.index) has non-zero size but the last offset is 0 which is no greater than the base offset 0.}, recovering segment and rebuilding index files... (kafka.log.Log)
[2019-04-30 12:02:16,524] INFO [Log partition=myTopic-0, dir=/var/lib/kafka] Completed load of log with 1 segments, log start offset 0 and log end offset 12 in 16 ms (kafka.log.Log)
[2019-04-30 12:35:34,530] INFO Got user-level KeeperException when processing sessionid:0x16a6e1ea2000001 type:setData cxid:0x1406 zxid:0xd11 txntype:-1 reqpath:n/a Error Path:/config/topics/myTopic Error:KeeperErrorCode = NoNode for /config/topics/myTopic (org.apache.zookeeper.server.PrepRequestProcessor)
[2019-04-30 12:35:34,535] INFO Topic creation Map(myTopic-0 -> ArrayBuffer(0)) (kafka.zk.AdminZkClient)
[2019-04-30 12:35:34,547] INFO [ReplicaFetcherManager on broker 0] Removed fetcher for partitions myTopic-0
(kafka.server.ReplicaFetcherManager)
[2019-04-30 12:35:34,580] INFO [Partition myTopic-0 broker=0] No checkpointed highwatermark is found for partition myTopic-0 (kafka.cluster.Partition)
[2019-04-30 12:35:34,580] INFO Replica loaded for partition myTopic-0 with initial high watermark 0 (kafka.cluster.Replica)
[2019-04-30 12:35:34,580] INFO [Partition myTopic-0 broker=0] myTopic-0 starts at Leader Epoch 0 from offset 12. Previous Leader Epoch was: -1 (kafka.cluster.Partition)
And the messages were indeed removed:
[2019-04-30 12:39:24,199] INFO [Log partition=myTopic-0, dir=/var/lib/kafka] Found deletable segments with base offsets [0] due to retention time 10800000ms breach (kafka.log.Log)
[2019-04-30 12:39:24,201] INFO [Log partition=myTopic-0, dir=/var/lib/kafka] Rolled new log segment at offset 12 in 2 ms. (kafka.log.Log)
NoNode for /config/topics/myTopic
Kafka no longer knows this topic exists and that it should be compacted, which seems to be evident by the log cleaner logs
due to retention time 10800000ms breach
So yes, Zookeeper is very important. But so is gracefully shutting down a broker with kafka-server-stop, otherwise forcibly killing the process or the host machine would end up with corrupted partition segments
I'm not entirely sure what conditions would lead to this
the last offset is 0 which is no greater than the base offset 0
But assuming that you had a full cluster and that the topic had a replication factor higher than 1, then you could hope that at least one replica were healthy.
The way to recover a broker with a corrupted index/partition would be stop kafka process, delete the corrupted partition folder from disk, restart kafka on that machine, and then let it replicate back from a healthy instance

Kafka broker shutdown while cleaning up log files

Kafka cluster with 3 brokers(version:1.1.0) and is well running for over 6 months.
Then we modified partitions from 3 to 48 for every topic after 2018/12/12, then the brokers shutdown every 5-10 days.
Then we upgraded the broker from 1.1.0 to 2.1.0, but the brokers still keep shutting down every 5-10 days.
Each time, one broker shut down after the following error log, then several minutes later, the other 2 brokers shut down too, with the same error but other partition log files.
[2019-01-11 17:16:36,572] INFO [ProducerStateManager partition=__transaction_state-11] Writing producer snapshot at offset 807760 (kafka.log.ProducerStateManager)
[2019-01-11 17:16:36,572] INFO [Log partition=__transaction_state-11, dir=/kafka/logs] Rolled new log segment at offset 807760 in 4 ms. (kafka.log.Log)
[2019-01-11 17:16:46,150] WARN Resetting first dirty offset of __transaction_state-35 to log start offset 194404 since the checkpointed offset 194345 is invalid. (kafka.log.LogCleanerManager$)
[2019-01-11 17:16:46,239] ERROR Failed to clean up log for __transaction_state-11 in dir /kafka/logs due to IOException (kafka.server.LogDirFailureChannel)
java.nio.file.NoSuchFileException: /kafka/logs/__transaction_state-11/00000000000000807727.log
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:409)
at sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262)
at java.nio.file.Files.move(Files.java:1395)
at org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:809)
at org.apache.kafka.common.record.FileRecords.renameTo(FileRecords.java:222)
at kafka.log.LogSegment.changeFileSuffixes(LogSegment.scala:488)
at kafka.log.Log.asyncDeleteSegment(Log.scala:1838)
at kafka.log.Log.$anonfun$replaceSegments$6(Log.scala:1901)
at kafka.log.Log.$anonfun$replaceSegments$6$adapted(Log.scala:1896)
at scala.collection.immutable.List.foreach(List.scala:388)
at kafka.log.Log.replaceSegments(Log.scala:1896)
at kafka.log.Cleaner.cleanSegments(LogCleaner.scala:583)
at kafka.log.Cleaner.$anonfun$doClean$6(LogCleaner.scala:515)
at kafka.log.Cleaner.$anonfun$doClean$6$adapted(LogCleaner.scala:514)
at scala.collection.immutable.List.foreach(List.scala:388)
at kafka.log.Cleaner.doClean(LogCleaner.scala:514)
at kafka.log.Cleaner.clean(LogCleaner.scala:492)
at kafka.log.LogCleaner$CleanerThread.cleanLog(LogCleaner.scala:353)
at kafka.log.LogCleaner$CleanerThread.cleanFilthiestLog(LogCleaner.scala:319)
at kafka.log.LogCleaner$CleanerThread.doWork(LogCleaner.scala:300)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
Suppressed: java.nio.file.NoSuchFileException: /kafka/logs/__transaction_state-11/00000000000000807727.log -> /kafka/logs/__transaction_state-11/00000000000000807727.log.deleted
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:396)
at sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262)
at java.nio.file.Files.move(Files.java:1395)
at org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:806)
... 17 more
[2019-01-11 17:16:46,245] INFO [ReplicaManager broker=2] Stopping serving replicas in dir /kafka/logs (kafka.server.ReplicaManager)
[2019-01-11 17:16:46,314] INFO Stopping serving logs in dir /kafka/logs (kafka.log.LogManager)
[2019-01-11 17:16:46,326] ERROR Shutdown broker because all log dirs in /kafka/logs have failed (kafka.log.LogManager)
if you have not changed log.retention.bytes or log.retention.hours or log.retention.minutes or log.retention.ms configs, Kafka tries to delete logs after 7 days. So based on the exception, Kafka wants to clean up file /kafka/logs/__transaction_state-11/00000000000000807727.log but, there is no such file in Kafka log directory and it throws an exception which causes broker shut down.
if you are able to shut down cluster and Zookeeper do it and clean up /kafka/logs/__transaction_state-11 manually.
Note: I don't know it is harmful or not but you can follow safely remove Kafka topic posts.

cluster no response due to replication

I found this in my server.log:
[2016-03-29 18:24:59,349] INFO Scheduling log segment 3773408933 for log g17-4 for deletion. (kafka.log.Log)
[2016-03-29 18:24:59,349] INFO Scheduling log segment 3778380412 for log g17-4 for deletion. (kafka.log.Log)
[2016-03-29 18:24:59,403] WARN [ReplicaFetcherThread-3-4], Replica 2 for partition [g17,4] reset its fetch offset from 3501121050 to current leader 4's start offset 3501121050 (kafka.server.ReplicaFetcherThread)
[2016-03-29 18:24:59,403] ERROR [ReplicaFetcherThread-3-4], Current offset 3781428103 for partition [g17,4] out of range; reset offset to 3501121050 (kafka.server.ReplicaFetcherThread)
[2016-03-29 18:25:27,816] INFO Rolled new log segment for 'g17-12' in 1 ms. (kafka.log.Log)
[2016-03-29 18:25:35,548] INFO Rolled new log segment for 'g18-10' in 2 ms. (kafka.log.Log)
[2016-03-29 18:25:35,707] INFO Partition [g18,10] on broker 2: Shrinking ISR for partition [g18,10] from 2,4 to 2 (kafka.cluster.Partition)
[2016-03-29 18:25:36,042] INFO Partition [g18,10] on broker 2: Expanding ISR for partition [g18,10] from 2 to 2,4 (kafka.cluster.Partition)
The offset of replication is larger than leader's, so the replication data will delete, and then copy the the data from leader.
But when copying, the cluster is very slow; some storm topology fail due to no response from Kafka.
How do I prevent this problem from occurring?
How do I slow down the replication rate, while replication is copying?

When does kafka change leader?

I was running my services that work with kafka already for a year and no spontaneous changes of leader happens.
But for the last 2 weeks that started happens quite often.
Kafka log on that:
[2015-09-27 15:35:14,826] INFO [ReplicaFetcherManager on broker 2]
Removed fetcher for partitions [myTopic] (kafka.server.ReplicaFetcherManager)
[2015-09-27 15:35:14,830] INFO Truncating log myTopic-0 to offset 11520979. (kafka.log.Log)
[2015-09-27 15:35:14,845] WARN [Replica Manager on Broker 2]: Fetch request with correlation id 713276 from client ReplicaFetcherThread-0-2 on partition [myTopic,0] failed due to Leader not local for partition [myTopic,0] on broker 2 (kafka.server.ReplicaManager)
[2015-09-27 15:35:14,857] WARN [Replica Manager on Broker 2]: Fetch request with correlation id 256685 from client mirrormaker-1 on partition [myTopic,0] failed due to Leader not local for partition [myTopic,0] on broker 2 (kafka.server.ReplicaManager)
[2015-09-27 15:35:20,171] INFO [ReplicaFetcherManager on broker 2] Removed fetcher for partitions [myTopic,0] (kafka.server.ReplicaFetcherManager)
What can cause switching leader? If there is info in some kafka documentation - please - just point the link. I've failed to find.
System configuration
kafka version: kafka_2.10-0.8.2.1
os: Red Hat Enterprise Linux Server release 6.5 (Santiago)
server.properties (differs from default):
broker.id=001
socket.send.buffer.bytes=1048576
socket.receive.buffer.bytes=1048576
socket.request.max.bytes=104857600
log.flush.interval.messages=10000
log.flush.interval.ms=1000
log.retention.bytes=-1
controlled.shutdown.enable=true
auto.create.topics.enable=false
It appears like lead broker is down for that partition. It might be that data directroy(log.dirs) configured in server.properties is out of space and broker is not able to accommodate.
Also, what is replication factor of topic and cluster size of brokers?
I am assuming you have one topic and one partition with a replication factor of 2. Which is not a good configuration for optimal Kafka performance and consumers.
Your Logs are not clear enough for leader switch. Major issue in your topic may be having the only one leader due to the only partition. Now the single file in your logs is getting bigger in size day by day. Kafka internally does rebalancing at some level(details are not confirmed). That can be the reason for your leader switch. But i am not sure.
Also in your 2nd log line its says some of the logs are truncated. Can you please go though the logs in details and check is this happening only after truncation?
As you already mentioned you already checked your Kafka log directory files and their size. Please run the describe when you got this issue. The leader switch will reflect here as well. Or if you can setup some dashboard that will display the leader for past time. Then it will be easy for you to find the root cause.
bin/kafka-topics.sh --describe --zookeeper Zookeeperhost:Port --topic TopicName
Suggestion: i will suggest you to create a new topic with more partitions(read Kafka documentation to get a good idea about optimum number of partitions) and start writing to it. Or you can check, how to change partitions for current topic.
Last Thing: Is leader switch causing some issues in your Clients or you are worried only about warnings?