We have messages disappearing from topics on Apache Kafka with versions 2.3, 2.4.0, 2.4.1 and 2.5.0. We noticed this when we make a rolling deployment of our clusters and unfortunately it doesn't happen every time, so it's very inconsistent.
Sometimes we lose all messages inside a topic, other times we lose all messages inside a partition. When this happens the following log is a constant:
[2020-04-27 10:36:40,386] INFO [Log partition=test-lost-messages-5, dir=/var/kafkadata/data01/data] Deleting segments List(LogSegment(baseOffset=6, size=728, lastModifiedTime=1587978859000, largestTime=0)) (kafka.log.Log)
There is also a previous log saying this segment hit the retention time breach of 24 hours. In this example, the message was produced ~12 minutes before the deployment.
Notice, all messages that are wrongly deleted have largestTime=0 and the ones that are properly deleted have a valid timestamp in there. From what we read from documentation and code it looks like the largestTime is used to calculate if a given segment reached the time breach or not.
Since we can observe this in multiple versions of Kafka, we think this might be related to anything external to Kafka. E.g Zookeeper.
Does anyone have any ideas of why this could be happening? We are using Zookeeper 3.6.0.
We found out that the cause was not related to Kafka itself but to the volume where we stored the logs. Still, the following explanation might be useful for educational purposes:
In detail, it was a permission problem where Kafka was not able to read the .timeindex files when the log cleaner was triggered. This caused largestTime to be 0 and lead to some messages being deleted way before the retention time.
Each topic partition is divided into several segments and the last are then stored into different .log files that contain the actual messages. For each .log file there is a .timeindex file containing a map between offset and lastModifiedTime.
When Kafka needs to check if a segment is deletable, it searches for the most recent offset lastModifiedTime and stores it as largestTime. Then, checks if the retention limit was reached: currentTime - largestTime > retentionTime.
If so, it deletes the segment and the respective messages.
Since Kafka was not able to read the file, largestTime was 0 and the check currentTime > retentionTime was always true for our 1-day retention.
Ensure date is synced between all Kafka brokers and ZooKeeper nodes.
Bash command: date.
Compare year, day, hour and minute.
Related
I'm wanting to know how Kafka would handle this situation. A consumer has come across a poison pill message, and is not committing past it. No one notices for a long time (15 days). The retention period on the topic is (7 days). Let's say that this poison pill is in a log segment file that has satisfied the requirements to be deleted by the retention period.
What happens?
Does Kafka allow this log segment file to be deleted while a Consumer actively trying to read from it?
Does Kafka delete the log segment file and leave the Consumer scrambling trying to figure out where to start reading from by using the auto.offset.reset setting?
It'll be option 2 and you can find logs on the consumer instances that indicate it's seeking to the beginning/end, or will fail if auto offset reset = none saying that the offset is out of range
Kafka 0.11.0.0 has been running in production. We see that log compaction of the consumer offsets topic is not happening. In the consumer offset partitions, we see log segments remaining there for the last 3 months. Log cleaner logs showed that it failed building the map for compaction due to "CorruptRecordException".
Since there were a lot of segment files each of size 100mb in the partitions, instead of taking a DumpLogSegements and finding the bad segment, we decided to go ahead and delete the old segment files and keep only the ones from the last 3 days. After this, we restarted kafka and it seemed to work fine.
But in 2 days of doing this, we are seeing the logs getting built up again, just as it did before. We no longer see a corruptRecord Exception in the logs, but the offsets are not getting compacted and its been 7 days since.
None of the default values for compaction or retention were changed. preallocate is also set to false. Can anybody give me any insight of what could be going on here?
Edit:
The CorruptRecordException that I was running into seems to originate from AbstractLegacyRecordBatch.java
long offset = offsetAndSizeBuffer.getLong(Records.OFFSET_OFFSET);
int size = offsetAndSizeBuffer.getInt(Records.SIZE_OFFSET);
if (size < LegacyRecord.RECORD_OVERHEAD_V0)
throw new CorruptRecordException(String.format("Record size is less than the minimum record overhead (%d)", LegacyRecord.RECORD_OVERHEAD_V0));
Any idea about when this can occur and why the compaction is not happeneing even after the old segments are deleted.
One topic has 20 partitions, almost everyone has more than 20,000 log segment files, most of them are created months ago. Even after I config the retention.ms to very short, the segments are not deleted. While other topics can recycle normal.
I am wondering what's the issue inside, and how to solve it. Because I'm worry about the number of total segments will keep increasing that larger than OS vm.max_map_count, which will damage kafka process itself. Following image is the describe about the abnormal topic.
Not sure what the issue is exactly, but some things to consider:
Broker vs topic-specific configs. Check to make sure your topic actually has the configs you think it has, and is not inheriting them from the broker settings.
Configs related to retention. As mentioned by Girogos Myrianthous, you can look at log.retention.check.interval.ms and log.cleanup.policy. I would also look at the roll related settings, like log.roll.hours. I believe that in some cases, Kafka will not delete a segment until its partition rolls, even if the segment is old. And rolling follows the following behavior:
The log rolling time is no longer depending on log segment create time. Instead it is now based on the timestamp in the messages. More specifically. if the timestamp of the first message in the segment is T, the log will be rolled out when a new message has a timestamp greater than or equal to T + log.roll.ms (http://kafka.apache.org/20/documentation.html)
So make sure to consider the record timestamps, not just the segment files' age.
Finally:
What version of Kafka are you using?
Have you looked carefully at the broker logs? Broker logs is how I've solved all such problems that I've encountered.
We have a Kafka cluster for Kafka stream application.
After some hours our broker went down and we got OutOfMemory exception.
We saw the vm.max_map_count is not enough and maps memory of the process is above 40K.
Can someone explain what can be the problem or what influence on that parameter?
The number always increases and never goes down.
Based on the pull request at https://github.com/apache/kafka/pull/4358/files (both the change being proposed and the comments reacting to it), it appears that each log segment (i.e. file) in each partition on each topic on the broker consumes two maps.
I would expect the value to rise until you reach a steady-state where all topics have logs that are old enough to start being deleted due to the retention interval. At that point, each new file would be expected to occur at around the same time as an older one is deleted (assuming roughly constant message rates). I would expect the value to drop if topics were deleted or if you changed the configuration of an existing topic or the full broker (e.g. reduce the log retention time or cause the logs to roll over less frequently), and to go up if you change the configuration in the opposite direction.
I set server.properties'
log.retention.minutes=8
to clean data under kafka-logs/ every 8 minutes automatically ,
is it possible let the cleaner only clean up the data which have been consumed
,data not consumed by consumer will retain ?
Thanks !
No. Kafka messages are appended to log files which roll over every x hours or when they reach a certain size (depending on configuration). Once rolled over, those files are immutable (you cannot delete individual records). Log files are cleaned up when the last write access to a file exceeds the retention time.
In other words: the retention time is the time a message is kept at least. It is possible for a message with retention time of minutes to last for weeks (depending on other configuration settings).
The concept of "consumer offsets" is the mechanism Kafka uses to avoid reconsumption of messags. Kafka 0.11 also will contain exactly-once capabilities.