When kafka purges messages - apache-kafka

I have Apache Kafka cluster with retention policy delete and retention period set to 24 hrs.
Then I have changed retention period dynamically and set it to 1 minute for some specific topic. But old messages are still there, so I have several questions:
What is the trigger point for retention? I assume that though some explicit time to live set for messages, it is not guaranteed that messages will be deleted exactly after this time. So what is the process? (Can't find anything in the reference)
If I change retention period in runtime, will the old messages obey it. As far as I understand retention period is topic-wide property and should work as well for messages, which were published with the first retention period.

On each broker the partitions are divided into segment logs. By default a segment will store 1GB of data (log.segment.bytes) of data. In addition, a new log segment is rolled out by default every 7 days (log.roll.hours)
Each broker schedules a cleaner-thread which is responsible for periodically check which segments are eligibled to deletion. By default, the cleaner-thread will run a check every 5 minutes (this can be configured throught the broker config : log.retention.check.interval.ms)
A segment is removable if the most recent message within a log is older than the configured retention period. In addition, the active segment log (the one the broker is currently writing to) can't be deleted
In order to be able to remove a segment log as soon as possible you should configure the log rolling in correlation with you retention period. For example, if your retention period is configured to 24 hours it could be a good id to configured log.roll.hours to 1 hour.
Note that segment deletion can actually happen at different time on each broker as the cleaner threads are scheduled together.
Check specific topic configuration with kafka-configs script:
Example :
./bin/kafka-configs --describe --zookeeper localhost:2181 --entity-type topics --entity-name __consumer_offsets

Retention policy is applied on closed segments only. If you segment is still active then the data in that segment wont be purged until closed and new segment is opened.

Related

How does Kafka handle a situation where retention period expires while a consumer offset is within the segment file?

I'm wanting to know how Kafka would handle this situation. A consumer has come across a poison pill message, and is not committing past it. No one notices for a long time (15 days). The retention period on the topic is (7 days). Let's say that this poison pill is in a log segment file that has satisfied the requirements to be deleted by the retention period.
What happens?
Does Kafka allow this log segment file to be deleted while a Consumer actively trying to read from it?
Does Kafka delete the log segment file and leave the Consumer scrambling trying to figure out where to start reading from by using the auto.offset.reset setting?
It'll be option 2 and you can find logs on the consumer instances that indicate it's seeking to the beginning/end, or will fail if auto offset reset = none saying that the offset is out of range

Kafka - Compact and Time Based Retention

I have tried creating a Kafka topic configuration that uses compaction and deletion, to achieve the following:
Within the retention period, retain the latest version of the key
After the retention period, any message older than the timestamp to be removed
For this, I have tried the following topic specific config:
cleanup.policy=[compact,delete]
retention.ms=864000000 (10 days)
min.compaction.lag.ms=3600000 (1 hour)
min.cleanable.dirty.ratio=0.1
segment.ms=3600000 (1 hour)
The broker configuration is as following:
log.retention.hours=7 days
log.segment.bytes=1.1gb
log.cleanup.policy=delete
delete.retention.ms=1 day
When I set this to a smaller amount in test, e.g. 20mins, 1hr etc, I can correctly see the data is pruned after the retention period, only adjusting retention.ms on the topic.
I can see that the data is correctly being compacted as expected, but after the 10 day retention period if I read the topic from the beginning, data much older than 10 days is still there. Is this a problem with such a long retention period?
Am I missing any configuration here? I have checked the kafka logs and see the broker is rolling the segments and compacting as expected, but can't see anything about deletes?
Kafka Version is
5.1.2-1
It might be the case that your topic and broker configuration override each other and eventually one with higher importance is evaluated.

Kafka: Messages disappearing from topics, largestTime=0

We have messages disappearing from topics on Apache Kafka with versions 2.3, 2.4.0, 2.4.1 and 2.5.0. We noticed this when we make a rolling deployment of our clusters and unfortunately it doesn't happen every time, so it's very inconsistent.
Sometimes we lose all messages inside a topic, other times we lose all messages inside a partition. When this happens the following log is a constant:
[2020-04-27 10:36:40,386] INFO [Log partition=test-lost-messages-5, dir=/var/kafkadata/data01/data] Deleting segments List(LogSegment(baseOffset=6, size=728, lastModifiedTime=1587978859000, largestTime=0)) (kafka.log.Log)
There is also a previous log saying this segment hit the retention time breach of 24 hours. In this example, the message was produced ~12 minutes before the deployment.
Notice, all messages that are wrongly deleted have largestTime=0 and the ones that are properly deleted have a valid timestamp in there. From what we read from documentation and code it looks like the largestTime is used to calculate if a given segment reached the time breach or not.
Since we can observe this in multiple versions of Kafka, we think this might be related to anything external to Kafka. E.g Zookeeper.
Does anyone have any ideas of why this could be happening? We are using Zookeeper 3.6.0.
We found out that the cause was not related to Kafka itself but to the volume where we stored the logs. Still, the following explanation might be useful for educational purposes:
In detail, it was a permission problem where Kafka was not able to read the .timeindex files when the log cleaner was triggered. This caused largestTime to be 0 and lead to some messages being deleted way before the retention time.
Each topic partition is divided into several segments and the last are then stored into different .log files that contain the actual messages. For each .log file there is a .timeindex file containing a map between offset and lastModifiedTime.
When Kafka needs to check if a segment is deletable, it searches for the most recent offset lastModifiedTime and stores it as largestTime. Then, checks if the retention limit was reached: currentTime - largestTime > retentionTime.
If so, it deletes the segment and the respective messages.
Since Kafka was not able to read the file, largestTime was 0 and the check currentTime > retentionTime was always true for our 1-day retention.
Ensure date is synced between all Kafka brokers and ZooKeeper nodes.
Bash command: date.
Compare year, day, hour and minute.

what happens to the data logs if set 'log.cleanup.policy' to both 'delete' and 'compact' when I'm using time based retention policy

I'm trying to understand, how the does compaction and log clean up happens if I set the 'log.cleanup.policy' to have both 'delete' and 'compact' at the same time when I'm using time based retention policy.
Let's say our retention period is 7 days (the default one) and during these 7 days I've below pattern for my data flow, Please help me understand how does it look after 7 days.
When you use both compact and delete as a log.cleanup.policy. Logs will be compacted in background periodically to retain at least the last known value for each message key within the log of data for a single topic partition. Compaction can be configured with this config parameters:
log.cleaner.min.compaction.lag.ms: The minimum time a message will remain uncompacted in the log
log.cleaner.max.compaction.lag.ms: The maximum time a message will remain ineligible for compaction in the log
As you have already said, for your example:
Before compaction:
After compaction:
Logs will also be deleted after log.retention.hours period without considering whether or not it is compacted. Log retention is checked according to this parameter:
log.retention.check.interval.ms: The frequency in milliseconds that the log cleaner checks whether any log is eligible for deletion (default is 5 minutes)
As per my understanding of the below KIP, all the keys that haven't been updated for some time to be automatically expired. That's why we would have only three keys ( which were updated during this time) and their corresponding values after the retention time period is met.
https://issues.apache.org/jira/browse/KAFKA-4015

Kafka topic with very low retention time

Are there any significant disadvantages when I set the retention time of a certain topic to lets say 10 minutes?
It should not have any disadvantage as such as it is a background process, but it should be known that in kafka partitions are split into segments. A new segment is rolled over when the configured time or size is reached.
kafka will not delete an active segment , so depending on your config and data load it may or may not delete a segment as desired.For the desired result please check the below broker configs as well
Log retention check frequency - offsets.retention.check.interval.ms
Log retention time - log.roll.ms
log.segment.delete.delay.ms
Log cleaner configs