Kafka - Compact and Time Based Retention - apache-kafka

I have tried creating a Kafka topic configuration that uses compaction and deletion, to achieve the following:
Within the retention period, retain the latest version of the key
After the retention period, any message older than the timestamp to be removed
For this, I have tried the following topic specific config:
cleanup.policy=[compact,delete]
retention.ms=864000000 (10 days)
min.compaction.lag.ms=3600000 (1 hour)
min.cleanable.dirty.ratio=0.1
segment.ms=3600000 (1 hour)
The broker configuration is as following:
log.retention.hours=7 days
log.segment.bytes=1.1gb
log.cleanup.policy=delete
delete.retention.ms=1 day
When I set this to a smaller amount in test, e.g. 20mins, 1hr etc, I can correctly see the data is pruned after the retention period, only adjusting retention.ms on the topic.
I can see that the data is correctly being compacted as expected, but after the 10 day retention period if I read the topic from the beginning, data much older than 10 days is still there. Is this a problem with such a long retention period?
Am I missing any configuration here? I have checked the kafka logs and see the broker is rolling the segments and compacting as expected, but can't see anything about deletes?
Kafka Version is
5.1.2-1

It might be the case that your topic and broker configuration override each other and eventually one with higher importance is evaluated.

Related

How does Kafka handle a situation where retention period expires while a consumer offset is within the segment file?

I'm wanting to know how Kafka would handle this situation. A consumer has come across a poison pill message, and is not committing past it. No one notices for a long time (15 days). The retention period on the topic is (7 days). Let's say that this poison pill is in a log segment file that has satisfied the requirements to be deleted by the retention period.
What happens?
Does Kafka allow this log segment file to be deleted while a Consumer actively trying to read from it?
Does Kafka delete the log segment file and leave the Consumer scrambling trying to figure out where to start reading from by using the auto.offset.reset setting?
It'll be option 2 and you can find logs on the consumer instances that indicate it's seeking to the beginning/end, or will fail if auto offset reset = none saying that the offset is out of range

When kafka purges messages

I have Apache Kafka cluster with retention policy delete and retention period set to 24 hrs.
Then I have changed retention period dynamically and set it to 1 minute for some specific topic. But old messages are still there, so I have several questions:
What is the trigger point for retention? I assume that though some explicit time to live set for messages, it is not guaranteed that messages will be deleted exactly after this time. So what is the process? (Can't find anything in the reference)
If I change retention period in runtime, will the old messages obey it. As far as I understand retention period is topic-wide property and should work as well for messages, which were published with the first retention period.
On each broker the partitions are divided into segment logs. By default a segment will store 1GB of data (log.segment.bytes) of data. In addition, a new log segment is rolled out by default every 7 days (log.roll.hours)
Each broker schedules a cleaner-thread which is responsible for periodically check which segments are eligibled to deletion. By default, the cleaner-thread will run a check every 5 minutes (this can be configured throught the broker config : log.retention.check.interval.ms)
A segment is removable if the most recent message within a log is older than the configured retention period. In addition, the active segment log (the one the broker is currently writing to) can't be deleted
In order to be able to remove a segment log as soon as possible you should configure the log rolling in correlation with you retention period. For example, if your retention period is configured to 24 hours it could be a good id to configured log.roll.hours to 1 hour.
Note that segment deletion can actually happen at different time on each broker as the cleaner threads are scheduled together.
Check specific topic configuration with kafka-configs script:
Example :
./bin/kafka-configs --describe --zookeeper localhost:2181 --entity-type topics --entity-name __consumer_offsets
Retention policy is applied on closed segments only. If you segment is still active then the data in that segment wont be purged until closed and new segment is opened.

Kafka topic with very low retention time

Are there any significant disadvantages when I set the retention time of a certain topic to lets say 10 minutes?
It should not have any disadvantage as such as it is a background process, but it should be known that in kafka partitions are split into segments. A new segment is rolled over when the configured time or size is reached.
kafka will not delete an active segment , so depending on your config and data load it may or may not delete a segment as desired.For the desired result please check the below broker configs as well
Log retention check frequency - offsets.retention.check.interval.ms
Log retention time - log.roll.ms
log.segment.delete.delay.ms
Log cleaner configs

Mystery about Kafka's retention period

I have a topic configured on our production cluster, it has a retention period of 432000000 ms i.e 5 days. But it is usually holding earliest messages containing timestamps of 10 days ago! For example, today on 22nd March I checked data in that topic using console consumer command. First record was having timestamp of 12th March. This data went in the topic at nearly the same time when it was generated, so there is no difference between timestamp in the log and actual time when it got queued up. So how can this happen that Kafka is storing messages well past the configured retention period?
The retention settings are lower bound limits.
In your example it means Kafka will not delete any messages that are less than 5 days old.
The logs on disk are split up in several segments. Kafka only performs deletion on full segments and does not touch the latest (active) segment. So in order for a segment to be deleted, the last message in it has to be older than 5 days and it must not be the latest segment.
By default, Kafka only rolls new segments if they are older than 7 days (log.roll.hours=168) or if they reach their max size (log.segment.bytes=1GB).
So it looks like you've not produced enough data to roll a new segment because of size, so I suggest to reduce log.roll.hours to force new segments to be created more frequently.

what is the significance of log retention period in kafka?

If I have log retention period set to 2 hours for a partitions than after 2hrs only the consumed messages will be purged or all the messages whether consumed or not, will be purged?
Once the retention hour is over all the messages will be discarded no matter consumed or not. Here is a brief note from the official documentation
The Kafka cluster retains all published messages—whether or not they have been consumed—for a configurable period of time. For example if the log retention is set to two days, then for the two days after a message is published it is available for consumption, after which it will be discarded to free up space. Kafka's performance is effectively constant with respect to data size so retaining lots of data is not a problem.