I have set the retention time, but I donot the time of insert. Can we know the time when the first/oldest message in topic/partition gets deleted?
You can look at the creation time of the data files on kafka brokers, and they get deleted at this time + retention time.
For example, if you see a file dated from 12 hours ago with 20 hours retention, it'll get deleted in 8 hours.
Related
Kafka topic's retention period is 7 days. But I need to push data which is expiring because of retention period to new kafka topic or some other storage.
So is there any method where I can access the data which is going to be deleted after 7 days just before it gets deleted? or way to set up some process where it will automatically push data which is going to get deleted to some place else.
Since 0.10 version of kafka each message has a timestamp. Simply setup a consumer group that starts every hour and processes each topic partition from the initial offset (auto.offset.reset=earliest) and pushes on the new topic the messages with the timestamp with incoming expiration (one hour width), then the consumer group stops and is restarted one hour later.
I have tried creating a Kafka topic configuration that uses compaction and deletion, to achieve the following:
Within the retention period, retain the latest version of the key
After the retention period, any message older than the timestamp to be removed
For this, I have tried the following topic specific config:
cleanup.policy=[compact,delete]
retention.ms=864000000 (10 days)
min.compaction.lag.ms=3600000 (1 hour)
min.cleanable.dirty.ratio=0.1
segment.ms=3600000 (1 hour)
The broker configuration is as following:
log.retention.hours=7 days
log.segment.bytes=1.1gb
log.cleanup.policy=delete
delete.retention.ms=1 day
When I set this to a smaller amount in test, e.g. 20mins, 1hr etc, I can correctly see the data is pruned after the retention period, only adjusting retention.ms on the topic.
I can see that the data is correctly being compacted as expected, but after the 10 day retention period if I read the topic from the beginning, data much older than 10 days is still there. Is this a problem with such a long retention period?
Am I missing any configuration here? I have checked the kafka logs and see the broker is rolling the segments and compacting as expected, but can't see anything about deletes?
Kafka Version is
5.1.2-1
It might be the case that your topic and broker configuration override each other and eventually one with higher importance is evaluated.
I'm trying to understand, how the does compaction and log clean up happens if I set the 'log.cleanup.policy' to have both 'delete' and 'compact' at the same time when I'm using time based retention policy.
Let's say our retention period is 7 days (the default one) and during these 7 days I've below pattern for my data flow, Please help me understand how does it look after 7 days.
When you use both compact and delete as a log.cleanup.policy. Logs will be compacted in background periodically to retain at least the last known value for each message key within the log of data for a single topic partition. Compaction can be configured with this config parameters:
log.cleaner.min.compaction.lag.ms: The minimum time a message will remain uncompacted in the log
log.cleaner.max.compaction.lag.ms: The maximum time a message will remain ineligible for compaction in the log
As you have already said, for your example:
Before compaction:
After compaction:
Logs will also be deleted after log.retention.hours period without considering whether or not it is compacted. Log retention is checked according to this parameter:
log.retention.check.interval.ms: The frequency in milliseconds that the log cleaner checks whether any log is eligible for deletion (default is 5 minutes)
As per my understanding of the below KIP, all the keys that haven't been updated for some time to be automatically expired. That's why we would have only three keys ( which were updated during this time) and their corresponding values after the retention time period is met.
https://issues.apache.org/jira/browse/KAFKA-4015
I have a topic configured on our production cluster, it has a retention period of 432000000 ms i.e 5 days. But it is usually holding earliest messages containing timestamps of 10 days ago! For example, today on 22nd March I checked data in that topic using console consumer command. First record was having timestamp of 12th March. This data went in the topic at nearly the same time when it was generated, so there is no difference between timestamp in the log and actual time when it got queued up. So how can this happen that Kafka is storing messages well past the configured retention period?
The retention settings are lower bound limits.
In your example it means Kafka will not delete any messages that are less than 5 days old.
The logs on disk are split up in several segments. Kafka only performs deletion on full segments and does not touch the latest (active) segment. So in order for a segment to be deleted, the last message in it has to be older than 5 days and it must not be the latest segment.
By default, Kafka only rolls new segments if they are older than 7 days (log.roll.hours=168) or if they reach their max size (log.segment.bytes=1GB).
So it looks like you've not produced enough data to roll a new segment because of size, so I suggest to reduce log.roll.hours to force new segments to be created more frequently.
I set server.properties'
log.retention.minutes=8
to clean data under kafka-logs/ every 8 minutes automatically ,
is it possible let the cleaner only clean up the data which have been consumed
,data not consumed by consumer will retain ?
Thanks !
No. Kafka messages are appended to log files which roll over every x hours or when they reach a certain size (depending on configuration). Once rolled over, those files are immutable (you cannot delete individual records). Log files are cleaned up when the last write access to a file exceeds the retention time.
In other words: the retention time is the time a message is kept at least. It is possible for a message with retention time of minutes to last for weeks (depending on other configuration settings).
The concept of "consumer offsets" is the mechanism Kafka uses to avoid reconsumption of messags. Kafka 0.11 also will contain exactly-once capabilities.