I have a topic configured on our production cluster, it has a retention period of 432000000 ms i.e 5 days. But it is usually holding earliest messages containing timestamps of 10 days ago! For example, today on 22nd March I checked data in that topic using console consumer command. First record was having timestamp of 12th March. This data went in the topic at nearly the same time when it was generated, so there is no difference between timestamp in the log and actual time when it got queued up. So how can this happen that Kafka is storing messages well past the configured retention period?
The retention settings are lower bound limits.
In your example it means Kafka will not delete any messages that are less than 5 days old.
The logs on disk are split up in several segments. Kafka only performs deletion on full segments and does not touch the latest (active) segment. So in order for a segment to be deleted, the last message in it has to be older than 5 days and it must not be the latest segment.
By default, Kafka only rolls new segments if they are older than 7 days (log.roll.hours=168) or if they reach their max size (log.segment.bytes=1GB).
So it looks like you've not produced enough data to roll a new segment because of size, so I suggest to reduce log.roll.hours to force new segments to be created more frequently.
Related
I have tried creating a Kafka topic configuration that uses compaction and deletion, to achieve the following:
Within the retention period, retain the latest version of the key
After the retention period, any message older than the timestamp to be removed
For this, I have tried the following topic specific config:
cleanup.policy=[compact,delete]
retention.ms=864000000 (10 days)
min.compaction.lag.ms=3600000 (1 hour)
min.cleanable.dirty.ratio=0.1
segment.ms=3600000 (1 hour)
The broker configuration is as following:
log.retention.hours=7 days
log.segment.bytes=1.1gb
log.cleanup.policy=delete
delete.retention.ms=1 day
When I set this to a smaller amount in test, e.g. 20mins, 1hr etc, I can correctly see the data is pruned after the retention period, only adjusting retention.ms on the topic.
I can see that the data is correctly being compacted as expected, but after the 10 day retention period if I read the topic from the beginning, data much older than 10 days is still there. Is this a problem with such a long retention period?
Am I missing any configuration here? I have checked the kafka logs and see the broker is rolling the segments and compacting as expected, but can't see anything about deletes?
Kafka Version is
5.1.2-1
It might be the case that your topic and broker configuration override each other and eventually one with higher importance is evaluated.
I'm trying to understand, how the does compaction and log clean up happens if I set the 'log.cleanup.policy' to have both 'delete' and 'compact' at the same time when I'm using time based retention policy.
Let's say our retention period is 7 days (the default one) and during these 7 days I've below pattern for my data flow, Please help me understand how does it look after 7 days.
When you use both compact and delete as a log.cleanup.policy. Logs will be compacted in background periodically to retain at least the last known value for each message key within the log of data for a single topic partition. Compaction can be configured with this config parameters:
log.cleaner.min.compaction.lag.ms: The minimum time a message will remain uncompacted in the log
log.cleaner.max.compaction.lag.ms: The maximum time a message will remain ineligible for compaction in the log
As you have already said, for your example:
Before compaction:
After compaction:
Logs will also be deleted after log.retention.hours period without considering whether or not it is compacted. Log retention is checked according to this parameter:
log.retention.check.interval.ms: The frequency in milliseconds that the log cleaner checks whether any log is eligible for deletion (default is 5 minutes)
As per my understanding of the below KIP, all the keys that haven't been updated for some time to be automatically expired. That's why we would have only three keys ( which were updated during this time) and their corresponding values after the retention time period is met.
https://issues.apache.org/jira/browse/KAFKA-4015
From Kafka Docs I got interested and tried the following 2 retention types together
log.retention.bytes:
The maximum size of the log before deleting it
Type: longDefault: -1Valid Values:Importance: highUpdate Mode: cluster-wide
log.retention.ms
The number of milliseconds to keep a log file before deleting it (in
milliseconds), If not set, the value in log.retention.minutes is used.
If set to -1, no time limit is applied. Type: longDefault: nullValid
Values:Importance: highUpdate Mode: cluster-wide
AS
log.retention.bytes = 1Gb
log.retention.ms = 7 days
Problem Situation
I have currently on my topic all messages belonging two different log files both of which are < 1GB
Lets say log.1 files has 400 MB of messages with oldest message > 7 days old.
which is on the top of
log.2 file has 500 MB with newest message > 7 days old.
I understand kafka would clean up all records belonging to log.2 file in other words remove this log from the topic.
What happens to the records in the log.1 which are older than 7 days?
There are two properties which defines message retention in Kafka - log.retention.bytes and log.retention.ms (per topic per partition level). The strategy for data removal works on FIFO basic, i.e., the message which was pushed to a topic first would be deleted first.
You have rightly said that the default values for the same are:
log.retention.bytes = 1Gb (per topic per partition)
log.retention.ms = 7 days (per topic)
It means that whichever limit is breached first, would lead to data purge in Kafka.
For example, let's assume that the size of messages in your topic takes 500 MB of space (which is less than log.retention.bytes) but older than 7 days (i.e. greater than the default log.retention.ms). In this case the data older than 7 days would be purged (on FIFO basis).
Likewise, if, for a given topic, the space occupied by the messages exceeds the log.retention.bytes but are not older than log.retention.ms, in this case too, the data would be purged (on FIFO basis).
Concept of making data expire is called as Cleanup & the messages on a topic are not immediately removed after they are consumed/expired. What happens in the background is, once either of the limit is breached, the messages are marked deleted. There are 3 logs cleanup policies in Kafka - DELETE (default), COMPACT, DELETE AND COMPACT. Kafka Log Cleaner does log compaction, a pool of background compaction threads.
To turn on compaction for a topic use topic config log.cleanup.policy=compact. To set delay to start compacting records after they are written use topic config log.cleaner.min.compaction.lag.ms. Records won’t get compacted until after this period. The setting gives consumers time to get every record. This could be reason that older messages are not getting deleted immediately. You can check the value of property for compaction delay.
Below links might be helpful:
https://medium.com/#sunny_81705/kafka-log-retention-and-cleanup-policies-c8d9cb7e09f8
http://cloudurable.com/blog/kafka-architecture-log-compaction/index.html
https://www.learningjournal.guru/courses/kafka/kafka-foundation-training/broker-configurations/
I'm paraphrasing here, from the relevant section of a book, Kafka - Definitive Guide. It'll most likely clear your doubt.
log.retention.bytes : This denotes the total number of bytes of messages retained per partition. So, if we have a topic with 8 partitions, and log.retention.bytes is set to 1GB, then the amount of data retained for the topic will be 8GB at most. This means if we ever choose to increase the number of partitions for a topic, total amount of data retained will also increase.
log.retention.ms : The most common configuration for how long Kafka will retain messages is by time. The default is specified in the configuration file using the log.retention.hours parameter, and it is set to 168 hours, or one week. However, there are two other parameters allowed, log.retention.minutes and log.retention.ms. All three of these specify the same configuration—the amount of time after which messages may be deleted—but the recommended parameter to use is log.retention.ms, as the smaller unit size will take precedence if more than one is specified. This will make sure that the value set for log.retention.ms is always the one used. If more than one is specified, the smaller unit size will take precedence.
Retention By Time and Last Modified Times : Retention by time is performed by examining the last modified time (mtime) on each log segment file on disk. Under normal cluster operations, this is the time that the log segment was closed, and represents the timestamp of the last message in the file. However, when using administrative tools to move partitions between brokers, this time is not accurate and will result in excess retention for these partitions.
Configuring Retention by Size and Time : If you have specified a value for both log.retention.bytes and log.retention.ms (or another parameter for retention by time), messages may be removed when either criteria is met. For example, if log.retention.ms is set to 86400000 (1 day) and log.retention.bytes is set to 1000000000 (1 GB), it is possible for messages that are less than 1 day old to get deleted if the total volume of messages over the course of the day is greater than 1 GB. Conversely, if the volume is less than 1 GB, messages can be deleted after 1 day even if the total size of the partition is less than 1 GB.
I set server.properties'
log.retention.minutes=8
to clean data under kafka-logs/ every 8 minutes automatically ,
is it possible let the cleaner only clean up the data which have been consumed
,data not consumed by consumer will retain ?
Thanks !
No. Kafka messages are appended to log files which roll over every x hours or when they reach a certain size (depending on configuration). Once rolled over, those files are immutable (you cannot delete individual records). Log files are cleaned up when the last write access to a file exceeds the retention time.
In other words: the retention time is the time a message is kept at least. It is possible for a message with retention time of minutes to last for weeks (depending on other configuration settings).
The concept of "consumer offsets" is the mechanism Kafka uses to avoid reconsumption of messags. Kafka 0.11 also will contain exactly-once capabilities.
I have set the retention time, but I donot the time of insert. Can we know the time when the first/oldest message in topic/partition gets deleted?
You can look at the creation time of the data files on kafka brokers, and they get deleted at this time + retention time.
For example, if you see a file dated from 12 hours ago with 20 hours retention, it'll get deleted in 8 hours.