If I have log retention period set to 2 hours for a partitions than after 2hrs only the consumed messages will be purged or all the messages whether consumed or not, will be purged?
Once the retention hour is over all the messages will be discarded no matter consumed or not. Here is a brief note from the official documentation
The Kafka cluster retains all published messages—whether or not they have been consumed—for a configurable period of time. For example if the log retention is set to two days, then for the two days after a message is published it is available for consumption, after which it will be discarded to free up space. Kafka's performance is effectively constant with respect to data size so retaining lots of data is not a problem.
Related
Like how we have in MQ solutions , is it possible to have the message automatically deleted in Kafka once it is consumed ?
As I don't have control when the message will be consumed ,its not possible to define retention by time / byte size
You can override the configuration of retention by time per topic basis, even set it to 0 for no deletion at all. Retention byte size retention is not limited by default, and you don't have to use it. Being said that I am not sure Kafka is best suited for your use case as it meant to use used for real time high performance streaming processes... another note you can use COMPACT topic and send tombstone message to delete a record once processed, but basically kafka does not have automatic delete on consumption
I have tried creating a Kafka topic configuration that uses compaction and deletion, to achieve the following:
Within the retention period, retain the latest version of the key
After the retention period, any message older than the timestamp to be removed
For this, I have tried the following topic specific config:
cleanup.policy=[compact,delete]
retention.ms=864000000 (10 days)
min.compaction.lag.ms=3600000 (1 hour)
min.cleanable.dirty.ratio=0.1
segment.ms=3600000 (1 hour)
The broker configuration is as following:
log.retention.hours=7 days
log.segment.bytes=1.1gb
log.cleanup.policy=delete
delete.retention.ms=1 day
When I set this to a smaller amount in test, e.g. 20mins, 1hr etc, I can correctly see the data is pruned after the retention period, only adjusting retention.ms on the topic.
I can see that the data is correctly being compacted as expected, but after the 10 day retention period if I read the topic from the beginning, data much older than 10 days is still there. Is this a problem with such a long retention period?
Am I missing any configuration here? I have checked the kafka logs and see the broker is rolling the segments and compacting as expected, but can't see anything about deletes?
Kafka Version is
5.1.2-1
It might be the case that your topic and broker configuration override each other and eventually one with higher importance is evaluated.
we are considering to implement a timeout as part of a Kafka-based API by utilising its time based retention capabilities.
Basically, setting log.retention.ms = 10000 to make messages expire from a command topic if not processed within 10seconds.
I am wondering though whether this would provide a message level guarantee (i.e. every message is available the same amount of time) given that retention policies operate at the log segment level (based on largest timestamp per segment).
Of course, we can reduce log.segment.bytes to achieve more granular retention control, not sure though about the implications on performance.
any advice?
Nick
In Kafka, the retention settings are lower bounds, ie Kafka guarantees it will not delete a message before its retention limits are reached.
In practice, that means messages can stay in the log for longer than their retention limits.
Also as you said, Kafka operate at the log segment level. For time retention, only once the latest message in a segment gets older than the limit, this segment becomes eligible for deletion. And that does not apply to the active segment. So retention can't be used to provide per message time to live.
I don't know about your use case but maybe have a look at the offsetsForTimes() and seek() APIs in the consumer. These allow to select what the consumer will read based on time.
Finally, if you really need strong per message TTL, maybe Kafka is not the best tool.
From official Kafka documentation https://kafka.apache.org/documentation/#gettingStarted there are time and size retention parameters. Is there a way to configure Kafka to always keep last message per topic regardless how long it would be?
Currently I am thinking to republish it at the end of expiration period, that does not look like good idea.
See the section of log compaction and having a topic setting of cleanup.policy=compact will keep messages retained indefinitely, but only those with unique keys.
Note that all messages will be retained within an open "segment", which defaults to 1GB worth of data, while any closed, old segments will have uniquely keyed events. You can tune the segment size and "dirty ratio" of a topic to make the LogCleaner more aggressive, but this comes at a performance cost.
I set server.properties'
log.retention.minutes=8
to clean data under kafka-logs/ every 8 minutes automatically ,
is it possible let the cleaner only clean up the data which have been consumed
,data not consumed by consumer will retain ?
Thanks !
No. Kafka messages are appended to log files which roll over every x hours or when they reach a certain size (depending on configuration). Once rolled over, those files are immutable (you cannot delete individual records). Log files are cleaned up when the last write access to a file exceeds the retention time.
In other words: the retention time is the time a message is kept at least. It is possible for a message with retention time of minutes to last for weeks (depending on other configuration settings).
The concept of "consumer offsets" is the mechanism Kafka uses to avoid reconsumption of messags. Kafka 0.11 also will contain exactly-once capabilities.