we are considering to implement a timeout as part of a Kafka-based API by utilising its time based retention capabilities.
Basically, setting log.retention.ms = 10000 to make messages expire from a command topic if not processed within 10seconds.
I am wondering though whether this would provide a message level guarantee (i.e. every message is available the same amount of time) given that retention policies operate at the log segment level (based on largest timestamp per segment).
Of course, we can reduce log.segment.bytes to achieve more granular retention control, not sure though about the implications on performance.
any advice?
Nick
In Kafka, the retention settings are lower bounds, ie Kafka guarantees it will not delete a message before its retention limits are reached.
In practice, that means messages can stay in the log for longer than their retention limits.
Also as you said, Kafka operate at the log segment level. For time retention, only once the latest message in a segment gets older than the limit, this segment becomes eligible for deletion. And that does not apply to the active segment. So retention can't be used to provide per message time to live.
I don't know about your use case but maybe have a look at the offsetsForTimes() and seek() APIs in the consumer. These allow to select what the consumer will read based on time.
Finally, if you really need strong per message TTL, maybe Kafka is not the best tool.
Related
Like how we have in MQ solutions , is it possible to have the message automatically deleted in Kafka once it is consumed ?
As I don't have control when the message will be consumed ,its not possible to define retention by time / byte size
You can override the configuration of retention by time per topic basis, even set it to 0 for no deletion at all. Retention byte size retention is not limited by default, and you don't have to use it. Being said that I am not sure Kafka is best suited for your use case as it meant to use used for real time high performance streaming processes... another note you can use COMPACT topic and send tombstone message to delete a record once processed, but basically kafka does not have automatic delete on consumption
I am having a use-case of kafka streams where I need to perform aggregate operation with past data that might be consumed even months earlier.
I wonder if it means that I need to be concerned about default retention period of internal topics e.g. XXX-REDUCE-STATE-STORE-changelog, XXX-AGGREGATE-STATE-STORE-repartition and explicitly change/set somehow?
If yes, is there a way to configure it for stream app? If I set default retention period at broker level, will my newly created internal topics have forever retention?
Figured out that XXX-REDUCE-STATE-STORE-changelog topics have cleanup.policy=compact. Meaning the messages will never be deleted as log compaction is enabled. XXX-AGGREGATE-STATE-STORE-repartition topics have retention.ms=-1 by default even if broker level default is set to any other value.
From official Kafka documentation https://kafka.apache.org/documentation/#gettingStarted there are time and size retention parameters. Is there a way to configure Kafka to always keep last message per topic regardless how long it would be?
Currently I am thinking to republish it at the end of expiration period, that does not look like good idea.
See the section of log compaction and having a topic setting of cleanup.policy=compact will keep messages retained indefinitely, but only those with unique keys.
Note that all messages will be retained within an open "segment", which defaults to 1GB worth of data, while any closed, old segments will have uniquely keyed events. You can tune the segment size and "dirty ratio" of a topic to make the LogCleaner more aggressive, but this comes at a performance cost.
I am fairly new to kafka so forgive me if this question is trivial. I have a very simple setup for purposes of timing tests as follows:
Machine A -> writes to topic 1 (Broker) -> Machine B reads from topic 1
Machine B -> writes message just read to topic 2 (Broker) -> Machine A reads from topic 2
Now I am sending messages of roughly 1400 bytes in an infinite loop filling up the space on my small broker very quickly. I'm experimenting with setting different values for log.retention.ms, log.retention.bytes, log.segment.bytes and log.segment.delete.delay.ms. First I set all of the values to the minimum allowed, but it seemed this degraded performance, then I set them to the maximum my broker could take before being completely full, but again the performance degrades when a deletion occurs. Is there a best practice for setting these values to get the absolute minimum delay?
Thanks for the help!
Apache Kafka uses Log data structure to manage its messages. Log data structure is basically an ordered set of Segments whereas a Segment is a collection of messages. Apache Kafka provides retention at Segment level instead of at Message level. Hence, Kafka keeps on removing Segments from its end as these violate retention policies.
Apache Kafka provides us with the following retention policies -
Time Based Retention
Under this policy, we configure the maximum time a Segment (hence messages) can live for. Once a Segment has spanned configured retention time, it is marked for deletion or compaction depending on configured cleanup policy. Default retention time for Segments is 7 days.
Here are the parameters (in decreasing order of priority) that you can set in your Kafka broker properties file:
Configures retention time in milliseconds
log.retention.ms=1680000
Used if log.retention.ms is not set
log.retention.minutes=1680
Used if log.retention.minutes is not set
log.retention.hours=168
Size based Retention
In this policy, we configure the maximum size of a Log data structure for a Topic partition. Once Log size reaches this size, it starts removing Segments from its end. This policy is not popular as this does not provide good visibility about message expiry. However it can come handy in a scenario where we need to control the size of a Log due to limited disk space.
Here are the parameters that you can set in your Kafka broker properties file:
Configures maximum size of a Log
log.retention.bytes=104857600
So according to your use case you should configure log.retention.bytes so that your disk should not get full.
If I have log retention period set to 2 hours for a partitions than after 2hrs only the consumed messages will be purged or all the messages whether consumed or not, will be purged?
Once the retention hour is over all the messages will be discarded no matter consumed or not. Here is a brief note from the official documentation
The Kafka cluster retains all published messages—whether or not they have been consumed—for a configurable period of time. For example if the log retention is set to two days, then for the two days after a message is published it is available for consumption, after which it will be discarded to free up space. Kafka's performance is effectively constant with respect to data size so retaining lots of data is not a problem.