Retention period of internal kafka stream topics - apache-kafka

I am having a use-case of kafka streams where I need to perform aggregate operation with past data that might be consumed even months earlier.
I wonder if it means that I need to be concerned about default retention period of internal topics e.g. XXX-REDUCE-STATE-STORE-changelog, XXX-AGGREGATE-STATE-STORE-repartition and explicitly change/set somehow?
If yes, is there a way to configure it for stream app? If I set default retention period at broker level, will my newly created internal topics have forever retention?

Figured out that XXX-REDUCE-STATE-STORE-changelog topics have cleanup.policy=compact. Meaning the messages will never be deleted as log compaction is enabled. XXX-AGGREGATE-STATE-STORE-repartition topics have retention.ms=-1 by default even if broker level default is set to any other value.

Related

kafka automatic deletion based on message consumption

Like how we have in MQ solutions , is it possible to have the message automatically deleted in Kafka once it is consumed ?
As I don't have control when the message will be consumed ,its not possible to define retention by time / byte size
You can override the configuration of retention by time per topic basis, even set it to 0 for no deletion at all. Retention byte size retention is not limited by default, and you don't have to use it. Being said that I am not sure Kafka is best suited for your use case as it meant to use used for real time high performance streaming processes... another note you can use COMPACT topic and send tombstone message to delete a record once processed, but basically kafka does not have automatic delete on consumption

log retention to implement message TTL

we are considering to implement a timeout as part of a Kafka-based API by utilising its time based retention capabilities.
Basically, setting log.retention.ms = 10000 to make messages expire from a command topic if not processed within 10seconds.
I am wondering though whether this would provide a message level guarantee (i.e. every message is available the same amount of time) given that retention policies operate at the log segment level (based on largest timestamp per segment).
Of course, we can reduce log.segment.bytes to achieve more granular retention control, not sure though about the implications on performance.
any advice?
Nick
In Kafka, the retention settings are lower bounds, ie Kafka guarantees it will not delete a message before its retention limits are reached.
In practice, that means messages can stay in the log for longer than their retention limits.
Also as you said, Kafka operate at the log segment level. For time retention, only once the latest message in a segment gets older than the limit, this segment becomes eligible for deletion. And that does not apply to the active segment. So retention can't be used to provide per message time to live.
I don't know about your use case but maybe have a look at the offsetsForTimes() and seek() APIs in the consumer. These allow to select what the consumer will read based on time.
Finally, if you really need strong per message TTL, maybe Kafka is not the best tool.

Permanent Kafka Streams/KSQL retention policy

I'm presently working on an use case in which user interaction with a platform is tracked, thus generating a stream of events that gets stored into kafka and will be subsequently processed in Kafka Streams/KSQL.
But I've run into an issue concerning the state store and changelog topic retention policies. User sessions could happen indefinitely apart in time, therefore I must guarantee that the state will be persisted through that period and restored in case of node and clusterwide failures. During our searches, we came accross the following information:
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Internal+Data+Management
Kafka Streams allows for stateful stream processing, i.e. operators that have an internal state. (...). The default implementation used by Kafka Streams DSL is a fault-tolerant state store using 1. an internally created and compacted changelog topic (for fault-tolerance) and 2. one (or multiple) RocksDB instances (for cached key-value lookups). Thus, in case of starting/stopping applications and rewinding/reprocessing, this internal data needs to get managed correctly.
(...) Thus, RocksDB memory requirement does not grow infinitely (in contrast to changelog topic). (KAFKA-4015 was fixed in 0.10.1 release, and windowed changelog topics don't grow unbounded as they apply an additional retention time parameter).
Retention time in kafka local state store / changelog
"For windowed KTables there is a local retention time and there is the changlog retention time. You can set the local store retention time via Materialized.withRetentionTime(...) -- the default value is 24h.
If a new application is created, changelog topics are created with the same retention time as local store retention time."
https://docs.confluent.io/current/streams/developer-guide/config-streams.html
The windowstore.changelog.additional.retention.ms parameter states:
Added to a windows maintainMs to ensure data is not deleted from the log prematurely. Allows for clock drift.
It would seem that Kafka Streams' maintains both a (replicated) local state store and a changelog topic for fault tolerance, with both having a finite, configurable retention period, and will apparently erase records once the retention time expires. This would lead to unnaceptable data loss in our platform, thus raising the following questions:
Does Kafka Streams actually clean up the default state store over time or have I misunderstood something? Is there an actual risk of data loss?
In that case, is it advisable or even possible to set an infinite retention policy to the state store? Or perhaps there could be another way of making sure the state will be persisted, such as using a more traditional database as state store, if that makes sense?
Does the retention policy apply to standby replicas?
If it's impossible to persist the state permanently, could there be another stream processing framework that better suits our use case?
Any clarification would be appreciated.
Seems you're asking about two different things. Session windows and changelog topics...
Compacted topics retain unique key pairs forever. Session window duration should probably be closed over time; a user session a week/month/year from one today is arguably a new session, and you should tie together each individual session window as a collection by the userId, not only store the most recent session (which implies removing previous sessions from the state store)

Kafka stream - define a retention policy for a changelog

I use Kafka Streams for some aggregations of a TimeWindow.
I'm interested only in the final result of each window, so I use the .suppress() feature which creates a changelog topic for its state.
The retention policy configuration for this changelog topic is defined as "compact" which to my understanding will keep at least the last event for each key in the past.
The problem in my application is that keys often change. This means that the topic will grow indefinitely (each window will bring new keys which will never be deleted).
Since the aggregation is per window, after the aggregation was done, I don't really need the "old" keys.
Is there a way to tell Kafka Streams to remove keys from previous windows?
For that matter, I think configuring the changelog topic retention policy to "compact,delete" will do the job (which is available in kafka according to this: KIP-71, KAFKA-4015.
But is it possible to change the retention policy so using the Kafka Streams api?
suppress() operator sends tombstone messages to the changelog topic if a record is evicted from its buffer and sent downstream. Thus, you don't need to worry about unbounded growth of the topic. Changing the compaction policy might in fact break the guarantees that the operator provide and you might loose data.

Kafka Materialized Views TTL

As far as I know, Kafka by default will keep the records in the topics for 7 days and then delete them. But how about the Kafka Materialized Views, how long Kafka will keep the data there(infinitive or limited time)? Also, does Kafka replicates Materialized Views over the cluster?
Kafka topics can either be configured with retention time or with log compaction. For log compaction, the latest record for each key will never be deleted, while older record with the same key are garbage collected in regular intervals. See https://kafka.apache.org/documentation/#compaction
When Kafka Streams creates a KTable or state store and creates a changelog topic for fault-tolerance, it will create this changelog topic with log compactions enabled.
Note: if you read a topic directly as a KTable or GlobalKTable (ie, builder.table(...)), no additional changelog topic will be created but the source topic will be used for this purpose. Thus, the source topic should be configured with log compaction (and not with retention time).
You can configure the desired replication factor with StreamConfig parameter repliaction.factor. You can also manually change the replication factor at any time if you wish, eg, via bin/kafka-topics.sh command.