Let’s say I have an A-Event KStream aggregated into an A-Snapshot KTable and a B-Event KStream aggregated into a B-Snapshot KTable. Neither A-Snapshot nor B-Snapshot conveys null values (delete events are instead aggregated as state attribute of the snapshot). At this point, we can assume we have a persisted kafka changelog topic and a rocksDB local store for both the A-KTable and B-KTable aggregations. Then, my topology would join the A-KTable with B-KTable to produce a joined AB-KStream. That said, my problem is with the A-KTable and B-KTable materialization lifecycles (both the changelog topic and the local rocksdb store). Let’s say A-Event topic and B-Event topic retention strategies were set to 2 weeks, is there a way to side effect the kafka internal KTable materialization topic retention policy (changelog and rocksDB) with the upstream event topic delete retention policies? Otherwise, can we configure the KTable materialization with some sort of retention policy that would manage both the changelog topic and the rockdb store lifecycle? Considering I can’t explicitly emit A-KTable and B-KTable snapshot tombstones? I am concerned that the changelog and the local store will grow indefinitely,..,
At the moment, KStream doesn't support out of the box functionality to inject the cleanup in Changelog topics based on the source topics retention policy. By default, it uses the "Compact" retention policy.
There is an open JIRA issue for the same :
https://issues.apache.org/jira/browse/KAFKA-4212
One option is to inject tombstone messages but that's not nice way.
In case of windowed store, you can use "compact, delete" retention policy.
Related
We have a Kafka stream aggregation topology. We need to keep the size of the changeLog topic in check to reduce the Kafka storage costs. So we use transformer (DSL API) in the topology to schedule a punctuation that deletes the old records from the stateStore using keyValueStore.delete().
I am able to verify that after a delete, on further scheduled triggers of the punctuation, the deleted key is not there in the state store. But does it remove the record from the changeLog topic as well? More importantly, does it reduce the size of the changeLog topic as well so that the Kafka storage cost is in check??
Yes, changes to the state store are applied to the changelog topic.
No, there's no actual record deletion into the changelog topic when you issue a "delete" command. Be aware that a "delete" command is in fact a record with a null value (aka tombstone) written into a topic (changelog or any other) - see here:
null values are interpreted in a special way: a record with a null
value represents a "DELETE" or tombstone for the record's key
So, in fact the interpretation is the one that makes it feel like a deletion; one could read a changelog topic (you'll have to know the exact topic's name) as a KStream or by using the Kafka Consumer API and will find the tombstone records there (till removed by the compaction or retention thread). But if you read a changelog or any compacted topic with a KTable than a tombstone record will determine a deletion from the associated store - you'll no longer find the related key in the store despite the fact that it actually exists in the related compacted topic.
If compaction policy is enabled on a topic (by default is enabled on changelog topics) than its records are removed till the last one for a specific key. So at some point you'll only have the delete record because the previous records with the same key are removed by the compaction Kafka thread.
I use Kafka Streams for some aggregations of a TimeWindow.
I'm interested only in the final result of each window, so I use the .suppress() feature which creates a changelog topic for its state.
The retention policy configuration for this changelog topic is defined as "compact" which to my understanding will keep at least the last event for each key in the past.
The problem in my application is that keys often change. This means that the topic will grow indefinitely (each window will bring new keys which will never be deleted).
Since the aggregation is per window, after the aggregation was done, I don't really need the "old" keys.
Is there a way to tell Kafka Streams to remove keys from previous windows?
For that matter, I think configuring the changelog topic retention policy to "compact,delete" will do the job (which is available in kafka according to this: KIP-71, KAFKA-4015.
But is it possible to change the retention policy so using the Kafka Streams api?
suppress() operator sends tombstone messages to the changelog topic if a record is evicted from its buffer and sent downstream. Thus, you don't need to worry about unbounded growth of the topic. Changing the compaction policy might in fact break the guarantees that the operator provide and you might loose data.
I have a ktable kafka linked to a topic however when retention is done on the topic the messages are also deleted in my ktable And is it possible to keep the values of the Ktable while those of its topic are deleted?
Changelog topics are log compacted with infinite retention time. At any time, if there is a new event coming up your topic, it will update the state in the KTables for that key.
In your scenario, if you need to make the data available even when source topic is deleted, I will recommend publish the KTable into some output topic to make it more stable.
As KTables are exposed only within the applications and built on top of changelog topics, once your application is gone, you will lose the data unless you use persistent state store.
As far as I know, Kafka by default will keep the records in the topics for 7 days and then delete them. But how about the Kafka Materialized Views, how long Kafka will keep the data there(infinitive or limited time)? Also, does Kafka replicates Materialized Views over the cluster?
Kafka topics can either be configured with retention time or with log compaction. For log compaction, the latest record for each key will never be deleted, while older record with the same key are garbage collected in regular intervals. See https://kafka.apache.org/documentation/#compaction
When Kafka Streams creates a KTable or state store and creates a changelog topic for fault-tolerance, it will create this changelog topic with log compactions enabled.
Note: if you read a topic directly as a KTable or GlobalKTable (ie, builder.table(...)), no additional changelog topic will be created but the source topic will be used for this purpose. Thus, the source topic should be configured with log compaction (and not with retention time).
You can configure the desired replication factor with StreamConfig parameter repliaction.factor. You can also manually change the replication factor at any time if you wish, eg, via bin/kafka-topics.sh command.
I'm experimenting with kafka streams and I have the following setup:
I have an existing kafka topic with a key space that is unbounded (but predictable and well known).
My topic has a retention policy (in bytes) to age out old records.
I'd like to materialize this topic into a Ktable where I can use the Interactive Queries API to retrieve records by key.
Is there any way to make my KTable "inherit" the retention policy from my topic? So that when records are aged out of the primary topic, they're no longer available in the ktable?
I'm worried about dumping all of the records into the KTable and having the StateStore grow unbounded.
One solution that I can think of is to transform into a windowed stream with hopping windows equal to a TimeToLive for the record, but I'm wondering if there's a better solution in a more native way.
Thanks.
It is unfortunately not support atm. There is a JIRA though: https://issues.apache.org/jira/browse/KAFKA-4212
Another possibility would be to insert tombstone messages (<key,null>) into the input topic. The KTable would pick those up and delete the corresponding key from the store.