I have a ktable kafka linked to a topic however when retention is done on the topic the messages are also deleted in my ktable And is it possible to keep the values of the Ktable while those of its topic are deleted?
Changelog topics are log compacted with infinite retention time. At any time, if there is a new event coming up your topic, it will update the state in the KTables for that key.
In your scenario, if you need to make the data available even when source topic is deleted, I will recommend publish the KTable into some output topic to make it more stable.
As KTables are exposed only within the applications and built on top of changelog topics, once your application is gone, you will lose the data unless you use persistent state store.
Related
How can I access the list of all uncommitted messages in a topic in kafka
To access uncommitted messages, first assumes you are committing. Otherwise, you're just consuming a topic.
The only way to get any records from Kafka is by using consumer api.
Commits aren't required (disable auto commit and don't explicitly call commit methods in code). However, if the app restarts for any reason, the auto.offset.reset property will always apply to any topic you're consuming, meaning you either skip everything or have to block your main code execution until this consumer reads everything from the beginning of the topic. One popular app that doesn't commit any offsets or create a consumer group is the Confluent Schema Registry. Kafka Streams changelog topics also do this.
If you want "another consumer" to read the entire topic, it needs to have a different group id. That's it
We have a Kafka stream aggregation topology. We need to keep the size of the changeLog topic in check to reduce the Kafka storage costs. So we use transformer (DSL API) in the topology to schedule a punctuation that deletes the old records from the stateStore using keyValueStore.delete().
I am able to verify that after a delete, on further scheduled triggers of the punctuation, the deleted key is not there in the state store. But does it remove the record from the changeLog topic as well? More importantly, does it reduce the size of the changeLog topic as well so that the Kafka storage cost is in check??
Yes, changes to the state store are applied to the changelog topic.
No, there's no actual record deletion into the changelog topic when you issue a "delete" command. Be aware that a "delete" command is in fact a record with a null value (aka tombstone) written into a topic (changelog or any other) - see here:
null values are interpreted in a special way: a record with a null
value represents a "DELETE" or tombstone for the record's key
So, in fact the interpretation is the one that makes it feel like a deletion; one could read a changelog topic (you'll have to know the exact topic's name) as a KStream or by using the Kafka Consumer API and will find the tombstone records there (till removed by the compaction or retention thread). But if you read a changelog or any compacted topic with a KTable than a tombstone record will determine a deletion from the associated store - you'll no longer find the related key in the store despite the fact that it actually exists in the related compacted topic.
If compaction policy is enabled on a topic (by default is enabled on changelog topics) than its records are removed till the last one for a specific key. So at some point you'll only have the delete record because the previous records with the same key are removed by the compaction Kafka thread.
I'm observing my Kafka Streams app reporting consumer lag for topics used to fill Global KTables. Is it correct that offsets are not committed for such topics?
This would make sense as the topic is read from the beginning on every startup, so keeping track of the offest in the consumer would be sufficient.
It would however be useful to know for monitoring to exclude such consumer topic pairs.
Correct, offsets are not committed for "global topics" -- the main reason is, that all KafkaStreams instances read all partitions and committing multiple offsets is not possible.
You can still access the "global consumer" metrics, that also report their local lag.
Let’s say I have an A-Event KStream aggregated into an A-Snapshot KTable and a B-Event KStream aggregated into a B-Snapshot KTable. Neither A-Snapshot nor B-Snapshot conveys null values (delete events are instead aggregated as state attribute of the snapshot). At this point, we can assume we have a persisted kafka changelog topic and a rocksDB local store for both the A-KTable and B-KTable aggregations. Then, my topology would join the A-KTable with B-KTable to produce a joined AB-KStream. That said, my problem is with the A-KTable and B-KTable materialization lifecycles (both the changelog topic and the local rocksdb store). Let’s say A-Event topic and B-Event topic retention strategies were set to 2 weeks, is there a way to side effect the kafka internal KTable materialization topic retention policy (changelog and rocksDB) with the upstream event topic delete retention policies? Otherwise, can we configure the KTable materialization with some sort of retention policy that would manage both the changelog topic and the rockdb store lifecycle? Considering I can’t explicitly emit A-KTable and B-KTable snapshot tombstones? I am concerned that the changelog and the local store will grow indefinitely,..,
At the moment, KStream doesn't support out of the box functionality to inject the cleanup in Changelog topics based on the source topics retention policy. By default, it uses the "Compact" retention policy.
There is an open JIRA issue for the same :
https://issues.apache.org/jira/browse/KAFKA-4212
One option is to inject tombstone messages but that's not nice way.
In case of windowed store, you can use "compact, delete" retention policy.
As far as I know, Kafka by default will keep the records in the topics for 7 days and then delete them. But how about the Kafka Materialized Views, how long Kafka will keep the data there(infinitive or limited time)? Also, does Kafka replicates Materialized Views over the cluster?
Kafka topics can either be configured with retention time or with log compaction. For log compaction, the latest record for each key will never be deleted, while older record with the same key are garbage collected in regular intervals. See https://kafka.apache.org/documentation/#compaction
When Kafka Streams creates a KTable or state store and creates a changelog topic for fault-tolerance, it will create this changelog topic with log compactions enabled.
Note: if you read a topic directly as a KTable or GlobalKTable (ie, builder.table(...)), no additional changelog topic will be created but the source topic will be used for this purpose. Thus, the source topic should be configured with log compaction (and not with retention time).
You can configure the desired replication factor with StreamConfig parameter repliaction.factor. You can also manually change the replication factor at any time if you wish, eg, via bin/kafka-topics.sh command.