Kafka Materialized Views TTL - apache-kafka

As far as I know, Kafka by default will keep the records in the topics for 7 days and then delete them. But how about the Kafka Materialized Views, how long Kafka will keep the data there(infinitive or limited time)? Also, does Kafka replicates Materialized Views over the cluster?

Kafka topics can either be configured with retention time or with log compaction. For log compaction, the latest record for each key will never be deleted, while older record with the same key are garbage collected in regular intervals. See https://kafka.apache.org/documentation/#compaction
When Kafka Streams creates a KTable or state store and creates a changelog topic for fault-tolerance, it will create this changelog topic with log compactions enabled.
Note: if you read a topic directly as a KTable or GlobalKTable (ie, builder.table(...)), no additional changelog topic will be created but the source topic will be used for this purpose. Thus, the source topic should be configured with log compaction (and not with retention time).
You can configure the desired replication factor with StreamConfig parameter repliaction.factor. You can also manually change the replication factor at any time if you wish, eg, via bin/kafka-topics.sh command.

Related

Why schema registry internal topic _schemas has only single partition?

From confluent documentation here
Kafka is used as Schema Registry storage backend. The special Kafka topic <kafkastore.topic> (default _schemas), with a single partition, is used as a highly available write ahead log..
_schemas topic is created with single partition. What is the design rational behind this? Having number of partitions more than 1 will definitely improve search for schemas by the consumers.
The schemas topic must be ordered, and it uses the default partitioner. Therefore it has one partition. There is only one consumer, anyway (the master registry server), therefore doesn't need to scale. The HTTP server can handle thousands of requests perfectly fine; the schemas are stored all in memory after consuming the topic. Consumers and producers also cache the schemas after using them once.
The replication factor of one allows for local development without editing configs. You should change this.
Kafka's own internal topics (consumer offsets and transaction topics) default to 1, as well, by the way. And num.partitions also defaults to 1 for auto-created topics.

Retention period of internal kafka stream topics

I am having a use-case of kafka streams where I need to perform aggregate operation with past data that might be consumed even months earlier.
I wonder if it means that I need to be concerned about default retention period of internal topics e.g. XXX-REDUCE-STATE-STORE-changelog, XXX-AGGREGATE-STATE-STORE-repartition and explicitly change/set somehow?
If yes, is there a way to configure it for stream app? If I set default retention period at broker level, will my newly created internal topics have forever retention?
Figured out that XXX-REDUCE-STATE-STORE-changelog topics have cleanup.policy=compact. Meaning the messages will never be deleted as log compaction is enabled. XXX-AGGREGATE-STATE-STORE-repartition topics have retention.ms=-1 by default even if broker level default is set to any other value.

Kafka stream KTable changelog TTL

Let’s say I have an A-Event KStream aggregated into an A-Snapshot KTable and a B-Event KStream aggregated into a B-Snapshot KTable. Neither A-Snapshot nor B-Snapshot conveys null values (delete events are instead aggregated as state attribute of the snapshot). At this point, we can assume we have a persisted kafka changelog topic and a rocksDB local store for both the A-KTable and B-KTable aggregations. Then, my topology would join the A-KTable with B-KTable to produce a joined AB-KStream. That said, my problem is with the A-KTable and B-KTable materialization lifecycles (both the changelog topic and the local rocksdb store). Let’s say A-Event topic and B-Event topic retention strategies were set to 2 weeks, is there a way to side effect the kafka internal KTable materialization topic retention policy (changelog and rocksDB) with the upstream event topic delete retention policies? Otherwise, can we configure the KTable materialization with some sort of retention policy that would manage both the changelog topic and the rockdb store lifecycle? Considering I can’t explicitly emit A-KTable and B-KTable snapshot tombstones? I am concerned that the changelog and the local store will grow indefinitely,..,
At the moment, KStream doesn't support out of the box functionality to inject the cleanup in Changelog topics based on the source topics retention policy. By default, it uses the "Compact" retention policy.
There is an open JIRA issue for the same :
https://issues.apache.org/jira/browse/KAFKA-4212
One option is to inject tombstone messages but that's not nice way.
In case of windowed store, you can use "compact, delete" retention policy.

apache kafka retention kstream and linked topic

I have a ktable kafka linked to a topic however when retention is done on the topic the messages are also deleted in my ktable And is it possible to keep the values of the Ktable while those of its topic are deleted?
Changelog topics are log compacted with infinite retention time. At any time, if there is a new event coming up your topic, it will update the state in the KTables for that key.
In your scenario, if you need to make the data available even when source topic is deleted, I will recommend publish the KTable into some output topic to make it more stable.
As KTables are exposed only within the applications and built on top of changelog topics, once your application is gone, you will lose the data unless you use persistent state store.

Kafka System Tools for Replay

I have case where we need to move data from one topic to other topic. I saw a utility in Kafka documentation "ReplayLogProducer". Its supposed to be run as indicated below.
bin/kafka-run-class.sh kafka.tools.ReplayLogProducer
Does this tool require the partitions on source topic same as that of destination partitions? How does the retention of data work on the new topic?
It would be great if anyone can provide any insight on any best practices to be followed or caveats to keep in mind while running this tool.
The command-line kafka.tools.ReplayLogProducer tool does not require the partitions to be the same. By default it uses the default partitioning strategy: hash of message's key if present, or round-robin if your messages don't have keys. One of the main use cases is to copy data from an old to a new topic after changing the number of partitions or the partitioning strategy.
It's still not documented, but the ability to specify a custom partitioner was apparently added by KAFKA-1281: you can now specify custom producer options with --property. So to use a different partitioning strategy, try:
bin/kafka-run-class.sh kafka.tools.ReplayLogProducer --property partitioner.class=my.custom.Partitioner
Retention of data in the new topic will be however the new topic is configured with cleanup.policy, retention.ms or retention.bytes. Note that if using retention.ms (the default), retention is relative to the time the messages were replayed, not the original creation time. This is an issue with regular replication or mirrormaker and ReplayLogProducer is no different. Proposals for KIP-32 and KIP-33 should make it possible to instead configure retention by "creation time" of your messages, but since Kafka 0.10 is not yet released, it's not yet clear if ReplayLogProducer would preserve message creation time.