We have a requirement that demands to delete/purge data for any given partition within a topic. I am using Kafka 0.10.0.1. Is there any way I can delete entire partition content on demand? If yes then how. I see that we can use log compaction to post a null message for a key and delete it but other than that is there any way to achieve deletion?
Kafka does not currently support reducing the number of partitions for a topic, so no out-of-box tool is offered to be used to remove a partition directly.
Related
I come across the following two phrases from the book "Mastering Kafka Streams and ksqlDB" and author used two terms, what does they really mean "compacted topics" and "uncompacted topics"
Does they got anything to with respect to "log compaction" ?
Tables can be thought of as updates to a database. In this view of the logs, only the current state (either the latest record for a given key or some kind of aggregation) for each key is retained. Tables are usually built from compacted topics.
Streams can be thought of as inserts in database parlance. Each distinct record remains in this view of the log. Streams are usually built from uncompacted topics.
Yes, log compaction according to kafka docs
Log compaction ensures that Kafka will always retain at least the last known value for each message key within the log of data for a single topic partition
https://kafka.apache.org/documentation/#compaction
If log compaction is enabled on topic, Kafka removes any old records when there is a newer version of it with the same key in the partition log.
For more detailed explanation of log compaction refer - https://medium.com/swlh/introduction-to-topic-log-compaction-in-apache-kafka-3e4d4afd2262
Yes, these terms are synonymous.
Ref: Log Compaction
From this article:
The idea behind compacted topics is that no duplicate keys exist. Only the most recent value for a message key is maintained.
It is mostly used for the scenarios such as restoring to the previous state before the application crashed or system failed, or while reloading cache after application restarts.
As an example of the above kafka has the topic __consumer_offsets which can be used to to continue from the last message which was read after a crash or a restart. A schema registry is also often used to ensure compatible communication between producers and consumers. The schemas used are maintained in the __schemas topic.
Suppose I want to remain not more than 100 recent messages per key in Kafka topic. Can I reach this policy somehow? For example can I configure compaction policy to store recent N messages (not only one per key)?
There is no such way that Kafka provides. Kafka doesn't remove logs (kind of compaction) based on message-based recency. It rather depends on the time-lapse & size of a log file.
I use Kafka Streams for some aggregations of a TimeWindow.
I'm interested only in the final result of each window, so I use the .suppress() feature which creates a changelog topic for its state.
The retention policy configuration for this changelog topic is defined as "compact" which to my understanding will keep at least the last event for each key in the past.
The problem in my application is that keys often change. This means that the topic will grow indefinitely (each window will bring new keys which will never be deleted).
Since the aggregation is per window, after the aggregation was done, I don't really need the "old" keys.
Is there a way to tell Kafka Streams to remove keys from previous windows?
For that matter, I think configuring the changelog topic retention policy to "compact,delete" will do the job (which is available in kafka according to this: KIP-71, KAFKA-4015.
But is it possible to change the retention policy so using the Kafka Streams api?
suppress() operator sends tombstone messages to the changelog topic if a record is evicted from its buffer and sent downstream. Thus, you don't need to worry about unbounded growth of the topic. Changing the compaction policy might in fact break the guarantees that the operator provide and you might loose data.
I have an application in which I'm using a Kstream-Kstream join and Ktream-Ktable join.
I have updated the input source topic partition count from 4 to 16 and the application stopped with below error.
Could not create internal topics: Existing internal topic application-test-processor-KSTREAM-JOINTHIS-0000000009-store-changelog has invalid partitions. Expected: 16 Actual: 4. Use 'kafka.tools.StreamsResetter' tool to clean up invalid topics before processing. Retry #3
How to update internal changelog topic partition count when a source topic partition count is updated ?
Note: We are using kafka version: 0.10.2.1
I looked at the application resetter tool from this link: https://docs.confluent.io/current/streams/developer-guide/app-reset-tool.html
but it doesn't say how to update the changelog partition.
Thanks in advance.
Using the reset tool is actually recommended.
The state of your application is sharded based on the number of input partitions. This was originally 4. Thus, changing it to 16 broke the application. If you would manually add partitions to the changelog topic (what would be possible and resolve the exception, but not really fix the issue), state would not be redistributed and thus would be corrupted.
If you use the reset tool, you delete all state and let your application reprocess all input data from scratch. This allows Kafka Streams to recreate the state correctly (now with 16 shards).
I'm experimenting with kafka streams and I have the following setup:
I have an existing kafka topic with a key space that is unbounded (but predictable and well known).
My topic has a retention policy (in bytes) to age out old records.
I'd like to materialize this topic into a Ktable where I can use the Interactive Queries API to retrieve records by key.
Is there any way to make my KTable "inherit" the retention policy from my topic? So that when records are aged out of the primary topic, they're no longer available in the ktable?
I'm worried about dumping all of the records into the KTable and having the StateStore grow unbounded.
One solution that I can think of is to transform into a windowed stream with hopping windows equal to a TimeToLive for the record, but I'm wondering if there's a better solution in a more native way.
Thanks.
It is unfortunately not support atm. There is a JIRA though: https://issues.apache.org/jira/browse/KAFKA-4212
Another possibility would be to insert tombstone messages (<key,null>) into the input topic. The KTable would pick those up and delete the corresponding key from the store.