How to delete unused topics? - apache-kafka

In our design, Kafka topics are created on the fly and used for a short period of time. While new topics are created and used, previously created topics may go out of use. There is a need to periodically delete unused topics. Say for simplicity, we would like to delete all the topics that have not been used (and are empty) for last 2 days.
Is there an established solution for this case? Any pointers on how to achieve this?
(We will be using AWS MSK (Kafka version 2.8))

Related

Apache Kafka messages got archived - is it possible to retrieve the messages

We are using Apache Kafka and we process more than 30 million messages per day. We have an retention policy of "30" days. However, before 30 days, our messages got archived.
Is there a way we could retrieve the deleted messages?
Is it possible to reset the "start index" to older index to retrieve the data through query?
What other options do we have?
If we have "disk backup", could we use that for retrieving the data?
Thank You
I'm assuming your messages got deleted by the Kafka cluster here.
In general, no - if the records got deleted due to duration / size related policies, then they have been removed.
Theoretically, if you have access to backups you might move the Kafka data-log files to server directory, but the behaviour is undefined. Trying that with a fresh cluster with infinite size/time policies (so nothing gets purged instantly) might work and let you consume again.
In my experience, until the general availability of Tiered Storage, there is no free/easy way to recover data (via the Kafka Consumer protocol).
For example, you can use some Kafka Connect Sink connector to write to some external, more persistent storage. Then, would you want to write a job that scrapes that data? Sure, you could have a SQL database table of STRING topic, INT timestamp, BLOB key, BLOB value, and maybe track "consumer offsets" separately from that? If you use that design, then Kafka doesn't really seem useful, as you'd be reimplementing various parts of it when you could've just added more storage to the Kafka cluster.
Is it possible to reset the "start index" to older index to retrieve the data through query?
That is what auto.offset.reset=earliest will do, or kafka-consumer-groups --reset-offsets --to-earliest
have "disk backup", could we use that
With caution, maybe. For example - you can copy old broker log segments into a server, but then there aren't any tools I know of that will retroactively discover the new "low watermark" of each topic (maybe the broker finds this upon restart, I haven't tested). You'd need to copy this data for each broker manually, I believe, since the replicas wouldn't know about old segments (again, maybe after a full cluster restart, they might).
Plus, the consumer offsets would already be reading way past that data, unless you stop all consumers and reset them.
I'm also not sure what happens if you had gaps in the segment files. E.g. your current oldest segment is N and you copy N-2, but not N-1... You might then run into an error or the consumer will simply apply auto.offset.reset policy, and seek to the next available offset or to the very end of the topic

How to implement Time To Live (TTL) for a Kafka topic

I want to have a topic deleted after some predefined time of inactivity.
To give you some context, there's a microservice that has many replicas, and each replica has its own topic to communicate, identified by its replica Id (e.g. topic_microservice-name_<random_id>).
If for any reason, a replica crashes, K8s will start another Pod, with a completely different replica Id, therefore the previous topic will not be used anymore. For this reason, after some time there could be many useless topics.
Does kafka have a built-in Time To Live for the whole topic?
Another idea I have is to have a Quartz Job iterating all topics somehow getting the last modified/written date and checking if the TTL expired.
There currently isn't a way to give a topic a TTL, where once the TTL expires Kafka automatically deletes the topic.
One can configure retention on the topic level (retention.ms - how long messages should be retained for this topic or retention.bytes - the amount of messages to retain in bytes). With this, you could have a separate service leveraging the AdminClient to execute scheduled operations on your topics. The logic could simply be iterating over the topics, filtering out the active topics, and deleting each topic that has been inactive long enough for the retention strategy to take effect.
The original question as to whether kafka topic actually has a TTL has already been answered (which is NO as of writing this answer).
This answer deals with several ways to handle deletion of topics w.r.t your scenario.
Write a container preStop hook
Where you can execute the topic's deletion code upon a pod termination. This could be simple approach.
The hook implementations include exec command (or) a HTTP call.
You can for example, include a small wrapper script on top of kafka-topics.sh (or) a simple python script that could connect to the broker and delete the topic.
You might also want to make a note of terminationGracePeriodSeconds and increase it accordingly if your topic deletion script takes longer than this value.
Get notified using Kubernetes Watch APIs
You may need to write a client that listens to the events and use the AdminClient to delete the topics corresponding to the terminated pod. This typically needs to be separated from the terminated pod.
Find out what topics needs to be deleted by getting list of active pods.
Retrieve the pod replicas available in the Kubernetes cluster using Kubernetes API.
Iterate over all the topics and delete those which do not conform to the above retrieved list.
P.S:
Note that the deletion of topics is an administrative task and it is typically done manually after some verification checks.
Creation of a lot of topics isn't recommended as maintenance would be difficult. If your applications are creating a lot of topics, for eg, as many as the number of workload instances running, then it might be the time to rethink your application design.

Splitting Kafka into separate topic or single topic/multiple partitions

As usual, it's bit confusing to see benefits of splitting methods over others.
I can't see the difference/Pros-Cons between having
Topic1 -> P0 and Topic 2 -> P0
over Topic 1 -> P0, P1
and a consumer pull from 2 topics or single topic/2 partitions, while P0 and P1 will hold different event types or entities.
Thee only benefit I can see if another consumer needs Topic 2 data then it's easy to consume
Regarding topic auto generation, any benefits behind that way or it will be out of hand after some time?
Thanks
I would say this decision depends on multiple factors;
Logic/Separation of Concerns: You can decide whether to use multiple topics over multiple partitions based on the logic you are trying to implement. Normally, you need distinct topics for distinct entities. For example, say you want to stream users and companies. It doesn't make much sense to create a single topic with two partitions where the first partition holds users and the second one holds the companies. Also, having a single topic for multiple partitions won't allow you to implement e.g. message ordering for users that can only be achieved using keyed messages (message with the same key are placed in the same partition).
Host storage capabilities: A partition must fit in the storage of the host machine while a topic can be distributed across the whole Kafka Cluster by partitioning it across multiple partitions. Kafka Docs can shed some more light on this:
The partitions in the log serve several purposes. First, they allow
the log to scale beyond a size that will fit on a single server. Each
individual partition must fit on the servers that host it, but a topic
may have many partitions so it can handle an arbitrary amount of data.
Second they act as the unit of parallelism—more on that in a bit.
Throughput: If you have high throughput, it makes more sense to create different topics per entity and split them into multiple partitions so that multiple consumers can join the consumer group. Don't forget that the level of parallelism in Kafka is defined by the number of partitions (and obviously active consumers).
Retention Policy: Message retention in Kafka works on partition/segment level and you need to make sure that the partitioning you've made in conjunction with the desired retention policy you've picked will support your use case.
Coming to your second question now, I am not sure what is your requirement and how this question relates to the first one. When a producer attempts to write a message to a Kafka topic that does not exist, it will automatically create that topic when auto.create.topics.enable is set to true. Otherwise, the topic won't get created and your producer will fail.
auto.create.topics.enable: Enable auto creation of topic on the server
Again, this decision should be dependent on your requirements and the desired behaviour. Normally, auto.create.topics.enable should be set to false in production environments in order to mitigate any risks.
Just adding some things on top of Giorgos answer:
By choosing the second approach over the first one, you would lose a lot of features that Kafka offers. Some of the features may be: data balancing per brokers, removing topics, consumer groups, ACLs, joins with Kafka Streams, etc.
I think that this flag can be easily compared with automatically creating tables in your database. It's handy to do it in your dev environments but you never want it to happen in production.

Kafka: multiple consumers in the same group

Let's say I have a Kafka cluster with several topics spread over several partitions. Also, I have a cluster of applications act as clients for Kafka. Each application in that cluster has a client that is subscribed to a same set of topics, which is identical over the whole cluster. Also, each of these clients share same Kafka group ID.
Now, speaking of commit mode. I really do not want to specify offset manually, but I do not want to use autocommit either, because I need to do some handing after I receive my data from Kafka.
With this solution, I expect to occur "same data received by different consumers" problem, because I do not specify offset before I do reading (consuming), and I read data concurrently from different clients.
Now, my question: what are the solutions to get rid of multiple reads? Several options coming to my mind:
1) Exclusive (sequential) Kafka access. Until one consumer committed read, no other consumers access Kafka.
2) Somehow specify offset before each reading. I do not even know how to do that with assumption that read might fail (and offset will not be committed) - we gonna need some complicated distributed offset storage.
I'd like to ask people experienced with Kafka to recommend something to achieve behavior I need.
Every partition is consumed only by one client - another client with the same group ID won't get access to that partition, so concurrent reads won't occur...

what is the best approach to keep two kafka clusters in Sync

I have to setup two kafka clusters in two different data centers (DCs), which have same topics and configuration. the reason is that the connectivity between two data centers is nasty we cannot create a global one.
We are having producers and consumers to publish and subcribe to the topics of each DC.
the problem is that I need to keep both clusters in sync.
Lets say: all messages are written to the first DC should be eventually replicated to the second, and otherway around.
I am evaluation the kafka MirrorMaker tool by creating the Mirror by consuming messages of the first and procuding messages to the second one. However it is also requried to replicate data from the second to the first because writing data is both allowed in two clusters.
I dont think the Kafka MirrorMaker tool is fit to our case.
Appricate any suggestion?
Thanks in advance.
Depending on your exact requirements, you can use MirrorMaker for your use case.
One option would be to just have two separate topics, lets call them topic1 on cluster 1 and topic2 on cluster 2. All your producing threads write to the "local" topic and you use mirrormaker to replicate this topic to the remote cluster.
For your consumers, you simply subscribe to both topics on whatever cluster is closest to you, that way you will get all records that were written on either cluster.
I have created an illustration that hopefully helps:
Alternatively, you could create aggregation topics on both clusters and use MirrorMaker to replicate data into this topic, this would enable you to have all data in one topic for consumption.
You would have duplicate data on the same cluster this way, but you could take care of this by lower retention settings on the input topic.
Again, hopefully the following picture helps to explains my thinking:
In order for this to work, you will need to configure MirrorMaker to replicate a topic into a topic with a different name, which is not a standard thing for it to do, I have written a small blog post on how to do this, if you want to investigate this option further.