How to migrate a kafka topic to log compaction? - apache-kafka

we have a Kafka topic, which cleanup.policy is currently delete. Messages, which have been produced on this topic, have no keys. I'm able to alter the configuration of this topic and it won't accept any new messages without a key, which is reasonable and desired.
I'm wondering what Kafka is going to do with these old keyless messages, though. Are they going to be treated like they have one key, or aren't they going to be affected by the new cleanup policy?
Are there best practices for migrating, I'm not able to find something about that. Is this an unusual use case?
Thanks in advance

I've made some tests in my Kafka Cluster and answering this for future questions:
Messages without a key are going to be deleted
If you don't add new messages, you might end up with some of the old messages in the partitions, because they are "in the last segment". They are going to be deleted, when you add new messages
I think I'll introduce a new compacted topic and republish my data to the new one. This causes all consumers to consume a new topic, but this is ok in my case.
Good luck future me

Related

Kafka client and aggregated events

In event-driven design we strive to find out events that we interested of. Using Kafka we can easily subscribe (a new group.id) to a topic and start consuming events. If retention policy is default one we could consume also one week old messages if specify auto.offset.reset=earliest. Right? But what if we want to start from the very beginning? I guess that KTable should be used but I'm not sure what will happened when a new client has subscribed to a stateful stream. Could you tell me is it true that the new subscriber will receive all aggregated messages?
You can't consume data that has been deleted.
That's why KTables are built on top of compacted topics, which will store the latest keys for each record, and have infinite retention.
If you want to read the "current state" of the table, to get all aggregated messages, then you can use Interactive Queries.
not sure what will happened when a new client has subscribed to a stateful stream
It needs to read the entire compacted topic, starting from the beginning (earliest available offset, not necessarily the first ever produced message) since it cannot easily find where in the topic that each unique key may start.

kafka message expiry event - how to capture

I am a beginner to Kafka, and recently started using in my projects at work. One important thing that I wanna know is, whether it is possible to capture event(s) when messages expire in kafka. The intent is to trap these expired messages and back them up in a backup store.
I believe the goal you want to achieve is similar to Apache Kafka Tiered Storage which is still under development in Open Source Apache Kafka
Messages don't expire. I think you could think of two different scenarios when you think about messages that expire.
A topic is configured with cleanup.policy = delete . After retention.ms or retention.bytes it looks like messages expire. However what actually happens is that a whole log segment, whose newest message is older than retention.ms or if the partitions retention.bytes is exceeded, will be deleted. It will only be considered for deletion if it is not the active segment which Kafka currently writes to.
A topic is configured with cleanup.policy = tombstone. When two log segments are merged, Kafka will make sure that only the latest version for each distinct key will be kept. To "delete" messages one would send a message with a key to target a message and with an empty value - also called a tombstone.
There's no hook or event you could subscribe to, in order to figure out if either of these two cases will happen. You'd have to take of the logic on the client side, which is hard because the Kafka API does not expose any details about the log segments within a partition.

How I can use Kafka like relational database

good time of day. I am sorry my poor English. I have some issue, can you help me to understand how i can use kafka and kafka streams like database.
My problem is i have some microservices and each service have their data in own database. I need for report purposes collect data in one point, for this i chose the kafka. I use debezuim maybe you know it (change data capture debezium), each table in relational database it is a topic in kafka. And i wrote the application with kafka stream (i joined streams each other) so far good. Example: I have the topic for ORDER and ORDER_DETAILS, after a while will come some event for join this topic, problem is i dont know when come this event maybe after minutes or after monthes or after years. How i can get data in topics ORDER and ORDER_DETAIL after month or year ? It is right way save data in topic infinitely? can you give me some advice maybe have some solutions.
The event will come as soon as there is a change in the database.
Typically, the changes to the database tables are pushed as messages to the topic.
Each and every update to the database will be a kafka message. Since there is a message for every update, you might be interested in only the latest value (update) for any given key which mostly will be the primary key
In this case, you can maintain the topic infinitely (retention.ms=-1) but compact (cleanup.policy=compact) it in order to save space.
You may also be interested in configuring segment.ms and/or segment.bytes for further tuning the topic retention.

How to make restart-able producer?

Latest version of kafka support exactly-once-semantics (EoS). To support this notion, extra details are added to each message. This means that at your consumer; if you print offsets of messages they won't be necessarily sequential. This makes harder to poll a topic to read the last committed message.
In my case, consumer printed something like this
Offset-0 0
Offset-2 1
Offset-4 2
Problem: In order to write restart-able proudcer; I poll the topic and read the content of last message. In this case; last message would be offset#5 which is not a valid consumer record. Hence, I see errors in my code.
I can use the solution provided at : Getting the last message sent to a kafka topic. The only problem is that instead of using consumer.seek(partition, last_offset=1); I would use consumer.seek(partition, last_offset-2). This can immediately resolve my issue, but it's not an ideal solution.
What would be the most reliable and best solution to get last committed message for a consumer written in Java? OR
Is it possible to use local state-store for a partition? OR
What is the most recommended way to store last message to withstand producer-failure? OR
Are kafka connectors restartable? Is there any specific API that I can use to make producers restartable?
FYI- I am not looking for quick fix
In my case, multiple producers push data to one big topic. Therefore, reading entire topic would be nightmare.
The solution that I found is to maintain another topic i.e. "P1_Track" where producer can store metadata. Within a transaction a producer will send data to one big topic and P1_Track.
When I restart a producer, it will read P1_Track and figure out where to start from.
Thinking about storing last committed message in a database and using it when producer process restarts.

Kafka stream application not consume data after restart

After I did restart our Kafka cluster my application of Kafka streams didn't receive messages from input topic and I got an exception of "can׳t create internal topic". After some research, I did reset with the Kafka tool (to the input topic and the application) the tool is Kafka-streams-application-reset.sh.
Unfortunately, it didn't resolve the problem and I also got the exception again
From the error message, you can infer that the topic already exists and thus, cannot be created. The reason for the failure is, that the existing topic does not have the expected number of partitions (it has 1 instead of 150) -- if the number of partitions would match, Kafka Streams would just use the existing topic.
This can happen, if you have topic auto-create enabled at the brokers (and the topic was created with a wrong number of partitions), or if the number of partitions of your input topic changed. Kafka Streams does not automatically change the number of partitions for the repartition topic, because this might result in data corruption and thus lead to incorrect results.
One way to fix this, it to either manually delete this topic: note, that this might result in data loss and you should only do this, if you know that it is what you want.
Another (better way) would be, to reset the application cleanly using bin/kafka-streams-application-reste.sh in combination with KafkaStreams#cleanup().
Because you need to clean up the application and users should be aware of the implication, Kafka Streams fails to make user aware of the issue instead of "auto magically" take some actions that might be undesired from a user point of view.
Check out the docs for more details. There is also a blog post that explains application reset in details:
https://kafka.apache.org/11/documentation/streams/developer-guide/app-reset-tool.html
https://www.confluent.io/blog/data-reprocessing-with-kafka-streams-resetting-a-streams-application/