Cannot make all messages in a Kafka topic expired with retention - apache-kafka

I often clean up all current messages in a Kafka topic by updating retention.ms to 10. That makes all messages will be expired after 10ms. However, sometimes, the messages cannot be cleaned up by that way. I had to
drop and re-create the topic in order to clean up all messages.
I'm not sure it's related to the issue or not, but it often happens after all consumers of that topic have been stopped working by some reason.
What could be the root cause for this?

The retention.ms field is a minimum time for the log cleaner. The log cleaner only runs once every so often (the Kafka docs state 300000 ms), and only on closed log segments (default size of 1GB), so you may have to wait for it to run or need more data in the topic

Related

Apache Kafka Cleanup while consuming messages

Playing around with Apache Kafka and its retention mechanism I'm thinking about following situation:
A consumer fetches first batch of messages with offsets 1-5
The cleaner deletes the first 10 messages, so the topic now has offsets 11-15
In the next poll, the consumer fetches the next batch with offsets 11-15
As you can see the consumer lost the offsets 6-10.
Question, is such a situation possible at all? With other words, will the cleaner execute while there is an active consumer? If yes, is the consumer able to somehow recognize that gap?
Yes such a scenario can happen. The exact steps will be a bit different:
Consumer fetches message 1-5
Messages 1-10 are deleted
Consumer tries to fetch message 6 but this offset is out of range
Consumer uses its offset reset policy auto.offset.reset to find a new valid offset.
If set to latest, the consumer moves to the end of the partition
If set to earliest the consumer moves to offset 11
If none or unset, the consumer throws an exception
To avoid such scenarios, you should monitor the lead of your consumer group. It's similar to the lag, but the lead indicates how far from the start of the partition the consumer is. Being near the start has the risk of messages being deleted before they are consumed.
If consumers are near the limits, you can dynamically add more consumers or increase the topic retention size/time if needed.
Setting auto.offset.reset to none will throw an exception if this happens, the other values only log it.
Question, is such a situation possible at all? will the cleaner execute while there is an active consumer
Yes, if the messages have crossed TTL (Time to live) period before they are consumed, this situation is possible.
Is the consumer able to somehow recognize that gap?
In case where you suspect your configuration (high consumer lag, low TTL) might lead to this, the consumer should track offsets. kafka-consumer-groups.sh command gives you the information position of all consumers in a consumer group as well as how far behind the end of the log they are.

Kafka messages not getting purged

I am new to Kafka. I am doing some experiment as to how to purge messages in a kafka topic. I found that if we set "retention.ms" property for a topic to some less time value lets say 1 second, then after 1 seconds the messages in the topic will be purged as per my understanding.
I ran 1 producer which produced few messages to topic and stopped it after some time. At the same time I ran a console consumer so it got the generated messages.
I started another consumer console for the same topic after retention time is elapsed lets say after 1-2 minutes. But too my surprise I was able to get the messages on that topic.
Started console consumer after 2 minutes again when I fifnnaly didnt see any messages in topic. It took almost 3-4 minutes for kafka to purge the messages.
Is there any additional settings required at Kafka so that messages will be purged instantly ?
Setting retention.ms will not guarantee message will be deleted immediately from topic. Even though it will be marked for deletion.
If your message is in form of pair, then setting retention time is not good enough. You have to set following parameters also:
log.cleanup.policy
log.cleaner.min.compaction.lag.ms
log.cleaner.enable
Another set of parameters controls the deletion of message in case if they are present in your config:
log.retention.ms
log.roll.hours

Kafka retention AFTER initial consuming

I have a Kafka cluster with one consumer, which is processing TB's of data every day. Once a message is consumed and committed, it can be deleted immediately (or after a retention of few minutes).
It looks like the log.retention.bytes and log.retention.hours configurations count from the message creation. Which is not good for me.
In case where the consumer is down for maintenance/incident, I want to keep the data until it comes back online. If I happen to run out of space, I want to refuse accepting new data from the producers, and NOT delete data that wasn't consumed yet (so the log.retention.bytes doesn't help me).
Any ideas?
If you can ensure your messages have unique keys, you can configure your topic to use compaction instead of timed-retention policy. Then have your consumer after having processed each message send a message back to the same topic with the message key but null value. Kafka would compact away such messages. You can tune compaction parameters to your needs (and log segment file size, since the head segment is never compacted, you may want to set it to a smaller size if you want compaction to kick in sooner).
However, as I mentioned before, this would only work if messages have unique keys, otherwise you can't simply turn on compaction as that would cause loss of previous messages with the same key during periods when your consumer is down (or has fallen behind the head segment).

Apache Kafka and durable subscribtion

I'm considering using Apache Kafka and I could not find any information about durable subscriptions. Let's say I have expiration of 5 seconds for messages in my partition. Now if consumer fails and reconnects after 5 seconds, the message he missed will be gone. Even worse, he wont know that he missed a message. The durable subscription pattern solves this by saving the message for the consumer that failed or was disconnected. Is similar feature implemented in Kafka?
This is not supported by Kafka. But you can of course always increase your retention time, and thus limit the probability that a consumer misses messages.
Furthermore, if you set auto.offset.reset to none you will get an exception that informs you if a consumer misses any messages. Hence, it is possible to get informed if this happens.
Last but not least, it might be possible, to use a compacted topic -- this would ensure, that messages are not deleted until you explicitly write a so-called tombstone message. Note, that records must have unique keys to use a compacted topic.

Simple-Kafka-consumer message delivery duplication

I am trying to implement a simple Producer-->Kafka-->Consumer application in Java. I am able to produce as well as consume the messages successfully, but the problem occurs when I restart the consumer, wherein some of the already consumed messages are again getting picked up by consumer from Kafka (not all messages, but a few of the last consumed messages).
I have set autooffset.reset=largest in my consumer and my autocommit.interval.ms property is set to 1000 milliseconds.
Is this 'redelivery of some already consumed messages' a known problem, or is there any other settings that I am missing here?
Basically, is there a way to ensure none of the previously consumed messages are getting picked up/consumed by the consumer?
Kafka uses Zookeeper to store consumer offsets. Since Zookeeper operations are pretty slow, it's not advisable to commit offset after consumption of every message.
It's possible to add shutdown hook to consumer that will manually commit topic offset before exit. However, this won't help in certain situations (like jvm crash or kill -9). To guard againts that situations, I'd advise implementing custom commit logic that will commit offset locally after processing each message (file or local database), and also commit offset to Zookeeper every 1000ms. Upon consumer startup, both these locations should be queried, and maximum of two values should be used as consumption offset.