How Kafka reply works in case of log compaction? - apache-kafka

In Kafka if log compaction is enabled it will store only recent key values. If we try to reply these messages, will it just replay latest messages? How exactly Kafka-reply works?

Yes. Offsets of earlier, duplicate keys are dropped and the newest key offset is kept. The consumer skips over gaps in the broker offsets to read all messages available
Also, log compaction happens on a schedule, so you might see the same key within a partition for a certain amount of time, depending on the the properties defined on the broker/topic.

Related

Does Kafka consumer reads the message from active segment in the partition?

Let us say I have a partition (partition-0) with 4 segments that are committed and are eligible for compaction. So all these segments will not have any duplicate data since the compaction is done on all the 4 segments.
Now, there is an active segment which is still not closed. Meanwhile, if the consumer starts reading the data from the partition-0, does it also read the messages from active segment?
Note: My goal is to not provide duplicate data to the consumer for a particular key.
Your concerns are valid as the Consumer will also read the messages from the active segment. Log compaction does not guarantee that you have exactly one value for a particular key, but rather at least one.
Here is how Log Compaction is introduced in the documentation:
Log compaction ensures that Kafka will always retain at least the last known value for each message key within the log of data for a single topic partition.
However, you can try to get the compaction running more frequently to have your active and non-compated segment as small as possible. This, however, comes at a cost as running the compaction log cleaner takes up ressources.
There are a lot of configurations at topic level that are related to the log compaction. Here are the most important and all details can be looked-up here:
delete.retention.ms
max.compaction.lag.ms
min.cleanable.dirty.ratio
min.compaction.lag.ms
segment.bytes
However, I am quite convinced that you will not be able to guarantee that your consumer is never getting any duplicates with a log compacted topic.

Kafka compaction for de-duplication

I'm trying to understand how Kafka compaction works and have the following question: Does kafka guarantees uniqueness of keys for messages stored in topic with enabled compaction?
Thanks!
Short answer is no.
Kafka doesn't guarantees uniqueness for key stored with enabled topic retention.
In Kafka you have two types of cleanup.policy:
delete - It means that after configured time messages won't be available. There are several properties, that can be used for that: log.retention.hours, log.retention.minutes, log.retention.ms. By default log.retention.hours is set 168. It means, that messages older than 7 days will be deleted
compact - For each key at least one message will be available. In some situation it can be one, but in the most cases it will be more. Compaction processed is run in background periodically. It copies log parts with removing duplicates and only leaving last value.
If you want to read only one value for each key, you have to use KTable<K,V> abstraction from Kafka Streams.
Related question regarding latest value for key and compaction:
Kafka only subscribe to latest message?
Looking at 4 guarantees of kakfa compaction, number 4 states:
Any consumer progressing from the start of the log will see at least
the final state of all records in the order they were written.
Additionally, all delete markers for deleted records will be seen,
provided the consumer reaches the head of the log in a time period
less than the topic's delete.retention.ms setting (the default is 24
hours). In other words: since the removal of delete markers happens
concurrently with reads, it is possible for a consumer to miss delete
markers if it lags by more than delete.retention.ms.
So, you will have more than one value for the key if the head of the topic is not being retained by the delete.retention.ms policy.
As I understand it, if you set a 24h retention policy (delete.retention.ms=86400000), you'll have a unique value for a single key, for all messages that were from 24h ago. That's your at least, but not only, as many other messages for the same key may have arrived during the last 24 hours.
So, it is guaranteed that you'll catch at least one, but not just the last, because retention didn't act on recent messages.
edit. As cricket's comment states, even if you set a delete retention property of 1 day, the log.roll.ms is what defines when a log segment is closed, based on message's timestamp. As this last segment is never retained for compaction, it becomes the second factor that doesn't allow you having just the last value for your known key. If your topic starts at T0, then messages after T0+log.roll.ms will be on the open log segment, thus, not compacted.

Kafka retention AFTER initial consuming

I have a Kafka cluster with one consumer, which is processing TB's of data every day. Once a message is consumed and committed, it can be deleted immediately (or after a retention of few minutes).
It looks like the log.retention.bytes and log.retention.hours configurations count from the message creation. Which is not good for me.
In case where the consumer is down for maintenance/incident, I want to keep the data until it comes back online. If I happen to run out of space, I want to refuse accepting new data from the producers, and NOT delete data that wasn't consumed yet (so the log.retention.bytes doesn't help me).
Any ideas?
If you can ensure your messages have unique keys, you can configure your topic to use compaction instead of timed-retention policy. Then have your consumer after having processed each message send a message back to the same topic with the message key but null value. Kafka would compact away such messages. You can tune compaction parameters to your needs (and log segment file size, since the head segment is never compacted, you may want to set it to a smaller size if you want compaction to kick in sooner).
However, as I mentioned before, this would only work if messages have unique keys, otherwise you can't simply turn on compaction as that would cause loss of previous messages with the same key during periods when your consumer is down (or has fallen behind the head segment).

Apache Kafka and durable subscribtion

I'm considering using Apache Kafka and I could not find any information about durable subscriptions. Let's say I have expiration of 5 seconds for messages in my partition. Now if consumer fails and reconnects after 5 seconds, the message he missed will be gone. Even worse, he wont know that he missed a message. The durable subscription pattern solves this by saving the message for the consumer that failed or was disconnected. Is similar feature implemented in Kafka?
This is not supported by Kafka. But you can of course always increase your retention time, and thus limit the probability that a consumer misses messages.
Furthermore, if you set auto.offset.reset to none you will get an exception that informs you if a consumer misses any messages. Hence, it is possible to get informed if this happens.
Last but not least, it might be possible, to use a compacted topic -- this would ensure, that messages are not deleted until you explicitly write a so-called tombstone message. Note, that records must have unique keys to use a compacted topic.

Can Kafka compaction overwrite messages with same partition key?

I am using following code to write to Kafka:
String partitionKey = "" + System.currentTimeMillis();
KeyedMessage<String, String> data = new KeyedMessage<String, String>(topic, partitionKey, payload);
And we are using 0.8.1.1 version of Kafka.
Is it possible that when multiple threads are writing, some of them (with different payload) write with same partition key and because of that Kafka overwrites these messages (due to same partitionKey)?
The documentation that got us thinking in this direction is:
http://kafka.apache.org/documentation.html#compaction
I found some more material at https://cwiki.apache.org/confluence/display/KAFKA/Log+Compaction
Salient points:
Before 0.8 version, Kafka supported only a single retention
mechanism: deleting old segments of log
Log compaction provides an alternative such that it maintains the most recent entry for each
unique key, rather than maintaining only recent log entries.
There is a per-topic option to choose either "delete" or "compact".
Compaction guarantees that each key is unique in the tail of the
log. It works by recopying the log from beginning to end, removing
keys which have a later occurrence in the log.
Any consumer that stays within the head of the log (~1GB) will see all messages.
So whether we have log compaction or not, it follows that Kafka deletes older records but the records in the head of the log are safe from that.
Missing records problem will occur only when downstream clients are unable to empty Kafka queues for a very long time (such that per topic size/time limit is hit).
This should be an expected behavior I think since we cannot keep records forever. They have to be deleted some time or the other.
Sounds very possible. Compaction saves the last message for each key. If you have multiple messages sharing a key, only the last one will be saved after compaction. The normal use-case is database replication where only the latest state is interesting.