A common concern for PII on a compacted topic is to ensure that after some time the topic gets compacted even though no new message is written to it and thus triggers a segment close and compaction.
Using kafka 2.6
The topic I have needs to be compacted 1 hour after some PII cleaning is written, and as the topic is very low volume there might not be any more writes for a couple of days. Thus the old and new key/values stays.
When reading: https://cwiki.apache.org/confluence/display/KAFKA/KIP-354%3A+Add+a+Maximum+Log+Compaction+Lag its not clear wether a write needs to happen or if some house-keeping would ensure closing active segment and then a compaction.
I am configure topic with:
cleanup.policy=compact
min.cleanable.dirty.ratio=0.01
min.compaction.lag.ms=<1 hour>
max.compaction.lag.ms=<1 hour>
segment.ms=<1 hour>
segment.bytes=<1 MB>
What am I missing
Related
I have a kafka changelog topic which has a cleanup policy COMPACT_DELETE .
It has a retention.ms of 86520000 which is approximately 1 day.
However I've observed that partitions in this topic has data which is over a month old .
I would expect that since DELETE is also part of the cleanup policy there should be no messages beyond 1 day in any of the partitions in this topic .
The major problem here is that this topic is constantly growing and is never settling down which is causing disk issues in the kafka broker side.
I'd like to understand why retention.ms isnt kicking in for COMPACT_DELETE topics.
Following up on this question - I would like to know semantics between consumer-groups and offset expiry. In general I'm curious to know, how kafka protocol determines some specific offset (for consumer-group, topic, partition combination) to be expired ? Is it basing on periodic commits from consumer that are part of the group-protocol or does the offset-tick gets applied after all consumers are deemed dead/closed ? Im thinking this could have repercussions when dealing with topic-partitions to which data isn't produced frequently. In my case, we have a consumer-group reading from a fairly idle topic (not much data produced). Since, the consumer-group doesnt periodically commit any offsets, can we ever be in danger of loosing previously committed offsets. For example, when some unforeseen rebalance happens, the topic-partitions could get re-assigned with lost offset-commits and this could cause the consumer to read data from the earliest (configured auto.offset.reset) point ?
For user-topics, offset expiry / topic retention is completely decoupled from consumer-group offsets. Segments do not "reopen" when a consumer accesses them.
At a minimum, segment.bytes, retention.ms(or minutes/hours), retention.bytes all determine when log segments get deleted.
For the internal __consumer_offsets topic, offsets.retention.minutes controls when it is deleted (also in coordination with its segment.bytes).
The LogCleaner thread actively removes closed segments on a periodic basis, not the consumers. If a consumer is lagging considerably, and upon requesting offsets from a segment that had been deleted, then the auto.offset.reset gets applied.
Let us say I have a partition (partition-0) with 4 segments that are committed and are eligible for compaction. So all these segments will not have any duplicate data since the compaction is done on all the 4 segments.
Now, there is an active segment which is still not closed. Meanwhile, if the consumer starts reading the data from the partition-0, does it also read the messages from active segment?
Note: My goal is to not provide duplicate data to the consumer for a particular key.
Your concerns are valid as the Consumer will also read the messages from the active segment. Log compaction does not guarantee that you have exactly one value for a particular key, but rather at least one.
Here is how Log Compaction is introduced in the documentation:
Log compaction ensures that Kafka will always retain at least the last known value for each message key within the log of data for a single topic partition.
However, you can try to get the compaction running more frequently to have your active and non-compated segment as small as possible. This, however, comes at a cost as running the compaction log cleaner takes up ressources.
There are a lot of configurations at topic level that are related to the log compaction. Here are the most important and all details can be looked-up here:
delete.retention.ms
max.compaction.lag.ms
min.cleanable.dirty.ratio
min.compaction.lag.ms
segment.bytes
However, I am quite convinced that you will not be able to guarantee that your consumer is never getting any duplicates with a log compacted topic.
In all the Kafka tutorials I've read so far they all mentioned "Kafka partitions are immutable". However, I also read from this site https://towardsdatascience.com/log-compacted-topics-in-apache-kafka-b1aa1e4665a7 that from time to time, Kafka will remove older messages in the partition (depending on the retention time you set in the log-compact command). You can see from the screenshot below that data within the partition has clearly changed after removing the duplicate Keys in the partition:
So my question is what exactly does it mean to say "Kafka partitions are immutable"?
Tha Kafka partitions are defined as "immutable" referring to the fact that a producer can just append messages to a partition itself and not changing the value for an existing one (i.e. with the same key). The partition itself is a commit log working just in append mode from a producer point of view.
Of course, it means that without any kind of mechanisms like deletion (by retention time) and compaction, the partition size could grow endlessly.
At this point you could think .. "so it's not immutable!" as you mentioned.
Well, as I said the immutability is from a producer's point of view. Deletion and compaction are administrative operations.
For example, deleting records is also possible using the Admin Client API ... but we are always talking about administrative stuff, not producer/consumer related stuff.
If you think about compaction and how it works, the producer initially sends, for example, a message with key = A and payload = "Hello". After a while in order to "update" the value, it sends a new message with same key = A and payload = "Hi" ... but actually it's a really new message appended at the end of the partition log; it will be the compaction thread in the broker doing the work of deleting the old message with "Hello" payload leaving just the new one.
In the same way a producer can send the message with key = A and payload = null. It's the way for actually deleting the message (null is called "tombstone"). Anyway the producer is still appending a new message to the partition; it's always the compaction thread which will delete the last message with key = A when it saw the tombstone.
Inidividual messages are immutable.
Compaction or retention will drop messages. It doesn't alter messages or offsets
Data in Kafka is stored in topics, topics are partitioned, each partition is further divided into segments and finally each segment has a log file to store the actual message, an index file to store the position of the messages in the log file and timeindex file, for example:
$ ls -l /mnt/data/kafka/*consumer*/00000000004618814867*
-rw-r--r-- 1 kafka kafka 10485760 Oct 3 23:41 /mnt/data/kafka/__consumer_offsets-7/00000000004618814867.index
-rw-r--r-- 1 kafka kafka 8189913 Oct 3 23:41 /mnt/data/kafka/__consumer_offsets-7/00000000004618814867.log
-rw-r--r-- 1 kafka kafka 10485756 Oct 3 23:41 /mnt/data/kafka/__consumer_offsets-7/00000000004618814867.timeindex
In scenario where log.cleanup.policy (or cleanup.policy on particular topic) set to delete, occur complete delete some of log segments (one or more).
In scenario where params set to compact the compaction is done in the background by periodically recopying log segments: it recopies the log from beginning to end removing keys which have a later occurrence in the log. New, clean segments are swapped into the log immediately so the additional disk space required is just one additional log segment (not a fully copy of the log). In other words, the old segment is replaced by a new compacted segment
See more about distributed logs:
https://kafka.apache.org/documentation.html#compaction
https://medium.com/#durgaswaroop/a-practical-introduction-to-kafka-storage-internals-d5b544f6925f
https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
https://bookkeeper.apache.org/distributedlog/docs/0.5.0/user_guide/architecture/main
https://bravenewgeek.com/building-a-distributed-log-from-scratch-part-1-storage-mechanics/
Immutability is a property of the records stored within the partitions themselves. When the source (documentation or articles) states immutability within the context of topics or partitions, they are usually referring to either one of two things, both of which are correct in a limited context:
Records are immutable. Once a record is written, its contents can never be altered. A record can be deleted by the broker when either (a) the contents of the partition are pruned due to the retention limit, (b) a new record is added for the same key that supersedes the original record and compaction takes place, or (c) a record is added for the same key with a null value, which acts as a tombstone record, deleting the original without adding a replacement.
Partitions are append-only from a client's perspective, in that a client is not permitted to modify records or directly remove records from a partition, only append to the partition. This is somewhat debatable, because a client can induce the deletion of a record through the compaction feature, although this operation is asynchronous and the client cannot specify precisely which record should be deleted.
I have a question about Kafka Topic cleanup policies and their interaction of log.retention....
For example, if I set cleanup.policy to compact, compaction will only start after the retention time of the topic or retention time has no effect for compaction?
Second part of the question, if I use compact,delete together, and I have log.retention for lets say 1 day, topic compacted all the time but content of the topic will be deleted after one day? or compaction and delete realised after one day?
Thx for answers...
Log segments can be deleted or compacted, or both, to manage their size. The topic-level configuration cleanup.policy determines the way the log segments for the topic are managed.
Log cleanup by compaction
If the topic-level configuration cleanup.policy is set to compact,the log for the topic is compacted periodically in the background by the log cleaner.
In a compacted topic,the log only needs to contain the most recent message for each key while earlier messages can be discarded.
There is no need to set log.retention to -1 or any other value. Your topics will be compacted and old messages never deleted (as per compaction rules).
Note that only the inactive file segment can be compacted; active segment will never be compacted.
Log cleanup by using both
You can specify both delete and compact values for the cleanup.policy configuration at the same time. In this case, the log is compacted, but the cleanup process also follows the retention time or size limit settings.
I would suggest you to go through the following links
https://ibm.github.io/event-streams/installing/capacity-planning/
https://kafka.apache.org/documentation/#compaction
https://cwiki.apache.org/confluence/display/KAFKA/KIP-71%3A+Enable+log+compaction+and+deletion+to+co-exist