Kafka log.segment.bytes vs log.retention.hours - apache-kafka

I was following the book "Kafka: The Definitive Guide" First Edition to understand when log segments are deleted by the broker.
As per the text I understood, a segment will not become eligible for deletion until it is closed. A segment can be closed only when it has reached log.segment.bytes size (considering log.segment.ms is not set) . Once a segment becomes eligible for deletion, the log.retention.ms policy would apply to finally decide when to delete this segment.
However this seems to contradict the behaviour I see in our production cluster ( Kafka ver 2.5).
The log segment gets deleted as soon as log.retention.ms is satisfied, even when the segment size is less than log.segment.bytes.
[2020-12-24 15:51:17,808] INFO [Log partition=Topic-2,
dir=/Folder/Kafka_data/kafka] Found deletable segments with base
offsets [165828] due to retention time 604800000ms breach
(kafka.log.Log)
[2020-12-24 15:51:17,808] INFO [Log partition=Topic-2,
dir=/Folder/Kafka_data/kafka] Scheduling segments for deletion
List(LogSegment(baseOffset=165828, size=895454171,
lastModifiedTime=1608220234000, largestTime=1608220234478))
(kafka.log.Log)
The size is still less than 1GB, but the segment got deleted.
The book mentions at the time of press release the Kafka version was 0.9.0.1 . So was this setting changed in later versions of Kafka. ( I could not find any specific mention of this change in the Kafka docs). Below is the snippet from the book.

Setting: log.retention.ms and log.retention.bytes
The most common configuration for how long Kafka broker will retain messages (actually, “log segments”) is by time (in milliseconds), and is specified using log.retention.ms parameter (default to 1 week). If set to -1, no time limit is applied.
Another way to expire is based on the total number of bytes of messages retained. This value is set using the log.retention.bytes parameter, and it is applied per partition. Its default value is -1, which allows for infinite retention. This means that if you have a topic with 8 partitions, and log.retention.bytes is set to 1 GB, the amount of data retained for the topic will be 8 GB at most. If you have specified both log.retention.bytes and log.retention.ms, messages may be removed when either criterion is met.
Setting: log.segment.bytes and log.segment.ms
As messages are produced to the Kafka broker, they are appended to the current log segment for the partition. Once the log segment has reached the size specified by the log.segment.bytes parameter (default 1 GB), the log segment is closed and a new one is opened. Only once a log segment has been closed, it can be considered for expiration (by log.retention.ms or log.retention.bytes).
Another way to control when log segments are closed is by using the log.segment.ms parameter, which specifies the amount of time after which a log segment should be closed. Kafka will close a log segment either when the size limit is reached or when the time limit is reached, whichever comes first.
A smaller log-segment size means that files must be closed and allocated more often, which reduces the overall efficiency of disk writes. Adjusting the size of the log segment can be important if topics have a low produce rate. For example, if a topic receives only 100 megabytes per day of messages, and log.segment.bytes is set to the default, it will take 10 days to fill one segment. As messages cannot be expired until the log segment is closed, if log.retention.ms is set to 1 week, they will actually be up to 17 days of messages retained until the closed segment expires. This is because once the log segment is closed with the current 10 days of messages, that log segment must be retained 7 days before it expires based on the time policy.

Hope this becomes clearer.
segment.ms => the maximum age of the segment file (from the date of
creation)
retention.ms => the maximum age of any message in a segment (that is
closed) beyond which this segment is eligible for deletion (if delete
policy is set)
So if the segment is "active segment" then it can be rolled over based on segment.ms (or segment.bytes) but NOT by retention.ms. The retention only comes into play on closed (not active) segments.
So the behavior that is quoted from the book is correct. However you think that the segment is active and the INFO logs specify that the segment is setup for deletion.
This cannot happen on an active segment (assuming no bug). The segment has to be closed (not active) before any of the retention.* properties can take effect.
See this.

What you observe is the expected behavior. In short, if you have an active segment that is not full yet, and segment.ms has passed, then it will be closed and turn into an "old log segment" even if it is not full.

Related

How i delete old Kafka logs Safely in server.properties

I used Kafka Version 2.3, I want to delete old kafka logs
there are two folders
log.dirs=/var/www/html/zookeeper_1/zookeeper_data_1
kafka_2.10-0.8.2.2/logs
What is the difference between two folders, and I want to delete old log?
I would argue that the safest way to delete older logs is to properly configure your retention policy.
In Kafka, there are two types of log retention; size and time retention. The former is triggered by log.retention.bytes while the latter by log.retention.hours.
Assuming that you want a delete cleanup policy, you'd need to configure the following parameters to
log.cleaner.enable=true
log.cleanup.policy=delete
Then you need to think about the configuration of log.retention.bytes, log.segment.bytes and log.retention.check.interval.ms. To do so, you have to take into consideration the following factors:
log.retention.bytes is a minimum guarantee for a single partition of a topic, meaning that if you set log.retention.bytes to 512MB, it means you will always have 512MB of data (per partition) in your disk.
Again, if you set log.retention.bytes to 512MB and log.retention.check.interval.ms to 5 minutes (which is the default value) at any given time, you will have at least 512MB of data + the size of data produced within the 5 minute window, before the retention policy is triggered.
A topic log on disk, is made up of segments. The segment size is dependent to log.segment.bytes parameter. For log.retention.bytes=1GB and log.segment.bytes=512MB, you will always have up to 3 segments on the disk (2 segments which reach the retention and the 3rd one will be the active segment where data is currently written to).
Finally, you should do the math and compute the maximum size that might be reserved by Kafka logs at any given time on your disk and tune the aforementioned parameters accordingly. I would also advice to set a time retention policy as well and configure log.retention.hours accordingly. If after 2 days you don't need your data anymore, then set log.retention.hours=48.
One is Zookeeper data, the other is Kafka 0.8.2.2 data, which is not directly compatible with Kafka 2.3
You'd delete segments from the latter, however it'll have the potential to corrupt the topic if you do so, so you should let Kafka clean itself up

Does retention.bytes defines the maximum size of inactive segment?

i have a kafka setting for retention like this:
# A size-based retention policy for logs. Segments are pruned from the log as long as the remaining$
# segments don't drop below log.retention.bytes.$
log.retention.bytes=1073741824$
$
# The maximum size of a log segment file. When this size is reached a new log segment will be created.$
log.segment.bytes=1073741824$
So the size of log.retention.bytes and log.segment.bytes are 1gb, and then i created a topic with only one partition. After flushing message to my topic, i observed that there is alway two log files, one file already reached 1gb and another one is an active one which is receiving messages.
My question is, does this log.retention.bytes defines the maximum total size of inactive segments files and not including the active one?
Thanks
Yes that's roughly correct. I usually don't like to define this setting as the "maximum size" as it's not completely right.
One way to see it is to consider log.retention.bytes the minimum amount of data that must be left after Kafka deletes segments. Or the amount of data Kafka guarantees to keep at anytime (obviously only if the time retention limit is not reached!)
The active segment is not eligible for deletion. So as you noticed when the first segment fills up, Kafka does not delete anything even though you reached 1GB. Instead it rolled a new segment (the new active one). Once this new segment also reaches 1GB, you effectively have 2GB of data on disk.
At that point a new segment is rolled again and you have 2 inactive segments. Only now Kafka can delete a segment and still satisfy log.retention.bytes, as there will be 1GB of data on disk + the active segment.

Kafka topic record retention policies not clear

From Kafka Docs I got interested and tried the following 2 retention types together
log.retention.bytes:
The maximum size of the log before deleting it
Type: longDefault: -1Valid Values:Importance: highUpdate Mode: cluster-wide
log.retention.ms
The number of milliseconds to keep a log file before deleting it (in
milliseconds), If not set, the value in log.retention.minutes is used.
If set to -1, no time limit is applied. Type: longDefault: nullValid
Values:Importance: highUpdate Mode: cluster-wide
AS
log.retention.bytes = 1Gb
log.retention.ms = 7 days
Problem Situation
I have currently on my topic all messages belonging two different log files both of which are < 1GB
Lets say log.1 files has 400 MB of messages with oldest message > 7 days old.
which is on the top of
log.2 file has 500 MB with newest message > 7 days old.
I understand kafka would clean up all records belonging to log.2 file in other words remove this log from the topic.
What happens to the records in the log.1 which are older than 7 days?
There are two properties which defines message retention in Kafka - log.retention.bytes and log.retention.ms (per topic per partition level). The strategy for data removal works on FIFO basic, i.e., the message which was pushed to a topic first would be deleted first.
You have rightly said that the default values for the same are:
log.retention.bytes = 1Gb (per topic per partition)
log.retention.ms = 7 days (per topic)
It means that whichever limit is breached first, would lead to data purge in Kafka.
For example, let's assume that the size of messages in your topic takes 500 MB of space (which is less than log.retention.bytes) but older than 7 days (i.e. greater than the default log.retention.ms). In this case the data older than 7 days would be purged (on FIFO basis).
Likewise, if, for a given topic, the space occupied by the messages exceeds the log.retention.bytes but are not older than log.retention.ms, in this case too, the data would be purged (on FIFO basis).
Concept of making data expire is called as Cleanup & the messages on a topic are not immediately removed after they are consumed/expired. What happens in the background is, once either of the limit is breached, the messages are marked deleted. There are 3 logs cleanup policies in Kafka - DELETE (default), COMPACT, DELETE AND COMPACT. Kafka Log Cleaner does log compaction, a pool of background compaction threads.
To turn on compaction for a topic use topic config log.cleanup.policy=compact. To set delay to start compacting records after they are written use topic config log.cleaner.min.compaction.lag.ms. Records won’t get compacted until after this period. The setting gives consumers time to get every record. This could be reason that older messages are not getting deleted immediately. You can check the value of property for compaction delay.
Below links might be helpful:
https://medium.com/#sunny_81705/kafka-log-retention-and-cleanup-policies-c8d9cb7e09f8
http://cloudurable.com/blog/kafka-architecture-log-compaction/index.html
https://www.learningjournal.guru/courses/kafka/kafka-foundation-training/broker-configurations/
I'm paraphrasing here, from the relevant section of a book, Kafka - Definitive Guide. It'll most likely clear your doubt.
log.retention.bytes : This denotes the total number of bytes of messages retained per partition. So, if we have a topic with 8 partitions, and log.retention.bytes is set to 1GB, then the amount of data retained for the topic will be 8GB at most. This means if we ever choose to increase the number of partitions for a topic, total amount of data retained will also increase.
log.retention.ms : The most common configuration for how long Kafka will retain messages is by time. The default is specified in the configuration file using the log.retention.hours parameter, and it is set to 168 hours, or one week. However, there are two other parameters allowed, log.retention.minutes and log.retention.ms. All three of these specify the same configuration—the amount of time after which messages may be deleted—but the recommended parameter to use is log.retention.ms, as the smaller unit size will take precedence if more than one is specified. This will make sure that the value set for log.retention.ms is always the one used. If more than one is specified, the smaller unit size will take precedence.
Retention By Time and Last Modified Times : Retention by time is performed by examining the last modified time (mtime) on each log segment file on disk. Under normal cluster operations, this is the time that the log segment was closed, and represents the timestamp of the last message in the file. However, when using administrative tools to move partitions between brokers, this time is not accurate and will result in excess retention for these partitions.
Configuring Retention by Size and Time : If you have specified a value for both log.retention.bytes and log.retention.ms (or another parameter for retention by time), messages may be removed when either criteria is met. For example, if log.retention.ms is set to 86400000 (1 day) and log.retention.bytes is set to 1000000000 (1 GB), it is possible for messages that are less than 1 day old to get deleted if the total volume of messages over the course of the day is greater than 1 GB. Conversely, if the volume is less than 1 GB, messages can be deleted after 1 day even if the total size of the partition is less than 1 GB.

Kafka Log Compaction

When kafka does log compaction ,the log segments of a partition is split into "dirty"/"head" and "tail". I know the compaction happens only on the tail part of the segment.But does the dirty/head part include the active segment records along with the closed segment records which have earlier then log.cleaner.min.compaction.lag.ms ?
Docs says
"If not set, all log segments are eligible for compaction except for the last segment, i.e. the one currently being written to. The active segment will not be compacted even if all of its messages are older than the minimum compaction time lag. "
But since head/dirty part of the segment does NOT get compacted anyway ,so does the active segment taken into consideration for head/dirt part of compaction?
Got the answer to my question,my understanding initially was incorrect. The way it works is Certain head/dirty part of log also gets compacted and head does not include the active segments.
Below video at 40.0 From Jun Rao ,explains this :
https://vimeo.com/185844593/77f7d239a3?mkt_tok=eyJpIjoiWkRKall6azFZekJoTldGayIsInQiOiJvd2pTTmQ5WUIrUHUzelpDOVh5eStienVpZ1N1amlYMUc3Y3BMZWFTRjBMdEtaUXJRM2pLemNyTHB3bzkyYWVpSFRnMTN0NzdpV0VpMFp6d3V4YktMZ1dEaG1vNnBpMGR0OG9UbWUrUUZ6NDNpXC9GZmhGS1dVU1ZXcDJXdTRoSEMifQ%3D%3D

Kafka optimal retention and deletion policy

I am fairly new to kafka so forgive me if this question is trivial. I have a very simple setup for purposes of timing tests as follows:
Machine A -> writes to topic 1 (Broker) -> Machine B reads from topic 1
Machine B -> writes message just read to topic 2 (Broker) -> Machine A reads from topic 2
Now I am sending messages of roughly 1400 bytes in an infinite loop filling up the space on my small broker very quickly. I'm experimenting with setting different values for log.retention.ms, log.retention.bytes, log.segment.bytes and log.segment.delete.delay.ms. First I set all of the values to the minimum allowed, but it seemed this degraded performance, then I set them to the maximum my broker could take before being completely full, but again the performance degrades when a deletion occurs. Is there a best practice for setting these values to get the absolute minimum delay?
Thanks for the help!
Apache Kafka uses Log data structure to manage its messages. Log data structure is basically an ordered set of Segments whereas a Segment is a collection of messages. Apache Kafka provides retention at Segment level instead of at Message level. Hence, Kafka keeps on removing Segments from its end as these violate retention policies.
Apache Kafka provides us with the following retention policies -
Time Based Retention
Under this policy, we configure the maximum time a Segment (hence messages) can live for. Once a Segment has spanned configured retention time, it is marked for deletion or compaction depending on configured cleanup policy. Default retention time for Segments is 7 days.
Here are the parameters (in decreasing order of priority) that you can set in your Kafka broker properties file:
Configures retention time in milliseconds
log.retention.ms=1680000
Used if log.retention.ms is not set
log.retention.minutes=1680
Used if log.retention.minutes is not set
log.retention.hours=168
Size based Retention
In this policy, we configure the maximum size of a Log data structure for a Topic partition. Once Log size reaches this size, it starts removing Segments from its end. This policy is not popular as this does not provide good visibility about message expiry. However it can come handy in a scenario where we need to control the size of a Log due to limited disk space.
Here are the parameters that you can set in your Kafka broker properties file:
Configures maximum size of a Log
log.retention.bytes=104857600
So according to your use case you should configure log.retention.bytes so that your disk should not get full.