Confusion in using log.retention.bytes parameter in logging Topic data in Apache Kafka

Confusion in using log.retention.bytes parameter in logging Topic data in Apache Kafka - apache-kafka

"log.retention.bytes" is the parameter we are using to retain the logs of topic messages and I had given value as 1073741824.
I had referred the Kafka documentation, where it says the size given in "log.retention.bytes" is per partition, so that means suppose if I have 20 partitions for all the topics I am using, then total size of bytes that Kafka will retain is 20*1073741824 according to the documentation.
But what clarity I need is
Will Kafka retain 20*1073741824 bytes for all the topics?
(or)
Will Kafka retain 20*1073741824 bytes per topic?

log.retention.bytes Parameter used to retain in the log for each topic partition. By default, log size is unlimited.
This configuration controls the maximum size a partition (which consists of log segments) can grow to before we will discard old log segments to free up space if we are using the "delete" retention policy. By default there is no size limit only a time limit. Since this limit is enforced at the partition level, multiply it by the number of partitions to compute the topic retention in bytes.
If you set log.retention.bytes = 1 GB, Kafka will trigger a clean-up activity when the partition size reaches to 1 GB. Remember that it is not a topic size. It is partition size.
Kafka give you other option to configure the retention periond i.e log.retention.ms.. The default retention period is seven days.If you want to change the duration, you can specify your value for log.retention.ms configuration.
If you specify both configurations, the clean-up will start on meeting either of the criteria.

Related

how to defined the topics retention bytes and segment bytes so size of topic partition will no high the specific size

we have kafka cluster with 3 nodes
kafka contain 5 topics and each topic include 100 partitions
bow we want to set the retention bytes and the retention segment in way that each topic partition will not high the 5G ( because we are limited according to kafka disk size )
is it possible to tune the values of retention bytes and segment bytes , so no way that any topic partition will be high then 5G ?

There is no way to cap the size of a topic. It's even possible that retention will go above retention.bytes if you push data into the topic faster than the LogCleaner thread has time to clean it up.
Also note that upcoming versions of Kafka will offer infinite retention
Or you could similarly use tiered storage features of Apache Pulsar instead of Kafka

Kafka topic record retention policies not clear

From Kafka Docs I got interested and tried the following 2 retention types together
log.retention.bytes:
The maximum size of the log before deleting it
Type: longDefault: -1Valid Values:Importance: highUpdate Mode: cluster-wide
log.retention.ms
The number of milliseconds to keep a log file before deleting it (in
milliseconds), If not set, the value in log.retention.minutes is used.
If set to -1, no time limit is applied. Type: longDefault: nullValid
Values:Importance: highUpdate Mode: cluster-wide
AS
log.retention.bytes = 1Gb
log.retention.ms = 7 days
Problem Situation
I have currently on my topic all messages belonging two different log files both of which are < 1GB
Lets say log.1 files has 400 MB of messages with oldest message > 7 days old.
which is on the top of
log.2 file has 500 MB with newest message > 7 days old.
I understand kafka would clean up all records belonging to log.2 file in other words remove this log from the topic.
What happens to the records in the log.1 which are older than 7 days?

There are two properties which defines message retention in Kafka - log.retention.bytes and log.retention.ms (per topic per partition level). The strategy for data removal works on FIFO basic, i.e., the message which was pushed to a topic first would be deleted first.
You have rightly said that the default values for the same are:
log.retention.bytes = 1Gb (per topic per partition)
log.retention.ms = 7 days (per topic)
It means that whichever limit is breached first, would lead to data purge in Kafka.
For example, let's assume that the size of messages in your topic takes 500 MB of space (which is less than log.retention.bytes) but older than 7 days (i.e. greater than the default log.retention.ms). In this case the data older than 7 days would be purged (on FIFO basis).
Likewise, if, for a given topic, the space occupied by the messages exceeds the log.retention.bytes but are not older than log.retention.ms, in this case too, the data would be purged (on FIFO basis).
Concept of making data expire is called as Cleanup & the messages on a topic are not immediately removed after they are consumed/expired. What happens in the background is, once either of the limit is breached, the messages are marked deleted. There are 3 logs cleanup policies in Kafka - DELETE (default), COMPACT, DELETE AND COMPACT. Kafka Log Cleaner does log compaction, a pool of background compaction threads.
To turn on compaction for a topic use topic config log.cleanup.policy=compact. To set delay to start compacting records after they are written use topic config log.cleaner.min.compaction.lag.ms. Records won’t get compacted until after this period. The setting gives consumers time to get every record. This could be reason that older messages are not getting deleted immediately. You can check the value of property for compaction delay.
Below links might be helpful:
https://medium.com/#sunny_81705/kafka-log-retention-and-cleanup-policies-c8d9cb7e09f8
http://cloudurable.com/blog/kafka-architecture-log-compaction/index.html
https://www.learningjournal.guru/courses/kafka/kafka-foundation-training/broker-configurations/

I'm paraphrasing here, from the relevant section of a book, Kafka - Definitive Guide. It'll most likely clear your doubt.
log.retention.bytes : This denotes the total number of bytes of messages retained per partition. So, if we have a topic with 8 partitions, and log.retention.bytes is set to 1GB, then the amount of data retained for the topic will be 8GB at most. This means if we ever choose to increase the number of partitions for a topic, total amount of data retained will also increase.
log.retention.ms : The most common configuration for how long Kafka will retain messages is by time. The default is specified in the configuration file using the log.retention.hours parameter, and it is set to 168 hours, or one week. However, there are two other parameters allowed, log.retention.minutes and log.retention.ms. All three of these specify the same configuration—the amount of time after which messages may be deleted—but the recommended parameter to use is log.retention.ms, as the smaller unit size will take precedence if more than one is specified. This will make sure that the value set for log.retention.ms is always the one used. If more than one is specified, the smaller unit size will take precedence.
Retention By Time and Last Modified Times : Retention by time is performed by examining the last modified time (mtime) on each log segment file on disk. Under normal cluster operations, this is the time that the log segment was closed, and represents the timestamp of the last message in the file. However, when using administrative tools to move partitions between brokers, this time is not accurate and will result in excess retention for these partitions.
Configuring Retention by Size and Time : If you have specified a value for both log.retention.bytes and log.retention.ms (or another parameter for retention by time), messages may be removed when either criteria is met. For example, if log.retention.ms is set to 86400000 (1 day) and log.retention.bytes is set to 1000000000 (1 GB), it is possible for messages that are less than 1 day old to get deleted if the total volume of messages over the course of the day is greater than 1 GB. Conversely, if the volume is less than 1 GB, messages can be deleted after 1 day even if the total size of the partition is less than 1 GB.

Kafka config replica.fetch.max.bytes on a per-topic level

I would like to set a Kafka cluster to only allow large messages on a particular topic. From the docs I see that if I wanted to do this at the level of the entire cluster I could do so by setting message.max.bytes to allow a larger amount of data on the broker and replica.fetch.max.bytes to allow it to be replicated, but my understanding is that this would increase memory usage for all topics in my cluster, not just the one that I know can receive large messages. There is also a topic-level setting max.message.bytes that controls the maximum size of messages, but I don't see a topic-level setting controlling the maximum data size of replication operations. It seems strange that one of these closely tied settings is not configurable at a topic level; perhaps I'm missing where such setting is or there is another way to accomplish these goals?

replica.fetch.max.bytes can only be set on the broker level. However, you can set max.partition.fetch.bytes on the consumer side:
The maximum amount of data per-partition the server will return.
Records are fetched in batches by the consumer. If the first record
batch in the first non-empty partition of the fetch is larger than
this limit, the batch will still be returned to ensure that the
consumer can make progress. The maximum record batch size accepted by
the broker is defined via message.max.bytes (broker config) or
max.message.bytes (topic config). See fetch.max.bytes for limiting the
consumer request size.
Note that this is a per-partition configuration, meaning that if you set it to a large number, it will consume a lot of memory in case you have a lot of partitions too.

Increase the number of messages read by a Kafka consumer in a single poll

Kafka consumer has a configuration max.poll.records which controls the maximum number of records returned in a single call to poll() and its default value is 500. I have set it to a very high number so that I can get all the messages in a single poll.
However, the poll returns only a few thousand messages(roughly 6000) in a single call even though the topic has many more. How can I further increase the number of messages read by a single consumer?

You can increase Consumer poll() batch size by increasing max.partition.fetch.bytes, but still as per documentation it has limitation with fetch.max.bytes which also need to be increased with required batch size. And also from the documentation there is one other property message.max.bytes in Topic config and Broker config to restrict the batch size. so one way is to increase all of these property based on your required batch size
In Consumer config max.partition.fetch.bytes default value is 1048576
The maximum amount of data per-partition the server will return. Records are fetched in batches by the consumer. If the first record batch in the first non-empty partition of the fetch is larger than this limit, the batch will still be returned to ensure that the consumer can make progress. The maximum record batch size accepted by the broker is defined via message.max.bytes (broker config) or max.message.bytes (topic config). See fetch.max.bytes for limiting the consumer request size
In Consumer Config fetch.max.bytes default value is 52428800
The maximum amount of data the server should return for a fetch request. Records are fetched in batches by the consumer, and if the first record batch in the first non-empty partition of the fetch is larger than this value, the record batch will still be returned to ensure that the consumer can make progress. As such, this is not a absolute maximum. The maximum record batch size accepted by the broker is defined via message.max.bytes (broker config) or max.message.bytes (topic config). Note that the consumer performs multiple fetches in parallel.
In Broker config message.max.bytes default value is 1000012
The largest record batch size allowed by Kafka. If this is increased and there are consumers older than 0.10.2, the consumers' fetch size must also be increased so that the they can fetch record batches this large.
In the latest message format version, records are always grouped into batches for efficiency. In previous message format versions, uncompressed records are not grouped into batches and this limit only applies to a single record in that case.
This can be set per topic with the topic level max.message.bytes config.
In Topic config max.message.bytes default value is 1000012
The largest record batch size allowed by Kafka. If this is increased and there are consumers older than 0.10.2, the consumers' fetch size must also be increased so that the they can fetch record batches this large.
In the latest message format version, records are always grouped into batches for efficiency. In previous message format versions, uncompressed records are not grouped into batches and this limit only applies to a single record in that case.

Most probably your payload is limited by max.partition.fetch.bytes, which is 1MB by default. Refer to Kafka Consumer configuration.
Here's good detailed explanation:
MAX.PARTITION.FETCH.BYTES
This property controls the maximum number of bytes the server will return per partition. The default is 1 MB, which means that when KafkaConsumer.poll() returns ConsumerRecords, the record object will use at most max.partition.fetch.bytes per partition assigned to the consumer. So if a topic has 20 partitions, and you have 5 consumers, each consumer will need to have 4 MB of memory available for ConsumerRecords. In practice, you will want to allocate more memory as each consumer will need to handle more partitions if other consumers in the group fail. max. partition.fetch.bytes must be larger than the largest message a broker will accept (determined by the max.message.size property in the broker configuration), or the broker may have messages that the consumer will be unable to consume, in which case the consumer will hang trying to read them. Another important consideration when setting max.partition.fetch.bytes is the amount of time it takes the consumer to process data. As you recall, the consumer must call poll() frequently enough to avoid session timeout and subsequent rebalance. If the amount of data a single poll() returns is very large, it may take the consumer longer to process, which means it will not get to the next iteration of the poll loop in time to avoid a session timeout. If this occurs, the two options are either to lower max. partition.fetch.bytes or to increase the session timeout.
Hope it helps!

Apache Kafka: reduce kafka disk usage

I have a question about Kafka's disk.
Kafka will fail when its disk become full.
So I want to reduce the disk usage to less than x% by discarding the old data stored on the Kafka disk (or discarding a copy of the data) when the Kafka disk usage reaches x%. Do I need to modify the Kafka source code to do this?

You can configure retention.bytes for your topics.
This configuration controls the maximum size a partition (which consists of log segments) can grow to before we will discard old log segments to free up space if we are using the "delete" retention policy. By default there is no size limit only a time limit. Since this limit is enforced at the partition level, multiply it by the number of partitions to compute the topic retention in bytes.
See https://kafka.apache.org/documentation/#topicconfigs

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse