Apache Kafka: reduce kafka disk usage - apache-kafka

I have a question about Kafka's disk.
Kafka will fail when its disk become full.
So I want to reduce the disk usage to less than x% by discarding the old data stored on the Kafka disk (or discarding a copy of the data) when the Kafka disk usage reaches x%. Do I need to modify the Kafka source code to do this?

You can configure retention.bytes for your topics.
This configuration controls the maximum size a partition (which consists of log segments) can grow to before we will discard old log segments to free up space if we are using the "delete" retention policy. By default there is no size limit only a time limit. Since this limit is enforced at the partition level, multiply it by the number of partitions to compute the topic retention in bytes.
See https://kafka.apache.org/documentation/#topicconfigs

Related

Is there any storage limits for a Kafka compacted topic?

When doing stateful processing in kafka streams we can hold large state. We can provision more disks space for the client as the data grows. But what about the changelog topic? The local state is backed up in this compacted topic. Are there any limitations in how much data we can store in this topic?
We did not encounter any issues yet. But i see that some cloud services do have limitations on the size for a compacted topic. Is this a kafka limitation? An if yes, do these limitations also apply for non compacted topics?
Infinite retention of any topic log segments can be achieved by setting
log.retention.bytes = -1
log.retention.hours = -1
This option is available from version 0.9.0.0 which indicates a mature feature on Kafka.
However, many suggest that using Kafka as permanent storage is not what it was designed to do and as the amount of data stored in Kafka increases, users eventually hit a “retention cliff,” at which point it becomes significantly more expensive to store, manage, and retrieve data. The infrastructure costs will be increased as the longer the retention period the more hardware is required.
Having said that, it seems that people do use Kafka for persistence storage, for example, The New York Times uses Kafka as a source of truth, storing 160 years of journalism going back to the 1850s.
I would suggest using a small message size if you decide to use
Kafka as a System Of Record (SOR) and to hold the state of an entity.
Kafka makes it very clear that its performance is greatly based on the event/message size, so there is a size limit on them.
Kafka has a default limit of 1MB per message in the topic. This is
because very large messages are considered inefficient and an
anti-pattern in Apache Kafka.
more for handling larger messages here.
By default, each Kafka topic partition log will start at a minimum size of 20MB and grow to a maximum size of 100MB on disk before a new log file is created. It's possible to have multiple log files in a partition at any one time.

Span Kafka topic partition across directories

I've a Kafka topic with one partition. I'm trying to send messages to broker. The source is of 1.5 TB in size. My broker has two directories to store the Kafka partitions
/dev/sdc1 1.1T 567G 460G 56% /data_disk_0
/dev/sdd1 1.1T 1.1T 0 100% /data_disk_1
Each one with 1.1 TB size. As my topic has only one partition, Kafka is storing all the messages to /dev/sdd1. Eventually the disk fills up completely because the source size is greater than the target disk size. Can I span my topic partition to store half data in disk0 and the other half in disk1 without changing the number of partitions?
Please advice
I couldn't find any configuration related changes that I can add to Kafka
This isn't possible at the kafka configuration level. You'd have to use RAID or logical volume groups to pool the disks together as one volume
In the Kafka documentation, it mentions
You can either RAID these drives together into a single volume or format and mount each drive as its own directory
If your data is so heavily skewed to one disk, meaning certain partitions, you should be checking how your producers are partitioning the data, start to persist such a large topic somewhere, or turn on compaction / retention periods for these topics

Kafka disk space gets full

I have a Kafka service with 1000GB disk and this running parameter:
log.retention.bytes=350000000000
However, the usage of disk space reaches 90% (900GB). Since that parameter is running, the disk size should not exceeds 326GB. Why could this happen?
Other properties:
log.index.interval.bytes=4000
log.segment.bytes=250000000
log.index.size.max.bytes=10485760
log.retention.ms=168
while the official documentation isnt very clear:
The maximum size of the log before deleting it
the confluent documentation on topic configs (which should really be considered the official documentation anyway) has a better one (under retention.bytes):
This configuration controls the maximum size a partition (which consists of log segments) can grow to before we will discard old log segments to free up space if we are using the "delete" retention policy. By default there is no size limit only a time limit. Since this limit is enforced at the partition level, multiply it by the number of partitions to compute the topic retention in bytes.
in short, this config isnt even per topic. its per partition. im not aware of a kafka config that acts as a broker-wide size limit.
if youre trying to balance data load across multiple brokers in a cluster perhaps you should look at cruise control

Confusion in using log.retention.bytes parameter in logging Topic data in Apache Kafka

"log.retention.bytes" is the parameter we are using to retain the logs of topic messages and I had given value as 1073741824.
I had referred the Kafka documentation, where it says the size given in "log.retention.bytes" is per partition, so that means suppose if I have 20 partitions for all the topics I am using, then total size of bytes that Kafka will retain is 20*1073741824 according to the documentation.
But what clarity I need is
Will Kafka retain 20*1073741824 bytes for all the topics?
(or)
Will Kafka retain 20*1073741824 bytes per topic?
log.retention.bytes Parameter used to retain in the log for each topic partition. By default, log size is unlimited.
This configuration controls the maximum size a partition (which consists of log segments) can grow to before we will discard old log segments to free up space if we are using the "delete" retention policy. By default there is no size limit only a time limit. Since this limit is enforced at the partition level, multiply it by the number of partitions to compute the topic retention in bytes.
If you set log.retention.bytes = 1 GB, Kafka will trigger a clean-up activity when the partition size reaches to 1 GB. Remember that it is not a topic size. It is partition size.
Kafka give you other option to configure the retention periond i.e log.retention.ms.. The default retention period is seven days.If you want to change the duration, you can specify your value for log.retention.ms configuration.
If you specify both configurations, the clean-up will start on meeting either of the criteria.

Kafka optimal retention and deletion policy

I am fairly new to kafka so forgive me if this question is trivial. I have a very simple setup for purposes of timing tests as follows:
Machine A -> writes to topic 1 (Broker) -> Machine B reads from topic 1
Machine B -> writes message just read to topic 2 (Broker) -> Machine A reads from topic 2
Now I am sending messages of roughly 1400 bytes in an infinite loop filling up the space on my small broker very quickly. I'm experimenting with setting different values for log.retention.ms, log.retention.bytes, log.segment.bytes and log.segment.delete.delay.ms. First I set all of the values to the minimum allowed, but it seemed this degraded performance, then I set them to the maximum my broker could take before being completely full, but again the performance degrades when a deletion occurs. Is there a best practice for setting these values to get the absolute minimum delay?
Thanks for the help!
Apache Kafka uses Log data structure to manage its messages. Log data structure is basically an ordered set of Segments whereas a Segment is a collection of messages. Apache Kafka provides retention at Segment level instead of at Message level. Hence, Kafka keeps on removing Segments from its end as these violate retention policies.
Apache Kafka provides us with the following retention policies -
Time Based Retention
Under this policy, we configure the maximum time a Segment (hence messages) can live for. Once a Segment has spanned configured retention time, it is marked for deletion or compaction depending on configured cleanup policy. Default retention time for Segments is 7 days.
Here are the parameters (in decreasing order of priority) that you can set in your Kafka broker properties file:
Configures retention time in milliseconds
log.retention.ms=1680000
Used if log.retention.ms is not set
log.retention.minutes=1680
Used if log.retention.minutes is not set
log.retention.hours=168
Size based Retention
In this policy, we configure the maximum size of a Log data structure for a Topic partition. Once Log size reaches this size, it starts removing Segments from its end. This policy is not popular as this does not provide good visibility about message expiry. However it can come handy in a scenario where we need to control the size of a Log due to limited disk space.
Here are the parameters that you can set in your Kafka broker properties file:
Configures maximum size of a Log
log.retention.bytes=104857600
So according to your use case you should configure log.retention.bytes so that your disk should not get full.