Kafka disk space gets full - apache-kafka

I have a Kafka service with 1000GB disk and this running parameter:
log.retention.bytes=350000000000
However, the usage of disk space reaches 90% (900GB). Since that parameter is running, the disk size should not exceeds 326GB. Why could this happen?
Other properties:
log.index.interval.bytes=4000
log.segment.bytes=250000000
log.index.size.max.bytes=10485760
log.retention.ms=168

while the official documentation isnt very clear:
The maximum size of the log before deleting it
the confluent documentation on topic configs (which should really be considered the official documentation anyway) has a better one (under retention.bytes):
This configuration controls the maximum size a partition (which consists of log segments) can grow to before we will discard old log segments to free up space if we are using the "delete" retention policy. By default there is no size limit only a time limit. Since this limit is enforced at the partition level, multiply it by the number of partitions to compute the topic retention in bytes.
in short, this config isnt even per topic. its per partition. im not aware of a kafka config that acts as a broker-wide size limit.
if youre trying to balance data load across multiple brokers in a cluster perhaps you should look at cruise control

Related

Is there any storage limits for a Kafka compacted topic?

When doing stateful processing in kafka streams we can hold large state. We can provision more disks space for the client as the data grows. But what about the changelog topic? The local state is backed up in this compacted topic. Are there any limitations in how much data we can store in this topic?
We did not encounter any issues yet. But i see that some cloud services do have limitations on the size for a compacted topic. Is this a kafka limitation? An if yes, do these limitations also apply for non compacted topics?
Infinite retention of any topic log segments can be achieved by setting
log.retention.bytes = -1
log.retention.hours = -1
This option is available from version 0.9.0.0 which indicates a mature feature on Kafka.
However, many suggest that using Kafka as permanent storage is not what it was designed to do and as the amount of data stored in Kafka increases, users eventually hit a “retention cliff,” at which point it becomes significantly more expensive to store, manage, and retrieve data. The infrastructure costs will be increased as the longer the retention period the more hardware is required.
Having said that, it seems that people do use Kafka for persistence storage, for example, The New York Times uses Kafka as a source of truth, storing 160 years of journalism going back to the 1850s.
I would suggest using a small message size if you decide to use
Kafka as a System Of Record (SOR) and to hold the state of an entity.
Kafka makes it very clear that its performance is greatly based on the event/message size, so there is a size limit on them.
Kafka has a default limit of 1MB per message in the topic. This is
because very large messages are considered inefficient and an
anti-pattern in Apache Kafka.
more for handling larger messages here.
By default, each Kafka topic partition log will start at a minimum size of 20MB and grow to a maximum size of 100MB on disk before a new log file is created. It's possible to have multiple log files in a partition at any one time.

How i delete old Kafka logs Safely in server.properties

I used Kafka Version 2.3, I want to delete old kafka logs
there are two folders
log.dirs=/var/www/html/zookeeper_1/zookeeper_data_1
kafka_2.10-0.8.2.2/logs
What is the difference between two folders, and I want to delete old log?
I would argue that the safest way to delete older logs is to properly configure your retention policy.
In Kafka, there are two types of log retention; size and time retention. The former is triggered by log.retention.bytes while the latter by log.retention.hours.
Assuming that you want a delete cleanup policy, you'd need to configure the following parameters to
log.cleaner.enable=true
log.cleanup.policy=delete
Then you need to think about the configuration of log.retention.bytes, log.segment.bytes and log.retention.check.interval.ms. To do so, you have to take into consideration the following factors:
log.retention.bytes is a minimum guarantee for a single partition of a topic, meaning that if you set log.retention.bytes to 512MB, it means you will always have 512MB of data (per partition) in your disk.
Again, if you set log.retention.bytes to 512MB and log.retention.check.interval.ms to 5 minutes (which is the default value) at any given time, you will have at least 512MB of data + the size of data produced within the 5 minute window, before the retention policy is triggered.
A topic log on disk, is made up of segments. The segment size is dependent to log.segment.bytes parameter. For log.retention.bytes=1GB and log.segment.bytes=512MB, you will always have up to 3 segments on the disk (2 segments which reach the retention and the 3rd one will be the active segment where data is currently written to).
Finally, you should do the math and compute the maximum size that might be reserved by Kafka logs at any given time on your disk and tune the aforementioned parameters accordingly. I would also advice to set a time retention policy as well and configure log.retention.hours accordingly. If after 2 days you don't need your data anymore, then set log.retention.hours=48.
One is Zookeeper data, the other is Kafka 0.8.2.2 data, which is not directly compatible with Kafka 2.3
You'd delete segments from the latter, however it'll have the potential to corrupt the topic if you do so, so you should let Kafka clean itself up

Apache Kafka: reduce kafka disk usage

I have a question about Kafka's disk.
Kafka will fail when its disk become full.
So I want to reduce the disk usage to less than x% by discarding the old data stored on the Kafka disk (or discarding a copy of the data) when the Kafka disk usage reaches x%. Do I need to modify the Kafka source code to do this?
You can configure retention.bytes for your topics.
This configuration controls the maximum size a partition (which consists of log segments) can grow to before we will discard old log segments to free up space if we are using the "delete" retention policy. By default there is no size limit only a time limit. Since this limit is enforced at the partition level, multiply it by the number of partitions to compute the topic retention in bytes.
See https://kafka.apache.org/documentation/#topicconfigs

Kafka optimal retention and deletion policy

I am fairly new to kafka so forgive me if this question is trivial. I have a very simple setup for purposes of timing tests as follows:
Machine A -> writes to topic 1 (Broker) -> Machine B reads from topic 1
Machine B -> writes message just read to topic 2 (Broker) -> Machine A reads from topic 2
Now I am sending messages of roughly 1400 bytes in an infinite loop filling up the space on my small broker very quickly. I'm experimenting with setting different values for log.retention.ms, log.retention.bytes, log.segment.bytes and log.segment.delete.delay.ms. First I set all of the values to the minimum allowed, but it seemed this degraded performance, then I set them to the maximum my broker could take before being completely full, but again the performance degrades when a deletion occurs. Is there a best practice for setting these values to get the absolute minimum delay?
Thanks for the help!
Apache Kafka uses Log data structure to manage its messages. Log data structure is basically an ordered set of Segments whereas a Segment is a collection of messages. Apache Kafka provides retention at Segment level instead of at Message level. Hence, Kafka keeps on removing Segments from its end as these violate retention policies.
Apache Kafka provides us with the following retention policies -
Time Based Retention
Under this policy, we configure the maximum time a Segment (hence messages) can live for. Once a Segment has spanned configured retention time, it is marked for deletion or compaction depending on configured cleanup policy. Default retention time for Segments is 7 days.
Here are the parameters (in decreasing order of priority) that you can set in your Kafka broker properties file:
Configures retention time in milliseconds
log.retention.ms=1680000
Used if log.retention.ms is not set
log.retention.minutes=1680
Used if log.retention.minutes is not set
log.retention.hours=168
Size based Retention
In this policy, we configure the maximum size of a Log data structure for a Topic partition. Once Log size reaches this size, it starts removing Segments from its end. This policy is not popular as this does not provide good visibility about message expiry. However it can come handy in a scenario where we need to control the size of a Log due to limited disk space.
Here are the parameters that you can set in your Kafka broker properties file:
Configures maximum size of a Log
log.retention.bytes=104857600
So according to your use case you should configure log.retention.bytes so that your disk should not get full.

Partitions and Replications for the Apache Kafka

I have read the entire Documentation from the suggested website http://kafka.apache.org/ and did not able to understand the Hardware Requirements
1)I need a clarification on: How many Partitions and Replication is Required for collecting minimum 50GB of data per/day for single topic
2)It is given that the 0000000000000.log file is able to store up-to 100GB of data. Is it possible to reduce this log file size for reducing the usage of I/O ?
If the data is uniformed ingested during the entire day, that means that you need to ingest something like 600kb per second, all depends on the number of messages that are on those 600kb (according to Jay Creps explanation here you need to calculate something like 22 bytes of overhead per message) (keep in mind that the way you ACK the messages from the producer is also very important)
But you should be able with 1 topic and 1 partition to get this throughput from a producer.
1.Check this link it has the answer to choose #partitions:
http://www.confluent.io/blog/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/][1]
Yes it is possible to change the maximum size of log file in kafka. You have to set the below mentioned property on each of the brokers and then restart the brokers.
log.segment.bytes=1073741824
Above line will set the log segment size to 1GB.