Kafka optimal retention and deletion policy - apache-kafka

I am fairly new to kafka so forgive me if this question is trivial. I have a very simple setup for purposes of timing tests as follows:
Machine A -> writes to topic 1 (Broker) -> Machine B reads from topic 1
Machine B -> writes message just read to topic 2 (Broker) -> Machine A reads from topic 2
Now I am sending messages of roughly 1400 bytes in an infinite loop filling up the space on my small broker very quickly. I'm experimenting with setting different values for log.retention.ms, log.retention.bytes, log.segment.bytes and log.segment.delete.delay.ms. First I set all of the values to the minimum allowed, but it seemed this degraded performance, then I set them to the maximum my broker could take before being completely full, but again the performance degrades when a deletion occurs. Is there a best practice for setting these values to get the absolute minimum delay?
Thanks for the help!

Apache Kafka uses Log data structure to manage its messages. Log data structure is basically an ordered set of Segments whereas a Segment is a collection of messages. Apache Kafka provides retention at Segment level instead of at Message level. Hence, Kafka keeps on removing Segments from its end as these violate retention policies.
Apache Kafka provides us with the following retention policies -
Time Based Retention
Under this policy, we configure the maximum time a Segment (hence messages) can live for. Once a Segment has spanned configured retention time, it is marked for deletion or compaction depending on configured cleanup policy. Default retention time for Segments is 7 days.
Here are the parameters (in decreasing order of priority) that you can set in your Kafka broker properties file:
Configures retention time in milliseconds
log.retention.ms=1680000
Used if log.retention.ms is not set
log.retention.minutes=1680
Used if log.retention.minutes is not set
log.retention.hours=168
Size based Retention
In this policy, we configure the maximum size of a Log data structure for a Topic partition. Once Log size reaches this size, it starts removing Segments from its end. This policy is not popular as this does not provide good visibility about message expiry. However it can come handy in a scenario where we need to control the size of a Log due to limited disk space.
Here are the parameters that you can set in your Kafka broker properties file:
Configures maximum size of a Log
log.retention.bytes=104857600
So according to your use case you should configure log.retention.bytes so that your disk should not get full.

Related

kafka automatic deletion based on message consumption

Like how we have in MQ solutions , is it possible to have the message automatically deleted in Kafka once it is consumed ?
As I don't have control when the message will be consumed ,its not possible to define retention by time / byte size
You can override the configuration of retention by time per topic basis, even set it to 0 for no deletion at all. Retention byte size retention is not limited by default, and you don't have to use it. Being said that I am not sure Kafka is best suited for your use case as it meant to use used for real time high performance streaming processes... another note you can use COMPACT topic and send tombstone message to delete a record once processed, but basically kafka does not have automatic delete on consumption

How i delete old Kafka logs Safely in server.properties

I used Kafka Version 2.3, I want to delete old kafka logs
there are two folders
log.dirs=/var/www/html/zookeeper_1/zookeeper_data_1
kafka_2.10-0.8.2.2/logs
What is the difference between two folders, and I want to delete old log?
I would argue that the safest way to delete older logs is to properly configure your retention policy.
In Kafka, there are two types of log retention; size and time retention. The former is triggered by log.retention.bytes while the latter by log.retention.hours.
Assuming that you want a delete cleanup policy, you'd need to configure the following parameters to
log.cleaner.enable=true
log.cleanup.policy=delete
Then you need to think about the configuration of log.retention.bytes, log.segment.bytes and log.retention.check.interval.ms. To do so, you have to take into consideration the following factors:
log.retention.bytes is a minimum guarantee for a single partition of a topic, meaning that if you set log.retention.bytes to 512MB, it means you will always have 512MB of data (per partition) in your disk.
Again, if you set log.retention.bytes to 512MB and log.retention.check.interval.ms to 5 minutes (which is the default value) at any given time, you will have at least 512MB of data + the size of data produced within the 5 minute window, before the retention policy is triggered.
A topic log on disk, is made up of segments. The segment size is dependent to log.segment.bytes parameter. For log.retention.bytes=1GB and log.segment.bytes=512MB, you will always have up to 3 segments on the disk (2 segments which reach the retention and the 3rd one will be the active segment where data is currently written to).
Finally, you should do the math and compute the maximum size that might be reserved by Kafka logs at any given time on your disk and tune the aforementioned parameters accordingly. I would also advice to set a time retention policy as well and configure log.retention.hours accordingly. If after 2 days you don't need your data anymore, then set log.retention.hours=48.
One is Zookeeper data, the other is Kafka 0.8.2.2 data, which is not directly compatible with Kafka 2.3
You'd delete segments from the latter, however it'll have the potential to corrupt the topic if you do so, so you should let Kafka clean itself up

log retention to implement message TTL

we are considering to implement a timeout as part of a Kafka-based API by utilising its time based retention capabilities.
Basically, setting log.retention.ms = 10000 to make messages expire from a command topic if not processed within 10seconds.
I am wondering though whether this would provide a message level guarantee (i.e. every message is available the same amount of time) given that retention policies operate at the log segment level (based on largest timestamp per segment).
Of course, we can reduce log.segment.bytes to achieve more granular retention control, not sure though about the implications on performance.
any advice?
Nick
In Kafka, the retention settings are lower bounds, ie Kafka guarantees it will not delete a message before its retention limits are reached.
In practice, that means messages can stay in the log for longer than their retention limits.
Also as you said, Kafka operate at the log segment level. For time retention, only once the latest message in a segment gets older than the limit, this segment becomes eligible for deletion. And that does not apply to the active segment. So retention can't be used to provide per message time to live.
I don't know about your use case but maybe have a look at the offsetsForTimes() and seek() APIs in the consumer. These allow to select what the consumer will read based on time.
Finally, if you really need strong per message TTL, maybe Kafka is not the best tool.

Kafka disk space gets full

I have a Kafka service with 1000GB disk and this running parameter:
log.retention.bytes=350000000000
However, the usage of disk space reaches 90% (900GB). Since that parameter is running, the disk size should not exceeds 326GB. Why could this happen?
Other properties:
log.index.interval.bytes=4000
log.segment.bytes=250000000
log.index.size.max.bytes=10485760
log.retention.ms=168
while the official documentation isnt very clear:
The maximum size of the log before deleting it
the confluent documentation on topic configs (which should really be considered the official documentation anyway) has a better one (under retention.bytes):
This configuration controls the maximum size a partition (which consists of log segments) can grow to before we will discard old log segments to free up space if we are using the "delete" retention policy. By default there is no size limit only a time limit. Since this limit is enforced at the partition level, multiply it by the number of partitions to compute the topic retention in bytes.
in short, this config isnt even per topic. its per partition. im not aware of a kafka config that acts as a broker-wide size limit.
if youre trying to balance data load across multiple brokers in a cluster perhaps you should look at cruise control

Kafka PersistentWindowStore rebalancing mechanics

I am creating a 30-minute de-duplication store for a Kafka Streams application loosely based upon this confluent code (to solve a different problem to Kafka's exactly-once processing guarantee), and want to minimise topology startup time.
This code makes use of a persistent window store, which requires that I specify the number of log segments to make use of. Assuming I want to use 2 segments, and am using the default segment size of 1GB, does this mean that during rebalancing, the client will have to read 2GB of data before the application launches?
The segment parameter configures something different in Kafka Streams -- it's not related to segments in the brokers (just the same name).
Using a windowed store, the retention time of the store, is divided by the number of segments. If all data is a segment is older than the retention time, the complete segment is dropped and a new empty segment is created. Those segments, only exist client-side.
The number of record that need to be restored, only depend on the retention time (and your input data rate). It's independent of segments size. Segment size only defined how fine grained older records are expired.