kafka __consumer_offsets topic logs rapidly growing in size reducing disk space - apache-kafka

I find that the __consumer_offsets topic log size is growing rapidly and after studying it further found the topics with the highest volume. I changed the retention policy on these topics to stop the rate of growth but would like to increase disk space and delete all the old logs for __consumer_offsets topic.
But this will cause all the other topics and consumers/producers to get corrupted or lose valuable metadata. Is there a way I can accomplish this? I'm looking at the parameters for the config which includes cleanup policy and compression but not sure how to specify this specifically for the topics that caused this rapid growth.
https://docs.confluent.io/current/installation/configuration/topic-configs.html
Appreciate any assistance here.

The topic "__consumer_offsets" is an internal topic which is used to manage the offsets of each Consumer Group. Producers will not be directly impacted by any change/modification in this topic.
Saying that, and also emphasizing your expecrience, you should be very careful about changing the configuration of this topic.
I suggest to tweak the topic configurations for compacted topics. The cleanup policy should be kept at "compacted".
Reduce max.compaction.lag.ms (cluster-wide setting: log.cleaner.max.compaction.lag.ms) which defaults to MAX_LONG to something like 60000.
Reduce the ratio when a compaction is triggered through min.cleanable.dirty.ratio (cluster-wide setting: log.cleaner.min.cleanable.ratio) which defaults to 0.5 to something like 0.1.
That way, the compactions will be conducted more often without loosing any essential information.
Deleting old records in __consumer_offsets
The topic will pile up if you use many unique Consumer Groups (e.g. by using console-consumer which creates by default a random Consumer Group each time it is being executing).
To clean "old and un-needed" entries in the topic you need to be aware how to delete a message out of a compacted topic. This is done by producing a message to the topic with a null value. That way you will eventually delete the messages for the same key. You just have to figure out the keys of the messages you want to get rid of.

In Kafka, there are two types of log retention; size and time retention. The former is triggered by log.retention.bytes while the latter by log.retention.hours.
In your case, you should pay attention to size retention that sometimes can be quite tricky to configure. Assuming that you want a delete cleanup policy, you'd need to configure the following parameters to
log.cleaner.enable=true
log.cleanup.policy=delete
Then you need to think about the configuration of log.retention.bytes, log.segment.bytes and log.retention.check.interval.ms. To do so, you have to take into consideration the following factors:
log.retention.bytes is a minimum guarantee for a single partition of a topic, meaning that if you set log.retention.bytes to 512MB, it means you will always have 512MB of data (per partition) in your disk.
Again, if you set log.retention.bytes to 512MB and log.retention.check.interval.ms to 5 minutes (which is the default value) at any given time, you will have at least 512MB of data + the size of data produced within the 5 minute window, before the retention policy is triggered.
A topic log on disk, is made up of segments. The segment size is dependent to log.segment.bytes parameter. For log.retention.bytes=1GB and log.segment.bytes=512MB, you will always have up to 3 segments on the disk (2 segments which reach the retention and the 3rd one will be the active segment where data is currently written to).
Finally, you should do the math and compute the maximum size that might be reserved by Kafka logs at any given time on your disk and tune the aforementioned parameters accordingly. Of course, I would also advice to set a time retention policy as well and configure log.retention.hours accordingly. If after 2 days you don't need your data anymore, then set log.retention.hours=48.
Now in order to change the retention policy just for the __consumer_offsets topic, you can simply run:
bin/kafka-configs.sh \
--zookeeper localhost:2181 \
--alter \
--entity-type topics \
--entity-name __consumer_offsets \
--add-config retention.bytes=...
As a side note, you must be very careful with the retention policy for the __consumer_offsets as this might mess up all your consumers.

Related

How i delete old Kafka logs Safely in server.properties

I used Kafka Version 2.3, I want to delete old kafka logs
there are two folders
log.dirs=/var/www/html/zookeeper_1/zookeeper_data_1
kafka_2.10-0.8.2.2/logs
What is the difference between two folders, and I want to delete old log?
I would argue that the safest way to delete older logs is to properly configure your retention policy.
In Kafka, there are two types of log retention; size and time retention. The former is triggered by log.retention.bytes while the latter by log.retention.hours.
Assuming that you want a delete cleanup policy, you'd need to configure the following parameters to
log.cleaner.enable=true
log.cleanup.policy=delete
Then you need to think about the configuration of log.retention.bytes, log.segment.bytes and log.retention.check.interval.ms. To do so, you have to take into consideration the following factors:
log.retention.bytes is a minimum guarantee for a single partition of a topic, meaning that if you set log.retention.bytes to 512MB, it means you will always have 512MB of data (per partition) in your disk.
Again, if you set log.retention.bytes to 512MB and log.retention.check.interval.ms to 5 minutes (which is the default value) at any given time, you will have at least 512MB of data + the size of data produced within the 5 minute window, before the retention policy is triggered.
A topic log on disk, is made up of segments. The segment size is dependent to log.segment.bytes parameter. For log.retention.bytes=1GB and log.segment.bytes=512MB, you will always have up to 3 segments on the disk (2 segments which reach the retention and the 3rd one will be the active segment where data is currently written to).
Finally, you should do the math and compute the maximum size that might be reserved by Kafka logs at any given time on your disk and tune the aforementioned parameters accordingly. I would also advice to set a time retention policy as well and configure log.retention.hours accordingly. If after 2 days you don't need your data anymore, then set log.retention.hours=48.
One is Zookeeper data, the other is Kafka 0.8.2.2 data, which is not directly compatible with Kafka 2.3
You'd delete segments from the latter, however it'll have the potential to corrupt the topic if you do so, so you should let Kafka clean itself up

When kafka purges messages

I have Apache Kafka cluster with retention policy delete and retention period set to 24 hrs.
Then I have changed retention period dynamically and set it to 1 minute for some specific topic. But old messages are still there, so I have several questions:
What is the trigger point for retention? I assume that though some explicit time to live set for messages, it is not guaranteed that messages will be deleted exactly after this time. So what is the process? (Can't find anything in the reference)
If I change retention period in runtime, will the old messages obey it. As far as I understand retention period is topic-wide property and should work as well for messages, which were published with the first retention period.
On each broker the partitions are divided into segment logs. By default a segment will store 1GB of data (log.segment.bytes) of data. In addition, a new log segment is rolled out by default every 7 days (log.roll.hours)
Each broker schedules a cleaner-thread which is responsible for periodically check which segments are eligibled to deletion. By default, the cleaner-thread will run a check every 5 minutes (this can be configured throught the broker config : log.retention.check.interval.ms)
A segment is removable if the most recent message within a log is older than the configured retention period. In addition, the active segment log (the one the broker is currently writing to) can't be deleted
In order to be able to remove a segment log as soon as possible you should configure the log rolling in correlation with you retention period. For example, if your retention period is configured to 24 hours it could be a good id to configured log.roll.hours to 1 hour.
Note that segment deletion can actually happen at different time on each broker as the cleaner threads are scheduled together.
Check specific topic configuration with kafka-configs script:
Example :
./bin/kafka-configs --describe --zookeeper localhost:2181 --entity-type topics --entity-name __consumer_offsets
Retention policy is applied on closed segments only. If you segment is still active then the data in that segment wont be purged until closed and new segment is opened.

Kafka - is it possible to alter Topic's partition count while keeping the change transparent to Producers and Consumers?

I am investigating on Kafka to assess its suitability for our use case. Can you please help me understand how flexible is Kafka with changing the number of partitions for an existing Topic?
Specifically,
Is it possible to change the number of partitions without tearing down the cluster?
And is it possible to do that without bringing down the topic?
Will adding/removing partitions automatically take care of redistributing messages across the new partitions?
Ideally, I would want the change to be transparent to the producers and consumers. Does Kafka ensure this?
Update:
From my understanding so far, it looks like Kafka's design cannot allow this because it mapping of consumer groups to partitions will have to be altered. Is that correct?
1.Is it possible to change the number of partitions without tearing down the cluster?
Yes kafka supports increasing the number of partitions at runtime but doesn't support decreasing number of partitions due to its design
2.And is it possible to do that without bringing down the topic?
Yes provided you are increasing partitions.
3.Will adding/removing partitions automatically take care of redistributing messages across the new partitions?
As mentioned earlier removing partitions is not supported .
When you increase the number of partitions, the existing messages will remain in the same partitions as before only the new messages will be considered for new partitions (also depending on you partitioner logic). Increasing the partitions for a topic will trigger a cluster rebalance , where in the consumers and producers will get notified with the updated metadata of the topics. Producers will start sending messages to new partitions after receiving updated metadata and consumer rebalancer will redistribute the partitions among the consumers groups and resume consumption from the last committed offset.All this will happen under the hood , so you wont have to do any changes at client side
Yes, it it perfectly possible. You just execute the following command against the topic of your choice: bin/kafka-topics.sh --zookeeper zk_host:port --alter --topic <your_topic_name> --partitions <new_partition_count>. Remember, Kafka only allows increasing the number of partitions, because decreasing it would cause in data loss.
There's a catch here. Kafka doc says the following:
Be aware that one use case for partitions is to semantically partition
data, and adding partitions doesn't change the partitioning of
existing data so this may disturb consumers if they rely on that
partition. That is if data is partitioned by hash(key) %
number_of_partitions then this partitioning will potentially be
shuffled by adding partitions but Kafka will not attempt to
automatically redistribute data in any way.
Yes, if by bringing down the topic you mean deleting the topic.
Once you've increased the partition count, Kafka would trigger a rebalance, for consumers who are subscribing to that topic, and on subsequent polls, the partitions would get distributed across the consumers. It's transparent to the client code, you don't have to worry about it.
NOTE: As I mentioned before, you can only add partitions, removing is not possible.
+one more thing, if you are using stateful operations in clients like aggregations(making use of statestore), change in partition will kill all the streams thread in consumer. This is expected as increase in partition may corrupt stateful applications. So beware changing partition size, it may break stateful consumers connected to the topic.
Good read: Why does kafka streams threads die when the source topic partitions changes ? Can anyone point to reading material around this?

Why do we have __consumer_offsets partitions default value 50?

Why do we need __consumer_offsets partition count having default value 50? i.e offsets.topic.num.partitions default value if 50. We can even use offsets.topic.num.partitions=1.
Kafka really scales by splitting data over partitions allowing to distribute them across many servers. This is particularly true for this internal topic that is used to store consumer groups data.
Also as this setting can't be changed after deployment, I make sense to have a relatively large default value. This allows clusters to grow from an handful of brokers to dozens of them without hitting scaling issues with this internal topic.
For development purpose and if you have very limited hardware resources, you could set it to 1 but I wouldn't recommend it. From my experience, I've found the cost of having 50 partitions in my development environment negligible.
The __consumer_offsets is used in different scenarios like when a consumer starts working to obtain an initial offset, or when it commits its last processed offset. So, depending on how they commit their offsets, which is automatically by default, and the number of consumers and brokers, the number of partitions of __consumer_offsets could have a direct effect on performance and reliability of the track of offsets. Hence, the default value could be a good starting point for most setups, but you should know that you need to optimize it depending on your application.

Kafka optimal retention and deletion policy

I am fairly new to kafka so forgive me if this question is trivial. I have a very simple setup for purposes of timing tests as follows:
Machine A -> writes to topic 1 (Broker) -> Machine B reads from topic 1
Machine B -> writes message just read to topic 2 (Broker) -> Machine A reads from topic 2
Now I am sending messages of roughly 1400 bytes in an infinite loop filling up the space on my small broker very quickly. I'm experimenting with setting different values for log.retention.ms, log.retention.bytes, log.segment.bytes and log.segment.delete.delay.ms. First I set all of the values to the minimum allowed, but it seemed this degraded performance, then I set them to the maximum my broker could take before being completely full, but again the performance degrades when a deletion occurs. Is there a best practice for setting these values to get the absolute minimum delay?
Thanks for the help!
Apache Kafka uses Log data structure to manage its messages. Log data structure is basically an ordered set of Segments whereas a Segment is a collection of messages. Apache Kafka provides retention at Segment level instead of at Message level. Hence, Kafka keeps on removing Segments from its end as these violate retention policies.
Apache Kafka provides us with the following retention policies -
Time Based Retention
Under this policy, we configure the maximum time a Segment (hence messages) can live for. Once a Segment has spanned configured retention time, it is marked for deletion or compaction depending on configured cleanup policy. Default retention time for Segments is 7 days.
Here are the parameters (in decreasing order of priority) that you can set in your Kafka broker properties file:
Configures retention time in milliseconds
log.retention.ms=1680000
Used if log.retention.ms is not set
log.retention.minutes=1680
Used if log.retention.minutes is not set
log.retention.hours=168
Size based Retention
In this policy, we configure the maximum size of a Log data structure for a Topic partition. Once Log size reaches this size, it starts removing Segments from its end. This policy is not popular as this does not provide good visibility about message expiry. However it can come handy in a scenario where we need to control the size of a Log due to limited disk space.
Here are the parameters that you can set in your Kafka broker properties file:
Configures maximum size of a Log
log.retention.bytes=104857600
So according to your use case you should configure log.retention.bytes so that your disk should not get full.