Will Kafka Reuse an old disk for writes after a new disk has been added? - apache-kafka

I have a question about using multiple disks per Kafka broker.
Assume that a Kafka broker has 3 disks associated with it.
i) Disk-1 was full in 5 days
ii) Disk-2 is nearing 40 % usage in the next 3 days.
Now if the log.retention.hours = 168 (7 days) has completed, then let’s say the data in Disk-1 was deleted, so Disk-1 is free again and Disk-2 is 40% used
Now will Kafka reuse Disk-1 for new writes again, or will it only write to new disks i.e Disk-2 , Disk-3 so on?
Basically, my question is, will Kafka again write to an older disk, if there is enough free space in the older disk due to message deletion after max retention days in Kafka ?

When a partition is created, each broker that is a replica will pick a select a log directory to put data for that partition. On a broker, data for a specific partition is only stored in that selected log directory.
Log directories are specified in the broker configuration via the log.dirs setting.
If you have multiple log directories, when creating a partition, the log directory with the least amount of partitions is picked.
When producing messages to a partition, the data goes into the log directory where that partition is.
In short the answer to your specific question is "it depends" but hopefully I've described the process clearly enough for you to figure out the answer for your exact situation.

Related

How i delete old Kafka logs Safely in server.properties

I used Kafka Version 2.3, I want to delete old kafka logs
there are two folders
log.dirs=/var/www/html/zookeeper_1/zookeeper_data_1
kafka_2.10-0.8.2.2/logs
What is the difference between two folders, and I want to delete old log?
I would argue that the safest way to delete older logs is to properly configure your retention policy.
In Kafka, there are two types of log retention; size and time retention. The former is triggered by log.retention.bytes while the latter by log.retention.hours.
Assuming that you want a delete cleanup policy, you'd need to configure the following parameters to
log.cleaner.enable=true
log.cleanup.policy=delete
Then you need to think about the configuration of log.retention.bytes, log.segment.bytes and log.retention.check.interval.ms. To do so, you have to take into consideration the following factors:
log.retention.bytes is a minimum guarantee for a single partition of a topic, meaning that if you set log.retention.bytes to 512MB, it means you will always have 512MB of data (per partition) in your disk.
Again, if you set log.retention.bytes to 512MB and log.retention.check.interval.ms to 5 minutes (which is the default value) at any given time, you will have at least 512MB of data + the size of data produced within the 5 minute window, before the retention policy is triggered.
A topic log on disk, is made up of segments. The segment size is dependent to log.segment.bytes parameter. For log.retention.bytes=1GB and log.segment.bytes=512MB, you will always have up to 3 segments on the disk (2 segments which reach the retention and the 3rd one will be the active segment where data is currently written to).
Finally, you should do the math and compute the maximum size that might be reserved by Kafka logs at any given time on your disk and tune the aforementioned parameters accordingly. I would also advice to set a time retention policy as well and configure log.retention.hours accordingly. If after 2 days you don't need your data anymore, then set log.retention.hours=48.
One is Zookeeper data, the other is Kafka 0.8.2.2 data, which is not directly compatible with Kafka 2.3
You'd delete segments from the latter, however it'll have the potential to corrupt the topic if you do so, so you should let Kafka clean itself up

kafka partition has lots of log segments

One topic has 20 partitions, almost everyone has more than 20,000 log segment files, most of them are created months ago. Even after I config the retention.ms to very short, the segments are not deleted. While other topics can recycle normal.
I am wondering what's the issue inside, and how to solve it. Because I'm worry about the number of total segments will keep increasing that larger than OS vm.max_map_count, which will damage kafka process itself. Following image is the describe about the abnormal topic.
Not sure what the issue is exactly, but some things to consider:
Broker vs topic-specific configs. Check to make sure your topic actually has the configs you think it has, and is not inheriting them from the broker settings.
Configs related to retention. As mentioned by Girogos Myrianthous, you can look at log.retention.check.interval.ms and log.cleanup.policy. I would also look at the roll related settings, like log.roll.hours. I believe that in some cases, Kafka will not delete a segment until its partition rolls, even if the segment is old. And rolling follows the following behavior:
The log rolling time is no longer depending on log segment create time. Instead it is now based on the timestamp in the messages. More specifically. if the timestamp of the first message in the segment is T, the log will be rolled out when a new message has a timestamp greater than or equal to T + log.roll.ms (http://kafka.apache.org/20/documentation.html)
So make sure to consider the record timestamps, not just the segment files' age.
Finally:
What version of Kafka are you using?
Have you looked carefully at the broker logs? Broker logs is how I've solved all such problems that I've encountered.

kafka : How to delete data which already been consumed by consumer?

I set server.properties'
log.retention.minutes=8
to clean data under kafka-logs/ every 8 minutes automatically ,
is it possible let the cleaner only clean up the data which have been consumed
,data not consumed by consumer will retain ?
Thanks !
No. Kafka messages are appended to log files which roll over every x hours or when they reach a certain size (depending on configuration). Once rolled over, those files are immutable (you cannot delete individual records). Log files are cleaned up when the last write access to a file exceeds the retention time.
In other words: the retention time is the time a message is kept at least. It is possible for a message with retention time of minutes to last for weeks (depending on other configuration settings).
The concept of "consumer offsets" is the mechanism Kafka uses to avoid reconsumption of messags. Kafka 0.11 also will contain exactly-once capabilities.

Partitions and Replications for the Apache Kafka

I have read the entire Documentation from the suggested website http://kafka.apache.org/ and did not able to understand the Hardware Requirements
1)I need a clarification on: How many Partitions and Replication is Required for collecting minimum 50GB of data per/day for single topic
2)It is given that the 0000000000000.log file is able to store up-to 100GB of data. Is it possible to reduce this log file size for reducing the usage of I/O ?
If the data is uniformed ingested during the entire day, that means that you need to ingest something like 600kb per second, all depends on the number of messages that are on those 600kb (according to Jay Creps explanation here you need to calculate something like 22 bytes of overhead per message) (keep in mind that the way you ACK the messages from the producer is also very important)
But you should be able with 1 topic and 1 partition to get this throughput from a producer.
1.Check this link it has the answer to choose #partitions:
http://www.confluent.io/blog/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/][1]
Yes it is possible to change the maximum size of log file in kafka. You have to set the below mentioned property on each of the brokers and then restart the brokers.
log.segment.bytes=1073741824
Above line will set the log segment size to 1GB.

Need help to understand Kafka storage

I am new in kafka. From the link : http://notes.stephenholiday.com/Kafka.pdf
It is mentioned:
"Every time a producer publishes a message to a partition, the broker
simply appends the message to the last segment file. For better
performance, we flush the segment files to disk only after a
configurable number of messages have been published or a certain
amount of time has elapsed. A message is only exposed to the consumers
after it is flushed."
Now my question is
What is segment file here?
When I create a topic with partition then each partition will have an index file and a .log file.
is this (.log file) the segment file? if so then it is already in disk so why it is saying "For better performance, we flush the segment files to
disk". if it is flushing to disk then where in the disk it is flushing?
It seems that until it flush to disk , it is not available to the the consumer. Then we adding some latency to read the message, but why?
Also want help to understand that when consumer wants to read some data then is it reading from disk (partition, segment file) or there is some cache mechanism , if so then how and when data is persisting into the cache?
I am not sure all questions are valid or not, but it will help me understand if anybody can clear it.
You can think this segment file as OS pagecache.
Kafka has a very simple storage layout. Each partition of a topic
corresponds to a logical log. Physically, a log is implemented as a
set of segment files of equal sizes. Every time a producer publishes a
message to a partition, the broker simply appends the message to the
last segment file. Segment file is flushed to disk after configurable
number of messages has been published or after certain amount of time.
Messages are exposed to consumer after it gets flushed.
And also please refer to document below.
http://kafka.apache.org/documentation/#appvsosflush
Kafka always immediately writes all data to the filesystem and
supports the ability to configure the flush policy that controls when
data is forced out of the OS cache and onto disk using the flush. This
flush policy can be controlled to force data to disk after a period of
time or after a certain number of messages has been written. There are
several choices in this configuration.
Don't get confused when you see the filesystem word there, OS pagecache is also a filesystem and the link you have mentioned is really very much outdated.