Kafka internal storage - apache-kafka

As per, Kafka book
Producer publishes messages to topic. These messages will be written
in a segment and after that batch of message will be stored to disk.
Consumer subscribes to a topic and read the message form the segment.
I read that physically segment is nothing but files. I am confused what is the role of disk if we are storing message in segments(file system) .
Can someone explain me the relationship between segments and disk ?

Messages are published into Topic/Partition. When a broker receives messages, it will write them into the OS page-cache (i.e : the unused portions of the main memory). OS will then periodically flush (fsync) dirty pages on disk.
Kafka physically organises data on file system into directories named with KAFKA_LOGS_DIR/- which are store into the logs directories configured on each broker (default : tmp/kafka-logs/)
Each of this directories contains a number of log-structured file call segment in which messages are written sequentially. In addition, each segment log is attached to two index files/.

Related

Kafka Streams WindowStore keeps open many .sst small files

I have a Kafka Streams application (Kafka 2.2.x) that stores its local state to RocksDB, through WindowStore object. For every message, the stream call some basic function in order to check if the received message key is already into the state store.
We configured RocksDB as follow:
- max_open_files = 5000;
- target_file_size_base = 134217728 (128 MB)
- compaction_style = LEVEL
Despite this, the application stores a very large number of small files and we cannot tell if the compaction process is working or not. This situation causes a problem, because
sometimes (1 or 2 times a day) the application throws "too many files exception" and Kafka rebalance processing starts.
We also increased the number of opened file that service can handle from systemd.
How we could limit the number of opened files? Why sometimes the number of opened files by the Kafka Stream process is bigger than 5000?

How does Kafka internally order the messages within a partition? Does it store them as it received from the producer?

I wanted to understand the order that Kafka follows internally to place messages in a parition that it received from a bunch of varying producers.
Partition is a sharding for the topic. And each partition will be write into a separate file under the same directory holds the topic's name. Writing or reading into a file is sequential, that is the way partition maintains its order.
Does it store them as it received from the producer?
Yes, as soon message received it will be written into its buffer quite similar to some relational data bases write to write ahead log. Kafka uses operating systems page cache as a buffer to obtain high performance of reading and writing. Periodically depends on the configuration Kafka writes data into the file.

Can a Kafka consumer read records from a Broker's page cache?

Kafka's documentation clearly states that messages/records are immediately written to the file system as they are received by the Broker. With the default configuration, this means that the Broker flushes records to the page cache immediately and later the Kernel can flush it to disk.
My question is: can a consumer read a record that is in the page cache but that has not yet been flushed to disk by the kernel?
If the answer is yes, how will the consumer keep track of the offset it reads from?
If the answer is no, then it would mean that the record has to be read back from disk to the page cache before it is sent out to NIC via zero-copy. Correct?
Thanks,
Whenever there is a read/write operation to the file, the data is written/fetched to page cache first. In case of read, if the data is already present in cache page the actual disk read is not called and data is served from page cache. It's not that kafka consumer is reading from page cache of broker but this is being done by file system and hidden from actual read call. In most of the cases, the records from kafka are read sequentially which allows it to use page cache effectively.
zero-copy optimization is used in every read from kafka client, copying data directly from page cache to NIC buffer.

How is Apache Kafka offset generated?

Went through
How is the kafka offset value computed?
From the kafka documentation on replication:
The purpose of adding replication in Kafka is for stronger durability and higher availability. We want to guarantee that any successfully published message will not be lost and can be consumed, even when there are server failures. Such failures can be caused by machine error, program error, or more commonly, software upgrades.
From the kafka documentation on Efficiency:
The message log maintained by the broker is itself just a directory of files, each populated by a sequence of message sets that have been written to disk in the same format used by the producer and consumer. Maintaining this common format allows optimization of the most important operation: network transfer of persistent log chunks.
I did not see anywhere details regarding how the offset is generated for a topic. Will be offsets be generated by a single machine in the cluster in which case there is one master or Kafka has distributed logging that relies on some kind of clock synchronization and generates messages in a consistent order among all the nodes.
Any pointers or additional information will be helpful.
Offsets are not generated explicitly for each message and messages do also no store their offset.
A topic consists of partitions, and messages are written to partitions in junks, called segments (on the file system, there will be a folder for a topic, with subfolders for each partition -- a segment corresponds to a file within a partitions folder).
Furthermore, a index is maintained per partitions and stored along with the segment files, that uses the offset of the first message per segment as key and point to the segment. For all consecutive messages within a segment, the offset of a message can be computed by it's logical position within the segment (including the offset of the first messages).
If you start a new topic or actually a new partition, a first segment is generated and its start offset zero is inserted into the index. Message get written to the segment until it's full. A new segment is started and it's start offset get's added to the index -- the start offset of the new segment can easily be computed by the start offset of the latest segment plus the number of message within this segment.
Thus, for each partitions, the broker that hosts this partitions (ie, the leader for this partition) tracks the offset for this partitions by maintaining the index. If segments are deleted because retention time passed, the segment file get's deleted and the entry in the index is removed.

Need help to understand Kafka storage

I am new in kafka. From the link : http://notes.stephenholiday.com/Kafka.pdf
It is mentioned:
"Every time a producer publishes a message to a partition, the broker
simply appends the message to the last segment file. For better
performance, we flush the segment files to disk only after a
configurable number of messages have been published or a certain
amount of time has elapsed. A message is only exposed to the consumers
after it is flushed."
Now my question is
What is segment file here?
When I create a topic with partition then each partition will have an index file and a .log file.
is this (.log file) the segment file? if so then it is already in disk so why it is saying "For better performance, we flush the segment files to
disk". if it is flushing to disk then where in the disk it is flushing?
It seems that until it flush to disk , it is not available to the the consumer. Then we adding some latency to read the message, but why?
Also want help to understand that when consumer wants to read some data then is it reading from disk (partition, segment file) or there is some cache mechanism , if so then how and when data is persisting into the cache?
I am not sure all questions are valid or not, but it will help me understand if anybody can clear it.
You can think this segment file as OS pagecache.
Kafka has a very simple storage layout. Each partition of a topic
corresponds to a logical log. Physically, a log is implemented as a
set of segment files of equal sizes. Every time a producer publishes a
message to a partition, the broker simply appends the message to the
last segment file. Segment file is flushed to disk after configurable
number of messages has been published or after certain amount of time.
Messages are exposed to consumer after it gets flushed.
And also please refer to document below.
http://kafka.apache.org/documentation/#appvsosflush
Kafka always immediately writes all data to the filesystem and
supports the ability to configure the flush policy that controls when
data is forced out of the OS cache and onto disk using the flush. This
flush policy can be controlled to force data to disk after a period of
time or after a certain number of messages has been written. There are
several choices in this configuration.
Don't get confused when you see the filesystem word there, OS pagecache is also a filesystem and the link you have mentioned is really very much outdated.