Need help to understand Kafka storage

Need help to understand Kafka storage - apache-kafka

I am new in kafka. From the link : http://notes.stephenholiday.com/Kafka.pdf
It is mentioned:
"Every time a producer publishes a message to a partition, the broker
simply appends the message to the last segment file. For better
performance, we flush the segment files to disk only after a
configurable number of messages have been published or a certain
amount of time has elapsed. A message is only exposed to the consumers
after it is flushed."
Now my question is
What is segment file here?
When I create a topic with partition then each partition will have an index file and a .log file.
is this (.log file) the segment file? if so then it is already in disk so why it is saying "For better performance, we flush the segment files to
disk". if it is flushing to disk then where in the disk it is flushing?
It seems that until it flush to disk , it is not available to the the consumer. Then we adding some latency to read the message, but why?
Also want help to understand that when consumer wants to read some data then is it reading from disk (partition, segment file) or there is some cache mechanism , if so then how and when data is persisting into the cache?
I am not sure all questions are valid or not, but it will help me understand if anybody can clear it.

You can think this segment file as OS pagecache.
Kafka has a very simple storage layout. Each partition of a topic
corresponds to a logical log. Physically, a log is implemented as a
set of segment files of equal sizes. Every time a producer publishes a
message to a partition, the broker simply appends the message to the
last segment file. Segment file is flushed to disk after configurable
number of messages has been published or after certain amount of time.
Messages are exposed to consumer after it gets flushed.
And also please refer to document below.
http://kafka.apache.org/documentation/#appvsosflush
Kafka always immediately writes all data to the filesystem and
supports the ability to configure the flush policy that controls when
data is forced out of the OS cache and onto disk using the flush. This
flush policy can be controlled to force data to disk after a period of
time or after a certain number of messages has been written. There are
several choices in this configuration.
Don't get confused when you see the filesystem word there, OS pagecache is also a filesystem and the link you have mentioned is really very much outdated.

Related

Artemis - Messages Sync Between Memory & Journal

When reading artemis docs understood that - artemis stores entire current active messages in memory and can offload messages to paging area for a given queue/topic as per the settings & artemis journals are append only.
With respect to this
How and when broker sync messages to and from from journal ( Only during restart ? )
How it identifies the message to be deleted from journal ( For ex : If journal is append only mode , if a consumer of a persistent message ACK the message , then how broker removes a single message from journal without keeping indexing).
Isn't it a performance hit to keep every active message in memory and even makes broker go out of memory. To avoid this , every queue/topic pagination settings have to be set in configuration otherwise broker may fill all the messages. Please correct me if wrong.
Any reference link that can explain about message sync and these information is helpful. Artemis docs explains regarding append only mode though but may be any section/article that explains these storage concepts and I might be missing.

By default, a durable message is persisted to disk after the broker receives it and before the broker sends a response back to the client that the message was received. In this way the client can know for sure that if it receives the response back from the broker that the durable message it sent was received and persisted to disk.
When using the NIO journal-type in broker.xml (i.e. the default configuration), data is synced to disk using java.nio.channels.FileChannel.force(boolean).
Since the journal is append-only during normal operation then when a message is acknowledged it is not actually deleted from the journal. The broker simply appends a delete record to the journal for that particular message. The message will then be physically removed from the journal later during "compaction". This process is controlled by the journal-compact-min-files & journal-compact-percentage parameters in broker.xml. See the documentation for more details on that.
Keeping message data in memory actually improves performance dramatically vs. evicting it from memory and then having to read it back from disk later. As you note, this can lead to memory consumption problems which is why the broker supports paging, blocking, etc. The main thing to keep in mind is that a message broker is not a storage medium like a database. Paging is a palliative measure meant to be used as a last resort to keep the broker functioning. Ideally the broker should be configured to handle the expected load without paging (e.g. acquire more RAM, allocate more heap). In other words, message production and message consumption should be balanced. The broker is designed for messages to flow through it. It can certainly buffer messages (potentially millions depending on the configuration & hardware) but when its forced to page the performance will drop substantially simply because disk is orders of magnitude slower than RAM.

Upload files to Kafka and further handling?

Is it good way to send binary data of uploading files to Kafka then to distribute handling uploading by some services that are connected to Kafka topic?
I see some advantages:
Filtering uploading data
Replica
Some services can handle uploading, not only one
What do you think about that?

Is it good way to send binary data of uploading files to Kafka then to
distribute handling uploading by some services that are connected to
Kafka topic?
Typically files are uploaded to file system and their URIs are stored in the Kafka message. This is to ensure that the Kafka message size is relatively smaller, thereby increasing the throughput of its clients.
In case, if we put large objects in Kafka message, the consumer would have to read the entire file. So your poll() will take longer time than usual.
On the other hand, if we just put a URI of the file instead of the file itself, then the message consumption will be relatively faster and you can delegate the processing of files to perhaps another thread (possibly from a thread pool), there by increasing your application throughput.
Replicas
Just as there are replicas in Kafka, there can also be replicas for filesystem. Even kafka stores messages in file system (as segment files). So, the replication may as well be done with filesystem itself.
The best way is to put an URI that points to the file in the Kafka
message and then put a handler for that URI which will be
reponsible for giving you the file and possibly taking care of giving you a replica in case the original file is deleted.
The handler may be loosely-coupled from the rest of your system, built specifically for managing the files, maintaining replicas etc.
Filtering uploading data
The filtering of uploaded data can be done only when you actually read the contents of the file. You may do that even by putting the URI of your file in the message and reading from there. For ex, if you are using Kafka streams, you can put that filtering logic in transform() or mapValues() etc.
stream.from(topic)
.mapValues(v -> v.getFileURI())
.filter((k,fileURI) -> validate(read(fileURI)))
.to(..)
Hitting segment.bytes
Another disadvantage of storing files in your message is that, you might hit segment.bytes limit if the files are larger. You need to keep changing the segment.bytes every time to meet the new size requirements of the files.
Another point is, if the segment.bytes is set to 1GB and your first message (file) size is 750MB, and your next message is 251 MB, the 251MB message can't fit in the first segment, so your first segment will have only one message, though it hasn't reached the limit. This means that relatively lower number of messages will be stored per segment.

Kafka internal storage

As per, Kafka book
Producer publishes messages to topic. These messages will be written
in a segment and after that batch of message will be stored to disk.
Consumer subscribes to a topic and read the message form the segment.
I read that physically segment is nothing but files. I am confused what is the role of disk if we are storing message in segments(file system) .
Can someone explain me the relationship between segments and disk ?

Messages are published into Topic/Partition. When a broker receives messages, it will write them into the OS page-cache (i.e : the unused portions of the main memory). OS will then periodically flush (fsync) dirty pages on disk.
Kafka physically organises data on file system into directories named with KAFKA_LOGS_DIR/- which are store into the logs directories configured on each broker (default : tmp/kafka-logs/)
Each of this directories contains a number of log-structured file call segment in which messages are written sequentially. In addition, each segment log is attached to two index files/.

Can a Kafka consumer read records from a Broker's page cache?

Kafka's documentation clearly states that messages/records are immediately written to the file system as they are received by the Broker. With the default configuration, this means that the Broker flushes records to the page cache immediately and later the Kernel can flush it to disk.
My question is: can a consumer read a record that is in the page cache but that has not yet been flushed to disk by the kernel?
If the answer is yes, how will the consumer keep track of the offset it reads from?
If the answer is no, then it would mean that the record has to be read back from disk to the page cache before it is sent out to NIC via zero-copy. Correct?
Thanks,

Whenever there is a read/write operation to the file, the data is written/fetched to page cache first. In case of read, if the data is already present in cache page the actual disk read is not called and data is served from page cache. It's not that kafka consumer is reading from page cache of broker but this is being done by file system and hidden from actual read call. In most of the cases, the records from kafka are read sequentially which allows it to use page cache effectively.
zero-copy optimization is used in every read from kafka client, copying data directly from page cache to NIC buffer.

How does Apache Kafka use open file descriptors?

I wanted to know how does Kafka use open file descriptors. Why is it recommended to have a large number of open file descriptor. Does it impact Producer and Consumer throughput.

Brokers create and maintain file handles for each log segment files and network connections. The total number could be very huge if the broker hosts many partitions and partition has many log segment files. This applies for the network connection as well.
I don't immediately see any possible performance declines caused by setting a large file-max, but the page cache miss matters.

Kafka keeps one file descriptor open for every segment file, and it fails miserably if the limit is too low. I don't know if it affects consumer throughput, but I assume it doesn't since Kafka appears to ignore the limit until it is reached.
The number of segment files is the number of partitions multiplied by some number that is dependent on the retention policy. The default retention policy is to start a new segment after one week (or 1GB, whatever occurs first) and to delete a segment when all data in it is more than one week old.
(disclaimer: This answer is for Kafka 1.0 based on what I have learnt from one installation I have)

We can check in below ways.
if a broker hosts many partitions. For example, a Kafka broker needs at least the following number of file descriptors to just track log segment files:
(number of partitions)*(partition size / segment size)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse