Upload files to Kafka and further handling? - apache-kafka

Is it good way to send binary data of uploading files to Kafka then to distribute handling uploading by some services that are connected to Kafka topic?
I see some advantages:
Filtering uploading data
Replica
Some services can handle uploading, not only one
What do you think about that?

Is it good way to send binary data of uploading files to Kafka then to
distribute handling uploading by some services that are connected to
Kafka topic?
Typically files are uploaded to file system and their URIs are stored in the Kafka message. This is to ensure that the Kafka message size is relatively smaller, thereby increasing the throughput of its clients.
In case, if we put large objects in Kafka message, the consumer would have to read the entire file. So your poll() will take longer time than usual.
On the other hand, if we just put a URI of the file instead of the file itself, then the message consumption will be relatively faster and you can delegate the processing of files to perhaps another thread (possibly from a thread pool), there by increasing your application throughput.
Replicas
Just as there are replicas in Kafka, there can also be replicas for filesystem. Even kafka stores messages in file system (as segment files). So, the replication may as well be done with filesystem itself.
The best way is to put an URI that points to the file in the Kafka
message and then put a handler for that URI which will be
reponsible for giving you the file and possibly taking care of giving you a replica in case the original file is deleted.
The handler may be loosely-coupled from the rest of your system, built specifically for managing the files, maintaining replicas etc.
Filtering uploading data
The filtering of uploaded data can be done only when you actually read the contents of the file. You may do that even by putting the URI of your file in the message and reading from there. For ex, if you are using Kafka streams, you can put that filtering logic in transform() or mapValues() etc.
stream.from(topic)
.mapValues(v -> v.getFileURI())
.filter((k,fileURI) -> validate(read(fileURI)))
.to(..)
Hitting segment.bytes
Another disadvantage of storing files in your message is that, you might hit segment.bytes limit if the files are larger. You need to keep changing the segment.bytes every time to meet the new size requirements of the files.
Another point is, if the segment.bytes is set to 1GB and your first message (file) size is 750MB, and your next message is 251 MB, the 251MB message can't fit in the first segment, so your first segment will have only one message, though it hasn't reached the limit. This means that relatively lower number of messages will be stored per segment.

Related

Integrating a large XML file size with Kafka

The XML file (~100 Mb) is a batch export by an external system of its entire database (The Batch export is every 6 hours).
I can not change the integration to use Debezium connector for example.
I have access only to the XML file.
What would be the best solution to consume the file with Apache Kafka?
Or, an architecture to send single messages of the XML file with an XSD schema?
Is not receiving its content on a large single message size a bad thing for the architecture?
The default max.message.bytes configuration on broker and topic level in Kafka is set to c. 1MB and it is not advisable to significantly increase that configuration as Kafka is not optimizes to handle large messages.
Is see two options to solve this:
Before loading the XML into Kafka, split it into chunks that represent an individual row of the database. In addition, us a typesafe format (such as AVRO) in combination with a Schema Registry to tell potential consumers how to read the data.
Dependent on what needs to be done with the large XML file, you could also store the XML in a resilient location (such as HDFS) and only provide the location path in a Kafka message. That way, a consumer can consume the paths from the Kafka topic and make some processing on them.
Writing a Kafka producer that unamarshalls XML files to Java Objects, Sends serialized objects in Avro format to the cluster was the solution for me.

What defines the scope of a kafka topic

I'm looking to try out using Kafka for an existing system, to replace an older message protocol. Currently we have a number of types of messages (hundreds) used to communicate among ~40 applications. Some are asynchronous at high rates and some are based upon request from user/events.
Now looking at Kafka, it breaks out topics and partitions etc. But I'm a bit confused as to what constitutes a topic. Does every type of message my applications produce get their own topic allowing hundreds of topics, or do I cluster them together to related message types? If the second answer, is it bad practice for an application to read a message and drop it when its contents are not what its looking for?
I'm also in a dilemma where there will be upwards of 10 copies of a single application (a display), all of which getting a very large amount of data (in form of a light weight video stream of sorts) and would be sending out user commands on each particular node. Would Kafka be a sufficient form of communication for this? Assuming that at most 10, but sometimes these particular applications may not have the desire to get the video stream at all times.
A third and final question: I read a bit about replay-ability of messages. Is this only within a single topic, or can the replay-ability go over a slew of different topics?
Kafka itself doesn't care about "types" of message. The only type it knows about are bytes, meaning that you are completely flexible to how you will serialize your datasets. Note, however that the default max message size is just 1MB, so "streaming video/images/media" is arguably the wrong use case for Kafka alone. A protocol like RTMP would probably make more sense
Kafka consumer groups scale horizontally, not in response to load. Consumers poll data at a rate at which they can process. If they don't need data, then they can be stopped, if they need to reprocess data, they can be independently seeked

Remove and add compression in kafka topic. What will happen to the existing data in the topic?

If a topic is set without compression, and some data already exist in the topic.
Now the topic is set with compression, will the existing data be compressed?
The other direction is, if a topic is set with compression, and some data already exist in the topic, will the existing data be decompressed?
This question comes up the worries to the data consumer. When the topic has some data is compressed and some is not compressed, this is very messy, or the brokers know those events are compressed and those are not in the same topic, and will deliver the right data?
If the existing data is not corresponding to the compression setup, I will remove the existing data by configuring very low retention time. Until the topic is very clean that has no data, I will then ingest data to ensure every event is either compressed or not compressed.
Both compressed and uncompressed records could coexist in a single topic. The corresponding compression type is stored in each record(record batch actually), so the consumer knows how to handle this message.
On the broker side, it normally does not care if a record batch is compressed. Assuming there occurs no down converting for old-formatted records, the broker always saves the batch as it is.

Can a Kafka consumer read records from a Broker's page cache?

Kafka's documentation clearly states that messages/records are immediately written to the file system as they are received by the Broker. With the default configuration, this means that the Broker flushes records to the page cache immediately and later the Kernel can flush it to disk.
My question is: can a consumer read a record that is in the page cache but that has not yet been flushed to disk by the kernel?
If the answer is yes, how will the consumer keep track of the offset it reads from?
If the answer is no, then it would mean that the record has to be read back from disk to the page cache before it is sent out to NIC via zero-copy. Correct?
Thanks,
Whenever there is a read/write operation to the file, the data is written/fetched to page cache first. In case of read, if the data is already present in cache page the actual disk read is not called and data is served from page cache. It's not that kafka consumer is reading from page cache of broker but this is being done by file system and hidden from actual read call. In most of the cases, the records from kafka are read sequentially which allows it to use page cache effectively.
zero-copy optimization is used in every read from kafka client, copying data directly from page cache to NIC buffer.

Need help to understand Kafka storage

I am new in kafka. From the link : http://notes.stephenholiday.com/Kafka.pdf
It is mentioned:
"Every time a producer publishes a message to a partition, the broker
simply appends the message to the last segment file. For better
performance, we flush the segment files to disk only after a
configurable number of messages have been published or a certain
amount of time has elapsed. A message is only exposed to the consumers
after it is flushed."
Now my question is
What is segment file here?
When I create a topic with partition then each partition will have an index file and a .log file.
is this (.log file) the segment file? if so then it is already in disk so why it is saying "For better performance, we flush the segment files to
disk". if it is flushing to disk then where in the disk it is flushing?
It seems that until it flush to disk , it is not available to the the consumer. Then we adding some latency to read the message, but why?
Also want help to understand that when consumer wants to read some data then is it reading from disk (partition, segment file) or there is some cache mechanism , if so then how and when data is persisting into the cache?
I am not sure all questions are valid or not, but it will help me understand if anybody can clear it.
You can think this segment file as OS pagecache.
Kafka has a very simple storage layout. Each partition of a topic
corresponds to a logical log. Physically, a log is implemented as a
set of segment files of equal sizes. Every time a producer publishes a
message to a partition, the broker simply appends the message to the
last segment file. Segment file is flushed to disk after configurable
number of messages has been published or after certain amount of time.
Messages are exposed to consumer after it gets flushed.
And also please refer to document below.
http://kafka.apache.org/documentation/#appvsosflush
Kafka always immediately writes all data to the filesystem and
supports the ability to configure the flush policy that controls when
data is forced out of the OS cache and onto disk using the flush. This
flush policy can be controlled to force data to disk after a period of
time or after a certain number of messages has been written. There are
several choices in this configuration.
Don't get confused when you see the filesystem word there, OS pagecache is also a filesystem and the link you have mentioned is really very much outdated.