Remove and add compression in kafka topic. What will happen to the existing data in the topic? - apache-kafka

If a topic is set without compression, and some data already exist in the topic.
Now the topic is set with compression, will the existing data be compressed?
The other direction is, if a topic is set with compression, and some data already exist in the topic, will the existing data be decompressed?
This question comes up the worries to the data consumer. When the topic has some data is compressed and some is not compressed, this is very messy, or the brokers know those events are compressed and those are not in the same topic, and will deliver the right data?
If the existing data is not corresponding to the compression setup, I will remove the existing data by configuring very low retention time. Until the topic is very clean that has no data, I will then ingest data to ensure every event is either compressed or not compressed.

Both compressed and uncompressed records could coexist in a single topic. The corresponding compression type is stored in each record(record batch actually), so the consumer knows how to handle this message.
On the broker side, it normally does not care if a record batch is compressed. Assuming there occurs no down converting for old-formatted records, the broker always saves the batch as it is.

Related

How is it possible to aggregate messages from Kafka topic based on duration (e.g. 1h)?

We are streaming messages to a Kafka topic at a rate of a few hundred per second. Each message has a timestamp and a payload. Ultimately, we would like aggregate one hour worth of data - based on the timestamp of the message - into parquet files and upload them to a cheap remote storage (object-store).
A naive approach would be to have the consumer simply read the messages from the topic and do the aggregation/roll-up in memory, and once there is one hour worth of data, generate and upload the parquet file.
However, in case the consumer crashes or needs to be restarted, we would lose all data since the beginning of the current hour - if we use enable.auto.commit=true or enable.auto.commit=false and manually commit after a batch of messages.
A simple solution for the Consumer could be to keep reading until one hour worth of data is in memory, do the parquet file generation (and upload it), and only then call commitAsync() or commitSync() (using enable.auto.commit=false and use an external store to keep track of the offsets).
But this would lead to millions of messages not being committed for at least one hour. I am wondering if Kafka does even allow to "delay" the commit of messages for so many messages / so long time (I seem to remember to have read about this somewhere but for the life of me I cannot find it again).
Actual questions:
a) is there a limit to the number of messages (or duration) not being committed before Kafka possibly considers the Consumer to be broken or stops giving additional messages to the consumer? this seems counter-intuitive though, since what would be the purpose of enable.auto.commit=false and managing the offsets in the Consumer (with e.g. the help of an external database).
b) in terms of robustness/redundancy and scalability, it would be great to have more than one Consumer in the consumer group; if I understand correctly, it is never possible to have more than one Consumer per partition. If we then run more than one Consumer and configure multiple partitions per topic we cannot do this kind of aggregation/roll-up, since now messages will be distributed across Consumers. The only way to work-around this issue would be to have an additional (external) temporary storage for all those messages belonging to such one-hour group, correct?
You can configure Kafka Streams with a TimestampExtractor to aggregate data into different types of time-windows
into parquet files and upload them to a cheap remote storage (object-store).
Kafka Connect S3 sink, or Pinterest Secor tool, already do this

Upload files to Kafka and further handling?

Is it good way to send binary data of uploading files to Kafka then to distribute handling uploading by some services that are connected to Kafka topic?
I see some advantages:
Filtering uploading data
Replica
Some services can handle uploading, not only one
What do you think about that?
Is it good way to send binary data of uploading files to Kafka then to
distribute handling uploading by some services that are connected to
Kafka topic?
Typically files are uploaded to file system and their URIs are stored in the Kafka message. This is to ensure that the Kafka message size is relatively smaller, thereby increasing the throughput of its clients.
In case, if we put large objects in Kafka message, the consumer would have to read the entire file. So your poll() will take longer time than usual.
On the other hand, if we just put a URI of the file instead of the file itself, then the message consumption will be relatively faster and you can delegate the processing of files to perhaps another thread (possibly from a thread pool), there by increasing your application throughput.
Replicas
Just as there are replicas in Kafka, there can also be replicas for filesystem. Even kafka stores messages in file system (as segment files). So, the replication may as well be done with filesystem itself.
The best way is to put an URI that points to the file in the Kafka
message and then put a handler for that URI which will be
reponsible for giving you the file and possibly taking care of giving you a replica in case the original file is deleted.
The handler may be loosely-coupled from the rest of your system, built specifically for managing the files, maintaining replicas etc.
Filtering uploading data
The filtering of uploaded data can be done only when you actually read the contents of the file. You may do that even by putting the URI of your file in the message and reading from there. For ex, if you are using Kafka streams, you can put that filtering logic in transform() or mapValues() etc.
stream.from(topic)
.mapValues(v -> v.getFileURI())
.filter((k,fileURI) -> validate(read(fileURI)))
.to(..)
Hitting segment.bytes
Another disadvantage of storing files in your message is that, you might hit segment.bytes limit if the files are larger. You need to keep changing the segment.bytes every time to meet the new size requirements of the files.
Another point is, if the segment.bytes is set to 1GB and your first message (file) size is 750MB, and your next message is 251 MB, the 251MB message can't fit in the first segment, so your first segment will have only one message, though it hasn't reached the limit. This means that relatively lower number of messages will be stored per segment.

Integrating a large XML file size with Kafka

The XML file (~100 Mb) is a batch export by an external system of its entire database (The Batch export is every 6 hours).
I can not change the integration to use Debezium connector for example.
I have access only to the XML file.
What would be the best solution to consume the file with Apache Kafka?
Or, an architecture to send single messages of the XML file with an XSD schema?
Is not receiving its content on a large single message size a bad thing for the architecture?
The default max.message.bytes configuration on broker and topic level in Kafka is set to c. 1MB and it is not advisable to significantly increase that configuration as Kafka is not optimizes to handle large messages.
Is see two options to solve this:
Before loading the XML into Kafka, split it into chunks that represent an individual row of the database. In addition, us a typesafe format (such as AVRO) in combination with a Schema Registry to tell potential consumers how to read the data.
Dependent on what needs to be done with the large XML file, you could also store the XML in a resilient location (such as HDFS) and only provide the location path in a Kafka message. That way, a consumer can consume the paths from the Kafka topic and make some processing on them.
Writing a Kafka producer that unamarshalls XML files to Java Objects, Sends serialized objects in Avro format to the cluster was the solution for me.

How does Kafka internally order the messages within a partition? Does it store them as it received from the producer?

I wanted to understand the order that Kafka follows internally to place messages in a parition that it received from a bunch of varying producers.
Partition is a sharding for the topic. And each partition will be write into a separate file under the same directory holds the topic's name. Writing or reading into a file is sequential, that is the way partition maintains its order.
Does it store them as it received from the producer?
Yes, as soon message received it will be written into its buffer quite similar to some relational data bases write to write ahead log. Kafka uses operating systems page cache as a buffer to obtain high performance of reading and writing. Periodically depends on the configuration Kafka writes data into the file.

Need help to understand Kafka storage

I am new in kafka. From the link : http://notes.stephenholiday.com/Kafka.pdf
It is mentioned:
"Every time a producer publishes a message to a partition, the broker
simply appends the message to the last segment file. For better
performance, we flush the segment files to disk only after a
configurable number of messages have been published or a certain
amount of time has elapsed. A message is only exposed to the consumers
after it is flushed."
Now my question is
What is segment file here?
When I create a topic with partition then each partition will have an index file and a .log file.
is this (.log file) the segment file? if so then it is already in disk so why it is saying "For better performance, we flush the segment files to
disk". if it is flushing to disk then where in the disk it is flushing?
It seems that until it flush to disk , it is not available to the the consumer. Then we adding some latency to read the message, but why?
Also want help to understand that when consumer wants to read some data then is it reading from disk (partition, segment file) or there is some cache mechanism , if so then how and when data is persisting into the cache?
I am not sure all questions are valid or not, but it will help me understand if anybody can clear it.
You can think this segment file as OS pagecache.
Kafka has a very simple storage layout. Each partition of a topic
corresponds to a logical log. Physically, a log is implemented as a
set of segment files of equal sizes. Every time a producer publishes a
message to a partition, the broker simply appends the message to the
last segment file. Segment file is flushed to disk after configurable
number of messages has been published or after certain amount of time.
Messages are exposed to consumer after it gets flushed.
And also please refer to document below.
http://kafka.apache.org/documentation/#appvsosflush
Kafka always immediately writes all data to the filesystem and
supports the ability to configure the flush policy that controls when
data is forced out of the OS cache and onto disk using the flush. This
flush policy can be controlled to force data to disk after a period of
time or after a certain number of messages has been written. There are
several choices in this configuration.
Don't get confused when you see the filesystem word there, OS pagecache is also a filesystem and the link you have mentioned is really very much outdated.