Integrating a large XML file size with Kafka - apache-kafka

The XML file (~100 Mb) is a batch export by an external system of its entire database (The Batch export is every 6 hours).
I can not change the integration to use Debezium connector for example.
I have access only to the XML file.
What would be the best solution to consume the file with Apache Kafka?
Or, an architecture to send single messages of the XML file with an XSD schema?
Is not receiving its content on a large single message size a bad thing for the architecture?

The default max.message.bytes configuration on broker and topic level in Kafka is set to c. 1MB and it is not advisable to significantly increase that configuration as Kafka is not optimizes to handle large messages.
Is see two options to solve this:
Before loading the XML into Kafka, split it into chunks that represent an individual row of the database. In addition, us a typesafe format (such as AVRO) in combination with a Schema Registry to tell potential consumers how to read the data.
Dependent on what needs to be done with the large XML file, you could also store the XML in a resilient location (such as HDFS) and only provide the location path in a Kafka message. That way, a consumer can consume the paths from the Kafka topic and make some processing on them.

Writing a Kafka producer that unamarshalls XML files to Java Objects, Sends serialized objects in Avro format to the cluster was the solution for me.

Related

Upload files to Kafka and further handling?

Is it good way to send binary data of uploading files to Kafka then to distribute handling uploading by some services that are connected to Kafka topic?
I see some advantages:
Filtering uploading data
Replica
Some services can handle uploading, not only one
What do you think about that?
Is it good way to send binary data of uploading files to Kafka then to
distribute handling uploading by some services that are connected to
Kafka topic?
Typically files are uploaded to file system and their URIs are stored in the Kafka message. This is to ensure that the Kafka message size is relatively smaller, thereby increasing the throughput of its clients.
In case, if we put large objects in Kafka message, the consumer would have to read the entire file. So your poll() will take longer time than usual.
On the other hand, if we just put a URI of the file instead of the file itself, then the message consumption will be relatively faster and you can delegate the processing of files to perhaps another thread (possibly from a thread pool), there by increasing your application throughput.
Replicas
Just as there are replicas in Kafka, there can also be replicas for filesystem. Even kafka stores messages in file system (as segment files). So, the replication may as well be done with filesystem itself.
The best way is to put an URI that points to the file in the Kafka
message and then put a handler for that URI which will be
reponsible for giving you the file and possibly taking care of giving you a replica in case the original file is deleted.
The handler may be loosely-coupled from the rest of your system, built specifically for managing the files, maintaining replicas etc.
Filtering uploading data
The filtering of uploaded data can be done only when you actually read the contents of the file. You may do that even by putting the URI of your file in the message and reading from there. For ex, if you are using Kafka streams, you can put that filtering logic in transform() or mapValues() etc.
stream.from(topic)
.mapValues(v -> v.getFileURI())
.filter((k,fileURI) -> validate(read(fileURI)))
.to(..)
Hitting segment.bytes
Another disadvantage of storing files in your message is that, you might hit segment.bytes limit if the files are larger. You need to keep changing the segment.bytes every time to meet the new size requirements of the files.
Another point is, if the segment.bytes is set to 1GB and your first message (file) size is 750MB, and your next message is 251 MB, the 251MB message can't fit in the first segment, so your first segment will have only one message, though it hasn't reached the limit. This means that relatively lower number of messages will be stored per segment.

Remove and add compression in kafka topic. What will happen to the existing data in the topic?

If a topic is set without compression, and some data already exist in the topic.
Now the topic is set with compression, will the existing data be compressed?
The other direction is, if a topic is set with compression, and some data already exist in the topic, will the existing data be decompressed?
This question comes up the worries to the data consumer. When the topic has some data is compressed and some is not compressed, this is very messy, or the brokers know those events are compressed and those are not in the same topic, and will deliver the right data?
If the existing data is not corresponding to the compression setup, I will remove the existing data by configuring very low retention time. Until the topic is very clean that has no data, I will then ingest data to ensure every event is either compressed or not compressed.
Both compressed and uncompressed records could coexist in a single topic. The corresponding compression type is stored in each record(record batch actually), so the consumer knows how to handle this message.
On the broker side, it normally does not care if a record batch is compressed. Assuming there occurs no down converting for old-formatted records, the broker always saves the batch as it is.

How does Kafka internally order the messages within a partition? Does it store them as it received from the producer?

I wanted to understand the order that Kafka follows internally to place messages in a parition that it received from a bunch of varying producers.
Partition is a sharding for the topic. And each partition will be write into a separate file under the same directory holds the topic's name. Writing or reading into a file is sequential, that is the way partition maintains its order.
Does it store them as it received from the producer?
Yes, as soon message received it will be written into its buffer quite similar to some relational data bases write to write ahead log. Kafka uses operating systems page cache as a buffer to obtain high performance of reading and writing. Periodically depends on the configuration Kafka writes data into the file.

kafka + reading from topic log file

I have a topic log file and the corresponding .index file. I would like to read the messages in a streaming fashion and process it. How and where should I start?
Should I load these files to Kafka producer and read from topic?
Can i directly write a consumer to read data from the file and process it?
I have gone through the Kafka website and everywhere, it uses pre-built Kafka producers and consumers in the examples. So, I couldn't get enough guidance.
I want to read in streaming fashion in Java.
The text looks encrypted so i am not posting the input files.
Any help is really appreciated.
You can dump log segments and use the deep iteration option to deserialize the data into something more readable.
If you want to "stream it", then use a standard Unix pipe to output to some other tool
do aggregate operations
Then use Kafka Streams to actually read from the topic for all partitions rather than the single partition on that single broker

Is there a way to limit the size of avro files when writing from kafka via hdfs connector?

Currently we used the Flink FsStateBackend checkpointing and set fileStateSizeThreshold to limit the size of data written to avro/json files on HDFS to 128MB. Also closing files after a certain delay in checkpoint actions.
Since we are not using advanced Flink features in in a new project we want to use Kafka Streaming with the Kafka Connect HDFS Connector to write messages directly to hdfs (without spinning up Flink)
However I cannot find if there are options to limit the filesize of the hdfs files from the kafka connector, except maybe flush.size which seem to limit the # of records.
If there are no settings on the connector, how do people manage the filesizes from streaming data on hdfs in another way?
There is no file size option, only time based rotation and flush size. You can set a large flush size, which you never think you'll reach, then a time based rotation will do a best-effort partitioning of large files into date partitions (we've been able to get 4GB output files per topic partition within an hour directory from Connect)
Personally, I suggest additional tools such as Hive, Pig, DistCp, Flink/Spark, depending on what's available, and not all at once, running in an Oozie job to "compact" these streaming files into larger files.
See my comment here
Before Connect, there was Camus, which is now Apache Gobblin. Within that project, it offers the ideas of compaction and late event processing + Hive table creation
The general answer here is that you have a designated "hot landing zone" for streaming data, then you periodically archive it or "freeze" it (which brings out technology names like Amazon Glacier/Snowball & Snowplow)