Purpose of +tmp in Kafka hdfs connect - apache-kafka

I am planning to use Kafka hdfs connect for moving messages from Kafka to hdfs. While looking into it, I see there are parameters like flush size and rotate interval Ms with which you can batch messages in heap and write batch at once.
Is the batch written to Wal first and then to the mentioned location. I also see it creates a +tmp directory. What's the purpose of+tmp directory . We can directly write whole batch as file under specified location with offset ranges..

When Kafka consumer writes to HDFS, it writes to WAL first. +tmp dir holds all the temporary files, which get compressed together into larger HDFS files. Then it is moved to the actual defined location.
Infact you can refer the actual implementation to understand in depth.
https://github.com/confluentinc/kafka-connect-hdfs/blob/121a69133bc2c136b6aa9d08b23a0799a4cd8799/src/main/java/io/confluent/connect/hdfs/TopicPartitionWriter.java#L611

Related

Upload files to Kafka and further handling?

Is it good way to send binary data of uploading files to Kafka then to distribute handling uploading by some services that are connected to Kafka topic?
I see some advantages:
Filtering uploading data
Replica
Some services can handle uploading, not only one
What do you think about that?
Is it good way to send binary data of uploading files to Kafka then to
distribute handling uploading by some services that are connected to
Kafka topic?
Typically files are uploaded to file system and their URIs are stored in the Kafka message. This is to ensure that the Kafka message size is relatively smaller, thereby increasing the throughput of its clients.
In case, if we put large objects in Kafka message, the consumer would have to read the entire file. So your poll() will take longer time than usual.
On the other hand, if we just put a URI of the file instead of the file itself, then the message consumption will be relatively faster and you can delegate the processing of files to perhaps another thread (possibly from a thread pool), there by increasing your application throughput.
Replicas
Just as there are replicas in Kafka, there can also be replicas for filesystem. Even kafka stores messages in file system (as segment files). So, the replication may as well be done with filesystem itself.
The best way is to put an URI that points to the file in the Kafka
message and then put a handler for that URI which will be
reponsible for giving you the file and possibly taking care of giving you a replica in case the original file is deleted.
The handler may be loosely-coupled from the rest of your system, built specifically for managing the files, maintaining replicas etc.
Filtering uploading data
The filtering of uploaded data can be done only when you actually read the contents of the file. You may do that even by putting the URI of your file in the message and reading from there. For ex, if you are using Kafka streams, you can put that filtering logic in transform() or mapValues() etc.
stream.from(topic)
.mapValues(v -> v.getFileURI())
.filter((k,fileURI) -> validate(read(fileURI)))
.to(..)
Hitting segment.bytes
Another disadvantage of storing files in your message is that, you might hit segment.bytes limit if the files are larger. You need to keep changing the segment.bytes every time to meet the new size requirements of the files.
Another point is, if the segment.bytes is set to 1GB and your first message (file) size is 750MB, and your next message is 251 MB, the 251MB message can't fit in the first segment, so your first segment will have only one message, though it hasn't reached the limit. This means that relatively lower number of messages will be stored per segment.

Integrating a large XML file size with Kafka

The XML file (~100 Mb) is a batch export by an external system of its entire database (The Batch export is every 6 hours).
I can not change the integration to use Debezium connector for example.
I have access only to the XML file.
What would be the best solution to consume the file with Apache Kafka?
Or, an architecture to send single messages of the XML file with an XSD schema?
Is not receiving its content on a large single message size a bad thing for the architecture?
The default max.message.bytes configuration on broker and topic level in Kafka is set to c. 1MB and it is not advisable to significantly increase that configuration as Kafka is not optimizes to handle large messages.
Is see two options to solve this:
Before loading the XML into Kafka, split it into chunks that represent an individual row of the database. In addition, us a typesafe format (such as AVRO) in combination with a Schema Registry to tell potential consumers how to read the data.
Dependent on what needs to be done with the large XML file, you could also store the XML in a resilient location (such as HDFS) and only provide the location path in a Kafka message. That way, a consumer can consume the paths from the Kafka topic and make some processing on them.
Writing a Kafka producer that unamarshalls XML files to Java Objects, Sends serialized objects in Avro format to the cluster was the solution for me.

How to cache a single csv file into a KTable in Kafka?

We have a situation that we have to cache and persist a CSV file in a Kafka KTable. Is it possible in Kafka?According to what I have researched, we can read a CSV file in KTable but it won't be persisted (I might be wrong here). I have not been able to find anything related to it in the docs.
To be a little specific:
We need to take a CSV file.
Send it to a KTable and cache/persist it as it is.
One more thing: if it's possible, will it read the file line by line or the whole file can be sent too with a single key?
Thank you!
It's possible, yes, although, I'm not sure I understand why you wouldn't just load the CSV itself within the application as a list of rows.
will it read the file line by line or the whole file can be sent too with a single key?
Depends how you read the file. And you'd first produce the data to Kafka. A KTable must consume from a topic, not files
Note: Kafka has a max message default size at 1MB, and is not meant for file transfer
it won't be persisted
I'm not sure where you read that. You can persist the data in a compacted topic, although, you'd want to then have some key for each row of the file

Is there a way to limit the size of avro files when writing from kafka via hdfs connector?

Currently we used the Flink FsStateBackend checkpointing and set fileStateSizeThreshold to limit the size of data written to avro/json files on HDFS to 128MB. Also closing files after a certain delay in checkpoint actions.
Since we are not using advanced Flink features in in a new project we want to use Kafka Streaming with the Kafka Connect HDFS Connector to write messages directly to hdfs (without spinning up Flink)
However I cannot find if there are options to limit the filesize of the hdfs files from the kafka connector, except maybe flush.size which seem to limit the # of records.
If there are no settings on the connector, how do people manage the filesizes from streaming data on hdfs in another way?
There is no file size option, only time based rotation and flush size. You can set a large flush size, which you never think you'll reach, then a time based rotation will do a best-effort partitioning of large files into date partitions (we've been able to get 4GB output files per topic partition within an hour directory from Connect)
Personally, I suggest additional tools such as Hive, Pig, DistCp, Flink/Spark, depending on what's available, and not all at once, running in an Oozie job to "compact" these streaming files into larger files.
See my comment here
Before Connect, there was Camus, which is now Apache Gobblin. Within that project, it offers the ideas of compaction and late event processing + Hive table creation
The general answer here is that you have a designated "hot landing zone" for streaming data, then you periodically archive it or "freeze" it (which brings out technology names like Amazon Glacier/Snowball & Snowplow)

Need help to understand Kafka storage

I am new in kafka. From the link : http://notes.stephenholiday.com/Kafka.pdf
It is mentioned:
"Every time a producer publishes a message to a partition, the broker
simply appends the message to the last segment file. For better
performance, we flush the segment files to disk only after a
configurable number of messages have been published or a certain
amount of time has elapsed. A message is only exposed to the consumers
after it is flushed."
Now my question is
What is segment file here?
When I create a topic with partition then each partition will have an index file and a .log file.
is this (.log file) the segment file? if so then it is already in disk so why it is saying "For better performance, we flush the segment files to
disk". if it is flushing to disk then where in the disk it is flushing?
It seems that until it flush to disk , it is not available to the the consumer. Then we adding some latency to read the message, but why?
Also want help to understand that when consumer wants to read some data then is it reading from disk (partition, segment file) or there is some cache mechanism , if so then how and when data is persisting into the cache?
I am not sure all questions are valid or not, but it will help me understand if anybody can clear it.
You can think this segment file as OS pagecache.
Kafka has a very simple storage layout. Each partition of a topic
corresponds to a logical log. Physically, a log is implemented as a
set of segment files of equal sizes. Every time a producer publishes a
message to a partition, the broker simply appends the message to the
last segment file. Segment file is flushed to disk after configurable
number of messages has been published or after certain amount of time.
Messages are exposed to consumer after it gets flushed.
And also please refer to document below.
http://kafka.apache.org/documentation/#appvsosflush
Kafka always immediately writes all data to the filesystem and
supports the ability to configure the flush policy that controls when
data is forced out of the OS cache and onto disk using the flush. This
flush policy can be controlled to force data to disk after a period of
time or after a certain number of messages has been written. There are
several choices in this configuration.
Don't get confused when you see the filesystem word there, OS pagecache is also a filesystem and the link you have mentioned is really very much outdated.