How to cache a single csv file into a KTable in Kafka? - apache-kafka

We have a situation that we have to cache and persist a CSV file in a Kafka KTable. Is it possible in Kafka?According to what I have researched, we can read a CSV file in KTable but it won't be persisted (I might be wrong here). I have not been able to find anything related to it in the docs.
To be a little specific:
We need to take a CSV file.
Send it to a KTable and cache/persist it as it is.
One more thing: if it's possible, will it read the file line by line or the whole file can be sent too with a single key?
Thank you!

It's possible, yes, although, I'm not sure I understand why you wouldn't just load the CSV itself within the application as a list of rows.
will it read the file line by line or the whole file can be sent too with a single key?
Depends how you read the file. And you'd first produce the data to Kafka. A KTable must consume from a topic, not files
Note: Kafka has a max message default size at 1MB, and is not meant for file transfer
it won't be persisted
I'm not sure where you read that. You can persist the data in a compacted topic, although, you'd want to then have some key for each row of the file

Related

FilePulse SourceConnector

I would like to continuously reading csv file in ksqldb with FilePulse Source Connector but it not work correctly
a) the connector read the file only once or
b) the connector read all data from file, but in that case there are duplicities in kafka topic (every time when connector read the appended file then insert all data from file into topic - not only the changed data
Is there any options how to solve this? (to continuously read only appended data from file or remove duplicities in kafka topic)
Thank you
To my knowledge, the file source connector doesn't track the file content. The connector only sees a modified file, so reads the whole thing on any update. Otherwise, reading the file once is expected behavior and you should reset your consumer offsets to handle this in your processing logic; for example make a table in ksql
If you want to tail a file for appends, other options like the spooldir connector, or Filebeat/Fluentd would be preferred (and are actually documented as being production-grade solutions for reading files into Kafka)
Disclaimer: I'm the author of Connect FilePulse
Connect FilePulse is probably not the best solution for continuously reading files. And as already mentioned in other answers: it might be a good idea to use solutions like: Filebeat, Fluentd or Logstash.
But, FilePulse actually supports continous reading using the LocalRowFileInputReader with the reader's property read.max.wait.ms. Here is an older answer for a question similar to yours: Stackoverflow: How can be configured kafka-connect-file-pulse for continuous reading of a text file?

Integrating a large XML file size with Kafka

The XML file (~100 Mb) is a batch export by an external system of its entire database (The Batch export is every 6 hours).
I can not change the integration to use Debezium connector for example.
I have access only to the XML file.
What would be the best solution to consume the file with Apache Kafka?
Or, an architecture to send single messages of the XML file with an XSD schema?
Is not receiving its content on a large single message size a bad thing for the architecture?
The default max.message.bytes configuration on broker and topic level in Kafka is set to c. 1MB and it is not advisable to significantly increase that configuration as Kafka is not optimizes to handle large messages.
Is see two options to solve this:
Before loading the XML into Kafka, split it into chunks that represent an individual row of the database. In addition, us a typesafe format (such as AVRO) in combination with a Schema Registry to tell potential consumers how to read the data.
Dependent on what needs to be done with the large XML file, you could also store the XML in a resilient location (such as HDFS) and only provide the location path in a Kafka message. That way, a consumer can consume the paths from the Kafka topic and make some processing on them.
Writing a Kafka producer that unamarshalls XML files to Java Objects, Sends serialized objects in Avro format to the cluster was the solution for me.

Remove and add compression in kafka topic. What will happen to the existing data in the topic?

If a topic is set without compression, and some data already exist in the topic.
Now the topic is set with compression, will the existing data be compressed?
The other direction is, if a topic is set with compression, and some data already exist in the topic, will the existing data be decompressed?
This question comes up the worries to the data consumer. When the topic has some data is compressed and some is not compressed, this is very messy, or the brokers know those events are compressed and those are not in the same topic, and will deliver the right data?
If the existing data is not corresponding to the compression setup, I will remove the existing data by configuring very low retention time. Until the topic is very clean that has no data, I will then ingest data to ensure every event is either compressed or not compressed.
Both compressed and uncompressed records could coexist in a single topic. The corresponding compression type is stored in each record(record batch actually), so the consumer knows how to handle this message.
On the broker side, it normally does not care if a record batch is compressed. Assuming there occurs no down converting for old-formatted records, the broker always saves the batch as it is.

Purpose of +tmp in Kafka hdfs connect

I am planning to use Kafka hdfs connect for moving messages from Kafka to hdfs. While looking into it, I see there are parameters like flush size and rotate interval Ms with which you can batch messages in heap and write batch at once.
Is the batch written to Wal first and then to the mentioned location. I also see it creates a +tmp directory. What's the purpose of+tmp directory . We can directly write whole batch as file under specified location with offset ranges..
When Kafka consumer writes to HDFS, it writes to WAL first. +tmp dir holds all the temporary files, which get compressed together into larger HDFS files. Then it is moved to the actual defined location.
Infact you can refer the actual implementation to understand in depth.
https://github.com/confluentinc/kafka-connect-hdfs/blob/121a69133bc2c136b6aa9d08b23a0799a4cd8799/src/main/java/io/confluent/connect/hdfs/TopicPartitionWriter.java#L611

kafka + reading from topic log file

I have a topic log file and the corresponding .index file. I would like to read the messages in a streaming fashion and process it. How and where should I start?
Should I load these files to Kafka producer and read from topic?
Can i directly write a consumer to read data from the file and process it?
I have gone through the Kafka website and everywhere, it uses pre-built Kafka producers and consumers in the examples. So, I couldn't get enough guidance.
I want to read in streaming fashion in Java.
The text looks encrypted so i am not posting the input files.
Any help is really appreciated.
You can dump log segments and use the deep iteration option to deserialize the data into something more readable.
If you want to "stream it", then use a standard Unix pipe to output to some other tool
do aggregate operations
Then use Kafka Streams to actually read from the topic for all partitions rather than the single partition on that single broker