I have a folder that contains two files, I want to display the names of two file and when I added another file I display their name too. My question is, can I do that with Kafka?
Its strange case of use kafka. I think that this should by done adding an apache flume. Flume can pool directory and when discover new files, send this to kafka and then process this messages to recover file name.
it solves your problem?
Related
I am creating the Kafka topic programmatically . when application end . i have to delete the created Kafka topics . I am using adminClient.deleteTopics(TOPIC_LIST) but still topics doe not get deleted instead making marked as "marked for deletion" . Any one know how can i delete topic permanently using java.
The method you're using is correct.
There's no method available to wait for the data on disk to permanently be removed unless you decide to use something like JSSH and manually delete remote file paths from the brokers
I would like to continuously reading csv file in ksqldb with FilePulse Source Connector but it not work correctly
a) the connector read the file only once or
b) the connector read all data from file, but in that case there are duplicities in kafka topic (every time when connector read the appended file then insert all data from file into topic - not only the changed data
Is there any options how to solve this? (to continuously read only appended data from file or remove duplicities in kafka topic)
Thank you
To my knowledge, the file source connector doesn't track the file content. The connector only sees a modified file, so reads the whole thing on any update. Otherwise, reading the file once is expected behavior and you should reset your consumer offsets to handle this in your processing logic; for example make a table in ksql
If you want to tail a file for appends, other options like the spooldir connector, or Filebeat/Fluentd would be preferred (and are actually documented as being production-grade solutions for reading files into Kafka)
Disclaimer: I'm the author of Connect FilePulse
Connect FilePulse is probably not the best solution for continuously reading files. And as already mentioned in other answers: it might be a good idea to use solutions like: Filebeat, Fluentd or Logstash.
But, FilePulse actually supports continous reading using the LocalRowFileInputReader with the reader's property read.max.wait.ms. Here is an older answer for a question similar to yours: Stackoverflow: How can be configured kafka-connect-file-pulse for continuous reading of a text file?
We have a situation that we have to cache and persist a CSV file in a Kafka KTable. Is it possible in Kafka?According to what I have researched, we can read a CSV file in KTable but it won't be persisted (I might be wrong here). I have not been able to find anything related to it in the docs.
To be a little specific:
We need to take a CSV file.
Send it to a KTable and cache/persist it as it is.
One more thing: if it's possible, will it read the file line by line or the whole file can be sent too with a single key?
Thank you!
It's possible, yes, although, I'm not sure I understand why you wouldn't just load the CSV itself within the application as a list of rows.
will it read the file line by line or the whole file can be sent too with a single key?
Depends how you read the file. And you'd first produce the data to Kafka. A KTable must consume from a topic, not files
Note: Kafka has a max message default size at 1MB, and is not meant for file transfer
it won't be persisted
I'm not sure where you read that. You can persist the data in a compacted topic, although, you'd want to then have some key for each row of the file
I am planning to use Kafka hdfs connect for moving messages from Kafka to hdfs. While looking into it, I see there are parameters like flush size and rotate interval Ms with which you can batch messages in heap and write batch at once.
Is the batch written to Wal first and then to the mentioned location. I also see it creates a +tmp directory. What's the purpose of+tmp directory . We can directly write whole batch as file under specified location with offset ranges..
When Kafka consumer writes to HDFS, it writes to WAL first. +tmp dir holds all the temporary files, which get compressed together into larger HDFS files. Then it is moved to the actual defined location.
Infact you can refer the actual implementation to understand in depth.
https://github.com/confluentinc/kafka-connect-hdfs/blob/121a69133bc2c136b6aa9d08b23a0799a4cd8799/src/main/java/io/confluent/connect/hdfs/TopicPartitionWriter.java#L611
I'm using spooldir as Flume source and sink to kafka, is there anyway that i can transfer both the content and filename to kafka.
For example, filename is test.txt and content is hello world, need to display
hello world
test.txt
Some sources allow adding the name of the file as header of the Flume event created with the input data; that's the case of the spooldir source.
And some sinks allow configuring the serializer to be used for writting the data, such as the HDFS one; in that case, I've read there exists a header_and_text serializer (never tested it). Nevertheless, the Kafka source does not expose parameters for doing that.
So, IMHO your options are two:
Configure the spooldir for adding the above mentioned header about the file name, and develop a custom interceptor in charge of modifying the data with such a header value. Interceptors are pieces of code running at the output of the sources that "intercept" the events and modify them before they are effectively put into the Flume channel.
Modify the data you send to the spooldir source by adding a first data line about the file name.