FilePulse SourceConnector - apache-kafka

I would like to continuously reading csv file in ksqldb with FilePulse Source Connector but it not work correctly
a) the connector read the file only once or
b) the connector read all data from file, but in that case there are duplicities in kafka topic (every time when connector read the appended file then insert all data from file into topic - not only the changed data
Is there any options how to solve this? (to continuously read only appended data from file or remove duplicities in kafka topic)
Thank you

To my knowledge, the file source connector doesn't track the file content. The connector only sees a modified file, so reads the whole thing on any update. Otherwise, reading the file once is expected behavior and you should reset your consumer offsets to handle this in your processing logic; for example make a table in ksql
If you want to tail a file for appends, other options like the spooldir connector, or Filebeat/Fluentd would be preferred (and are actually documented as being production-grade solutions for reading files into Kafka)

Disclaimer: I'm the author of Connect FilePulse
Connect FilePulse is probably not the best solution for continuously reading files. And as already mentioned in other answers: it might be a good idea to use solutions like: Filebeat, Fluentd or Logstash.
But, FilePulse actually supports continous reading using the LocalRowFileInputReader with the reader's property read.max.wait.ms. Here is an older answer for a question similar to yours: Stackoverflow: How can be configured kafka-connect-file-pulse for continuous reading of a text file?

Related

Kafka connect topic message modification before writing to sink database

I have setup Kafka connect between my source and destination, for example
I have a table in mysql which I want to send to mongodb, I have setup mysql as source where as mongodb as sink and its working fine.
In my mysql table has a column called 'download_link', where I have a pdf s3 download link. Now when I setup Kafka this link will go mongodb but what I need is, after I receive message from mysql source, I want to execute a python code which downloads the pdf file and extract text from it, so when my data goes into mongodb. It shouldnt be link rather the text extracted. Is it possible to do something like this?
Can someone provide some resources how I can achieve this?
I want to execute a python code ...
Kafka Connect cannot do this.
Since you have apache-kafka-streams, refer post - Does Kafka python API support stream processing?
You would run your Python stream processor after the source connector, send data to new topic(s), then use a Connect sink on those
Keep in mind that Kafka messages have a maximum size, so extracting large PDF text blobs and persisting the data in the topic(s) might not be the best idea. Instead, you could have the MongoDB writer application download the PDF before writing to the database, but as stated, you'd need to write Java to use Kafka Connect for that. Otherwise, you're left with other Python processes that consume from Kafka and write to Mongo

Kafka Connect Reread Entire File for KSQLDB Debugging or KSQLDB possiblity to insert all events after query creation?

I start to develop with KSQLDB and alongside with Kafka-connect. Kafka-connect is awesome and everything is well and has the behaviour to not reread the records if it detects that it was already read in the past (extrem useful for production). But, for development and debugging of KSQLDB queries it is necessary to replay the data, as ksqldb will create table entries on the fly on emitted changes. If nothing is replayed the to 'test' query stays empty. Any advice how to replay a csv file with kafka connect after the file is inserted for the first time? Maybe, ksqldb has the possiblity to reread the whole topic after the table is created. Has someone the answer for a beginner?
Create the source connector with a different name, or give the CSV file a new name. Both should cause it to be re-read.
I had also a problem in my configuration file and ksqldb did not recognize the
SET 'auto.offset.reset'='earliest';
option. Use the above command, notice the ', to force ksqldb to reread the whole topic from the beginning after a table/stream creation command. You have to set it manually every time you connect via ksql-cli or client.

How to cache a single csv file into a KTable in Kafka?

We have a situation that we have to cache and persist a CSV file in a Kafka KTable. Is it possible in Kafka?According to what I have researched, we can read a CSV file in KTable but it won't be persisted (I might be wrong here). I have not been able to find anything related to it in the docs.
To be a little specific:
We need to take a CSV file.
Send it to a KTable and cache/persist it as it is.
One more thing: if it's possible, will it read the file line by line or the whole file can be sent too with a single key?
Thank you!
It's possible, yes, although, I'm not sure I understand why you wouldn't just load the CSV itself within the application as a list of rows.
will it read the file line by line or the whole file can be sent too with a single key?
Depends how you read the file. And you'd first produce the data to Kafka. A KTable must consume from a topic, not files
Note: Kafka has a max message default size at 1MB, and is not meant for file transfer
it won't be persisted
I'm not sure where you read that. You can persist the data in a compacted topic, although, you'd want to then have some key for each row of the file

kafka + reading from topic log file

I have a topic log file and the corresponding .index file. I would like to read the messages in a streaming fashion and process it. How and where should I start?
Should I load these files to Kafka producer and read from topic?
Can i directly write a consumer to read data from the file and process it?
I have gone through the Kafka website and everywhere, it uses pre-built Kafka producers and consumers in the examples. So, I couldn't get enough guidance.
I want to read in streaming fashion in Java.
The text looks encrypted so i am not posting the input files.
Any help is really appreciated.
You can dump log segments and use the deep iteration option to deserialize the data into something more readable.
If you want to "stream it", then use a standard Unix pipe to output to some other tool
do aggregate operations
Then use Kafka Streams to actually read from the topic for all partitions rather than the single partition on that single broker

Single letter being prepended to JSON records from NIFI JSONRecordSetWriter

I'm fairly new to NiFi and Kafka and I have been struggling with this problem for a few days. I have a NiFi data flow that ends with JSON records being being published to a Kafka topic using PublishKafkaRecord_2_0 processor configured with a JSONRecordSetWriter service as the writer. Everything seems to work great: messages are published to Kafka and looking at the records in the flow file after being published look like well-formed JSON. Though, when consuming the messages on the command line I see that they are prepended with a single letter. Trying to read the messages with ConsumeKafkaRecord_2_0 configured with a JSONTreeReader and of course see the error here.
As I've tried different things the letter has changed: it started with an "h", then "f" (when configuring a JSONRecordSetWriter farther upstream and before being published to Kafka), and currently a "y".
I can't figure out where it is coming from. I suspect it is caused by the JSONRecordSetWriter but not sure. My configuration for the writer is here and nothing looks unusual to me.
I've tried debugging by creating different flows. I thought the issue might be with my Avro schema and tried replacing that. I'm out of things to try, does anyone have any ideas?
Since you have the "Schema Write Strategy" set to "Confluent Schema Reference" this is telling the writer to write the confluent schema id reference at the beginning of the content of the message, so likely what you are seeing is the bytes of that.
If you are using the confluent schema registry then this is correct behavior and those values need to be there for the consuming side to determine what schema to use.
If you are not using confluent schema registry when consuming these messages, just choose one of the other Schema Write Strategies.