Reading:Kafka Connect FileStreamSource ignores appended lines
An Answer from 2018 states :
Kafka Connect does not "watch" or "tail" a file. I don't believe it is documented anywhere that it does do that.
It seems Kafka does now support this as reading:
https://docs.confluent.io/5.5.0/connect/managing/configuring.html#standalone-example
does state the file is watched:
FileSource Connector The FileSource Connector reads data from a file
and sends it to Apache Kafka®. Beyond the configurations common to all
connectors it takes only an input file and output topic as properties.
Here is an example configuration:
name=local-file-source connector.class=FileStreamSource tasks.max=1
file=/tmp/test.txt topic=connect-test This connector will read only
one file and send the data within that file to Kafka. It will then
watch the file for appended updates only. Any modification of file
lines already sent to Kafka will not be reprocessed.
My configuration is same as question posted Kafka Connect FileStreamSource ignores appended lines
connect-file-source.properties contains:
name=my-file-connector
connector.class=FileStreamSource
tasks.max=1
file=/data/users/zamara/suivi_prod/app/data/logs.txt
topic=connect-test
Starting connect standalone with
connect-standalone connect-standalone.properties connect-file-source.properties
Adds all the contents of the file logs.txt to the topic connect-test , adding new lines to logs.txt does not add the lines to the topic. Is there configuration required to enable Kafka to watch the file so that new data added to logs.txt is added to the topic connect-test ?
Unless you're just experimenting with FileStreamSource for educational purposes, you're heading down a blind alley here. The connector only exists as a sample connector.
To ingest files into Kafka use Kafka Connect Spooldir, Kafka Connect FilePulse, or look at things like Filebeat from Elastic.
Related
I am starting to play with CDC and Kafka connect
After countless hours trying, I have come to understand the logic
Set Kafka Connect properties (bin/connect-standalone.sh) with your cluster information
Set Kafka Connect configuration file (config/connect-standalone.properties)
Download your Kafka connector (in this case MySQL from Debizium)
Configure connector properties in whatevername.properties
In order to run a worker with Kafka Connector, you need to
./bin/connect-standalone.sh config/connect-standalone.properties
which answers:
INFO Usage: ConnectStandalone worker.properties connector1.properties [connector2.properties ...] (org.apache.kafka.connect.cli.ConnectStandalone:62)
I know we need to run:
./bin/connect-standalone.sh config/connect-standalone.properties myconfig.properties
My issue is that I cannot find any format description, or example of that myconfig.properties field.
【Extra Info】
Debizium configuration properties list:
https://docs.confluent.io/debezium-connect-mysql-source/current/mysql_source_connector_config.html#mysql-source-connector-config
https://debezium.io/documentation/reference/1.5/connectors/mysql.html
【Question】
Where can I find an example of the connector properties?
Thanks!
I'm not sure if I understood your question, but here is an example of properties for this connector :
connector.class=io.debezium.connector.mysql.MySqlConnector
connector.name=someuniquename
database.hostname=192.168.99.100
database.port=3306
database.user=debezium-user
database.password=debezium-user-pw
database.server.id=184054
database.server.name=fullfillment
database.include.list=inventory
database.history.kafka.bootstrap.servers=kafka:9092
database.history.kafka.topic=dbhistory.fullfillment
include.schema.changes=true
The original config is the one from the documentation which I converted from json to properties : https://debezium.io/documentation/reference/1.5/connectors/mysql.html#mysql-example-configuration
I set up everything as recommended for a quick start, I used a text file as a source or producer which contains one sentence. when I launch a consumer console for the first time I am able to read the sentence (JSON format) in the file but when I add something in the file it's not showing in the consumer console and when I use the producer console to add something in the topic, it shows right the way in the consumer console. What could be the problem?
zookeeper UP
Connector UP
consumer UP
producer UP
Kafka UP
Kafka doesn't watch files for changes. You would need to program your own code to detect file modifications on disk then restart the producer thread to pick up those changes
Alternatively, use kafka-connect-spooldir connector, available on Github
I created a new topic and placed the file to the wrong path, so I had to edit these files:
bin/connect-standalone.sh
config/connect-standalone.properties
config/connect-file-source.properties
config/connect-file-sink.properties
---------- edit these lines------------------------------
topic=my_created_topic
file=PATH_TO_MY_SOURCE_FILE
Everything is working perfectly, yah!!!!!!!!!!
Is Kafka Spool Directory connector suitable for loading streaming data (log) into Kafka in production ? Can it be run in distributed mode ? Is there any other connector that can be used since filestream source connector is not suitable for production ?
Does this match your requirements?
provides the capability to watch a directory for files and read the data as new files are written to the input directory.
Do you have CSV or JSON files?
If so, then you can use the Spooldir connector
It can be argued that something like Flume, Logastash, Filebeat, FluentD, Syslog, GELF, or other log solutions are more appropriately suited for your purposes of collecting logs into Kafka
I have been using Kafka connect for the confluent platform using the following guide
Kafka connect quickstart
But it doesn't update the sink file anymore, any changes in the source file are not written in the kafka topic.
I have already deleted all tmp files but no change.
Thanks in advance
Start up a new file source connector with a new location for storing the offsets. This connector is meant as a demo and really doesn't handle anything except a simple file that only gets append updates. Note, you shouldn't be doing anything with this connector other than a simple demo. Have a look at the connector hub if you need something for production.
To OP, I have had this like 5 mins ago but when I restarted the connector it's fine, both test.sink.txt and the consumer are getting new line added. So in a nutshell, just restart your connector.
The FileStreamSource/Sink does not work after it worked fine and you've already restarted the zookeeper, kafka server and the connector but still it does not work then the problem is with the CONNECT.OFFSETS file in the kafka directory.
You should delete it and create a new empty one.
I faced the same problem before. But correcting the path of the input and output files in the properties files as below worked for me. And it streamed from input file(test.txt) to output file(test.sink.txt).
name=local-file-source
connector.class=FileStreamSource
tasks.max=1
file=/home/mypath/kafka/test.txt
topic=connect-test
name=local-file-sink
connector.class=FileStreamSink
tasks.max=1
file=/home/mypath/kafka/test.sink.txt
topics=connect-test
I have a 1-node kafka that crashed recently. I was able to salvage the .log and .index files from /tmp/kafka-logs/mytopic-0/ and I have moved these files to a different server and installed kafka on it.
Is there a way to have the new kafka server serve the data contained in these .log files?
Update:
I probably didn't do this the right way, but here is what I've tired:
created a topic named recovermytopic on the new kafka server
stopped kafka
moved all the .log files into /tmp/kafka-logs/recovermytopic-0
restarted kafka
it appeared that for each .log file, kafka generated a .index file, looked promising but after the index files were created, I saw messeages below:
WARN Partition [recovermytopic,0] on broker 0: No checkpointed highwatermark is found for partition [recovermytopic,0] (kafka.cluster.Partition)
INFO [ReplicaFetcherManager on broker 0] Removed fetcher for partitions [recovermytopic,0] (kafka.server.ReplicaFetcherManager)
When I try to check the topic using kafka-console-consumer, the kafka server says:
INFO Closing socket connection to /127.0.0.1. (kafka.network.Processor)
no messages being consumed..
Kafka comes packaged with a DumpLogSegments tool that will extract messages (along with offsets, etc.) from Kafka data log files:
$KAFKA_HOME/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --deep-iteration --print-data-log --files mytopic-0/00000000000000132285.log > 00000000000000132285_messages.out
The output will vary a bit depending on which version of Kafka you're using, but it should be easy to extract the message keys and values with the use of sed or some other tool. The messages can then be replayed into your Kafka cluster using the kafka-console-producer.sh tool, or programmatically.
While this method is a bit roundabout, I think it's more transparent/reliable than trying to get a broker to start with data log files obtained from somewhere else. I've tested the DumpLogSegments tool with various versions of Kafka from 0.9 all the way up to 2.4.