I want to send my google chrome history to kafka.
My basic idea is to use my local data located in
C:/Users/master/AppData/Local/Google/Chrome/User Data/Default/history
To do so, I want to use Kafka file source connector.
But how can I send newly added chrome history log after I run kakfa source connector?
Is there any way track the change of source file so kafka broker can acknowledge it?
Indeed you can use FileStreamSourceConnector to achieve that. You do not need to anything else.
Once you start FileStreamSourceConnector, it will hook to the specified file. So, whenever new data is appended to the file, your connector will automatically produce to the topic.
From the link that I shared above:
This connector will read only one file and send the data within that file to Kafka. It will then watch the file for appended updates only. Any modification of file lines already sent to Kafka will not be reprocessed.
This may help you: Read File Data with Connect
Related
I set up everything as recommended for a quick start, I used a text file as a source or producer which contains one sentence. when I launch a consumer console for the first time I am able to read the sentence (JSON format) in the file but when I add something in the file it's not showing in the consumer console and when I use the producer console to add something in the topic, it shows right the way in the consumer console. What could be the problem?
zookeeper UP
Connector UP
consumer UP
producer UP
Kafka UP
Kafka doesn't watch files for changes. You would need to program your own code to detect file modifications on disk then restart the producer thread to pick up those changes
Alternatively, use kafka-connect-spooldir connector, available on Github
I created a new topic and placed the file to the wrong path, so I had to edit these files:
bin/connect-standalone.sh
config/connect-standalone.properties
config/connect-file-source.properties
config/connect-file-sink.properties
---------- edit these lines------------------------------
topic=my_created_topic
file=PATH_TO_MY_SOURCE_FILE
Everything is working perfectly, yah!!!!!!!!!!
I use the HDFS sink connector and I want to know how many record are put to HDFS.
In logs I have log when the connector start to put file to HDFS, but not how many record there are.
For exemple :
INFO Opening record writer for: hdfs://hdfs/path/+tmp/table/partition=2020-02-27/19955b52-8189-4f70-94b5-46d579cd1505_tmp.avro (io.confluent.connect.hdfs.avro.AvroRecordWriterProvider)
Is it possible by extending the connector itself ?
I use kafka connect HDFS 2 sink.
Out of the box, not that I know of (of course, it's open source, and you could look). Each file would have variable amount of data, so metric tracking wouldn't be all too useful.
I cannot recall if debug or trace logs expose that information.
You can can use Hive/Spark/HDFS CLI to inspect each file, though
I'm developing a Kafka Sink connector on my own. My deserializer is JSONConverter. However, when someone send a wrong JSON data into my connector's topic, I want to omit this record and send this record to a specific topic of my company.
My confuse is: I can't find any API for me to get my Connect's bootstrap.servers.(I know it's in the confluent's etc directory but it's not a good idea to write hard code of the directory of "connect-distributed.properties" to get the bootstrap.servers)
So question, is there another way for me to get the value of bootstrap.servers conveniently in my connector program?
Instead of trying to send the "bad" records from a SinkTask to Kafka, you should instead try to use the dead letter queue feature that was added in Kafka Connect 2.0.
You can configure the Connect runtime to automatically dump records that failed to be processed to a configured topic acting as a DLQ.
For more details, see the KIP that added this feature.
I have source path and destination path in HDFS.Our UpStream places files in source path and We check for any new files added in source path if there are any
we copy from source path to destination path.
For this right now we are using a shell script. But I want to use Kafka in between.I researched about it, found only HDFS sink connectors. There are no source connectors for HDFS.
My Question is can we use Kafka here and how?
For this use case , I don't think you need kafka.
There are various ways to do this , one way for example , you can use Zookeeper watcher interface which getters triggered on watched events and programmatically fire the copy to hadoop from your program
As an alternate, Hadoop 2.6 introduced DFSInotifyEventInputStream that you can use for this. You can get an instance of it from HdfsAdmin and then just call .take() or .poll() to get all the event and based on the event you can take action
What is the best way to ingest a log file into HDFS while it is being written ? I am trying to configure Apache Flume, and am trying to configure sources that can offer me data reliability as well. I was trying to configure "exec" and later also looked at "spooldir" but the following documentation at flume.apache.org has put doubt on my own intent -
Exec Source:
One of the most commonly requested features is the use case like-
"tail -F file_name" where an application writes to a log file on disk and
Flume tails the file, sending each line as an event. While this is
possible, there’s an obvious problem; what happens if the channel
fills up and Flume can’t send an event? Flume has no way of indicating
to the application writing the log file, that it needs to retain the
log or that the event hasn’t been sent for some reason. Your
application can never guarantee data has been received when using a
unidirectional asynchronous interface such as ExecSource!
Spooling Directory Source:
Unlike the Exec source, "spooldir" source is reliable and will not
miss data, even if Flume is restarted or killed. In exchange for this
reliability, only immutable files must be dropped into the spooling
directory. If a file is written to after being placed into the
spooling directory, Flume will print an error to its log file and stop
processing.
Anything better is available that I can use to ensure Flume will not miss any event and also reads in realtime ?
I would recommend using the Spooling Directory Source, because of its reliability. A workaround for the inmmutability requirement is to compose the files in a second directory, and once they reach certain size (in terms of bytes or amount of logs), move them to the spooling directory.