Can we use Apache Kafka as a system for file watching - apache-kafka

I have source path and destination path in HDFS.Our UpStream places files in source path and We check for any new files added in source path if there are any
we copy from source path to destination path.
For this right now we are using a shell script. But I want to use Kafka in between.I researched about it, found only HDFS sink connectors. There are no source connectors for HDFS.
My Question is can we use Kafka here and how?

For this use case , I don't think you need kafka.
There are various ways to do this , one way for example , you can use Zookeeper watcher interface which getters triggered on watched events and programmatically fire the copy to hadoop from your program
As an alternate, Hadoop 2.6 introduced DFSInotifyEventInputStream that you can use for this. You can get an instance of it from HdfsAdmin and then just call .take() or .poll() to get all the event and based on the event you can take action

Related

Unable to change default /etc/kafka/connect-log4j.properties location for different kafka connectors

I am using multiple kafka connectors. But every connector is writing log within same connect.log file. But I want connectors to write different log files. For that, during startup I need to change default /etc/kafka/connect-log4j.properties file. But unable to change it.
Sample Start Script:
/usr/bin/connect-standalone ../properties/sample-worker.properties ../properties/sample-connector.properties > /dev/null 2>&1 &
Is there any way to change default /etc/kafka/connect-log4j.properties file during the startup of connectors.
Kafka uses log4j, and has a variable for overriding it
export KAFKA_LOG4J_OPTS="-Dlog4j.configuration=file:///some/other/log4j.properties"
connect-standalone.sh ...
Generally, it would be best to use connect-distributed and use some log aggregation tool like the ELK stack to parse through log events for different connectors

Is there any way to send chrome history logs to kafka?

I want to send my google chrome history to kafka.
My basic idea is to use my local data located in
C:/Users/master/AppData/Local/Google/Chrome/User Data/Default/history
To do so, I want to use Kafka file source connector.
But how can I send newly added chrome history log after I run kakfa source connector?
Is there any way track the change of source file so kafka broker can acknowledge it?
Indeed you can use FileStreamSourceConnector to achieve that. You do not need to anything else.
Once you start FileStreamSourceConnector, it will hook to the specified file. So, whenever new data is appended to the file, your connector will automatically produce to the topic.
From the link that I shared above:
This connector will read only one file and send the data within that file to Kafka. It will then watch the file for appended updates only. Any modification of file lines already sent to Kafka will not be reprocessed.
This may help you: Read File Data with Connect

is there a way to know how many record are written by kafka connect sink?

I use the HDFS sink connector and I want to know how many record are put to HDFS.
In logs I have log when the connector start to put file to HDFS, but not how many record there are.
For exemple :
INFO Opening record writer for: hdfs://hdfs/path/+tmp/table/partition=2020-02-27/19955b52-8189-4f70-94b5-46d579cd1505_tmp.avro (io.confluent.connect.hdfs.avro.AvroRecordWriterProvider)
Is it possible by extending the connector itself ?
I use kafka connect HDFS 2 sink.
Out of the box, not that I know of (of course, it's open source, and you could look). Each file would have variable amount of data, so metric tracking wouldn't be all too useful.
I cannot recall if debug or trace logs expose that information.
You can can use Hive/Spark/HDFS CLI to inspect each file, though

Confluent Control Center Connect database

When I create a source or sink connector using Confluent Control Center where does it save the settings related to that connector? Are there files I can browse? We are planning to create 50+ connectors and at one point we need to copy them from one environment to another, I was wondering if there is an easy way to do that.
Kafka Connect in distributed mode uses Kafka topics for storing configuration.
Kafka Connect supports a REST API. You can use this for viewing existing connector configuration, creating new ones (including programatically/automatically for 50+ new connectors), starting/stopping connectors, etc.
The REST API is documented here.
Kafka Connect distributed mode is started with a property file. That property file defines a "config topic".
The connectors you're able to load, however, are not stored there - that's only for the running source/sink configurations.
The classes themselves are bundled as JAR files in the classpaths of the individual Connect Workers, and Control Center has no current way of provisioning new Connect classes. In other words, you must use something like Ansible or manually connect to each worker, download the Connect type you want, and extract it next to the other connects.
For example, let's pretend you wanted the Syslog connector.
You'd already have folders for these under usr/share/java in the Confluent installation
kafka-connect-hdfs
kafka-connect-jdbc
...
So, you download or build that Syslog connector, make a kafka-connect-syslog folder, and drop all necessary jar libraries there.
Once you do this for all connect instances, you'll need to also restart the Kafka Connect process on those machines.
Once Control Center connects back to the Connect server, you'll be able to configure your new Connect classes

Ingesting a log file into HDFS using Flume while it is being written

What is the best way to ingest a log file into HDFS while it is being written ? I am trying to configure Apache Flume, and am trying to configure sources that can offer me data reliability as well. I was trying to configure "exec" and later also looked at "spooldir" but the following documentation at flume.apache.org has put doubt on my own intent -
Exec Source:
One of the most commonly requested features is the use case like-
"tail -F file_name" where an application writes to a log file on disk and
Flume tails the file, sending each line as an event. While this is
possible, there’s an obvious problem; what happens if the channel
fills up and Flume can’t send an event? Flume has no way of indicating
to the application writing the log file, that it needs to retain the
log or that the event hasn’t been sent for some reason. Your
application can never guarantee data has been received when using a
unidirectional asynchronous interface such as ExecSource!
Spooling Directory Source:
Unlike the Exec source, "spooldir" source is reliable and will not
miss data, even if Flume is restarted or killed. In exchange for this
reliability, only immutable files must be dropped into the spooling
directory. If a file is written to after being placed into the
spooling directory, Flume will print an error to its log file and stop
processing.
Anything better is available that I can use to ensure Flume will not miss any event and also reads in realtime ?
I would recommend using the Spooling Directory Source, because of its reliability. A workaround for the inmmutability requirement is to compose the files in a second directory, and once they reach certain size (in terms of bytes or amount of logs), move them to the spooling directory.