Ingesting a log file into HDFS using Flume while it is being written - real-time

What is the best way to ingest a log file into HDFS while it is being written ? I am trying to configure Apache Flume, and am trying to configure sources that can offer me data reliability as well. I was trying to configure "exec" and later also looked at "spooldir" but the following documentation at flume.apache.org has put doubt on my own intent -
Exec Source:
One of the most commonly requested features is the use case like-
"tail -F file_name" where an application writes to a log file on disk and
Flume tails the file, sending each line as an event. While this is
possible, there’s an obvious problem; what happens if the channel
fills up and Flume can’t send an event? Flume has no way of indicating
to the application writing the log file, that it needs to retain the
log or that the event hasn’t been sent for some reason. Your
application can never guarantee data has been received when using a
unidirectional asynchronous interface such as ExecSource!
Spooling Directory Source:
Unlike the Exec source, "spooldir" source is reliable and will not
miss data, even if Flume is restarted or killed. In exchange for this
reliability, only immutable files must be dropped into the spooling
directory. If a file is written to after being placed into the
spooling directory, Flume will print an error to its log file and stop
processing.
Anything better is available that I can use to ensure Flume will not miss any event and also reads in realtime ?

I would recommend using the Spooling Directory Source, because of its reliability. A workaround for the inmmutability requirement is to compose the files in a second directory, and once they reach certain size (in terms of bytes or amount of logs), move them to the spooling directory.

Related

Is there any way to send chrome history logs to kafka?

I want to send my google chrome history to kafka.
My basic idea is to use my local data located in
C:/Users/master/AppData/Local/Google/Chrome/User Data/Default/history
To do so, I want to use Kafka file source connector.
But how can I send newly added chrome history log after I run kakfa source connector?
Is there any way track the change of source file so kafka broker can acknowledge it?
Indeed you can use FileStreamSourceConnector to achieve that. You do not need to anything else.
Once you start FileStreamSourceConnector, it will hook to the specified file. So, whenever new data is appended to the file, your connector will automatically produce to the topic.
From the link that I shared above:
This connector will read only one file and send the data within that file to Kafka. It will then watch the file for appended updates only. Any modification of file lines already sent to Kafka will not be reprocessed.
This may help you: Read File Data with Connect

is there a way to know how many record are written by kafka connect sink?

I use the HDFS sink connector and I want to know how many record are put to HDFS.
In logs I have log when the connector start to put file to HDFS, but not how many record there are.
For exemple :
INFO Opening record writer for: hdfs://hdfs/path/+tmp/table/partition=2020-02-27/19955b52-8189-4f70-94b5-46d579cd1505_tmp.avro (io.confluent.connect.hdfs.avro.AvroRecordWriterProvider)
Is it possible by extending the connector itself ?
I use kafka connect HDFS 2 sink.
Out of the box, not that I know of (of course, it's open source, and you could look). Each file would have variable amount of data, so metric tracking wouldn't be all too useful.
I cannot recall if debug or trace logs expose that information.
You can can use Hive/Spark/HDFS CLI to inspect each file, though

Can we use Apache Kafka as a system for file watching

I have source path and destination path in HDFS.Our UpStream places files in source path and We check for any new files added in source path if there are any
we copy from source path to destination path.
For this right now we are using a shell script. But I want to use Kafka in between.I researched about it, found only HDFS sink connectors. There are no source connectors for HDFS.
My Question is can we use Kafka here and how?
For this use case , I don't think you need kafka.
There are various ways to do this , one way for example , you can use Zookeeper watcher interface which getters triggered on watched events and programmatically fire the copy to hadoop from your program
As an alternate, Hadoop 2.6 introduced DFSInotifyEventInputStream that you can use for this. You can get an instance of it from HdfsAdmin and then just call .take() or .poll() to get all the event and based on the event you can take action

How does hornetq persist messages?

We are in the process of planning our server machine switch. While we are doing the switch, we need to be able to continue to receive traffic and save the JMS messages that are generated.
Is it possible to move the persisted message queue from one JBoss 7.1.1/HornetQ to another?
HornetQ uses a set of binary journal files to store the messages in the queues.
You can use export journal / export data... or you can use bridges to transfer data.
You should find some relevant information at the documentation on hornetq.org

Hbase for File I/O. and way to connect HDFS on remote client

Please be aware that I’m not fluent in English before you read.
I'm new at NoSQL,and now trying to use HBase for File storage. - I'll store Files in HBase as binary.
I don't need any statistics. Only what I need is File storage.
IS IT RECOMMENDED!?!?
I am worrying about I/O speed.
Actually, because I couldn't find any way to connect HDFS with out hadoop, I wanna try HBase for file storage. I can’t set up Hadoop on client computer. I was trying to find some libraries - like JDBC for RDBMS - which help the client connect HDFS to get files. but I couldn’t find anything and just have chosen HBase instead of connection library.
Can I get any help from someone?
It really depends on your file sizes. In Hbase it is generally not recommended to store files or LOBs, the default max keyvalue size is 10mb. I have raised that limit and run tests with >100mb values but you do risk OOME your regionservers as it has to hold the entire value in memory - config your JVMs memory with care.
When this type of question is asked on the hbase-users listserve the usual response is to recommend using HDFS if you files can be large.
You should be able to use Thrift to connect to HDFS to bypass installing the Hadoop client on your client computer.