Currently i installed kafka into linux and created topic and published message to it and it saves data in the folder /tmp/kafka-logs/topicname-0, as i checked the local file system type is xfs, is there any way kafka can save data in the format of HDFS file system type, if yes help me with configuration or steps.
Kafka runs on top of a local filesystem. It cannot be run on HDFS. If you want to move data from Kafka into HDFS, one option is using a connector to push the data to HDFS https://docs.confluent.io/current/connect/connect-hdfs/docs/index.html
Related
Is Kafka Spool Directory connector suitable for loading streaming data (log) into Kafka in production ? Can it be run in distributed mode ? Is there any other connector that can be used since filestream source connector is not suitable for production ?
Does this match your requirements?
provides the capability to watch a directory for files and read the data as new files are written to the input directory.
Do you have CSV or JSON files?
If so, then you can use the Spooldir connector
It can be argued that something like Flume, Logastash, Filebeat, FluentD, Syslog, GELF, or other log solutions are more appropriately suited for your purposes of collecting logs into Kafka
Is there a procedure in Kafka to take backup of Kafka broker data ?
How does backup and restore work in Kafka ?
Note-
The one method is to create another DC and configure inter DC replication .
But is there any other method to take backup of data ?
Thanks!
One approach I'd recommend is to continuously backup your Kafka data into HDFS. In order to do this, you can apply Confluent HDFS-Sink connector. You can store your records in Avro or Parquet format.
On the flip side, using HDFS as data source allows you to replay all your records into Kafka.
[1] Confluent HDFS-Sink: https://docs.confluent.io/current/connect/kafka-connect-hdfs/index.html
I want to push data from one of the ingestion processor of Apache nifi to Kafka and further to HDFS for storage.
Is it possible to connect the ingestion processor of Apache nifi with Kafka?
Nifi ships with several Kafka processors.
Just start typing Kafka into search box when you add one. Use the version that matches your Kafka installation. For example, absolutely don't run Kafka08 version processor (called GetKafka & PutKafka) with a Kafka 0.10.x installation
You'll need to set the bootstrap servers, of course, then whatever other producer properties you care about, like the topic name
Attach a ConsumeKafka processor to PutHdfs
Sidenote Kafka Connect HDFS uses purely Kafka based API methods to ship data to Hadoop from Kafka. You don't require Nifi unless you're ingesting some other types of data
You can use PutKafka processors for pushing data from Nifi to Kafka. In Add Processors dialog, type PutKafka for find the processor.
For HDFS, you can use PutHDFS processor. You need core-site.xml and hdfs-site.xml files to use PutHDFS processor. You can download HDFS configuration files from HDFS menu inside Ambari. In HDFS menu, click Actions and select Download Client Configs. You should write file locations by comma seperated.
I am writing a Kafka producer
It has to read data from a local Linux folder and write to my topic
Is it possible to do something like that?
What would be my code snippet here (in Scala)
Business case -
Real time data will be written on a local Linux folder in form of CSV files here - /data/data01/pharma/2017/
How can I move this data to a topic I created?
My consumer will read this data and add to Spark streaming data frame for processing
Real time data will be written on a local linux folder
There's many frameworks that allow you to handle this
Those I'm aware of with Kafka connections
Filebeat
FluentD / Fluentbit
Spark Streaming (or SparkSQL / Structured Streaming)
Flume
Apache Nifi (better to run as a cluster, though, not locally)
Kafka Connect with a FileStreamConnector which is included with Apache Kafka (don't need Confluent Platform)
Point being, don't reinvent the wheel which bears the risk of writing unnecessary (and possibly faulty) code, although, you could easily write your own KafkaProducer code to do this.
If you want to read a single file, then
cat ${file} | bin/kafka-console-producer.sh --broker-list localhost:9092 --topic my_topic
If the files are created dynamically, then you need to monitor them and feed it to kafka-console-producer.sh
Kafka producer to read data files
I can see a property in config/server.properties called log.dir? Does this mean kafka uses the same directory for storing logs and data both?
Kafka topics are "distributed and partitioned append only logs". Parameter log.dir defines where topics (ie, data) is stored.
It is not related to application/broker logging.
The default log.dir is /tmp/kafka-logs which you may want to change in case your OS has a /tmp directory cleaner.
log.dir or log.dirs in the config/server.properties specifiy the directories in which the log data is kept.
The server log directory is kafka_base_dir/logs by default. You could modify it by specifying another directory for 'kafka.logs.dir' in log4j.properties.
log.dir in server.properties is the place where the Kafka broker will store the commit logs containing your data. Typically this will your high speed mount disk for mission critical use-cases.
For application/broker logging you can use general log4j logging to get the event logging in your custom location. Below are the variables to do this.
-Dlog4j.configuration=file:<configuration file with log rolling, logging level etc.> & -Dkafka.logs.dir=<path to logs>
The directory location of logs and data were perfectly described by Mathias. Yet the data were designed for internal processing of Kafka engine, may you could use Kafka Connect to store and manipulate the data. Kafka Connect is a tool for scalability and reliability data between Apache Kafka and other systems. Look the picture bellow:
It will make simple to define connectors that move large amount of data into and out of Kafka internal data system. Kafka Connect can ingest entire database making the data available for stream processing or sink the specific data of a single topic (or multiples) to another system or database for further analysis.