Kafka connect to load streaming log data into Kafka - apache-kafka

Is Kafka Spool Directory connector suitable for loading streaming data (log) into Kafka in production ? Can it be run in distributed mode ? Is there any other connector that can be used since filestream source connector is not suitable for production ?

Does this match your requirements?
provides the capability to watch a directory for files and read the data as new files are written to the input directory.
Do you have CSV or JSON files?
If so, then you can use the Spooldir connector
It can be argued that something like Flume, Logastash, Filebeat, FluentD, Syslog, GELF, or other log solutions are more appropriately suited for your purposes of collecting logs into Kafka

Related

Dump Kafka to GCS

Can someone help me with the best possible way to dump data from a kafka topic to Google Cloud storage?
I would like to build a near real time pipeline which is capable to creating multiple files based on time or size cuts.
Data in kafka topic is in JSON format.
Confluent offers a GCS Kafka Connect sink, but you could also try using Google DataFlow / Apache Beam to do the same.

How to use Kafka connect to output to dynamic directory in GCS?

I am fetching JSON data from a Kafka topic. I need to dump this data onto GCS (Google Cloud Storage) into a directory, wherein the directory name will be fetched from the value of "ID" in the JSON data.
I googled and did not find any similar use case wherein Kafka Connect can be used to interpret the JSON data and create directories dynamically based on the value from the JSON data.
Can this be achieved using Kafka Connect?
You can use Kafka Connect GCS sink connector which is provided by Confluent.
The Google Cloud Storage (GCS) connector, currently available as a
sink, allows you to export data from Kafka topics to GCS objects in
various formats. In addition, for certain data layouts, GCS connector
exports data by guaranteeing exactly-once delivery semantics to
consumers of the GCS objects it produces.
Here's an example configuration for the connector:
name=gcs-sink
connector.class=io.confluent.connect.gcs.GcsSinkConnector
tasks.max=1
topics=gcs_topic
gcs.bucket.name=#bucket-name
gcs.part.size=5242880
flush.size=3
gcs.credentials.path=#/path/to/credentials/keys.json
storage.class=io.confluent.connect.gcs.storage.GcsStorage
format.class=io.confluent.connect.gcs.format.avro.AvroFormat
partitioner.class=io.confluent.connect.storage.partitioner.DefaultPartitioner
schema.compatibility=BACKWARD
confluent.topic.bootstrap.servers=localhost:9092
confluent.topic.replication.factor=1
# Uncomment and insert license for production use
# confluent.license=
You can find more details for installation and configuration in the link I've provided above.
This isn't really possible out-of-the-box using most connectors. Instead, you can implement your own Kafka Connect sink task that processes Kafka records and then writes them to the correct GCS directories based on your JSON.
Here's the method you'd override in the connector.
Here's a link to the source code for the AWS S3 sink connector.

how to integrate/connect apache nifi ingestion processor to Apache kafka?

I want to push data from one of the ingestion processor of Apache nifi to Kafka and further to HDFS for storage.
Is it possible to connect the ingestion processor of Apache nifi with Kafka?
Nifi ships with several Kafka processors.
Just start typing Kafka into search box when you add one. Use the version that matches your Kafka installation. For example, absolutely don't run Kafka08 version processor (called GetKafka & PutKafka) with a Kafka 0.10.x installation
You'll need to set the bootstrap servers, of course, then whatever other producer properties you care about, like the topic name
Attach a ConsumeKafka processor to PutHdfs
Sidenote Kafka Connect HDFS uses purely Kafka based API methods to ship data to Hadoop from Kafka. You don't require Nifi unless you're ingesting some other types of data
You can use PutKafka processors for pushing data from Nifi to Kafka. In Add Processors dialog, type PutKafka for find the processor.
For HDFS, you can use PutHDFS processor. You need core-site.xml and hdfs-site.xml files to use PutHDFS processor. You can download HDFS configuration files from HDFS menu inside Ambari. In HDFS menu, click Actions and select Download Client Configs. You should write file locations by comma seperated.

Apache Kafka topic data in HDFS format

Currently i installed kafka into linux and created topic and published message to it and it saves data in the folder /tmp/kafka-logs/topicname-0, as i checked the local file system type is xfs, is there any way kafka can save data in the format of HDFS file system type, if yes help me with configuration or steps.
Kafka runs on top of a local filesystem. It cannot be run on HDFS. If you want to move data from Kafka into HDFS, one option is using a connector to push the data to HDFS https://docs.confluent.io/current/connect/connect-hdfs/docs/index.html

Which directory does apache kafka store the data in broker nodes

I can see a property in config/server.properties called log.dir? Does this mean kafka uses the same directory for storing logs and data both?
Kafka topics are "distributed and partitioned append only logs". Parameter log.dir defines where topics (ie, data) is stored.
It is not related to application/broker logging.
The default log.dir is /tmp/kafka-logs which you may want to change in case your OS has a /tmp directory cleaner.
log.dir or log.dirs in the config/server.properties specifiy the directories in which the log data is kept.
The server log directory is kafka_base_dir/logs by default. You could modify it by specifying another directory for 'kafka.logs.dir' in log4j.properties.
log.dir in server.properties is the place where the Kafka broker will store the commit logs containing your data. Typically this will your high speed mount disk for mission critical use-cases.
For application/broker logging you can use general log4j logging to get the event logging in your custom location. Below are the variables to do this.
-Dlog4j.configuration=file:<configuration file with log rolling, logging level etc.> & -Dkafka.logs.dir=<path to logs>
The directory location of logs and data were perfectly described by Mathias. Yet the data were designed for internal processing of Kafka engine, may you could use Kafka Connect to store and manipulate the data. Kafka Connect is a tool for scalability and reliability data between Apache Kafka and other systems. Look the picture bellow:
It will make simple to define connectors that move large amount of data into and out of Kafka internal data system. Kafka Connect can ingest entire database making the data available for stream processing or sink the specific data of a single topic (or multiples) to another system or database for further analysis.