I am writing a Kafka producer
It has to read data from a local Linux folder and write to my topic
Is it possible to do something like that?
What would be my code snippet here (in Scala)
Business case -
Real time data will be written on a local Linux folder in form of CSV files here - /data/data01/pharma/2017/
How can I move this data to a topic I created?
My consumer will read this data and add to Spark streaming data frame for processing
Real time data will be written on a local linux folder
There's many frameworks that allow you to handle this
Those I'm aware of with Kafka connections
Filebeat
FluentD / Fluentbit
Spark Streaming (or SparkSQL / Structured Streaming)
Flume
Apache Nifi (better to run as a cluster, though, not locally)
Kafka Connect with a FileStreamConnector which is included with Apache Kafka (don't need Confluent Platform)
Point being, don't reinvent the wheel which bears the risk of writing unnecessary (and possibly faulty) code, although, you could easily write your own KafkaProducer code to do this.
If you want to read a single file, then
cat ${file} | bin/kafka-console-producer.sh --broker-list localhost:9092 --topic my_topic
If the files are created dynamically, then you need to monitor them and feed it to kafka-console-producer.sh
Kafka producer to read data files
Related
I am working on a existing project and using kafka to fetch some data from DB(for generating reports). I have some questions. Which may sound silly to many(Sorry for that). I am mentioning the steps which I have performed till now.
Installed Confluent
Started Zookeeper, Kafka and Schema Registry
I had downloaded MySql connectors jars and copied to kafka-connect-jdbc
Then made a mysql properties file with connection url, topic-prefix etc.
The I started the mysql connector using this command
bin/connect-standalone etc/schema-registry/connect-avro-standalone.properties etc/kafka-connect-jdbc/source-quickstart-mysql.properties
After this if I run the avro consumer command in terminal using this command
bin/kafka-avro-console-consumer --topic mysql-01 --bootstrap-server localhost:9092 --from-beginning
it gives data succesfully.
Now, the problem and confusion.
I want to get the same data by using Spring Boot.
I am writing code only for consumer. Do I need a Producer here??(As per my understanding goes, I have data already in my topic, I just need to fetch that)
I have made an avro schema for that as well. It gets deserialized as well but I don't get the data.
The data which get printed in terminal is:
{"cust_code":27,"cust_description":{"string":"Endline survey completed"}}
The data in Spring Boot console is:
{"cust_code": "cust_description":{"string":""}}
For our pipeline, we have about 40 topics (10-25 partitions each) that we want to write into the same HDFS directory using HDFS 3 Sink Connectors in standalone mode (distributed doesn't work for our current setup). We have tried running all the topics on one connector but encounter problems recovering offsets if it needs to be restarted.
If we divide the topics among different standalone connectors, can they all write into the same HDFS directory? Since the connectors then organize all files in HDFS by topic, I don't think this should be an issue but I'm wondering if anyone has experience with this setup.
Basic example:
Connector-1 config
name=connect-1
connector.class=io.confluent.connect.hdfs3.Hdfs3SinkConnector
topics=topic1
hdfs.url=hdfs://kafkaOutput
Connector-2 config
name=connect-2
connector.class=io.confluent.connect.hdfs3.Hdfs3SinkConnector
topics=topic2
hdfs.url=hdfs://kafkaOutput
distributed doesn't work for our current setup
You should be able to run connect-distibured in the exact same nodes as connect-standalone is ran.
We have tried running all the topics on one connector but encounter problems recovering offsets if it needs to be restarted
Yeah, I would suggest not bundling all topics into one connector.
If we divide the topics among different standalone connectors, can they all write into the same HDFS directory?
That is my personal recommendation, and yes, they can because the HDFS path is named by the topic name, futher split by the partitioning scheme
Note: The following allow applies to all other storage connectors (S3 & GCS)
I want to push data from one of the ingestion processor of Apache nifi to Kafka and further to HDFS for storage.
Is it possible to connect the ingestion processor of Apache nifi with Kafka?
Nifi ships with several Kafka processors.
Just start typing Kafka into search box when you add one. Use the version that matches your Kafka installation. For example, absolutely don't run Kafka08 version processor (called GetKafka & PutKafka) with a Kafka 0.10.x installation
You'll need to set the bootstrap servers, of course, then whatever other producer properties you care about, like the topic name
Attach a ConsumeKafka processor to PutHdfs
Sidenote Kafka Connect HDFS uses purely Kafka based API methods to ship data to Hadoop from Kafka. You don't require Nifi unless you're ingesting some other types of data
You can use PutKafka processors for pushing data from Nifi to Kafka. In Add Processors dialog, type PutKafka for find the processor.
For HDFS, you can use PutHDFS processor. You need core-site.xml and hdfs-site.xml files to use PutHDFS processor. You can download HDFS configuration files from HDFS menu inside Ambari. In HDFS menu, click Actions and select Download Client Configs. You should write file locations by comma seperated.
Currently i installed kafka into linux and created topic and published message to it and it saves data in the folder /tmp/kafka-logs/topicname-0, as i checked the local file system type is xfs, is there any way kafka can save data in the format of HDFS file system type, if yes help me with configuration or steps.
Kafka runs on top of a local filesystem. It cannot be run on HDFS. If you want to move data from Kafka into HDFS, one option is using a connector to push the data to HDFS https://docs.confluent.io/current/connect/connect-hdfs/docs/index.html
I have experimented with the basic examples of publishing random messages from producer to consumer by command line.
Now i want to publish all the 1GB of data present in my local machine. For that i am struggling to load that 1GB of data to producer.
Help me out please.
You can simply dump a file by simple redirection to kafka topic. Assuming 1.xml is 1GB file then you can use following command.
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test123 < ./1.xml
But please make sure that you set following properties in producer xml.
socket.request.max.bytes, socket.receive.buffer.bytes, socket.send.buffer.bytes.
You need to set max.message.bytes for test123 topic if your message size is big.
Also change Xmx parameter in console-producer.sh to avoid Out of Memory issue.
These are the general steps to load data in kafka.
We will able to understand more if you provide the error.
So couple of approaches can help:
1) You can use big data platforms like Flume which are built for such use cases.
2) If you want to implement you own code then you can use Apache commons Lib which will help you in capturing events when a new file arrives in folder (Capture events happening inside a directory) and once you have that then you can call the code which publishes the data on kafka.
3) In our project we use Logstash API to do the same which fetches from a folder and publishes data from file to kafka and then processes it through Storm.