Load 1GB of file to Kafka producer directly from my local machine - apache-kafka

I have experimented with the basic examples of publishing random messages from producer to consumer by command line.
Now i want to publish all the 1GB of data present in my local machine. For that i am struggling to load that 1GB of data to producer.
Help me out please.

You can simply dump a file by simple redirection to kafka topic. Assuming 1.xml is 1GB file then you can use following command.
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test123 < ./1.xml
But please make sure that you set following properties in producer xml.
socket.request.max.bytes, socket.receive.buffer.bytes, socket.send.buffer.bytes.
You need to set max.message.bytes for test123 topic if your message size is big.
Also change Xmx parameter in console-producer.sh to avoid Out of Memory issue.
These are the general steps to load data in kafka.
We will able to understand more if you provide the error.
So couple of approaches can help:
1) You can use big data platforms like Flume which are built for such use cases.
2) If you want to implement you own code then you can use Apache commons Lib which will help you in capturing events when a new file arrives in folder (Capture events happening inside a directory) and once you have that then you can call the code which publishes the data on kafka.
3) In our project we use Logstash API to do the same which fetches from a folder and publishes data from file to kafka and then processes it through Storm.

Related

Kafka file stream connect and stream API

am working on the file stream connector, I have more than ten million records in the file(it's not a single file, its partition by account #). I have to load these files into the topic and update my streams. have gone through stand-alone streams, I have the following question and need help to achieve.
look at the data set, I have two account#, each account has 5 rows, I would need to group them in two rows and key as acctNbr.
how to write my source connector to read the file and get the grouping logic?
my brokers are running in Linux machines X,Y,Z.. post-development of source connector, my jar file should it deploy in every broker(if I start running in the distributed broker )?
I have only 30 mins window to extract file drop to the topic? what are all the parameters that are there to tune the logic to get my working window down? FYI, this topic would have more than 50 partitions and 3 broker set up.
Data set:
{"acctNbr":"1234567","secNbr":"AAPL","date":"2010-01-01","currentPrice":"10","availQnty":"10"}
{"acctNbr":"1234567","secNbr":"AAPL","date":"2010-01-02","currentPrice":"10","availQnty":"10"}
{"acctNbr":"1234567","secNbr":"AAPL","date":"2010-01-03","currentPrice":"10","availQnty":"10"}
{"acctNbr":"1234567","secNbr":"AAPL","date":"2010-01-04","currentPrice":"10","availQnty":"10"}
{"acctNbr":"1234567","secNbr":"AAPL","date":"2010-01-05","currentPrice":"10","availQnty":"10"}
{"acctNbr":"abc3355","secNbr":"AAPL","date":"2010-01-01","currentPrice":"10","availQnty":"10"}
{"acctNbr":"abc3355","secNbr":"AAPL","date":"2010-01-02","currentPrice":"10","availQnty":"10"}
{"acctNbr":"abc3355","secNbr":"AAPL","date":"2010-01-03","currentPrice":"10","availQnty":"10"}
{"acctNbr":"abc3355","secNbr":"AAPL","date":"2010-01-04","currentPrice":"10","availQnty":"10"}
{"acctNbr":"abc3355","secNbr":"AAPL","date":"2010-01-05","currentPrice":"10","availQnty":"10"}
how to write my source connector to read the file and get the grouping logic
FileSream connector cannot do this, and was not intended for such a purpose other than an example to write your own connectors. In other words, do not use in production.
That being said, you can use alternative solutions like Flume, Filebeat, Fluentd, NiFi, Streamsets, etc, etc, to glob your filepaths, then send all records line-by-line into a Kafka topic.
post-development of source connector, my jar file should it deploy in every broker
You should not run Connect on any broker. The Connect servers are called workers.
have only 30 mins window to extract file drop to the topic?
Not clear where this number came from. Any of the above methods listed above watch for all new files, without any defined window.

Confusion regarding Producers and Consumers in Kafka

I am working on a existing project and using kafka to fetch some data from DB(for generating reports). I have some questions. Which may sound silly to many(Sorry for that). I am mentioning the steps which I have performed till now.
Installed Confluent
Started Zookeeper, Kafka and Schema Registry
I had downloaded MySql connectors jars and copied to kafka-connect-jdbc
Then made a mysql properties file with connection url, topic-prefix etc.
The I started the mysql connector using this command
bin/connect-standalone etc/schema-registry/connect-avro-standalone.properties etc/kafka-connect-jdbc/source-quickstart-mysql.properties
After this if I run the avro consumer command in terminal using this command
bin/kafka-avro-console-consumer --topic mysql-01 --bootstrap-server localhost:9092 --from-beginning
it gives data succesfully.
Now, the problem and confusion.
I want to get the same data by using Spring Boot.
I am writing code only for consumer. Do I need a Producer here??(As per my understanding goes, I have data already in my topic, I just need to fetch that)
I have made an avro schema for that as well. It gets deserialized as well but I don't get the data.
The data which get printed in terminal is:
{"cust_code":27,"cust_description":{"string":"Endline survey completed"}}
The data in Spring Boot console is:
{"cust_code": "cust_description":{"string":""}}

Is it possible to log all incoming messages in Apache Kafka

I need to know if it possible to configure logging for Apache Kafka broker to write all produced/consumed topics and it's messages.
I've been looking the log4j.properties but none of the suggested properties seems to do what I need.
Thanks in advance.
Looking the generated logging files by Kafka, none of them seems to log the messages written in the different topics.
UPDATE:
Not exactly what I was looking for, but for anyone looking something similar I found: https://github.com/kafka-lens/kafka-lens which provides a friendly GUI to view messages on different topics.
I feel like there's some confusion with the word "log".
As you're talking about log4j, I assume you're talking about what I'd call "application logs". Kafka does not write the records it handles in application/log4j logs. In Kafka, log4j logs are only used to trace errors and give some context about the work brokers are doing.
On the other hand, Kafka write/read records into/from its "log", the Kafka log. These are stored in the path specified by log.dirs (/tmp/kafka-logs by default) and are not directly readable. You can use the DumpLogSegments tool to read these files if you want, for example:
bin/kafka-run-class.sh kafka.tools.DumpLogSegments \
-f /tmp/kafka-logs/topic-0/00000000000000000000.log

How to dump avro data from Kafka topic and read it back in Java/Scala

We need to export production data from a Kafka topic to use it for testing purposes: the data is written in Avro and the schema is placed on the Schema registry.
We tried the following strategies:
Using kafka-console-consumer with StringDeserializer or BinaryDeserializer. We were unable to obtain a file which we could parse in Java: we always got exceptions when parsing it, suggesting the file was in the wrong format.
Using kafka-avro-console-consumer: it generates a json which includes also some bytes, for example when deserializing BigDecimal. We didn't even know which parsing option to choose (it is not avro, it is not json)
Other unsuitable strategies:
deploying a special kafka consumer would require us to package and place that code in some production server, since we are talking about our production cluster. It is just too long. After all, isn't kafka console consumer already a consumer with configurable options?
Potentially suitable strategies
Using a kafka connect Sink. We didn't find a simple way to reset the consumer offset since apparently the connector created consumer is still active even when we delete the sink
Isn't there a simply, easy way to dump the content of the value (not the schema) of a Kafka topic containing avro data to a file so that it can be parsed? I expect this to be achievable using kafka-console-consumer with the right options, plus using the correct Java Api of Avro.
for example, using kafka-console-consumer... We were unable to obtain a file which we could parse in Java: we always got exceptions when parsing it, suggesting the file was in the wrong format.
You wouldn't use regular console consumer. You would use kafka-avro-console-consumer which deserializes the binary avro data into json for you to read on the console. You can redirect > topic.txt to the console to read it.
If you did use the console consumer, you can't parse the Avro immediately because you still need to extract the schema ID from the data (4 bytes after the first "magic byte"), then use the schema registry client to retrieve the schema, and only then will you be able to deserialize the messages. Any Avro library you use to read this file as the console consumer writes it expects one entire schema to be placed at the header of the file, not only an ID pointing to anything in the registry at every line. (The basic Avro library doesn't know anything about the registry either)
The only thing configurable about the console consumer is the formatter and the registry. You can add decoders by additionally exporting them into the CLASSPATH
in such a format that you can re-read it from Java?
Why not just write a Kafka consumer in Java? See Schema Registry documentation
package and place that code in some production server
Not entirely sure why this is a problem. If you could SSH proxy or VPN into the production network, then you don't need to deploy anything there.
How do you export this data
Since you're using the Schema Registry, I would suggest using one of the Kafka Connect libraries
Included ones are for Hadoop, S3, Elasticsearch, and JDBC. I think there's a FileSink Connector as well
We didn't find a simple way to reset the consumer offset
The connector name controls if a new consumer group is formed in distributed mode. You only need a single consumer, so I would suggest standalone connector, where you can set offset.storage.file.filename property to control how the offsets are stored.
KIP-199 discusses reseting consumer offsets for Connect, but feature isn't implemented.
However, did you see Kafka 0.11 how to reset offsets?
Alternative options include Apache Nifi or Streamsets, both integrate into the Schema Registry and can parse Avro data to transport it to numerous systems
One option to consider, along with cricket_007's, is to simply replicate data from one cluster to another. You can use Apache Kafka Mirror Maker to do this, or Replicator from Confluent. Both give the option of selecting certain topics to be replicated from one cluster to another- such as a test environment.

Kafka producer to read from a local Linux folder

I am writing a Kafka producer
It has to read data from a local Linux folder and write to my topic
Is it possible to do something like that?
What would be my code snippet here (in Scala)
Business case -
Real time data will be written on a local Linux folder in form of CSV files here - /data/data01/pharma/2017/
How can I move this data to a topic I created?
My consumer will read this data and add to Spark streaming data frame for processing
Real time data will be written on a local linux folder
There's many frameworks that allow you to handle this
Those I'm aware of with Kafka connections
Filebeat
FluentD / Fluentbit
Spark Streaming (or SparkSQL / Structured Streaming)
Flume
Apache Nifi (better to run as a cluster, though, not locally)
Kafka Connect with a FileStreamConnector which is included with Apache Kafka (don't need Confluent Platform)
Point being, don't reinvent the wheel which bears the risk of writing unnecessary (and possibly faulty) code, although, you could easily write your own KafkaProducer code to do this.
If you want to read a single file, then
cat ${file} | bin/kafka-console-producer.sh --broker-list localhost:9092 --topic my_topic
If the files are created dynamically, then you need to monitor them and feed it to kafka-console-producer.sh
Kafka producer to read data files