Unable to create Kafka Producer Object in Intellij - scala

I am trying my hands on Kafka in Intellij using Spark & Scala. While creating producer Object I am unable to rectify the error. The code in Scala object is given below:
import java.util.Properties
import org.apache.kafka.clients.producer._
import kafka.producer.KeyedMessage
import org.apache.spark._
object kafkaProducer {
def main(args: Array[String]){
val topic = "jovis"
val props = new Properties()
props.put("metadata.broker.list", "localhost:9092")
props.put("serializer.class", "kafka.serializer.StringEncoder")
val config = new ProducerConfig(props)
//Error in Line below
val producer = new Producer[String, String](config)
val conf = new SparkConf().setAppName("Kafka").setMaster("local")
//val ssc = new StreamingContext(conf, Seconds(10))
val sc = new SparkContext(conf)
val data = sc.textFile("/home/hdadmin/empname.txt")
var i = 0
while(i <= data.count){
data.collect().foreach(x => {
println(x)
producer.send(new KeyedMessage[String, String](topic, x))
Thread.sleep(1000)
})
}
Error Log:
constructor ProducerConfig in class ProducerConfig cannot be accessed in object kafkaProducer
val config = new ProducerConfig(props)
Trait Producer is abstract;Cannot be instantiated.
val producer = new Producer[String, String](config)
I have imported the dependency jars below:
http://central.maven.org/maven2/org/apache/kafka/kafka-clients/0.8.2.0/kafka-clients-0.8.2.0.jar
http://central.maven.org/maven2/org/apache/kafka/kafka_2.11/0.10.2.1/kafka_2.11-0.10.2.1.jar
Apart from that I have started zookeeper server as well.
Where am I going wrong?

May be this will help you
what is the difference between kafka ProducerRecord and KeyedMessage
Please, try the new API "org.apache.kafka" %% "kafka" % "0.8.2.0"
import org.apache.kafka.clients.producer.ProducerRecord
import org.apache.kafka.clients.producer.KafkaProducer
val producer = new KafkaProducer[String, String](props)
producer.send(new ProducerRecord[String, String](topic, key, value)

Related

Getting expected class or object definition when running from Command line in scala but working fine from intelliJ

I am trying to read the Kafka message from a producer and implement some basic filtration on it and finally print the output in consumer end. I am passing the 1st filename as an argument which will have set of values and the 2nd file as the 2nd argument which will have the filter criteria.
When I am running the same from IntelliJ it is working fine. When i am trying "scalac" from command line I am getting "expected class or object definition".
package kafka_db
object kafkap extends App {
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import java.util.Properties
import scala.io.Source
import org.apache.kafka.clients.producer._
import kafkaProducer.kafkaProducerScala.producer
val conf = new SparkConf().
setMaster(args(0)).
setAppName("kafkap")
val sc = new SparkContext(conf)
sc.setLogLevel("ERROR")
val props = new Properties()
props.put("bootstrap.servers", "localhost:9092")
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String, String](props)
val topic = "kafkatopic"
for (line2 <- Source.fromFile(args(2)).getLines){
val c = line2.toInt
for (line <- Source.fromFile(args(1)).getLines) {
val a = line.toInt
val b = if (a > c) {
var d = a
val record = new ProducerRecord[String, String](topic, d.toString)
producer.send(record)
}
}
}
producer.close()
}
While Running we were missing out the Jars. SBT can be another option for this soution.

Scala Package Issue With ZKStringSerializer

I am trying to use the class ZKStringSerializer, which I get with
import kafka.utils.ZKStringSerializer
According to the entirety of the internet, and even my own code before I restarted by computer, this should allow my code to work. However, I now get an incredibly confusing compile error,
object ZKStringSerializer in package utils cannot be accessed in package kafka.utils
This is confusing because this file is not supposed to be in any package, and I don't specify a package anywhere. This is my code:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.execution.streaming.FileStreamSource.Timestamp
import org.apache.spark.sql.types._
import org.I0Itec.zkclient.ZkClient
import org.I0Itec.zkclient.ZkConnection
import java.util.Properties
import org.apache.kafka.clients.admin
import kafka.admin.{AdminUtils, RackAwareMode}
import kafka.utils.ZKStringSerializer
import kafka.utils.ZkUtils
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord}
object SpeedTester {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder.master("local[4]").appName("SpeedTester").config("spark.driver.memory", "8g").getOrCreate()
val rootLogger = Logger.getRootLogger()
rootLogger.setLevel(Level.ERROR)
import spark.implicits._
val zookeeperConnect = "localhost:2181"
val sessionTimeoutMs = 10000
val connectionTimeoutMs = 10000
val zkClient = new ZkClient(zookeeperConnect, sessionTimeoutMs, connectionTimeoutMs, ZKStringSerializer)
val topicName = "testTopic"
val numPartitions = 8
val replicationFactor = 1
val topicConfig = new Properties
val isSecureKafkaCluster = false
val zkUtils = new ZkUtils(zkClient, new ZkConnection(zookeeperConnect), isSecureKafkaCluster)
AdminUtils.createTopic(zkUtils, topicName, numPartitions, replicationFactor, topicConfig)
// Create producer for topic testTopic and actually push values to the topic
val props = new Properties()
props.put("bootstrap.servers", "localhost:9592")
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String, String](props)
val TOPIC = "testTopic"
for (i <- 1 to 50) {
val record = new ProducerRecord(TOPIC, "key", s"hello $i")
producer.send(record)
}
val record = new ProducerRecord(TOPIC, "key", "the end" + new java.util.Date)
producer.send(record)
producer.flush()
producer.close()
}
}
I know this is too late, but for others who will be looking for the same issue-
In the latest version of kafka, kafka.utils got deprecated. So please use kafka admin client apis

Kafka and Spark Streaming Simple Producer Consumer

I do not know why the data sent by producer do not reach the consumer.
I am working on cloudera virtual machine.
I am trying to write simple producer consumer where the producer uses Kafka and consumer uses spark streaming.
The Producer Code in scala:
import java.util.Properties
import org.apache.kafka.clients.producer._
object kafkaProducer {
def main(args: Array[String]) {
val props = new Properties()
props.put("bootstrap.servers", "localhost:9092")
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String, String](props)
val TOPIC = "test"
for (i <- 1 to 50) {
Thread.sleep(1000) //every 1 second
val record = new ProducerRecord(TOPIC, generator.getID().toString(),generator.getRandomValue().toString())
producer.send(record)
}
producer.close()
}
}
The Consumer Code in scala :
import java.util
import org.apache.kafka.clients.consumer.KafkaConsumer
import scala.collection.JavaConverters._
import java.util.Properties
import kafka.producer._
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.kafka._
object kafkaConsumer {
def main(args: Array[String]) {
var totalCount = 0L
val sparkConf = new SparkConf().setMaster("local[1]").setAppName("AnyName").set("spark.driver.host", "localhost")
val ssc = new StreamingContext(sparkConf, Seconds(2))
ssc.checkpoint("checkpoint")
val stream = KafkaUtils.createStream(ssc, "localhost:9092", "spark-streaming-consumer-group", Map("test" -> 1))
stream.foreachRDD((rdd: RDD[_], time: Time) => {
val count = rdd.count()
println("\n-------------------")
println("Time: " + time)
println("-------------------")
println("Received " + count + " events\n")
totalCount += count
})
ssc.start()
Thread.sleep(20 * 1000)
ssc.stop()
if (totalCount > 0) {
println("PASSED")
} else {
println("FAILED")
}
}
}
The problem is resolved by changing in the consumer code the line :
val stream = KafkaUtils.createStream(ssc, "localhost:9092", "spark-streaming-consumer-group", Map("test" -> 1))
the second parameter should be the zookeeper port which 2181 not 9092 and the zookeeper will manage to connect to the Kafka port 9092 automatically.
Note: Kafka should be started from terminal before running both the producer and consumer.

Reuse kafka producer in Spark Streaming

We have a spark streaming application(following is the code) that sources data from kafka and does some transformations(on each message) before inserting the data into MongoDB. We have a middleware application that pushes the messages(in bulk) into Kafka and waits for an acknowledgement(for each message) from spark streaming application. If the acknowledgement is not received by the middleware within a certain period of time(5seconds) after sending the message into Kafka, the middleware application re-sends the message. The spark streaming application is able to receive around 50-100 messages(in one batch) and send acknowledgement for all the messages under 5 seconds. But if the middleware application pushes over 100 messages, it is resulting in middleware application re-sending the message due to delay in spark streaming sending the acknowledgement. In our current implementation, we create the producer each time we want to send an acknowledgement, which is taking 3-4 seconds.
package com.testing
import org.apache.spark.streaming._
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.{ Seconds, StreamingContext }
import org.apache.spark.{ SparkConf, SparkContext }
import org.apache.spark.streaming.kafka._
import org.apache.spark.sql.{ SQLContext, Row, Column, DataFrame }
import java.util.HashMap
import org.apache.kafka.clients.producer.{ KafkaProducer, ProducerConfig, ProducerRecord }
import scala.collection.mutable.ArrayBuffer
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.joda.time._
import org.joda.time.format._
import org.json4s._
import org.json4s.JsonDSL._
import org.json4s.jackson.JsonMethods._
import com.mongodb.util.JSON
import scala.io.Source._
import java.util.Properties
import java.util.Calendar
import scala.collection.immutable
import org.json4s.DefaultFormats
object Sample_Streaming {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Sample_Streaming")
.setMaster("local[4]")
val sc = new SparkContext(sparkConf)
sc.setLogLevel("ERROR")
val sqlContext = new SQLContext(sc)
val ssc = new StreamingContext(sc, Seconds(1))
val props = new HashMap[String, Object]()
val bootstrap_server_config = "127.0.0.100:9092"
val zkQuorum = "127.0.0.101:2181"
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrap_server_config)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
val TopicMap = Map("sampleTopic" -> 1)
val KafkaDstream = KafkaUtils.createStream(ssc, zkQuorum, "group", TopicMap).map(_._2)
val schemaDf = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource")
.option("spark.mongodb.input.uri", "connectionURI")
.option("spark.mongodb.input.collection", "schemaCollectionName")
.load()
val outSchema = schemaDf.schema
var outDf = sqlContext.createDataFrame(sc.emptyRDD[Row], outSchema)
KafkaDstream.foreachRDD(rdd => rdd.collect().map { x =>
{
val jsonInput: JValue = parse(x)
/*Do all the transformations using Json libraries*/
val json4s_transformed = "transformed json"
val rdd = sc.parallelize(compact(render(json4s_transformed)) :: Nil)
val df = sqlContext.read.schema(outSchema).json(rdd)
df.write.option("spark.mongodb.output.uri", "connectionURI")
.option("collection", "Collection")
.mode("append").format("com.mongodb.spark.sql").save()
val producer = new KafkaProducer[String, String](props)
val message = new ProducerRecord[String, String]("topic_name", null, "message_received")
producer.send(message)
producer.close()
}
}
)
// Run the streaming job
ssc.start()
ssc.awaitTermination()
}
}
So we tried another approach of creating the producer outside of the foreachRDD and reuse it for the entire batch interval(following is the code). This seem to have helped as we are not creating the producer each time we want to send the acknowledgement. But for some reason, when we monitor the application on the spark UI, the streaming application's memory consumption is increasing steadily, which was not the case before. We tried using the --num-executors 1 option in spark-submit to limit the number of executors that get initiated by yarn.
object Sample_Streaming {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Sample_Streaming")
.setMaster("local[4]")
val sc = new SparkContext(sparkConf)
sc.setLogLevel("ERROR")
val sqlContext = new SQLContext(sc)
val ssc = new StreamingContext(sc, Seconds(1))
val props = new HashMap[String, Object]()
val bootstrap_server_config = "127.0.0.100:9092"
val zkQuorum = "127.0.0.101:2181"
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrap_server_config)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
val TopicMap = Map("sampleTopic" -> 1)
val KafkaDstream = KafkaUtils.createStream(ssc, zkQuorum, "group", TopicMap).map(_._2)
val schemaDf = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource")
.option("spark.mongodb.input.uri", "connectionURI")
.option("spark.mongodb.input.collection", "schemaCollectionName")
.load()
val outSchema = schemaDf.schema
val producer = new KafkaProducer[String, String](props)
KafkaDstream.foreachRDD(rdd =>
{
rdd.collect().map ( x =>
{
val jsonInput: JValue = parse(x)
/*Do all the transformations using Json libraries*/
val json4s_transformed = "transformed json"
val rdd = sc.parallelize(compact(render(json4s_transformed)) :: Nil)
val df = sqlContext.read.schema(outSchema).json(rdd)
df.write.option("spark.mongodb.output.uri", "connectionURI")
.option("collection", "Collection")
.mode("append").format("com.mongodb.spark.sql").save()
val message = new ProducerRecord[String, String]("topic_name", null, "message_received")
producer.send(message)
producer.close()
}
)
}
)
// Run the streaming job
ssc.start()
ssc.awaitTermination()
}
}
My questions are:
How do I monitor the spark application's memory consumption, currently we are manually monitoring the application every 5 minutes until it exhausts the memory available in our cluster(2 node 16GB each)?
What are the best practices that are followed in the industry while using Spark streaming and kafka?
Kafka is a broker: It gives you delivery guarantees for the producer and the consumer. It's overkill to implement an 'over the top' acknowledge mechanism between the producer and the consumer. Ensure that the producer behaves correctly and that the consumer can recover in case of failure and the end-2-end delivery will be ensured.
Regarding the job, there's no wonder why its performance is poor: The processing is being done sequentially, element by element up to the point of the write to the external DB. This is plain wrong and should be addressed before attempting to fix any memory consumption issues.
This process could be improved like:
val producer = // create producer
val jsonDStream = kafkaDstream.transform{rdd => rdd.map{elem =>
val json = parse(elem)
render(doAllTransformations(json)) // output should be a String-formatted JSON object
}
}
jsonDStream.foreachRDD{ rdd =>
val df = sqlContext.read.schema(outSchema).json(rdd) // transform the complete collection, not element by element
df.write.option("spark.mongodb.output.uri", "connectionURI") // write in bulk, not one by one
.option("collection", "Collection")
.mode("append").format("com.mongodb.spark.sql").save()
val msg = //create message
producer.send(msg)
producer.flush() // force send. *DO NOT Close* otherwise it will not be able to send any more messages
}
This process could be improved further if we could replace all the string-centric JSON transformation by case class instances.

Can anyone share a Flink Kafka example in Scala?

Can anyone share a working example of Flink Kafka (mainly receiving messages from Kafka) in Scala? I know there is a KafkaWordCount example in Spark. I just need to print out Kafka message in Flink. It would be really helpful.
The following code shows how to read from a Kafka topic using Flink's Scala DataStream API:
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer082
import org.apache.flink.streaming.util.serialization.SimpleStringSchema
object Main {
def main(args: Array[String]) {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val properties = new Properties()
properties.setProperty("bootstrap.servers", "localhost:9092")
properties.setProperty("zookeeper.connect", "localhost:2181")
properties.setProperty("group.id", "test")
val stream = env
.addSource(new FlinkKafkaConsumer082[String]("topic", new SimpleStringSchema(), properties))
.print
env.execute("Flink Kafka Example")
}
}
In contrast to what Robert added, below is a piece of application code for sending messages to the Kafka topic.
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord}
object KafkaProducer {
def main(args: Array[String]): Unit = {
KafkaProducer.sendMessageToKafkaTopic("localhost:9092", "topic_name")
}
def sendMessageToKafkaTopic(server: String, topic:String): Unit = {
val props = new Properties()
props.put("bootstrap.servers", servers)
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String,String](props)
val record = new ProducerRecord[String,String](topic, "HELLO WORLD!")
producer.send(record)
producer.close()
}
}