Kafka and Spark Streaming Simple Producer Consumer - scala

I do not know why the data sent by producer do not reach the consumer.
I am working on cloudera virtual machine.
I am trying to write simple producer consumer where the producer uses Kafka and consumer uses spark streaming.
The Producer Code in scala:
import java.util.Properties
import org.apache.kafka.clients.producer._
object kafkaProducer {
def main(args: Array[String]) {
val props = new Properties()
props.put("bootstrap.servers", "localhost:9092")
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String, String](props)
val TOPIC = "test"
for (i <- 1 to 50) {
Thread.sleep(1000) //every 1 second
val record = new ProducerRecord(TOPIC, generator.getID().toString(),generator.getRandomValue().toString())
producer.send(record)
}
producer.close()
}
}
The Consumer Code in scala :
import java.util
import org.apache.kafka.clients.consumer.KafkaConsumer
import scala.collection.JavaConverters._
import java.util.Properties
import kafka.producer._
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.kafka._
object kafkaConsumer {
def main(args: Array[String]) {
var totalCount = 0L
val sparkConf = new SparkConf().setMaster("local[1]").setAppName("AnyName").set("spark.driver.host", "localhost")
val ssc = new StreamingContext(sparkConf, Seconds(2))
ssc.checkpoint("checkpoint")
val stream = KafkaUtils.createStream(ssc, "localhost:9092", "spark-streaming-consumer-group", Map("test" -> 1))
stream.foreachRDD((rdd: RDD[_], time: Time) => {
val count = rdd.count()
println("\n-------------------")
println("Time: " + time)
println("-------------------")
println("Received " + count + " events\n")
totalCount += count
})
ssc.start()
Thread.sleep(20 * 1000)
ssc.stop()
if (totalCount > 0) {
println("PASSED")
} else {
println("FAILED")
}
}
}

The problem is resolved by changing in the consumer code the line :
val stream = KafkaUtils.createStream(ssc, "localhost:9092", "spark-streaming-consumer-group", Map("test" -> 1))
the second parameter should be the zookeeper port which 2181 not 9092 and the zookeeper will manage to connect to the Kafka port 9092 automatically.
Note: Kafka should be started from terminal before running both the producer and consumer.

Related

How to store data into HDFS using spark streaming

I want to store streaming data into hdfs. Its a spark streaming code capture data from kafka topic.
I tried this
lines.saveAsHadoopFiles("hdfs://192.168.10.31:9000/user/spark/mystream/", "abc")
this is my code let me know here to write code for save data into hdfs and how.in console i am receiving output need to store in hdfs
Thanks in advance
package com.spark.cons.conskafka
import java.util.HashMap
import kafka.serializer.StringDecoder
import org.apache.kafka.clients.producer.{ KafkaProducer, ProducerConfig, ProducerRecord }
import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
import org.apache.spark.streaming.{ Seconds, StreamingContext }
import org.apache.spark.SparkContext._
import org.apache.spark.streaming.twitter._
import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.{ SparkContext, SparkConf }
import org.apache.spark.storage.StorageLevel
import _root_.kafka.serializer.StringDecoder
object Consume {
def createContext(brokers: String, topics: String, checkpointDirectory: String): StreamingContext = {
println("Creating new context")
val conf = new SparkConf().setMaster("local[*]").setAppName("Spark Streaming - Kafka DirectReceiver - PopularHashTags").set("spark.executor.memory", "1g")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")
// Set the Spark StreamingContext to create a DStream for every 2 seconds
val ssc = new StreamingContext(sc, Seconds(2))
ssc.checkpoint("checkpoint")
// Define the Kafka parameters, broker list must be specified
val kafkaParams = Map[String, String](
"metadata.broker.list" -> brokers,
// start from the smallest available offset, ie the beginning of the kafka log
"auto.offset.reset" -> "largest")
// Define which topics to read from
val topicsSet = topics.split(",").toSet
// Map value from the kafka message (k, v) pair
val lines = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
// Filter hashtags
val hashTags = lines.map(_._2).flatMap(_.split(" ")).filter(_.startsWith("#"))
// Get the top hashtags over the previous 60/10 sec window
val topCounts60 = hashTags.map((_, 1)).reduceByKeyAndWindow(_ + _, Seconds(60))
.map { case (topic, count) => (count, topic) }
.transform(_.sortByKey(false))
val topCounts10 = hashTags.map((_, 1)).reduceByKeyAndWindow(_ + _, Seconds(10))
.map { case (topic, count) => (count, topic) }
.transform(_.sortByKey(false))
lines.print()
// Print popular hashtags
topCounts60.foreachRDD(rdd => {
val topList = rdd.take(10)
println("\nPopular topics in last 60 seconds (%s total):".format(rdd.count()))
topList.foreach { case (count, tag) => println("%s (%s tweets)".format(tag, count)) }
})
topCounts10.foreachRDD(rdd => {
val topList = rdd.take(10)
println("\nPopular topics in last 10 seconds (%s total):".format(rdd.count()))
topList.foreach { case (count, tag) => println("%s (%s tweets)".format(tag, count)) }
})
lines.count().map(cnt => "Received " + cnt + " kafka messages.").print()
ssc
}
def main(args: Array[String]) {
if (args.length < 2) {
System.err.println(s"""
|Usage: KafkaDirectPopularHashTags <brokers> <topics>
| <brokers> is a list of one or more Kafka brokers
| <topics> is a list of one or more kafka topics to consume from
| <checkpointDirectory> the directory where the metadata is stored
|
""".stripMargin)
System.exit(1)
}
// Create an array of arguments: brokers, topicname, checkpoint directory
val Array(brokers, topics, checkpointDirectory) = args
val ssc = StreamingContext.getOrCreate(checkpointDirectory,
() => createContext(brokers, topics, checkpointDirectory))
ssc.start()
ssc.awaitTermination()
}
}

Scala Package Issue With ZKStringSerializer

I am trying to use the class ZKStringSerializer, which I get with
import kafka.utils.ZKStringSerializer
According to the entirety of the internet, and even my own code before I restarted by computer, this should allow my code to work. However, I now get an incredibly confusing compile error,
object ZKStringSerializer in package utils cannot be accessed in package kafka.utils
This is confusing because this file is not supposed to be in any package, and I don't specify a package anywhere. This is my code:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.execution.streaming.FileStreamSource.Timestamp
import org.apache.spark.sql.types._
import org.I0Itec.zkclient.ZkClient
import org.I0Itec.zkclient.ZkConnection
import java.util.Properties
import org.apache.kafka.clients.admin
import kafka.admin.{AdminUtils, RackAwareMode}
import kafka.utils.ZKStringSerializer
import kafka.utils.ZkUtils
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord}
object SpeedTester {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder.master("local[4]").appName("SpeedTester").config("spark.driver.memory", "8g").getOrCreate()
val rootLogger = Logger.getRootLogger()
rootLogger.setLevel(Level.ERROR)
import spark.implicits._
val zookeeperConnect = "localhost:2181"
val sessionTimeoutMs = 10000
val connectionTimeoutMs = 10000
val zkClient = new ZkClient(zookeeperConnect, sessionTimeoutMs, connectionTimeoutMs, ZKStringSerializer)
val topicName = "testTopic"
val numPartitions = 8
val replicationFactor = 1
val topicConfig = new Properties
val isSecureKafkaCluster = false
val zkUtils = new ZkUtils(zkClient, new ZkConnection(zookeeperConnect), isSecureKafkaCluster)
AdminUtils.createTopic(zkUtils, topicName, numPartitions, replicationFactor, topicConfig)
// Create producer for topic testTopic and actually push values to the topic
val props = new Properties()
props.put("bootstrap.servers", "localhost:9592")
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String, String](props)
val TOPIC = "testTopic"
for (i <- 1 to 50) {
val record = new ProducerRecord(TOPIC, "key", s"hello $i")
producer.send(record)
}
val record = new ProducerRecord(TOPIC, "key", "the end" + new java.util.Date)
producer.send(record)
producer.flush()
producer.close()
}
}
I know this is too late, but for others who will be looking for the same issue-
In the latest version of kafka, kafka.utils got deprecated. So please use kafka admin client apis

Spark w Kafka - can't get enough parallelization

I am running spark with the local[8] configuration. The input is a kafka stream with 8 brokers. But as seen in the system monitor, it isn't parallel enough, it seems that about only one node is running. The input to the kafka streamer is about 1.6GB big, so it should process much faster.
system monitor
Kafka Producer:
import java.io.{BufferedReader, FileReader}
import java.util
import java.util.{Collections, Properties}
import logparser.LogEvent
import org.apache.hadoop.conf.Configuration
import org.apache.kafka.clients.producer.{KafkaProducer, Producer, ProducerRecord}
import org.apache.kafka.common.serialization.StringDeserializer
object sparkStreaming{
private val NUMBER_OF_LINES = 100000000
val brokers ="localhost:9092,localhost:9093,localhost:9094,localhost:9095,localhost:9096,localhost:9097,localhost:9098,localhost:9099"
val topicName = "log-1"
val fileName = "data/HDFS.log"
val producer = getProducer()
// no hdfs , read from text file.
def produce(): Unit = {
try { //1. Get the instance of Configuration
val configuration = new Configuration
val fr = new FileReader(fileName)
val br = new BufferedReader(fr)
var line = ""
line = br.readLine
var count = 1
//while (line != null){
while ( {
line != null && count < NUMBER_OF_LINES
}) {
System.out.println("Sending batch " + count + " " + line)
producer.send(new ProducerRecord[String, LogEvent](topicName, new LogEvent(count,line,System.currentTimeMillis())))
line = br.readLine
count = count + 1
}
producer.close()
System.out.println("Producer exited successfully for " + fileName)
} catch {
case e: Exception =>
System.out.println("Exception while producing for " + fileName)
System.out.println(e)
}
}
private def getProducer() : KafkaProducer[String,LogEvent] = { // create instance for properties to access producer configs
val props = new Properties
//Assign localhost id
props.put("bootstrap.servers", brokers)
props.put("auto.create.topics.enable", "true")
//Set acknowledgements for producer requests.
props.put("acks", "all")
//If the request fails, the producer can automatically retry,
props.put("retries", "100")
//Specify buffer size in config
props.put("batch.size", "16384")
//Reduce the no of requests less than 0
props.put("linger.ms", "1")
//The buffer.memory controls the total amount of memory available to the producer for buffering.
props.put("buffer.memory", "33554432")
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
props.put("value.serializer", "logparser.LogEventSerializer")
props.put("topic.metadata.refresh.interval.ms", "1")
val producer = new KafkaProducer[String, LogEvent](props)
producer
}
def sendBackToKafka(logEvent: LogEvent): Unit ={
producer.send(new ProducerRecord[String, LogEvent] ("times",logEvent))
}
def main (args: Array[String]): Unit = {
println("Starting to produce");
this.produce();
}
}
Consumer:
package logparser
import java.io._
import java.util.Properties
import kafka.serializer.StringDecoder
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord}
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka010._
object consumer {
var tFromKafkaToSpark: Long = 0
var tParsing : Long = 0
val startTime = System.currentTimeMillis()
val CPUNumber = 8
val pw = new PrintWriter(new FileOutputStream("data/Streaming"+CPUNumber+"config2x.txt",false))
pw.write("Writing Started")
def printstarttime(): Unit ={
pw.print("StartTime : " + System.currentTimeMillis())
}
def printendtime(): Unit ={
pw.print("EndTime : " + System.currentTimeMillis());
}
val producer = getProducer()
private def getProducer() : KafkaProducer[String,TimeList] = { // create instance for properties to access producer configs
val props = new Properties
val brokers ="localhost:9090,"
//Assign localhost id
props.put("bootstrap.servers", brokers)
props.put("auto.create.topics.enable", "true")
//Set acknowledgements for producer requests.
props.put("acks", "all")
//If the request fails, the producer can automatically retry,
props.put("retries", "100")
//Specify buffer size in config
props.put("batch.size", "16384")
//Reduce the no of requests less than 0
props.put("linger.ms", "1")
//The buffer.memory controls the total amount of memory available to the producer for buffering.
props.put("buffer.memory", "33554432")
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
props.put("value.serializer", "logparser.TimeListSerializer")
props.put("topic.metadata.refresh.interval.ms", "1")
val producer = new KafkaProducer[String, TimeList](props)
producer
}
def sendBackToKafka(timeList: TimeList): Unit ={
producer.send(new ProducerRecord[String, TimeList]("times",timeList))
}
def main(args: Array[String]) {
val topics = "log-1"
//val Array(brokers, ) = Array("localhost:9092","log-1")
val brokers = "localhost:9092"
// Create context with 2 second batch interval
val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount").setMaster("local[" + CPUNumber + "]")
val ssc = new StreamingContext(sparkConf, Seconds(1))
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
var kafkaParams = Map[String, AnyRef]("metadata.broker.list" -> brokers)
kafkaParams = kafkaParams + ("bootstrap.servers" -> "localhost:9092,localhost:9093,localhost:9094,localhost:9095,localhost:9096,localhost:9097,localhost:9098,localhost:9099")
kafkaParams = kafkaParams + ("auto.offset.reset"-> "latest")
kafkaParams = kafkaParams + ("group.id" -> "test-consumer-group")
kafkaParams = kafkaParams + ("key.deserializer" -> classOf[StringDeserializer])
kafkaParams = kafkaParams + ("value.deserializer"-> "logparser.LogEventDeserializer")
//kafkaParams.put("zookeeper.connect", "192.168.101.165:2181");
kafkaParams = kafkaParams + ("enable.auto.commit"-> "true")
kafkaParams = kafkaParams + ("auto.commit.interval.ms"-> "1000")
kafkaParams = kafkaParams + ("session.timeout.ms"-> "20000")
kafkaParams = kafkaParams + ("metadata.max.age.ms"-> "1000")
val messages = KafkaUtils.createDirectStream[String, LogEvent](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, LogEvent](topicsSet, kafkaParams))
var started = false
val lines = messages.map(_.value)
val lineswTime = lines.map(event =>
{
event.addNextEventTime(System.currentTimeMillis())
event
}
)
lineswTime.foreachRDD(a => a.foreach(e => println(e.getTimeList)))
val logLines = lineswTime.map(
(event) => {
//println(event.getLogline.stringMessages.toString)
event.setLogLine(event.getContent)
println("Got event with id = " + event.getId)
event.addNextEventTime(System.currentTimeMillis())
println(event.getLogline.stringMessages.toString)
event
}
)
//logLines.foreachRDD(a => a.foreach(e => println(e.getTimeList + e.getLogline.stringMessages.toString)))
val x = logLines.map(le => {
le.addNextEventTime(System.currentTimeMillis())
sendBackToKafka(new TimeList(le.getTimeList))
le
})
x.foreachRDD(a => a.foreach(e => println(e.getTimeList)))
//logLines.map(ll => ll.addNextEventTime(System.currentTimeMillis()))
println("--------------***///*****-------------------")
//logLines.print(10)
/*
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _)
wordCounts.print()
*/
// Start the computation
ssc.start()
ssc.awaitTermination()
ssc.stop(false)
pw.close()
}
}
There's a piece of information missing in your problem statement: how many partitions does your input topic log-1 have?
My guess is that such topic have less than 8 partitions.
The parallelism of Spark Streaming (in case of a Kafka source) is tied (modulo re-partitioning) to the number of total Kafka partitions it consumes (i.e. the RDDs' partitions are taken from the Kafka partitions).
If, as I suspect, your input topic only has a few partitions, for each micro-batch Spark Streaming will task only an equal amount of nodes with the computation. All the other nodes will sit idling.
The fact that you see all the node working in an almost round-robin fashion is due to the fact that Spark do not always choose the same node for processing data for the same partition, but it actually actively mix things up.
In order to have a better idea on what's happening I'd need to see some statistics from the Spark UI Streaming page.
Given the information you provided so far however, the insufficient Kafka partitioning would be my best bet for this behaviour.
Everything consuming from Kafka is limited by the number of partitions your topic(s) has. One consumer per partition. How much do you have ?
Although Spark can redistribute the work, it's not recommended as you might be spending more time exchanging information between executors than actually processing it.

Reuse kafka producer in Spark Streaming

We have a spark streaming application(following is the code) that sources data from kafka and does some transformations(on each message) before inserting the data into MongoDB. We have a middleware application that pushes the messages(in bulk) into Kafka and waits for an acknowledgement(for each message) from spark streaming application. If the acknowledgement is not received by the middleware within a certain period of time(5seconds) after sending the message into Kafka, the middleware application re-sends the message. The spark streaming application is able to receive around 50-100 messages(in one batch) and send acknowledgement for all the messages under 5 seconds. But if the middleware application pushes over 100 messages, it is resulting in middleware application re-sending the message due to delay in spark streaming sending the acknowledgement. In our current implementation, we create the producer each time we want to send an acknowledgement, which is taking 3-4 seconds.
package com.testing
import org.apache.spark.streaming._
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.{ Seconds, StreamingContext }
import org.apache.spark.{ SparkConf, SparkContext }
import org.apache.spark.streaming.kafka._
import org.apache.spark.sql.{ SQLContext, Row, Column, DataFrame }
import java.util.HashMap
import org.apache.kafka.clients.producer.{ KafkaProducer, ProducerConfig, ProducerRecord }
import scala.collection.mutable.ArrayBuffer
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.joda.time._
import org.joda.time.format._
import org.json4s._
import org.json4s.JsonDSL._
import org.json4s.jackson.JsonMethods._
import com.mongodb.util.JSON
import scala.io.Source._
import java.util.Properties
import java.util.Calendar
import scala.collection.immutable
import org.json4s.DefaultFormats
object Sample_Streaming {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Sample_Streaming")
.setMaster("local[4]")
val sc = new SparkContext(sparkConf)
sc.setLogLevel("ERROR")
val sqlContext = new SQLContext(sc)
val ssc = new StreamingContext(sc, Seconds(1))
val props = new HashMap[String, Object]()
val bootstrap_server_config = "127.0.0.100:9092"
val zkQuorum = "127.0.0.101:2181"
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrap_server_config)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
val TopicMap = Map("sampleTopic" -> 1)
val KafkaDstream = KafkaUtils.createStream(ssc, zkQuorum, "group", TopicMap).map(_._2)
val schemaDf = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource")
.option("spark.mongodb.input.uri", "connectionURI")
.option("spark.mongodb.input.collection", "schemaCollectionName")
.load()
val outSchema = schemaDf.schema
var outDf = sqlContext.createDataFrame(sc.emptyRDD[Row], outSchema)
KafkaDstream.foreachRDD(rdd => rdd.collect().map { x =>
{
val jsonInput: JValue = parse(x)
/*Do all the transformations using Json libraries*/
val json4s_transformed = "transformed json"
val rdd = sc.parallelize(compact(render(json4s_transformed)) :: Nil)
val df = sqlContext.read.schema(outSchema).json(rdd)
df.write.option("spark.mongodb.output.uri", "connectionURI")
.option("collection", "Collection")
.mode("append").format("com.mongodb.spark.sql").save()
val producer = new KafkaProducer[String, String](props)
val message = new ProducerRecord[String, String]("topic_name", null, "message_received")
producer.send(message)
producer.close()
}
}
)
// Run the streaming job
ssc.start()
ssc.awaitTermination()
}
}
So we tried another approach of creating the producer outside of the foreachRDD and reuse it for the entire batch interval(following is the code). This seem to have helped as we are not creating the producer each time we want to send the acknowledgement. But for some reason, when we monitor the application on the spark UI, the streaming application's memory consumption is increasing steadily, which was not the case before. We tried using the --num-executors 1 option in spark-submit to limit the number of executors that get initiated by yarn.
object Sample_Streaming {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Sample_Streaming")
.setMaster("local[4]")
val sc = new SparkContext(sparkConf)
sc.setLogLevel("ERROR")
val sqlContext = new SQLContext(sc)
val ssc = new StreamingContext(sc, Seconds(1))
val props = new HashMap[String, Object]()
val bootstrap_server_config = "127.0.0.100:9092"
val zkQuorum = "127.0.0.101:2181"
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrap_server_config)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
val TopicMap = Map("sampleTopic" -> 1)
val KafkaDstream = KafkaUtils.createStream(ssc, zkQuorum, "group", TopicMap).map(_._2)
val schemaDf = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource")
.option("spark.mongodb.input.uri", "connectionURI")
.option("spark.mongodb.input.collection", "schemaCollectionName")
.load()
val outSchema = schemaDf.schema
val producer = new KafkaProducer[String, String](props)
KafkaDstream.foreachRDD(rdd =>
{
rdd.collect().map ( x =>
{
val jsonInput: JValue = parse(x)
/*Do all the transformations using Json libraries*/
val json4s_transformed = "transformed json"
val rdd = sc.parallelize(compact(render(json4s_transformed)) :: Nil)
val df = sqlContext.read.schema(outSchema).json(rdd)
df.write.option("spark.mongodb.output.uri", "connectionURI")
.option("collection", "Collection")
.mode("append").format("com.mongodb.spark.sql").save()
val message = new ProducerRecord[String, String]("topic_name", null, "message_received")
producer.send(message)
producer.close()
}
)
}
)
// Run the streaming job
ssc.start()
ssc.awaitTermination()
}
}
My questions are:
How do I monitor the spark application's memory consumption, currently we are manually monitoring the application every 5 minutes until it exhausts the memory available in our cluster(2 node 16GB each)?
What are the best practices that are followed in the industry while using Spark streaming and kafka?
Kafka is a broker: It gives you delivery guarantees for the producer and the consumer. It's overkill to implement an 'over the top' acknowledge mechanism between the producer and the consumer. Ensure that the producer behaves correctly and that the consumer can recover in case of failure and the end-2-end delivery will be ensured.
Regarding the job, there's no wonder why its performance is poor: The processing is being done sequentially, element by element up to the point of the write to the external DB. This is plain wrong and should be addressed before attempting to fix any memory consumption issues.
This process could be improved like:
val producer = // create producer
val jsonDStream = kafkaDstream.transform{rdd => rdd.map{elem =>
val json = parse(elem)
render(doAllTransformations(json)) // output should be a String-formatted JSON object
}
}
jsonDStream.foreachRDD{ rdd =>
val df = sqlContext.read.schema(outSchema).json(rdd) // transform the complete collection, not element by element
df.write.option("spark.mongodb.output.uri", "connectionURI") // write in bulk, not one by one
.option("collection", "Collection")
.mode("append").format("com.mongodb.spark.sql").save()
val msg = //create message
producer.send(msg)
producer.flush() // force send. *DO NOT Close* otherwise it will not be able to send any more messages
}
This process could be improved further if we could replace all the string-centric JSON transformation by case class instances.

Unable to create Kafka Producer Object in Intellij

I am trying my hands on Kafka in Intellij using Spark & Scala. While creating producer Object I am unable to rectify the error. The code in Scala object is given below:
import java.util.Properties
import org.apache.kafka.clients.producer._
import kafka.producer.KeyedMessage
import org.apache.spark._
object kafkaProducer {
def main(args: Array[String]){
val topic = "jovis"
val props = new Properties()
props.put("metadata.broker.list", "localhost:9092")
props.put("serializer.class", "kafka.serializer.StringEncoder")
val config = new ProducerConfig(props)
//Error in Line below
val producer = new Producer[String, String](config)
val conf = new SparkConf().setAppName("Kafka").setMaster("local")
//val ssc = new StreamingContext(conf, Seconds(10))
val sc = new SparkContext(conf)
val data = sc.textFile("/home/hdadmin/empname.txt")
var i = 0
while(i <= data.count){
data.collect().foreach(x => {
println(x)
producer.send(new KeyedMessage[String, String](topic, x))
Thread.sleep(1000)
})
}
Error Log:
constructor ProducerConfig in class ProducerConfig cannot be accessed in object kafkaProducer
val config = new ProducerConfig(props)
Trait Producer is abstract;Cannot be instantiated.
val producer = new Producer[String, String](config)
I have imported the dependency jars below:
http://central.maven.org/maven2/org/apache/kafka/kafka-clients/0.8.2.0/kafka-clients-0.8.2.0.jar
http://central.maven.org/maven2/org/apache/kafka/kafka_2.11/0.10.2.1/kafka_2.11-0.10.2.1.jar
Apart from that I have started zookeeper server as well.
Where am I going wrong?
May be this will help you
what is the difference between kafka ProducerRecord and KeyedMessage
Please, try the new API "org.apache.kafka" %% "kafka" % "0.8.2.0"
import org.apache.kafka.clients.producer.ProducerRecord
import org.apache.kafka.clients.producer.KafkaProducer
val producer = new KafkaProducer[String, String](props)
producer.send(new ProducerRecord[String, String](topic, key, value)