Spark-Kafka Stream- Duplicate Kafka messages - scala

0I am using Spark DStreams to consume data from a Kafka topic, which has 5 partitions. Below is code for the same:
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.sql._
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.kafka010.KafkaUtils
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
Object SparkKafkaListener extends Serializable {
def main(args: Array[String]): Unit = {
val spark = SparkSession.Builder().enableHiveSupport().getOrCreate()
val kafkaConfig = Map[String, Object](<Kafka Configurations>)
val sc = spark.sparkContext
val topic = Array("topic")
// Creating Streaming Context
#transient
val ssc = new StreamingContext(sc, Seconds(300))
// Building Spark-Kafka DStreams object
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topic, kafkaConfig)
)
// Iterating over stream of RDDs and performing some operation
// Committing offsets in the end after storing Kafka message and header
stream.foreachRDD((rdd, batchTime) => {
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
rdd.map(value => (value.value())).saveAsTextFile("path")
rdd.map(message => message.headers()).map(y =>
{y.iterator}).map(x => x.next()).map(y =>
y.value().map(_.toChar).mkString)
.saveAsTextFile("path")
val commits = new offsetCommit(logger, util, props,batchID,x12Type)
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges, commits)
}
)
)
ssc.start()
ssc.awaitTermination()
}
}
After saving messages from Kafka Topic, I'm trying to commit the offsets in range via commitAsync call. The issue here is that this call is executed in consecutive run but not the current. For instance, offsets of batch1 stream are getting committed only after the batch2 execution completes, instead it should be happening just after the batch1 completes. This issue is leading to data duplicity and sometimes data loss.
Not sure if I'm missing some property which would immediately commit offsets of the respective batch of stream without any lag. Would really appreciate any help towards this.

Related

Reuse kafka producer in Spark Streaming

We have a spark streaming application(following is the code) that sources data from kafka and does some transformations(on each message) before inserting the data into MongoDB. We have a middleware application that pushes the messages(in bulk) into Kafka and waits for an acknowledgement(for each message) from spark streaming application. If the acknowledgement is not received by the middleware within a certain period of time(5seconds) after sending the message into Kafka, the middleware application re-sends the message. The spark streaming application is able to receive around 50-100 messages(in one batch) and send acknowledgement for all the messages under 5 seconds. But if the middleware application pushes over 100 messages, it is resulting in middleware application re-sending the message due to delay in spark streaming sending the acknowledgement. In our current implementation, we create the producer each time we want to send an acknowledgement, which is taking 3-4 seconds.
package com.testing
import org.apache.spark.streaming._
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.{ Seconds, StreamingContext }
import org.apache.spark.{ SparkConf, SparkContext }
import org.apache.spark.streaming.kafka._
import org.apache.spark.sql.{ SQLContext, Row, Column, DataFrame }
import java.util.HashMap
import org.apache.kafka.clients.producer.{ KafkaProducer, ProducerConfig, ProducerRecord }
import scala.collection.mutable.ArrayBuffer
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.joda.time._
import org.joda.time.format._
import org.json4s._
import org.json4s.JsonDSL._
import org.json4s.jackson.JsonMethods._
import com.mongodb.util.JSON
import scala.io.Source._
import java.util.Properties
import java.util.Calendar
import scala.collection.immutable
import org.json4s.DefaultFormats
object Sample_Streaming {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Sample_Streaming")
.setMaster("local[4]")
val sc = new SparkContext(sparkConf)
sc.setLogLevel("ERROR")
val sqlContext = new SQLContext(sc)
val ssc = new StreamingContext(sc, Seconds(1))
val props = new HashMap[String, Object]()
val bootstrap_server_config = "127.0.0.100:9092"
val zkQuorum = "127.0.0.101:2181"
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrap_server_config)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
val TopicMap = Map("sampleTopic" -> 1)
val KafkaDstream = KafkaUtils.createStream(ssc, zkQuorum, "group", TopicMap).map(_._2)
val schemaDf = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource")
.option("spark.mongodb.input.uri", "connectionURI")
.option("spark.mongodb.input.collection", "schemaCollectionName")
.load()
val outSchema = schemaDf.schema
var outDf = sqlContext.createDataFrame(sc.emptyRDD[Row], outSchema)
KafkaDstream.foreachRDD(rdd => rdd.collect().map { x =>
{
val jsonInput: JValue = parse(x)
/*Do all the transformations using Json libraries*/
val json4s_transformed = "transformed json"
val rdd = sc.parallelize(compact(render(json4s_transformed)) :: Nil)
val df = sqlContext.read.schema(outSchema).json(rdd)
df.write.option("spark.mongodb.output.uri", "connectionURI")
.option("collection", "Collection")
.mode("append").format("com.mongodb.spark.sql").save()
val producer = new KafkaProducer[String, String](props)
val message = new ProducerRecord[String, String]("topic_name", null, "message_received")
producer.send(message)
producer.close()
}
}
)
// Run the streaming job
ssc.start()
ssc.awaitTermination()
}
}
So we tried another approach of creating the producer outside of the foreachRDD and reuse it for the entire batch interval(following is the code). This seem to have helped as we are not creating the producer each time we want to send the acknowledgement. But for some reason, when we monitor the application on the spark UI, the streaming application's memory consumption is increasing steadily, which was not the case before. We tried using the --num-executors 1 option in spark-submit to limit the number of executors that get initiated by yarn.
object Sample_Streaming {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Sample_Streaming")
.setMaster("local[4]")
val sc = new SparkContext(sparkConf)
sc.setLogLevel("ERROR")
val sqlContext = new SQLContext(sc)
val ssc = new StreamingContext(sc, Seconds(1))
val props = new HashMap[String, Object]()
val bootstrap_server_config = "127.0.0.100:9092"
val zkQuorum = "127.0.0.101:2181"
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrap_server_config)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
val TopicMap = Map("sampleTopic" -> 1)
val KafkaDstream = KafkaUtils.createStream(ssc, zkQuorum, "group", TopicMap).map(_._2)
val schemaDf = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource")
.option("spark.mongodb.input.uri", "connectionURI")
.option("spark.mongodb.input.collection", "schemaCollectionName")
.load()
val outSchema = schemaDf.schema
val producer = new KafkaProducer[String, String](props)
KafkaDstream.foreachRDD(rdd =>
{
rdd.collect().map ( x =>
{
val jsonInput: JValue = parse(x)
/*Do all the transformations using Json libraries*/
val json4s_transformed = "transformed json"
val rdd = sc.parallelize(compact(render(json4s_transformed)) :: Nil)
val df = sqlContext.read.schema(outSchema).json(rdd)
df.write.option("spark.mongodb.output.uri", "connectionURI")
.option("collection", "Collection")
.mode("append").format("com.mongodb.spark.sql").save()
val message = new ProducerRecord[String, String]("topic_name", null, "message_received")
producer.send(message)
producer.close()
}
)
}
)
// Run the streaming job
ssc.start()
ssc.awaitTermination()
}
}
My questions are:
How do I monitor the spark application's memory consumption, currently we are manually monitoring the application every 5 minutes until it exhausts the memory available in our cluster(2 node 16GB each)?
What are the best practices that are followed in the industry while using Spark streaming and kafka?
Kafka is a broker: It gives you delivery guarantees for the producer and the consumer. It's overkill to implement an 'over the top' acknowledge mechanism between the producer and the consumer. Ensure that the producer behaves correctly and that the consumer can recover in case of failure and the end-2-end delivery will be ensured.
Regarding the job, there's no wonder why its performance is poor: The processing is being done sequentially, element by element up to the point of the write to the external DB. This is plain wrong and should be addressed before attempting to fix any memory consumption issues.
This process could be improved like:
val producer = // create producer
val jsonDStream = kafkaDstream.transform{rdd => rdd.map{elem =>
val json = parse(elem)
render(doAllTransformations(json)) // output should be a String-formatted JSON object
}
}
jsonDStream.foreachRDD{ rdd =>
val df = sqlContext.read.schema(outSchema).json(rdd) // transform the complete collection, not element by element
df.write.option("spark.mongodb.output.uri", "connectionURI") // write in bulk, not one by one
.option("collection", "Collection")
.mode("append").format("com.mongodb.spark.sql").save()
val msg = //create message
producer.send(msg)
producer.flush() // force send. *DO NOT Close* otherwise it will not be able to send any more messages
}
This process could be improved further if we could replace all the string-centric JSON transformation by case class instances.

Stopping Spark Streaming: exception in the cleaner thread but it will continue to run

I'm working on a Spark-Streaming application, I'm just trying to get a simple example of a Kafka Direct Stream working:
package com.username
import _root_.kafka.serializer.StringDecoder
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.kafka._
import org.apache.spark.streaming.{Seconds, StreamingContext}
object MyApp extends App {
val topic = args(0) // 1 topic
val brokers = args(1) //localhost:9092
val spark = SparkSession.builder().master("local[2]").getOrCreate()
val sc = spark.sparkContext
val ssc = new StreamingContext(sc, Seconds(1))
val topicSet = topic.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val directKafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicSet)
// Just print out the data within the topic
val parsers = directKafkaStream.map(v => v)
parsers.print()
ssc.start()
val endTime = System.currentTimeMillis() + (5 * 1000) // 5 second loop
while(System.currentTimeMillis() < endTime){
//write something to the topic
Thread.sleep(1000) // 1 second pause between iterations
}
ssc.stop()
}
This mostly works, whatever I write into the kafka topic, it gets included into the streaming batch and gets printed out. My only concern is what happens at ssc.stop():
dd/mm/yy hh:mm:ss WARN FileSystem: exception in the cleaner thread but it will continue to run
java.lang.InterruptException
at java.lang.Object.wait(Native Method)
at java.lang.ReferenceQueue.remove(ReferenceQueue.java:143)
at java.lang.ReferenceQueue.remove(ReferenceQueue.java:164)
at org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner.run(FileSystem.java:2989)
at java.lang.Thread.run(Thread.java:748)
This exception doesn't cause my app to fail nor exit though. I know I could wrap ssc.stop() into a try/catch block to suppress it, but looking into the API docs has me believe that this is not its intended behavior. I've been looking around online for a solution but nothing involving Spark has mentioned this exception, is there anyway for me to properly fix this?
I encountered the same problem with starting the process directly with sbt run. But if I packaged the project and start with YOUR_SPARK_PATH/bin/spark-submit --class [classname] --master local[4] [package_path], it works correctly. Hope this would help.

Spark Stream kafka stream - returned rdd doesn't do anything in foreachRDD

I am facing a weird issue here, I am reading Avro records from kafka and trying to deserialize it and store it into a file. I am able to get the records from Kafka but some how when I try to use a function on the rdd records it refuses to do anything
import java.util.UUID
import io.confluent.kafka.serializers.KafkaAvroDecoder
import com.my.project.avro.AvroDeserializer
import com.my.project.util.SparkJobLogging
import io.confluent.kafka.schemaregistry.client.SchemaRegistryClient
import io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient
import org.apache.spark.streaming.{Minutes, Seconds, StreamingContext, Time}
import org.apache.spark.streaming.kafka._
import kafka.serializer.{DefaultDecoder, StringDecoder}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming.dstream.{DStream}
object KafkaConsumer extends SparkJobLogging {
var schemaRegistry: SchemaRegistryClient = null
val url="url:8181"
schemaRegistry= new CachedSchemaRegistryClient(url, 1000)
def createKafkaStream(ssc: StreamingContext): DStream[(String,Array[Byte])] = {
val kafkaParams = Map[String, String](
"zookeeper.connect" -> "zk.server:2181",
"group.id" -> s"${UUID.randomUUID().toString}",
"auto.offset.reset" -> "smallest",
"bootstrap.servers" -> "bootstrap.server:9092",
"zookeeper.connection.timeout.ms" -> "6000",
"schema.registry.url" ->"registry.url:8181"
)
val topic = "my.topic"
KafkaUtils.createDirectStream[String, Array[Byte], StringDecoder, DefaultDecoder](ssc, kafkaParams, Set(topic))
}
def processRecord( avroStream: Array[Byte])={
println(AvroDeserializer.toRecord(avroStream, schemaRegistry) )
}
def main(args: Array[String]) = {
val sparkConf = new SparkConf().setAppName("AvroDeserilizer")
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Seconds(5))
val topicStream = createKafkaStream(ssc)map(_._2)
topicStream.foreachRDD(
rdd => if (!rdd.isEmpty()){
logger.info(rdd.count())
rdd.foreach(avroRecords=> processRecord(avroRecords))
}
)
ssc.start()
ssc.awaitTermination()
}
}
object AvroDeserializer extends SparkJobLogging{
def toRecord(buffer: Array[Byte], registry: SchemaRegistryClient): GenericRecord = {
val bb = ByteBuffer.wrap(buffer)
bb.get() // consume MAGIC_BYTE
val schemaId = bb.getInt // consume schemaId
val schema = registry.getByID(schemaId) // consult the Schema Registry
val reader = new GenericDatumReader[GenericRecord](schema)
val decoder = DecoderFactory.get().binaryDecoder(buffer, bb.position(), bb.remaining(), null)
reader.read(null, decoder) //null -> as we are not providing any datum
}
}
Till statement logger.info(rdd.count()) everything works fine and I see the exact record counts in the log. However after that nothing works.
When I tired
val record= rdd.first()
processRecord(record)
it worked but rdd.foreach(avroRecords=> processRecord(avroRecords)) and rdd.map(avroRecords=> processRecord(avroRecords)) doesn't works. It just prints below on every streaming call:
17/05/14 01:01:24 INFO scheduler.DAGScheduler: Job 2 finished: foreach at KafkaConsumer.scala:56, took 42.684999 s
17/05/14 01:01:24 INFO scheduler.JobScheduler: Finished job streaming job 1494738000000 ms.0 from job set of time 1494738000000 ms
17/05/14 01:01:24 INFO scheduler.JobScheduler: Total delay: 84.888 s for time 1494738000000 ms (execution: 84.719 s)
17/05/14 01:01:24 INFO scheduler.ReceivedBlockTracker: Deleting batches ArrayBuffer()
17/05/14 01:01:24 INFO scheduler.InputInfoTracker: remove old batch metadata:
17/05/14 01:01:26 INFO yarn.YarnAllocator: Canceling requests for 0 executor containers
17/05/14 01:01:26 WARN yarn.YarnAllocator: Expected to find pending requests, but found none.
17/05/14 01:01:29 INFO yarn.YarnAllocator: Canceling requests for 0 executor containers
17/05/14 01:01:29 WARN yarn.YarnAllocator: Expected to find pending requests, but found none.
It just prints the last 2 lines in the log till the next streaming context call.
Your println statements are being run on distributed workers not in the current process, so you dont see them. You could try replacing println with log.info verify this.
Ideally you should be turning your DStream[Array[Byte]] to a DStream[GenericRecord] and write that to a file, use .saveAsTextFiles or something. You may want a stream.take() in there because the stream could be infinite.
http://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreams
val topicStream = createKafkaStream(ssc)map(_._2)
topicStream.foreachRDD(
rdd => if (!rdd.isEmpty()){
logger.info(rdd.count())
rdd.foreach(avroRecords=> processRecord(avroRecords))
dstream.foreachRDD is a powerful primitive that allows data to be sent out to external systems. However, it is important to understand how to use this primitive correctly and efficiently. Some of the common mistakes to avoid are as follows.
DStreams are executed lazily by the output operations, just like RDDs are lazily executed by RDD actions. Specifically, RDD actions inside the DStream output operations force the processing of the received data. Hence, if your application does not have any output operation, or has output operations like dstream.foreachRDD() without any RDD action inside them, then nothing will get executed. The system will simply receive the data and discard it.
Although the above method didn't worked for me but I found a different way in confluent documentation. The KafkaAvroDecoder will communicate with schema registry, get the schema and deserialize the data. So it removes the need of custom deserializer.
import io.confluent.kafka.serializers.KafkaAvroDecoder
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers,
"schema.registry.url" -> schemaRegistry,
"key.converter.schema.registry.url" -> schemaRegistry,
"value.converter.schema.registry.url" -> schemaRegistry,
"auto.offset.reset" -> "smallest")
val topicSet = Set(topics)
val messages = KafkaUtils.createDirectStream[Object, Object, KafkaAvroDecoder, KafkaAvroDecoder](ssc, kafkaParams, topicSet).map(_._2)
messages.foreachRDD {
rdd => if (!rdd.isEmpty()){
logger.info(rdd.count())
rdd.saveAsTextFile("/data/")
}
)
ssc.start()
ssc.awaitTermination()
}
}
Dependency jar: kafka-avro-serializer-3.1.1.jar. This is working perfectly for me right now and I hope this will helpful to someone in future.

How to write spark streaming DF to Kafka topic

I am using Spark Streaming to process data between two Kafka queues but I can not seem to find a good way to write on Kafka from Spark. I have tried this:
input.foreachRDD(rdd =>
rdd.foreachPartition(partition =>
partition.foreach {
case x: String => {
val props = new HashMap[String, Object]()
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
println(x)
val producer = new KafkaProducer[String, String](props)
val message = new ProducerRecord[String, String]("output", null, x)
producer.send(message)
}
}
)
)
and it works as intended but instancing a new KafkaProducer for every message is clearly unfeasible in a real context and I'm trying to work around it.
I would like to keep a reference to a single instance for every process and access it when I need to send a message. How can I write to Kafka from Spark Streaming?
Yes, unfortunately Spark (1.x, 2.x) doesn't make it straight-forward how to write to Kafka in an efficient manner.
I'd suggest the following approach:
Use (and re-use) one KafkaProducer instance per executor process/JVM.
Here's the high-level setup for this approach:
First, you must "wrap" Kafka's KafkaProducer because, as you mentioned, it is not serializable. Wrapping it allows you to "ship" it to the executors. The key idea here is to use a lazy val so that you delay instantiating the producer until its first use, which is effectively a workaround so that you don't need to worry about KafkaProducer not being serializable.
You "ship" the wrapped producer to each executor by using a broadcast variable.
Within your actual processing logic, you access the wrapped producer through the broadcast variable, and use it to write processing results back to Kafka.
The code snippets below work with Spark Streaming as of Spark 2.0.
Step 1: Wrapping KafkaProducer
import java.util.concurrent.Future
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord, RecordMetadata}
class MySparkKafkaProducer[K, V](createProducer: () => KafkaProducer[K, V]) extends Serializable {
/* This is the key idea that allows us to work around running into
NotSerializableExceptions. */
lazy val producer = createProducer()
def send(topic: String, key: K, value: V): Future[RecordMetadata] =
producer.send(new ProducerRecord[K, V](topic, key, value))
def send(topic: String, value: V): Future[RecordMetadata] =
producer.send(new ProducerRecord[K, V](topic, value))
}
object MySparkKafkaProducer {
import scala.collection.JavaConversions._
def apply[K, V](config: Map[String, Object]): MySparkKafkaProducer[K, V] = {
val createProducerFunc = () => {
val producer = new KafkaProducer[K, V](config)
sys.addShutdownHook {
// Ensure that, on executor JVM shutdown, the Kafka producer sends
// any buffered messages to Kafka before shutting down.
producer.close()
}
producer
}
new MySparkKafkaProducer(createProducerFunc)
}
def apply[K, V](config: java.util.Properties): MySparkKafkaProducer[K, V] = apply(config.toMap)
}
Step 2: Use a broadcast variable to give each executor its own wrapped KafkaProducer instance
import org.apache.kafka.clients.producer.ProducerConfig
val ssc: StreamingContext = {
val sparkConf = new SparkConf().setAppName("spark-streaming-kafka-example").setMaster("local[2]")
new StreamingContext(sparkConf, Seconds(1))
}
ssc.checkpoint("checkpoint-directory")
val kafkaProducer: Broadcast[MySparkKafkaProducer[Array[Byte], String]] = {
val kafkaProducerConfig = {
val p = new Properties()
p.setProperty("bootstrap.servers", "broker1:9092")
p.setProperty("key.serializer", classOf[ByteArraySerializer].getName)
p.setProperty("value.serializer", classOf[StringSerializer].getName)
p
}
ssc.sparkContext.broadcast(MySparkKafkaProducer[Array[Byte], String](kafkaProducerConfig))
}
Step 3: Write from Spark Streaming to Kafka, re-using the same wrapped KafkaProducer instance (for each executor)
import java.util.concurrent.Future
import org.apache.kafka.clients.producer.RecordMetadata
val stream: DStream[String] = ???
stream.foreachRDD { rdd =>
rdd.foreachPartition { partitionOfRecords =>
val metadata: Stream[Future[RecordMetadata]] = partitionOfRecords.map { record =>
kafkaProducer.value.send("my-output-topic", record)
}.toStream
metadata.foreach { metadata => metadata.get() }
}
}
Hope this helps.
My first advice would be to try to create a new instance in foreachPartition and measure if that is fast enough for your needs (instantiating heavy objects in foreachPartition is what the official documentation suggests).
Another option is to use an object pool as illustrated in this example:
https://github.com/miguno/kafka-storm-starter/blob/develop/src/main/scala/com/miguno/kafkastorm/kafka/PooledKafkaProducerAppFactory.scala
I however found it hard to implement when using checkpointing.
Another version that is working well for me is a factory as described in the following blog post, you just have to check if it provides enough parallelism for your needs (check the comments section):
http://allegro.tech/2015/08/spark-kafka-integration.html
With Spark >= 2.2
Both read and write operations are possible on Kafka using Structured Streaming API
Build stream from Kafka topic
// Subscribe to a topic and read messages from the earliest to latest offsets
val ds= spark
.readStream // use `read` for batch, like DataFrame
.format("kafka")
.option("kafka.bootstrap.servers", "brokerhost1:port1,brokerhost2:port2")
.option("subscribe", "source-topic1")
.option("startingOffsets", "earliest")
.option("endingOffsets", "latest")
.load()
Read the key and value and apply the schema for both, for simplicity we are making converting both of them to String type.
val dsStruc = ds.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
Since dsStruc have the schema, it accepts all SQL kind operations like filter, agg, select ..etc on it.
Write stream to Kafka topic
dsStruc
.writeStream // use `write` for batch, like DataFrame
.format("kafka")
.option("kafka.bootstrap.servers", "brokerhost1:port1,brokerhost2:port2")
.option("topic", "target-topic1")
.start()
More configuration for Kafka integration to read or write
Key artifacts to add in the application
"org.apache.spark" % "spark-core_2.11" % 2.2.0,
"org.apache.spark" % "spark-streaming_2.11" % 2.2.0,
"org.apache.spark" % "spark-sql-kafka-0-10_2.11" % 2.2.0,
There is a Streaming Kafka Writer maintained by Cloudera (actually spun off from a Spark JIRA [1]). It basically creates a producer per partition, which amortizes the time spent to create 'heavy' objects over a (hopefully large) collection of elements.
The Writer can be found here: https://github.com/cloudera/spark-kafka-writer
I was having the same issue and found this post.
The author solves the problem by creating 1 producer per executor. Instead of sending the producer itself, he sends only a “recipe” how to create a producer in an executor by broadcasting it.
val kafkaSink = sparkContext.broadcast(KafkaSink(conf))
He uses a wrapper that lazily creates the producer:
class KafkaSink(createProducer: () => KafkaProducer[String, String]) extends Serializable {
lazy val producer = createProducer()
def send(topic: String, value: String): Unit = producer.send(new ProducerRecord(topic, value))
}
object KafkaSink {
def apply(config: Map[String, Object]): KafkaSink = {
val f = () => {
val producer = new KafkaProducer[String, String](config)
sys.addShutdownHook {
producer.close()
}
producer
}
new KafkaSink(f)
}
}
The wrapper is serializable because the Kafka producer is initialized just before first use on an executor. The driver keeps the reference to the wrapper and the wrapper sends the messages using each executor's producer:
dstream.foreachRDD { rdd =>
rdd.foreach { message =>
kafkaSink.value.send("topicName", message)
}
}
Why is it infeasible? Fundamentally each partition of each RDD is going to run independently (and may well run on a different cluster node), so you have to redo the connection (and any synchronization) at the start of each partition's task. If the overhead of that is too high then you should increase the batch size in your StreamingContext until it becomes acceptable (obv. there's a latency cost to doing this).
(If you're not handling thousands of messages in each partition, are you sure you need spark-streaming at all? Would you do better with a standalone application?)
This might be what you want to do. You basically create one producer for each partition of records.
input.foreachRDD(rdd =>
rdd.foreachPartition(
partitionOfRecords =>
{
val props = new HashMap[String, Object]()
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String,String](props)
partitionOfRecords.foreach
{
case x:String=>{
println(x)
val message=new ProducerRecord[String, String]("output",null,x)
producer.send(message)
}
}
})
)
Hope that helps
With Spark < 2.2
Since there is no direct way of writing the messages to Kafka from Spark Streaming
Create a KafkaSinkWritter
import java.util.Properties
import org.apache.kafka.clients.producer._
import org.apache.spark.sql.ForeachWriter
class KafkaSink(topic:String, servers:String) extends ForeachWriter[(String, String)] {
val kafkaProperties = new Properties()
kafkaProperties.put("bootstrap.servers", servers)
kafkaProperties.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
kafkaProperties.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
val results = new scala.collection.mutable.HashMap[String, String]
var producer: KafkaProducer[String, String] = _
def open(partitionId: Long,version: Long): Boolean = {
producer = new KafkaProducer(kafkaProperties)
true
}
def process(value: (String, String)): Unit = {
producer.send(new ProducerRecord(topic, value._1 + ":" + value._2))
}
def close(errorOrNull: Throwable): Unit = {
producer.close()
}
}
Write messages using SinkWriter
val topic = "<topic2>"
val brokers = "<server:ip>"
val writer = new KafkaSink(topic, brokers)
val query =
streamingSelectDF
.writeStream
.foreach(writer)
.outputMode("update")
.trigger(ProcessingTime("25 seconds"))
.start()
Reference link

How to deserialize Flume's Avro events coming to Spark?

I have Flume Avro sink and SparkStreaming program that read the sink.
CDH 5.1 , Flume 1.5.0 , Spark 1.0 , using Scala as program lang on Spark
i was able to make the Spark example and count the Flume Avro Events.
however i was not able to De serialize the Flume Avro Event into string\text and then parse the structure row.
Does anyone have an example of how to do so using Scala?
You can deserialize the flume events with the below code:
val eventBody = stream.map(e => new String(e.event.getBody.array))
Here's an example of a spark streaming application for analyzing popular hashtags from twitter using a flume twitter source and avro sink to push the events to spark:
import org.apache.spark.streaming.{ Seconds, StreamingContext }
import org.apache.spark.SparkContext._
import org.apache.spark.streaming.twitter._
import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.{ SparkContext, SparkConf }
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.flume._
object PopularHashTags {
val conf = new SparkConf().setMaster("local[4]").setAppName("PopularHashTags").set("spark.executor.memory", "1g")
val sc = new SparkContext(conf)
def main(args: Array[String]) {
sc.setLogLevel("WARN")
System.setProperty("twitter4j.oauth.consumerKey", <consumerKey>)
System.setProperty("twitter4j.oauth.consumerSecret", <consumerSecret>)
System.setProperty("twitter4j.oauth.accessToken", <accessToken>)
System.setProperty("twitter4j.oauth.accessTokenSecret", <accessTokenSecret>)
val ssc = new StreamingContext(sc, Seconds(5))
val filter = args.takeRight(args.length)
val stream = FlumeUtils.createStream(ssc, <hostname>, <port>)
val tweets = stream.map(e => new String(e.event.getBody.array))
val hashTags = tweets.flatMap(status => status.split(" ").filter(_.startsWith("#")))
val topCounts60 = hashTags.map((_, 1)).reduceByKeyAndWindow(_ + _, Seconds(60))
.map { case (topic, count) => (count, topic) }
.transform(_.sortByKey(false))
// Print popular hashtags
topCounts60.foreachRDD(rdd => {
val topList = rdd.take(10)
println("\nPopular topics in last 60 seconds (%s total):".format(rdd.count()))
topList.foreach { case (count, tag) => println("%s (%s tweets)".format(tag, count)) }
})
stream.count().map(cnt => "Received " + cnt + " flume events.").print()
ssc.start()
ssc.awaitTermination()
}
}
You can implement a custom decoder inorder to deserialize. Provide the expected type information along with it.
Try the code below:
stream.map(e => "Event:header:" + e.event.get(0).toString
+ "body: " + new String(e.event.getBody.array)).print