Spark Streaming using Kafka: empty collection exception - scala

I'm developing an algorithm using Kafka and Spark Streaming. This is part of my receiver:
val Array(brokers, topics) = args
val sparkConf = new SparkConf().setAppName("Traccia2014")
val ssc = new StreamingContext(sparkConf, Seconds(10))
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
val slice=30
val lines = messages.map(_._2)
val dStreamDst=lines.transform(rdd => {
val y= rdd.map(x => x.split(",")(0)).reduce((a, b) => if (a < b) a else b)
rdd.map(x => (((x.split(",")(0).toInt - y.toInt).toLong/slice).round*slice+" "+(x.split(",")(2)),1)).reduceByKey(_ + _)
})
dStreamDst.print()
on which I get the following error :
ERROR JobScheduler: Error generating jobs for time 1484927230000 ms
java.lang.UnsupportedOperationException: empty collection
at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$42.apply(RDD.scala:1034)
at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$42.apply(RDD.scala:1034)
What does it means? How could I solve it?
Any kind of help is truly appreciated..thanks in advance
Update:
Solved. Don't use transform or print() method. Use foreachRDD, is the best solution.

You are encountering this b/c you are interacting with the DStream using the transform() API. When using that method, you are given the RDD that represents that snapshot of data in time, in your case the 10 second window. Your code is failing because at a particular time window, there was no data, and the RDD you are operating on is empty, giving you the "empty collection" error when you invoke reduce().
Use the rdd.isEmpty() to ensure that the RDD is not empty before invoking your operation.
lines.transform(rdd => {
if (rdd.isEmpty)
rdd
else {
// rest of transformation
}
})

Related

Spark Streaming 1.6 + Kafka: Too many batches in "queued" status

I'm using spark streaming to consume messages from a Kafka topic, which has 10 partitions. I'm using direct approach to consume from kafka and the code can be found below:
def createStreamingContext(conf: Conf): StreamingContext = {
val dateFormat = conf.dateFormat.apply
val hiveTable = conf.tableName.apply
val sparkConf = new SparkConf()
sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
sparkConf.set("spark.driver.allowMultipleContexts", "true")
val sc = SparkContextBuilder.build(Some(sparkConf))
val ssc = new StreamingContext(sc, Seconds(conf.batchInterval.apply))
val kafkaParams = Map[String, String](
"bootstrap.servers" -> conf.kafkaBrokers.apply,
"key.deserializer" -> classOf[StringDeserializer].getName,
"value.deserializer" -> classOf[StringDeserializer].getName,
"auto.offset.reset" -> "smallest",
"enable.auto.commit" -> "false"
)
val directKafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc,
kafkaParams,
conf.topics.apply().split(",").toSet[String]
)
val windowedKafkaStream = directKafkaStream.window(Seconds(conf.windowDuration.apply))
ssc.checkpoint(conf.sparkCheckpointDir.apply)
val eirRDD: DStream[Row] = windowedKafkaStream.map { kv =>
val fields: Array[String] = kv._2.split(",")
createDomainObject(fields, dateFormat)
}
eirRDD.foreachRDD { rdd =>
val schema = SchemaBuilder.build()
val sqlContext: HiveContext = HiveSQLContext.getInstance(Some(rdd.context))
val eirDF: DataFrame = sqlContext.createDataFrame(rdd, schema)
eirDF
.select(schema.map(c => col(c.name)): _*)
.write
.mode(SaveMode.Append)
.partitionBy("year", "month", "day")
.insertInto(hiveTable)
}
ssc
}
As it can be seen from the code, I used window to achieve this (and please correct me if I'm wrong): Since there's an action to insert into a hive table, I want to avoid writing to HDFS too often, so what I want is to hold enough data in memory and only then write to the filesystem. I thought that using window would be the right way to achieve it.
Now, in the image below, you can see that there are many batches being queued and the batch being processed, takes forever to complete.
I'm also providing the details of the single batch being processed:
Why are there so many tasks for the insert action, when there aren't many events in the batch? Sometimes having 0 events also generates thousands of tasks that take forever to complete.
Is the way I process microbatches with Spark wrong?
Thanks for your help!
Some extra details:
Yarn containers have a max of 2gb.
In this Yarn queue, the maximum number of containers is 10.
When I look at details of the queue where this spark application is being executed, the number of containers is extremely large, around 15k pending containers.
Well, I finally figured it out. Apparently Spark Streaming does not get along with empty events, so inside the foreachRDD portion of the code, I added the following:
eirRDD.foreachRDD { rdd =>
if (rdd.take(1).length != 0) {
//do action
}
}
That way we skip empty micro-batches. the isEmpty() method does not work.
Hope this help somebody else! ;)

Stateful streaming Spark processing

I'm learning Spark and trying to build a simple streaming service.
For e.g. I have a Kafka queue and a Spark job like words count. That example is using a stateless mode. I'd like to accumulate words counts so if test has been sent a few times in different messages I could get a total number of all its occurrences.
Using other examples like StatefulNetworkWordCount I've tried to modify my Kafka streaming service
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Seconds(2))
ssc.checkpoint("/tmp/data")
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
// Get the lines, split them into words, count the words and print
val lines = messages.map(_._2)
val words = lines.flatMap(_.split(" "))
val wordDstream = words.map(x => (x, 1))
// Update the cumulative count using mapWithState
// This will give a DStream made of state (which is the cumulative count of the words)
val mappingFunc = (word: String, one: Option[Int], state: State[Int]) => {
val sum = one.getOrElse(0) + state.getOption.getOrElse(0)
val output = (word, sum)
state.update(sum)
output
}
val stateDstream = wordDstream.mapWithState(
StateSpec.function(mappingFunc) /*.initialState(initialRDD)*/)
stateDstream.print()
stateDstream.map(s => (s._1, s._2.toString)).foreachRDD(rdd => sc.toRedisZSET(rdd, "word_count", 0))
// Start the computation
ssc.start()
ssc.awaitTermination()
I get a lot of errors like
17/03/26 21:33:57 ERROR streaming.StreamingContext: Error starting the context, marking it as stopped
java.io.NotSerializableException: DStream checkpointing has been enabled but the DStreams with their functions are not serializable
org.apache.spark.SparkContext
Serialization stack:
- object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext#2b680207)
- field (class: com.DirectKafkaWordCount$$anonfun$main$2, name: sc$1, type: class org.apache.spark.SparkContext)
- object (class com.DirectKafkaWordCount$$anonfun$main$2, <function1>)
- field (class: org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3, name: cleanedF$1, type: interface scala.Function1)
though the stateless version works fine without errors
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Seconds(2))
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topicsSet)
// Get the lines, split them into words, count the words and print
val lines = messages.map(_._2)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _).map(s => (s._1, s._2.toString))
wordCounts.print()
wordCounts.foreachRDD(rdd => sc.toRedisZSET(rdd, "word_count", 0))
// Start the computation
ssc.start()
ssc.awaitTermination()
The question is how to make the streaming stateful word count.
At this line:
ssc.checkpoint("/tmp/data")
you've enabled checkpointing, which means everything in your:
wordCounts.foreachRDD(rdd => sc.toRedisZSET(rdd, "word_count", 0))
has to be serializable, and sc itself is not, as you can see from the error message:
object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext#2b680207)
Removing checkpointing code line will help with that.
Another way is to either continuously compute your DStream into RDD or write data directly to redis, something like:
wordCounts.foreachRDD{rdd =>
rdd.foreachPartition(partition => RedisContext.setZset("word_count", partition, ttl, redisConfig)
}
RedisContext is a serializable object that doesn't depend on SparkContext
See also: https://github.com/RedisLabs/spark-redis/blob/master/src/main/scala/com/redislabs/provider/redis/redisFunctions.scala

Only first message in Kafka stream gets processed

In Spark I create a stream from Kafka with a batch time of 5 seconds. Many messages can come in during that time and I want to process each of them individually, but it seems that with my current logic only the first message of each batch is being processed.
val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, params, topics)
val messages = stream.map((x$2) => x$2._2)
messages.foreachRDD { rdd =>
if(!rdd.isEmpty) {
val message = rdd.map(parse)
println(message.collect())
}
}
The parse function simply extracts relevant fields from the Json message into a tuple.
I can drill down into the partitions and process each message individually that way:
messages.foreachRDD { rdd =>
if(!rdd.isEmpty) {
rdd.foreachPartition { partition =>
partition.foreach{msg =>
val message = parse(msg)
println(message)
}
}
}
}
But I'm certain there is a way to stay at the RDD level. What am I doing wrong in the first example?
I'm using spark 2.0.0, scala 2.11.8 and spark streaming kafka 0.8.
Here is the sample Streaming app which converts each message for the batch in to upper case inside for each loop and prints them. Try this sample app and then recheck your application. Hope this helps.
object SparkKafkaStreaming {
def main(args: Array[String]) {
//Broker and topic
val brokers = "localhost:9092"
val topic = "myTopic"
//Create context with 5 second batch interval
val sparkConf = new SparkConf().setAppName("SparkKafkaStreaming").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf, Seconds(5))
//Create direct kafka stream with brokers and topics
val topicsSet = Set[String](topic)
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val msgStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
//Message
val msg = msgStream.map(_._2)
msg.print()
//For each
msg.foreachRDD { rdd =>
if (!rdd.isEmpty) {
println("-----Convert Message to UpperCase-----")
//convert messages to upper case
rdd.map { x => x.toUpperCase() }.collect().foreach(println)
} else {
println("No Message Received")
}
}
//Start the computation
ssc.start()
ssc.awaitTermination()
}
}

No messages received when using foreachPartition spark streaming

I am pulling from Kafka using Spark Streaming. When I use foreachPartition on my RDD I never get any messages received. If I read the messages from the RDD using a foreach it works fine. However I need to use the partition function so I can have a socket connection on each executor.
This is code connecting to spark and creating stream
val kafkaParams = Map(
"zookeeper.connect" -> zooKeepers,
"group.id" -> ("metric-group"),
"zookeeper.connection.timeout.ms" -> "5000")
val inputTopic = "threatflow"
val conf = new SparkConf().setAppName(applicationTitle).set("spark.eventLog.overwrite", "true")
val ssc = new StreamingContext(conf, Seconds(5))
val streams = (1 to numberOfStreams) map { _ =>
KafkaUtils.createStream[String,String,StringDecoder,StringDecoder](ssc, kafkaParams, Map(inputTopic -> 1), StorageLevel.MEMORY_ONLY_SER)
}
val kafkaStream = ssc.union(streams)
kafkaStream.foreachRDD { (rdd, time) =>
calcVictimsProcess(process, rdd, time.milliseconds)
}
ssc.start()
ssc.awaitTermination()
Here is my code that attempts to process the messages using foreachPartition instead of foreach
val threats = rdd.map(message => gson.fromJson(message._2.substring(1, message._2.length()), classOf[ThreatflowMessage]))
threats.flatMap(mapSrcVictim).reduceByKey((a,b) => a + b).foreachPartition{ partition =>
val socket = new Socket(InetAddress.getByName("localhost"),4242)
val writer = new BufferedOutputStream(socket.getOutputStream)
partition.foreach{ value =>
val parts = value._1.split("-")
val put = "put %s %d %d type=%s address=%s unique=%s\n".format("metric", bucket, value._2, parts(0),parts(1),unique)
Thread.sleep(10000)
}
writer.flush()
socket.close()
}
simply switching this to foreach as I said will work, however this won't work as I need to have sockets created per executor

KafkaRDD scala minimal example

I'm trying to get a running example using KafkaRDD:
val kafkaParams = Map("metadata.broker.list" -> "localhost:9092")
val offsetRanges = Array(
OffsetRange("topic", 0, 0, 2)
)
val rdd = KafkaUtils.createRDD[String, String, StringDecoder, StringDecoder](sc, kafkaParams, offsetRanges)
rdd.map(x => println(x)).collect()
res: Array[Unit] = Array((), ())
I have been careful in creating "topic" with a single partition and writing 2 messages, hello, world.
I can get what looks like a correct RDD, but how can I access its content? Am I missing something?
Thanks, E.
The problem is this line, I believe:
rdd.map(x => println(x)).collect()
The way an RDD works, rdd.map runs on the executor. When you println it's printing it to stdout for the executor. To print it to stdout in the driver application, try this instead:
rdd.collect().map(x => println(x))